Vocal-Contour Transcription¶

Vocal pitch contour transcription.

Transcribes monophonic pitch contour of vocal in given polyphonic audio. Re-implementation of the repository Vocal-Melody-Extraction.

Feature Storage Format¶

Processed feature and label will be stored in .hdf format, one file per piece.

Columns in the file are:

feature: CFP feature representation.
label: 2D numpy array of vocal pitch contour.

References¶

The related publication of this work can be found in [1].

1: Wei-Tsung Lu and Li Su, “Vocal melody extraction with semantic segmentation and audio-symbolic domain transfer learning,” International Society of Music Information Retrieval Conference (ISMIR), 2018.

App¶

Application class of vocal-contour.

Inludes core functions and interfaces for frame-level vocal transcription: model training, feature pre-processing, and audio transcription.

See Also¶

omnizart.base.BaseTranscription: The base class of all transcription/application classes.

class omnizart.vocal_contour.app.VocalContourDatasetLoader(feature_folder=None, feature_files=None, num_samples=100, timesteps=128, channels=0, feature_num=384)¶

Bases: omnizart.base.BaseDatasetLoader

Data loader for training the mdoel of vocal-contour.

Load feature and label for training.

Parameters

feature_folder: Path: Path to the extracted feature files in *.hdf.
feature_files: list[Path]: List of path to the feature files in`*.hdf`.
num_samples: int: Total number of samples to yield.
timesteps: int: Time length of the feature.
channels: list[int]: Channels to be used for training. Allowed values are [1, 2, 3].
feature_num: int: Target size of feature dimension. Zero padding is done to resolve mismatched input and target size.

Yields

feature:: Input features for model training.
label:: Coressponding labels.

class omnizart.vocal_contour.app.VocalContourTranscription(conf_path=None)¶

Bases: omnizart.base.BaseTranscription

Application class for vocal-contour transcription.

Methods

`generate_feature`(dataset_path[, ...])	Extract the feature from the given dataset.
`train`(feature_folder[, model_name, ...])	Model training.
`transcribe`(input_audio[, model_path, output])	Transcribe frame-level fundamental frequency of vocal from the given audio.

generate_feature(dataset_path, vocalcontour_settings=None, num_threads=4)¶

Extract the feature from the given dataset.

To train the model, the first step is to pre-process the data into feature representations. After downloading the dataset, use this function to generate the feature by giving the path of the stored dataset.

To specify the output path, modify the attribute vocalcontour_settings.dataset.feature_save_path (TODO: to confirm). It defaults to the folder of the stored dataset, and creates two folders: train_feature and test_feature.

Parameters

dataset_path: Path: Path to the downloaded dataset.
vocalcontour_settings: VocalContourSettings: The configuration instance that holds all relative settings for the life-cycle of building a model.
num_threads:: Number of threads for parallel extraction of the feature.

Inference¶

Loss Functions¶

Loss functions for Music module.

omnizart.music.losses.focal_loss(target_tensor, prediction_tensor, weights=None, alpha=0.25, gamma=2)¶

Compute focal loss for predictions.

Multi-labels Focal loss formula:

\[FL = -\alpha * (z-p)^\gamma * \log{(p)} -(1-\alpha) * p^\gamma * \log{(1-p)}\]

Which \(\alpha\) = 0.25, \(\gamma\) = 2, p = sigmoid(x), z = target_tensor.

Parameters

prediction_tensor: A float tensor of shape [batch_size, num_anchors, num_classes] representing the predicted logits for each class.
target_tensor:: A float tensor of shape [batch_size, num_anchors, num_classes] representing one-hot encoded classification targets.
weights: A float tensor of shape [batch_size, num_anchors].
alpha: A scalar tensor for focal loss alpha hyper-parameter.
gamma: A scalar tensor for focal loss gamma hyper-parameter.

Returns

loss: A scalar tensor representing the value of the loss function

omnizart.music.losses.smooth_loss(y_true, y_pred, gamma=0.15, total_chs=22, weight=None)¶: Function to compute loss after applying label-smoothing.

Settings¶

Below are the default settings for frame-level vocal transcription. It will be loaded by the class omnizart.setting_loaders.VocalContourSettings. The name of the attributes will be converted to snake-case (e.g. HopSize -> hop_size). There is also a path transformation when applying the settings into the VocalContourSettings instance. For example, the attribute BatchSize defined in the yaml path General/Training/Settings/BatchSize is transformed to VocalContourSettings.training.batch_size. The level of /Settings is removed among all fields.

General:
    TranscriptionMode:
        Description: Mode of transcription by executing the `omnizart vocal-contour transribe` command.
        Type: String 
        Value: VocalContour
    CheckpointPath:
        Description: Path to the pre-trained models.
        Type: Map
        SubType: [String, String]
        Value: 
            VocalContour: checkpoints/vocal/vocal_contour
    Feature:
        Description: Default settings of feature extraction
        Settings:
            HopSize:
                Description: Hop size in seconds with respect to sampling rate.
                Type: Float
                Value: 0.02
            SamplingRate:
                Description: Adjust input sampling rate to this value.
                Type: Integer
                Value: 16000
            WindowSize:
                Type: Integer
                Value: 2049
    Dataset:
        Description: Settings of datasets.
        Settings:
            SavePath:
                Description: Path for storing the downloaded datasets.
                Type: String
                Value: ./
            FeatureSavePath:
                Description: Path for storing the extracted feature. Default to the path under the dataset folder.
                Type: String
                Value: +
    Model:
        Description: Default settings of training / testing the model.
        Settings:
            SavePrefix:
                Description: Prefix of the trained model's name to be saved.
                Type: String
                Value: vocal_contour
            SavePath:
                Description: Path to save the trained model.
                Type: String
                Value: ./checkpoints/vocal_contour
    Training:
        Description: Parameters for training
        Settings:
            Epoch:
                Description: Maximum number of epochs for training.
                Type: Integer
                Value: 5
            EarlyStop:
                Description: Terminate the training if the validation performance doesn't imrove after n epochs.
                Type: Integer
                Value: 3
            Steps:
                Description: Number of training steps for each epoch.
                Type: Integer
                Value: 6000
            ValSteps:
                Description: Number of validation steps after each training epoch.
                Type: Integer
                Value: 200    
            BatchSize:
                Description: Batch size of each training step.
                Type: Integer
                Value: 12
            ValBatchSize:
                Description: Batch size of each validation step.
                Type: Integer
                Value: 12
            Timesteps:
                Description: Length of time axis of the input feature.
                Type: Integer
                Value: 128