Vocal-Contour Transcription

Vocal pitch contour transcription.

Transcribes monophonic pitch contour of vocal in given polyphonic audio. Re-implementation of the repository Vocal-Melody-Extraction.

Feature Storage Format

Processed feature and label will be stored in .hdf format, one file per piece.

Columns in the file are:

  • feature: CFP feature representation.

  • label: 2D numpy array of vocal pitch contour.

References

The related publication of this work can be found in [1].

1

Wei-Tsung Lu and Li Su, “Vocal melody extraction with semantic segmentation and audio-symbolic domain transfer learning,” International Society of Music Information Retrieval Conference (ISMIR), 2018.

App

Application class of vocal-contour.

Inludes core functions and interfaces for frame-level vocal transcription: model training, feature pre-processing, and audio transcription.

See Also

omnizart.base.BaseTranscription: The base class of all transcription/application classes.

class omnizart.vocal_contour.app.VocalContourDatasetLoader(feature_folder=None, feature_files=None, num_samples=100, timesteps=128, channels=0, feature_num=384)

Bases: omnizart.base.BaseDatasetLoader

Data loader for training the mdoel of vocal-contour.

Load feature and label for training.

Parameters
feature_folder: Path

Path to the extracted feature files in *.hdf.

feature_files: list[Path]

List of path to the feature files in`*.hdf`.

num_samples: int

Total number of samples to yield.

timesteps: int

Time length of the feature.

channels: list[int]

Channels to be used for training. Allowed values are [1, 2, 3].

feature_num: int

Target size of feature dimension. Zero padding is done to resolve mismatched input and target size.

Yields
feature:

Input features for model training.

label:

Coressponding labels.

class omnizart.vocal_contour.app.VocalContourTranscription(conf_path=None)

Bases: omnizart.base.BaseTranscription

Application class for vocal-contour transcription.

Methods

generate_feature(dataset_path[, ...])

Extract the feature from the given dataset.

train(feature_folder[, model_name, ...])

Model training.

transcribe(input_audio[, model_path, output])

Transcribe frame-level fundamental frequency of vocal from the given audio.

generate_feature(dataset_path, vocalcontour_settings=None, num_threads=4)

Extract the feature from the given dataset.

To train the model, the first step is to pre-process the data into feature representations. After downloading the dataset, use this function to generate the feature by giving the path of the stored dataset.

To specify the output path, modify the attribute vocalcontour_settings.dataset.feature_save_path (TODO: to confirm). It defaults to the folder of the stored dataset, and creates two folders: train_feature and test_feature.

Parameters
dataset_path: Path

Path to the downloaded dataset.

vocalcontour_settings: VocalContourSettings

The configuration instance that holds all relative settings for the life-cycle of building a model.

num_threads:

Number of threads for parallel extraction of the feature.

See also

omnizart.constants.datasets

The supported datasets and the corresponding training/testing splits.

train(feature_folder, model_name=None, input_model_path=None, vocalcontour_settings=None)

Model training.

Train the model from scratch or continue training given a model checkpoint.

Parameters
feature_folder: Path

Path to the generated feature.

model_name: str

The name of the trained model. If not given, will default to the current timestamp.

input_model_path: Path

Specify the path to the model checkpoint in order to fine-tune the model.

vocalcontour_settings: VocalContourSettings

The configuration that holds all relative settings for the life-cycle of model building.

transcribe(input_audio, model_path=None, output='./')

Transcribe frame-level fundamental frequency of vocal from the given audio.

Parameters
input_audio: Path

Path to the wav audio file.

model_path: Path

Path to the trained model or the transcription mode. If given a path, should be the folder that contains arch.yaml, weights.h5, and configuration.yaml.

output: Path (optional)

Path for writing out the extracted vocal f0. Default to current path.

Returns
f0: txt

The transcribed f0 of the vocal contour in Hz.

See also

omnizart.cli.vocal_contour.transcribe

The coressponding command line entry.

Inference

Loss Functions

Loss functions for Music module.

omnizart.music.losses.focal_loss(target_tensor, prediction_tensor, weights=None, alpha=0.25, gamma=2)

Compute focal loss for predictions.

Multi-labels Focal loss formula:

\[FL = -\alpha * (z-p)^\gamma * \log{(p)} -(1-\alpha) * p^\gamma * \log{(1-p)}\]

Which \(\alpha\) = 0.25, \(\gamma\) = 2, p = sigmoid(x), z = target_tensor.

Parameters
prediction_tensor

A float tensor of shape [batch_size, num_anchors, num_classes] representing the predicted logits for each class.

target_tensor:

A float tensor of shape [batch_size, num_anchors, num_classes] representing one-hot encoded classification targets.

weights

A float tensor of shape [batch_size, num_anchors].

alpha

A scalar tensor for focal loss alpha hyper-parameter.

gamma

A scalar tensor for focal loss gamma hyper-parameter.

Returns
loss

A scalar tensor representing the value of the loss function

omnizart.music.losses.smooth_loss(y_true, y_pred, gamma=0.15, total_chs=22, weight=None)

Function to compute loss after applying label-smoothing.

Settings

Below are the default settings for frame-level vocal transcription. It will be loaded by the class omnizart.setting_loaders.VocalContourSettings. The name of the attributes will be converted to snake-case (e.g. HopSize -> hop_size). There is also a path transformation when applying the settings into the VocalContourSettings instance. For example, the attribute BatchSize defined in the yaml path General/Training/Settings/BatchSize is transformed to VocalContourSettings.training.batch_size. The level of /Settings is removed among all fields.

General:
    TranscriptionMode:
        Description: Mode of transcription by executing the `omnizart vocal-contour transribe` command.
        Type: String 
        Value: VocalContour
    CheckpointPath:
        Description: Path to the pre-trained models.
        Type: Map
        SubType: [String, String]
        Value: 
            VocalContour: checkpoints/vocal/vocal_contour
    Feature:
        Description: Default settings of feature extraction
        Settings:
            HopSize:
                Description: Hop size in seconds with respect to sampling rate.
                Type: Float
                Value: 0.02
            SamplingRate:
                Description: Adjust input sampling rate to this value.
                Type: Integer
                Value: 16000
            WindowSize:
                Type: Integer
                Value: 2049
    Dataset:
        Description: Settings of datasets.
        Settings:
            SavePath:
                Description: Path for storing the downloaded datasets.
                Type: String
                Value: ./
            FeatureSavePath:
                Description: Path for storing the extracted feature. Default to the path under the dataset folder.
                Type: String
                Value: +
    Model:
        Description: Default settings of training / testing the model.
        Settings:
            SavePrefix:
                Description: Prefix of the trained model's name to be saved.
                Type: String
                Value: vocal_contour
            SavePath:
                Description: Path to save the trained model.
                Type: String
                Value: ./checkpoints/vocal_contour
    Training:
        Description: Parameters for training
        Settings:
            Epoch:
                Description: Maximum number of epochs for training.
                Type: Integer
                Value: 5
            EarlyStop:
                Description: Terminate the training if the validation performance doesn't imrove after n epochs.
                Type: Integer
                Value: 3
            Steps:
                Description: Number of training steps for each epoch.
                Type: Integer
                Value: 6000
            ValSteps:
                Description: Number of validation steps after each training epoch.
                Type: Integer
                Value: 200    
            BatchSize:
                Description: Batch size of each training step.
                Type: Integer
                Value: 12
            ValBatchSize:
                Description: Batch size of each validation step.
                Type: Integer
                Value: 12
            Timesteps:
                Description: Length of time axis of the input feature.
                Type: Integer
                Value: 128