Vocal-Contour Transcription¶
Vocal pitch contour transcription.
Transcribes monophonic pitch contour of vocal in given polyphonic audio. Re-implementation of the repository Vocal-Melody-Extraction.
Feature Storage Format¶
Processed feature and label will be stored in .hdf format, one file per piece.
Columns in the file are:
feature: CFP feature representation.
label: 2D numpy array of vocal pitch contour.
References¶
The related publication of this work can be found in [1].
- 1
 Wei-Tsung Lu and Li Su, “Vocal melody extraction with semantic segmentation and audio-symbolic domain transfer learning,” International Society of Music Information Retrieval Conference (ISMIR), 2018.
App¶
Application class of vocal-contour.
Inludes core functions and interfaces for frame-level vocal transcription: model training, feature pre-processing, and audio transcription.
See Also¶
omnizart.base.BaseTranscription: The base class of all transcription/application classes.
- class omnizart.vocal_contour.app.VocalContourDatasetLoader(feature_folder=None, feature_files=None, num_samples=100, timesteps=128, channels=0, feature_num=384)¶
 Bases:
omnizart.base.BaseDatasetLoaderData loader for training the mdoel of
vocal-contour.Load feature and label for training.
- Parameters
 - feature_folder: Path
 Path to the extracted feature files in *.hdf.
- feature_files: list[Path]
 List of path to the feature files in`*.hdf`.
- num_samples: int
 Total number of samples to yield.
- timesteps: int
 Time length of the feature.
- channels: list[int]
 Channels to be used for training. Allowed values are [1, 2, 3].
- feature_num: int
 Target size of feature dimension. Zero padding is done to resolve mismatched input and target size.
- Yields
 - feature:
 Input features for model training.
- label:
 Coressponding labels.
- class omnizart.vocal_contour.app.VocalContourTranscription(conf_path=None)¶
 Bases:
omnizart.base.BaseTranscriptionApplication class for vocal-contour transcription.
Methods
generate_feature(dataset_path[, ...])Extract the feature from the given dataset.
train(feature_folder[, model_name, ...])Model training.
transcribe(input_audio[, model_path, output])Transcribe frame-level fundamental frequency of vocal from the given audio.
- generate_feature(dataset_path, vocalcontour_settings=None, num_threads=4)¶
 Extract the feature from the given dataset.
To train the model, the first step is to pre-process the data into feature representations. After downloading the dataset, use this function to generate the feature by giving the path of the stored dataset.
To specify the output path, modify the attribute
vocalcontour_settings.dataset.feature_save_path(TODO: to confirm). It defaults to the folder of the stored dataset, and creates two folders:train_featureandtest_feature.- Parameters
 - dataset_path: Path
 Path to the downloaded dataset.
- vocalcontour_settings: VocalContourSettings
 The configuration instance that holds all relative settings for the life-cycle of building a model.
- num_threads:
 Number of threads for parallel extraction of the feature.
See also
omnizart.constants.datasetsThe supported datasets and the corresponding training/testing splits.
- train(feature_folder, model_name=None, input_model_path=None, vocalcontour_settings=None)¶
 Model training.
Train the model from scratch or continue training given a model checkpoint.
- Parameters
 - feature_folder: Path
 Path to the generated feature.
- model_name: str
 The name of the trained model. If not given, will default to the current timestamp.
- input_model_path: Path
 Specify the path to the model checkpoint in order to fine-tune the model.
- vocalcontour_settings: VocalContourSettings
 The configuration that holds all relative settings for the life-cycle of model building.
- transcribe(input_audio, model_path=None, output='./')¶
 Transcribe frame-level fundamental frequency of vocal from the given audio.
- Parameters
 - input_audio: Path
 Path to the wav audio file.
- model_path: Path
 Path to the trained model or the transcription mode. If given a path, should be the folder that contains arch.yaml, weights.h5, and configuration.yaml.
- output: Path (optional)
 Path for writing out the extracted vocal f0. Default to current path.
- Returns
 - f0: txt
 The transcribed f0 of the vocal contour in Hz.
See also
omnizart.cli.vocal_contour.transcribeThe coressponding command line entry.
Inference¶
Loss Functions¶
Loss functions for Music module.
- omnizart.music.losses.focal_loss(target_tensor, prediction_tensor, weights=None, alpha=0.25, gamma=2)¶
 Compute focal loss for predictions.
Multi-labels Focal loss formula:
\[FL = -\alpha * (z-p)^\gamma * \log{(p)} -(1-\alpha) * p^\gamma * \log{(1-p)}\]Which \(\alpha\) = 0.25, \(\gamma\) = 2, p = sigmoid(x), z = target_tensor.
- Parameters
 - prediction_tensor
 A float tensor of shape [batch_size, num_anchors, num_classes] representing the predicted logits for each class.
- target_tensor:
 A float tensor of shape [batch_size, num_anchors, num_classes] representing one-hot encoded classification targets.
- weights
 A float tensor of shape [batch_size, num_anchors].
- alpha
 A scalar tensor for focal loss alpha hyper-parameter.
- gamma
 A scalar tensor for focal loss gamma hyper-parameter.
- Returns
 - loss
 A scalar tensor representing the value of the loss function
- omnizart.music.losses.smooth_loss(y_true, y_pred, gamma=0.15, total_chs=22, weight=None)¶
 Function to compute loss after applying label-smoothing.
Settings¶
Below are the default settings for frame-level vocal transcription.
It will be loaded by the class omnizart.setting_loaders.VocalContourSettings.
The name of the attributes will be converted to snake-case (e.g. HopSize -> hop_size).
There is also a path transformation when applying the settings into the VocalContourSettings instance.
For example, the attribute BatchSize defined in the yaml path General/Training/Settings/BatchSize is transformed
to VocalContourSettings.training.batch_size.
The level of /Settings is removed among all fields.
General:
    TranscriptionMode:
        Description: Mode of transcription by executing the `omnizart vocal-contour transribe` command.
        Type: String 
        Value: VocalContour
    CheckpointPath:
        Description: Path to the pre-trained models.
        Type: Map
        SubType: [String, String]
        Value: 
            VocalContour: checkpoints/vocal/vocal_contour
    Feature:
        Description: Default settings of feature extraction
        Settings:
            HopSize:
                Description: Hop size in seconds with respect to sampling rate.
                Type: Float
                Value: 0.02
            SamplingRate:
                Description: Adjust input sampling rate to this value.
                Type: Integer
                Value: 16000
            WindowSize:
                Type: Integer
                Value: 2049
    Dataset:
        Description: Settings of datasets.
        Settings:
            SavePath:
                Description: Path for storing the downloaded datasets.
                Type: String
                Value: ./
            FeatureSavePath:
                Description: Path for storing the extracted feature. Default to the path under the dataset folder.
                Type: String
                Value: +
    Model:
        Description: Default settings of training / testing the model.
        Settings:
            SavePrefix:
                Description: Prefix of the trained model's name to be saved.
                Type: String
                Value: vocal_contour
            SavePath:
                Description: Path to save the trained model.
                Type: String
                Value: ./checkpoints/vocal_contour
    Training:
        Description: Parameters for training
        Settings:
            Epoch:
                Description: Maximum number of epochs for training.
                Type: Integer
                Value: 5
            EarlyStop:
                Description: Terminate the training if the validation performance doesn't imrove after n epochs.
                Type: Integer
                Value: 3
            Steps:
                Description: Number of training steps for each epoch.
                Type: Integer
                Value: 6000
            ValSteps:
                Description: Number of validation steps after each training epoch.
                Type: Integer
                Value: 200    
            BatchSize:
                Description: Batch size of each training step.
                Type: Integer
                Value: 12
            ValBatchSize:
                Description: Batch size of each validation step.
                Type: Integer
                Value: 12
            Timesteps:
                Description: Length of time axis of the input feature.
                Type: Integer
                Value: 128