Vocal-Contour Transcription¶
Vocal pitch contour transcription.
Transcribes monophonic pitch contour of vocal in given polyphonic audio. Re-implementation of the repository Vocal-Melody-Extraction.
Feature Storage Format¶
Processed feature and label will be stored in .hdf
format, one file per piece.
Columns in the file are:
feature: CFP feature representation.
label: 2D numpy array of vocal pitch contour.
References¶
The related publication of this work can be found in [1].
- 1
Wei-Tsung Lu and Li Su, “Vocal melody extraction with semantic segmentation and audio-symbolic domain transfer learning,” International Society of Music Information Retrieval Conference (ISMIR), 2018.
App¶
Application class of vocal-contour.
Inludes core functions and interfaces for frame-level vocal transcription: model training, feature pre-processing, and audio transcription.
See Also¶
omnizart.base.BaseTranscription: The base class of all transcription/application classes.
- class omnizart.vocal_contour.app.VocalContourDatasetLoader(feature_folder=None, feature_files=None, num_samples=100, timesteps=128, channels=0, feature_num=384)¶
Bases:
omnizart.base.BaseDatasetLoader
Data loader for training the mdoel of
vocal-contour
.Load feature and label for training.
- Parameters
- feature_folder: Path
Path to the extracted feature files in *.hdf.
- feature_files: list[Path]
List of path to the feature files in`*.hdf`.
- num_samples: int
Total number of samples to yield.
- timesteps: int
Time length of the feature.
- channels: list[int]
Channels to be used for training. Allowed values are [1, 2, 3].
- feature_num: int
Target size of feature dimension. Zero padding is done to resolve mismatched input and target size.
- Yields
- feature:
Input features for model training.
- label:
Coressponding labels.
- class omnizart.vocal_contour.app.VocalContourTranscription(conf_path=None)¶
Bases:
omnizart.base.BaseTranscription
Application class for vocal-contour transcription.
Methods
generate_feature
(dataset_path[, ...])Extract the feature from the given dataset.
train
(feature_folder[, model_name, ...])Model training.
transcribe
(input_audio[, model_path, output])Transcribe frame-level fundamental frequency of vocal from the given audio.
- generate_feature(dataset_path, vocalcontour_settings=None, num_threads=4)¶
Extract the feature from the given dataset.
To train the model, the first step is to pre-process the data into feature representations. After downloading the dataset, use this function to generate the feature by giving the path of the stored dataset.
To specify the output path, modify the attribute
vocalcontour_settings.dataset.feature_save_path
(TODO: to confirm). It defaults to the folder of the stored dataset, and creates two folders:train_feature
andtest_feature
.- Parameters
- dataset_path: Path
Path to the downloaded dataset.
- vocalcontour_settings: VocalContourSettings
The configuration instance that holds all relative settings for the life-cycle of building a model.
- num_threads:
Number of threads for parallel extraction of the feature.
See also
omnizart.constants.datasets
The supported datasets and the corresponding training/testing splits.
- train(feature_folder, model_name=None, input_model_path=None, vocalcontour_settings=None)¶
Model training.
Train the model from scratch or continue training given a model checkpoint.
- Parameters
- feature_folder: Path
Path to the generated feature.
- model_name: str
The name of the trained model. If not given, will default to the current timestamp.
- input_model_path: Path
Specify the path to the model checkpoint in order to fine-tune the model.
- vocalcontour_settings: VocalContourSettings
The configuration that holds all relative settings for the life-cycle of model building.
- transcribe(input_audio, model_path=None, output='./')¶
Transcribe frame-level fundamental frequency of vocal from the given audio.
- Parameters
- input_audio: Path
Path to the wav audio file.
- model_path: Path
Path to the trained model or the transcription mode. If given a path, should be the folder that contains arch.yaml, weights.h5, and configuration.yaml.
- output: Path (optional)
Path for writing out the extracted vocal f0. Default to current path.
- Returns
- f0: txt
The transcribed f0 of the vocal contour in Hz.
See also
omnizart.cli.vocal_contour.transcribe
The coressponding command line entry.
Inference¶
Loss Functions¶
Loss functions for Music module.
- omnizart.music.losses.focal_loss(target_tensor, prediction_tensor, weights=None, alpha=0.25, gamma=2)¶
Compute focal loss for predictions.
Multi-labels Focal loss formula:
\[FL = -\alpha * (z-p)^\gamma * \log{(p)} -(1-\alpha) * p^\gamma * \log{(1-p)}\]Which \(\alpha\) = 0.25, \(\gamma\) = 2, p = sigmoid(x), z = target_tensor.
- Parameters
- prediction_tensor
A float tensor of shape [batch_size, num_anchors, num_classes] representing the predicted logits for each class.
- target_tensor:
A float tensor of shape [batch_size, num_anchors, num_classes] representing one-hot encoded classification targets.
- weights
A float tensor of shape [batch_size, num_anchors].
- alpha
A scalar tensor for focal loss alpha hyper-parameter.
- gamma
A scalar tensor for focal loss gamma hyper-parameter.
- Returns
- loss
A scalar tensor representing the value of the loss function
- omnizart.music.losses.smooth_loss(y_true, y_pred, gamma=0.15, total_chs=22, weight=None)¶
Function to compute loss after applying label-smoothing.
Settings¶
Below are the default settings for frame-level vocal transcription.
It will be loaded by the class omnizart.setting_loaders.VocalContourSettings
.
The name of the attributes will be converted to snake-case (e.g. HopSize -> hop_size).
There is also a path transformation when applying the settings into the VocalContourSettings
instance.
For example, the attribute BatchSize
defined in the yaml path General/Training/Settings/BatchSize is transformed
to VocalContourSettings.training.batch_size.
The level of /Settings is removed among all fields.
General:
TranscriptionMode:
Description: Mode of transcription by executing the `omnizart vocal-contour transribe` command.
Type: String
Value: VocalContour
CheckpointPath:
Description: Path to the pre-trained models.
Type: Map
SubType: [String, String]
Value:
VocalContour: checkpoints/vocal/vocal_contour
Feature:
Description: Default settings of feature extraction
Settings:
HopSize:
Description: Hop size in seconds with respect to sampling rate.
Type: Float
Value: 0.02
SamplingRate:
Description: Adjust input sampling rate to this value.
Type: Integer
Value: 16000
WindowSize:
Type: Integer
Value: 2049
Dataset:
Description: Settings of datasets.
Settings:
SavePath:
Description: Path for storing the downloaded datasets.
Type: String
Value: ./
FeatureSavePath:
Description: Path for storing the extracted feature. Default to the path under the dataset folder.
Type: String
Value: +
Model:
Description: Default settings of training / testing the model.
Settings:
SavePrefix:
Description: Prefix of the trained model's name to be saved.
Type: String
Value: vocal_contour
SavePath:
Description: Path to save the trained model.
Type: String
Value: ./checkpoints/vocal_contour
Training:
Description: Parameters for training
Settings:
Epoch:
Description: Maximum number of epochs for training.
Type: Integer
Value: 5
EarlyStop:
Description: Terminate the training if the validation performance doesn't imrove after n epochs.
Type: Integer
Value: 3
Steps:
Description: Number of training steps for each epoch.
Type: Integer
Value: 6000
ValSteps:
Description: Number of validation steps after each training epoch.
Type: Integer
Value: 200
BatchSize:
Description: Batch size of each training step.
Type: Integer
Value: 12
ValBatchSize:
Description: Batch size of each validation step.
Type: Integer
Value: 12
Timesteps:
Description: Length of time axis of the input feature.
Type: Integer
Value: 128