Vocal Transcription¶
Vocal melody transcription.
Transcribe vocal notes in the song and outputs the MIDI file. Re-implementation of the work [1] with tensorflow 2.3.0. Some changes have also been made to improve the performance.
Feature Storage Format¶
Processed feature will be stored in .hdf
file format, one file per piece.
Columns in the file are:
feature: CFP feature specialized for
vocal
module.label: Onset, offset, and duration information of the vocal.
References¶
See Also¶
omnizart.feature.cfp.extract_vocal_cfp
:Function to extract specialized CFP feature for
vocal
.
App¶
- class omnizart.vocal.app.VocalTranscription(conf_path=None)¶
Bases:
omnizart.base.BaseTranscription
Application class for vocal note transcription.
This application implements the training procedure in a semi-supervised way.
Methods
generate_feature
(dataset_path[, ...])Extract the feature of the whole dataset.
get_model
(settings)Get the Pyramid model.
train
(feature_folder[, semi_feature_folder, ...])Model training.
transcribe
(input_audio[, model_path, output])Transcribe vocal notes in the audio.
- generate_feature(dataset_path, vocal_settings=None, num_threads=4)¶
Extract the feature of the whole dataset.
Currently supports MIR-1K and TONAS datasets. To train the model, you have to prepare the training data first, then process it into feature representations. After downloading the dataset, use this function to do the pre-processing and transform the raw data into features.
To specify the output path, modify the attribute
vocal_settings.dataset.feature_save_path
to the value you want. It will default to the folder under where the dataset stored, generating two folders:train_feature
andtest_feature
.- Parameters
- dataset_path: Path
Path to the downloaded dataset.
- vocal_settings: VocalSettings
The configuration instance that holds all relative settings for the life-cycle of building a model.
- num_threads:
Number of threads for parallel extracting the features.
- get_model(settings)¶
Get the Pyramid model.
More comprehensive reasons to having this method, please refer to
omnizart.base.BaseTranscription.get_model
.
- train(feature_folder, semi_feature_folder=None, model_name=None, input_model_path=None, vocal_settings=None)¶
Model training.
Train a new model or continue to train on a previously trained model.
- Parameters
- feature_folder: Path
Path to the folder containing generated feature.
- semi_feature_folder: Path
If specified, semi-supervise learning will be leveraged, and the feature files contained in this folder will be used as unsupervised data.
- model_name: str
The name for storing the trained model. If not given, will default to the current timesamp.
- input_model_path: Path
Continue to train on the pre-trained model by specifying the path.
- vocal_settings: VocalSettings
The configuration instance that holds all relative settings for the life-cycle of building a model.
- transcribe(input_audio, model_path=None, output='./')¶
Transcribe vocal notes in the audio.
This function transcribes onset, offset, and pitch of the vocal in the audio. This module is reponsible for predicting onset and offset time of each note, and pitches are estimated by the vocal-contour submodule.
- Parameters
- input_audio: Path
Path to the raw audio file (.wav).
- model_path: Path
Path to the trained model or the supported transcription mode.
- output: Path (optional)
Path for writing out the transcribed MIDI file. Default to the current path.
- Returns
- midi: pretty_midi.PrettyMIDI
The transcribed vocal notes.
See also
omnizart.cli.vocal.transcribe
CLI entry point of this function.
omnizart.vocal_contour.transcribe
Pitch estimation function.
Dataset¶
- class omnizart.vocal.app.VocalDatasetLoader(ctx_len=9, feature_folder=None, feature_files=None, num_samples=100, slice_hop=1)¶
Bases:
omnizart.base.BaseDatasetLoader
Dataset loader of ‘vocal’ module.
Defines an additional parameter ‘ctx_len’ to determine the context length of the input feature with repect to the current timestamp.
Inference¶
- omnizart.vocal.inference.infer_interval(pred, ctx_len=2, threshold=0.5, min_dura=0.1, t_unit=0.02)¶
Improved version of interval inference function.
Inference the onset and offset time of notes given the raw prediction values.
- Parameters
- pred:
Raw prediction array.
- ctx_len: int
Context length for determing peaks.
- threhsold: float
Threshold for prediction values to be taken as true positive.
- min_dura: float
Minimum duration for a note, in seconds.
- t_unit: float
Time unit of each frame.
- Returns
- interval: list[tuple[float, float]]
Pairs of inferenced onset and offset time in seconds.
- omnizart.vocal.inference.infer_interval_original(pred, ctx_len=2, threshold=0.5, t_unit=0.02)¶
Original implementation of interval inference.
After checking the inference results of this implementation, we found there are lots of missing notes that aren’t in the inferenced results. This function is just leaving for reference.
- Parameters
- pred:
Raw prediction array.
- ctx_len: int
Context length for determing peaks.
- threhsold: float
Threshold for prediction values to be taken as true positive.
- t_unit: float
Time unit of each frame.
- Returns
- interval: list[tuple[float, float]]
Pairs of inferenced onset and offset time in seconds.
- omnizart.vocal.inference.infer_midi(interval, agg_f0, t_unit=0.02)¶
Inference the given interval and aggregated F0 to MIDI file.
- Parameters
- interval: list[tuple[float, float]]
The return value of
infer_interval
function. List of onset/offset pairs in seconds.- agg_f0: list[dict]
Aggregated f0 information. Each elements in the list should contain three columns: start_time, end_time, and frequency. Time units should be in seonds, and pitch should be Hz.
- t_unit: float
Time unit of each frame.
- Returns
- midi: pretty_midi.PrettyMIDI
The inferred MIDI object.
Labels¶
- class omnizart.vocal.labels.BaseLabelExtraction¶
Base class for extract label information.
Provides basic functions to parse the original label format into the target format for training. All sub-classes should override the function
load_label
and returns a list ofLabel
objects.Methods
extract_label
(label_path[, t_unit])Extract SDT label.
load_label
(label_path)Load the label file and parse information into
Label
class.- classmethod extract_label(label_path, t_unit=0.02)¶
Extract SDT label.
There are 6 types of events as defined in the original paper: activation, silence, onset, non-onset, offset, and non-offset. The corresponding annotations used in the paper are [a, s, o, o’, f, f’]. The ‘activation’ includes the onset and offset time. And non-onset and non-offset events refer to when there are no onset/offset events.
- Parameters
- label_path: Path
Path to the groun-truth file.
- t_unit: float
Time unit of each frame.
- Returns
- sdt_label: 2D numpy array
Label in SDT format with dimension: Time x 6
- abstract classmethod load_label(label_path)¶
Load the label file and parse information into
Label
class.Sub-classes should override this function to process their own label format.
- Parameters
- label_path: Path
Path to the label file.
- Returns
- labels: list[Label]
List of
Label
instances.
- class omnizart.vocal.labels.CMediaLabelExtraction¶
Label extraction for CMedia dataset.
Methods
load_label
(label_path)Load the label file and parse information into
Label
class.- classmethod load_label(label_path)¶
Load the label file and parse information into
Label
class.Sub-classes should override this function to process their own label format.
- Parameters
- label_path: Path
Path to the label file.
- Returns
- labels: list[Label]
List of
Label
instances.
- class omnizart.vocal.labels.MIR1KlabelExtraction¶
Label extraction for MIR-1K dataset.
Methods
load_label
(label_path)Load the label file and parse information into
Label
class.- classmethod load_label(label_path)¶
Load the label file and parse information into
Label
class.Sub-classes should override this function to process their own label format.
- Parameters
- label_path: Path
Path to the label file.
- Returns
- labels: list[Label]
List of
Label
instances.
- class omnizart.vocal.labels.TonasLabelExtraction¶
Label extraction for TONAS dataset.
Methods
load_label
(label_path)Load the label file and parse information into
Label
class.- classmethod load_label(label_path)¶
Load the label file and parse information into
Label
class.Sub-classes should override this function to process their own label format.
- Parameters
- label_path: Path
Path to the label file.
- Returns
- labels: list[Label]
List of
Label
instances.
Prediction¶
- omnizart.vocal.prediction.create_batches(feature, ctx_len=9, batch_size=64)¶
- omnizart.vocal.prediction.merge_batches(batch_pred)¶
- omnizart.vocal.prediction.predict(feature, model, ctx_len=9, batch_size=16)¶
Settings¶
Below are the default settings for building the vocal model. It will be loaded
by the class omnizart.setting_loaders.VocalSettings
. The name of the
attributes will be converted to snake-case (e.g. HopSize -> hop_size). There
is also a path transformation process when applying the settings into the
VocalSettings
instance. For example, if you want to access the attribute
BatchSize
defined in the yaml path General/Training/Settings/BatchSize,
the coressponding attribute will be VocalSettings.training.batch_size.
The level of /Settings is removed among all fields.
General:
TranscriptionMode:
Description: Mode of transcription by executing the `omnizart vocal transcribe` command.
Type: String
Value: Semi
CheckpointPath:
Description: Path to the pre-trained models.
Type: Map
SubType: [String, String]
Value:
Super: checkpoints/vocal/vocal_super
Semi: checkpoints/vocal/vocal_semi
Feature:
Description: Default settings of feature extraction for drum transcription.
Settings:
HopSize:
Description: Hop size in seconds with respect to sampling rate.
Type: Float
Value: 0.02
SamplingRate:
Description: Adjust input sampling rate to this value.
Type: Integer
Value: 16000
FrequencyResolution:
Type: Float
Value: 2.0
FrequencyCenter:
Description: Lowest frequency to extract.
Type: Float
Value: 80
TimeCenter:
Description: Highest frequency to extract (1/time_center).
Type: Float
Value: 0.001
Gamma:
Type: List
SubType: Float
Value: [0.24, 0.6, 1.0]
BinsPerOctave:
Description: Number of bins for each octave.
Type: Integer
Value: 48
Dataset:
Description: Settings of datasets.
Settings:
SavePath:
Description: Path for storing the downloaded datasets.
Type: String
Value: ./
FeatureSavePath:
Description: Path for storing the extracted feature. Default to the path under the dataset folder.
Type: String
Value: +
Model:
Description: Default settings of training / testing the model.
Settings:
SavePrefix:
Description: Prefix of the trained model's name to be saved.
Type: String
Value: vocal
SavePath:
Description: Path to save the trained model.
Type: String
Value: ./checkpoints/vocal
MinKernelSize:
Description: Minimum kernel size of convolution layers in each pyramid block.
Type: Integer
Value: 16
Depth:
Description: Total number of pyramid blocks will be -> (Depth - 2) / 2 .
Type: Integer
Value: 110
Alpha:
Type: Integer
Value: 270
ShakeDrop:
Description: Whether to leverage Shake Drop normalization when back propagation.
Type: Bool
Value: True
SemiLossWeight:
Description: Weighting factor of the semi-supervise loss. Supervised loss will not be affected by this parameter.
Type: Float
Value: 1.0
SemiXi:
Description: A small constant value for weighting the adverarial perturbation.
Type: Float
Value: 0.000001
SemiEpsilon:
Description: Weighting factor of the output adversarial perturbation.
Type: Float
Value: 8.0
SemiIterations:
Description: Number of iterations when generating the adversarial perturbation.
Type: Integer
Value: 2
Inference:
Description: Default settings when infering notes.
Settings:
ContextLength:
Description: Length of context that will be used to find the peaks.
Type: Integer
Value: 2
Threshold:
Description: Threshold that will be applied to clip the predicted values to either 0 or 1.
Type: Float
Value: 0.5
MinDuration:
Description: Minimum required length of each note, in seconds.
Type: Float
Value: 0.1
PitchModel:
Description: The model for predicting the pitch contour. Default to use vocal-contour modeul. Could be path or mode name.
Type: String
Value: VocalContour
Training:
Description: Hyper parameters for training
Settings:
Epoch:
Description: Maximum number of epochs for training.
Type: Integer
Value: 10
Steps:
Description: Number of training steps for each epoch.
Type: Integer
Value: 1000
ValSteps:
Description: Number of validation steps after each training epoch.
Type: Integer
Value: 50
BatchSize:
Description: Batch size of each training step.
Type: Integer
Value: 64
ValBatchSize:
Description: Batch size of each validation step.
Type: Integer
Value: 64
EarlyStop:
Description: Terminate the training if the validation performance doesn't imrove after n epochs.
Type: Integer
Value: 8
InitLearningRate:
Descriptoin: Initial learning rate.
Type: Float
Value: 0.0001
ContextLength:
Description: Context to be considered before and after current timestamp.
Type: Integer
Value: 9