Vocal Transcription

Vocal melody transcription.

Transcribe vocal notes in the song and outputs the MIDI file. Re-implementation of the work [1] with tensorflow 2.3.0. Some changes have also been made to improve the performance.

Feature Storage Format

Processed feature will be stored in .hdf file format, one file per piece.

Columns in the file are:

  • feature: CFP feature specialized for vocal module.

  • label: Onset, offset, and duration information of the vocal.

References

1

https://github.com/B05901022/VOCANO

See Also

omnizart.feature.cfp.extract_vocal_cfp:

Function to extract specialized CFP feature for vocal.

App

class omnizart.vocal.app.VocalTranscription(conf_path=None)

Bases: omnizart.base.BaseTranscription

Application class for vocal note transcription.

This application implements the training procedure in a semi-supervised way.

Methods

generate_feature(dataset_path[, ...])

Extract the feature of the whole dataset.

get_model(settings)

Get the Pyramid model.

train(feature_folder[, semi_feature_folder, ...])

Model training.

transcribe(input_audio[, model_path, output])

Transcribe vocal notes in the audio.

generate_feature(dataset_path, vocal_settings=None, num_threads=4)

Extract the feature of the whole dataset.

Currently supports MIR-1K and TONAS datasets. To train the model, you have to prepare the training data first, then process it into feature representations. After downloading the dataset, use this function to do the pre-processing and transform the raw data into features.

To specify the output path, modify the attribute vocal_settings.dataset.feature_save_path to the value you want. It will default to the folder under where the dataset stored, generating two folders: train_feature and test_feature.

Parameters
dataset_path: Path

Path to the downloaded dataset.

vocal_settings: VocalSettings

The configuration instance that holds all relative settings for the life-cycle of building a model.

num_threads:

Number of threads for parallel extracting the features.

get_model(settings)

Get the Pyramid model.

More comprehensive reasons to having this method, please refer to omnizart.base.BaseTranscription.get_model.

train(feature_folder, semi_feature_folder=None, model_name=None, input_model_path=None, vocal_settings=None)

Model training.

Train a new model or continue to train on a previously trained model.

Parameters
feature_folder: Path

Path to the folder containing generated feature.

semi_feature_folder: Path

If specified, semi-supervise learning will be leveraged, and the feature files contained in this folder will be used as unsupervised data.

model_name: str

The name for storing the trained model. If not given, will default to the current timesamp.

input_model_path: Path

Continue to train on the pre-trained model by specifying the path.

vocal_settings: VocalSettings

The configuration instance that holds all relative settings for the life-cycle of building a model.

transcribe(input_audio, model_path=None, output='./')

Transcribe vocal notes in the audio.

This function transcribes onset, offset, and pitch of the vocal in the audio. This module is reponsible for predicting onset and offset time of each note, and pitches are estimated by the vocal-contour submodule.

Parameters
input_audio: Path

Path to the raw audio file (.wav).

model_path: Path

Path to the trained model or the supported transcription mode.

output: Path (optional)

Path for writing out the transcribed MIDI file. Default to the current path.

Returns
midi: pretty_midi.PrettyMIDI

The transcribed vocal notes.

See also

omnizart.cli.vocal.transcribe

CLI entry point of this function.

omnizart.vocal_contour.transcribe

Pitch estimation function.

Dataset

class omnizart.vocal.app.VocalDatasetLoader(ctx_len=9, feature_folder=None, feature_files=None, num_samples=100, slice_hop=1)

Bases: omnizart.base.BaseDatasetLoader

Dataset loader of ‘vocal’ module.

Defines an additional parameter ‘ctx_len’ to determine the context length of the input feature with repect to the current timestamp.

Inference

omnizart.vocal.inference.infer_interval(pred, ctx_len=2, threshold=0.5, min_dura=0.1, t_unit=0.02)

Improved version of interval inference function.

Inference the onset and offset time of notes given the raw prediction values.

Parameters
pred:

Raw prediction array.

ctx_len: int

Context length for determing peaks.

threhsold: float

Threshold for prediction values to be taken as true positive.

min_dura: float

Minimum duration for a note, in seconds.

t_unit: float

Time unit of each frame.

Returns
interval: list[tuple[float, float]]

Pairs of inferenced onset and offset time in seconds.

omnizart.vocal.inference.infer_interval_original(pred, ctx_len=2, threshold=0.5, t_unit=0.02)

Original implementation of interval inference.

After checking the inference results of this implementation, we found there are lots of missing notes that aren’t in the inferenced results. This function is just leaving for reference.

Parameters
pred:

Raw prediction array.

ctx_len: int

Context length for determing peaks.

threhsold: float

Threshold for prediction values to be taken as true positive.

t_unit: float

Time unit of each frame.

Returns
interval: list[tuple[float, float]]

Pairs of inferenced onset and offset time in seconds.

omnizart.vocal.inference.infer_midi(interval, agg_f0, t_unit=0.02)

Inference the given interval and aggregated F0 to MIDI file.

Parameters
interval: list[tuple[float, float]]

The return value of infer_interval function. List of onset/offset pairs in seconds.

agg_f0: list[dict]

Aggregated f0 information. Each elements in the list should contain three columns: start_time, end_time, and frequency. Time units should be in seonds, and pitch should be Hz.

t_unit: float

Time unit of each frame.

Returns
midi: pretty_midi.PrettyMIDI

The inferred MIDI object.

Labels

class omnizart.vocal.labels.BaseLabelExtraction

Base class for extract label information.

Provides basic functions to parse the original label format into the target format for training. All sub-classes should override the function load_label and returns a list of Label objects.

Methods

extract_label(label_path[, t_unit])

Extract SDT label.

load_label(label_path)

Load the label file and parse information into Label class.

classmethod extract_label(label_path, t_unit=0.02)

Extract SDT label.

There are 6 types of events as defined in the original paper: activation, silence, onset, non-onset, offset, and non-offset. The corresponding annotations used in the paper are [a, s, o, o’, f, f’]. The ‘activation’ includes the onset and offset time. And non-onset and non-offset events refer to when there are no onset/offset events.

Parameters
label_path: Path

Path to the groun-truth file.

t_unit: float

Time unit of each frame.

Returns
sdt_label: 2D numpy array

Label in SDT format with dimension: Time x 6

abstract classmethod load_label(label_path)

Load the label file and parse information into Label class.

Sub-classes should override this function to process their own label format.

Parameters
label_path: Path

Path to the label file.

Returns
labels: list[Label]

List of Label instances.

class omnizart.vocal.labels.CMediaLabelExtraction

Label extraction for CMedia dataset.

Methods

load_label(label_path)

Load the label file and parse information into Label class.

classmethod load_label(label_path)

Load the label file and parse information into Label class.

Sub-classes should override this function to process their own label format.

Parameters
label_path: Path

Path to the label file.

Returns
labels: list[Label]

List of Label instances.

class omnizart.vocal.labels.MIR1KlabelExtraction

Label extraction for MIR-1K dataset.

Methods

load_label(label_path)

Load the label file and parse information into Label class.

classmethod load_label(label_path)

Load the label file and parse information into Label class.

Sub-classes should override this function to process their own label format.

Parameters
label_path: Path

Path to the label file.

Returns
labels: list[Label]

List of Label instances.

class omnizart.vocal.labels.TonasLabelExtraction

Label extraction for TONAS dataset.

Methods

load_label(label_path)

Load the label file and parse information into Label class.

classmethod load_label(label_path)

Load the label file and parse information into Label class.

Sub-classes should override this function to process their own label format.

Parameters
label_path: Path

Path to the label file.

Returns
labels: list[Label]

List of Label instances.

Prediction

omnizart.vocal.prediction.create_batches(feature, ctx_len=9, batch_size=64)
omnizart.vocal.prediction.merge_batches(batch_pred)
omnizart.vocal.prediction.predict(feature, model, ctx_len=9, batch_size=16)

Settings

Below are the default settings for building the vocal model. It will be loaded by the class omnizart.setting_loaders.VocalSettings. The name of the attributes will be converted to snake-case (e.g. HopSize -> hop_size). There is also a path transformation process when applying the settings into the VocalSettings instance. For example, if you want to access the attribute BatchSize defined in the yaml path General/Training/Settings/BatchSize, the coressponding attribute will be VocalSettings.training.batch_size. The level of /Settings is removed among all fields.

General:
    TranscriptionMode:
        Description: Mode of transcription by executing the `omnizart vocal transcribe` command.
        Type: String
        Value: Semi
    CheckpointPath:
        Description: Path to the pre-trained models.
        Type: Map
        SubType: [String, String]
        Value:
            Super: checkpoints/vocal/vocal_super
            Semi: checkpoints/vocal/vocal_semi
    Feature:
        Description: Default settings of feature extraction for drum transcription.
        Settings:
            HopSize:
                Description: Hop size in seconds with respect to sampling rate.
                Type: Float
                Value: 0.02
            SamplingRate:
                Description: Adjust input sampling rate to this value.
                Type: Integer
                Value: 16000
            FrequencyResolution:
                Type: Float
                Value: 2.0
            FrequencyCenter:
                Description: Lowest frequency to extract.
                Type: Float
                Value: 80
            TimeCenter:
                Description: Highest frequency to extract (1/time_center).
                Type: Float
                Value: 0.001
            Gamma:
                Type: List
                SubType: Float
                Value: [0.24, 0.6, 1.0]
            BinsPerOctave:
                Description: Number of bins for each octave.
                Type: Integer
                Value: 48
    Dataset:
        Description: Settings of datasets.
        Settings:
            SavePath:
                Description: Path for storing the downloaded datasets.
                Type: String
                Value: ./
            FeatureSavePath:
                Description: Path for storing the extracted feature. Default to the path under the dataset folder.
                Type: String
                Value: +
    Model:
        Description: Default settings of training / testing the model.
        Settings:
            SavePrefix:
                Description: Prefix of the trained model's name to be saved.
                Type: String
                Value: vocal
            SavePath:
                Description: Path to save the trained model.
                Type: String
                Value: ./checkpoints/vocal
            MinKernelSize:
                Description: Minimum kernel size of convolution layers in each pyramid block.
                Type: Integer
                Value: 16
            Depth:
                Description: Total number of pyramid blocks will be -> (Depth - 2) / 2 .
                Type: Integer
                Value: 110
            Alpha:
                Type: Integer
                Value: 270
            ShakeDrop:
                Description: Whether to leverage Shake Drop normalization when back propagation.
                Type: Bool
                Value: True
            SemiLossWeight:
                Description: Weighting factor of the semi-supervise loss. Supervised loss will not be affected by this parameter.
                Type: Float
                Value: 1.0
            SemiXi:
                Description: A small constant value for weighting the adverarial perturbation.
                Type: Float
                Value: 0.000001
            SemiEpsilon:
                Description: Weighting factor of the output adversarial perturbation.
                Type: Float
                Value: 8.0
            SemiIterations:
                Description: Number of iterations when generating the adversarial perturbation.
                Type: Integer
                Value: 2
    Inference:
        Description: Default settings when infering notes.
        Settings:
            ContextLength:
                Description: Length of context that will be used to find the peaks.
                Type: Integer
                Value: 2
            Threshold:
                Description: Threshold that will be applied to clip the predicted values to either 0 or 1.
                Type: Float
                Value: 0.5
            MinDuration:
                Description: Minimum required length of each note, in seconds.
                Type: Float
                Value: 0.1
            PitchModel:
                Description: The model for predicting the pitch contour. Default to use vocal-contour modeul. Could be path or mode name.
                Type: String
                Value: VocalContour
    Training:
        Description: Hyper parameters for training
        Settings:
            Epoch:
                Description: Maximum number of epochs for training.
                Type: Integer
                Value: 10
            Steps:
                Description: Number of training steps for each epoch.
                Type: Integer
                Value: 1000
            ValSteps:
                Description: Number of validation steps after each training epoch.
                Type: Integer
                Value: 50
            BatchSize:
                Description: Batch size of each training step.
                Type: Integer
                Value: 64
            ValBatchSize:
                Description: Batch size of each validation step.
                Type: Integer
                Value: 64
            EarlyStop:
                Description: Terminate the training if the validation performance doesn't imrove after n epochs.
                Type: Integer
                Value: 8
            InitLearningRate:
                Descriptoin: Initial learning rate.
                Type: Float
                Value: 0.0001
            ContextLength:
                Description: Context to be considered before and after current timestamp.
                Type: Integer
                Value: 9