Music Transcription¶

Music transcription module.

This module provides utilities for transcribing pitch and instruments in the audio. This module is also an improved version of the original repository BreezeWhite/Music-Transcription-with-Semantic-Segmentation, with a cleaner architecture and consistent coding style, also provides command line interface for easy usage.

Feature Storage Format¶

Processed feature will be stored in .hdf and .pickle file format. The former format is used to store the feature representation, and the later is used for customized label representation. Each piece will have both two different files.

Columns in .hdf feature file:

feature

References¶

Technical details can be found in the publications [1], [2], and [3].

1: Yu-Te Wu, Berlin Chen, and Li Su, “Multi-Instrument Automatic Music Transcription With Self-Attention-Based Instance Segmentation.” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020.
2: Yu-Te Wu, Berlin Chen, and Li Su. “Polyphonic Music Transcription with Semantic Segmentation.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
3: Yu-Te Wu, Berlin Chen, and Li Su. “Automatic Music Yranscription Leveraging Generalized Cepstral Features and Deep Learning.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.

App¶

class omnizart.music.app.MusicTranscription(conf_path=None)¶

Bases: omnizart.base.BaseTranscription

Application class for music transcription.

Inherited from the BaseTranscription class to make sure everything needed got override.

Methods

`generate_feature`(dataset_path[, ...])	Extract the feature from the given dataset.
`train`(feature_folder[, model_name, ...])	Model training.
`transcribe`(input_audio[, model_path, output])	Transcribe notes and instruments of the given audio.

generate_feature(dataset_path, music_settings=None, num_threads=4)¶

Extract the feature from the given dataset.

To train the model, the first step is to pre-process the data into feature representations. After downloading the dataset, use this function to generate the feature by giving the path of the stored dataset.

To specify the output path, modify the attribute music_settings.dataset.feature_save_path. It defaults to the folder under where the dataset stored, generating two folders: train_feature and test_feature.

Parameters

dataset_path: Path: Path to the downloaded dataset.
music_settings: MusicSettings: The configuration instance that holds all relative settings for the life-cycle of building a model.
num_threads:: Number of threads for parallel extraction the feature.

See also

omnizart.constants.datasets: The supported datasets and the corresponding training/testing splits.

train(feature_folder, model_name=None, input_model_path=None, music_settings=None)¶

Model training.

Train the model from scratch or continue training given a model checkpoint.

Parameters

feature_folder: Path: Path to the generated feature.
model_name: str: The name of the trained model. If not given, will default to the current timestamp.
input_model_path: Path: Specify the path to the model checkpoint in order to fine-tune the model.
music_settings: MusicSettings: The configuration that holds all relative settings for the life-cycle of model building.

transcribe(input_audio, model_path=None, output='./')¶

Transcribe notes and instruments of the given audio.

This function transcribes notes (onset, duration) of each instruments in the audio. The results will be written out as a MIDI file.

Parameters

input_audio: Path: Path to the wav audio file.
model_path: Path: Path to the trained model or the transcription mode. If given a path, should be the folder that contains arch.yaml, weights.h5, and configuration.yaml.
output: Path (optional): Path for writing out the transcribed MIDI file. Default to current path.

Returns

midi: pretty_midi.PrettyMIDI: The transcribed notes of different instruments.

See also

omnizart.cli.music.transcribe: The coressponding command line entry.

Dataset¶

class omnizart.music.app.MusicDatasetLoader(label_conversion_func, feature_folder=None, feature_files=None, num_samples=100, timesteps=128, channels=[1, 3], feature_num=352)¶

Bases: omnizart.base.BaseDatasetLoader

Data loader for training the mdoel of music.

Load feature and label for training. Also converts the custom format of label into piano roll representation.

Parameters

label_conversion_func: callable: The function that will be used for converting the customized label format into numpy array.
feature_folder: Path: Path to the extracted feature files, including *.hdf and *.pickle pairs, which refers to feature and label files, respectively.
feature_files: list[Path]: List of path of *.hdf feature files. Corresponding label files should also under the same folder.
num_samples: int: Total number of samples to yield.
timesteps: int: Time length of the feature.
channels: list[int]: Channels to be used for training. Allowed values are [1, 2, 3].
feature_num: int: Target size of feature dimension. Zero padding is done to resolve mismatched input and target size.

Yields

feature:: Input features for model training.
label:: Corresponding labels.

Inference¶

omnizart.music.inference.down_sample(pred)¶

Down sample multi-channel predictions along the feature dimension.

Down sample the feature size from 354 to 88 for infering the notes from a multi-channel prediction.

Parameters

pred: 3D numpy array: Thresholded prediction with multiple channels. Dimension: [timesteps x pitch x instruments]

Returns

d_sample: 3D numpy array: Down-sampled prediction. Dimension: [timesteps x 88 x instruments]

omnizart.music.inference.find_min_max_stren(notes)¶

Function for detemine the note velocity accroding to prediction value.

Parameters

notes: list[dict]: Data structure returned by function infer_piece.

omnizart.music.inference.find_occur(pitch, t_unit=0.02, min_duration=0.03)¶

Find the onset and offset of a thresholded prediction.

Parameters

pitch: 1D numpy array: Time series of predicted pitch activations.
t_unit: float: Time unit of each entry.
min_duration: float: Minimum interval of each note in seconds.

omnizart.music.inference.infer_piece(piece, shortest_sec=0.05, offset_sec=0.12, t_unit=0.02)¶: Dim: time x 88 x 4 (off, dura, onset, offset)

omnizart.music.inference.interpolation(data, ori_t_unit=0.02, tar_t_unit=0.01)¶

Interpolate between each frame to increase the time resolution.

The default setting of feature extraction has time resolution of 0.02 seconds for each frame. To fit the conventional evaluation settings, which has time resolution of 0.01 seconds, we additionally apply the interpolation function to increase time resolution. Here we use Cubic Spline for the estimation.

omnizart.music.inference.multi_inst_note_inference(pred, mode='note-stream', onset_th=5, dura_th=2, frm_th=1, inst_th=0.95, normalize=True, t_unit=0.02, channel_program_mapping=[0, 6, 40, 41, 42, 43, 60, 68, 70, 71, 73])¶

Function for infering raw multi-instrument predictions.

Parameters

mode: {‘note-stream’, ‘note’, ‘frame-stream’, ‘frame’}: Inference mode. Difference between ‘note’ and ‘frame’ is that the former consists of two note attributes, which are ‘onset’ and ‘duration’, and the later only contains ‘duration’, which in most of the cases leads to worse listening experience. With postfix ‘stream’ refers to transcribe instrument at the same time, meaning classifying each notes into instrument classes, or says different tracks.
onset_th: float: Threshold of onset channel. Type of list or float
dura_th: float: Threshold of duration channel. Type of list or float
inst_th: float: Threshold of deciding a instrument is present or not according to Std. of prediction.
normalize: bool: Whether to normalize the predictions. For more details, please refer to our paper
t_unit: float: Time unit for each frame. Should not be modified unless you have different settings during the feature extraction
channel_program_mapping: list[int]: Mapping prediction channels to MIDI program numbers.

Returns

out_midi: A pretty_midi.PrettyMIDI object.

References

Publications can be found here.

omnizart.music.inference.norm_onset_dura(pred, onset_th, dura_th, interpolate=True, normalize=True)¶: Normalizes prediction values of onset and duration channel.

omnizart.music.inference.norm_split_onset_dura(pred, onset_th, lower_onset_th, split_bound, dura_th, interpolate=True, normalize=True)¶

An advanced version of function for normalizing onset and duration channel.

From the extensive experiments, we observe that the average prediction value for high and low frequency are different. Lower pitches tend to have smaller values, while higher pitches having larger. To acheive better transcription results, the most straight-forward solution is to assign different thresholds for low and high frequency part. And this is what this function provides for the purpose.

Parameters

pred: The predictions.
onset_th: float: Threshold for high frequency part.
lower_onset_th: float: Threshold for low frequency part.
split_bound: int: The split point of low and high frequency part. Value should be within 0~87.
interpolate: bool: Whether to apply interpolation between each frame to increase time resolution.
normalize: bool: Whether to normalize the prediction values.

Returns

pred: Thresholded prediction, having value either 0 or 1.

omnizart.music.inference.roll_down_sample(data, base=88)¶

Down sample feature size for a single pitch.

Down sample the feature size from 354 to 88 for infering the notes.

Parameters

data: 2D numpy array: The thresholded 2D prediction..
base: Should be constant as there are 88 pitches on the piano.

Returns

return_v: 2D numpy array: Down sampled prediction.

Warning

The parameter data should be thresholded!

omnizart.music.inference.threshold_type_converter(threshold, length)¶: Convert scalar value to a list with the same value.

omnizart.music.inference.to_midi(notes, t_unit=0.02)¶: Translate the intermediate data into final output MIDI file.

Loss Functions¶

Loss functions for Music module.

omnizart.music.losses.focal_loss(target_tensor, prediction_tensor, weights=None, alpha=0.25, gamma=2)¶

Compute focal loss for predictions.

Multi-labels Focal loss formula:

\[FL = -\alpha * (z-p)^\gamma * \log{(p)} -(1-\alpha) * p^\gamma * \log{(1-p)}\]

Which \(\alpha\) = 0.25, \(\gamma\) = 2, p = sigmoid(x), z = target_tensor.

Parameters

prediction_tensor: A float tensor of shape [batch_size, num_anchors, num_classes] representing the predicted logits for each class.
target_tensor:: A float tensor of shape [batch_size, num_anchors, num_classes] representing one-hot encoded classification targets.
weights: A float tensor of shape [batch_size, num_anchors].
alpha: A scalar tensor for focal loss alpha hyper-parameter.
gamma: A scalar tensor for focal loss gamma hyper-parameter.

Returns

loss: A scalar tensor representing the value of the loss function

omnizart.music.losses.smooth_loss(y_true, y_pred, gamma=0.15, total_chs=22, weight=None)¶: Function to compute loss after applying label-smoothing.

Labels¶

class omnizart.music.labels.BaseLabelExtraction¶

Base class for extract label informations.

Provides basic functions to process native label format into the format required by music module. All sub-classes should parse the original label information into Label class.

See also

omnizart.music.labels.label_conversion

Methods

`extract_label`(label_path, t_unit[, ...])	Extract labels into customized storage format.
`load_label`(label_path)	Load the label file and parse information into `Label` class.
`name_transform`(name)	Maps the filename of label to the same name of the corresponding wav file.
`process`(label_list, out_path[, t_unit, ...])	Process the given list of label files and output to the target folder.

classmethod extract_label(label_path, t_unit, onset_len_sec=0.03)¶

Extract labels into customized storage format.

Process the given path of label into list of Label instances, then further convert them into deliberately customized storage format.

Parameters

label_path: Path: Path to the label file.
t_unit: float: Time unit of each step in seconds. Should be consistent with the time unit of each frame of the extracted feature.
onset_len_sec: float: Length of the first few frames with probability one. The later onset probabilities will be in a ‘fade-out’ manner until the note offset.

abstract classmethod load_label(label_path)¶

Load the label file and parse information into Label class.

Sub-classes should override this function to process their own label format.

Parameters

label_path: Path: Path to the label file.

Returns

labels: list[Label]: List of Label instances.

classmethod name_transform(name)¶

Maps the filename of label to the same name of the corresponding wav file.

Parameters

name: str: Name of the label file, without parent directory prefix and file extension.

Returns

trans_name: str: The name same as the coressponding wav (or says feature) file.

classmethod process(label_list, out_path, t_unit=0.02, onset_len_sec=0.03)¶

Process the given list of label files and output to the target folder.

Parameters

label_list: list[Path]: List of label paths.
out_path: Path: Path for saving the extracted label files.
t_unit: float: Time unit of each step in seconds. Should be consistent with the time unit of each frame of the extracted feature.
onset_len_sec: float: Length of the first few frames with probability one. The later onset probabilities will be in a ‘fade-out’ manner until the note offset.

class omnizart.music.labels.LabelType(mode)¶

Defines different types of music label for training.

Defines functions that converts the customized label format into numpy array. With the customized format, it is more flexible to transform labels into different different numpy formats according to the usage scenario, and also saves a lot of storage space by using the customized format.

Parameters

mode: [‘note’, ‘note-stream’, ‘pop-note-stream’, ‘frame’, ‘frame-stream’]

Mode of label conversion.

note: outputs onset and duration channel
note-stream: outputs onset and duration channel of instruments (for MusicNet)
pop-note-stream: similar to note-stream mode, but is for Pop dataset
frame: same as note mode. To truely output duration channel only, use true-frame mode.
frame-stream: same as note-stream. To truely output duration channel only for each instrument, use true-frame-stream mode.

Methods

get_available_modes
get_conversion_func
get_frame
get_frame_onset
get_out_classes
multi_inst_frm
multi_inst_note
multi_pop_note

get_available_modes()¶

get_conversion_func()¶

get_frame(label)¶

get_frame_onset(label)¶

get_out_classes()¶

multi_inst_frm(label)¶

multi_inst_note(label)¶

multi_pop_note(label)¶

class omnizart.music.labels.MaestroLabelExtraction¶

Label extraction class for Maestro dataset

Methods

load_label(label_path)

Load the label file and parse information into Label class.

classmethod load_label(label_path)¶

Load the label file and parse information into Label class.

Sub-classes should override this function to process their own label format.

Parameters

label_path: Path: Path to the label file.

Returns

labels: list[Label]: List of Label instances.

class omnizart.music.labels.MapsLabelExtraction¶

Label extraction class for Maps dataset

Methods

load_label(label_path)

Load the label file and parse information into Label class.

classmethod load_label(label_path)¶

Load the label file and parse information into Label class.

Sub-classes should override this function to process their own label format.

Parameters

label_path: Path: Path to the label file.

Returns

labels: list[Label]: List of Label instances.

class omnizart.music.labels.MusicNetLabelExtraction¶

Label extraction class for MusicNet dataset

Methods

load_label(label_path)

Load the label file and parse information into Label class.

classmethod load_label(label_path)¶

Load the label file and parse information into Label class.

Sub-classes should override this function to process their own label format.

Parameters

label_path: Path: Path to the label file.

Returns

labels: list[Label]: List of Label instances.

class omnizart.music.labels.PopLabelExtraction¶

Label extraction class for Pop Rhythm dataset

Methods

name_transform(name)

Maps the filename of label to the same name of the corresponding wav file.

classmethod name_transform(name)¶

Maps the filename of label to the same name of the corresponding wav file.

Parameters

name: str: Name of the label file, without parent directory prefix and file extension.

Returns

trans_name: str: The name same as the coressponding wav (or says feature) file.

class omnizart.music.labels.SuLabelExtraction¶

Label extraction class for Extended-Su dataset

Uses the same process as Maestro dataset

omnizart.music.labels.label_conversion(label, ori_feature_size=352, feature_num=352, base=88, mpe=False, onsets=False, channel_mapping=None)¶

Converts the customized label format into numpy array.

Parameters

label: object: List of dict that is in customized label format.
ori_feature_size: int: Size of the original feature dimension.
feature_num: int: Size of the target output feature dimension.
base: int: Number of total available pitches.
mpe: bool: Whether to merge all channels into a single one, discarding information about instruments.
onsets: bool: Fill in onset probabilities if set to true, or fill one to all activations.
channel_mapping: dict: Maps the instrument program number to the specified channel index, used to indicate which channel should represent what instruments.

See also

omnizart.music.labels.BaseLabelExtraction.extract_label: Function that generates the customized label format.

Prediction¶

Utility functions for Music module

omnizart.music.prediction.create_batches(feature, timesteps, b_size=8, step_size=10)¶

Create a series of batch input.

The size of the last batch could smaller than the given b_size.

Parameters

feature: numpy.ndarray: The only constraint is the first dimension should time index. There is no limit on the number of dimensions.
timesteps: int: Input feature length of the model.
b_size: int: Batch size of the input.
step_size: int: Step size for hopping the feature. Value smaller than timesteps indicates there will be overlapping between each feature slice.

Returns

batches: list: List of input batches.

omnizart.music.prediction.merge_batches(batches, step_size=10)¶

Reverse process of create_batches.

Merge the list of batch predictions into the complete predicted results.

Parameters

batches: numpy.ndarray: List of predicted batches.
step_size: int: Should be the same as passing to create_batches.

Returns

pred: numpy.ndarray: The final predicted results.

omnizart.music.prediction.predict(feature, model, batch_size=4, step_size=64)¶

Make predictions on the feature.

Generate predictions by using the loaded model.

Parameters

feature: numpy.ndarray: Extracted feature of the audio. Dimension: timesteps x feature_size x channels
model: keras.Model: The loaded model instance.
batch_size: int: Batch size for the prediction iteration.
step_size: int: Step size for hopping the feature. Value smaller then timesteps means there will be overlapping.

Returns

pred: numpy.ndarray: The predicted results.

omnizart.music.prediction.predict_old(feature, model, batch_size=4)¶

Make predictions on the feature.

Generate predictions by using the loaded model.

Parameters

feature: numpy.ndarray: Extracted feature of the audio. Dimension: timesteps x feature_size x channels
model: keras.Model: The loaded model instance
batch_size: int: Batch size for each step of prediction. The size is depending on the available GPU memory.

Returns

pred: numpy.ndarray: The predicted results. The values are ranging from 0~1.

Settings¶

Below are the default settings for building the music model. It will be loaded by the class omnizart.setting_loaders.MusicSettings. The name of the attributes will be converted to snake-case (e.g., HopSize -> hop_size). There is also a path transformation process when applying the settings into the MusicSettings instance. For example, if you want to access the attribute BatchSize defined in the yaml path General/Training/Settings/BatchSize, the corresponding attribute will be MusicSettings.training.batch_size. The level of /Settings is removed among all fields.

# Self-documented configurable settings, with description, type hint, and available
# options. All the parameters can be overriden by another specified configuration file 
# with selected parameters.


General:
    TranscriptionMode:
        Description: Mode of transcription by executing the `omnizart music transcribe` command.
        Type: String 
        Value: Piano
    CheckpointPath:
        Description: Path to the pre-trained models.
        Type: Map
        SubType: [String, String]
        Value:
            Piano: checkpoints/music/music_piano
            Pop: checkpoints/music/music_pop
            Stream: checkpoints/music/music_note_stream
            PianoV2: checkpoints/music/music_piano-v2
    Feature:
        Description: Default settings of feature extraction
        Settings:
            HopSize:
                Description: Hop size in seconds with respect to sampling rate.
                Type: Float
                Value: 0.02
            SamplingRate:
                Description: Adjust input sampling rate to this value.
                Type: Integer
                Value: 44100
            WindowSize:
                Type: Integer
                Value: 7939
            FrequencyResolution:
                Type: Float
                Value: 2.0
            FrequencyCenter:
                Description: Lowest frequency to extract.
                Type: Float
                Value: 27.5
            TimeCenter:
                Description: Highest frequency to extract (1/time_center).
                Type: Float
                Value: 0.00022287
            Gamma:
                Type: List
                SubType: Float
                Value: [0.24, 0.6, 1.0]
            BinsPerOctave:
                Description: Number of bins for each octave.
                Type: Integer
                Value: 48
            HarmonicNumber:
                Description: Number of harmonic bins of HCFP feature.
                Type: Integer
                Value: 6
            Harmonic:
                Description: Whether to use harmonic version of the input feature for training.
                Type: Bool
                Value: False
    Dataset:
        Description: Settings of datasets.
        Settings:
            SavePath:
                Description: Path for storing the downloaded datasets.
                Type: String
                Value: ./
            FeatureType:
                Description: Type of feature to extract.
                Type: String
                Value: CFP
                Choices: ["CFP", "HCFP"]
            FeatureSavePath:
                Description: Path for storing the extracted feature. Default to the path under the dataset folder.
                Type: String
                Value: +
    Model:
        Description: Default settings of training / testing the model.
        Settings:
            SavePrefix:
                Description: Prefix of the trained model's name to be saved.
                Type: String
                Value: music
            SavePath:
                Description: Path to save the trained model.
                Type: String
                Value: ./checkpoints/music
            ModelType:
                Description: Default model type to be used for training
                Type: String
                Value: attn
                Choices: ["aspp", "attn"]
    Inference:
        Description: Default settings when infering notes.
        Settings:
            MinLength:
                Description: Minimum length of a note in seconds.
                Type: Float
                Value: 0.05
            InstTh:
                Description: Threshold for filtering instruments.
                Type: Float
                Value: 1.1
            OnsetTh:
                Description: Threshold of predicted onset channel.
                Type: Float
                Value: 3.5
            DuraTh:
                Description: Threshold of predicted duration channel.
                Type: Float
                Value: 0.5
            FrameTh:
                Description: Threshold of frame-level predictions.
                Type: Float
                Value: 0.5
    Training:
        Description: Parameters for training
        Settings:
            Epoch:
                Description: Maximum number of epochs for training.
                Type: Integer
                Value: 20
            Steps:
                Description: Number of training steps for each epoch.
                Type: Integer
                Value: 3000
            ValSteps:
                Description: Number of validation steps after each training epoch.
                Type: Integer
                Value: 500
            BatchSize:
                Description: Batch size of each training step.
                Type: Integer
                Value: 8
            ValBatchSize:
                Description: Batch size of each validation step.
                Type: Integer
                Value: 8
            EarlyStop:
                Description: Terminate the training if the validation performance doesn't imrove after n epochs.
                Type: Integer
                Value: 6
            LossFunction:
                Description: Loss function for computing the objectives.
                Type: String
                Value: smooth
                Choices: ["smooth", "focal", "bce"]
            LabelType:
                Description: Determines the training target to be single- or multi-instrument scenario, and more options.
                Type: String
                Value: note-stream
                Choices: 
                    - note-stream
                    - frame-stream
                    - note
                    - frame
                    - true-frame
                    - true-frame-stream
                    - pop-note-stream
            Channels:
                Description: Use different types of feature for training.
                Type: List
                SubType: String
                Value: ["Spec", "Ceps"]
                Choices: ["Spec", "GCoS", "Ceps"]
            Timesteps:
                Description: Length of time axis of the input feature.
                Type: Integer
                Value: 256
            FeatureNum:
                Description: The target size of feature dimension.
                Type: Integer
                Value: 352