Music Transcription

Music transcription module.

This module provides utilities for transcribing pitch and instruments in the audio. This module is also an improved version of the original repository BreezeWhite/Music-Transcription-with-Semantic-Segmentation, with a cleaner architecture and consistent coding style, also provides command line interface for easy usage.

Feature Storage Format

Processed feature will be stored in .hdf and .pickle file format. The former format is used to store the feature representation, and the later is used for customized label representation. Each piece will have both two different files.

Columns in .hdf feature file:

  • feature

References

Technical details can be found in the publications [1], [2], and [3].

1

Yu-Te Wu, Berlin Chen, and Li Su, “Multi-Instrument Automatic Music Transcription With Self-Attention-Based Instance Segmentation.” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020.

2

Yu-Te Wu, Berlin Chen, and Li Su. “Polyphonic Music Transcription with Semantic Segmentation.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.

3

Yu-Te Wu, Berlin Chen, and Li Su. “Automatic Music Yranscription Leveraging Generalized Cepstral Features and Deep Learning.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.

App

class omnizart.music.app.MusicTranscription(conf_path=None)

Bases: omnizart.base.BaseTranscription

Application class for music transcription.

Inherited from the BaseTranscription class to make sure everything needed got override.

Methods

generate_feature(dataset_path[, ...])

Extract the feature from the given dataset.

train(feature_folder[, model_name, ...])

Model training.

transcribe(input_audio[, model_path, output])

Transcribe notes and instruments of the given audio.

generate_feature(dataset_path, music_settings=None, num_threads=4)

Extract the feature from the given dataset.

To train the model, the first step is to pre-process the data into feature representations. After downloading the dataset, use this function to generate the feature by giving the path of the stored dataset.

To specify the output path, modify the attribute music_settings.dataset.feature_save_path. It defaults to the folder under where the dataset stored, generating two folders: train_feature and test_feature.

Parameters
dataset_path: Path

Path to the downloaded dataset.

music_settings: MusicSettings

The configuration instance that holds all relative settings for the life-cycle of building a model.

num_threads:

Number of threads for parallel extraction the feature.

See also

omnizart.constants.datasets

The supported datasets and the corresponding training/testing splits.

train(feature_folder, model_name=None, input_model_path=None, music_settings=None)

Model training.

Train the model from scratch or continue training given a model checkpoint.

Parameters
feature_folder: Path

Path to the generated feature.

model_name: str

The name of the trained model. If not given, will default to the current timestamp.

input_model_path: Path

Specify the path to the model checkpoint in order to fine-tune the model.

music_settings: MusicSettings

The configuration that holds all relative settings for the life-cycle of model building.

transcribe(input_audio, model_path=None, output='./')

Transcribe notes and instruments of the given audio.

This function transcribes notes (onset, duration) of each instruments in the audio. The results will be written out as a MIDI file.

Parameters
input_audio: Path

Path to the wav audio file.

model_path: Path

Path to the trained model or the transcription mode. If given a path, should be the folder that contains arch.yaml, weights.h5, and configuration.yaml.

output: Path (optional)

Path for writing out the transcribed MIDI file. Default to current path.

Returns
midi: pretty_midi.PrettyMIDI

The transcribed notes of different instruments.

See also

omnizart.cli.music.transcribe

The coressponding command line entry.

Dataset

class omnizart.music.app.MusicDatasetLoader(label_conversion_func, feature_folder=None, feature_files=None, num_samples=100, timesteps=128, channels=[1, 3], feature_num=352)

Bases: omnizart.base.BaseDatasetLoader

Data loader for training the mdoel of music.

Load feature and label for training. Also converts the custom format of label into piano roll representation.

Parameters
label_conversion_func: callable

The function that will be used for converting the customized label format into numpy array.

feature_folder: Path

Path to the extracted feature files, including *.hdf and *.pickle pairs, which refers to feature and label files, respectively.

feature_files: list[Path]

List of path of *.hdf feature files. Corresponding label files should also under the same folder.

num_samples: int

Total number of samples to yield.

timesteps: int

Time length of the feature.

channels: list[int]

Channels to be used for training. Allowed values are [1, 2, 3].

feature_num: int

Target size of feature dimension. Zero padding is done to resolve mismatched input and target size.

Yields
feature:

Input features for model training.

label:

Corresponding labels.

Inference

omnizart.music.inference.down_sample(pred)

Down sample multi-channel predictions along the feature dimension.

Down sample the feature size from 354 to 88 for infering the notes from a multi-channel prediction.

Parameters
pred: 3D numpy array

Thresholded prediction with multiple channels. Dimension: [timesteps x pitch x instruments]

Returns
d_sample: 3D numpy array

Down-sampled prediction. Dimension: [timesteps x 88 x instruments]

omnizart.music.inference.find_min_max_stren(notes)

Function for detemine the note velocity accroding to prediction value.

Parameters
notes: list[dict]

Data structure returned by function infer_piece.

omnizart.music.inference.find_occur(pitch, t_unit=0.02, min_duration=0.03)

Find the onset and offset of a thresholded prediction.

Parameters
pitch: 1D numpy array

Time series of predicted pitch activations.

t_unit: float

Time unit of each entry.

min_duration: float

Minimum interval of each note in seconds.

omnizart.music.inference.infer_piece(piece, shortest_sec=0.05, offset_sec=0.12, t_unit=0.02)

Dim: time x 88 x 4 (off, dura, onset, offset)

omnizart.music.inference.interpolation(data, ori_t_unit=0.02, tar_t_unit=0.01)

Interpolate between each frame to increase the time resolution.

The default setting of feature extraction has time resolution of 0.02 seconds for each frame. To fit the conventional evaluation settings, which has time resolution of 0.01 seconds, we additionally apply the interpolation function to increase time resolution. Here we use Cubic Spline for the estimation.

omnizart.music.inference.multi_inst_note_inference(pred, mode='note-stream', onset_th=5, dura_th=2, frm_th=1, inst_th=0.95, normalize=True, t_unit=0.02, channel_program_mapping=[0, 6, 40, 41, 42, 43, 60, 68, 70, 71, 73])

Function for infering raw multi-instrument predictions.

Parameters
mode: {‘note-stream’, ‘note’, ‘frame-stream’, ‘frame’}

Inference mode. Difference between ‘note’ and ‘frame’ is that the former consists of two note attributes, which are ‘onset’ and ‘duration’, and the later only contains ‘duration’, which in most of the cases leads to worse listening experience. With postfix ‘stream’ refers to transcribe instrument at the same time, meaning classifying each notes into instrument classes, or says different tracks.

onset_th: float

Threshold of onset channel. Type of list or float

dura_th: float

Threshold of duration channel. Type of list or float

inst_th: float

Threshold of deciding a instrument is present or not according to Std. of prediction.

normalize: bool

Whether to normalize the predictions. For more details, please refer to our paper

t_unit: float

Time unit for each frame. Should not be modified unless you have different settings during the feature extraction

channel_program_mapping: list[int]

Mapping prediction channels to MIDI program numbers.

Returns
out_midi

A pretty_midi.PrettyMIDI object.

References

Publications can be found here.

omnizart.music.inference.norm_onset_dura(pred, onset_th, dura_th, interpolate=True, normalize=True)

Normalizes prediction values of onset and duration channel.

omnizart.music.inference.norm_split_onset_dura(pred, onset_th, lower_onset_th, split_bound, dura_th, interpolate=True, normalize=True)

An advanced version of function for normalizing onset and duration channel.

From the extensive experiments, we observe that the average prediction value for high and low frequency are different. Lower pitches tend to have smaller values, while higher pitches having larger. To acheive better transcription results, the most straight-forward solution is to assign different thresholds for low and high frequency part. And this is what this function provides for the purpose.

Parameters
pred

The predictions.

onset_th: float

Threshold for high frequency part.

lower_onset_th: float

Threshold for low frequency part.

split_bound: int

The split point of low and high frequency part. Value should be within 0~87.

interpolate: bool

Whether to apply interpolation between each frame to increase time resolution.

normalize: bool

Whether to normalize the prediction values.

Returns
pred

Thresholded prediction, having value either 0 or 1.

omnizart.music.inference.roll_down_sample(data, base=88)

Down sample feature size for a single pitch.

Down sample the feature size from 354 to 88 for infering the notes.

Parameters
data: 2D numpy array

The thresholded 2D prediction..

base

Should be constant as there are 88 pitches on the piano.

Returns
return_v: 2D numpy array

Down sampled prediction.

Warning

The parameter data should be thresholded!

omnizart.music.inference.threshold_type_converter(threshold, length)

Convert scalar value to a list with the same value.

omnizart.music.inference.to_midi(notes, t_unit=0.02)

Translate the intermediate data into final output MIDI file.

Loss Functions

Loss functions for Music module.

omnizart.music.losses.focal_loss(target_tensor, prediction_tensor, weights=None, alpha=0.25, gamma=2)

Compute focal loss for predictions.

Multi-labels Focal loss formula:

\[FL = -\alpha * (z-p)^\gamma * \log{(p)} -(1-\alpha) * p^\gamma * \log{(1-p)}\]

Which \(\alpha\) = 0.25, \(\gamma\) = 2, p = sigmoid(x), z = target_tensor.

Parameters
prediction_tensor

A float tensor of shape [batch_size, num_anchors, num_classes] representing the predicted logits for each class.

target_tensor:

A float tensor of shape [batch_size, num_anchors, num_classes] representing one-hot encoded classification targets.

weights

A float tensor of shape [batch_size, num_anchors].

alpha

A scalar tensor for focal loss alpha hyper-parameter.

gamma

A scalar tensor for focal loss gamma hyper-parameter.

Returns
loss

A scalar tensor representing the value of the loss function

omnizart.music.losses.smooth_loss(y_true, y_pred, gamma=0.15, total_chs=22, weight=None)

Function to compute loss after applying label-smoothing.

Labels

class omnizart.music.labels.BaseLabelExtraction

Base class for extract label informations.

Provides basic functions to process native label format into the format required by music module. All sub-classes should parse the original label information into Label class.

Methods

extract_label(label_path, t_unit[, ...])

Extract labels into customized storage format.

load_label(label_path)

Load the label file and parse information into Label class.

name_transform(name)

Maps the filename of label to the same name of the corresponding wav file.

process(label_list, out_path[, t_unit, ...])

Process the given list of label files and output to the target folder.

classmethod extract_label(label_path, t_unit, onset_len_sec=0.03)

Extract labels into customized storage format.

Process the given path of label into list of Label instances, then further convert them into deliberately customized storage format.

Parameters
label_path: Path

Path to the label file.

t_unit: float

Time unit of each step in seconds. Should be consistent with the time unit of each frame of the extracted feature.

onset_len_sec: float

Length of the first few frames with probability one. The later onset probabilities will be in a ‘fade-out’ manner until the note offset.

abstract classmethod load_label(label_path)

Load the label file and parse information into Label class.

Sub-classes should override this function to process their own label format.

Parameters
label_path: Path

Path to the label file.

Returns
labels: list[Label]

List of Label instances.

classmethod name_transform(name)

Maps the filename of label to the same name of the corresponding wav file.

Parameters
name: str

Name of the label file, without parent directory prefix and file extension.

Returns
trans_name: str

The name same as the coressponding wav (or says feature) file.

classmethod process(label_list, out_path, t_unit=0.02, onset_len_sec=0.03)

Process the given list of label files and output to the target folder.

Parameters
label_list: list[Path]

List of label paths.

out_path: Path

Path for saving the extracted label files.

t_unit: float

Time unit of each step in seconds. Should be consistent with the time unit of each frame of the extracted feature.

onset_len_sec: float

Length of the first few frames with probability one. The later onset probabilities will be in a ‘fade-out’ manner until the note offset.

class omnizart.music.labels.LabelType(mode)

Defines different types of music label for training.

Defines functions that converts the customized label format into numpy array. With the customized format, it is more flexible to transform labels into different different numpy formats according to the usage scenario, and also saves a lot of storage space by using the customized format.

Parameters
mode: [‘note’, ‘note-stream’, ‘pop-note-stream’, ‘frame’, ‘frame-stream’]

Mode of label conversion.

  • note: outputs onset and duration channel

  • note-stream: outputs onset and duration channel of instruments (for MusicNet)

  • pop-note-stream: similar to note-stream mode, but is for Pop dataset

  • frame: same as note mode. To truely output duration channel only, use true-frame mode.

  • frame-stream: same as note-stream. To truely output duration channel only for each instrument, use true-frame-stream mode.

Methods

get_available_modes

get_conversion_func

get_frame

get_frame_onset

get_out_classes

multi_inst_frm

multi_inst_note

multi_pop_note

get_available_modes()
get_conversion_func()
get_frame(label)
get_frame_onset(label)
get_out_classes()
multi_inst_frm(label)
multi_inst_note(label)
multi_pop_note(label)
class omnizart.music.labels.MaestroLabelExtraction

Label extraction class for Maestro dataset

Methods

load_label(label_path)

Load the label file and parse information into Label class.

classmethod load_label(label_path)

Load the label file and parse information into Label class.

Sub-classes should override this function to process their own label format.

Parameters
label_path: Path

Path to the label file.

Returns
labels: list[Label]

List of Label instances.

class omnizart.music.labels.MapsLabelExtraction

Label extraction class for Maps dataset

Methods

load_label(label_path)

Load the label file and parse information into Label class.

classmethod load_label(label_path)

Load the label file and parse information into Label class.

Sub-classes should override this function to process their own label format.

Parameters
label_path: Path

Path to the label file.

Returns
labels: list[Label]

List of Label instances.

class omnizart.music.labels.MusicNetLabelExtraction

Label extraction class for MusicNet dataset

Methods

load_label(label_path)

Load the label file and parse information into Label class.

classmethod load_label(label_path)

Load the label file and parse information into Label class.

Sub-classes should override this function to process their own label format.

Parameters
label_path: Path

Path to the label file.

Returns
labels: list[Label]

List of Label instances.

class omnizart.music.labels.PopLabelExtraction

Label extraction class for Pop Rhythm dataset

Methods

name_transform(name)

Maps the filename of label to the same name of the corresponding wav file.

classmethod name_transform(name)

Maps the filename of label to the same name of the corresponding wav file.

Parameters
name: str

Name of the label file, without parent directory prefix and file extension.

Returns
trans_name: str

The name same as the coressponding wav (or says feature) file.

class omnizart.music.labels.SuLabelExtraction

Label extraction class for Extended-Su dataset

Uses the same process as Maestro dataset

omnizart.music.labels.label_conversion(label, ori_feature_size=352, feature_num=352, base=88, mpe=False, onsets=False, channel_mapping=None)

Converts the customized label format into numpy array.

Parameters
label: object

List of dict that is in customized label format.

ori_feature_size: int

Size of the original feature dimension.

feature_num: int

Size of the target output feature dimension.

base: int

Number of total available pitches.

mpe: bool

Whether to merge all channels into a single one, discarding information about instruments.

onsets: bool

Fill in onset probabilities if set to true, or fill one to all activations.

channel_mapping: dict

Maps the instrument program number to the specified channel index, used to indicate which channel should represent what instruments.

See also

omnizart.music.labels.BaseLabelExtraction.extract_label

Function that generates the customized label format.

Prediction

Utility functions for Music module

omnizart.music.prediction.create_batches(feature, timesteps, b_size=8, step_size=10)

Create a series of batch input.

The size of the last batch could smaller than the given b_size.

Parameters
feature: numpy.ndarray

The only constraint is the first dimension should time index. There is no limit on the number of dimensions.

timesteps: int

Input feature length of the model.

b_size: int

Batch size of the input.

step_size: int

Step size for hopping the feature. Value smaller than timesteps indicates there will be overlapping between each feature slice.

Returns
batches: list

List of input batches.

omnizart.music.prediction.merge_batches(batches, step_size=10)

Reverse process of create_batches.

Merge the list of batch predictions into the complete predicted results.

Parameters
batches: numpy.ndarray

List of predicted batches.

step_size: int

Should be the same as passing to create_batches.

Returns
pred: numpy.ndarray

The final predicted results.

omnizart.music.prediction.predict(feature, model, batch_size=4, step_size=64)

Make predictions on the feature.

Generate predictions by using the loaded model.

Parameters
feature: numpy.ndarray

Extracted feature of the audio. Dimension: timesteps x feature_size x channels

model: keras.Model

The loaded model instance.

batch_size: int

Batch size for the prediction iteration.

step_size: int

Step size for hopping the feature. Value smaller then timesteps means there will be overlapping.

Returns
pred: numpy.ndarray

The predicted results.

omnizart.music.prediction.predict_old(feature, model, batch_size=4)

Make predictions on the feature.

Generate predictions by using the loaded model.

Parameters
feature: numpy.ndarray

Extracted feature of the audio. Dimension: timesteps x feature_size x channels

model: keras.Model

The loaded model instance

batch_size: int

Batch size for each step of prediction. The size is depending on the available GPU memory.

Returns
pred: numpy.ndarray

The predicted results. The values are ranging from 0~1.

Settings

Below are the default settings for building the music model. It will be loaded by the class omnizart.setting_loaders.MusicSettings. The name of the attributes will be converted to snake-case (e.g., HopSize -> hop_size). There is also a path transformation process when applying the settings into the MusicSettings instance. For example, if you want to access the attribute BatchSize defined in the yaml path General/Training/Settings/BatchSize, the corresponding attribute will be MusicSettings.training.batch_size. The level of /Settings is removed among all fields.

# Self-documented configurable settings, with description, type hint, and available
# options. All the parameters can be overriden by another specified configuration file 
# with selected parameters.


General:
    TranscriptionMode:
        Description: Mode of transcription by executing the `omnizart music transcribe` command.
        Type: String 
        Value: Piano
    CheckpointPath:
        Description: Path to the pre-trained models.
        Type: Map
        SubType: [String, String]
        Value:
            Piano: checkpoints/music/music_piano
            Pop: checkpoints/music/music_pop
            Stream: checkpoints/music/music_note_stream
            PianoV2: checkpoints/music/music_piano-v2
    Feature:
        Description: Default settings of feature extraction
        Settings:
            HopSize:
                Description: Hop size in seconds with respect to sampling rate.
                Type: Float
                Value: 0.02
            SamplingRate:
                Description: Adjust input sampling rate to this value.
                Type: Integer
                Value: 44100
            WindowSize:
                Type: Integer
                Value: 7939
            FrequencyResolution:
                Type: Float
                Value: 2.0
            FrequencyCenter:
                Description: Lowest frequency to extract.
                Type: Float
                Value: 27.5
            TimeCenter:
                Description: Highest frequency to extract (1/time_center).
                Type: Float
                Value: 0.00022287
            Gamma:
                Type: List
                SubType: Float
                Value: [0.24, 0.6, 1.0]
            BinsPerOctave:
                Description: Number of bins for each octave.
                Type: Integer
                Value: 48
            HarmonicNumber:
                Description: Number of harmonic bins of HCFP feature.
                Type: Integer
                Value: 6
            Harmonic:
                Description: Whether to use harmonic version of the input feature for training.
                Type: Bool
                Value: False
    Dataset:
        Description: Settings of datasets.
        Settings:
            SavePath:
                Description: Path for storing the downloaded datasets.
                Type: String
                Value: ./
            FeatureType:
                Description: Type of feature to extract.
                Type: String
                Value: CFP
                Choices: ["CFP", "HCFP"]
            FeatureSavePath:
                Description: Path for storing the extracted feature. Default to the path under the dataset folder.
                Type: String
                Value: +
    Model:
        Description: Default settings of training / testing the model.
        Settings:
            SavePrefix:
                Description: Prefix of the trained model's name to be saved.
                Type: String
                Value: music
            SavePath:
                Description: Path to save the trained model.
                Type: String
                Value: ./checkpoints/music
            ModelType:
                Description: Default model type to be used for training
                Type: String
                Value: attn
                Choices: ["aspp", "attn"]
    Inference:
        Description: Default settings when infering notes.
        Settings:
            MinLength:
                Description: Minimum length of a note in seconds.
                Type: Float
                Value: 0.05
            InstTh:
                Description: Threshold for filtering instruments.
                Type: Float
                Value: 1.1
            OnsetTh:
                Description: Threshold of predicted onset channel.
                Type: Float
                Value: 3.5
            DuraTh:
                Description: Threshold of predicted duration channel.
                Type: Float
                Value: 0.5
            FrameTh:
                Description: Threshold of frame-level predictions.
                Type: Float
                Value: 0.5
    Training:
        Description: Parameters for training
        Settings:
            Epoch:
                Description: Maximum number of epochs for training.
                Type: Integer
                Value: 20
            Steps:
                Description: Number of training steps for each epoch.
                Type: Integer
                Value: 3000
            ValSteps:
                Description: Number of validation steps after each training epoch.
                Type: Integer
                Value: 500
            BatchSize:
                Description: Batch size of each training step.
                Type: Integer
                Value: 8
            ValBatchSize:
                Description: Batch size of each validation step.
                Type: Integer
                Value: 8
            EarlyStop:
                Description: Terminate the training if the validation performance doesn't imrove after n epochs.
                Type: Integer
                Value: 6
            LossFunction:
                Description: Loss function for computing the objectives.
                Type: String
                Value: smooth
                Choices: ["smooth", "focal", "bce"]
            LabelType:
                Description: Determines the training target to be single- or multi-instrument scenario, and more options.
                Type: String
                Value: note-stream
                Choices: 
                    - note-stream
                    - frame-stream
                    - note
                    - frame
                    - true-frame
                    - true-frame-stream
                    - pop-note-stream
            Channels:
                Description: Use different types of feature for training.
                Type: List
                SubType: String
                Value: ["Spec", "Ceps"]
                Choices: ["Spec", "GCoS", "Ceps"]
            Timesteps:
                Description: Length of time axis of the input feature.
                Type: Integer
                Value: 256
            FeatureNum:
                Description: The target size of feature dimension.
                Type: Integer
                Value: 352