Music Transcription¶
Music transcription module.
This module provides utilities for transcribing pitch and instruments in the audio. This module is also an improved version of the original repository BreezeWhite/Music-Transcription-with-Semantic-Segmentation, with a cleaner architecture and consistent coding style, also provides command line interface for easy usage.
Feature Storage Format¶
Processed feature will be stored in .hdf
and .pickle
file format. The former format
is used to store the feature representation, and the later is used for customized label
representation. Each piece will have both two different files.
Columns in .hdf
feature file:
feature
References¶
Technical details can be found in the publications [1], [2], and [3].
- 1
Yu-Te Wu, Berlin Chen, and Li Su, “Multi-Instrument Automatic Music Transcription With Self-Attention-Based Instance Segmentation.” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020.
- 2
Yu-Te Wu, Berlin Chen, and Li Su. “Polyphonic Music Transcription with Semantic Segmentation.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
- 3
Yu-Te Wu, Berlin Chen, and Li Su. “Automatic Music Yranscription Leveraging Generalized Cepstral Features and Deep Learning.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
App¶
- class omnizart.music.app.MusicTranscription(conf_path=None)¶
Bases:
omnizart.base.BaseTranscription
Application class for music transcription.
Inherited from the BaseTranscription class to make sure everything needed got override.
Methods
generate_feature
(dataset_path[, ...])Extract the feature from the given dataset.
train
(feature_folder[, model_name, ...])Model training.
transcribe
(input_audio[, model_path, output])Transcribe notes and instruments of the given audio.
- generate_feature(dataset_path, music_settings=None, num_threads=4)¶
Extract the feature from the given dataset.
To train the model, the first step is to pre-process the data into feature representations. After downloading the dataset, use this function to generate the feature by giving the path of the stored dataset.
To specify the output path, modify the attribute
music_settings.dataset.feature_save_path
. It defaults to the folder under where the dataset stored, generating two folders:train_feature
andtest_feature
.- Parameters
- dataset_path: Path
Path to the downloaded dataset.
- music_settings: MusicSettings
The configuration instance that holds all relative settings for the life-cycle of building a model.
- num_threads:
Number of threads for parallel extraction the feature.
See also
omnizart.constants.datasets
The supported datasets and the corresponding training/testing splits.
- train(feature_folder, model_name=None, input_model_path=None, music_settings=None)¶
Model training.
Train the model from scratch or continue training given a model checkpoint.
- Parameters
- feature_folder: Path
Path to the generated feature.
- model_name: str
The name of the trained model. If not given, will default to the current timestamp.
- input_model_path: Path
Specify the path to the model checkpoint in order to fine-tune the model.
- music_settings: MusicSettings
The configuration that holds all relative settings for the life-cycle of model building.
- transcribe(input_audio, model_path=None, output='./')¶
Transcribe notes and instruments of the given audio.
This function transcribes notes (onset, duration) of each instruments in the audio. The results will be written out as a MIDI file.
- Parameters
- input_audio: Path
Path to the wav audio file.
- model_path: Path
Path to the trained model or the transcription mode. If given a path, should be the folder that contains arch.yaml, weights.h5, and configuration.yaml.
- output: Path (optional)
Path for writing out the transcribed MIDI file. Default to current path.
- Returns
- midi: pretty_midi.PrettyMIDI
The transcribed notes of different instruments.
See also
omnizart.cli.music.transcribe
The coressponding command line entry.
Dataset¶
- class omnizart.music.app.MusicDatasetLoader(label_conversion_func, feature_folder=None, feature_files=None, num_samples=100, timesteps=128, channels=[1, 3], feature_num=352)¶
Bases:
omnizart.base.BaseDatasetLoader
Data loader for training the mdoel of
music
.Load feature and label for training. Also converts the custom format of label into piano roll representation.
- Parameters
- label_conversion_func: callable
The function that will be used for converting the customized label format into numpy array.
- feature_folder: Path
Path to the extracted feature files, including *.hdf and *.pickle pairs, which refers to feature and label files, respectively.
- feature_files: list[Path]
List of path of *.hdf feature files. Corresponding label files should also under the same folder.
- num_samples: int
Total number of samples to yield.
- timesteps: int
Time length of the feature.
- channels: list[int]
Channels to be used for training. Allowed values are [1, 2, 3].
- feature_num: int
Target size of feature dimension. Zero padding is done to resolve mismatched input and target size.
- Yields
- feature:
Input features for model training.
- label:
Corresponding labels.
Inference¶
- omnizart.music.inference.down_sample(pred)¶
Down sample multi-channel predictions along the feature dimension.
Down sample the feature size from 354 to 88 for infering the notes from a multi-channel prediction.
- Parameters
- pred: 3D numpy array
Thresholded prediction with multiple channels. Dimension: [timesteps x pitch x instruments]
- Returns
- d_sample: 3D numpy array
Down-sampled prediction. Dimension: [timesteps x 88 x instruments]
- omnizart.music.inference.find_min_max_stren(notes)¶
Function for detemine the note velocity accroding to prediction value.
- Parameters
- notes: list[dict]
Data structure returned by function infer_piece.
- omnizart.music.inference.find_occur(pitch, t_unit=0.02, min_duration=0.03)¶
Find the onset and offset of a thresholded prediction.
- Parameters
- pitch: 1D numpy array
Time series of predicted pitch activations.
- t_unit: float
Time unit of each entry.
- min_duration: float
Minimum interval of each note in seconds.
- omnizart.music.inference.infer_piece(piece, shortest_sec=0.05, offset_sec=0.12, t_unit=0.02)¶
Dim: time x 88 x 4 (off, dura, onset, offset)
- omnizart.music.inference.interpolation(data, ori_t_unit=0.02, tar_t_unit=0.01)¶
Interpolate between each frame to increase the time resolution.
The default setting of feature extraction has time resolution of 0.02 seconds for each frame. To fit the conventional evaluation settings, which has time resolution of 0.01 seconds, we additionally apply the interpolation function to increase time resolution. Here we use Cubic Spline for the estimation.
- omnizart.music.inference.multi_inst_note_inference(pred, mode='note-stream', onset_th=5, dura_th=2, frm_th=1, inst_th=0.95, normalize=True, t_unit=0.02, channel_program_mapping=[0, 6, 40, 41, 42, 43, 60, 68, 70, 71, 73])¶
Function for infering raw multi-instrument predictions.
- Parameters
- mode: {‘note-stream’, ‘note’, ‘frame-stream’, ‘frame’}
Inference mode. Difference between ‘note’ and ‘frame’ is that the former consists of two note attributes, which are ‘onset’ and ‘duration’, and the later only contains ‘duration’, which in most of the cases leads to worse listening experience. With postfix ‘stream’ refers to transcribe instrument at the same time, meaning classifying each notes into instrument classes, or says different tracks.
- onset_th: float
Threshold of onset channel. Type of list or float
- dura_th: float
Threshold of duration channel. Type of list or float
- inst_th: float
Threshold of deciding a instrument is present or not according to Std. of prediction.
- normalize: bool
Whether to normalize the predictions. For more details, please refer to our paper
- t_unit: float
Time unit for each frame. Should not be modified unless you have different settings during the feature extraction
- channel_program_mapping: list[int]
Mapping prediction channels to MIDI program numbers.
- Returns
- out_midi
A pretty_midi.PrettyMIDI object.
References
Publications can be found here.
- omnizart.music.inference.norm_onset_dura(pred, onset_th, dura_th, interpolate=True, normalize=True)¶
Normalizes prediction values of onset and duration channel.
- omnizart.music.inference.norm_split_onset_dura(pred, onset_th, lower_onset_th, split_bound, dura_th, interpolate=True, normalize=True)¶
An advanced version of function for normalizing onset and duration channel.
From the extensive experiments, we observe that the average prediction value for high and low frequency are different. Lower pitches tend to have smaller values, while higher pitches having larger. To acheive better transcription results, the most straight-forward solution is to assign different thresholds for low and high frequency part. And this is what this function provides for the purpose.
- Parameters
- pred
The predictions.
- onset_th: float
Threshold for high frequency part.
- lower_onset_th: float
Threshold for low frequency part.
- split_bound: int
The split point of low and high frequency part. Value should be within 0~87.
- interpolate: bool
Whether to apply interpolation between each frame to increase time resolution.
- normalize: bool
Whether to normalize the prediction values.
- Returns
- pred
Thresholded prediction, having value either 0 or 1.
- omnizart.music.inference.roll_down_sample(data, base=88)¶
Down sample feature size for a single pitch.
Down sample the feature size from 354 to 88 for infering the notes.
- Parameters
- data: 2D numpy array
The thresholded 2D prediction..
- base
Should be constant as there are 88 pitches on the piano.
- Returns
- return_v: 2D numpy array
Down sampled prediction.
Warning
The parameter data should be thresholded!
- omnizart.music.inference.threshold_type_converter(threshold, length)¶
Convert scalar value to a list with the same value.
- omnizart.music.inference.to_midi(notes, t_unit=0.02)¶
Translate the intermediate data into final output MIDI file.
Loss Functions¶
Loss functions for Music module.
- omnizart.music.losses.focal_loss(target_tensor, prediction_tensor, weights=None, alpha=0.25, gamma=2)¶
Compute focal loss for predictions.
Multi-labels Focal loss formula:
\[FL = -\alpha * (z-p)^\gamma * \log{(p)} -(1-\alpha) * p^\gamma * \log{(1-p)}\]Which \(\alpha\) = 0.25, \(\gamma\) = 2, p = sigmoid(x), z = target_tensor.
- Parameters
- prediction_tensor
A float tensor of shape [batch_size, num_anchors, num_classes] representing the predicted logits for each class.
- target_tensor:
A float tensor of shape [batch_size, num_anchors, num_classes] representing one-hot encoded classification targets.
- weights
A float tensor of shape [batch_size, num_anchors].
- alpha
A scalar tensor for focal loss alpha hyper-parameter.
- gamma
A scalar tensor for focal loss gamma hyper-parameter.
- Returns
- loss
A scalar tensor representing the value of the loss function
- omnizart.music.losses.smooth_loss(y_true, y_pred, gamma=0.15, total_chs=22, weight=None)¶
Function to compute loss after applying label-smoothing.
Labels¶
- class omnizart.music.labels.BaseLabelExtraction¶
Base class for extract label informations.
Provides basic functions to process native label format into the format required by
music
module. All sub-classes should parse the original label information intoLabel
class.Methods
extract_label
(label_path, t_unit[, ...])Extract labels into customized storage format.
load_label
(label_path)Load the label file and parse information into
Label
class.name_transform
(name)Maps the filename of label to the same name of the corresponding wav file.
process
(label_list, out_path[, t_unit, ...])Process the given list of label files and output to the target folder.
- classmethod extract_label(label_path, t_unit, onset_len_sec=0.03)¶
Extract labels into customized storage format.
Process the given path of label into list of
Label
instances, then further convert them into deliberately customized storage format.- Parameters
- label_path: Path
Path to the label file.
- t_unit: float
Time unit of each step in seconds. Should be consistent with the time unit of each frame of the extracted feature.
- onset_len_sec: float
Length of the first few frames with probability one. The later onset probabilities will be in a ‘fade-out’ manner until the note offset.
- abstract classmethod load_label(label_path)¶
Load the label file and parse information into
Label
class.Sub-classes should override this function to process their own label format.
- Parameters
- label_path: Path
Path to the label file.
- Returns
- labels: list[Label]
List of
Label
instances.
- classmethod name_transform(name)¶
Maps the filename of label to the same name of the corresponding wav file.
- Parameters
- name: str
Name of the label file, without parent directory prefix and file extension.
- Returns
- trans_name: str
The name same as the coressponding wav (or says feature) file.
- classmethod process(label_list, out_path, t_unit=0.02, onset_len_sec=0.03)¶
Process the given list of label files and output to the target folder.
- Parameters
- label_list: list[Path]
List of label paths.
- out_path: Path
Path for saving the extracted label files.
- t_unit: float
Time unit of each step in seconds. Should be consistent with the time unit of each frame of the extracted feature.
- onset_len_sec: float
Length of the first few frames with probability one. The later onset probabilities will be in a ‘fade-out’ manner until the note offset.
- class omnizart.music.labels.LabelType(mode)¶
Defines different types of music label for training.
Defines functions that converts the customized label format into numpy array. With the customized format, it is more flexible to transform labels into different different numpy formats according to the usage scenario, and also saves a lot of storage space by using the customized format.
- Parameters
- mode: [‘note’, ‘note-stream’, ‘pop-note-stream’, ‘frame’, ‘frame-stream’]
Mode of label conversion.
note: outputs onset and duration channel
note-stream: outputs onset and duration channel of instruments (for MusicNet)
pop-note-stream: similar to
note-stream
mode, but is forPop
datasetframe: same as
note
mode. To truely output duration channel only, use true-frame mode.frame-stream: same as
note-stream
. To truely output duration channel only for each instrument, usetrue-frame-stream
mode.
Methods
get_available_modes
get_conversion_func
get_frame
get_frame_onset
get_out_classes
multi_inst_frm
multi_inst_note
multi_pop_note
- get_available_modes()¶
- get_conversion_func()¶
- get_frame(label)¶
- get_frame_onset(label)¶
- get_out_classes()¶
- multi_inst_frm(label)¶
- multi_inst_note(label)¶
- multi_pop_note(label)¶
- class omnizart.music.labels.MaestroLabelExtraction¶
Label extraction class for Maestro dataset
Methods
load_label
(label_path)Load the label file and parse information into
Label
class.- classmethod load_label(label_path)¶
Load the label file and parse information into
Label
class.Sub-classes should override this function to process their own label format.
- Parameters
- label_path: Path
Path to the label file.
- Returns
- labels: list[Label]
List of
Label
instances.
- class omnizart.music.labels.MapsLabelExtraction¶
Label extraction class for Maps dataset
Methods
load_label
(label_path)Load the label file and parse information into
Label
class.- classmethod load_label(label_path)¶
Load the label file and parse information into
Label
class.Sub-classes should override this function to process their own label format.
- Parameters
- label_path: Path
Path to the label file.
- Returns
- labels: list[Label]
List of
Label
instances.
- class omnizart.music.labels.MusicNetLabelExtraction¶
Label extraction class for MusicNet dataset
Methods
load_label
(label_path)Load the label file and parse information into
Label
class.- classmethod load_label(label_path)¶
Load the label file and parse information into
Label
class.Sub-classes should override this function to process their own label format.
- Parameters
- label_path: Path
Path to the label file.
- Returns
- labels: list[Label]
List of
Label
instances.
- class omnizart.music.labels.PopLabelExtraction¶
Label extraction class for Pop Rhythm dataset
Methods
name_transform
(name)Maps the filename of label to the same name of the corresponding wav file.
- classmethod name_transform(name)¶
Maps the filename of label to the same name of the corresponding wav file.
- Parameters
- name: str
Name of the label file, without parent directory prefix and file extension.
- Returns
- trans_name: str
The name same as the coressponding wav (or says feature) file.
- class omnizart.music.labels.SuLabelExtraction¶
Label extraction class for Extended-Su dataset
Uses the same process as Maestro dataset
- omnizart.music.labels.label_conversion(label, ori_feature_size=352, feature_num=352, base=88, mpe=False, onsets=False, channel_mapping=None)¶
Converts the customized label format into numpy array.
- Parameters
- label: object
List of dict that is in customized label format.
- ori_feature_size: int
Size of the original feature dimension.
- feature_num: int
Size of the target output feature dimension.
- base: int
Number of total available pitches.
- mpe: bool
Whether to merge all channels into a single one, discarding information about instruments.
- onsets: bool
Fill in onset probabilities if set to true, or fill one to all activations.
- channel_mapping: dict
Maps the instrument program number to the specified channel index, used to indicate which channel should represent what instruments.
See also
omnizart.music.labels.BaseLabelExtraction.extract_label
Function that generates the customized label format.
Prediction¶
Utility functions for Music module
- omnizart.music.prediction.create_batches(feature, timesteps, b_size=8, step_size=10)¶
Create a series of batch input.
The size of the last batch could smaller than the given
b_size
.- Parameters
- feature: numpy.ndarray
The only constraint is the first dimension should time index. There is no limit on the number of dimensions.
- timesteps: int
Input feature length of the model.
- b_size: int
Batch size of the input.
- step_size: int
Step size for hopping the feature. Value smaller than
timesteps
indicates there will be overlapping between each feature slice.
- Returns
- batches: list
List of input batches.
- omnizart.music.prediction.merge_batches(batches, step_size=10)¶
Reverse process of
create_batches
.Merge the list of batch predictions into the complete predicted results.
- Parameters
- batches: numpy.ndarray
List of predicted batches.
- step_size: int
Should be the same as passing to
create_batches
.
- Returns
- pred: numpy.ndarray
The final predicted results.
- omnizart.music.prediction.predict(feature, model, batch_size=4, step_size=64)¶
Make predictions on the feature.
Generate predictions by using the loaded model.
- Parameters
- feature: numpy.ndarray
Extracted feature of the audio. Dimension: timesteps x feature_size x channels
- model: keras.Model
The loaded model instance.
- batch_size: int
Batch size for the prediction iteration.
- step_size: int
Step size for hopping the feature. Value smaller then
timesteps
means there will be overlapping.
- Returns
- pred: numpy.ndarray
The predicted results.
- omnizart.music.prediction.predict_old(feature, model, batch_size=4)¶
Make predictions on the feature.
Generate predictions by using the loaded model.
- Parameters
- feature: numpy.ndarray
Extracted feature of the audio. Dimension: timesteps x feature_size x channels
- model: keras.Model
The loaded model instance
- batch_size: int
Batch size for each step of prediction. The size is depending on the available GPU memory.
- Returns
- pred: numpy.ndarray
The predicted results. The values are ranging from 0~1.
Settings¶
Below are the default settings for building the music model. It will be loaded
by the class omnizart.setting_loaders.MusicSettings
. The name of the
attributes will be converted to snake-case (e.g., HopSize -> hop_size). There
is also a path transformation process when applying the settings into the
MusicSettings
instance. For example, if you want to access the attribute
BatchSize
defined in the yaml path General/Training/Settings/BatchSize,
the corresponding attribute will be MusicSettings.training.batch_size.
The level of /Settings is removed among all fields.
# Self-documented configurable settings, with description, type hint, and available
# options. All the parameters can be overriden by another specified configuration file
# with selected parameters.
General:
TranscriptionMode:
Description: Mode of transcription by executing the `omnizart music transcribe` command.
Type: String
Value: Piano
CheckpointPath:
Description: Path to the pre-trained models.
Type: Map
SubType: [String, String]
Value:
Piano: checkpoints/music/music_piano
Pop: checkpoints/music/music_pop
Stream: checkpoints/music/music_note_stream
PianoV2: checkpoints/music/music_piano-v2
Feature:
Description: Default settings of feature extraction
Settings:
HopSize:
Description: Hop size in seconds with respect to sampling rate.
Type: Float
Value: 0.02
SamplingRate:
Description: Adjust input sampling rate to this value.
Type: Integer
Value: 44100
WindowSize:
Type: Integer
Value: 7939
FrequencyResolution:
Type: Float
Value: 2.0
FrequencyCenter:
Description: Lowest frequency to extract.
Type: Float
Value: 27.5
TimeCenter:
Description: Highest frequency to extract (1/time_center).
Type: Float
Value: 0.00022287
Gamma:
Type: List
SubType: Float
Value: [0.24, 0.6, 1.0]
BinsPerOctave:
Description: Number of bins for each octave.
Type: Integer
Value: 48
HarmonicNumber:
Description: Number of harmonic bins of HCFP feature.
Type: Integer
Value: 6
Harmonic:
Description: Whether to use harmonic version of the input feature for training.
Type: Bool
Value: False
Dataset:
Description: Settings of datasets.
Settings:
SavePath:
Description: Path for storing the downloaded datasets.
Type: String
Value: ./
FeatureType:
Description: Type of feature to extract.
Type: String
Value: CFP
Choices: ["CFP", "HCFP"]
FeatureSavePath:
Description: Path for storing the extracted feature. Default to the path under the dataset folder.
Type: String
Value: +
Model:
Description: Default settings of training / testing the model.
Settings:
SavePrefix:
Description: Prefix of the trained model's name to be saved.
Type: String
Value: music
SavePath:
Description: Path to save the trained model.
Type: String
Value: ./checkpoints/music
ModelType:
Description: Default model type to be used for training
Type: String
Value: attn
Choices: ["aspp", "attn"]
Inference:
Description: Default settings when infering notes.
Settings:
MinLength:
Description: Minimum length of a note in seconds.
Type: Float
Value: 0.05
InstTh:
Description: Threshold for filtering instruments.
Type: Float
Value: 1.1
OnsetTh:
Description: Threshold of predicted onset channel.
Type: Float
Value: 3.5
DuraTh:
Description: Threshold of predicted duration channel.
Type: Float
Value: 0.5
FrameTh:
Description: Threshold of frame-level predictions.
Type: Float
Value: 0.5
Training:
Description: Parameters for training
Settings:
Epoch:
Description: Maximum number of epochs for training.
Type: Integer
Value: 20
Steps:
Description: Number of training steps for each epoch.
Type: Integer
Value: 3000
ValSteps:
Description: Number of validation steps after each training epoch.
Type: Integer
Value: 500
BatchSize:
Description: Batch size of each training step.
Type: Integer
Value: 8
ValBatchSize:
Description: Batch size of each validation step.
Type: Integer
Value: 8
EarlyStop:
Description: Terminate the training if the validation performance doesn't imrove after n epochs.
Type: Integer
Value: 6
LossFunction:
Description: Loss function for computing the objectives.
Type: String
Value: smooth
Choices: ["smooth", "focal", "bce"]
LabelType:
Description: Determines the training target to be single- or multi-instrument scenario, and more options.
Type: String
Value: note-stream
Choices:
- note-stream
- frame-stream
- note
- frame
- true-frame
- true-frame-stream
- pop-note-stream
Channels:
Description: Use different types of feature for training.
Type: List
SubType: String
Value: ["Spec", "Ceps"]
Choices: ["Spec", "GCoS", "Ceps"]
Timesteps:
Description: Length of time axis of the input feature.
Type: Integer
Value: 256
FeatureNum:
Description: The target size of feature dimension.
Type: Integer
Value: 352