Patch-CNN Transcription¶
Vocal pitch contour transcription PatchCNN ver.
Transcribes monophonic pitch contour of vocal in the given polyphonic audio by using the PatchCNN approach. Re-implementation of the repository VocalMelodyExtPatchCNN.
Feature Storage Format¶
Processed feature and label will be stored in .hdf
format, one file per piece.
Columns contained in each file are:
feature: Patch CFP feature.
label: Binary classes of each patch.
Z: The original CFP feature.
mapping: Records the original frequency and time indexes of each patch.
References¶
Publication of this module can be found in [1].
- 1
Li Su, “Vocal Melody Extraction Using Patch-based CNN,” in IEEE International Conference of Acoustics, Speech, and Signal Processing (ICASSP), 2018.
App¶
- class omnizart.patch_cnn.app.PatchCNNTranscription(conf_path=None)¶
Bases:
omnizart.base.BaseTranscription
Application class of PatchCNN module.
Methods
generate_feature
(dataset_path[, ...])Extract the feature from the given dataset.
train
(feature_folder[, model_name, ...])Model training.
transcribe
(input_audio[, model_path, output])Transcribe frame-level fundamental frequency of vocal from the given audio.
- generate_feature(dataset_path, patch_cnn_settings=None, num_threads=4)¶
Extract the feature from the given dataset.
To train the model, the first step is to pre-process the data into feature representations. After downloading the dataset, use this function to generate the feature by giving the path of the stored dataset.
To specify the output path, modify the attribute
patch_cnn_settings.dataset.feature_save_path
. It defaults to the folder of the stored dataset, and creates two folders:train_feature
andtest_feature
.- Parameters
- dataset_path: Path
Path to the downloaded dataset.
- patch_cnn_settings: PatchCNNSettings
The configuration instance that holds all relative settings for the life-cycle of building a model.
- num_threads:
Number of threads for parallel extraction of the feature.
See also
omnizart.constants.datasets
The supported datasets and the corresponding training/testing splits.
- train(feature_folder, model_name=None, input_model_path=None, patch_cnn_settings=None)¶
Model training.
Train the model from scratch or continue training given a model checkpoint.
- Parameters
- feature_folder: Path
Path to the generated feature.
- model_name: str
The name of the trained model. If not given, will default to the current timestamp.
- input_model_path: Path
Specify the path to the model checkpoint in order to fine-tune the model.
- patch_cnn_settings: VocalContourSettings
The configuration that holds all relative settings for the life-cycle of model building.
- transcribe(input_audio, model_path=None, output='./')¶
Transcribe frame-level fundamental frequency of vocal from the given audio.
- Parameters
- input_audio: Path
Path to the wav audio file.
- model_path: Path
Path to the trained model or the transcription mode. If given a path, should be the folder that contains arch.yaml, weights.h5, and configuration.yaml.
- output: Path (optional)
Path for writing out the extracted vocal f0. Default to current path.
- Returns
- agg_f0: list[dict]
List of aggregated F0 information, with each entry containing the onset, offset, and freqeuncy (Hz).
See also
omnizart.cli.patch_cnn.transcribe
The coressponding command line entry.
Dataset¶
- class omnizart.patch_cnn.app.PatchCNNDatasetLoader(feature_folder=None, feature_files=None, num_samples=100, slice_hop=1, feat_col_name='feature')¶
Bases:
omnizart.base.BaseDatasetLoader
Dataset loader for PatchCNN module.
Inference¶
- omnizart.patch_cnn.inference.inference(pred, mapping, zzz, cenf, threshold=0.5, max_method='posterior')¶
Infers pitch contour from the model prediction.
- Parameters
- pred:
The predicted results of the model.
- mapping: 2D numpy array
The original frequency and time index of patches. See
omnizart.feature.cfp.extract_patch_cfp
for more details.- zzz: 2D numpy array
The original CFP feature.
- cenf: list[float]
Center frequencies in Hz of each frequency index.
- threshold: float
Threshold for filtering value of predictions.
- max_method: {‘posterior’, ‘prior’}
The approach for determine the frequency. Method of posterior assigns the frequency value according to the given
mapping
parameter, and prior uses the givenzzz
feature for the determination.
- Returns
- contour: 1D numpy array
Sequence of freqeuncies in Hz, representing the inferred pitch contour.
Labels¶
- omnizart.patch_cnn.app.extract_label(label_path, label_loader, mapping, cenf, t_unit)¶
Label extraction function of PatchCNN module.
Extracts the label representation required by PatchCNN module. The output dimesions are: patch_length x 2. The second dimension indicates whether there is an active vocal pitch or not of that patch.
Small probabilities are assigned to those patch with pitch slightly shifted to augment the sparse label. The probabilities are computed according to the distance of that pitch index to the ground-truth index: 1 / (dist + 1).
- Parameters
- label_path: Path
Path to the ground-truth file.
- label_loader:
Label loader that contains
load_label
function for parsing the ground-truth file into listLabel
representation.- mapping: 2D numpy array
The original frequency and time index of patches. See
omnizart.feature.cfp.extract_patch_cfp
for more details.- cenf: list[float]
Center frequencies in Hz of each frequency index.
- t_unit: float
Time unit of each frame in seconds.
- Returns
- gt_roll: 2D numpy array
A sequence of binary classes, represents whether the patch contains the pitch of vocal.
Settings¶
Below are the default settings for building the PatchCNN model. It will be loaded
by the class omnizart.setting_loaders.PatchCNNSettings
. The name of the
attributes will be converted to snake-case (e.g. HopSize -> hop_size). There
is also a path transformation process when applying the settings into the
PatchCNNSettings
instance. For example, if you want to access the attribute
BatchSize
defined in the yaml path General/Training/Settings/BatchSize,
the coressponding attribute will be MusicSettings.training.batch_size.
The level of /Settings is removed among all fields.
General:
TranscriptionMode:
Description: Mode of transcription by executing the `omnizart patch-cnn transcribe` command.
Type: String
Value: Melody
CheckpointPath:
Description: Path to the pre-trained models.
Type: Map
SubType: [String, String]
Value:
Melody: checkpoints/patch_cnn/patch_cnn_melody
Feature:
Description: Default settings of feature extraction
Settings:
PatchSize:
Description: Input size of feature dimension.
Type: Integer
Value: 25
PeakThreshold:
Description: Threshold used to filter out peaks with small value.
Type: Float
Value: 0.5
HopSize:
Description: Hop size in seconds with respect to sampling rate.
Type: Float
Value: 0.02
SamplingRate:
Description: Adjust input sampling rate to this value.
Type: Integer
Value: 16000
WindowSize:
Type: Integer
Value: 2049
FrequencyResolution:
Type: Float
Value: 2.0
FrequencyCenter:
Description: Lowest frequency to extract.
Type: Float
Value: 80
TimeCenter:
Description: Highest frequency to extract (1/time_center).
Type: Float
Value: 0.001
Gamma:
Type: List
SubType: Float
Value: [0.24, 0.6, 1.0]
BinsPerOctave:
Description: Number of bins for each octave.
Type: Integer
Value: 48
Model:
Description: Default settings of training / testing the model.
Settings:
SavePrefix:
Description: Prefix of the trained model's name to be saved.
Type: String
Value: patch_cnn
SavePath:
Description: Path to save the trained model.
Type: String
Value: ./checkpoints/patch_cnn
Dataset:
Description: Settings of datasets.
Settings:
SavePath:
Description: Path for storing the downloaded datasets.
Type: String
Value: ./
FeatureSavePath:
Description: Path for storing the extracted feature. Default to the path under the dataset folder.
Type: String
Value: +
Inference:
Description: Default settings when infering notes.
Settings:
Threshold:
Description: Threshold of the prediction value.
Type: Float
Value: 0.05
MaxMethod:
Description: Method of determine the position of the max prediction value.
Type: String
Value: posterior
Choices: ["posterior", "prior"]
Training:
Description: Hyper parameters for training
Settings:
Epoch:
Description: Maximum number of epochs for training.
Type: Integer
Value: 10
Steps:
Description: Number of training steps for each epoch.
Type: Integer
Value: 2000
ValSteps:
Description: Number of validation steps after each training epoch.
Type: Integer
Value: 300
BatchSize:
Description: Batch size of each training step.
Type: Integer
Value: 32
ValBatchSize:
Description: Batch size of each validation step.
Type: Integer
Value: 32
EarlyStop:
Description: Terminate the training if the validation performance doesn't imrove after n epochs.
Type: Integer
Value: 4
InitLearningRate:
Descriptoin: Initial learning rate.
Type: Float
Value: 0.00001