Patch-CNN Transcription¶

Vocal pitch contour transcription PatchCNN ver.

Transcribes monophonic pitch contour of vocal in the given polyphonic audio by using the PatchCNN approach. Re-implementation of the repository VocalMelodyExtPatchCNN.

Feature Storage Format¶

Processed feature and label will be stored in .hdf format, one file per piece.

Columns contained in each file are:

feature: Patch CFP feature.
label: Binary classes of each patch.
Z: The original CFP feature.
mapping: Records the original frequency and time indexes of each patch.

References¶

Publication of this module can be found in [1].

1: Li Su, “Vocal Melody Extraction Using Patch-based CNN,” in IEEE International Conference of Acoustics, Speech, and Signal Processing (ICASSP), 2018.

App¶

class omnizart.patch_cnn.app.PatchCNNTranscription(conf_path=None)¶

Bases: omnizart.base.BaseTranscription

Application class of PatchCNN module.

Methods

`generate_feature`(dataset_path[, ...])	Extract the feature from the given dataset.
`train`(feature_folder[, model_name, ...])	Model training.
`transcribe`(input_audio[, model_path, output])	Transcribe frame-level fundamental frequency of vocal from the given audio.

generate_feature(dataset_path, patch_cnn_settings=None, num_threads=4)¶

Extract the feature from the given dataset.

To train the model, the first step is to pre-process the data into feature representations. After downloading the dataset, use this function to generate the feature by giving the path of the stored dataset.

To specify the output path, modify the attribute patch_cnn_settings.dataset.feature_save_path. It defaults to the folder of the stored dataset, and creates two folders: train_feature and test_feature.

Parameters

dataset_path: Path: Path to the downloaded dataset.
patch_cnn_settings: PatchCNNSettings: The configuration instance that holds all relative settings for the life-cycle of building a model.
num_threads:: Number of threads for parallel extraction of the feature.

Dataset¶

class omnizart.patch_cnn.app.PatchCNNDatasetLoader(feature_folder=None, feature_files=None, num_samples=100, slice_hop=1, feat_col_name='feature')¶

Bases: omnizart.base.BaseDatasetLoader

Dataset loader for PatchCNN module.

Inference¶

omnizart.patch_cnn.inference.inference(pred, mapping, zzz, cenf, threshold=0.5, max_method='posterior')¶

Infers pitch contour from the model prediction.

Parameters

pred:: The predicted results of the model.
mapping: 2D numpy array: The original frequency and time index of patches. See omnizart.feature.cfp.extract_patch_cfp for more details.
zzz: 2D numpy array: The original CFP feature.
cenf: list[float]: Center frequencies in Hz of each frequency index.
threshold: float: Threshold for filtering value of predictions.
max_method: {‘posterior’, ‘prior’}: The approach for determine the frequency. Method of posterior assigns the frequency value according to the given mapping parameter, and prior uses the given zzz feature for the determination.

Returns

contour: 1D numpy array: Sequence of freqeuncies in Hz, representing the inferred pitch contour.

Labels¶

omnizart.patch_cnn.app.extract_label(label_path, label_loader, mapping, cenf, t_unit)¶

Label extraction function of PatchCNN module.

Extracts the label representation required by PatchCNN module. The output dimesions are: patch_length x 2. The second dimension indicates whether there is an active vocal pitch or not of that patch.

Small probabilities are assigned to those patch with pitch slightly shifted to augment the sparse label. The probabilities are computed according to the distance of that pitch index to the ground-truth index: 1 / (dist + 1).

Parameters

label_path: Path: Path to the ground-truth file.
label_loader:: Label loader that contains load_label function for parsing the ground-truth file into list Label representation.
mapping: 2D numpy array: The original frequency and time index of patches. See omnizart.feature.cfp.extract_patch_cfp for more details.
cenf: list[float]: Center frequencies in Hz of each frequency index.
t_unit: float: Time unit of each frame in seconds.

Returns

gt_roll: 2D numpy array: A sequence of binary classes, represents whether the patch contains the pitch of vocal.

Settings¶

Below are the default settings for building the PatchCNN model. It will be loaded by the class omnizart.setting_loaders.PatchCNNSettings. The name of the attributes will be converted to snake-case (e.g. HopSize -> hop_size). There is also a path transformation process when applying the settings into the PatchCNNSettings instance. For example, if you want to access the attribute BatchSize defined in the yaml path General/Training/Settings/BatchSize, the coressponding attribute will be MusicSettings.training.batch_size. The level of /Settings is removed among all fields.

General:
    TranscriptionMode:
        Description: Mode of transcription by executing the `omnizart patch-cnn transcribe` command.
        Type: String 
        Value: Melody
    CheckpointPath:
        Description: Path to the pre-trained models.
        Type: Map
        SubType: [String, String]
        Value:
            Melody: checkpoints/patch_cnn/patch_cnn_melody
    Feature:
        Description: Default settings of feature extraction
        Settings:
            PatchSize:
                Description: Input size of feature dimension.
                Type: Integer
                Value: 25
            PeakThreshold:
                Description: Threshold used to filter out peaks with small value.
                Type: Float
                Value: 0.5
            HopSize:
                Description: Hop size in seconds with respect to sampling rate.
                Type: Float
                Value: 0.02
            SamplingRate:
                Description: Adjust input sampling rate to this value.
                Type: Integer
                Value: 16000
            WindowSize:
                Type: Integer
                Value: 2049
            FrequencyResolution:
                Type: Float
                Value: 2.0
            FrequencyCenter:
                Description: Lowest frequency to extract.
                Type: Float
                Value: 80
            TimeCenter:
                Description: Highest frequency to extract (1/time_center).
                Type: Float
                Value: 0.001
            Gamma:
                Type: List
                SubType: Float
                Value: [0.24, 0.6, 1.0]
            BinsPerOctave:
                Description: Number of bins for each octave.
                Type: Integer
                Value: 48
    Model:
        Description: Default settings of training / testing the model.
        Settings:
            SavePrefix:
                Description: Prefix of the trained model's name to be saved.
                Type: String
                Value: patch_cnn
            SavePath:
                Description: Path to save the trained model.
                Type: String
                Value: ./checkpoints/patch_cnn
    Dataset:
        Description: Settings of datasets.
        Settings:
            SavePath:
                Description: Path for storing the downloaded datasets.
                Type: String
                Value: ./
            FeatureSavePath:
                Description: Path for storing the extracted feature. Default to the path under the dataset folder.
                Type: String
                Value: +
    Inference:
        Description: Default settings when infering notes.
        Settings:
            Threshold:
                Description: Threshold of the prediction value.
                Type: Float
                Value: 0.05
            MaxMethod:
                Description: Method of determine the position of the max prediction value.
                Type: String
                Value: posterior
                Choices: ["posterior", "prior"]
    Training:
        Description: Hyper parameters for training
        Settings:
            Epoch:
                Description: Maximum number of epochs for training.
                Type: Integer
                Value: 10
            Steps:
                Description: Number of training steps for each epoch.
                Type: Integer
                Value: 2000
            ValSteps:
                Description: Number of validation steps after each training epoch.
                Type: Integer
                Value: 300
            BatchSize:
                Description: Batch size of each training step.
                Type: Integer
                Value: 32
            ValBatchSize:
                Description: Batch size of each validation step.
                Type: Integer
                Value: 32
            EarlyStop:
                Description: Terminate the training if the validation performance doesn't imrove after n epochs.
                Type: Integer
                Value: 4
            InitLearningRate:
                Descriptoin: Initial learning rate.
                Type: Float
                Value: 0.00001