Patch-CNN Transcription

Vocal pitch contour transcription PatchCNN ver.

Transcribes monophonic pitch contour of vocal in the given polyphonic audio by using the PatchCNN approach. Re-implementation of the repository VocalMelodyExtPatchCNN.

Feature Storage Format

Processed feature and label will be stored in .hdf format, one file per piece.

Columns contained in each file are:

  • feature: Patch CFP feature.

  • label: Binary classes of each patch.

  • Z: The original CFP feature.

  • mapping: Records the original frequency and time indexes of each patch.

References

Publication of this module can be found in [1].

1

Li Su, “Vocal Melody Extraction Using Patch-based CNN,” in IEEE International Conference of Acoustics, Speech, and Signal Processing (ICASSP), 2018.

App

class omnizart.patch_cnn.app.PatchCNNTranscription(conf_path=None)

Bases: omnizart.base.BaseTranscription

Application class of PatchCNN module.

Methods

generate_feature(dataset_path[, ...])

Extract the feature from the given dataset.

train(feature_folder[, model_name, ...])

Model training.

transcribe(input_audio[, model_path, output])

Transcribe frame-level fundamental frequency of vocal from the given audio.

generate_feature(dataset_path, patch_cnn_settings=None, num_threads=4)

Extract the feature from the given dataset.

To train the model, the first step is to pre-process the data into feature representations. After downloading the dataset, use this function to generate the feature by giving the path of the stored dataset.

To specify the output path, modify the attribute patch_cnn_settings.dataset.feature_save_path. It defaults to the folder of the stored dataset, and creates two folders: train_feature and test_feature.

Parameters
dataset_path: Path

Path to the downloaded dataset.

patch_cnn_settings: PatchCNNSettings

The configuration instance that holds all relative settings for the life-cycle of building a model.

num_threads:

Number of threads for parallel extraction of the feature.

See also

omnizart.constants.datasets

The supported datasets and the corresponding training/testing splits.

train(feature_folder, model_name=None, input_model_path=None, patch_cnn_settings=None)

Model training.

Train the model from scratch or continue training given a model checkpoint.

Parameters
feature_folder: Path

Path to the generated feature.

model_name: str

The name of the trained model. If not given, will default to the current timestamp.

input_model_path: Path

Specify the path to the model checkpoint in order to fine-tune the model.

patch_cnn_settings: VocalContourSettings

The configuration that holds all relative settings for the life-cycle of model building.

transcribe(input_audio, model_path=None, output='./')

Transcribe frame-level fundamental frequency of vocal from the given audio.

Parameters
input_audio: Path

Path to the wav audio file.

model_path: Path

Path to the trained model or the transcription mode. If given a path, should be the folder that contains arch.yaml, weights.h5, and configuration.yaml.

output: Path (optional)

Path for writing out the extracted vocal f0. Default to current path.

Returns
agg_f0: list[dict]

List of aggregated F0 information, with each entry containing the onset, offset, and freqeuncy (Hz).

See also

omnizart.cli.patch_cnn.transcribe

The coressponding command line entry.

Dataset

class omnizart.patch_cnn.app.PatchCNNDatasetLoader(feature_folder=None, feature_files=None, num_samples=100, slice_hop=1, feat_col_name='feature')

Bases: omnizart.base.BaseDatasetLoader

Dataset loader for PatchCNN module.

Inference

omnizart.patch_cnn.inference.inference(pred, mapping, zzz, cenf, threshold=0.5, max_method='posterior')

Infers pitch contour from the model prediction.

Parameters
pred:

The predicted results of the model.

mapping: 2D numpy array

The original frequency and time index of patches. See omnizart.feature.cfp.extract_patch_cfp for more details.

zzz: 2D numpy array

The original CFP feature.

cenf: list[float]

Center frequencies in Hz of each frequency index.

threshold: float

Threshold for filtering value of predictions.

max_method: {‘posterior’, ‘prior’}

The approach for determine the frequency. Method of posterior assigns the frequency value according to the given mapping parameter, and prior uses the given zzz feature for the determination.

Returns
contour: 1D numpy array

Sequence of freqeuncies in Hz, representing the inferred pitch contour.

Labels

omnizart.patch_cnn.app.extract_label(label_path, label_loader, mapping, cenf, t_unit)

Label extraction function of PatchCNN module.

Extracts the label representation required by PatchCNN module. The output dimesions are: patch_length x 2. The second dimension indicates whether there is an active vocal pitch or not of that patch.

Small probabilities are assigned to those patch with pitch slightly shifted to augment the sparse label. The probabilities are computed according to the distance of that pitch index to the ground-truth index: 1 / (dist + 1).

Parameters
label_path: Path

Path to the ground-truth file.

label_loader:

Label loader that contains load_label function for parsing the ground-truth file into list Label representation.

mapping: 2D numpy array

The original frequency and time index of patches. See omnizart.feature.cfp.extract_patch_cfp for more details.

cenf: list[float]

Center frequencies in Hz of each frequency index.

t_unit: float

Time unit of each frame in seconds.

Returns
gt_roll: 2D numpy array

A sequence of binary classes, represents whether the patch contains the pitch of vocal.

Settings

Below are the default settings for building the PatchCNN model. It will be loaded by the class omnizart.setting_loaders.PatchCNNSettings. The name of the attributes will be converted to snake-case (e.g. HopSize -> hop_size). There is also a path transformation process when applying the settings into the PatchCNNSettings instance. For example, if you want to access the attribute BatchSize defined in the yaml path General/Training/Settings/BatchSize, the coressponding attribute will be MusicSettings.training.batch_size. The level of /Settings is removed among all fields.

General:
    TranscriptionMode:
        Description: Mode of transcription by executing the `omnizart patch-cnn transcribe` command.
        Type: String 
        Value: Melody
    CheckpointPath:
        Description: Path to the pre-trained models.
        Type: Map
        SubType: [String, String]
        Value:
            Melody: checkpoints/patch_cnn/patch_cnn_melody
    Feature:
        Description: Default settings of feature extraction
        Settings:
            PatchSize:
                Description: Input size of feature dimension.
                Type: Integer
                Value: 25
            PeakThreshold:
                Description: Threshold used to filter out peaks with small value.
                Type: Float
                Value: 0.5
            HopSize:
                Description: Hop size in seconds with respect to sampling rate.
                Type: Float
                Value: 0.02
            SamplingRate:
                Description: Adjust input sampling rate to this value.
                Type: Integer
                Value: 16000
            WindowSize:
                Type: Integer
                Value: 2049
            FrequencyResolution:
                Type: Float
                Value: 2.0
            FrequencyCenter:
                Description: Lowest frequency to extract.
                Type: Float
                Value: 80
            TimeCenter:
                Description: Highest frequency to extract (1/time_center).
                Type: Float
                Value: 0.001
            Gamma:
                Type: List
                SubType: Float
                Value: [0.24, 0.6, 1.0]
            BinsPerOctave:
                Description: Number of bins for each octave.
                Type: Integer
                Value: 48
    Model:
        Description: Default settings of training / testing the model.
        Settings:
            SavePrefix:
                Description: Prefix of the trained model's name to be saved.
                Type: String
                Value: patch_cnn
            SavePath:
                Description: Path to save the trained model.
                Type: String
                Value: ./checkpoints/patch_cnn
    Dataset:
        Description: Settings of datasets.
        Settings:
            SavePath:
                Description: Path for storing the downloaded datasets.
                Type: String
                Value: ./
            FeatureSavePath:
                Description: Path for storing the extracted feature. Default to the path under the dataset folder.
                Type: String
                Value: +
    Inference:
        Description: Default settings when infering notes.
        Settings:
            Threshold:
                Description: Threshold of the prediction value.
                Type: Float
                Value: 0.05
            MaxMethod:
                Description: Method of determine the position of the max prediction value.
                Type: String
                Value: posterior
                Choices: ["posterior", "prior"]
    Training:
        Description: Hyper parameters for training
        Settings:
            Epoch:
                Description: Maximum number of epochs for training.
                Type: Integer
                Value: 10
            Steps:
                Description: Number of training steps for each epoch.
                Type: Integer
                Value: 2000
            ValSteps:
                Description: Number of validation steps after each training epoch.
                Type: Integer
                Value: 300
            BatchSize:
                Description: Batch size of each training step.
                Type: Integer
                Value: 32
            ValBatchSize:
                Description: Batch size of each validation step.
                Type: Integer
                Value: 32
            EarlyStop:
                Description: Terminate the training if the validation performance doesn't imrove after n epochs.
                Type: Integer
                Value: 4
            InitLearningRate:
                Descriptoin: Initial learning rate.
                Type: Float
                Value: 0.00001