
Collection of functions for processing raw audio into required input representation.


Author: Lisu

Mantainer: BreezeWhite

omnizart.feature.cfp.STFT(x, fr, fs, Hop, h)
omnizart.feature.cfp.cfp_filterbank(x, fr, fs, Hop, h, fc, tc, g, bin_per_octave)
omnizart.feature.cfp.extract_cfp(filename, down_fs=44100, **kwargs)

CFP feature extraction function.

Given the audio path, returns the CFP feature. Will automatically process the feature in parallel to accelerate the computation.

filename: Path

Path to the audio.

hop: float

Hop size in seconds, with regard to the sampling rate.

win_size: int

Window size.

fr: float

Frequency resolution.

fc: float

Lowest start frequency.

tc: float

Inverse number of the highest frequency bound.

g: list[float]

Power factor of the output STFT results.

bin_per_octave: int

Number of bins in each octave.

down_fs: int

Resample to this sampling rate, if the loaded audio has a different value.

max_sample: int

Maximum number of frames to be processed for each computation. Adjust to a smaller number if your RAM is not enough.


Multiplication of spectrum and cepstrum


Spectrum of the audio.


Generalized Cepstrum of Spectrum (GCoS).


Cepstrum of the audio


Central frequencies to each feature.


The CFP approach was first proposed in [1]


L. Su and Y. Yang, “Combining Spectral and Temporal Representations for Multipitch Estimation of Polyphonic Music,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015.

omnizart.feature.cfp.extract_patch_cfp(filename, patch_size=25, threshold=0.5, hop=0.02, win_size=2049, fr=2.0, fc=80.0, tc=0.001, g=[0.24, 0.6, 1], bin_per_octave=48, down_fs=16000, max_sample=2000)

Extract patch CFP feature for PatchCNN module.

filename: Path

Path to the audio

patch_size: int

Height and width of each feature patch.

threshold: float

Threshold for determine peaks.

hop: float

Hop size in seconds, with regard to the sampling rate.

win_size: int

Window size.

fr: float

Frequency resolution.

fc: float

Lowest start frequency.

tc: float

Inverse number of the highest frequency bound.

g: list[float]

Power factor of the output STFT results.

bin_per_octave: int

Number of bins in each octave.

down_fs: int

Resample to this sampling rate, if the loaded audio has a different value.

max_sample: int

Maximum number of frames to be processed for each computation. Adjust to a smaller number if your RAM is not enough.

patch: 3D numpy array

Sequence of patch CFP features. The position of the patches are inferred according to the amplitude of the spectrogram.

mapping: 2D numpy array

Records the original frequency and time index of each patch, having dimension of len(patch) x 2.

Z: 2D numpy array

The original CFP feature. Dim: freq x time

cenf: list[float]

Records the corresponding center frequencies of the frequency dimension.

omnizart.feature.cfp.extract_vocal_cfp(filename, down_fs=16000, **kwargs)

Specialized CFP feature extraction for vocal submodule.

omnizart.feature.cfp.freq_to_log_freq_mapping(tfr, f, fr, fc, tc, NumPerOct)
omnizart.feature.cfp.nonlinear_func(X, g, cutoff)
omnizart.feature.cfp.parallel_extract(x, samples, max_sample, fr, fs, Hop, h, fc, tc, g, bin_per_octave)
omnizart.feature.cfp.quef_to_log_freq_mapping(ceps, q, fs, fc, tc, NumPerOct)
omnizart.feature.cfp.spectral_flux(spec, invert=False, norm=True)


omnizart.feature.hcfp.extract_hcfp(filename, hop=0.02, win_size=7939, fr=2.0, g=[0.24, 0.6, 1], bin_per_octave=48, down_fs=44100, max_sample=2000, harmonic_num=6)
omnizart.feature.hcfp.fetch_harmonic(data, cenf, ith_har, start_freq=27.5, num_per_octave=48, is_reverse=False)


omnizart.feature.cqt.extract_cqt(audio_path, sampling_rate=44100, lowest_note=16, note_num=120, a_hop=256, pad_sec=1)

Compute some audio data’s constant-Q spectrogram, normalize, and log-scale it

audio_data: Path

Path to the input audio.

sampling_rate: int

Sampling rate the audio data is sampled at, should be DOWN_SAMPLE_TO_SAPMLING_RATE.

lowest_note: int

Lowest MIDI note number.

note_num: int

Number of total notes. The highest note number would thus be lowest_note + note_num.

a_hop: int

Hop size for computing CQT.

pad_sec: float

Length of padding to the begin and the end of the raw audio data in seconds.

midi_gram: np.ndarray

Log-magnitude, L2-normalized constant-Q spectrogram of synthesized MIDI data.


Normalize and log-scale a Constant-Q spectrogram

gram: np.ndarray

Constant-Q spectrogram, constructed from librosa.cqt.

log_normalized_gram: np.ndarray

Log-magnitude, L2-normalized constant-Q spectrogram.

Beat Tracking

class omnizart.feature.beat_for_drum.MadmomBeatTracking(num_threads=3)

Extract beat information with madmom library.

Three different beat tracking methods are used together for producing a more stable beat tracking result.



Generate beat tracking results with multiple approaches.


Generate beat tracking results with multiple approaches.

omnizart.feature.beat_for_drum.extract_beat_with_madmom(audio_path, sampling_rate=44100)

Extract beat position (in seconds) of the audio.

Extract beat with mixture of beat tracking techiniques using madmom.

audio_path: Path

Path to the target audio

sampling_rate: int

Desired sampling to be resampled.

beat_arr: 1D numpy array

Contains beat positions in seconds.

audio_len_sec: float

Total length of the audio in seconds.

omnizart.feature.beat_for_drum.extract_mini_beat_from_audio_path(audio_path, sampling_rate=44100, mini_beat_div_n=32)

Wrapper of extracting mini beats from audio path.

omnizart.feature.beat_for_drum.extract_mini_beat_from_beat_arr(beat_arr, audio_len_sec, mini_beat_div_n=32)

Extract mini beats from the beat array.

Furhter split beat into shorter beat interval, which we call it mini beat, to increase the beat resolution. We use linear interpolation to generate the mini beats.

beat_arr: 1D numpy array

Beat array generated by extract_beat_with_madmom.

audio_len_sec: float

Total length of the audio in seconds.

mini_beat_div_n: int

Number of mini beats in a single 4/4 measure.

mini_beat_pos_t: 1D numpy array

Positions of mini beats in seconds.