Estimate pitch using the Kaldi modifies version of RAPT
kaldi_pitch.Rd
The algorithm used is a version of the RAPT algorithm that considers voicing also in voiceless frames and conputes a Normalized Cross Correlation Function (NCCF) that can be used to estimate the probability of voicing (Ghahremani et al. 2014) .
Usage
kaldi_pitch(
listOfFiles,
beginTime = 0,
endTime = 0,
windowShift = 5,
windowSize = 25,
minF = 70,
maxF = 200,
softMinF0 = 10,
voiced_voiceless_cost = 0.1,
owpass_cutoff = 1000,
resample_frequency = 4000,
deltaChange = 0.005,
nccfBallast = 7000,
lowpass_cutoff = 1000,
lowpass_filter_width = 1,
upsample_filter_width = 5,
max_frames_latency = 0,
frames_per_chunk = 0,
simulate_first_pass_online = FALSE,
recompute_frame = 500,
snip_edges = TRUE,
explicitExt = "kap",
outputDirectory = NULL,
toFile = TRUE,
conda.env = NULL
)
Arguments
- listOfFiles
A vector of file paths to wav files.
- beginTime
The start time of the section of the sound file that should be processed.
- endTime
The end time of the section of the sound file that should be processed.
- windowShift
The measurement interval (frame duration), in seconds.
- minF
Candidate f0 frequencies below this frequency will not be considered.
- maxF
Candidates above this frequency will be ignored.
- resample_frequency
Frequency that we down-sample the signal to. Must be more than twice
lowpass_cutoff
. (default: 4000)- lowpass_cutoff
Cutoff frequency for LowPass filter (Hz) (default: 1000)
- lowpass_filter_width
Integer that determines filter width of lowpass filter, more gives sharper filter. (default: 1)
- max_frames_latency
Maximum number of frames of latency that we allow pitch tracking to introduce into the feature processing (affects output only if
frames_per_chunk
> 0 andsimulate_first_pass_online
=TRUE
) (default: 0)- frames_per_chunk
The number of frames used for energy normalization. (default: 0)
- simulate_first_pass_online
If true, the function will output features that correspond to what an online decoder would see in the first pass of decoding – not the final version of the features, which is the default. (default:
FALSE
) Relevant ifframes_per_chunk > 0
.- recompute_frame
Only relevant for compatibility with online pitch extraction. A non-critical parameter; the frame at which we recompute some of the forward pointers, after revising our estimate of the signal energy. Relevant if
frames_per_chunk > 0
. (default: 500)- snip_edges
If this is set to false, the incomplete frames near the ending edge won’t be snipped, so that the number of frames is the file size divided by the
windowShift
. This makes different types of features give the same number of frames. (default: True)- explicitExt
the file extension that should be used.
- outputDirectory
set an explicit directory for where the signal file will be written. If not defined, the file will be written to the same directory as the sound file.
- toFile
write the output to a file? The file will be written in
outputDirectory
, if defined, or in the same directory as the soundfile.- conda.env
The name of the conda environment in which Python and its required packages are stored. Please make sure that you know what you are doing if you change this.
- soft_min_f0
(float, optional) – Minimum f0, applied in soft way, must not exceed min-f0 (default: 10.0)
- penalty_factor
Cost factor for fO change. (default: 0.1)
- delta_pitch
Smallest relative change in pitch that our algorithm measures. (default: 0.005)
- nccf_ballast
Increasing this factor reduces NCCF for quiet frames (default: 7000)
- psample_filter_width
Integer that determines filter width when upsampling NCCF. (default: 5)
Value
An SSFF track object containing two tracks (f0 and nccf) that are either returned (toFile == FALSE) or stored on disk.
Details
The function calls the torchaudio (Yang et al. 2021) library to do the pitch estimates and therefore relies on it being present in a properly set up python environment to work. Please refer to the torchaudio manual for further information.
References
Ghahremani P, BabaAli B, Povey D, Riedhammer K, Trmal J, Khudanpur S (2014).
“A Pitch Extraction Algorithm Tuned for Automatic Speech Recognition.”
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2494--2498.
doi:10.1109/icassp.2014.6854049
.
Yang Y, Hira M, Ni Z, Chourdia A, Astafurov A, Chen C, Yeh C, Puhrsch C, Pollack D, Genzel D, Greenberg D, Yang EZ, Lian J, Mahadeokar J, Hwang J, Chen J, Goldsborough P, Roy P, Narenthiran S, Watanabe S, Chintala S, Quenneville-Bélair V, Shi Y (2021).
“TorchAudio: Building Blocks for Audio and Speech Processing.”
arXiv preprint arXiv:2110.15018.