Skip to contents

The algorithm used is a version of the RAPT algorithm that considers voicing also in voiceless frames and conputes a Normalized Cross Correlation Function (NCCF) that can be used to estimate the probability of voicing (Ghahremani et al. 2014) .

Usage

kaldi_pitch(
  listOfFiles,
  beginTime = 0,
  endTime = 0,
  windowShift = 5,
  windowSize = 25,
  minF = 70,
  maxF = 200,
  softMinF0 = 10,
  voiced_voiceless_cost = 0.1,
  owpass_cutoff = 1000,
  resample_frequency = 4000,
  deltaChange = 0.005,
  nccfBallast = 7000,
  lowpass_cutoff = 1000,
  lowpass_filter_width = 1,
  upsample_filter_width = 5,
  max_frames_latency = 0,
  frames_per_chunk = 0,
  simulate_first_pass_online = FALSE,
  recompute_frame = 500,
  snip_edges = TRUE,
  explicitExt = "kap",
  outputDirectory = NULL,
  toFile = TRUE,
  conda.env = NULL
)

Arguments

listOfFiles

A vector of file paths to wav files.

beginTime

The start time of the section of the sound file that should be processed.

endTime

The end time of the section of the sound file that should be processed.

windowShift

The measurement interval (frame duration), in seconds.

minF

Candidate f0 frequencies below this frequency will not be considered.

maxF

Candidates above this frequency will be ignored.

resample_frequency

Frequency that we down-sample the signal to. Must be more than twice lowpass_cutoff. (default: 4000)

lowpass_cutoff

Cutoff frequency for LowPass filter (Hz) (default: 1000)

lowpass_filter_width

Integer that determines filter width of lowpass filter, more gives sharper filter. (default: 1)

max_frames_latency

Maximum number of frames of latency that we allow pitch tracking to introduce into the feature processing (affects output only if frames_per_chunk > 0 and simulate_first_pass_online=TRUE) (default: 0)

frames_per_chunk

The number of frames used for energy normalization. (default: 0)

simulate_first_pass_online

If true, the function will output features that correspond to what an online decoder would see in the first pass of decoding – not the final version of the features, which is the default. (default: FALSE) Relevant if frames_per_chunk > 0.

recompute_frame

Only relevant for compatibility with online pitch extraction. A non-critical parameter; the frame at which we recompute some of the forward pointers, after revising our estimate of the signal energy. Relevant if frames_per_chunk > 0. (default: 500)

snip_edges

If this is set to false, the incomplete frames near the ending edge won’t be snipped, so that the number of frames is the file size divided by the windowShift. This makes different types of features give the same number of frames. (default: True)

explicitExt

the file extension that should be used.

outputDirectory

set an explicit directory for where the signal file will be written. If not defined, the file will be written to the same directory as the sound file.

toFile

write the output to a file? The file will be written in outputDirectory, if defined, or in the same directory as the soundfile.

conda.env

The name of the conda environment in which Python and its required packages are stored. Please make sure that you know what you are doing if you change this.

soft_min_f0

(float, optional) – Minimum f0, applied in soft way, must not exceed min-f0 (default: 10.0)

penalty_factor

Cost factor for fO change. (default: 0.1)

delta_pitch

Smallest relative change in pitch that our algorithm measures. (default: 0.005)

nccf_ballast

Increasing this factor reduces NCCF for quiet frames (default: 7000)

psample_filter_width

Integer that determines filter width when upsampling NCCF. (default: 5)

Value

An SSFF track object containing two tracks (f0 and nccf) that are either returned (toFile == FALSE) or stored on disk.

Details

The function calls the torchaudio (Yang et al. 2021) library to do the pitch estimates and therefore relies on it being present in a properly set up python environment to work. Please refer to the torchaudio manual for further information.

References

Ghahremani P, BabaAli B, Povey D, Riedhammer K, Trmal J, Khudanpur S (2014). “A Pitch Extraction Algorithm Tuned for Automatic Speech Recognition.” 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2494--2498. doi:10.1109/icassp.2014.6854049 .

Yang Y, Hira M, Ni Z, Chourdia A, Astafurov A, Chen C, Yeh C, Puhrsch C, Pollack D, Genzel D, Greenberg D, Yang EZ, Lian J, Mahadeokar J, Hwang J, Chen J, Goldsborough P, Roy P, Narenthiran S, Watanabe S, Chintala S, Quenneville-Bélair V, Shi Y (2021). “TorchAudio: Building Blocks for Audio and Speech Processing.” arXiv preprint arXiv:2110.15018.

See also

rapt