Estimate pitch using the Kaldi modifies version of RAPT

The algorithm used is a version of the RAPT algorithm that considers voicing also in voiceless frames and conputes a Normalized Cross Correlation Function (NCCF) that can be used to estimate the probability of voicing (Ghahremani et al. 2014) .

Usage

kaldi_pitch(
  listOfFiles,
  beginTime = 0,
  endTime = 0,
  windowShift = 5,
  windowSize = 25,
  minF = 70,
  maxF = 200,
  softMinF0 = 10,
  voiced_voiceless_cost = 0.1,
  owpass_cutoff = 1000,
  resample_frequency = 4000,
  deltaChange = 0.005,
  nccfBallast = 7000,
  lowpass_cutoff = 1000,
  lowpass_filter_width = 1,
  upsample_filter_width = 5,
  max_frames_latency = 0,
  frames_per_chunk = 0,
  simulate_first_pass_online = FALSE,
  recompute_frame = 500,
  snip_edges = TRUE,
  explicitExt = "kap",
  outputDirectory = NULL,
  toFile = TRUE,
  conda.env = NULL
)

Arguments

listOfFiles: A vector of file paths to wav files.
beginTime: The start time of the section of the sound file that should be processed.
endTime: The end time of the section of the sound file that should be processed.
windowShift: The measurement interval (frame duration), in seconds.
minF: Candidate f0 frequencies below this frequency will not be considered.
maxF: Candidates above this frequency will be ignored.
resample_frequency: Frequency that we down-sample the signal to. Must be more than twice lowpass_cutoff. (default: 4000)
lowpass_cutoff: Cutoff frequency for LowPass filter (Hz) (default: 1000)
lowpass_filter_width: Integer that determines filter width of lowpass filter, more gives sharper filter. (default: 1)
max_frames_latency: Maximum number of frames of latency that we allow pitch tracking to introduce into the feature processing (affects output only if frames_per_chunk > 0 and simulate_first_pass_online=TRUE) (default: 0)
frames_per_chunk: The number of frames used for energy normalization. (default: 0)
simulate_first_pass_online: If true, the function will output features that correspond to what an online decoder would see in the first pass of decoding – not the final version of the features, which is the default. (default: FALSE) Relevant if frames_per_chunk > 0.
recompute_frame: Only relevant for compatibility with online pitch extraction. A non-critical parameter; the frame at which we recompute some of the forward pointers, after revising our estimate of the signal energy. Relevant if frames_per_chunk > 0. (default: 500)
snip_edges: If this is set to false, the incomplete frames near the ending edge won’t be snipped, so that the number of frames is the file size divided by the windowShift. This makes different types of features give the same number of frames. (default: True)
explicitExt: the file extension that should be used.
outputDirectory: set an explicit directory for where the signal file will be written. If not defined, the file will be written to the same directory as the sound file.
toFile: write the output to a file? The file will be written in outputDirectory, if defined, or in the same directory as the soundfile.
conda.env: The name of the conda environment in which Python and its required packages are stored. Please make sure that you know what you are doing if you change this.
soft_min_f0: (float, optional) – Minimum f0, applied in soft way, must not exceed min-f0 (default: 10.0)
penalty_factor: Cost factor for fO change. (default: 0.1)
delta_pitch: Smallest relative change in pitch that our algorithm measures. (default: 0.005)
nccf_ballast: Increasing this factor reduces NCCF for quiet frames (default: 7000)
psample_filter_width: Integer that determines filter width when upsampling NCCF. (default: 5)

Value

An SSFF track object containing two tracks (f0 and nccf) that are either returned (toFile == FALSE) or stored on disk.

Details

The function calls the torchaudio (Yang et al. 2021) library to do the pitch estimates and therefore relies on it being present in a properly set up python environment to work. Please refer to the torchaudio manual for further information.

References

Ghahremani P, BabaAli B, Povey D, Riedhammer K, Trmal J, Khudanpur S (2014). “A Pitch Extraction Algorithm Tuned for Automatic Speech Recognition.” 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2494--2498. doi:10.1109/icassp.2014.6854049 .

Yang Y, Hira M, Ni Z, Chourdia A, Astafurov A, Chen C, Yeh C, Puhrsch C, Pollack D, Genzel D, Greenberg D, Yang EZ, Lian J, Mahadeokar J, Hwang J, Chen J, Goldsborough P, Roy P, Narenthiran S, Watanabe S, Chintala S, Quenneville-Bélair V, Shi Y (2021). “TorchAudio: Building Blocks for Audio and Speech Processing.” arXiv preprint arXiv:2110.15018.

Estimate pitch using the Kaldi modifies version of RAPT

Usage

Arguments

Value

Details

References

See also