Compute f0 using the algorithm named Yet Another Algorithm for Pitch Tracking

The Yet Another Algorithm for Pitch Tracking algorithm (Kasi and Zahorian 2002) that computes f0 using Normalized Cross Correlation (NCCF) and the work of Talkin (Talkin and Kleijn 1995) in developing the RAPT algorithm.

Usage

yaapt(
  listOfFiles,
  beginTime = 0,
  endTime = 0,
  windowShift = 5,
  windowSize = 35,
  minF = 70,
  maxF = 200,
  tda_frame_length = 35,
  fft_length = 8192,
  bp_forder = 150,
  bp_low = 50,
  bp_high = 1500,
  nlfer_thresh1 = 0.75,
  nlfer_thresh2 = 0.1,
  shc_numharms = 3,
  shc_window = 40,
  shc_maxpeaks = 4,
  shc_pwidth = 50,
  shc_thresh1 = 5,
  shc_thresh2 = 1.25,
  f0_double = 150,
  f0_half = 150,
  dp5_k1 = 11,
  dec_factor = 1,
  nccf_thresh1 = 0.3,
  nccf_thresh2 = 0.9,
  nccf_maxcands = 3,
  nccf_pwidth = 5,
  merit_boost = 0.2,
  merit_pivot = 0.99,
  merit_extra = 0.4,
  median_value = 7,
  dp_w1 = 0.15,
  dp_w2 = 0.5,
  dp_w3 = 0.1,
  dp_w4 = 0.9,
  explicitExt = "yf0",
  outputDirectory = NULL,
  toFile = TRUE
)

Arguments

listOfFiles: A vector of file paths to wav files.
beginTime: The start time of the section of the sound file that should be processed.
endTime: The end time of the section of the sound file that should be processed.
windowShift: The measurement interval (frame duration), in seconds.
windowSize: length of each analysis frame (default: 35 ms)
minF: Candidate f0 frequencies below this frequency will not be considered.
maxF: Candidates above this frequency will be ignored.
tda_frame_length: The frame length employed in the time domain analysis (defaults to the same as windowSize 35 ms).
fft_length: FFT length (default: 8192 samples)
bp_forder: order of band-pass filter (default: 150)
bp_low: low frequency of filter passband (default: 50 Hz)
bp_high: high frequency of filter passband (default: 1500 Hz)
nlfer_thresh1: NLFER (Normalized Low Frequency Energy Ratio) boundary for voiced/unvoiced decisions (default: 0.75)
nlfer_thresh2: threshold for NLFER definitely unvoiced (default: 0.1)
shc_numharms: number of harmonics in SHC (Spectral Harmonics Correlation) calculation (default: 3)
shc_window: SHC window length (default: 40 Hz)
shc_maxpeaks: maximum number of SHC peaks to be found (default: 4)
shc_pwidth: window width in SHC peak picking (default: 50 Hz)
shc_thresh1: threshold 1 for SHC peak picking (default: 5)
shc_thresh2: threshold 2 for SHC peak picking (default: 1.25)
f0_double: pitch doubling decision threshold (default: 150 Hz)
f0_half: pitch halving decision threshold (default: 150 Hz)
dp5_k1: weight used in dynamic program (default: 11)
dec_factor: factor for signal resampling (default: 1)
nccf_thresh1: threshold for considering a peak in NCCF (Normalized Cross Correlation Function) (default: 0.3)
nccf_thresh2: threshold for terminating search in NCCF (default: 0.9)
nccf_maxcands: maximum number of candidates found (default: 3)
nccf_pwidth: window width in NCCF peak picking (default: 5)
merit_boost: boost merit (default. 0.20)
merit_pivot: merit assigned to unvoiced candidates in definitely unvoiced frames (default: 0.99)
merit_extra: merit assigned to extra candidates in reducing pitch doubling/halving errors (default: 0.4)
median_value: order of medial filter (default: 7)
dp_w1: DP (Dynamic Programming) weight factor for voiced-voiced transitions (default: 0.15)
dp_w2: DP weight factor for voiced-unvoiced or unvoiced-voiced transitions (default: 0.5)
dp_w3: DP weight factor of unvoiced-unvoiced transitions (default: 0.1)
dp_w4: Weight factor for local costs (default: 0.9)
explicitExt: the file extension that should be used.
outputDirectory: set an explicit directory for where the signal file will be written. If not defined, the file will be written to the same directory as the sound file.
toFile: write the output to a file? The file will be written in outputDirectory, if defined, or in the same directory as the soundfile.

Value

An SSFF track object containing two tracks ("f0" and "voiced") which contains the computed pitch values, and a binary (0 or 1) indication of whether the frame was considered "voiced" (1) or not (0). The tracks are either returned (toFile == FALSE) or stored on disk.

Details

The YAAPT algorithm processes the original acoustic signal and a non-linearly processed version of the signal to partially restore very weak f0 components. Intelligent peak picking to select multiple f0 candidates and assign merit factors; and, incorporation of highly robust pitch contours obtained from smoothed versions of low frequency portions of spectrograms. Dynamic programming is used to find the “best” pitch track among all the candidates, using both local and transition costs.

References

Kasi K, Zahorian SA (2002). “Yet Another Algorithm for Pitch Tracking.” 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, I--361-I-364. doi:10.1109/icassp.2002.5743729 .

Talkin D, Kleijn WB (1995). “A robust algorithm for pitch tracking (RAPT).” Speech coding and synthesis, 495, 518.