`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-25, NO. 1, FEBRUARY 1977
`
`On the Use of Autocorrelation Analysis for Pitch
`Detection
`
`LAWRENCE R. RABINER, FELLOW, IEEE
`
`Abstract-One of the most time honored methods of detecting pitch
`is to use some type of autocorrelation analysis on speech which has
`been appropriately preprocessed. The goal of the speech preprocessing
`in most systems is to whiten, or spectrally flatten, the signal so as to
`eliminate the effects of the vocal tract spectrum on the detailed shape
`of the resulting autocorrelation function. The purpose of this paper is
`to present some results on several types of (nonlinear) preprocessing
`which can be used to effectively spectrally flatten the speech signal
`The types of nonlinearities which are considered are classified by a non(cid:173)
`linear input-output quantizer characteristic. By appropriate adjustment
`of the quantizer threshold levels, both the ordinary (linear) autocor(cid:173)
`relation analysis, and the center clipping-peak clipping autocorrelation
`of Dubnowski et a/, [1] can be obtained. Results are presented to
`demonstrate the degree of spectrum flattening obtained using these
`methods. Each of the proposed methods was tested on several ofthe
`utterances used in a recent pitch detector comparison study by Rabiner
`et al. [2] Results of this comparison are included in this paper. One
`imal topic which is discussed in this paper is an algorithm for adaptively
`choosing a frame size for an autoCorrelation pitch analysis.
`
`I. INTRODUCTION
`ALTHOUGH a large number of different methods have
`/""\.. been proposed for detecting pitch, the autocorrelation
`pitch detector is still one of the most robust and reliable of
`pitch detectors [2] . There are several reasons why autocor(cid:173)
`relation methods for pitch detection have generally met with
`The autocorrelation computation is made
`good success.
`directly on the waveform and is a fairly straightforward (albeit
`time consuming) computation. Although a high processing
`rate is required, the autocorrelation computation is simply
`amenable to digital hardware implementation generally re(cid:173)
`quiring only a single multiplier and an accumulator as the
`computational elements. Finally, the autocorrelation compu(cid:173)
`tation is largely phase insensitive. 1 Thus, it is a good method
`to use to detect pitch of speech which has been transmitted
`over a telephone line, or has suffered some degree of phase
`distortion via transmission.
`Although an autocorrelation pitch detector has some advan(cid:173)
`tages for pitch detection, there are several problems associated
`with the use of this method. Although the autocorrelation
`function of a section of voiced speech generally displays a
`fairly prominent peak at the pitch period, autocorrelation
`peaks due to the detailed formant structure of the signal are
`also often present. Thus, one problem is to decide which of
`several autocorrelation peaks corresponds to the pitch period.
`Another problem with the autocorreiation computation is the
`required use of a window for computing the short time auto-
`
`correlation function. The use of a window for analysis leads
`to at least two difficulties. First there is the problem of
`choosing an appropriate window. Second there is the problem
`that (for a stationary analysis)/ no matter which window is
`selected, the effect of the window is to taper the autocor(cid:173)
`relation function smoothly to 0 as the autocorrelation index
`increases. This effect tends to compound the difficulties
`mentioned above in which formant peaks in the autocorre(cid:173)
`lation function (which occur at lower indices than the pitch
`period peak) tend to be of greater magnitude than the peak
`due to the pitch period.
`A final difficulty with the autocorrelation computation is
`the problem of choosing an appropriate analysis frame
`(window) size. The ideal analysis frame should contain from 2
`to 3 complete pitch periods. Thus, for high pitch speakers the
`analysis frame should be short (5-20 ms), whereas for low
`pitched speakers it should be long (20-50 ms).
`A wide variety of solutions have been proposed to the above
`probiems. To partially eliminate the effects of the higher
`formant structure on the autocorrelation function; most
`methods use a sharp cutoff low-pass filter with cutoff around
`900 Hz. This will, in general, preserve a sufficient number of
`pitch harmonics for accurate pitch detection, but will elim(cid:173)
`inate the second and higher formants.
`In addition to linear
`filtering to remove the formant structure, a wide variety of
`methods have been proposed for directly or indirectly
`spectrally flattening the speech signal to remove the effects of
`the first formant [3]- [5], [1]. Included among these tech(cid:173)
`niques are center clipping and spectral equalization by fllter
`bank methods [3] , inverse filtering using linear prediction
`methods [4] , spectral flattening by linear prediction and a
`Newton transformation [5], and spectral flattening by a com(cid:173)
`bination of center and peak clipping methods [ 1] .
`Each of these methods has met with some degree of success;
`however, problems still remain. It is the purpose of this paper
`to investigate the properties of a class of nonlinearities applied
`to the speech signal prior to autocorrelation :malysis with the
`purpose of spectrally flattening the signal. Also a solution to
`the problem of choosing an analysis frame size which adapts to
`the estimated average pitch of the speaker will be presented.
`The organization of this paper is as follows. In Section II we
`review the theory of short-time autocorrelation analysis and
`present the various types of nonlinearities to be investigated
`for spectrally flattening the speech. Examples of signal spectra
`
`Manuscript received April4, 1976; revised August 16, i976.
`The author is with the Bell Laboratories, Murray Hill, NJ 07974.
`1 In the limit of exactly periodic signals, or for an infinite correlation
`function it is exactly phase insensitive.
`
`2 A stationary analysis is one for which the same set of input samples
`is used in computing all the points of the autocorrelation function. A
`nonstationary analysis is impractical for pitch detection because of the
`large number of autocorrelation points involved in the computation.
`
`ZTE EXHIBIT 1025
`
`Page 1 of 10
`
`
`
`RABINER: AUTOCORRELATION ANALYSIS FOR PITCH DETECTION
`
`25
`
`obtained with the nonlinearities being used will be given in this .
`section.
`In Section III the results of a limited but formal
`evaluation of each of the nonlinear autocorrelation analyses
`are given. Several of the test utterances used in (2] are used in
`this test for comparison purposes. In Section IV we discuss a
`simple algorithm for adapting the frame size of the analysis
`based on the estimated average pitch period for the speaker,
`and present results on how well it worked on several test
`examples.
`
`II. SHORT-TIME AUTOCORRELATION ANALYSIS
`
`Given a discrete time signal x(n), defined for all n, the auto(cid:173)
`correlation function is generally defined as
`
`1
`N
`¢x(m) =lim - - L x(n)x(n + m).
`N->oo 2N + 1 n=-N
`
`(1)
`
`The autocorrelation function of a signal is basically a (non(cid:173)
`invertible) transformation of the signal which is useful for
`displaying structure in the waveform. Thus, for pitch detec(cid:173)
`tion, if we assume x(n) is exactly periodic with period P, i.e.,
`x(n) = x(n + P) for all n, then it is easily shown that
`¢x(m) = 11x(m + P),
`
`(2)
`
`i.e., the autocorrelation is also periodic with the same period.
`Conversely, periodicity in the autocorrelation function indi(cid:173)
`cates periodicity in the signal.
`For a nonstationary signal, such as speech, the concept of a
`long-time autocorrelation measurement as given by (1) is not
`really meaningful: Thus, it is reasonable to define a short(cid:173)
`time autocorrelation function, which operates on short seg(cid:173)
`ments of the signal, as
`
`1 N'- 1
`¢£(m) =- ,L
`N n=O
`
`[x(n + Q)w(n)] [x(n + Q + m)w(n + m)],
`
`(3)
`
`where w(n) is an appropriate window for analysis, N is the
`section length being analyzed, N' is the number of signal
`samples used in the computation of ¢ 2(m), M 0 is the number
`of autocorrelation points to be computed, and Q is the index
`of the starting sample of the frame. For pitch detection appli(cid:173)
`cations N' is generally set to the value
`
`r----------------------,
`
`I
`I
`X1 (nl I
`I
`I
`I
`I
`
`I
`I
`I
`I
`I
`I
`I
`S(n) I
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`L-----------------------~
`SPEECH PREPROCESSOR
`Fig. 1. Block diagram of the nonlinear correlation processing.
`
`computation of (3), as discussed in Section I. Fig. 1 shows a
`block diagram of the processing which was used. The speech
`signal s(n) is first low-pass filtered by an FIR, linear phase,
`digital filter with a passband of 0 to 900 Hz, and a stopband
`beginning at 1700 Hz.3 The output of the low-pass filter is
`then used as input to two nonlinear processors, labeled NL1
`and NL2 in Fig. 1. The nonlinearities used in each path may
`or may not be identical. The types of nonlinearities which
`were investigated were various center clippers, and peak clip(cid:173)
`pers. Based on earlier works (3], [1] it has been shown that
`such nonlinearities can provide a fairly high degree of spectral
`flattening, and are computationally quite efficient to imple(cid:173)
`ment (1]. Additionally, the capability of correlating two non(cid:173)
`linearly processed versions of the same signal provides a useful
`degree of flexibility into the system. It has also been argued
`that such a correlation will be most appropriate in a variety
`of actual situations in pitch detection.4
`Three types of nonlinearity have been considered. They are
`classified according to their input-output quantization charac(cid:173)
`teristic in the following way. The first type of nonlinearity is
`a compressed center clipper whose output y(n) obeys the
`relation (with x(n) as input)5
`
`y(n) = clc [x(n)] = (x(n)- CL),
`= 0,
`
`x(n);;:, CL
`lx(n)l < CL
`x(n) <, -CL
`
`(5)
`
`where CL is the clipping threshold. The second nonlinearity
`is a simple center clipper with the input-output relation 6
`
`N'=N- m
`
`(4)
`
`y(n) = clp [x(n)] = x(n),
`
`x(n);;:, CL
`
`so that only the N samples in the analysis frame (i.e., x(Q),
`x(Q + 1), · · · , x(Q + N- 1)) are used in the autocorrelation
`computation. Values of 200 and 300 have generally been used
`for M 0 and N, respectively, (1] corresponding to a maximum
`pitch period of 20 ms (200 samples at a 10kHz sampling rate)
`and a 30 ms analysis frame size. As will be discussed later a
`rectangular window (i.e., w(n)= 1, O<,n<,N- 1, w(n)=O
`elsewhere) is used for all the computations to be described in
`this paper.
`To reduce the effects of the formant structure on the de(cid:173)
`tailed shape of the short-time autocorrelation function, two
`preprocessing functions were used prior to the autocorrelation
`
`=0,
`= -x(n),
`
`lx(n)l <CL
`
`x(n)<,-CL.
`
`(6)
`
`Finally, the third nonlinearity is the combination center and
`peak clipper with the input-output relation 7
`
`3The filter had an impulse response duration of 25 samples. The filter
`passband was flat to within ±0.03, and the stopband response was down
`at least 50 dB.
`4 M. Sondhi-personal communication.
`5 The function clc [x] stands for clip and compress x.
`6The function clp [x] stands for clip x.
`7 The function sgn [x] stands for the sign of x.
`
`Page 2 of 10
`
`
`
`26
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, FEBRUARY 1977
`
`(0)
`------"7-+-----_L_- X
`/
`
`7
`
`y=clc rxl
`
`y
`
`ch--.,--------
`
`(b)
`
`-CL
`----r-----------~--~--------~---x
`CL
`
`TABLE I
`COMBINATIONS OF NONLINEARITIES FOR CORRELATON ANALYSIS
`
`Cor:relatic;rj
`No.
`
`x
`
`(n)
`1
`
`(n)
`
`x
`
`2
`
`x(n)
`
`x( r.)
`
`clc[x(n;]
`
`cJ.c[x(n)]
`
`clp[x(n)]
`
`clp[x(n)]
`
`x(n)
`
`clc[x(n)]
`
`sr;n[x( n)]
`
`::>f:n[x( n) J
`
`clp[x(r.)]
`
`sgn[x(n)]
`
`x(n)
`
`x(r.)
`
`clc[x(n)]
`
`clp[x(n)]
`
`clp[x(r.)]
`
`clc[x(n)]
`
`Comoutation
`
`. , + . ,+/0 . ,+/0'
`:':13 . ,+/0
`• • +/0" . ,+/0
`
`YJ
`':./0
`
`- - - ' - - - - - - - - -CL
`
`10
`
`sgn[x(n)]
`
`sg:n[x( n)]
`
`~oun";er/0
`
`y=clp rxl
`
`y
`
`+I - - - - - - - - - -
`
`(C)
`
`----r-----------~-------------L---x
`CL
`
`- - - - - - - - - - -1
`
`y=sgnrxl
`Fig. 2. Input-output characteristics of each of the three nonlinearities
`used in the investigation.
`
`y(n) == sgn [x(n)] == 1,
`
`= 0,
`
`=-1,
`
`x(n) ~ CL
`lx(n)l < CL
`x(n),;;;;-cL·
`
`(7)
`
`Fig. 2 illustrates the input-output characteristics for the three
`nonlinearities of (5)-(7). Allowing a direct path connection
`between input and output for each of the nonlinearities of Fig.
`1 (i.e.,y = x) it can be seen that there are ten distinct ways8 in
`which the signals x 1 (n) and x 2 (n) can be correlated, depending
`on which of the nonlinearities is used for NLl and NL2. Table
`I summarizes these ten possibilities.
`It should be noted that correlation number 1 in Table I
`corresponds to an ordinary autocorrelation; whereas correla(cid:173)
`tion number 10 corresponds to the combination peak clipping,
`center clipping correlation discussed in [ 1] . Also shown in
`Table I are the required computations needed to implement
`the combined correlation for each possibility. In the most
`general case (correlation number 1) a multiply and an
`add are required for each sample in the computation. For
`cases 2, 3, 7, 8, and 9, whenever either x 1(n) or x 2 (n + m)
`falls below the clipping level, CL, no computation is required
`as indicated by the ¢; in the computation column for these
`cases. For cases 4, 5, and 6 only an adder/subtractor is re(cid:173)
`quired because sgn [x(n + m)] can only assume the values + 1
`[addition of x 1 (n)], 0 (no computation), or -1 [subtraction
`of x 1 (n)]. Finally, case 10 only requires an up-down counter
`to implement as discussed in [1] .
`
`The justification for considering the nonlinearities of Fig. 2
`for use in autocorrelation analysis is obtained by examining
`the effects of the nonlir1earities on the waveforms. It can be
`argued that a center clipper effectively attenuates the effects
`of first formant structure on the waveform, without seriously
`affecting the pitch pulse indications. However, it has been
`argued that the peak clipping of the sgn quantizer [Fig. 2(c)]
`gives too much weight to signal amplitudes that just exceed
`the clipping threshold, and too little weight to signal ampli(cid:173)
`tudes that exceed the clipping threshold by a wide margin.
`Thus, the justification for the clc (clip and compress) and the
`clp (clip) quantizers is that they provide a compromise be(cid:173)
`tween the extremes of no clipping and infinite peak clipping.
`Before proceeding to some examples showing the effects of
`each of these nonlinearities, it is worth noting that the method
`used to set the clippir1g threshold (CL) for each of these non(cid:173)
`lir1earities was exactly the method used in [ 1] , i.e., set the
`clipping as a fixed percentage (68 percent) of the smaller of
`the maximum absolute signal level over the first and last one(cid:173)
`thirds of the analysis frame. This method has proven quite
`successful in all tests to date [2] .
`Fig. 3 illustrates the effects of each of the quantizer charac(cid:173)
`teristics of Fig. 1 on a typical frame of voiced speech. The
`left-hand side of Fig. 3 shows the sequence of signals x(n),
`clc[x(n)], clp[x(n)], and sgn[x(n)]. Superimposed on the
`plot of x(n) is the clipping level for this frame of speech. The
`right-hand side of Fig. 3 shows the sequence of autocorrelations
`corresponding to each of the sequences at the left (i.e., corre(cid:173)
`lations numbers 1, 2, 3, and 10 in Table I).9 A rectangular
`window was used in all cases for computing the correlations as
`no other type of window is reasonable. The effects of the
`nonlinear clippir1g are readily evident in this figure. Although
`there is a sharp peak in the autocorrelation due to the pitch
`at m = 80 for all four correlations, the shape of the correlation
`function for the unprocessed speech [Fig. 3(a)] is significantly
`different from the shape of the correlation function for all of
`the nonlinearly processed signals [Fig. 3(b)-(d)]-especially in
`the low time part of the correlation (i.e., m going from 20 to
`80). Fig. 3(d) also illustrates the' problems associated with
`using the sgn quantizer in that all speech samples which exceed
`the clipping threshold are weighted equally. Thus in the third
`
`8 Theoretically there are 16 ways in which x 1 (n) and Xz(n) can be
`correlated. For all practical purposes, however, six pairs of these results
`are equivalent. Thus, only ten ways of correlating Xt (n) and Xz(n) are
`considered here.
`
`9 Each of the signal amplitudes in Figs. 3-7, and 9 is scaled so that the
`maximum value is set to 1.0 for display purposes. Thus, it is difficult
`to compare these amplitude sequences against each other.
`
`Page 3 of 10
`
`
`
`RABINER: AUTOCORRELATION ANALYSIS FOR PITCH DETECTION
`
`4> (m)
`
`27
`
`FRAME 23
`LRR~I SAW THE CAT
`
`1 ~, • .ALa .. ~l,A.,t<\~~~""
`vrlflV' (vvv• ~Y¥V'
`I
`t
`I
`I 1
`l ,, ( I
`~
`
`21
`
`31
`
`I
`
`I '
`
`I •
`
`A v ...
`
`·,4
`
`.A. vv
`
`A.
`inl'
`
`vv
`
`!\ aw
`
`Fig. 3. Each of the processed signals and the resulting correlation func·
`tion for a section of voiced speech.
`
`period there are five pulses of varying width whereas in the
`first periods there are only three pulses. Fig. 3(b) showsthat
`such problems are inherently eliminated by the clc quantizer
`whose output samples are proportional in amplitude to the
`amount by which they exceed the clipping threshold.
`
`Spectral Flattening from the Quantizers
`
`It has already been argued that the effect of the nonlinear
`processing preceding the correlation computation is to approx(cid:173)
`imately spectrally flatten the signal spectrum, thereby en(cid:173)
`hancing the periodicity of the signal. To investigate this, the
`power spectrum of each of the correlation functions of Table I
`was computed directly from the correlation function by the
`Fourier transform relation
`
`rp(m)e-i21rfm.
`
`M 0 -1
`S(f) = L
`M=-(M0 -1)
`A 512-point FFT was used to compute S(fk), k = 0,1, · · · ,511
`where
`
`(8)
`
`(9)
`
`i.e., at 512 points around the unit circle. Since rp(m) is theo(cid:173)
`retically infinite, a (2M0 + 1) point Hamming window was
`used to taper rp(m) smoothly to 0. (Note we are assuming
`¢(m) is symmetric, i.e., if>(m) = q;(-m).)
`
`0
`
`SIGNAL X1(n)
`
`2000
`3000
`CORRELATION</> (m)
`
`5000
`POWER SPECTRUM
`S(f) (db)
`
`Fig. 4. The signal x 1 (n); the resulting correlation and power spectrum
`for each of the ten correlators of Table I for a section of voiced
`speech.
`
`Figs. 4-7 show plots of the results of processing four dif(cid:173)
`ferent sections of voiced speech. The left-hand column shows
`the signal x 1 (n), the middle column shows the signal tf;(m) =
`x 1(n) correlated with x 2 (n) (where x 2 (n) is as specified in
`Table I), and the right-hand column shows the power spec(cid:173)
`trum S(f) obtained as described above. The ten rows in each
`figure correspond to the ten combinations of signals to be
`correlated as shown in Table L An examination of Fig. 4
`shows that for the unprocessed signal (i.e., the top row) the
`first several harmonics are seen in the power spectrum.
`Beyond 1 kHz, the spectrum decays rapidly due to the low(cid:173)
`pass filter (the lack of a sharp falloff in the spectrum is due to
`a combination of the signal and autocorrelation windows).
`The amplitudes of the harmonics vary with the first formant
`envelope. It can be seen that the spectrum for the autocor(cid:173)
`relations of each of the nonlinear quantizers (i.e., rows 2, 3,
`and 10) are much flatter than the original signal spectrum.
`Additionally, the spectra of the nonlinearly processed signals
`are much broader than the original spectrum. It is interesting
`to note that the spectra from correlations involving x(n), i.e.,
`correlations numbers 1, 4, 7, and 8, are the least flattened and
`are generally quite irregular (i.e., the harmonics are not very
`easy to find).
`Fig. 5 shows similar results from a different section of voiced
`speech. As seen from the spectrum of the unprocessed signal
`(on the top line) the bandwidth of the fmt formant is fairly
`small, causing the correlation function to show a great deal of
`formant periodicity for small values of m. The eftects of the
`nonlinearities on the signal spectra are quite impressive even
`for such a difficult case as this one.
`Fig. 6 shows results from a section of speech from a female
`speaker (high pitch). Again the spectrum from the unpro-
`
`Page 4 of 10
`
`
`
`28
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, FEBRUARY 1977
`FRAME 88
`
`LRR- I SAW THE CAT
`
`FRAME 146
`LM05T
`
`~n'"''""''·'~
`v \(ITV\TVV\j \TV v
`~v
`-40
`A.
`/). ~ v v
`~
`r· o."r./\1\f\
`~ . ~ 0 v
`~
`~if
`~A v v
`~ c
`~·
`~
`
`~ v •
`
`.1 0 0
`
`3000
`CORRELATION <jJ(m)
`
`200
`0
`
`5000
`POWER SPECTRUM
`S(f) (db)
`
`\[v
`
`0~\T'V"
`
`/)..
`0 v
`
`flA/'\1\./\o
`VVQ\j
`
`e "c'\fivAcyr.. ·
`
`h~~.,...,., ..
`
`I ~A A A Li ~A A .d)~ G A ~l\"ho"{',"h-~-l'\~
`+ o~~n~ ••
`vvv\P vvw vvV' V Q\J4J VV V V<r
`VVIJlOT'
`~-
`A. -40~
`I I
`r
`21 ~'
`!
`I
`r
`v
`vv
`I ~ o II- ~
`II
`~ l
`3 1 fi I
`p
`ijl
`i'
`i
`co' ~llilf!lllf:td!
`ffi 4r~A· J\oh •• ,)\~6. ~ •""{\·
`.~nnn ..
`~ 4r
`z I!
`~ V V VlfJ
`lfV)liP
`! 51
`:::>
`l
`~ ~
`r
`I
`g5 V'
`Y'
`~ I ~I
`! ~ .llh ~ vro
`~ I~ I
`! I
`li
`~ 6
`~ 6
`iji
`iji
`i
`~. ""{\II
`0 7~6~' LJAAoodA••
`8 7r onflno ..
`v VlJV
`iflfU v v 'fV"'
`VVYuP
`01;1£ ~11111111~!!11!
`8~AD~ LihD&od6•~ ~
`81· ·~010··
`VIJviP vvvv vvlJ' V""' v\Jyvv=
`uvvw
`!I
`I
`"1\ ~
`~-
`! ' y
`91
`9 1 A I
`ijl
`iji
`~ i
`J I
`1ojfil n ~I
`0 VV"' ~
`~
`~nv
`101
`ij1
`~
`~~
`0
`200 0
`300 0
`CORRELATION <jJ (m)
`
`v
`
`VIT"'
`
`~ '{W
`
`ijV v
`
`VVQ \TV
`
`A
`
`6"
`\IVV
`
`0
`
`\jV"' V\T\TV v=
`
`OIQI'
`
`0 " { \ {\
`
`VV4J
`
`• 1\.
`
`0
`
`SIGNAL X1 (n)
`
`5000
`POWER SPECTRUM
`$(f) (db)
`Fig. 5. The signal x 1 (n); the resulting correlation and power spectrum
`for each of the ten correlators of Table I for another section of voiced
`speech.
`
`• nn 6 A ••
`lf\JIJ ovv
`I. I
`II
`i
`.flonn ..
`VI) v vliV
`L I
`!I
`i
`~t~OD•·
`vvvvvv
`.~hGh•.
`VV\fvW
`~I
`y
`!I
`i
`x1<nl
`
`21
`
`31
`
`I
`
`SIGNAL
`
`FRAME 50
`F 105W
`
`0
`
`300 0
`
`200 0
`CORRELATION cp (m)
`
`5000
`POWER SPECTRUM
`S(f) (db)
`Fig. 6. The signal x 1 (n); the resulting correlation and power spectrum
`for each of the ten correlators of Table I for a section of voiced
`speech from a female speaker.
`
`cessed signal shows only a few harmonics whose amplitudes
`vary with the formant amplitude. The nonlinearly processed
`samples show various degrees of spectral flattening, as antici(cid:173)
`pated by the previous discussion.
`· Finally, Fig. 7 shows the results obtained with a voiced
`frame from a low pitched (long period) male speaker. In this
`example the first formant has a very narrow bandwidth as seen
`
`Fig. 7. The signal x1 (n); the resulting correlation and power spectrum
`for each of the ten correlators of Table I for a section of voiced
`speech from a low pitched male speaker.
`
`in the spectrum at the top of Fig. 7. Pitch detection directly
`on the autocorrelation of the signal yields incorrect results in
`this case due to the first formant peak(s) in the autocorrela(cid:173)
`tion function. However, as shown in Fig. 7, almost any of the
`nonlinearities flatten the spectrum and eliminate the trouble(cid:173)
`some effects of the sharp first formant in the resulting corre(cid:173)
`lation function.
`In summary, we have presented examples which tend to
`show that, as anticipated, the effect of nonlinearly quantizing
`the signal amplitudes using the quantizers of Fig. 1 is to effec(cid:173)
`tively flatten and broaden the signal power spectrum, thereby
`reducing the effects of the first formant on the correlation
`function, and simplifying the pitch detection problem. In the
`next section we present results of a comparative test of the
`performance of the ten correlation pitch detectors discussed in
`this section on a series of speech utterances.
`
`III. EvALUATION oF THE TEN NoNLINEAR
`CORRELATIONS
`
`In order to evaluate and compare the performance of the ten
`nonlinear correlations discussed in the preceding section, a
`small set of the utterances from the data base in [2] was used
`as a test set. For each of the utterances a reference pitch
`contour was available from which an error analysis was made
`[ 6]. Since the problem of making a reliable voiced-unvoiced
`decision was not a concern here, the reference voiced-unvoiced
`contour was used directly, i.e., each correlator was required to
`estimate the pitch period, assuming a priori that the interval
`was properly classified as voiced.
`(No pitch detection was
`done during unvoiced intervals.) However, if the peak correla(cid:173)
`tion value (normalized) fell below a threshold (0.25), the
`interval was classified as unvoiced since reliable selection of
`
`Page 5 of 10
`
`
`
`RABINER: AUTOCORRELATION ANALYSIS FOR PITCH DETECTION
`
`29
`
`TABLE II
`STANDARD DEVIATIONS FOR TEN CORRELATORS
`
`TABLE III
`ERROR STATISTICS FOR TEN CORRELATORS-UNSMOOTHED
`
`Utterance
`
`Utterance
`
`.53
`.63
`.1\4
`.52
`.63
`.47
`.40
`.57
`_.46
`.45
`
`.80
`.60
`.n
`.84
`. 72
`.61!
`. 79
`.61
`. 76
`.64
`. 72
`.65
`.80
`.63
`.65 1.15
`.68
`.97
`. 79
`.69
`
`3
`
`"
`
`j
`
`10
`
`.n
`.68
`. 70
`.83
`.86
`.18
`.78
`.73
`.67
`.75
`
`.54
`.58
`.56
`.79
`·99
`.82
`.54
`.54
`.56
`. 75
`
`1.34
`1.59
`1.50
`1.54
`1. 47
`1.55
`1.46
`1.67
`1.56
`1.50
`
`.88
`o85
`1.58
`1.70 1.18 1.19
`
`1.66 1.10 1.14
`1. 82
`.97 1.11
`1.75 1.07 1.07
`1.66 1.08 1.19
`1. 70
`.99 1.04
`}..88 1.09 1.18
`1. 76 1.13 1.17
`1.69 1.13 1. 25
`
`1.23
`
`1.21
`
`1.31
`1.32
`1.24
`1.29
`1.34
`1.29
`1.30
`1.39
`
`1.24
`1.32
`1.17
`
`1.24
`
`1.02
`
`1.05
`1.29
`1.51
`1.41
`1.28
`
`1.52
`1.39
`1.46
`1.50
`1.37
`1.43
`1.45
`1.45
`1.46
`1.57
`
`] .23
`1.05
`1.04
`1.34
`1.27
`1.25
`1.24
`1.24
`1.11
`1.17
`
`Standard Deviation of P::.tch ?erion -
`
`llnsmooth!:!'d
`
`1.12 1.59 1.52
`1.40 1.63 1.81
`1.21 1. 49 1. 76
`.97 1. 48 1. 59
`1.02 1. 65 1 :6o
`1.13 1.63 1.68
`1.14 l. 55 1. 75
`1.43 1.78 1.83
`1.41 1.67 1.92
`1.26 1.63 1.76
`
`2.08
`1.54
`1. 34
`1. 87
`1. 36
`1.69
`1. 77
`1.78
`1.48
`1.68
`
`1.22
`1.25
`1.28
`1.33
`1.55
`1.57
`1.21
`1.84
`1.41
`1.56
`
`1.92
`2.15
`1.98
`2.17
`2.23
`1.93
`2.15
`2.48
`2.39
`2.29
`
`1. 75
`1.50
`l.lil
`1.83
`1.53
`1.51
`2.11
`
`1.92·
`1.66
`1.63
`
`24
`8
`13
`21
`14
`13
`24
`28
`16
`15
`
`33
`15
`22
`29
`25
`24
`31
`37
`25
`27
`
`40
`11
`10
`24
`18
`12
`27
`34
`14
`15
`
`18
`6
`10
`15
`18
`11
`13
`16
`10
`
`11
`
`18
`5
`7
`14
`17
`14
`
`13
`8
`
`9
`
`25
`8
`11
`20
`19
`13
`16
`22
`11
`13
`
`10
`
`10
`
`Number of Gross Voiced Errors (Unsmoothed)
`
`10
`
`11
`
`10
`
`11
`
`13
`4
`
`10
`11
`
`10
`
`12
`
`10
`
`14
`
`10
`
`.61
`.62
`.61
`.60
`• 70
`.61
`.59
`.63
`.59
`. 59
`
`.40
`.48
`.49
`.43
`.48
`.56
`.46
`.48
`.50
`.59
`
`.51
`.61
`.55
`-55
`·57
`.60
`.54
`.67
`.64
`.54
`
`.50
`.50
`.50
`.56
`.54
`.56
`.71
`;68
`.51
`.56
`
`.50
`• 59
`.57
`.49
`.62
`.52
`.51
`.50
`.57
`.58
`
`.84
`1.09
`.85
`1.25
`1.18
`1.06
`1.03
`1.19
`1.04
`1.05
`
`10
`
`Standard Deviation of pj tch Period - Smoothed
`
`the pitch was not possible with a correlation peak below this
`threshold.
`Thirteen utterances from [2] were used in this comparison.
`Tables II-V present the results of an error analysis which
`measured the average and standard deviation of the pitch
`period, the number of gross pitch period errors, and the
`number of voiced-to-unvoiced errors [2] .10 For all utterances
`the average pitch period error was well below 0.5 samples
`( 10 kHz sampling rate) and so the results of this measurement
`are not presented. Table II presents the standard deviations of
`the pitch period for the ten correlations. The results are also
`presented for the errors when the pitch contours were non(cid:173)
`linearly smoothed using a medium smoothing algorithm [7].
`From Table II it can be seen that the standard deviations for
`all correlators were approximately the same for the same
`utterance. It is also seen that as the average pitch period gets
`longer (reading from left to right) the standard deviation
`increases proportionally.
`Tables III and IV show the error statistics for gross errors,
`and voiced-to-unvoiced errors both for the unsmoothed pitch
`contours (Table III) and for the smoothed pltch contours
`(Table IV). These tables show that for the high pitched
`speakers (utterances prefaced by Cl; Fl, F2), although some
`differences 11 were present in the error scores for the un(cid:173)
`smoothed data, the nonlinear smoother was able to correct
`most of the errors. Thus the overall performance on the first
`
`10 A voiced-to-unvoiced error occurred when a voiced region was
`improperly classified as an unvoiced region because no peak above the
`threshold was present in the correlation function.
`11 These differences for the high pitched (short period) speakers were
`due to pitch period doubling, i.e., the correlation peak at twice the
`period was somewhat higher than the correlation peak at the true
`period. This is a common effect when the pitch period is on the order
`of 30 ms (300 Hz pitch) as was the case for these speakers.
`
`Humber of Voiced - Unvoiced Errors (Unsmoothed)
`
`Total NIJID.ber
`of Voiced
`Intervals
`
`213
`
`133
`
`174
`
`169
`
`118
`
`152
`
`157
`
`170
`
`170
`
`144
`
`105
`
`134
`
`141
`
`TABLE IV
`ERROR STATISTICS FOR TEN CORRELATORS-SMOOTHED
`
`Utterance
`
`11
`
`17
`
`7
`8
`
`11
`
`" j
`~ . rl .
`" 0
`
`t u
`
`9
`10
`
`Nwnber of Gross Voiced Errors -
`
`(Smoothed)
`
`13
`9
`3
`3
`
`10
`7
`
`11
`
`13
`10
`14
`8
`
`10
`
`19
`12
`
`10
`
`20
`
`18
`
`11
`
`17
`
`11
`
`19
`15
`
`10
`17
`
`10
`
`12
`11
`
`10
`
`10
`
`Number of Voic~ - Un·rc-iced Errors (Smoothed)
`
`four utterances was approximately the same for all correlators.
`For the low pitched speakers (utterances prefaced by LM, M2)
`there were more significant differences between the cor(cid:173)
`relators. For the category of gross errors, cortelators 1 and 8
`generally had the largest numbers of errors across the last 6
`utterances in the test. However, for the category of voiced-to(cid:173)
`unvoiced errors, correlators 2 and 9 had consistently the
`largest number of errors. Although the smoothing signifi-
`
`Page 6 of 10
`
`
`
`30
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, FEBRUARY 1977
`
`TABLE V
`TOTAL ERROR STATISTICS FOR TEN CORRELATORS
`
`Utte~an-.::e
`
`0
`
`"'
`_,
`"
`
`'"
`0
`§.!:
`
`'"
`~
`0
`~
`
`~
`
`~
`>'
`
`16
`~
`
`11
`
`l l
`
`0
`~
`0
`
`i\'
`
`14
`
`11
`
`14
`
`11
`
`l
`"
`
`25
`18
`
`16
`
`22
`
`16
`
`16
`
`25
`
`~
`
`~
`
`~
`
`~
`
`g
`"'
`
`§
`i!
`
`36
`
`26
`
`28
`
`30
`
`29
`
`29
`
`34
`
`Lo
`
`15
`12
`
`24
`
`18
`
`15
`
`27
`
`20
`
`16
`
`15
`
`19
`22
`
`16
`
`21
`
`18
`
`11
`
`14
`
`19
`16
`
`27
`19
`16
`
`22
`
`23
`
`17
`
`20
`
`2
`~ 3
`l
`
`0
`0
`
`5
`6
`
`cantly reduced the number of gross errors for many of the
`correlators, in turn it increased the number of voiced-to(cid:173)
`unvoiced errors. Since both errors constitute a pitch error,
`in this case the most significant error statistic is probably the
`sum of the gross errors and voiced-to-unvoiced errors, Table V
`shows these results. Based on this combined error statistic
`the following conclusions can be drawn about the performance
`of the ten correlators.
`1) For high pitched speakers the differences in performance
`scores between the different correlators are small and probably
`It is for this class of speakers that any type of
`insignificant.
`correlation measurement of pitch period tends to work very
`well.
`2) For low pitched speakers fairly significant differences in
`the performance scores existed. Correlator number 1 (the
`normal linear autocorrelation) tended to give the worst per(cid:173)
`formance for all utterances in this class, Correlators numbers
`4, 7, 8 (the ones involving an unprocessed x(n) in the com(cid:173)
`putation) were also somewhat poorer in their overall perform(cid:173)
`ance based on the sum of gross errors and voiced-to-unvoiced
`errors.
`3) Differences in the performance among the remaining
`six correlators were not consistent. Thus, any one of these
`correlators would be appropriate for an autocorrelation pitch
`detector.
`It is interesting to note that (as seen in Tables III-V) the
`results for utterance M208T were significantly worse than for
`utterance M208M. These utterances were simultaneously
`recorded-the difference being that M208T was recorded off a
`telephone line, whereas M208M was recorded from a close
`talking microphone. This result. is due to the band-limiting
`effects of the telephone line (300Hz cutoff frequency) which
`eliminate the first few harmonics of the pitch, thereby making
`accurate pitch detection more difficult.
`To illustrate