throbber
24
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-25, NO. 1, FEBRUARY 1977
`
`On the Use of Autocorrelation Analysis for Pitch
`Detection
`
`LAWRENCE R. RABINER, FELLOW, IEEE
`
`Abstract-One of the most time honored methods of detecting pitch
`is to use some type of autocorrelation analysis on speech which has
`been appropriately preprocessed. The goal of the speech preprocessing
`in most systems is to whiten, or spectrally flatten, the signal so as to
`eliminate the effects of the vocal tract spectrum on the detailed shape
`of the resulting autocorrelation function. The purpose of this paper is
`to present some results on several types of (nonlinear) preprocessing
`which can be used to effectively spectrally flatten the speech signal
`The types of nonlinearities which are considered are classified by a non(cid:173)
`linear input-output quantizer characteristic. By appropriate adjustment
`of the quantizer threshold levels, both the ordinary (linear) autocor(cid:173)
`relation analysis, and the center clipping-peak clipping autocorrelation
`of Dubnowski et a/, [1] can be obtained. Results are presented to
`demonstrate the degree of spectrum flattening obtained using these
`methods. Each of the proposed methods was tested on several ofthe
`utterances used in a recent pitch detector comparison study by Rabiner
`et al. [2] Results of this comparison are included in this paper. One
`imal topic which is discussed in this paper is an algorithm for adaptively
`choosing a frame size for an autoCorrelation pitch analysis.
`
`I. INTRODUCTION
`ALTHOUGH a large number of different methods have
`/""\.. been proposed for detecting pitch, the autocorrelation
`pitch detector is still one of the most robust and reliable of
`pitch detectors [2] . There are several reasons why autocor(cid:173)
`relation methods for pitch detection have generally met with
`The autocorrelation computation is made
`good success.
`directly on the waveform and is a fairly straightforward (albeit
`time consuming) computation. Although a high processing
`rate is required, the autocorrelation computation is simply
`amenable to digital hardware implementation generally re(cid:173)
`quiring only a single multiplier and an accumulator as the
`computational elements. Finally, the autocorrelation compu(cid:173)
`tation is largely phase insensitive. 1 Thus, it is a good method
`to use to detect pitch of speech which has been transmitted
`over a telephone line, or has suffered some degree of phase
`distortion via transmission.
`Although an autocorrelation pitch detector has some advan(cid:173)
`tages for pitch detection, there are several problems associated
`with the use of this method. Although the autocorrelation
`function of a section of voiced speech generally displays a
`fairly prominent peak at the pitch period, autocorrelation
`peaks due to the detailed formant structure of the signal are
`also often present. Thus, one problem is to decide which of
`several autocorrelation peaks corresponds to the pitch period.
`Another problem with the autocorreiation computation is the
`required use of a window for computing the short time auto-
`
`correlation function. The use of a window for analysis leads
`to at least two difficulties. First there is the problem of
`choosing an appropriate window. Second there is the problem
`that (for a stationary analysis)/ no matter which window is
`selected, the effect of the window is to taper the autocor(cid:173)
`relation function smoothly to 0 as the autocorrelation index
`increases. This effect tends to compound the difficulties
`mentioned above in which formant peaks in the autocorre(cid:173)
`lation function (which occur at lower indices than the pitch
`period peak) tend to be of greater magnitude than the peak
`due to the pitch period.
`A final difficulty with the autocorrelation computation is
`the problem of choosing an appropriate analysis frame
`(window) size. The ideal analysis frame should contain from 2
`to 3 complete pitch periods. Thus, for high pitch speakers the
`analysis frame should be short (5-20 ms), whereas for low
`pitched speakers it should be long (20-50 ms).
`A wide variety of solutions have been proposed to the above
`probiems. To partially eliminate the effects of the higher
`formant structure on the autocorrelation function; most
`methods use a sharp cutoff low-pass filter with cutoff around
`900 Hz. This will, in general, preserve a sufficient number of
`pitch harmonics for accurate pitch detection, but will elim(cid:173)
`inate the second and higher formants.
`In addition to linear
`filtering to remove the formant structure, a wide variety of
`methods have been proposed for directly or indirectly
`spectrally flattening the speech signal to remove the effects of
`the first formant [3]- [5], [1]. Included among these tech(cid:173)
`niques are center clipping and spectral equalization by fllter
`bank methods [3] , inverse filtering using linear prediction
`methods [4] , spectral flattening by linear prediction and a
`Newton transformation [5], and spectral flattening by a com(cid:173)
`bination of center and peak clipping methods [ 1] .
`Each of these methods has met with some degree of success;
`however, problems still remain. It is the purpose of this paper
`to investigate the properties of a class of nonlinearities applied
`to the speech signal prior to autocorrelation :malysis with the
`purpose of spectrally flattening the signal. Also a solution to
`the problem of choosing an analysis frame size which adapts to
`the estimated average pitch of the speaker will be presented.
`The organization of this paper is as follows. In Section II we
`review the theory of short-time autocorrelation analysis and
`present the various types of nonlinearities to be investigated
`for spectrally flattening the speech. Examples of signal spectra
`
`Manuscript received April4, 1976; revised August 16, i976.
`The author is with the Bell Laboratories, Murray Hill, NJ 07974.
`1 In the limit of exactly periodic signals, or for an infinite correlation
`function it is exactly phase insensitive.
`
`2 A stationary analysis is one for which the same set of input samples
`is used in computing all the points of the autocorrelation function. A
`nonstationary analysis is impractical for pitch detection because of the
`large number of autocorrelation points involved in the computation.
`
`ZTE EXHIBIT 1025
`
`Page 1 of 10
`
`

`
`RABINER: AUTOCORRELATION ANALYSIS FOR PITCH DETECTION
`
`25
`
`obtained with the nonlinearities being used will be given in this .
`section.
`In Section III the results of a limited but formal
`evaluation of each of the nonlinear autocorrelation analyses
`are given. Several of the test utterances used in (2] are used in
`this test for comparison purposes. In Section IV we discuss a
`simple algorithm for adapting the frame size of the analysis
`based on the estimated average pitch period for the speaker,
`and present results on how well it worked on several test
`examples.
`
`II. SHORT-TIME AUTOCORRELATION ANALYSIS
`
`Given a discrete time signal x(n), defined for all n, the auto(cid:173)
`correlation function is generally defined as
`
`1
`N
`¢x(m) =lim - - L x(n)x(n + m).
`N->oo 2N + 1 n=-N
`
`(1)
`
`The autocorrelation function of a signal is basically a (non(cid:173)
`invertible) transformation of the signal which is useful for
`displaying structure in the waveform. Thus, for pitch detec(cid:173)
`tion, if we assume x(n) is exactly periodic with period P, i.e.,
`x(n) = x(n + P) for all n, then it is easily shown that
`¢x(m) = 11x(m + P),
`
`(2)
`
`i.e., the autocorrelation is also periodic with the same period.
`Conversely, periodicity in the autocorrelation function indi(cid:173)
`cates periodicity in the signal.
`For a nonstationary signal, such as speech, the concept of a
`long-time autocorrelation measurement as given by (1) is not
`really meaningful: Thus, it is reasonable to define a short(cid:173)
`time autocorrelation function, which operates on short seg(cid:173)
`ments of the signal, as
`
`1 N'- 1
`¢£(m) =- ,L
`N n=O
`
`[x(n + Q)w(n)] [x(n + Q + m)w(n + m)],
`
`(3)
`
`where w(n) is an appropriate window for analysis, N is the
`section length being analyzed, N' is the number of signal
`samples used in the computation of ¢ 2(m), M 0 is the number
`of autocorrelation points to be computed, and Q is the index
`of the starting sample of the frame. For pitch detection appli(cid:173)
`cations N' is generally set to the value
`
`r----------------------,
`
`I
`I
`X1 (nl I
`I
`I
`I
`I
`
`I
`I
`I
`I
`I
`I
`I
`S(n) I
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`L-----------------------~
`SPEECH PREPROCESSOR
`Fig. 1. Block diagram of the nonlinear correlation processing.
`
`computation of (3), as discussed in Section I. Fig. 1 shows a
`block diagram of the processing which was used. The speech
`signal s(n) is first low-pass filtered by an FIR, linear phase,
`digital filter with a passband of 0 to 900 Hz, and a stopband
`beginning at 1700 Hz.3 The output of the low-pass filter is
`then used as input to two nonlinear processors, labeled NL1
`and NL2 in Fig. 1. The nonlinearities used in each path may
`or may not be identical. The types of nonlinearities which
`were investigated were various center clippers, and peak clip(cid:173)
`pers. Based on earlier works (3], [1] it has been shown that
`such nonlinearities can provide a fairly high degree of spectral
`flattening, and are computationally quite efficient to imple(cid:173)
`ment (1]. Additionally, the capability of correlating two non(cid:173)
`linearly processed versions of the same signal provides a useful
`degree of flexibility into the system. It has also been argued
`that such a correlation will be most appropriate in a variety
`of actual situations in pitch detection.4
`Three types of nonlinearity have been considered. They are
`classified according to their input-output quantization charac(cid:173)
`teristic in the following way. The first type of nonlinearity is
`a compressed center clipper whose output y(n) obeys the
`relation (with x(n) as input)5
`
`y(n) = clc [x(n)] = (x(n)- CL),
`= 0,
`
`x(n);;:, CL
`lx(n)l < CL
`x(n) <, -CL
`
`(5)
`
`where CL is the clipping threshold. The second nonlinearity
`is a simple center clipper with the input-output relation 6
`
`N'=N- m
`
`(4)
`
`y(n) = clp [x(n)] = x(n),
`
`x(n);;:, CL
`
`so that only the N samples in the analysis frame (i.e., x(Q),
`x(Q + 1), · · · , x(Q + N- 1)) are used in the autocorrelation
`computation. Values of 200 and 300 have generally been used
`for M 0 and N, respectively, (1] corresponding to a maximum
`pitch period of 20 ms (200 samples at a 10kHz sampling rate)
`and a 30 ms analysis frame size. As will be discussed later a
`rectangular window (i.e., w(n)= 1, O<,n<,N- 1, w(n)=O
`elsewhere) is used for all the computations to be described in
`this paper.
`To reduce the effects of the formant structure on the de(cid:173)
`tailed shape of the short-time autocorrelation function, two
`preprocessing functions were used prior to the autocorrelation
`
`=0,
`= -x(n),
`
`lx(n)l <CL
`
`x(n)<,-CL.
`
`(6)
`
`Finally, the third nonlinearity is the combination center and
`peak clipper with the input-output relation 7
`
`3The filter had an impulse response duration of 25 samples. The filter
`passband was flat to within ±0.03, and the stopband response was down
`at least 50 dB.
`4 M. Sondhi-personal communication.
`5 The function clc [x] stands for clip and compress x.
`6The function clp [x] stands for clip x.
`7 The function sgn [x] stands for the sign of x.
`
`Page 2 of 10
`
`

`
`26
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, FEBRUARY 1977
`
`(0)
`------"7-+-----_L_- X
`/
`
`7
`
`y=clc rxl
`
`y
`
`ch--.,--------
`
`(b)
`
`-CL
`----r-----------~--~--------~---x
`CL
`
`TABLE I
`COMBINATIONS OF NONLINEARITIES FOR CORRELATON ANALYSIS
`
`Cor:relatic;rj
`No.
`
`x
`
`(n)
`1
`
`(n)
`
`x
`
`2
`
`x(n)
`
`x( r.)
`
`clc[x(n;]
`
`cJ.c[x(n)]
`
`clp[x(n)]
`
`clp[x(n)]
`
`x(n)
`
`clc[x(n)]
`
`sr;n[x( n)]
`
`::>f:n[x( n) J
`
`clp[x(r.)]
`
`sgn[x(n)]
`
`x(n)
`
`x(r.)
`
`clc[x(n)]
`
`clp[x(n)]
`
`clp[x(r.)]
`
`clc[x(n)]
`
`Comoutation
`
`. , + . ,+/0 . ,+/0'
`:':13 . ,+/0
`• • +/0" . ,+/0
`
`YJ
`':./0
`
`- - - ' - - - - - - - - -CL
`
`10
`
`sgn[x(n)]
`
`sg:n[x( n)]
`
`~oun";er/0
`
`y=clp rxl
`
`y
`
`+I - - - - - - - - - -
`
`(C)
`
`----r-----------~-------------L---x
`CL
`
`- - - - - - - - - - -1
`
`y=sgnrxl
`Fig. 2. Input-output characteristics of each of the three nonlinearities
`used in the investigation.
`
`y(n) == sgn [x(n)] == 1,
`
`= 0,
`
`=-1,
`
`x(n) ~ CL
`lx(n)l < CL
`x(n),;;;;-cL·
`
`(7)
`
`Fig. 2 illustrates the input-output characteristics for the three
`nonlinearities of (5)-(7). Allowing a direct path connection
`between input and output for each of the nonlinearities of Fig.
`1 (i.e.,y = x) it can be seen that there are ten distinct ways8 in
`which the signals x 1 (n) and x 2 (n) can be correlated, depending
`on which of the nonlinearities is used for NLl and NL2. Table
`I summarizes these ten possibilities.
`It should be noted that correlation number 1 in Table I
`corresponds to an ordinary autocorrelation; whereas correla(cid:173)
`tion number 10 corresponds to the combination peak clipping,
`center clipping correlation discussed in [ 1] . Also shown in
`Table I are the required computations needed to implement
`the combined correlation for each possibility. In the most
`general case (correlation number 1) a multiply and an
`add are required for each sample in the computation. For
`cases 2, 3, 7, 8, and 9, whenever either x 1(n) or x 2 (n + m)
`falls below the clipping level, CL, no computation is required
`as indicated by the ¢; in the computation column for these
`cases. For cases 4, 5, and 6 only an adder/subtractor is re(cid:173)
`quired because sgn [x(n + m)] can only assume the values + 1
`[addition of x 1 (n)], 0 (no computation), or -1 [subtraction
`of x 1 (n)]. Finally, case 10 only requires an up-down counter
`to implement as discussed in [1] .
`
`The justification for considering the nonlinearities of Fig. 2
`for use in autocorrelation analysis is obtained by examining
`the effects of the nonlir1earities on the waveforms. It can be
`argued that a center clipper effectively attenuates the effects
`of first formant structure on the waveform, without seriously
`affecting the pitch pulse indications. However, it has been
`argued that the peak clipping of the sgn quantizer [Fig. 2(c)]
`gives too much weight to signal amplitudes that just exceed
`the clipping threshold, and too little weight to signal ampli(cid:173)
`tudes that exceed the clipping threshold by a wide margin.
`Thus, the justification for the clc (clip and compress) and the
`clp (clip) quantizers is that they provide a compromise be(cid:173)
`tween the extremes of no clipping and infinite peak clipping.
`Before proceeding to some examples showing the effects of
`each of these nonlinearities, it is worth noting that the method
`used to set the clippir1g threshold (CL) for each of these non(cid:173)
`lir1earities was exactly the method used in [ 1] , i.e., set the
`clipping as a fixed percentage (68 percent) of the smaller of
`the maximum absolute signal level over the first and last one(cid:173)
`thirds of the analysis frame. This method has proven quite
`successful in all tests to date [2] .
`Fig. 3 illustrates the effects of each of the quantizer charac(cid:173)
`teristics of Fig. 1 on a typical frame of voiced speech. The
`left-hand side of Fig. 3 shows the sequence of signals x(n),
`clc[x(n)], clp[x(n)], and sgn[x(n)]. Superimposed on the
`plot of x(n) is the clipping level for this frame of speech. The
`right-hand side of Fig. 3 shows the sequence of autocorrelations
`corresponding to each of the sequences at the left (i.e., corre(cid:173)
`lations numbers 1, 2, 3, and 10 in Table I).9 A rectangular
`window was used in all cases for computing the correlations as
`no other type of window is reasonable. The effects of the
`nonlinear clippir1g are readily evident in this figure. Although
`there is a sharp peak in the autocorrelation due to the pitch
`at m = 80 for all four correlations, the shape of the correlation
`function for the unprocessed speech [Fig. 3(a)] is significantly
`different from the shape of the correlation function for all of
`the nonlinearly processed signals [Fig. 3(b)-(d)]-especially in
`the low time part of the correlation (i.e., m going from 20 to
`80). Fig. 3(d) also illustrates the' problems associated with
`using the sgn quantizer in that all speech samples which exceed
`the clipping threshold are weighted equally. Thus in the third
`
`8 Theoretically there are 16 ways in which x 1 (n) and Xz(n) can be
`correlated. For all practical purposes, however, six pairs of these results
`are equivalent. Thus, only ten ways of correlating Xt (n) and Xz(n) are
`considered here.
`
`9 Each of the signal amplitudes in Figs. 3-7, and 9 is scaled so that the
`maximum value is set to 1.0 for display purposes. Thus, it is difficult
`to compare these amplitude sequences against each other.
`
`Page 3 of 10
`
`

`
`RABINER: AUTOCORRELATION ANALYSIS FOR PITCH DETECTION
`
`4> (m)
`
`27
`
`FRAME 23
`LRR~I SAW THE CAT
`
`1 ~, • .ALa .. ~l,A.,t<\~~~""
`vrlflV' (vvv• ~Y¥V'
`I
`t
`I
`I 1
`l ,, ( I
`~
`
`21
`
`31
`
`I
`
`I '
`
`I •
`
`A v ...
`
`·,4
`
`.A. vv
`
`A.
`inl'
`
`vv
`
`!\ aw
`
`Fig. 3. Each of the processed signals and the resulting correlation func·
`tion for a section of voiced speech.
`
`period there are five pulses of varying width whereas in the
`first periods there are only three pulses. Fig. 3(b) showsthat
`such problems are inherently eliminated by the clc quantizer
`whose output samples are proportional in amplitude to the
`amount by which they exceed the clipping threshold.
`
`Spectral Flattening from the Quantizers
`
`It has already been argued that the effect of the nonlinear
`processing preceding the correlation computation is to approx(cid:173)
`imately spectrally flatten the signal spectrum, thereby en(cid:173)
`hancing the periodicity of the signal. To investigate this, the
`power spectrum of each of the correlation functions of Table I
`was computed directly from the correlation function by the
`Fourier transform relation
`
`rp(m)e-i21rfm.
`
`M 0 -1
`S(f) = L
`M=-(M0 -1)
`A 512-point FFT was used to compute S(fk), k = 0,1, · · · ,511
`where
`
`(8)
`
`(9)
`
`i.e., at 512 points around the unit circle. Since rp(m) is theo(cid:173)
`retically infinite, a (2M0 + 1) point Hamming window was
`used to taper rp(m) smoothly to 0. (Note we are assuming
`¢(m) is symmetric, i.e., if>(m) = q;(-m).)
`
`0
`
`SIGNAL X1(n)
`
`2000
`3000
`CORRELATION</> (m)
`
`5000
`POWER SPECTRUM
`S(f) (db)
`
`Fig. 4. The signal x 1 (n); the resulting correlation and power spectrum
`for each of the ten correlators of Table I for a section of voiced
`speech.
`
`Figs. 4-7 show plots of the results of processing four dif(cid:173)
`ferent sections of voiced speech. The left-hand column shows
`the signal x 1 (n), the middle column shows the signal tf;(m) =
`x 1(n) correlated with x 2 (n) (where x 2 (n) is as specified in
`Table I), and the right-hand column shows the power spec(cid:173)
`trum S(f) obtained as described above. The ten rows in each
`figure correspond to the ten combinations of signals to be
`correlated as shown in Table L An examination of Fig. 4
`shows that for the unprocessed signal (i.e., the top row) the
`first several harmonics are seen in the power spectrum.
`Beyond 1 kHz, the spectrum decays rapidly due to the low(cid:173)
`pass filter (the lack of a sharp falloff in the spectrum is due to
`a combination of the signal and autocorrelation windows).
`The amplitudes of the harmonics vary with the first formant
`envelope. It can be seen that the spectrum for the autocor(cid:173)
`relations of each of the nonlinear quantizers (i.e., rows 2, 3,
`and 10) are much flatter than the original signal spectrum.
`Additionally, the spectra of the nonlinearly processed signals
`are much broader than the original spectrum. It is interesting
`to note that the spectra from correlations involving x(n), i.e.,
`correlations numbers 1, 4, 7, and 8, are the least flattened and
`are generally quite irregular (i.e., the harmonics are not very
`easy to find).
`Fig. 5 shows similar results from a different section of voiced
`speech. As seen from the spectrum of the unprocessed signal
`(on the top line) the bandwidth of the fmt formant is fairly
`small, causing the correlation function to show a great deal of
`formant periodicity for small values of m. The eftects of the
`nonlinearities on the signal spectra are quite impressive even
`for such a difficult case as this one.
`Fig. 6 shows results from a section of speech from a female
`speaker (high pitch). Again the spectrum from the unpro-
`
`Page 4 of 10
`
`

`
`28
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, FEBRUARY 1977
`FRAME 88
`
`LRR- I SAW THE CAT
`
`FRAME 146
`LM05T
`
`~n'"''""''·'~
`v \(ITV\TVV\j \TV v
`~v
`-40
`A.
`/). ~ v v
`~
`r· o."r./\1\f\
`~ . ~ 0 v
`~
`~if
`~A v v
`~ c
`~·
`~
`
`~ v •
`
`.1 0 0
`
`3000
`CORRELATION <jJ(m)
`
`200
`0
`
`5000
`POWER SPECTRUM
`S(f) (db)
`
`\[v
`
`0~\T'V"
`
`/)..
`0 v
`
`flA/'\1\./\o
`VVQ\j
`
`e "c'\fivAcyr.. ·
`
`h~~.,...,., ..
`
`I ~A A A Li ~A A .d)~ G A ~l\"ho"{',"h-~-l'\~
`+ o~~n~ ••
`vvv\P vvw vvV' V Q\J4J VV V V<r
`VVIJlOT'
`~-
`A. -40~
`I I
`r
`21 ~'
`!
`I
`r
`v
`vv
`I ~ o II- ~
`II
`~ l
`3 1 fi I
`p
`ijl
`i'
`i
`co' ~llilf!lllf:td!
`ffi 4r~A· J\oh •• ,)\~6. ~ •""{\·
`.~nnn ..
`~ 4r
`z I!
`~ V V VlfJ
`lfV)liP
`! 51
`:::>
`l
`~ ~
`r
`I
`g5 V'
`Y'
`~ I ~I
`! ~ .llh ~ vro
`~ I~ I
`! I
`li
`~ 6
`~ 6
`iji
`iji
`i
`~. ""{\II
`0 7~6~' LJAAoodA••
`8 7r onflno ..
`v VlJV
`iflfU v v 'fV"'
`VVYuP
`01;1£ ~11111111~!!11!
`8~AD~ LihD&od6•~ ~
`81· ·~010··
`VIJviP vvvv vvlJ' V""' v\Jyvv=
`uvvw
`!I
`I
`"1\ ~
`~-
`! ' y
`91
`9 1 A I
`ijl
`iji
`~ i
`J I
`1ojfil n ~I
`0 VV"' ~
`~
`~nv
`101
`ij1
`~
`~~
`0
`200 0
`300 0
`CORRELATION <jJ (m)
`
`v
`
`VIT"'
`
`~ '{W
`
`ijV v
`
`VVQ \TV
`
`A
`
`6"
`\IVV
`
`0
`
`\jV"' V\T\TV v=
`
`OIQI'
`
`0 " { \ {\
`
`VV4J
`
`• 1\.
`
`0
`
`SIGNAL X1 (n)
`
`5000
`POWER SPECTRUM
`$(f) (db)
`Fig. 5. The signal x 1 (n); the resulting correlation and power spectrum
`for each of the ten correlators of Table I for another section of voiced
`speech.
`
`• nn 6 A ••
`lf\JIJ ovv
`I. I
`II
`i
`.flonn ..
`VI) v vliV
`L I
`!I
`i
`~t~OD•·
`vvvvvv
`.~hGh•.
`VV\fvW
`~I
`y
`!I
`i
`x1<nl
`
`21
`
`31
`
`I
`
`SIGNAL
`
`FRAME 50
`F 105W
`
`0
`
`300 0
`
`200 0
`CORRELATION cp (m)
`
`5000
`POWER SPECTRUM
`S(f) (db)
`Fig. 6. The signal x 1 (n); the resulting correlation and power spectrum
`for each of the ten correlators of Table I for a section of voiced
`speech from a female speaker.
`
`cessed signal shows only a few harmonics whose amplitudes
`vary with the formant amplitude. The nonlinearly processed
`samples show various degrees of spectral flattening, as antici(cid:173)
`pated by the previous discussion.
`· Finally, Fig. 7 shows the results obtained with a voiced
`frame from a low pitched (long period) male speaker. In this
`example the first formant has a very narrow bandwidth as seen
`
`Fig. 7. The signal x1 (n); the resulting correlation and power spectrum
`for each of the ten correlators of Table I for a section of voiced
`speech from a low pitched male speaker.
`
`in the spectrum at the top of Fig. 7. Pitch detection directly
`on the autocorrelation of the signal yields incorrect results in
`this case due to the first formant peak(s) in the autocorrela(cid:173)
`tion function. However, as shown in Fig. 7, almost any of the
`nonlinearities flatten the spectrum and eliminate the trouble(cid:173)
`some effects of the sharp first formant in the resulting corre(cid:173)
`lation function.
`In summary, we have presented examples which tend to
`show that, as anticipated, the effect of nonlinearly quantizing
`the signal amplitudes using the quantizers of Fig. 1 is to effec(cid:173)
`tively flatten and broaden the signal power spectrum, thereby
`reducing the effects of the first formant on the correlation
`function, and simplifying the pitch detection problem. In the
`next section we present results of a comparative test of the
`performance of the ten correlation pitch detectors discussed in
`this section on a series of speech utterances.
`
`III. EvALUATION oF THE TEN NoNLINEAR
`CORRELATIONS
`
`In order to evaluate and compare the performance of the ten
`nonlinear correlations discussed in the preceding section, a
`small set of the utterances from the data base in [2] was used
`as a test set. For each of the utterances a reference pitch
`contour was available from which an error analysis was made
`[ 6]. Since the problem of making a reliable voiced-unvoiced
`decision was not a concern here, the reference voiced-unvoiced
`contour was used directly, i.e., each correlator was required to
`estimate the pitch period, assuming a priori that the interval
`was properly classified as voiced.
`(No pitch detection was
`done during unvoiced intervals.) However, if the peak correla(cid:173)
`tion value (normalized) fell below a threshold (0.25), the
`interval was classified as unvoiced since reliable selection of
`
`Page 5 of 10
`
`

`
`RABINER: AUTOCORRELATION ANALYSIS FOR PITCH DETECTION
`
`29
`
`TABLE II
`STANDARD DEVIATIONS FOR TEN CORRELATORS
`
`TABLE III
`ERROR STATISTICS FOR TEN CORRELATORS-UNSMOOTHED
`
`Utterance
`
`Utterance
`
`.53
`.63
`.1\4
`.52
`.63
`.47
`.40
`.57
`_.46
`.45
`
`.80
`.60
`.n
`.84
`. 72
`.61!
`. 79
`.61
`. 76
`.64
`. 72
`.65
`.80
`.63
`.65 1.15
`.68
`.97
`. 79
`.69
`
`3
`
`"
`
`j
`
`10
`
`.n
`.68
`. 70
`.83
`.86
`.18
`.78
`.73
`.67
`.75
`
`.54
`.58
`.56
`.79
`·99
`.82
`.54
`.54
`.56
`. 75
`
`1.34
`1.59
`1.50
`1.54
`1. 47
`1.55
`1.46
`1.67
`1.56
`1.50
`
`.88
`o85
`1.58
`1.70 1.18 1.19
`
`1.66 1.10 1.14
`1. 82
`.97 1.11
`1.75 1.07 1.07
`1.66 1.08 1.19
`1. 70
`.99 1.04
`}..88 1.09 1.18
`1. 76 1.13 1.17
`1.69 1.13 1. 25
`
`1.23
`
`1.21
`
`1.31
`1.32
`1.24
`1.29
`1.34
`1.29
`1.30
`1.39
`
`1.24
`1.32
`1.17
`
`1.24
`
`1.02
`
`1.05
`1.29
`1.51
`1.41
`1.28
`
`1.52
`1.39
`1.46
`1.50
`1.37
`1.43
`1.45
`1.45
`1.46
`1.57
`
`] .23
`1.05
`1.04
`1.34
`1.27
`1.25
`1.24
`1.24
`1.11
`1.17
`
`Standard Deviation of P::.tch ?erion -
`
`llnsmooth!:!'d
`
`1.12 1.59 1.52
`1.40 1.63 1.81
`1.21 1. 49 1. 76
`.97 1. 48 1. 59
`1.02 1. 65 1 :6o
`1.13 1.63 1.68
`1.14 l. 55 1. 75
`1.43 1.78 1.83
`1.41 1.67 1.92
`1.26 1.63 1.76
`
`2.08
`1.54
`1. 34
`1. 87
`1. 36
`1.69
`1. 77
`1.78
`1.48
`1.68
`
`1.22
`1.25
`1.28
`1.33
`1.55
`1.57
`1.21
`1.84
`1.41
`1.56
`
`1.92
`2.15
`1.98
`2.17
`2.23
`1.93
`2.15
`2.48
`2.39
`2.29
`
`1. 75
`1.50
`l.lil
`1.83
`1.53
`1.51
`2.11
`
`1.92·
`1.66
`1.63
`
`24
`8
`13
`21
`14
`13
`24
`28
`16
`15
`
`33
`15
`22
`29
`25
`24
`31
`37
`25
`27
`
`40
`11
`10
`24
`18
`12
`27
`34
`14
`15
`
`18
`6
`10
`15
`18
`11
`13
`16
`10
`
`11
`
`18
`5
`7
`14
`17
`14
`
`13
`8
`
`9
`
`25
`8
`11
`20
`19
`13
`16
`22
`11
`13
`
`10
`
`10
`
`Number of Gross Voiced Errors (Unsmoothed)
`
`10
`
`11
`
`10
`
`11
`
`13
`4
`
`10
`11
`
`10
`
`12
`
`10
`
`14
`
`10
`
`.61
`.62
`.61
`.60
`• 70
`.61
`.59
`.63
`.59
`. 59
`
`.40
`.48
`.49
`.43
`.48
`.56
`.46
`.48
`.50
`.59
`
`.51
`.61
`.55
`-55
`·57
`.60
`.54
`.67
`.64
`.54
`
`.50
`.50
`.50
`.56
`.54
`.56
`.71
`;68
`.51
`.56
`
`.50
`• 59
`.57
`.49
`.62
`.52
`.51
`.50
`.57
`.58
`
`.84
`1.09
`.85
`1.25
`1.18
`1.06
`1.03
`1.19
`1.04
`1.05
`
`10
`
`Standard Deviation of pj tch Period - Smoothed
`
`the pitch was not possible with a correlation peak below this
`threshold.
`Thirteen utterances from [2] were used in this comparison.
`Tables II-V present the results of an error analysis which
`measured the average and standard deviation of the pitch
`period, the number of gross pitch period errors, and the
`number of voiced-to-unvoiced errors [2] .10 For all utterances
`the average pitch period error was well below 0.5 samples
`( 10 kHz sampling rate) and so the results of this measurement
`are not presented. Table II presents the standard deviations of
`the pitch period for the ten correlations. The results are also
`presented for the errors when the pitch contours were non(cid:173)
`linearly smoothed using a medium smoothing algorithm [7].
`From Table II it can be seen that the standard deviations for
`all correlators were approximately the same for the same
`utterance. It is also seen that as the average pitch period gets
`longer (reading from left to right) the standard deviation
`increases proportionally.
`Tables III and IV show the error statistics for gross errors,
`and voiced-to-unvoiced errors both for the unsmoothed pitch
`contours (Table III) and for the smoothed pltch contours
`(Table IV). These tables show that for the high pitched
`speakers (utterances prefaced by Cl; Fl, F2), although some
`differences 11 were present in the error scores for the un(cid:173)
`smoothed data, the nonlinear smoother was able to correct
`most of the errors. Thus the overall performance on the first
`
`10 A voiced-to-unvoiced error occurred when a voiced region was
`improperly classified as an unvoiced region because no peak above the
`threshold was present in the correlation function.
`11 These differences for the high pitched (short period) speakers were
`due to pitch period doubling, i.e., the correlation peak at twice the
`period was somewhat higher than the correlation peak at the true
`period. This is a common effect when the pitch period is on the order
`of 30 ms (300 Hz pitch) as was the case for these speakers.
`
`Humber of Voiced - Unvoiced Errors (Unsmoothed)
`
`Total NIJID.ber
`of Voiced
`Intervals
`
`213
`
`133
`
`174
`
`169
`
`118
`
`152
`
`157
`
`170
`
`170
`
`144
`
`105
`
`134
`
`141
`
`TABLE IV
`ERROR STATISTICS FOR TEN CORRELATORS-SMOOTHED
`
`Utterance
`
`11
`
`17
`
`7
`8
`
`11
`
`" j
`~ . rl .
`" 0
`
`t u
`
`9
`10
`
`Nwnber of Gross Voiced Errors -
`
`(Smoothed)
`
`13
`9
`3
`3
`
`10
`7
`
`11
`
`13
`10
`14
`8
`
`10
`
`19
`12
`
`10
`
`20
`
`18
`
`11
`
`17
`
`11
`
`19
`15
`
`10
`17
`
`10
`
`12
`11
`
`10
`
`10
`
`Number of Voic~ - Un·rc-iced Errors (Smoothed)
`
`four utterances was approximately the same for all correlators.
`For the low pitched speakers (utterances prefaced by LM, M2)
`there were more significant differences between the cor(cid:173)
`relators. For the category of gross errors, cortelators 1 and 8
`generally had the largest numbers of errors across the last 6
`utterances in the test. However, for the category of voiced-to(cid:173)
`unvoiced errors, correlators 2 and 9 had consistently the
`largest number of errors. Although the smoothing signifi-
`
`Page 6 of 10
`
`

`
`30
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, FEBRUARY 1977
`
`TABLE V
`TOTAL ERROR STATISTICS FOR TEN CORRELATORS
`
`Utte~an-.::e
`
`0
`
`"'
`_,
`"
`
`'"
`0
`§.!:
`
`'"
`~
`0
`~
`
`~
`
`~
`>'
`
`16
`~
`
`11
`
`l l
`
`0
`~
`0
`
`i\'
`
`14
`
`11
`
`14
`
`11
`
`l
`"
`
`25
`18
`
`16
`
`22
`
`16
`
`16
`
`25
`
`~
`
`~
`
`~
`
`~
`
`g
`"'
`

`i!
`
`36
`
`26
`
`28
`
`30
`
`29
`
`29
`
`34
`
`Lo
`
`15
`12
`
`24
`
`18
`
`15
`
`27
`
`20
`
`16
`
`15
`
`19
`22
`
`16
`
`21
`
`18
`
`11
`
`14
`
`19
`16
`
`27
`19
`16
`
`22
`
`23
`
`17
`
`20
`
`2
`~ 3
`l
`
`0
`0
`
`5
`6
`
`cantly reduced the number of gross errors for many of the
`correlators, in turn it increased the number of voiced-to(cid:173)
`unvoiced errors. Since both errors constitute a pitch error,
`in this case the most significant error statistic is probably the
`sum of the gross errors and voiced-to-unvoiced errors, Table V
`shows these results. Based on this combined error statistic
`the following conclusions can be drawn about the performance
`of the ten correlators.
`1) For high pitched speakers the differences in performance
`scores between the different correlators are small and probably
`It is for this class of speakers that any type of
`insignificant.
`correlation measurement of pitch period tends to work very
`well.
`2) For low pitched speakers fairly significant differences in
`the performance scores existed. Correlator number 1 (the
`normal linear autocorrelation) tended to give the worst per(cid:173)
`formance for all utterances in this class, Correlators numbers
`4, 7, 8 (the ones involving an unprocessed x(n) in the com(cid:173)
`putation) were also somewhat poorer in their overall perform(cid:173)
`ance based on the sum of gross errors and voiced-to-unvoiced
`errors.
`3) Differences in the performance among the remaining
`six correlators were not consistent. Thus, any one of these
`correlators would be appropriate for an autocorrelation pitch
`detector.
`It is interesting to note that (as seen in Tables III-V) the
`results for utterance M208T were significantly worse than for
`utterance M208M. These utterances were simultaneously
`recorded-the difference being that M208T was recorded off a
`telephone line, whereas M208M was recorded from a close
`talking microphone. This result. is due to the band-limiting
`effects of the telephone line (300Hz cutoff frequency) which
`eliminate the first few harmonics of the pitch, thereby making
`accurate pitch detection more difficult.
`To illustrate

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket