`
`Pitch Prediction Filters in Speech Coding
`
`RAVI P. RAMACHANDRAN AND PETER KABAL
`
`467
`
`F(n)
`
`Abstract-Prediction error filters which combine short-time predic(cid:173)
`tion (formant prediction) with long-time prediction (pitch prediction)
`in a cascade connection are examined. A number of different solution
`methods (autocorrelation, covariance, Burg) and implementations
`(transversal and lattice) are considered. It is found that the F -P cas(cid:173)
`cade (formant filter before the pitch filter) outperforms the P-F cas(cid:173)
`cade for both transversal- and lattice-structured predictors. The per(cid:173)
`formances of the transversal and lattice forms are similar. The solution
`method that yields a transversal structure requires a stability test and,
`if necessary, a consequent stabilization. The lattice form allows for a
`solution method which ensures a stable synthesis filter. Simplified so(cid:173)
`lution methods are shown to be applicable for the pitch filter (multitap
`case) in an F-P cascade. Furthermore, new methods to estimate the
`appropriate pitch lag for a pitch filter are proposed for both transver(cid:173)
`sal and lattice structures. These methods perform essentially as well as
`an exhaustive search in an F -P cascade. Finally, the two cascade forms
`are implemented as part of an APC coder to evaluate their relative
`subjective performance.
`
`I. INTRODUCTION
`
`I N this paper, speech coder configurations which use two
`
`nonrecursive prediction error filters to process the in(cid:173)
`coming speech signal are examined. Conventionally, the
`prediction is carried out as a cascade of two separate fil(cid:173)
`tering operations. The first filter, referred to here as the
`formant filter, removes near-sample redundancies. The
`second is termed the pitch filter and acts on distant-sample
`waveform similarities. The resulting residual signal is
`quantized and coded for transmission. In an adaptive pre(cid:173)
`dictive coder (APC), these predictors are placed in a feed(cid:173)
`back loop around the quantizer. An additional quantiza(cid:173)
`tion noise shaping filter can be employed to reduce the
`perceived distortion in the decoded speech [ 1], [2]. An
`alternative description of an APC coder uses an open-loop
`predictor configuration and a noise feedback filter [3]. A
`block diagram of such a configuration is shown in Fig. 1.
`This type of open-loop arrangement is also used in code(cid:173)
`excited linear prediction (CELP) [4]. In CELP, the cod(cid:173)
`ing is accomplished by selecting a waveform from a given
`repertoire of waveforms. The selection process uses an
`analysis-by-synthesis strategy. Conceptually, each can(cid:173)
`didate waveform is passed through the synthesis filters to
`find that one which produces the best quality speech.
`
`Manuscript received June 10, 1987; revised August 30, 1988. This work
`was supported by the Natural Sciences and Engineering Research Council
`of Canada.
`R. P. Ramachandran is with the Department of Electrical Engineering,
`McGill University, Montreal, P.Q., Canada H3A 2A7.
`P. Kabal is with the Department of Electrical Engineering, McGill Uni(cid:173)
`versity, Montreal, P.Q., Canada H3A 2A7 and lNRS-Telecommunica(cid:173)
`tions, Universite du Quebec, Verdun, P.Q., Canada H3E 1H6.
`IEEE Log Number 8826113.
`
`(a)
`
`i(n)~ +~}---~.<(n)
`+~+~
`
`Fig. I. Block diagram of an APC coder with noise feedback. (a) Analysis
`phase. (b) Synthesis phase.
`
`(b)
`
`Noise shaping is accomplished by including a frequency
`weighting in the error criterion which is used to choose
`the best waveform.
`In both APC and CELP, the residual signal or the se(cid:173)
`lected codeword (after scaling by the gain factor) is passed
`through a pitch synthesis and a formant synthesis filter to
`reproduce the decoded speech. The filtering in the syn(cid:173)
`thesis phase can be viewed in the frequency domain as
`first inserting the fine pitch structure and then shaping the
`spectral envelope (formant structure).
`The analysis to determine the predictor coefficients is
`carried out frame by frame. The filter coefficients are then
`coded for transmission. The quantization of these coeffi(cid:173)
`cients is outside the scope of the present study. These pa(cid:173)
`rameters, along with the quantized excitation informa(cid:173)
`tion, are used by the decoder to reconstruct the speech.
`The frame update rate is chosen to be slow enough to keep
`the transmission rate required small, yet fast enough to
`allow the speech segment under analysis to be adequately
`described by a set of constant parameters. Depending on
`the application, the effective frame size usually corre(cid:173)
`sponds to time intervals between 5 and 20 ms.
`The aim of this paper is to study predictors which in(cid:173)
`corporate both short-time and long-time prediction. The
`effect of the ordering of the prediction filters in the cas(cid:173)
`cade connection is considered. The filters will be imple(cid:173)
`mented in both lattice and transversal forms. In addition,
`methods to determine the lag used for the pitch filter will
`be derived. The two predictor configurations incorporat(cid:173)
`ing the transversal and lattice solutions are tested as part
`of an APC coder that is equivalent to the one shown in
`Fig. 1. This allows us to access the relative perceptual
`quality of the decoded speech that results from the use of
`different configurations and solutions.
`The next section will introduce the different configura(cid:173)
`tions for formant and pitch filters. This is followed by an
`
`0096-3518/89/0400-0467$01.00 © 1989 IEEE
`
`ZTE EXHIBIT 1023
`
`Page 1 of 12
`
`
`
`468
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH. AND SIGNAL PROCESSING, VOL. 37, NO. 4. APRIL 1989
`
`analysis of a prediction error filter which uses general de(cid:173)
`lays. This general structure subsumes both formant and
`pitch filters and allows for both autocorrelation and co(cid:173)
`variance analyses. The following section makes the anal(cid:173)
`ysis specific to pitch filters. A comparison of the tech(cid:173)
`niques is given in Section V. Then, the stability properties
`of the synthesis filters are examined for different config(cid:173)
`urations. Section VII examines means to determine an ap(cid:173)
`propriate lag for the pitch filter. Finally, Section VIII dis(cid:173)
`cusses the relative performance of the different options
`when implemented as part of a speech coder.
`
`II. FORMANT AND PITCH PREDICTORS
`The conventional formant predictor has a transfer func(cid:173)
`tion
`
`Nf
`F(z) = ~ akz -k
`
`k~l
`
`( 1 )
`
`The order Nf is typically between 8 and 16. The system
`function of the noise feedback filter is usually related to
`that of the formant predictor. One choice is to let N ( z) =
`F ( z /ex) where 0 < ex < I. This reduces the perceptual
`distortion of the output speech by improving the signal(cid:173)
`to-noise ratio (SNR) in regions where the spectral level is
`low. However, this improvement comes at the expense of
`decreased SNR in the formant regions [ 1]. At the re(cid:173)
`ceiver, the formant synthesis filter has a transfer function
`HF(z) = 1/(1- F(z)).
`The pitch predictor has a small number of taps NP. The
`delays associated with these taps are bunched around the
`pitch lag value. The system function for a transversal form
`pitch predictor is
`
`1 tap
`
`2 taps
`
`3 taps.
`
`(2)
`The pitch lag M is usually updated along with the coeffi(cid:173)
`cients. The pitch synthesis filter has a system function
`Hp(Z) = 1/(1 - P(z)).
`The conventional predictor configuration uses a cas(cid:173)
`cade of a formant predictor and a pitch predictor, referred
`to here as an F-P cascade. This structure can be moti(cid:173)
`vated from a standard speech production model, which
`decouples the quasi-periodic source (the vocal folds) from
`the vocal tract filter.
`In the context of speech coding, pitch predictors are
`most useful during voiced speech since voiced speech is
`characterized as a quasi-periodic signal with considerable
`correlation between samples separated by a pitch period.
`In the F-P cascade, the formant predictor removes the
`near-sample correlations to a large extent. The resulting
`formant predicted signal is a low-density quasi-periodic
`signal consisting mainly of pitch spikes. The pitch pre-
`
`dictor acts on this residual signal. If the pitch period is an
`integral number of samples, a one-tap pitch predictor can
`remove pitch period correlations. For nonintegral pitch
`periods, a multitap pitch predictor serves somewhat as an
`interpolating filter for the removal of these distant-sample
`correlations.
`The formulation used for the pitch filter is such that it
`removes long-term correlations, whether due to actual
`pitch excitation or not. The use of the term ''pitch filter''
`is somewhat misleading in describing the action of this
`filter for unvoiced speech and even to some extent for
`voiced speech. However, for ease of reference and to keep
`with past nomenclature, in this paper the long delay filter
`will be referred to as the pitch filter, and the correspond(cid:173)
`ing lag value will be referred to as the pitch lag.
`The cascade connection of the predictors can also have
`the pitch predictor precede the formant predictor (referred
`to as a P-F cascade). In the P-F cascade, the pitch filter
`is chosen to reduce long-term correlations such as those
`due to quasi-periodic input signals. The remaining near(cid:173)
`sample correlations are handled by the formant predictor.
`The filter coefficients for the two filters in the cascade are
`determined in a sequential fashion. The coefficients of the
`first filter are found, and then the coefficients of the sec(cid:173)
`ond filter are determined. The sequential solution process
`gives different results for the F-P cascade and the P-F
`cascade connections. In addition, the initial conditions at
`the time of coefficient update are different for the F-P and
`P-F connections. This would account for differences even
`when the two forms use the same coefficients for their
`constituent parts.
`The individual filters can be implemented in either
`transversal or lattice form. As shown later, the various
`implementations and solution methods give rise to sys(cid:173)
`tems with differing performance and differing stability
`properties.
`
`III. PREDICTORS WITH GENERAL DELAYS
`In this section, general formulations for determining the
`predictor coefficients for both formant and pitch predic(cid:173)
`tors in transversal or lattice form are developed. Later,
`these formulations are made more specific for the case of
`pitch filters.
`
`A. Transversal Implementation
`A model for calculating the predictor coefficients for a
`transversal implementation is shown in Fig. 2. The input
`signal x ( n) is multiplied by a data window w d ( n) to give
`xw ( n). The signal xw (n) is predicted from a set of its pre(cid:173)
`vious samples to form an error signal
`
`L
`e(n) = xw(n)- ~ ckxw(n- Mk)·
`
`k~I
`
`(3)
`
`The values Mk are arbitrary but distinct integers corre(cid:173)
`sponding to delays of the signal xw ( n). The final step is
`to multiply the error signal by an error window we ( n) to
`
`Page 2 of 12
`
`
`
`RAMACHANDRAN AND KABAL: PITCH PREDICTION FILTERS
`
`469
`
`"'("I
`
`e,( 11 I
`
`Fig. 2. Analysis model for transversal predictors.
`
`obtain a windowed error signal ew ( n) where
`
`ew(n) = we(n) e(n)
`
`L
`we(n) xw(n) - we(n) ~ ckxw(n- Mk).
`k= l
`
`The squared error is defined by
`
`E
`
`2 = ~ e~,.(n).
`n = ~oo
`
`(4)
`
`(5)
`
`2
`The coefficients ck are computed by minimizing E
`• This
`leads to a linear system of equations that can be written
`in matrix form ( <Pc = a):
`
`¢(Mb MJ) cf>(Mb M2 )
`¢(M2, M1 ) ¢(M2, M2)
`
`¢(Mb ML)
`¢(M2 , ML)
`
`cf>(ML, M1)
`
`cf>(ML, M2)
`
`cf>(ML, ML)
`
`¢(0, MJ)
`
`¢(0, M2 )
`
`¢(0, ML)
`
`where
`
`Cl
`
`c2
`
`CL
`
`(6)
`
`cf>(i,j)
`
`~ w;(n) xw(n- i) xw(n- j).
`n = -oo
`
`(7)
`
`For both formant and pitch predictors, the delays Mk
`are grouped. A formant predictor has a set of delays Mk
`= k fork = I to N1. A pitch predictor has a small number
`of delays Mk = M + k for k = 0 to NP - 1. Grouping the
`pitch taps reduces the amount of side information (which
`is sent to the decoder) needed to specify the delay values.
`1) Autocorrelation Method: The autocorrelation meth(cid:173)
`od results if we ( n) = I for all n. The data window w" ( n)
`is
`typically
`time-limited (rectangular, Hamming, or
`other). An important consideration is the minimum-phase
`property of the prediction error filter A ( z) = I - E f = 1
`ckz -M>. If A ( z) is minimum phase, the corresponding
`synthesis filter I/ A ( z) used at the decoder is stable. The
`autocorrelation method can be shown to give a minimum(cid:173)
`phase formant filter [5]. In the case of pitch filters, the
`minimum-phase property does not hold in general. An ex-
`
`ception occurs if the delays corresponding to the coeffi(cid:173)
`cients are uniformly spaced, i.e., M, = k M 1
`• This point
`is discussed further in Appendix A.
`The matrix <Pis always symmetric and positive definite.
`It is also Toeplitz if the intercoefficient delays are equal.
`Depending on whether <P is Toeplitz or not, either the
`Levinson recursion 1 or the Cholesky decomposition can
`be used to solve the autocorrelation equations.
`2) Covariance Method: The covariance method results
`if w d ( n) = 1 for all n and the error window is rectangu(cid:173)
`lar, we(n) =I forO:$ n :$ N- I. More general error
`windows in a covariance approach have been suggested
`by Singhal and Atal [6]. The covariance method does not
`guarantee that A ( z) is minimum phase, but it does mini(cid:173)
`mize the error energy for each frame. The resulting sys(cid:173)
`tem of equations (6) has the entries in <P and a defined
`with no window applied to the input signal,
`N-l
`¢(i,j) = ~ x(n- i)x(n -j)
`n=O
`
`(8)
`
`where N is the frame length.
`An alternative method is the modified covariance tech(cid:173)
`nique, which does guarantee a minimum-phase filter [ 1].
`This technique works well for formant predictors and is
`used in many of the experiments described later. A dis(cid:173)
`cussion of the modified covariance approach and its rel(cid:173)
`evance to the pitch prediction problem appears in Appen(cid:173)
`dix B.
`
`B. Lattice Implementation
`Lattice methods have been employed in linear predic(cid:173)
`tion and are useful in implementing a lattice-structured
`formant preditor [7]. 2 Here, we consider more general lat(cid:173)
`tice forms with only a subset of the stages actually per(cid:173)
`forming a filtering operation. A lattice-structured predic(cid:173)
`tor consisting of a total of P stages is an all-zero filter, as
`depicted in Fig. 3. The input signal is x ( n), and the final
`error signal is e ( n) = fp ( n). Stage i has a reflection coef(cid:173)
`ficient Ki and forms both the forward residual f ( n) and
`backward residual bi ( n). Reflection coefficients will be
`calculated for stages corresponding to one of the delay
`values Mk. Other stages will have zero-valued reflection
`coefficients. For these null stages, the forward error term
`propagates unaltered, and the backward error term is
`merely delayed. A lattice form filter will be minimum
`phase if all of the reflection coefficients have magnitudes
`which are smaller than one [7].
`For those stages for which a reflection coefficient is cal(cid:173)
`culated, the aim, in terms of maximizing the prediction
`
`1A distinction is made between the general form of the Levinson recur(cid:173)
`sion, which allows an arbitrary right-hand-side vector, and the Levinson(cid:173)
`Durbin recursion, which applies if the elements of o. appear in the first row
`of <1>.
`2For formant predictors, one can convert between transversal and lattice
`implementations with identical impulse responses. However. in a time(cid:173)
`varying environment, they are not equivalent due to their different initial
`conditions at frame boundaries.
`
`Page 3 of 12
`
`
`
`470
`
`.r(nl
`
`IEEE TRANSACTIONS ON ACOUSTICS. SPEECH. AND SIGNAL PROCESSING. VOL. 37. NO. 4. APRIL 1989
`
`fo(u)
`
`f!(n)
`
`_
`
`+
`
`/1'~~>1"1
`•In)
`fp(u)
`
`+
`
`llp_l()
`
`Fig. 3. Analysis model for lattice predictors.
`
`-
`
`~
`
`+
`
`,()
`
`gain alone, is to minimize the mean-square value of the
`forward residual. However, this criterion does not ensure
`that the magnitude of the resulting reflection coefficients
`is bounded by one and therefore does not ensure the sta(cid:173)
`bility of the corresponding synthesis filter. The Burg al(cid:173)
`gorithm minimizes the sum of the mean-square values of
`the forward and backward residuals and ensures the sta(cid:173)
`bility of the synthesis filter. It also has the desirable prop(cid:173)
`erty of guaranteeing that the mean-square value of the for(cid:173)
`ward residual is nonincreasing across each stage of the
`lattice.
`For the Burg method, the reflection coefficient K; is cal(cid:173)
`culated as
`
`K;
`
`2C;-I
`F;-1 + B;-1
`
`where
`
`N-1
`N-1
`C; = ~ jj(n) b;(n- 1), F; = ~ ff(n),
`n=O
`n=O
`
`N-1
`B; = ~ b~(n - 1 ),
`n=O
`
`(9)
`
`( 10)
`
`and N is the frame length. The mean-square value of the
`forward residual is reduced by the factor ( 1 - Kf) across
`stage i. A computationally efficient procedure, termed the
`covariance-lattice method [7], calculates the reflection
`coefficients using (9), but expresses them in terms of the
`covariance of the input signal. With this rearrangement,
`
`prediction error filters. Assume for the moment that the
`value of the pitch filter lag M is chosen so as to maximize
`the prediction gain. A discussion of the problem of esti(cid:173)
`mating M is postponed until a later section. The input to
`the pitch filter is d(n) for an F-P cascade and s(n) for a
`P-F cascade.
`
`A. Transversal Pitch Filters
`The autocorrelation and covariance methods can be used
`to determine the coefficients for a transversal-structured
`prediction error filter 1 - P(z). For the autocorrelation
`method, the input signal must be windowed. Convention(cid:173)
`ally, the window is of finite duration, and all the samples
`outside the range of the window are set to zero. The
`method is effective for formant predictors since the largest
`filter delay is usually small compared to the length of the
`window used. This ensures that the frame edge effects due
`to the zero-valued analysis samples preceding and follow(cid:173)
`ing the window are small. In contrast to the formant pre(cid:173)
`dictor, the delays used for a pitch predictor are compa(cid:173)
`rable to, or even larger than, the frame and window
`lengths. For a pitch filter, frame edge effects are no longer
`negligible. The problem is not solved by using windows
`that are longer than the largest delay of the pitch predictor
`since too much time-averaging greatly reduces the per(cid:173)
`formance and, in addition, changes in the pitch lag are
`not adequately tracked. Experiments involving various
`window shapes (including windows dynamically adapted
`to the pitch lag) confirm that the performance of the re(cid:173)
`sulting pitch predictors is poor. Furthermore, there is no
`guarantee that the synthesis filter Hp(Z) derived using the
`autocorrelation method is stable if the filter has more than
`a single tap. 3
`The covariance method yields high prediction gains, but
`may give unstable pitch synthesis filters. Specifically, for
`three-tap pitch predictors in an F-P cascade, the system
`of equations is
`
`¢(M, M + 2)
`¢(M, M + 1)
`¢(M, M)
`¢(M + 1, M) ¢(M + 1, M + 1) ¢(M + 1, M + 2)
`
`¢(M + 2, M) ¢(M + 2, M + 1) ¢(M + 2, M + 2)
`
`l
`
`] l{3 1 ]
`
`l¢(0,M)
`
`]
`¢(0,M+1)
`¢(0, M + 2)
`
`( 11 )
`
`{3 2
`{3 3
`
`the computational complexity becomes comparable to the
`conventional covariance method.
`When applying the Burg formula, the lattice-structured
`prediction error filter (as in Fig. 3) is minimum phase even
`if some of the reflection coefficients are constrained to be
`zero. Note also that the lattice coefficients can be trans(cid:173)
`formed to direct form (impulse response) coefficients, al(cid:173)
`lowing for an alternative implementation of the filter in
`transversal form.
`
`IV. PITCH FILTER ANALYSIS METHOD
`This section discusses the analysis methods that are used
`to implement both transversal- and lattice-structured pitch
`
`where
`
`N-1
`<t>(i,j) = ~ d(n- i) d(n- j)
`n=O
`
`(12)
`
`and d ( n) is the input signal to the pitch predictor. The
`matrix ci> is not Toeplitz in general, and the Cholesky de(cid:173)
`composition can be used to solve the system of equations.
`For reasonable frame sizes and the small number of taps
`used for pitch filters, ci> can be modified to become Toe(cid:173)
`plitz with little loss in prediction gain. Then, the general
`
`3In our limited experiments with pitch filters derived using an autocor(cid:173)
`relation formulation, no instability was observed.
`
`Page 4 of 12
`
`
`
`RAMACHANDRAN AND KABAL: PITCH PREDICTION FILTERS
`
`471
`
`form of the Levinson recursion can be used to determine
`the predictor coefficients. Note that the Toeplitz nature of
`<I> does not guarantee that Hp(Z) is stable, but does allow
`for a more efficient solution of the system of equations.
`Stabilization schemes can be employed whenever H P ( z)
`is found to be unstable. Stabilization of the pitch filter is
`simple to implement and is derived from a computation(cid:173)
`ally efficient stability test [8]. The degradation in average
`pitch prediction gain due to stabilization has been found
`to be small for an F-P cascade [8].
`
`B. Lattice Pitch Filters
`The Burg method works well for a formant predictor.
`Here, the technique is used to develop a lattice imple(cid:173)
`mentation for a pitch predictor that is used in cascade con(cid:173)
`nections. The basic motivation for using the Burg ap(cid:173)
`proach is that the corresponding synthesis filter is stable.
`Hence, no stability test or fix-up is necessary. Even
`though a lattice filter involves more computations per
`sample than its transversal counterpart, computational
`convenience is provided in the above context. The com(cid:173)
`putation required for a stability test and the consequent
`stabilization is saved.
`Given the value of M, the reflection coefficients K; for
`i = 1 to M - 1 are set to zero. The first nonzero coeffi(cid:173)
`cient in the pitch filter is KM. In the case of formant filters
`and one-coefficient pitch filters, the impulse response of
`a lattice prediction error filter has the same form as that
`for a transversal filter. However, two- and three-coeffi(cid:173)
`cient lattice pitch prediction error filters do not have the
`same transfer functions as the transversal structured fil(cid:173)
`ters. The transfer functions of lattice prediction error fil(cid:173)
`ters are given by
`
`ples to allow a "warm-up" period before generating ac(cid:173)
`tual output samples. 4 This resets the memory in the filter
`to be the same as if the filter had been used for the infinite
`past. This strategy will be used in the implementations.
`
`V. COMPARISON OF TECHNIQUES
`The various predictor configurations were tested using
`the analysis phase of a general speech coder, such as
`shown in Fig. l(a), as a test bed. In comparing different
`configurations and algorithms involving a pitch filter, pre(cid:173)
`diction gain will be used as the performance measure. The
`prediction gain is used to avoid tying down the results to
`a specific type of coder. The aim is to assess the extent to
`which the predictors remove redundancies by measuring
`the energy of the resulting residual. For a general predic(cid:173)
`tor, the prediction gain is the ratio of the average energy
`at the input to that predictor to the average energy of the
`prediction residual. For the system shown in Fig. l(a),
`the formant gain is
`
`~ s2(n)
`GF = .,n,-----;c-_
`~d 2 (n).
`n
`
`( 14)
`
`A similar formula applies to the pitch gain. The overall
`prediction gain is the prediction gain for the cascade of
`the filters.
`For the present results, the value of pitch lag M is cho(cid:173)
`sen to be the one that gives the highest prediction gain.
`Although an exhaustive search for the best M is not com(cid:173)
`putationally practical, this approach will provide some in(cid:173)
`sight into the relative performance of the various config(cid:173)
`urations. Also, for the present, the pitch filter is not
`stabilized. The issue of stability is deferred to Section VI.
`
`A(z)
`
`+ KMKM+lz- 1
`
`-
`
`- KMz-M- KM+lz-(M+l)
`1 + KMKM+Iz- 2 - KMz-M
`+ (KMKM+I + KM+lKM+2)z-
`(KM+ I + KMKM+ I KM+2) z -(M+ I) - KM+2Z -(M+l) N, = 3.
`
`N, = 2
`
`( 13)
`
`The two- and three-coefficient lattice filters have terms in
`z -I and z -z which are absent in the corresponding trans(cid:173)
`versal filters. Note, however, that in the case of the three(cid:173)
`coefficient lattice filter, the reflection coefficients control
`the five nontrivial impulse response values, hence giving
`a configuration with only three degrees of freedom.
`In a pitch filter, the pitch lag changes from frame to
`frame. This variation of the position of the nonzero lattice
`coefficients can be detrimental to the performance. Con(cid:173)
`sider the case when the pitch lag increases from one frame
`to another. In the new frame, the backward residual in the
`lattice will have been filtered by both the old coefficients
`and the new coefficients. A remedy for this problem is to
`reset the backward residual to the delayed filter input sig(cid:173)
`nal at each frame boundary. In addition, it is beneficial to
`back up the filter at each frame boundary by NP - 1 sam-
`
`The conditions common to all experiments involve the
`use of a formant predictor with ten coefficients and a pitch
`predictor with one, two, or three coefficients. The input
`speech samples comprise six utterances, three spoken by
`a male and three spoken by a female. The speech database
`consists of high-quality recordings (low-pass filtered to
`just below 4kHz) at a sampling frequency of 8kHz. The
`relevant average prediction gains for each sentence were
`computed and converted to decibels. Since the relative
`ordering of the methods is more or less preserved for each
`sentence, the tables present averages across the sentences
`of these decibel values.
`
`4This strategy is equivalent to converting the reflection coefficients to
`direct form coefficients [see (13)] and then implementing the predictor in
`transversal form.
`
`Page 5 of 12
`
`
`
`472
`
`IEEE TRANSACTIONS ON ACOUSTICS. SPEECH. AND SIGNAL PROCESSING. VOL. 37. NO. 4. APRIL !989
`
`TABLE I
`PREDICTION GAINS FOR FORMANT/PITCH PREDICTORS ( 80-SAMPLE
`FRAMES). THREE NUMBERS IN AN ENTRY REFER TO 0NE-.TWO-,
`AND THREE-TAP PITCH FILTERS
`
`A. Cascade Configurations
`In comparing the F-P and P-F configurations, analysis
`is carried out for 80-sample frames (corresponding to 10
`ms intervals). This somewhat rapid update allows for a
`higher prediction gain and tends to illustrate the differ(cid:173)
`ences between schemes more clearly. The range of pitch
`lags is set to cover the range for both male and female
`speakers. Minimum and maximum values forM of 20 and
`120 are used.
`Table I shows a comparison of several techniques for
`one-, two-, and three-tap pitch filters. The formant pre(cid:173)
`dictor is implemented in transversal form with the coef(cid:173)
`ficients determined by using the modified convariance
`method. The pitch filter is implemented in transversal
`form (covariance method) or lattice form (Burg method).
`In the case of the F-P cascade, the pitch lag is that which
`maximizes the pitch prediction gain and hence also the
`overall prediction gain. Only a single figure appears for
`the formant prediction gain since the formant filter is un(cid:173)
`affected by the choice of pitch coefficients and pitch lag.
`For the P-F cascade, there is more of an interaction
`between the pitch predictor and the formant predictor.
`Values are given for the case in which the pitch lag is
`chosen to maximize the prediction gain for the pitch filter
`alone and the case in which the pitch lag is chosen to
`maximize the overall prediction gain. The myopic view
`of choosing the pitch lag to maximize only the pitch pre(cid:173)
`diction gain gives the situation in which the pitch predic(cid:173)
`tor has a higher prediction gain than the formant predic(cid:173)
`tor. This phenomenon has also been observed in [9]. The
`situation reverses when the pitch lag is chosen to maxi(cid:173)
`mize the overall prediction gain. It should be noted here
`that the search for the best lag to maximize the overall
`prediction gain is impractically complex. Note also that
`as the number of taps in the pitch predictor is increased,
`there is an increase in the pitch prediction gain at the ex(cid:173)
`pense of a decrease in the formant prediction gain.
`Note that the F-P cascade consistently outperforms the
`P-F cascade in average overall prediction gain, and even
`more so for the myopic choice of pitch lag. 5
`There is only a small difference between the lattice im(cid:173)
`plementation of a pitch predictor and a transversal imple(cid:173)
`mentation, in spite of these forms having different im(cid:173)
`pulse responses. In fact, the lattice form for two and three(cid:173)
`tap filters in the F-P cascade slightly outperforms the
`transversal form for the utterances spoken by females.
`Examination of the final residual formed after formant and
`pitch prediction verifies that the pitch pulses are effec(cid:173)
`tively removed by a lattice pitch filter.
`For the previous experiments, a modified covariance
`approach guarantees a minimum-phase formant prediction
`error filter. Ancillary experiments were conducted to
`compare different options for the formant filter in an F-P
`
`metlwfl
`
`F-P
`
`P-F
`
`t rans\'ersal P
`lattice P
`transYf'rsai p(ll
`transversal P 121
`i:::~~: ~:~:
`
`oYer all
`pitch
`gain dB
`gain dB
`4.'2 S.3 0.8 20.3 214 219
`4.0 5.2 0.'
`20.1 21.3 218
`8.4 8.2 7.6! 10.1 11.4 11.9' 18.5 19.6 19.5
`13.7 1L:i 112 I
`6.1 8.9 9A 19.8 20.4 20.6
`9.5 u.s 13.::>
`rs.6 r9.r 2o.o 1
`6.2 7.9 9.4 19.8 19.9 20.4
`13.6 12.0 110
`:'\ot('s: p( 11 pitch lag dwst>u to maximize pitch prediction gain
`P{ 21 pitch lag chosen to maximizf' oYerall prNliction gain
`
`formant
`gain dB
`
`16.1
`
`i
`
`, o.r 7.3 6.5
`
`cascade. A lattice implementation of the formant filter
`using the reflection coefficients determined by a modified
`covariance method results in essentially no change in pre(cid:173)
`diction gain. Using a covariance formulation (imple(cid:173)
`mented in transversal form) for the formant filter im(cid:173)
`proves the prediction gain by only 0.1 dB, but introduces
`nonminimum-phase formant filters in about 4 percent of
`the frames.
`
`B. Formant-Pitch Interaction: Effects of Frame Size
`The success of the pitch predictor depends on the for(cid:173)
`mant residual having adjacent pitch pulses which are sim(cid:173)
`ilar in shape. Yet if the formant predictor varies signifi(cid:173)
`cantly from frame to frame, the pitch pulses may differ in
`detail. For short frames, the formant predictor coefficients
`may change significantly from frame to frame just due to
`the asynchronism between the frames and the positions of
`the pitch pulses. An investigation was made of the vari(cid:173)
`ation in prediction gain with changes in the sizes of the
`analysis frames for the formant and pitch filters. We re(cid:173)
`turn to the F-P cascade with the modified covariance
`method used to determine the coefficients of the transver(cid:173)
`sal formant predictor. Also, the covariance approach de(cid:173)
`termines the coefficients of a transversal pitch predictor.
`The results for different combinations of frame sizes are
`shown in Table II.
`Consider the performance of the pitch filter with a 40-
`sample analysis frame. The pitch gain increases as the
`length of the analysis frame for the formant predictor in(cid:173)
`creases from 40 to 80 and then levels off for a 160-sample
`formant analysis frame. At the same time, since the for(cid:173)
`mant prediction gain does not change significantly with
`the frame size, the overall prediction initially rises and
`then levels off. 6 For the 80-sample pitch analysis frame,
`the performance is again essentially constant with a
`change of the formant frame size from 80 to 160. Since
`the prediction gain remains high at the slow formant up(cid:173)
`date rates, the slow formant update rates are to be pre(cid:173)
`ferred since they involve less computation and require a
`smaller bandwidth for transmission. The number of frames
`with unstable pitch synthesis filters also depends on the
`frame size combination chosen. But as shown in the next
`
`5Th is ordering is also true for each utterance. except for one utterance
`in which the P-