`
`59
`
`Adaptive Postfiltering for Quality
`Enhancement of Coded Speech
`
`Juin-Hwey Chen, Senior Member, IEEE, and Allen Gersho, Fellow, IEEE
`
`Abstract- An adaptive postfiltering algorithm for enhancing
`the perceptual quality of coded speech is presented. The postfilter
`consists of a long-term postfilter section in cascade with a short(cid:173)
`term postfilter section and includes spectral tilt compensation
`and automatic gain control. The long-term section emphasizes
`pitch harmonics and attenuates the spectral valleys between pitch
`harmonics. The short-term section, on the other hand, emphasizes
`speech formants and attenuates the spectral valleys between
`formants. Both filter sections have poles and zeros. Unlike earlier
`postfilters that often introduced a substantial amount of muming
`to the output speech, our postfilter significantly reduces this effect
`by minimizing the spectral tilt in its frequency response. As a
`result, this postfilter achieves noticeable noise reduction while
`introducing only minimal distortion in speech. The complexity
`of the postfilter is quite low. Variations of this postfilter are
`now being used in several national and international speech
`coding standards. This paper presents for tbe first time a com(cid:173)
`plete description of our original postfiltering algorithm and tbe
`underlying ideas that motivated its development.
`
`I. INTRODUCTION
`
`E ARLY speech coders operating at high bit-rates were
`
`usually designed to minimize the energy of quantization
`noise, or equivalently, to maximize the signal-to-noise ratio
`(SNR). In these traditional coders, the coding noise is roughly
`white, i.e., the noise spectrum is roughly flat. As the encoding
`rate goes down to 16 kb/s and below, the SNR also drops and
`the noise floor of this white coding noise is elevated to such
`an extent that it is very difficult, if not impossible, to keep it
`below the threshold of audibility.
`Two perceptually motivated approaches were proposed to
`deal with this problem. The first one uses noise spectral
`shaping at the speech encoder. This method was first proposed
`in the late 1970s by Atal, Schroeder, and Hall [2], [3] and
`by Makhoul and Berouti [4]. It has been used successfully
`in adaptive predictive coding (APC) [2), [4], [5], multipulse
`linear predictive coding (MPLPC) [6], and code-excited linear
`prediction (CELP) coders [7]. The basic idea is to shape the
`spectrum of the coding noise so that it follows the speech
`
`Manuscript received February 24, 1994; approved May 9, 1994. This
`work was performed for the Jet Propulsion Laboratory, California Institute of
`Technology, sponsored by the National Aeronautics and Space Administration.
`The associate editor coordinating the review of this paper and approving it
`for publication was Dr. Spiros Dimolitsas.
`J.-H. Chen was with the Department of Electrical and Computer Engi(cid:173)
`neering, University of California, Santa Barbara. He is now with the Speech
`Coding Research Department, AT&T Bell Laboratories, Murray Hill, NJ
`07974 USA.
`A. Gersho is with the Center for Information Processing Research, Depart(cid:173)
`ment of Electrical and Computer Engineering, University of California, Santa
`Barbara, CA 93106 USA.
`IEEE Log Number 9406780.
`
`spectrum to some extent. Roughly speaking, the ratio of signal(cid:173)
`to-noise power densities at each frequency should exceed
`some minimum value that depends on frequency and the local
`character of the speech signal. Coding noise spectrally shaped
`in this way is less audible to human ears due to the noise(cid:173)
`masking effect of the human auditory system [3], [8], [9].
`However, as will be discussed later, at low encoding rates,
`noise spectral shaping alone is not sufficient to make the
`coding noise inaudible.
`The second perceptually-based approach uses an adaptive
`postfilter at the speech decoder output. The use of an adaptive
`rather than fixed filter is based on the need to change the
`filtering operation according to the local character of the
`speech spectrum. The idea of filtering speech with a "formant(cid:173)
`equalized" frequency response, or even the idea of enhancing
`noisy speech with a filter having a speech-like frequency
`response, at least dates back to a U.S. patent by Schroeder
`in 1965 [10]. In 1981, Sondhi et al. used Schroeder's idea
`of a "formant-equalized" frequency response in a speech en(cid:173)
`hancement system [11]. In 1982, Malah and Cox reported the
`use of pitch-adaptive comb filtering as a speech enhancement
`technique [12].
`To the best of our knowledge, adaptive postfiltering as a
`postprocessing technique for speech coding was first proposed
`in 1981 by Smith and Allen for enhancing the output of an
`adaptive delta modulation (ADM) coder [13]. The postfilter
`they used was an adaptive low-pass filter implemented by
`a short-time Fourier analysis/synthesis method. The cutoff
`frequency of the low-pass filter was adaptive and was chosen
`so that all spectral components above this frequency consti(cid:173)
`tuted only 1% of the total energy of the input signal. This
`adaptive cutoff frequency needed to be transmitted as side
`information. By eliminating the "out-of-band" high frequency
`noise, this postfiltering technique improved the speech quality
`of a 16 kb/s ADM coder to the extent that it was comparable
`to a 24 kb/s ADM coder without postfiltering [13]. Jayant
`extended this postfiltering idea to the adaptive differential
`pulse code modulation (ADPCM) coder. [14]. Instead of the
`frequency-domain approach, he switched between a bank of
`four fixed-bandwidth low-pass finite impulse response (FIR)
`filters and achieved a similar perceptual improvement in
`ADPCM-coded speech.
`The use of postfiltering for speech coding did not become
`popular until 1984 when Ramamoorthy and Jayant proposed a
`new postfiltering technique described in [15] and in a U.S.
`patent [16]. A specific postfilter for 24 kb/s ADPCM was
`shown in [15], which also describes using the technique to
`
`1063-6676/95$04.00 © 1995 IEEE
`
`ZTE EXHIBIT 1028
`
`Page 1 of 13
`
`
`
`60
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 3, NO. I, JANUARY 1995
`
`enhance speech degraded by additive white Gaussian noise.
`The ADPCM postfilter proposed in [15] moves the poles
`and zeros of the synthesis filter radially toward the origin by
`suitably chosen factors. Such a postfilter reduces the perceived
`level of coding noise [15]. However, if the coding noise of
`ADPCM is high, sufficient noise reduction requires a "strong"
`postfilter, which makes speech sound muffled [ 17] (similar to
`a low-pass filtering effect when the filter cutoff frequency is
`lower than the effective signal bandwidth). This postfiltering
`technique has been used in many applications, including the
`enhancement of 12 kb/s subband-coded speech as well as 24
`and 16 kb/s ADPCM-coded speech [18].
`In 1986, Yatsuzuka, Iizuka, and Yamazaki were the first
`to combine adaptive postfiltering and noise spectral shaping
`in a speech coder-in this case a 4.8 to 16 kb/s variable(cid:173)
`rate APC coder [19]. Yatsuzuka eta/. were also the first to
`propose explicitly an additional long-term postfilter section
`based on the pitch periodicity in speech. Their short-term
`and long-term postfilter sections were respectively obtained
`by moving the poles of short-term and long-term synthesis
`filters of APC toward the origin. This all-pole postfilter had
`the same muffling (or low-pass) effect mentioned above. An
`all-pole short-term postfilter for a sequentially-adaptive APC
`coder was also reported by Zarkadis and Evans in 1987 [20].
`In his 1987 thesis [ 1], Chen described a postfilter which
`significantly reduced the low-pass effect. This postfilter was
`described in a U.S. patent [21]. The postfilter proposed in
`[1] contained elaborate long-term and short-term postfilter
`sections which achieved significant noise reduction without
`making the speech sound muffled. The short-term postfilter
`section of this postfilter was reported by Chen and Gersho [22]
`in 1987. At the same time, Kroon and A tal proposed the use of
`postfiltering in a CELP coder [23]. The postfilter they used was
`essentially the same as the postfilter proposed by Yatsuzuka
`et a/. [ 19] for APC. Also at the same time, Veeneman and
`Mazor described an improved version of Malah and Cox's
`pitch-adaptive comb filter [12], with both coefficients and pitch
`period adapted for enhancing block-coded speech [24].
`Since 1987, the use of our postfiltering algorithm [1], [22]
`in CELP-like coders has become very popular. Recently,
`different variations of our postfiltering algorithm have been
`incorporated into several national and international speech
`coding standards. These include the U.S. Federal Standard 4.8
`kb/s CELP (FS1016) [25], the North American digital cellular
`radio standard 8 kb/s VSELP (IS-54) [26], [27], the Japanese
`digital cellular radio standard 6.7 kb/s VSELP (JDC), and the
`recently adopted CCilT standard 16 kb/s low-delay CELP
`(Recommendation G. 728) [28], [29]. Recently, a frequency
`domain method for adaptive postfiltering to suppress noise in
`spectral valleys was reported by Wang eta/. [30].
`In this paper, we present for the first time a complete
`description of our original postfiltering algorithm in [1] and
`the underlying ideas that motivated its development. We start
`with an explanation of the principle and philosophy of our
`postfilter design in Section II. This is followed by a description
`of the short-term postfilter and the long-term postfilter in
`Sections III and IV, respectively. Next, we describe in Section
`V the structure and operation of the combined postfilter (with
`
`both short-term and long-term sections). In Section VI, we
`comment on the performance of the postfilter. We then discuss
`in Section VII the variations of our postfiltering algorithm that
`are currently used in the speech coding standards mentioned
`above. Our conclusions are given in Section VIII.
`
`II. NOISE MASKING AND POSTFILTERING
`The classical Wiener theory of optimal filtering tells how
`to optimally filter a noise contaminated signal to minimize
`the noise power at the filter output. The theory shows that
`for a signal with power spectral density S(w) contaminated
`by independent additive noise with spectral density N(w), the
`optimal filter transfer function for minimizing mean squared
`error (MSE) between the filter output and the original signal
`is given by H(w) = S(w)j[S(w) + N(w)]. See, for example,
`[31]. Thus, in frequency bands where the signal-to-noise power
`density ratio (SNR) is large, the filter gain is approximately
`unity and in bands where the SNR is small the filter gain
`is very small. For postfiltering of coded speech, this theory
`suggests that we seek a filter whose transfer function has a
`magnitude that depends on the SNR at each frequency and
`that, at least qualitatively, follows the above behavior. Such a
`filter would necessarily be adaptive in order to track the time(cid:173)
`varying spectral character of the speech signal. Of course, the
`performance objective should really be perceived quality rather
`than MSE. Therefore, even if the ideal Wiener filter could be
`computed, it would not be optimal for speech enhancement.
`Nevertheless, the theory provides a conceptual starting point
`for the search for an effective postfiltering technique. Percep(cid:173)
`tual considerations are needed to find an effective trade-off
`between noise reduction and signal distortion resulting from
`a postfiltering operation.
`In [15], Ramamoorthy and Jayant explained from an in(cid:173)
`tuitive perspective (rather than from a Wiener filtering per(cid:173)
`spective) why adaptive postfiltering could reduce perceived
`noise. In this section, we give another explanation that takes
`into account auditory masking of noise, based on established
`properties of the human hearing system. We also describe the
`general philosophy of our postfilter design.
`Given a pure tone with a certain frequency and intensity,
`for a normal listener there is a masking threshold function
`associated with this tone such that if noise is added to the
`tone and the power spectrum of the noise is strictly below
`the masking threshold at all frequencies, that noise will be
`inaudible, i.e., it will be completely masked by the tone [9].
`In general, the masking threshold has a peak at the frequency
`of the tone, and monotonically decreases on both sides of
`the peak. This means the noise components near the tone
`frequency are allowed to have higher intensities than other
`noise components that are farther away from that frequency
`while remaining inaudible.
`Some
`limited studies have also been performed on
`suprathreshold masking that reduces the loudness of the noise
`rather than making the noise completely inaudible [3]. In this
`case, a pulsating narrow-band noise burst is above the masking
`threshold and is partially masked by a masker tone (i.e., has a
`reduced noise loudness). From the experimental data in [3], it
`
`Page 2 of 13
`
`
`
`CHEN AND GERSHO: ADAPTIVE POSTFILTERJNG FOR QUALITY ENHANCEMENT OF CODED SPEECH
`
`61
`
`can be seen that for a given loudness of the partially masked
`noise, the intensity of the noise varies as a function of the
`difference between the center frequency of the narrow-band
`noise and the frequency of the masker tone. Such a function
`generally has a shape similar to that of the masking threshold
`function. Consequently, even for low-bit-rate speech coding
`when suprathreshold masking is present and it is difficult to
`make the noise inaudible, the masking threshold function still
`provides a useful guideline for reducing noise loudness.
`A short segment of a speech signal can be considered as
`a superposition of many sine waves. If each of these sine
`waves were presented alone to a normal listener, there would
`be an associated masking threshold function with a peak at
`the frequency of that sine wave. When all such sine waves
`are superimposed, their associated threshold functions must
`also superimpose. Exactly how these functions interact with
`each other is unknown. However, no matter how complicated
`the interaction might be, there must exist an overall masking
`threshold function for the given segment of speech signal
`such that an added noise will be inaudible if its power
`spectrum is below the threshold at all frequencies. The overall
`masking threshold function follows to some extent the spectral
`peaks and valleys of the speech spectrum. (The suprathreshold
`masking curves for limiting noise to a given level of sub(cid:173)
`jective loudness will be similar in shape.) This characteristic
`behavior of the masking threshold function is more commonly
`associated with the spectral envelope of speech. However, by
`spectral peaks, we are referring not only to the formant peaks,
`but also to the pitch harmonic peaks for voiced speech. In
`other words, we believe that at least at the low frequency end,
`the masking threshold function also follows the pitch harmonic
`peaks and valleys to some extent.
`There is little psychophysical evidence to justify that super(cid:173)
`imposing tone masking curves will give the same qualitative
`behavior for pitch harmonic peaks. However, at least at the
`lower frequencies some justification can be given. Specifically,
`for the first few critical bands at the low frequency region
`of the spectrum, the bandwidths are only 100 Hz or slightly
`higher1• Hence, except for very low pitch male voices, it is not
`likely that two or more pitch harmonics will fall within a single
`critical band at the low frequency end, and therefore our ears
`should have enough frequency resolution there to resolve two
`adjacent pitch harmonic peaks. The effectiveness of our long(cid:173)
`term postfilter appears to validate the exploitation of masking
`for pitch harmonics.
`If a speech coder can push the coding noise below the
`masking threshold function at all frequencies and maintain
`this over time as the speech spectrum changes, then the coded
`speech will be noise-free as far as our auditory perception is
`concerned. In practice, however, such an ideal coder is quite
`difficult to develop, especially at low bit-rates. Noise spectral
`shaping may help to obtain the desired shape of the noise
`spectrum. However, in most cases, lowering noise components
`at certain frequencies can only be achieved at the price of in-
`
`1 Here we are referring to the critical bandwidths listed in Table I of Scharf's
`chapter in [8]. Scharf gives the following definition of the critical band: "As
`a purely empirical phenomenon, the critical band is that bandwidth at which
`subjective responses rather abruptly change."
`
`creased noise components at other frequencies [2]. Therefore,
`at very low encoding rates when the average level of coding
`noise is quite high, it is very difficult, if not impossible, to force
`noise below the threshold at all frequencies. The situation is
`similar to stepping on a balloon: when we use noise spectral
`shaping to reduce the noise components in the spectral valley
`regions, the noise components near formants will exceed the
`threshold; on the other hand, if we reduce the noise near
`formants, the noise in valley regions will exceed the threshold.
`Hence, at low encoding rates, noise spectral shaping alone is
`not adequate to make the coding noise inaudible.
`In speech perception, the formants of speech are perceptu(cid:173)
`ally much more important than spectral valley regions. Since
`we cannot push the noise below the threshold in both formant
`and valley regions at low encoding rates, a good strategy is
`to sacrifice valley regions and preserve the formants. This
`can be done in analysis-by-synthesis coders by tuning the
`perceptual weighting filter so that it keeps the noise below
`the masking threshold in formant regions. Of course, in doing
`so, the noise components in some of the valley regions may
`exceed the threshold. However, these noise components can
`later be made inaudible by attenuating them with a postfilter.
`In performing such attenuation, the speech components in
`valley regions will also be attenuated. Fortunately, the just
`noticeable difference (JND) for the intensity of spectral valleys
`can be as large as 10 dB [32]. In other words, the intensity
`of spectral valleys can be altered by as much as 10 dB before
`our ears can detect the difference. Therefore, by attenuating the
`components in spectral valleys, the postfilter only introduces
`minimal distortion in the speech signal, but it could achieve
`a substantial noise reduction.
`As an example, in the upper plot of Fig. I, we show
`the spectrum of a segment of speech sampled at 8 kHz.
`Suppose during speech encoding we have used noise spectral
`shaping in such a way that the noise components around
`spectral peaks are below the masking threshold while the
`noise components in valley regions are not. Then, most
`of the perceived coding noise comes from spectral valleys,
`including the valleys between pitch harmonic peaks. In this
`case, a useful postfilter may have a frequency response shown
`in the lower plot of Fig. 1. This postfilter attenuates the
`frequency components between pitch harmonics as well as
`the components between formants. An important feature of
`this frequency response is that the three spectral envelope
`peaks corresponding to the three formants have roughly the
`same height. This feature ensures that the relative intensity
`of the three formants will remain roughly unchanged after
`postfiltering. This is essential to avoid the undesirable low-pass
`effect normally associated with previous postfiltering schemes.
`The frequency responses of previously developed postfilters
`often have an overall slope, or spectral tilt, which tends to
`follow the tilt of the speech spectrum. For voiced speech, the
`spectral envelope has a low-pass spectral tilt with roughly 6
`dB per octave spectral fall-off. This results from the net effect
`of the glottal source low-pass character and the lip radiation
`high frequency boost. Since speech quality is predominated by
`voiced sounds, many previous postfilters had low-pass spectral
`tilt most of the time. (The postfilter proposed in [13] is an
`
`Page 3 of 13
`
`
`
`62
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 3, NO. I, JANUARY 1995
`
`120~~-~-
`
`20
`
`-20 0
`
`500
`
`1000 1500 2000 2500 3000 3500 4000
`
`Frequency (Hz)
`
`Frequency (Hz)
`
`Fig. I. An example of the speech spectrum and the corresponding postfilter
`frequency response.
`
`Fig. 2. An example of the frequency response of the modified LPC synthesis
`filter 1/[1- P(zfa)] for different values of a. Adjacent plots are separated
`by a 20 dB offset to enhance visibility.
`
`-WL--~-~--~-~-~----~--J
`0
`500
`I 000
`1500
`WOO
`2500
`3000
`3500
`4000
`
`exception.) Thus, speech filtered by these postfilters often
`sounds muffled.
`Our goal was to develop a postfilter with attenuation be(cid:173)
`tween formant peaks and pitch harmonic peaks but without a
`spectral tilt (as in the example of Fig. 1). To accomplish this
`goal, we use two stages of postfiltering: a long-term postfilter
`and a short-term postfilter with spectral tilt compensation. The
`short-term postfilter has a frequency response similar to the
`spectral envelope of the frequency response in Fig. 1. The
`long-term postfilter adds the fine structure (closely spaced
`peaks) to the overall frequency response. These two filter
`stages and the combined postfilter are described in the next
`three sections.
`
`Ill. SHORT-TERM POSTFILTER
`The frequency response of an ideal short-term postfilter
`should follow the peaks and valleys of the spectral envelope
`of speech without giving an overall spectral tilt. In a predictive
`speech coder employing linear prediction, the synthesis filter
`(often called the LPG filter) has a frequency response which
`closely follows the spectral envelope of the input speech.
`Therefore, it is natural to derive the short-term postfilter from
`the LPC predictor.
`Let the transfer function of the LPC predictor be P( z) =
`2:;~1 aiz-i, where ai is the ith LPC predictor coefficient and
`M is the LPC predictor order, which is typically chosen as 10.
`The corresponding LPC synthesis filter has a transfer function
`of 11[1 - P(z)], and its frequency response is often referred
`to as the LPG spectrum. The plot at the top of Fig. 2 shows
`an example of such an LPC spectrum for a voiced sound.
`If we scale down the radii of the poles of the LPC synthesis
`filter by a factor of a where 0 < a < I (that is, moving
`the poles radially toward the origin of the z-plane), then
`the corresponding modified filter has a transfer function of
`11[1- P(zla)], where P(zla) = 2:;~1 aiaiz-i. The poles
`of 11[1 - P(z)] are inside the unit circle since the LPC
`synthesis filters used in practical coders are stable filters.
`
`Therefore, the poles of 11[1 - P(zla)] are not only inside
`but also farther away from the unit circle. Consequently, the
`frequency response of 11[1- P(zla)] has lower peaks with
`wider bandwidth than that of 11[1- P(z)]. In Fig. 2, we show
`the frequency responses of the filter 11[1- P(zla)] for a=
`I, 0.9, 0.8, 0.5, and 0. As can be seen in Fig. 2, the frequency
`response becomes smoother and flatter as a decreases toward
`zero.
`As discussed above, the postfilter proposed in [15] for
`ADPCM moves the poles and zeros of the 2-pole, 6-zero
`synthesis filter toward the origin. If this idea is used in an
`LPC synthesis filter, the short-term postfilter will have the
`form 11[1 - P(zla)]. This form of short-term postfilter was
`indeed used by Yatsuzuka et al. [19] and Kroon and Atal
`[23]. Such a postfilter does reduce the perceived noise level.
`However, when coding noise is high, sufficient noise reduction
`is accompanied by muffled speech. This is due to the fact
`that the frequency response of this postfilter generally has a
`low-pass spectral tilt for voiced speech, as can be seen in
`Fig. 2.
`To reduce the spectral tilt of the all-pole postfilter 11[1 -
`P(zla)], we added M zeros with the same phase angles as
`the M poles. The transfer function of the resulting pole-zero
`postfilter has the form
`
`H.(z) = 1- P(zl/3),
`1- P(zla)
`
`0 < (3 < a < 1.
`
`(l)
`
`The frequency response of Hz ( z) can be expressed as
`
`20log IH.(e1w)l = 20log I t . I )I
`
`1- P eJw a
`1
`- 20log 11- P(eiw l/3)1'
`
`(2)
`
`Therefore, in the logarithmic scale, the frequency response
`of H. ( z) is simply the difference between the frequency
`responses of two modified LPC synthesis filters 1 I [ 1-P ( z I a)]
`and 11[1- P(zl/3)].
`
`Page 4 of 13
`
`
`
`CHEN AND GERSHO: ADAPTIVE POSTFILTERING FOR QUALITY ENHANCEMENT OF CODED SPEECH
`
`63
`
`~ 20
`j
`t lO
`
`15
`
`:t
`
`-10
`0
`
`~-o.s
`
`500
`
`1000
`
`1500
`
`2000
`
`2500
`
`3000
`
`3500
`
`4000
`
`Frequency (Hz)
`
`Fig. 3. Frequency response of the short-term postfilter [I -:- p.~- 1 ]
`[1- P(z/11)]/[(1- P(z/a)] corresponding to the LPC spectrum m Ftg. 2.
`The two plots were obtained with a = 0.8, 11 = 0.5, and p. = 0 or 0.5. The
`two plots are separated by a 20 dB offset to enhance visibility.
`
`The optimal values of o: and {3 depend on the bit-rate and
`the type of speech coder used, and they generally need to be
`determined empirically based on subjective listening tests. Our
`postfilter was originally developed for the 4.8 kb/s vector APC
`(VAPC) coder [22]. For that coder, we chose the parameters o:
`and {3 to be 0.8 and 0.5, respectively. From Fig. 2, we see that
`the response of 1/[1- P(z/o:)] foro:= 0.8 has both spectral
`tilt and formant peaks (although greatly smoothed), while the
`response for o: = 0.5 has spectral tilt only. Thus, with o: =
`0.8 and {3 = 0.5 in (2), we can remove the spectral tilt to a
`large extent by subtracting the response for o: = 0.5 from the
`response for o: = 0.8. The upper curve in Fig. 3 shows the
`resulting postfilter frequency response. (Note that the vertical
`scale in Fig. 3 has been amplified relative to Fig. 2.)
`In informal listening tests, we found that the low-pass
`effect was significantly reduced after the numerator term
`[1 - P(z/{3)] was included in the transfer function Hs(z).
`However, the filtered speech was still slightly muffled. To
`further reduce the low-pass effect, we added a first-order filter
`with a transfer function of [1-Mz- 1
`] in cascade with Hs(z).
`The parameter IL was chosen to be 0.5 for the 4.8 kb/s VAPC
`coder. Such a filter provided a slightly high-pass spectral tilt
`and thus helped to reduce the low-pass effect. The lower curve
`in Fig. 3 shows the overall frequency response of the cascaded
`filter, which has the spectral tilt further reduced.
`The first-order filter [1 - JLZ- 1 J can be made adaptive to
`better track the spectral tilt of H8 (z). In computer simulations,
`however, we found that a fixed filter witll!L = 0.5 gave quite
`satisfactory results. Therefore, for the VAPC postfilter, we used
`a fixed value of IL for simplicity.
`
`IV. LoNG-TERM POSTFILTER
`The function of a long-term postfilter is to attenuate fre(cid:173)
`quency components between pitch harmonic peaks. Again, no
`overall spectral tilt should be introduced. Such a long-term
`postfilter can be derived from tlle pitch predictor typically
`
`used in predictive coders like APC or CELP, because the pitch
`predictor contains the information about tlle pitch period and
`the degree of periodicity.
`In APC, VAPC, or CELP coders, a three-tap pitch predictor
`is frequently used [2], [22], [7]. The pitch synthesis filter cor(cid:173)
`responding to such a three-tap pitch predictor is not guaranteed
`to be stable. Since the poles of such a synthesis filter may be
`outside the unit circle, moving the poles toward the origin may
`not have the same effect as in a stable LPC synthesis filter (in
`terms of spectral peak reduction and bandwidth broadening).
`Even if the three-tap pitch synthesis filter is stabilized, as was
`done in VAPC, its frequency response may have an undesirable
`spectral tilt. In contrast, a long-term postfilter derived from a
`one-tap pitch predictor does not have these problems.
`Consider a one-tap pitch predictor with a transfer function
`of (1 - gz-PJ, where g is the predictor coefficient and p
`is the pitch period (in terms of number of samples). The
`corresponding pitch synthesis filter is given by 1/[1- gz-PJ,
`which has p poles witll the same radius g11P and uniformly
`spaced phase angles. Assume that g is positive (which is
`normally the case), tllen tlle poles are located at phase angles
`0, 21r jp, 47r jp, ... , (p- 1)27r fp, which correspond to the fre(cid:173)
`quencies of pitch harmonics. Therefore, to achieve the desired
`spectral peaks at pitch harmonic frequencies, we initially
`choose the long-term postfilter as 1/ (1- Az -p], where A < 1 is
`a suitably chosen coefficient. The upper curve of Fig. 4 shows
`a typical frequency response of such a postfilter with A = 0.5
`and p = 30.
`Zeros can also be used in the long-term postfilter to provide
`more flexibility and more control of the frequency response.
`To keep tlle spectral peaks at tlle correct frequencies, tlle zeros
`should be placed at phase angles corresponding to the valleys
`between pitch harmonics, namely, 1r jp, 37r jp, ... (2p -1)1r jp.
`For positive 'Y· the polynomial [1 + 'Yz-P] has roots located at
`such phase angles. As an example, the middle curve of Fig. 4
`shows the frequency response of the filter [1 + 0.5z- 30].
`With both poles and zeros, the long-term postfilter can be
`represented by the transfer function
`
`(3)
`
`where G1 is an adaptive scaling factor. The bottom curve
`of Fig. 4 shows tlle frequency response for tlle example of
`'Y = A = 0.25, p = 30, and G1 = l. Note tllat the pitch period
`p in (3) should be the "true" pitch rather than the double pitch
`or triple pitch sometimes produced by some pitch detectors.
`In Fig. 4, if p were double or triple the true pitch of 30,
`then the frequency response of H, ( z) would have extraneous
`peaks between pitch harmonics. In this case, we would not get
`sufficient attenuation between pitch harmonics, which defeats
`the purpose of tlle long-term postfilter.
`We now describe how 'Y, A, and G 1 are determined. The
`discussion above is based on tlle assumption tllat a voiced
`speech frame is encountered. In practice, however, there are
`also unvoiced frames as well as transition frames. Appropriate
`values of 'Y· A, and G, should be chosen according to the
`"voicing"information, or degree of periodicity in speech.
`
`Page 5 of 13
`
`
`
`64
`
`:r.-- r'
`
`-
`
`40l'
`
`''v'
`
`I\ I\
`
`\'---'
`
`30
`
`y=O. k=0.5
`
`y=05,k=O
`
`y='-=0.25
`
`500
`
`1000
`
`1500
`
`2000
`
`2500
`
`3000
`
`3500
`
`4000
`
`Frequency (Hz)
`
`Fig. 4. Frequency response of the long-term postfilter Gl[l + rz-Pjj
`,\z-Pj. In all three plots, p = 30 and G1 = 1. Adjacent plots are
`[1 -
`separated by a 20 dB offset to enhance visibility.
`
`A. Coefficient Determination
`We found that the tap weight g of the one-tap pitch predictor
`is a good indicator of voicing: It tends to be close to unity
`during steady-state voiced speech, and it approaches zero
`during unvoiced speech. In addition, at the beginning of
`some voiced speech segments where the amplitude gradually
`builds up, g tends to be greater than unity. On the other
`hand, in the trailing edge of some voiced segments where
`the amplitude gradually decreases, g tends to be less than
`unity. In both cases, g is a rough approximation of the ratio
`between the waveform amplitude of a pitch period and that of
`its preceding pitch period. Therefore, the tap weight g provides
`useful information about voicing and the change in the speech
`waveform envelope. For a three-tap pitch predictor P1 ( z) =
`b1z-p+l + b2z-P + b3z-p- 1 , we found that b = (b1 + b2 + b3),
`the sum of the three tap weights, plays a similar role. That is, it
`tends to be unity for voiced speech, zero for unvoiced speech,
`and so on.
`In the postfilter of VAPC, we used bas the voicing indicator
`since a three-tap pitch predictor is used in this coder and its
`parameters are available at the decoder. On the other hand,
`for coders with a single-tap pitch predictor, it is natural to use
`the tap weight g as the voicing indicator since it is readily
`available. For speech coders without a pitch predictor at all,
`and for speech coders that do not transmit the true pitch (such
`as those using an adaptive codebook, e.g., [33]), a long-term
`postfilter can still be used, but a separate pitch analysis on
`coded speech is needed as will be discussed later. I