throbber
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 3, NO. I, JANUARY 1995
`
`59
`
`Adaptive Postfiltering for Quality
`Enhancement of Coded Speech
`
`Juin-Hwey Chen, Senior Member, IEEE, and Allen Gersho, Fellow, IEEE
`
`Abstract- An adaptive postfiltering algorithm for enhancing
`the perceptual quality of coded speech is presented. The postfilter
`consists of a long-term postfilter section in cascade with a short(cid:173)
`term postfilter section and includes spectral tilt compensation
`and automatic gain control. The long-term section emphasizes
`pitch harmonics and attenuates the spectral valleys between pitch
`harmonics. The short-term section, on the other hand, emphasizes
`speech formants and attenuates the spectral valleys between
`formants. Both filter sections have poles and zeros. Unlike earlier
`postfilters that often introduced a substantial amount of muming
`to the output speech, our postfilter significantly reduces this effect
`by minimizing the spectral tilt in its frequency response. As a
`result, this postfilter achieves noticeable noise reduction while
`introducing only minimal distortion in speech. The complexity
`of the postfilter is quite low. Variations of this postfilter are
`now being used in several national and international speech
`coding standards. This paper presents for tbe first time a com(cid:173)
`plete description of our original postfiltering algorithm and tbe
`underlying ideas that motivated its development.
`
`I. INTRODUCTION
`
`E ARLY speech coders operating at high bit-rates were
`
`usually designed to minimize the energy of quantization
`noise, or equivalently, to maximize the signal-to-noise ratio
`(SNR). In these traditional coders, the coding noise is roughly
`white, i.e., the noise spectrum is roughly flat. As the encoding
`rate goes down to 16 kb/s and below, the SNR also drops and
`the noise floor of this white coding noise is elevated to such
`an extent that it is very difficult, if not impossible, to keep it
`below the threshold of audibility.
`Two perceptually motivated approaches were proposed to
`deal with this problem. The first one uses noise spectral
`shaping at the speech encoder. This method was first proposed
`in the late 1970s by Atal, Schroeder, and Hall [2], [3] and
`by Makhoul and Berouti [4]. It has been used successfully
`in adaptive predictive coding (APC) [2), [4], [5], multipulse
`linear predictive coding (MPLPC) [6], and code-excited linear
`prediction (CELP) coders [7]. The basic idea is to shape the
`spectrum of the coding noise so that it follows the speech
`
`Manuscript received February 24, 1994; approved May 9, 1994. This
`work was performed for the Jet Propulsion Laboratory, California Institute of
`Technology, sponsored by the National Aeronautics and Space Administration.
`The associate editor coordinating the review of this paper and approving it
`for publication was Dr. Spiros Dimolitsas.
`J.-H. Chen was with the Department of Electrical and Computer Engi(cid:173)
`neering, University of California, Santa Barbara. He is now with the Speech
`Coding Research Department, AT&T Bell Laboratories, Murray Hill, NJ
`07974 USA.
`A. Gersho is with the Center for Information Processing Research, Depart(cid:173)
`ment of Electrical and Computer Engineering, University of California, Santa
`Barbara, CA 93106 USA.
`IEEE Log Number 9406780.
`
`spectrum to some extent. Roughly speaking, the ratio of signal(cid:173)
`to-noise power densities at each frequency should exceed
`some minimum value that depends on frequency and the local
`character of the speech signal. Coding noise spectrally shaped
`in this way is less audible to human ears due to the noise(cid:173)
`masking effect of the human auditory system [3], [8], [9].
`However, as will be discussed later, at low encoding rates,
`noise spectral shaping alone is not sufficient to make the
`coding noise inaudible.
`The second perceptually-based approach uses an adaptive
`postfilter at the speech decoder output. The use of an adaptive
`rather than fixed filter is based on the need to change the
`filtering operation according to the local character of the
`speech spectrum. The idea of filtering speech with a "formant(cid:173)
`equalized" frequency response, or even the idea of enhancing
`noisy speech with a filter having a speech-like frequency
`response, at least dates back to a U.S. patent by Schroeder
`in 1965 [10]. In 1981, Sondhi et al. used Schroeder's idea
`of a "formant-equalized" frequency response in a speech en(cid:173)
`hancement system [11]. In 1982, Malah and Cox reported the
`use of pitch-adaptive comb filtering as a speech enhancement
`technique [12].
`To the best of our knowledge, adaptive postfiltering as a
`postprocessing technique for speech coding was first proposed
`in 1981 by Smith and Allen for enhancing the output of an
`adaptive delta modulation (ADM) coder [13]. The postfilter
`they used was an adaptive low-pass filter implemented by
`a short-time Fourier analysis/synthesis method. The cutoff
`frequency of the low-pass filter was adaptive and was chosen
`so that all spectral components above this frequency consti(cid:173)
`tuted only 1% of the total energy of the input signal. This
`adaptive cutoff frequency needed to be transmitted as side
`information. By eliminating the "out-of-band" high frequency
`noise, this postfiltering technique improved the speech quality
`of a 16 kb/s ADM coder to the extent that it was comparable
`to a 24 kb/s ADM coder without postfiltering [13]. Jayant
`extended this postfiltering idea to the adaptive differential
`pulse code modulation (ADPCM) coder. [14]. Instead of the
`frequency-domain approach, he switched between a bank of
`four fixed-bandwidth low-pass finite impulse response (FIR)
`filters and achieved a similar perceptual improvement in
`ADPCM-coded speech.
`The use of postfiltering for speech coding did not become
`popular until 1984 when Ramamoorthy and Jayant proposed a
`new postfiltering technique described in [15] and in a U.S.
`patent [16]. A specific postfilter for 24 kb/s ADPCM was
`shown in [15], which also describes using the technique to
`
`1063-6676/95$04.00 © 1995 IEEE
`
`ZTE EXHIBIT 1028
`
`Page 1 of 13
`
`

`
`60
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 3, NO. I, JANUARY 1995
`
`enhance speech degraded by additive white Gaussian noise.
`The ADPCM postfilter proposed in [15] moves the poles
`and zeros of the synthesis filter radially toward the origin by
`suitably chosen factors. Such a postfilter reduces the perceived
`level of coding noise [15]. However, if the coding noise of
`ADPCM is high, sufficient noise reduction requires a "strong"
`postfilter, which makes speech sound muffled [ 17] (similar to
`a low-pass filtering effect when the filter cutoff frequency is
`lower than the effective signal bandwidth). This postfiltering
`technique has been used in many applications, including the
`enhancement of 12 kb/s subband-coded speech as well as 24
`and 16 kb/s ADPCM-coded speech [18].
`In 1986, Yatsuzuka, Iizuka, and Yamazaki were the first
`to combine adaptive postfiltering and noise spectral shaping
`in a speech coder-in this case a 4.8 to 16 kb/s variable(cid:173)
`rate APC coder [19]. Yatsuzuka eta/. were also the first to
`propose explicitly an additional long-term postfilter section
`based on the pitch periodicity in speech. Their short-term
`and long-term postfilter sections were respectively obtained
`by moving the poles of short-term and long-term synthesis
`filters of APC toward the origin. This all-pole postfilter had
`the same muffling (or low-pass) effect mentioned above. An
`all-pole short-term postfilter for a sequentially-adaptive APC
`coder was also reported by Zarkadis and Evans in 1987 [20].
`In his 1987 thesis [ 1], Chen described a postfilter which
`significantly reduced the low-pass effect. This postfilter was
`described in a U.S. patent [21]. The postfilter proposed in
`[1] contained elaborate long-term and short-term postfilter
`sections which achieved significant noise reduction without
`making the speech sound muffled. The short-term postfilter
`section of this postfilter was reported by Chen and Gersho [22]
`in 1987. At the same time, Kroon and A tal proposed the use of
`postfiltering in a CELP coder [23]. The postfilter they used was
`essentially the same as the postfilter proposed by Yatsuzuka
`et a/. [ 19] for APC. Also at the same time, Veeneman and
`Mazor described an improved version of Malah and Cox's
`pitch-adaptive comb filter [12], with both coefficients and pitch
`period adapted for enhancing block-coded speech [24].
`Since 1987, the use of our postfiltering algorithm [1], [22]
`in CELP-like coders has become very popular. Recently,
`different variations of our postfiltering algorithm have been
`incorporated into several national and international speech
`coding standards. These include the U.S. Federal Standard 4.8
`kb/s CELP (FS1016) [25], the North American digital cellular
`radio standard 8 kb/s VSELP (IS-54) [26], [27], the Japanese
`digital cellular radio standard 6.7 kb/s VSELP (JDC), and the
`recently adopted CCilT standard 16 kb/s low-delay CELP
`(Recommendation G. 728) [28], [29]. Recently, a frequency
`domain method for adaptive postfiltering to suppress noise in
`spectral valleys was reported by Wang eta/. [30].
`In this paper, we present for the first time a complete
`description of our original postfiltering algorithm in [1] and
`the underlying ideas that motivated its development. We start
`with an explanation of the principle and philosophy of our
`postfilter design in Section II. This is followed by a description
`of the short-term postfilter and the long-term postfilter in
`Sections III and IV, respectively. Next, we describe in Section
`V the structure and operation of the combined postfilter (with
`
`both short-term and long-term sections). In Section VI, we
`comment on the performance of the postfilter. We then discuss
`in Section VII the variations of our postfiltering algorithm that
`are currently used in the speech coding standards mentioned
`above. Our conclusions are given in Section VIII.
`
`II. NOISE MASKING AND POSTFILTERING
`The classical Wiener theory of optimal filtering tells how
`to optimally filter a noise contaminated signal to minimize
`the noise power at the filter output. The theory shows that
`for a signal with power spectral density S(w) contaminated
`by independent additive noise with spectral density N(w), the
`optimal filter transfer function for minimizing mean squared
`error (MSE) between the filter output and the original signal
`is given by H(w) = S(w)j[S(w) + N(w)]. See, for example,
`[31]. Thus, in frequency bands where the signal-to-noise power
`density ratio (SNR) is large, the filter gain is approximately
`unity and in bands where the SNR is small the filter gain
`is very small. For postfiltering of coded speech, this theory
`suggests that we seek a filter whose transfer function has a
`magnitude that depends on the SNR at each frequency and
`that, at least qualitatively, follows the above behavior. Such a
`filter would necessarily be adaptive in order to track the time(cid:173)
`varying spectral character of the speech signal. Of course, the
`performance objective should really be perceived quality rather
`than MSE. Therefore, even if the ideal Wiener filter could be
`computed, it would not be optimal for speech enhancement.
`Nevertheless, the theory provides a conceptual starting point
`for the search for an effective postfiltering technique. Percep(cid:173)
`tual considerations are needed to find an effective trade-off
`between noise reduction and signal distortion resulting from
`a postfiltering operation.
`In [15], Ramamoorthy and Jayant explained from an in(cid:173)
`tuitive perspective (rather than from a Wiener filtering per(cid:173)
`spective) why adaptive postfiltering could reduce perceived
`noise. In this section, we give another explanation that takes
`into account auditory masking of noise, based on established
`properties of the human hearing system. We also describe the
`general philosophy of our postfilter design.
`Given a pure tone with a certain frequency and intensity,
`for a normal listener there is a masking threshold function
`associated with this tone such that if noise is added to the
`tone and the power spectrum of the noise is strictly below
`the masking threshold at all frequencies, that noise will be
`inaudible, i.e., it will be completely masked by the tone [9].
`In general, the masking threshold has a peak at the frequency
`of the tone, and monotonically decreases on both sides of
`the peak. This means the noise components near the tone
`frequency are allowed to have higher intensities than other
`noise components that are farther away from that frequency
`while remaining inaudible.
`Some
`limited studies have also been performed on
`suprathreshold masking that reduces the loudness of the noise
`rather than making the noise completely inaudible [3]. In this
`case, a pulsating narrow-band noise burst is above the masking
`threshold and is partially masked by a masker tone (i.e., has a
`reduced noise loudness). From the experimental data in [3], it
`
`Page 2 of 13
`
`

`
`CHEN AND GERSHO: ADAPTIVE POSTFILTERJNG FOR QUALITY ENHANCEMENT OF CODED SPEECH
`
`61
`
`can be seen that for a given loudness of the partially masked
`noise, the intensity of the noise varies as a function of the
`difference between the center frequency of the narrow-band
`noise and the frequency of the masker tone. Such a function
`generally has a shape similar to that of the masking threshold
`function. Consequently, even for low-bit-rate speech coding
`when suprathreshold masking is present and it is difficult to
`make the noise inaudible, the masking threshold function still
`provides a useful guideline for reducing noise loudness.
`A short segment of a speech signal can be considered as
`a superposition of many sine waves. If each of these sine
`waves were presented alone to a normal listener, there would
`be an associated masking threshold function with a peak at
`the frequency of that sine wave. When all such sine waves
`are superimposed, their associated threshold functions must
`also superimpose. Exactly how these functions interact with
`each other is unknown. However, no matter how complicated
`the interaction might be, there must exist an overall masking
`threshold function for the given segment of speech signal
`such that an added noise will be inaudible if its power
`spectrum is below the threshold at all frequencies. The overall
`masking threshold function follows to some extent the spectral
`peaks and valleys of the speech spectrum. (The suprathreshold
`masking curves for limiting noise to a given level of sub(cid:173)
`jective loudness will be similar in shape.) This characteristic
`behavior of the masking threshold function is more commonly
`associated with the spectral envelope of speech. However, by
`spectral peaks, we are referring not only to the formant peaks,
`but also to the pitch harmonic peaks for voiced speech. In
`other words, we believe that at least at the low frequency end,
`the masking threshold function also follows the pitch harmonic
`peaks and valleys to some extent.
`There is little psychophysical evidence to justify that super(cid:173)
`imposing tone masking curves will give the same qualitative
`behavior for pitch harmonic peaks. However, at least at the
`lower frequencies some justification can be given. Specifically,
`for the first few critical bands at the low frequency region
`of the spectrum, the bandwidths are only 100 Hz or slightly
`higher1• Hence, except for very low pitch male voices, it is not
`likely that two or more pitch harmonics will fall within a single
`critical band at the low frequency end, and therefore our ears
`should have enough frequency resolution there to resolve two
`adjacent pitch harmonic peaks. The effectiveness of our long(cid:173)
`term postfilter appears to validate the exploitation of masking
`for pitch harmonics.
`If a speech coder can push the coding noise below the
`masking threshold function at all frequencies and maintain
`this over time as the speech spectrum changes, then the coded
`speech will be noise-free as far as our auditory perception is
`concerned. In practice, however, such an ideal coder is quite
`difficult to develop, especially at low bit-rates. Noise spectral
`shaping may help to obtain the desired shape of the noise
`spectrum. However, in most cases, lowering noise components
`at certain frequencies can only be achieved at the price of in-
`
`1 Here we are referring to the critical bandwidths listed in Table I of Scharf's
`chapter in [8]. Scharf gives the following definition of the critical band: "As
`a purely empirical phenomenon, the critical band is that bandwidth at which
`subjective responses rather abruptly change."
`
`creased noise components at other frequencies [2]. Therefore,
`at very low encoding rates when the average level of coding
`noise is quite high, it is very difficult, if not impossible, to force
`noise below the threshold at all frequencies. The situation is
`similar to stepping on a balloon: when we use noise spectral
`shaping to reduce the noise components in the spectral valley
`regions, the noise components near formants will exceed the
`threshold; on the other hand, if we reduce the noise near
`formants, the noise in valley regions will exceed the threshold.
`Hence, at low encoding rates, noise spectral shaping alone is
`not adequate to make the coding noise inaudible.
`In speech perception, the formants of speech are perceptu(cid:173)
`ally much more important than spectral valley regions. Since
`we cannot push the noise below the threshold in both formant
`and valley regions at low encoding rates, a good strategy is
`to sacrifice valley regions and preserve the formants. This
`can be done in analysis-by-synthesis coders by tuning the
`perceptual weighting filter so that it keeps the noise below
`the masking threshold in formant regions. Of course, in doing
`so, the noise components in some of the valley regions may
`exceed the threshold. However, these noise components can
`later be made inaudible by attenuating them with a postfilter.
`In performing such attenuation, the speech components in
`valley regions will also be attenuated. Fortunately, the just
`noticeable difference (JND) for the intensity of spectral valleys
`can be as large as 10 dB [32]. In other words, the intensity
`of spectral valleys can be altered by as much as 10 dB before
`our ears can detect the difference. Therefore, by attenuating the
`components in spectral valleys, the postfilter only introduces
`minimal distortion in the speech signal, but it could achieve
`a substantial noise reduction.
`As an example, in the upper plot of Fig. I, we show
`the spectrum of a segment of speech sampled at 8 kHz.
`Suppose during speech encoding we have used noise spectral
`shaping in such a way that the noise components around
`spectral peaks are below the masking threshold while the
`noise components in valley regions are not. Then, most
`of the perceived coding noise comes from spectral valleys,
`including the valleys between pitch harmonic peaks. In this
`case, a useful postfilter may have a frequency response shown
`in the lower plot of Fig. 1. This postfilter attenuates the
`frequency components between pitch harmonics as well as
`the components between formants. An important feature of
`this frequency response is that the three spectral envelope
`peaks corresponding to the three formants have roughly the
`same height. This feature ensures that the relative intensity
`of the three formants will remain roughly unchanged after
`postfiltering. This is essential to avoid the undesirable low-pass
`effect normally associated with previous postfiltering schemes.
`The frequency responses of previously developed postfilters
`often have an overall slope, or spectral tilt, which tends to
`follow the tilt of the speech spectrum. For voiced speech, the
`spectral envelope has a low-pass spectral tilt with roughly 6
`dB per octave spectral fall-off. This results from the net effect
`of the glottal source low-pass character and the lip radiation
`high frequency boost. Since speech quality is predominated by
`voiced sounds, many previous postfilters had low-pass spectral
`tilt most of the time. (The postfilter proposed in [13] is an
`
`Page 3 of 13
`
`

`
`62
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 3, NO. I, JANUARY 1995
`
`120~~-~-
`
`20
`
`-20 0
`
`500
`
`1000 1500 2000 2500 3000 3500 4000
`
`Frequency (Hz)
`
`Frequency (Hz)
`
`Fig. I. An example of the speech spectrum and the corresponding postfilter
`frequency response.
`
`Fig. 2. An example of the frequency response of the modified LPC synthesis
`filter 1/[1- P(zfa)] for different values of a. Adjacent plots are separated
`by a 20 dB offset to enhance visibility.
`
`-WL--~-~--~-~-~----~--J
`0
`500
`I 000
`1500
`WOO
`2500
`3000
`3500
`4000
`
`exception.) Thus, speech filtered by these postfilters often
`sounds muffled.
`Our goal was to develop a postfilter with attenuation be(cid:173)
`tween formant peaks and pitch harmonic peaks but without a
`spectral tilt (as in the example of Fig. 1). To accomplish this
`goal, we use two stages of postfiltering: a long-term postfilter
`and a short-term postfilter with spectral tilt compensation. The
`short-term postfilter has a frequency response similar to the
`spectral envelope of the frequency response in Fig. 1. The
`long-term postfilter adds the fine structure (closely spaced
`peaks) to the overall frequency response. These two filter
`stages and the combined postfilter are described in the next
`three sections.
`
`Ill. SHORT-TERM POSTFILTER
`The frequency response of an ideal short-term postfilter
`should follow the peaks and valleys of the spectral envelope
`of speech without giving an overall spectral tilt. In a predictive
`speech coder employing linear prediction, the synthesis filter
`(often called the LPG filter) has a frequency response which
`closely follows the spectral envelope of the input speech.
`Therefore, it is natural to derive the short-term postfilter from
`the LPC predictor.
`Let the transfer function of the LPC predictor be P( z) =
`2:;~1 aiz-i, where ai is the ith LPC predictor coefficient and
`M is the LPC predictor order, which is typically chosen as 10.
`The corresponding LPC synthesis filter has a transfer function
`of 11[1 - P(z)], and its frequency response is often referred
`to as the LPG spectrum. The plot at the top of Fig. 2 shows
`an example of such an LPC spectrum for a voiced sound.
`If we scale down the radii of the poles of the LPC synthesis
`filter by a factor of a where 0 < a < I (that is, moving
`the poles radially toward the origin of the z-plane), then
`the corresponding modified filter has a transfer function of
`11[1- P(zla)], where P(zla) = 2:;~1 aiaiz-i. The poles
`of 11[1 - P(z)] are inside the unit circle since the LPC
`synthesis filters used in practical coders are stable filters.
`
`Therefore, the poles of 11[1 - P(zla)] are not only inside
`but also farther away from the unit circle. Consequently, the
`frequency response of 11[1- P(zla)] has lower peaks with
`wider bandwidth than that of 11[1- P(z)]. In Fig. 2, we show
`the frequency responses of the filter 11[1- P(zla)] for a=
`I, 0.9, 0.8, 0.5, and 0. As can be seen in Fig. 2, the frequency
`response becomes smoother and flatter as a decreases toward
`zero.
`As discussed above, the postfilter proposed in [15] for
`ADPCM moves the poles and zeros of the 2-pole, 6-zero
`synthesis filter toward the origin. If this idea is used in an
`LPC synthesis filter, the short-term postfilter will have the
`form 11[1 - P(zla)]. This form of short-term postfilter was
`indeed used by Yatsuzuka et al. [19] and Kroon and Atal
`[23]. Such a postfilter does reduce the perceived noise level.
`However, when coding noise is high, sufficient noise reduction
`is accompanied by muffled speech. This is due to the fact
`that the frequency response of this postfilter generally has a
`low-pass spectral tilt for voiced speech, as can be seen in
`Fig. 2.
`To reduce the spectral tilt of the all-pole postfilter 11[1 -
`P(zla)], we added M zeros with the same phase angles as
`the M poles. The transfer function of the resulting pole-zero
`postfilter has the form
`
`H.(z) = 1- P(zl/3),
`1- P(zla)
`
`0 < (3 < a < 1.
`
`(l)
`
`The frequency response of Hz ( z) can be expressed as
`
`20log IH.(e1w)l = 20log I t . I )I
`
`1- P eJw a
`1
`- 20log 11- P(eiw l/3)1'
`
`(2)
`
`Therefore, in the logarithmic scale, the frequency response
`of H. ( z) is simply the difference between the frequency
`responses of two modified LPC synthesis filters 1 I [ 1-P ( z I a)]
`and 11[1- P(zl/3)].
`
`Page 4 of 13
`
`

`
`CHEN AND GERSHO: ADAPTIVE POSTFILTERING FOR QUALITY ENHANCEMENT OF CODED SPEECH
`
`63
`
`~ 20
`j
`t lO
`
`15
`
`:t
`
`-10
`0
`
`~-o.s
`
`500
`
`1000
`
`1500
`
`2000
`
`2500
`
`3000
`
`3500
`
`4000
`
`Frequency (Hz)
`
`Fig. 3. Frequency response of the short-term postfilter [I -:- p.~- 1 ]
`[1- P(z/11)]/[(1- P(z/a)] corresponding to the LPC spectrum m Ftg. 2.
`The two plots were obtained with a = 0.8, 11 = 0.5, and p. = 0 or 0.5. The
`two plots are separated by a 20 dB offset to enhance visibility.
`
`The optimal values of o: and {3 depend on the bit-rate and
`the type of speech coder used, and they generally need to be
`determined empirically based on subjective listening tests. Our
`postfilter was originally developed for the 4.8 kb/s vector APC
`(VAPC) coder [22]. For that coder, we chose the parameters o:
`and {3 to be 0.8 and 0.5, respectively. From Fig. 2, we see that
`the response of 1/[1- P(z/o:)] foro:= 0.8 has both spectral
`tilt and formant peaks (although greatly smoothed), while the
`response for o: = 0.5 has spectral tilt only. Thus, with o: =
`0.8 and {3 = 0.5 in (2), we can remove the spectral tilt to a
`large extent by subtracting the response for o: = 0.5 from the
`response for o: = 0.8. The upper curve in Fig. 3 shows the
`resulting postfilter frequency response. (Note that the vertical
`scale in Fig. 3 has been amplified relative to Fig. 2.)
`In informal listening tests, we found that the low-pass
`effect was significantly reduced after the numerator term
`[1 - P(z/{3)] was included in the transfer function Hs(z).
`However, the filtered speech was still slightly muffled. To
`further reduce the low-pass effect, we added a first-order filter
`with a transfer function of [1-Mz- 1
`] in cascade with Hs(z).
`The parameter IL was chosen to be 0.5 for the 4.8 kb/s VAPC
`coder. Such a filter provided a slightly high-pass spectral tilt
`and thus helped to reduce the low-pass effect. The lower curve
`in Fig. 3 shows the overall frequency response of the cascaded
`filter, which has the spectral tilt further reduced.
`The first-order filter [1 - JLZ- 1 J can be made adaptive to
`better track the spectral tilt of H8 (z). In computer simulations,
`however, we found that a fixed filter witll!L = 0.5 gave quite
`satisfactory results. Therefore, for the VAPC postfilter, we used
`a fixed value of IL for simplicity.
`
`IV. LoNG-TERM POSTFILTER
`The function of a long-term postfilter is to attenuate fre(cid:173)
`quency components between pitch harmonic peaks. Again, no
`overall spectral tilt should be introduced. Such a long-term
`postfilter can be derived from tlle pitch predictor typically
`
`used in predictive coders like APC or CELP, because the pitch
`predictor contains the information about tlle pitch period and
`the degree of periodicity.
`In APC, VAPC, or CELP coders, a three-tap pitch predictor
`is frequently used [2], [22], [7]. The pitch synthesis filter cor(cid:173)
`responding to such a three-tap pitch predictor is not guaranteed
`to be stable. Since the poles of such a synthesis filter may be
`outside the unit circle, moving the poles toward the origin may
`not have the same effect as in a stable LPC synthesis filter (in
`terms of spectral peak reduction and bandwidth broadening).
`Even if the three-tap pitch synthesis filter is stabilized, as was
`done in VAPC, its frequency response may have an undesirable
`spectral tilt. In contrast, a long-term postfilter derived from a
`one-tap pitch predictor does not have these problems.
`Consider a one-tap pitch predictor with a transfer function
`of (1 - gz-PJ, where g is the predictor coefficient and p
`is the pitch period (in terms of number of samples). The
`corresponding pitch synthesis filter is given by 1/[1- gz-PJ,
`which has p poles witll the same radius g11P and uniformly
`spaced phase angles. Assume that g is positive (which is
`normally the case), tllen tlle poles are located at phase angles
`0, 21r jp, 47r jp, ... , (p- 1)27r fp, which correspond to the fre(cid:173)
`quencies of pitch harmonics. Therefore, to achieve the desired
`spectral peaks at pitch harmonic frequencies, we initially
`choose the long-term postfilter as 1/ (1- Az -p], where A < 1 is
`a suitably chosen coefficient. The upper curve of Fig. 4 shows
`a typical frequency response of such a postfilter with A = 0.5
`and p = 30.
`Zeros can also be used in the long-term postfilter to provide
`more flexibility and more control of the frequency response.
`To keep tlle spectral peaks at tlle correct frequencies, tlle zeros
`should be placed at phase angles corresponding to the valleys
`between pitch harmonics, namely, 1r jp, 37r jp, ... (2p -1)1r jp.
`For positive 'Y· the polynomial [1 + 'Yz-P] has roots located at
`such phase angles. As an example, the middle curve of Fig. 4
`shows the frequency response of the filter [1 + 0.5z- 30].
`With both poles and zeros, the long-term postfilter can be
`represented by the transfer function
`
`(3)
`
`where G1 is an adaptive scaling factor. The bottom curve
`of Fig. 4 shows tlle frequency response for tlle example of
`'Y = A = 0.25, p = 30, and G1 = l. Note tllat the pitch period
`p in (3) should be the "true" pitch rather than the double pitch
`or triple pitch sometimes produced by some pitch detectors.
`In Fig. 4, if p were double or triple the true pitch of 30,
`then the frequency response of H, ( z) would have extraneous
`peaks between pitch harmonics. In this case, we would not get
`sufficient attenuation between pitch harmonics, which defeats
`the purpose of tlle long-term postfilter.
`We now describe how 'Y, A, and G 1 are determined. The
`discussion above is based on tlle assumption tllat a voiced
`speech frame is encountered. In practice, however, there are
`also unvoiced frames as well as transition frames. Appropriate
`values of 'Y· A, and G, should be chosen according to the
`"voicing"information, or degree of periodicity in speech.
`
`Page 5 of 13
`
`

`
`64
`
`:r.-- r'
`
`-
`
`40l'
`
`''v'
`
`I\ I\
`
`\'---'
`
`30
`
`y=O. k=0.5
`
`y=05,k=O
`
`y='-=0.25
`
`500
`
`1000
`
`1500
`
`2000
`
`2500
`
`3000
`
`3500
`
`4000
`
`Frequency (Hz)
`
`Fig. 4. Frequency response of the long-term postfilter Gl[l + rz-Pjj
`,\z-Pj. In all three plots, p = 30 and G1 = 1. Adjacent plots are
`[1 -
`separated by a 20 dB offset to enhance visibility.
`
`A. Coefficient Determination
`We found that the tap weight g of the one-tap pitch predictor
`is a good indicator of voicing: It tends to be close to unity
`during steady-state voiced speech, and it approaches zero
`during unvoiced speech. In addition, at the beginning of
`some voiced speech segments where the amplitude gradually
`builds up, g tends to be greater than unity. On the other
`hand, in the trailing edge of some voiced segments where
`the amplitude gradually decreases, g tends to be less than
`unity. In both cases, g is a rough approximation of the ratio
`between the waveform amplitude of a pitch period and that of
`its preceding pitch period. Therefore, the tap weight g provides
`useful information about voicing and the change in the speech
`waveform envelope. For a three-tap pitch predictor P1 ( z) =
`b1z-p+l + b2z-P + b3z-p- 1 , we found that b = (b1 + b2 + b3),
`the sum of the three tap weights, plays a similar role. That is, it
`tends to be unity for voiced speech, zero for unvoiced speech,
`and so on.
`In the postfilter of VAPC, we used bas the voicing indicator
`since a three-tap pitch predictor is used in this coder and its
`parameters are available at the decoder. On the other hand,
`for coders with a single-tap pitch predictor, it is natural to use
`the tap weight g as the voicing indicator since it is readily
`available. For speech coders without a pitch predictor at all,
`and for speech coders that do not transmit the true pitch (such
`as those using an adaptive codebook, e.g., [33]), a long-term
`postfilter can still be used, but a separate pitch analysis on
`coded speech is needed as will be discussed later. I

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket