throbber
S1.3
`
`LOW-DELAY CODE-EXCITED LINEAR-PREDICTIVE
`CODING OF WIDEBAND SPEECH AT 32 KBPS
`
`Erik Ordentlich
`Yair Shoham
`
`Signal Processing Research Dept.
`AT&T Bell Laboratories
`600 Mountain Ave.
`Murray Hill, NJ 07974
`
`ABSTRACT
`speech
`of multi-channeVmulti-user
`prospect
`The
`communication via the emerging ISDN has raised a lot of
`interest in advanced coding algorithms for 7KHz wideband
`speech. The capability of 32Kb/s wideband speech coding
`will support sustained high-quality stereo (or bilingual) voice
`transmission for audio-video teleconferencing over a basic
`rate 64Kb/s ISDN channel. In addition to the high-quality
`requirement, network applications are likely to dictate the use
`of algorithms with a very short coding delay. This paper
`reports on the use of the well known Codebook Excited
`Linear Predictive (CELP) [1-3] algorithm for 32Kb/s low(cid:173)
`delay (LD-CELP) coding of wideband speech. The main
`problem associated with wideband coding, namely, spectral
`is discussed. The paper proposes an
`noise weighting,
`enhanced noise weighting technique and demonstrates its
`efficiency via subjective listening
`tests.
`In
`these
`tests,
`involving 20 listeners and 8 test sentences, the average
`rating for the proposed 32Kb/s LD-CELP was essentially
`equal
`to
`that of the 64Kb/s standard (0.722) CCITI
`wideband coder.
`
`I. INTRODUCTION
`The prospect of high-quality multi-channel/multi-user
`speech communication via the emerging ISDN has raised a
`lot of interest in advanced coding algorithms for wideband
`speech. In contrast to the standard telephony band of 200 to
`3400 Hz, wideband speech is assigned the band 50 to
`7000Hz and is sampled at a rate of 16000 Hz for subsequent
`digital processing. The added low frequencies increase the
`voice naturalness and enhance the sense of closeness whereas
`the added high frequencies make the speech sound crisper
`and more intelligible. The overall quality of wideband
`speech as defined above
`is
`sufficient
`for
`sustained
`commentary-grade voice communication as required, for
`example,
`in multi-user
`audio-video
`teleconferencing.
`Wideband speech is, however, harder to code since the data .
`is highly unstructured at high frequencies and the spectral
`dynamic range is very high.
`In some network applications,
`there is also a requirement for a short coding delay which
`limits the size of the processing frame and reduces the
`efficiency of the coding algorithm. This adds another
`dimension to the difficulty of this coding problem.
`The study reported in this paper investigates the use of
`the well known Codebook Excited Linear Predictive (CELP)
`[1-3] algorithm for wideband speech coding. The study is
`reported in detail in [4]. Here, a very brief report is given,
`which highlights the problem, the solution and the main
`results.
`
`ll. CELP AND LOW-DELAY CELP
`The basic structure of conventional CELP [1-3] is shown
`in Figure 1. In the system, an excitation signal, drawn from
`an excitation codebook, excites an all-pole filter which is
`usually a cascade of an LPC-derived filter 1 I A (z) and a so(cid:173)
`called PWh filter 1 I B (z). The LPC polynomial is given by
`A (z) = Laiz-i and is obtained by a standard M 1h-order LPC
`i=O
`analysis of the speech signal. The pitch. filter is determined
`. p
`.
`q
`by the polynomial B (z) = Lbjz_,_ where P is the current
`"=()
`"pitch" Jag - a value
`that best represents
`the current
`periodicity of the input and bj's are the current pitch gains.
`Most often, the order of the pitch filter is q = 1 and it is
`rarely more than 3. Both polyno!J1ial A (z), B (z) are monic.
`algorithm
`implements
`a
`The CELP
`closed -loop
`(analysis-by-synthesis) search procedure for finding the best
`excitation. Each excitation vector is passed through the LPC
`and pitch filters in an effort to find the best match to the
`output, usually, in a weighted mean-squared error (WMSE)
`sense. As seen
`in Figure 1,
`the WMSE matching
`is
`accomplished via the use of a noise-weighting filter W (z ).
`The input speech s (n) is first pre-filtered by W (z) and the
`signal x (n) ( X (z )=S (z) W (z) )
`resulting
`serves
`as
`a
`reference signal in the closed-loop search. The quantized
`version of x (n ), denoted by y (n ), is a filtered excitation,
`closest to x (n) in an MSE sense. The filter used in the
`search
`loop
`is
`the
`synthesis
`filter
`weighted
`H(z) = W(z)I[B(z)A(z)].
`The filter W (z)
`is
`important for achieving a high
`perceptual quality in CELP systems and it plays a central
`role in the CELP-based wideband coder presented here, as
`will become evident later in this paper.
`In general, the CELP transmitter codes and transmits the
`following five entities: the excitation vector, the excitation
`gain, the pitch lag, the pitch tap(s), and the LPC parameters.
`The overall transmission bit rate is determined by the sum of
`all the bits required for coding these entities.
`The CELP coder uses a long block of future samples
`mainly for the LPC analysis and coding. This implies a long
`coding delay which i~ unacceptable in many applications.
`Low-Delay CELP [5] 1s based on the same CELP principle
`but it avoid~ t~e excessive coding delay by performing the
`LPC analysts m a backward-mode.
`In this mode LPC
`a~al~sis_ is performed on recent past quantized ;peech,
`ehmmaung the need for a long look-ahead data block and no
`LPC information is transmitted. If the pitch loop is not used
`and the gain is processed in a fully backward mode [5], all
`
`-9-
`
`CH2977-7/91/0000-0009 $1.00 © 1991 IEEE
`
`ZTE EXHIBIT 1031
`
`Page 1 of 4
`
`

`
`the transrmss10n bits are assigned to the excitation, which
`facilitates the use of an extremely short data frame. This, in
`tum, results in a very short coding delay.
`The LD-CELP proposed here for coding wideband speech
`at 32 Kb/s employs backward LPC. Two versions of the
`coder were studied. The first
`included pitch
`loop. The
`second did not, i.e., B (z)=l. The general structure of the
`coder is that of Figure I, excluding the transmission of the
`LPC information.
`
`(1)
`
`I -:;, Y2 < Y1
`
`-:;, 1
`
`III. PERCEPTUAL NOISE WEIGHTING ANALYSIS IN
`CELP SYSTEMS
`In closed-loop MSE waveform coding the quantization
`the error signal between the output and the target is white.
`As a result, the signal-to-noise ratio (SNR) is not uniform
`across the frequency range and the quantization noise tends
`to mask the low-energy regions of the input spectrum. This
`problem has been recognized and addressed in the context of
`CELP coding of telephony-bandwidth speech
`[6]. The
`solution was in a form of a noise weighting filter, added to
`the CELP search loop in a way shown in Figure I. The
`general form of this filter is:
`A(zly1)
`W(z) = ---
`A (z ly2)
`where A (z) is the LPC polynomial. The effect of y1 or Y2 is
`to move the roots of A (z) towards the origin, de-emphasizing
`the spectral peaks of l!A (z). With y1 and y2 as in (1), the
`response of W (z) has valleys (anti-formants) at the formant
`locations and the inter-formant areas are emphasized.
`In
`addition, the amount of an overall spectral roll-off is reduced,
`compared
`to
`the speech spectral envelope as given by
`1 I A (z).
`In the CELP system of Figure 1, the unweighted error
`signal E (z) = Y (z) - X (z) is white since this is the signal
`that is actually minimized. The final error signal is
`S(z)- s (z) = E (z) w-1 (z)
`(2)
`and has the spectral shape of w-1(z). This means that the
`noise is now concentrated in
`the formant peaks and
`is
`attenuated in between the formants. The idea behind this
`noise shaping is to exploit the auditory masking effect. Noise
`is less audible if it shares the same spectral band with a
`high-level tone-like signal. Capitalizing on this effect, the
`filter W (z) greatly enhances the perceptual quality of the
`CELP coder.
`As will be shown shortly, incorporating W(z) in the
`CELP coder decreases the MSE efficiency of the coder, that
`the minimum overall mean square error can not be
`is,
`achieved. This poses a particularly delicate problem in the
`case of wideband speech coding due to the high spectral
`dynamic range. Perceptually efficient spectral shaping is
`obtained at the price of an increased overall noise level.
`However, there is an apparent tradeoff of shaping vs. noise(cid:173)
`level and an excessive noise shaping usually deteriorates the
`performance.
`to
`The performance of the CELP coder is related
`predictability or the non-spectral-flatness of the speech input.
`This fact is not that obvious in the framework of a closed(cid:173)
`loop analysis-by-synthesis system since prediction is not
`explicitly performed. The CELP system can be heuristically
`analyzed with the aid of an equivalent "DPCM" coder
`shown in Figure 2. In this figure, the transmitter of Figure 1.
`was redrawn in a different way. Virtual predictor and
`quantizer were added to make the circuit look like a standard
`differential coder. Note that the weighting filter is now
`
`represented by ~ W (z) where W (z) is assumed to be a
`rational monic filter, that is, the leading terms of both the
`numerator and denominator are strictly 1.0. The virtual
`predictor
`
`(3)
`
`p (z) = 1 _ B(z)A(z)
`'
`W(z)
`It is easy to show
`plays an important role in this analysis.
`that is guarantees the fundamental DPCM relation, namely,
`R(z)- R (z) = Y(z)- X(z) = E(z)
`(4)
`Note that for P,(z) to be a valid predictor the leading term of
`its impulse response must be strictly 0.0. This is guaranteed
`by B (z ),A (z ), W (z) being all monic. Note, also,
`that if
`W (z) = 1, the predictor reduces to that of a standard DPCM
`system.
`to view the quantization
`The relation (4) allows us
`operation as done by a virtual vector quantizer Q (r) p, (see
`Fig. 2) although, in reality, it is done in output domain by
`matching Y (z)
`to X (z ). The performance of such a
`quantizer, in terms of MSE SNR, is given by the well known
`expression
`
`(6)
`
`ae = a, ETb
`(5)
`where ae is the variance of the quantization error, a, is the
`variance of the virtual residual R (z), and b is the coding rate
`in bits per sample. Experimental results indicate that, for a
`low bit rate (b < 1 as in the usual CELP case) and an
`uncorrelated input (such as R (z )), the quantizer constant E is
`close to 1.0. Henceforth, we neglect this constant.
`the usual
`To proceed with
`this analysis we make
`assumption that the output Y (z) is very close to X (z ), i.e.,
`the quantization error is small. So, using Y (z) = X (z ), we
`get
`R (z) =X (z)- [ 1 - B (z)A (z) w-1(z) ]X (z)
`= B(z)A (z) w-1(z)X(z) = ~S(z)B(z)A (z)
`The result R (z) = ~ S (z) B (z) A (z) is rather striking.
`It tells
`us
`that
`the
`virtual
`residual
`is
`approximately
`independent of the weighting filter up to some fixed gain.
`Moreover, it tells us that R (z) is approximately a minimum(cid:173)
`inverse-filtering
`that
`is
`residual obtained by
`energy
`characterized by maximum LPC prediction gain:
`as
`Pc=(cid:173)a,
`where as is the variance of the input speech. Note that P c
`characterizes the input signal only and it has nothing to do
`with the coding system.
`Using the above, the variance of the error E(z) is given
`by ae =~(as I Pc )Tb. Since E(z) is statistically white,
`the average-magnitude-spectrum of the final error signal
`D (z) = S(z)- S(z) = E(z) ~-Iw-1 (2)
`is given by
`
`(7)
`
`(8)
`
`MD(ro) = ~ rb I w-1 (ro) I
`Pc
`The MSE performance is, therefore, linearly related to the
`prediction gain of the signal. The output noise is affected
`only by the monic part of the weighting filter. Henceforth,
`we assume without loss of generality, that W(z) is monic.
`The power of the output error signal D (z) is given by
`a2
`1t
`
`(9)
`
`aa = --f- T2b f I w-2 (ro) I drol2rc
`
`Pc
`
`-1t
`
`(10)
`
`-10-
`
`Page 2 of 4
`
`

`
`Using the standard logarithmic inequality we get
`02
`1l
`
`o2 = -+ T2b f I w-2(ro) I dro/2rc ~
`a; ,_:G[ f Jog-~ w-2(ro) I dro/2rc + 1]
`
`Pc
`
`-11
`
`(11)
`
`Since W (z) is rational, monic and stable, the first term
`inside the brackets is zero, which leaves us with
`02
`2 >
`s 2-2b
`od--2
`Pc
`with equality if and only if W(z) = 1. Therefore, applying a
`weighting filter always increases the overall noise level.
`
`(12)
`
`A smoothed version of the speech input spectrum may be
`represented by
`
`o,
`Ss(ro) = B (ro)A (ro)
`Combining (9) and
`(13) we get an expression for a
`frequency-dependent signal to noise ratio in the log domain:
`
`(13)
`
`(14)
`
`SNR (ro) =log I W(ro)
`
`B(ro)A (ro)
`
`I + b log2
`
`(16)
`
`in between the formants. The conclusion was that the
`formant and tilt problems ought to be decoupled. The
`approach taken was to use W(z) only for formant modeling
`and to add another section for controlling the tilt only. The
`general form of the new filter is
`W'(z) = W(z)P(z)
`(15)
`where P(z) is responsible for the tilt only. Various forms of
`P (z) were studied. A detailed discussion of these forms can
`be found in [ 4]. In this paper we focus on a solution that has
`been found, after careful listening to coded speech, to work
`the best.
`In the proposed system, the weighting filter (15)
`incorporates two-pole section
`p (z) = --2--.-.
`1 + LPiS'z-'
`i=l
`The coefficients Pi are found by applying the standard LPC
`algorithm to the first three correlation coefficients of the
`current-frame LPC inverse filter ( A (z) ) sequence ai. The
`parameter S is used to adjust the spectral tilt of P (z ). The
`values S = 0.7, Yt=0.95 and Y2=0.8 were found to yield the
`best perceptual perfonnance.
`Figure 3 demonstrates the effect of the enhanced noise(cid:173)
`shaping filter. The dashed curve in the figure shows a typical
`spectrum of a conventional inverse filter w-1 (z ). The solid
`inverse
`filter
`the spectrum of an enhanced
`curve
`is
`w-1(z)P- 1(z) for the same underlying LPC filter. As seen,
`for
`the same general
`tilt,
`the enhanced filter has
`less
`pronounced
`formants especially at
`lowest and highest
`formants. Compared to the old filter, the enhanced filter
`attenuates the noise by about 5 dB at the lowest and highest
`formant, while achieving the right overall spectral tilt.
`The CELP coder with the noise weighting as in (15, 16)
`was implemented with various pitch loops, i.e., various
`orders forB (z) and various number of bits for the pitch taps.
`Interestingly, the best system- in terms of both perceptual
`and MSE performance - was the one without a pitch loop,
`i.e., B (z) = 1.
`
`V. SUBJECTIVE PERFORMANCE TEST
`The performance of the wideband LD-CELP was assessed
`by comparing it to the 0.722 CCITT standard wideband
`coder [7]. This coder operates at 64 Kb/s, twice the rate of
`LD-CELP. The LD-CELP algorithm employed an adaptive
`two-pole tilt filter with S=0.7, in combination with a formant
`filter W(z) defined by y1=0.98 and y2=0.8. Backward LPC
`analysis was used, of order 32. The coder did not use a
`pitch loop. The coding block size was 5 and the codebook
`size was 1024.
`included 4 male and four female
`The test material
`utterances. Each utterance was coded by the 0.722 and by
`the LD-CELP to form a pair of utterances. The order in a
`pair was set at random. However, each pair was played in
`both orders to eliminate biased decisions. Twenty listeners
`took part in the test. The listener was asked to vote for the
`better sounding utterance in his judgement, or, to split his
`vote equally, if no preference could be made. The final
`scores were defined as the percent number of votes for each
`system per given condition.
`Table 1 shows the experimental results. As shown, the
`overall scores are 51.72 and 48.28 for the LD-CELP and the
`0.722, respectively. This means that, on the average, the
`··. two
`systems
`performed
`alike, which
`is
`extremely
`encouraging, recalling that the LD-CELP rate is half that of
`the 0.722. Also, the coding delay of the 0.722 coder is 1.5
`
`Averaging this expression over the frequency range shows
`that the only non-zero term left is b log 2 ( recall that all the
`filters are monic). Since b Jog2 is a small value, the SNR is
`bound to be negative in some frequency regions, mostly at
`high frequencies.
`It is also clear from
`(14)
`that
`the
`weighting filter W (z) cannot improve this situation globally.
`It can only improve the SNR in some regions at the price of
`reducing the SNR in others.
`
`IV. PERCEPTUAL NOISE WEIGHTING IN CELP CODING
`OF WIDEBAND SPEECH
`-
`is
`The spectral dynamic range of wideband speech
`considerably higher then that of telephony speech where the
`added high-frequency region of 3400 to 6000 Hz is usually
`in
`the
`near the bottom of this range. Expression (14)
`previous analysis tells us that, in this case, the unweighted
`SNR tends to be highly negative at high frequencies. While
`coding of the low-frequency region seems to be easier,
`coding of the high-frequency region poses a severe problem.
`The auditory system is quite sensitive in this region and the
`quantization distortions are clearly audible in a form of
`crackling and hiss. Noise weighting is,
`therefore, more
`crucial, in wideband CELP. The balance of low to high
`frequency coding is more delicate. The major effort in this
`study was towards finding a good weighting filter that would
`allow a better control of this balance.
`The starting point of this investigation was the weighting
`filter of the conventional CELP as in (1). The goal was to
`find a good set (Yt ,y2) for best perceptual performance. It
`was found that, similar to the narrow-band case, the values
`y1 =0.9 , Y2=0.4 produced reasonable results. However the
`performance was not yet satisfactory. It was found that the
`filter W (z) as in (1) has an inherent limitation in modeling
`the
`fonnat
`structure
`and
`the
`required
`spectral
`tilt
`concurrently. The spectral tilt is more or less controlled by
`the difference '{I -y2. The tilt is global in nature and it is not
`possible to emphasize it separately at high frequencies. Also,
`changing the tilt affects the shape of the formants of W (z ).
`A pronounced tilt is obtained along with higher and wider
`formants, which puts too much noise at low frequencies and
`
`-11-
`
`Page 3 of 4
`
`

`
`msec. whereas that of the LD-CELP is only 0.94 msec. In
`the ISDN environment, this means stereo, multi-user or
`bilingual communication over a basic-rate channel without
`loss of quality. When the scores are broken up in terms of
`utterance gender, one can notice an asymmetry in quality: the
`LD-CELP does better on males (62.81% preference) whereas
`the 0.722 does better on females (59.38% preference). The
`reason for this is not yet understood.
`The results show that 32Kb/s LD-CELP an an excellent
`candidate
`for wideband
`speech coding
`in high-quality
`communication networks. Although LD-CELP
`is more
`complex than the 0.722, future advances in DSP technology
`will render this disadvantage insignificant.
`
`I
`I
`
`1/
`w-1 (z)
`'Y1 =0.95 'Y2=0.4
`
`'Y1 =0.95 'Y2=0.8
`'Yp=0.7
`w-1 (z)p-1 (z/yp)
`
`\//
`
`I
`
`\
`
`\.
`
`X
`
`y
`
`Analysis
`
`-20~--~~~~~~~~--~--~--~--~--~
`8000
`0
`
`Frequency (Hz)
`
`Figure 3. Effect of the enchanced noise-shaping filter.
`
`Dashed line: Spectrum of a conventional inverse
`filter w-I (z).
`Solid line: Spectrum of an enhanced inverse
`filter w-1(z)P-1(z).
`
`CHANNEL
`
`Utterances
`All
`Male
`Female
`
`LD-CELP (%) G.722 (%)
`48.28
`51.72
`37.19
`62.81
`40.62
`59.38
`
`/\ s
`
`Table 1.
`
`Listening test results for the proposed LD-CELP and the
`CCfiT-G.722 wideband coders.
`
`RECEIVER
`
`Figure 1. CELP Coder
`
`Figure 2. "DPCM" Equivalent of a CELP Coder
`
`- 12-
`
`REFERENCES
`[1] B.S. Atal, M.R. Schroeder , "Stochastic Coding of Speech
`Signals at Very Low Bit rates", Proc. IEEE Int. Conf.
`Comm., May 1984, P. 48.1
`[2] M.R. Schroeder, B.S. Atal, "Code-Excited Linear Predictive
`(CELP): High Quality Speech at Very Low Bit Rates", Proc.
`IEEE Int. Conf. ASSP., 1985, pp. 937-940.
`P. Kroon, E.F. Deprettere "A Class of Analysis-by-Synthesis
`Predictive Coders for High-Quality Speech Codihg at Rate
`Between 4.8 and 16 Kb/s.", IEEE J. on Sel. Area in Comm.
`SAC-6(2), Feb. 1988, pp. 353-363.
`E. Ordentlich, "Low Delay Code Excited Linear Predictive
`(LD-CELP) Coding of Wide Band Speech at 32Kbit/sec.",
`MS Thesis, EE Dept., MIT, March 1990.
`
`[3]
`
`[4]
`
`[5]
`
`J.H. Chen, "A Robust Low-Delay CELP Speech Coder at 16
`Kb/s", Proc. GLOBECOM-89, Vol. 2, Nov. 89, pp. 1237-40.
`[6] B.S. Atal, M.R. Schroeder, "Predictive Coding of Speech
`Signals and Subjective Error Criteria", IEEE Tr. ASSP, Vol.
`ASSP-27, No.3, June 1979, pp. 247-254.
`P. Mermelstein, "G.722, a New CCITT Coding Standard for
`Digital Transmission of Wide band Audio Signals". IEEE
`Comm. Mag., pp. 8-15, Jan. 1988.
`
`[7]
`
`Page 4 of 4

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket