`TRANSFORM CODED EXCITATION (TCX)
`
`fl I rfc b w e
`J-P Adorrl
`R Solanir
`C Lof7nmme
`depart in en^ of 1:lectnral Erigineenng, t'niversity of Sherbrooke
`Sherbrooke, QirPbe(, Canada, J1K 3R1
`
`ABSTRACT
`L'his papt'r descrile. the applicnt ion of Transform (:odd
`Excitation (TCX) coding-- an algorithm recently proposvd
`by the authors-~ to encoding widebad speech and audio
`signals in I he bit rate raiige of 16 kbit.s/s to 32 kbit,s/s. The
`approach uses a conihination of time domain (linear pre-
`diction; pit,ch predict.ion'i and frequency domain (transform
`coding; dynamic hit allocation) techniques. and utilizes a
`synthesis model similar to that of lineal- prediction coders as
`CELP. However. at, the mroder, the high conip1exit.y anal-
`ysis-by-synthesis tkchnique is bypassed by directly quan-
`tizing the so-called target. signal in the frequency domain.
`The innovative excitation is derived a t the decoder hy in-
`verse filtering the quantized target. signal. The algorithm
`is intended for applirations whereby a large numher of hit.s
`i s available for the inno\ativp excit,abion. In this paper the
`7'CX algorithm is uti1izc.d 1.0 encode widehand speech and
`audio signals with 50--7000 Nz handwidt h. Novel quant,i-
`zation procedures including inter-framr prediction in the
`frequency domain are proposed to encode t h r target sig-
`nal. The proposed algorithm achieves very high qua1it.y for
`speech at, 16 kbils/s, ancl for music at 2.1 kbitr/s.
`
`INTRODUCTION
`1.
`'l'here is currently a pmving demand lor low bit &e wide-
`hand speech (50 7000 Ilz) and audio coding for many appli-
`cations such a- audio-virleo teleconferrnc-ing and multime-
`P coding hni I ~ e e n successfully applied to obtain
`high quality widehaid speech a t IF khit/s and below [I].
`Ilowcvc~, the idgoritliin was fairly complex, and in order
`1.0 achieve a single DSP chip implemt,ritation the codrhook
`size had to he rtsdircd resulting i n a lower bit rate vrrsiorr
`wit.h some quality rlegradation [?I.
`Recently [3], a nrw approach calld Transform Coded
`E:xcitation (VCX) cc-diilg has heen nrtroduccd as an effi-
`cient technique to enc-odt. the innnvativr excitation in CEI,P
`type speerh coders at medium and high rates. The pi-o-
`posed algorithm utilizes time-domain linear prediction (LP)
`and pit,ch prediclioir (Pl'j analysis t,o determine the recon-
`xtruct.ed signal. Howr~vei . instead of using thp compiitat iow
`ally demanding analysis-by-spt hesis techniques 1.0 detcr-
`mine the innovative excit,al.ion, t.he percept nally w?ight.rd
`signal with r r m o v d filter ringing ancl pitch corrrhtions,
`hetter known a5 the target signal, is t,ransformed and en-
`coded in the frequimcy cloniain; it, is then derodrd ancl in-
`verse transformwl t,o cxtract t,he bime-domain innovative
`exritat.ion. The model is hest applied t o situations where
`n large nrimher of hits is availahlr for lhe excitation, snch
`,is mediiim-dday ro(linp wi1.h backward L P ( ' analysis. as
`
`was presented in j3]. and in encoding wideband speech and
`audio signals as will be presented in this art.icle.
`The main advant,age of 1'CX coding is its algorithmic
`simplicity (as the analysis-by-synthesis search procedure is
`eliminat,edj. Simple scalar and vector quantization tech-
`niques are used and only one inverse filtering is needed to
`extract the innovative excitation. Further, quantizing the
`target in the frequency domain does not suffer from the
`defficiencies of traditional transform coding when applied
`directly to the original signal such as frame noise. In TCX
`the reconstructed signal is obtained by continous filtering
`which significant,ly reduces the framing discontinuities.
`TCX coding can be seen as one possible approach to
`construct a target codebook at each frame of input sig-
`nal. In CELP coding, the target. codebook is constructed
`hy filtering the entries of a fixed codebook through a t i m e
`varying (weighbed) synthesis filter, which imposes the for-
`mant structure of the input signal on the target. Hence,
`the target codebook is adaptive, as its spectral characteris-
`tics evolve with the input signal. At each frame, the CELP
`coder selects the excitat.ion that, produces the reconstructed
`target (filtered excitation) closest to the target. In TCX
`coding, the fixed codebook is actually in the target do-
`main. At each frame, the TCX coder first computes the
`quantized version of the target, and then calculates the cor-
`responding excitation by filtering through t.he (zero-state)
`inverse weighted synthesis filter. Hence, there is no actual
`exchtion codebook, in the sense that only one excitation
`per frame is computed and used. The target codebook in
`TCX can still be made adaptive, as in CELP, by the use
`of predictive encoding in the transform domain.
`In this
`case, the fixed coilrhook contains a set of predictive residu-
`als. As will he shown h e r , successive amplitude spectra of
`the target are highly correlat.ed, especially for some music
`signals, making predictive encoding of the target spectrum
`very effective.
`The present paper concerns itself with the application
`of the TCX model to widcband audio signals (50-7000
`Hz). speech and mnsic, a t rates ranging from 16kbits/s
`to 32kbits/s. At such high rates, the size of the innova-
`tive codehook of a CELP coder is exceedingly large, mak-
`ing analysis-by-syrithesis virtually impossihle bo implement.
`The TCN model is attractive precisely because only one
`filtering is used to det.ermine the innovative excitation, in-
`stead of one per innovative codebook vector. Most of the
`compiitational effort in TCX is due to the qnantization of
`the t.arget signal.
`Sect,ion 2 present,s t.he principle of TCX coding. In Sec-
`tion 3, we describe the quantization t,echnique of the target
`signal: while Section 4 decrihes the actual coding schemes
`used for both speech and mnsic signals. Section 5 gives some
`
`1-193
`
`0-7803-1775-0/94 $3.00 Q 1994 IEEE
`
`Ex. 1040 / Page 1 of 4
`Apple v. Saint Lawrence
`
`
`
`n1.4
`. . . . . . . . . . . . . .
`
`&.i
`
`...
`
`2.4
`.... ........
`Figure 1 Principle of TCX coding
`
`results and a discussion, and Section 6 thi, conclusions.
`
`2. PRINCIPLE OF TCX CODING
`Figure 1 shows the schematic diagram of the TCX c-oder.
`For each frame of N samples, the input signal s ( n ) is first
`preemphasized with the filtrr F ( z ) = 1 - p.:-', where p =
`0.5, to increase the relative energy of the higti-frequency
`components of the signal. This operation is of particular
`importance in wideband coding, to improve the higher order
`LP 'analysis, which is performed on t,he preemphasized sig-
`nal +(n) using the aiit,ororrelat,ion method. The LP coeffi-
`cients of the filter A ( z ) are qiiant.ized in the LSP domain [4].
`Using the quantized version of t,he LPC filter, A ( z ) , a resid-
`ual signal r(n) is computed. Closed loop pitch analysis is
`then performed to find the pitch delay and gain by mini-
`mizing the mean-square error between the weighted input
`exr:it,ation ?(n) filtcred through the
`speech and the p-t
`weighted quantized synthesis filter l / A ( z / ~ ) , with y = 0.8
`(the weighling filter is i V ( z ) = A(z)/A(z/y)). The pitch
`correlation is removed from the residual signal by subtract-
`ing the past excitation, with proper delay and gain, from
`the residual r(n), to givv signal ti(n). ir(n) is then filtered
`through l/A(z/r) (with its initial stat.es properly srt) to
`give the target signal ~ ( 1 1 ) .
`In traditional CELP coding, the next step involves filter-
`ing through l/A(z/,) a set of innovative rxcitations from
`a codebook, to find the one that, hest matches the target
`z(n). In TCX coding, the. target signal is directly quantized,
`in the frequency domain, a shown in Figure 1, where the
`transformation T is a Fourier transform. The information
`transmitted by the coder is thns (1) the LPC paramet,ers,
`in the form of quantized LSl's, (2) the pitch delay and gain,
`and (3) the result of t,he quantization of the complex t.arget
`signal X ( k ) , namely a st,t of codebook indices f and gains
`9..
`
`For the 'TCX coder to acconiodate music signals, which
`can not in general be efficiently modeled by a iiniqne set
`of equally spaced harmonics, the pitch predictor was only
`used on speech, and removnl when coding music seqiiences.
`This resuked in a signific-ant improvement in miisir quali6y.
`In this context, only the LPC coefficients (LSPs) and bhe
`quantized version of the target signal are transmitted.
`At the decoder end (I'igiire 2), the quant,ized version of
`the complex target sign.4 v ( k ) is inverse transformed to
`
`hJk&
`.....
`Ad
`.............. A ..............
`Figure 2. TCX decoding
`
`give the time-domain quantized target. i ( n ) . The innovative
`excitation 6(n) is obtained by filtering the quantized target
`signal through the inverse weighted LPC filter,. A(z/-y), with
`zero initial states. The pitch correlated excitation p ( n ) is
`added back t o the innovative excitation, and together they
`form the total excitation ?(n), which produces the qnan-
`tized preemphasized synthesis &(n) by filt.ering through
`l / A ( t ) . Finally, the synthesis signal is found by filtering
`i p ( n ) through the deemphasis filter l / F ( z ) = l / ( l - p z - ' ) .
`
`3. QUANTIZATION OF THE TARGET
`Replacing the low-energy, "white" innovation codebook of
`CELP with the high-energy, "colored target codebook of
`TCX requires that the quantization procedure, denoted Q
`in Figure 1, be carefully designed to take into account
`the correlations between successive target signals. Since
`the target is obtained by subtracting the pitch correlations
`from the weighted input signal s,,(n),
`it has I,o some ex-
`tent the formant structure of ."(.).
`Hence, in the trans-
`form domain, successive amplitude spectra of the target
`will he correlated. This correlation will be part,icularly irn-
`portant in the case of music, where the target is simply
`t,he weighted input signal (after removing the zero-input re-
`sponse of 1/A(z/y)). In general, t,he phase spectrum does
`not present inter-frame correlation as does the amplitude
`spectrum.
`Figure 3 and Figure 4 show two consecutive amplitude
`spectra of the target for a voiced segment of male speech,
`and for an organ sequence, respectively. Recall that in the
`case of music, there is no pitch prediction, which allows
`larger frames to be used. It can be seen that the energy
`cont,our is roughly the same for the two consecutive speech
`frames, whereas the correlation is even greater in the case
`of music. In particular, for music, even the high-frequency
`fine structure is repeated from one frame to the next, for
`the example shown in Figure 4.
`To exploit t,his redundancy, the phase and amplitude
`spectra of the target, respectively .X@ and X , , are quan-
`t,ized separately. Differential quantization is used to encode
`,Ya, while direct quantization is used for
`3.1. Quantizing the amplitude spectriim
`The quantization procedure of amplitude spectrum a t frame
`m, ,Y!"'(k), is as follows. At each frame m , the amplitude
`spect.rum X:"(k) can be expressed as the sum of two terms
`( k ) + R(k).
`, Y p ( k ) = f l . p - u
`
`i 1)
`
`1-194
`
`Ex. 1040 / Page 2 of 4
`
`
`
`,"I
`
`14 161
`
`I
`
`I
`
`..........
`past-
`
`O
`
`0
`
`C
`loo0
`
`
`
`Figure 7. Two consective target amplitude spectra for a male
`speech sequence.
`
`Figure 4. Two consective target amplitude spectra for an organ
`sequence.
`where flk;"-')(k) is the predict.ed amplitude spect.rum
`from past frame, and RI k) is the prediction residual.
`The best prediction gain
`is first, dekrmined by selecting
`from a prediction gain codebook the one that minimizes t.he
`prediction error
`
`IC-1
`E; =
`
`1=0
`
`(2)
`
`[X.!")(k) - a 2 y - " ( k ) ] Z ,
`N / 2 where N is the frame size). Once t.he gain a is known,
`the prediction residual R ( k ) , given by
`R(k) = AA'"'(k) - bkA"-"(k)
`( 3 )
`is vector quantized. In general, a split-VQ approach is used,
`since the amplitude spectra are high-dimensional vectors.
`For each subvector, a gain and a shape are computed (gain-
`shape VQ). The total quantized amplit.ude spectrum is then
`given by
`( k ) + R ( k ) .
`.c!m)(l;) = jacF-ll
`
`where f< is the mimher of amplitudes i n the spectrum (IC =
`
`14)
`
`The spectral prediction residual codebook, containing the
`reconst.ructed residuals d ( k ) , is obtained using the k-means
`training algorithm. The training sequence of R(k) vectors
`is obtained from a large database of speech or music signals.
`The distribution of t,he prediction gain /3 was studied and
`found to lie between 0.2 and 1.2, for both speech and mu-
`sic. The maximum of t.liis distribution lies around 0.7.5 for
`
`speech and around 0.9 for music. Recall that in the case of
`music, no pitch prediction is used, which increases the aver-
`age correlation between successive amplitude spectra of the
`target.
`3.2. Q u a n t i z i n g t h e phase s p e c t r u m
`With R the total number of bits available to quantize the
`target, we have Rg = R - R,, where Re and R, are the
`number of bits to quantize the phase and amplitude spec-
`tra, respectively. The Rg bits are dynamically allocated at
`each frame, as a hmct.ion of the quant,ized amplitude spec-
`trum .?L'")(k). This way, no side information is needed to
`transmit the bit allocation to the decoder, since it can be
`calculated locally.
`The bit allocation algorithm is as follows. Let 2a be
`the quantized amplitude spectrum of the target and let I<-
`be the number of phases and amplitudes of t.he spectrum.
`Let R+ be the number of bits to allocate t o the phases.
`Finally, let rli be the number of bits allocated to phase k,
`k = O ...., IC-1.
`Initinlization: set r k = 0, for k = 0,. . . , I< - 1
`Iteration: Repeat Rg times :
`1. Find the maximum amplit.ude of <va a t position k.
`2. rli = rli + 1 (allocate one hit to t.he kth phase)
`3. Divide k a ( k ) by 2
`At the end, the rli contain the bits allocated to the quanti-
`zation of each phase.
`A maximum of 7 bits was allowed for each phase. Exper-
`iments have shown that allocat.ing 7 bits to each phase, a t
`each frame, produced transparent quality speech and mu-
`sic, in all cases considered. Further, at the chosen encoding
`rates of 16 kbits/a and 24 kbit.s/s, the highest observed bit
`rate allocated to one given phase was 8 bits, using the bit
`allocation described above.
`The phase spectrum ?I+(k) is then quantized with R+
`bits, following the bit allocat,ion. Typically, it is this stage
`of the quantization procedure that requires the largest num-
`ber of bits (in the order of 2 bits per phase, on average). The
`phases that, are allocated a sufficient number of bits (for ex-
`ample, 2 bits or more) are separately scalar quantized; the
`other phases, that have been allocated fewrr bit.s are block
`quantized. Many schemes are possible t,o vector quantize
`those phases. For instance, if 4 given phases were allocated
`respectively 0, 1, 0 and 1 bits, they could be block quan-
`tized using a %bit, 4-dimensional vector quantizer. This
`implies the use (and storage) of a mu1t.i-rate, possibly multi-
`dimensional, vector quantizer. We used a simpler scheme
`where all the phases that have been allocated 1 bit are sub-
`mitted to a second simpler bit allocation, as follows. If there
`are I(, such phases, then the K t / 2 largest of those are al-
`located 2 bits, and the other IC1/2 (the smallest ones) are
`allocated 0 bit. All amplitudes corresponding to phmes that
`are allocated 0 bit are set to zero. This resu1t.s in dynamic
`decimation of the amplitude spectrum.
`4. CODING SCHEMES FOR MUSIC A N D
`SPEECH
`The TCX coding algorithm was used bo encode both speech
`and music files, in both configurations, namely with and
`wit.hout pitch prediction. Pitch prediction is used only in
`the case of speech. The pitch delay is encoded with 8 bits,
`with possible delays ranging form 40 samples (400 Hz) to
`295 samples (54 Hz); the pitch gain is encoded with 4 bits.
`
`1-195
`
`Ex. 1040 / Page 3 of 4
`
`
`
`Hence, a total of 12 b i b are needed to encode the pitch
`parameters.
`For speech, a frame lengt.lr of N = 96 samples (6 ms) is
`chosen. This frame length is opt,imal for updating the pitch
`parameters, since smaller frames would result in an unnec-
`essarily higher bit rat.e for updating the pitch parameters,
`and longer frames degrade the performance of the pit.ch pre-
`dictor. As noted in Section 2, the value of y in W ( i ) is set
`to 0.8. At 16kbits/s, there are 96 bits available a t each
`frame. The 16 LPC parameters are quantized once every
`four frames with 48 hits, using split-VQ; the 16 LSPs are
`split into 7 subvectors of respectively 2, 2 , 2, 2, 2, 3 and
`3 LSPs, and each suhvector is allocated 7 bits, except the
`last 3 LSPs which are allocated 6 hits. The LPC coeffi-
`cients are linearly interpolated for the other three frames.
`This amounts to 12 bits per frame for the LPC coefficients.
`With already 12 bits used t.0 transmit the pitch parameters,
`t.here are (96 - 12 - 12) bits = 72 bits left at each frame
`to quantize the target signal. The 48-dimensional ampli-
`tude spectrum X , ( k ) is quantized as in Section 3.1, using
`18 bits; the spectral prediction gain ,8 is quantized with 5
`hits, and the spectral prediction residual R ( k ) is quantized
`with 13 bits (single 48-dimensional vector with 8-hit shape
`codebook, and 5-bit gain codehook). The phase spectrum
`X + ( k ) is quantized using the remaining 54 hit,s, as described
`in Section i3.2.
`For music, a frame length of 256 samples (16 ms) is rho-
`sen, and no pitch prediction is used. Again, t,he value of
`7 in W ( z ) is set to 0.8. At 24kbits/s, 384 bits are avail-
`able a t each frame. The l,PC coefficients are quantized and
`transmitted at each frame tising 49 bits; the 16 LSPs are
`split into 7 subvectors of respectively 2, 2, 2, 2, 2, 3 and
`3 LSPs. and each subvector is allocated 7 bits. Without
`pitch prediction, this leaves 335 bits to quantize t.he target
`vector. The amplitude specl.rum of t.he target is quantized
`with 100 bits, as described in Section 3.1. A single spectral
`prediction gain 4 is used on the whole amplitude spectrum,
`and quantized on 5 bits. The spectral prediction residual is
`split into eigth 16-dimensional subvectors; the first, 6 sub-
`vectors are quantized with I? bits (8 bits for the shape and
`4 bits for the gain - the gains are quantized in pairs with
`8 bits per pair); the last two suhvectors are quantized with
`11.5 bits each (8 bits per shape, and 7 bits for that last gain
`pair, making it 3.5 bits per gain). The phase spectrum of
`the target is finally quantized with the remaining 235 bits
`(384 - 49 - loo), as descrihrd in Sect.ion 3 . 2 .
`5. EXPERIMENTAL RESULTS
`‘The TCX coder was tested on eight speech files, four male
`and four female, and eight, music files, sampled at 16 kHz.
`The test music files contain a wide variety of sounds, rang-
`ing from organ, piano. orchestra. to a castanet sequence and
`also a capella singing (the oft en used Suzan Vega sequence).
`The encoding rate is 16 khits/s for speech arid 24 khits/s for
`music.
`With the coding scheme descrihed in Section 4, the aver-
`age SNR ohtained for spwch is 17.31 dR, with a minimum
`of 16.14 dB and a maximrim of 18.42 dB. Note that because
`of the low value of y in perceptual filt,er M’(z),
`the coder
`is not as good a wawform follower as with higher valries of
`y; this decreases the SNR values, in all cases, brit increases
`the perceptual quality bc.cause of the noise masking prop-
`erties of W ( z ) . With a 1 value close to 1.0, the average
`SNR is about 19.0 dB, biit the perceptual qiiality decreases
`significantly. The speech quality was jridged very good hy
`informal list,eners.
`
`The coding scheme used for music is also described in
`Section 4. In the case of music sequences, the SNR varies
`dramatically from one file to the ot,her, ranging from 5 dB
`for the castanet sequence, to 14.5 dB for the organ sequence.
`This is to be expected since music is not in general as sta-
`tionary as speech. For instance, the castanet sequence re-
`sembles a train of pulses, and any kind of predictive coder
`(either pitch prediction used on speech or spectral predic-
`tion used on music) will not be ahle to capture its highly
`temporally localized structure. Nevertheless, the percep
`tual quality of all music sequences was judged very good by
`informal listeners. In all cases, there is a perceptible dif-
`ference between the original and synthesis signal, but the
`art,ifacts are not unpleasant, and in most caSes are not de-
`tected when only the synthesis signal is listened to.
`
`6. CONCLUSION
`In this paper, we have proposed a wideband audio coder
`bmed on a new approach called TCX, which combines effi-
`cient h e
` domain and frequency domain analysis to achieve
`the best perceptual reproduction of the original signal.
`The algorithm is based in part on the CELP model, wit.h
`the main difference being that the target signal is directly
`qiiantized, in the freqiiency domain, and inverse filtered to
`give the innovative excitation, instead of using analysis-by-
`synthesis as in CELP To make the target codebook a d a p
`tive, spectral prediction is used to encode the amplitude
`spectrum of the target. At each frame, most of the avail-
`able bits are used t.o encode t,he target, with the phases
`requiring a larger port.ion of the bits than the amplitudes.
`In the present work, two different schemes were used
`when encoding either speech or music, with pitch prediction
`being used only when encoding speech. A frame length of
`6 ms was used for speech, whereas a frame length of 16 ms
`was used for music.
`Informal listening test have shown bliat very high qualit,y
`is obtained by the TCX coder, a t 16kbits/s for wideband
`speech and a t 24 kbits/s music.
`
`7. ACKNOWLEDGMENTS
`The authors wish to thank t,he CITl (Cent,er for lnformat,ion
`Terhiiologies Innovation; Industry and Science Canada) for
`the financial support provided for this research, especially
`Mr. Raymond Descoiit, Program Head of Multimedia Sys-
`t.ems.
`
`REFERENCES
`[I] C. Laflamme, J-P. Adoul, R. Salami, S. Morissette, and
`P. Illabilleau, “16 kbps wideband speech coding tech-
`nique based on algebraic CELP,” Proc. ICASSP’91,
`Toronto, Canada. 14-17 May, 1991, pp. 177-180.
`[a] R. Salami, C. Laflamme, and J-P. Adoul, “Real-
`time implementation of a 9.6kbit/s ACELP wideband
`speech coder,” Proc. GIobacom ’92, Orlando, Florida,
`Dec. 6--9, 1992, pp. 447-451.
`[:3] R. Lefebvre, R. Salami, C. Laflamme, and J-P. Adoiil,
`“8 Khits/s coding of speech with 6 ms frame-length,”
`Proc. ICA.SSP’9.5, Minneapolis, Minnesota. April 27-
`30, 1993, pp. 612-615.
`[4] I<. K. Paliwal and B. Atal. “Efficient vector qiinnti
`zation of LPC parameters a t 24 bits/frame,” Pmc.
`lCASSP’91, Toront,o, Canada, 14-17 hlay, 1991,
`pp. 661-664.
`
`1-196
`
`Ex. 1040 / Page 4 of 4
`
`