`
`ZTE EXHIBIT 1006
`
`Page 1 of 8
`
`
`
`! I
`
`1
`
`ITG-Fachbericht
`
`39
`IIG- F-a.c!lftt.he:tUtt] (ttl& ~ ~./</M;f Mvt HM-M, GW11a.tt~
`1" .
`. Sprac · kommunikation
`
`,
`
`(
`
`(J
`
`Vortrage der ITG-Fachtagung
`am 17. und 18. September 1996 in Frankfurt am Main
`
`'
`Wissenschaftliche Tagungs,leitung: ·
`Prof. Dr.-lng. Arild Lacroix , - r·
`•..
`~ .. ~ ''
`Johann Wolfgang Goethe-Universitat · ·
`
`Veranstalter:
`lnformationstechnische Gesellschaft im VDE (lTG)
`lnstitut fur Angewandte Physik
`Johann Wolfgang Goethe-Universitat
`
`/\
`u
`
`''\/
`
`VDE-VERLAG GMBH · Berlin ·Offenbach
`
`Page 2 of 8
`
`
`
`Die Deutsche Bibliothek - CIP-Einheitsaufnahme
`
`Sprachkommunikation : Vortrage der ITG-Fachtagung am 17. und 18. September 1996
`in Frankfurt am Main I Veranst.: lnformationstechnische Gesellschaft im VDE
`(lTG) ; lnstitut tor Angewandte Physik, Johann-Wolfgang-Goethe-Universitat.
`Wiss. Tagungsleitung: Arild Lacroix.- Berlin; Offenbach: VDE-Verl., 1996
`(ITG-Fachbericht ; 139)
`ISBN 3-8007-2204-6
`NE: Lacroix, Arild [Hrsg.] ; lnformationstechnische Gesellschaft: ITG-Fachbericht
`
`~··
`
`ISSN 0932-6022
`ISBN 3-8007-2204-6
`
`© 1996 VDE-VERLAG GMBH, Berlin und Offenbach
`BismarckstraBe 33, D-1 0625 Berlin
`
`Aile Rechte vorbehalten
`
`Druck: Druckerei Weinert, Berlin
`
`-' '
`
`9608
`
`Page 3 of 8
`
`
`
`Vorwort
`
`Die Sprachkommunikation erfahrt in der Grundlagenforschung und in der
`industriellen Forschung und Entwicklung seit einer Anzahl von Jahren eine
`zunehmende UnterstOtzung, sowohl infolge von bedeutenden technolo(cid:173)
`gischen Fortschritten, wie auch durch neue Signalverarbeitungsverfahren
`und -algorithmen. Der Motor dieser Weiterentwicklung ist ohne Frage die
`gewachsene Bedeutung der Sprachkommunikation in den heutigen digita(cid:173)
`len Funk- und Kabelnetzen.
`
`Der vorliegende Tagungsband enthalt die bei Drucklegung verfugbaren
`Beitrage der ITG-Fachtagung Sprachkommunikation die mittlerweile zum
`vierten Mal durchgetuhrt wird. Wie auf den vorangegang.enen Tagungen
`werden aile Disziplinen der Sprachkommunikation in eingeladenen Ober(cid:173)
`sichtsvortragen und eingereichten Beitragen behandelt. Entsprechend den
`gegenwartigen Forschungsarbeiten gibt dieser Tagungsband die Themen(cid:173)
`schwerpunkte Sprach- und Sprechererkennung_, Sprachsynthese und
`Sprachcodierung wieder, erganzt durch Beitrage uber Sprachana/yse und
`Sprachgilte. Neben grundlegenden Prinzipien und Methoden werden An(cid:173)
`wendungen und realisierte Systeme unter Berucksichtigung neuester Tech(cid:173)
`nologien wie Signalprozessoren, neuronale Netze und hochparallele Ver(cid:173)
`arbeitung dargestellt.
`
`Mein Dank gilt den Mitgliedern des Programmausschusses fUr die Mit(cid:173)
`wirkung bei der Zusammenstellung des Programms und allen Autoren fUr
`ihre profunden Beitrage, sowie Herrn Dr. Schanz von der lTG fUr die Unter(cid:173)
`stOtzung bei der Vorbereitung und Durchtuhrung der Tagung.
`
`Frankfurt am Main, August 1996
`
`Arild Lacroix ~wissenschaftlicher Tagungsleiter
`
`''
`
`I
`I
`I
`
`Page 4 of 8
`
`
`
`WIDEBAND SPEECH CODING FOR THE GSM FULLRATE CHANNEL?
`
`Jurgen Paulus•
`
`Jurgen Schnitzler
`
`Institute of Communication Systems and Data Processing (IND)
`RWTH Aachen, University of Technology, D-52056 Aachen, Germany
`+49.241.806982, Fax: +49.241.8888186, E-mail: Juergen.Schnitzler@ind.rwth-aachen.de
`
`Phone:
`
`ABSTRACT
`
`In this paper we propose a wideband speech encoding
`scheme (50-7000Hz) having a bit rate of 12.3kbit/s which
`could be used for the GSM Fullrate channel. The cod(cid:173)
`ing scheme is based on 2 unequal sub bands from 0-6 kHz
`and from 6-7kHz. This approach was motivated by experi(cid:173)
`mental evaluation of the instantaneous signal bandwidth
`of speech frames. The lower subband is encoded using
`code-excited linear prediction (CELP). The higher sub band
`is replaced at the receiver by aliased components of the
`lower band using an interpolation filter with a cut-off fre(cid:173)
`quency of 7kHz. By informal listening tests the speech
`quality was rated higher than the speech quality of the
`CCITT G.722 wideband codec operating at 48kbitfs. In
`comparison to the GSM Fullrate codec with 13 kbitfs, nat(cid:173)
`uralness and intelligibility are improved significantly.
`
`1.
`
`INTRODUCTION
`
`Since several years the so called 'full rate codec' has been
`used in the GSM system for mobile communication [1].
`This codec which is operating at 13 kbit/s was designed al(cid:173)
`most one decade ago for the encoding of narrowband speech
`signals (0.3-3.4 kHz) which are sampled at 8 kHz. The ef(cid:173)
`fective bit rate is therefore 13/8 = 1.625 bit per sample.
`Recently, a new 8kbit/s narrowband speech coder has
`been standardized by the Study Group 15 of ITU-T which
`provides telephone quality at 1 bit per sample only [2]. In
`comparison to these codecs, a wideband coding scheme
`would require a sampling rate of 16kHz and the effective
`target bit rate would be 13/16 = 0.8125 bit per sample.
`During the last few years there has been an increasing
`effort in wideband speech coding at lower bit rates. This
`arises not only from applications such as high quality video(cid:173)
`phone and digital mobile telephone, but also from the in(cid:173)
`creasing market for multimedia systems where high quality
`speech and audio is demanded. Compared to narrowband
`telephone speech, the reduction of the lower cut-off fre(cid:173)
`quency from 300Hz to 50 Hz contributes to increased nat(cid:173)
`uralness and fullness. The high frequency extension from
`
`* Now with Siemens AG Mlinchen, ~0
`E-mail: juergen.paulus@hl.siemens.de
`
`3400Hz to 7000Hz provides better fricative differentiation
`and therefore higher intelligibility.
`
`2. ENCODER STRUCTURE
`
`Recently, we presented a split-band encoding scheme us(cid:173)
`ing 2 unequal subbands, i.e. 0-6kHz and 6-7kHz [3]. This
`approach was motivated by the experimental evaluation of
`the instantaneous signal bandwidth of speech frames, i.e.
`the cut-off frequency necessary to encode the current frame
`without loss of perceptual speech quality. In an experiment,
`a bank of linear phase FIR filters with cut-off frequencies
`decreasing in steps of 1kHz (starting at 7kHz) were applied
`to the speech data. After filtering a frame of 10 ms length
`using all the lowpass filters, an energy ratio K,E = E f b.E
`was calculated from the full-band signal energy E and the
`energy b.E of the difference signal between the filtered and
`original signal. If the value of this energy ratio is greater
`than a certain threshold K,Eth~, the cut-off frequency of the
`actual signal could be reduced to the cut-off frequency of
`the lowpass. The threshold K,Eth~ was obtained by informal
`listening tests. In Figure 1 a histogram of the final cut-
`
`···~···
`
`.: ............ ~ ... .
`
`IS
`
`304
`
`2
`
`4
`5
`frequency [kHz]
`
`6
`
`7
`
`Figure 1. Histogram of allowable cut-off frequencies
`
`off frequencies is presented. A simulation was performed
`using 100s multilingual speech (english, german, french)
`each with male and female speakers. The speech mater(cid:173)
`ial having an speech activity of 95 % was extracted from
`the European Broadcasting Union database [4]. This ma(cid:173)
`terial was bandlimited to a frequency range of 50-7000Hz,
`according to the CCITT G.722 recommendation [5]. By ap(cid:173)
`plying a frame size of 160 samples (10 ms) and choosing an
`energy threshold of K,E,h~ = 800000, it resulted that almost
`40% of the frames ( 4249 out of 10171) could be encoded us(cid:173)
`ing a bandwidth of only 6 kHz, without loss of perceptual
`
`ITG-Fachtagung SPRACHKOMMUNIKA TION Frankfurt am Main, 17. und 18. September1996
`
`Page 5 of 8
`
`
`
`12
`
`Figure 2. Spectrogram and quantized cut-off fre(cid:173)
`quency J;q(t) for the speech segment 'To
`administer medicine'.
`
`speech quality.
`The full bandwidth was selected only during unvoiced
`parts of the speech signal. Due to the mechanism of
`speech production, spectral components above approxim(cid:173)
`ately 4kHz are almost exclusively referring to fricative
`speech. In contrast to that, during voiced parts of speech,
`most of the signal energy is present in the lower subband.
`Therefore, it is not necessary to encode the higher subband
`of voiced frames. 'Ifansform coding techniques behave in a
`similar way in that they allocate more bits to the low fre(cid:173)
`quency components than to the high frequency components
`of speech. Figure 2 gives an example of a speech segment
`and its quantized instantenous bandwidth.
`Visual inspection of the waveforms of speech compon(cid:173)
`ents within the frequency ranges 5-6 kHz and 6-7kHz shows
`that both of these signals exhibit a similar distribution of
`energy along the time and frequency axis for a given speech
`sound. Furthermore, it turned out by informal listening ex(cid:173)
`periments that during unvoiced parts it is sufficient to add
`some noise like spectral components above 6kHz to obtain
`the perceptual speech quality of a 7kHz speech signal. This
`effect has been used in our previous proposal [3].
`This observation can be alternatively used to substitute
`the speech components in the frequency range of 6-7kHz by
`something else without transmitting any side-information
`as shown below. The wideband speech signal is encoded us(cid:173)
`ing only the spectral bandwidth from 0-6 kHz and the miss(cid:173)
`ing components above 6kHz are produced at the receiver
`by interpolating the lower sub band signal from 12kHz to
`16kHz using an interpolation filter with cut-off frequency
`7kHz which violates the interpolation requirements.
`A further spectrogram for the same speech example is
`given in Figure 3. The decimation/interpolation chain de(cid:173)
`scribed above was applied to the original speech signal of
`
`Figure 3. Spectrogram of the previous speech seg(cid:173)
`ment, with the original signal decimated
`from 16kHz to 12kHz and subsequently re(cid:173)
`interpolated using an 7kHz interpolation
`filter.
`Figure 2, but without the decimated signal being encoded
`yet. Both spectrograms exhibit a considerable similarity,
`although, especially near t = 0.5 ms, the aliasing effect
`is visible. However, the enhancement of the perceptual
`speech quality is surprisingly high. The basic approach was
`already proposed by Dietrich [6] in the context of wide band
`ADPCM.
`"-i
`In Figure 4 and Figure 5 the encoder and decoder structures
`are given.
`
`s(n)
`f 8= 16kHz
`
`50-6000Hz
`
`Figure 4. Encoder structure of the proposed codec
`
`1\
`s(n)
`
`f 8= 16kHz
`
`fc=7000Hz
`
`f 8= 12kHz
`
`Figure 5. Decoder structure of the proposed codec
`
`3. CELP ENCODING SCHEME
`For encoding the decimated signal, code-excited linear pre(cid:173)
`diction (CELP, Atal et al. [7]) is performed. The coder
`operates on speech frames of 20 ms (240 samples).
`The subframe lengths used for the different codec parts
`are illustrated in Figure 6, being 5 ms for the pitch analysis
`and 2.5 ms for the fixed code book search.
`3.1. LP analysis
`The linear prediction (LP) analysis uses a covariance-lattice
`approach as described by Cumani [8]. The analysis win-
`
`ITG-Fachtagung SPRACHKOMMUNIKATION Frankfurt am Main, 17. und 18. September 1996
`
`Page 6 of 8
`
`
`
`13
`
`LPC
`
`, integer delaY. fractional pitch,
`
`, actual speech frame ,
`
`LTP
`
`LTP
`
`LTP
`
`LTP
`
`~~~~~~~~
`.
`.
`20
`0
`17.5
`time[ms]->
`
`7.5
`
`10
`
`12.5
`
`15
`
`2.5
`
`5
`
`- 'tmax
`
`0
`- 'tmin
`
`120 samples
`
`Figure 8. Combined integer and fractional pitch
`search ranges during closed-loop adaptive
`codebook search.
`fractional pitch approach is used (13], as shown in Figure 8.
`
`The pitch gain is quantized nonuniformly quantized with
`4 bits.
`3.3. Fixed Codebook
`Every 2.5 ms (30 samples), an excitation vector is selec(cid:173)
`ted from a modified 16-bit ternary sparse codebook, as de(cid:173)
`scribed by Salami et al. (14]. An innovation vector contains
`4 nonzero pulses, as shown in Table 1.
`I Amplitude I
`±1
`±1
`±1
`±1
`
`Position
`0, 4, 8, 12, 16, 20, 24, 28
`1, 5, 9, 13, 17., 21, 25, 29
`2, 6, 10, 14, 18, 22, 26, (30)
`3, 7, 11, 15, 19, 23, 27, (31)
`
`Table 1. 16-bit ternary sparse codebook (14].
`
`Note that the last position' of the 3rd and 4th pulse falls
`outside the subframe boundary. This gives the possibility
`of a variable number of pulses per frame.
`Each pulse has 8 possible positions. Therefore, the pulse
`positions are encoded for each pulse with 3 bits.· Further(cid:173)
`more, each pulse amplitude is encoded with 1 bit (i.e. ±1),
`resulting in a total of 16 bits for the 4 pulses.
`Due to the structure of the codebook, a fast search pro(cid:173)
`cedure is possible. Additionally, a focussed search approach
`is used for further reductions of the computational load of
`the codebook search (14].
`To reduce the dynamic range of the fixed codebook gain,
`a fixed gain predictor is used. The gain predictor is pre(cid:173)
`dicting the log. energy of the current fixed code book vector
`based on the log. energy of the previously selected scaled
`fixed codebook vector. This is done in a similar way as in
`a preliminary version of ITU-T G.729 (15]. The residual of
`the gain predictor is quantized nonuniformly with 4 bits.
`3.4. Perceptual weighting filter
`The perceptual weighting filter W(z) used during the min(cid:173)
`imization process has a transfer function of the form
`
`Figure 6. Update of the codec parameters
`
`dow length is 260 samples(~ 21.7ms), centered around the
`middle. of the frame. The order of the LP-filter of our real(cid:173)
`ization is 14. The prediction coefficients are updated every
`20 ms and converted to line spectral frequencies (LSF) (9].
`Prior to solving the equations for the coefficients, the cov(cid:173)
`ariance matrix is modified by weighting it with a binomial
`window having an effective bandwidth of 80Hz (10].
`The LP coefficients are encoded using 42 bit by an hybrid
`trained predictive vector and lattice vector quantization
`(PVQ-LVQ) scheme for the line spectral frequencies (LSF)
`(11]. The computationally efficient quantization scheme
`leads to an average spectral distortion [12] of 1 dB only.
`Linear interpolation of the LP-filter coefficients is per(cid:173)
`formed for the first three LTP-subframes. This is done in
`the LSF-domain between the quantized actual coefficient
`set and the quantized coefficient set of the previous frame.
`For the last subframe, no interpolation is performed.
`
`3.2. Pitch analysis
`Every 5 ms, the long-term prediction (LTP) is carried out
`in a combination of open-loop and closed-loop LT-analysis.
`Every 10 ms, an open-loop pitch estimate is calculated us(cid:173)
`ing a weighted correlation measure to avoid multiples of
`the pitch period. Thus, a smoothed estimate of the pitch
`contour is obtained. In the first and third LTP subframe, a
`focussed closed-loop adaptive codebook search is performed
`around the open-loop estimate To!, and in the second and
`fourth subframe a restricted search is performed around
`the pitch lag of the closed-loop analysis of the first (third)
`subframe Tc1,1, as depicted in Figure7.
`
`1st subframe :
`
`2nd subframe :
`
`0 samples
`
`-
`
`search range -
`
`0 samples
`
`Figure 7. Long-term analysis using~ combined open(cid:173)
`loop and closed-loop analysis and a fo(cid:173)
`cussed search strategy.·~
`
`This procedure results in a delta encoding scheme re(cid:173)
`quiring 2x(8+6)=28 bits for coding the 4 pitch lags. The
`closed-loop search is performed using an adaptive code(cid:173)
`book filled with previously computed excitation samples.
`The minimum pitch lag is half of the subframe length, i.e.
`Tmin =30samples. Additionally, in the lower delay range a
`
`W(z) = A(zhl)'
`A(zh2)
`with A(z) being the LP-analysis filter, using unquantized
`LP-filter coefficients. Different sets of weighting factors
`b1, 12} are used for the adaptive and fixed code book
`search.
`
`(1)
`
`ITG-Fachtagung SPRACHKOMMUNIKATION Frankfurt am Main, 17. und 18. September 1996
`
`Page 7 of 8
`
`
`
`The coefficients of the weighting filters are updated sim(cid:173)
`ilarly to the LP synthesis filter, but using the unquantized
`LSF.
`
`4. BIT ALLOCATION
`
`According to Table 2, a total bit-rate of 12.3 kbit/s is
`achieved. The _codec will be demonstrated at the confer(cid:173)
`ence by audio tape.
`I Parameter I Bit Allocation Bits/Frame
`42bit
`LPC
`LT-Index
`28bit
`LT-Gain
`16bit
`CB-Index
`128 bit
`32bit
`CB-Gain
`
`2x(8+6)bit
`4x 4 bit
`8x16 bit
`8x 4 bit
`
`Bit Rate
`2.1 kbit/s
`
`2.2kbit/s
`
`8.0kbit/s
`
`- )'ox
`
`14
`
`"16 kbit/s Wideband
`[3) J. Paulus and J. Schnitzler,
`Speech Coding Based on Unequal Subbands,",
`in
`Proc. Int. Conf. Acoust., Speech, Signal Processing,
`ICASSP, Atlanta, Georgia, USA, 1996, pp. 255-258.
`[4) European Broadcasting Union ( EBU ), Sound Quality
`Assesment Material (Recordings for Subjective Test),
`no. 422 204-2 edition.
`[5) CCITT, "7 kHz Audio Coding within 64kbit/s", in
`Recommendation G. 122, vol. Fascile III.4 of Blue Book,
`pp. 269-341. Melbourne 1988.
`[6) M. Dietrich,
`"Performance and Implementation of
`a Robust ADPCM Algorithm for Wideband Speech
`Coding with 64 kbit/s",
`in Proc. Int. Zurich Sem(cid:173)
`inar on Digital Communications, Zurich, Switzerland,
`March 1984.
`[7] B.S. Atal and M.R. Schroeder, "Stochastic Coding of
`Speech Signals at Very Low Bit Rates", in Proc. Int.
`Conf. Communication {ICC), May 1984, pp. 1610-
`1613.
`"On a Covariance-Lattice Algorithm
`[8) A. Cumani,
`for Linear Prediction",
`in Proc. Int. Conf. Acoust.,_
`Speech, Signal Processing, ICASSP, Paris, France,
`1982, pp. 651-654.
`[9) P. Kabal and R.P. Ramachandran, "The Computation
`of Line Spectral Frequencies Using Chebyshef Poly(cid:173)
`nomials", IEEE Trans. Acoust., Speech, Signal Pro(cid:173)
`cessing, vol. 34, no. 6, pp. 1419-1426, December 1986.
`[10) Y. Tohkura and F. Itakura nad S. Hashimoto,
`"Spectral Smoothing Technique in P ARCOR Speech
`Analysis-Synthesis",
`IEEE Trans. Acoust., Speech,
`Signal Processing, vol. 26, no. 6, pp. 587-596, Decem(cid:173)
`ber 1978.
`[11) J. Schnitzler, "Lattice-Quantisierung der LP-Filter(cid:173)
`koeffizienten auf der Basis der Line Spectral Frequen(cid:173)
`cies (LSF),", ITG Fachtagung Sprachkommunikation,
`Frankfurt am Main, 1996.
`[12) K.H. Paliwal and B.S. Atal, "Efficient Vector Quant(cid:173)
`ization of LPC Parameters at 24 Bits/Frame", IEEE
`Transactions on Speech and Audio Processing, vol. 1,
`no. 1, pp. 3-13, January 1993.
`[13) J. S. Marques, J. M. Tribolet, I. M. Trancoso, and L. B.
`Almeida, "Pitch Prediction with Fractional Delays in
`CELP Coding", in Proc. EUROSPEECH, Genua, It(cid:173)
`alien, 1989, pp. 509-513.
`[14) R. Salami, C. Laflamme, J-P. Adoul, A. Kataoka,
`· S. Hayashi, C. Lamblin, D. Massaloux, S. Proust,
`P. Kroon, andY. Shoham, "Description of the Pro(cid:173)
`posed ITU-T 8kb/s Speech Coding Standard",
`in
`Proc. IEEE Workshop on Speech Coding, Annapolis,
`Maryland, USA, September 1995, pp. 3-4.
`[15) R. Salami, C. Laflamme, J.-P. Adoul, and D. Mas(cid:173)
`"A Toll Quality 8 kb/s Speech Codec for
`saloux,
`the Personal Communications System (PCS)", IEEE
`Trans. Vehicular Technology, vol. 43, no. 3, pp. 808-
`816, August 1994.
`
`12.3khlt/s
`
`Table 2. Bit allocation for a 20 ms frame of the pro(cid:173)
`posed 12.3 kbit/s wide band co dec
`
`5. CONCLUSiON
`
`In this paper a split-band encoding scheme for wideband
`speech coding at 12.3 kbitjs has been presented. It is based
`on two unequal subbands from 0-6kHz and 6-7kHz. This
`approach was motivated by experimental evaluation of the
`instantenous signal bandwidth. The lower band codec op(cid:173)
`erates on speech frames of 20 ms using code-excited linear
`prediction. Taking advantage of a similar energy distribu(cid:173)
`tion in the 5-6 kHz and 6-7 kHz bands, ':no information is
`transmitted for the upper band; the missing components
`above 6kHz are generated at the decoder by extrapolation
`from the 5-6 kHz band. With respect to informal listening
`tests, this experimental encoding scheme exhibits a speech
`quality rated higher than the CCITT G.722 wideband co(cid:173)
`dec operating at 48 kbitjs. Comparing the speech quality to
`the GSM Fullrate codec at 13 kbit/s, a significant improve(cid:173)
`ment of naturalness and intelligibility has been achieved.
`
`-ACKNOWLEDGEMENTS
`
`This work has been supported by the Technologiezentrum
`of Deutsche Telekom AG. The authors would like to thank
`especially Mr. G. Schroder. Acknowledgements are made to
`Prof. P. Vary and the colleagues· of the speech coding group
`for inspiring discussions, especially to T. Fingscheidt.
`
`REFERENCES
`
`[1] P. Vary, K. Hellwig, R. Hofmann, R.J. Sluyter, C. Ga(cid:173)
`land, and M. Rosso, "Speech Codec for the European
`Mobile Radio System,", in Proc. Int. Conf. Acoust.,
`Speech, Signal Processing, ICASSP, New York, USA,
`1988, pp. 227-230.
`[2] ITU-T Recommendation G.729, "Coding of Speech
`at 8kbps using conjugate-structure algebraic-code(cid:173)
`excited linear-prediction (CS-ACELP)".
`
`ITG-Fachtagung SPRACHKOMMUNIKATION Frankfurt am Main, 17. und 18. September 1996
`
`. - · I $
`
`Page 8 of 8