`
`Jiirgen Schnitzler
`
`RWTH Aachen, University of Technology
`Institute of Communication Svstems and Data Processing (IND), D-52056 Aachen, Germany
`ht tp: // www. ind.rwth-aachen. de/-j uergen
`Juergen.Schnitzler@ind.rwth-aachen.de
`
`ABSTRACT
`This paper describes a wideband (7 kHz) speech compres-
`sion scheme operating at a bit rate of 13.0 kbit/s, i.e. 0.8 bit
`per sample. We apply a split-band (SB) technique, where
`the 0-6 kHz band is critically subsampled and coded by an
`ACELP approach. The high frequency signal components
`(6-7 kHz) are generated by an improved High-Frequency-
`Resynthesis (HFR) at the decoder such that no additional
`information has to be transmitted. In informal listening
`tests, the subjective speech quality was rated to be compa-
`rable to the CCITT G.722 wideband codec at 48 kbit/s.
`
`speech quality as our original algorithm [3] for clean speech
`at 16 kbit/s.
`In this paper, we present a modified scheme that shows an
`improveid performance under both clean speech and acous-
`tic bachground noise conditions. In the sequel, section 2
`gives an overview of the general codec structure, whereas
`section 3 focusses on the core codec, an ACELP algorithm
`designed for the main 0-6 kHz subband signal. In section 4
`we propose an improved high-frequency resynthesis of the
`6-7 kHz band that does not require the transmission of any
`side information.
`
`1. INTRODUCTION
`
`2. GENERAL CODEC STRUCTURE
`
`The interest in using wideband (50 . . . 7000 Hz) speech and
`audio signals has grown within the last years. Compared to
`'narrowband', i.e. telephone band limited signals, the larger
`signal bandwidth provides much more naturalness and in-
`telligibility, and thus promises a significant quality improve-
`ment for telecommunication services. As a first wideband
`speech compression standard released in 1988, the CCITT
`G.722 [l] subband ADPCM scheme operates at bit rates
`of 48, 56 or 64kbit/s (i.e. at effictive rates of 3-4 bit per
`sample).
`Recently, ITU-T study group 16 has started a new stan-
`dardization of a coding algorithm which is required to ex-
`hibit, at bit rates of 16, 24 and 32 kbit/s (1-2 bit per sam-
`ple), a similar performance as the (2.722 codec at its respec-
`tive rates under most operating conditions [2]. The new
`codec aims at wireline applications such as ISDN wideband
`telephony, videoconferencing, and also at packet transmis-
`sion applications as B-ISDN and 'multimedia' transmissions
`in the internet. In [3] we have proposed a split band coding
`scheme that fulfilled most of the requirements for speech at
`16 kbit/s.
`New applications for wideband speech will arise in the do-
`main of mobile communications, which experienced a tre-
`mendous development during the last decade. Future in-
`terconnections between fixed and mobile networks and the
`increasing competition between their operators, e.g. in the
`Wireless Local Loop, will certainly excite a need for high
`quality services. Low rate wideband speech coding schemes
`(i.e. at effective rates of 0.5-1 bit per sample) may play an
`important role in this context. In ETSI SMG 11 the intro-
`duction of a wideband mode is currently being discussed for
`the forthcoming AMR (Adaptive Multi-Rate) codec stan-
`dard [4], which shall replace the existing GSM codecs. In a
`previous proposal [5] we have introduced an algorithm that
`provided, at a rate well below 13 kbitjs, a similar clean
`
`Similarly to CCITT G.722, our basic approach is to split
`the input signal into two subbands, in order to allocate the
`available bit rate according to both the spectral distribution
`and the subjective importance of the subband components.
`An important difference is that we found an unequal split-
`ting at a cutoff frequency of 6 kHz to be a more suitable
`solution [3]. This conclusion was motivated by an inspec-
`tion of the instantaneous bandwidth of speech signals and
`by the spectral resolution of human perception: the 6-7 kHz
`band corresponds to about one critical band only.
`In our configuration, thosc spectral portions of the upper
`subbandl (6-7kHz) which are sufficient to convey a cor-
`rect subjective impression of wideband speech can be rep-
`resented either by coding them at a very low bit rate or
`even, as described in this paper, by extrapolation at the
`decoder side. Furthermore, this band splitting allows the
`lower subband (0-6 kHz) to be more efficiently quantized: at
`increams from 6 M 0.8 bit per sample at a sampling rate of
`an overall target bit rate of 13 kbit/s, the effective bit rate
`fs = 16 kHz to R = 1.1 bit per sample at fs = 12 kHz.
`This suggests the use of state-of-the-art ACELP (Algebraic
`- Code-&<cited Linear Prediction) techniques for coding the
`lower subband. Currently the domain of toll-quality, me-
`dium rate narrowband speech codecs is dominated by al-
`gorithms based on ACELP, as they best fulfill the perfor-
`mance requirements in terms of subjective quality, complex-
`ity, robustness and delay. Examples of ACELP codecs are
`the GSRX Enhanced Full Rate (EFR) codec [6] (12.2 kbit/s,
`i.e. R c: 1.53 bit per sample), the ITU-T G.729 universal
`8 kbit/s codec [7] (R = 1 bit per sample) an$ its extensions,
`or the IS-641 standard [8] (7.4 kbit/s or R M 0.93 bit per
`sample) for the US-TDMA system.
`Figure 1 a) shows the encoder structure of our proposal. A
`rate conversion module extracts the 0-6 kHz lower subband
`from the input wideband (7kHz) signal and reduces the
`
`157
`
`0-7803-4428-6198 $10.00 0 1998 IEEE
`
`Ex. 1045 / Page 1 of 4
`Apple v. Saint Lawrence
`
`
`
`-
`1
`I
`
`r-
`
`ACELP
`
`Encoder
`
`X
`
`bit
`
`,
`
`innnt,
`i G i i h
`4
`
`3-
`ACELP
`E
`
`Adaptive
`
`E --c Decoder - PosKiter
`
`output
`speech
`HFR
`w
`
`& .
`
`Figure 1: a) Structure of wideband encoder
`b) Structure of wideband decoder
`
`sampling rate from 16 kHz to 12 kHz using a linear phase
`analysis filter.
`The ACELP core codec operates on speech frames of 20 ms
`(240 samples at fs = 12 kHz). For every frame, a total of
`260 bits is transmitted over the channel, including 8 bits of
`protection information that can be used for error conceal-
`ment in the decoder. The resulting bit allocation is shown
`in Table 1. Details of the ACELP configuration are given
`in the next section.
`The decoder structure is shown in Figure 1 b) and will be re-
`fined in section 4. The received bits are used in the ACELP
`decoder to synthesize the lower band signal. A postfilter
`is applied to the signal in order to enhance the percep-
`tual quality. The receiver rate conversion module interpo-
`lates the postfilter output to the original sampling rate of
`16 kHz. Both the decimation (transmitter) and the interpo-
`lation (receiver) filter contribute a delay of 2.5 ms each. In
`conjunction with the framing, the overall algorithmic delay
`therefore amounts to 25 ms.
`Finally a High-Frequency-Resynthesis (HFR) module gen-
`erates an upper band (6-7 kHz) signal portion. As it will be
`described in section 4, the regenerated upper band signal
`consists of a gain-amplified and filtered bandpass noise. All
`necessary parameters are solely adapted based on the re-
`ceived lower band parameters and the use of a priori know-
`legde of the input speech.
`I Bit Allocation I Bits/Frame 1 Bit Rate I
`I Parameter
`32bit I 1.6kbit/s 1
`I LPC
`I
`I
`ACB-Index I
`2x(8+6) bit I
`''bit
`I
`4x 4bit I
`16 bit
`ACB-Gain
`8 ~ 1 8 b i t
`144 bit
`FCB-Index
`FCB-Gain
`8x 4bit
`32 bit
`8 bit
`Paritv
`
`I 2.2kbit/s 1
`
`I
`
`8.8 kbit/s
`0.4kbitls
`
`Spectral Frequencies (LSF) from the windowed speech sig-
`nal. The analysis window covers 300 samples and is right
`aligned with the current 20ms frame, i.e. no lookahead is
`used. The order of the LP filter is p l b = 14. Before com-
`puting the LSF coefficients, the autocorrelation matrix is
`weighted using a binomial window, providing an additional
`amount of bandwidth expansion to the LP filter.
`The 14 LSF parameters are quantized by a Predictive Multi-
`stage Split Vector Quantizer scheme. For the prediction of
`the LSF vector, a Moving Average (MA) model of order 4
`is used. The closed-loop residual quantizer consists of two
`stages of split vector quantizers, using 2 segments for the
`first and 3 segments for the second stage, respectively. One
`of two fixed predictor sets can be chosen. This approach
`resulting into an overall bit rate of 32 bit per frame is sim-
`ilar to the one used for the ITU-T G.729 codec [7].
`In addition, a linear interpolation of the LP filter coeffi-
`cients is performed in the LSF domain every 5 ms.
`
`3.2. Long-term prediction analysis
`
`Every 5ms, the long-term prediction (LTP) is carried out
`in a combination of open-loop and closed-loop LT-analysis
`based on an adaptive codebook (ACB) representation (see
`[5]). The ACB delays in the four LTP subframes are coded
`by 8+6+8+6=28 bits. In the lower delay range a fractional
`pitch approach is used. The ACB gains are nonuniformly
`quantized with 4 bits each.
`
`3.3. Fixed codebook (FCB) excitation
`Every 2.5 ms (30 samples), an excitation shape vector is se-
`lected from a sparse algebraic pulse codebook. An innova-
`tion vector contains 4 nonzero pulses, as shown in Table 2.
`The pulses 1 and 2 can take one of 16 possible positions,
`the pulses 3 and 4 one of 8 positions. Since each pulse can
`have an individual sign, 18 bits are necessary to encode the
`shape vector. Note that the pulses 1 and 3 as well as pulses
`2 and 4 may share the same position, and that all pulses
`can fall outside the valid range of positions 0 . . . 29. This
`allows a variable number of pulses and pulse amplitudes of
`0, f l , f 2 .
`The codebook structure and the efficient focussed search
`method are based on [7]. The FCB gain is quantized us-
`ing a fixed autoregressive predictor in order to reduce the
`dynamic range [lo]. The residual of the gain predictor is
`nonuniformly scalar quantized with 4 bits.
`
`3.4. Perceptual weighting
`The perceptual weighting filter W ( z ) applied during the
`optimization processes of the ACB and FCB search has a
`
`Table 1: Bit allocation of the proposed codec
`
`3. ACELP CODING OF 0-6KHZ BAND
`
`3.1. Short-term L P analysis
`The linear prediction (LP) analysis uses a modified Split-
`Levinson approach as described in [9] to compute the Line
`
`158
`
`1
`
`4
`
`1
`
`
`
`,
`
`,
`
`3: 7 : 11. 15. 19. 23. 2;
`
`,
`
`,
`
`
`
`Table 2: 18-bit sparse algebraic pulse codebook
`
`Ex. 1045 / Page 2 of 4
`
`
`
`transfer function of the form
`
`with A(z) being the LP-analysis filter computed from the
`unquantized, interpolated LSF parameters. Different sets
`of weighting factors {yl , yz} are used for the adaptive and
`fixed codebook search. For the fixed codebook search, the
`parameters are adapted with respect to the tilt and the
`strength of resonances of the LP synt,hesis filter [7].
`
`3.5. Adaptive postfilter
`As described in [ll], the adaptive postfilter consists of a
`cascade of a formant postfilter, an harmonic postfilter and
`a tilt compensation filter. The postfilter is updated ev-
`ery LTP subframe (5 ms). The formant postfilter uses the
`transmitted LPC filter coefficients. After postfiltering, an
`adaptive gain control is performed.
`
`4. HIGH F R E Q U E N C Y RESYNTHESIS
`
`For speech, spectral components above 6 kHz are almost al-
`ways due to unvoiced, i.e. fricative, sounds. Informal listen-
`ing tests showed that the presence of this spectral band is
`still very well perceivable. However, a sufficient subjective
`quality does not require an exact reproduction of the noise-
`like signal waveform. In [5] we have demonstrated a very
`simple and efficient spectral folding technique to regenerate
`an upper band signal. This approach exploited the obser-
`vation that the spectral distributions in the 5-6kHz and
`6-7 kHz bands are very similar.
`On the other hand, many operating conditions typically in-
`clude the presence of acoustic background noise. In such sit-
`uations of non-speech signal components, our previous pro-
`posal sometimes revealed perceivable degradations. Tech-
`niques as proposed in [12] do not yield the intended quality,
`either.
`In this paper, we describe a more elaborate scheme based
`on High-Frequency Resynthesis (HFR) techniques that have
`been initially studied for the extension from telephone-band
`to wideband speech [13].
`
`4.1. Resynthesis of the 6-7 kHz band
`Similarly to [3], we model the upper band signal by a band-
`pass (6-7 kHz) noise excitation whose magnitude spectrum
`has to be shaped properly (see Figure 2). The basic idea
`is to separately extrapolate the spectral envelope and the
`residual of the signal. Typically there exists a good cor-
`relation between the lower band spectral envelope and the
`spectrum of the upper band. Since no side information shall
`be transmitted, the task is to predict the spectral shape and
`the energy of this excitation from the received lower band
`parameters.
`The overall spectral shape of the output wideband signal
`is determined by selecting an appropriate LPC synthesis
`filter l/A~f,.(z). Provided that this filter matches the syn-
`thesized lower band spectrum, it is expected that its be-
`haviour above 6kHz will reflect the original upper band
`speech components. l / A ~ f , . ( z ) is determined from a code-
`book C w b describing N H ~ , . LPC filters that are computed
`
`at fs = 16 kHz and stored in their LSF representation [15].
`The decoded and interpolated lower band signal is first fil-
`tered by A ~ f , . ( z ) ,
`such that the gain-adapted noise excita-
`tion for the upper band is added in the residual domain,
`before the sum is again filtered using l/A~f,.(z). This im-
`plies that, regardless of the choice of R ~ f ~ ( z ) ,
`the HFR
`module does not introduce any additional degradation to
`the lower band. The spectral fit of the filter to the actual
`lower band signal does not have to be very exact and the
`number of stored filter parameters can be limited. We have
`found N H ~ , . = 48 different LSF sets of order p w b = 16 to
`be sufficient.
`For the selection of A~f,.(z), a second codebook C l b is nec-
`essary: it contains N H ~ , . LSF vectors describing the spec-
`tral envelope of the lower band at a sampling frequency of
`E C w b (v = 0 . . . N H ~ , . -
`12 kHz. For each LSF vector
`I), the associate vector %b,u E C l b approximates the lower
`band part of the spectrum given by w , , ~ , ~ . Thus the selec-
`tion prolcess can be understood as re-quantizing the lower
`band LF’C filter, A(z), in C l b and looking up the LSF pa-
`rameters for l/AHfr(z) in the ’shadow’ codebook C w b . The
`found HFR filter defined by l/A~f,.(z) is linearly interpo-
`lated every 5 ms.
`The adaptation of the HFR gain g H f r is performed in the
`residual domain. Assuming that the inverse HFR filter
`
`A ~ f ~ ( z ) yields a rather fiat spectrum in the 0-6kHz fre-
`quency range, the bandpass noise d u b is adjusted in order to
`adopt the same power spectral density level in the 6-7 kHz
`range. Since d l b is not completely decorrelated, better re-
`sults are obtained when using a high-pass (5 kHz) filtered
`portion for the gain adaptation. This scaling is updated ev-
`ery 5 ms, and the resulting gains are smoothed on a sample-
`by-sample basis.
`The described HFR method serves to achieve a more trans-
`parent subjective quality than our previous spectral folding
`approach.
`In particular, the performance in background
`noise conditions has been improved.
`
`4.2. Design of HFR codebooks
`To obtain the HFR codebooks C l b and C w b , an approach
`close to the Linde-Buzo-Gray (LBG) algorithm [14] is cho-
`sen [15]. Prior to the training phase, an initial codebook
`is required for the partioning of the plb-dimensional vector
`space filled by all possible lower band LSF parameter sets.
`- 1)
`(v = 0 . . . N H ~ ~
`contains N H j , . LSF vectors %ObU
`and is obltained by applying the LB6 algorithm to the lower
`band portion of the training speech data.
`During the training process, for each 20ms frame X an
`LPC analysis (order Pwb) is performed on the wideband
`input speech (fs = 16kHz); thus, an LSF vector gwb(X)
`is computed. In parallel, the lower band signal portion
`(fs = 12 kHz) of frame A is subject to a second LPC anal-
`ysis (order plb), yielding an LSF vector gfb(X). Using c p b ,
`the current frame’s parameters % b ( X ) and gwb(X) are as-
`signed to the sets P l b ( U ) and p w b ( V ) , respectively. P l b ( l / )
`and p w b ( V ) , U = 0 . . . N H ~ , . - 1, define the partioning of
`the vectsor spaces containing the LSF parameters %(,(A)
`and gwb(X). This assignment is achieved by searching c f b
`and selecting v such that an inverse LPC filter, built from
`gYb,” E c f b and applied to the lower band speech, yields the
`minimum mean squared prediction error.
`After processing all frames of training data, the final code-
`
`159
`
`Ex. 1045 / Page 3 of 4
`
`
`
`bit
`stre
`
`output
`speech
`
`Figure 2: Wideband speech decoder details: ACELP decoder and High-Frequency Resynthesis (HFR)
`
`vectors of Clb and Cwb are found as the centroids of par-
`tions Plb(v) and Pwb(v), v = 0 . . . N H ~ ? - 1, respectively.
`This procedure ensures to produce pairs of lower band and
`wideband LSF parameters having a good spectral fit in the
`0-6 kHz range. Furthermore, the stability of the resulting
`filters is guaranteed [15].
`It can be noted that, in order to save memory in a practical
`implementation, the codebook Cwb may be directly linked to
`the high-resolution LSF quantizer of the lower band ACELP
`decoder, instead of explicitely storing C l b .
`
`5. CONCLUSION
`
`In this paper an SB-ACELP encoding scheme for 13.0 kbit/s
`wideband speech encoding has been presented. The algo-
`rithm is based on a split band (SB) structure. A state-
`of-the-art ACELP codec operating at a 12 kHz sampling
`frequency is used to transmit the 0-6 kHz subband signal.
`An LPC-based High-Frequency-Resynthesis technique has
`been successfully applied to fill the perceptually significant
`upper 6-7kHz band on the decoder side, without the need
`to transmit any side information. By informal listening
`tests the speech quality was judged to be comparable to
`the CCITT (2.722 wideband codec operating at 48 kbit/s.
`
`6. REFERENCES
`
`(11 CCITT, “7 kHz Audio Coding within 64kbit/s,” in
`Recommendation G. 722, vol. Fascile 111.4 of Blue Book,
`pp. 269-341, International Telecommunication Union,
`Melbourne 1988.
`[2] ITU-T SG 16 Q.20, “Terms of Reference for the ITU-
`T Wideband (7 kHz) Speech Coding Algorithm,” April
`1997.
`[3] J. Paulus und J. Schnitzler, “16 kbit/s Wideband
`Speech Coding Based on Unequal Subbands” in
`Proc. Int. Conf. Acoust., Speech, Signal Processing,
`ICASSP, (Atlanta, Georgia, USA), pp. 651-654, 1996.
`[4] ETSI SMG11, “Draft Adaptive Multi-Rate (AMR)
`Study Phase Report.” Version 0.4, Tdoc SMG11-AMR
`128/97, August 1997.
`[5] J. Paulus und J. Schnitzler, “Wideband Speechcod-
`ing for the GSM Fullrate Channel?” in Proceedings
`
`ITG-Fachtaguny “Sprachkommunikation”, (Frankfurt
`am Main), pp. 11-14, 1996.
`[SI ETSI/TC SMG, “Recommendation GSM 06.60: En-
`hanced Full Rate Rate Speech Transcoding” European
`Telecommunications Standards Institute, Januar 1996.
`[7] CCITT/ITU-T, “Rec. G.729: Coding of Speech at
`8 kbit/s Using Conjugate-Structure Algebraic-Code-
`Excited Linear-Prediction (CS-ACELP)” in General
`Aspects of Digital Transmission Systems; Terminal
`Equipments, Series G Recommendations, International
`Telecommunication Union, 1996.
`[8] T. Honkanen, J. Vainio, K. J%rvinen, P. Haavisto,
`R. Salami, C. Laflamme, und J.-P. Adoul, “Enhanced
`Full Rate Speech Codec for IS-136 Digital Cellular Sys-
`tem” in Proc. Int. Conf. Acoust., Speech, Signal Pro-
`cessing, ICASSP, (Munich, Germany) , pp. 731-734,
`IEEE, 1997.
`[9] S. Saoudi und J. Boucher, “A New Efficient Algorithm
`to Compute the LSP Parameters for Speech Coding”
`Signal Processing, vol. 28, pp. 201-212, 1992.
`[lo] R. Salami, C. Laflamme, J. Adoul, und D. Massaloux,
`“A Toll Quality 8 Kb/s Speech Codec for the Personal
`Communications System (PCS)” IEEE Transactions
`on Vehicular Technology, vol. 43, pp. 808-816, August
`1994.
`[ll] J. Chen und A. Gersho, “Adaptive Postfiltering for
`Quality Enhancement of Speech’’ IEEE Duns. Speech
`and Audio Processing, vol. 3, pp. 59-71, January 1995.
`[12] J. Makhoul und M. Berouti, “High-Frequency Regen-
`eration in Speech Coding Systems,” in Proc. Int. Conf.
`Acoust., Speech, Signal Processing, ICASSP, (Wash-
`ington, DC), pp. 428-431, IEEE, 1979.
`[13] H. Carl, Untersuchung verschiedener Methoden der
`Sprachcodierung und eine Anwendung zur Bandbreit-
`enveryroJerung von Schmalband-Sprachsignalen. PhD
`thesis, Ruhr-Universitat Bochum, 1994.
`[14] Y . Linde, A. Buzo, und R. Gray, “An Algorithm for
`Vector Quantizer Design” IEEE Dansactions on Com-
`munications, vol. 28, pp. 84-95, January 1980.
`tiinstlichen
`[15] J. Kenkenberg, “Untersuchungen zur
`von
`Sprachsignalen”.
`Bandbreitenvergroflerung
`Diploma thesis D25/96, Institut fur Nachrichtengerate
`und Datenverarbeitung, IND, RWTH Aachen, 1996.
`
`160
`
`Ex. 1045 / Page 4 of 4
`
`