`
`R. Salami, R. Lefebvre, and C. Laflamme
`Department of Electrical 8ngineering, University of Sherbrooke,
`Sherbrooke, Quebec, Canada JlK 2Rl
`
`ABSTRACT
`This paper describes a wideband speech/audio codec at 16/24
`kbit/s with 10 ms frames. The algorithm uses an ACELP model
`at 16 kbit/s and a switched ACELP /TCX model at 24 kbit/s.
`Adaptive preempha.sis is used to improve the performance at high
`frequencies and a hybrid forward/backward LP filter is used to
`improve the performance of stationary signals. Subjective tests
`showed that for speech signals, the codec performance at 16 and
`24 kbit is equivalent to 0.722 at 48 and 56 kbitfs, respectively.
`For music signals, the performance of the codec at 24 kbit/s was
`equivalent to that of 0.722 at 48 kbit/s.
`
`INTRODUCTION
`1.
`Compression of wideband speech and audio (7kHz bandwidth)
`is increasingly needed in many applications such as videoconfer(cid:173)
`encing and Internet. The ITU-T has recently started a standard(cid:173)
`ization activity for a wideba.nd codec at 16 and 24 kbit/s which is
`required to perform similar to G.722 at 48 and 56 kbit/s, respec(cid:173)
`tively, in most operating conditions. Initially, two modes were
`proposed: Mode A with 25 ms delay (the delay is considered as
`twice the frame size plus the lookahead} and Mode B with longer
`delay (60 ms) and lower complexity (15 MIPS). This paper de(cid:173)
`scribes a codec which wad proposed for mode A standardization.
`Section 2 describes the codec principles and Section 3 discusses
`the codec's performance. The conclusions are given in Section 4.
`
`2. CODEC PRINCIPLES
`2.1. Coding model and bit allocation
`The coder uses the algebraic code-excited linear predictive
`(ACELP) coding model [1] at 16 kbit/s and switch CELP/TCX
`(Transform coded excitation [2]) at 24 kbitfs. The coder uses
`!0 ms speech frames (160 samples at the sampling frequency of
`16000 sample/s). An adaptive preempha.sis procedure is per(cid:173)
`formed before the encoding process (2 bits are used to quantize
`the preemphasis filter). A hybrid forward/backward linear pre(cid:173)
`diction (LP) analysis is used. The short-term prediction parame(cid:173)
`ters (or LP parameters) are transmitted every speech frame. 1 bit
`is used to determine LP mode (forward or forward/backward).
`The speech frame is divided into 2 subframes of 5 ms (80 sam(cid:173)
`ples). The pitch and algebraic codebook parameters are trans(cid:173)
`mitted every subframe. The bit allocation of the coder is shown
`in Table l. The LP parameters are quantized with 34 bits. The
`pitch lag is encoded with 9 bits in the first subfrarne and G bits
`in the second subframe. The pitch gain is quantized with 4 bits
`and the fixed code book gain is quantized with 5 bits in each sub(cid:173)
`frame. The only difference between the two bit rate modes is the
`size of the innovation codebook. In the 16 kbit/s mode, the in(cid:173)
`novation codebook index is encoded with 45 bits each subframe,
`while in the 24 kbit/s mode it is encoded with 85 bits each sub(cid:173)
`frame. The algorithm can be switched dynamically from frame
`to fra.me between the two bit rates.
`
`Parameter
`LP mode
`?reemphasis filter
`ISPs
`Pitch delay
`Pitch gam
`~?vation codebook
`codebook gain
`Total
`
`1st subjr
`
`2nd subfr
`
`9
`4
`45
`5
`
`6
`4
`45
`5
`
`!:,er fram•!J
`1
`2
`34
`15
`8
`90
`10
`160
`
`Table 1. Bit allocation of the coding algorithm at 16 kbitfs.
`At 24 kbit/s the innovation codebook uses 85 bits instead of
`45 bits.
`
`2.2. Pre-processing
`The pre-processing block perfonns adaptive preemphasis. Four
`possible 2nd order filters, P(z), can be used (2 bits). The preem(cid:173)
`phasis filter is determined based on 2nd order LP analysis of the
`input signal. The use of preempha.sis significantly improves the
`codec performance at high frequencies. This gave better perfor(cid:173)
`mance than introducing a tilt filter into the perceptual weighting
`filter [3]. The second advantage of preemphasis is that it reduces
`the dynamic range of the input signal which facilitates the fixed(cid:173)
`point implementation of the algorithm.
`2.3. Short-term prediction
`Short-term prediction, or linea.r prediction (LP), analysis is pt:r(cid:173)
`formed once per speech frame (with a lookahead of 5 ms). The
`16th order LP parameters are quantized with 34 bits, and used
`fer the second subframe while the first subframe uses intel"J:•O..
`lated filters. The quantization and interpolation are performed
`in the immittancespectral pair (ISP) domain [4]. Predictive split
`vector quantization of the ISPs is used.
`To improve the quality in case of music signals, a hybrid for(cid:173)
`ward/backward LP filter configuration is used. The LP filter is
`either a forward filter or a hybrid forward/backward filter (de(cid:173)
`pending on stationarity criterion). 1 bit is used for the LP mode.
`2.4, Long-term prediction analysis
`The pitch parameters are the delay and gain of the pitch filt~r.
`In the first subframe, a fractional pitch delay is used with n:s(cid:173)
`olutions: 1/3 in the range [29t, lS8jJ and integers only in the
`range [159, 281]. For the second subframe, a pitch resolution of
`1/3 is always used in the range [T1 - 10~, T1 + 9~], where T1 is
`nearest integer to the fractional pitch lag of the first subfranll!.
`To simplify the pitch analysis procedure, a two stage approach
`is used [5]. First, an open loop pitch is computed every frame
`(10 ms} using the weighted speech signal sw(n) to find a pitch
`estimate Top· The weighted speech is low-pass filtered and dec(cid:173)
`imated by 3 to simplify the search. Second, a closed-loop pitch
`analysis is performed around the open-loop pitch estimate on a
`subframe basis. In the first subframe the range Top± 9, bounded
`
`0-7803-4073-6/97/$1 0.00©19971EEE.
`
`103
`
`LGSLC0008724 7
`
`Ex. 1024 / Page 1 of 2
`Apple v. Saint Lawrence
`
`
`
`by 30-281, is searched. For the second subframe, dosed-loop
`pitch analysis is performed around the pitch selected in the first
`subframe as described earlier. The pitch delay is encoded with
`9 bits in the first subframe and the relative delay of the second
`subframe is encoded with 6 bits.
`The pitch gain is quantized using 4-bit scalar quantization.
`2.5.
`Innovation codebook structure
`At 16 kbit/s, a 45-bit algebraic codebook is used. The 80 po(cid:173)
`sitions in a subframe are divided into 5 interleaved tracks. The
`innovation vector contains 10 non-zero pulses, where 2 pulses are
`placed in each track. All pulses can have the amplitudes +1 or
`-1. The positions and signs of the two pulses in a given track are
`encoded with 9 bits. This gives a total of 45 bits. The codebook
`is search using the fast procedure described in [6].
`At 24 kbit/s the innovation code book is either based on an al(cid:173)
`gebraic code book structure or transform-coded excitation (TCX)
`structure. The former codebook is more suitable for transient
`frames and attacks while the latter codebook is used in case of
`stationary periods. In the algebraic codebook case, the innova(cid:173)
`tion vector contains 20 non-zero pulses, where 4 pulses are placed
`in each one of the 5 tracks. All pulses can have the amplitudes +1
`or -1. In the TCX case, the target vector for codebook search
`is quantized in the transform domain [2].
`The fixed code book gain is qua'ntized using scalar quantization
`with 5 bits, after applying a 2nd order moving average (MA)
`prediction to the innovation energy in the logarithmic domain.
`2.6. Decoder
`The function of the decoder consists of decoding the transmitted
`parameters (LP parameters, adaptive codebook vector, algebraic
`code vector, and gains) and performing synthesis to obtain the
`reconstructed speech.
`The ouput of the LP synthesis filter is passed through the post(cid:173)
`processing block which performs an adaptive deemphasis proce(cid:173)
`dure (the inverse of the preprocessing procedure) to restore the
`dynamic of the speech signal.
`
`3. CODEC PERFORMANCE
`The codec was tested in compliance with the qualification test
`plan set by the ITU [7]. The test consisted of three experiments.
`Experiment la tested the codec performance in case of speech
`(single talkers without background noise; Experiment lb tested
`the performnace for music signals; and Experiment 2 tested the
`performance in case of speech with background noise. Table 2
`gives some of the results of Experiment la for the nominal level
`of -26 dBov (26 dB below overload). The table gives the codec
`I Condition
`G.722 48k
`codec 16k
`G.722 56k
`codec 24k
`G.722 48k 0.001 BER
`codec 16k 0.001 BER
`G.722 56k 0.001 B.EJR
`codec 24k 0.001 BER
`G.722 48k 2 tandem
`codec 16k 2 tandem
`G.722 56k 2 tandem
`codec 24k 2 tandem
`
`I d
`
`I C·nt I
`
`I
`
`0.08
`
`0.18
`
`0.06
`
`0.17
`
`1/ MOSc I Sc
`0.78
`3.41
`3.33
`0.99
`3.77
`0.78
`3.71
`0.87
`2.36
`0.91
`1.02
`2.68
`0.88
`2.59
`2.85
`1.11
`2.87
`0.71
`2.55
`0.98
`3.34
`0.73
`3.09
`1.02
`
`-0.32
`
`0.19
`
`-0.26
`
`0.20
`
`0.32
`
`0.17
`
`0.25
`
`0.18
`
`Table 2. Test results from Experiment la.
`condition, the combined Mean Opinion Score (MOS), the stan(cid:173)
`dard deviation, the difference between the reference and candi(cid:173)
`date codec and the 95% confidence interval. The codec meets
`the requirements for speech and significantly better in case of
`
`104
`
`biL errors. The coder was significatly better than G.722 at the
`lower input level of -36 dBov but didn't meet the requirement
`at higher input level of -16 dBov. This because the G. 722 refer(cid:173)
`ence coder is level dependent whose performance increases with
`increasing the input level. G.722 shows aMOS variation of al(cid:173)
`most 1 between higher and lower levels while the MOS variation
`of the candidate codec is limited to 0.1. The results showed that
`performance is slightly below meeting the tandem requirement.
`Initially, the nominal level was at -32 dB at which the tandem
`requirement was met. Then it was increased to -26 dB which
`increased the MOS of G.722 by 0.3 for a single encoding (due to
`its level dependancy).
`In Experiment lb (music), the coder didn't meet the require(cid:173)
`ment at 16 kbit/s, and at 24 kbit/s it was slightly worse that
`56 kbit/s G.722. The test showed that for music. the perfor(cid:173)
`mance at 24 kbit/s is better to G.722 at 48 kbit/s. The re(cid:173)
`quirements for music are difficult to attain with the short frame
`size of 10 ms due to the lack of frequency resolution to perform
`perceptual transform coding.
`In Experiment 2, the codcc didn't meet the requirements in the
`presence of background noise. This is due to the discriminatory
`Comparison Category Rating procedure used in this experiment.
`
`4. CONCLUSION
`The article described a wideband speech codec operating at
`16/24 kbitfs. The coder operates on 10 ms speech frames using
`an ACELP algorithm at 16 kbit/s and a switched ACELP /TCX
`algorithm at 24 kbit/s. Subjective test results showed that the
`codec meets most the performance requirements for clean speech
`(equivalent to 48/56 kbit/s G.722), while it is below the require(cid:173)
`ments for music signals and background noise conditions.
`In the March 1997 meeting of SG 16, it was decided to keep
`only the longer delay mode (60 ms) for the wide band coding stan(cid:173)
`dard while allowing more complexity (single commercial DSP
`chip). The larger delay is essential in order to meet the require(cid:173)
`ments for music signals. The procedure for testing the back(cid:173)
`ground noise conditions has been changed, which is likely to
`make it less difficult to meet the requirements. The remaining
`difficulty will be meeting the requirement at -16 dBov. It is in
`fact illogical to test the level dependency of the candidate codec
`against a reference codec which is itself very level dependent.
`
`REFERENCES
`[1] C. Laflamme, J-P. Adoul. R. Salami, S. Morissette, and
`P. Mabilleau, "16 kbps wideband speech coding technique
`based on algebraic CELP," Proc. ICASSP'9!, pp. 13-16.
`[2] R. Lefebvre, R. Salami, C. Laflamme, and J.-P. Adoul,
`"High quality coding of wideband audio signals using
`Transform-Codec eXcitation (TCX)," Proc. ICASSP'94,
`pp. I-193-I-196.
`[3] E. Ordentlich and Y. Shoham, "Low-delay code-excited
`linear-predictive coding of wideband speech at 32 kbps,"
`Proc. ICASSP'91, pp. 9-12.
`[4] Y. Bistritz and S. Peller, "Immittance spectral pairs (ISP)
`for speech encoding," ?roc. ICASSP'93, pp. U-9-II-12.
`[5] R. Salami, C. Laflamme, J-P. Adoul, and D. Massaloux, "A
`toll quality 8 kb/s speech co dec for the personal communi(cid:173)
`cations system (PCS)," IEEE Trans. Veh. Techno!., vol. 43,
`no. 3, pp. 808-816,Aug. 1994.
`(6] R. Salami et al, "Description of GSM enhanced full rate
`codec," Proc. ICC'97.
`[7] "Subjective qualification test plan for the ITU-T wideband
`(7 kHz) speech coding algorithm," ITU-T, Version 3.1,
`November 1996.
`
`LGSLC00087248
`
`Ex. 1024 / Page 2 of 2
`
`