`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 6, NO. 2, MARCH 1998
`
`Design and Description of CS-ACELP:
`A Toll Quality 8 kb/s Speech Coder
`
`Redwan Salami, Member, IEEE, Claude Laflamme, Jean-Pierre Adoul, Fellow, IEEE,
`Akitoshi Kataoka, Member, IEEE, Shinji Hayashi, Takehiro Moriya, Member, IEEE, Claude Lamblin,
`Dominique Massaloux, St´ephane Proust, Peter Kroon, Fellow, IEEE, and Yair Shoham, Member, IEEE
`
`Abstract— This paper describes the 8 kb/s speech coding al-
`gorithm G.729 which has been recently standardized by ITU-T.
`The algorithm is based on a conjugate-structure algebraic CELP
`(CS-ACELP) coding technique and uses 10 ms speech frames. The
`codec delivers toll-quality speech (equivalent to 32 kb/s ADPCM)
`for most operating conditions. This paper describes the coder
`structure in detail and discusses the reasons behind certain design
`choices. A 16-b fixed-point version has been developed as part of
`Recommendation G.729 and a summary of the subjective test
`results based on a real-time implementation of this version are
`presented.
`
`Index Terms—Analysis-by-synthesis, speech coding.
`
`I. INTRODUCTION
`
`SINCE 1990, Study Group 15 (SG15) of the ITU-T has
`
`been involved in a standardization process for a speech
`coding algorithm at 8 kb/s. The main applications for this
`coder are 1) personal communication systems (PCS), 2) digital
`satellite systems, and 3) other applications such as packetized
`speech and circuit multiplexing equipment. The speech quality
`produced by this coder should be equivalent to that of 32
`kb/s ADPCM (G.726) for most operating conditions. These
`conditions include clean and noisy speech, multiple encodings,
`level variations and nonspeech inputs. The intended wireless
`applications require that the coder is robust against channel
`errors. These errors could be either random or bursty, and the
`coder should be able to withstand them without introducing
`major annoying effects. Moreover, if the radio channels suffer
`from long fades, and complete frames are lost, the decoder
`should be able to conceal these missing frames with a minimal
`loss in speech quality.
`Two candidate algorithms were submitted: one from NTT
`[1]–[3] and the other from France Telecom CNET/University
`of Sherbrooke [4]. Both candidates were equivalent to (or
`better than) 32 kb/s ADPCM in most test conditions; however,
`they failed some conditions. At
`the March 1994 meeting
`of SG15, both proponents agreed to join their efforts to
`
`produce a coder that combines the best features of both
`algorithms, and to undertake further research to meet all
`performance requirements. At this time, AT&T joined these
`algorithmic optimization efforts. A floating-point version of
`the resulting coder was tested in January 1995, and it was
`accepted at the ITU-T meeting in February 1995. In the final
`recommendation the algorithm is specified in terms of 16-
`b fixed-point arithmetic. This version was tested in October
`1995, and the recommendation was accepted for ratification
`in November 1995 [5].
`In this paper, we describe the important aspects of the
`algorithm, which is referred to as conjugate-structure algebraic
`CELP (CS-ACELP). Additional information can be found in
`[6]–[10]. The complete algorithm, including ANSI-C source
`code, can be found in [11].
`This paper is organized as follows. In Section II we describe
`the coding algorithm in detail. In Section III we describe
`features of this coder that were included to increase the
`robustness against transmission errors. Section IV reports on
`the performance, and Section V discusses implementation
`aspects. Finally, the conclusions are given in Section VI.
`
`II. DESCRIPTION OF THE CS-ACELP SPEECH CODER
`The coder is based on a code-excited linear prediction
`(CELP) coding model [12]. In this model the locally decoded
`signal is compared against the original signal and the coder
`parameters are selected such that the mean-squared weighted
`error between the original and reconstructed signal is mini-
`mized.
`The CS-ACELP coder is designed to operate with an
`appropriately bandlimited signal sampled at 8000 Hz. The
`input and output samples are represented using 16-b linear
`PCM. The coder operates on frames of 10 ms, using a 5 ms
`look-ahead for linear prediction (LP) analysis. This results in
`an overall algorithmic delay of 15 ms. The encoding principle
`is shown in Fig. 1. After processing the 16-b input samples
`through a 140 Hz highpass filter,
`tenth-order LP analysis
`is performed, and the LP parameters are quantized in the
`line spectral pair (LSF) domain [13] with 18 b [7]. The
`input frame is divided into two subframes of 5 ms each.
`The use of subframes allows better tracking of the pitch and
`gain parameters and reduces the complexity of the codebook
`searches. The quantized and unquantized LP filter coefficients
`are used for the second subframe while in the first subframe
`interpolated LP filter coefficients are used. For each subframe
`1063–6676/98$10.00 ª
`
`Manuscript received March 21, 1996; revised March 26, 1997. The associate
`editor coordinating the review of this manuscript and approving it for
`publication was Dr. W. Bastiaan Kleijn.
`R. Salami, C. Laflamme, and J.-P. Adoul are with the Department of
`Electrical Engineering, University of Sherbrooke, P.Q., Canada J1K 2R1.
`A. Kataoka, S. Hayashi, and T. Moriya are with NTT, Tokyo, Japan.
`C. Lamblin, D. Massaloux, and S. Proust are with France Telecom CNET,
`Lannion, France.
`P. Kroon and Y. Shoham are with Bell Laboratories, Lucent Technologies,
`Murray Hill, NJ 07974 USA (e-mail: kroon@research.bell-labs.com).
`Publisher Item Identifier S 1063-6676(98)01691-5.
`
`1998 IEEE
`
`Ex. 1044 / Page 1 of 15
`Apple v. Saint Lawrence
`
`
`
`SALAMI et al.: DESIGN AND DESCRIPTION OF CS-ACELP
`
`117
`
`Fig. 1. Block diagram of the CS-ACELP encoder.
`
`the excitation is represented by an adaptive-codebook and a
`fixed-codebook contribution. The adaptive and fixed-codebook
`parameters are transmitted every subframe.
`The adaptive-codebook component represents the period-
`icity in the excitation signal using a fractional pitch lag
`[14] with 1/3 sample resolution. The adaptive-codebook is
`searched using a two-step procedure. An open-loop pitch lag is
`estimated once per frame based on the perceptually weighted
`speech signal. The adaptive-code book index and gain are
`found by a closed-loop search around the open-loop pitch lag.
`The signal to be matched, referred to as the target signal, is
`computed by filtering the LP residual through the weighted
`synthesis filter.
`The adaptive-codebook index is encoded with 8 b in the
`first subframe and differentially encoded with 5 b in the
`second subframe. The target signal is updated by removing the
`adaptive-codebook contribution, and this new target is used
`in the fixed-codebook search. The fixed codebook is a 17-b
`algebraic codebook [10]. The gains of the adaptive and fixed
`codebook are vector quantized with 7 b using a conjugate-
`structure codebook [7] (with moving-average (MA) prediction
`applied to the fixed-codebook gain as in [4] and [15]). The bit
`allocation for a 10 ms frame is shown in Table I.
`The function of the decoder (see Fig. 2) consists of decoding
`the transmitted parameters (LP parameters, adaptive-codebook
`vector, fixed-codebook vector, and gains) and performing
`synthesis to obtain the reconstructed speech, followed by a
`postprocessing stage [8], consisting of an adaptive postfilter
`and a fixed highpass filter.
`
`low-frequency or DC components. To prevent overflow in the
`fixed-point implementation, the input values are divided by
`two. The filtered and scaled signal is referred to as
`, and
`will be used in all subsequent encoder operations.
`
`B. LP Analysis and Quantization
`LP analysis is performed once per speech frame using the
`autocorrelation method [16] with a 30 ms asymmetric window.
`Every 80 samples (10 ms), the autocorrelation coefficients
`of windowed speech are computed and converted to LP
`coefficients using the Levinson–Durbin algorithm [16]. Then
`the LP coefficients are transformed to line spectral frequencies
`(LSF) [13] for quantization and interpolation purposes. The
`interpolated quantized and unquantized LSF coefficients are
`converted back to LP coefficients to construct the synthesis and
`weighting filters for each subframe. The short-term analysis
`and synthesis filters are based on tenth-order LP filters. The
`LP synthesis filter is defined as
`
`(1)
`
`where
`, are the (quantized) LP coefficients.
`1) Windowing and Autocorrelation Computation: The LP
`analysis window consists of two parts: the first part is half a
`Hamming window and the second part is a quarter of a cosine
`function cycle. The window is given by
`
`(2)
`
`A. Preprocessing
`The 16-b PCM input samples to the speech encoder are fil-
`tered with a second-order pole/zero highpass filter with a cutoff
`frequency of 140 Hz. This highpass filter prevents undesired
`
`There is a 5 ms look-ahead in the LP analysis, which means
`that 40 samples are needed from the future speech frame. This
`translates into an extra algorithmic delay of 5 ms at the encoder
`stage. The use of an asymmetrical window allows reduction
`in the look-ahead without compromising quality [17]. The LP
`
`Ex. 1044 / Page 2 of 15
`
`
`
`118
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 6, NO. 2, MARCH 1998
`
`Fig. 2. Block diagram of the CS-ACELP decoder.
`
`TABLE I
`BIT ALLOCATION OF G.729 CS-ACELP FOR A 10 ms FRAME. FOR SOME
`PARAMETERS, THE NUMBER OF BITS FOR EACH 5 ms SUBFRAME IS
`IDENTIFIED. THE TOTAL NUMBER OF BITS FOR A 10 ms FRAME = 80
`
`analysis window is applied to 120 samples from past speech
`frames, 80 samples from the present speech frame, and 40
`samples from the future frame. The use of a 30 ms window
`was found to provide a smoother evolution of the LP filter,
`thereby providing better speech quality.
`The autocorrelation coefficients are computed from the
`windowed speech
`
`To avoid arithmetic problems for low-level input signals the
`value of
`has a lower boundary of
`. A 60
`Hz bandwidth expansion [18] is applied, by multiplying the
`autocorrelation coefficients with
`
`(3)
`
`(4)
`
`where
`Hz is the bandwidth expansion and
`Hz is the sampling frequency. The bandwidth expansion on
`the autocorrelation coefficients reduces the possibility of ill-
`conditioning in the Levinson algorithm (especially in fixed
`point). In addition, it reduces underestimation of the formant
`bandwidths, which could create undesirably sharp resonances.
`To further reduce the possibility of ill-conditioning due to
`bandpass filtering of the input, the value of
`is multiplied
`by a white-noise correction factor
`, which is equiva-
`lent to adding a noise floor at
`40 dB [19]. The modified
`autocorrelation coefficients are used to obtain the LP filter
`coefficients
`, by using the Levinson–Durbin
`algorithm [16].
`2) Quantization of the LSF Coefficients: The LP filter co-
`efficients
`are converted to line spectral
`frequencies (LSF) using Chebyshev polynomials [20]. In this
`procedure the roots are found in the cosine domain. Since the
`quantizer is vector quantization (VQ) based, it is more conve-
`nient to represent the LSF’s as normalized radian frequencies.
`The relation between these two representations is given by
`
`(5)
`
`are the LSF coefficients in the cosine domain, and
`where
`the LSF coefficients in the frequency domain.
`To keep the algorithmic delay as low as possible, the update
`of the LP coefficients is done every 10 ms. However, most
`speech spectra vary slowly in time, and a slower update (e.g.,
`20 ms) would provide a better tradeoff between spectral repre-
`sentation and bit rate. Since the higher update rate introduces
`a strong correlation between coefficients from frame to frame,
`a good compromise is to use predictive VQ. During onsets
`this correlation is not very strong. To accommodate both
`types of correlation the predictor switches between two modes,
`one representing a strong correlation and one representing a
`
`Ex. 1044 / Page 3 of 15
`
`
`
`SALAMI et al.: DESIGN AND DESCRIPTION OF CS-ACELP
`
`mild correlation. Another advantage of using a separate bit
`for the switch is that it effectively reduces the size of the
`codebook, thereby reducing the storage requirements. To limit
`propagation of channel errors, the predictor is based on an
`MA filter. The length of this filter was determined empirically
`using a large data base [2], and it was found that a fourth-order
`MA predictor forms a good compromise between performance
`and error propagation. The quantizer is organized as follows:
`a switched fourth-order MA prediction is used to predict the
`LSF coefficients of the current frame. The difference between
`the computed and predicted coefficients is quantized using a
`two-stage vector quantizer. The first stage is a 10-D VQ using
`codebook
`with 128 entries (7 b). The second stage is a
`10-b VQ that has been implemented as a split VQ using two
`5-D codebooks,
`and
`containing 32 entries (5 b) each.
`The reason for using a nonsplit first-stage is that it allows for
`the exploitation of the correlations between the first 5 LSF and
`last 5 LSF coefficients. At the second stage these correlations
`are less strong, and the split reduces search time and storage
`requirements.
`To explain the quantization process, it is convenient to
`first describe the decoding process. Each quantized value is
`obtained from the sum of two codewords, as follows:
`
`(6)
`
`where
`are the codebook indices. To guarantee
`, and
`that the reconstructed filters are stable the vector
`is arranged
`such that adjacent elements have a minimum distance of
`(see [11]). This rearrangement process is done twice.
`First with a value of
`,
`then with a value
`of
`. The incorporation of this process into
`the quantization procedure, assures that each of the possible
`reconstructed
`vectors produces a stable filter. After this
`rearrangement process, the quantized LSF coefficients
`for the current frame
`, are obtained from the weighted sum
`, and the current quantizer
`of previous quantizer outputs
`output
`
`(7)
`
`where
`are the coefficients of the switched MA predictor as
`defined by codebook
`, and the one bit codebook index
`.
`are given by
`At startup the initial values of
`for all
`.
`After computing
`, the corresponding filter is checked
`for stability and unnatural sharp resonances by checking the
`ordering property (i.e.,
`). If
`this condition is not met, the frequencies are moved using a
`heuristic process, which enforces a minimum spacing of 50
`Hz between the coefficients [11].
`The procedure for encoding the LSF parameters can be
`outlined as follows. For each of the two MA predictors the best
`approximation to the current LSF coefficients has to be found.
`The best approximation is defined as the one that minimizes
`
`the weighted mean-squared error
`
`119
`
`(8)
`
`The weights emphasize the relative importance of each LSF.
`Spectral resonances (closely spaced LSF’s) are perceptually
`more important, and several (heuristic) procedures to derive
`these coefficients can be found in the literature (cf. [21]).
`The following heuristic procedure was found to improve
`performance and be computationally efficient. The weights
`are made adaptive as a function of the unquantized LSF
`coefficients,
`
`if
`
`if
`
`if
`
`otherwise
`
`otherwise
`
`(9)
`
`otherwise
`
`are multiplied by
`and
`In addition, the weights
`The vector to be quantized for the current frame
`obtained from
`
`each.
`is
`
`(10)
`
`that
`is searched and the entry
`The first codebook
`minimizes the (unweighted) mean-squared error (MSE) is
`selected. This is followed by a search of the second codebook
`, which defines the lower part of the second stage. The
`weighted MSE of (8) is computed, and the vector
`which
`results in the lowest error is selected. Using the selected first
`stage vector
`and the lower part of the second stage
`, the
`higher part of the second stage is searched from codebook
`.
`The vector
`that minimizes the weighted MSE is selected.
`is rearranged twice using
`The resulting vector
`the procedure outlined earlier. This process is done for each of
`the two MA predictors defined by
`, and the MA predictor
`that produces the lowest weighted MSE is selected.
`3) Interpolation of the LSF Coefficients: The quantized
`(and unquantized) LP coefficients are used for the second
`subframe. For the first subframe, the quantized (and unquan-
`tized) LP coefficients are obtained by linear interpolation of
`the corresponding parameters in the adjacent subframes. The
`interpolation is done on the LSF coefficients in the cosine
`domain rather than the frequency domain. Interpolating in
`either domain did not produce noticeable audible differences,
`and the cosine domain was selected because of ease of
`implementation. Once the LSF coefficients are quantized and
`interpolated, they are converted back to the LP coefficients
`.
`
`C. Perceptual Weighting
`in a subframe is obtained
`The weighted speech signal
`by filtering the speech through a perceptual weighting filter
`
`Ex. 1044 / Page 4 of 15
`
`
`
`120
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 6, NO. 2, MARCH 1998
`
`. This perceptual weighting filter [22] is based on the
`unquantized LP filter coefficients
`, and is given by
`
`(11)
`
`The use of the unquantized coefficients gives a weighting filter
`that matches better the original spectrum. The values of
`and
`modify the frequency response of the filter
`, and
`thereby the amount of noise weighting. It is difficult to find
`fixed values of
`and
`that provided good performance for
`different input signal characteristics. For example, differences
`in the low-frequency cutoff would lead to different choices for
`these coefficients. Hence, the values of
`and
`are made a
`function of the spectral shape of the input signal. For signals
`with a lot of low-frequency energy, the amount of weighting
`is increased.
`This adaptation is done once per 10 ms frame, but an
`interpolation procedure for each first subframe is used to
`smooth this adaptation process. The spectral shape is obtained
`from a second-order linear prediction filter, obtained as a
`by-product from the Levinson–Durbin recursion. The reflec-
`tion coefficients
`are converted to log area ratio (LAR)
`coefficients
`by
`
`(12)
`
`The LAR coefficients are used because they have better
`interpolation properties than reflection coefficients [23]. The
`LAR coefficients corresponding to the current 10 ms frame are
`used for the second subframe. The LAR coefficients for the
`first subframe are obtained through linear interpolation with
`the LAR parameters from the previous frame. The spectral
`envelope is characterized as being either flat
`or tilted
`. For each subframe this characterization is obtained
`by applying a threshold function to the LAR coefficients. To
`avoid rapid changes, a hysteresis is used by taking into account
`the value of
`in the previous subframe
`
`if
`if
`
`and
`or
`
`otherwise
`
`and
`and
`
`(13)
`
`If the interpolated spectrum for a subframe is classified as
`, the weight factors are set to
`and
`flat
`. If the spectrum is classified as tilted
`,
`the value of
`is set to 0.98, and the value of
`is adapted
`to the strength of the resonances in the LP synthesis filter,
`but is bounded between 0.4 and 0.7. If a strong resonance is
`present, the value of
`is set closer to the upperbound. This
`adaptation is done to reduce the amount of unmasked noise at
`the formant frequencies. The adaptation of
`is done by using
`a heuristic criterion based on the minimum distance between
`two successive LSF coefficients for the current subframe. The
`minimum distance is given by
`
`The value of
`
`is computed using
`
`, bounded by
`
`(15)
`
`The values of
`for the different conditions were
`and
`obtained through many informal listening experiments using
`expert listeners. The process was done in stages by selecting
`speech material that could be characterized to fall into one of
`the categories: 1) flat spectrum
`, 2) tilted spectrum
`and no strong resonances, and 3) tilted spectrum
`with strong resonances. This allowed independent
`optimization of the weight factors. In all experiments both
`single and double encodings were considered. In general, the
`improvements due to this adaptation of the weights are most
`noticeable for double encodings.
`
`D. Pitch Analysis
`The pitch analysis technique described in [4] is used. An
`open-loop pitch lag
`is estimated once per 10 ms frame us-
`ing the weighted speech signal
`. The adaptive-codebook
`approach is used to represent
`the periodic component
`in
`the excitation signal. The selected adaptive-codebook vector
`is represented by an index, which corresponds to a certain
`fractional lag value.
`For each subframe the target signal,
`, and the impulse
`response,
`, of the weighted synthesis filter are computed.
`A closed-loop adaptive-codebook search is performed in the
`first subframe around the index corresponding to the open-loop
`pitch lag estimate
`. A 1/3 fractional sample resolution is
`used in the range
`and integers only are used in
`the range 85–143. It was found that this choice of resolution
`provides a good trade-off between performance and bit rate.
`The adaptive-codebook index in the first subframe is encoded
`with 8 b. In the second subframe, a 1/3 fractional sample
`resolution is used in the range
`where
`is the integer part of the adaptive-codebook lag in the
`first subframe. This range is adapted for the cases where
`straddles the boundaries of the lag range. The lag in the second
`subframe is differentially encoded with 5 b. Since the open-
`loop pitch estimate provides a form of pitch tracking, the
`differential coding does not introduce noticeable degradations
`in the speech quality.
`1) Open-Loop Pitch Lag Estimation: The open-loop pitch
`lag estimation uses the weighted speech signal
`, and is
`done as follows: in the first step, 3 maxima of the correlation
`
`(16)
`
`are found in the following three ranges: 1)
`, and 3)
`. Note that for
`signal values from the previous frame are used. The retained
`maxima
`, where
`are the lag values corresponding to the
`maxima in the three lag regions
`, are normalized
`through
`
`, 2)
`
`(14)
`
`(17)
`
`Ex. 1044 / Page 5 of 15
`
`
`
`SALAMI et al.: DESIGN AND DESCRIPTION OF CS-ACELP
`
`121
`
`The winner among the three normalized correlations is selected
`by favoring the lags with the values in the lower range. This is
`done by weighting the normalized correlations corresponding
`to the longer lag values. This procedure of dividing the lag
`range into three sections and favoring the smaller values is
`used to discourage choosing pitch multiples.
`2) Computation of the Target Signal: The LP residual
`given by
`
`is
`
`(18)
`
`The target signal
`for the adaptive-codebook search
`is computed by filtering the LP residual signal
`through
`and the weighting
`the combination of synthesis filter
`. After determining the excitation for
`filter
`the subframe, the initial states of these filters are updated by
`filtering the difference between the residual and excitation
`signals.
`3) Adaptive-Codebook Search: The adaptive-codebook pa-
`rameters (or pitch parameters) are the indices corresponding to
`a certain lag and gain. In the adaptive-codebook approach for
`implementing the pitch filter [24], the excitation is repeated
`for lags less than the subframe length. The use of fractional
`lags makes this process computationally expensive during the
`search stage. Hence, during the search the excitation beyond
`the duration of the pitch is extended by the LP residual.
`This procedure is simpler, and it was found that it produces
`similar results compared to using the adaptive-codebook for
`the complete subframe. Note that once the lag has been
`determined, the conventional adaptive-codebook approach is
`used to generate the adaptive-codebook vector.
`For each 5 ms subframe,
`the lag is determined using
`closed-loop analysis that minimizes the weighted MSE. In
`the first subframe the lag
`is found by searching a small
`range (six samples) of values around the open-loop lag
`(see Section II-D1). For the second subframe, the closed-loop
`adaptive-codebook search is done around the lag selected in
`the first subframe to find the optimal lag
`.
`The
`closed-loop search minimizes
`the mean-squared
`weighted error between the original and reconstructed speech.
`This is achieved by maximizing the term
`
`(19)
`
`is the past filtered
`is the target signal and
`where
`excitation at delay
`(past excitation convolved with
`,
`where
`is the impulse response of the weighted synthesis
`).
`filter
`For the determination of
`if the optimum
`, and also for
`integer closed-loop lag is less than 85, the fractions around
`the optimum integer lag have to be tested. Instead of com-
`puting the correlation for each lag value, only the normalized
`correlations (19) corresponding to integer lags are computed.
`The correlation values corresponding to the fractional lags are
`obtained through interpolation of the normalized correlation
`function in (19) using a finite impulse response (FIR) filter
`
`based on a Hamming windowed
`function truncated
`at
`. The lag value corresponding to the (interpolated)
`maximum correlation is selected, and the adaptive-codebook
`vector
`is computed by interpolating the past excitation
`signal
`at the given integer lag
`and fraction
`[14].
`The interpolation filter is based on a Hamming windowed
`function truncated at
`and padded with zeros
`at
`. The filter has a cutoff frequency ( 3 dB) at 3600
`Hz in the oversampled domain. Although the lower cutoff
`frequency allows fewer taps for the realization of the filter,
`it is necessary to also filter the integer lags thereby increasing
`complexity. However, it was found that the lowpass action of
`the filter resulted in a smoother quality of the reconstructed
`signal, which justified this tradeoff.
`Once the adaptive-codebook index has been determined, the
`adaptive-codebook gain
`is computed as
`
`bounded by
`
`(20)
`
`is the filtered adaptive-
`is the target signal and
`where
`codebook vector
`, obtained by convolving
`with
`.
`The scaled and filtered adaptive-codebook vector is subtracted
`from
`to produce a new target vector
`.
`
`E. Fixed (Algebraic) Codebook
`A 17-b algebraic codebook is used for the fixed-codebook.
`Algebraic codebooks are deterministic codebooks in which the
`codebook vectors are determined from the transmitted index
`using simple algebra rather than lookup tables. This structure
`has advantages in terms of storage, search complexity, and
`robustness [25]–[27]. Each fixed-codebook vector contains
`four nonzero pulses. These pulses can assume the amplitudes
`and positions given in Table II, and are encoded separately
`using the bit allocation given in this table.
`Let
`be the codebook vector at index . The optimum
`codeword is the one that maximizes the term
`
`(21)
`
`is the vector of correlations between the target
`where
`signal
`and the impulse response,
`, of the weighted
`synthesis filter, and
`is the matrix of correlations of
`.
`The structure of the codebook allows for a fast-search
`procedure since the fixed-codebook vector
`contains only
`four nonzero pulses whose amplitudes are
`. The search
`is performed in four nested loops, where in each loop the
`contribution of a new pulse is added. The correlation in the
`numerator of (21) is given by
`
`(22)
`
`where
`is its sign.
`is the position of the th pulse and
`The energy in the denominator of (21) is given by
`
`(23)
`
`The search complexity is greatly reduced by using the
`following procedure. The most likely amplitude of a pulse
`
`Ex. 1044 / Page 6 of 15
`
`
`
`122
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 6, NO. 2, MARCH 1998
`
`TABLE II
`STRUCTURE OF 17-b FIXED-CODEBOOK
`
`. More
`occurring at a certain position is estimated using
`precisely,
`the amplitude of a pulse at a certain position
`is set a priori
`to the sign of
`at
`that position. This
`choice of signs for a given combination of pulses maximizes
`the correlation term in (21). Therefore, before entering the
`codebook search, the following steps are taken. First, the signal
`is decomposed into its absolute value
`and
`its sign which characterizes the preselected pulse amplitudes at
`each of the 40 possible pulse positions. Second, the matrix
`is
`modified in order to include the preset pulse amplitudes; that is,
`and
`.
`The correlation in (22) is now given by
`
`and the energy in (23) is given by
`
`(24)
`
`(25)
`
`8192
`To evaluate all pulse positions, a total of 2
`combinations need to be examined. To reduce the complexity,
`the search is focused on those combinations that are likely to
`provide a good match. Since the last loop has 16 possible pulse
`positions, it was decided to limit the number of times that this
`loop is entered. In an exhaustive search this loop is entered
`times to find the position of the fourth pulse for
`all possible combinations of three pulse positions. By keeping
`track of the relative contribution of the three pulses toward
`the objective of maximizing (24), it is possible to make the
`(heuristic) decision not to enter the search loop for the fourth
`pulse.
`This is done by setting a threshold based on the correlation
`. The threshold is a function of the maximum absolute
`correlation and the average correlation due to the contribution
`of the first three pulses. The maximum correlation due the
`contribution of the first three pulses is given by
`
`(26)
`
`at the first three
`are the maxima of
`where
`position tracks given in Table II. The average correlation due
`to the contribution of the first three pulses is given by
`
`(27)
`
`The search threshold is given by
`
`(28)
`This threshold value is preset before starting the codebook
`search. The last loop is searched only if the correlation due to
`the contribution of the first three pulses exceeds this threshold.
`Note that the value
`controls the number of times the last
`loop is entered. It was found that a value of
`provides
`performance equivalent to exhaustive search. In an exhaustive
`search, the last loop is entered
`times. At the threshold
`factor
`, the average value of
`is 60 and only 5% of
`the time it exceeds 90. The search complexity is determined
`by the number of codewords searched in a 10 ms frame (two
`subframes). To limit the worst case complexity, the value of
`in the two subframes is limited to a maximum value
`.
`The first subframe is allowed a maximum value of 105 and the
`second subframe is left with
`, where
`is the value
`of
`in the first subframe. Using this approach, the search is
`forced to stop only 1% of the time. The average worst case
`is 90 times per subframe. That is, the worst case of the
`number of tested positions is 90
`16
`1440 out of 8192
`combinations.
`The periodicity in the excitation is provided by the adaptive-
`codebook contribution only. High-pitched voices will have
`more than one pitch pulse in a subframe. In that situation
`the fixed-codebook contribution could potentially reduce the
`amount of periodicity, leading to degraded speech quality.
`This effect is avoided by introducing periodicity in the fixed-
`codebook excitation by filtering the fixed-codebook vector
`through the filter
`, where
`is the
`integer part of the pitch lag for the current subframe, and
`is the adaptive-codebook gain. This modification of the fixed-
`codebook vectors is integrated in the search by filtering the
`impulse response
`with
`before the codebook search
`(for lag values less than the framesize). Since at that time
`the quantized adaptive-codebook gain is not known, it was
`found that the past quantized pitch gain bounded by [0.2–0.8]
`provided a good alternative.
`
`F. Quantization of the Gains
`The adaptive-codebook gain and the fixed codebook gain are
`vector quantized using 7 b. This joint quantization provides a
`saving of about 2 b compared to scalar quantization. Based
`on informal comparisons, it was found that this quantization
`did not introduce any noticeable degradations to the speech
`quality compared to unquantized gains.
`The gain codebook search is done by minimizing the mean-
`squared weighted error between original and reconstructed
`speech which is given by
`
`(29)
`
`where
`is the target vector (see Section II-D2), and
`are the filtered adaptive-codebook and fixed-
`and
`codebook vectors, respectively.
`the Fixed-Codebook Gain: The fixed-
`1) Prediction of
`codebook gains in adjacent frames are correlated. An efficient
`
`Ex. 1044 / Page 7 of 15
`
`
`
`SALAMI et al.: DESIGN AND DESCRIPTION OF CS-ACELP
`
`123
`
`way to exploit this redundancy is to use a log-energy gain
`predictor [15]. This gain-predictor not only helps in reducing
`the dynamic range of the fixed-codebook gain, it also makes
`this range less dependent on the input level variations. The
`use of a moving-average filter reduces the propagation of
`channel errors.
`The fixed-codebook gain
`
`can be expressed as
`
`(30)
`
`is a predicted gain based on previous fixed-codebook
`where
`energies, and
`is a correction factor, which is coded for
`transmission.
`The mean energy of the fixed-codebook contribution is
`given by
`
`2) Codebook Search for Gain Quantization: The adaptive-
`codebook gain,
`, and the factor
`are vector quantized using
`a two-stage conjugate-structured codebook [2]. The term con-
`jugate refers to the fact that each input vector is quantized as a
`linear combination of both codebooks. Such a structure reduces
`both computational and memory requirements [28]. The first
`stage consists of a 3