throbber
116
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 6, NO. 2, MARCH 1998
`
`Design and Description of CS-ACELP:
`A Toll Quality 8 kb/s Speech Coder
`
`Redwan Salami, Member, IEEE, Claude Laflamme, Jean-Pierre Adoul, Fellow, IEEE,
`Akitoshi Kataoka, Member, IEEE, Shinji Hayashi, Takehiro Moriya, Member, IEEE, Claude Lamblin,
`Dominique Massaloux, St´ephane Proust, Peter Kroon, Fellow, IEEE, and Yair Shoham, Member, IEEE
`
`Abstract— This paper describes the 8 kb/s speech coding al-
`gorithm G.729 which has been recently standardized by ITU-T.
`The algorithm is based on a conjugate-structure algebraic CELP
`(CS-ACELP) coding technique and uses 10 ms speech frames. The
`codec delivers toll-quality speech (equivalent to 32 kb/s ADPCM)
`for most operating conditions. This paper describes the coder
`structure in detail and discusses the reasons behind certain design
`choices. A 16-b fixed-point version has been developed as part of
`Recommendation G.729 and a summary of the subjective test
`results based on a real-time implementation of this version are
`presented.
`
`Index Terms—Analysis-by-synthesis, speech coding.
`
`I. INTRODUCTION
`
`SINCE 1990, Study Group 15 (SG15) of the ITU-T has
`
`been involved in a standardization process for a speech
`coding algorithm at 8 kb/s. The main applications for this
`coder are 1) personal communication systems (PCS), 2) digital
`satellite systems, and 3) other applications such as packetized
`speech and circuit multiplexing equipment. The speech quality
`produced by this coder should be equivalent to that of 32
`kb/s ADPCM (G.726) for most operating conditions. These
`conditions include clean and noisy speech, multiple encodings,
`level variations and nonspeech inputs. The intended wireless
`applications require that the coder is robust against channel
`errors. These errors could be either random or bursty, and the
`coder should be able to withstand them without introducing
`major annoying effects. Moreover, if the radio channels suffer
`from long fades, and complete frames are lost, the decoder
`should be able to conceal these missing frames with a minimal
`loss in speech quality.
`Two candidate algorithms were submitted: one from NTT
`[1]–[3] and the other from France Telecom CNET/University
`of Sherbrooke [4]. Both candidates were equivalent to (or
`better than) 32 kb/s ADPCM in most test conditions; however,
`they failed some conditions. At
`the March 1994 meeting
`of SG15, both proponents agreed to join their efforts to
`
`produce a coder that combines the best features of both
`algorithms, and to undertake further research to meet all
`performance requirements. At this time, AT&T joined these
`algorithmic optimization efforts. A floating-point version of
`the resulting coder was tested in January 1995, and it was
`accepted at the ITU-T meeting in February 1995. In the final
`recommendation the algorithm is specified in terms of 16-
`b fixed-point arithmetic. This version was tested in October
`1995, and the recommendation was accepted for ratification
`in November 1995 [5].
`In this paper, we describe the important aspects of the
`algorithm, which is referred to as conjugate-structure algebraic
`CELP (CS-ACELP). Additional information can be found in
`[6]–[10]. The complete algorithm, including ANSI-C source
`code, can be found in [11].
`This paper is organized as follows. In Section II we describe
`the coding algorithm in detail. In Section III we describe
`features of this coder that were included to increase the
`robustness against transmission errors. Section IV reports on
`the performance, and Section V discusses implementation
`aspects. Finally, the conclusions are given in Section VI.
`
`II. DESCRIPTION OF THE CS-ACELP SPEECH CODER
`The coder is based on a code-excited linear prediction
`(CELP) coding model [12]. In this model the locally decoded
`signal is compared against the original signal and the coder
`parameters are selected such that the mean-squared weighted
`error between the original and reconstructed signal is mini-
`mized.
`The CS-ACELP coder is designed to operate with an
`appropriately bandlimited signal sampled at 8000 Hz. The
`input and output samples are represented using 16-b linear
`PCM. The coder operates on frames of 10 ms, using a 5 ms
`look-ahead for linear prediction (LP) analysis. This results in
`an overall algorithmic delay of 15 ms. The encoding principle
`is shown in Fig. 1. After processing the 16-b input samples
`through a 140 Hz highpass filter,
`tenth-order LP analysis
`is performed, and the LP parameters are quantized in the
`line spectral pair (LSF) domain [13] with 18 b [7]. The
`input frame is divided into two subframes of 5 ms each.
`The use of subframes allows better tracking of the pitch and
`gain parameters and reduces the complexity of the codebook
`searches. The quantized and unquantized LP filter coefficients
`are used for the second subframe while in the first subframe
`interpolated LP filter coefficients are used. For each subframe
`1063–6676/98$10.00 ª
`
`Manuscript received March 21, 1996; revised March 26, 1997. The associate
`editor coordinating the review of this manuscript and approving it for
`publication was Dr. W. Bastiaan Kleijn.
`R. Salami, C. Laflamme, and J.-P. Adoul are with the Department of
`Electrical Engineering, University of Sherbrooke, P.Q., Canada J1K 2R1.
`A. Kataoka, S. Hayashi, and T. Moriya are with NTT, Tokyo, Japan.
`C. Lamblin, D. Massaloux, and S. Proust are with France Telecom CNET,
`Lannion, France.
`P. Kroon and Y. Shoham are with Bell Laboratories, Lucent Technologies,
`Murray Hill, NJ 07974 USA (e-mail: kroon@research.bell-labs.com).
`Publisher Item Identifier S 1063-6676(98)01691-5.
`
`1998 IEEE
`
`Ex. 1044 / Page 1 of 15
`Apple v. Saint Lawrence
`
`

`

`SALAMI et al.: DESIGN AND DESCRIPTION OF CS-ACELP
`
`117
`
`Fig. 1. Block diagram of the CS-ACELP encoder.
`
`the excitation is represented by an adaptive-codebook and a
`fixed-codebook contribution. The adaptive and fixed-codebook
`parameters are transmitted every subframe.
`The adaptive-codebook component represents the period-
`icity in the excitation signal using a fractional pitch lag
`[14] with 1/3 sample resolution. The adaptive-codebook is
`searched using a two-step procedure. An open-loop pitch lag is
`estimated once per frame based on the perceptually weighted
`speech signal. The adaptive-code book index and gain are
`found by a closed-loop search around the open-loop pitch lag.
`The signal to be matched, referred to as the target signal, is
`computed by filtering the LP residual through the weighted
`synthesis filter.
`The adaptive-codebook index is encoded with 8 b in the
`first subframe and differentially encoded with 5 b in the
`second subframe. The target signal is updated by removing the
`adaptive-codebook contribution, and this new target is used
`in the fixed-codebook search. The fixed codebook is a 17-b
`algebraic codebook [10]. The gains of the adaptive and fixed
`codebook are vector quantized with 7 b using a conjugate-
`structure codebook [7] (with moving-average (MA) prediction
`applied to the fixed-codebook gain as in [4] and [15]). The bit
`allocation for a 10 ms frame is shown in Table I.
`The function of the decoder (see Fig. 2) consists of decoding
`the transmitted parameters (LP parameters, adaptive-codebook
`vector, fixed-codebook vector, and gains) and performing
`synthesis to obtain the reconstructed speech, followed by a
`postprocessing stage [8], consisting of an adaptive postfilter
`and a fixed highpass filter.
`
`low-frequency or DC components. To prevent overflow in the
`fixed-point implementation, the input values are divided by
`two. The filtered and scaled signal is referred to as
`, and
`will be used in all subsequent encoder operations.
`
`B. LP Analysis and Quantization
`LP analysis is performed once per speech frame using the
`autocorrelation method [16] with a 30 ms asymmetric window.
`Every 80 samples (10 ms), the autocorrelation coefficients
`of windowed speech are computed and converted to LP
`coefficients using the Levinson–Durbin algorithm [16]. Then
`the LP coefficients are transformed to line spectral frequencies
`(LSF) [13] for quantization and interpolation purposes. The
`interpolated quantized and unquantized LSF coefficients are
`converted back to LP coefficients to construct the synthesis and
`weighting filters for each subframe. The short-term analysis
`and synthesis filters are based on tenth-order LP filters. The
`LP synthesis filter is defined as
`
`(1)
`
`where
`, are the (quantized) LP coefficients.
`1) Windowing and Autocorrelation Computation: The LP
`analysis window consists of two parts: the first part is half a
`Hamming window and the second part is a quarter of a cosine
`function cycle. The window is given by
`
`(2)
`
`A. Preprocessing
`The 16-b PCM input samples to the speech encoder are fil-
`tered with a second-order pole/zero highpass filter with a cutoff
`frequency of 140 Hz. This highpass filter prevents undesired
`
`There is a 5 ms look-ahead in the LP analysis, which means
`that 40 samples are needed from the future speech frame. This
`translates into an extra algorithmic delay of 5 ms at the encoder
`stage. The use of an asymmetrical window allows reduction
`in the look-ahead without compromising quality [17]. The LP
`
`Ex. 1044 / Page 2 of 15
`
`

`

`118
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 6, NO. 2, MARCH 1998
`
`Fig. 2. Block diagram of the CS-ACELP decoder.
`
`TABLE I
`BIT ALLOCATION OF G.729 CS-ACELP FOR A 10 ms FRAME. FOR SOME
`PARAMETERS, THE NUMBER OF BITS FOR EACH 5 ms SUBFRAME IS
`IDENTIFIED. THE TOTAL NUMBER OF BITS FOR A 10 ms FRAME = 80
`
`analysis window is applied to 120 samples from past speech
`frames, 80 samples from the present speech frame, and 40
`samples from the future frame. The use of a 30 ms window
`was found to provide a smoother evolution of the LP filter,
`thereby providing better speech quality.
`The autocorrelation coefficients are computed from the
`windowed speech
`
`To avoid arithmetic problems for low-level input signals the
`value of
`has a lower boundary of
`. A 60
`Hz bandwidth expansion [18] is applied, by multiplying the
`autocorrelation coefficients with
`
`(3)
`
`(4)
`
`where
`Hz is the bandwidth expansion and
`Hz is the sampling frequency. The bandwidth expansion on
`the autocorrelation coefficients reduces the possibility of ill-
`conditioning in the Levinson algorithm (especially in fixed
`point). In addition, it reduces underestimation of the formant
`bandwidths, which could create undesirably sharp resonances.
`To further reduce the possibility of ill-conditioning due to
`bandpass filtering of the input, the value of
`is multiplied
`by a white-noise correction factor
`, which is equiva-
`lent to adding a noise floor at
`40 dB [19]. The modified
`autocorrelation coefficients are used to obtain the LP filter
`coefficients
`, by using the Levinson–Durbin
`algorithm [16].
`2) Quantization of the LSF Coefficients: The LP filter co-
`efficients
`are converted to line spectral
`frequencies (LSF) using Chebyshev polynomials [20]. In this
`procedure the roots are found in the cosine domain. Since the
`quantizer is vector quantization (VQ) based, it is more conve-
`nient to represent the LSF’s as normalized radian frequencies.
`The relation between these two representations is given by
`
`(5)
`
`are the LSF coefficients in the cosine domain, and
`where
`the LSF coefficients in the frequency domain.
`To keep the algorithmic delay as low as possible, the update
`of the LP coefficients is done every 10 ms. However, most
`speech spectra vary slowly in time, and a slower update (e.g.,
`20 ms) would provide a better tradeoff between spectral repre-
`sentation and bit rate. Since the higher update rate introduces
`a strong correlation between coefficients from frame to frame,
`a good compromise is to use predictive VQ. During onsets
`this correlation is not very strong. To accommodate both
`types of correlation the predictor switches between two modes,
`one representing a strong correlation and one representing a
`
`Ex. 1044 / Page 3 of 15
`
`

`

`SALAMI et al.: DESIGN AND DESCRIPTION OF CS-ACELP
`
`mild correlation. Another advantage of using a separate bit
`for the switch is that it effectively reduces the size of the
`codebook, thereby reducing the storage requirements. To limit
`propagation of channel errors, the predictor is based on an
`MA filter. The length of this filter was determined empirically
`using a large data base [2], and it was found that a fourth-order
`MA predictor forms a good compromise between performance
`and error propagation. The quantizer is organized as follows:
`a switched fourth-order MA prediction is used to predict the
`LSF coefficients of the current frame. The difference between
`the computed and predicted coefficients is quantized using a
`two-stage vector quantizer. The first stage is a 10-D VQ using
`codebook
`with 128 entries (7 b). The second stage is a
`10-b VQ that has been implemented as a split VQ using two
`5-D codebooks,
`and
`containing 32 entries (5 b) each.
`The reason for using a nonsplit first-stage is that it allows for
`the exploitation of the correlations between the first 5 LSF and
`last 5 LSF coefficients. At the second stage these correlations
`are less strong, and the split reduces search time and storage
`requirements.
`To explain the quantization process, it is convenient to
`first describe the decoding process. Each quantized value is
`obtained from the sum of two codewords, as follows:
`
`(6)
`
`where
`are the codebook indices. To guarantee
`, and
`that the reconstructed filters are stable the vector
`is arranged
`such that adjacent elements have a minimum distance of
`(see [11]). This rearrangement process is done twice.
`First with a value of
`,
`then with a value
`of
`. The incorporation of this process into
`the quantization procedure, assures that each of the possible
`reconstructed
`vectors produces a stable filter. After this
`rearrangement process, the quantized LSF coefficients
`for the current frame
`, are obtained from the weighted sum
`, and the current quantizer
`of previous quantizer outputs
`output
`
`(7)
`
`where
`are the coefficients of the switched MA predictor as
`defined by codebook
`, and the one bit codebook index
`.
`are given by
`At startup the initial values of
`for all
`.
`After computing
`, the corresponding filter is checked
`for stability and unnatural sharp resonances by checking the
`ordering property (i.e.,
`). If
`this condition is not met, the frequencies are moved using a
`heuristic process, which enforces a minimum spacing of 50
`Hz between the coefficients [11].
`The procedure for encoding the LSF parameters can be
`outlined as follows. For each of the two MA predictors the best
`approximation to the current LSF coefficients has to be found.
`The best approximation is defined as the one that minimizes
`
`the weighted mean-squared error
`
`119
`
`(8)
`
`The weights emphasize the relative importance of each LSF.
`Spectral resonances (closely spaced LSF’s) are perceptually
`more important, and several (heuristic) procedures to derive
`these coefficients can be found in the literature (cf. [21]).
`The following heuristic procedure was found to improve
`performance and be computationally efficient. The weights
`are made adaptive as a function of the unquantized LSF
`coefficients,
`
`if
`
`if
`
`if
`
`otherwise
`
`otherwise
`
`(9)
`
`otherwise
`
`are multiplied by
`and
`In addition, the weights
`The vector to be quantized for the current frame
`obtained from
`
`each.
`is
`
`(10)
`
`that
`is searched and the entry
`The first codebook
`minimizes the (unweighted) mean-squared error (MSE) is
`selected. This is followed by a search of the second codebook
`, which defines the lower part of the second stage. The
`weighted MSE of (8) is computed, and the vector
`which
`results in the lowest error is selected. Using the selected first
`stage vector
`and the lower part of the second stage
`, the
`higher part of the second stage is searched from codebook
`.
`The vector
`that minimizes the weighted MSE is selected.
`is rearranged twice using
`The resulting vector
`the procedure outlined earlier. This process is done for each of
`the two MA predictors defined by
`, and the MA predictor
`that produces the lowest weighted MSE is selected.
`3) Interpolation of the LSF Coefficients: The quantized
`(and unquantized) LP coefficients are used for the second
`subframe. For the first subframe, the quantized (and unquan-
`tized) LP coefficients are obtained by linear interpolation of
`the corresponding parameters in the adjacent subframes. The
`interpolation is done on the LSF coefficients in the cosine
`domain rather than the frequency domain. Interpolating in
`either domain did not produce noticeable audible differences,
`and the cosine domain was selected because of ease of
`implementation. Once the LSF coefficients are quantized and
`interpolated, they are converted back to the LP coefficients
`.
`
`C. Perceptual Weighting
`in a subframe is obtained
`The weighted speech signal
`by filtering the speech through a perceptual weighting filter
`
`Ex. 1044 / Page 4 of 15
`
`

`

`120
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 6, NO. 2, MARCH 1998
`
`. This perceptual weighting filter [22] is based on the
`unquantized LP filter coefficients
`, and is given by
`
`(11)
`
`The use of the unquantized coefficients gives a weighting filter
`that matches better the original spectrum. The values of
`and
`modify the frequency response of the filter
`, and
`thereby the amount of noise weighting. It is difficult to find
`fixed values of
`and
`that provided good performance for
`different input signal characteristics. For example, differences
`in the low-frequency cutoff would lead to different choices for
`these coefficients. Hence, the values of
`and
`are made a
`function of the spectral shape of the input signal. For signals
`with a lot of low-frequency energy, the amount of weighting
`is increased.
`This adaptation is done once per 10 ms frame, but an
`interpolation procedure for each first subframe is used to
`smooth this adaptation process. The spectral shape is obtained
`from a second-order linear prediction filter, obtained as a
`by-product from the Levinson–Durbin recursion. The reflec-
`tion coefficients
`are converted to log area ratio (LAR)
`coefficients
`by
`
`(12)
`
`The LAR coefficients are used because they have better
`interpolation properties than reflection coefficients [23]. The
`LAR coefficients corresponding to the current 10 ms frame are
`used for the second subframe. The LAR coefficients for the
`first subframe are obtained through linear interpolation with
`the LAR parameters from the previous frame. The spectral
`envelope is characterized as being either flat
`or tilted
`. For each subframe this characterization is obtained
`by applying a threshold function to the LAR coefficients. To
`avoid rapid changes, a hysteresis is used by taking into account
`the value of
`in the previous subframe
`
`if
`if
`
`and
`or
`
`otherwise
`
`and
`and
`
`(13)
`
`If the interpolated spectrum for a subframe is classified as
`, the weight factors are set to
`and
`flat
`. If the spectrum is classified as tilted
`,
`the value of
`is set to 0.98, and the value of
`is adapted
`to the strength of the resonances in the LP synthesis filter,
`but is bounded between 0.4 and 0.7. If a strong resonance is
`present, the value of
`is set closer to the upperbound. This
`adaptation is done to reduce the amount of unmasked noise at
`the formant frequencies. The adaptation of
`is done by using
`a heuristic criterion based on the minimum distance between
`two successive LSF coefficients for the current subframe. The
`minimum distance is given by
`
`The value of
`
`is computed using
`
`, bounded by
`
`(15)
`
`The values of
`for the different conditions were
`and
`obtained through many informal listening experiments using
`expert listeners. The process was done in stages by selecting
`speech material that could be characterized to fall into one of
`the categories: 1) flat spectrum
`, 2) tilted spectrum
`and no strong resonances, and 3) tilted spectrum
`with strong resonances. This allowed independent
`optimization of the weight factors. In all experiments both
`single and double encodings were considered. In general, the
`improvements due to this adaptation of the weights are most
`noticeable for double encodings.
`
`D. Pitch Analysis
`The pitch analysis technique described in [4] is used. An
`open-loop pitch lag
`is estimated once per 10 ms frame us-
`ing the weighted speech signal
`. The adaptive-codebook
`approach is used to represent
`the periodic component
`in
`the excitation signal. The selected adaptive-codebook vector
`is represented by an index, which corresponds to a certain
`fractional lag value.
`For each subframe the target signal,
`, and the impulse
`response,
`, of the weighted synthesis filter are computed.
`A closed-loop adaptive-codebook search is performed in the
`first subframe around the index corresponding to the open-loop
`pitch lag estimate
`. A 1/3 fractional sample resolution is
`used in the range
`and integers only are used in
`the range 85–143. It was found that this choice of resolution
`provides a good trade-off between performance and bit rate.
`The adaptive-codebook index in the first subframe is encoded
`with 8 b. In the second subframe, a 1/3 fractional sample
`resolution is used in the range
`where
`is the integer part of the adaptive-codebook lag in the
`first subframe. This range is adapted for the cases where
`straddles the boundaries of the lag range. The lag in the second
`subframe is differentially encoded with 5 b. Since the open-
`loop pitch estimate provides a form of pitch tracking, the
`differential coding does not introduce noticeable degradations
`in the speech quality.
`1) Open-Loop Pitch Lag Estimation: The open-loop pitch
`lag estimation uses the weighted speech signal
`, and is
`done as follows: in the first step, 3 maxima of the correlation
`
`(16)
`
`are found in the following three ranges: 1)
`, and 3)
`. Note that for
`signal values from the previous frame are used. The retained
`maxima
`, where
`are the lag values corresponding to the
`maxima in the three lag regions
`, are normalized
`through
`
`, 2)
`
`(14)
`
`(17)
`
`Ex. 1044 / Page 5 of 15
`
`

`

`SALAMI et al.: DESIGN AND DESCRIPTION OF CS-ACELP
`
`121
`
`The winner among the three normalized correlations is selected
`by favoring the lags with the values in the lower range. This is
`done by weighting the normalized correlations corresponding
`to the longer lag values. This procedure of dividing the lag
`range into three sections and favoring the smaller values is
`used to discourage choosing pitch multiples.
`2) Computation of the Target Signal: The LP residual
`given by
`
`is
`
`(18)
`
`The target signal
`for the adaptive-codebook search
`is computed by filtering the LP residual signal
`through
`and the weighting
`the combination of synthesis filter
`. After determining the excitation for
`filter
`the subframe, the initial states of these filters are updated by
`filtering the difference between the residual and excitation
`signals.
`3) Adaptive-Codebook Search: The adaptive-codebook pa-
`rameters (or pitch parameters) are the indices corresponding to
`a certain lag and gain. In the adaptive-codebook approach for
`implementing the pitch filter [24], the excitation is repeated
`for lags less than the subframe length. The use of fractional
`lags makes this process computationally expensive during the
`search stage. Hence, during the search the excitation beyond
`the duration of the pitch is extended by the LP residual.
`This procedure is simpler, and it was found that it produces
`similar results compared to using the adaptive-codebook for
`the complete subframe. Note that once the lag has been
`determined, the conventional adaptive-codebook approach is
`used to generate the adaptive-codebook vector.
`For each 5 ms subframe,
`the lag is determined using
`closed-loop analysis that minimizes the weighted MSE. In
`the first subframe the lag
`is found by searching a small
`range (six samples) of values around the open-loop lag
`(see Section II-D1). For the second subframe, the closed-loop
`adaptive-codebook search is done around the lag selected in
`the first subframe to find the optimal lag
`.
`The
`closed-loop search minimizes
`the mean-squared
`weighted error between the original and reconstructed speech.
`This is achieved by maximizing the term
`
`(19)
`
`is the past filtered
`is the target signal and
`where
`excitation at delay
`(past excitation convolved with
`,
`where
`is the impulse response of the weighted synthesis
`).
`filter
`For the determination of
`if the optimum
`, and also for
`integer closed-loop lag is less than 85, the fractions around
`the optimum integer lag have to be tested. Instead of com-
`puting the correlation for each lag value, only the normalized
`correlations (19) corresponding to integer lags are computed.
`The correlation values corresponding to the fractional lags are
`obtained through interpolation of the normalized correlation
`function in (19) using a finite impulse response (FIR) filter
`
`based on a Hamming windowed
`function truncated
`at
`. The lag value corresponding to the (interpolated)
`maximum correlation is selected, and the adaptive-codebook
`vector
`is computed by interpolating the past excitation
`signal
`at the given integer lag
`and fraction
`[14].
`The interpolation filter is based on a Hamming windowed
`function truncated at
`and padded with zeros
`at
`. The filter has a cutoff frequency ( 3 dB) at 3600
`Hz in the oversampled domain. Although the lower cutoff
`frequency allows fewer taps for the realization of the filter,
`it is necessary to also filter the integer lags thereby increasing
`complexity. However, it was found that the lowpass action of
`the filter resulted in a smoother quality of the reconstructed
`signal, which justified this tradeoff.
`Once the adaptive-codebook index has been determined, the
`adaptive-codebook gain
`is computed as
`
`bounded by
`
`(20)
`
`is the filtered adaptive-
`is the target signal and
`where
`codebook vector
`, obtained by convolving
`with
`.
`The scaled and filtered adaptive-codebook vector is subtracted
`from
`to produce a new target vector
`.
`
`E. Fixed (Algebraic) Codebook
`A 17-b algebraic codebook is used for the fixed-codebook.
`Algebraic codebooks are deterministic codebooks in which the
`codebook vectors are determined from the transmitted index
`using simple algebra rather than lookup tables. This structure
`has advantages in terms of storage, search complexity, and
`robustness [25]–[27]. Each fixed-codebook vector contains
`four nonzero pulses. These pulses can assume the amplitudes
`and positions given in Table II, and are encoded separately
`using the bit allocation given in this table.
`Let
`be the codebook vector at index . The optimum
`codeword is the one that maximizes the term
`
`(21)
`
`is the vector of correlations between the target
`where
`signal
`and the impulse response,
`, of the weighted
`synthesis filter, and
`is the matrix of correlations of
`.
`The structure of the codebook allows for a fast-search
`procedure since the fixed-codebook vector
`contains only
`four nonzero pulses whose amplitudes are
`. The search
`is performed in four nested loops, where in each loop the
`contribution of a new pulse is added. The correlation in the
`numerator of (21) is given by
`
`(22)
`
`where
`is its sign.
`is the position of the th pulse and
`The energy in the denominator of (21) is given by
`
`(23)
`
`The search complexity is greatly reduced by using the
`following procedure. The most likely amplitude of a pulse
`
`Ex. 1044 / Page 6 of 15
`
`

`

`122
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 6, NO. 2, MARCH 1998
`
`TABLE II
`STRUCTURE OF 17-b FIXED-CODEBOOK
`
`. More
`occurring at a certain position is estimated using
`precisely,
`the amplitude of a pulse at a certain position
`is set a priori
`to the sign of
`at
`that position. This
`choice of signs for a given combination of pulses maximizes
`the correlation term in (21). Therefore, before entering the
`codebook search, the following steps are taken. First, the signal
`is decomposed into its absolute value
`and
`its sign which characterizes the preselected pulse amplitudes at
`each of the 40 possible pulse positions. Second, the matrix
`is
`modified in order to include the preset pulse amplitudes; that is,
`and
`.
`The correlation in (22) is now given by
`
`and the energy in (23) is given by
`
`(24)
`
`(25)
`
`8192
`To evaluate all pulse positions, a total of 2
`combinations need to be examined. To reduce the complexity,
`the search is focused on those combinations that are likely to
`provide a good match. Since the last loop has 16 possible pulse
`positions, it was decided to limit the number of times that this
`loop is entered. In an exhaustive search this loop is entered
`times to find the position of the fourth pulse for
`all possible combinations of three pulse positions. By keeping
`track of the relative contribution of the three pulses toward
`the objective of maximizing (24), it is possible to make the
`(heuristic) decision not to enter the search loop for the fourth
`pulse.
`This is done by setting a threshold based on the correlation
`. The threshold is a function of the maximum absolute
`correlation and the average correlation due to the contribution
`of the first three pulses. The maximum correlation due the
`contribution of the first three pulses is given by
`
`(26)
`
`at the first three
`are the maxima of
`where
`position tracks given in Table II. The average correlation due
`to the contribution of the first three pulses is given by
`
`(27)
`
`The search threshold is given by
`
`(28)
`This threshold value is preset before starting the codebook
`search. The last loop is searched only if the correlation due to
`the contribution of the first three pulses exceeds this threshold.
`Note that the value
`controls the number of times the last
`loop is entered. It was found that a value of
`provides
`performance equivalent to exhaustive search. In an exhaustive
`search, the last loop is entered
`times. At the threshold
`factor
`, the average value of
`is 60 and only 5% of
`the time it exceeds 90. The search complexity is determined
`by the number of codewords searched in a 10 ms frame (two
`subframes). To limit the worst case complexity, the value of
`in the two subframes is limited to a maximum value
`.
`The first subframe is allowed a maximum value of 105 and the
`second subframe is left with
`, where
`is the value
`of
`in the first subframe. Using this approach, the search is
`forced to stop only 1% of the time. The average worst case
`is 90 times per subframe. That is, the worst case of the
`number of tested positions is 90
`16
`1440 out of 8192
`combinations.
`The periodicity in the excitation is provided by the adaptive-
`codebook contribution only. High-pitched voices will have
`more than one pitch pulse in a subframe. In that situation
`the fixed-codebook contribution could potentially reduce the
`amount of periodicity, leading to degraded speech quality.
`This effect is avoided by introducing periodicity in the fixed-
`codebook excitation by filtering the fixed-codebook vector
`through the filter
`, where
`is the
`integer part of the pitch lag for the current subframe, and
`is the adaptive-codebook gain. This modification of the fixed-
`codebook vectors is integrated in the search by filtering the
`impulse response
`with
`before the codebook search
`(for lag values less than the framesize). Since at that time
`the quantized adaptive-codebook gain is not known, it was
`found that the past quantized pitch gain bounded by [0.2–0.8]
`provided a good alternative.
`
`F. Quantization of the Gains
`The adaptive-codebook gain and the fixed codebook gain are
`vector quantized using 7 b. This joint quantization provides a
`saving of about 2 b compared to scalar quantization. Based
`on informal comparisons, it was found that this quantization
`did not introduce any noticeable degradations to the speech
`quality compared to unquantized gains.
`The gain codebook search is done by minimizing the mean-
`squared weighted error between original and reconstructed
`speech which is given by
`
`(29)
`
`where
`is the target vector (see Section II-D2), and
`are the filtered adaptive-codebook and fixed-
`and
`codebook vectors, respectively.
`the Fixed-Codebook Gain: The fixed-
`1) Prediction of
`codebook gains in adjacent frames are correlated. An efficient
`
`Ex. 1044 / Page 7 of 15
`
`

`

`SALAMI et al.: DESIGN AND DESCRIPTION OF CS-ACELP
`
`123
`
`way to exploit this redundancy is to use a log-energy gain
`predictor [15]. This gain-predictor not only helps in reducing
`the dynamic range of the fixed-codebook gain, it also makes
`this range less dependent on the input level variations. The
`use of a moving-average filter reduces the propagation of
`channel errors.
`The fixed-codebook gain
`
`can be expressed as
`
`(30)
`
`is a predicted gain based on previous fixed-codebook
`where
`energies, and
`is a correction factor, which is coded for
`transmission.
`The mean energy of the fixed-codebook contribution is
`given by
`
`2) Codebook Search for Gain Quantization: The adaptive-
`codebook gain,
`, and the factor
`are vector quantized using
`a two-stage conjugate-structured codebook [2]. The term con-
`jugate refers to the fact that each input vector is quantized as a
`linear combination of both codebooks. Such a structure reduces
`both computational and memory requirements [28]. The first
`stage consists of a 3

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket