throbber
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 8, NOVEMBER 2001
`
`1
`
`Enhanced Waveform Interpolative Coding
`at Low Bit-Rate
`
`Oded Gottesman, Member, IEEE, and Allen Gersho, Fellow, IEEE
`
`Abstract—This paper presents a high quality enhanced wave-
`form interpolative (EWI) speech coder at low bit-rate. The system
`incorporates novel features such as optimization of the slowly
`evolving waveform (SEW) for interpolation, analysis-by-synthesis
`(AbS) vector quantization (VQ) of the SEW dispersion phase,
`dual-predictive AbS quantization of the SEW, efficient parameter-
`ization of the rapidly-evolving waveform (REW) magnitude, and
`VQ of the REW parameter, a special pitch search for transitions,
`and switched-predictive analysis-by-synthesis gain VQ. Subjective
`tests indicate that the 2.8 kb/s EWI coder’s quality exceeds that
`of G.723.1 at 5.3 kb/s, and it is slightly better than that of G.723.1
`at 6.3 kb/s.
`Index Terms—Analysis-by-synthesis, phase dispersion, speech
`coding,
`speech compression, vector quantization, waveform
`interpolation, waveform interpolative coding.
`
`I. INTRODUCTION
`
`I N RECENT years, there has been increasing interest in
`
`achieving toll-quality speech coding at rates of 4 kb/s
`and below. Currently, there is an ongoing 4 kb/s standardiza-
`tion effort conducted by the ITU-T. The expanding variety
`of emerging applications for speech coding, such as third
`generation wireless networks and Low Earth Orbit (LEO)
`systems, is motivating increased research efforts. The speech
`quality produced by waveform coders such as code-excited
`linear prediction (CELP) coders [1] degrades rapidly at
`rates below 5 kb/s. On the other hand, parametric coders
`such as the waveform-interpolative (WI) coder [8]–[20], the
`sinusoidal-transform coder (STC) [2], the multiband-excita-
`tion (MBE) coder [3], the mixed-excitation linear predictive
`(MELP) vocoder [4], [5], and the harmonic-stochastic excita-
`tion (HSX) coder [6] produce good quality at low rates, but
`they do not achieve toll quality. This is largely due to the lack of
`robustness of speech parameter estimation, which is commonly
`performed in open-loop, and to inadequate modeling of nonsta-
`
`Manuscript received April 25, 2000; revised June 19, 2001. This work
`was supported in part by the National Science Foundation under Grant
`MIP-9707764, the University of California MICRO Program, Cisco Systems,
`Inc., Conexant Systems, Inc., Dialogic Corp., Fujitsu Laboratories of America,
`Inc., General Electric Co., Hughes Network Systems, Lernout & Hauspie
`Speech Products, Lockheed Martin, Lucent Technologies, Inc., Panasonic
`Speech Technology Laboratory, Qualcomm, Inc., and Texas Instruments, Inc.
`The associate editor coordinating the review of this manuscript and approving
`it for publication was Dr. Peter Kroon.
`A. Gersho is with the Department of Electrical and Computer Engi-
`neering, University of California, Santa Barbara, CA 93106 USA (e-mail:
`gersho@ece.ucsb.edu; http://scl.ece.ucsb.edu).
`O. Gottesman is with the Department of Electrical and Computer Engi-
`neering, University of California, Santa Barbara, CA 93106 USA and also with
`Compandent, Inc., Goleta, CA 93117 USA (e-mail: gottesman@ece.ucsb.edu;
`oded@gottesmans.com; http://www.compandent.com).
`Publisher Item Identifier S 1063-6676(01)08235-9.
`
`tionary speech segments. In this work we propose a paradigm
`for WI coding that incorporates analysis-by-synthesis (AbS)
`for parameter estimation, offers higher temporal and spectral
`resolution for the rapidly-evolving waveform (REW), and more
`efficient quantization of the slowly-evolving waveform (SEW).
`The WI coders [13]–[20] use nonideal low-pass filters for
`downsampling and upsampling of the SEW. We describe a novel
`AbS SEW quantization scheme, which takes the nonideal filters
`into consideration. An improved match between reconstructed
`and original SEW spectra is obtained, most notably in transition
`segments of speech.
`Commonly in WI coding, the similarity between successive
`REW magnitudes is exploited by downsampling and interpola-
`tion and by bit allocation that constrains similarity [13]. In our
`previous enhanced waveform-interpolative (EWI) coder [22],
`[23], the REW magnitude was quantized on a waveform by
`waveform basis, and with an excessive number of bits—more
`than is perceptually required. Here we propose a novel para-
`metric representation of the REW magnitude and an efficient
`paradigm for AbS predictive vector quantization of the REW
`parameter sequence. The new method achieves a substantial re-
`duction in the REW bit-rate.
`In low bit-rate WI coding, the relation between the SEW and
`the REW magnitudes was exploited by computing the magni-
`tude of one as the unity complement of the other [14], [17]–[20].
`Also, since the sequence of SEW spectrum evolves slowly, suc-
`cessive SEWs exhibit similarity, offering opportunities for re-
`dundancy removal. Additional forms of redundancy that may
`be exploited for coding efficiency are 1) for a fixed SEW/REW
`decomposition filter, the mean SEW magnitude increases with
`the pitch period and 2) the similarity between successive SEWs,
`also increases with the pitch period. These phenomena are due
`to the fact that, for uniformly extracted waveforms, the overlap
`between successive waveforms increases with the pitch period.
`In this work, we introduce a novel “dual-predictive” AbS para-
`digm for quantizing the SEW magnitude that optimally exploits
`the information about the current quantized REW, the past quan-
`tized SEW, and the pitch, in order to estimate the current SEW.
`In parametric coders the phase information is commonly not
`transmitted, and this is for two reasons: first, the phase is of sec-
`ondary perceptual significance; and second, no efficient phase
`quantization scheme is known. WI coders [8]–[20] typically use
`a fixed phase vector for the SEW, for example, in [14], [19],
`a fixed male speaker extracted phase was used. On the other
`hand, waveform coders such as CELP [1], by directly quan-
`tizing the waveform, implicitly allocate an excessive number
`of bits to the phase information—more than is perceptually re-
`quired. In the past [31]–[34], phase modeling and quantization
`
`1063–6676/01$10.00 © 2001 IEEE
`
`

`

`2
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 8, NOVEMBER 2001
`
`was investigated. In [32] a random phase codebook was used
`at a relatively high number of phase quantization bits. In [33],
`[34], a noncausal all-pole filter’s phase model was discussed,
`but quantization was not optimized. We have observed that such
`a model is quite inadequate in matching the physiological exci-
`tation’s phase, although occasionally it does provide a reason-
`able match. In addition, none of the above methods have in-
`corporated perceptual weighting. Recently [21], we proposed a
`novel, efficient AbS VQ encoding of the dispersion phase of the
`excitation signal to enhance the performance of the WI coder at
`a low bit-rate, which can be used for parametric coders as well
`as for waveform coders. The EWI coder presented here employs
`this scheme, which incorporates perceptual weighting and does
`not require any phase unwrapping.
`Pitch accuracy is crucial for high quality reproduced speech
`in WI coders. We introduce a novel pitch search technique based
`on varying segment boundaries; it allows for locking onto the
`most probable pitch period during transitions or other segments
`with rapidly varying pitch.
`Commonly in speech coding the gain sequence is downsam-
`pled and interpolated. As a result it is often smeared during plo-
`sives and onsets. In the past, this problem was addressed by
`employing a special mechanism that mimicked the gain char-
`acteristics [14]. To alleviate this problem, we propose a novel
`switched-predictive AbS gain VQ scheme based on temporal
`weighting.
`This paper is organized as follows. Section II describes the
`WI coder. In Section III we explain the AbS SEW optimization.
`The dispersion phase quantizer is discussed in Section IV. In
`Section V we explain the REW parameterization, and the cor-
`responding AbS VQ. The dual predictive SEW AbS VQ and its
`performance are discussed in Section VI. Section VII describes
`the pitch search. In Section VIII we present the switched-pre-
`dictive AbS gain VQ. The bit allocation is given in Section IX.
`Subjective results are reported in Section X. Finally, we sum-
`marize our work.
`
`II. DESCRIPTION OF THE WAVEFORM INTERPOLATIVE CODER
`
`A. Introduction to Waveform Interpolation
`
`During voiced speech, which is quasiperiodic, one can ob-
`serve the underlying process of evolving shape of successive
`pitch cycles. A continuously evolving sequence of pitch cycle
`waveforms can be generated from a continuous-time signal, ei-
`ther from the linear prediction residual or from the speech wave-
`form directly. For coding purposes, one may extract a subse-
`quence of these waveforms, and apply quantization to it. At the
`decoder, following inverse quantization, speech synthesis can be
`performed by interpolating missing waveforms. Such a process
`is the essence of waveform interpolative coding [8]–[20].
`Speech segments typically contain both voiced and unvoiced
`attributes. The different perceived character of the voiced and
`unvoiced components [27] suggests a separation of the compo-
`nents, and applying distinct perceptually based coding to them
`[12]–[20].
`
`B. Definitions
`Given a continuous linear prediction residual (or speech)
`, and its associated instantaneous pitch period con-
`signal,
`tour,
`, a characteristic waveform (CW) [8]–[20],
`,
`may be generated by extracting pitch cycles at an infinitely
`, and aligning them
`high rate, normalizing their length to
`sequentially by a cyclical shift. The differential alignment
`phase shift,
`, is given by
`
`Therefore, the temporal accumulated phase shift is equal to
`
`(1)
`
`(2)
`
`where
`. The CW is a two-
`is the initial phase shift at time
`dimensional (2-D) surface which is defined by
`
`(3)
`
`where
`
`wraps
`
`over the range
`
`, and is defined by
`
`modulo
`
`(4)
`
`, with a period
`The CW is a periodic function of the parameter
`. The residual (or speech) signal may be generated from the
`CW by calculating its value along the phase shift contour
`
`(5)
`
`The WI coder based on this 2-D function is conceptually similar
`to the pitch synchronous transform coder [7].
`
`C. Waveform Interpolative Coder Description
`The EWI coder is based on the WI coding model [11]–[14]. In
`this model, the CW is decomposed into two components called
`SEW and REW. The SEW, which is computed by low-pass fil-
`tering the 2-D CW surface along the time axis (also known as the
`evolutionary axis), contains most of the voiced speech attribute.
`The SEW is coded at low temporal resolution, high spectral res-
`olution, and using spectrally weighted distortion measure. The
`REW, which is the complementary high-pass component, repre-
`sents primarily the unvoiced speech attribute. The REW is coded
`at high temporal resolution, low spectral resolution, and by ex-
`ploiting spectral and temporal masking.
`The EWI encoder is illustrated in Fig. 1. The LPC analysis,
`and quantization is performed every 20 ms frame, and interpo-
`lated values are used for each of the ten waveforms in the frame.
`The input speech is then passed through the resulting whitening
`filter to produce the residual signal. A search for the pitch pe-
`riod is performed and the pitch is quantized every 10 ms, and
`is then interpolated. The interpolated pitch values are used for
`pitch cycle waveform extraction, which is performed at a reg-
`ular rate (every 2 ms). The rate must be higher then the maximal
`pitch frequency in order to prevent aliasing along the time axis
`[14], [18]. The extracted waveforms are then power normalized,
`and sequentially aligned, to form a discrete-time CW, which is
`represented by a Fourier series (FS). The Fourier coefficients
`
`

`

`GOTTESMAN AND GERSHO: ENHANCED WAVEFORM INTERPOLATIVE CODING AT LOW BIT-RATE
`
`3
`
`Fig. 1. Block diagram of the EWI encoder.
`
`Fig. 2. Block diagram of the EWI decoder.
`
`(FCs) are obtained by pitch-synchronous discrete Fourier trans-
`form (DFT). The frequency domain representation is used in
`order to benefit from appropriate perceptually motivated coding
`paradigms for the magnitude, and the phase. The CW is then
`low-pass filtered along the time axis, to produce the SEW. The
`REW is computed as the complementary high-pass component,
`and is then quantized. The SEW is downsampled, and then quan-
`tized every 20 ms. Finally, a local decoder is used to reconstruct
`the speech, then the encoder adjusts the gain to equate the re-
`constructed speech waveform energy to that of the input speech
`waveform, and quantizes the resultant gain.
`The EWI decoder is illustrated in Fig. 2. The REW and the
`SEW are decoded, and an interpolated SEW is computed each
`2 ms. The REW and SEW are phase adjusted to achieve ade-
`quate voicing level and to benefit from temporal masking, and
`then added together. The resulting waveform is then power-nor-
`
`malized, and multiplied by the respective quantized gain. The
`pitch is decoded, and interpolated, and is then used for com-
`puting the phase contour using (2). The reconstructed residual is
`computed by continuous waveform interpolation, which is per-
`formed by computing the Fourier series along the phase con-
`tour followed by overlap-and-add. Over the interpolation in-
`, the continuous reconstructed excitation
`terval
`signal,
`, is given by
`
`(6)
`
`are the reconstructed CW at the
`and
`where
`interval beginning and ending, respectively, and
`is some
`.
`increasing interpolation function in the range
`The quantized LPC coefficients are interpolated, and are then
`used for the synthesis filter. Finally, the reconstructed speech is
`obtained by passing the reconstructed residual through the syn-
`
`

`

`4
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 8, NOVEMBER 2001
`
`Fig. 3. Block diagram of the AbS SEW vector quantization.
`
`thesis filter. For low rate coding, it is beneficial to use a formant
`adaptive postfilter [28]. In WI coding the postfilter enhances the
`quantized speech quality by reducing the audibility of the non-
`periodic speech component around the formants. Such compo-
`nent is mostly due to the REW which is still somehow related to
`the SEW and may not always be regarded as independent noise.
`
`Many speech coding schemes use voiced/unvoiced classifica-
`tion with separate coding of each type of sound. Such schemes
`may suffer severe quality loss whenever classification error is
`made, which causes the coder to apply coding method that is
`inappropriate to the coded speech sound. One of the important
`advantages of the WI coding system is that it is universally ap-
`plied to all speech sounds, and is therefore more robust than
`classification based coding scheme.
`
`III. SEW OPTIMIZATION
`
`Most WI coders [10]–[18] use nonideal low-pass filters for
`downsampling and upsampling of the SEW. These filters intro-
`duce aliasing and mirroring distortion, even when no quantiza-
`tion is applied. We propose, instead, a novel AbS SEW quan-
`tization scheme, illustrated in Fig. 3, which takes the nonideal
`interpolation filters into consideration and optimizes the SEW
`accordingly, however some aliasing may already exist (due to
`nonideal anti-aliasing filters) and this will not be eliminated by
`the AbS quantization scheme. The input speech is analyzed and
`LPC parameters are extracted, quantized and interpolated, and
`an LPC whitening filter is obtained. Then the speech is passed
`through the resulting whitening filter to produce the residual
`SEWs are extracted from the residual
`signal. In each frame
`with
`look-ahead waveforms. Each waveform is represented
`by a vector of FCs
`. The local decoder at the encoder re-
`constructs
`SEWs,
`, by interpolating between the
`
`Fig. 4. Example for the improved interpolation by SEW optimization during
`nonstationary speech segment.
`
`, to the current frame
`quantized SEW at the previous frame,
`quantized SEW,
`. The interpolated SEW vectors are given by
`
`(7)
`
`Assuming
`and the LPC coefficients are given, the encoder’s
`task is to find the quantized vector
`such that the accumu-
`lated weighted distortion between original and reconstructed
`, is minimized. Since the
`waveform sequences, denoted by
`
`

`

`GOTTESMAN AND GERSHO: ENHANCED WAVEFORM INTERPOLATIVE CODING AT LOW BIT-RATE
`
`5
`
`effect of the linear interpolation LPF is taken into account in
`the proposed scheme, a true interpolated waveform (synthesis)
`is incorporated in the analysis process, unlike the conventional
`open-loop WI coders [10]–[18] in which only one waveform,
`, is used for the quantization. Consider the accumu-
`namely
`, between the input SEW FCs
`lated weighted distortion,
`, and the quantized and interpolated vectors,
`,
`vectors,
`given by
`
`(8)
`
`where
`
`number of waveforms per frame;
`number of look-ahead waveforms;
`, are the spectral
`diagonal matrix whose elements,
`values of the combined spectral-weighting and syn-
`thesis filters at the th harmonic given by
`
`(9)
`
`where
`
`and
`
`pitch period;
`number of harmonics;
`gain;
`input and the quantized LPC polynomials,
`respectively.
`The spectral weighting parameters satisfy
`. It
`can be shown that the accumulated distortion in (8) is equal to
`the sum of two components, a modeling distortion and a quan-
`tization distortion
`
`where the quantization distortion is given by
`
`(10)
`
`(11)
`
`An improved match between reconstructed and original SEW
`is obtained, most notably in the transitions. Fig. 4 illustrates
`the improved waveform matching obtained for a nonstationary
`speech segment by interpolating the optimized SEW.
`
`IV. DISPERSION PHASE QUANTIZATION
`
`The dispersion-phase quantization scheme [21]–[23] is illus-
`trated in Fig. 5. A pitch cycle that is extracted from the SEW is
`applied as an input to the system, and is cyclically shifted so that
`its pulse is located at position zero. Let its FC vector be denoted
`by . After quantization, the components of the quantized mag-
`, are multiplied by the exponential of the quan-
`nitude vector,
`tized phases,
`, to yield the quantized waveform FC vector,
`, which is subtracted from the input FC vector to produce the
`error FC vector. The error FC vector is then transformed to the
`perceptually-weighted frequency domain by weighting it by the
`. The en-
`combined synthesis and weighting filter
`coder searches for the phase that minimizes the energy of the
`perceptually weighted error, allowing a fine tuning of the cyclic
`shift of the input waveform during the search, to eliminate any
`residual phase shift between the input waveform and the quan-
`tized waveform. Phase dispersion quantization aims to improve
`waveform matching. Efficient AbS quantization can be obtained
`by using the perceptually weighted distortion
`
`(15)
`
`where
`is the weighted input SEW prototype and
`is the quantized and weighted SEW prototype. It can be shown
`[21] that the above distortion is equivalent to
`
`(16)
`
`The magnitude is perceptually more significant than the phase
`[26] and should therefore be quantized first. Furthermore, if the
`phase were quantized first, the very limited bit allocation avail-
`able for the phase would lead to an excessively degraded spectral
`matching of the magnitude in favor of a somewhat improved,
`but less important, matching of the waveform. For this distor-
`tion measure, the quantized phase vector is given by [21]–[23]
`
`(17)
`
`where the optimal vector,
`eling distortion) is given by
`
`, (which minimizes the mod-
`
`where
`
`and the respective weighting matrix is given by
`
`(12)
`
`(13)
`
`Therefore, VQ with the accumulated distortion of (8) can be
`simplified by using the distortion of (11), and
`
`(14)
`
`running phase codebook index;
`respective diagonal phase exponent matrix;
`quantized magnitude vector.
`The AbS search for phase quantization is based on evaluating
`(17) for each candidate phase codevector. Since only trigono-
`metric functions of the phase candidates are used (via complex
`are relevant, and
`exponentials), only phase values modulo
`therefore phase unwrapping is avoided. The EWI coder uses the
`, and the optimized weighting,
`,
`optimized SEW,
`for the AbS phase quantization.
`
`A. Phase Centroid Equations
`We will now describe the training of the phase codebook.
`Suppose
`is the set of SEW
`
`

`

`6
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 8, NOVEMBER 2001
`
`training vectors used for the design of the phase VQ, where
`is the cardinality of the set
`, that is, the number of elements in
`. The average global distortion measure for the quantization
`of the training set is
`
`(18)
`
`th input and the
`are the th FC of the
`, and
`where
`quantized SEWs, respectively. The th optimal partition cell sat-
`isfies
`
`for all
`
`(19)
`
`, the centroid
`For a given partition
`equation [29] of the th coefficient’s phase, for the th cluster,
`which minimizes the global distortion (18), is given by
`
`-
`
`Fig. 5. Block diagram of the AbS dispersion phase vector quantization.
`
`(20)
`
`Fig. 6. Segmental weighted SNR of the phase VQ versus the number of bits,
`for M-IRS and for nonfiltered (flat) speech.
`
`B. Variable Dimension Vector Quantization
`The phase vector’s dimension depends on the pitch period
`and, therefore, a variable dimension VQ has been implemented.
`In our WI coder, the possible pitch period value was divided into
`several ranges, and for each range of pitch period a codebook
`was designed such that all vectors of dimension smaller than the
`largest pitch period in that range are zero padded beyond their
`highest element. Pitch changes over time cause the quantizer
`to switch among the pitch-range selected codebooks. In order
`to achieve smooth phase variations whenever such a switch oc-
`curs, overlapped training clusters were used and similar initial
`conditions were selected for each codebook. This design method
`does not guarantee smoothness, i.e., for a slight change in pitch
`that causes a switch in codebooks, the quantized vector could
`change substantially. However, significant quality improvement
`was obtained with the procedure. We believe such smoothness
`may be guaranteed by including some heuristic rules in the en-
`coding process.
`
`C. Objective Results
`The segmental weighted signal-to-noise ratio (SNR) of the
`phase quantizer is illustrated in Fig. 6. The segmental SNR was
`calculated by averaging the SNR of the extracted waveforms.
`For each waveform, the SNR was computed using the quan-
`tized phase and nonquantized magnitude. The proposed system
`achieves approximately 14 dB SNR for as few as six bits for
`nonfiltered speech, and nearly 10 dB for modified intermediate
`reference system (M-IRS) [35] filtered speech.
`
`Fig. 7. Results of subjective A/B test for comparison between the four-bit
`phase VQ, and male extracted fixed phase.
`
`D. Subjective Results
`Recent WI coders have used a fixed dispersion phase ex-
`tracted from male speakers [14], [19]. We have conducted a sub-
`jective A/B test to compare our dispersion phase VQ, using only
`four bits, to a male-extracted dispersion phase. The test data in-
`cluded 16 M-IRS speech sentences, eight of which are of female
`speakers, and eight of male speakers. During the test, all pairs
`of file were played twice in alternating order, and the listeners
`could vote for either of the systems, or for no preference. The
`speech material was synthesized using our WI system in which
`only the dispersion phase was quantized every 20 ms. Twenty
`one listeners participated in the test. The test results, illustrated
`in Fig. 7, show improvement in speech quality by quantizing the
`phase with a four-bit VQ. The improvement is larger for female
`speakers than for male. This may be due to the fact that for fe-
`
`

`

`GOTTESMAN AND GERSHO: ENHANCED WAVEFORM INTERPOLATIVE CODING AT LOW BIT-RATE
`
`7
`
`male speech there is a larger number of bits per vector sample,
`resulting in better waveform matching which is more perceiv-
`able particularly during transitions.
`The codebook design for the dispersion-phase quantization
`involves a tradeoff between robustness in terms of smooth phase
`variations and waveform matching. A locally optimized code-
`book for each pitch value may improve the waveform matching
`on the average, but will occasionally yield abrupt and excessive
`changes that can cause temporal artifacts.
`
`V. PARAMETRIC REW QUANTIZATION
`
`Efficient REW quantization can benefit from two observa-
`tions [25]: 1) the REW magnitude is typically an increasing
`function of frequency, which suggests that an efficient para-
`metric representation may be used and 2) one can observe
`similarity between successive REW magnitude spectra, which
`suggests that employing predictive VQ on a group of adjacent
`REWs may yield useful coding gains. The next four sections
`introduce the REW parametric representation and the associ-
`ated VQ technique.
`
`A. REW Parameterization
`Direct quantization of the REW magnitude is a variable
`dimension quantization problem, which may result in spending
`bits and computational effort on perceptually irrelevant in-
`formation. A simple and practical way to obtain a reduced,
`and fixed, dimension representation of the REW is with a
`linear combination of basis functions, such as orthonormal
`polynomials [18]–[20]. Such a representation usually produces
`a smoother REW magnitude, and improves the perceptual
`quality. Suppose the REW magnitude,
`, is represented by
`a linear combination of orthonormal functions
`
`(21)
`
`is the representation
`is the angular frequency, and
`where
`order. The REW magnitude is typically an increasing function of
`frequency, which can be coarsely quantized with a small number
`of bits per waveform without significant perceptual degrada-
`tion. Therefore, it may be advantageous to represent the REW
`magnitude in a simple, but perceptually relevant manner. Con-
`sequently we model the REW by the following parametric rep-
`resentation,
`
`(22)
`where
`is a parametric vector
`of coefficients within the representation model subspace, and
`is the “unvoicing” parameter which is zero for a fully voiced
`spectrum, and one for a fully unvoiced spectrum. Thus,
`defines a 2-D surface whose cross sections for each value of
`give a particular REW magnitude spectrum, which is defined
`merely by specifying a scalar parameter value.
`
`Fig. 8. REW parametric representation R (!; ).
`
`B. Piecewise Linear REW Representation
`In order to have a simple representation that is computation-
`ally efficient and avoids excessive memory requirements, we
`model the 2-D surface by a piecewise linear parametric repre-
`uniformly spaced
`sentation. Therefore, we introduce a set of
`, as shown in Fig. 8. (Such a set of
`spectra,
`functions is similar to the hand-tuned REW codebook in [19]
`and [20].) Then the parametric surface is defined by linear in-
`terpolation according to
`
`Because this representation is linear, the coefficients of
`are linear combinations of the coefficients of
`. Hence,
`
`(23)
`
`and
`
`(24)
`
`where
`is the coefficient vector of the
`representation
`
`th REW magnitude
`
`(25)
`
`C. REW Modeling
`1) Nonweighted Distortion: Suppose for a REW magnitude,
`, represented by some coefficient vector,
`, we search for
`, in
`, whose respective
`the parameter value,
`, minimizes the mean squared error
`representation vector,
`(MSE) distortion between the two spectra
`
`From orthonormality, the distortion is equal to
`
`(26)
`
`

`

`8
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 8, NOVEMBER 2001
`
`The optimal interpolation factor that minimizes the MSE is
`
`(27)
`
`(28)
`
`and the respective optimal parameter value, which is a contin-
`uous variable between zero and one, is given by
`
`(29)
`
`This result allows a rapid search for the best unvoicing param-
`eter value needed to transform the coefficient vector to a scalar
`parameter, for encoding or for VQ design.
`2) Weighted Distortion: Commonly in speech coding, the
`magnitude is quantized using a weighted distortion measure.
`In this case, the weighted distortion between the input and the
`parametric representation modeled spectra is equal to
`
`where
`is the weighted correlation matrix of the or-
`thonormal functions, its elements are
`
`(30)
`
`(31)
`
`is the modeled
`is the input coefficient vector and
`where
`parametric coefficient vector. The optimal parameter that mini-
`mizes (30) is given by
`
`(32)
`
`and the respective optimal parameter value is computed using
`(29). Alternatively, in order to eliminate using the matrix
`,
`and to benefit from the orthonormal function simplification,
`given in (27), the scalar product may be redefined to incor-
`porate the time-varying spectral weighting. The respective or-
`thonormal basis functions then satisfy
`
`(33)
`
`where
`denotes the Kroneker delta. The respective pa-
`rameter vector is given by
`
`Fig. 9. REW parametric representation AbS VQ.
`
`Fig. 10. REW parametric representation AbS VQ.
`
`Fig. 11. REW parametric representation simplified weighted AbS VQ.
`
`(34)
`
`is an th dimensional
`where
`vector of time-varying orthonormal functions.
`
`D. REW Quantization
`1) Full Complexity Spectral Quantization Scheme: A novel
`AbS REW parameter VQ paradigm is illustrated in Fig. 9. An
`is selected from
`excitation vector
`the VQ codebook and is fed through a synthesis filter to ob-
`(synthesized quantized) which is
`tain a parameter vector
`
`then mapped to quantized a representation coefficient vectors
`. This is compared with a sequence of input represen-
`tation coefficient vectors
`and each is spectrally weighted.
`Each spectrally weighted error is then temporally weighted, and
`a distortion measure is obtained. A search through all candidate
`excitation vectors determines an optimal choice. The synthesis
`filter in Fig. 9 can be viewed as a first order predictor in a feed-
`to
`back loop. By allowing the value of the predictor parameter
`change, it becomes a “switched-predictor” scheme. Switched-
`prediction is introduced to allow for different levels of REW
`
`

`

`GOTTESMAN AND GERSHO: ENHANCED WAVEFORM INTERPOLATIVE CODING AT LOW BIT-RATE
`
`where
`
`9
`
`(38)
`
`parameter correlation. The scheme incorporates both spectral
`weighting and temporal weighting. The spectral weighting is
`used for the distortion between each pair of input and quantized
`spectra. In order to improve SEW/REW mixing, particularly in
`mixed voiced and unvoiced speech segments, and to increase
`speech crispness, especially for plosives and onsets, temporal
`weighting is incorporated in the AbS REW VQ. The temporal
`weighting is a monotonic function of the temporal gain. Two
`codebooks are used, one corresponding to each of two predictor
`and
`. The quantization target is an
`-di-
`coefficients,
`mensional vector of REW spectra. Each REW spectrum is rep-
`resented by a vector of basis function coefficients denoted by
`. The search for the minimal weighted mean squared error
`, of the two
`(WMSE) is performed over all the vectors,
`codebooks for
`. The quantized REW function coeffi-
`, is a function of the quantized param-
`cients vector,
`, which is obtained by passing the quantized vector,
`eter
`, through the synthesis filter using the coefficient
`for
`, or
`. The weighted distortion between each pair of input
`and quantized REW spectra is calculated. The total distortion is
`spectrally weighted dis-
`a temporally-weighted sum of the
`tortions. Since the predictor coefficients are known, direct VQ
`can be used to simplify the computations. For a piecewise linear
`parametric REW representation, a substantial simplification of
`the search computations may be obtained by interpolating the
`distortion between the representation spectra set.
`2) Simplified Parametric Quantization Scheme: The above
`scheme maps each quantized parameter to a coefficient vector,
`which is used to compute the spectral distortion. To reduce com-
`plexity, such a mapping, and spectral distortion computation
`may be eliminated by using the simplified scheme described
`below. For a high rate, and a smooth representation surface
`, the total distortion is equal to the sum of a modeling dis-
`tortion and a quantization distortion
`
`The quantization distortion is related to the quantized parameter
`by
`
`(35)
`
`The quantization distortion is linearly related to the REW pa-
`, and there-
`rameter squared quantization error,
`fore justifies direct VQ of the REW parameter.
`distortion: The
`a) Simplified
`scheme,
`nonweighted
`encoder maps the REW magnitude to an unvoicing parameter,
`and then quantizes the parameter by AbS VQ, as illustrated
`REWs in the
`in Fig. 10. Initially, the magnitudes of the
`frame are mapped to coefficient vectors,
`. Then,
`for each coefficient vector, a search is performed to find the
`, using (29), to form
`optimal representation parameter,
`an
`-dimensional parameter vector for the current frame,
`. Finally,
`the parameter vector is encoded
`, are
`by AbS VQ. The decoded spectra,
`,
`obtained from the quantized parameter vector,
`using (23). This scheme allows for higher temporal as well as
`spectral REW resolution, since no downsampling is performed,
`and

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket