`
`1
`
`Enhanced Waveform Interpolative Coding
`at Low Bit-Rate
`
`Oded Gottesman, Member, IEEE, and Allen Gersho, Fellow, IEEE
`
`Abstract—This paper presents a high quality enhanced wave-
`form interpolative (EWI) speech coder at low bit-rate. The system
`incorporates novel features such as optimization of the slowly
`evolving waveform (SEW) for interpolation, analysis-by-synthesis
`(AbS) vector quantization (VQ) of the SEW dispersion phase,
`dual-predictive AbS quantization of the SEW, efficient parameter-
`ization of the rapidly-evolving waveform (REW) magnitude, and
`VQ of the REW parameter, a special pitch search for transitions,
`and switched-predictive analysis-by-synthesis gain VQ. Subjective
`tests indicate that the 2.8 kb/s EWI coder’s quality exceeds that
`of G.723.1 at 5.3 kb/s, and it is slightly better than that of G.723.1
`at 6.3 kb/s.
`Index Terms—Analysis-by-synthesis, phase dispersion, speech
`coding,
`speech compression, vector quantization, waveform
`interpolation, waveform interpolative coding.
`
`I. INTRODUCTION
`
`I N RECENT years, there has been increasing interest in
`
`achieving toll-quality speech coding at rates of 4 kb/s
`and below. Currently, there is an ongoing 4 kb/s standardiza-
`tion effort conducted by the ITU-T. The expanding variety
`of emerging applications for speech coding, such as third
`generation wireless networks and Low Earth Orbit (LEO)
`systems, is motivating increased research efforts. The speech
`quality produced by waveform coders such as code-excited
`linear prediction (CELP) coders [1] degrades rapidly at
`rates below 5 kb/s. On the other hand, parametric coders
`such as the waveform-interpolative (WI) coder [8]–[20], the
`sinusoidal-transform coder (STC) [2], the multiband-excita-
`tion (MBE) coder [3], the mixed-excitation linear predictive
`(MELP) vocoder [4], [5], and the harmonic-stochastic excita-
`tion (HSX) coder [6] produce good quality at low rates, but
`they do not achieve toll quality. This is largely due to the lack of
`robustness of speech parameter estimation, which is commonly
`performed in open-loop, and to inadequate modeling of nonsta-
`
`Manuscript received April 25, 2000; revised June 19, 2001. This work
`was supported in part by the National Science Foundation under Grant
`MIP-9707764, the University of California MICRO Program, Cisco Systems,
`Inc., Conexant Systems, Inc., Dialogic Corp., Fujitsu Laboratories of America,
`Inc., General Electric Co., Hughes Network Systems, Lernout & Hauspie
`Speech Products, Lockheed Martin, Lucent Technologies, Inc., Panasonic
`Speech Technology Laboratory, Qualcomm, Inc., and Texas Instruments, Inc.
`The associate editor coordinating the review of this manuscript and approving
`it for publication was Dr. Peter Kroon.
`A. Gersho is with the Department of Electrical and Computer Engi-
`neering, University of California, Santa Barbara, CA 93106 USA (e-mail:
`gersho@ece.ucsb.edu; http://scl.ece.ucsb.edu).
`O. Gottesman is with the Department of Electrical and Computer Engi-
`neering, University of California, Santa Barbara, CA 93106 USA and also with
`Compandent, Inc., Goleta, CA 93117 USA (e-mail: gottesman@ece.ucsb.edu;
`oded@gottesmans.com; http://www.compandent.com).
`Publisher Item Identifier S 1063-6676(01)08235-9.
`
`tionary speech segments. In this work we propose a paradigm
`for WI coding that incorporates analysis-by-synthesis (AbS)
`for parameter estimation, offers higher temporal and spectral
`resolution for the rapidly-evolving waveform (REW), and more
`efficient quantization of the slowly-evolving waveform (SEW).
`The WI coders [13]–[20] use nonideal low-pass filters for
`downsampling and upsampling of the SEW. We describe a novel
`AbS SEW quantization scheme, which takes the nonideal filters
`into consideration. An improved match between reconstructed
`and original SEW spectra is obtained, most notably in transition
`segments of speech.
`Commonly in WI coding, the similarity between successive
`REW magnitudes is exploited by downsampling and interpola-
`tion and by bit allocation that constrains similarity [13]. In our
`previous enhanced waveform-interpolative (EWI) coder [22],
`[23], the REW magnitude was quantized on a waveform by
`waveform basis, and with an excessive number of bits—more
`than is perceptually required. Here we propose a novel para-
`metric representation of the REW magnitude and an efficient
`paradigm for AbS predictive vector quantization of the REW
`parameter sequence. The new method achieves a substantial re-
`duction in the REW bit-rate.
`In low bit-rate WI coding, the relation between the SEW and
`the REW magnitudes was exploited by computing the magni-
`tude of one as the unity complement of the other [14], [17]–[20].
`Also, since the sequence of SEW spectrum evolves slowly, suc-
`cessive SEWs exhibit similarity, offering opportunities for re-
`dundancy removal. Additional forms of redundancy that may
`be exploited for coding efficiency are 1) for a fixed SEW/REW
`decomposition filter, the mean SEW magnitude increases with
`the pitch period and 2) the similarity between successive SEWs,
`also increases with the pitch period. These phenomena are due
`to the fact that, for uniformly extracted waveforms, the overlap
`between successive waveforms increases with the pitch period.
`In this work, we introduce a novel “dual-predictive” AbS para-
`digm for quantizing the SEW magnitude that optimally exploits
`the information about the current quantized REW, the past quan-
`tized SEW, and the pitch, in order to estimate the current SEW.
`In parametric coders the phase information is commonly not
`transmitted, and this is for two reasons: first, the phase is of sec-
`ondary perceptual significance; and second, no efficient phase
`quantization scheme is known. WI coders [8]–[20] typically use
`a fixed phase vector for the SEW, for example, in [14], [19],
`a fixed male speaker extracted phase was used. On the other
`hand, waveform coders such as CELP [1], by directly quan-
`tizing the waveform, implicitly allocate an excessive number
`of bits to the phase information—more than is perceptually re-
`quired. In the past [31]–[34], phase modeling and quantization
`
`1063–6676/01$10.00 © 2001 IEEE
`
`
`
`2
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 8, NOVEMBER 2001
`
`was investigated. In [32] a random phase codebook was used
`at a relatively high number of phase quantization bits. In [33],
`[34], a noncausal all-pole filter’s phase model was discussed,
`but quantization was not optimized. We have observed that such
`a model is quite inadequate in matching the physiological exci-
`tation’s phase, although occasionally it does provide a reason-
`able match. In addition, none of the above methods have in-
`corporated perceptual weighting. Recently [21], we proposed a
`novel, efficient AbS VQ encoding of the dispersion phase of the
`excitation signal to enhance the performance of the WI coder at
`a low bit-rate, which can be used for parametric coders as well
`as for waveform coders. The EWI coder presented here employs
`this scheme, which incorporates perceptual weighting and does
`not require any phase unwrapping.
`Pitch accuracy is crucial for high quality reproduced speech
`in WI coders. We introduce a novel pitch search technique based
`on varying segment boundaries; it allows for locking onto the
`most probable pitch period during transitions or other segments
`with rapidly varying pitch.
`Commonly in speech coding the gain sequence is downsam-
`pled and interpolated. As a result it is often smeared during plo-
`sives and onsets. In the past, this problem was addressed by
`employing a special mechanism that mimicked the gain char-
`acteristics [14]. To alleviate this problem, we propose a novel
`switched-predictive AbS gain VQ scheme based on temporal
`weighting.
`This paper is organized as follows. Section II describes the
`WI coder. In Section III we explain the AbS SEW optimization.
`The dispersion phase quantizer is discussed in Section IV. In
`Section V we explain the REW parameterization, and the cor-
`responding AbS VQ. The dual predictive SEW AbS VQ and its
`performance are discussed in Section VI. Section VII describes
`the pitch search. In Section VIII we present the switched-pre-
`dictive AbS gain VQ. The bit allocation is given in Section IX.
`Subjective results are reported in Section X. Finally, we sum-
`marize our work.
`
`II. DESCRIPTION OF THE WAVEFORM INTERPOLATIVE CODER
`
`A. Introduction to Waveform Interpolation
`
`During voiced speech, which is quasiperiodic, one can ob-
`serve the underlying process of evolving shape of successive
`pitch cycles. A continuously evolving sequence of pitch cycle
`waveforms can be generated from a continuous-time signal, ei-
`ther from the linear prediction residual or from the speech wave-
`form directly. For coding purposes, one may extract a subse-
`quence of these waveforms, and apply quantization to it. At the
`decoder, following inverse quantization, speech synthesis can be
`performed by interpolating missing waveforms. Such a process
`is the essence of waveform interpolative coding [8]–[20].
`Speech segments typically contain both voiced and unvoiced
`attributes. The different perceived character of the voiced and
`unvoiced components [27] suggests a separation of the compo-
`nents, and applying distinct perceptually based coding to them
`[12]–[20].
`
`B. Definitions
`Given a continuous linear prediction residual (or speech)
`, and its associated instantaneous pitch period con-
`signal,
`tour,
`, a characteristic waveform (CW) [8]–[20],
`,
`may be generated by extracting pitch cycles at an infinitely
`, and aligning them
`high rate, normalizing their length to
`sequentially by a cyclical shift. The differential alignment
`phase shift,
`, is given by
`
`Therefore, the temporal accumulated phase shift is equal to
`
`(1)
`
`(2)
`
`where
`. The CW is a two-
`is the initial phase shift at time
`dimensional (2-D) surface which is defined by
`
`(3)
`
`where
`
`wraps
`
`over the range
`
`, and is defined by
`
`modulo
`
`(4)
`
`, with a period
`The CW is a periodic function of the parameter
`. The residual (or speech) signal may be generated from the
`CW by calculating its value along the phase shift contour
`
`(5)
`
`The WI coder based on this 2-D function is conceptually similar
`to the pitch synchronous transform coder [7].
`
`C. Waveform Interpolative Coder Description
`The EWI coder is based on the WI coding model [11]–[14]. In
`this model, the CW is decomposed into two components called
`SEW and REW. The SEW, which is computed by low-pass fil-
`tering the 2-D CW surface along the time axis (also known as the
`evolutionary axis), contains most of the voiced speech attribute.
`The SEW is coded at low temporal resolution, high spectral res-
`olution, and using spectrally weighted distortion measure. The
`REW, which is the complementary high-pass component, repre-
`sents primarily the unvoiced speech attribute. The REW is coded
`at high temporal resolution, low spectral resolution, and by ex-
`ploiting spectral and temporal masking.
`The EWI encoder is illustrated in Fig. 1. The LPC analysis,
`and quantization is performed every 20 ms frame, and interpo-
`lated values are used for each of the ten waveforms in the frame.
`The input speech is then passed through the resulting whitening
`filter to produce the residual signal. A search for the pitch pe-
`riod is performed and the pitch is quantized every 10 ms, and
`is then interpolated. The interpolated pitch values are used for
`pitch cycle waveform extraction, which is performed at a reg-
`ular rate (every 2 ms). The rate must be higher then the maximal
`pitch frequency in order to prevent aliasing along the time axis
`[14], [18]. The extracted waveforms are then power normalized,
`and sequentially aligned, to form a discrete-time CW, which is
`represented by a Fourier series (FS). The Fourier coefficients
`
`
`
`GOTTESMAN AND GERSHO: ENHANCED WAVEFORM INTERPOLATIVE CODING AT LOW BIT-RATE
`
`3
`
`Fig. 1. Block diagram of the EWI encoder.
`
`Fig. 2. Block diagram of the EWI decoder.
`
`(FCs) are obtained by pitch-synchronous discrete Fourier trans-
`form (DFT). The frequency domain representation is used in
`order to benefit from appropriate perceptually motivated coding
`paradigms for the magnitude, and the phase. The CW is then
`low-pass filtered along the time axis, to produce the SEW. The
`REW is computed as the complementary high-pass component,
`and is then quantized. The SEW is downsampled, and then quan-
`tized every 20 ms. Finally, a local decoder is used to reconstruct
`the speech, then the encoder adjusts the gain to equate the re-
`constructed speech waveform energy to that of the input speech
`waveform, and quantizes the resultant gain.
`The EWI decoder is illustrated in Fig. 2. The REW and the
`SEW are decoded, and an interpolated SEW is computed each
`2 ms. The REW and SEW are phase adjusted to achieve ade-
`quate voicing level and to benefit from temporal masking, and
`then added together. The resulting waveform is then power-nor-
`
`malized, and multiplied by the respective quantized gain. The
`pitch is decoded, and interpolated, and is then used for com-
`puting the phase contour using (2). The reconstructed residual is
`computed by continuous waveform interpolation, which is per-
`formed by computing the Fourier series along the phase con-
`tour followed by overlap-and-add. Over the interpolation in-
`, the continuous reconstructed excitation
`terval
`signal,
`, is given by
`
`(6)
`
`are the reconstructed CW at the
`and
`where
`interval beginning and ending, respectively, and
`is some
`.
`increasing interpolation function in the range
`The quantized LPC coefficients are interpolated, and are then
`used for the synthesis filter. Finally, the reconstructed speech is
`obtained by passing the reconstructed residual through the syn-
`
`
`
`4
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 8, NOVEMBER 2001
`
`Fig. 3. Block diagram of the AbS SEW vector quantization.
`
`thesis filter. For low rate coding, it is beneficial to use a formant
`adaptive postfilter [28]. In WI coding the postfilter enhances the
`quantized speech quality by reducing the audibility of the non-
`periodic speech component around the formants. Such compo-
`nent is mostly due to the REW which is still somehow related to
`the SEW and may not always be regarded as independent noise.
`
`Many speech coding schemes use voiced/unvoiced classifica-
`tion with separate coding of each type of sound. Such schemes
`may suffer severe quality loss whenever classification error is
`made, which causes the coder to apply coding method that is
`inappropriate to the coded speech sound. One of the important
`advantages of the WI coding system is that it is universally ap-
`plied to all speech sounds, and is therefore more robust than
`classification based coding scheme.
`
`III. SEW OPTIMIZATION
`
`Most WI coders [10]–[18] use nonideal low-pass filters for
`downsampling and upsampling of the SEW. These filters intro-
`duce aliasing and mirroring distortion, even when no quantiza-
`tion is applied. We propose, instead, a novel AbS SEW quan-
`tization scheme, illustrated in Fig. 3, which takes the nonideal
`interpolation filters into consideration and optimizes the SEW
`accordingly, however some aliasing may already exist (due to
`nonideal anti-aliasing filters) and this will not be eliminated by
`the AbS quantization scheme. The input speech is analyzed and
`LPC parameters are extracted, quantized and interpolated, and
`an LPC whitening filter is obtained. Then the speech is passed
`through the resulting whitening filter to produce the residual
`SEWs are extracted from the residual
`signal. In each frame
`with
`look-ahead waveforms. Each waveform is represented
`by a vector of FCs
`. The local decoder at the encoder re-
`constructs
`SEWs,
`, by interpolating between the
`
`Fig. 4. Example for the improved interpolation by SEW optimization during
`nonstationary speech segment.
`
`, to the current frame
`quantized SEW at the previous frame,
`quantized SEW,
`. The interpolated SEW vectors are given by
`
`(7)
`
`Assuming
`and the LPC coefficients are given, the encoder’s
`task is to find the quantized vector
`such that the accumu-
`lated weighted distortion between original and reconstructed
`, is minimized. Since the
`waveform sequences, denoted by
`
`
`
`GOTTESMAN AND GERSHO: ENHANCED WAVEFORM INTERPOLATIVE CODING AT LOW BIT-RATE
`
`5
`
`effect of the linear interpolation LPF is taken into account in
`the proposed scheme, a true interpolated waveform (synthesis)
`is incorporated in the analysis process, unlike the conventional
`open-loop WI coders [10]–[18] in which only one waveform,
`, is used for the quantization. Consider the accumu-
`namely
`, between the input SEW FCs
`lated weighted distortion,
`, and the quantized and interpolated vectors,
`,
`vectors,
`given by
`
`(8)
`
`where
`
`number of waveforms per frame;
`number of look-ahead waveforms;
`, are the spectral
`diagonal matrix whose elements,
`values of the combined spectral-weighting and syn-
`thesis filters at the th harmonic given by
`
`(9)
`
`where
`
`and
`
`pitch period;
`number of harmonics;
`gain;
`input and the quantized LPC polynomials,
`respectively.
`The spectral weighting parameters satisfy
`. It
`can be shown that the accumulated distortion in (8) is equal to
`the sum of two components, a modeling distortion and a quan-
`tization distortion
`
`where the quantization distortion is given by
`
`(10)
`
`(11)
`
`An improved match between reconstructed and original SEW
`is obtained, most notably in the transitions. Fig. 4 illustrates
`the improved waveform matching obtained for a nonstationary
`speech segment by interpolating the optimized SEW.
`
`IV. DISPERSION PHASE QUANTIZATION
`
`The dispersion-phase quantization scheme [21]–[23] is illus-
`trated in Fig. 5. A pitch cycle that is extracted from the SEW is
`applied as an input to the system, and is cyclically shifted so that
`its pulse is located at position zero. Let its FC vector be denoted
`by . After quantization, the components of the quantized mag-
`, are multiplied by the exponential of the quan-
`nitude vector,
`tized phases,
`, to yield the quantized waveform FC vector,
`, which is subtracted from the input FC vector to produce the
`error FC vector. The error FC vector is then transformed to the
`perceptually-weighted frequency domain by weighting it by the
`. The en-
`combined synthesis and weighting filter
`coder searches for the phase that minimizes the energy of the
`perceptually weighted error, allowing a fine tuning of the cyclic
`shift of the input waveform during the search, to eliminate any
`residual phase shift between the input waveform and the quan-
`tized waveform. Phase dispersion quantization aims to improve
`waveform matching. Efficient AbS quantization can be obtained
`by using the perceptually weighted distortion
`
`(15)
`
`where
`is the weighted input SEW prototype and
`is the quantized and weighted SEW prototype. It can be shown
`[21] that the above distortion is equivalent to
`
`(16)
`
`The magnitude is perceptually more significant than the phase
`[26] and should therefore be quantized first. Furthermore, if the
`phase were quantized first, the very limited bit allocation avail-
`able for the phase would lead to an excessively degraded spectral
`matching of the magnitude in favor of a somewhat improved,
`but less important, matching of the waveform. For this distor-
`tion measure, the quantized phase vector is given by [21]–[23]
`
`(17)
`
`where the optimal vector,
`eling distortion) is given by
`
`, (which minimizes the mod-
`
`where
`
`and the respective weighting matrix is given by
`
`(12)
`
`(13)
`
`Therefore, VQ with the accumulated distortion of (8) can be
`simplified by using the distortion of (11), and
`
`(14)
`
`running phase codebook index;
`respective diagonal phase exponent matrix;
`quantized magnitude vector.
`The AbS search for phase quantization is based on evaluating
`(17) for each candidate phase codevector. Since only trigono-
`metric functions of the phase candidates are used (via complex
`are relevant, and
`exponentials), only phase values modulo
`therefore phase unwrapping is avoided. The EWI coder uses the
`, and the optimized weighting,
`,
`optimized SEW,
`for the AbS phase quantization.
`
`A. Phase Centroid Equations
`We will now describe the training of the phase codebook.
`Suppose
`is the set of SEW
`
`
`
`6
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 8, NOVEMBER 2001
`
`training vectors used for the design of the phase VQ, where
`is the cardinality of the set
`, that is, the number of elements in
`. The average global distortion measure for the quantization
`of the training set is
`
`(18)
`
`th input and the
`are the th FC of the
`, and
`where
`quantized SEWs, respectively. The th optimal partition cell sat-
`isfies
`
`for all
`
`(19)
`
`, the centroid
`For a given partition
`equation [29] of the th coefficient’s phase, for the th cluster,
`which minimizes the global distortion (18), is given by
`
`-
`
`Fig. 5. Block diagram of the AbS dispersion phase vector quantization.
`
`(20)
`
`Fig. 6. Segmental weighted SNR of the phase VQ versus the number of bits,
`for M-IRS and for nonfiltered (flat) speech.
`
`B. Variable Dimension Vector Quantization
`The phase vector’s dimension depends on the pitch period
`and, therefore, a variable dimension VQ has been implemented.
`In our WI coder, the possible pitch period value was divided into
`several ranges, and for each range of pitch period a codebook
`was designed such that all vectors of dimension smaller than the
`largest pitch period in that range are zero padded beyond their
`highest element. Pitch changes over time cause the quantizer
`to switch among the pitch-range selected codebooks. In order
`to achieve smooth phase variations whenever such a switch oc-
`curs, overlapped training clusters were used and similar initial
`conditions were selected for each codebook. This design method
`does not guarantee smoothness, i.e., for a slight change in pitch
`that causes a switch in codebooks, the quantized vector could
`change substantially. However, significant quality improvement
`was obtained with the procedure. We believe such smoothness
`may be guaranteed by including some heuristic rules in the en-
`coding process.
`
`C. Objective Results
`The segmental weighted signal-to-noise ratio (SNR) of the
`phase quantizer is illustrated in Fig. 6. The segmental SNR was
`calculated by averaging the SNR of the extracted waveforms.
`For each waveform, the SNR was computed using the quan-
`tized phase and nonquantized magnitude. The proposed system
`achieves approximately 14 dB SNR for as few as six bits for
`nonfiltered speech, and nearly 10 dB for modified intermediate
`reference system (M-IRS) [35] filtered speech.
`
`Fig. 7. Results of subjective A/B test for comparison between the four-bit
`phase VQ, and male extracted fixed phase.
`
`D. Subjective Results
`Recent WI coders have used a fixed dispersion phase ex-
`tracted from male speakers [14], [19]. We have conducted a sub-
`jective A/B test to compare our dispersion phase VQ, using only
`four bits, to a male-extracted dispersion phase. The test data in-
`cluded 16 M-IRS speech sentences, eight of which are of female
`speakers, and eight of male speakers. During the test, all pairs
`of file were played twice in alternating order, and the listeners
`could vote for either of the systems, or for no preference. The
`speech material was synthesized using our WI system in which
`only the dispersion phase was quantized every 20 ms. Twenty
`one listeners participated in the test. The test results, illustrated
`in Fig. 7, show improvement in speech quality by quantizing the
`phase with a four-bit VQ. The improvement is larger for female
`speakers than for male. This may be due to the fact that for fe-
`
`
`
`GOTTESMAN AND GERSHO: ENHANCED WAVEFORM INTERPOLATIVE CODING AT LOW BIT-RATE
`
`7
`
`male speech there is a larger number of bits per vector sample,
`resulting in better waveform matching which is more perceiv-
`able particularly during transitions.
`The codebook design for the dispersion-phase quantization
`involves a tradeoff between robustness in terms of smooth phase
`variations and waveform matching. A locally optimized code-
`book for each pitch value may improve the waveform matching
`on the average, but will occasionally yield abrupt and excessive
`changes that can cause temporal artifacts.
`
`V. PARAMETRIC REW QUANTIZATION
`
`Efficient REW quantization can benefit from two observa-
`tions [25]: 1) the REW magnitude is typically an increasing
`function of frequency, which suggests that an efficient para-
`metric representation may be used and 2) one can observe
`similarity between successive REW magnitude spectra, which
`suggests that employing predictive VQ on a group of adjacent
`REWs may yield useful coding gains. The next four sections
`introduce the REW parametric representation and the associ-
`ated VQ technique.
`
`A. REW Parameterization
`Direct quantization of the REW magnitude is a variable
`dimension quantization problem, which may result in spending
`bits and computational effort on perceptually irrelevant in-
`formation. A simple and practical way to obtain a reduced,
`and fixed, dimension representation of the REW is with a
`linear combination of basis functions, such as orthonormal
`polynomials [18]–[20]. Such a representation usually produces
`a smoother REW magnitude, and improves the perceptual
`quality. Suppose the REW magnitude,
`, is represented by
`a linear combination of orthonormal functions
`
`(21)
`
`is the representation
`is the angular frequency, and
`where
`order. The REW magnitude is typically an increasing function of
`frequency, which can be coarsely quantized with a small number
`of bits per waveform without significant perceptual degrada-
`tion. Therefore, it may be advantageous to represent the REW
`magnitude in a simple, but perceptually relevant manner. Con-
`sequently we model the REW by the following parametric rep-
`resentation,
`
`(22)
`where
`is a parametric vector
`of coefficients within the representation model subspace, and
`is the “unvoicing” parameter which is zero for a fully voiced
`spectrum, and one for a fully unvoiced spectrum. Thus,
`defines a 2-D surface whose cross sections for each value of
`give a particular REW magnitude spectrum, which is defined
`merely by specifying a scalar parameter value.
`
`Fig. 8. REW parametric representation R (!; ).
`
`B. Piecewise Linear REW Representation
`In order to have a simple representation that is computation-
`ally efficient and avoids excessive memory requirements, we
`model the 2-D surface by a piecewise linear parametric repre-
`uniformly spaced
`sentation. Therefore, we introduce a set of
`, as shown in Fig. 8. (Such a set of
`spectra,
`functions is similar to the hand-tuned REW codebook in [19]
`and [20].) Then the parametric surface is defined by linear in-
`terpolation according to
`
`Because this representation is linear, the coefficients of
`are linear combinations of the coefficients of
`. Hence,
`
`(23)
`
`and
`
`(24)
`
`where
`is the coefficient vector of the
`representation
`
`th REW magnitude
`
`(25)
`
`C. REW Modeling
`1) Nonweighted Distortion: Suppose for a REW magnitude,
`, represented by some coefficient vector,
`, we search for
`, in
`, whose respective
`the parameter value,
`, minimizes the mean squared error
`representation vector,
`(MSE) distortion between the two spectra
`
`From orthonormality, the distortion is equal to
`
`(26)
`
`
`
`8
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 8, NOVEMBER 2001
`
`The optimal interpolation factor that minimizes the MSE is
`
`(27)
`
`(28)
`
`and the respective optimal parameter value, which is a contin-
`uous variable between zero and one, is given by
`
`(29)
`
`This result allows a rapid search for the best unvoicing param-
`eter value needed to transform the coefficient vector to a scalar
`parameter, for encoding or for VQ design.
`2) Weighted Distortion: Commonly in speech coding, the
`magnitude is quantized using a weighted distortion measure.
`In this case, the weighted distortion between the input and the
`parametric representation modeled spectra is equal to
`
`where
`is the weighted correlation matrix of the or-
`thonormal functions, its elements are
`
`(30)
`
`(31)
`
`is the modeled
`is the input coefficient vector and
`where
`parametric coefficient vector. The optimal parameter that mini-
`mizes (30) is given by
`
`(32)
`
`and the respective optimal parameter value is computed using
`(29). Alternatively, in order to eliminate using the matrix
`,
`and to benefit from the orthonormal function simplification,
`given in (27), the scalar product may be redefined to incor-
`porate the time-varying spectral weighting. The respective or-
`thonormal basis functions then satisfy
`
`(33)
`
`where
`denotes the Kroneker delta. The respective pa-
`rameter vector is given by
`
`Fig. 9. REW parametric representation AbS VQ.
`
`Fig. 10. REW parametric representation AbS VQ.
`
`Fig. 11. REW parametric representation simplified weighted AbS VQ.
`
`(34)
`
`is an th dimensional
`where
`vector of time-varying orthonormal functions.
`
`D. REW Quantization
`1) Full Complexity Spectral Quantization Scheme: A novel
`AbS REW parameter VQ paradigm is illustrated in Fig. 9. An
`is selected from
`excitation vector
`the VQ codebook and is fed through a synthesis filter to ob-
`(synthesized quantized) which is
`tain a parameter vector
`
`then mapped to quantized a representation coefficient vectors
`. This is compared with a sequence of input represen-
`tation coefficient vectors
`and each is spectrally weighted.
`Each spectrally weighted error is then temporally weighted, and
`a distortion measure is obtained. A search through all candidate
`excitation vectors determines an optimal choice. The synthesis
`filter in Fig. 9 can be viewed as a first order predictor in a feed-
`to
`back loop. By allowing the value of the predictor parameter
`change, it becomes a “switched-predictor” scheme. Switched-
`prediction is introduced to allow for different levels of REW
`
`
`
`GOTTESMAN AND GERSHO: ENHANCED WAVEFORM INTERPOLATIVE CODING AT LOW BIT-RATE
`
`where
`
`9
`
`(38)
`
`parameter correlation. The scheme incorporates both spectral
`weighting and temporal weighting. The spectral weighting is
`used for the distortion between each pair of input and quantized
`spectra. In order to improve SEW/REW mixing, particularly in
`mixed voiced and unvoiced speech segments, and to increase
`speech crispness, especially for plosives and onsets, temporal
`weighting is incorporated in the AbS REW VQ. The temporal
`weighting is a monotonic function of the temporal gain. Two
`codebooks are used, one corresponding to each of two predictor
`and
`. The quantization target is an
`-di-
`coefficients,
`mensional vector of REW spectra. Each REW spectrum is rep-
`resented by a vector of basis function coefficients denoted by
`. The search for the minimal weighted mean squared error
`, of the two
`(WMSE) is performed over all the vectors,
`codebooks for
`. The quantized REW function coeffi-
`, is a function of the quantized param-
`cients vector,
`, which is obtained by passing the quantized vector,
`eter
`, through the synthesis filter using the coefficient
`for
`, or
`. The weighted distortion between each pair of input
`and quantized REW spectra is calculated. The total distortion is
`spectrally weighted dis-
`a temporally-weighted sum of the
`tortions. Since the predictor coefficients are known, direct VQ
`can be used to simplify the computations. For a piecewise linear
`parametric REW representation, a substantial simplification of
`the search computations may be obtained by interpolating the
`distortion between the representation spectra set.
`2) Simplified Parametric Quantization Scheme: The above
`scheme maps each quantized parameter to a coefficient vector,
`which is used to compute the spectral distortion. To reduce com-
`plexity, such a mapping, and spectral distortion computation
`may be eliminated by using the simplified scheme described
`below. For a high rate, and a smooth representation surface
`, the total distortion is equal to the sum of a modeling dis-
`tortion and a quantization distortion
`
`The quantization distortion is related to the quantized parameter
`by
`
`(35)
`
`The quantization distortion is linearly related to the REW pa-
`, and there-
`rameter squared quantization error,
`fore justifies direct VQ of the REW parameter.
`distortion: The
`a) Simplified
`scheme,
`nonweighted
`encoder maps the REW magnitude to an unvoicing parameter,
`and then quantizes the parameter by AbS VQ, as illustrated
`REWs in the
`in Fig. 10. Initially, the magnitudes of the
`frame are mapped to coefficient vectors,
`. Then,
`for each coefficient vector, a search is performed to find the
`, using (29), to form
`optimal representation parameter,
`an
`-dimensional parameter vector for the current frame,
`. Finally,
`the parameter vector is encoded
`, are
`by AbS VQ. The decoded spectra,
`,
`obtained from the quantized parameter vector,
`using (23). This scheme allows for higher temporal as well as
`spectral REW resolution, since no downsampling is performed,
`and