`Saint Lawrence Communications
`Exhibit 2009
`
`ENHANCING WAVEFORM INTERPOLATIVE CODING WITH
`WEIGHTED REW PARAMETRIC QUANTIZATION
`Oded Gottesman and Allen Gersho
`Signal Compression Laboratory, Department of Electrical and Computer Engineering
`University of California, Santa Barbara, California 93106, USA
`E-mail: [oded, gersho]@scl.ece.ucsb.edu, Web: http://scl.ece.ucsb.edu
`
`[4]-[6]. Such a
`such as orthonormal polynomial
`functions,
`representation usually produces a smoother REW magnitude, and
`), is
`improves the perceptual quality. Suppose the REW magnitude, R(
`represented by a linear combination of orthonormal functions,
`i(
`)
`(
`0,
`)
`(
`R
`
`i
`
`i
`
`1 0
`
`I i
`
`):
`
`(1)
`
`where
` is the angular frequency, and I is the representation order. The
`REW magnitude is typically an increasing function of the frequency,
`which can be coarsely quantized with a small number of bits per
`waveform without significant perceptual degradation. Therefore, it may
`be advantageous to represent the REW magnitude in a simple, but
`perceptually relevant manner. Suppose the REW is modeled by the
`:
`following parametric representation,
`,
`(R
`)
`
`0;
`
`1
`
`(2)
`
`)(
`
`i
`
`(
`
`i
`
`)
`
`0,
`
`1 0
`
`I i
`
`R
`
`(
`
`,
`
`)
`
`T
`
`)(
`(
`),...,
`)(
`where
` is a parametric vector of coefficients
`0
`1
`I
` is the “unvoicing”
`within the representation model subspace, and
`parameter which is zero for a fully voiced spectrum, and one for a fully
`unvoiced spectrum.
`
`2.2 Piecewise Linear REW Representation
`For practical considerations we may assume that the parametric
`representation is piecewise linear, and may be represented by a set of N
`)
`,
`(
`R
`uniformly spaced spectra,
`, as illustrated in Figure 1.
`n
`This representation is similar to the hand-tuned REW codebook in [5],
`[6]. The parametric surface is linearly interpolated by:
`1(
`),
`(
`,
`(
`)
`,
`(
`)
`R
`R
`R
`
`)
`
`n
`
`(3)
`
`1 0
`
`N n
`
`n
`
`1
`
`;
`
`n
`
`1
`
`n
`
`;
`
`n
`
`1
`
`;
`
`n
`
`n
`
`1
`
`n
`
`From the linearity of the representation:
`1(
`)(
`)
`(4)
`n 1
`n
`n is the coefficient vector of the n-th REW magnitude
`where
`)
`( n
`representation, i.e.
`.
`2.3 REW Modeling
`2.3.1 Non-Weighted Distortion
`Suppose for a REW magnitude, R(
`), represented by some coefficient
`, we search for the parameter value, ( ), in
`,
`vector,
`n 1
`n
`, minimizes the MSE
`whose respective representation vector,
`)(
`distortion between the two spectra:
`
`
`
` , RRD
`
`(
`
`))
`
`R
`
`(
`
`)
`
`1(
`
`)
`
`R
`
`(
`
`,
`
`)
`
`n
`
`1
`
`R
`
`(
`
`,
`
`2
`
`d
`
`)
`
`n
`
`0
`From orthonormality, the distortion is equal to:
`2
`(
`
` , RRD
`)(
`))
`)
`1(
`
`n
`
`1
`
`2
`
`n
`
`(5)
`
`(6)
`
`ABSTRACT
`This paper presents an efficient quantization technique for the rapidly-
`evolving waveforms in waveform interpolative (WI) coders. The
`scheme, based on a parametrization of the rapidly-evolving waveform
`(REW) magnitude, and analysis-by-synthesis (AbS) vector quantization
`(VQ) of the REW parameters, allows both higher temporal and spectral
`resolution of the REW. A perceptually weighted distortion measure
`takes advantage of spectral and temporal masking and leads to improved
`reconstructed speech quality, most notably in mixed voiced and
`unvoiced speech segments. The technique is an important component of
`the Enhanced Waveform Interpolative (EWI) speech coder at 2.8 kbps
`that achieves a subjective quality slightly better than that of G.723.1 at
`6.3 kbps.
`
`1. INTRODUCTION
`In WI coding [1]-[10], the similarity between successive REW
`magnitudes is generally exploited by downsampling and interpolation
`and by constrained bit allocation [1]. In our earlier EWI coder [7]-[9],
`the REW magnitude was quantized on a waveform by waveform basis,
`and with an excessive number of bits – more than is perceptually
`required. Here we propose a novel parametric representation of the REW
`magnitude and an efficient paradigm for AbS predictive vector
`quantization of the REW parameter sequence. The proposed scheme is
`discussed, and a simplified version is derived. The quantization scheme
`employs a perceptually weighted distortion measure, which takes
`advantage of spectral and temporal masking. The new method achieves
`a substantial reduction in the REW bit rate.
`2. REW QUANTIZATION
`Efficient REW quantization can benefit from two observations: (a) the
`REW magnitude is typically an increasing function of the frequency,
`which suggests that an efficient parametric representation may be used;
`(b) one can observe similarity between successive REW magnitude
`spectra, which suggests that employing predictive VQ on a group of
`adjacent REWs may yield useful coding gains. The next four sections
`introduce the REW parametric representation and the associated VQ
`technique.
`
`2.1 REW Parameterization
`Direct quantization of the REW magnitude is a variable dimension
`quantization problem, which may result
`in spending bits and
`computational effort on perceptually irrelevant information. A simple
`and practical way
`to obtain a reduced, and fixed dimension
`representation of the REW is with a linear combination of basis
`
`This work was supported in part by the University of California MICRO program,
`Cisco Systems, Inc., Conexant Systems, Inc., Dialogic Corp., Fujitsu Laboratories of
`America, Inc., General Electric Corp., Hughes Network Systems, Lernout & Hauspie
`Speech Products NV, Lucent Technologies, Inc., Nokia Mobile Phones, Panasonic
`Speech Technology Laboratory, Qualcomm, Inc. and Texas Instruments, Inc.
`
`IEEE Workshop on Speech Coding, 2000 ©
`
`IPR2017-01077
`Saint Lawrence Communications
`Exhibit 2009
`
`
`
`all the vectors,
`
`)
`
`(m)
`
`, of the codebook. The quantized REW function
`(ˆ mcij
`)(ˆ m , is a function of the quantized parameter
`
`coefficients vector,
`
`)(ˆ m , which is obtained by passing the quantized vector,
`(ˆ mcij
`)
`,
`through the synthesis filter. The weighted distortion between each pair
`of input and quantized REW spectra is calculated. The total distortion is
`a temporally-weighted sum of the M spectrally weighted distortions.
`Since the predictor coefficients are known, direct VQ can be used to
`simplify the computations. For a piecewise linear parametric REW
`representation, a substantial simplification of the search computations
`may be obtained by
`interpolating
`the distortion between
`the
`representation spectra set.
`g(m)
`Gain
`LPC a(m)
`REW representation coefficient vector
`Pi
`
`Predictor
`Codebook
`Vector
`Quantizer
`Codebook
`
`^
`cij(m)
`
`Synthesis
`Filter
`
`1
`
`1
`zPi
`
`1
`
`^
`
`(m)
`
`)^
`
`(
`
`min(*)
`
`+
`
`Spectral
`Weighting
`
`||*|| 2
`
`Temporal
`Weighting
`
`Fig. 2. REW Parametric Representation AbS VQ
`2.4.2 Simplified Parametric Quantization Scheme
`The above scheme maps each quantized parameter to a coefficient
`vector, which is used to compute the spectral distortion. To reduce
`complexity, such a mapping, and spectral distortion computation may be
`eliminated by using the simplified scheme described below. For high
`rate, and smooth representation function, the total distortion is equal the
`sum of modeling distortion and quantization distortion:
`
` mRmRD ( (ˆ(ˆ),
`
`))
`w
`
`1
`
`M m
`
` (12)
`
`The optimal interpolation factor that minimizes the MSE is:
`T
`(
`)
`()
`
`opt
`
`n
`
`n
`
`1
`
`n
`
`1
`
`2
`
`(7)
`
`1
`n
`n
`and the respective optimal parameter value, which is a continuous
`variable between zero and one, is given by:
`(8)
`1(
`)(
`)
`1
`opt
`n
`opt
`n
`This result allows a rapid search for the best unvoicing parameter value
`needed to transform the coefficient vector to a scalar parameter, for
`encoding or for VQ design.
`
`1
`
`0.8
`
`0.6
`
`0.4
`
`0.2
`
`0
`11N
`
`2N
`
`3 /4
`
`2
`
`1
`
`0.5
`/2
`REW parameter
`/4
`Frequency [radians]
`00
`0
`Fig. 1. REW Parametric Representation
`)
`,
`(R
`2.3.2 Weighted Distortion
`Commonly in speech coding, quantization is performed with a
`perceptually weighted distortion measure. In this case, the weighted
`distortion between the input and the parametric representation modeled
`spectra is equal:
`
`
`
` , RRD
`w
`
`)(
`
`R
`
`(
`
`)
`
`R
`
`(
`
`,
`
`2
`
`)
`
`0
`
`
`
` ( )dW
`
`
`
`(9)
`
`)(
`(
`)
`)(
`W
`(W
`)
`where
` is the weighted correlation matrix of the orthonormal
`functions, its elements are:
`
`T
`
`(
`
`)
`
`(
`
`)
`
`(
`
`)
`
` ,
`
`(10)
`
`(13)
`
`(14)
`
`
`
` mRmRD )),(( (ˆ(ˆ
`
`))
`
`M m
`
`
`
`
`
`M m
`
`1
`
`M m
`
`W
`
`,
`ji
`
`W
`
`i
`
`(
`
`j
`
`)
`d
`
`0
`)(
` is the modeled parametric
` is the input coefficient vectors, and
`coefficient vector. The optimal parameter that minimizes (9) is given by:
`T
`(
`(
`)
`)
`(11)
`1
`T
`)
`(
`)
`(
`1
`1
`n
`n
`n
`n
`and the respective optimal parameter value is computed using (8).
`
`n
`
`n
`
`n
`
`1
`
`opt
`
`2.4 REW Quantization
`2.4.1 Full Complexity Spectral Quantization Scheme
`A novel switched-predictive AbS REW parameter VQ paradigm is
`illustrated in Fig. 2. Switched-prediction is introduced to allow for
`different levels of REW parameter correlation. The scheme incorporates
`both spectral weighting and temporal weighting. The spectral weighting
`is used for the distortion between each pair of input and quantized
`spectra. In order to improve SEW/REW mixing, particularly in mixed
`voiced and unvoiced speech segments, and to increase speech crispness,
`especially for plosives and onsets, temporal weighting is incorporated in
`the AbS REW VQ. The temporal weighting is a monotonic function of
`the temporal gain. A codebook with two partitions is used. Each
`partition has a particular predictor coefficient value, Pi, i=1 or 2. The
`quantization target is an M-dimensional vector of REW spectra. Each
`REW spectrum is represented by a vector of basis function coefficients
`denoted by (m). The search for the minimal WMSE is performed over
`
`
`
` mRmRD ))((),(
`
`
`
`
`w
`w
`1
`1
`The quantization distortion is related to the quantized parameter by:
`(ˆ(ˆ
`((
`))
`)),
`mRmRD
`w
`
`M m
`
`1
`
`M m
`
`((
`
`m
`
`))
`
`T
`
`m
`
`))
`
`(
`mW
`
`)
`
`((
`
`m
`
`))
`
`m
`
`))
`
`(ˆ(ˆ
`(ˆ(ˆ
`1
`which, for the piecewise linear representation case, is equal
`(ˆ(ˆ
`))
`)),
`((
`mRmRD
`w
`
`1
`
`M
`
`2
`
`m
`
`1
`
`((
`
`m
`
`))
`
`T
`
`
`
`mW (
`
`)
`
`((
`
`m
`
`))(
`
`n
`
`(
`
`m
`
`)
`
`(ˆ
`
`2
`
`m
`
`))
`
`n
`))m
`(( ))m
`((
`
`
`
`((
`))
`where
`quantization
`. The
`1 m
`n
`n
`n
`distortion is linearly related to the REW parameter squared quantization
`2)
`(ˆ
`(
`)
`m
`m
`, and therefore justifies direct VQ of the REW
`error,
`parameter.
`2.4.2.1 Simplified Scheme, Non-Weighted Distortion
`The encoder maps the REW magnitude to an unvoicing parameter, and
`then quantizes the parameter by AbS VQ, as illustrated in Fig. 3. This
`scheme allows for higher temporal as well as spectral REW resolution,
`since no downsampling is performed, and the continuous parameter is
`vector quantized in AbS [10].
`
`IEEE Workshop on Speech Coding, 2000 ©
`
`
`
`Powered by TCPDF (www.tcpdf.org)
`
`Bits / second
`1000
`600
`450
`400
`350
`2800
`
`Bits / Frame
`Parameter
`20
`LPC
`2x6 = 12
`Pitch
`9
`Gain
`8
`SEW magnitude
`7
`REW magnitude
`Total
`56
`Table 1. Bit allocation for 2.8 kbps EWI coder
`4. SUBJECTIVE RESULTS
`We have conducted a subjective A/B test to compare our 2.8 kbps EWI
`coder to the G.723.1. The test data included 24 modified intermediate
`reference system (M-IRS) [11] filtered speech sentences, 12 of which
`are of female speakers, and 12 of male speakers. Twelve listeners
`participated in the test. The test results, listed in Table 2, indicate that
`the subjective quality of the 2.8 kbps EWI is slightly better than that of
`G.723.1 at 6.3 kbps.
`
`No Preference
`6.3 kbps G.723.1
`2.8 kbps WI
`Test
`25.00%
`36.81%
`38.19%
`Female
`25.00%
`31.94%
`43.06%
`Male
`Total
`25.00%
`34.38%
`40.63%
`Table 2. Results of subjective A/B test for comparison between
`the 2.8 kbps EWI coder to 6.3 kbps G.723.1. With 95%
`certainty the result lies within +/-5.59%.
`5. SUMMARY
`We have found a new technique that enhances the performance of the
`WI coder, and allow for better coding efficiency. It offers an efficient
`parametrization of the REW magnitude, and AbS VQ of the REW
`parameter. This scheme allows for higher temporal as well as spectral
`REW resolution. The weighted distortion scheme
`improves
`the
`reconstructed speech quality, most notably in mixed voiced and
`unvoiced speech segments. Subjective test results indicate that the
`performance of the 2.8 kbps EWI coder slightly exceeds that of G.723.1
`at 6.3 kbps and therefore EWI achieves very close to toll quality, at least
`under clean speech conditions.
`
`[3]
`
`6. REFERENCES
`[1] W. B. Kleijn, and J. Haagen, "A Speech Coder Based on Decomposition of
`Characteristic Waveforms," IEEE ICASSP'95, pp. 508-511, 1995.
`[2] W. B. Kleijn, and J. Haagen, "Waveform Interpolation for Coding and
`Synthesis," in Speech Coding Synthesis by W. B. Kleijn and K. K. Paliwal,
`Elsevier Science B. V., Chapter 5, pp. 175-207, 1995.
`I. S. Burnett, and D. H. Pham, "Multi-Prototype Waveform Coding using
`Frame-by-Frame Analysis-by-Synthesis," IEEE ICASSP'97, pp. 1567-1570,
`1997.
`[4] W. B. Kleijn, Y. Shoham, D. Sen, and R. Haagen, "A Low-Complexity
`Waveform Interpolation Coder," IEEE ICASSP'96, pp. 212-215, 1996.
`[5] Y. Shoham, "Very Low Complexity Interpolative Speech Coding at 1.2 to 2.4
`kbps," IEEE ICASSP'97, pp. 1599-1602, 1997.
`[6] Y. Shoham, "Low-Complexity Speech Coding at 1.2 to 2.4 kbps Based on
`Waveform Interpolation," International Journal of Speech Technology,
`Kluwer Academic Publishers, pp. 329-341, May 1999.
`[7] O. Gottesman, “Dispersion Phase Vector Quantization for Enhancement of
`Waveform Interpolative Coder,” IEEE ICASSP’99, vol. 1, pp. 269-272, 1999.
`[8] O. Gottesman and A. Gersho, “Enhanced Waveform Interpolative Coding at 4
`kbps,” IEEE Speech Coding Workshop, pp. 90-92, 1999, Finland.
`[9] O. Gottesman and A. Gersho, "Enhanced Analysis-by-Synthesis Waveform
`Interpolative Coding at 4 kbps," EUROSPEECH'99, pp. 1443-1446, 1999,
`Hungary.
`[10] O. Gottesman and A. Gersho, “High Quality Enhanced Waveform
`Interpolative Coding at 2.8 kbps,” IEEE ICASSP’00, Turkey, 2000.
`ITU-T, "Recommendation P.830, Subjective Performance Assessment of
`Telephone Band and Wideband Digital Codecs," Annex D, ITU, Geneva,
`February 1996.
`
`[11]
`
`REW polynomial coefficients
`
`(m)
`
`Vector
`Quantizer
`Codebook
`
`^
`ci(m)
`
`Synthesis
`Filter
`
`1
`
`1
`zP
`
`1
`
`( (m))
`
`^
`
`(m)
`
`(m)
`
`+
`
`min(*)
`
`||*|| 2
`
`Fig. 3. REW Parametric Representation AbS VQ.
`2.4.2.2 Simplified Scheme, Weighted Distortion
`We may improve the quantization scheme to incorporate spectral and
`temporal weightings, as illustrated in Fig. 4. The REW parameter vector
`is first mapped to REW parameter by minimizing a distortion, which is
`, as described
`weighted by the coefficient spectral weighting matrix
`in section 2.3.2. Then, the resulted REW parameter is used to compute a
`weighting, ws( (m)), in form of the spectral sensitivity to the REW
`
` 2)m
`(ˆ
`parameter squared quantization error,
`, given by:
`
`(
`)m
`T
`
`(15)
`
`((
`
`
`s mw
`
`))
`
`|
`(
`)
`m
`For the piecewise linear representation case it is equal:
`1
`T
`))
`((
`((
`)
`
`))
` (mW
`((
`m
`w
`m
`m
`
`s
`
`2
`
`n
`
`))
`
`(16)
`
`n
`
`((
`
`m
`
`))
`
`(
`
`m
`
`)
`
`2
`
`m
`
`)
`
`(17)
`
`M
`
`)
`
`M m
`
`)
`
`Mm
`
`The above derivative can be computed off line. Additionally, a temporal
`weighting, in the form of a monotonic function of the gain, may be used,
`and denoted by wt(g(m)). The AbS REW parameter quantization is
`computed by minimizing the combined spectrally and temporally
`weighted distortion:
`(ˆ,
`(ˆ
`))
`(
`(
`)
`(
`(
`wmgw
`m
`D
`m
`s
`t
`1
`1
`1
`m
`The weighted distortion scheme improves the reconstructed speech
`quality, most notably in mixed voiced and unvoiced speech segments.
`This may be explained by an improvement in REW/SEW mixing, and a
`less destructive REW contribution.
`g(m)
`Gain
`a(m)
`
`LPC
`
`Spectral
`Weighting
`
`REW polynomial coefficients
`
`(m)
`
`Vector
`Quantizer
`Codebook
`
`^
`ci(m)
`
`Synthesis
`Filter
`
`1
`
`1
`zP
`
`1
`
`(m)
`
`( (m))
`
`^
`
`(m)
`
`(m)
`
`+
`
`||*||2
`
`min(*)
`
`Temporal/Spectral
`Weighting
`
`Fig. 4. REW Parametric Representation Simplified Weighted AbS VQ
`3. BIT ALLOCATION
`The bit allocation for the 2.8 kbps EWI coder is given in Table 1. The
`frame length is 20 ms, and ten waveforms are extracted per frame. The
`line spectral frequencies (LSFs) are coded using predictive MSVQ,
`having two stages of 10 bit each, a 2-bit increase compared to the past
`version of our coder [8], [9]. The pitch is coded twice per frame.
`
`IEEE Workshop on Speech Coding, 2000 ©
`
`