`Oded Gottesman and Allen Gersho
`
`Signal Compression Laboratory
`Department of Electrical and Computer Engineering
`University of California
`Santa Barbara, California 93106, USA
`E-mail: [oded, gersho]@scl.ece.ucsb.edu
`
`used for parametric coders as well as for waveform coders. The
`EWI coder employs this scheme, which incorporates perceptual
`weighting and does not require any phase unwrapping.
`
`The WI coders use non-ideal low-pass filters for downsampling
`and upsampling of the SEW. We describe a novel AbS SEW
`quantization scheme, which takes the non-ideal filters into
`consideration. An improved match between reconstructed and
`original SEW is obtained, most notably in the transitions.
`
`Pitch accuracy is crucial for high quality reproduced speech in
`WI coders. We introduce a novel pitch search technique based on
`varying segment boundaries; it allows for locking onto the most
`probable pitch period during transitions or other segments with
`rapidly varying pitch.
`Commonly in speech coding the gain sequence is downsampled
`and interpolated, as a result it is often smeared during plosives
`and onsets. To alleviate such a problem a novel switched-
`predictive AbS gain VQ scheme is introduced; it is based on
`temporal weighting.
`
`This paper is organized as follows. In Section 2 we explain the
`AbS SEW optimization. The dispersion phase quantizer is
`discussed in Section 3. Section 4 describes the pitch search. In
`Section 5 we present the switched-predictive AbS gain VQ. The
`bit allocation is given in section 6. Subjective results are
`reported in Section 7. Finally, we summarize our work.
`
`2.
`
`AbS SEW OPTIMIZATION
`
`ABSTRACT
`This paper presents an Enhanced Waveform Interpolative (EWI)
`speech coder at 4 kbps. The system incorporates novel features
`such as analysis-by-synthesis (AbS) vector-quantization (VQ) of
`the dispersion-phase, AbS optimization of the slowly evolving
`waveform (SEW), a special pitch search for transitions, and
`switched-predictive analysis-by-synthesis gain VQ. Subjective
`quality tests indicate that it exceeds that of MPEG-4 at 4 kbps
`and of G.723.1 at 5.3 kbps, and it is slightly better than that of
`G.723.1 at 6.3 kbps.
`
`1.
`
`INTRODUCTION
`
`Recently, there has been growing interest in developing toll-
`quality speech coders at rates of 4 kbps and below. The speech
`quality produced by waveform coders such as code-excited
`linear predictive (CELP) coder [1] degrades rapidly at rates
`below 5 kbps. On the other hand, parametric coders such as the
`Waveform-interpolative (WI) coder [4]-[6], the sinusoidal-
`transform coder (STC) [2], and the multiband-excitation (MBE)
`coder [3] produce good quality at low rates, but they do not
`achieve toll quality. This is mainly due to lack of robustness to
`parameter estimation, which is commonly done in open loop,
`and to inadequate modeling of non-stationary speech segments.
`In this work we propose a paradigm which incorporates AbS
`approach in the parameter estimation, and a special pitch search
`for the non-stationary segments.
`
`In parametric coders the phase information is commonly not
`transmitted, and this is for two reasons: first, the phase is of
`secondary perceptual significance; and second, no efficient
`phase quantization scheme is known. WI coders [4]-[6] typically
`use a fixed phase vector for the SEW, for example, in [5], fixed
`male speaker extracted phase was used. On the other hand,
`waveform coders such as CELP [1], by directly quantizing the
`waveform, implicitly allocate an excessive number of bits to the
`phase information - more than is perceptually required. Recently
`[7], we proposed a novel, efficient AbS VQ encoding of the
`dispersion phase of the excitation signal to enhance the
`performance of the WI coder at a very low bit-rate, which can be
`
`=
`
`D
`
`wI
`
`Commonly in WI coders the SEW is distorted by downsampling
`and upsampling with non-ideal low-pass filters. In order to
`reduce such distortion, an optimal SEW vector is calculated and
`quantized. Consider the accumulated weighted distortion, DwI,
`between the input SEW vectors, mr , and the interpolated
`vectors, mr~ , given by:
`[
`]
`[
`~
`--
`H
`r
`rWr
`m
`mm
`m
`
`]œœœœßøŒŒŒŒºØ ---
`(cid:229)(cid:229) -+
`
`M
`
`=
`1
`LM
`
`m
`
`+
`
`1
`1[
`+=
`1
`Mm
`
`]
`
`~
`r
`m
`
` (1)
`
`a
`
`(
`
`t
`
`m
`
`[
`r
`m
`
`2
`
`)]
`
`[
`]
`~
`H
`rWr
`M
`mm
`
`~
`r
`M
`
`This work was supported in part by the University of California MICRO
`program, ACT Networks, Inc, Cisco Systems, Inc., Conexant Systems, Inc.,
`Dialogic Corp., DSP Group, Inc., Fujitsu Laboratories of America, Inc.,
`General Electric Corp., Hughes Network Systems, Intel Corp., Lernout &
`Hauspie Speech Products NV, Lucent Technologies, Inc., Nokia Mobile
`Phones, Panasonic Speech Technology Laboratory, Qualcomm, Inc., Sun
`Microsystems Inc., and Texas Instruments, Inc.
`
`where M is the number of waveforms per frame, L is the
`lookahead number of waveforms, a (t) is some increasing
`interpolation function in the range 0£ a (t)£ 1, and
`mW is a
`diagonal matrix whose elements, wkk, are the combined spectral-
`weighting and synthesis of the k-th harmonic given by:
`
`
`
`{
`
`j
`ˆ
`
`=
`
`argmin
`j
`ˆ
`
`i
`
`--
`j
`ˆ
`j
`H
`
` )ˆ (rWr
`
`e
`
`i
`
`
`
`(r
`
`j
`ˆ
`j
`
`i
`
`e
`
`r
`
`(7)
`
`})ˆ
`jjˆe
`where i is the running phase codebook index, and
` is the
`respective diagonal phase exponent matrix. The AbS search for
`phase quantization is based on evaluating (7) for each candidate
`phase codevector. Since only trigonometric functions of the
`phase candidates are used, phase unwrapping is avoided.
`
`i
`
`Pitch-Cycle
`Waveform’s DFT
`
`Crude
`Linear-
`Phase
`Alignment
`
`Refined
`Linear-
`Phase
`Alignment
`
`Magnitude
`Codebook
`
`Phase
`Codebook
`
`^
`|r|
`
`e jj^
`
`x
`
`r^
`
`-
`
`r
`
`+
`
`W(z)
`
`; k = 1 ,.., K
`
` (2)
`
`p
`)2(
`P
`
`j
`
`k
`
`2
`
`)
`
`1g
`
`)
`
`2
`
`g
`/(
`gA
`z
`)(ˆ
`/(
`zAzA
`
`=
`
`w
`kk
`
`=
`ez
`where P is the pitch period, K is the number of harmonics, g is
`)(ˆ zA
` are the input and the quantized LPC
`the gain, A(z) and
`polynomials respectively, and the spectral weighting parameters
`satisfy 0 £ g 2 < g 1 £ 1 . The interpolated SEW vectors are given
`by:
`~
`r
`m
`
`-=
`1[
`
`a
`
`
`
`t(
`
`+
`
`a
`
`
`
`t(
`
`
`
` ; m = 1 ,.., M
`
`(3)
`
`
`
`
`
`(4)
`
`] œœœœ ß
`
`)
`
`m
`
`--
`1[
`
`a
`
`(
`t
`
`m
`
`ˆ)]
`r
`0
`
`t
`
`1
`1[
`+
`1
`
`a
`
`
`
`(t
`
`m
`
`2
`
`)]
`
`rW
`mm
`
`=
`Mm
`
`ˆ)]r
`
`)r
`0
`
`Mopt,
`m
`m
`0ˆr , is the quantized SEW at the previous frame. The
`optM ,r
`optimal vector,
`, which minimizes DwI, is given by:
`[
`rW
`mm
`
`( L
`
`where,
`
`r
`,
`optM
`
`=
`
`W
`,
`optM
`
`1
`
`M
`
`a
`
`=
`1
`M
`
`m
`
`+
`
`øŒŒŒŒºØ -
`(cid:229)(cid:229) -+
`
`where,
`
`M
`
`2
`
`LM
`
`2
`
`Pitch
`
`min||*||2
`
`W
`,
`optM
`
`=
`
`(cid:229)(cid:229) -+
`a
`+
`W
`
`(t
`)
`m
`
`m
`
`=
`1
`
`m
`
`1
`1[
`+
`1
`
`a
`
`
`
`(t
`
`m
`
`)]
`
`W
`m
`
`
`
`(5)
`
`=
`Mm
`
`This optimized vector is then quantized using WMSE weighted
`optM ,W . An improved match between reconstructed and
`by
`original SEW is obtained, most notably in the transitions.
`
`AbS PHASE QUANTIZATION
`3.
`The dispersion-phase quantization scheme [7] is illustrated in
`Figure 1. Consider a pitch cycle which is extracted from the
`residual signal, and is cyclically shifted such that its pulse is
`located at position zero. Let its DFT be denoted by r; the
`resulting DFT phase is the dispersion phase,
`, which
`determines, along with the magnitude r , the waveform’s pulse
`shape. After quantization, the components of the quantized
`magnitude vector, rˆ , are multiplied by the exponential of the
`j k , to yield the quantized waveform DFT,
`quantized phases, $(
`)
`rˆ , which is subtracted from the input DFT to produce the error
`DFT. The error DFT is then transformed to the perceptual
`domain by weighting it by the combined synthesis and weighting
`filter W(z). The encoder searches for the phase that minimizes
`the energy of the perceptual domain error, allowing a refining
`cyclic shift of the input waveform during the search, to eliminate
`any residual phase shift between the input waveform and the
`quantized waveform. Phase dispersion quantization aims to
`improve waveform matching. Efficient AbS quantization can be
`obtained by using the perceptually weighted distortion measure:
`--
`=
`H
` )ˆ,( rr
` /)ˆr
`
`
`
`)ˆ (rWr
`r
`
`(
`D
`K
` (6)
`w
`The magnitude is perceptually more significant than the phase;
`and should therefore be quantized first. Furthermore, if the phase
`were quantized first, the very limited bit allocation available for
`the phase would lead to an excessively degraded spectral
`matching of the magnitude in favor of a somewhat improved, but
`less important, matching of the waveform. For the above
`distortion measure, the quantized phase vector is given by [7]:
`
`Figure 1. Block diagram of the AbS dispersion phase’s
`vector quantization.
`
`4.
`
`PITCH SEARCH
`
`The pitch search is based on varying segment boundaries. It
`allows for locking onto the most probable pitch period even
`during transitions or other segments with rapidly varying pitch.
`Initially, pitch periods, P(ni), are searched every 2 ms at
`instances ni by maximizing the normalized correlation of the
`weighted speech sw(n), that is:
`=
`r
`
`{
`
`
`
`(nP
`i
`
`)
`
`arg
`max
`t
`,
`
` ,NN
`1
`
`2
`
`t
`
`,
`
`(
`
`n
`
`i
`
`,
`
`
`
` , NN
`1
`
`)
`
`2
`
`} =
`
`(8)
`
`
`
` ++t
`
`N
`
`n
`i
`
`2
`
`
`
`
`
` )( (nsns
`w
`w
`D-=
`Nnn
`1
`i
`
`t
`
`)
`
`(cid:239)(cid:239)(cid:254)(cid:239)(cid:239)(cid:253)(cid:252)(cid:239)(cid:239)(cid:238)(cid:239)(cid:239)(cid:237)(cid:236) ---(cid:229)(cid:229) (cid:229) D
`
`arg
`max
`t
`,
`,
`NN
`1
`
`2
`
`
`
` ++t
`
`N
`
`n
`i
`
`2
`
`
`
`
`
` )( )(nsns
`w
`w
`D-=
`Nnn
`1
`i
`
`
`
` ++t
`
`N
`
`n
`i
`
`2
`
`
`
` (ns
`w
`D-=
`Nnn
`1
`i
`
`t
`
`)
`
`
`
` (ns
`w
`
`t
`
`)
`
`where D
` is some incremental segment used in the summations
`for computational simplicity, and 0 £ Nj £ º 160 / D ß . Then, every
`10 ms a weighted-mean pitch value is calculated by:
`
`P
`mean
`
`(cid:229)(cid:229)=
`r
`(
` ) (nPn
`/)
`
`
`i
`i
`=
`1
`
`5
`
`i
`
`r
`
`5
`
`=
`1
`
`(
`
`n
`i
`
`)
`
` (9)
`
`i
`
`where
`
`( inr
`
`)
`
` is the normalized correlation for P(ni).
`
`5.
`
`GAIN QUANTIZATION
`
`The gain trajectory is commonly smeared during plosives and
`onsets by downsampling and interpolation. We address this
`problem and improve speech crispness with a novel Switched-
`Predictive AbS Gain VQ technique, illustrated in Figure 2.
`
`-
`-
`j
`D
`D
`
`
`Switched-prediction is introduced to allow for different levels of
`gain correlation, and to reduce the occurrence of gain outliers. In
`order to improve speech crispness, especially for plosives and
`onsets, temporal weighting is incorporated in the AbS gain VQ.
`The weighting is a monotonic function of the temporal gain.
`Two codebooks of 32 vectors each are used. Each codebook has
`an associated predictor coefficient, Pi, and a DC offset Di. The
`quantization target vector is the DC removed log-gain vector
`denoted by t(m). The search for the minimal WMSE is
`performed over all the vectors, cij(m), of the codebooks. The
`(ˆ mt
`)
`quantized target,
`, is obtained by passing the quantized
`vector, cij(m), through the synthesis filter. Since each quantized
`target vector may have a different value of the removed DC, the
`quantized DC is added temporarily to the filter memory after the
`state update, and the next quantized vector’s DC is subtracted
`from it before filtering is performed. Since the predictor
`coefficients are known, direct VQ can be used to simplify the
`computations.
`
`Log-Gain
`
`g(m)
`
`+
`
`DC
`Codebook
`Predictor
`Codebook
`Vector
`Quantizer
`Codebook
`
`Di
`
`Pi
`
`cij(m)
`
`Synthesis
`Filter
`
`1
`--
`zPi
`
`1
`
`1
`
`^
`t(m)
`
`+
`
`t(m)
`
`Temporal
`Weighting
`
`min||*||2
`
`Figure 2. Switched-Predictive Analysis-by-Synthesis
`gain VQ using temporal weighting.
`
`6.
`
`BIT ALLOCATION
`
`The bit allocation of the coder is given in Table 1. The frame
`length is 20 ms, and ten waveforms are extracted per frame. The
`pitch and the gain are coded twice per frame.
`
`Parameter
`LPC
`Pitch
`Gain
`REW
`SEW magnitude
`SEW phase
`Total
`
`Bits / Frame
`18
`2x6=12
`2x6=12
`20
`14
`4
`80
`
`Bits / second
`900
`600
`600
`1000
`700
`200
`4000
`
`Table 1. Bit allocation for EWI coder
`
`SUBJECTIVE RESULTS
`7.
`We have conducted a subjective A/B test to compare our 4 kbps
`EWI coder to MPEG-4 at 4 kbps, and to G.723.1. The test data
`included 24 MIRS speech sentences, 12 of which are of female
`speakers, and 12 of male speakers. Fourteen
`listeners
`participated in the test. The test results, listed in Table 2 to Table
`
`4, indicate that the subjective quality of EWI exceeds that of
`MPEG-4 at 4 kbps and of G.723.1 at 5.3 kbps, and it is slightly
`better than that of G.723.1 at 6.3 kbps.
`
`Test
`Female
`Male
`Total
`
`4 kbps WI
`65.48%
`61.90%
`63.69%
`
`4 kbps MPEG-4
`34.52%
`38.10%
`36.31%
`
`Table 2. Results of subjective A/B test for comparison between
`the 4 kbps WI coder to 4 kbps MPEG-4. With 95% certainty the WI
`preference lies in [58.63%, 68.75%].
`
`Test
`Female
`Male
`Total
`
`4 kbps WI
`57.74%
`61.31%
`59.52%
`
`5.3 kbps G.723.1
`42.26%
`38.69%
`40.48%
`
`Table 3. Results of subjective A/B test for comparison between
`the 4 kbps WI coder to 5.3 kbps G.723.1. With 95% certainty the
`WI preference lies in [54.17%, 64.88%]
`
`Test
`Female
`Male
`Total
`
`4 kbps WI
`54.76%
`52.98%
`53.87%
`
`6.3 kbps G.723.1
`45.24%
`47.02%
`46.13%
`
`Table 4. Results of subjective A/B test for comparison between
`the 4 kbps WI coder to 6.3 kbps G.723.1. With 95% certainty the
`WI preference lies in [48.51%, 59.23%].
`
`SUMMARY
`8.
`We have found that the performance of the WI coder can be
`enhanced by adding several new
`techniques. The most
`significant of these, reported here, analysis-by-synthesis vector-
`quantization of the dispersion-phase, AbS optimization of the
`SEW, a special pitch search for transitions, and switched-
`predictive analysis-by-synthesis gain VQ. These
`features
`improve the algorithm and its robustness. The test results
`indicate that the EWI coder slightly exceeds the G.723.1 coder's
`performance at 6.3 kbps and therefore it is very close to toll
`quality, at least under clean speech conditions.
`
`REFERENCES
`9.
`[1] B. S. Atal, and M. R. Schroeder, “Stochastic Coding of Speech at Very
`Low Bit Rate”, Proc. Int. Conf. Comm, Amsterdam, pp. 1610-1613,
`1984.
`[2] R. J. McAulay, and T. F. Quatieri, “Speech Analysis-Synthesis Based
`on a Sinusoidal Representation”, IEEE Trans. ASSP, Vol. 34, No. 4,
`pp. 744-754, 1986.
`[3] D. Griffin, and J. S. Lim, “Multiband Excitation Vocoder”, IEEE
`Trans. ASSP, Vol. 36, No. 8, pp. 1223-1235, August 1988.
`[4] Y. Shoham, "High Quality Speech Coding at 2.4 to 4.0 kbps Based on
`Time-Frequency-Interpolation", IEEE ICASSP’93, Vol. II, pp. 167-170,
`1993.
`[5] W. B. Kleijn, and J. Haagen, "Waveform Interpolation for Coding and
`Synthesis", in Speech Coding Synthesis by W. B. Kleijn and K. K.
`Paliwal, Elsevier Science B. V., Chapter 5, pp. 175-207, 1995.
`I. S. Burnett, and D. H. Pham, "Multi-Prototype Waveform Coding
`using Frame-by-Frame Analysis-by-Synthesis", IEEE ICASSP’97, pp.
`1567-1570, 1997.
`[7] O. Gottesman,
`“Dispersion Phase Vector Quantization For
`Enhancement of Waveform Interpolative Coder”, IEEE ICASSP’99,
`vol. 1, pp. 269-272, 1999.
`
`[6]
`
`