`INTERPOLATIVE CODING AT 4 KBPS
`Oded Gottesman and Allen Gersho
`
`Signal Compression Laboratory
`Department of Electrical and Computer Engineering
`University of California
`Santa Barbara, California 93106, USA
`E-mail: [oded, gersho]@scl.ece.ucsb.edu
`
`performance of the WI coder at a very low bit-rate, which can be
`used for parametric coders as well as for waveform coders. The
`EWI coder employs this scheme, which incorporates perceptual
`weighting and does not require any phase unwrapping.
`
`The WI coders use non-ideal low-pass filters for downsampling
`and upsampling of the SEW. We describe a novel AbS SEW
`quantization scheme, which takes the non-ideal filters into
`consideration. An improved match between reconstructed and
`original SEW is obtained, most notably in the transitions.
`
`Pitch accuracy is crucial for high quality reproduced speech in
`WI coders. We introduce a novel pitch search technique based on
`varying segment boundaries; it allows for locking onto the most
`probable pitch period during transitions or other segments with
`rapidly varying pitch.
`Commonly in speech coding the gain sequence is downsampled
`and interpolated. As a result it is often smeared during plosives
`and onsets. To alleviate this problem, we propose a novel
`switched-predictive AbS gain VQ scheme based on temporal
`weighting.
`
`This paper is organized as follows. In Section 2 we explain the
`AbS SEW optimization. The dispersion phase quantizer is
`discussed in Section 3. Section 4 describes the pitch search. In
`Section 5 we present the switched-predictive AbS gain VQ. The
`bit allocation is given in section 6. Subjective results are
`reported in Section 7. Finally, we summarize our work.
`
`2.
`
`AbS SEW QUANTIZATION
`
`ABSTRACT
`This paper presents an Enhanced analysis-by-synthesis (AbS)
`Waveform Interpolative (EWI) speech coder at 4 kbps. The
`system incorporates novel features such as: AbS quantization of
`the slowly evolving waveform (SEW), AbS vector quantization
`(VQ) of the dispersion phase, a special pitch search for
`transitions, and switched-predictive analysis-by-synthesis gain
`VQ. Subjective quality tests indicate that it exceeds MPEG-4 at
`4 kbps and of G.723.1 at 5.3 kbps, and it is slightly better than
`G.723.1 at 6.3 kbps.
`
`1.
`
`INTRODUCTION
`
`Recently, there has been growing interest in developing toll-
`quality speech coders at rates of 4 kbps and below. The speech
`quality produced by waveform coders such as code-excited
`linear prediction (CELP) coders [1] degrades rapidly at rates
`below 5 kbps. On the other hand, parametric coders such as the
`waveform-interpolative (WI) coder [4]-[6], the sinusoidal-
`transform coder (STC) [2], and the multiband-excitation (MBE)
`coder [3] produce good quality at low rates, but they do not
`achieve toll quality. This is mainly due to lack of robustness to
`parameter estimation, which is commonly done in open loop,
`and to inadequate modeling of non-stationary speech segments.
`In this work we propose a paradigm which incorporates AbS for
`parameter estimation, and a novel pitch search technique that is
`well suited for the non-stationary segments.
`
`In parametric coders the phase information is commonly not
`transmitted, and this is for two reasons: first, the phase is of
`secondary perceptual significance; and second, no efficient
`phase quantization scheme is known. WI coders [4]-[6] typically
`use a fixed phase vector for the SEW, for example, in [5], a fixed
`male speaker extracted phase was used. On the other hand,
`waveform coders such as CELP [1], by directly quantizing the
`waveform, implicitly allocate an excessive number of bits to the
`phase information - more than is perceptually required. Recently
`[8], we proposed a novel, efficient AbS VQ encoding of the
`dispersion phase of the excitation signal to enhance the
`
`This work was supported in part by the University of California MICRO
`program, ACT Networks, Inc, Cisco Systems, Inc., Conexant Systems, Inc.,
`Dialogic Corp., DSP Group, Inc., Fujitsu Laboratories of America, Inc.,
`General Electric Corp., Hughes Network Systems, Intel Corp., Lernout &
`Hauspie Speech Products NV, Lucent Technologies, Inc., Nokia Mobile
`Phones, Panasonic Speech Technology Laboratory, Qualcomm, Inc., Sun
`Microsystems Inc., and Texas Instruments, Inc.
`
`Commonly in WI coders the SEW is distorted by downsampling
`and upsampling with non-ideal low-pass filters. In order to
`reduce such distortion, an AbS SEW quantization scheme,
`illustrated in Figure 1, was used. Consider the accumulated
`weighted distortion, DwI, between the input SEW vectors, mr ,
`and the interpolated vectors, mr~ , given by:
`[
`]
`[
`--
`~
`H
`r
`rWr
`m
`mm
`m
`
`D
`
`wI
`
`ˆ(
`r
`M
`
`,
`
`{ }
`+
`LM
`r
`=
`mm
`1
`
`=
`
`1
`
`)
`
`M
`
`m
`
`+
`
`=
`1
`+
`LM
`
`1
`1[
`+=
`1
`Mm
`
`]œœœœßøŒŒŒŒºØ ---
`(cid:229)(cid:229) -
`
`]
`
`~
`r
`m
`
`(1)
`
`a
`
`(
`t
`
`m
`
`2
`
`)]
`
`[
`r
`m
`
`[
`]
`~
`H
`rWr
`M
`mm
`
`~
`r
`M
`
`where M is the number of waveforms per frame, L is the
`lookahead number of waveforms, a (t) is some increasing
`interpolation function in the range 0£ a (t)£ 1, and
`mW is a
`
`Saint Lawrence Communications, LLC
`IPR2016-00704
`Exhibit 2011
`
`-
`
`
`LPC
`Analysis
`
`LPC
`Interpolation
`
`Speech
`
`A(z)
`
`Residual
`
`Waveform
`Extraction +
`Alignment +
`Decomposition
`
`Pitch
`Extraction
`
`Waveform
`Synthesizer
`( M
`Interpolation
`+Lookahead
`extrapolation)
`
`r1
`
`rM+L-1
`
`A1
`
`AM+L-1
`
`-
`
`+
`
`~
`r1
`
`W1(z)
`A1(z)
`
`||*||2
`
`+
`
`WM+L-1(z)
`AM+L-1(z)
`
`||*||2
`
`[1-a (tM+L-1)]2
`
`Lookahead
`~
`rM+L-1
`
`-
`
`+
`
`min
`
`DwI
`
`^ r
`
`^
`
`M
`
`r0
`
`z-1
`
`Waveform
`Codebooks
`
`An improved match between reconstructed and original SEW is
`obtained, most notably in the transitions. Figure 2 illustrates the
`improved waveform matching obtained for a non-stationary
`speech segment by interpolating the optimized SEW.
`
`x 104
`
`Original
`
`0.69
`
`0.7
`
`x 104
`
`0.71
`Optimized
`
`0.72
`
`0.73
`
`1
`
`0.5
`
`0
`
`-0.5
`
`-1
`
`1
`
`Amplitude
`
`0.5
`
`0
`
`-0.5
`
`Amplitude
`
`0.69
`
`0.7
`
`0.71
`Non-optimized
`
`0.72
`
`0.73
`
`x 104
`
`-1
`
`1
`
`0.5
`
`Figure 1. Block diagram of the AbS SEW vector quantization.
`
`diagonal matrix whose elements, wkk, are the combined spectral-
`weighting and synthesis of the k-th harmonic given by:
`
`; k = 1 ,.., K
`
` (2)
`
`2
`
`)
`
`1g
`
`w
`kk
`
`=
`
`1
`K
`
`g
`/(
`gA
`z
`)(ˆ
`/(
`zAzA
`
`)
`
`2
`
`p
`)2(
`P
`
`j
`
`k
`
` ; m = 1 ,.., M (3)
`
`D
`wI
`
`,
`
`1
`
`)
`
`D
`wI
`
`
`
`,
`
`1
`
`)
`
` (4)
`
`)
`
`=
`ez
`where P is the pitch period, K is the number of harmonics, g is
`)(ˆ zA
` are the input and the quantized LPC
`the gain, A(z) and
`polynomials respectively, and the spectral weighting parameters
`satisfy 0 £ g 2 < g 1 £ 1 . The interpolated SEW vectors are given
`by:
`~
`-=
`a
`a
`+
`rˆ)]
`rˆ)
`
`
`r
`t(
`(
`
`1[
`t
`0
`m
`Mm
`m
`where, 0ˆr and Mrˆ
` are the quantized SEW at the previous and at
`the current frame respectively. It can be shown that the
`accumulated distortion in equation (1) is equal to the sum of
`modeling distortion and quantization distortion:
`{ }
`{ }
`=
`+
`-+
`-+
`LM
`LM
`r
`r
`=
`=
`mm
`mm
`1
`1
`where the quantization distortion is given by:
`--
`=
`H
`
`ˆ(r
`
`ˆ(r
`
`ˆ(r
`r
`r
`W
`,
`)
`)
`MwD
`
` ,optM
` ,optM
`
` ,optM
`M
`M
`optM ,r
`
`
`
`r
`
` ,optM
`
`)
`
` (5)
`
`The optimal vector,
`
`, which minimizes the modeling
`
`ˆ(r
`
`M
`
`(r
`
` ,optM
`
`ˆ(r
`
`r
`,
`D
` ,optMMw
`
`
`distortion, is given by:
`
`r
`,
`optM
`
`=
`
`W
`,
`optM
`
`1
`
`M
`
`a
`
`=
`1
`M
`
`m
`
`+
`
`=
`Mm
`
`øŒŒŒŒºØ -
`(cid:229)(cid:229) -+
`
`] œœœœ ß
`
`[
`rW
`mm
`
`)
`
`m
`
`--
`1[
`
`a
`
`
`
`(t
`
`m
`
`
`
`ˆ)]r
`0
`
`t
`
`( L
`
`1
`1[
`+
`1
`
`a
`
`
`
`(t
`
`m
`
`2
`
`)]
`
`rW
`mm
`
`0.69
`
`0.7
`
`0.71
`Time (sec)
`
`0.72
`
`0.73
`
`0
`
`-0.5
`
`-1
`
`Amplitude
`
`Figure 2. Example for the improved interpolation by
`SEW optimization during non stationary speech segment
`
`The dispersion-phase quantization scheme [8][9] is illustrated in
`Figure 3. Consider a pitch cycle which is extracted from the
`residual signal, and is cyclically shifted such that its pulse is
`located at position zero. Let its DFT be denoted by r; the
`resulting DFT phase is the dispersion phase,
`, which
`determines, along with the magnitude r , the waveform’s pulse
`shape. After quantization, the components of the quantized
`magnitude vector, rˆ , are multiplied by the exponential of the
`
`
`
`
`
`(6)
`
`3.
`
`AbS PHASE QUANTIZATION
`
`where,
`
`W
`,
`optM
`
`=
`
`(cid:229)(cid:229) -+
`a
`+
`W
`
`(t
`)
`m
`
`M
`
`=
`1
`
`m
`
`2
`
`m
`
`LM
`
`=
`Mm
`
`1
`1[
`+
`1
`
`a
`
`
`
`(t
`
`m
`
`2
`
`)]
`
`W
`m
`
`
`
`(7)
`
`Therefore, VQ with the accumulated distortion of equation (1)
`can be simplified by using the distortion of equation (5), and:
`--
`=
`
` ’(rW
`r
`)
` ,optM
`
`
` ,optM
`i
`
`ˆ
`r
`M
`
`{
`argmin
`r
`’
`i
`
`’(r
`
`
`i
`
`H
`
`})
`
`(8)
`
`r
`
` ,optM
`
`-
`-
`j
`
`
`pitch periods, P(ni), are searched every 2 ms at instances ni by
`maximizing the normalized correlation of the weighted speech
`sw(n), that is:
`=
`)
`
`arg
`max
`(nP
`i
`t
`,
`
` ,NN
`1
`
`{
`
`r
`
`2
`
`t
`
`,
`
`(
`
`n
`
`i
`
`,
`
`
`
` , NN
`1
`
`)
`
`2
`
`} =
`
`(11)
`
`t
`++
`
`N
`
`n
`i
`
`2
`
`
`
`
`
` )( (nsns
`w
`w
`D-=
`Nnn
`1
`i
`
`t
`
`)
`
`(cid:239)(cid:239)(cid:254)(cid:239)(cid:239)(cid:253)(cid:252)(cid:239)(cid:239)(cid:238)(cid:239)(cid:239)(cid:237)(cid:236) ---(cid:229)(cid:229) (cid:229) D
`
`arg
`max
`t
`,
`,
`NN
`1
`
`2
`
`
`
` ++t
`
`N
`
`n
`i
`
`2
`
`
`
`
`
` )( )(nsns
`w
`w
`D-=
`Nnn
`1
`i
`
`
`
` ++t
`
`N
`
`n
`i
`
`2
`
`
`
` (ns
`w
`D-=
`Nnn
`1
`i
`
`t
`
`)
`
`
`
` (ns
`w
`
`t
`
`)
`
`where D
` is some incremental segment used in the summations
`for computational simplicity, and 0 £ Nj £ º 160 / D ß . Then, every
`10 ms a weighted-mean pitch value is calculated by:
`5
`5
`(cid:229)(cid:229)=
`r
`r
`(
` ) (nPn
`/)
`
`
`i
`i
`=
`=
`1
`1
`i
`i
` is the normalized correlation for P(ni).
`
`(
`
`n
`i
`
`)
`
` (12)
`
`P
`mean
`( inr
`
`)
`
`where
`
`j
`ˆ
`
`Speech
`
`j k , to yield the quantized waveform DFT,
`quantized phases, $(
`)
`rˆ , which is subtracted from the input DFT to produce the error
`DFT. The error DFT is then transformed to the perceptual domain
`by weighting it by the combined synthesis and weighting filter
`W(z)/A(z). The encoder searches for the phase that minimizes the
`energy of the perceptual domain error, allowing a refining cyclic
`shift of the input waveform during the search, to eliminate any
`residual phase shift between the input waveform and the
`quantized waveform. Phase dispersion quantization aims to
`improve waveform matching. Efficient AbS quantization can be
`obtained by using the perceptually weighted distortion:
`--
`=
`H
` )ˆ,( rr
`)ˆr
`
`
`
`)ˆ (rWr
`r
`
`(
`wD
` (9)
`The magnitude is perceptually more significant than the phase;
`and should therefore be quantized first. Furthermore, if the phase
`were quantized first, the very limited bit allocation available for
`the phase would lead to an excessively degraded spectral
`matching of the magnitude in favor of a somewhat improved, but
`less important, matching of the waveform. For the above
`distortion, the quantized phase vector is given by [8][9]:
`j
`=
`--
`j
`ˆ
`H
`
` )ˆ (rWr
`ˆ
`
`e
`
`{
`
`
`
`(r
`
`j
`
`i
`
`argmin
`j
`ˆ
`
`i
`
`})ˆ
`
`r
`
`j
`
`e
`
`i
`
`(10)
`
`Spectral domain
`pitch search +
`tracker
`
`100 Hz
`
`No
`
`Good
`Pitch?
`
`Yes
`
`Weighted
`speech
`
`Temporal domain
`pitch refinement
`
`500 Hz
`
`No
`
`Good
`Pitches?
`
`Yes
`
`Temporal domain
`pitch search
`
`Yes
`
`500 Hz
`
`Good
`Pitches?
`
`No
`
`Use 4 ms
`waveform length
`
`Weighted-Average
`Pitch
`
`100 Hz
`Figure 4. Pitch search of the EWI coder.
`
`5.
`
`GAIN QUANTIZATION
`
`i
`
`jjˆe
`where i is the running phase codebook index, and
` is the
`respective diagonal phase exponent matrix. The AbS search for
`phase quantization is based on evaluating (10) for each
`candidate phase codevector. Since only trigonometric functions
`of the phase candidates are used, phase unwrapping is avoided.
`optM ,r
`The EWI coder uses the optimized SEW,
`, and the
`optM ,W , for the AbS phase quantization.
`
`optimized weighting,
`
`Pitch-Cycle
`Waveform’s DFT
`
`Crude
`Linear-
`Phase
`Alignment
`
`Refined
`Linear-
`Phase
`Alignment
`
`x
`
`r^
`
`-
`
`Magnitude
`Codebook
`
`Phase
`Codebook
`
`^
`|r|
`
`e jj^
`
`r
`
`+
`
`W(z)
`A(z)
`
`Pitch
`
`min||*||2
`
`Figure 3. Block diagram of the AbS dispersion phase
`vector quantization.
`
`4.
`
`PITCH SEARCH
`
`The pitch search consists of a spectral domain search employed
`at 100 Hz and a temporal domain search employed at 500 Hz, as
`illustrated in Figure 4. The spectral domain pitch search is based
`on harmonic matching [2][3][7]. The temporal domain pitch
`search is based on varying segment boundaries. It allows for
`locking onto the most probable pitch period even during
`transitions or other segments with rapidly varying pitch. Initially,
`
`The gain trajectory is commonly smeared during plosives and
`onsets by downsampling and interpolation. We address this
`problem and improve speech crispness with a novel Switched-
`Predictive AbS Gain VQ technique, illustrated in Figure 5.
`Switched-prediction is introduced to allow for different levels of
`gain correlation, and to reduce the occurrence of gain outliers. In
`order to improve speech crispness, especially for plosives and
`onsets, temporal weighting is incorporated in the AbS gain VQ.
`The weighting is a monotonic function of the temporal gain.
`
`D
`D
`
`
`Two codebooks of 32 vectors each are used. Each codebook has
`an associated predictor coefficient, Pi, and a DC offset Di. The
`quantization target vector is the DC removed log-gain vector
`denoted by t(m). The search for the minimal WMSE is
`performed over all the vectors, cij(m), of the codebooks. The
`(ˆ mt
`)
`quantized target,
`, is obtained by passing the quantized
`vector, cij(m), through the synthesis filter. Since each quantized
`target vector may have a different value of the removed DC, the
`quantized DC is added temporarily to the filter memory after the
`state update, and the next quantized vector’s DC is subtracted
`from it before filtering is performed. Since the predictor
`coefficients are known, direct VQ can be used to simplify the
`computations.
`
`Log-Gain
`
`g(m)
`
`+
`
`DC
`Codebook
`Predictor
`Codebook
`Vector
`Quantizer
`Codebook
`
`Di
`
`Pi
`
`cij(m)
`
`Synthesis
`Filter
`
`1
`--
`zPi
`
`1
`
`1
`
`^
`t(m)
`
`+
`
`t(m)
`
`Temporal
`Weighting
`
`min||*||2
`
`Figure 5. Switched-Predictive Analysis-by-Synthesis
`gain VQ using temporal weighting.
`
`6.
`
`BIT ALLOCATION
`
`The bit allocation of the coder is given in Table 1. The frame
`length is 20 ms, and ten waveforms are extracted per frame. The
`pitch and the gain are coded twice per frame.
`
`Parameter
`LPC
`Pitch
`Gain
`REW
`SEW magnitude
`SEW phase
`Total
`
`Bits / Frame
`18
`2x6=12
`2x6=12
`20
`14
`4
`80
`
`Bits / second
`900
`600
`600
`1000
`700
`200
`4000
`
`Table 1. Bit allocation for EWI coder
`
`7.
`
`SUBJECTIVE RESULTS
`
`We have conducted a subjective A/B test to compare our 4 kbps
`EWI coder to MPEG-4 at 4 kbps, and to G.723.1. The test data
`included 24 MIRS speech sentences, 12 of which are of female
`speakers, and 12 of male speakers. Fourteen
`listeners
`participated in the test. The test results, listed in Table 2 to Table
`4, indicate that the subjective quality of EWI exceeds that of
`MPEG-4 at 4 kbps and of G.723.1 at 5.3 kbps, and it is slightly
`better than that of G.723.1 at 6.3 kbps.
`
`Test
`Female
`Male
`Total
`
`4 kbps WI
`65.48%
`61.90%
`63.69%
`
`4 kbps MPEG-4
`34.52%
`38.10%
`36.31%
`
`Table 2. Results of subjective A/B test for comparison between
`the 4 kbps WI coder to 4 kbps MPEG-4. With 95% certainty the WI
`preference lies in [58.63%, 68.75%].
`
`Test
`Female
`Male
`Total
`
`4 kbps WI
`57.74%
`61.31%
`59.52%
`
`5.3 kbps G.723.1
`42.26%
`38.69%
`40.48%
`
`Table 3. Results of subjective A/B test for comparison between
`the 4 kbps WI coder to 5.3 kbps G.723.1. With 95% certainty the
`WI preference lies in [54.17%, 64.88%]
`
`Test
`Female
`Male
`Total
`
`4 kbps WI
`54.76%
`52.98%
`53.87%
`
`6.3 kbps G.723.1
`45.24%
`47.02%
`46.13%
`
`Table 4. Results of subjective A/B test for comparison between
`the 4 kbps WI coder to 6.3 kbps G.723.1. With 95% certainty the
`WI preference lies in [48.51%, 59.23%].
`
`SUMMARY
`8.
`We have found several new techniques that enhance the
`performance of the WI coder. The most significant of these,
`reported here, analysis-by-synthesis vector-quantization of the
`dispersion-phase, AbS optimization of the SEW, a special pitch
`search for transitions, and switched-predictive analysis-by-
`synthesis gain VQ. These features improve the algorithm and its
`robustness. The test results indicate that the performance of the
`EWI coder slightly exceeds that of G.723.1 at 6.3 kbps and
`therefore EWI achieves very close to toll quality, at least under
`clean speech conditions.
`
`REFERENCES
`9.
`[1] B. S. Atal, and M. R. Schroeder, “Stochastic Coding of Speech at Very
`Low Bit Rate”, Proc. Int. Conf. Comm, Amsterdam, pp. 1610-1613,
`1984.
`[2] R. J. McAulay, and T. F. Quatieri, “Sinusoidal Coding", in Speech
`Coding Synthesis by W. B. Kleijn and K. K. Paliwal, Elsevier Science
`B. V., Chapter 4, pp. 121-173, 1995.
`[3] D. Griffin, and J. S. Lim, “Multiband Excitation Vocoder”, IEEE
`Trans. ASSP, Vol. 36, No. 8, pp. 1223-1235, August 1988.
`[4] Y. Shoham, "High Quality Speech Coding at 2.4 to 4.0 kbps Based on
`Time-Frequency-Interpolation", IEEE ICASSP’93, Vol. II, pp. 167-170,
`1993.
`[5] W. B. Kleijn, and J. Haagen, "Waveform Interpolation for Coding and
`Synthesis", in Speech Coding Synthesis by W. B. Kleijn and K. K.
`Paliwal, Elsevier Science B. V., Chapter 5, pp. 175-207, 1995.
`I. S. Burnett, and D. H. Pham, "Multi-Prototype Waveform Coding
`using Frame-by-Frame Analysis-by-Synthesis", IEEE ICASSP’97, pp.
`1567-1570, 1997.
`[7] E. Shlomot, V. Cuperman, and A. Gersho, “Hybrid Coding of Speech at
`4 kbps”, IEEE Speech Coding Workshop, pp. 37-38, 1997.
`[8] O. Gottesman,
`“Dispersion Phase Vector Quantization For
`Enhancement of Waveform Interpolative Coder”, IEEE ICASSP’99,
`vol. 1, pp. 269-272, 1999.
`[9] O. Gottesman and A. Gersho, “Enhanced Waveform Interpolative
`Coding at 4 kbps”, IEEE Speech Coding Workshop, 1999, Finland.
`
`[6]