`US007643996B1
`
`US 7,643,996 B1
`(10) Patent N0.:
`(12) Ulllted States Patent
`(12) United States Patent
`US 7,643,996 B1
`(10) Patent N0.:
`Gottesman
`(45) Date of Patent:
`Jan. 5, 2010
`
` Gottesman (45) Date of Patent: Jan. 5, 2010
`
`
`(54) ENHANCED WAVEFORM INTERPOLATIVE
`(54) ENHANCED WAVEFORM INTERPOLATIVE
`CODER
`CODER
`
`(75) Inventor: Oded Gottesman, Goleta, CA (US)
`(75)
`Inventor: Oded Gottesman, Goleta, CA (US)
`.
`_
`(73) A551gnee: The Regents of the University of
`(73) Asslgnee: The Regents of the University of
`California, Oakland, CA (US)
`California, Oakland, CA (US)
`
`( * ) Notice:
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`U.S.C. 154(b) by Odays.
`
`(21) Appl. No.:
`(21) Appl. No.:
`
`09/831,843
`09/831,843
`
`(22) PCT Filed:
`(22) PCT Filed:
`
`Dec. 1, 1999
`Dec. 1, 1999
`
`(86) PCT No.:
`(86) PCT No.:
`
`PCT/US99/28449
`PCT/US99/28449
`
`§ 371 (6X1),
`§ 371 (0(1),
`(2), (4) Date:
`(2), (4) Date:
`
`Aug. 13, 2001
`Aug. 13, 2001
`
`(87) PCT Pub. No.: W000/33297
`(87) PCT Pub. No.: WO00/33297
`
`PCT Pub. Date: Jun. 8, 2000
`PCT Pub. Date: Jun. 8, 2000
`
`Related U-s- Application Data
`Related U-S- Application Data
`(60) Provisional application No. 60/110,522, ?led on Dec.
`(60) Provisional application No. 60/110,522, filed on Dec.
`1, 1998, provisional application No. 60/110,641, ?led
`1, 1998, provisional application No. 60/110,641, filed
`on Dec_ 1 1998'
`on Dec. 1 1998.
`5
`’
`
`(51)
`Int. Cl.
`(51) Int. Cl.
`G10L 13/04
`G10L 13/04
`
`(2006.01)
`(2006.01)
`
`(52) US. Cl.
`....................... 704/265; 704/219; 704/230;
`(52) US. Cl. ..................... .. 704/265; 704/219; 704/230;
`704/220; 704/205; 704/223
`704/220; 704/205; 704/223
`(58) Field of Classification Search ................. 704/205,
`(58) Field of Classi?cation Search ............... .. 704/205,
`704/207’ 219, 230, 220, 222, 225, 265, 223
`704/207, 2193 230a 220a 222a 225a 265, 223
`See application file for complete search history.
`See application ?le for complete search history.
`
`(56)
`(56)
`
`References Cited
`References Cited
`
`U’S’ PATENT DOCUMENTS
`U'S' PATENT DOCUMENTS
`4,653,098 A *
`3/1987 Nakata etal.
`............... 704/207
`4,653,098 A *
`3/1987 Nakata et a1. ............. .. 704/207
`5,086,471 A *
`2/1992 Tanaka et al.
`............... 704/222
`5,086,471 A *
`2/1992 Tanaka et a1. ............. .. 704/222
`5,517,595 A *
`5/1996 Kleijn ........................ 704/205
`5,517,595 A *
`5/1996 Kleijn ...................... .. 704/205
`6,418,408 B1 *
`7/2002 Udaya Bhaskar et al.
`704/219
`6,418,408 B1 *
`7/2002 Udaya Bhaskar et a1.
`704/219
`6,493,664 B1 * 12/2002 Udaya Bhaskar et al.
`704/222
`6,493,664 B1* 12/2002 Udaya Bhaskar et a1.
`704/222
`
`* cited by examiner
`* cited by examiner
`
`..
`__
`Primary Examinerngay B Chawan
`Primary ExamineriVljay B ChaWan
`(74) Attorney, Agent, or FirmiBerliner & Associates
`(74) Attorney, Agent, or FirmiBerliner & Associates
`
`(57)
`(57)
`
`ABSTRACT
`ABSTRACT
`
`An Enhanced analysis-by-synthesis Waveform Interpolative
`An Enhanced analysis-by-synthesis Waveform lnterpolative
`speech coder able to operate at 4 kbps. Novel features include
`speech coder able to operate at 4 kbps. Novel features include
`analysis-by-synthesis quantization of the slowly evolving
`analysis-by-synthesis quantization of the slowly evolving
`Waveform, analysis-by-synthesis vector quantization of the
`waveform, analysis-by-synthesis vector quantization of the
`disPersion phase? a Special pitch search_for transitions’ and
`dispersion phase, a special. pitch search. for transitions, and
`SW1‘9hed'Pre‘l1C‘1Ye analy_sls'by'syl_nhe_3sls gam VeFtOr quan'
`SWltPhed'Pre‘mWe analy.51s-by-synthes1s gam V3.01“ quan-
`t1Zat1on. Sub]ect1ve quallty tests 1nd1cate that 1t exceeds
`t1zatlon. SubJeCt1ve quahty tests 1nd1cate that 1t exceeds
`MPEG-4 at 4 kbps and of G.723.1 at 6.3 kbps.
`MPEG-4 at 4 kbps and of G.723.1 at 6.3 kbps.
`
`34 Claims, 4 Drawing Sheets
`34 Claims, 4 Drawing Sheets
`
`LPC
`LPC
`ANALYSIS
`ANALYSIS
`
`____)
`
`LPC
`LPC
`INTERPOLATION
`
`-
`‘
`
` WAVEFORM
`INTERPOIATION
`J_l_l
`WAVEFORM
`
`
`EXTRACTION+
`SPEECH
`Am RESIDUAL
`EXTRACTION+
`
`ALIGNMENI+
`
`L__l—
`ALIGNMENT+
`'
`DECOMPOSITION
`DECOMPOSITION
`
`PITCH
`EXTRACTION
`
`‘F0
`WAVEFORM
`
`
`WAVEFORM
`SYNTHESIZER
`SYNTHESIZER
`
`(
`I M
`
`( 1 M
`m
`
`
`iNTERPOLATION
`WAVEFORM
`
`
`\
`INTERPOLATION
`_) WAVEFORM
`+LOOKAHEAD
`CODEBOOKS
`+LO0KAHEAD
`CODEBOOKS
`LOOKAHEAD FR
`
`
`
`EXTRAPOLATION)
`EXTRAPOLATION)
`_ w
`‘01-1-1
`
`
`
`|PR2017-01075
`Saint Lawrence Communications
`Exhibit 2016
`
`
`
`US. Patent
`U.S. Patent
`
`Jan. 5, 2010
`1m5,
`
`Sheet 1 of4
`hS
`
`US 7,643,996 B1
`US 7,643,996 B1
`
`,m2.vammummzza%52w;
`
`0:2:
`TE:zofiéxw
`
`....... :< i , , I . . 1% , 6E
`
`Ti: F Tfz _ ZOE/‘Eva <
`
`“J“$3222:
`n.2258580
`
`ZQESESMQ
`
`53%,;
`2mob>s$
`
`532%“
`i222: S<
`
`20526 225% 55%
`
`zoCSonm—HE
`m 2252152 22%
`
`0n:
`
`_ 0,: on:
`
`on:
`
`2mfi<z<
`
`
`
`+ - 52%; a
`
`, F ,, r @355
`
`
`
`420:58me53%;
`
`N6E
`
`_
`
`
`
`so 2; ................................... ......
`
`1 | Ti; “
`
`
`
`@538:mxooaooo
`
`E? A 5 _ @5502 Dir/Q8: @8580 L"
`
`
`
`20:59:55
`1 i + AZQEODEEE _
`zegonmvaz 2 55>; a
`
`ll! 1/ E <
`
`
`
`
`
`US. Patent
`
`Jan. 5, 2010
`
`Sheet 2 M4
`
`US 7,643,996 B1
`
`F/G. 2
`
`x104
`
`1.0
`
`ORIGINAL
`
`0
`
`2
`
`
`
`
`
`MQESQSZ maajmsz NEESQSZ
`
`A.
`
`9 9
`
`_ _‘ _ _ _ _ _ _ _
`
`X x
`
`1|- .1l.
`
`0 40
`
`#6. T6. 1%
`
`0 0 0
`
`50.50. 05050 0.5.0.50
`
`
`
`
`00n_u1_x 1.00%.“. 100?_v.1|
`
`_
`
`PR _R
`D E km W M
`00 N O
`OUT. 00 I 0
`
`‘7mm LAW. m Ln.
`
`0 0D 0
`17 1H FM
`
`
`
`
`
` 3 3 3 l_.! I7. I7. 0 0 0
`
`m
`
`0 0 0
`
`
`
`T7 Fm. FLA/u
`
`Fla" 3 PITCH-CYCLE
`CRUDE
`WAVEFORM'S DFT LINEAR
`PHASE
`ALIGNMENT
`
`REFI NED
`LI NEAR
`PHASE
`ALIGNMENT
`
`MAGNITUDE
`CODEBOOK
`
`PHASE
`CODEBOOK
`1
`L_________
`
`I
`
`5
`PITCH -""
`
`______________._____.1
`
`
`
`U.S. Patent
`
`Jan. 5, 2010
`
`Sheet 3 of 4
`
`US 7,643,996 B1
`
`
`
`
`
`SEC.WEIGHTEDSNR:dB
`
`F/G. 4
`
`.p.
`
`M
`
`O
`
`(X)
`
`NON—MIRS (FLAT)
`
`
`
`3
`
`PHASE BITS
`
`
`
`SUBJECTIVESCORE
`
`
`
`FEMALE
`
`50%
`
`45%
`
`40%
`
`35%
`
`30%
`
`25%
`
`20%
`
`1 5%
`
`10%
`
`5%
`
`0%
`
`
`
`US. Patent
`
`Jan. 5, 2010
`
`Sheet 4 0f 4
`
`US 7,643,996 B1
`
`SPEECH
`
`SPECTRAL DOMAIN
`PITCH SEARCH+TRACKER
`
`TEMPORAL DOMAIN
`PITCH REFINEMENT
`
`WEI GHTED
`SPEECH
`
`TEMPORAL DOMAIN
`PITCH SEARCH
`
`NO
`
`GOOD
`PITCHES
`?
`
`500m
`
`USE 4ms
`WEIGHTED-AVERAGE
`WAVEFORM
`PITCH
`LENGTH
`|_____—_—___
`
`IOOHz
`
`LOG-GAIN
`
`9(m)
`
`l m
`
`0c
`Di
`CODEBOOK
`F/G 7 PREDICTOR P;
`'
`CODEBOOK h;
`QLYESTTI'ZRER CUM SYNTHESIS
`
`1 _1
`
`\ __.
`
`T
`
`I
`i
`L ------------ —- MIN M2
`
`TEMPORAL
`WEIGHTING
`
`
`
`US 7,643,996 B1
`
`1
`ENHANCED WAVEFORM INTERPOLATIVE
`CODER
`
`CROSS REFERENCE TO RELATED
`APPLICATIONS
`
`This application claims the bene?t of Provisional Patent
`Application Nos. 60/110,522, ?led Dec. 1, 1998 and 60/110,
`641 ?led Dec. 1, 1998.
`
`BACKGROUND OF THE INVENTION
`
`Recently, there has been growing interest in developing
`toll-quality speech coders at rates of 4 kbps and beloW. The
`speech quality produced by Waveform coders such as code
`excited linear prediction (CELP) coders degrades rapidly at
`rates beloW 5 kbps [B. S. Atal, and M. R. Schroder, “Stochas
`tic Coding of Speech at Very LoW Bit Rate”, Proc. Int. Conf.
`Comm, Amsterdam, pp. 1610-1613, 1984]. On the other
`hand, parametric coders such as the Waveform-interpolative
`(WI) coder, the sinusoidal-transform coder (STC), and the
`multiband-excitation (MBE) coder produce good quality at
`loW rates, but they do not achieve toll quality [Y. Shoham,
`“High Quality Speech Coding at 2.4 and 4.0 kbps Based on
`Time Frequency-Interpolation”, IEEE ICASSP’93, Vol. II,
`pp. 167-170, 1993; W. B. Kleijn, and J. Haagen, “Waveform
`Interpolation for Coding and Synthesis”, in Speech Coding
`Synthesis by W. B. Kleijn and K. K. PaliWal, Elsevier Science
`B. V., Chapter 5, pp. 175-207, 1995; I. S. Burnett, and D. H.
`Pham, “Multi-Prototye Waveform Coding using Frame-by
`Frame Analysis-by-Synthesis”, IEEE ICASSP’97, pp. 1567
`1570, 1997; R. J. McAulay, and T. F. Quatieri, “Sinusoidal
`Coding”, in Speech Coding Synthesis by W. B. Kleijn and K.
`K. PaliWal, Elsevier Science B. V., Chapter 4, pp. 121-173,
`1995; and D. Grif?n, and J. S. Lim, “Multiband Excitation
`Vocoder”, IEEE Trans. ASSP, Vol. 36, No. 8, pp. 1223-1235,
`August 1988]. This is mainly due to lack of robustness to
`parameter estimation, Which is commonly done in open loop,
`and to inadequate modeling of non-stationary speech seg
`ments. Also, in parametric coders the phase information is
`commonly not transmitted, and this is for tWo reasons: ?rst,
`the phase is of secondary perceptual signi?cance; and second,
`no e?icient phase quantization scheme is knoWn. WI coders
`typically use a ?xed phase vector for the sloWly evolving
`Waveform [Shoham, supra; Kleijn et al, supra; and Burnett et
`al, supra]. For example, in Kleijn et al, a ?xed male speaker
`extracted phase Was used. On the other hand, Waveform cod
`ers such as CELP, by directly quantiZing the Waveform,
`implicitly allocate an excessive number of bits to the phase
`informationimore than is perceptually required.
`
`SUMMARY OF THE INVENTION
`
`The present invention overcomes the foregoing drawbacks
`by implementing a paradigm that incorporates analysis-by
`synthesis (AbS) for parameter estimation, and a novel pitch
`search technique that is Well suited for the non-stationary
`segments. In one embodiment, the invention provides a novel,
`e?icient AbS vector quantiZation (V O) encoding of the dis
`persion phase of the excitation signal to enhance the perfor
`mance of the Waveform interpolative (WI) coder at a very loW
`bit-rate, Which can be used forparametric coders as Well as for
`Waveform coders. The enhanced analysis-by-synthesis Wave
`form interpolative (EWI) coder of this invention employs this
`scheme, Which incorporates perceptual Weighting and does
`not require any phase unWrapping.
`The WI coders use non-ideal loW-pass ?lters for doWnsam
`pling and unsampling of the sloWly evolving Waveform
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`(SEW). In another embodiment of the invention, A novel AbS
`SEW quantiZation scheme is provided, Which takes the non
`ideal ?lters into consideration. An improved match betWeen
`reconstructed and original SEW is obtained, most notably in
`the transitions.
`Pitch accuracy is crucial for high quality reproduced
`speech in WI coders. Still another embodiment of the inven
`tion provides a novel pitch search technique based on varying
`segment boundaries; it alloWs for locking onto the most prob
`able pitch period during transitions or other segments With
`rapidly varying pitch.
`Commonly in speech coding, the gain sequence is doWn
`sampled and interpolated. As a result it is often smeared
`during plosives and onsets. To alleviate this problem, a further
`embodiment of the invention provides a novel sWitched-pre
`dictive AbS gain VQ scheme based on temporal Weighting.
`More particularly, the invention provides a method for
`interpolative coding of input signals at loW data rates in Which
`there may be signi?cant pitch transitivity, the signals having
`an evolving Waveform, the method incorporating at least one,
`and preferably all, of the folloWing steps:
`(a) AbS VQ of the SEQ Whereby to reduce distortion in the
`signal by obtaining the accumulated Weighted distortion
`betWeen an original sequence of Waveforms and a sequence
`of quantiZed and interpolated Waveforms;
`(b) AbS quantiZation of the dispersion phase;
`(c) locking onto the most probable pitch period of the
`signal using both a spectral domain pitch search and a tem
`poral domain pitch search;
`(d) incorporating temporal Weighting in the AbS VQ of the
`signal gain, Whereby to emphasiZe local high energy events in
`the input signal;
`(e) applying both high correlation and loW correlation syn
`thesis ?lters to a vector quantiZer codebook in the AbS VQ of
`the signal gain Whereby to add self correlation to the code
`book vectors and maximiZe similarity betWeen the signal
`Waveform and a codebook Waveform;
`(f) using each value of gain in the AbS VQ of the signal gain
`to obtain a plurality of shapes, each composed of a predeter
`mined number of values, and comparing said shapes to a
`vector quantiZed codebook of shapes, each having said pre
`determined number of values, e.g., in the range of 2-50,
`preferably 5-20; and
`(g) using a coder in Which a plurality of bits, eg 4 bits, are
`allocated to the SEW dispersion phase.
`The method of the invention can be used in general With
`any Waveform signal, and is particularly useful With speech
`signals. In the step of AbS VQ of the SEW, distortion is
`reduced in the signal by obtaining the accumulated Weighted
`distortion betWeen an original sequence of Waveforms and a
`sequence of quantiZed and interpolated Waveforms. In the
`step of AbS quantiZation of the dispersion phase, at least one
`codebook is provided that contains magnitude and phase
`information for predetermined Waveforms. The linear phase
`of the input is crudely aligned, then iteratively shifted and
`compared to a plurality of Waveforms reconstructed from the
`magnitude and phase information contained in one or more
`codebooks. The reconstructed Waveform that best matches
`one of the iteratively shifted inputs is selected.
`In the step of locking onto the mo st probable pitch period of
`the signal, the invention includes searching the temporal
`domain pitch, de?ning a boundary for a segment of said
`temporal domain pitch, maximiZing the length of the bound
`ary by iteratively shrinking and expanding the segment, and
`maximiZing the similarity by shifting the segment. The
`searches are preferably conducted respectively at 100 HZ and
`500 HZ.
`
`
`
`US 7,643,996 B1
`
`3
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram of the AbS SEW vector quanti
`zation;
`FIG. 2 shows amplitude-time plots illustrating the
`improved Waveform matching obtained for a non-stationary
`speech segment by interpolating the optimized SEW;
`FIG. 3 is a block diagram of the AbS dispersion phase
`vector quantization;
`FIG. 4 is a plot of the segmentally Weighted signal-to-noise
`ratio of the phase vector quantization versus the number of
`bits, for modi?ed intermediate reference system (MIRS) and
`for non-MIRS (?at) speech;
`FIG. 5 shoWs the results of subjective A/B tests comparing
`a 4-bit phase vector quantization and a male extracted ?xed
`phase;
`FIG. 6 is a block diagram of the pitch search of the EWI
`coder; and
`FIG. 7 is a block diagram of the sWitch-predictiveAbS gain
`VQ using temporal Weighting.
`
`20
`
`DETAILED DESCRIPTION OF THE INVENTION
`
`The invention has a number of embodiments, some of
`Which can be used independently of the others to enhance
`speech and other signal coding systems. The embodiments
`cooperate to produce a superior coding system, involving
`AbS SEW optimization, and novel dispersion phase quan
`tizer, pitch search scheme, sWitched-predictive AbS gain VQ,
`and bit allocation.
`
`25
`
`30
`
`4
`the gain, i.e. the g parameter, or another combination of input
`and quantized LPC polynomials, i.e. the A(Z) and A(Z)
`parameters.
`The interpolated SEW vectors are given by:
`
`fm:[l—(1(lm)]fO+(1(lm)fM'mIl, .
`
`. .M
`
`(3)
`
`Where t is time, In is the number of Waveforms in a frame, and
`i0 and 2M are the quantized SEW at the previous and at the
`current frame respectively. The parameter 0[ is an increasing
`linear function from 0 to 1. It can be shoWn that the accumu
`lated distortion in equation (1) is equal to the sum of modeling
`distortion and quantization distortion:
`
`Where the quantization distortion is given by:
`
`(4)
`
`(5)
`
`The optimal vector, rMpPt, Which minimizes the modeling
`distortion, is given by:
`
`AbS SEW Quantization
`Commonly in WI coders the SEW is distorted by doWn
`sampling and upsampling With non-ideal loW-pass ?lters. In
`order to reduce such distortion, an AbS SEW quantization
`scheme, illustrated in FIG. 1, Was used. Consider the accu
`mulated Weighted distortion, DWI, betWeen the input SEW
`vectors r and the interpolated vectors, 2m, given by:
`
`am:
`
`35
`
`[ME
`
`40
`
`(1)
`
`45
`
`Where the ?rst sum is that of many current distortions and the
`second sum is that of lookahead distortions. H denotes Her
`mitian (transposed+complex conjugate), M is the number of
`Waveforms per frame, L is the lookahead number of Wave
`forms, 0[(t) is some increasing interpolation function in the
`range 0§0[(t)§ l, and Wm is diagonal matrix Whose elements,
`Wkk, and the combined spectral-Weighting and synthesis of
`the k-th harmonic given by:
`
`50
`
`55
`
`2
`
`2.
`
`_ B
`
`'
`
`.
`
`<2)
`
`60
`
`Where P is the pitch period, K is the number of harmonics, g
`is the gain , A(z) and A(z) are the input and the quantized LPC
`polynomials respectively, and the spectral Weighting param
`eters satisfy 0§y2<y2§l It is also possible to leave out the
`inverse of the number of harmonics, i.e., the l/ K parameter,
`
`65
`
`Therefore, VQ With the accumulated distortion of equation
`(1) can be simpli?ed by using the distortion of equation (5),
`and:
`
`1
`
`(6)
`
`An improved match betWeen reconstructed and original
`SEW is obtained, most notably in the translations. FIG. 2
`illustrates the improved Waveform matching obtained for a
`non-stationary speech segment by interpolating the opti
`mized SEW.
`
`AbS Phase Quantization
`The dispersion-phase vector quantization scheme is illus
`trated in FIG. 3. Consider a pitch cycle Which is extracted
`from the residual signal, and is cyclically shifted such that its
`pulse is located at position zero. Let its discrete Fourier trans
`form (DFT) are denoted by r; the resulting DFT phase is the
`dispersion phase, 4), Which determines, along With the mag
`nitude |r|, the Waveform’ s pulse shape. The SEW Waveform r
`is the vector of complex DFT coe?icients. The complex num
`ber can represent magnitude and phase. After quantization,
`the components of the quantized magnitude vector, lrl, are
`multiplied by the exponential of the quantized phases, (Mk), to
`yield the quantized Waveform DFT, i, which is subtracted
`from the input DFT to produce the error DFT. The error DFT
`is then transformed to the perceptual domain by Weighting it
`by the combined synthesis and Weighting ?lter W(z)/A(z). In
`a crude linear phase alignment, the encoder searches for the
`phase that minimizes the energy of the perceptual domain
`
`
`
`US 7,643,996 B1
`
`5
`error, shifting the signal such that the peak is located at time
`zero. It then allows a re?ning cyclic shift of the input Wave
`form during the search, incrementally increasing or decreas
`ing the linear phase, to eliminate any residual phase shift
`betWeen the input Waveform and the quantized Waveform.
`Although shoWn in FIG. 3 as occurring immediately after the
`crude linear phase alignment, the re?ned linear phase align
`ment step can occur elseWhere in the cycle, e.g., betWeen the
`X and + steps. Phase dispersion quantization aims to improve
`Waveform matching. Ef?cient quantization can be obtained
`by using the perceptually Weighted distortion:
`
`The magnitude is perceptually more signi?cant than the
`phase; and should therefore be quantized ?rst. Furthermore, if
`the phase Were quantized ?rst, the very limited bit allocation
`available for the phase Would lead to an excessively degraded
`spectral matching of the magnitude in favor of a someWhat
`improved, but less important, matching of the Waveform. For
`the above distortion, the quantized phase vector is given by:
`
`Where i is the running phase codebook index, and eff” is the
`respective diagonal phase exponent matrix Where i is the
`running phase codebook index, and the respective phase
`exponent matrix is given by
`
`20
`
`25
`
`30
`
`eivz : diagOnaHem-w}
`
`35
`
`(9)
`
`The AbS search for phase quantization is based on evaluating
`(8) for each candidate phase codevector. Since only trigono
`metric functions of the phase candidates are used, phase
`unWrapping is avoided. The EWI coder uses the optimized
`SEW, r M’O pt, and the optimized Weighting, W M’O p t, for the AbS
`phase quantization.
`
`40
`
`45
`
`Equation (8) : argmaX{f rw(¢);'w(¢p $5) d¢}
`@i
`0
`
`2”
`
`Equivalently, the quantized phase vector can be simpli?ed to:
`
`50
`
`55
`
`Where (l)(k) is the phase of, r(k), the k-th input DFT coe?icient.
`The average global distortion measure for M vector set is:
`
`60
`
`1
`Dmcbbal = —
`m:(Dara Vectors}
`
`A
`A
`Dw(rm, BWmVM) =
`
`(11)
`
`65
`
`6
`
`-continued
`
`1
`M
`m:(Dara Vectors}
`
`The centroid equation [A. Gersho et al, “Vector Quantiza
`tion and Signal Compression”, KluWerAcademic Publishers,
`1992] of the k-th harmonic’s phase for the j -th cluster, Which
`minimizes the global distortion in equation (11), is given by:
`
`souofhrcluster : man
`
`These centroid equations use trigonometric functions of
`the phase, and therefore do not require any phase unWrapping.
`It is possible to use |r(k)m|2 instead of |r(k)m||r(k)m|.
`The phase vector’s dimension depends on the pitch period
`and, therefore, a variable dimension Q has been implemented.
`In the WI system the possible pitch period value Was divided
`into eight ranges, and for each range of pitch period an opti
`mal codebook Was designed such that vectors of dimension
`smaller than the largest pitch period in each range are zero
`padded.
`Pitch changes over time cause the quantizer to sWitch
`among the pitch-range codebooks. In order to achieve smooth
`phase variations Whenever such sWitch occurs, overlapped
`training clusters Were used.
`The phase-quantization scheme has bene implemented as a
`part of WI coder, and used to quantize the SEW phase. The
`objective performance of the suggested phase VQ has been
`tested under the folloWing conditions:
`Phase Bits: 0-6 ever 20 ms, a bitrate of 0-300 bit/ second.
`8 pitch ranges Were selected, and training has been per
`formed for each range.
`Modi?ed IRS (MIRS) ?ltered speech (Female+Male)
`Training Set: 99,323 vectors.
`Test Score: 83,099 vectors.
`Non-MIRS ?ltered speech (Female+Male)
`Training Set: 101,359 vectors.
`Test Set: 95,446 vectors.
`The magnitude Was not quantized.
`The segmental Weighted signal-to-noise ratio (SNR) of the
`quantizer is illustrated in FIG. 4. The proposed system
`achieves approximately 14 dB SNR for as loW as 6 bits for
`non-MIRS ?ltered speech, and nearly 10 dB for MIRS ?ltered
`speech.
`Recent WI coders have used a male speaker extracted dis
`persion phase [Kleijn et al, supra: Y. Shoham, “Very LoW
`Complexity Interpolative Speech Coding at 1.2 to 2.4
`KBPS”, IEEE ICASSP ’97, pp. 1599-1602, 1997].A subjec
`tive A/B testW as conducted to compare the dispersion phase
`of this invention, using only 4 bits, to a male extracted dis
`persion phase. The test data included 16 MIRS speech sen
`tences, 8 of Which are of female speakers, and 8 of male
`speakers. During the test, all pairs of ?le Were played tWice in
`alternating order, and the listeners could vote for either of the
`systems, or for no preference. The speech material Was syn
`thesized using WI system in Which only the dispersion phase
`Was quantized every 20 ms. TWenty one listeners participated
`in the test. The test results, illustrated in FIG. 5, shoW
`
`
`
`US 7,643,996 B1
`
`8
`for plosives and onsets, temporal Weighting is incorporated in
`the AbS gain VQ. The Weighting is a monotonic function of
`the temporal gain. TWo codebooks of 32 vectors each are
`used. Each codebook has an associated predictor coe?icient,
`Pi, and a DC offset D. The quantization target vector is the
`DC removed log-gain vector denoted by t(m). The search for
`the minimal Weighted mean squared error (WMSE) is per
`formed over all the vectors, cZ-J-(m), of the codebooks. The
`quantized target, i(m), is obtained by passing the quantized
`vector, clj(m), through the synthesis ?lter. Since each quan
`tized target vector may have a different value of the removed
`DC, the quantized DC is added temporarily to the ?lter
`memory after the state update, and the next quantized vector’ s
`DC is subtracted from its before ?ltering is performed. Since
`the predictor coef?cients are knoWn, direct VQ can be used to
`simplify the computations. The synthesis ?lter adds self cor
`relation to the codebook vector. All combinations are tried
`and Whether high or loW self correlation is used depends on
`Which yields the best results.
`
`Bit Allocation
`The bit allocation of the coder is given in Table 1. The
`frame length is 20 ms, and ten Waveforms are extracted per
`frame. The pitch and the gain are coded tWice per frame.
`
`TABLE 1
`
`Bit allocation for EWI coder
`
`Parameter
`
`Bits/Frame
`
`Bits/second
`
`7
`improvement in speech quality by using the 4-bit phase VQ.
`The improvement is larger for female speakers than for male.
`This may be explained by a higher number of bits per vector
`sample for female, by less spectral masking for female’s
`speech, and by a larger amount of phase-dispersion variation
`for female. The codebook design for the dispersion-phase
`quantization involves a tradeoff betWeen robustness in terms
`of smooth phase variations and Waveform matching. Locally
`optimized codebook for each pitch value may improve the
`Waveform matching on the average, but may occasionally
`yield abrupt and excessive changes Which may cause tempo
`ral artifacts.
`
`Pitch Search
`The pitch search of the EWl coder consists of a spectral
`domain search employed at 100 Hz and a temporal domain
`search employed at 500 Hz, as illustrated in FIG. 6. The
`spectral domain pitch search is based on haromonic matching
`[McAuley et al, supra; Gri?in et al, supra; and E. Shiomot, V.
`Cuperman, and A. Gersho, “Hybrid Coding of Speech at 4
`kbps”, IEEE Speech Coding Workshop, pp. 37-38, 1997].
`The temporal domain pitch search is based on varying seg
`ment boundaries. It alloWs for locking onto the most probable
`pitch period even during transitions or other segments With
`rapidly varying pitch (e.g., speech onset or offset or fast
`changing periodicity). Initially, pitch periods, P(nl.), are
`searched every 2 ms at instances nl- by maximizing the nor
`malized correlation of the Weighted speech sW(n), that is:
`
`25
`
`30
`
`35
`
`Where '5 is the shift in the segment, A is some incremental
`segment used in the summations for computational simplic
`ity, and 0§Nj§[160/A]. Then, every 10 ms a Weighted-mean
`pitch value is calculated by:
`
`(13)
`
`Where p(ni) is the normalized correlation for P(ni). The above
`values (160, 10, 5) are for the particular coder and is used for
`illustration. Equation (12) describes the temporal domain
`pitch search and the temporal domain pitch re?nement blocks
`of FIG. 6. Equation (13) describes the Weighted average pitch
`block of FIG. 6.
`
`45
`
`50
`
`55
`
`Gain Quantization
`The gain trajectory is commonly smeared during plosives
`and onsets by doWnsampling and interpolation. This problem
`is addressed and speech crispness is improved in accordance
`With an embodiment of the invention that provides a novel
`sWitched-predictive AbS gain VQ technique, illustrated in
`FIG. 7. SWitched-prediction is introduced to alloW for differ
`ent levels of gain correlation, and to reduce the occurrence of
`gain outliers. In order to improve speech crispness, especially
`
`60
`
`65
`
`LPC
`Pitch
`Gain
`REW
`SEW magn.
`SEW phase
`
`40
`
`Total
`
`18
`2 X 6 =12
`2 X 6 = 12
`20
`14
`4
`
`80
`
`900
`600
`600
`1000
`700
`200
`
`4000
`
`Subjective Results
`A subjective A/B test Was conducted to compare the 4 kbps
`EWl coder of this invention to MPEG-4 at 4 kbps, and to
`G.723.1. The test data included 24 MIRS speech sentences,
`12 of Which are of female speakers, and 12 of male speakers.
`Fourteen listeners participated in the test. The test results,
`listed in Tables 2 to 4, indicate that the subjective quality of
`EWl exceeds that of MPEG-4 at 4 kbps an of G.723.1 at 5.3
`kbps, and it is slightly better than that ofG.723.1 at 6.3 kbps.
`
`TABLE 2
`
`Test
`
`Female
`Male
`
`Total
`
`4 kbps W1
`
`4 kbps MPEG-4
`
`65.48%
`61.90%
`
`63.69%
`
`34.52%
`38.10%
`
`36.31%
`
`Table 2 shoWs the results of subjective A/ B tests for compari
`son betWeen the 4 kbps WI coder and th 4 kbps MPEG-4.
`Within 95% certainty the WI preference lies in [58.63%,
`68.75%].
`
`
`
`US 7,643,996 B1
`
`9
`
`TABLE 3
`
`Test
`
`Female
`Male
`
`Total
`
`4 kbps WI
`
`5.3 kbps G.723.1
`
`57.74%
`61.31%
`
`59.52%
`
`42.26%
`38.69%
`
`40.48%
`
`Table 3 shows the results of subjective A/ B tests for compari
`son between the 4 kbps WI coder to 5.3 kbps G.723.l. With
`95% certainty the WI preference lies in [54.17%, 64.88%].
`
`10
`
`TABLE 4
`
`Test
`
`Female
`Male
`
`Total
`
`4 kbps WI
`
`6.3 kbps G.723.1
`
`54.76%
`52.98%
`
`53.87%
`
`45.24%
`47.02%
`
`46.13%
`
`Table 4. Results of subjective A/B test for comparison
`between the 4 kbps WI coder to 6.3 kbps G.723.l. With 95%
`certainty the WI preference lies in [48.51%, 59.23%].
`The present invention incorporates several new techniques
`that enhance the performance of the WI coder, analysis-by
`synthesis vector-quantization of the dispersion-phase, AbS
`optimization of the SEW, a special pitch search for transi
`tions, and switched-predictive analysis-by-synthesis gain
`VQ. These features improve the algorithm and its robustness.
`The test results indicate that the performance of the EWI
`coder slightly exceeds that of G.723.1 at 6.3 kbps and there
`fore EWI achieve very close to toll quality, at least under clean
`speech conditions.
`The invention claimed is:
`1. A method for using a computer processor to interpola
`tively code a digitized audio waveform input signal having a
`?rst bitrate into a coded audio waveform output signal having
`a second bitrate lower than said ?rst bitrate, said method
`comprising the steps of:
`extracting a slowly evolving waveform from the digitized
`audio waveform input signal;
`estimating a dispersion phase of an excitation signal;
`locking onto a most probable pitch period;
`quantizing a sequence of gain trajectory correlation values;
`using the computer processor to transform the extracted
`slowly evolving waveform, the estimated dispersion
`phase, the most probable pitch period and the quantized
`sequence ofgain trajectory values into an interpolatively
`coded audio waveform output signal with said lower
`bitrate; and
`outputting said coded audio waveform output signal,
`wherein said method comprises using the computer pro
`cessor to execute at least one step selected from the
`group consisting of:
`(a) performing an analysis-by-synthesis vector quanti
`zation of the dispersion phase such that a linear shift
`phase residual is minimized;
`(b) computing a weighted average of a group of adjacent
`pitch values in order to computer the most probable
`pitch period;
`(c) performing spectral and temporal pitch searching in
`order to compute the most probable pitch period, such
`that the temporal pitch searching is performed at a
`different rate than the spectral pitch searching;
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`65
`
`10
`(d) incorporating temporal weighting in an analysis-by
`synthesis vector-quantization of the gain trajectory
`correlation values;
`(e) quantizing adjacent gain trajectory correlation values
`by analysis-by-synthesis vector-quantization without
`downsampling or interpolation;
`(f) incorporating switched prediction ?ltering in an
`analysis-by-synthesis vector-quantization of the
`sequence of gain trajectory correlation values;
`(g) temporal pitch searching with varying segment
`boundaries.
`2. The method of claim 1 in which said method incorpo
`rates all of steps (a) through (g).
`3. The method of claim 2 in which said digitized audio
`waveform input signal is representative of speech and said
`coded output signal has a subjective speech quality at 4 kbps
`better than that of G.723 coding at 6.3 kbps.
`4. The method of claim 1, wherein distortion is reduced by
`obtaining an accumulated weighted distortion between a
`sequence of input waveforms and a sequence of quantized
`and interpolated waveforms.
`5. The method of claim 1 wherein said at least one step is
`step (a) further comprising providing at least one codebook
`comprising magnitude and dispersion phase information for
`predetermined waveforms, approximately aligning a linear
`phase or output, then iteratively shifting the approximately
`aligned linear phase input or output, comparing the shifted
`input or output to a plurality of waveforms reconstructed from
`the magnitude and dispersion phase information contained in
`said at least one codebook, and selecting the reconstructed
`waveform that best matches one of the iteratively shifted
`inputs or outputs.
`6. The method of claim 1 wherein said at least one step
`includes step (g) and said varying segment boundaries are
`used to compute a best boundary by iteratively shifting and
`changing the length