throbber
A NEW MODEL OF LPC EXCITATION FOR PRODUCING
`NATURAL-SOUNDING SPEECH AT LOW BIT RATES
`
`Bishnu S. Atal and Joel R. Remde
`
`Bell Laboratories
`Murray Hill, New Jersey 07974
`
`ABSTRACT
`
`The excitation for LPC speech synthesis usually consists of two
`separate signals - a delta-function pulse once every pitch penod for
`voiced speech and white noise for unvoiced speech. This manner of
`representing excitation requires that speech segments be classified
`accurately into voiced and unvoiced categories and the pitch period of
`it is now well recognised that such a rigid
`voiced segments be known.
`idealization of the vocal excitation Li often responsible for the unna-
`tural qualizy a.ssock#ed with synthesized speech. This paper describes
`a new approach to the excitation problem that does not require a priori
`knowledge of either the voiced-unvoiced decision or the pitch period.
`All classes of sounds are generated by exciting the LPC filter with a
`sequence of pulses; the amplitudes and locations of the pulses are
`determined using a non-iterative analysis-by-synthesis procedure. This
`procedure minimizes a perceptual-distance metric representing
`subjectively-important differences between the waveforms of the origi-
`nal and the synthetic speech signals. The disiance metric takes
`account of the finite-frequency resolution as well as the ditTerential
`sensitivity of the human ear to errors in the formant and inter-formant
`regions of the speech spectrum.
`
`INTRODUCTION
`
`Synthesis of natural-sounding speech at low 'bit rates has been
`a topic of considerable interest in speech research. Speech coding
`methods can be classified into two broad categories: The first type
`of coders are the waveform coders [1) which attempt to mimic the
`speech waveform as faithfully as possible. The second type of
`coders are vocoders [2-41 which synthesize speech using a
`parametric model of speech production. Typical examples of
`waveform coders are pulse code modulation systems and their
`differential generalizations [1], adaptive predictive [SI and
`transform coders [6]. The waveform coders are capable of pro-
`ducing high quality speech but only at bit rates above about 16
`kbits/sec. The vocoders are efficient at reducing the bit rate to
`much lower values [71 but do so only at the cost of lower speech
`quality and intelligibility. The speech quality from vocoders can-
`not generally be improved by increasing the bit rate.
`A mcdel of speech production for synthesizing speech at low
`bit rates is shown in Fig. 1. The model performs two basic func-
`tions. It has a linear filter to model the characteristics of the
`vocal tract and the spectral shaping of the vocal source. The
`second part provides the excitation to the linear ifiter. The model
`assumes that the speech signal can be classified into one of two
`classes, voiced and unvoiced, and that the pitch period of the
`voiced speech segment is known. For voiced speech, the excita-
`tion is a quasi-periodic pulse train with delta functions located at
`pitch period intervals. For unvoiced speech, the excitation is
`white noise.
`This idealized model of speech production is widely used for
`
`CH 1746-718210000 . 0614 $ 00.75 © 1982 IEEE
`
`614
`
`rcuP11
`
`YtVcED. UNVOD
`$WITN
`
`$YN1W1
`
`cFFIcgNTs
`
`Fig. 1. Block diagram of an LPC speech synthesizer using a periodic
`pulse train for synthesizing voiced speech and white noise for
`synthesizing unvoiced speech.
`
`synthesizing speech at low bit rates. However, it is difficult to
`produce high-quality speech with this model, even at high bit
`rates. The model is used, because it is perhaps the only way at
`present to synthesize speech at bit rates below about 4 kbits/s.
`Why it is so difficult to produce high-quality speech from this
`model? The central problem is the highly inflexible way in which
`the excitation is generated in the model. The introduction of
`linear predictive coding (LPC) techniques to speech analysis and
`synthesis did provide an alternate way of representing the spectral
`information by all-pole ifiter parameters [81. However, 'the model
`for the excitation of the filter was still carried over from the chan-
`nel vocoders [3]. The invention of voice-excited vocoders [9] was
`directed to the solution of the excitation problem but did not pro-
`vide an entirely satisfactory solution. Accurate separation of
`speech into two classes, voiced and unvoiced, is difficult to
`achieve in practice. There are more than two modes in which the
`vocal tract is excited and often these modes are mixed. An exam-
`ple of a short segment of speech waveform, shown in Fig. 2, illus-
`trates this problem. Of course, there are some easily recognizable
`voiced and unvoiced portions. But, there are regions where it is
`not clear whether the signal is voiced or unvoiced and what the
`pitch period is. It is indeed a difficult task '- using either manual
`or automatic means - to classify short segments of waveform reli-
`ably into voiced and unvoiced categories.
`
`Fig. 2. An example of a short segment of the speech waveform.
`
`Even when the speech waveform is clearly periodic, it is a
`gross simplification to assume that there is only one point of exci-
`tation in the entire pitch period [10]. There is some evidence
`
`Ex. 1028 / Page 1 of 4
`Apple v. Saint Lawrence
`
`

`

`that, apart from the main excitation which occurs at glottal clo-
`sure, there is secondary excitation, not only at glottal opening and
`during the open phase, but also after the closure 110]. These
`results suggest that the excitation for voiced speech should consist
`of several pulses in a pitch period rather than just one at the
`beginning of the period. The main difficulty so far in using a
`multi-pulse model has been our inability to develop a satisfactory
`procedure for determining these pulses.
`We present in this paper a multi-pulse model for the excita-
`tion of LPC synthesizers and describe a simple procedure for
`determining the locations and amplitudes of the pulses. We find
`that this model is flexible enough to provide high quality speech
`even at low bit rates. We make no a priori assumption about the
`nature of the excitation signal. The excitation consists of a
`sequence of pulses for all speech classes - including voiced and
`unvoiced speech. We make no attempt to make it periodic or
`non-periodic.
`Of course, if the number of pulses is increased to arbitrarily
`large value so that there is a pulse at every sampling instant, it
`should be possible to duplicate the original speech waveform (at
`the expense of a high bit rate). What we find interesting is that
`only a few pulses (typically 8 pulses every 10 insec) are sufficient
`for generating different kinds of speech sounds, including voiced
`and unvoiced, with little audible distortion. Thus, we do not need
`any prior knowledge of either the voiced-unvoiced decision or the
`pitch period for synthesizing speech. The periodic (or non-
`periodic) nature of the speech signal does not play a crucial role
`in analysis or synthesis, although the periodicity could be used to
`decrease the bit rate to lower values.
`
`MULTI-PULSE EXCITATION MODEL
`
`Fig. 3 shows the block diagram of an LPC speech synthesizer
`with multi-pulse excitation. It is quite similar to the traditional
`LPC synthesizer except for the absence of the pulse and white
`noise generators and the voiced-unvoiced switch. The excitation
`for the all-pole filter is generated by an excitation generator that
`,,,,
`produces a sequence of pulses located at times t1,t2,
`,a,,," , respectively. The all-pole
`with amplitudes a1,a2,
`filter could be replaced by a pole-zero filter if desired. The sam-
`pled output of the all-pole filter is passed through a low-pass filter
`to produce a continuous speech waveform .(t). We will now dis-
`cuss an analysis-by-synthesis procedure for determining the loca-
`tions and amplitudes of the excitation pulses.
`
`AnaysLs-by-Synthesis Method of Determining Excitation
`
`An analysis-by-synthesis procedure for determining the loca-
`tions and the amplitudes of the pulses is shown in the block
`diagram of Fig. 4. The LPC synthesizer produces samples i,, of
`synthetic speech signal in response to the excitation v,. The syn-
`thetic speech samples are compared with the corresponding
`speech samples of the original (natural) speech signal to produce
`an error signal e,. This error is not very meaningful and must be
`modified to take account of how the human perception treats the
`error. Our knowledge in this area is not perfect but phenomena
`like masking and the limited frequency resolution of the ear must
`be considered [11,12]. Broadly speaking, the error in the formant
`regions must be dc-emphasized. This is done by a linear filter
`which suppresses the energy in the error signal in the formant
`regions. The error signal is thus weighted to produce a
`subjectively-meaningful measure of the difference between the
`signals n and s,. The weighted error is squired and averaged
`over a short time interval 5 to 10 msec in duration to produce the
`mean-squared weighted error e. The locations and amplitudes of
`the pulses are chosen to minimize the error e.
`
`EXCITATION
`
`$ tI fl
`
`.
`
`$
`
`I EXCITATION
`
`'' GENERATORJ
`
`LPC
`' SYNTHESIZER
`
`S LOW-PASS L5mTI€'rlc
`FILTER
`SPEECH
`
`REFLECTION
`COEFFICIENTS
`
`Fig. 3. Block diagram of an LPC speech synthesizer with multi-
`pulse excitation.
`
`-Jp WW1fll
`
`i.t
`44
`
`w
`
`Fig. 4. Block diagram of an analysis-by-synthesis procedure for
`determining locations and amplitudes of pulses for the multi-pulse
`excitation.
`
`Percepfttal Criteria for Selecting the Error- WeightingFYiter
`
`The error-weighting procedure needs further explanation. As
`suggested earlier, the error e is not a valid measure of the per-
`ceptual difference between the original and the synthetic speech
`signals. To obtain a better measure of this difference, we define a
`frequency-weighted error as
`
`= 501k Is (1 )S (1)12 W(f)df,
`(1)
`where 5(f) and S(f) are the Fourier transforms of the original
`and the synthetic speech signals, respectively, W(f) is a
`suitably-chosen weighting function, f is the frequency variable,
`and!, is the sampling frequency. Due to the relatively high con-
`centration of speech energy in formant regions, we can tolerate
`larger errors in the formant regions in comparison to the in-
`between formant regions. The weighting function W(f) is
`chosen to dc-emphasize formant regions in the error spectrum.
`Let 1—P(z) be the LPC inverse filter in the z-transform nota-
`tions. Then, the short-time spectral envelope of the speech signal
`is given by
`
`Se (1 )'IK/L1—P(e'1'111')] 2,
`(2)
`where K is the mean-squared prediction error. The inverse filter
`1—P(z) is given in terms of the inverse filter coefficients ak as
`1—P(z) 1±akzc l_±ake2Tft
`k—i
`k—I
`If we now represent the transfer function of the error-weighting
`filter by W(z), then a suitable choice for W(z) is given by
`
`(3)
`
`W(z) = Ii — akz d]/[l
`
`(4)
`
`where y is a fraction between 0 and 1 and controls the increase in
`the error in the formant regions. The filter W(z) changes from
`W(z)=l for 'y=l to W(z)1—P(z) for 7=0. The value of 7 is
`determined by the degree to which one wishes to dà-emphasize
`
`615
`
`Ex. 1028 / Page 2 of 4
`
`

`

`the formant regions in the error spectrum. The optimum value of
`must be determined by suitable listening tests. We find that the
`choice of 'y IS not too critical - a typical value is approximately 0.8
`at a sampling frequency of 8 kHz. Figure 5 shows an example of
`the speech spectrum and the frequency response of the
`corresponding error-weighting filter.
`
`2
`FREQUENCY (kHz)
`
`Fig. 5. An example of the speech spectrum and the frequency
`response of the corresponding error-weighting filter.
`
`Error-Minimization Procedure
`
`The weighted error is used in an error minimization pro-
`cedure to select the optimal locations and amplitudes of the pulses
`in the excitation signal. In general, such a procedure would be
`extremely complex if one seeks to determine all the pulses at
`once. We find that an efficient solution is obtained by determin-
`ing the location and amplitude of pulses - one pulse at a time. No
`doubt, it is a sub-optimal procedure, but our experience so far
`suggests that it is a reasonable one.
`In finding the locations and amplitudes of the pulses - one
`pulse at a time - we have converted a problem with many
`unknowns to just two unknowns. Moreover, the amplitude of the
`pulses appears as a linear factor in the error and as a quadratic
`factor in the mean-squared error. A closed-form solution for the
`pulse amplitude is obtained by setting the derivative of the
`mean-squared error with respect to the unknown amplitude to
`zero. The mean-squared error is then a function of only the loca-
`tion of the pulse. The optimum location is found by computing
`
`the error for all possible pulse locations in a given time interval
`and by locating the minimum of the error. The procedure can be
`further refined by observing that the unknown amplitudes of all
`the pulses can be determined in a single step by solving a set of
`linear equations provided locations of the pulses are known. We
`determine the pulse locations first, one pulse at a time, and then
`use the resulting locations to determine the amplitudes in a single
`step.
`The procedure for finding the locations and amplitudes of
`various pulses in any given time interval can be summarized as
`follows: At the beginning, without any excitation pulse, the syn-
`thetic speech output is entirely generated from the memory of the
`all-pole filter from previous synthesis intervals. The contribution
`of this past memory is subtracted out from the speech signal and
`the location and amplitude of one single pulse, which minimizes
`the mean-squared weighted error, is determined. A new error
`signal is now computed by subtracting out the contribution of the
`pulse just determined. The process of locating new pulses to
`reduce the mean-squared weighted error is continued until the
`error is reduced to acceptable limits. Figure 6 illustrates the error
`minimization procedure. The waveform in the first row of Fig. 6
`is that of the original speech signal, about 5 msec in duration. At
`the beginning, the excitation is zero and the synthetic speech sig-
`nal is produced by the memory hangover from the previous syn-
`thesis intervals. The error of course is large and is reduced by
`placing a pulse in the excitation as shown in Fig. 6 (b). This pro-
`cess is continued by adding more pulses; the results with two
`pulses are shown in Fig. 6 (c). The error is decreased further.
`The rate of improvement usually slows down after a few pulses
`have been placed. The results with three and four pulses are also
`shown in Fig. 6. The process of adding pulses can be continued
`but in practice we find that only minimal improvements are
`achieved after about 8 pulses have been placed in a 10 msec inter-
`val.
`
`RESULTS
`
`Several examples of the original and the synthetic speech
`waveforms are shown in Figs. 7-9. The duration of the speech
`segment is 0.1 sec in each figure. An all-pole filter with 16 poles
`was used to synthesize speech. The parameters of the all-pole
`filter were determined once every 10 msec using the stablized
`covariance method of LPC analysis with high-frequencycórrection
`[5]. The covariance matrix was computed by averaging speech
`data over 20 msec long time intervals.: The pulse locations and
`amplitudes were determined by minimizing the error over succes-
`sive 5 msec time intervals.
`
`(a)
`
`(b)
`
`(c)
`
`(d)
`
`e)
`
`ORIGIMI
`SPEECH
`
`EXCITATION
`
`SYNTHETIC
`SPEECH
`
`ERROR
`
`Fig. 6. Illustration of the waveform matching procedure. The waveforms of the original speech, the exci-
`tation, the synthetic speech, and the error signals are shown (a) in the beginning without any excitation
`pulse, (b) with one excitation pulse, (c) with two excitation pulses, (d) with three excitation pulses, and
`(e) with four excitation pulses.
`
`616
`
`Ex. 1028 / Page 3 of 4
`
`

`

`The figures show the waveforms of the original speech signal,
`the synthetic speech signal, the excitation signal, and the error
`between the original and the synthetic waveforms. As can be
`seen, the multi-pulse excitation is able to follow rapid changes in
`speech waveforms, such as those occuring in rapid transitions.
`For voiced speech, the quasi-periodic nature of the excitation is
`produced by the errorminimization procedure without any
`knowledge of whether the segment is voiced or unvoiced and
`what its pitch period is. Similarly, the random excitation pattern
`is also produced automatically by the matching procedure. Mixed
`excitation comes out naturally depending upon the nature of the
`speech signal.
`Informal listening tests show that the synthetic speech signal
`is perceptually close to the original speech signal. The synthetic
`speech signal from the multi-pulse excitation model does not have
`the unnatural or buzzy characteristics so often associated with
`synthetic speech from vocoders.
`
`CONCLUSIONS
`
`Our work on developing a new model of LPC excitation for
`producing high quality speech is as yet very preliminary. We find
`that speech quality can be significantly improved by using multi-
`pulse excitation signal. Moreover, the locations and amplitudes of
`the pulses in the multi-pulse excitation can be determined by
`using a computationally efficient non-iterative analysis-by-
`synthesis procedure that can minimize the perceptual difference
`between the natural and the synthetic speech waveforms. The
`difficult problems of voiced-unvoiced decision and pitch analysis
`are eliminated.
`
`REFERENCES
`
`[1] N. S. Jayant, "Digital Coding of Speech Waveforms: PCM, DPCM
`and DM Quantizers, "Prcc. IEEE. vol. 62, pp. 611-632, May 1974.
`121 M. R. Schroeder, "Vocoders: Analysis and Synthesis of Speech,"
`Proc. iEEE, vol. 54, pp. 720-734, 1966.
`[3] J. L. Flanagan, Speech Analysis Synthesis and Perception, 2nd Edition,
`New York: Springer-Verlag, 1972.
`141 J. D. Markel and A. H. Gray, Jr., Linear Prediction of Speech, New
`York: Springer-Verlag, 1976.
`[5] B. S. Atal and M. R. Schroeder, "Predictive Coding of Speech
`Signals and Subjective Error Criteria," IEEE Trans. Acoust., Speech.
`Signal Processing. vol. ASSP-27, pp. 247-254, June 1979.
`[6] R. E. Crochiere and J. M. Tribolet, "Frequency Domain Coding of
`Speech," IEEE 7)ans. Acoust., Speech, Signal Processing, vol. ASSP-
`27, pp. 512-530, Oct. 1979.
`[71 J. D. Markel and A. H. Gray, Jr., "A Linear Prediction Vocoder
`Simulation Based Upon the Autocorrelation Method," IEEE Trans.
`Acoust.. Speech, Signal Processing, vol. ASSP-22, pp. 124-134, 1974.
`181 B. S. Atal and S. L. 'Hanauer, "Speech Analysis and Synthesis by
`Linear Prediction," .1. Acoust. Soc. Amer. vol. 50, pp. 637-655, Aug.
`1971.
`(9] M. R. Schroeder, "Recent Progress in Speech Coding at Bell
`Telephone Laboratories," Proc. lii list, Congress on AcoustIcs, pp.
`201-210, Elsevier Publishing Co., Amsterdam, 1961.
`(10] J. N. Holmes, "Forxnant Excitation Before and After Glottal
`Closure," Conf Rec. 1976 IEEE ha. Conf. Acoust., Speech and
`Signal Processing. pp. 39-42, April 1976.
`(11] M. R. Schroeder, B. S. Atal and J. L. Hall, "Optimizing Digital
`Speech Coders by Exploiting Masking Properties of the Human
`Ear," Jour. Acoust. Soc. Amer., vol. 66, pp. 1647-1652, Dec. 1979.
`[12] M. R. Schroeder, B. S. Atal and J. L. Hall, "Objective Measure of
`Certain Speech Signal Degradations Based on Properties of Human
`Auditory Perception," in Frontiers of Speech Communication Research
`edited by B. Lindblom and S. Ohman, London: Academic Press,
`1979, pp. 217-229.
`
`I l.a ,, . I liii l
`"I
`'I I I •Il
`
`,IiI IP I
`
`Fig. 7. Waveforms 01 tne original speech, the synthetic speech, the
`excitation, and the error signals.
`
`- L. I..
`
`i... I.
`
`I. - a......
`
`Fig. 8. Another example of the waveforms of the original speech,
`the synthetic speech, the excitation signal, and the error.
`
`lIIpsw.N.P..ial,Y.rI. It..
`N''-
`
`Fig. 9. Another example of the waveforms of the original speech,
`the synthetic speech, the excitation signal, and the error.
`
`617
`
`Ex. 1028 / Page 4 of 4
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket