`HIGH-QUALITY SPEECH AT VERY LOW BIT RATES
`
`Manfred R. Schroeder
`Drittes Physikalisches Institut
`University of Goettingen, F. R. Germany
`and AT&T Bell Laboratories
`Murray Hill, New Jersey 07974
`
`Bishnu S. Atal
`AT&T Bell Laboratories
`Murray Hill, New Jersey 07974
`
`ABSTRACT
`We describe in this paper a code-excited linear predictive coder
`in which the optimum innovation sequence is selected from a code
`book of stored sequences to optimize a given fidelity criterion. Each
`sample of the innovation sequence is filtered sequentially through two
`time-varying linear recursive filters, one with a long-delay (related to
`pitch period) predictor in the feedback loop and the other with a
`short-delay predictor (related to spectral envelope) in the feedback
`loop. We code speech, sampled at 8 kHz, in blocks of 5-msec dura-
`tion. Each block consisting of 40 samples is produced from one of
`1024 possible innovation sequences. The bit rate for the innovation
`sequence is thus 1/4 bit per sample. We compare in this paper
`several different random and deterministic code books for their
`effectiveness in providing the optimum innovation sequence in each
`block. Our results indicate that a random code book has a slight
`speech quality advantage at low bit rates. Examples of speech pro-
`duced by the above method will be played at the conference.
`
`INTRODUCTION
`
`Performance of adaptive predictive coders for speech signals
`using instantaneous quantizers deteriorate rapidly at bit rates below
`about 10 kbits/sec. Our past work has shown that high speech quality
`can be maintained in predictive coders at lower bit rates by using
`non-instantaneous stochastic quantizers which minimize a subjective
`error criterion based on properties of human auditory perception. [1].
`We have used tree search procedures to encode the innovation signal
`and have found the tree codes to perform very well at I bit/sample (8
`kbits/sec.). The speech quality is maintained even at 1/2 bit/sample
`when the tree has 4 branches at every node and 4 white Gaussian ran-
`dom numbers on each branch [21.
`The tree search procedures are suboptimal and the performance
`of tree codes deteriorates significantly when the innovation signal is
`coded at only 1/4 bit/sample (2 kbits/sec). Such low bit rates for the
`innovation signal are necessary to bring the total bit rate for coding
`the speech signal down to 4.8 kbits/sec - a rate that offers the possi-
`bility of carrying digital speech over a single analog voice channel.
`Fehn and Noll [3] have discussed merits of various multipath
`search coding procedures: code-book coding, tree coding, and trellis
`coding. Code-book coding is of particular interest at very low bit
`rates. In code-book coding, the set of possible sequences for a block
`of innovation signal is stored in a code book. For a given speech seg-
`ment, the optimum innovation sequence is selected to optimize a given
`fidelity criterion by exhaustive search of the code book and an index
`specifying the optimum sequence is transmitted to the receiver. In
`general, code-book coding is impractical due to the large size of the
`code books. However, at the very low bit rates we are aiming for,
`exhaustive search of the code book to find the best innovation
`sequence for encoding short segments of the speech signal becomes
`possible 14].
`
`SPEECH SYNTHESIS MODEL
`
`The speech synthesizer in a code-excited linear predictive coder
`is identical to the one used in adaptive predictive coders [11. It con-
`sists of two time-varying linear recursive filters each with a predictor
`in its feedback loop as shown in Fig. I. The first feedback loop
`includes a long-delay (pitch) predictor which generates the pitch
`periodicity of voiced speech. The second feedback loop includes a
`short-delay predictor to restore the spectral envelope.
`
`FINE STRUCTURE
`
`SPECTReL ENVELOPE
`(FORMANTS)
`
`SPEECH
`
`U'4NOVATIOr
`WHITE
`
`Fig. 1. Speech synthesis model with short and long delay
`predictors.
`
`The two predictors are determined using procedures outlined in
`References 1 and 5. The short-delay predictor has 16 coefficients and
`these are determined using the weighted stabilized covariance method
`of LPC analysis [1,5] once every 10 msec. In this method of LPC
`analysis, the instantaneous prediction error is weighted by a Hamming
`window 20 msec in duration and the predictor coefficients are deter-
`mined by minimizing the energy of the weighted error. The long-
`delay (pitch) predictor has 3 coefficients which are determined by
`minimizing the mean-squared prediction error after pitch prediction
`over a time interval of 5 msec [2].
`
`SELECTION OF OPTIMUM INNOVATION SEQUENCE
`
`Let us consider the coding of a short block of speech signal 5
`msec in duration. Each such block consists of 40 speech samples at a
`sampling frequency of 8 kHz. A bit rate of 1/4 bit per sample
`corresponds to 1024 possible sequences (10 bits) of length 40 for each
`block. The procedure for selecting the optimum sequence is illus-
`trated in Fig. 2. Each member of the code book provides 40 samples
`of the innovation signal. Each sample of the innovation signal is
`scaled by an amplitude factor that is constant for the 5 msec block
`and is reset to a new value once every 5 msec. The scaled samples
`are filtered sequentially through two recursive filters, one for introduc-
`ing the voice periodicity and the other for the spectral envelope. The
`regenerated speech samples at the output of the second filter are com-
`pared with the corresponding samples of the original speech signal to
`form a difference signal. The difference signal representing the-objec-
`tive error is further processed through a linear filter to attenuate those
`frequencies where the error is perceptually less important and to
`amplify those frequencies where the error is perceptually more impor-
`tant. The transfer function of the weighting filter is given by
`
`25.1.1
`
`CH2118-8/85/0000-0937 $1.00 © 1985 IEEE
`
`937
`
`Ex. 1046 / Page 1 of 4
`Apple v. Saint Lawrence
`
`
`
`V/(z)
`
`I —
`
`p
`
`k—I
`
`a5z
`— a5a'zt
`
`silence regions to voiced speech.
`
`(1)
`
`where ak are the short-delay predictor coefficients, p=l6 and a is a
`parameter for controlling the weighting of the error as a function of
`frequency. A suitable value of a is given by
`a e2100'"
`(2)
`where f is the sampling frequency. The weighted mean-squared
`error is determined by squaring and averaging the error samples at
`the output of the weighting filter for each 5-msec block. The
`optimum innovation sequence for each block is selected by exhaustive
`search to minimize the weighted error. As mentioned earlier, prior to
`filtering, each sample of the innovation sequence is scaled by an
`amplitude factor that is constant for the 5-msec block. This ampli-
`tude factor is determined for each code word by minimizing the
`weighted mean-squared error for the block.
`
`PERCEPTUAL
`ERROR
`
`Fig. 2. Block diagram illustrating the procedure for selecting
`the optimum innovation sequence.
`
`CONSTRUCHON OF OPTIMUM CODE BOOKS
`
`A code book, within the limitation of its size, should provide as
`dense a sampling as possible of the space of innovation sequences. In
`principle, the code words could be block codes that are optimally
`placed on a hypersphere in the 40-dimensional space (representing 40
`samples in each 5-msec block). Fehn and Noll [3] have argued that
`random code books (code books with randomly selected code words)
`are less restrictive than deterministic code books Random code books,
`in some sense, provide a lower bound for the performance at any given
`bit rate. A deterministic code book, if properly constructed, should
`provide a performance that is at least equal to - if not better than -
`that of the random code books and the deterministic nature of the
`code book should make it easier to find the optimum innovation
`sequence for each block of speech. However, it is generally very
`difficult to design an optimum deterministic code book.
`As a start, we have chosen a random code book in which each
`possible code word is constructed of white Gaussian random numbers
`with unit variance. We have chosen the Gaussian distribution since
`our earlier work has shown that the probability density function of the
`prediction error samples (after both short-delay and long-delay predic-
`tions) is nearly Gaussian Eli. Figure 3 shows a plot of the first-order
`cumulative amplitude distribution function for the prediction residual
`samples and compares it with the corresponding Gaussian distribution
`function with the same mean and variance. A closer examination of
`the prediction error shows that the Gaussian assumption is valid
`almost everywhere except for stop bursts of unvoiced stop consonants
`and for a few pitch periods during the transition from unvoiced or
`
`Fig. 3. First-order cumulative probability distribution function
`for the prediction residual samples (solid curve). The
`corresponding Gaussian distribution function with the same
`mean and variance is shown by the dashed curve.
`
`Each sample v, of the innovation sequence in a Gaussian code
`book can be expressed as a Fourier series of N cosine functions
`(N =20):
`N-IV, Cf c0s(lrktIIN + (bk), n=0,l
`
`2N—l,
`
`(3)
`
`k—U
`is uniformly
`where Ck and 4a are independent random variables,
`distributed between 0 and 2ir, and c5 is Rayleigh distributed with pro-
`bability density function
`p (c5)
`
`ckexp(—ck2/2), C5>0.
`
`(4)
`
`The function of the innovation sequence in the synthesis model of
`Fig. I is to provide a correction to the filter output in reproducing the
`speech waveform within the limitation of the size of the code book.
`Using the Fourier series model of Eq. (3), the correction can be con-
`sidered separately for the amplitude and phase of each Fourier com-
`ponent. Do we need both amplitude and phase corrections for high-
`quality speech synthesis? Are the two types of corrections equally
`important? These questions can be answered by restricting the varia-
`tions in the amplitudes and phases of various Fourier components in
`Eq. (3). For example, a code book can be formed by setting the
`amplitudes c5 to a constant value and by keeping the phases /k uni-
`formly distributed between 0 and 27r. Another code book is formed
`by setting the phases to some constant set of values and by keeping
`the amplitudes Rayleigh distributed in accordance with Eq. (4).
`We have also used a code book in which the different innovation
`sequences are obtained directly from the prediction error (after nor-
`malizing to unit variance) of speech signals. The amplitudes and
`phases are no longer distributed according to Rayleigh and uniform
`density functions, respectively, but reflect the distributions represented
`in the actual prediction error.
`
`RESULTS
`
`As we mentioned earlier, the random code book provides a base
`line against which we can compare other code books. We have syn-
`thesized several speech utterances spoken by both male and female
`
`938
`
`25L2
`
`Ex. 1046 / Page 2 of 4
`
`
`
`code book cannot be reduced significantly without producing substan-
`tial increase in the error.
`
`0Oc
`
`(I)0a:
`
`w0(
`
`)10
`
`a:
`Ui
`aD
`
`z
`
`30
`
`32
`
`34
`
`36
`RMS ERROR
`
`38
`
`40
`
`42
`
`Fig. 5. Distribution of error amongst the various code words in
`a Gaussian code book.
`
`Due to the random nature of code books, different Gaussian code
`books produced different innovation sequences. However, we did not
`hear any audible difference between the speech signals reconstructed
`from these different code books. Figure 6 shows several examples of
`the innovation sequences selected from several different Gaussian code
`books for one 5-msec block. The innovation sequences for other previ-
`ous blocks were kept the same; thus, the filter coefficients and tbo
`filter memories were identical at the beginning of the block. The
`coded innovation sequences show very little similarity to each other.
`The amplitude spectrum for the different sequences is shown in Fig.
`6(b). Again, there is no obvious common pattern amongst the
`different amplitude spectra. The corresponding phase responses are
`shown Fig. 6(c).
`
`(a)
`
`(b)
`
`Ic)
`
`0)2345
`
`TIME (msec)
`
`0
`
`I
`
`2 3 4
`
`FREQ 1kHz)
`
`0
`
`\AA
`
`2
`I
`FREQ 1kHz)
`
`3
`
`4
`
`Fig. 6. (a) Waveforms of different innovation sequences for a
`particular 5-msec block, (b) amplitude spectra of innovation
`sequences, and (c) phase responses of innovation sequences.
`
`The code book with constant amplitude but uniformly distributed
`phases performed nearly as well as the Gaussian code book. The
`signal-to-noise ratio decreased by about 1.5 dB and there was an audi-
`ble difference between the two code books. The code book with
`
`speakers (pitch frequencies ranging from 80 Hz to 400 Hz) using the
`different code books discussed in the previous section. The random
`code book (with 1024 code words) provided unexpectedly good perfor-
`mance. Even in close pair-wise comparisons over head phones, only
`occasional small differences were noticeable between the original and
`synthetic speech utterances. These results suggest that a 10-bit ran-
`dom code book has sufficient flexibility to produce high-quality speech
`from the synthesis model shown in Fig. 1.
`The waveforms of the original and synthetic speech signals were
`found to match closely for voiced speech and reasonably well for
`unvoiced speech. The signal-to-noise ratio averaged over several
`seconds of speech was found to be approximately 15 dB. Examples of
`speech waveforms are shown in Fig. 4. The figure shows (a) original
`speech, (b) synthetic speech, (c) the LPC prediction residual, (d) the
`reconstructed LPC residual, (e) the prediction residual after pitch
`prediction, and (f) the coded residual trom a 10-bit random code
`book. As expected, the Gaussian code book is not able to reproduce a
`sharp impulse in the coded residual waveform, The absence of the
`sharp impulse produces appreciable phase distortion in the recon-
`structed LPC prediction residual. However this phase distortion is
`mostly limited to frequency regions outside the formants.
`
`(a
`
`(b)
`
`(c)
`
`(d)
`
`(e)
`
`(f)
`
`L_L_L_ I
`0
`
`I
`
`I
`
`0.2
`
`I
`
`I
`
`04
`0.6
`TIME (IO1sec)
`
`I _LJ_I
`
`I
`
`I
`0.8
`
`I
`
`I
`
`I I
`.0
`
`Fig. 4. Waveforms of different signals in the coder: (a) the
`original speech, (b) the synthetic speech, (c) the LPC
`prediction residual, (d) the reconstructed LPC residual, (e)
`the prediction residual after pitch prediction, and (f) the coded
`residual from a 10-bit random code book. Waveforms (c) and
`(d) are amplified S times relative to the speech signal.
`Waveforms (e) and (f) are amplified by an additional factor
`of 2.
`
`We have also examined the distribution of the reconstruction
`error amongst various code words. Figure 5 shows a plot of the
`number of code words which produced a given amount of rms error in
`a particular 5-msec block of speech. The behavior shown is typical
`of what we observed in several blocks. The minimum rms error for
`this block was 30 and only 5 code words (Out of a total of 1024) pro-
`duced an rms error less than 33. This indicates that the size of the
`
`25t3
`
`939
`
`Ex. 1046 / Page 3 of 4
`
`
`
`constant phases but Rayleigh-distributed amplitudes performed very
`poorly, both in the signal-to-noise ratio and in listening to synthetic
`speech. The code book based on the prediction residual signals
`derived from speech performed as well as the Gaussian code book.
`
`CONCLUDING REMARKS
`
`Our present work with the code-excited linear predictive coder
`has demonstrated that such coders offer considerable promise for pro-
`ducing high quality synthetic speech at bit rates as low as 4.8
`kbits/sec. The random code book we have used so far obviously does
`not provide the best choice. The proper design of the code book is the
`key to success for achieving even lower bit rates than we realized in
`this study. We have so far employed a fixed code book for all speech
`data. A fixed code book is somewhat wasteful. Further efficiency
`could be gained by making the code book adaptive to the time-varying
`linear filters used to synthesize speech and to weight the error. The
`coding procedure is computationally very expensive; it took 125 sec of
`Cray-i CPU time to process 1 sec of the speech signal. The program
`was however not optimized to run on Cray. Most of the time was
`taken up by the search for the optimum innovation sequence. A code
`book with sufficient structure amenable to fast search algorithms
`ould lead to real time implementation of code-excited coders.
`
`REFERENCES
`
`Eli B. S. Atal, "Predictive coding of speech at low bit rates," IEEE
`Trans. Co,nmun. vol. COM-30, pp. 600-614, April 1982.
`[2] M. R. Schroeder and B. S. Atal, "Speech coding using efficient
`block codes," in Proc. mt. Conf on Acoustics, Speech and Signal
`Proc., vol. 3, pp. 1668-1671, May 1982.
`[31 H. G. Fehn and P. Noll, "Multipath search coding of stationary
`signals with applications to speech," IEEE Trans. Commun. vol.
`COM-30, pp. 687-701, April 1982.
`[41 B. S. Atal and M. R. Schroeder, "Stochastic coding of speech
`signals at very low bit rates," in Proc. mt. Conf Comrnun. -
`1CC84, part 2, pp. 1610-1613, May 1984.
`[51 S. Singhal and B. S. Atal, "Improving performance of multi-pulse
`LPC coders at low bit rates," in Proc. mt. Conf. on Acoustics.
`Speech and Signal Proc., vol. 1, paper no. 1.3, March 1984.
`
`940
`
`25.1A
`
`Ex. 1046 / Page 4 of 4
`
`