`
`The Bank-of-Filters Front-End Processor
`
`87
`
`Solution 3.3
`Given the definition of Sn(ei"') we have
`
`Xn(ei"') = 1Sn(ei"')j2 = [Sn(ei"')][Sn(ei"')t
`
`= lt= s(m)w(n - m)e-iw• l [,f s(r)w(n -
`= L L w(n - m)s(m)w(n -
`
`00
`
`00
`
`r)Jw']
`
`r)s(r)e-jw(m-r>
`
`Let r = k + m, then:
`
`r=-oo m=-oo
`
`00
`
`00
`
`k=-oo
`
`k=-oo
`
`(since Rn(k) = Rn( -k));
`
`therefore
`
`3.2.2.4 FFT Implementation of Uniform Filter Bank Based on the Short-Time
`Fourier Transform
`
`We now return to the question of how to efficiently implement the computation of the set
`of filter-bank outputs (Eq. (3.15)) for the uniform filter bank. If we assume, reasonably,
`that we are interested in a uniform frequency spacing-that
`is, if
`f; = i(Fs/N),
`then Eq. (3.15a) can be written as
`x;(n) = ei('lg-)in Ls(m)w(n
`
`i=O,I,
`
`... ,N-I
`
`- m)e-i(~)im_
`
`(3.21)
`
`(3.22)
`
`m
`Now consider breaking up the summation over m, into a double summation of rand k, in
`which
`
`-oo < r < oo.
`0 < k ~ N - 1,
`m = Nr + k,
`In other words, we break up the computation over m into pieces of size N. If we let
`
`then Eq. (3.22) can be written as
`
`Sn(m) = s(m)w(n - m),
`
`(3.23)
`
`(3.24)
`
`(3.25)
`
`IPR2023-00035
`Apple EX1015 Page 118
`
`
`
`88
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`Since e-j1.1rir = l, for all i, r, then
`
`If we define
`
`we wind up with
`
`r
`
`Un(k) = L Sn(Nr + k),
`x;(n) = eiC2; )in [I: Un(k)e-i(
`
`(3.26)
`
`(3.27)
`
`(3.28)
`
`i; )ik]
`
`k=O
`which is the desired result; that is, x;(n) is a modulated N-point DFf of the sequence un(k).
`Thus the basic steps in the computation of a uniform filter bank via FFf methods are
`as follows:
`
`1. Fonn the windowed signal sn(m) = s(m) w(n - m), m = n - L + 1, ... , n, where
`w(n) is a causal, finite window of duration L samples. Figure 3.16a illustrates this
`step.
`
`2. Fonn un(k) = L s11(Nr + k), 0 ::; k < N - 1. That is, break the signal Sn(m) into
`r
`pieces of size N samples and add up the pieces (alias them back unto itself) to give a
`signal of size N samples. Figures 3.16b and c illustrate this step for the case in which
`l»N.
`3. Take the N-point DFf of Un(k).
`4. Modulate the DFf by the sequence ei( 2
`: )in.
`
`The modulation step 4 can be avoided by circularly shifting the sequence, u11(k), by n EB N
`samples (where EB is the modulo operation), to give un((k - n))N, 0 < k < N - 1, prior to
`the DFf computation.
`The computation to implement the uniform filter bank via Eq. (3.28) is essentially
`CFBFFf ~ 2N log N•, +.
`Consider now the ratio, R, between the computation for the direct form implementation of
`a uniform filter bank (Eq. (3.13)), and the FFT implementation
`(Eq. (3.29)), such that
`R = CoFFIR =
`LQ
`(3.30)
`2N log N •
`C FBFFT
`If we assume N = 32 (i.e., a I 6-channel filter bank), with L = 128 (i.e., 12.8 msec impulse
`response filter at a IO-kHz sampling rate), and Q = 16 channels, we get
`R = 128 • 16 _
`2 · 32 · 5 - 6.4.
`h ct·
`ture
`h
`The FFf implementation is about 6.4 times more effi •
`&
`c1ent t an t e 1rect 1orm struc
`•
`
`(3.29)
`
`IPR2023-00035
`Apple EX1015 Page 119
`
`
`
`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`89
`
`m
`
`-
`
`~(ml
`
`n-L+1
`
`0
`
`N-1
`
`k
`
`Figure 3.16 FFf implementation of a uniform filter bank.
`
`.
`
`-
`
`I
`
`'.
`
`s (n)
`
`h 1 (n)
`
`h2(n)
`
`•
`•
`•
`
`-
`
`~
`
`ha(n)
`
`Figure 3.17 Direct form implementation of an
`arbitrary nonuniform filler bank.
`
`3.2.2.5 Nonuniform FIR Filter Bank Implementations
`
`The most general form of a nonuniform FIR filter bank is shown in Figure 3.17, where the
`kth bandpass filter impulse response, hk(n), represents a filter with center frequency wk, and
`bandwidth !:iwk. The set of Q bandpass filters is intended to cover the frequency range of
`
`IPR2023-00035
`Apple EX1015 Page 120
`
`
`
`90
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`1
`
`2
`
`3
`
`1 2 3
`
`4
`
`5
`
`6
`
`7
`
`(a)
`
`(b)
`
`f
`
`f
`
`Figure 3.18 Two arbitrary nonuniform filter-bank
`ideal filter specifications
`consisting of either 3 bands (part a) or 7 bands (part b).
`
`interest for the intended speech-processing application.
`In its most general form, each bandpass filter is implemented via a direct convolution;
`that is, no efficient FFf structure can be used. In the case where each bandpass filter is
`designed via the windowing design method (Ref. [ 1 ]), using the same lowpass window, we
`can show that the composite frequency response of the Q-channel filter bank is independent
`of the number and distribution of the individual filters. Thus a filter bank with the three
`filters shown in Figure 3.18a has the exact same composite frequency response as the filter
`bank with the seven filters shown in Figure 3.18b.
`To show this we denote the impulse response of the kth bandpass filter as
`hk(n) = w(n)hk(n),
`where w(n) is the RR window, and hk(n) is the impulse response of the ideal bandpass filter
`being designed. The frequency response of the kth bandpass filter, Hk(eiw), can be written
`as
`
`(3.31)
`
`Thus the frequency response of the composite filter bank, H(eiw), can be written as
`
`H(eiw) = L Hk(eiw) = L W(eJ°w)@ /h(eiw).
`
`Q
`
`Q
`
`By interchanging the summation and the convolution we get
`
`k=I
`
`k=I
`
`H(eiw) = W(eiw)@ L Hk(eJ°w).
`
`Q
`
`(3.32)
`
`(3.33)
`
`(3.34)
`
`k=l
`By realizing_ ~at_the summation of Eq. (3.34) is the summation of ideal frequency responses,
`we see that 1t 1s mdependent of the number and distribution of the individual filters. 'fhus
`we can write the summation as
`
`Wmin < W < Wmax
`otherwise
`
`(3.35)
`
`IPR2023-00035
`Apple EX1015 Page 121
`
`
`
`,I
`
`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`91
`
`where Wmin is the lowest frequency in the filter bank, and Wmax is the highest frequency.
`Then Eq. (3.34) can be expressed as
`
`(3.36)
`independent of the number of ideal filters, Q, and their distribution in frequency, which is
`the desired result.
`
`3.2.2.6 FFT-Based Nonuniform Filter Banks
`
`One possible way to exploit the FFf structure for implementing uniform filter banks
`discussed earlier is to design a large uniform filter bank (e.g., N = 128 or 256 channels)
`and then create the nonuniformity by combining two or more uniform channels. This
`technique of combining channels is readily shown to be equivalent to applying a modified
`analysis window to the sequence prior to the FFf. To see this, consider taking an N-point
`DFf of the sequence x(n) (derived from the speech signal, s(n), by windowing by w(n)).
`Thus we get
`
`(3.37)
`
`(3.38)
`
`(3.39)
`
`N-1
`xk = Lx(n)e-j
`n=O
`as the set of DFT values. If we consider adding DFf outputs Xk and Xk+I, we get
`
`nk,
`
`2;
`
`2;
`
`2;
`
`nk +e-j
`
`(e-j
`
`n(k+I))
`
`N-1
`Xk +Xk+I = LX(n)
`n=O
`
`which can be written as
`
`the equivalent kth channel value, Xt, could have been obtained by weighting the
`i.e.
`in time, by the complex sequence 2e-F';/ cos ( ~n). If more than two
`sequence, x(n),
`channels are combined, then a different equivalent weighting sequence results. Thus FFf
`channel combining is essentially a "quick and dirty" method of designing broader bandpass
`filters and is a simple and effective way of realizing certain types of nonuniform filter bank
`analysis structures.
`
`3.2.2.7 Tree Structure Realizations of Nonuniform Filter Banks
`
`A third method used to implement certain types of nonuniform filter banks is the tree
`structure in which the speech signal is filtered in stages, and the sampling rate is successively
`reduced at each stage for efficiency of implementation. An example of such a realization
`is given in Figure 3.19a for the 4-band, octave-spaced filter bank shown (ideally) in
`Figure 3.19b. The original speech signal, s(n), is filtered initially into two bands, a low
`band and a high band, using quadrature mirror filters (QMFs)-i.e.,
`filters whose frequency
`responses are complementary. The high band, which covers half the spectrum, is reduced
`in sampling rate by a factor of 2, and represents the highest octave band ( 1r /2 ~ w :::; 1r) of
`
`IPR2023-00035
`Apple EX1015 Page 122
`
`
`
`92
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`LP 3
`
`2 ♦
`
`X1 (m)
`
`HP 3
`
`s (n)
`
`X4 (m)
`
`1
`
`2
`
`3
`
`4
`
`I I
`
`0
`
`1T'
`8
`
`I
`
`1T'
`4
`
`1T'
`2
`
`Tr
`
`Figure 3.19 Tree structure implementation of a 4-band, octave-spaced, filter bank.
`
`the filter bank. The low band is similarly reduced in sampling rate by a factor of 2, and is
`fed into a second filtering stage in which the signal is again split into two equal bands using
`QMF filters. Again the high band of stage 2 is decimated by a factor of 2 and is used as the
`next-highest filter bank output; the low band is also decimated by a factor of 2 and fed into
`a third stage of QMF filters. These third-stage outputs, in this case after decimation by a
`factor of 2, are used as the two lowest filter bands.
`QMF filter bank structures are quite efficient and have been used for a number of
`speech-processing applications [3]. Their efficiency for arbitrary nonuniform filter bank
`structures is not as good as for the octave band designs of Figure 3.19.
`
`3.2.3 Summary of Considerations for Speech-Recognition Filter Banks
`
`In the previous sections we discussed several methods of implementing filter banks for
`speech recognition. We have not gone into great detail here because our goal was to make
`the reader familiar with the issues involved in filter-bank design and implementation, not
`to make the reader an expert in signal processing. The interested reader is urged to pursue
`this fascinating area further by studying the material in the References at the end of this
`chapter. In this section we summarize the considerations that go into choosing the number
`and types of filters used in the structures discussed earlier in this section.
`The first consideration for any filter bank is the type of digital filter used. The
`choices are IIR (recursive) and FIR (nonrecursive) designs. The IIR designs have the
`advantage of being implementable in simple, efficient structures. The big disadvantage of
`IIR filters is that their phase response is nonlinear; hence, to minimize this disadvantage
`
`IPR2023-00035
`Apple EX1015 Page 123
`
`
`
`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`93
`
`a trade-off is usually made between the ideal magnitude characteristics that can readily
`be realized, and the highly nonideal phase characteristics. On the other hand, AR filters
`can achieve linear phase without compromising the ability to approximate ideal magnitude
`characteristics; however, they are usually computationally expensive in implementation.
`For speech-recognition applications, we have shown how an FFf structure can often be
`applied to alleviate considerably the computational inefficiency of FIR filter banks; hence,
`most practical digital filter bank structures use FIR filters (usually in an FFf realization).
`Once the type of filter has been decided, the next consideration is the number of filters
`to be used in the filter bank. For unifonn filter banks, the number of filters, Q, cannot be too
`small or else the ability of the filter bank to resolve the speech spectrum is greatly impaired.
`Thus values of Q less than about 8 are generally avoided. Similarly, the value of Q cannot
`be too large (unless there is considerable filter overlap), because the filter bandwidths would
`eventually be too narrow for some talkers (e.g., high-pitch females or children), and there
`would be a high probability that certain bands would have extremely low speech energy
`(i.e., no prominent harmonic would fall within the band). Thus, practical systems tend to
`have values of Q < 32. Although unifonnly spaced filter banks have been widely used for
`recognition, many practical systems have used nonunifonn spacing in an effort to reduce
`overall computation and to characterize the speech spectrum in a manner considered more
`consistent with human perception.
`A final consideration for practical filter-bank analyzers is the choice of nonlinearity
`and lowpass filter used at the output of each channel. Typically the nonlinearity has been
`a full wave rectifier (FWR), a half wave rectifier (HWR), or a center clipper. The resultant
`spectrum is only weakly sensitive to the nonlinearity. The lowpass filter used in practice
`varies from a simple integrator to a fairly good quality UR lowpass filter (typically a Bessel
`filter).
`
`3.2.4 Practical Examples of Speech-Recognition Filter Banks
`
`[4] show examples of a wide range of speech-recognition filter banks,
`Figures 3.20-3.25
`including both uniform and nonunifonn designs. Figure 3.20 is for a 15-channel uniform
`filter bank in which the basic lowpass filter was designed using the windowing technique
`with a IO I -point Kaiser window. Part a of the figure shows the impulse response of the
`lowpass filter (i.e., an ideal lowpass filter response multiplied by a Kaiser window). Part b
`of the figure shows the responses of the individual filters in the filter bank (note there is no
`overlap between adjacent filters), and part c shows the composite frequency response of the
`overall filter bank. The sidelobe peak ripple of each individual filter is down about 60 dB,
`and the composite frequency response is essentially ideally flat over the entire frequency
`range of interest (approximate 100-3000 Hz).
`By contrast, Figure 3.21 is for a 15-channel unifonn filter bank in which the basic
`lowpass filter was a Kaiser window (instead of the Kaiser windowed version of the ideal
`lowpass filter). From parts band c of this figure, it can be seen that the individual bandpass
`filters are narrower in bandwidth than those of Figure 3.20; furthennore, the composite
`
`IPR2023-00035
`Apple EX1015 Page 124
`
`
`
`94
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`(a)
`
`oo~-==:::c::::::26~=:c:::==::c:===:I====::c::=====~~:::::c:::::;;a~100
`
`TIME
`
`IN SAMPLES
`
`FREQUENCY
`
`( kHz)
`
`w
`0
`:::,
`
`I--z
`
`(!)
`ct
`~-600L..J.-1........1.......1.._.,JL.......1.._.,_...,_.1.....1---l....----'---'-._.___.._~_.__.........____.__.__._..__.__.____.__...._........,~
`3.33
`
`FREQUENCY
`
`( kHz)
`
`Figure 3.20 Window sequence, w(n), (part a), the individual filter response (part b), and
`the composite response (part c) of a Q = 15 channel, uniform filter bank, designed using a
`IOI-point Kaiser window smoothed lowpass window (after Dautrich et al. [4]).
`
`filter-bank response shows 18 dB gaps at the boundaries between each filter. Clearly, this
`filter bank would be unacceptable for speech-recognition applications.
`Figures 3.22 and 3.23 show individual filter frequency responses, and the composite
`frequency response, for a 4-channel, octave-band filter bank, and a 12-channel, I /3 octave
`filter bank, frequency, respectively. Each of these nonuniform filter banks was designed
`to cover the frequency band from 200 to 3200 Hz and used linear-phase FIR filters (IOI
`points for the octave band design, and 201 points for the 1 /3 octave band design) for each
`individual channel. The peak sidelobe ripple was about -40 dB for both filter banks.
`Figure 3.24 shows a similar set of responses for a 7-channel critical band filter bank
`in which each individual filter encompassed two critical bands. Again we used IOI-point,
`linear phase, FIR filters with a peak sidelobe of -54 dB to realize each individual bandpass
`filter. Finally, Figure 3.25 shows the responses of a 13-channel, critical band filter bank
`in which the individual channels were highly overlapping. The individual bandpass filter
`responses are rather poor (e.g., the ratios of center frequency to bandwidth of each filter was
`about 8). However, this poor frequency resolution characteristic was balanced somewhat
`by the excellent time resolution of the filters.
`
`IPR2023-00035
`Apple EX1015 Page 125
`
`
`
`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`95
`
`w
`:::>
`..J
`~ >
`
`ID
`'O
`
`w
`0
`:::>
`1--
`z
`C)
`ct
`~ -50
`
`0
`
`w
`0
`:::>
`1--
`z
`C)
`ct
`~ -so----
`0
`
`TIME
`
`IN SAMPLES
`
`FREQUENCY
`
`( kHz)
`
`3.33
`
`c)
`
`....... ------
`
`
`
`.................. --~-~~-----
`
`
`
`......... --_._
`
`
`
`......... ....._._ _ _.__
`3.33
`
`FREQUENCY
`
`(kHz)
`
`Figure 3.21 Window sequence, w(n), (part a), the individual filter responses (part b), and
`the composite response (part c) of a Q = 15 channel, uniform filter bank, designed using a
`IOI-point Kaiser window directly as the Jowpass window (after Dautrich et al. [4]).
`
`3.2.5 Generalizations of Filter-Bank Analyzer
`
`Although we have been concerned primarily with designing and implementing individual
`channels of a filter-bank analyzer, there is a generalized structure that must be considered
`as part of the canonic filter-bank analysis method. This generalized structure is shown in
`Figure 3.26. The generalization includes a signal preprocessor that "conditions" the speech
`signal, s(n), to a new form, s(n), which is "more suitable" for filter-bank analysis, and a
`postprocessor that operates on the filter-bank output vectors, x(m), to give the processed
`vectors x(m) that are "more suitable" for recognition. Although a wide range of signal(cid:173)
`processing operations could go into the preprocessor and postprocessor boxes, perhaps the
`most reasonable ones include the following.
`
`Preprocessor Operations
`
`• signal preemphasis (to equalize the inherent spectral tilt in speech)
`
`IPR2023-00035
`Apple EX1015 Page 126
`
`
`
`96
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`CHANNEL 1
`
`CHANNEL 4
`
`(d)
`
`3.33
`
`CHANNEL 2
`
`COMPOSITE
`
`-60L--'---'---'------'-.._..._......__.___._~
`0
`
`FREQUENCY
`
`(kHz)
`
`3.33
`
`CHANNEL 3
`
`Individual channel responses (parts a to d) and composite filter response (part c) of a
`Figure 3.22
`Q = 4 channel, octave band design, using IOI-point FIR filters in each band (after Dautrich et al. [4)).
`
`• noise elimination
`• signal enhancement (to make the formant peaks more prominent)
`
`Postprocessor Operations
`
`• temporal smoothing of sequential filter-bank output vectors
`• frequency smoothing of individual filter-bank output vectors
`• normalization of each filter-bank output vector
`• thresholding and/or quantization of the filter-bank output vectors
`• principal components analysis of the filter-bank output vector.
`
`The purpose of the preprocessor is to make the speech signal as clean as possible so far
`as the filter bank analyzer is concerned; hence, noise is eliminated,
`long-time spectral
`trends are removed, and the signal is spectrally flattened to give the best immunity 10
`measurement imperfections. Similarly, the purpose of the postprocessor is to clean up lhe
`
`IPR2023-00035
`Apple EX1015 Page 127
`
`
`
`Sec. 3.3
`
`Linear Predictive Coding Model for Speech Recognition
`
`C 1-+ANNEL 1
`
`CHANNEL 6
`
`_JJ~
`
`0
`
`3 33
`
`97
`
`CHANNEL 11
`
`CHANNEL 2
`
`CHANNEL 7
`
`CHANNEL12
`
`COMPOSITE
`
`CHANNEL 4
`
`CHANNEL 9
`
`_.:h;;; _J:r1:;1
`_.:~ _.:w1::1
`
`0
`
`3 33
`
`0
`
`3 33
`
`CHANNELS
`
`CHANNEL1O
`
`0
`FR E QUE NC Y
`
`3.33
`(kHz)
`
`0
`FREQUENCY
`
`3.33
`(kHz)
`
`Individual channel responses and composite filter response of a Q = 12 channel, 1/3 octave band
`Figure 3.23
`design, using 201-point FIR filters in each band (after Dautrich et al. (4)).
`
`sequence of feature vectors from the filter-bank analyzer so as to best represent the spectral
`information in the speech signal and thereby to maximize the chances of successful speech
`recognition [4,5].
`
`3.3 LINEAR PREDIC:TIVE CODING MODEL FOR SPEECH RECOGNITION
`
`The theory of linear predictive coding (LPC), as applied to speech, has been well understood
`for many years (see for example Ref. [6]). In this section we describe the basics of how LPC
`has been applied in speech-recognition systems. The mathematical details and derivations
`will be omitted here; the interested reader is referred to the references.
`Before describing a general LPC front-end processor for speech recognition, it is
`worthwhile to review the reasons why LPC has been so widely used. These include the
`following:
`
`IPR2023-00035
`Apple EX1015 Page 128
`
`
`
`98
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`CHANNEL 4
`
`COMPOSITE
`
`
`
`-obj \~~ -6:r== ~ : ~ i (~]
`
`60
`
`0
`
`FREQUENCY
`
`(kHz)
`
`FREQUENCY
`
`(kHz)
`
`3.33
`
`0
`
`3.33
`
`Figure 3.24
`Individual channel responses (parts a to g) and composite filter response (part h) of a
`Q = 7 channel critical band filter bank design (after Dautrich et al. [4]).
`
`1. LPC provides a good model of the speech signal. This is especially true for the quasi
`steady state voiced regions of speech in which the all-pole model of LPC provides
`a good approximation to the vocal tract spectral envelope. During unvoiced and
`transient regions of speech, the LPC model is less effective than for voiced regions,
`but it still provides an acceptably useful model for speech-recognition purposes.
`2. The way in which LPC is applied to the analysis of speech signals leads to a reasonable
`source-vocal tract separation. As a result, a parsimonious representation of the vocal
`tract characteristics (which we know are directly related to the speech sound being
`produced) becomes possible.
`3. LPC is an analytically tractable model. The method of LPC is mathematically precise
`and is simple and straightforward to implement in either software or hardware. The
`computation involved in LPC processing is considerably less than that required for
`an all-digital implementation of the bank-of-filters model described in Section 3.2.
`4. The LPC model works well in recognition applications. Experience has shown that
`
`IPR2023-00035
`Apple EX1015 Page 129
`
`
`
`Sec. 3.3
`
`Linear Predictive Coding Model for Speech Recognition
`
`99
`
`CHANNEL1
`
`CHANNEL 6
`
`CHANNEL 11
`
`I 'ia\!
`
`3.33
`
`-6:r~ 0
`
`- Ors::
`600
`
`,...
`a>
`~
`
`~
`ct
`~
`
`~ -6:r~
`
`0
`
`CHANNEL 2
`
`CHANNEL 3
`
`CHANNEL 4
`
`CHANNE(,. 5
`
`3.33 _JZ=:J
`-6:rz::]
`_JZm1
`_JZ:1
`: I :1
`3.33 _J:z:::±1
`
`CHANNEL?
`
`CHANNEL12
`
`0
`
`0
`
`3.33
`
`3.33
`
`CHANNEL 8
`
`CHANNEL13
`
`3. 33
`
`0
`
`3.33
`
`0
`
`3.33
`
`CHANNEL 9
`
`COMPOSITE
`
`-6:r
`
`:
`
`: :
`
`3.33
`0
`FREQUENCY (kHz)
`
`CHANNEL 10
`
`:(~
`
`-6:r:s
`3.33
`0
`0
`FREQUENCY (kHz)
`FREQUENCY
`(kHz)
`Individual channel responses and composite filter response of a Q = 13 channel, critical band spacing
`Figure 3.25
`filter bank, using highly overlapping filters in frequency (after Dautrich et al. [4]).
`
`s (n)
`
`s (n)
`-
`- PREPROCESSOR
`
`FILTER x(m)
`-
`BANK
`ANALYZER
`
`x (m).
`POSTPROCESSOR
`
`~
`
`Figure 3.26 Generalization of filter-bank analysis model.
`
`the performance of speech recognizers, based on LPC front ends, is comparable to or
`better than that of recognizers based on filter-bank front ends (see References [ 4,5, 71).
`
`Based on the above considerations, LPC front-end processing has been used in a large
`number of recognizers. In particular, most of the systems to be described in this book are
`based on this model.
`
`IPR2023-00035
`Apple EX1015 Page 130
`
`
`
`100
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`u (n)
`
`s (n)
`
`A(z)
`
`G
`
`Figure 3.27 Linear prediction model of speech.
`
`3.3.1 The LPC Model
`
`The basic idea behind the LPC model is that a given speech sample at time n, s(n), can be
`approximated as a linear combination of the past p speech samples, such that
`
`where the coefficients a 1, a2, ... , ap are assumed constant over the speech analysis frame.
`We convert Eq. (3.40) to an equality by including an excitation term, G u(n), giving:
`
`p
`
`s(n) = L a;s(n -
`
`i) + G u(n),
`
`(3.41)
`
`(3.40)
`
`i=l
`where u(n) is a normalized excitation and G is the gain of the excitation. By expressing
`Eq. (3.41) in the z-domain we get the relation
`p
`
`S(z) = L a;z-iS(z) + G U(z)
`
`(3.42)
`
`(3.43)
`
`leading to the transfer function
`
`i=I
`
`S(z)
`H(z) = G U(z)
`
`1
`- A(z).
`
`1
`P
`I - La;z-i
`i=I
`The interpretation of Eq. (3.43) is given in Figure 3.27, which shows the nonnalized
`excitation source, u(n), being scaled by the gain, G, and acting as input to the all-pole
`system, H(z) = A:z>, to produce the speech signal, s(n). Based on our knowledge that the
`actual excitation function for speech is essentially either a quasiperiodic pulse train (for
`voiced speech sounds) or a random noise source (for unvoiced sounds), the appropriate
`synthesis model for speech, corresponding to the LPC analysis, is as shown in Figure 3.28.
`Here the normalized excitation source is chosen by a switch whose position is controlled
`by the voiced/unvoiced character of the speech, which chooses either a quasiperiodic train
`of pulses as the excitation for voiced sounds, or a random noise sequence for unvoiced
`sounds. The appropriate gain, G, of the source is estimated from the speech signal, and the
`scaled source is used as input to a digital filter (H(z)), which is controlled by the vocal tract
`
`IPR2023-00035
`Apple EX1015 Page 131
`
`
`
`sec. 3.3
`
`Linear Predictive Coding Model for Speech Recognition
`
`101
`
`PITCH
`PERIOD
`
`',,
`
`IMPULSE
`TRAIN
`GENERATOR
`
`VOICED/
`UNVOICED
`SWITCH
`
`/
`-----+{
`
`u( n)
`
`X r----,~
`
`VOCAL TRACT
`PARAMETERS
`,O-
`
`TIME -VARYING
`DIGITAL
`FILTER
`
`s(n)
`
`RANDOM
`NOISE
`GENERATOR
`
`G
`
`Figure 3.28 Speech synthesis model based on LPC model.
`
`parameters characteristic of the speech being produced. Thus the parameters of this model
`are voiced/unvoiced classification, pitch period for voiced sounds, the gain parameter, and
`the coefficients of the digital filter, { ak}. These parameters all vary slowly with time.
`
`3.3.2 LPC Analysis Equations
`
`Based on the model of Figure 3.27, the exact relation between s(n) and u(n) is
`
`p
`
`s(n) = L aks(n - k) + G u(n).
`
`(3.44)
`
`k=l
`We consider the linear combination of past speech samples as the estimate s(n), defined as
`
`p
`
`s(n) = L aks(n - k).
`
`k=l
`
`We now form the prediction error, e(n), defined as
`
`e(n) = s(n) - s(n) = s(n) - L aks(n - k)
`
`p
`
`k=l
`
`with error transfer function
`
`A(z) = E(z) = I - L akz-k •
`
`p
`
`S(z)
`
`k=t
`
`(3.45)
`
`(3.46)
`
`(3.47)
`
`Clearly, when s(n) is actually generated by a linear system of the type shown in Figure 3.27,
`then the prediction error, e(n), will equal G u(n), the scaled excitation.
`The basic problem of linear prediction analysis is to determine the set of predictor
`
`IPR2023-00035
`Apple EX1015 Page 132
`
`
`
`102
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`coefficients, { ak}, directly from the speech signal so that th~ s~ectral prope~ies of the digital
`filter of Figure 3.28 match those of the speech waveform within the analysis window s·
`• Ince
`.
`h
`d.
`.
`the spectral characteristics of speech vary over time, t e pre ictor coefficients at a .
`.
`&IVen
`time, n, must be estimated from a short segment of the ~peech sign~l occurring around
`time n. Thus the basic approach is to find a set of predictor coefficients that minimize
`the mean-squared prediction error over a short segment of the s~eech wavefonn. (Usuan
`this type of short time spectral analysis is perfonned on successive frames of speech, Wit~
`frame spacing on the order of IO msec.)
`To set up the equations that must be solved t~ determine the predictor coefficients,
`we define short-tenn speech and error segments at time n as
`s11(m) = s(n + m)
`en(m) = e(n + m)
`and we seek to minimize the mean squared error signal at time n
`
`(3.48a)
`(3.48b)
`
`En= Le~(m)
`m
`
`which, using the definition of en(m) in terms of sn(m), can be written as
`
`En = ~ [s.(m) - t akSn(m
`
`2
`- k)]
`
`•
`
`(3.49)
`
`(3.50)
`
`To solve Eq. (3.50), for the predictor coefficients, we differentiate En with respect to each
`ak and set the result to zero,
`
`k = 1, 2, ... ,p
`
`(3.51)
`
`giving
`
`p
`
`i)sn(m - k).
`
`(3.52)
`
`L Sn(m - i)sn(m) = L ilk L Sn(m -
`By recognizing that tenns of the form L sn(m -
`covariance of sn(m), i.e.,
`
`m
`
`k=l
`
`m
`
`i) sn(m - k) are terms of the short-tenn
`
`<Pn(i, k) = L Sn(m -. i)sn(m - k)
`
`m
`we can express Eq. (3.52) in the compact notation
`
`p
`
`<Pn(i, 0) = E ak¢n(i, k)
`
`k=l
`
`(3.53)
`
`(3.54)
`
`which describes a set __ of p equations in p unknowns. It is readily shown that the minimum
`mean-squared error, En, can be expressed as
`
`IPR2023-00035
`Apple EX1015 Page 133
`
`
`
`Sec. 3.3
`
`Linear Predictive Coding Model for Speech Recognition
`
`En = L s~(m) - L ak L sn(m)sn(m - k)
`= <Pn(0, 0) - L ak¢n(0, k).
`
`m
`
`p
`
`k=I
`p
`
`m
`
`103
`
`(3.55)
`
`(3.56)
`
`k=l
`
`Thus the minimum mean-squared error consists of a fixed tenn (</>n(0, 0)) and tenns that
`depend on the predictor coefficients.
`To solve Eq. (3.54) for the optimum predictor coefficients (the aks) we have to
`compute <l>nU, k) for 1 < i < p and O ~ k ~ p, and then solve the resulting set of p
`simultaneous equations.
`In practice, the method of solving the equations (as well as the
`method of computing the ¢s) is a strong function of the range of mused in defining both
`the section of speech for analysis and the region over which the mean-squared error is
`computed. We now discuss two standard methods of defining this range for speech.
`
`3.3.3 The Autocorrelation Method
`
`A fairly simple and straightforward way of defining the limits on m in the summations is
`to assume that the speech segment, sn(m), is identically zero outside the interval O ~ m ~
`N - 1. This is equivalent to assuming that the speech signal, s(m + n), is multiplied by a
`l.
`finite length window, w(m), which is identically zero outside the range O ::; m ::; N -
`Thus the speech sample for minimization can be expressed as
`s(m + n) • w(m), 0 ::; m ::; N - 1
`.
`otherwise.
`0,
`
`Sn(m) =
`
`{
`
`(3.57)
`
`The effect of weighting of the speech by a window is illustrated in Figures 3.29-3.31.
`In each of these figures, the upper panel shows the running speech waveform, s(m), the
`middle panel shows the weighted section of speech (using a Hamming window for w(m)),
`and the bottom panel shows the resulting error signal, en(m), based on optimum selection
`of the predictor parameters.
`Based on Eq. (3.57), form < 0, the error signal en(m) is exactly zero since sn(m) = 0
`for all m < 0 and the ref ore there is no prediction error. Furthermore, for m > N -
`I + p
`there is again no prediction error because sn(m) = 0 for all m > N -
`I. However, in
`the region of m = 0 (i.e., from m = 0 tom = p - 1) the windowed speech signal sn(m)
`is being predicted from previous samples, some of which are arbitrarily zero. Hence the
`potential for relatively large prediction errors exists in this region and can actually be seen
`to exist in the bottom panel of Figure 3.29. Furthermore, in the region of m = N -
`I (i.e.,
`from m = N - 1 to m = N -
`l + p) the potential of large prediction errors again exists
`because the zero-valued (weighted) speech signal is being predicted from at least some
`nonzero previous speech samples. In the bottom panel of Figure 3.30 we see this effect
`at the end of the prediction error waveform. These two effects are especially prominent
`for voiced speech when the beginning of a pitch period occurs at or very close to the
`m = 0 or m = N -
`I points of the sample. For unvoiced speech, these problems are
`essentially eliminated because no part of the waveform is position sensitive. Hence we see
`
`IPR2023-00035
`Apple EX1015 Page 134
`
`
`
`104
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`20000.---,--------,--~:---r-----;-....,........---,---.---r----,.r---,
`
`s(m)
`
`w :::, - 11000 L___2_
`..J <
`16000 ,----,------,----,.----....,.--~--.-.-----,---,.----..---,
`>
`
`__.__~L-----'--L.L..--'---.:.&....:....-__._
`
`_
`
`-10000
`2600
`
`a:
`0 a:
`a: w
`
`-2200
`
`1
`
`0
`
`0
`
`__
`
`..____........__-..J
`
`~(~
`
`N-1
`
`SAMPLE
`
`N-1 +p
`
`600
`
`Figure 3.29 Illustration of speech sample, weighted speech section, and prediction error
`for voiced speech where the prediction error is large at the beginning of the section.
`
`20000r---.......----r--,------,-:---,.--~--,---,-~-~-.---,
`
`s (m)
`
`~ -11000 '-----L-----'----'--..___
`..J
`~ 15000 ,-----,------,---,.----r-,,---,---r-r----,-----,-----,
`
`_
`
`__.__....___..._
`
`____
`
`____, ______
`
`_
`
`sn (m)
`
`I
`I
`I
`_ _.._ _ __._~1~
`N-1
`
`__
`
`_,_ _ _.
`
`-10000------------..__..__
`
`~00,-------,-------,--,----..----.e------,-----,
`
`a:
`0 a:
`a: w
`
`SAMPLE
`
`N-1+p
`
`600
`
`Figure 3.30 Illustration of speech sample, weighted speech section, and prediction error
`for voiced speech where the prediction error is large at the end of the section.
`
`IPR2023-00035
`Apple EX1015 Page 135
`
`
`
`Sec. 3.3
`
`Linear Predictive Coding Model for Speech Recognition
`
`105
`
`2900 r----.-----.---,--___,,..._~----,----,-------.----.r--o-------.
`
`~ - 3000 .__ _ _.__ _ _._ _ ___._ _ ____. __
`
`......_ _ __.__ _ _.__ _ ___.__L _ _j
`
`~
`~ 2100 ,--,------.--~---.-----.T"ff'""-.....-----,--~----
`
`- 2300
`
`.___...___......._-=-~--'-----........1.-~---'-_!__._
`0
`732.17 r---,------.--~---.--T"T"""-.....----..--~----
`
`_
`
`_._ _
`
`_J
`
`N-1
`
`a:
`
`0 a: a: w
`
`I
`I
`I
`- 681.3 ._ _ _.__ _ ___._~I ___.. __
`1
`0