throbber
Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`87
`
`Solution 3.3
`Given the definition of Sn(ei"') we have
`
`Xn(ei"') = 1Sn(ei"')j2 = [Sn(ei"')][Sn(ei"')t
`
`= lt= s(m)w(n - m)e-iw• l [,f s(r)w(n -
`= L L w(n - m)s(m)w(n -
`
`00
`
`00
`
`r)Jw']
`
`r)s(r)e-jw(m-r>
`
`Let r = k + m, then:
`
`r=-oo m=-oo
`
`00
`
`00
`
`k=-oo
`
`k=-oo
`
`(since Rn(k) = Rn( -k));
`
`therefore
`
`3.2.2.4 FFT Implementation of Uniform Filter Bank Based on the Short-Time
`Fourier Transform
`
`We now return to the question of how to efficiently implement the computation of the set
`of filter-bank outputs (Eq. (3.15)) for the uniform filter bank. If we assume, reasonably,
`that we are interested in a uniform frequency spacing-that
`is, if
`f; = i(Fs/N),
`then Eq. (3.15a) can be written as
`x;(n) = ei('lg-)in Ls(m)w(n
`
`i=O,I,
`
`... ,N-I
`
`- m)e-i(~)im_
`
`(3.21)
`
`(3.22)
`
`m
`Now consider breaking up the summation over m, into a double summation of rand k, in
`which
`
`-oo < r < oo.
`0 < k ~ N - 1,
`m = Nr + k,
`In other words, we break up the computation over m into pieces of size N. If we let
`
`then Eq. (3.22) can be written as
`
`Sn(m) = s(m)w(n - m),
`
`(3.23)
`
`(3.24)
`
`(3.25)
`
`IPR2023-00035
`Apple EX1015 Page 118
`
`

`

`88
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`Since e-j1.1rir = l, for all i, r, then
`
`If we define
`
`we wind up with
`
`r
`
`Un(k) = L Sn(Nr + k),
`x;(n) = eiC2; )in [I: Un(k)e-i(
`
`(3.26)
`
`(3.27)
`
`(3.28)
`
`i; )ik]
`
`k=O
`which is the desired result; that is, x;(n) is a modulated N-point DFf of the sequence un(k).
`Thus the basic steps in the computation of a uniform filter bank via FFf methods are
`as follows:
`
`1. Fonn the windowed signal sn(m) = s(m) w(n - m), m = n - L + 1, ... , n, where
`w(n) is a causal, finite window of duration L samples. Figure 3.16a illustrates this
`step.
`
`2. Fonn un(k) = L s11(Nr + k), 0 ::; k < N - 1. That is, break the signal Sn(m) into
`r
`pieces of size N samples and add up the pieces (alias them back unto itself) to give a
`signal of size N samples. Figures 3.16b and c illustrate this step for the case in which
`l»N.
`3. Take the N-point DFf of Un(k).
`4. Modulate the DFf by the sequence ei( 2
`: )in.
`
`The modulation step 4 can be avoided by circularly shifting the sequence, u11(k), by n EB N
`samples (where EB is the modulo operation), to give un((k - n))N, 0 < k < N - 1, prior to
`the DFf computation.
`The computation to implement the uniform filter bank via Eq. (3.28) is essentially
`CFBFFf ~ 2N log N•, +.
`Consider now the ratio, R, between the computation for the direct form implementation of
`a uniform filter bank (Eq. (3.13)), and the FFT implementation
`(Eq. (3.29)), such that
`R = CoFFIR =
`LQ
`(3.30)
`2N log N •
`C FBFFT
`If we assume N = 32 (i.e., a I 6-channel filter bank), with L = 128 (i.e., 12.8 msec impulse
`response filter at a IO-kHz sampling rate), and Q = 16 channels, we get
`R = 128 • 16 _
`2 · 32 · 5 - 6.4.
`h ct·
`ture
`h
`The FFf implementation is about 6.4 times more effi •
`&
`c1ent t an t e 1rect 1orm struc
`•
`
`(3.29)
`
`IPR2023-00035
`Apple EX1015 Page 119
`
`

`

`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`89
`
`m
`
`-
`
`~(ml
`
`n-L+1
`
`0
`
`N-1
`
`k
`
`Figure 3.16 FFf implementation of a uniform filter bank.
`
`.
`
`-
`
`I
`
`'.
`
`s (n)
`
`h 1 (n)
`
`h2(n)
`
`•
`•
`•
`
`-
`
`~
`
`ha(n)
`
`Figure 3.17 Direct form implementation of an
`arbitrary nonuniform filler bank.
`
`3.2.2.5 Nonuniform FIR Filter Bank Implementations
`
`The most general form of a nonuniform FIR filter bank is shown in Figure 3.17, where the
`kth bandpass filter impulse response, hk(n), represents a filter with center frequency wk, and
`bandwidth !:iwk. The set of Q bandpass filters is intended to cover the frequency range of
`
`IPR2023-00035
`Apple EX1015 Page 120
`
`

`

`90
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`1
`
`2
`
`3
`
`1 2 3
`
`4
`
`5
`
`6
`
`7
`
`(a)
`
`(b)
`
`f
`
`f
`
`Figure 3.18 Two arbitrary nonuniform filter-bank
`ideal filter specifications
`consisting of either 3 bands (part a) or 7 bands (part b).
`
`interest for the intended speech-processing application.
`In its most general form, each bandpass filter is implemented via a direct convolution;
`that is, no efficient FFf structure can be used. In the case where each bandpass filter is
`designed via the windowing design method (Ref. [ 1 ]), using the same lowpass window, we
`can show that the composite frequency response of the Q-channel filter bank is independent
`of the number and distribution of the individual filters. Thus a filter bank with the three
`filters shown in Figure 3.18a has the exact same composite frequency response as the filter
`bank with the seven filters shown in Figure 3.18b.
`To show this we denote the impulse response of the kth bandpass filter as
`hk(n) = w(n)hk(n),
`where w(n) is the RR window, and hk(n) is the impulse response of the ideal bandpass filter
`being designed. The frequency response of the kth bandpass filter, Hk(eiw), can be written
`as
`
`(3.31)
`
`Thus the frequency response of the composite filter bank, H(eiw), can be written as
`
`H(eiw) = L Hk(eiw) = L W(eJ°w)@ /h(eiw).
`
`Q
`
`Q
`
`By interchanging the summation and the convolution we get
`
`k=I
`
`k=I
`
`H(eiw) = W(eiw)@ L Hk(eJ°w).
`
`Q
`
`(3.32)
`
`(3.33)
`
`(3.34)
`
`k=l
`By realizing_ ~at_the summation of Eq. (3.34) is the summation of ideal frequency responses,
`we see that 1t 1s mdependent of the number and distribution of the individual filters. 'fhus
`we can write the summation as
`
`Wmin < W < Wmax
`otherwise
`
`(3.35)
`
`IPR2023-00035
`Apple EX1015 Page 121
`
`

`

`,I
`
`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`91
`
`where Wmin is the lowest frequency in the filter bank, and Wmax is the highest frequency.
`Then Eq. (3.34) can be expressed as
`
`(3.36)
`independent of the number of ideal filters, Q, and their distribution in frequency, which is
`the desired result.
`
`3.2.2.6 FFT-Based Nonuniform Filter Banks
`
`One possible way to exploit the FFf structure for implementing uniform filter banks
`discussed earlier is to design a large uniform filter bank (e.g., N = 128 or 256 channels)
`and then create the nonuniformity by combining two or more uniform channels. This
`technique of combining channels is readily shown to be equivalent to applying a modified
`analysis window to the sequence prior to the FFf. To see this, consider taking an N-point
`DFf of the sequence x(n) (derived from the speech signal, s(n), by windowing by w(n)).
`Thus we get
`
`(3.37)
`
`(3.38)
`
`(3.39)
`
`N-1
`xk = Lx(n)e-j
`n=O
`as the set of DFT values. If we consider adding DFf outputs Xk and Xk+I, we get
`
`nk,
`
`2;
`
`2;
`
`2;
`
`nk +e-j
`
`(e-j
`
`n(k+I))
`
`N-1
`Xk +Xk+I = LX(n)
`n=O
`
`which can be written as
`
`the equivalent kth channel value, Xt, could have been obtained by weighting the
`i.e.
`in time, by the complex sequence 2e-F';/ cos ( ~n). If more than two
`sequence, x(n),
`channels are combined, then a different equivalent weighting sequence results. Thus FFf
`channel combining is essentially a "quick and dirty" method of designing broader bandpass
`filters and is a simple and effective way of realizing certain types of nonuniform filter bank
`analysis structures.
`
`3.2.2.7 Tree Structure Realizations of Nonuniform Filter Banks
`
`A third method used to implement certain types of nonuniform filter banks is the tree
`structure in which the speech signal is filtered in stages, and the sampling rate is successively
`reduced at each stage for efficiency of implementation. An example of such a realization
`is given in Figure 3.19a for the 4-band, octave-spaced filter bank shown (ideally) in
`Figure 3.19b. The original speech signal, s(n), is filtered initially into two bands, a low
`band and a high band, using quadrature mirror filters (QMFs)-i.e.,
`filters whose frequency
`responses are complementary. The high band, which covers half the spectrum, is reduced
`in sampling rate by a factor of 2, and represents the highest octave band ( 1r /2 ~ w :::; 1r) of
`
`IPR2023-00035
`Apple EX1015 Page 122
`
`

`

`92
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`LP 3
`
`2 ♦
`
`X1 (m)
`
`HP 3
`
`s (n)
`
`X4 (m)
`
`1
`
`2
`
`3
`
`4
`
`I I
`
`0
`
`1T'
`8
`
`I
`
`1T'
`4
`
`1T'
`2
`
`Tr
`
`Figure 3.19 Tree structure implementation of a 4-band, octave-spaced, filter bank.
`
`the filter bank. The low band is similarly reduced in sampling rate by a factor of 2, and is
`fed into a second filtering stage in which the signal is again split into two equal bands using
`QMF filters. Again the high band of stage 2 is decimated by a factor of 2 and is used as the
`next-highest filter bank output; the low band is also decimated by a factor of 2 and fed into
`a third stage of QMF filters. These third-stage outputs, in this case after decimation by a
`factor of 2, are used as the two lowest filter bands.
`QMF filter bank structures are quite efficient and have been used for a number of
`speech-processing applications [3]. Their efficiency for arbitrary nonuniform filter bank
`structures is not as good as for the octave band designs of Figure 3.19.
`
`3.2.3 Summary of Considerations for Speech-Recognition Filter Banks
`
`In the previous sections we discussed several methods of implementing filter banks for
`speech recognition. We have not gone into great detail here because our goal was to make
`the reader familiar with the issues involved in filter-bank design and implementation, not
`to make the reader an expert in signal processing. The interested reader is urged to pursue
`this fascinating area further by studying the material in the References at the end of this
`chapter. In this section we summarize the considerations that go into choosing the number
`and types of filters used in the structures discussed earlier in this section.
`The first consideration for any filter bank is the type of digital filter used. The
`choices are IIR (recursive) and FIR (nonrecursive) designs. The IIR designs have the
`advantage of being implementable in simple, efficient structures. The big disadvantage of
`IIR filters is that their phase response is nonlinear; hence, to minimize this disadvantage
`
`IPR2023-00035
`Apple EX1015 Page 123
`
`

`

`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`93
`
`a trade-off is usually made between the ideal magnitude characteristics that can readily
`be realized, and the highly nonideal phase characteristics. On the other hand, AR filters
`can achieve linear phase without compromising the ability to approximate ideal magnitude
`characteristics; however, they are usually computationally expensive in implementation.
`For speech-recognition applications, we have shown how an FFf structure can often be
`applied to alleviate considerably the computational inefficiency of FIR filter banks; hence,
`most practical digital filter bank structures use FIR filters (usually in an FFf realization).
`Once the type of filter has been decided, the next consideration is the number of filters
`to be used in the filter bank. For unifonn filter banks, the number of filters, Q, cannot be too
`small or else the ability of the filter bank to resolve the speech spectrum is greatly impaired.
`Thus values of Q less than about 8 are generally avoided. Similarly, the value of Q cannot
`be too large (unless there is considerable filter overlap), because the filter bandwidths would
`eventually be too narrow for some talkers (e.g., high-pitch females or children), and there
`would be a high probability that certain bands would have extremely low speech energy
`(i.e., no prominent harmonic would fall within the band). Thus, practical systems tend to
`have values of Q < 32. Although unifonnly spaced filter banks have been widely used for
`recognition, many practical systems have used nonunifonn spacing in an effort to reduce
`overall computation and to characterize the speech spectrum in a manner considered more
`consistent with human perception.
`A final consideration for practical filter-bank analyzers is the choice of nonlinearity
`and lowpass filter used at the output of each channel. Typically the nonlinearity has been
`a full wave rectifier (FWR), a half wave rectifier (HWR), or a center clipper. The resultant
`spectrum is only weakly sensitive to the nonlinearity. The lowpass filter used in practice
`varies from a simple integrator to a fairly good quality UR lowpass filter (typically a Bessel
`filter).
`
`3.2.4 Practical Examples of Speech-Recognition Filter Banks
`
`[4] show examples of a wide range of speech-recognition filter banks,
`Figures 3.20-3.25
`including both uniform and nonunifonn designs. Figure 3.20 is for a 15-channel uniform
`filter bank in which the basic lowpass filter was designed using the windowing technique
`with a IO I -point Kaiser window. Part a of the figure shows the impulse response of the
`lowpass filter (i.e., an ideal lowpass filter response multiplied by a Kaiser window). Part b
`of the figure shows the responses of the individual filters in the filter bank (note there is no
`overlap between adjacent filters), and part c shows the composite frequency response of the
`overall filter bank. The sidelobe peak ripple of each individual filter is down about 60 dB,
`and the composite frequency response is essentially ideally flat over the entire frequency
`range of interest (approximate 100-3000 Hz).
`By contrast, Figure 3.21 is for a 15-channel unifonn filter bank in which the basic
`lowpass filter was a Kaiser window (instead of the Kaiser windowed version of the ideal
`lowpass filter). From parts band c of this figure, it can be seen that the individual bandpass
`filters are narrower in bandwidth than those of Figure 3.20; furthennore, the composite
`
`IPR2023-00035
`Apple EX1015 Page 124
`
`

`

`94
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`(a)
`
`oo~-==:::c::::::26~=:c:::==::c:===:I====::c::=====~~:::::c:::::;;a~100
`
`TIME
`
`IN SAMPLES
`
`FREQUENCY
`
`( kHz)
`
`w
`0
`:::,
`
`I--z
`
`(!)
`ct
`~-600L..J.-1........1.......1.._.,JL.......1.._.,_...,_.1.....1---l....----'---'-._.___.._~_.__.........____.__.__._..__.__.____.__...._........,~
`3.33
`
`FREQUENCY
`
`( kHz)
`
`Figure 3.20 Window sequence, w(n), (part a), the individual filter response (part b), and
`the composite response (part c) of a Q = 15 channel, uniform filter bank, designed using a
`IOI-point Kaiser window smoothed lowpass window (after Dautrich et al. [4]).
`
`filter-bank response shows 18 dB gaps at the boundaries between each filter. Clearly, this
`filter bank would be unacceptable for speech-recognition applications.
`Figures 3.22 and 3.23 show individual filter frequency responses, and the composite
`frequency response, for a 4-channel, octave-band filter bank, and a 12-channel, I /3 octave
`filter bank, frequency, respectively. Each of these nonuniform filter banks was designed
`to cover the frequency band from 200 to 3200 Hz and used linear-phase FIR filters (IOI
`points for the octave band design, and 201 points for the 1 /3 octave band design) for each
`individual channel. The peak sidelobe ripple was about -40 dB for both filter banks.
`Figure 3.24 shows a similar set of responses for a 7-channel critical band filter bank
`in which each individual filter encompassed two critical bands. Again we used IOI-point,
`linear phase, FIR filters with a peak sidelobe of -54 dB to realize each individual bandpass
`filter. Finally, Figure 3.25 shows the responses of a 13-channel, critical band filter bank
`in which the individual channels were highly overlapping. The individual bandpass filter
`responses are rather poor (e.g., the ratios of center frequency to bandwidth of each filter was
`about 8). However, this poor frequency resolution characteristic was balanced somewhat
`by the excellent time resolution of the filters.
`
`IPR2023-00035
`Apple EX1015 Page 125
`
`

`

`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`95
`
`w
`:::>
`..J
`~ >
`
`ID
`'O
`
`w
`0
`:::>
`1--
`z
`C)
`ct
`~ -50
`
`0
`
`w
`0
`:::>
`1--
`z
`C)
`ct
`~ -so----
`0
`
`TIME
`
`IN SAMPLES
`
`FREQUENCY
`
`( kHz)
`
`3.33
`
`c)
`
`....... ------
`
`
`
`.................. --~-~~-----
`
`
`
`......... --_._
`
`
`
`......... ....._._ _ _.__
`3.33
`
`FREQUENCY
`
`(kHz)
`
`Figure 3.21 Window sequence, w(n), (part a), the individual filter responses (part b), and
`the composite response (part c) of a Q = 15 channel, uniform filter bank, designed using a
`IOI-point Kaiser window directly as the Jowpass window (after Dautrich et al. [4]).
`
`3.2.5 Generalizations of Filter-Bank Analyzer
`
`Although we have been concerned primarily with designing and implementing individual
`channels of a filter-bank analyzer, there is a generalized structure that must be considered
`as part of the canonic filter-bank analysis method. This generalized structure is shown in
`Figure 3.26. The generalization includes a signal preprocessor that "conditions" the speech
`signal, s(n), to a new form, s(n), which is "more suitable" for filter-bank analysis, and a
`postprocessor that operates on the filter-bank output vectors, x(m), to give the processed
`vectors x(m) that are "more suitable" for recognition. Although a wide range of signal(cid:173)
`processing operations could go into the preprocessor and postprocessor boxes, perhaps the
`most reasonable ones include the following.
`
`Preprocessor Operations
`
`• signal preemphasis (to equalize the inherent spectral tilt in speech)
`
`IPR2023-00035
`Apple EX1015 Page 126
`
`

`

`96
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`CHANNEL 1
`
`CHANNEL 4
`
`(d)
`
`3.33
`
`CHANNEL 2
`
`COMPOSITE
`
`-60L--'---'---'------'-.._..._......__.___._~
`0
`
`FREQUENCY
`
`(kHz)
`
`3.33
`
`CHANNEL 3
`
`Individual channel responses (parts a to d) and composite filter response (part c) of a
`Figure 3.22
`Q = 4 channel, octave band design, using IOI-point FIR filters in each band (after Dautrich et al. [4)).
`
`• noise elimination
`• signal enhancement (to make the formant peaks more prominent)
`
`Postprocessor Operations
`
`• temporal smoothing of sequential filter-bank output vectors
`• frequency smoothing of individual filter-bank output vectors
`• normalization of each filter-bank output vector
`• thresholding and/or quantization of the filter-bank output vectors
`• principal components analysis of the filter-bank output vector.
`
`The purpose of the preprocessor is to make the speech signal as clean as possible so far
`as the filter bank analyzer is concerned; hence, noise is eliminated,
`long-time spectral
`trends are removed, and the signal is spectrally flattened to give the best immunity 10
`measurement imperfections. Similarly, the purpose of the postprocessor is to clean up lhe
`
`IPR2023-00035
`Apple EX1015 Page 127
`
`

`

`Sec. 3.3
`
`Linear Predictive Coding Model for Speech Recognition
`
`C 1-+ANNEL 1
`
`CHANNEL 6
`
`_JJ~
`
`0
`
`3 33
`
`97
`
`CHANNEL 11
`
`CHANNEL 2
`
`CHANNEL 7
`
`CHANNEL12
`
`COMPOSITE
`
`CHANNEL 4
`
`CHANNEL 9
`
`_.:h;;; _J:r1:;1
`_.:~ _.:w1::1
`
`0
`
`3 33
`
`0
`
`3 33
`
`CHANNELS
`
`CHANNEL1O
`
`0
`FR E QUE NC Y
`
`3.33
`(kHz)
`
`0
`FREQUENCY
`
`3.33
`(kHz)
`
`Individual channel responses and composite filter response of a Q = 12 channel, 1/3 octave band
`Figure 3.23
`design, using 201-point FIR filters in each band (after Dautrich et al. (4)).
`
`sequence of feature vectors from the filter-bank analyzer so as to best represent the spectral
`information in the speech signal and thereby to maximize the chances of successful speech
`recognition [4,5].
`
`3.3 LINEAR PREDIC:TIVE CODING MODEL FOR SPEECH RECOGNITION
`
`The theory of linear predictive coding (LPC), as applied to speech, has been well understood
`for many years (see for example Ref. [6]). In this section we describe the basics of how LPC
`has been applied in speech-recognition systems. The mathematical details and derivations
`will be omitted here; the interested reader is referred to the references.
`Before describing a general LPC front-end processor for speech recognition, it is
`worthwhile to review the reasons why LPC has been so widely used. These include the
`following:
`
`IPR2023-00035
`Apple EX1015 Page 128
`
`

`

`98
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`CHANNEL 4
`
`COMPOSITE
`
`
`
`-obj \~~ -6:r== ~ : ~ i (~]
`
`60
`
`0
`
`FREQUENCY
`
`(kHz)
`
`FREQUENCY
`
`(kHz)
`
`3.33
`
`0
`
`3.33
`
`Figure 3.24
`Individual channel responses (parts a to g) and composite filter response (part h) of a
`Q = 7 channel critical band filter bank design (after Dautrich et al. [4]).
`
`1. LPC provides a good model of the speech signal. This is especially true for the quasi
`steady state voiced regions of speech in which the all-pole model of LPC provides
`a good approximation to the vocal tract spectral envelope. During unvoiced and
`transient regions of speech, the LPC model is less effective than for voiced regions,
`but it still provides an acceptably useful model for speech-recognition purposes.
`2. The way in which LPC is applied to the analysis of speech signals leads to a reasonable
`source-vocal tract separation. As a result, a parsimonious representation of the vocal
`tract characteristics (which we know are directly related to the speech sound being
`produced) becomes possible.
`3. LPC is an analytically tractable model. The method of LPC is mathematically precise
`and is simple and straightforward to implement in either software or hardware. The
`computation involved in LPC processing is considerably less than that required for
`an all-digital implementation of the bank-of-filters model described in Section 3.2.
`4. The LPC model works well in recognition applications. Experience has shown that
`
`IPR2023-00035
`Apple EX1015 Page 129
`
`

`

`Sec. 3.3
`
`Linear Predictive Coding Model for Speech Recognition
`
`99
`
`CHANNEL1
`
`CHANNEL 6
`
`CHANNEL 11
`
`I 'ia\!
`
`3.33
`
`-6:r~ 0
`
`- Ors::
`600
`
`,...
`a>
`~
`
`~
`ct
`~
`
`~ -6:r~
`
`0
`
`CHANNEL 2
`
`CHANNEL 3
`
`CHANNEL 4
`
`CHANNE(,. 5
`
`3.33 _JZ=:J
`-6:rz::]
`_JZm1
`_JZ:1
`: I :1
`3.33 _J:z:::±1
`
`CHANNEL?
`
`CHANNEL12
`
`0
`
`0
`
`3.33
`
`3.33
`
`CHANNEL 8
`
`CHANNEL13
`
`3. 33
`
`0
`
`3.33
`
`0
`
`3.33
`
`CHANNEL 9
`
`COMPOSITE
`
`-6:r
`
`:
`
`: :
`
`3.33
`0
`FREQUENCY (kHz)
`
`CHANNEL 10
`
`:(~
`
`-6:r:s
`3.33
`0
`0
`FREQUENCY (kHz)
`FREQUENCY
`(kHz)
`Individual channel responses and composite filter response of a Q = 13 channel, critical band spacing
`Figure 3.25
`filter bank, using highly overlapping filters in frequency (after Dautrich et al. [4]).
`
`s (n)
`
`s (n)
`-
`- PREPROCESSOR
`
`FILTER x(m)
`-
`BANK
`ANALYZER
`
`x (m).
`POSTPROCESSOR
`
`~
`
`Figure 3.26 Generalization of filter-bank analysis model.
`
`the performance of speech recognizers, based on LPC front ends, is comparable to or
`better than that of recognizers based on filter-bank front ends (see References [ 4,5, 71).
`
`Based on the above considerations, LPC front-end processing has been used in a large
`number of recognizers. In particular, most of the systems to be described in this book are
`based on this model.
`
`IPR2023-00035
`Apple EX1015 Page 130
`
`

`

`100
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`u (n)
`
`s (n)
`
`A(z)
`
`G
`
`Figure 3.27 Linear prediction model of speech.
`
`3.3.1 The LPC Model
`
`The basic idea behind the LPC model is that a given speech sample at time n, s(n), can be
`approximated as a linear combination of the past p speech samples, such that
`
`where the coefficients a 1, a2, ... , ap are assumed constant over the speech analysis frame.
`We convert Eq. (3.40) to an equality by including an excitation term, G u(n), giving:
`
`p
`
`s(n) = L a;s(n -
`
`i) + G u(n),
`
`(3.41)
`
`(3.40)
`
`i=l
`where u(n) is a normalized excitation and G is the gain of the excitation. By expressing
`Eq. (3.41) in the z-domain we get the relation
`p
`
`S(z) = L a;z-iS(z) + G U(z)
`
`(3.42)
`
`(3.43)
`
`leading to the transfer function
`
`i=I
`
`S(z)
`H(z) = G U(z)
`
`1
`- A(z).
`
`1
`P
`I - La;z-i
`i=I
`The interpretation of Eq. (3.43) is given in Figure 3.27, which shows the nonnalized
`excitation source, u(n), being scaled by the gain, G, and acting as input to the all-pole
`system, H(z) = A:z>, to produce the speech signal, s(n). Based on our knowledge that the
`actual excitation function for speech is essentially either a quasiperiodic pulse train (for
`voiced speech sounds) or a random noise source (for unvoiced sounds), the appropriate
`synthesis model for speech, corresponding to the LPC analysis, is as shown in Figure 3.28.
`Here the normalized excitation source is chosen by a switch whose position is controlled
`by the voiced/unvoiced character of the speech, which chooses either a quasiperiodic train
`of pulses as the excitation for voiced sounds, or a random noise sequence for unvoiced
`sounds. The appropriate gain, G, of the source is estimated from the speech signal, and the
`scaled source is used as input to a digital filter (H(z)), which is controlled by the vocal tract
`
`IPR2023-00035
`Apple EX1015 Page 131
`
`

`

`sec. 3.3
`
`Linear Predictive Coding Model for Speech Recognition
`
`101
`
`PITCH
`PERIOD
`
`',,
`
`IMPULSE
`TRAIN
`GENERATOR
`
`VOICED/
`UNVOICED
`SWITCH
`
`/
`-----+{
`
`u( n)
`
`X r----,~
`
`VOCAL TRACT
`PARAMETERS
`,O-
`
`TIME -VARYING
`DIGITAL
`FILTER
`
`s(n)
`
`RANDOM
`NOISE
`GENERATOR
`
`G
`
`Figure 3.28 Speech synthesis model based on LPC model.
`
`parameters characteristic of the speech being produced. Thus the parameters of this model
`are voiced/unvoiced classification, pitch period for voiced sounds, the gain parameter, and
`the coefficients of the digital filter, { ak}. These parameters all vary slowly with time.
`
`3.3.2 LPC Analysis Equations
`
`Based on the model of Figure 3.27, the exact relation between s(n) and u(n) is
`
`p
`
`s(n) = L aks(n - k) + G u(n).
`
`(3.44)
`
`k=l
`We consider the linear combination of past speech samples as the estimate s(n), defined as
`
`p
`
`s(n) = L aks(n - k).
`
`k=l
`
`We now form the prediction error, e(n), defined as
`
`e(n) = s(n) - s(n) = s(n) - L aks(n - k)
`
`p
`
`k=l
`
`with error transfer function
`
`A(z) = E(z) = I - L akz-k •
`
`p
`
`S(z)
`
`k=t
`
`(3.45)
`
`(3.46)
`
`(3.47)
`
`Clearly, when s(n) is actually generated by a linear system of the type shown in Figure 3.27,
`then the prediction error, e(n), will equal G u(n), the scaled excitation.
`The basic problem of linear prediction analysis is to determine the set of predictor
`
`IPR2023-00035
`Apple EX1015 Page 132
`
`

`

`102
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`coefficients, { ak}, directly from the speech signal so that th~ s~ectral prope~ies of the digital
`filter of Figure 3.28 match those of the speech waveform within the analysis window s·
`• Ince
`.
`h
`d.
`.
`the spectral characteristics of speech vary over time, t e pre ictor coefficients at a .
`.
`&IVen
`time, n, must be estimated from a short segment of the ~peech sign~l occurring around
`time n. Thus the basic approach is to find a set of predictor coefficients that minimize
`the mean-squared prediction error over a short segment of the s~eech wavefonn. (Usuan
`this type of short time spectral analysis is perfonned on successive frames of speech, Wit~
`frame spacing on the order of IO msec.)
`To set up the equations that must be solved t~ determine the predictor coefficients,
`we define short-tenn speech and error segments at time n as
`s11(m) = s(n + m)
`en(m) = e(n + m)
`and we seek to minimize the mean squared error signal at time n
`
`(3.48a)
`(3.48b)
`
`En= Le~(m)
`m
`
`which, using the definition of en(m) in terms of sn(m), can be written as
`
`En = ~ [s.(m) - t akSn(m
`
`2
`- k)]
`
`•
`
`(3.49)
`
`(3.50)
`
`To solve Eq. (3.50), for the predictor coefficients, we differentiate En with respect to each
`ak and set the result to zero,
`
`k = 1, 2, ... ,p
`
`(3.51)
`
`giving
`
`p
`
`i)sn(m - k).
`
`(3.52)
`
`L Sn(m - i)sn(m) = L ilk L Sn(m -
`By recognizing that tenns of the form L sn(m -
`covariance of sn(m), i.e.,
`
`m
`
`k=l
`
`m
`
`i) sn(m - k) are terms of the short-tenn
`
`<Pn(i, k) = L Sn(m -. i)sn(m - k)
`
`m
`we can express Eq. (3.52) in the compact notation
`
`p
`
`<Pn(i, 0) = E ak¢n(i, k)
`
`k=l
`
`(3.53)
`
`(3.54)
`
`which describes a set __ of p equations in p unknowns. It is readily shown that the minimum
`mean-squared error, En, can be expressed as
`
`IPR2023-00035
`Apple EX1015 Page 133
`
`

`

`Sec. 3.3
`
`Linear Predictive Coding Model for Speech Recognition
`
`En = L s~(m) - L ak L sn(m)sn(m - k)
`= <Pn(0, 0) - L ak¢n(0, k).
`
`m
`
`p
`
`k=I
`p
`
`m
`
`103
`
`(3.55)
`
`(3.56)
`
`k=l
`
`Thus the minimum mean-squared error consists of a fixed tenn (</>n(0, 0)) and tenns that
`depend on the predictor coefficients.
`To solve Eq. (3.54) for the optimum predictor coefficients (the aks) we have to
`compute <l>nU, k) for 1 < i < p and O ~ k ~ p, and then solve the resulting set of p
`simultaneous equations.
`In practice, the method of solving the equations (as well as the
`method of computing the ¢s) is a strong function of the range of mused in defining both
`the section of speech for analysis and the region over which the mean-squared error is
`computed. We now discuss two standard methods of defining this range for speech.
`
`3.3.3 The Autocorrelation Method
`
`A fairly simple and straightforward way of defining the limits on m in the summations is
`to assume that the speech segment, sn(m), is identically zero outside the interval O ~ m ~
`N - 1. This is equivalent to assuming that the speech signal, s(m + n), is multiplied by a
`l.
`finite length window, w(m), which is identically zero outside the range O ::; m ::; N -
`Thus the speech sample for minimization can be expressed as
`s(m + n) • w(m), 0 ::; m ::; N - 1
`.
`otherwise.
`0,
`
`Sn(m) =
`
`{
`
`(3.57)
`
`The effect of weighting of the speech by a window is illustrated in Figures 3.29-3.31.
`In each of these figures, the upper panel shows the running speech waveform, s(m), the
`middle panel shows the weighted section of speech (using a Hamming window for w(m)),
`and the bottom panel shows the resulting error signal, en(m), based on optimum selection
`of the predictor parameters.
`Based on Eq. (3.57), form < 0, the error signal en(m) is exactly zero since sn(m) = 0
`for all m < 0 and the ref ore there is no prediction error. Furthermore, for m > N -
`I + p
`there is again no prediction error because sn(m) = 0 for all m > N -
`I. However, in
`the region of m = 0 (i.e., from m = 0 tom = p - 1) the windowed speech signal sn(m)
`is being predicted from previous samples, some of which are arbitrarily zero. Hence the
`potential for relatively large prediction errors exists in this region and can actually be seen
`to exist in the bottom panel of Figure 3.29. Furthermore, in the region of m = N -
`I (i.e.,
`from m = N - 1 to m = N -
`l + p) the potential of large prediction errors again exists
`because the zero-valued (weighted) speech signal is being predicted from at least some
`nonzero previous speech samples. In the bottom panel of Figure 3.30 we see this effect
`at the end of the prediction error waveform. These two effects are especially prominent
`for voiced speech when the beginning of a pitch period occurs at or very close to the
`m = 0 or m = N -
`I points of the sample. For unvoiced speech, these problems are
`essentially eliminated because no part of the waveform is position sensitive. Hence we see
`
`IPR2023-00035
`Apple EX1015 Page 134
`
`

`

`104
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`20000.---,--------,--~:---r-----;-....,........---,---.---r----,.r---,
`
`s(m)
`
`w :::, - 11000 L___2_
`..J <
`16000 ,----,------,----,.----....,.--~--.-.-----,---,.----..---,
`>
`
`__.__~L-----'--L.L..--'---.:.&....:....-__._
`
`_
`
`-10000
`2600
`
`a:
`0 a:
`a: w
`
`-2200
`
`1
`
`0
`
`0
`
`__
`
`..____........__-..J
`
`~(~
`
`N-1
`
`SAMPLE
`
`N-1 +p
`
`600
`
`Figure 3.29 Illustration of speech sample, weighted speech section, and prediction error
`for voiced speech where the prediction error is large at the beginning of the section.
`
`20000r---.......----r--,------,-:---,.--~--,---,-~-~-.---,
`
`s (m)
`
`~ -11000 '-----L-----'----'--..___
`..J
`~ 15000 ,-----,------,---,.----r-,,---,---r-r----,-----,-----,
`
`_
`
`__.__....___..._
`
`____
`
`____, ______
`
`_
`
`sn (m)
`
`I
`I
`I
`_ _.._ _ __._~1~
`N-1
`
`__
`
`_,_ _ _.
`
`-10000------------..__..__
`
`~00,-------,-------,--,----..----.e------,-----,
`
`a:
`0 a:
`a: w
`
`SAMPLE
`
`N-1+p
`
`600
`
`Figure 3.30 Illustration of speech sample, weighted speech section, and prediction error
`for voiced speech where the prediction error is large at the end of the section.
`
`IPR2023-00035
`Apple EX1015 Page 135
`
`

`

`Sec. 3.3
`
`Linear Predictive Coding Model for Speech Recognition
`
`105
`
`2900 r----.-----.---,--___,,..._~----,----,-------.----.r--o-------.
`
`~ - 3000 .__ _ _.__ _ _._ _ ___._ _ ____. __
`
`......_ _ __.__ _ _.__ _ ___.__L _ _j
`
`~
`~ 2100 ,--,------.--~---.-----.T"ff'""-.....-----,--~----
`
`- 2300
`
`.___...___......._-=-~--'-----........1.-~---'-_!__._
`0
`732.17 r---,------.--~---.--T"T"""-.....----..--~----
`
`_
`
`_._ _
`
`_J
`
`N-1
`
`a:
`
`0 a: a: w
`
`I
`I
`I
`- 681.3 ._ _ _.__ _ ___._~I ___.. __
`1
`0

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket