`
`The Bank-of-Filters Front-End Processor
`
`73
`
`3.2 THE BANK-Of .. fH:TERS FRONT-END PROCESSOR
`
`A block diagram of the canonic structure of a complete filter-bank front-end analyzer is
`given in Figure 3.4. The sampled speech signal, s(n),
`is passed through a bank of Q
`bandpass filters, giving the signals
`s;(n) = s(n) * h;(n),
`= L h;(m)s(n
`
`M;-1
`
`m=O
`
`- m),
`
`(3.la)
`
`(3.1 b)
`
`where we have assumed that the impulse response of the i'h bandpass filter is h;(m) with
`a duration of M; samples; hence, we use the convolution representation of the filtering
`operation to give an explicit expression for s;(n),
`the bandpass-filtered speech signal. Since
`the purpose of the filter-bank analyzer is to give a measurement of the energy of the speech
`is passed through a
`signal in a given frequency band, each of the bandpass signals, s;(n),
`nonlinearity, such as a full-wave or half-wave rectifier. The nonlinearity shifts the bandpass
`signal spectrum to the low-frequency band as well as creates high-frequency images. A
`lowpass filter is used to eliminate the high-frequency images, giving a set of signals, u;(n),
`1 < i ~ Q, which represent an estimate of the speech signal energy in each of the Q
`frequency bands.
`To more fully understand the effects of the nonlinearity and the lowpass filter, let us
`assume that the output of the ith bandpass filter is a pure sinusoid at frequency w;, i.e.
`s;(n) = a; sin(w;n).
`This assumption is valid for speech in the case of steady state voiced sounds when the
`bandwidth of the filter is sufficiently narrow so that only a single speech harmonic is passed
`by the bandpass filter. If we use a full-wave rectifier as the nonlinearity, that is,
`/(s;(n)) = S;(n)
`for s;(n) ~ 0
`= -s;(n)
`for s;(n) < 0.
`Then we can represent the nonlinearity output as
`v;(n) = f(s;(n))
`
`(3.2)
`
`(3.3)
`
`(3.4)
`
`= s;(n)
`
`• w(n),
`
`where
`
`+ 1
`if S;(n) ~ 0
`w(n) =
`{
`if S;(n) < 0
`-1
`Since the nonlinearity output can be viewed as a
`as illustrated in Figure 3.5(a)-(c).
`modulation in time, as shown in Eq. (3.4), then in the frequency domain we get the result
`that
`
`(3.5)
`
`are the Fourier transfonns of the signals v;(n),
`and W(eiw)
`where V;(eiw),
`s;(n)
`S;(eiw),
`and w(n),
`respectively, and ® is circular convolution. The spectrum S;(eiw) is a single
`impulse at w0 = w;, whereas the spectrum W(eiw) is a set of impulses at the odd-harmonic
`
`(3.6)
`
`IPR2023-00037
`Apple EX1013 Page 104
`
`
`
`Figure 3.4 Complete bank-of-filters analysis model.
`
`I
`
`-
`
`XQ(m) -
`
`-COMPRESSION
`
`AMPLITUDE
`
`ua(ml
`
`REDUCTION
`
`RATE
`
`ta(n) ~ SAMPLING
`
`-
`
`LOWPASS
`
`FILTER
`
`-
`
`v0(n)_
`
`NONLINEARITY
`
`SQ(n)_
`
`-
`
`FILTER Q
`BANDPASS
`
`4
`
`•
`•
`•
`
`•
`•
`•
`
`•
`•
`•
`
`•
`•
`•
`
`•
`•
`•
`
`■ -
`
`• (n)
`
`•1 (m) -
`
`-COMPRESSION
`
`u1(ml_ AMPLITUDE
`
`REDUCTION
`
`RATE
`
`-
`
`l,(n) _ SAMPLING
`
`LOWPASS
`
`FILTER
`
`v1 (n) _
`
`NONLINEARITY
`
`•1 (n) -
`
`FILTER 1
`BANDPASS
`
`_.
`
`.... -
`
`IPR2023-00037
`Apple EX1013 Page 105
`
`
`
`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`75
`
`•1 (n)
`
`+1 w (n) I
`
`-1
`
`I
`I
`
`1
`I
`I
`I
`
`r Si (el "l I I w,
`rw (ol•ll I
`f v, (ol•ll
`
`w,
`
`0
`
`I
`
`I
`I
`
`I
`I
`I
`I
`
`I
`
`.
`I
`I
`
`I
`I
`I
`I
`
`I
`
`'
`
`I
`I
`
`I
`I
`I
`I
`
`n
`
`n
`
`1
`
`3w1
`
`t
`
`2w1
`
`L~
`
`4w1
`
`Figure 3.5 Typical waveforms and spectra for analysis of a pure sinusoid
`in the filter-bank model.
`
`frequencies Wq = w;q, q = 1, 3, ... , qmax• Hence the spectrum of V;(eiw) is an impulse at
`w = 0 and a set of smaller amplitude impulses at wq = w;q, q = 2, 4, 6, ... , as shown in
`Figure 3.5 (d)-(f). The effect of the lowpass filter is to retain the DC component of V;(eiw)
`and to filter out the higher frequency components due to the nonlinearity.
`The above analysis, although only strictly correct for a pure sinusoid, is a reasonably
`good model for voiced, quasiperiodic speech sounds so long as the bandpass filter is not so
`wide that it has two or more strong signal harmonics. Because of the time-varying nature of
`the speech signal (i.e., the quasiperiodicity), the spectrum of the lowpass signal is not a pure
`
`I
`
`I
`
`r
`'
`~
`
`~ \.
`
`IPR2023-00037
`Apple EX1013 Page 106
`
`
`
`76
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`_::~
`~~J
`~~ - ,~:~:::J
`< J l:I I l 11111 ! I 1:1 1:1 fl 111 I
`
`CD
`~
`w
`C
`::::,
`....
`z
`i
`:E
`§
`
`w
`::::, -6500
`..J
`>
`
`MOO~
`
`241
`
`200
`
`SAMPLE
`
`113~
`
`42~
`
`-10
`
`,,.~ ______::
`
`510
`
`5000
`
`FREQUENCY
`
`Figure 3.6 Typical waveforms and spectra of a voice speech signal in the bank-of-filters
`analysis model.
`
`DC impulse, but instead the information in the signal is contained in a low-frequency band
`around DC. Figure 3.6 illustrates typical waveforms of s(n), s;(n), w(n) and v;(n) for a brief
`(20 msec) section of voiced speech processed by a narrow bandwidth channel with center
`frequency of 500 Hz (sampling frequency for this example is 10,000 Hz). Also shown
`are the resulting spectral magnitudes for the four signals. It can be seen that jS;(eiw)j has
`most of its energy around 500 Hz (w = IOO(hr), whereas I W;(eiw) I (which is quasiperiodic)
`approximates an odd harmonic signal with peaks at 500, 1500, 2500 Hz. The resulting
`signal spectrum, jv;(eiw)j, shows the desired low-frequency concentration of energy as
`well as the undesired spectral peaks at 1000 Hz, 2000 Hz, etc. The role of the final lowpass
`filter is to eliminate the undesired spectral peaks.
`The bandwidth of the signal, v;(n), is related to the fastest rate of motion of speech
`harmonics in a narrow band and is generally acknowledged to be on the order of 20-30 Hz.
`Hence, the final two blocks of the canonic bank-of-filters model of Figure 3.4 are a sampling
`rate reduction box in which the lowpass-filtered signals, t;(n), are resampled at a rate on
`the order of 40--60 Hz (for economy of representation), and the signal dynamic range is
`compressed using an amplitude compression scheme (e.g., logarithmic encoding, µ-law
`encoding).
`Consider the design of a Q = 16 channel filter bank for a wideband speech signal
`where the highest frequency of interest is 8 kHz. Assume we use a sampling rate of
`Fs = 20 kHz on the speech data to minimize the effects of aliasing in the analog-to-digital
`conversion. The information (bit rate) rate of the raw speech signal is on the order of
`240 kbits per second (20 k samples per second times 12 bits per sample). At the output of
`
`IPR2023-00037
`Apple EX1013 Page 107
`
`
`
`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`77
`
`1
`
`2
`
`F~·j I
`
`Fs
`N
`
`2 Fa
`N
`
`I
`
`3
`
`I
`3Fg
`N
`
`I
`
`a
`
`I
`OFs
`N
`
`a
`
`... L\
`
`Figure 3.7
`Ideal (a) and realistic (b) set of filter responses of a Q-channel filter bank
`covering the frequency range Fs/N to (Q + 'h>Fs/N.
`
`the analyzer, if we use a sampling rate of 50 Hz and we use a 7 bit logarithmic amplitude
`compressor, we get an information rate of 16 channels times 50 samples per second per
`channel times 7 bits per sample, or 5600 bits per second. Thus, for this simple example
`we have achieved about a 40-to- l reduction in bit rate, and hopefully such a data reduction
`would result in an improved representation of the significant information in the speech
`signal.
`
`3.2.1 Types of Filter Bank Used for Speech Recognition
`
`The most common type of filter bank used for speech recognition is the uniform filter bank
`for which the center frequency,_{;, of the ith bandpass filter is defined as
`
`(3.7)
`
`where F s is the sampling rate of the speech signal, and N is the number of uniformly spaced
`filters required to span the frequency range of the speech. The actual number of filters used
`in the filter bank, Q, satisfies the relation
`
`Q ~N/2
`
`(3.8)
`
`with equality when the entire frequency range of the speech signal is used in the analysis.
`The bandwidth, b;, of the ith filter, generally satisfies the property
`
`(3.9)
`
`with equality meaning that there is no frequency overlap between adjacent filter channels,
`and with inequality meaning that adjacent filter channels overlap. (If b; < Fs/N, then
`certain portions of the speech spectrum would be missing from the analysis and the resulting
`speech spectrum would not be considered very meaningful.) Figure 3.7a shows a set of Q
`ideal, non-overlapping, bandpass filters covering the range from Fs/N( ½) to (Fs/N)(Q +
`½). Similarly Figure 3.7b shows a more realistic set of Q overlapping filters covering
`approximately
`the same range.
`
`IPR2023-00037
`Apple EX1013 Page 108
`
`
`
`78
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`The alternative to uniform filter banks is nonuniform filter banks designed according
`to some criterion for how the individual filters should be spaced in frequency. One
`commonly used criterion is to space the filters uniformly along a logarithmic frequency
`scale. (A logarithmic frequency scale is often justified from a human auditory perception
`point of view, as will be discussed in Chapter 4.) Thus for a set of Q bandpass filters with
`center frequencies,f;, and bandwidths, b;, 1 ~ i ~ Q, we set
`
`b1 =C
`
`b; = a,_b;-1,
`
`f; =!1 + :~::::>j
`(b; - bi)
`+
`2
`
`i-1
`
`j=I
`
`(3.IOa)
`(3.IOb)
`
`(3.11)
`
`'
`
`where C and/ 1 are the arbitrary bandwidth and center frequency of the first filter, and a is
`the logarithmic growth factor.
`The most commonly used values of a,_ are a,_ = 2, which gives an octave band spacing
`of adjacent filters, and o = 4/3 which gives a 1/3 octave filter spacing. Consider the
`design of a four band, octave-spaced, non-overlapping filter bank covering the frequency
`band from 200 to 3200 Hz (with a sampling rate of 6.67 kHz). Figure 3.8a shows the ideal
`filters for this filter bank. Values for / 1 and C of 300 Hz and 200 Hz are used, giving the
`following filter specifications:
`
`Filter I:
`Filter 2:
`Filter 3:
`Filter 4:
`
`/ 1 = 300 Hz,
`/ 2 = 600 Hz,
`/3 = 1200 Hz,
`/ 4 = 2400 Hz,
`
`b 1 = 200Hz
`b2 = 400Hz
`b3 = 800 Hz
`b4 = 1600 Hz
`
`An example of a 12-band, I /3-octave, ideal filter-bank specifications, covering the band
`from about 200 to 3200 Hz, is given in Figure 3.8b. For this example, C = 50 Hz, and
`/1 := 225 Hz.
`An alternative criterion for designing a nonuniform filter bank is to use the critical
`band scale directly. The spacing of filters along the critical band is based on perceptual
`studies and is intended to choose bands that give equal contribution to speech articulation.
`The general shape of the critical band scale is given in Figure 3.9. The scale is close to
`linear for frequencies below about I 000 Hz (i.e., the bandwidth is essentially constant as a
`function/), and is close to logarithmic for frequencies above 1000 Hz (i.e., the bandwidth
`is essentially exponential as a function of/). Several variants on the critical band scale
`have been used, including the mel scale and the bark scale. The differences between
`these variants are small and are, for the most part, insignificant with regard to design of
`filter banks for speech-recognition purposes. For example, Figure 3.8c shows a 7-band
`critical-band filter-bank specification.
`Other criteria for designing nonuniform filter banks have been proposed in the liter(cid:173)
`ature. For the most part, the uniform and nonuniform designs based on critical band scales
`have been the most widely used and studied filter-bank methods.
`
`IPR2023-00037
`Apple EX1013 Page 109
`
`
`
`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`l&J
`0
`
`:::, ...
`z
`C)
`4
`~
`
`0
`0
`
`...
`
`i • 1
`
`I• 2
`
`200
`
`400
`
`-
`
`I
`200 400
`ft
`
`I
`
`800
`
`I• 3
`
`800
`
`I
`
`1600
`
`i • 4
`
`1600
`
`I
`
`i•7
`
`i "8
`
`i •9
`
`i•10
`
`I • 11
`
`i=12
`
`79
`
`(a)
`
`3200
`
`-
`f
`
`(b )
`
`680
`
`I
`
`3200
`
`-
`f
`
`(c)
`
`3200
`
`-
`f
`
`I
`
`i•1 i•3
`ti•2 t i•4 i=5 i•6
`...
`so- ......
`65
`85
`100
`
`"" 0
`::::, ...
`z
`C)
`<I
`~
`
`0
`
`130 170 200
`
`260
`
`340--
`
`--400
`
`-
`
`-
`
`520
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`200 ~ ~00+ 630 800 1000
`250 315 500
`
`I
`
`I
`
`1260
`
`1600
`
`I
`
`2000
`
`I
`
`2520
`
`fe
`
`fg
`
`I
`
`...
`
`i = I
`
`i = 2
`
`i • 3
`
`i • 4
`
`j:5
`
`200
`
`230
`
`290-
`
`~350
`
`450
`
`0
`0
`
`I
`200 400
`
`I
`
`630
`
`I
`
`I
`
`I
`
`920
`
`1270
`
`1720
`
`2320
`
`i = 6
`
`600
`
`I
`
`i = 7
`
`880
`
`I
`
`Figure 3.8
`Ideal specifications of a 4-channel octave band-filter bank (a), a 12-channel third-octave band filter bank (b),
`and a ?-channel critical band scale filter bank (c) covering the telephone bandwidth range (200-3200 Hz).
`
`Figure 3.9 The variation of bandwidth with frequency for the per(cid:173)
`ceptually based critical band scale.
`
`IPR2023-00037
`Apple EX1013 Page 110
`
`
`
`80
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`3.2.2 Implementations of Filter Banks
`
`l-l
`
`(3.12a)
`
`(3.12b)
`
`A filter bank can be implemented in several w_a~s, depending o_n the method used to design
`the individual filters. Design methods for d1g1tal filters fall into two broad classes: (I)
`infinite impulse response (IIR) and (2) finite impulse response (FIR) methods. For IIR
`filters (also commonly called recursive filters in the literature), the most straightforward,
`and generally the most efficient implementation is to realize each individual bandpass filter
`as a cascade or parallel structure. (See Reference [ 1 ], pp. 40-46, for a discussion of such
`structures.)
`For FIR filters there are several possible methods of implementing the bandpass filters
`in the filter bank. The most straightforward and the simplest implementation is the direct
`fonn structure. In this case, if we denote the impulse response for the ith channel as h;(n),
`O < n < L - 1, then the output of the ith channel, x;(n), can be expressed as the discrete,
`finite convolution of the input signal, s(n), with the impulse response, h;(n), i.e.
`x;(n) = s(n) * h;(n)
`= L h;(m)s(n - m).
`m=O
`The computation of Eq. (3.12) is iterated on each channel i, for i = l, 2, ... , Q. The
`advantages of the convolutional, direct fonn structure are its simplicity and that it works
`for arbitrary h;(n). The disadvantage of this implementation is the high computational
`requirement. Thus for a Q-channel FIR filter bank, where each bandpass FIR filter has an
`impulse response of L samples duration, we require
`CoFFIR = LQ
`·, +
`(multiplication, addition)
`for a complete evaluation of x;(n), i = I, 2, ... , Q, at a single value of n.
`An alternative, less-expensive implementation can be derived for the case in which
`each bandpass filter impulse response can be represented as a fixed lowpass window, w(n),
`modulated by the complex exponential, eiw,-n_that is,
`
`(3.13)
`
`In this case Eq. (2.12b) becomes
`
`x;(n) = L w(m)eiw;ms(n - m)
`= L s(m) w(n - m)eiw;(n-m)
`
`m
`
`m
`
`(3.14)
`
`(3. 15a)
`
`m
`
`= eiw;nSn(eiw;),
`(3.15b)
`~here Sn(ei"';) is the sho~-time Fourier transfonn of s(n} at frequency w; = 2'/[f;. The
`importance of Eq. (3• I 5) is that efficient procedures often exist for evaluating the short·
`
`IPR2023-00037
`Apple EX1013 Page 111
`
`
`
`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`81
`
`w (50 - m)
`
`\
`
`~ /
`
`/w
`
`(100- m)
`
`/
`
`\
`\
`\
`
`/
`
`/
`
`"o•200
`
`••
`
`Figure 3.10 The signals s(m) and w(n - m) used in evaluation of the short-time Fourier transform.
`
`time Fourier transform using FFf methods. We will discuss such procedures shortly; first,
`however, we briefly review the interpretations of the short-time Fourier transform (see
`Ref. [2] for a more complete discussion of this fascinating branch of signal processing).
`
`3.2.2.1 Frequency Domain Interpretation of the Short-Time Fourier
`Transform
`
`The short-time Fourier transform of the sequence s(m) is defined as
`
`Sn(eiw,) = L s(m)w(n - m)e-jw,m.
`
`(3.16)
`
`m
`If we take the point of view that we are evaluating Sn(eiw•) for a fixed n = no, then we can
`interpret Eq. (3.16) as
`
`(3.17)
`
`where Ff[·] denotes the Fourier Transform. Thus Sn0 (eiw;) is the conventional Fourier
`transform of the windowed signal, s(m) w(no - m), evaluated at the frequency w = w;.
`Figure 3.10 illustrates the signals s(m) and w(n - m), at times n = no = 50, 100, and 200 to
`show which parts of s(m) are used in the computation of the short-time Fourier transform.
`Since w(m) is an FIR filter (i.e., of finite size), if we denote that size by l, then using the
`conventional Fourier transform interpretation of Sn(eiw;), we can state the following:
`
`l. If L is large, relative to the signal periodicity (pitch), then Sn(eiw;) gives good fre(cid:173)
`quency resolution. That is, we can resolve individual pitch harmonics but only
`roughly see the overall spectral envelope of the section of speech within the window.
`2. If l is small relative to the signal periodicity, then Sn(eiw;) gives poor frequency
`resolution (i.e., no pitch harmonics are resolved), but a good estimate of the gross
`spectral shape is obtained.
`
`To illustrate these points, Figures 3.11-3.14 show examples of windowed signals,
`s(m)w(n - m), (part a of each figure) and the resulting log magnitude short time spectra,
`20 log 10 ISn(eiw) I (part b of each figure). Figure 3.11 shows results for an l = 500-point
`Hamming window applied to a section of voiced speech. The periodicity of the signal is
`clearly seen in the windowed time waveform, as well as in the short-time spectrum in which
`the fundamental frequency and its harmonics show up as narrow peaks at equally spaced
`
`IPR2023-00037
`Apple EX1013 Page 112
`
`
`
`82
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`19000.---~-...--.......--
`
`..........
`---r---,-----.-----,-----.--,
`
`w ::,
`-'
`~
`
`•
`
`-11000
`
`1
`
`m
`~
`w
`0 ::,
`~ z C,
`<
`~
`8 -'
`
`114.7
`
`34.7
`0
`
`SAMPLE
`
`500
`
`FREQUENCY
`
`5000
`
`Figure 3.11 Short-time Fourier transform using a long (500 points or 50 msec)
`Hamming window on a section of voiced speech.
`
`870()r-,-...----r----.---~---.-----.--....--~-~---,
`
`w ::,
`-' < >
`
`0
`
`-m
`'0 -w
`~ z C,
`i
`§
`
`I\
`
`\ V
`
`-4200
`
`1
`
`95.1
`
`15.1
`0
`
`SAMPLE
`
`500
`
`FREQUENCY
`
`5000
`
`Figure 3.12 Shon-time Fourier transform using a sh rt (50
`•
`•
`.
`o
`ming wmdow on a secuon of voiced speech.
`
`) H
`·
`points or msec
`am·
`5
`
`IPR2023-00037
`Apple EX1013 Page 113
`
`
`
`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`83
`
`2200 r-------.---........,...---,-..,.,...__,...._
`
`.......... -.......-
`
`
`
`........... -
`
`-2400
`
`1
`
`SAMPLE
`
`500
`
`m
`~
`w
`0
`
`:::) .... z (!) <
`8 _,
`
`~
`
`97.8
`
`37.8
`0
`
`FREQUENCY
`
`5000
`
`Figure 3.13 Short-time Fourier transform using a long (500 points or 50 msec)
`Hamming window on a section of unvoiced speech.
`
`1500~~----~---~-.....--~--,---.----.---,
`
`-1300
`
`1
`
`~
`w
`C
`
`-m
`:::) .... z (!)
`i
`8 _,
`
`81.2
`
`21.2
`0
`
`SAMPLE
`
`500
`
`FREQUENCY
`
`5000
`
`Figure 3.14 Short-time Fourier transform using a short (50 points or 5 msec) Ham(cid:173)
`ming window on a section of unvoiced speech.
`
`IPR2023-00037
`Apple EX1013 Page 114
`
`
`
`84
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`s (n)
`
`#1'¥
`s (n)
`
`w (n)
`
`Figure 3.15 Linear filter interpretation of the short-time Fourier
`transform.
`
`frequencies. Figure 3.12 shows a similar set of comparisons for an L = 50-point Hamming
`window. For such short windows, the time sequence s(m)w(n-m) does not show the signal
`periodicity, nor does the signal spectrum. In fact, what we see in the short-time Fourier
`transform log magnitude is a few rather broad peaks in frequency corresponding roughly
`to the speech f onnants.
`Figures 3.13 and 3.14 show the effects of using windows on a section of unvoiced
`speech (corresponding to the fricative /sh/) for an L = 500 sample window (Figure 3.13)
`and L = 50 sample window (Figure 3.14 ). Since there is no periodicity in the signal, the
`resulting short-time spectral magnitude of Figure 3.13, for the L = 500 sample window
`shows a ragged series of local peaks and valleys due to the random nature of the unvoiced
`speech. Using the shorter window smoothes out the random fluctuations in the short-time
`spectral magnitude and again shows the broad spectral envelope very well.
`
`3.2.2.2 Linear Filtering Interpretation of the Short-Time Fourier Transform
`
`The linear filtering interpretation of the short-time Fourier transform is derived by consid(cid:173)
`ering Sn(eiw;), of Eq. (3.16), for fixed values of w;, in which case we have
`
`That is, Sn(eiw;) is a convolution of the lowpass window, w(n), with the speech signal, s(n),
`modulated to center frequency w;. This linear filtering interpretation of Sn(~w;) is illustrated
`in Figure 3 .15.
`If we denote the conventional Fourier transforms of s(n) and w(n) by S(~w) and
`W(eiw), then we see that the Fourier tr@nsfonn of s(n) of Figure 3.15 is just
`
`(3.18)
`
`and thus we get
`
`(3.19)
`
`(3.20)
`
`Since W(ei"') approximates 1 over a narrow band, and is O everywhere else, we see that, for
`fixed values, w;, the short-time Fourier transfonn gives a signal representative of the signal
`spectrum in a band around w;. Thus the short-time Fourier transfonn, Sn(ei"'i), represents
`the signal spectral analysis at frequency w; by a filter whose bandwidth is that of W(eiw).
`
`IPR2023-00037
`Apple EX1013 Page 115
`
`
`
`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`85
`
`3.2.2.3 Review Exercises
`
`Exercise 3.1
`A speech signal is sampled at a rate of 20,000 samples per second <Fs = 20 kHz). A 20-msec
`window is used for short-time spectral analysis, and the window is moved by 10 msec in
`consecutive analysis frames. Assume that a radix-2 FFf is used to compute DFfs.
`1. How many speech samples are used in each segment?
`2. What is the frame rate of the short-time spectral analysis?
`3. What size DFf and FFf are required to guarantee that no time-aliasing will occur?
`4. What is the resulting frequency resolution (spacing in Hz) between adjacent spectral
`samples?
`
`Solution 3.1
`1. Twenty msec of speech at the rate of 20,000 samples per second gives
`
`20 x 10- 3 sec x 20,000 samples/sec = 400 samples.
`
`Each section of speech is 400 samples in duration.
`2. Since the shift between consecutive speech frames is IO msec (i.e., 200 samples at a
`20,000 samples/sec rate), the frame rate is
`
`= 100/sec.
`
`frame rate=
`
`l
`l
`-----
`h'f
`f
`IO x 10- 3 sec
`rame s 1 t
`That is, l 00 spectral analyses are performed per second of speech.
`3. To avoid time aliasing in using the DFr to evaluate the short-time Fourier transform,
`we require the DFf size to be at least as large as the frame size of the analysis frame.
`Hence, from part 1, we require at least a 400-point DFf. Since we are using a radix 2
`FFf, we require, in theory, a 512-point FFT ( the smallest power of 2 greater than 400) to
`compute the DFf without time aliasing. (We would use the 400 speech samples as the
`first 400 points of the 512-point array; we pad 112 zero-valued samples to the end of the
`array to fill in and give a 512-point array.) Since the speech signal is real (as opposed
`to complex), we can use an FFT size of 256 by appropriate signal preprocessing and
`postprocessing with a complex FFT algorithm.
`4. The frequency resolution of the analysis is defined as
`
`sampling rate
`.
`frequency resolution = DFf size - =
`
`20,000 Hz
`512
`
`~ 39 Hz.
`
`Exercise 3.2
`If the sequences s(n) and w(n) have normal (long-time) Fourier transforms S(ei"') and W(ei"'),
`then show that the short-time Fourier transform
`00
`
`Sn(eiw) = L s(m)w(n - m)e-jwm
`
`m=-oo
`
`can be expressed in the form
`
`Sn(eiw) = _1 f 1r W(eio)eionS(ei<w+o>)dO.
`
`271'
`
`-1r
`
`IPR2023-00037
`Apple EX1013 Page 116
`
`
`
`86
`
`Chap. 3
`
`Signal Processing and Analysis M h
`et Ods
`
`That is S (eiw) is a smoothed (by the window spectrum) spectral estimate of S(eiw) at freq
`' n
`Uency
`w.
`
`Solution 3.2
`The long-time Fourier transforms of s(n) and w(n) can be expressed as
`
`CX)
`
`n=-ex>
`
`CX)
`
`n=-ex>
`
`The window sequence, w(n), can be recovered from its long-time Fourier transfonn via the
`integration
`
`11T'
`Sn(eiw) = L s(m)w(n - m)e-jwm
`
`1
`w(n) = -
`2-rr -7T'
`Hence, the short-time Fourier transform
`
`·wn
`·w
`W(e' )e' dw.
`
`CX)
`
`m=-ex>
`
`can be put in the fonn (by substituting for w(n - m)):
`
`S.(/"') = m ~oo s(m) [ 2~ 1: W(/B)/B<n-m)d0] e-jwm
`= 2~ 1: W(iB)/Bn L f ~ s(m)e-j(w+B>m] d0
`= 2~ 1: W(i8)/Bns(ei(w+8))d0.
`
`Exercise 3.3
`If we define the short-time spectrum of a signal in terms of its short-time Fourier transfonn as
`
`and we define the short-time autocorrelation of the signal as
`
`00
`
`Rn(k) = L w(n - m)s(m)w(n - k - m) s(m + k)
`
`m=-oo
`
`then show that for
`
`Sn(eiw) = L s(m)w(n - m)e-jwm
`
`(X)
`
`m=-<X>
`
`Rn(k) and Xn(~:) ~e related~ a normal (long-time) Fourier transfonn pair. In other worrlS,
`show that Xn(e' ) 1s the (long-t1me) Fourier transform of Rn(k}, and vice versa.
`
`IPR2023-00037
`Apple EX1013 Page 117
`
`
`
`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`87
`
`Solution 3.3
`Given the definition of Sn(ei"') we have
`
`Xn(ei"') = 1Sn(ei"')j2 = [Sn(ei"')][Sn(ei"')t
`
`= lt= s(m)w(n - m)e-iw• l [,f s(r)w(n -
`= L L w(n - m)s(m)w(n -
`
`00
`
`00
`
`r)Jw']
`
`r)s(r)e-jw(m-r>
`
`Let r = k + m, then:
`
`r=-oo m=-oo
`
`00
`
`00
`
`k=-oo
`
`k=-oo
`
`(since Rn(k) = Rn( -k));
`
`therefore
`
`3.2.2.4 FFT Implementation of Uniform Filter Bank Based on the Short-Time
`Fourier Transform
`
`We now return to the question of how to efficiently implement the computation of the set
`of filter-bank outputs (Eq. (3.15)) for the uniform filter bank. If we assume, reasonably,
`that we are interested in a uniform frequency spacing-that
`is, if
`f; = i(Fs/N),
`then Eq. (3.15a) can be written as
`x;(n) = ei('lg-)in Ls(m)w(n
`
`i=O,I,
`
`... ,N-I
`
`- m)e-i(~)im_
`
`(3.21)
`
`(3.22)
`
`m
`Now consider breaking up the summation over m, into a double summation of rand k, in
`which
`
`-oo < r < oo.
`0 < k ~ N - 1,
`m = Nr + k,
`In other words, we break up the computation over m into pieces of size N. If we let
`
`then Eq. (3.22) can be written as
`
`Sn(m) = s(m)w(n - m),
`
`(3.23)
`
`(3.24)
`
`(3.25)
`
`IPR2023-00037
`Apple EX1013 Page 118
`
`
`
`88
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`Since e-j1.1rir = l, for all i, r, then
`
`If we define
`
`we wind up with
`
`r
`
`Un(k) = L Sn(Nr + k),
`x;(n) = eiC2; )in [I: Un(k)e-i(
`
`(3.26)
`
`(3.27)
`
`(3.28)
`
`i; )ik]
`
`k=O
`which is the desired result; that is, x;(n) is a modulated N-point DFf of the sequence un(k).
`Thus the basic steps in the computation of a uniform filter bank via FFf methods are
`as follows:
`
`1. Fonn the windowed signal sn(m) = s(m) w(n - m), m = n - L + 1, ... , n, where
`w(n) is a causal, finite window of duration L samples. Figure 3.16a illustrates this
`step.
`
`2. Fonn un(k) = L s11(Nr + k), 0 ::; k < N - 1. That is, break the signal Sn(m) into
`r
`pieces of size N samples and add up the pieces (alias them back unto itself) to give a
`signal of size N samples. Figures 3.16b and c illustrate this step for the case in which
`l»N.
`3. Take the N-point DFf of Un(k).
`4. Modulate the DFf by the sequence ei( 2
`: )in.
`
`The modulation step 4 can be avoided by circularly shifting the sequence, u11(k), by n EB N
`samples (where EB is the modulo operation), to give un((k - n))N, 0 < k < N - 1, prior to
`the DFf computation.
`The computation to implement the uniform filter bank via Eq. (3.28) is essentially
`CFBFFf ~ 2N log N•, +.
`Consider now the ratio, R, between the computation for the direct form implementation of
`a uniform filter bank (Eq. (3.13)), and the FFT implementation
`(Eq. (3.29)), such that
`R = CoFFIR =
`LQ
`(3.30)
`2N log N •
`C FBFFT
`If we assume N = 32 (i.e., a I 6-channel filter bank), with L = 128 (i.e., 12.8 msec impulse
`response filter at a IO-kHz sampling rate), and Q = 16 channels, we get
`R = 128 • 16 _
`2 · 32 · 5 - 6.4.
`h ct·
`ture
`h
`The FFf implementation is about 6.4 times more effi •
`&
`c1ent t an t e 1rect 1orm struc
`•
`
`(3.29)
`
`IPR2023-00037
`Apple EX1013 Page 119
`
`
`
`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`89
`
`m
`
`-
`
`~(ml
`
`n-L+1
`
`0
`
`N-1
`
`k
`
`Figure 3.16 FFf implementation of a uniform filter bank.
`
`.
`
`-
`
`I
`
`'.
`
`s (n)
`
`h 1 (n)
`
`h2(n)
`
`•
`•
`•
`
`-
`
`~
`
`ha(n)
`
`Figure 3.17 Direct form implementation of an
`arbitrary nonuniform filler bank.
`
`3.2.2.5 Nonuniform FIR Filter Bank Implementations
`
`The most general form of a nonuniform FIR filter bank is shown in Figure 3.17, where the
`kth bandpass filter impulse response, hk(n), represents a filter with center frequency wk, and
`bandwidth !:iwk. The set of Q bandpass filters is intended to cover the frequency range of
`
`IPR2023-00037
`Apple EX1013 Page 120
`
`
`
`90
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`1
`
`2
`
`3
`
`1 2 3
`
`4
`
`5
`
`6
`
`7
`
`(a)
`
`(b)
`
`f
`
`f
`
`Figure 3.18 Two arbitrary nonuniform filter-bank
`ideal filter specifications
`consisting of either 3 bands (part a) or 7 bands (part b).
`
`interest for the intended speech-processing application.
`In its most general form, each bandpass filter is implemented via a direct convolution;
`that is, no efficient FFf structure can be used. In the case where each bandpass filter is
`designed via the windowing design method (Ref. [ 1 ]), using the same lowpass window, we
`can show that the composite frequency response of the Q-channel filter bank is independent
`of the number and distribution of the individual filters. Thus a filter bank with the three
`filters shown in Figure 3.18a has the exact same composite frequency response as the filter
`bank with the seven filters shown in Figure 3.18b.
`To show this we denote the impulse response of the kth bandpass filter as
`hk(n) = w(n)hk(n),
`where w(n) is the RR window, and hk(n) is the impulse response of the ideal bandpass filter
`being designed. The frequency response of the kth bandpass filter, Hk(eiw), can be written
`as
`
`(3.31)
`
`Thus the frequency response of the composite filter bank, H(eiw), can be written as
`
`H(eiw) = L Hk(eiw) = L W(eJ°w)@ /h(eiw).
`
`Q
`
`Q
`
`By interchanging the summation and the convolution we get
`
`k=I
`
`k=I
`
`H(eiw) = W(eiw)@ L Hk(eJ°w).
`
`Q
`
`(3.32)
`
`(3.33)
`
`(3.34)
`
`k=l
`By realizing_ ~at_the summation of Eq. (3.34) is the summation of ideal frequency responses,
`we see that 1t 1s mdependent of the number and distribution of the individual filters. 'fhus
`we can write the summation as
`
`Wmin < W < Wmax
`otherwise
`
`(3.35)
`
`IPR2023-00037
`Apple EX1013 Page 121
`
`
`
`,I
`
`Sec. 3.2
`
`The Bank-of-Filters Front-End Processor
`
`91
`
`where Wmin is the lowest frequency in the filter bank, and Wmax is the highest frequency.
`Then Eq. (3.34) can be expressed as
`
`(3.36)
`independent of the number of ideal filters, Q, and their distribution in frequency, which is
`the desired result.
`
`3.2.2.6 FFT-Based Nonuniform Filter Banks
`
`One possible way to exploit the FFf structure for implementing uniform filter banks
`discussed earlier is to design a large uniform filter bank (e.g., N = 128 or 256 channels)
`and then create the nonuniformity by combining two or more uniform channels. This
`technique of combining channels is readily shown to be equivalent to applying a modified
`analysis window to the sequence prior to the FFf. To see this, consider taking an N-point
`DFf of the sequence x(n) (derived from the speech signal, s(n), by windowing by w(n)).
`Thus we get
`
`(3.37)
`
`(3.38)
`
`(3.39)
`
`N-1
`xk = Lx(n)e-j
`n=O
`as the set of DFT values. If we consider adding DFf outputs Xk and Xk+I, we get
`
`nk,
`
`2;
`
`2;
`
`2;
`
`nk +e-j
`
`(e-j
`
`n(k+I))
`
`N-1
`Xk +Xk+I = LX(n)
`n=O
`
`which can be written as
`
`the equivalent kth channel value, Xt, could have been obtained by weighting the
`i.e.
`in time, by the complex sequence 2e-F';/ cos ( ~n). If more than two
`sequence, x(n),
`channels are combined, then a different equivalent weighting sequence results. Thus FFf
`channel combining is essentially a "quick and dirty" method of designing broader bandpass
`filters and is a simple and effective way of realizing certain types of nonuniform filter bank
`analysis structures.
`
`3.2.2.7 Tree Structure Realizations of Nonuniform Filter Banks
`
`A third method used to implement certain types of nonuniform filter banks is the tree
`structure in which the speech signal is filtered in stages, and the sampling rate is successively
`reduced at each stage for efficiency of implementation. An example of such a realization
`is given in Figure 3.19a for the 4-band, octave-spaced filter bank shown (ideally) in
`Figure 3.19b. The original speech signal, s(n), is filtered initially into two bands, a low
`band and a high band, using quadrature mirror filters (QMFs)-i.e.,
`filters whose frequency
`responses are complementary. The high band, which covers half the spectrum, is reduced
`in sampling rate by a factor of 2, and represents the highest octave band ( 1r /2 ~ w :::; 1r) of
`
`IPR2023-00037
`Apple EX1013 Page 122
`
`
`
`92
`
`Chap. 3
`
`Signal Processing and Analysis Meth
`Ods
`
`LP 3
`
`2 ♦
`
`X1 (m)
`
`HP 3
`
`s (n)
`
`X4 (m)
`
`1
`
`2
`
`3
`
`4
`
`I I
`
`0
`
`1T'
`8
`
`I
`
`1T'
`4
`
`1T'
`2
`
`Tr
`
`Figure 3.19 Tree structure implementati