`
`Vector Quantization
`
`129
`
`M-vector codebook as Ym, 1 :'.S m ::; M, and we denote the spectral vector to be classified
`(and quantized) as v, then the index, m*, of the best codebook entry is
`
`m* = arg min d(V,Ym),
`1$m$M
`
`(3.95)
`
`For codebooks with large values of M (e.g., M ~ 1024), the computation ofEq. (3.95) could
`be excessive, depending on the exact details of the distance measure; hence, alternative,
`suboptimal, procedures for designing VQ codebooks have been investigated. We will
`briefly discuss such methods in a later section of this chapter.
`
`3.4.6 Comparison of Vector and Scalar Quantizers
`
`To illustrate the power of the concept of quantizing an entire vector (rather than quantizing
`individual components of the vector), Figures 3.45 and 3.46 show comparisons of the results
`of using vector and scalar quantizers on typical speech spectral frames. In Figure 3.45 we
`see both model (speech) spectra and the resulting quantization error spectrum for 10-bit and
`24-bit scalar quantizers and for a IO-bit vector quantizer. It is clear that the quantization
`error of the I 0-bit vector quantizer is comparable to that of the 24-bit scalar quantizer. This
`implies that the vector quantizer provides a 14-bit reduction in storage (per frame) over a
`scalar quantizer, i.e., more than a 50% reduction in storage for the same distortion.
`Figure 3.46 shows temporal plots of distortion as well as distortion error histograms
`for the three quantizers of Figure 3.45. It can be seen that even though the average distortion
`of the IO-bit VQ is comparable to that of the 24-bit scalar quantizer, the peak distortion of
`the I 0-bit VQ is much smaller than the peak distortion of the 24-bit scalar quantizer. This
`represents another distinct advantage of VQ over scalar quantization.
`
`3.4.7 Extensions of Vector Quantization
`
`As mentioned earlier, several straightforward extensions of the ideas of VQ have been
`proposed and studied, including the following:
`
`1. Use of multiple codebooks in which codebooks are created separately (and indepen(cid:173)
`dently) for each of several spectral (or temporal) representations of speech. Thus
`we might consider using a separate codebook for cepstral vectors and a separate
`codebook for the time derivatives of the cepstral vectors. This method of multiple
`codebooks has been used extensively in large vocabulary speech-recognition systems.
`2. Binary search trees in which a series of suboptimal VQs is used to limit the search
`space so as to reduce the computation of the overall VQ from M distances to log (M)
`distances. The training procedure first designs an optimal M = 2 VQ and then
`assigns all training vectors to one of the VQ cells. Next the procedure designs a
`pair of M = 2 VQs, one for each subset of the preceding stage. This process is
`iterated until the desired size is obtained in log M steps. The suboptimality of the
`procedure is related to the fact that training vectors initially split along one branch of
`the VQ cannot join the other branch at a later stage of processing; hence, the overall
`
`IPR2023-00037
`Apple EX1013 Page 160
`
`
`
`130
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`MODEL SPECTRA ERROR SPECTRA
`
`-
`UNQUANTIZED
`+SOB
`······QUANTIZED ~
`10-BIT SCALAR
`
`MODEL SPECTRA ERROR SPECTRA
`
`UNQUANTIZEO
`-
`····•QUANTIZED
`
`[
`
`+SOB
`-5D8
`
`24-BIT SCALAR
`
`= ~ , .., ===--" •
`~ 14-tt\..--·...:::::>~"~c::,-,-.
`A'--"'....._,,.'-_.
`Pc===
`
`o,
`
`t b, •
`MODEL SPECTRA ERROR SPECTRA
`
`UNQUANTIZED
`-
`····••QUANTIZED
`
`[
`
`+508
`-SOB
`
`10-BIT VECTOR
`
`Figure 3AS Model and distonion error spectra for scalar and vector quantizers (after Juang
`et al. [121).
`
`distortion is not minimal at each branch of the tree.
`3. K-tuple (fixed-length block) quantizers in which K-frames of speech are coded at a
`time, rather than single frames, as is conventionally the case. The idea is to exploit
`correlations in time for vowels and vowel-like sounds. The disadvantage occurs for
`transient sounds and
`sounds where the correlation along the K-tuple is low-i.e.,
`many consonants.
`4. Matrix quantization in which a codebook of sounds or words of variable sequence
`
`IPR2023-00037
`Apple EX1013 Page 161
`
`
`
`Sec. 3.4
`
`Vector Quantization
`
`131
`
`30%
`
`20%
`
`10%
`
`LL W
`0 l)
`>-z
`l) w
`z a:
`w a:
`::> ::>
`0 l) w l)
`ff: 0
`
`24 BITS SCALAR
`MAX
`1 433
`AVE
`0.148
`u
`0.151
`
`O ,ee+et h MtJ+t ht 11,,L,o liY, ,;. ,+L.n<L Md
`10 BITS VECTOR QUAN
`TIME-+
`
`0%11W.W.LI..L.:i:n::~--_.,._-~
`0
`0.5
`LIKELIHOOD RATIO
`DISTORTION 24 BITS QUAN
`
`1.0
`
`10 BITS SCALAR
`MAX
`3.869
`AVE
`0.463
`0.514
`
`0
`
`30%
`
`20%
`
`LL W
`0 l)
`>- z
`l) UJ
`z a: w a:
`:::> :::>
`0
`l) w l)
`10%
`ff: 0
`
`10 BITS VECTOR
`MAX
`0.478
`AVE
`0.151
`0.073
`0
`
`30%
`
`20%
`
`10%
`
`LL ow
`l)
`>-z
`l) w
`z a:
`w a:
`::> ::>
`0() w l)
`ff: 0
`
`0%
`0
`
`0.5
`LIKELIHOOD RATIO
`DISTORTION 10 BITS QUAN
`
`1.0
`
`0%
`0
`
`0.5
`LIKELIHOOD RATIO
`DISTORTION 10 BITS QUAN
`
`1.0
`
`Figure 3.46 Plots and histograms of temporal distortion for scalar and vector quantizers (after
`Juang et al. [12)).
`
`length is created. The concept here is to handle time variability via some types of
`dynamic programming procedure and thereby create a codebook of sequences of
`vectors that represent typical sounds or words. Such techniques are most applicable
`to word-recognition systems.
`5. Trellis codes in which time sequential dependencies among codebook entries are
`explicitly determined as part of the training phase. The idea here is that when
`input vector Vn is quantized using codeword Ye, then input vector Vn+I is quantized
`using one of a limited subset of codebook entries that are related to ye via a set of
`learned sequential constraints, thereby reducing computation of encoding the input,
`and increasing the ability to interpret the codebook output in terms of basic speech
`units.
`6. Hidden Markov models in which both time and spectral constraints are used to
`quantize an entire speech utterance in a well-defined and efficient manner. We defer
`a discussion of hidden Markov models to Chapter 6.
`
`3.4.8 Summary of the VQ Method
`
`In later chapters of this book we will see several examples of how VQ concepts can be
`exploited in speech-recognition systems. Here we have shown that the basic idea of VQ
`is to reduce the information rate of the speech signal to a low rate through the use of
`a codebook with a relatively small number of code words. The goal is to be able to
`
`IPR2023-00037
`Apple EX1013 Page 162
`
`
`
`132
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`OUTER
`EAR
`
`Figure 3.47 Physiological model of the human ear.
`
`represent the spectral information of the signal in an efficient manner and in a way that
`direct connections to the acoustic-phonetic framework discussed in Chapter 2 can be made.
`Various techniques for achieving this efficiency of representation were discussed, and their
`properties were illustrated on representative examples of speech.
`
`3.5 AUDITORY-BASED SPECTRAL ANALYSIS MODELS
`
`The motivation for investigating spectral analysis methods that are physiologically based
`is to gain an understanding of how the human auditory system processes speech, so as to
`be able to design and implement robust, efficient methods of analyzing and representing
`speech. It is generally assumed that the better we understand the signal processing in the
`human auditory system, the closer we will come to being able to design a system that can
`truly understand meaning as well as content of speech.
`With these considerations in mind, we first examine a physiological model of the
`human ear. Such a model is given in Figure 3.47 and it shows that the ear has three distinct
`regions called the outer ear, the middle ear, and the inner ear. The outer ear consists
`of the pinna (the ear surface surrounding the canal in which sound is funneled), and the
`external canal. Sound waves reach the ear and are guided through the outer ear to the
`
`IPR2023-00037
`Apple EX1013 Page 163
`
`
`
`sec. 3.5
`
`Auditory-Based Spectral Analysis Models
`
`133
`
`MIOOL
`T
`
`VESTIBULAR SYSTEM
`
`OVAL WINDOW
`
`COCHLEAR
`FILTERS
`
`NNER
`AIR
`CELLS
`
`~
`
`EUSTACHIAN TUBE
`
`--
`
`..,
`AUDITORY
`NERVE
`
`'I
`
`Figure 3.48 Expanded view of the middle and inner ear mechanics.
`
`middle ear, which consists of the tympanic membrane or eardrum upon which the sound
`wave impinges and causes to move and a mechanical transducer (the malleus or hammer,
`the incus or anvil, and the stapes or stirrup), which converts the acoustical sound wave to
`mechanical vibrations along the inner ear. The inner ear consists of the cochlea, which is a
`fluid-filled chamber partitioned by the basilar membrane, and the cochlea or auditory nerve.
`The mechanical vibrations
`impinging on the oval window at the entrance to the cochlea
`create standing waves ( of the fluid inside the cochlea) that cause the basilar membrane to
`vibrate at frequencies commensurate with the input acoustic wave frequencies (e.g., the
`formants of voiced speech) and at a place along the basilar membrane that is associated
`with these frequencies.
`(An expanded view of the middle and inner ear mechanics is given
`in Figure 3.48. The 2 ½ turn, snail-like shape of the cochlea is shown as a straight tube in
`this figure for ease of presentation.)
`The basilar membrane
`is characterized by a set of frequency responses at different
`points along the membrane. Hence, in its simplest form, the cochlea can be modeled as a
`mechanical realization of a bank of filters (appropriately called cochlea filters). Distributed
`along the basilar member (in a dense but discrete manner) is a set of sensors called inner
`hair cells (IHC), which act as mechanical motion to neural activity converters. Mechanical
`motion at some point along the basilar membrane is sensed by the inner hair cells and
`causes firing activity at the nerve fibers that innervate the bottom of each IHC. Each IHC
`is connected to about IO nerve fibers, each of different diameter and of different synaptic
`connection. It has been shown experimentally that thin fibers fire (emit neural impulses)
`only at high motion levels, whereas thick fibers fire at much lower motion levels. A total
`of about 30,000 nerve fibers link the IHCs to the auditory nerve.
`
`IPR2023-00037
`Apple EX1013 Page 164
`
`
`
`134
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`Beyond the auditory nerve, our knowledge of how the information signals (the neural
`activity along the auditory nerve) are processed and eventually converted to intelligence in
`the brain is almost primitive. Hence when we attempt to build auditory models for signal
`processing, we are primarily modeling the middle ear, cochlea, and hair cell systems. The
`assumption is that the signal produced by such a model exhibits some of the robustness
`(immunity to noise, reverberation) and efficiency of the human auditory systems. Thus
`in the remainder of this section we present one such model, called the Ensemble Interval
`Histogram (EIH) model and show some of the properties of speech signals processed by
`such a model.
`
`3.5.1 The EIH Model
`
`On the basis of the discussion in the preceding section, a model of the cochJea and the hair
`cell transduction consists of a filter bank that models the frequency selectivity at various
`points along a simulated basilar membrane, and a nonlinear processor for converting the
`filter bank output to neural firing patterns along a simulated auditory nerve. Such a model
`is shown in Figure 3.49 and is called the EIH model [ 13].
`In the EIH model, the mechanical motion of the basilar membrane is sampled using
`165 IHC channels, equally spaced, on a log-frequency scale, between 150 and 7000 Hz.
`The corresponding cochlear filters are based on actual neural tuning curves for cats. The
`amplitude responses of 28 of these filters (i.e., about 1 in 8 from the model) are shown in
`Figure 3.50. The phase characteristics of these filters is minimum phase, and the relative
`gain, measured at the center frequency of the filter, reflects the corresponding value of the
`cat's middle ear transfer function.
`The next stage of processing in the EIH model of Figure 3.49 is an array oflevel cross(cid:173)
`ing detectors that models the motion-to-neural activity transduction of the hair cell mech(cid:173)
`anisms. The detection levels of each detector are pseudo-randomly distributed (based on
`measured distributions of level firings), thereby simulating the variability of fiber diameters
`and their synaptic connections.
`The output of the level-crossing detectors represents the discharge activity of the
`auditory nerve fibers. Figure 3.51 shows simulated auditory nerve activity, for the first
`60 msec of the vowel /o/ in the word "job," as a function of both time and the "characteristic
`frequency" of the IHC channels. (Note the logarithmic scale of the characteristic frequency
`which represents the place-to-frequency mapping on the basilar membrane.) In Figure 3.51,
`a level-crossing occurrence is marked as a single dot, and the output activity of each level(cid:173)
`crossing detector is plotted as a separate trace. Each IHC channel contributes seven
`parallel traces (corresponding to the seven level-crossing detectors for each channel), with
`the lowest trace representing the lowest-threshold level-crossing detector. If the magnitude
`of the filter's output is low, only one level will be crossed, as is seen for the very top
`channels in Figure 3.51. However, for large signal magnitudes, several levels will be
`activated, creating a "darker" area of activity in the figure.
`The level-crossing patterns represent the auditory nerve activity, which, in tum, is
`the input to a second, more central stage of neural processing, which gives the overall
`ensemble interval histogram (EIH). Conceptually,
`the EIH is a measure of the spatial
`
`IPR2023-00037
`Apple EX1013 Page 165
`
`
`
`Sec. 3.5
`
`Auditory-Based Spectral Analysis Models
`
`135
`
`COCHLEAR
`FILTER
`1
`
`COCHLEAR
`FILTER
`I-
`
`s ( t)
`
`,---
`----
`INTERVAL 7
`LEVEL
`I CROSSINGS : HISTOGRAMS I
`I
`I
`I
`
`L1
`.....____,
`
`I
`
`I H
`• 1
`
`I
`L AUDITORY NERVE _j
`
`LEVEL
`
`-
`
`-
`
`•
`•
`•
`•
`•
`
`COCHLEAR
`FILTER
`165
`
`Figure 3.49 Block diagram of the EIH model (after Ghitza [ 13]).
`
`extent of coherent neural activity across the simulated auditory nerve. Mathematically,
`it is the short-term probability density function of the reciprocal of the intervals between
`successive firings, measured over the entire simulated auditory nerve in a characteristic
`frequency-dependent time-frequency zone.
`As a consequence of the multilevel crossing detectors, the EIH representation pre(cid:173)
`serves information about the signal's overall energy. To illustrate this point, consider the
`case in which the input signal is a pure sinusoid, i.e. s(t) = A sin(21rfot), and the character(cid:173)
`istic frequency of a selected channel is Jo, as shown in Figure 3.52a. For a given intensity
`A, the cochlear filter output will activate only some low level-crossing detectors. For a
`given detector, the time interval between two successive positive-going level crossings is
`I /Jo. Since the histogram is scaled in units of frequency, this interval contributes a count
`to thefo bin. For the input signal in Figure 3.52a, all of the intervals are the same, resulting
`in a histogram in which the magnitude of each bin, save one (Jo), is zero. As the signal
`
`IPR2023-00037
`Apple EX1013 Page 166
`
`
`
`+20
`
`0
`
`CD
`
`"U -w
`Cf) z
`-20
`0
`~
`Cl') w
`
`Q::
`
`I.LI
`0
`-40
`::::>
`....
`
`...J
`~
`~
`<X
`
`-60
`
`-80l_..o....,~::.........=L,/,,/j=:..........._.__.__~ ..........
`
`100
`1000
`FREOUENCY,Hz
`
`_._L..l...J..__~.u.L.Jl,L....L.J....L.L......L.
`
`.......... _._1.....iL...._~~LJ
`10000
`
`Figure 3.50 Frequency response cuives of a cat's basilar membrane (after Ghitza [ 13 ]).
`
`4800,--------.--------------------.
`..
`.
`
`...
`
`···•
`
`.
`
`•·•
`
`............ .
`
`N
`:I:
`
`> (.) z
`
`LU
`:::,
`0
`LU
`a: u..
`~ en a:
`LU ....
`< a:
`<
`:I:
`(.)
`
`(.)
`
`(.)
`
`Figure 3.51 Magnitude of EIH for vowel / / h
`.
`0 s owing the time-frequency resolution (after
`Ghitza [ 13)).
`
`IPR2023-00037
`Apple EX1013 Page 167
`
`
`
`Sec. 3.5
`
`Auditory-Based Spectral Analysis Models
`
`137
`
`(a)
`
`LEVEL CROSSINGS
`
`HISTOGRAMS
`
`COCHLEAR
`FILTER
`I
`
`sit l=A sin (27Tf 0 t l
`
`(b)
`
`ID .,,
`
`s (tl •A sin (2,,-f 0 t)
`
`tL f0
`
`f
`
`16
`
`=U_
`
`0
`
`fo
`
`f
`
`Figure 3.52 Operation of the EIH model for a pure sinusoid (after Ghitza [ 13)).
`
`IPR2023-00037
`Apple EX1013 Page 168
`
`
`
`138
`
`0
`-10
`
`-20
`
`-~
`-40
`ID
`~ -50
`2
`0
`::,
`a: -10
`....
`0 "' -20
`"' -30
`0 ::, ~o
`~
`..J
`-50
`CL
`~ 0
`C
`
`CL
`Cl)
`
`-10
`
`-30
`-40
`-50
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`FOURIER
`
`[ooJ
`<400,------------,
`
`EIH
`
`300
`
`200
`
`100
`0 k:._~:..,__:....__~__:_-=:!.~::::t
`400~-----------,
`t 300
`i'
`E 200
`x w 100
`
`400~-----------,
`
`300
`
`100
`
`CLEAN
`
`SNR
`Oda
`
`LPC
`fit
`
`0
`
`2400
`1600
`800
`FREOUENCY,HZ
`
`3200
`
`800
`FREQUENCY,HZ
`
`Figure 3.53
`Ghitza [13)).
`
`Comparison of Fourier and EIH log spectra for clean and noisy speech signals (after
`
`amplitude A increases, more levels are activated. As a result, this cochlear filter contributes
`additional counts to the Jo bin of the EIH. Since the crossing levels are equally distributed
`on a log-amplitude scale, the magnitude of any EIH bin is related, in some fashion, to
`decibel units. However, this relation is not a straightforward one because there are several
`sources contributing counts to the Jo bin in a nonlinear manner. Figure 3.52b shows an
`input signal s(t) = A sin(21rJot) driving five adjacent cochlear filters with an amplitude
`response IH;(f)I and a phase response </>;(f), i = 1, 2, ... , 5. Due to the shape of the filters,
`more than one cochlear channel will contribute to the Jo bin. In fact, all the cochlear filters
`that produce s;(t) = A IH;(fo)I sin(21rJot + </>;<Jo)) will contribute to the Jo bin of the EIH,
`provided that A IH;(fo)I exceeds any of the level-crossing
`thresholds.
`In Figure 3.52b only
`cochlear filters 2, 3, and 4 are contributing nonzero histograms to the EIH. The number of
`counts is different for each filter, depending on the magnitude of A IH;(fo)I.
`One goal of auditory-based signal processing
`is to make the signal more robust to
`noise and reverberation than alternative spectral analysis procedures such as the filter-bank
`method or the LPC method. Figure 3.53 illustrates how well the EIH model achieves this
`goal. Shown in the figure are the log magnitude spectra of a clean (no noise) and a noisy
`(signal-to-noise ratio of O dB) speech signal processed by a standard Fourier filter bank
`(curves on the left) and by the EIH model (curves on the right). Also shown are LPC
`polynomial fits to the original signal spectrum (on the left) and to the EIH signal spectrUITl
`(on the right) for both the clean signal and the noisy signal. This figure clearly shows a
`tremendous sensitivity of the Fourier and LPC analyses to noise for the original signals.
`
`IPR2023-00037
`Apple EX1013 Page 169
`
`
`
`Sec. 3.6
`
`Summary
`
`139
`
`(This is especially seen in the LPC polynomial fits.) In the EIH case, the log magnitude
`spectra are almost unaltered by the noise, and the LPC polynomial fits are extremely close
`to each other.
`The implication of the above results for speech recognition is that the EIH model has
`potential for use in recognizing speech robustly in noisy and reverberant environments. We
`will explore this issue in Chapter 5 when we talk about the effects of noise on performance
`of speech recognizers.
`
`3.6 SUMMARY
`
`In this chapter we have discussed several ways of performing spectral analysis of speech,
`including filter-bank analysis, LPC analysis, vector quantization, and auditory modeling.
`We have discussed the relative strengths and weaknesses of each approach and given a hint
`of the advantages and disadvantages for application to actual speech-recognition systems.
`We will see, in later chapters, how the type of spectral analysis that is used interacts with
`the processing of other parts of the recognizer. Only through such an understanding can
`one fully see the trade-offs among the different approaches to speech spectrum analysis.
`
`REFERENCES
`
`[I] L.R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing, Prentice
`Hall, Englewood Cliffs, NJ, 1975.
`[2] L.R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice Hall,
`Englewood Cliffs, NJ, 1978.
`[3] R.E. Crochiere and L.R. Rabiner, Multirate Digital Signal Processing, Prentice Hall,
`Englewood Cliffs, NJ, 1983.
`[4] B.A. Dautrich, L.R. Rabiner, and T.B. Martin, "On the Effects of Varying Filter Bank
`Parameters on Isolated Word Recognition," IEEE Trans. Acoustics, Speech, Signal Proc.,
`ASSP-31 (4): 793-807, August 1983.
`[5] B.A. Dautrich, L.R. Rabiner, and T.B. Martin, "The Effects of Selected Signal Processing
`Techniques on the Performance of a Filter Bank Based Isolated Word Recognizer," Bell
`System Tech. J., 62 (5): 1311-1336, May-June 1983.
`[6] J.D. Markel and A.H. Gray, Jr., Linear Prediction of Speech, Springer-Verlag, 1976.
`[7] G.M. White and R.B. Neely, "Speech Recognition Experiments with Linear Prediction,
`Bandpass Filtering, and Dynamic Programming," IEEE Trans. Acoustics, Speech, Signal
`Proc., ASSP-24 (2): 183-188, 1976.
`[8] L.R. Rabiner, B.S. Atal, and M.R. Sambur, "LPC Prediction Error-Analysis of its Variation
`with the Position of the Analysis Frame," IEEE Trans. Acoustics, Speech, Signal Proc.,
`ASSP-25 (5): 434-442, October 1977.
`[9] H. Strube, "Determination of the Instant of Glottal Closure from the Speech Wave," J.
`Acoust. Soc. Am., 56 (5): 1625-1629, November 1974.
`
`IPR2023-00037
`Apple EX1013 Page 170
`
`
`
`140
`
`Chap. 3
`
`Signal Processing and Analysis M
`ethods
`
`[ 10] B.S. AtaJ and S.L. Hanauer, "Speech Analysis and Synthesis by Linear Predicti
`on of the
`Speech Wave," J. Acoust. Soc. Am., 50 (2): 637-055, August 1971.
`(11] S. Furui, "Speaker Independent Isolated Word Recognition Using Dynamic Feat
`.
`.
`ures of
`Speech Spectrum," IEEE Trans. Acoustics, Speech, Signal Proc., ASSP-34 (I): 52_
`59
`,
`February 1986.
`(12] J.-H. Juang, D.Y. Wong, and A.H. Gray, Jr., "J?istortion Pe~onnance of Vector Quantization
`for LPC Voice Coding," IEEE Trans. Acoustics, Speech, Signal Proc., ASSP-30 (2): 294--
`304, April 1982.
`[13] O. Ghitza, "Auditory Nerve Representation as a Basis for Speech Processing," in Advances
`in Speech Signal Processing, S. Furui and M. Sondhi, Eds., Marcel Dekker, NY, 453-48S,
`1991.
`
`J
`
`I I
`
`I I
`I
`I
`I
`I
`I
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`II
`
`11
`
`I I
`//
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`I
`I
`I
`I
`
`,
`~
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`J
`
`IPR2023-00037
`Apple EX1013 Page 171
`
`
`
`Chapter 6
`
`THEORY
`AND IMPLEMENTATION
`OF HIDDEN
`MARKOV MODELS
`
`6.1 INTRODUCTION
`
`In Chapters 4 and 5 we presented one major pattern-recognition approach to speech recogni(cid:173)
`tion, namely the template method. One key idea in the template method is to derive typical
`sequences of speech frames for a pattern (e.g., a word) via some averaging procedure, and
`to rely on the use of local spectral distance measures to compare patterns. Another key
`idea is to use some form of dynamic programming to temporally align patterns to account
`for differences in speaking rates across talkers as well as across repetitions of the word by
`the same talker. The methodology of the template approach is well developed and provides
`good recognition performance for a variety of practical applications.
`The template approach, however, is not based on· the ideas of statistical signal model(cid:173)
`ing in a strict sense. Even though statistical techniques have been widely used in clustering
`to create reference patterns, the template approach is best classified as a simplified, non(cid:173)
`parametric method in which a multiplicity of refe1ence tokens (sequences) are used to
`characterize the variation among different utterances. As such, statistical signal charac(cid:173)
`terization inherent in the template representation is only implicit and often inadequate.
`Consider, for example, the use of a truncated cepstral distortion measure as the local dis(cid:173)
`tance for template matching. The Euclidean distance form of the cepstral distance measure
`suggests that the reference vector can be viewed as the mean of some assumed distribution.
`
`-
`
`I
`I
`
`321
`
`IPR2023-00037
`Apple EX1013 Page 172
`
`
`
`322
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`Obviously, this simple form of the sufficient statistic 1 (use of only the mean reference
`covariances, which, as will be seen later,
`vector) neglects the second-order statistics-Le.,
`are of particular significance in statistical modeling. (Note that this distribution is used
`to account for variations of the cepstral coefficients at the frame level since time align(cid:173)
`ment is performed so as to match appropriate frames of the patterns being compared.)
`There is clearly a need to use a more elaborate and analytical statistical method for speech
`recognition.
`In this chapter we will study one well-known and widely used statistical method
`of characterizing the spectral properties of the frames of a pattern, namely the hidden
`Markov model (HMM) approach. (These models are also referred to as Markov sources or
`probabilistic functions of Markov chains in the communications literature.) The underlying
`assumption of the HMM (or any other type of statistical model) is that the speech signal
`can be well characterized as a parametric random process, and that the parameters of the
`stochastic process can be determined (estimated) in a precise, well-defined manner. We
`will show that the HMM method provides a natural and highly reliable way of recognizing
`speech for a wide range of applications and integrates well into systems incorporating both
`task syntax and semantics.
`The basic theory of hidden Markov models was published in a series of classic
`papers by Baum and his colleagues ([ I ]-[5]) in the late 1960s and early 1970s and was
`implemented for speech-processing applications by Baker [6] at CMU, and by Jelinek and
`his colleagues at IBM ([7]-[13]) in the 1970s.
`We begin this chapter with a review of the theory of Markov chains and then extend the
`ideas to HMMs using several simple examples. Based on the now-classical approach of Jack
`Ferguson of IDA (Institute for Defense Analyses), as introduced in lectures and in writing
`[ 14], we will focus our attention on the three fundamental problems for HMM design,
`namely:
`the evaluation of the probability (or likelihood) of a sequence of observations
`given a specific HMM; the determination of a best sequence of model states; and the
`adjustment of model parameters so as to best account for the observed signal. We will
`show that once these three fundamental problems are solved, we can readily apply HMMs
`to selected problems in speech recognition.
`
`6.2 DISC::RETE-TIME MARKOV PROCESSES
`
`Consider a system that may be described at any time as being in one of a set of N distinct
`states indexed by { l, 2, ... , N} as illustrated in Figure 6.1 (where N = 5 for simplicity).
`At regularly spaced, discrete times, the system undergoes a change of state (possibly back
`to the same state) according to a set of probabilities associated with the state. We denote
`the time instants associated with state changes as t = 1, 2, ... , and we denote the actual
`
`1 Sufficient statistics are a set of measurements from a process which contain all the relevant
`information for estimating the parameters of that process.
`
`IPR2023-00037
`Apple EX1013 Page 173
`
`
`
`sec. 6.2
`
`Discrete-Time Markov Processes
`
`323
`
`Figure 6.1 A Markov
`chain with five states
`(labeled I to 5) with selected state transitions.
`
`Figure 6.2 Markov model of the weather.
`
`state at time t as q,. A full probabilistic description of the above system would, in general,
`require specification of the current state (at time t), as well as all the predecessor states. For
`the special case of a discrete-time, first order, Markov chain, the probabilistic dependence
`is truncated to just the preceding state-that
`is,
`
`Furthermore, we consider only those processes in which the right-hand side of (6.1) is
`independent of time, thereby leading to the set of state-transition probabilities aij of the
`form
`
`(6.1)
`
`aii = P[q, = jlq,_ 1 = i],
`with the following properties
`
`1 ~ i,j 5: N
`
`a·· >O
`I}
`-
`
`Vj,i
`
`N
`Laij = 1
`J=I
`
`Vi
`
`(6.2)
`
`(6.3a)
`
`(6.3b)
`
`since they obey standard stochastic constraints.
`The above stochastic process could be called an observable Markov model because
`the output of the process is the set of states at each instant of time, where each state
`corresponds to an observable event. To set ideas, consider a simple three-state Markov
`model of the weather as shown in Figure 6.2. We assume that once a day (e.g., at noon),
`the weather is observed as being one of the following:
`
`State 1: precipitation (rain or snow)
`State 2: cloudy
`State 3: sunny.
`
`IPR2023-00037
`Apple EX1013 Page 174
`
`
`
`324
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`We postulate that the weather on day t is characterized by a single one of the three states
`above, and that the matrix A of state-transition probabilities is
`
`0.4 0.3 0.3 ]
`A= {aij} = 0.2 0.6 0.2
`[
`0.1 0.1 0.8
`
`.
`
`Given the model of Figure 6.2 we can now ask (and answer) several interesting
`questions about weather patterns over time. For example, we can pose the following
`simple problem:
`
`Problem
`
`What is the probability (according to the model) that the weather for eight consecutive days is
`"sun-sun-sun-rain-rain-sun-cloudy-sun"?
`
`Solution
`We define the observation sequence, 0, as
`0
`( sunny,
`sunny )
`cloudy,
`sunny,
`rain,
`rain,
`sunny,
`sunny,
`3
`)
`2,
`3,
`1,
`1,
`(
`3,
`3,
`3,
`8
`7
`6
`5
`4
`1
`2
`3
`day
`corresponding to the postulated set of weather conditions over the eight-day period and we
`want to calculate P(OIModel), the probability of the observation sequence 0, given the model
`of Figure 6.2. We can directly determine P(OIModel) as:
`
`P(O!Model) = P[3, 3, 3, I, I, 3, 2, 3IModel]
`= P[3]P[3l3J2 P[ll3]P[lll]
`P[311] P[213] P[3j2]
`= 1r3 • (a33)2 a31 a11 a13 a32 a23
`= ( 1.0)(0.8)2(0.1 )(0.4 )(0.3)(0.1 )(0.2)
`= 1.536 X 10- 4
`
`where we use the notation:
`
`1r; = P[qi = i],
`
`(6.4)
`
`to denote the initial state probabilities.
`
`Another interesting question we can ask (and answer using the model) is:
`
`Problem
`
`Given that the system is in a known state, what is the probability that it stays in that state for
`exactly
`
`Solution
`This probability can be evaluated as th
`·
`f
`b b'l'
`e pro a 1 tty o the observation sequence
`, ... ,
`
`i,
`d
`
`Jc/=i
`d+l
`
`0 = ( i,
`day
`I
`
`i,
`2
`
`3
`
`IPR2023-00037
`Apple EX1013 Page 175
`
`
`
`sec. 6.3
`
`Extensions to Hidden Markov Models
`
`given the model, which is
`
`P(OIModel,q1 = i)= P(O,q 1 = ilModel)/P(q1 = l)
`= 7r;(a;;t-l(l
`- a;;)/1r;
`= (a;l-
`10 - au)
`= p;(d)
`
`325
`
`(6.5)
`
`The quantity p;(d) is the probability distribution function of duration d in state i. This
`exponential distribution is characteristic of the state duration in a Markov chain. Based on
`p;(d), we can readily calculate the expected number of observations (duration) in a state
`conditioned on starting in that state as
`'
`
`00
`
`d; = Ldp;(d)
`
`00
`
`(6.6a)
`
`(6.6b)
`
`Thus the expected number of consecutive days of sunny weather, according to the model, is
`1/(0.2) = 5; for cloudy it is 2.5; for rain it is 1.67.
`
`Problem
`Derive the expression for the mean of p;(d), i.e. Eq. (6.6b).
`
`Solution
`
`00
`
`d; = Ldp;(d)
`
`d=I
`
`00
`
`d=I
`
`00
`
`d]
`
`)
`
`= (1 - a;;)-
`8
`[
`"a;;
`oa;; LJ
`d=I
`a ( a··
`-
`--"
`1 - a;;
`
`= (1 - a;;)-
`1
`---1 - a;;
`
`oa;;
`
`&.3 EXTENSIONS TO HIDDEN MARKOV MODELS
`
`So far we have considered Markov models in which each state corresponded to a deter(cid:173)
`ministically observable event. Thus, the output of such sources in any given state is not
`random. This model is too restrictive to be applicable to many problems of interest. In
`this section we extend the concept of Markov models to include the case in which the
`observation is a probabilistic function of the state-that
`is, the resulting model (which is
`
`IPR2023-00037
`Apple EX1013 Page 176
`
`
`
`I
`
`- I
`
`326
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov M
`Odels
`
`called a hidden Markov model) is a doubly embedded stochastic process with an und 1 .
`.
`.
`I (. • h. dd ) b
`er Ytng
`stochastic process that is not directly observab e It is 1 e