throbber
Sec. 3.4
`
`Vector Quantization
`
`129
`
`M-vector codebook as Ym, 1 :'.S m ::; M, and we denote the spectral vector to be classified
`(and quantized) as v, then the index, m*, of the best codebook entry is
`
`m* = arg min d(V,Ym),
`1$m$M
`
`(3.95)
`
`For codebooks with large values of M (e.g., M ~ 1024), the computation ofEq. (3.95) could
`be excessive, depending on the exact details of the distance measure; hence, alternative,
`suboptimal, procedures for designing VQ codebooks have been investigated. We will
`briefly discuss such methods in a later section of this chapter.
`
`3.4.6 Comparison of Vector and Scalar Quantizers
`
`To illustrate the power of the concept of quantizing an entire vector (rather than quantizing
`individual components of the vector), Figures 3.45 and 3.46 show comparisons of the results
`of using vector and scalar quantizers on typical speech spectral frames. In Figure 3.45 we
`see both model (speech) spectra and the resulting quantization error spectrum for 10-bit and
`24-bit scalar quantizers and for a IO-bit vector quantizer. It is clear that the quantization
`error of the I 0-bit vector quantizer is comparable to that of the 24-bit scalar quantizer. This
`implies that the vector quantizer provides a 14-bit reduction in storage (per frame) over a
`scalar quantizer, i.e., more than a 50% reduction in storage for the same distortion.
`Figure 3.46 shows temporal plots of distortion as well as distortion error histograms
`for the three quantizers of Figure 3.45. It can be seen that even though the average distortion
`of the IO-bit VQ is comparable to that of the 24-bit scalar quantizer, the peak distortion of
`the I 0-bit VQ is much smaller than the peak distortion of the 24-bit scalar quantizer. This
`represents another distinct advantage of VQ over scalar quantization.
`
`3.4.7 Extensions of Vector Quantization
`
`As mentioned earlier, several straightforward extensions of the ideas of VQ have been
`proposed and studied, including the following:
`
`1. Use of multiple codebooks in which codebooks are created separately (and indepen(cid:173)
`dently) for each of several spectral (or temporal) representations of speech. Thus
`we might consider using a separate codebook for cepstral vectors and a separate
`codebook for the time derivatives of the cepstral vectors. This method of multiple
`codebooks has been used extensively in large vocabulary speech-recognition systems.
`2. Binary search trees in which a series of suboptimal VQs is used to limit the search
`space so as to reduce the computation of the overall VQ from M distances to log (M)
`distances. The training procedure first designs an optimal M = 2 VQ and then
`assigns all training vectors to one of the VQ cells. Next the procedure designs a
`pair of M = 2 VQs, one for each subset of the preceding stage. This process is
`iterated until the desired size is obtained in log M steps. The suboptimality of the
`procedure is related to the fact that training vectors initially split along one branch of
`the VQ cannot join the other branch at a later stage of processing; hence, the overall
`
`IPR2023-00037
`Apple EX1013 Page 160
`
`

`

`130
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`MODEL SPECTRA ERROR SPECTRA
`
`-
`UNQUANTIZED
`+SOB
`······QUANTIZED ~
`10-BIT SCALAR
`
`MODEL SPECTRA ERROR SPECTRA
`
`UNQUANTIZEO
`-
`····•QUANTIZED
`
`[
`
`+SOB
`-5D8
`
`24-BIT SCALAR
`
`= ~ , .., ===--" •
`~ 14-tt\..--·...:::::>~"~c::,-,-.
`A'--"'....._,,.'-_.
`Pc===
`
`o,
`
`t b, •
`MODEL SPECTRA ERROR SPECTRA
`
`UNQUANTIZED
`-
`····••QUANTIZED
`
`[
`
`+508
`-SOB
`
`10-BIT VECTOR
`
`Figure 3AS Model and distonion error spectra for scalar and vector quantizers (after Juang
`et al. [121).
`
`distortion is not minimal at each branch of the tree.
`3. K-tuple (fixed-length block) quantizers in which K-frames of speech are coded at a
`time, rather than single frames, as is conventionally the case. The idea is to exploit
`correlations in time for vowels and vowel-like sounds. The disadvantage occurs for
`transient sounds and
`sounds where the correlation along the K-tuple is low-i.e.,
`many consonants.
`4. Matrix quantization in which a codebook of sounds or words of variable sequence
`
`IPR2023-00037
`Apple EX1013 Page 161
`
`

`

`Sec. 3.4
`
`Vector Quantization
`
`131
`
`30%
`
`20%
`
`10%
`
`LL W
`0 l)
`>-z
`l) w
`z a:
`w a:
`::> ::>
`0 l) w l)
`ff: 0
`
`24 BITS SCALAR
`MAX
`1 433
`AVE
`0.148
`u
`0.151
`
`O ,ee+et h MtJ+t ht 11,,L,o liY, ,;. ,+L.n<L Md
`10 BITS VECTOR QUAN
`TIME-+
`
`0%11W.W.LI..L.:i:n::~--_.,._-~
`0
`0.5
`LIKELIHOOD RATIO
`DISTORTION 24 BITS QUAN
`
`1.0
`
`10 BITS SCALAR
`MAX
`3.869
`AVE
`0.463
`0.514
`
`0
`
`30%
`
`20%
`
`LL W
`0 l)
`>- z
`l) UJ
`z a: w a:
`:::> :::>
`0
`l) w l)
`10%
`ff: 0
`
`10 BITS VECTOR
`MAX
`0.478
`AVE
`0.151
`0.073
`0
`
`30%
`
`20%
`
`10%
`
`LL ow
`l)
`>-z
`l) w
`z a:
`w a:
`::> ::>
`0() w l)
`ff: 0
`
`0%
`0
`
`0.5
`LIKELIHOOD RATIO
`DISTORTION 10 BITS QUAN
`
`1.0
`
`0%
`0
`
`0.5
`LIKELIHOOD RATIO
`DISTORTION 10 BITS QUAN
`
`1.0
`
`Figure 3.46 Plots and histograms of temporal distortion for scalar and vector quantizers (after
`Juang et al. [12)).
`
`length is created. The concept here is to handle time variability via some types of
`dynamic programming procedure and thereby create a codebook of sequences of
`vectors that represent typical sounds or words. Such techniques are most applicable
`to word-recognition systems.
`5. Trellis codes in which time sequential dependencies among codebook entries are
`explicitly determined as part of the training phase. The idea here is that when
`input vector Vn is quantized using codeword Ye, then input vector Vn+I is quantized
`using one of a limited subset of codebook entries that are related to ye via a set of
`learned sequential constraints, thereby reducing computation of encoding the input,
`and increasing the ability to interpret the codebook output in terms of basic speech
`units.
`6. Hidden Markov models in which both time and spectral constraints are used to
`quantize an entire speech utterance in a well-defined and efficient manner. We defer
`a discussion of hidden Markov models to Chapter 6.
`
`3.4.8 Summary of the VQ Method
`
`In later chapters of this book we will see several examples of how VQ concepts can be
`exploited in speech-recognition systems. Here we have shown that the basic idea of VQ
`is to reduce the information rate of the speech signal to a low rate through the use of
`a codebook with a relatively small number of code words. The goal is to be able to
`
`IPR2023-00037
`Apple EX1013 Page 162
`
`

`

`132
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`OUTER
`EAR
`
`Figure 3.47 Physiological model of the human ear.
`
`represent the spectral information of the signal in an efficient manner and in a way that
`direct connections to the acoustic-phonetic framework discussed in Chapter 2 can be made.
`Various techniques for achieving this efficiency of representation were discussed, and their
`properties were illustrated on representative examples of speech.
`
`3.5 AUDITORY-BASED SPECTRAL ANALYSIS MODELS
`
`The motivation for investigating spectral analysis methods that are physiologically based
`is to gain an understanding of how the human auditory system processes speech, so as to
`be able to design and implement robust, efficient methods of analyzing and representing
`speech. It is generally assumed that the better we understand the signal processing in the
`human auditory system, the closer we will come to being able to design a system that can
`truly understand meaning as well as content of speech.
`With these considerations in mind, we first examine a physiological model of the
`human ear. Such a model is given in Figure 3.47 and it shows that the ear has three distinct
`regions called the outer ear, the middle ear, and the inner ear. The outer ear consists
`of the pinna (the ear surface surrounding the canal in which sound is funneled), and the
`external canal. Sound waves reach the ear and are guided through the outer ear to the
`
`IPR2023-00037
`Apple EX1013 Page 163
`
`

`

`sec. 3.5
`
`Auditory-Based Spectral Analysis Models
`
`133
`
`MIOOL
`T
`
`VESTIBULAR SYSTEM
`
`OVAL WINDOW
`
`COCHLEAR
`FILTERS
`
`NNER
`AIR
`CELLS
`
`~
`
`EUSTACHIAN TUBE
`
`--
`
`..,
`AUDITORY
`NERVE
`
`'I
`
`Figure 3.48 Expanded view of the middle and inner ear mechanics.
`
`middle ear, which consists of the tympanic membrane or eardrum upon which the sound
`wave impinges and causes to move and a mechanical transducer (the malleus or hammer,
`the incus or anvil, and the stapes or stirrup), which converts the acoustical sound wave to
`mechanical vibrations along the inner ear. The inner ear consists of the cochlea, which is a
`fluid-filled chamber partitioned by the basilar membrane, and the cochlea or auditory nerve.
`The mechanical vibrations
`impinging on the oval window at the entrance to the cochlea
`create standing waves ( of the fluid inside the cochlea) that cause the basilar membrane to
`vibrate at frequencies commensurate with the input acoustic wave frequencies (e.g., the
`formants of voiced speech) and at a place along the basilar membrane that is associated
`with these frequencies.
`(An expanded view of the middle and inner ear mechanics is given
`in Figure 3.48. The 2 ½ turn, snail-like shape of the cochlea is shown as a straight tube in
`this figure for ease of presentation.)
`The basilar membrane
`is characterized by a set of frequency responses at different
`points along the membrane. Hence, in its simplest form, the cochlea can be modeled as a
`mechanical realization of a bank of filters (appropriately called cochlea filters). Distributed
`along the basilar member (in a dense but discrete manner) is a set of sensors called inner
`hair cells (IHC), which act as mechanical motion to neural activity converters. Mechanical
`motion at some point along the basilar membrane is sensed by the inner hair cells and
`causes firing activity at the nerve fibers that innervate the bottom of each IHC. Each IHC
`is connected to about IO nerve fibers, each of different diameter and of different synaptic
`connection. It has been shown experimentally that thin fibers fire (emit neural impulses)
`only at high motion levels, whereas thick fibers fire at much lower motion levels. A total
`of about 30,000 nerve fibers link the IHCs to the auditory nerve.
`
`IPR2023-00037
`Apple EX1013 Page 164
`
`

`

`134
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`Beyond the auditory nerve, our knowledge of how the information signals (the neural
`activity along the auditory nerve) are processed and eventually converted to intelligence in
`the brain is almost primitive. Hence when we attempt to build auditory models for signal
`processing, we are primarily modeling the middle ear, cochlea, and hair cell systems. The
`assumption is that the signal produced by such a model exhibits some of the robustness
`(immunity to noise, reverberation) and efficiency of the human auditory systems. Thus
`in the remainder of this section we present one such model, called the Ensemble Interval
`Histogram (EIH) model and show some of the properties of speech signals processed by
`such a model.
`
`3.5.1 The EIH Model
`
`On the basis of the discussion in the preceding section, a model of the cochJea and the hair
`cell transduction consists of a filter bank that models the frequency selectivity at various
`points along a simulated basilar membrane, and a nonlinear processor for converting the
`filter bank output to neural firing patterns along a simulated auditory nerve. Such a model
`is shown in Figure 3.49 and is called the EIH model [ 13].
`In the EIH model, the mechanical motion of the basilar membrane is sampled using
`165 IHC channels, equally spaced, on a log-frequency scale, between 150 and 7000 Hz.
`The corresponding cochlear filters are based on actual neural tuning curves for cats. The
`amplitude responses of 28 of these filters (i.e., about 1 in 8 from the model) are shown in
`Figure 3.50. The phase characteristics of these filters is minimum phase, and the relative
`gain, measured at the center frequency of the filter, reflects the corresponding value of the
`cat's middle ear transfer function.
`The next stage of processing in the EIH model of Figure 3.49 is an array oflevel cross(cid:173)
`ing detectors that models the motion-to-neural activity transduction of the hair cell mech(cid:173)
`anisms. The detection levels of each detector are pseudo-randomly distributed (based on
`measured distributions of level firings), thereby simulating the variability of fiber diameters
`and their synaptic connections.
`The output of the level-crossing detectors represents the discharge activity of the
`auditory nerve fibers. Figure 3.51 shows simulated auditory nerve activity, for the first
`60 msec of the vowel /o/ in the word "job," as a function of both time and the "characteristic
`frequency" of the IHC channels. (Note the logarithmic scale of the characteristic frequency
`which represents the place-to-frequency mapping on the basilar membrane.) In Figure 3.51,
`a level-crossing occurrence is marked as a single dot, and the output activity of each level(cid:173)
`crossing detector is plotted as a separate trace. Each IHC channel contributes seven
`parallel traces (corresponding to the seven level-crossing detectors for each channel), with
`the lowest trace representing the lowest-threshold level-crossing detector. If the magnitude
`of the filter's output is low, only one level will be crossed, as is seen for the very top
`channels in Figure 3.51. However, for large signal magnitudes, several levels will be
`activated, creating a "darker" area of activity in the figure.
`The level-crossing patterns represent the auditory nerve activity, which, in tum, is
`the input to a second, more central stage of neural processing, which gives the overall
`ensemble interval histogram (EIH). Conceptually,
`the EIH is a measure of the spatial
`
`IPR2023-00037
`Apple EX1013 Page 165
`
`

`

`Sec. 3.5
`
`Auditory-Based Spectral Analysis Models
`
`135
`
`COCHLEAR
`FILTER
`1
`
`COCHLEAR
`FILTER
`I-
`
`s ( t)
`
`,---
`----
`INTERVAL 7
`LEVEL
`I CROSSINGS : HISTOGRAMS I
`I
`I
`I
`
`L1
`.....____,
`
`I
`
`I H
`• 1
`
`I
`L AUDITORY NERVE _j
`
`LEVEL
`
`-
`
`-
`
`•
`•
`•
`•
`•
`
`COCHLEAR
`FILTER
`165
`
`Figure 3.49 Block diagram of the EIH model (after Ghitza [ 13]).
`
`extent of coherent neural activity across the simulated auditory nerve. Mathematically,
`it is the short-term probability density function of the reciprocal of the intervals between
`successive firings, measured over the entire simulated auditory nerve in a characteristic
`frequency-dependent time-frequency zone.
`As a consequence of the multilevel crossing detectors, the EIH representation pre(cid:173)
`serves information about the signal's overall energy. To illustrate this point, consider the
`case in which the input signal is a pure sinusoid, i.e. s(t) = A sin(21rfot), and the character(cid:173)
`istic frequency of a selected channel is Jo, as shown in Figure 3.52a. For a given intensity
`A, the cochlear filter output will activate only some low level-crossing detectors. For a
`given detector, the time interval between two successive positive-going level crossings is
`I /Jo. Since the histogram is scaled in units of frequency, this interval contributes a count
`to thefo bin. For the input signal in Figure 3.52a, all of the intervals are the same, resulting
`in a histogram in which the magnitude of each bin, save one (Jo), is zero. As the signal
`
`IPR2023-00037
`Apple EX1013 Page 166
`
`

`

`+20
`
`0
`
`CD
`
`"U -w
`Cf) z
`-20
`0
`~
`Cl') w
`
`Q::
`
`I.LI
`0
`-40
`::::>
`....
`
`...J
`~
`~
`<X
`
`-60
`
`-80l_..o....,~::.........=L,/,,/j=:..........._.__.__~ ..........
`
`100
`1000
`FREOUENCY,Hz
`
`_._L..l...J..__~.u.L.Jl,L....L.J....L.L......L.
`
`.......... _._1.....iL...._~~LJ
`10000
`
`Figure 3.50 Frequency response cuives of a cat's basilar membrane (after Ghitza [ 13 ]).
`
`4800,--------.--------------------.
`..
`.
`
`...
`
`···•
`
`.
`
`•·•
`
`............ .
`
`N
`:I:
`
`> (.) z
`
`LU
`:::,
`0
`LU
`a: u..
`~ en a:
`LU ....
`< a:
`<
`:I:
`(.)
`
`(.)
`
`(.)
`
`Figure 3.51 Magnitude of EIH for vowel / / h
`.
`0 s owing the time-frequency resolution (after
`Ghitza [ 13)).
`
`IPR2023-00037
`Apple EX1013 Page 167
`
`

`

`Sec. 3.5
`
`Auditory-Based Spectral Analysis Models
`
`137
`
`(a)
`
`LEVEL CROSSINGS
`
`HISTOGRAMS
`
`COCHLEAR
`FILTER
`I
`
`sit l=A sin (27Tf 0 t l
`
`(b)
`
`ID .,,
`
`s (tl •A sin (2,,-f 0 t)
`
`tL f0
`
`f
`
`16
`
`=U_
`
`0
`
`fo
`
`f
`
`Figure 3.52 Operation of the EIH model for a pure sinusoid (after Ghitza [ 13)).
`
`IPR2023-00037
`Apple EX1013 Page 168
`
`

`

`138
`
`0
`-10
`
`-20
`
`-~
`-40
`ID
`~ -50
`2
`0
`::,
`a: -10
`....
`0 "' -20
`"' -30
`0 ::, ~o
`~
`..J
`-50
`CL
`~ 0
`C
`
`CL
`Cl)
`
`-10
`
`-30
`-40
`-50
`
`Chap. 3
`
`Signal Processing and Analysis Methods
`
`FOURIER
`
`[ooJ
`<400,------------,
`
`EIH
`
`300
`
`200
`
`100
`0 k:._~:..,__:....__~__:_-=:!.~::::t
`400~-----------,
`t 300
`i'
`E 200
`x w 100
`
`400~-----------,
`
`300
`
`100
`
`CLEAN
`
`SNR
`Oda
`
`LPC
`fit
`
`0
`
`2400
`1600
`800
`FREOUENCY,HZ
`
`3200
`
`800
`FREQUENCY,HZ
`
`Figure 3.53
`Ghitza [13)).
`
`Comparison of Fourier and EIH log spectra for clean and noisy speech signals (after
`
`amplitude A increases, more levels are activated. As a result, this cochlear filter contributes
`additional counts to the Jo bin of the EIH. Since the crossing levels are equally distributed
`on a log-amplitude scale, the magnitude of any EIH bin is related, in some fashion, to
`decibel units. However, this relation is not a straightforward one because there are several
`sources contributing counts to the Jo bin in a nonlinear manner. Figure 3.52b shows an
`input signal s(t) = A sin(21rJot) driving five adjacent cochlear filters with an amplitude
`response IH;(f)I and a phase response </>;(f), i = 1, 2, ... , 5. Due to the shape of the filters,
`more than one cochlear channel will contribute to the Jo bin. In fact, all the cochlear filters
`that produce s;(t) = A IH;(fo)I sin(21rJot + </>;<Jo)) will contribute to the Jo bin of the EIH,
`provided that A IH;(fo)I exceeds any of the level-crossing
`thresholds.
`In Figure 3.52b only
`cochlear filters 2, 3, and 4 are contributing nonzero histograms to the EIH. The number of
`counts is different for each filter, depending on the magnitude of A IH;(fo)I.
`One goal of auditory-based signal processing
`is to make the signal more robust to
`noise and reverberation than alternative spectral analysis procedures such as the filter-bank
`method or the LPC method. Figure 3.53 illustrates how well the EIH model achieves this
`goal. Shown in the figure are the log magnitude spectra of a clean (no noise) and a noisy
`(signal-to-noise ratio of O dB) speech signal processed by a standard Fourier filter bank
`(curves on the left) and by the EIH model (curves on the right). Also shown are LPC
`polynomial fits to the original signal spectrum (on the left) and to the EIH signal spectrUITl
`(on the right) for both the clean signal and the noisy signal. This figure clearly shows a
`tremendous sensitivity of the Fourier and LPC analyses to noise for the original signals.
`
`IPR2023-00037
`Apple EX1013 Page 169
`
`

`

`Sec. 3.6
`
`Summary
`
`139
`
`(This is especially seen in the LPC polynomial fits.) In the EIH case, the log magnitude
`spectra are almost unaltered by the noise, and the LPC polynomial fits are extremely close
`to each other.
`The implication of the above results for speech recognition is that the EIH model has
`potential for use in recognizing speech robustly in noisy and reverberant environments. We
`will explore this issue in Chapter 5 when we talk about the effects of noise on performance
`of speech recognizers.
`
`3.6 SUMMARY
`
`In this chapter we have discussed several ways of performing spectral analysis of speech,
`including filter-bank analysis, LPC analysis, vector quantization, and auditory modeling.
`We have discussed the relative strengths and weaknesses of each approach and given a hint
`of the advantages and disadvantages for application to actual speech-recognition systems.
`We will see, in later chapters, how the type of spectral analysis that is used interacts with
`the processing of other parts of the recognizer. Only through such an understanding can
`one fully see the trade-offs among the different approaches to speech spectrum analysis.
`
`REFERENCES
`
`[I] L.R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing, Prentice
`Hall, Englewood Cliffs, NJ, 1975.
`[2] L.R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice Hall,
`Englewood Cliffs, NJ, 1978.
`[3] R.E. Crochiere and L.R. Rabiner, Multirate Digital Signal Processing, Prentice Hall,
`Englewood Cliffs, NJ, 1983.
`[4] B.A. Dautrich, L.R. Rabiner, and T.B. Martin, "On the Effects of Varying Filter Bank
`Parameters on Isolated Word Recognition," IEEE Trans. Acoustics, Speech, Signal Proc.,
`ASSP-31 (4): 793-807, August 1983.
`[5] B.A. Dautrich, L.R. Rabiner, and T.B. Martin, "The Effects of Selected Signal Processing
`Techniques on the Performance of a Filter Bank Based Isolated Word Recognizer," Bell
`System Tech. J., 62 (5): 1311-1336, May-June 1983.
`[6] J.D. Markel and A.H. Gray, Jr., Linear Prediction of Speech, Springer-Verlag, 1976.
`[7] G.M. White and R.B. Neely, "Speech Recognition Experiments with Linear Prediction,
`Bandpass Filtering, and Dynamic Programming," IEEE Trans. Acoustics, Speech, Signal
`Proc., ASSP-24 (2): 183-188, 1976.
`[8] L.R. Rabiner, B.S. Atal, and M.R. Sambur, "LPC Prediction Error-Analysis of its Variation
`with the Position of the Analysis Frame," IEEE Trans. Acoustics, Speech, Signal Proc.,
`ASSP-25 (5): 434-442, October 1977.
`[9] H. Strube, "Determination of the Instant of Glottal Closure from the Speech Wave," J.
`Acoust. Soc. Am., 56 (5): 1625-1629, November 1974.
`
`IPR2023-00037
`Apple EX1013 Page 170
`
`

`

`140
`
`Chap. 3
`
`Signal Processing and Analysis M
`ethods
`
`[ 10] B.S. AtaJ and S.L. Hanauer, "Speech Analysis and Synthesis by Linear Predicti
`on of the
`Speech Wave," J. Acoust. Soc. Am., 50 (2): 637-055, August 1971.
`(11] S. Furui, "Speaker Independent Isolated Word Recognition Using Dynamic Feat
`.
`.
`ures of
`Speech Spectrum," IEEE Trans. Acoustics, Speech, Signal Proc., ASSP-34 (I): 52_
`59
`,
`February 1986.
`(12] J.-H. Juang, D.Y. Wong, and A.H. Gray, Jr., "J?istortion Pe~onnance of Vector Quantization
`for LPC Voice Coding," IEEE Trans. Acoustics, Speech, Signal Proc., ASSP-30 (2): 294--
`304, April 1982.
`[13] O. Ghitza, "Auditory Nerve Representation as a Basis for Speech Processing," in Advances
`in Speech Signal Processing, S. Furui and M. Sondhi, Eds., Marcel Dekker, NY, 453-48S,
`1991.
`
`J
`
`I I
`
`I I
`I
`I
`I
`I
`I
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`II
`
`11
`
`I I
`//
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`I
`I
`I
`I
`
`,
`~
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`J
`
`IPR2023-00037
`Apple EX1013 Page 171
`
`

`

`Chapter 6
`
`THEORY
`AND IMPLEMENTATION
`OF HIDDEN
`MARKOV MODELS
`
`6.1 INTRODUCTION
`
`In Chapters 4 and 5 we presented one major pattern-recognition approach to speech recogni(cid:173)
`tion, namely the template method. One key idea in the template method is to derive typical
`sequences of speech frames for a pattern (e.g., a word) via some averaging procedure, and
`to rely on the use of local spectral distance measures to compare patterns. Another key
`idea is to use some form of dynamic programming to temporally align patterns to account
`for differences in speaking rates across talkers as well as across repetitions of the word by
`the same talker. The methodology of the template approach is well developed and provides
`good recognition performance for a variety of practical applications.
`The template approach, however, is not based on· the ideas of statistical signal model(cid:173)
`ing in a strict sense. Even though statistical techniques have been widely used in clustering
`to create reference patterns, the template approach is best classified as a simplified, non(cid:173)
`parametric method in which a multiplicity of refe1ence tokens (sequences) are used to
`characterize the variation among different utterances. As such, statistical signal charac(cid:173)
`terization inherent in the template representation is only implicit and often inadequate.
`Consider, for example, the use of a truncated cepstral distortion measure as the local dis(cid:173)
`tance for template matching. The Euclidean distance form of the cepstral distance measure
`suggests that the reference vector can be viewed as the mean of some assumed distribution.
`
`-
`
`I
`I
`
`321
`
`IPR2023-00037
`Apple EX1013 Page 172
`
`

`

`322
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`Obviously, this simple form of the sufficient statistic 1 (use of only the mean reference
`covariances, which, as will be seen later,
`vector) neglects the second-order statistics-Le.,
`are of particular significance in statistical modeling. (Note that this distribution is used
`to account for variations of the cepstral coefficients at the frame level since time align(cid:173)
`ment is performed so as to match appropriate frames of the patterns being compared.)
`There is clearly a need to use a more elaborate and analytical statistical method for speech
`recognition.
`In this chapter we will study one well-known and widely used statistical method
`of characterizing the spectral properties of the frames of a pattern, namely the hidden
`Markov model (HMM) approach. (These models are also referred to as Markov sources or
`probabilistic functions of Markov chains in the communications literature.) The underlying
`assumption of the HMM (or any other type of statistical model) is that the speech signal
`can be well characterized as a parametric random process, and that the parameters of the
`stochastic process can be determined (estimated) in a precise, well-defined manner. We
`will show that the HMM method provides a natural and highly reliable way of recognizing
`speech for a wide range of applications and integrates well into systems incorporating both
`task syntax and semantics.
`The basic theory of hidden Markov models was published in a series of classic
`papers by Baum and his colleagues ([ I ]-[5]) in the late 1960s and early 1970s and was
`implemented for speech-processing applications by Baker [6] at CMU, and by Jelinek and
`his colleagues at IBM ([7]-[13]) in the 1970s.
`We begin this chapter with a review of the theory of Markov chains and then extend the
`ideas to HMMs using several simple examples. Based on the now-classical approach of Jack
`Ferguson of IDA (Institute for Defense Analyses), as introduced in lectures and in writing
`[ 14], we will focus our attention on the three fundamental problems for HMM design,
`namely:
`the evaluation of the probability (or likelihood) of a sequence of observations
`given a specific HMM; the determination of a best sequence of model states; and the
`adjustment of model parameters so as to best account for the observed signal. We will
`show that once these three fundamental problems are solved, we can readily apply HMMs
`to selected problems in speech recognition.
`
`6.2 DISC::RETE-TIME MARKOV PROCESSES
`
`Consider a system that may be described at any time as being in one of a set of N distinct
`states indexed by { l, 2, ... , N} as illustrated in Figure 6.1 (where N = 5 for simplicity).
`At regularly spaced, discrete times, the system undergoes a change of state (possibly back
`to the same state) according to a set of probabilities associated with the state. We denote
`the time instants associated with state changes as t = 1, 2, ... , and we denote the actual
`
`1 Sufficient statistics are a set of measurements from a process which contain all the relevant
`information for estimating the parameters of that process.
`
`IPR2023-00037
`Apple EX1013 Page 173
`
`

`

`sec. 6.2
`
`Discrete-Time Markov Processes
`
`323
`
`Figure 6.1 A Markov
`chain with five states
`(labeled I to 5) with selected state transitions.
`
`Figure 6.2 Markov model of the weather.
`
`state at time t as q,. A full probabilistic description of the above system would, in general,
`require specification of the current state (at time t), as well as all the predecessor states. For
`the special case of a discrete-time, first order, Markov chain, the probabilistic dependence
`is truncated to just the preceding state-that
`is,
`
`Furthermore, we consider only those processes in which the right-hand side of (6.1) is
`independent of time, thereby leading to the set of state-transition probabilities aij of the
`form
`
`(6.1)
`
`aii = P[q, = jlq,_ 1 = i],
`with the following properties
`
`1 ~ i,j 5: N
`
`a·· >O
`I}
`-
`
`Vj,i
`
`N
`Laij = 1
`J=I
`
`Vi
`
`(6.2)
`
`(6.3a)
`
`(6.3b)
`
`since they obey standard stochastic constraints.
`The above stochastic process could be called an observable Markov model because
`the output of the process is the set of states at each instant of time, where each state
`corresponds to an observable event. To set ideas, consider a simple three-state Markov
`model of the weather as shown in Figure 6.2. We assume that once a day (e.g., at noon),
`the weather is observed as being one of the following:
`
`State 1: precipitation (rain or snow)
`State 2: cloudy
`State 3: sunny.
`
`IPR2023-00037
`Apple EX1013 Page 174
`
`

`

`324
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`We postulate that the weather on day t is characterized by a single one of the three states
`above, and that the matrix A of state-transition probabilities is
`
`0.4 0.3 0.3 ]
`A= {aij} = 0.2 0.6 0.2
`[
`0.1 0.1 0.8
`
`.
`
`Given the model of Figure 6.2 we can now ask (and answer) several interesting
`questions about weather patterns over time. For example, we can pose the following
`simple problem:
`
`Problem
`
`What is the probability (according to the model) that the weather for eight consecutive days is
`"sun-sun-sun-rain-rain-sun-cloudy-sun"?
`
`Solution
`We define the observation sequence, 0, as
`0
`( sunny,
`sunny )
`cloudy,
`sunny,
`rain,
`rain,
`sunny,
`sunny,
`3
`)
`2,
`3,
`1,
`1,
`(
`3,
`3,
`3,
`8
`7
`6
`5
`4
`1
`2
`3
`day
`corresponding to the postulated set of weather conditions over the eight-day period and we
`want to calculate P(OIModel), the probability of the observation sequence 0, given the model
`of Figure 6.2. We can directly determine P(OIModel) as:
`
`P(O!Model) = P[3, 3, 3, I, I, 3, 2, 3IModel]
`= P[3]P[3l3J2 P[ll3]P[lll]
`P[311] P[213] P[3j2]
`= 1r3 • (a33)2 a31 a11 a13 a32 a23
`= ( 1.0)(0.8)2(0.1 )(0.4 )(0.3)(0.1 )(0.2)
`= 1.536 X 10- 4
`
`where we use the notation:
`
`1r; = P[qi = i],
`
`(6.4)
`
`to denote the initial state probabilities.
`
`Another interesting question we can ask (and answer using the model) is:
`
`Problem
`
`Given that the system is in a known state, what is the probability that it stays in that state for
`exactly
`
`Solution
`This probability can be evaluated as th

`f
`b b'l'
`e pro a 1 tty o the observation sequence
`, ... ,
`
`i,
`d
`
`Jc/=i
`d+l
`
`0 = ( i,
`day
`I
`
`i,
`2
`
`3
`
`IPR2023-00037
`Apple EX1013 Page 175
`
`

`

`sec. 6.3
`
`Extensions to Hidden Markov Models
`
`given the model, which is
`
`P(OIModel,q1 = i)= P(O,q 1 = ilModel)/P(q1 = l)
`= 7r;(a;;t-l(l
`- a;;)/1r;
`= (a;l-
`10 - au)
`= p;(d)
`
`325
`
`(6.5)
`
`The quantity p;(d) is the probability distribution function of duration d in state i. This
`exponential distribution is characteristic of the state duration in a Markov chain. Based on
`p;(d), we can readily calculate the expected number of observations (duration) in a state
`conditioned on starting in that state as
`'
`
`00
`
`d; = Ldp;(d)
`
`00
`
`(6.6a)
`
`(6.6b)
`
`Thus the expected number of consecutive days of sunny weather, according to the model, is
`1/(0.2) = 5; for cloudy it is 2.5; for rain it is 1.67.
`
`Problem
`Derive the expression for the mean of p;(d), i.e. Eq. (6.6b).
`
`Solution
`
`00
`
`d; = Ldp;(d)
`
`d=I
`
`00
`
`d=I
`
`00
`
`d]
`
`)
`
`= (1 - a;;)-
`8
`[
`"a;;
`oa;; LJ
`d=I
`a ( a··
`-
`--"
`1 - a;;
`
`= (1 - a;;)-
`1
`---1 - a;;
`
`oa;;
`
`&.3 EXTENSIONS TO HIDDEN MARKOV MODELS
`
`So far we have considered Markov models in which each state corresponded to a deter(cid:173)
`ministically observable event. Thus, the output of such sources in any given state is not
`random. This model is too restrictive to be applicable to many problems of interest. In
`this section we extend the concept of Markov models to include the case in which the
`observation is a probabilistic function of the state-that
`is, the resulting model (which is
`
`IPR2023-00037
`Apple EX1013 Page 176
`
`

`

`I
`
`- I
`
`326
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov M
`Odels
`
`called a hidden Markov model) is a doubly embedded stochastic process with an und 1 .
`.
`.
`I (. • h. dd ) b
`er Ytng
`stochastic process that is not directly observab e It is 1 e

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket