throbber
sec. 6.15
`
`HMM System for Isolated Word Recognition
`
`379
`
`7 OBSERVATION
`SEQUENCE
`0
`
`____
`s,EEOI LJIC
`g!i~Al fEA TURE
`l
`ANALYSIS,
`(VECTOR
`QUA1'i1ZA TION)
`
`HMM FOR
`WORD I
`
`PROBABil.ITY P(O I >-1
`)
`COMPUTATION
`
`HMM FOR
`WORD2
`
`PROBABil.ITY P(O I >.2
`)
`COMPUTATION
`
`INDEX OF
`RECOON1ZED
`WORD
`
`SELECT
`MAXIMUM
`
`v· •
`
`arpm
`Is Vs V
`
`!~co I >.'>I
`
`•
`•
`
`HMM FOR
`WORDV
`
`PROBABil.ITY P(O I Xv)
`COMPUTATION
`
`Figure 6.13 Block diagram of an isolated word HMM recognizer (after Rabiner [38)).
`
`6.15.1 Choice of Model Parameters
`
`We now return to the issue that we have raised several times in this chapter-namely, How
`do we select the type of model, and how do we choose the parameters of the selected
`model? For isolated word recognition with a distinct HMM designed for each word in the
`vocabulary, it should be clear that a left-right model is more appropriate than an ergodic
`model, since we can then associate time with model states in a fairly straightforward
`manner. Furthermore, we can envision the physical meaning of the model states as distinct
`sounds (e.g., phonemes, syllables) of the word being modeled.
`The issue of the number of states to use in each word model leads to two schools
`of thought. One idea is to let the number of states correspond roughly to the number of
`sounds (phonemes) within the word-hence, models with from 2 to 10 states would be
`appropriate [18]. The other idea is to let the number of states correspond roughly to the
`average number of observations in a spoken version of the word, the so-called Bakis mcxlel
`
`IPR2023-00035
`Apple EX1015 Page 230
`
`

`

`380
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`6
`
`I-z
`Lu u a:
`Lu 4
`~
`~
`Lu
`I-
`4 a:
`a:
`0 a: a:
`
`Lu
`
`2
`
`..
`
`0
`1
`
`2
`
`9
`8
`7
`6
`5
`4
`3
`N, NUMBER OF STATES IN HMM
`
`20
`
`Figure 6.14 Average word error rate (for a digits vocabulary) versus the number
`of states N in the HMM (after Rabiner et al. [ 18)).
`
`about 10-15
`[11]. In this manner each state corresponds to an observation interval-i.e.,
`ms for the standard methods of analysis. In the results to be described later in this section,
`we use the former approach. Furthermore, we restrict each word model to have the same
`number of states; this implies that the models will work best when they represent words
`with the same number of sounds.
`To illustrate the effect of varying the number of states in a word model, Figure 6.14
`shows a plot of average word error rate versus N, for the case of recognition of isolated
`digits (i.e., a IO-word vocabulary). It can be seen that the error is somewhat insensitive to
`N, achieving a local minimum at N = 6; however, differences in error rate for values of N
`close to 6 are small.
`The next issue is the choice of observation vector and the way it is represented. As
`discussed in Chapter 3, possibilities include LPC-derived weighted cepstral coefficients and
`weighted cepstral derivatives or (for autoregressive HMMs) the autocorrelation coefficients
`as the observation vectors for continuous models; for discrete symbol models we use a
`codebook to generate the discrete symbols. For the continuous models we use as many
`as M = 64 ~ 256 mixtures per state; for the discrete symbol models we use codebooks
`with as many as M = 512 ~ I 024 code words. Also, for the continuous models, it has
`been found that it is more convenient and sometimes preferable to use diagonal covariance
`matrices with several mixtures, rather than fewer mixtures with full covariance matrices.
`The reason for this is simple-namely,
`the difficulty in performing reliable reestimation of
`the off diagonal components of the covariance matrix from the necessarily limited training
`data. Figure 6.15 illustrates the need for using mixture densities for modeling observation
`vectors (eighth-order cepstral vectors derived from LPC with log energy appended as the
`ninth vector component). Figure 6.15 shows a comparison of marginal distributions bj(o)
`against a histogram of the actual observations within a state (as determined by a maximum
`likelihood segmentation of all the training observations into states). The observation vectors
`
`IPR2023-00035
`Apple EX1015 Page 231
`
`

`

`sec. 6.15
`
`HMM System for Isolated Word Recognition
`
`381
`
`WORD: ZERO, STATE I
`
`.... z :,
`8
`
`-0.527
`
`0.435
`
`-0435
`
`0.366
`
`0.375 -44.20
`-0.483
`PARAMETER RANGE
`
`-4.112
`
`Figure 6.15 Comparison of estimated density (jagged contour) and model density (smooth contour)
`for each of the nine components of the observation vector (eight cepstral componen1s, one log energy
`component) for state l of the digit zero (after Rabiner et al. (38 ]).
`
`are ninth order, and the model density uses M = 5 mixtures. The covariance matrices are
`constrained to be diagonal for each individual mixture. The results of Figure 6.15 are for
`the first model state of the word "zero." The need for values of M > I is clearly seen
`in the histogram of the first parameter (the first cepstral component) which is inherently
`multimodal; similarly, the second, fourth, and eighth cepstral parameters show the need for
`more than a single Gaussian component to provide good fits to the empirical data. Many of
`the other parameters appear to be well fitted by a single Gaussian; in some cases, however,
`even M = 5 mixtures do not provide a sufficiently good fit.
`Another experimentally verified fact about the HMM is that it is important to limit
`some of the parameter estimates to prevent them from becoming too small. For example,
`for the discrete symbol models, the constraint that bj(k) be greater than or equal to some
`minimum value f is necessary to ensure that even when the kth symbol never occurred
`in some state j in the training observation set, there is always a finite probability of its
`occurrence when scoring an unknown observation set. To illustrate this point, Figure 6.16
`shows a curve of average word error rate versus the parameter f (on a log scale) for
`a standard word-recognition experiment.
`It can be seen that over a very broad range
`00- 10 < f < 10- 3 ) the average error rate remains at about a constant value; however,
`
`IPR2023-00035
`Apple EX1015 Page 232
`
`

`

`382
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`20
`
`16
`
`~
`0
`~
`~ 12
`I.IJ
`~ z
`I.IJ 8
`0
`~
`I.IJ
`~
`
`4
`
`0
`
`0--. ~
`
`0
`
`10- 3 10-•
`
`10-~
`
`10- 6
`C
`
`Figure 6.16 Average word error rate as a function of the minimum discrete density
`value f (after Rabiner et al. [ 18]).
`
`when Eis set to O (i.e., 10- 00
`), then the error rate increases sharply. Similarly, for continuous
`densities it is important to constrain the mixture gains CJm as well as the diagonal covariance
`coefficients Uim(r, r) to be greater than or equal to some minimum values (we use 10- 4 in
`all cases) [ 18).
`
`6.15.2 Segmental K-Means Segmentation into States
`
`In this chapter, we have emphasized that good initial estimates of the parameters of the
`bj(o,) densities are essential for rapid and proper convergence of the reestimation formulas.
`Hence a procedure for providing good initial estimates of these parameters was devised and
`is shown in Figure 6.17. The training procedure is a variant on the well-known K-means
`iterative procedure for clustering data.
`We assume that we have a training set of observations (the same as is required for
`parameter reestimation), and an initial estimate of all model parameters. However, unlike
`the one required for reestimation, the initial model estimate can be chosen randomly or on
`the basis of any available model appropriate to the data.
`Following model initialization, the set of training observation sequences is segmented
`into states, based on the current model .X. This segmentation
`is achieved by finding the
`optimum state sequence, via the Viterbi algorithm, and then backtracking along the optimal
`path. This procedure is illustrated in Figure 6.18, which shows a log-energy plot, an
`accumulated log-likelihood plot, and a state segmentation for one occurrence of the word
`"six." The states correspond roughly to the sounds in the spoken word "six."
`The result of segmenting each of the training sequences is, for each of the N states,
`a maximum likelihood estimate of the set of the observations that occur within each state j
`according to the current model. In the case where we are using discrete symbol densities,
`each of the observation vectors within a state is coded using the M-code-word codebook,
`
`IPR2023-00035
`Apple EX1015 Page 233
`
`

`

`\
`
`sec. 6.15
`
`HMM System f:lr Isolated Word Recognition
`
`383
`
`MODEL
`INITIALIZATION
`
`ST A TE SEQUENCE
`SEGMENTATION
`
`NO
`
`TRAINING
`DATA
`
`ESTIMATE PARAMETERS
`OFB(-)
`VIA SEGMENTAL
`K-MEANS
`
`MODEL
`REESflMATION
`
`YES
`
`MODEL
`PARAMETERS
`
`Figure 6.17 The segmental k-means training procedure used to estimate parameter values for the optimal
`continuous mixture density fit to a finite number of observation sequences (after Rabiner et al. [38]).
`
`and the updated estimate of the bj(k) parameters is
`bj(k) = number of vectors with codebook index k in state j divided by the
`number of vectors iri state j.
`
`When we are using continuous observation densities, a segmental K-means procedure is
`used to cluster the observation vectors within each state j into a set of M clusters (using
`a Euclidean distortion measure), where each cluster represents one of the M mixtures of
`the bj(o,) density. From the clustering, an updated set of model parameters is derived as
`follows:
`
`Cjm
`
`number of vectors classified in cluster m of state j divided by the number
`of vectors in state j
`sample mean of the vectors classified in cluster m of state j
`sample covariance matrix of the vectors classified in cluster m of state j.
`
`f.tjm
`
`iJjm
`
`=
`=
`=
`Based on this state segmentation, updated estimates of the aij coefficients can be obtained
`by counting the number of transitions from state i to j and dividing it by the number of
`transitions from state i to any state (including itself).
`An updated model X is obtained from the new model parameters, and the fonnal
`reestimation procedure is used to reestimate all model parameters. The resulting model
`is then compared to the previous model (by computing a distance score that reflects the
`
`IPR2023-00035
`Apple EX1015 Page 234
`
`

`

`384
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`>(cid:173)
`(!)
`Q: w z
`w
`Cl
`0
`
`-56~==~==~::::~;;;~~~~;=~==::::::=:=;==1
`
`~-,J~:~:~: ~: ~: ~: ~: ::;:==:
`-~• j
`
`5
`
`w
`I-
`~ (/)
`
`4
`
`3
`
`2 ..... --
`ll..L._.....JI. __
`I b1
`b2
`
`J...___JL...J..
`b3
`
`__
`
`..L.,__~--~--,..___....._
`b4
`FRAME NUMBER
`
`__
`
`.....__-:-__.
`b5•49=T
`
`Figure 6.18 Plots of (a) log energy; (b) accumulated log likelihood; and (c) state assignment for one
`occurrence of the word '"six'" (after Rabiner et al. [38]).
`
`statistical similarity of the HMMs). If the model distance score exceeds a threshold, then
`the old model ,\ is replaced by the new (reestimated) model ~. and the overall training loop
`is repeated. If the model distance score falls below the threshold, then model convergence
`is assumed and the final model parameters are saved.
`
`6.15.3
`
`Incorporation of State Duration into the HMM
`
`In Section 6.9 we discussed the theoretically correct method of incorporating state duration
`information into the mechanics of the HMM ([39]). We also showed that the cost of
`including duration density was rather high; namely a D2-fold increase in computation and
`a D-fold increase in storage. Using a value of D = 25 (as is required for word recognition),
`the cost of the increased computation tended to make the techniques not worth using.
`Thus the following alternative procedure was formulated for incorporating state duration
`information into the HMM.
`For this alternative procedure, the state duration probability pj(d) was measured
`directly from the segmented training sequences used in the segmental K-means procedure
`of the previous section. Hence the estimates of pj(d) are strictly heuristic ones. A typical
`set of histograms of pj(d) for a five-state model of the word "six" is shown in Figure 6.19.
`(In this figure the histograms are plotted versus normalized duration (d /T), rather than
`absolute duration d.) The first two states account for the initial / s/ in "six"; the third state
`accounts for the transition to the vowel / i /; the fourth state accounts for the vowel; and the
`fifth state accounts for the stop and the final / s / sound.
`The way in which the heuristic duration densities were used in the recognizer was
`
`IPR2023-00035
`Apple EX1015 Page 235
`
`

`

`sec. 6.15
`
`HMM System for Isolated Word Recognition
`
`385
`
`DIGIT: S1X
`
`0
`
`NORMALIZED DURATION (d/T)
`
`Figure 6.19 Histograms of the normalized duration density for the five states of 1he digil
`"six" (after Rabiner et al. [38)).
`
`as follows. First the normal Viterbi algorithm is used to give the best segmentation of the
`observation sequence of the unknown word into states via a backtracking procedure. The
`duration of each state is then measured from the state segmentation. A postprocessor then
`increments the log-likelihood score of the Viterbi algorithm, by the quantity
`
`log .P(q, OIA) = log P(q, OIA) + ad L log [pj(dj)]
`
`N
`
`j=I
`
`(6.123)
`
`where ad is a scaling multiplier on the state duration scores, and dj is the duration of
`state j along the optimal path as determined by the Viterbi algorithm. The incremental
`cost of the postprocessor for duration is essentially negligible, and experience has shown
`that recognition performance is essentially as good as that obtained using the theoretically
`correct duration model.
`
`6.15.4 HMM Isolated-Digit Performance
`
`We conclude this section on isolated word recognition using HMMs by giving a set of
`performance results (in terms of average word error rate) on the task of recognizing
`isolated digits in a speaker-independent manner. For this task, a training set consisting
`of 100 occurrences of each digit by 100 talkers (i.e., a single occurrence of each digit
`per talker) was used. Half the talkers were male and half were female. For testing the
`
`IPR2023-00035
`Apple EX1015 Page 236
`
`

`

`386
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`_j
`
`TABLE 6.1. Average Digit Error Rates for Several Recognizers
`and Evaluation Sets
`
`Recognizer
`Type
`LPC/DTW
`LPC/DTWNQ
`HMMNQ
`HMM/CD
`HMM/AR
`
`Original
`Training
`0.1
`
`0
`0.3
`
`Evaluation Set
`
`TS2
`
`0.2
`3.5
`3.7
`0.2
`1.8
`
`TS3
`
`2.0
`
`1.3
`3.4
`
`TS4
`
`1.1
`
`1.8
`4.1
`
`algorithm, we used the initial training set, as well as three other independent test sets with
`the following characteristics:
`
`TS2 The same 100 talkers as were used in the training; 100 occurrences
`of each digit
`TS3 A new set of 100 talkers (50 male, 50 female); 100 occurrences
`of each digit
`TS4 Another new set of 100 talkers (50 male, 50 female); 100 occurrences
`of each digit
`
`The results of the recognition tests are given in Table 6.1. The recognizers are the following:
`
`LPC/DTW
`
`LPC/DTWNQ
`
`HMMNQ
`HMM/CD
`
`HMM/AR
`
`Conventional template-based recognizer using dynamic time warping
`(DTW) alignment
`Conventional recognizer with vector quantization of the feature vectors
`(M = 64)
`HMM recognizer with M = 64 codebook
`HMM recognizer using continuous density model with M = 5 mixtures
`per state
`HMM recognizer using autoregressive observation density
`
`It can be seen that, when using a VQ, the perfonnance of the isolated word recognizer
`degrades in both the conventional and HMM modes. It can also be seen that the perfor(cid:173)
`mances of the conventional template-based recognizer, and the HMM recognizer with a
`continuous density model are comparable. Finally Table 6.1 shows that the autoregressive
`density HMM gives poorer performance than the standard mixture density model.
`
`6.16 SUMMARY
`
`In this chapter we have attempted to present the theory of hidden Markov models from
`the simplest concepts (discrete Markov chains) to the most sophisticated models (variable
`
`IPR2023-00035
`Apple EX1015 Page 237
`
`

`

`Chap.6
`
`References
`
`387
`
`It has been our purpose to focus on physical ex(cid:173)
`duration, continuous density models).
`planations of the basic mathematics; hence we have avoided long, drawn-out proofs or
`derivations of the key results, and concentrated primarily on trying to interpret the meaning
`of the math, and how it could be implemented in practice in real-world systems. We have
`also attempted to illustrate one application of the theory of HMMs to a simple problem in
`speech recognition-namely,
`isolated word recognition.
`
`REFERENCES
`
`[ 1 J L.E. Baum and T. Petrie, "Statistical inference for probabilistic functions of finite state
`Markov chains," Ann. Math. Stat., 37: 1554-1563, 1966.
`[2) L.E. Baum and J.A. Egon, "An inequality with applications to statistical estimation for
`functions of a Markov process and to a model for ecology,'' Bull. Amer.
`probabilistic
`Meteorol. Soc., 73: 360-363, 1967.
`[3) L.E. Baum and G.R. Sell, "Growth functions for transformations on manifolds," Pac. J.
`Math.,27(2):
`211-227. 1968.
`[4] L.E. Baum, T. Petrie. G. Soules, and N. Weiss, "A maximization technique occurring in the
`statistical analysis of probabilistic functions of Markov chains," Ann. Math. Stat., 41 (I):
`164-171, 1970.
`[5] L.E. Baum, "An inequality and associated maximization technique in statistical estimation
`for probabilistic functions of Markov processes," Inequalities, 3: 1-8, 1972.
`[6] J.K. Baker, "The dragon system-An
`overview," IEEE Trans. Acoustics, Speech, Signal
`Proc., ASSP-23 (1): 24-29, February 1975.
`[7] F. Jelinek, "A fast sequential decoding algorithm using a stack," IBM J. Res. Develop., 13:
`675-685, 1969.
`[8] L.R. Bahl and F. Jelinek, "Decoding for channels with insenions, deletions, and substitutions
`with applications to speech recognition," IEEE Trans. Information Theory, IT-21: 404-411,
`1975.
`[9] F. Jelinek, L.R. Bahl, and R.L. Mercer, "Design of a linguistic statistical decoder for the
`recognition of continuous speech," IEEE Trans. Information Theory, IT-21: 250-256, 1975.
`(1 OJ F. Jelinek, "Continuous speech recognition by statistical methods," Proc. IEEE, 64: 532-536,
`April 1976.
`(11] R. Balds, "Continuous speech word recognition via centisecond acoustic states," in Proc.
`ASA Meeting (Washington, DC), April 1976.
`(12] F. Jelinek, L.R. Bahl, and R.L. Mercer, "Continuous speech recognition: Statistical meth(cid:173)
`ods," in Handbook of Statistics, II, P.R. Krishnaiad, Ed. Amsterdam, The Netherlands:
`North-Holland, 1982.
`[13] L.R. Bahl, F. Jelinek, and R.L. Mercer, "A maximum likelihood approach to continuous
`speech recognition," IEEE Trans. Pattern Anal. Machine Intel/., PAMI-5: 179-190, 1983.
`[14] J.D. Ferguson, "Hidden Markov Analysis: An Introduction," in Hidden Markov Models/or
`Speech, Institute for Defense Analyses, Princeton, NJ, 1980.
`[ 15] A.J. Viterbi, "Error bounds for convolutional codes and an asymptotically optimal decoding
`algorithm," IEEE Trans. Information Theory, IT-13: 260-269, April 1967.
`[16] G.D. Forney, "The Viterbi algorithm," Proc. IEEE, 61: 268-278, March 1973.
`
`IPR2023-00035
`Apple EX1015 Page 238
`
`

`

`388
`
`Chap.6
`
`Theory and Implementation of Hidden Markov Models
`
`(17] A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum likelihood from incomplete data
`via the EM algorithm," J. Roy. Stat. Soc., 39 (1): 1-38, 1977.
`[18] S.E. Levinson, L.R. Rabiner, and M.M. Sondhi, "An Introduction to the Application of the
`Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition,"
`Bell System Tech. J., 62 (4): 1035-1074, April 1983.
`[ 19] L.A. Liporace, "Maximum likelihood estimation for multivariate observations of Markov
`sources," IEEE Trans. Information Theory, IT-28 (5): 729-734, 1982.
`[20] B.H. Juang, "Maximum likelihood estimation for mixture multivariate stochastic observa(cid:173)
`tions of Markov chains," AT&T Tech. J., 64 (6): 1235-1249, July-Aug. 1985.
`[21] B.H. Juang, S.E. Levinson, and M.M. Sondhi, "Maximum
`likelihood estimation for mul(cid:173)
`tivariate mixture observations of Markov chains," IEEE Trans. Information Theory, IT-32
`(2): 307-309, March 1986.
`[22] A.B. Poritz, "Linear predictive hidden Markov models and the speech signal," in Proc.
`/CASSP 82 (Paris, France), 1291-1294, May 1982.
`[23] B.H. Juang and L.R. Rabiner, "Mixture autoregressive hidden Markov models for speech
`signals," IEEE Trans. Acoustics, Speech, Signal Proc., ASSP-33 (6): 1404-1413, Decem(cid:173)
`ber 1985.
`[24] M J. Russell and R.K. Moore, "Explicit modeling of state occupancy in hidden Markov mod(cid:173)
`els for automatic speech recognition," in Proc. ICASSP 85 (Tampa, FL), 5-8, March 1985.
`[25] S.E. Levinson, "Continuously variable duration hidden Markov models for automatic speech
`recognition," Computer Speech and Language, 1 ( 1 ): 29-45, March 1986.
`[26] L.R. Bahl, P.F. Brown, P.V. deSouza, and L.R. Mercer, "Maximum mutual information
`estimation of hidden Markov model parameters for speech recognition," in Proc. ICASSP
`86 (Tokyo, Japan), 49-52, April 1986.
`[27] Y. Ephraim, A. Dembo, and L.R. Rabiner, "A minimum discrimination information ap(cid:173)
`proach for hidden Markov modeling," IEEE Trans. Information Theory, 35 (5): 1001-1003,
`September 1989.
`[28] Y. Ephraim and L. Rabiner, "On the Relations Between Modeling Approaches for Speech
`Recognition," IEEE Trans. Information Theory, 36 (2): 372-380, March 1990.
`[29] B.H. Juang and L.R. Rabiner, "A probabilistic distance measure for hidden Markov models,"
`AT&T Tech. J., 64 (2): 391-408, February 1985.
`[30] F. Jelinek and R.L. Mercer, "Interpolated estimation of Markov source parameters from
`sparse data," in Pattern Recognition in Practice, E.S. Gelesma and L.N. Kanai, Eds.
`Amsterdam, The Netherlands: North-Holland, 1980, 381-397.
`[31] C.-H. Lee, C.-H. Lin, and B.H. Juang, "A study on speaker adaptation of the parameters
`of continuous density hidden Markov models," IEEE Trans. Signal Processing, 39 (4):
`806-841, April 1991.
`[32] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis, John Wiley & Sons,
`New York, 1973.
`[33] L.R. Bahl, P.F. Brown, P.V. deSouza, and R.L. Mercer, "A new algorithm for the estimation
`of hidden Markov model parameters," Proc. ICASSP 88, 493-496, New York, April 1988.
`[34] S. Katagiri, C.H. Lee, and B.H. Juang, "New Discriminative Training Algorithms Based on
`the Generalized Probabilistic Descent Method," Proc. 1991 Workshop Neural Network.sfor
`Signal Processing, 299-308, IEEE Signal Processing Society, Princeton, September 1991.
`
`IPR2023-00035
`Apple EX1015 Page 239
`
`

`

`chap. 6 References
`
`319
`
`[35] Y. Linde, A. Buzo, and R.M. Gray, "An algorithm forvectorquantiz.crdcsign • /EUThw.
`Communications, COM-28: 84-95, January 1980.
`[36) J. Breiman, J.H. Friedman, R.A. Olshen, and CJ. Stone, Classification and RtgrusiOft
`Trees, Wadsworth & Brooks, Monterey, CA, 1984.
`[37) K.F. Lee, Automatic Speech Recognition-The Development of tht SPHINX S 'Sttm Kluwer
`Academic Publishers, Boston, 1989.
`[38] L.R. Rabiner, B.H. Juang, S.E. Levinson, and M.M. Sondhi, "Recognition of Isolated Digits
`Using Hidden Markov Models With Continuous Mixture Densities, 'AT &.T Ttch. J., 64 (6 :
`1211-1234, July-Aug. 1985.
`[39) L.R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech
`Recognition," Proc. IEEE, 77 (2): 257-286, February 1989.
`
`!
`i,
`
`I'
`
`.. I
`
`I I
`I
`t i
`L
`t
`
`IPR2023-00035
`Apple EX1015 Page 240
`
`

`

`Cha~ter 7
`
`SPEECH RECOGNITION
`BASED ON CONNECTED
`WORD MODELS
`
`7 .1
`
`INTRODUCTION
`
`Up to this point we have been discussing discrete utterance ( or, as it is often called, iso(cid:173)
`lated word or phrase) recognition. The assumption was that the speech to be recognized
`comprised a single word or phrase and was to be recognized as a complete entity with no
`explicit knowledge or regard for the phonetic content of the word or phrase. Hence, for a
`vocabulary of V words (or phrases), the recognition algorithm consisted of matching (via
`time alignment) the measured sequence of spectral vectors of the unknown spoken input
`against each of the set of spectral patterns for the V words and selecting the pattern whose
`accumulated time aligned spectral distance was smallest as the recognized word. Another
`implicit assumption was that each spoken utterance had a clearly defined beginning and
`ending point that could be found using some type of speech endpoint detector. As a result,
`pattern matching could be reliably performed without having to be concerned about uncer(cid:173)
`tainties in the endpoints of the patterns being compared. For many applications, notably
`those referred to as "command-and-control" applications, in which the user is required
`to speak the command words one at a time (i.e., with distinct pauses between command
`words), this paradigm works well and is entirely appropriate. (We will discuss command(cid:173)
`and-control applications in Chapter 9.) However, for applications in which the speech
`to be recognized consists of a sequence of words from the recognition vocabulary (e.g.,
`strings of digits) such a paradigm is often unacceptable for practical reasons. Figure 7.1
`illustrates this point for a three-digit sequence in which the upper panel shows the log
`
`390
`
`IPR2023-00035
`Apple EX1015 Page 241
`
`

`

`sec. 7.1
`
`Introduction
`
`391
`
`ISOLATED
`
`I
`
`-
`->-c:,
`
`CD
`'1:11
`
`a:
`~ w
`8
`_,
`
`3
`
`2
`
`0
`
`50
`
`--[
`
`,.,~
`
`CONTINUOUS
`
`Illustration of an isolated string of digits (upper panel) and a fluently
`Figure 7.1
`spoken version of the same digit string (lower panel).
`
`TIME (SEC)
`
`energy contour of a digit string spoken as a sequence of isolated digits where each digit is
`followed by a discernible pause (i.e., period of no speaking), and the lower panel shows
`the same digit string spoken in a fluent (continuous) manner. It should be self-evident that
`speaking a sequence of isolated digits is unnatural (from the point of view of human speech
`production) and grossly inefficient (it takes about 2.5 seconds to speak the isolated digit
`sequence versus less than 1 second to speak the same string of digits in a fluent manner).
`Hence, it is important to be able to extend the techniques described in earlier chapters of
`this book so that accurate recognition of fluent speech becomes possible.
`From the point of view of speech-recognition algorithms, there are two interesting
`classes of fluent speech strings. One class is the set of strings derived from small-to(cid:173)
`moderate-size vocabularies,
`including digit strings, spelled letter sequences, combinations
`of alphanumerics, and strings for accessing limited databases based on small-to-moderate(cid:173)
`size vocabularies and highly constrained word syntax. This set has the property that the
`basic speech-recognition unit can be the word (or phrase), much as in the case of isolated
`word- (or phrase-) recognition systems. The other class is the set of continuous speech
`drawn from moderate-to-large-size vocabularies where the basic speech-recognition unit
`cannot be the word because of complexity constraints. In such a case, subword speech
`units are necessary for the implementation of the speech-recognition system. We defer a
`discussion of the second class of speech strings to Chapter 8, where we provide a complete
`description of techniques used in the recognition of large vocabulary continuous speech. In
`
`IPR2023-00035
`Apple EX1015 Page 242
`
`

`

`392
`
`Chap. 7
`
`Speech Recognition Based on Connected Word Models
`
`n, .__ __ __,
`
`Figure 7.2 Illustration of the connected word-recognition problem.
`
`T
`
`this chapter we discuss in depth the class of speech-recognition systems commonly referred
`to as connected word speech recognizers, because the basic unit of recognition is a whole
`word.
`We can now formulate the basic problem in connected word recognition. To do this
`we refer to Figure 7 .2, which gives a pictorial description of the connected word-recognition
`problem. We assume that we are given the spectral vectors from a fluently spoken string of
`words. T = { t(l)
`t(2), ... , t(M)}, and we are also given the spectral patterns for each of
`V reference patterns, ('R, 1 to nv) corresponding to V unique words in the vocabulary. The
`connected word-recognition problem can be compactly stated as follows:
`
`Given a fluently spoken sequence of words, how can we determine the optimum match
`in terms of a concatenation of word reference patterns?
`
`To solve the connected word-recognition problem we must resolve the following problems:
`
`1. We don't usually know the number of words, L, in the string (although we usually
`have a good bound on the range, e.g., one to seven words).
`2. We don't know word utterance boundaries within the spoken string; i.e. except for
`the beginning of the first word in the string, and the end of the last word in the string,
`we don't know precisely where any word begins or ends.
`3. The word boundaries are often fuzzy or nonunique. That is, it is often difficult, if not
`impossible, to specify (i.e., find accurately and automatically)
`the word boundaries
`because of sound coarticulation. Thus, for example, the boundary between the digit
`3 and the digit 8 in Figure 7 .1 is fuzzy because the ending sound / i / in 3 coarticulates
`strongly with the initial sound /eY / in 8. We indicate such fuzziness of the boundaries
`by the squiggly lines in Figure 7.2.
`4. For a set of V word-reference patterns, and for a given value of L (the number
`of words in the string), there are vL possible combinations of composite matching
`patterns (possible L word sequences); for anything but extremely small values of
`V and l the exponential number of composite matching patterns
`implies that the
`connected word-recognition problem cannot be solved by exhaustive means.
`
`The bottom line is that a nonexhaustive matching algorithm is required to solve the con-
`
`IPR2023-00035
`Apple EX1015 Page 243
`
`

`

`sec. 7.2
`
`General Notation for the Connected Word-Recognition Problem
`
`393
`
`nected word-recognition problem. In this chapter we discuss several algorithms that have
`this property and are capable of solving the connected word-recognition problem with
`various degrees of efficiency.
`Although we will spend a great deal of time in this chapter discussing efficient
`algorithms for optimal matching of sequences of word patterns, it is also important to
`address the problem of word reference pattern training using connected word sequences.
`Unlike the isolated word case, for which training procedures were discussed in Chapters
`5 and 6, for connected word sequence training we encounter a fundamental difficulty,
`namely the uncertainty in word boundaries within the spoken word sequence. This makes
`the connected word sequence training problem somewhat more difficult; therefore in this
`chapter we also describe an embedded word training method based on the segmental
`k-means algorithm.
`
`7,2 GENERAL NOTATION FOR THE CONNECTED WORD-RECOGNITION PROBLEM
`
`We denote the spectral sequence of vectors of the test pattern as
`
`T = {t(l)
`
`t(2), ... , t(M)} = {t(m)}~=I
`
`(7.1)
`
`in which each t(m) is an appropriate spectral vector (e.g., filter bank, LPC, ear model,
`etc.). Similarly, we denote the set of word reference patterns (either templates or statistical
`models) as R;, 1 ~ i < V for a V word vocabulary, where each pattern is of the form
`
`R; = {r;(l),r;(2),
`
`... ,r;(N;)}
`
`(7.2)
`
`where N; is the duration (in frames or states) of the ith word reference pattern. (For
`convenience we will develop the algorithms on the basis of using word templates; we will
`see later in this chapter that the required modifications for statistical models are essentially
`trivial.)
`The connected word-recognition problem can now be stated as a problem in finding
`the optimum sequence of word reference patterns, R *, that best matches T. With no
`loss of generality we assume that there are L word patterns in R* (where we vary L from
`the minimum to the maximum possible values and take the optimum overall values of
`L). Hence the best sequence of reference patterns, R*, is a concatenation of L reference
`patterns, i.e.,
`
`(7.3)
`
`in which each index, q* ( f.) is in the range [ 1, V].
`To determine R *-that
`is, to determine the sequence of word indices q* ( {), 1 ~
`f < L, that give the best matching sequence--consider constructing an arbitrary "super(cid:173)
`reference" pattern Rs of the form
`Rs = Rq( I) EB Rq(2) EB Rq(3) EB • • • ffi Rq(L) = { r5 (n)} n= I
`N'
`in which Ns is the total duration of the concatena

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket