`
`HMM System for Isolated Word Recognition
`
`379
`
`7 OBSERVATION
`SEQUENCE
`0
`
`____
`s,EEOI LJIC
`g!i~Al fEA TURE
`l
`ANALYSIS,
`(VECTOR
`QUA1'i1ZA TION)
`
`HMM FOR
`WORD I
`
`PROBABil.ITY P(O I >-1
`)
`COMPUTATION
`
`HMM FOR
`WORD2
`
`PROBABil.ITY P(O I >.2
`)
`COMPUTATION
`
`INDEX OF
`RECOON1ZED
`WORD
`
`SELECT
`MAXIMUM
`
`v· •
`
`arpm
`Is Vs V
`
`!~co I >.'>I
`
`•
`•
`
`HMM FOR
`WORDV
`
`PROBABil.ITY P(O I Xv)
`COMPUTATION
`
`Figure 6.13 Block diagram of an isolated word HMM recognizer (after Rabiner [38)).
`
`6.15.1 Choice of Model Parameters
`
`We now return to the issue that we have raised several times in this chapter-namely, How
`do we select the type of model, and how do we choose the parameters of the selected
`model? For isolated word recognition with a distinct HMM designed for each word in the
`vocabulary, it should be clear that a left-right model is more appropriate than an ergodic
`model, since we can then associate time with model states in a fairly straightforward
`manner. Furthermore, we can envision the physical meaning of the model states as distinct
`sounds (e.g., phonemes, syllables) of the word being modeled.
`The issue of the number of states to use in each word model leads to two schools
`of thought. One idea is to let the number of states correspond roughly to the number of
`sounds (phonemes) within the word-hence, models with from 2 to 10 states would be
`appropriate [18]. The other idea is to let the number of states correspond roughly to the
`average number of observations in a spoken version of the word, the so-called Bakis mcxlel
`
`IPR2023-00035
`Apple EX1015 Page 230
`
`
`
`380
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`6
`
`I-z
`Lu u a:
`Lu 4
`~
`~
`Lu
`I-
`4 a:
`a:
`0 a: a:
`
`Lu
`
`2
`
`..
`
`0
`1
`
`2
`
`9
`8
`7
`6
`5
`4
`3
`N, NUMBER OF STATES IN HMM
`
`20
`
`Figure 6.14 Average word error rate (for a digits vocabulary) versus the number
`of states N in the HMM (after Rabiner et al. [ 18)).
`
`about 10-15
`[11]. In this manner each state corresponds to an observation interval-i.e.,
`ms for the standard methods of analysis. In the results to be described later in this section,
`we use the former approach. Furthermore, we restrict each word model to have the same
`number of states; this implies that the models will work best when they represent words
`with the same number of sounds.
`To illustrate the effect of varying the number of states in a word model, Figure 6.14
`shows a plot of average word error rate versus N, for the case of recognition of isolated
`digits (i.e., a IO-word vocabulary). It can be seen that the error is somewhat insensitive to
`N, achieving a local minimum at N = 6; however, differences in error rate for values of N
`close to 6 are small.
`The next issue is the choice of observation vector and the way it is represented. As
`discussed in Chapter 3, possibilities include LPC-derived weighted cepstral coefficients and
`weighted cepstral derivatives or (for autoregressive HMMs) the autocorrelation coefficients
`as the observation vectors for continuous models; for discrete symbol models we use a
`codebook to generate the discrete symbols. For the continuous models we use as many
`as M = 64 ~ 256 mixtures per state; for the discrete symbol models we use codebooks
`with as many as M = 512 ~ I 024 code words. Also, for the continuous models, it has
`been found that it is more convenient and sometimes preferable to use diagonal covariance
`matrices with several mixtures, rather than fewer mixtures with full covariance matrices.
`The reason for this is simple-namely,
`the difficulty in performing reliable reestimation of
`the off diagonal components of the covariance matrix from the necessarily limited training
`data. Figure 6.15 illustrates the need for using mixture densities for modeling observation
`vectors (eighth-order cepstral vectors derived from LPC with log energy appended as the
`ninth vector component). Figure 6.15 shows a comparison of marginal distributions bj(o)
`against a histogram of the actual observations within a state (as determined by a maximum
`likelihood segmentation of all the training observations into states). The observation vectors
`
`IPR2023-00035
`Apple EX1015 Page 231
`
`
`
`sec. 6.15
`
`HMM System for Isolated Word Recognition
`
`381
`
`WORD: ZERO, STATE I
`
`.... z :,
`8
`
`-0.527
`
`0.435
`
`-0435
`
`0.366
`
`0.375 -44.20
`-0.483
`PARAMETER RANGE
`
`-4.112
`
`Figure 6.15 Comparison of estimated density (jagged contour) and model density (smooth contour)
`for each of the nine components of the observation vector (eight cepstral componen1s, one log energy
`component) for state l of the digit zero (after Rabiner et al. (38 ]).
`
`are ninth order, and the model density uses M = 5 mixtures. The covariance matrices are
`constrained to be diagonal for each individual mixture. The results of Figure 6.15 are for
`the first model state of the word "zero." The need for values of M > I is clearly seen
`in the histogram of the first parameter (the first cepstral component) which is inherently
`multimodal; similarly, the second, fourth, and eighth cepstral parameters show the need for
`more than a single Gaussian component to provide good fits to the empirical data. Many of
`the other parameters appear to be well fitted by a single Gaussian; in some cases, however,
`even M = 5 mixtures do not provide a sufficiently good fit.
`Another experimentally verified fact about the HMM is that it is important to limit
`some of the parameter estimates to prevent them from becoming too small. For example,
`for the discrete symbol models, the constraint that bj(k) be greater than or equal to some
`minimum value f is necessary to ensure that even when the kth symbol never occurred
`in some state j in the training observation set, there is always a finite probability of its
`occurrence when scoring an unknown observation set. To illustrate this point, Figure 6.16
`shows a curve of average word error rate versus the parameter f (on a log scale) for
`a standard word-recognition experiment.
`It can be seen that over a very broad range
`00- 10 < f < 10- 3 ) the average error rate remains at about a constant value; however,
`
`IPR2023-00035
`Apple EX1015 Page 232
`
`
`
`382
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`20
`
`16
`
`~
`0
`~
`~ 12
`I.IJ
`~ z
`I.IJ 8
`0
`~
`I.IJ
`~
`
`4
`
`0
`
`0--. ~
`
`0
`
`10- 3 10-•
`
`10-~
`
`10- 6
`C
`
`Figure 6.16 Average word error rate as a function of the minimum discrete density
`value f (after Rabiner et al. [ 18]).
`
`when Eis set to O (i.e., 10- 00
`), then the error rate increases sharply. Similarly, for continuous
`densities it is important to constrain the mixture gains CJm as well as the diagonal covariance
`coefficients Uim(r, r) to be greater than or equal to some minimum values (we use 10- 4 in
`all cases) [ 18).
`
`6.15.2 Segmental K-Means Segmentation into States
`
`In this chapter, we have emphasized that good initial estimates of the parameters of the
`bj(o,) densities are essential for rapid and proper convergence of the reestimation formulas.
`Hence a procedure for providing good initial estimates of these parameters was devised and
`is shown in Figure 6.17. The training procedure is a variant on the well-known K-means
`iterative procedure for clustering data.
`We assume that we have a training set of observations (the same as is required for
`parameter reestimation), and an initial estimate of all model parameters. However, unlike
`the one required for reestimation, the initial model estimate can be chosen randomly or on
`the basis of any available model appropriate to the data.
`Following model initialization, the set of training observation sequences is segmented
`into states, based on the current model .X. This segmentation
`is achieved by finding the
`optimum state sequence, via the Viterbi algorithm, and then backtracking along the optimal
`path. This procedure is illustrated in Figure 6.18, which shows a log-energy plot, an
`accumulated log-likelihood plot, and a state segmentation for one occurrence of the word
`"six." The states correspond roughly to the sounds in the spoken word "six."
`The result of segmenting each of the training sequences is, for each of the N states,
`a maximum likelihood estimate of the set of the observations that occur within each state j
`according to the current model. In the case where we are using discrete symbol densities,
`each of the observation vectors within a state is coded using the M-code-word codebook,
`
`IPR2023-00035
`Apple EX1015 Page 233
`
`
`
`\
`
`sec. 6.15
`
`HMM System f:lr Isolated Word Recognition
`
`383
`
`MODEL
`INITIALIZATION
`
`ST A TE SEQUENCE
`SEGMENTATION
`
`NO
`
`TRAINING
`DATA
`
`ESTIMATE PARAMETERS
`OFB(-)
`VIA SEGMENTAL
`K-MEANS
`
`MODEL
`REESflMATION
`
`YES
`
`MODEL
`PARAMETERS
`
`Figure 6.17 The segmental k-means training procedure used to estimate parameter values for the optimal
`continuous mixture density fit to a finite number of observation sequences (after Rabiner et al. [38]).
`
`and the updated estimate of the bj(k) parameters is
`bj(k) = number of vectors with codebook index k in state j divided by the
`number of vectors iri state j.
`
`When we are using continuous observation densities, a segmental K-means procedure is
`used to cluster the observation vectors within each state j into a set of M clusters (using
`a Euclidean distortion measure), where each cluster represents one of the M mixtures of
`the bj(o,) density. From the clustering, an updated set of model parameters is derived as
`follows:
`
`Cjm
`
`number of vectors classified in cluster m of state j divided by the number
`of vectors in state j
`sample mean of the vectors classified in cluster m of state j
`sample covariance matrix of the vectors classified in cluster m of state j.
`
`f.tjm
`
`iJjm
`
`=
`=
`=
`Based on this state segmentation, updated estimates of the aij coefficients can be obtained
`by counting the number of transitions from state i to j and dividing it by the number of
`transitions from state i to any state (including itself).
`An updated model X is obtained from the new model parameters, and the fonnal
`reestimation procedure is used to reestimate all model parameters. The resulting model
`is then compared to the previous model (by computing a distance score that reflects the
`
`IPR2023-00035
`Apple EX1015 Page 234
`
`
`
`384
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`>(cid:173)
`(!)
`Q: w z
`w
`Cl
`0
`
`-56~==~==~::::~;;;~~~~;=~==::::::=:=;==1
`
`~-,J~:~:~: ~: ~: ~: ~: ::;:==:
`-~• j
`
`5
`
`w
`I-
`~ (/)
`
`4
`
`3
`
`2 ..... --
`ll..L._.....JI. __
`I b1
`b2
`
`J...___JL...J..
`b3
`
`__
`
`..L.,__~--~--,..___....._
`b4
`FRAME NUMBER
`
`__
`
`.....__-:-__.
`b5•49=T
`
`Figure 6.18 Plots of (a) log energy; (b) accumulated log likelihood; and (c) state assignment for one
`occurrence of the word '"six'" (after Rabiner et al. [38]).
`
`statistical similarity of the HMMs). If the model distance score exceeds a threshold, then
`the old model ,\ is replaced by the new (reestimated) model ~. and the overall training loop
`is repeated. If the model distance score falls below the threshold, then model convergence
`is assumed and the final model parameters are saved.
`
`6.15.3
`
`Incorporation of State Duration into the HMM
`
`In Section 6.9 we discussed the theoretically correct method of incorporating state duration
`information into the mechanics of the HMM ([39]). We also showed that the cost of
`including duration density was rather high; namely a D2-fold increase in computation and
`a D-fold increase in storage. Using a value of D = 25 (as is required for word recognition),
`the cost of the increased computation tended to make the techniques not worth using.
`Thus the following alternative procedure was formulated for incorporating state duration
`information into the HMM.
`For this alternative procedure, the state duration probability pj(d) was measured
`directly from the segmented training sequences used in the segmental K-means procedure
`of the previous section. Hence the estimates of pj(d) are strictly heuristic ones. A typical
`set of histograms of pj(d) for a five-state model of the word "six" is shown in Figure 6.19.
`(In this figure the histograms are plotted versus normalized duration (d /T), rather than
`absolute duration d.) The first two states account for the initial / s/ in "six"; the third state
`accounts for the transition to the vowel / i /; the fourth state accounts for the vowel; and the
`fifth state accounts for the stop and the final / s / sound.
`The way in which the heuristic duration densities were used in the recognizer was
`
`IPR2023-00035
`Apple EX1015 Page 235
`
`
`
`sec. 6.15
`
`HMM System for Isolated Word Recognition
`
`385
`
`DIGIT: S1X
`
`0
`
`NORMALIZED DURATION (d/T)
`
`Figure 6.19 Histograms of the normalized duration density for the five states of 1he digil
`"six" (after Rabiner et al. [38)).
`
`as follows. First the normal Viterbi algorithm is used to give the best segmentation of the
`observation sequence of the unknown word into states via a backtracking procedure. The
`duration of each state is then measured from the state segmentation. A postprocessor then
`increments the log-likelihood score of the Viterbi algorithm, by the quantity
`
`log .P(q, OIA) = log P(q, OIA) + ad L log [pj(dj)]
`
`N
`
`j=I
`
`(6.123)
`
`where ad is a scaling multiplier on the state duration scores, and dj is the duration of
`state j along the optimal path as determined by the Viterbi algorithm. The incremental
`cost of the postprocessor for duration is essentially negligible, and experience has shown
`that recognition performance is essentially as good as that obtained using the theoretically
`correct duration model.
`
`6.15.4 HMM Isolated-Digit Performance
`
`We conclude this section on isolated word recognition using HMMs by giving a set of
`performance results (in terms of average word error rate) on the task of recognizing
`isolated digits in a speaker-independent manner. For this task, a training set consisting
`of 100 occurrences of each digit by 100 talkers (i.e., a single occurrence of each digit
`per talker) was used. Half the talkers were male and half were female. For testing the
`
`IPR2023-00035
`Apple EX1015 Page 236
`
`
`
`386
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`_j
`
`TABLE 6.1. Average Digit Error Rates for Several Recognizers
`and Evaluation Sets
`
`Recognizer
`Type
`LPC/DTW
`LPC/DTWNQ
`HMMNQ
`HMM/CD
`HMM/AR
`
`Original
`Training
`0.1
`
`0
`0.3
`
`Evaluation Set
`
`TS2
`
`0.2
`3.5
`3.7
`0.2
`1.8
`
`TS3
`
`2.0
`
`1.3
`3.4
`
`TS4
`
`1.1
`
`1.8
`4.1
`
`algorithm, we used the initial training set, as well as three other independent test sets with
`the following characteristics:
`
`TS2 The same 100 talkers as were used in the training; 100 occurrences
`of each digit
`TS3 A new set of 100 talkers (50 male, 50 female); 100 occurrences
`of each digit
`TS4 Another new set of 100 talkers (50 male, 50 female); 100 occurrences
`of each digit
`
`The results of the recognition tests are given in Table 6.1. The recognizers are the following:
`
`LPC/DTW
`
`LPC/DTWNQ
`
`HMMNQ
`HMM/CD
`
`HMM/AR
`
`Conventional template-based recognizer using dynamic time warping
`(DTW) alignment
`Conventional recognizer with vector quantization of the feature vectors
`(M = 64)
`HMM recognizer with M = 64 codebook
`HMM recognizer using continuous density model with M = 5 mixtures
`per state
`HMM recognizer using autoregressive observation density
`
`It can be seen that, when using a VQ, the perfonnance of the isolated word recognizer
`degrades in both the conventional and HMM modes. It can also be seen that the perfor(cid:173)
`mances of the conventional template-based recognizer, and the HMM recognizer with a
`continuous density model are comparable. Finally Table 6.1 shows that the autoregressive
`density HMM gives poorer performance than the standard mixture density model.
`
`6.16 SUMMARY
`
`In this chapter we have attempted to present the theory of hidden Markov models from
`the simplest concepts (discrete Markov chains) to the most sophisticated models (variable
`
`IPR2023-00035
`Apple EX1015 Page 237
`
`
`
`Chap.6
`
`References
`
`387
`
`It has been our purpose to focus on physical ex(cid:173)
`duration, continuous density models).
`planations of the basic mathematics; hence we have avoided long, drawn-out proofs or
`derivations of the key results, and concentrated primarily on trying to interpret the meaning
`of the math, and how it could be implemented in practice in real-world systems. We have
`also attempted to illustrate one application of the theory of HMMs to a simple problem in
`speech recognition-namely,
`isolated word recognition.
`
`REFERENCES
`
`[ 1 J L.E. Baum and T. Petrie, "Statistical inference for probabilistic functions of finite state
`Markov chains," Ann. Math. Stat., 37: 1554-1563, 1966.
`[2) L.E. Baum and J.A. Egon, "An inequality with applications to statistical estimation for
`functions of a Markov process and to a model for ecology,'' Bull. Amer.
`probabilistic
`Meteorol. Soc., 73: 360-363, 1967.
`[3) L.E. Baum and G.R. Sell, "Growth functions for transformations on manifolds," Pac. J.
`Math.,27(2):
`211-227. 1968.
`[4] L.E. Baum, T. Petrie. G. Soules, and N. Weiss, "A maximization technique occurring in the
`statistical analysis of probabilistic functions of Markov chains," Ann. Math. Stat., 41 (I):
`164-171, 1970.
`[5] L.E. Baum, "An inequality and associated maximization technique in statistical estimation
`for probabilistic functions of Markov processes," Inequalities, 3: 1-8, 1972.
`[6] J.K. Baker, "The dragon system-An
`overview," IEEE Trans. Acoustics, Speech, Signal
`Proc., ASSP-23 (1): 24-29, February 1975.
`[7] F. Jelinek, "A fast sequential decoding algorithm using a stack," IBM J. Res. Develop., 13:
`675-685, 1969.
`[8] L.R. Bahl and F. Jelinek, "Decoding for channels with insenions, deletions, and substitutions
`with applications to speech recognition," IEEE Trans. Information Theory, IT-21: 404-411,
`1975.
`[9] F. Jelinek, L.R. Bahl, and R.L. Mercer, "Design of a linguistic statistical decoder for the
`recognition of continuous speech," IEEE Trans. Information Theory, IT-21: 250-256, 1975.
`(1 OJ F. Jelinek, "Continuous speech recognition by statistical methods," Proc. IEEE, 64: 532-536,
`April 1976.
`(11] R. Balds, "Continuous speech word recognition via centisecond acoustic states," in Proc.
`ASA Meeting (Washington, DC), April 1976.
`(12] F. Jelinek, L.R. Bahl, and R.L. Mercer, "Continuous speech recognition: Statistical meth(cid:173)
`ods," in Handbook of Statistics, II, P.R. Krishnaiad, Ed. Amsterdam, The Netherlands:
`North-Holland, 1982.
`[13] L.R. Bahl, F. Jelinek, and R.L. Mercer, "A maximum likelihood approach to continuous
`speech recognition," IEEE Trans. Pattern Anal. Machine Intel/., PAMI-5: 179-190, 1983.
`[14] J.D. Ferguson, "Hidden Markov Analysis: An Introduction," in Hidden Markov Models/or
`Speech, Institute for Defense Analyses, Princeton, NJ, 1980.
`[ 15] A.J. Viterbi, "Error bounds for convolutional codes and an asymptotically optimal decoding
`algorithm," IEEE Trans. Information Theory, IT-13: 260-269, April 1967.
`[16] G.D. Forney, "The Viterbi algorithm," Proc. IEEE, 61: 268-278, March 1973.
`
`IPR2023-00035
`Apple EX1015 Page 238
`
`
`
`388
`
`Chap.6
`
`Theory and Implementation of Hidden Markov Models
`
`(17] A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum likelihood from incomplete data
`via the EM algorithm," J. Roy. Stat. Soc., 39 (1): 1-38, 1977.
`[18] S.E. Levinson, L.R. Rabiner, and M.M. Sondhi, "An Introduction to the Application of the
`Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition,"
`Bell System Tech. J., 62 (4): 1035-1074, April 1983.
`[ 19] L.A. Liporace, "Maximum likelihood estimation for multivariate observations of Markov
`sources," IEEE Trans. Information Theory, IT-28 (5): 729-734, 1982.
`[20] B.H. Juang, "Maximum likelihood estimation for mixture multivariate stochastic observa(cid:173)
`tions of Markov chains," AT&T Tech. J., 64 (6): 1235-1249, July-Aug. 1985.
`[21] B.H. Juang, S.E. Levinson, and M.M. Sondhi, "Maximum
`likelihood estimation for mul(cid:173)
`tivariate mixture observations of Markov chains," IEEE Trans. Information Theory, IT-32
`(2): 307-309, March 1986.
`[22] A.B. Poritz, "Linear predictive hidden Markov models and the speech signal," in Proc.
`/CASSP 82 (Paris, France), 1291-1294, May 1982.
`[23] B.H. Juang and L.R. Rabiner, "Mixture autoregressive hidden Markov models for speech
`signals," IEEE Trans. Acoustics, Speech, Signal Proc., ASSP-33 (6): 1404-1413, Decem(cid:173)
`ber 1985.
`[24] M J. Russell and R.K. Moore, "Explicit modeling of state occupancy in hidden Markov mod(cid:173)
`els for automatic speech recognition," in Proc. ICASSP 85 (Tampa, FL), 5-8, March 1985.
`[25] S.E. Levinson, "Continuously variable duration hidden Markov models for automatic speech
`recognition," Computer Speech and Language, 1 ( 1 ): 29-45, March 1986.
`[26] L.R. Bahl, P.F. Brown, P.V. deSouza, and L.R. Mercer, "Maximum mutual information
`estimation of hidden Markov model parameters for speech recognition," in Proc. ICASSP
`86 (Tokyo, Japan), 49-52, April 1986.
`[27] Y. Ephraim, A. Dembo, and L.R. Rabiner, "A minimum discrimination information ap(cid:173)
`proach for hidden Markov modeling," IEEE Trans. Information Theory, 35 (5): 1001-1003,
`September 1989.
`[28] Y. Ephraim and L. Rabiner, "On the Relations Between Modeling Approaches for Speech
`Recognition," IEEE Trans. Information Theory, 36 (2): 372-380, March 1990.
`[29] B.H. Juang and L.R. Rabiner, "A probabilistic distance measure for hidden Markov models,"
`AT&T Tech. J., 64 (2): 391-408, February 1985.
`[30] F. Jelinek and R.L. Mercer, "Interpolated estimation of Markov source parameters from
`sparse data," in Pattern Recognition in Practice, E.S. Gelesma and L.N. Kanai, Eds.
`Amsterdam, The Netherlands: North-Holland, 1980, 381-397.
`[31] C.-H. Lee, C.-H. Lin, and B.H. Juang, "A study on speaker adaptation of the parameters
`of continuous density hidden Markov models," IEEE Trans. Signal Processing, 39 (4):
`806-841, April 1991.
`[32] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis, John Wiley & Sons,
`New York, 1973.
`[33] L.R. Bahl, P.F. Brown, P.V. deSouza, and R.L. Mercer, "A new algorithm for the estimation
`of hidden Markov model parameters," Proc. ICASSP 88, 493-496, New York, April 1988.
`[34] S. Katagiri, C.H. Lee, and B.H. Juang, "New Discriminative Training Algorithms Based on
`the Generalized Probabilistic Descent Method," Proc. 1991 Workshop Neural Network.sfor
`Signal Processing, 299-308, IEEE Signal Processing Society, Princeton, September 1991.
`
`IPR2023-00035
`Apple EX1015 Page 239
`
`
`
`chap. 6 References
`
`319
`
`[35] Y. Linde, A. Buzo, and R.M. Gray, "An algorithm forvectorquantiz.crdcsign • /EUThw.
`Communications, COM-28: 84-95, January 1980.
`[36) J. Breiman, J.H. Friedman, R.A. Olshen, and CJ. Stone, Classification and RtgrusiOft
`Trees, Wadsworth & Brooks, Monterey, CA, 1984.
`[37) K.F. Lee, Automatic Speech Recognition-The Development of tht SPHINX S 'Sttm Kluwer
`Academic Publishers, Boston, 1989.
`[38] L.R. Rabiner, B.H. Juang, S.E. Levinson, and M.M. Sondhi, "Recognition of Isolated Digits
`Using Hidden Markov Models With Continuous Mixture Densities, 'AT &.T Ttch. J., 64 (6 :
`1211-1234, July-Aug. 1985.
`[39) L.R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech
`Recognition," Proc. IEEE, 77 (2): 257-286, February 1989.
`
`!
`i,
`
`I'
`
`.. I
`
`I I
`I
`t i
`L
`t
`
`IPR2023-00035
`Apple EX1015 Page 240
`
`
`
`Cha~ter 7
`
`SPEECH RECOGNITION
`BASED ON CONNECTED
`WORD MODELS
`
`7 .1
`
`INTRODUCTION
`
`Up to this point we have been discussing discrete utterance ( or, as it is often called, iso(cid:173)
`lated word or phrase) recognition. The assumption was that the speech to be recognized
`comprised a single word or phrase and was to be recognized as a complete entity with no
`explicit knowledge or regard for the phonetic content of the word or phrase. Hence, for a
`vocabulary of V words (or phrases), the recognition algorithm consisted of matching (via
`time alignment) the measured sequence of spectral vectors of the unknown spoken input
`against each of the set of spectral patterns for the V words and selecting the pattern whose
`accumulated time aligned spectral distance was smallest as the recognized word. Another
`implicit assumption was that each spoken utterance had a clearly defined beginning and
`ending point that could be found using some type of speech endpoint detector. As a result,
`pattern matching could be reliably performed without having to be concerned about uncer(cid:173)
`tainties in the endpoints of the patterns being compared. For many applications, notably
`those referred to as "command-and-control" applications, in which the user is required
`to speak the command words one at a time (i.e., with distinct pauses between command
`words), this paradigm works well and is entirely appropriate. (We will discuss command(cid:173)
`and-control applications in Chapter 9.) However, for applications in which the speech
`to be recognized consists of a sequence of words from the recognition vocabulary (e.g.,
`strings of digits) such a paradigm is often unacceptable for practical reasons. Figure 7.1
`illustrates this point for a three-digit sequence in which the upper panel shows the log
`
`390
`
`IPR2023-00035
`Apple EX1015 Page 241
`
`
`
`sec. 7.1
`
`Introduction
`
`391
`
`ISOLATED
`
`I
`
`-
`->-c:,
`
`CD
`'1:11
`
`a:
`~ w
`8
`_,
`
`3
`
`2
`
`0
`
`50
`
`--[
`
`,.,~
`
`CONTINUOUS
`
`Illustration of an isolated string of digits (upper panel) and a fluently
`Figure 7.1
`spoken version of the same digit string (lower panel).
`
`TIME (SEC)
`
`energy contour of a digit string spoken as a sequence of isolated digits where each digit is
`followed by a discernible pause (i.e., period of no speaking), and the lower panel shows
`the same digit string spoken in a fluent (continuous) manner. It should be self-evident that
`speaking a sequence of isolated digits is unnatural (from the point of view of human speech
`production) and grossly inefficient (it takes about 2.5 seconds to speak the isolated digit
`sequence versus less than 1 second to speak the same string of digits in a fluent manner).
`Hence, it is important to be able to extend the techniques described in earlier chapters of
`this book so that accurate recognition of fluent speech becomes possible.
`From the point of view of speech-recognition algorithms, there are two interesting
`classes of fluent speech strings. One class is the set of strings derived from small-to(cid:173)
`moderate-size vocabularies,
`including digit strings, spelled letter sequences, combinations
`of alphanumerics, and strings for accessing limited databases based on small-to-moderate(cid:173)
`size vocabularies and highly constrained word syntax. This set has the property that the
`basic speech-recognition unit can be the word (or phrase), much as in the case of isolated
`word- (or phrase-) recognition systems. The other class is the set of continuous speech
`drawn from moderate-to-large-size vocabularies where the basic speech-recognition unit
`cannot be the word because of complexity constraints. In such a case, subword speech
`units are necessary for the implementation of the speech-recognition system. We defer a
`discussion of the second class of speech strings to Chapter 8, where we provide a complete
`description of techniques used in the recognition of large vocabulary continuous speech. In
`
`IPR2023-00035
`Apple EX1015 Page 242
`
`
`
`392
`
`Chap. 7
`
`Speech Recognition Based on Connected Word Models
`
`n, .__ __ __,
`
`Figure 7.2 Illustration of the connected word-recognition problem.
`
`T
`
`this chapter we discuss in depth the class of speech-recognition systems commonly referred
`to as connected word speech recognizers, because the basic unit of recognition is a whole
`word.
`We can now formulate the basic problem in connected word recognition. To do this
`we refer to Figure 7 .2, which gives a pictorial description of the connected word-recognition
`problem. We assume that we are given the spectral vectors from a fluently spoken string of
`words. T = { t(l)
`t(2), ... , t(M)}, and we are also given the spectral patterns for each of
`V reference patterns, ('R, 1 to nv) corresponding to V unique words in the vocabulary. The
`connected word-recognition problem can be compactly stated as follows:
`
`Given a fluently spoken sequence of words, how can we determine the optimum match
`in terms of a concatenation of word reference patterns?
`
`To solve the connected word-recognition problem we must resolve the following problems:
`
`1. We don't usually know the number of words, L, in the string (although we usually
`have a good bound on the range, e.g., one to seven words).
`2. We don't know word utterance boundaries within the spoken string; i.e. except for
`the beginning of the first word in the string, and the end of the last word in the string,
`we don't know precisely where any word begins or ends.
`3. The word boundaries are often fuzzy or nonunique. That is, it is often difficult, if not
`impossible, to specify (i.e., find accurately and automatically)
`the word boundaries
`because of sound coarticulation. Thus, for example, the boundary between the digit
`3 and the digit 8 in Figure 7 .1 is fuzzy because the ending sound / i / in 3 coarticulates
`strongly with the initial sound /eY / in 8. We indicate such fuzziness of the boundaries
`by the squiggly lines in Figure 7.2.
`4. For a set of V word-reference patterns, and for a given value of L (the number
`of words in the string), there are vL possible combinations of composite matching
`patterns (possible L word sequences); for anything but extremely small values of
`V and l the exponential number of composite matching patterns
`implies that the
`connected word-recognition problem cannot be solved by exhaustive means.
`
`The bottom line is that a nonexhaustive matching algorithm is required to solve the con-
`
`IPR2023-00035
`Apple EX1015 Page 243
`
`
`
`sec. 7.2
`
`General Notation for the Connected Word-Recognition Problem
`
`393
`
`nected word-recognition problem. In this chapter we discuss several algorithms that have
`this property and are capable of solving the connected word-recognition problem with
`various degrees of efficiency.
`Although we will spend a great deal of time in this chapter discussing efficient
`algorithms for optimal matching of sequences of word patterns, it is also important to
`address the problem of word reference pattern training using connected word sequences.
`Unlike the isolated word case, for which training procedures were discussed in Chapters
`5 and 6, for connected word sequence training we encounter a fundamental difficulty,
`namely the uncertainty in word boundaries within the spoken word sequence. This makes
`the connected word sequence training problem somewhat more difficult; therefore in this
`chapter we also describe an embedded word training method based on the segmental
`k-means algorithm.
`
`7,2 GENERAL NOTATION FOR THE CONNECTED WORD-RECOGNITION PROBLEM
`
`We denote the spectral sequence of vectors of the test pattern as
`
`T = {t(l)
`
`t(2), ... , t(M)} = {t(m)}~=I
`
`(7.1)
`
`in which each t(m) is an appropriate spectral vector (e.g., filter bank, LPC, ear model,
`etc.). Similarly, we denote the set of word reference patterns (either templates or statistical
`models) as R;, 1 ~ i < V for a V word vocabulary, where each pattern is of the form
`
`R; = {r;(l),r;(2),
`
`... ,r;(N;)}
`
`(7.2)
`
`where N; is the duration (in frames or states) of the ith word reference pattern. (For
`convenience we will develop the algorithms on the basis of using word templates; we will
`see later in this chapter that the required modifications for statistical models are essentially
`trivial.)
`The connected word-recognition problem can now be stated as a problem in finding
`the optimum sequence of word reference patterns, R *, that best matches T. With no
`loss of generality we assume that there are L word patterns in R* (where we vary L from
`the minimum to the maximum possible values and take the optimum overall values of
`L). Hence the best sequence of reference patterns, R*, is a concatenation of L reference
`patterns, i.e.,
`
`(7.3)
`
`in which each index, q* ( f.) is in the range [ 1, V].
`To determine R *-that
`is, to determine the sequence of word indices q* ( {), 1 ~
`f < L, that give the best matching sequence--consider constructing an arbitrary "super(cid:173)
`reference" pattern Rs of the form
`Rs = Rq( I) EB Rq(2) EB Rq(3) EB • • • ffi Rq(L) = { r5 (n)} n= I
`N'
`in which Ns is the total duration of the concatena