`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`Ojj
`
`Ojj
`
`(a>
`
`< bl
`
`Pj (d)
`
`• •
`S· J
`
`Figure 6.11
`Illustration of general interstate connections of (a) a normal HMM with
`exponential state duration density, and (b) a variable duration HMM with specified state
`densities and no self-transitions from a state back to itself.
`
`6.9
`
`INCLUSION OF EXPLICIT STATE DURATION DENSIT¥ IN HMMS
`
`(6.64)
`
`Earlier we showed via Eq. (6.5) that the inherent duration probability density p;(d)
`associated with state i, with self-transition coefficient a;;, was of the form
`p;(d) = (a;;l-
`1(1 - a;;)
`= probability of d consecutive observations in state i.
`Instead
`For most physical signals, this exponential state duration density is inappropriate.
`we would prefer to explicitly model duration density in some analytic form. (An extensive
`treatment of state duration modeling can be found in the work of Ferguson of IDA [14],
`which is the basis of the material presented here. Other valuable references include [24]
`and [25].) Figure 6.11 illustrates, for a pair of model states i and j; the differences between
`HMMs without and with explicit duration density. In part (a) the states have exponential
`duration densities based on self-transition coefficients a;; and ajj, respectively.
`In part (b),
`the self-transition coefficients are set to zero, and an explicit duration density is specified.
`For this case, a transition is made only after the appropriate number of observations have
`occurred in the state (as specified by the duration density). Such a model is called a
`semi-Markov model.
`Based on the simple model of Figure 6.1 l(b), the sequence of events of the variable
`duration HMM is as follows:
`
`1. An initial state, q1 = i, is chosen according to the initial state distribution 1r;.
`
`IPR2023-00037
`Apple EX1013 Page 209
`
`
`
`IO die ...
`tis lliol
`die dundolt
`
`, .. ,,
`• ,.<'» "· .,.
`
`..
`
`~ 6.9
`lndusion of Exptidt State Our·annn
`z. A duration d, • d,osen aa,onl"
`pedience and eue of i~
`muimum duration value D.)
`J. ()bservation o, 02 .... o,J •
`dloler. aa.iOldill ID
`bq, o1 02 ... o,, . C:,enerally we at111111t dal in ed
`..
`pendent 10 that bq1(01 O? ... ~ 1 = 0:~,b..,(
`.
`•· The next tate4 q2 = j, • cholm ac0onti11g -, die Stale u.-..• Oltalllillill.
`with the oonstraint that a, 1 ,, = 0, i..e... no ll'lllmlion bid
`oocur. Clearly thi i a n,quirancut, becai,e
`oblervation& occur. J
`
`oMml
`
`~D0n"111111 GIIMIIY
`
`A little thought hou1d convince lhe reader 1h11 !ht lriable
`made equivalent to the standard HMM by setting p,{4) to be b
`(6.64 ).
`U 'ing the above formulation, we mUSI make vaal ,..1'\!11.,-.-
`Section 6.4.3 to allow calculation of P(Oj.\) and for reesiunation
`In particular we as ume that the fint swe begin at,=
`I and &he I
`We then define the forward variable 0,(1) as
`ct,(i) = P(o1 02 ... o, the stay in state i ends at ,1 ).
`We as umc that a total of r states have been vi ited durin the fi t t ~rv
`denote the states as q 1• q2 ... , q, with durations associ led wi1h h 1e d1. d , ... , .
`Thu the constraint of E.q. (6.65) are
`
`1
`
`UW1•,
`
`q, =;
`
`,
`Ed,=
`t=l
`
`t.
`
`(6~
`
`(6.66b)
`
`••• · Pta (da) • P(oa O? •.• 04, lq,)
`
`Eq. (6.65) can then be written as
`o.,(i) = L L
`f
`"
`~a,,~Yft (d2)P(041 "Tl • - • 8' +4J "'2) • • •
`-•+• • • • e,fq,
`•Q 1-I JJ,1(d,)P(Ow1.M ...
`where the sum is over all stales q and all possible swe Ulliom d. By DdlCbOD
`write 0i,(J) as
`
`o.,(j) = E E 0,--4(,)a;;p;(_d) II b
`
`D
`
`'
`
`I
`i-1 4-1
`l=I
`where D is the maximum duration within Ill)' SlMC To initialize die a»■11Nl-■WJII ol o.,(j
`we use
`
`IPR2023-00037
`Apple EX1013 Page 210
`
`
`
`--:-:~(cid:173)'•. I
`I!
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`
`I
`
`I l
`
`I
`
`360
`
`Theory and Implementation of Hidden Markov Models
`
`Chap. 6
`cdi) = 1r;p;(2) II b;(Os) + E aq (j)aj;p;(l)b;(o2)
`
`(6.69b)
`
`(6.69c)
`
`(6.69d)
`
`(6.70)
`
`2
`
`N
`
`j:1
`11,
`
`(i) = 1r;p;(3) II b;(Os) +EE 03-d(j)aj;p;(d)
`
`2 N
`
`d=l
`
`j=l
`j#i
`
`a
`
`3
`
`s= I
`
`3
`
`S--1
`
`3 · IJ b;(Os)
`
`s=4-d
`(i) is computed; then Eq. (6.68) can be used for all t > D. It should be
`and so on, until a
`0
`clear that the desired probability of O given the model A can be written in tenns of the os
`as
`
`N
`
`P(OIA) == E O'.T(i)
`
`i=l
`
`as was previously used for ordinary HMMs.
`To give reestimation formulas for all the variables of the variable duration HMM, we
`must define three more forward-backward variables, namely
`a;(i) = P(o1 o2 ... o,, stay in state i starts at t + 1 jA)
`f3r(i) = P(o1+1 ... oTI stay in state i ends at t, -X)
`f3t(i) = P(or+t ... oTI stay in state i starts at t + 1, .X).
`
`(6.71)
`(6.72)
`(6.73)
`
`The relationships between a, a*, /3, and /3* are as follows:
`
`N
`
`i= I
`D
`
`a; (j) = L a 1(i)au
`ar(i) = L a;_d(i)p;(d) II b;(Os)
`
`t
`
`s=t-d+l
`
`d=l
`N
`
`/3,(i) = L au/3,*(j)
`
`}=I
`
`D
`
`/3,*(i) = L f3r+d(i)p;(d) II b;(05 ).
`
`t+d
`
`(6.74)
`
`(6.75)
`
`(6.76)
`
`(6.77)
`
`d=l
`
`s=t+I
`
`•
`•
`Based on the above relationships and definitio
`ns, the reeSttmatton formulas for the variable
`duration HMM with discrete obse
`t·
`'
`rva ions, are
`
`ft· - 1r;/3o U)
`I - P(OI-X)
`
`(6.78)
`
`IPR2023-00037
`Apple EX1013 Page 211
`
`
`
`sec-6.9
`
`Inclusion of Explicit State Duration Densrty in HMMs
`
`T L 0 ,<i)a;j{J,· <11
`Oij = -N:-:-----::-T ____
`L L o,(i)aij/3; (J)
`
`r=I
`
`_
`
`j=I
`
`1=1
`
`o;(i) • p;(r)- ~ 0,(1)
`~ [~
`b;(k) = -M;:'_·•_.o_,=--:;;:---------------
`~ ~ [~
`
`T
`
`s.1.0,=v,
`
`,(•)]
`
`o;(,)
`
`• p;m - ~ 0,(1)/J,(•)]
`
`)11
`
`(6.79)
`
`(6.80)
`
`(6.81)
`
`d=I
`
`t=I
`
`s=t+I
`
`formulas is the following: The fonnula for 1f, L the
`The interpretation of the reestimation
`probability that state i was the first state, given 0. The formula for au is almo t the ame as
`for the usual HMM, except it uses the condition that the alpha terms in which a tate ends at
`t, join with the beta terms in which a new state begins at r + l. The formula for b1(k) is the
`expected number of times that observation o, = vk occurred in state i, nonnalized by the
`expected number of times that any observation occurred in state i. Finally, the reestimation
`fonnula for P;(d) is the ratio of the expected number of times state i occurred with any
`duration.
`The importance of incorporating state duration densities is reflected in the observation
`that, for some problems, the quality of the modeling is significantly improved when explicit
`state duration densities are used. However, there are drawbacks to the use of the variable
`duration model discussed in this section. One is the greatly increased computational load
`associated with using variable durations. It can be seen from the definition and initialization
`conditions on the forward variable o;(i), from Eqs. (6.68}-(6.69), that about D times the
`storage and D 2 /2 times the computation
`is required. For D on the order of 25 (as is
`reasonable for many speech-processing problems), computation is increased by a factor of
`300. Another problem with the variable duration models is the large number of parameters
`(D), associated with each state, that must be estimated, in addition to the usual HMM
`for a fixed number of observations T, in the training set, there
`parameters. Furthermore,
`are, on average, fewer state transitions and much less data to estimate p;(d) than would
`be used in a standard HMM. Thus the reestimation problem is more difficult for variable
`duration HMMs than for the standard HMM.
`One proposal to alleviate some of these problems is to use a parametric state duration
`density instead of the nonparametric p;(d) used above [23-24].
`In particular, proposal
`
`IPR2023-00037
`Apple EX1013 Page 212
`
`
`
`362
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`include the Gaussian family with
`
`p;(d) = N(d, µ;,a;)
`
`with parameters µ; and (Jf, or the Gamma family with
`r(f' cf';- I e-r,,d
`f(v;)
`
`p;(d) =
`
`(6.82)
`
`(6.83)
`
`with parameters v; and r,; and with mean V;'f/;-1 and variance v;r,;- 2
`• Reestimation fonnulas
`for r,; and v; have been derived and used with good results. Another possibility, which
`has been used with good success, is to assume a unifonn duration distribution over an
`appropriate range of durations and use a path-constrained Viterbi decoding procedure.
`
`6.10 OPTIMIZATION CRITERION-ML, MMI, AND MDI
`
`The standard ML design criterion is to use a training sequence of observations O to derive
`the set of model parameters A, yielding
`
`(6.84)
`
`Any of the reestimation algorithms discussed previously provides a solution to this opti(cid:173)
`mization problem.
`The need to consider alternative design criteria, however, comes from several con(cid:173)
`cerns ([26--28]). The basic philosophy in statistical modeling methods, such as HMM, is
`that the signal or observation sequence can be well modeled if the parameters of the model
`are carefully and correctly chosen. The problem with this philosophy is that the assumed
`model-HMM in the present case-is sometimes inadequate to model the observed signal
`so that no matter how carefully the parameters are chosen, the modeling accuracy is limited.
`Often, this situation is described as a "model mismatch." The first alternative optimization
`criterion we discuss here is one that tries to overcome the problem of model mismatch in
`order to achieve a more accurate modeling of the observation signal.
`The observed signal O = (01, Oi, ... , or) is associated with a sequence of constraints
`n = (R1, R2, ... , Rr). For example, R, may be the autocorrelation matrix that charac(cid:173)
`terizes the observation o,. Then, obviously, 0 is only one of possibly uncountably many
`observation sequences that satisfy the constraint sequence n. Furthermore, in tenns of the
`probability distributions of the observation sequences, there exists a set of such distribu(cid:173)
`tions that would also satisfy n. We denote this set Q(R). The minimum discrimination
`information (MDI) is a measure of closeness between two probability measures ( one of
`which bears the HMM form here) under the gi~en constraint n and is defined by
`
`where
`
`/l
`v(R,P>J =
`
`inf
`QEO{'R)
`
`/(Q: P>J
`
`l(Q : P >.J = j q(O) log q(O) dO
`
`p(OIA)
`
`(6.85)
`
`(6.86)
`
`IPR2023-00037
`Apple EX1013 Page 213
`
`
`
`sec. 6.10
`
`Optimization Criterion-ML, MMI, and MDI
`
`363
`
`infonnation between distributions Q and P>. (27,28]. (The function
`is the discrimination
`q(·) and p( •I~) a~e t~e probab_ilit~ density functions corresponding to Q and P >. respectively.)
`The discrimmat1on mfonnat10n 1s calculated based on the given training set of observations.
`The MDI criterion tries to choose a model parameter set ..\ such that 11('R., P>.) i
`minimized. An interpretation of MDI is that the model parameter set ..\ i chosen so
`that the model p(Oj..\) is as close as it can be to a member of the set O('R.). Since
`the closeness is always measured in tenns of the discrimination infonnation evaluated
`on the given observation,
`the intrinsic characteristics of the training sequences would
`influence on the parameter selection. By emphasizing the measure
`then have substantial
`discrimination, the model estimation is no longer solely dictated by the assumed model
`fonn. The MDI optimization problem is, however, not as straightforward as the ML
`optimization problem and no simple robust implementation of the procedure i known.
`Another concern about the HMM optimization criterion arises when we attempt to u
`it to solve a class of speech-recognition problems. Consider recognition of a vocabulary of
`V words, each of which is represented by an HMM, with parameter set A,,. = I. 2. . . . .
`.
`We assume P(v) to be the a priori probability for word v, v = I 2 ... , V. The set of HMM
`A = {Av} together with the a priori probabilities thus defines a probability measure for an
`arbitrary observation sequence 0
`P A(O) = L P(0l..\v, v)P(v).
`
`(6.87)
`
`V
`
`11=1
`
`(The notation P(Oj..\ 11, v) indicates that it is a probability conditioned on the word v. We
`include the model parameter >.v, sometimes, because of the necessity of treating A11 as
`random variables for estimation purposes. Obviously, when Av is fixed, P(0IAv, v) is the
`conditional probability, parameterized by Av.) To train these models (i.e., to estimate the
`optimum parameters of the associated models), utterances of known (labeled) words are
`used. We denote the labeled training sequence by ov where superscript v reflects the fact
`that ov is a rendition of word v. The standard ML criterion of Eq. (6.84) is to use ov to
`estimate model parameters Av, yielding
`(Av)ML = argminP(OvjA) .
`.x
`Each model is estimated separately using the correspondingly labeled training observa(cid:173)
`tion sequence(s). The resultant models, however, need not be the optimal solution for
`minimizing the probability of recognition error.
`An alternative design criterion that aims at maximizing the "discrimination" of each
`model (i.e., the ability to distinguish between observation sequences generated by the
`correct word model and those generated by alternative models) is the maximum mutual in(cid:173)
`fonnation (MMI) criterion [26]. The mutual infonnation between an observation sequence
`ov and the word v, parameterized by A = {Av}, v = 1, 2, ... , V, is
`"
`P(0v, vjA)
`, v) = log p A(O")P(v)
`
`(6.88)
`
`h(O
`
`Since
`
`IPR2023-00037
`Apple EX1013 Page 214
`
`
`
`364
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`I
`I
`
`~ (,
`
`I
`
`/A(Ov, v) = log P(OvlAv) - log L P(OvlAw, w)P(w).
`
`V
`
`(6.89)
`
`w=l
`The MMI criterion is to find the entire model set A such that the mutual information is
`maximized,
`
`V
`(A)MMI = mfx{ ~/A(O', v) }·
`
`(6.90)
`
`The MMI criterion is obviously different from the ML criterion. Both are minimum
`cross-entropy approaches. In the ML approach, an HMM for the distribution of the data
`given the word is matched to the empirical distribution. In the MMI approach, a model
`for the distribution of the word given the data is matched to the empirical distribution.
`This explains the merit of the MMI approach. The optimization procedure for the MMI
`approach involves the entire model parameter set A even if only one labeled training
`sequence ov is used. The ML criterion addresses the likelihood P(OvlAv) alone, while
`the MMI criterion compares the likelihood P(OvlAv) against the "probability background"
`PA (Ov) and attempts to maximize the difference. However, (A)MMI is not as straightforward
`to obtain as (A)ML, One often has to use general optimization procedures like the descent
`algorithms to solve Eq. (6.90). Such optimization procedures often lead to numerical
`problems in implementation.
`
`6.11 COMPARISONS OF HMMS
`
`An interesting question associated with HMMs is the following: Given two HMMs, ,\1 and
`..\2, what is a reasonable measure of the similarity of the two models ([29])? Consider the
`case of two models
`
`with
`
`and
`
`Al = [ pl -p
`
`pl - p ]
`
`B, =
`
`[ q
`1 -q
`
`1 q ]
`q -
`
`7r] = [ 1 /2 1 /2]
`
`A=['
`2
`
`1-rJ
`1 - r r
`
`B2 = [ s
`1 - s ]
`1 - s s
`
`1r2 = [ 1 /2 1 /2].
`
`For At to be equivalent to ..\2, in the sense of having the same statistical properties for the
`observation symbols, i.e., E[o, = vkj,\i] = E[o1 = vkj,\ 2], for all vk, we require
`pq + (I - p)(l - q) =rs+ (1 - r)(l - s)
`
`IPR2023-00037
`Apple EX1013 Page 215
`
`
`
`sec-6. 12
`
`Implementation Issues for HMMs
`
`365
`
`by solving for s, we get
`or,
`
`S=p+q-2pq
`•
`I - 2r
`By choosing (arbitrarily)p = 0.6, q = 0.1, r = 0.2, we gets= 13/30 i::::= 0.433. Th even
`when the two models, At and A2, look ostensibly very different (i.e., A1 is very different
`from A2 and B t is very different from 82), statistical equivalence of the models can occur.
`We can generalize [29] the concept of mode' distance (dissimilarity) by defining a
`distance measure D(.Xt, .X2), between two Markov models, .Xt and .A2 as
`I
`D(At, A2) = T [log P(o<2>l.Xt) - log P(0(2ll,\2)]
`
`(6.91
`
`where 0(2) = ( o t 02 03 .•• or) is a sequence of observations generated by model
`. 8 ~
`sically, Eq. (6.91) is a measure of how well model .A1 matches observation generated by
`model .X2, relative to how well model .X2 matches observations generated by itself. Several
`interpretations of Eq. (6.91) exist in terms of cross-entropy, or divergence, or discrimin tion
`infonnation [29].
`One of the problems with the distance measure of Eq. (6.91) i that it i non ymmetric.
`Hence a natural expression of this measure is the symmetrized version, namely
`
`(6.92
`
`6.12 IMPLEMENTATION ISSUES FOR HMMS
`
`The discussion in the previous sections has dealt primarily with the theory of HMMs
`and several variations on the form of the model. In this section we deal with several
`practical implementation issues, including scaling, multiple observation sequences, initial
`parameter estimates, missing data, and choice of model size and type. For some of these
`implementation issues we can prescribe exact analytical solutions; for other issues we can
`provide only some seat-of-the-pants experience gained from working with HMMs.
`
`6.12.1 Scaling
`
`To understand why scaling ([ 18,23]) is required for implementing the reestimation pro(cid:173)
`cedure of HMMs, consider the definition of a,(i) of Eq. (6.18). It can be seen that o,(,)
`consists of the sum of a large number of terms, each of the fonn
`
`with q, = i and bis a discrete probability as defined by Eq. (6.8). Since each a and b term is
`less than 1 (generally significantly less than 1), it can be seen that as I starts to get big (c.g~
`IO or more), each term of a,(i) starts to head exponentially to i.ero. For sufficiently large
`t (e.g., 100 or more) the dynamic range of the o,(i) computation will exceed the precision
`
`IPR2023-00037
`Apple EX1013 Page 216
`
`
`
`366
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`range of essentially any machine (even in double precision). Hence the only reasonable
`way to perform the computation is to incorporate a scaling procedure.
`The basic scaling procedure multiplies a,(i) by a scaling coefficient that is indepen(cid:173)
`dent of i (i.e., it depends only on t), with the goal of keeping the scaled, a,(i) within the
`dynamic range of the computer for I ~ t ~ T. A similar scaling is done to the {31(i)
`coefficients (since these also tend to zero exponentially fast) and then, at the end of the
`computation, the scaling coefficients are canceled out exactly.
`To understand this scaling procedure better, consider the reestimation formula for the
`state-transition coefficients aij. If we write the reestimation formula (Eq. (6.40b)) dir~ctly
`in terms of the forward and backward variables, we get
`
`r=l
`liij = -T--N---------
`
`T-l L o:,(i)aijbj(o,+ 1 )f31+ 1 (j)
`L L o:,(i)aijbj(o,+df31+1U)
`
`t=I J=I
`
`(6.93)
`
`Consider the computation of cx,(i). We use the n~tation cx1(i) to denote the unscaled
`as, o,(i) to denote the scaled (and iterated) as, and o,(i) to denote the local version of
`a before scaling. Initially, fort=
`1, we compute cx1(i) according to Eq. (6.19) and set
`and 01 (i) = Ct a1 (i). For each t, 2 < t < T, we first
`&1 (i) = 01 (i), with Ct = LN I
`o:,(i)
`
`-
`
`-
`
`--
`
`1=1
`
`compute &r(i) according to the induction formula (Eq. (6.20)), in terms of the previously
`scaled &,(i); that is,
`
`N
`
`&,(i) = L o,-1(J)a1;b;(o,).
`
`J=l
`
`We determine the scaling coefficient c1 as
`
`Ct=
`
`_N __
`
`_
`
`Z: J,u)
`
`i=I
`
`giving
`
`&,(i) = c,&,(i)
`From Eq. (6.94a--c) we can write the scaled &,(i) as c,&,(i) or
`
`N L &,_ 1 (j)a1;b;(o,)
`a,(i) = -N:-:-J_=--:lN~-----
`L L &,_ 1 (j)ajib;(o,)
`
`i=l J=I
`
`(6.94a)
`
`(6.94b)
`
`(6.94c)
`
`(6.95)
`
`IPR2023-00037
`Apple EX1013 Page 217
`
`
`
`sec. 6.12
`
`Implementation Issues for HMMs
`
`BY induction we can write &,_ 1 (J) as
`
`'fhus we can write &,(i) as
`
`367
`
`(6.96a)
`
`(6.96b)
`
`o,(i)
`
`N L a,(i)
`
`i=I
`
`that is, each a,(i) is effectively scaled by the sum over all states of o,(i).
`Next we compute the /3,(i) tenns from the backward recursion. The only difference
`here is that we use the same scale factors for each time t for the betas as was used for the
`alphas. Hence the scaled {3s are of the fonn
`
`,.
`
`/3r(i) = Cr/3r(i).
`
`(6.97)
`
`Because each scale factor effectively restores the magnitude of the o tenns to 1, and
`because the magnitudes of the a and {3 tenns are comparable, using the same scaling
`factors on the {3s as was used on the os is an effective way to keep the computation
`within reasonable bounds. Furthermore, in tenns of the scaled variables, we see that the
`reestimation Eq. (6.93) becomes
`
`aij = -T--l _N _______
`-
`t=l'
`
`T-1 L &,(i)aijbj(o,+1)~1+1(1)
`LL &,(i)aijbj(01+1)fi1+1
`
`_
`
`(j)
`
`but each &,(i) can be written as
`
`t=l
`
`j=l
`
`&,(i) = [IT c,] o,(i) = C,o,(i)
`
`s=l
`
`and each f3r+ 1 (j) can be written as
`
`A
`
`~1+1U) = [ II c,].B,+1U) = D,+1.B,+1(1).
`
`s=t+l
`
`Thus Eq. (6.98) can be written as
`
`..
`
`(6.98)
`
`(6.99)
`
`(6.100)
`
`IPR2023-00037
`Apple EX1013 Page 218
`
`
`
`368
`
`Chap. 6
`
`Theory and Implementation of Hidden Marko
`v •v1odels
`
`~A
`
`(6.I0J)
`
`(6.102)
`
`i I
`I
`I
`I
`
`I
`
`I
`
`I
`
`I i
`~I I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`-
`au=
`
`T-1 _L c,a,(i)aijbj(o,+i>D,+1/3,+1 (j)
`-~' ~( ___________ _
`~ L C,a:,(i)aubj(o,+1 )D,+1 /3,+1 (j)
`
`r-1
`
`N
`
`t=I
`
`}=I
`
`Finally the term C,D,+ 1 can be seen to be of the form
`C,Dr+I = IT Cs IT Cs = IT Cs = Cr
`
`t
`
`T
`
`T
`
`s=I
`
`s=t+I
`
`s=I
`
`independent of r. Hence the terms C,Dr+ 1 cancel out of both the numerator and denominat
`of Eq. (6. IOI) and the exact reestimation equation is therefore realized.
`or
`It should be obvious that the above scaling procedure applies equally well to rees(cid:173)
`timation of the 1r or B coefficients. It should also be obvious that the scaling procedure
`of Eq. (6.95) need not be applied at every time instant t, but can be performed whenever
`desired, or whenever necessary (e.g., to prevent underflow).
`If scaling is not perfonnect
`at some instant t, the scaling coefficients c, are set to I at that time, and all the conditions
`discussed above are then met.
`The only real change to the HMM procedure because of scaling is the procedure for
`computing P(OI>.). We cannot merely sum up the &r(i) terms, because these are scaled
`already. However, we can use the property that
`
`IT c, L a:r(i) = Cr L a:r(i) = I.
`
`T
`
`N
`
`t=I
`
`i=I
`
`N
`
`i=I
`
`Thus we have
`
`or
`
`or
`
`T IT c, • P(O/.-\) = I
`
`t=I
`
`P(Oj.-\) = __.!__
`T
`II c,
`
`t=l
`
`log [P(O/.-\)] = L log c,.
`
`T
`
`Thus the log of p can be computed b
`of the machine anyway.
`' ut not
`
`t== I
`.
`.
`p
`, smce it would be out of the dynamic range
`.
`.
`th V'
`Finally we note that when usin
`state sequence no seal'
`.
`. g _e Iterbi algorithm to give the maximum likelihood
`.
`'
`mg 1s reqmred 1f w
`.
`1
`Viterbi implementation.
`e use oganthms as discussed
`in the alternate
`
`(6.l03)
`
`(6.104)
`
`(6.l05)
`
`(6.l06)
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`I
`
`I
`
`I
`
`I
`
`Ii
`I
`
`I
`
`I
`II
`
`IPR2023-00037
`Apple EX1013 Page 219
`
`
`
`sec. 6.12
`
`Implementation
`
`Issues for HMMs
`
`Multiple Observation Sequences
`
`,.12.2
`
`In Section 6.5 we discussed a fonn of HMM called the left-right or Baki model, in which
`the state proceeds from state I at r = I to state N at t = T in a sequential nwmer (ra::all
`the model of Figure 6.8(b)). We have already discussed how a left-righl model imposes
`constraints on the state-transition matrix, and the initial state probabiliti Eqs. 6.45}(cid:173)
`(6.48). However, the major problem with left-right model i tha1 one canno1 use a single
`to train the model (i.e., for reestimation of model parameters). Th.
`observation sequence
`,is because the transient nature of the states within the model allows only a mall number of
`observations for any state (until a transition is made to a successor state . Hc:nc:e, to ba
`sufficient data to make reliable estimates of all model parameters, one has to use multiple
`observation sequences (l 18]).
`The modification of the reestimation procedure is straightforward and i as folio
`We denote the set of K observation sequences as
`
`where Q(k> = (o\k>oik> ... o~ 1
`) is the kth observation sequence. We a sume each observ tion
`sequence is independent of every other observation sequence. and our goal i to ndju t the
`parameters of the model A to-maximize
`
`(6.107)
`
`K
`
`P(OI-X) = II P(O<*>I-X)
`
`k=I
`
`(6.108)
`
`(6.109)
`
`K
`
`= II Pk.
`
`k=I
`Since the reestimation formulas are based on frequencies of occurrence of variou evenrs,
`the reestimation formulas for multiple observation sequences are modified by adding to(cid:173)
`frequencies of occurrence for each sequence. Thus the modified
`gether the individual
`reestimation formulas for llij and bj(f) are
`
`and
`
`(6.110
`
`(6.11 l)
`
`IPR2023-00037
`Apple EX1013 Page 220
`
`
`
`370
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`and 7T'j is not reestimated since 1r1 = I, 7T'j = 0, i f:. 1.
`The proper scaling of Eqs. (6.110)-(6.111) is now straightforward since each obser(cid:173)
`vation sequence has its own scaling factor. The key idea is to remove the scaling factor
`from each term before summing. This can be accomplished by writing the reestimation
`equations in terms of the scaled variables, i.e.,
`
`(6.112)
`
`In this manner, for each sequence o<k>, the same scale factors will appear in each tenn of
`the sum over t as appears in the Pk term, and hence will cancel exactly. Thus using the
`scaled values of the as and (Js results in an unscaled liu. A similar result is obtained for the
`bj(l) term.
`
`6.12.3
`
`Initial Estimates of HMM Parameters
`
`In theory, the reestimation equations should give values of the HMM parameters that
`correspond to a local maximum of the likelihood function. A key question is, therefore,
`How do we choose initial estimates of the HMM parameters so that the local maximum is
`equal to or as close as possible to the global maximum of the likelihood function?
`Basically there is no simple or straightforward answer. Instead, experience has shown
`that either random (subject to the stochastic and the nonzero value constraints) or uniform
`initial estimates of the 1r and A parameters are adequate for giving useful reestimates of these
`parameters in almost all cases. However, for the B parameters, experience has shown that
`good initial estimates are helpful in the discrete symbol case and are essential (when dealing
`with multiple mixtures) in the continuous-distribution case. Such initial estimates can be
`obtained in a number of ways; these include (a) manual segmentation of the observation
`sequence(s) into states and averaging of observations within states, (b) maximum likelihood
`segmentation of observations and averaging, and (c) segmental k-means segmentation with
`clustering, etc. We discuss such segmentation techniques later in this chapter.
`
`6.12.4 Effe~ of Insufficient Training Data
`
`Another problem associated with training HMM parameters via reestimation methods is
`that the observation sequence used for training is, of necessity, finite ([30]). Thus there
`is always an inadequate number of occurrences of low-probability events (e.g., symbol
`occurrences within states) to give good estimates of the model parameters. By way of
`example, consider the case of a discrete observation HMM. Recall that the reestimation
`transformation of bj(k), Eq. (6.40c), requires a count of the expected number of times in
`state j and observing symbol vk simultaneously. If the training sequence is so small that
`it does not have any occurrences of this event (i.e., q, = j and o, = vk), bj(k) = 0 and
`
`IPR2023-00037
`Apple EX1013 Page 221
`
`
`
`Implementation Issues for HMMs
`
`ffl
`
`•
`
`will tay 0 after ree timation. The resultant model ould produce I mo
`tually includes ( = • Ind q, == /)
`for any ob rvation sequence that
`b t)
`outcome -is obviousJy a consequence of the unrdi.lble estimale
`in ufficiency of the training seL
`One soJution to thi problem i to increase dire iz.c of lbt lniinillta N!llilM!fV'•t.(cid:173)
`lution • to mluce die
`Often thi i impractical. A second possible
`tate . number of ymbol per st.ate). Although Ibis·
`(e.g., number of
`often there are phy ical reason why a given model i used. and lhelefort de modtl
`jze cannot be changed. A third possible solution i to sect uncon!valtiaml di:Sttcll
`estimation algorithms that can somehow enhance the reliabili of lbt DIJimlt•
`even based on limited training data. Deleted interpolation and pu1m1tter ~MOIGN
`are two uch a1tematives. Since deleted interpolation i considered m<xe
`parameter e timation method, we discuss that subject in the nelt
`•
`The simplest way to hand1e the effects of in ufficient tnining
`threshold constraints to the model parameters to en ure that no model ftllll'lllll'Nl•..,.esliLffille
`falls below a specified level '[ 18). Thus, for example. we might sptci
`for a discrete symbol model, that
`bj(k) = { bj(k)
`t,b,
`
`if b1(k) _
`otherwise
`
`6.11 )
`
`or, for a continuous distribution model, that
`
`( . II
`
`)
`
`U
`
`.,
`
`if U~(r r) _
`) _ { U11:(r, r),
`. (
`11c r, r -
`otherwise
`hu,
`When the numeric floor is invoked in the recstimation equation . 11 re-m inin
`need to be rescaled so that the densities obey the required tochastic con train
`.
`implcmcntational mcuu~.
`postprocessor techniques are thus considered
`to com
`in
`to secvcral prob
`insufficient data problem and have been applied with good succc
`speech processing. The method of parameter thresholding has a ju tification from a BiYi
`statistics point of view. It can be shown that Eq. (6.113b) is, in fact. a maximum a
`eeriori
`(MAP) estimate of the variance under the assumption that the parameter prior P U ,, r
`is an 1infonnative one with uniform distribution and (Ujt(r,r)). = 68 [31). (See Section
`6.13.)
`
`·6.12.5 Choice of Model
`
`The 1remaining issues in implementing HMMs are the choice of type of model ergodjc
`and cboa (Jf
`or lef1-right or some other form), choice of model size (number of--,,~
`observation -symbols (discrete or continuous. single or multimixtum-cboice of oblervalim
`parameters). Unfortunately.
`there is no simple. theoreticall COIRCt
`of making
`choices. These choices must be made depending on lhe signal being modeled.
`ilh dae
`comments, we end our discussion of the theorerieal aspects of hidden Mln&:W 111oaa:u
`such models ha e been applied lO •
`and proceed to a discussion of ho
`11· OlallCd
`recognition .problem.
`
`IPR2023-00037
`Apple EX1013 Page 222
`
`
`
`372
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`6.13
`
`IMPROVING iTHE EFFECTIVENESS OF MODEL ESTIMATES
`
`We discuss three methods that have been shown to be able to enhance the effectiveness
`of HMM model estimates for speech recognition. These are (I) deleted interpolation, (2)
`Bayesian adaptation, and (3) corrective training. The first two methods are motivated by
`the problem of insufficient data, while the last method has the unique objective of trying to
`reduce recognition errors directly.
`
`6.13.1 Deleted Interpolation
`
`When training data are insufficient, reliable and robust determination of HMM parameters
`cannot be accomplished. The HMM obtained by the Baum-Welch reestimation method,
`based on the maximum likelihood criterion, may be adequate in characterizing the training
`data, but for new data the match may be quite poor. One parameter estimation method that
`aims to improve the model reliability is the method of "deleted interpolation."
`The concept involves combining two (or more) separately trained models, one of
`which is more reliably trained than the other. A scenario in which this can happen is the
`case when we use tied states which forces "different" states to share an identical statistical
`characterization, effectively reducing the number of parameters in the model. A model with
`tied states is often more robust than a model without tied states when trained on the same
`amount of data. But a model with tied states is also less precise than a model without tied
`states if the training data are sufficient. Therefore, the idea of combining the two models
`is to allow us to fall back to the more reliable model when the supposedly more precise
`model is, in fact, unreliable. A similar scenario occurs when context-independent (more
`robust) and context-dependent (more precise) phone models are used in large vocabulary
`recognition (see Chapter 8).
`Let the two models be defined by the parameter sets A = (A, B, 1r) and A'
`(A', B',