IPR2023-00037, No. 1013-5 Exhibit - EX1013 Part 5 Rabiner, et al, Fundamentals of Speech Recognition (P.T.A.B. Dec. 2, 2022)

358
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`Ojj
`
`Ojj
`
`(a>
`
`< bl
`
`Pj (d)
`
`• •
`S· J
`
`Figure 6.11
`Illustration of general interstate connections of (a) a normal HMM with
`exponential state duration density, and (b) a variable duration HMM with specified state
`densities and no self-transitions from a state back to itself.
`
`6.9
`
`INCLUSION OF EXPLICIT STATE DURATION DENSIT¥ IN HMMS
`
`(6.64)
`
`Earlier we showed via Eq. (6.5) that the inherent duration probability density p;(d)
`associated with state i, with self-transition coefficient a;;, was of the form
`p;(d) = (a;;l-
`1(1 - a;;)
`= probability of d consecutive observations in state i.
`Instead
`For most physical signals, this exponential state duration density is inappropriate.
`we would prefer to explicitly model duration density in some analytic form. (An extensive
`treatment of state duration modeling can be found in the work of Ferguson of IDA [14],
`which is the basis of the material presented here. Other valuable references include [24]
`and [25].) Figure 6.11 illustrates, for a pair of model states i and j; the differences between
`HMMs without and with explicit duration density. In part (a) the states have exponential
`duration densities based on self-transition coefficients a;; and ajj, respectively.
`In part (b),
`the self-transition coefficients are set to zero, and an explicit duration density is specified.
`For this case, a transition is made only after the appropriate number of observations have
`occurred in the state (as specified by the duration density). Such a model is called a
`semi-Markov model.
`Based on the simple model of Figure 6.1 l(b), the sequence of events of the variable
`duration HMM is as follows:
`
`1. An initial state, q1 = i, is chosen according to the initial state distribution 1r;.
`
`IPR2023-00037
`Apple EX1013 Page 209
`
`

`IO die ...
`tis lliol
`die dundolt
`
`, .. ,,
`• ,.<'» "· .,.
`
`..
`
`~ 6.9
`lndusion of Exptidt State Our·annn
`z. A duration d, • d,osen aa,onl"
`pedience and eue of i~
`muimum duration value D.)
`J. ()bservation o, 02 .... o,J •
`dloler. aa.iOldill ID
`bq, o1 02 ... o,, . C:,enerally we at111111t dal in ed
`..
`pendent 10 that bq1(01 O? ... ~ 1 = 0:~,b..,(
`.
`•· The next tate4 q2 = j, • cholm ac0onti11g -, die Stale u.-..• Oltalllillill.
`with the oonstraint that a, 1 ,, = 0, i..e... no ll'lllmlion bid
`oocur. Clearly thi i a n,quirancut, becai,e
`oblervation& occur. J
`
`oMml
`
`~D0n"111111 GIIMIIY
`
`A little thought hou1d convince lhe reader 1h11 !ht lriable
`made equivalent to the standard HMM by setting p,{4) to be b
`(6.64 ).
`U 'ing the above formulation, we mUSI make vaal ,..1'\!11.,-.-
`Section 6.4.3 to allow calculation of P(Oj.\) and for reesiunation
`In particular we as ume that the fint swe begin at,=
`I and &he I
`We then define the forward variable 0,(1) as
`ct,(i) = P(o1 02 ... o, the stay in state i ends at ,1 ).
`We as umc that a total of r states have been vi ited durin the fi t t ~rv
`denote the states as q 1• q2 ... , q, with durations associ led wi1h h 1e d1. d , ... , .
`Thu the constraint of E.q. (6.65) are
`
`1
`
`UW1•,
`
`q, =;
`
`,
`Ed,=
`t=l
`
`t.
`
`(6~
`
`(6.66b)
`
`••• · Pta (da) • P(oa O? •.• 04, lq,)
`
`Eq. (6.65) can then be written as
`o.,(i) = L L
`f
`"
`~a,,~Yft (d2)P(041 "Tl • - • 8' +4J "'2) • • •
`-•+• • • • e,fq,
`•Q 1-I JJ,1(d,)P(Ow1.M ...
`where the sum is over all stales q and all possible swe Ulliom d. By DdlCbOD
`write 0i,(J) as
`
`o.,(j) = E E 0,--4(,)a;;p;(_d) II b
`
`D
`
`'
`
`I
`i-1 4-1
`l=I
`where D is the maximum duration within Ill)' SlMC To initialize die a»■11Nl-■WJII ol o.,(j
`we use
`
`IPR2023-00037
`Apple EX1013 Page 210
`
`

`--:-:~(cid:173)'•. I
`I!
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`
`I
`
`I l
`
`I
`
`360
`
`Theory and Implementation of Hidden Markov Models
`
`Chap. 6
`cdi) = 1r;p;(2) II b;(Os) + E aq (j)aj;p;(l)b;(o2)
`
`(6.69b)
`
`(6.69c)
`
`(6.69d)
`
`(6.70)
`
`2
`
`N
`
`j:1
`11,
`
`(i) = 1r;p;(3) II b;(Os) +EE 03-d(j)aj;p;(d)
`
`2 N
`
`d=l
`
`j=l
`j#i
`
`a
`
`3
`
`s= I
`
`3
`
`S--1
`
`3 · IJ b;(Os)
`
`s=4-d
`(i) is computed; then Eq. (6.68) can be used for all t > D. It should be
`and so on, until a
`0
`clear that the desired probability of O given the model A can be written in tenns of the os
`as
`
`N
`
`P(OIA) == E O'.T(i)
`
`i=l
`
`as was previously used for ordinary HMMs.
`To give reestimation formulas for all the variables of the variable duration HMM, we
`must define three more forward-backward variables, namely
`a;(i) = P(o1 o2 ... o,, stay in state i starts at t + 1 jA)
`f3r(i) = P(o1+1 ... oTI stay in state i ends at t, -X)
`f3t(i) = P(or+t ... oTI stay in state i starts at t + 1, .X).
`
`(6.71)
`(6.72)
`(6.73)
`
`The relationships between a, a*, /3, and /3* are as follows:
`
`N
`
`i= I
`D
`
`a; (j) = L a 1(i)au
`ar(i) = L a;_d(i)p;(d) II b;(Os)
`
`t
`
`s=t-d+l
`
`d=l
`N
`
`/3,(i) = L au/3,*(j)
`
`}=I
`
`D
`
`/3,*(i) = L f3r+d(i)p;(d) II b;(05 ).
`
`t+d
`
`(6.74)
`
`(6.75)
`
`(6.76)
`
`(6.77)
`
`d=l
`
`s=t+I
`
`•
`•
`Based on the above relationships and definitio
`ns, the reeSttmatton formulas for the variable
`duration HMM with discrete obse
`t·
`'
`rva ions, are
`
`ft· - 1r;/3o U)
`I - P(OI-X)
`
`(6.78)
`
`IPR2023-00037
`Apple EX1013 Page 211
`
`

`sec-6.9
`
`Inclusion of Explicit State Duration Densrty in HMMs
`
`T L 0 ,<i)a;j{J,· <11
`Oij = -N:-:-----::-T ____
`L L o,(i)aij/3; (J)
`
`r=I
`
`_
`
`j=I
`
`1=1
`
`o;(i) • p;(r)- ~ 0,(1)
`~ [~
`b;(k) = -M;:'_·•_.o_,=--:;;:---------------
`~ ~ [~
`
`T
`
`s.1.0,=v,
`
`,(•)]
`
`o;(,)
`
`• p;m - ~ 0,(1)/J,(•)]
`
`)11
`
`(6.79)
`
`(6.80)
`
`(6.81)
`
`d=I
`
`t=I
`
`s=t+I
`
`formulas is the following: The fonnula for 1f, L the
`The interpretation of the reestimation
`probability that state i was the first state, given 0. The formula for au is almo t the ame as
`for the usual HMM, except it uses the condition that the alpha terms in which a tate ends at
`t, join with the beta terms in which a new state begins at r + l. The formula for b1(k) is the
`expected number of times that observation o, = vk occurred in state i, nonnalized by the
`expected number of times that any observation occurred in state i. Finally, the reestimation
`fonnula for P;(d) is the ratio of the expected number of times state i occurred with any
`duration.
`The importance of incorporating state duration densities is reflected in the observation
`that, for some problems, the quality of the modeling is significantly improved when explicit
`state duration densities are used. However, there are drawbacks to the use of the variable
`duration model discussed in this section. One is the greatly increased computational load
`associated with using variable durations. It can be seen from the definition and initialization
`conditions on the forward variable o;(i), from Eqs. (6.68}-(6.69), that about D times the
`storage and D 2 /2 times the computation
`is required. For D on the order of 25 (as is
`reasonable for many speech-processing problems), computation is increased by a factor of
`300. Another problem with the variable duration models is the large number of parameters
`(D), associated with each state, that must be estimated, in addition to the usual HMM
`for a fixed number of observations T, in the training set, there
`parameters. Furthermore,
`are, on average, fewer state transitions and much less data to estimate p;(d) than would
`be used in a standard HMM. Thus the reestimation problem is more difficult for variable
`duration HMMs than for the standard HMM.
`One proposal to alleviate some of these problems is to use a parametric state duration
`density instead of the nonparametric p;(d) used above [23-24].
`In particular, proposal
`
`IPR2023-00037
`Apple EX1013 Page 212
`
`

`362
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`include the Gaussian family with
`
`p;(d) = N(d, µ;,a;)
`
`with parameters µ; and (Jf, or the Gamma family with
`r(f' cf';- I e-r,,d
`f(v;)
`
`p;(d) =
`
`(6.82)
`
`(6.83)
`
`with parameters v; and r,; and with mean V;'f/;-1 and variance v;r,;- 2
`• Reestimation fonnulas
`for r,; and v; have been derived and used with good results. Another possibility, which
`has been used with good success, is to assume a unifonn duration distribution over an
`appropriate range of durations and use a path-constrained Viterbi decoding procedure.
`
`6.10 OPTIMIZATION CRITERION-ML, MMI, AND MDI
`
`The standard ML design criterion is to use a training sequence of observations O to derive
`the set of model parameters A, yielding
`
`(6.84)
`
`Any of the reestimation algorithms discussed previously provides a solution to this opti(cid:173)
`mization problem.
`The need to consider alternative design criteria, however, comes from several con(cid:173)
`cerns ([26--28]). The basic philosophy in statistical modeling methods, such as HMM, is
`that the signal or observation sequence can be well modeled if the parameters of the model
`are carefully and correctly chosen. The problem with this philosophy is that the assumed
`model-HMM in the present case-is sometimes inadequate to model the observed signal
`so that no matter how carefully the parameters are chosen, the modeling accuracy is limited.
`Often, this situation is described as a "model mismatch." The first alternative optimization
`criterion we discuss here is one that tries to overcome the problem of model mismatch in
`order to achieve a more accurate modeling of the observation signal.
`The observed signal O = (01, Oi, ... , or) is associated with a sequence of constraints
`n = (R1, R2, ... , Rr). For example, R, may be the autocorrelation matrix that charac(cid:173)
`terizes the observation o,. Then, obviously, 0 is only one of possibly uncountably many
`observation sequences that satisfy the constraint sequence n. Furthermore, in tenns of the
`probability distributions of the observation sequences, there exists a set of such distribu(cid:173)
`tions that would also satisfy n. We denote this set Q(R). The minimum discrimination
`information (MDI) is a measure of closeness between two probability measures ( one of
`which bears the HMM form here) under the gi~en constraint n and is defined by
`
`where
`
`/l
`v(R,P>J =
`
`inf
`QEO{'R)
`
`/(Q: P>J
`
`l(Q : P >.J = j q(O) log q(O) dO
`
`p(OIA)
`
`(6.85)
`
`(6.86)
`
`IPR2023-00037
`Apple EX1013 Page 213
`
`

`sec. 6.10
`
`Optimization Criterion-ML, MMI, and MDI
`
`363
`
`infonnation between distributions Q and P>. (27,28]. (The function
`is the discrimination
`q(·) and p( •I~) a~e t~e probab_ilit~ density functions corresponding to Q and P >. respectively.)
`The discrimmat1on mfonnat10n 1s calculated based on the given training set of observations.
`The MDI criterion tries to choose a model parameter set ..\ such that 11('R., P>.) i
`minimized. An interpretation of MDI is that the model parameter set ..\ i chosen so
`that the model p(Oj..\) is as close as it can be to a member of the set O('R.). Since
`the closeness is always measured in tenns of the discrimination infonnation evaluated
`on the given observation,
`the intrinsic characteristics of the training sequences would
`influence on the parameter selection. By emphasizing the measure
`then have substantial
`discrimination, the model estimation is no longer solely dictated by the assumed model
`fonn. The MDI optimization problem is, however, not as straightforward as the ML
`optimization problem and no simple robust implementation of the procedure i known.
`Another concern about the HMM optimization criterion arises when we attempt to u
`it to solve a class of speech-recognition problems. Consider recognition of a vocabulary of
`V words, each of which is represented by an HMM, with parameter set A,,. = I. 2. . . . .
`.
`We assume P(v) to be the a priori probability for word v, v = I 2 ... , V. The set of HMM
`A = {Av} together with the a priori probabilities thus defines a probability measure for an
`arbitrary observation sequence 0
`P A(O) = L P(0l..\v, v)P(v).
`
`(6.87)
`
`V
`
`11=1
`
`(The notation P(Oj..\ 11, v) indicates that it is a probability conditioned on the word v. We
`include the model parameter >.v, sometimes, because of the necessity of treating A11 as
`random variables for estimation purposes. Obviously, when Av is fixed, P(0IAv, v) is the
`conditional probability, parameterized by Av.) To train these models (i.e., to estimate the
`optimum parameters of the associated models), utterances of known (labeled) words are
`used. We denote the labeled training sequence by ov where superscript v reflects the fact
`that ov is a rendition of word v. The standard ML criterion of Eq. (6.84) is to use ov to
`estimate model parameters Av, yielding
`(Av)ML = argminP(OvjA) .
`.x
`Each model is estimated separately using the correspondingly labeled training observa(cid:173)
`tion sequence(s). The resultant models, however, need not be the optimal solution for
`minimizing the probability of recognition error.
`An alternative design criterion that aims at maximizing the "discrimination" of each
`model (i.e., the ability to distinguish between observation sequences generated by the
`correct word model and those generated by alternative models) is the maximum mutual in(cid:173)
`fonnation (MMI) criterion [26]. The mutual infonnation between an observation sequence
`ov and the word v, parameterized by A = {Av}, v = 1, 2, ... , V, is
`"
`P(0v, vjA)
`, v) = log p A(O")P(v)
`
`(6.88)
`
`h(O
`
`Since
`
`IPR2023-00037
`Apple EX1013 Page 214
`
`

`364
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`I
`I
`
`~ (,
`
`I
`
`/A(Ov, v) = log P(OvlAv) - log L P(OvlAw, w)P(w).
`
`V
`
`(6.89)
`
`w=l
`The MMI criterion is to find the entire model set A such that the mutual information is
`maximized,
`
`V
`(A)MMI = mfx{ ~/A(O', v) }·
`
`(6.90)
`
`The MMI criterion is obviously different from the ML criterion. Both are minimum
`cross-entropy approaches. In the ML approach, an HMM for the distribution of the data
`given the word is matched to the empirical distribution. In the MMI approach, a model
`for the distribution of the word given the data is matched to the empirical distribution.
`This explains the merit of the MMI approach. The optimization procedure for the MMI
`approach involves the entire model parameter set A even if only one labeled training
`sequence ov is used. The ML criterion addresses the likelihood P(OvlAv) alone, while
`the MMI criterion compares the likelihood P(OvlAv) against the "probability background"
`PA (Ov) and attempts to maximize the difference. However, (A)MMI is not as straightforward
`to obtain as (A)ML, One often has to use general optimization procedures like the descent
`algorithms to solve Eq. (6.90). Such optimization procedures often lead to numerical
`problems in implementation.
`
`6.11 COMPARISONS OF HMMS
`
`An interesting question associated with HMMs is the following: Given two HMMs, ,\1 and
`..\2, what is a reasonable measure of the similarity of the two models ([29])? Consider the
`case of two models
`
`with
`
`and
`
`Al = [ pl -p
`
`pl - p ]
`
`B, =
`
`[ q
`1 -q
`
`1 q ]
`q -
`
`7r] = [ 1 /2 1 /2]
`
`A=['
`2
`
`1-rJ
`1 - r r
`
`B2 = [ s
`1 - s ]
`1 - s s
`
`1r2 = [ 1 /2 1 /2].
`
`For At to be equivalent to ..\2, in the sense of having the same statistical properties for the
`observation symbols, i.e., E[o, = vkj,\i] = E[o1 = vkj,\ 2], for all vk, we require
`pq + (I - p)(l - q) =rs+ (1 - r)(l - s)
`
`IPR2023-00037
`Apple EX1013 Page 215
`
`

`sec-6. 12
`
`Implementation Issues for HMMs
`
`365
`
`by solving for s, we get
`or,
`
`S=p+q-2pq
`•
`I - 2r
`By choosing (arbitrarily)p = 0.6, q = 0.1, r = 0.2, we gets= 13/30 i::::= 0.433. Th even
`when the two models, At and A2, look ostensibly very different (i.e., A1 is very different
`from A2 and B t is very different from 82), statistical equivalence of the models can occur.
`We can generalize [29] the concept of mode' distance (dissimilarity) by defining a
`distance measure D(.Xt, .X2), between two Markov models, .Xt and .A2 as
`I
`D(At, A2) = T [log P(o<2>l.Xt) - log P(0(2ll,\2)]
`
`(6.91
`
`where 0(2) = ( o t 02 03 .•• or) is a sequence of observations generated by model
`. 8 ~
`sically, Eq. (6.91) is a measure of how well model .A1 matches observation generated by
`model .X2, relative to how well model .X2 matches observations generated by itself. Several
`interpretations of Eq. (6.91) exist in terms of cross-entropy, or divergence, or discrimin tion
`infonnation [29].
`One of the problems with the distance measure of Eq. (6.91) i that it i non ymmetric.
`Hence a natural expression of this measure is the symmetrized version, namely
`
`(6.92
`
`6.12 IMPLEMENTATION ISSUES FOR HMMS
`
`The discussion in the previous sections has dealt primarily with the theory of HMMs
`and several variations on the form of the model. In this section we deal with several
`practical implementation issues, including scaling, multiple observation sequences, initial
`parameter estimates, missing data, and choice of model size and type. For some of these
`implementation issues we can prescribe exact analytical solutions; for other issues we can
`provide only some seat-of-the-pants experience gained from working with HMMs.
`
`6.12.1 Scaling
`
`To understand why scaling ([ 18,23]) is required for implementing the reestimation pro(cid:173)
`cedure of HMMs, consider the definition of a,(i) of Eq. (6.18). It can be seen that o,(,)
`consists of the sum of a large number of terms, each of the fonn
`
`with q, = i and bis a discrete probability as defined by Eq. (6.8). Since each a and b term is
`less than 1 (generally significantly less than 1), it can be seen that as I starts to get big (c.g~
`IO or more), each term of a,(i) starts to head exponentially to i.ero. For sufficiently large
`t (e.g., 100 or more) the dynamic range of the o,(i) computation will exceed the precision
`
`IPR2023-00037
`Apple EX1013 Page 216
`
`

`366
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`range of essentially any machine (even in double precision). Hence the only reasonable
`way to perform the computation is to incorporate a scaling procedure.
`The basic scaling procedure multiplies a,(i) by a scaling coefficient that is indepen(cid:173)
`dent of i (i.e., it depends only on t), with the goal of keeping the scaled, a,(i) within the
`dynamic range of the computer for I ~ t ~ T. A similar scaling is done to the {31(i)
`coefficients (since these also tend to zero exponentially fast) and then, at the end of the
`computation, the scaling coefficients are canceled out exactly.
`To understand this scaling procedure better, consider the reestimation formula for the
`state-transition coefficients aij. If we write the reestimation formula (Eq. (6.40b)) dir~ctly
`in terms of the forward and backward variables, we get
`
`r=l
`liij = -T--N---------
`
`T-l L o:,(i)aijbj(o,+ 1 )f31+ 1 (j)
`L L o:,(i)aijbj(o,+df31+1U)
`
`t=I J=I
`
`(6.93)
`
`Consider the computation of cx,(i). We use the n~tation cx1(i) to denote the unscaled
`as, o,(i) to denote the scaled (and iterated) as, and o,(i) to denote the local version of
`a before scaling. Initially, fort=
`1, we compute cx1(i) according to Eq. (6.19) and set
`and 01 (i) = Ct a1 (i). For each t, 2 < t < T, we first
`&1 (i) = 01 (i), with Ct = LN I
`o:,(i)
`
`-
`
`-
`
`--
`
`1=1
`
`compute &r(i) according to the induction formula (Eq. (6.20)), in terms of the previously
`scaled &,(i); that is,
`
`N
`
`&,(i) = L o,-1(J)a1;b;(o,).
`
`J=l
`
`We determine the scaling coefficient c1 as
`
`Ct=
`
`_N __
`
`_
`
`Z: J,u)
`
`i=I
`
`giving
`
`&,(i) = c,&,(i)
`From Eq. (6.94a--c) we can write the scaled &,(i) as c,&,(i) or
`
`N L &,_ 1 (j)a1;b;(o,)
`a,(i) = -N:-:-J_=--:lN~-----
`L L &,_ 1 (j)ajib;(o,)
`
`i=l J=I
`
`(6.94a)
`
`(6.94b)
`
`(6.94c)
`
`(6.95)
`
`IPR2023-00037
`Apple EX1013 Page 217
`
`

`sec. 6.12
`
`Implementation Issues for HMMs
`
`BY induction we can write &,_ 1 (J) as
`
`'fhus we can write &,(i) as
`
`367
`
`(6.96a)
`
`(6.96b)
`
`o,(i)
`
`N L a,(i)
`
`i=I
`
`that is, each a,(i) is effectively scaled by the sum over all states of o,(i).
`Next we compute the /3,(i) tenns from the backward recursion. The only difference
`here is that we use the same scale factors for each time t for the betas as was used for the
`alphas. Hence the scaled {3s are of the fonn
`
`,.
`
`/3r(i) = Cr/3r(i).
`
`(6.97)
`
`Because each scale factor effectively restores the magnitude of the o tenns to 1, and
`because the magnitudes of the a and {3 tenns are comparable, using the same scaling
`factors on the {3s as was used on the os is an effective way to keep the computation
`within reasonable bounds. Furthermore, in tenns of the scaled variables, we see that the
`reestimation Eq. (6.93) becomes
`
`aij = -T--l _N _______
`-
`t=l'
`
`T-1 L &,(i)aijbj(o,+1)~1+1(1)
`LL &,(i)aijbj(01+1)fi1+1
`
`_
`
`(j)
`
`but each &,(i) can be written as
`
`t=l
`
`j=l
`
`&,(i) = [IT c,] o,(i) = C,o,(i)
`
`s=l
`
`and each f3r+ 1 (j) can be written as
`
`A
`
`~1+1U) = [ II c,].B,+1U) = D,+1.B,+1(1).
`
`s=t+l
`
`Thus Eq. (6.98) can be written as
`
`..
`
`(6.98)
`
`(6.99)
`
`(6.100)
`
`IPR2023-00037
`Apple EX1013 Page 218
`
`

`368
`
`Chap. 6
`
`Theory and Implementation of Hidden Marko
`v •v1odels
`
`~A
`
`(6.I0J)
`
`(6.102)
`
`i I
`I
`I
`I
`
`I
`
`I
`
`I
`
`I i
`~I I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`-
`au=
`
`T-1 _L c,a,(i)aijbj(o,+i>D,+1/3,+1 (j)
`-~' ~( ___________ _
`~ L C,a:,(i)aubj(o,+1 )D,+1 /3,+1 (j)
`
`r-1
`
`N
`
`t=I
`
`}=I
`
`Finally the term C,D,+ 1 can be seen to be of the form
`C,Dr+I = IT Cs IT Cs = IT Cs = Cr
`
`t
`
`T
`
`T
`
`s=I
`
`s=t+I
`
`s=I
`
`independent of r. Hence the terms C,Dr+ 1 cancel out of both the numerator and denominat
`of Eq. (6. IOI) and the exact reestimation equation is therefore realized.
`or
`It should be obvious that the above scaling procedure applies equally well to rees(cid:173)
`timation of the 1r or B coefficients. It should also be obvious that the scaling procedure
`of Eq. (6.95) need not be applied at every time instant t, but can be performed whenever
`desired, or whenever necessary (e.g., to prevent underflow).
`If scaling is not perfonnect
`at some instant t, the scaling coefficients c, are set to I at that time, and all the conditions
`discussed above are then met.
`The only real change to the HMM procedure because of scaling is the procedure for
`computing P(OI>.). We cannot merely sum up the &r(i) terms, because these are scaled
`already. However, we can use the property that
`
`IT c, L a:r(i) = Cr L a:r(i) = I.
`
`T
`
`N
`
`t=I
`
`i=I
`
`N
`
`i=I
`
`Thus we have
`
`or
`
`or
`
`T IT c, • P(O/.-\) = I
`
`t=I
`
`P(Oj.-\) = __.!__
`T
`II c,
`
`t=l
`
`log [P(O/.-\)] = L log c,.
`
`T
`
`Thus the log of p can be computed b
`of the machine anyway.
`' ut not
`
`t== I
`.
`.
`p
`, smce it would be out of the dynamic range
`.
`.
`th V'
`Finally we note that when usin
`state sequence no seal'
`.
`. g _e Iterbi algorithm to give the maximum likelihood
`.
`'
`mg 1s reqmred 1f w
`.
`1
`Viterbi implementation.
`e use oganthms as discussed
`in the alternate
`
`(6.l03)
`
`(6.104)
`
`(6.l05)
`
`(6.l06)
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`I
`
`I
`
`I
`
`I
`
`Ii
`I
`
`I
`
`I
`II
`
`IPR2023-00037
`Apple EX1013 Page 219
`
`

`sec. 6.12
`
`Implementation
`
`Issues for HMMs
`
`Multiple Observation Sequences
`
`,.12.2
`
`In Section 6.5 we discussed a fonn of HMM called the left-right or Baki model, in which
`the state proceeds from state I at r = I to state N at t = T in a sequential nwmer (ra::all
`the model of Figure 6.8(b)). We have already discussed how a left-righl model imposes
`constraints on the state-transition matrix, and the initial state probabiliti Eqs. 6.45}(cid:173)
`(6.48). However, the major problem with left-right model i tha1 one canno1 use a single
`to train the model (i.e., for reestimation of model parameters). Th.
`observation sequence
`,is because the transient nature of the states within the model allows only a mall number of
`observations for any state (until a transition is made to a successor state . Hc:nc:e, to ba
`sufficient data to make reliable estimates of all model parameters, one has to use multiple
`observation sequences (l 18]).
`The modification of the reestimation procedure is straightforward and i as folio
`We denote the set of K observation sequences as
`
`where Q(k> = (o\k>oik> ... o~ 1
`) is the kth observation sequence. We a sume each observ tion
`sequence is independent of every other observation sequence. and our goal i to ndju t the
`parameters of the model A to-maximize
`
`(6.107)
`
`K
`
`P(OI-X) = II P(O<*>I-X)
`
`k=I
`
`(6.108)
`
`(6.109)
`
`K
`
`= II Pk.
`
`k=I
`Since the reestimation formulas are based on frequencies of occurrence of variou evenrs,
`the reestimation formulas for multiple observation sequences are modified by adding to(cid:173)
`frequencies of occurrence for each sequence. Thus the modified
`gether the individual
`reestimation formulas for llij and bj(f) are
`
`and
`
`(6.110
`
`(6.11 l)
`
`IPR2023-00037
`Apple EX1013 Page 220
`
`

`370
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`and 7T'j is not reestimated since 1r1 = I, 7T'j = 0, i f:. 1.
`The proper scaling of Eqs. (6.110)-(6.111) is now straightforward since each obser(cid:173)
`vation sequence has its own scaling factor. The key idea is to remove the scaling factor
`from each term before summing. This can be accomplished by writing the reestimation
`equations in terms of the scaled variables, i.e.,
`
`(6.112)
`
`In this manner, for each sequence o<k>, the same scale factors will appear in each tenn of
`the sum over t as appears in the Pk term, and hence will cancel exactly. Thus using the
`scaled values of the as and (Js results in an unscaled liu. A similar result is obtained for the
`bj(l) term.
`
`6.12.3
`
`Initial Estimates of HMM Parameters
`
`In theory, the reestimation equations should give values of the HMM parameters that
`correspond to a local maximum of the likelihood function. A key question is, therefore,
`How do we choose initial estimates of the HMM parameters so that the local maximum is
`equal to or as close as possible to the global maximum of the likelihood function?
`Basically there is no simple or straightforward answer. Instead, experience has shown
`that either random (subject to the stochastic and the nonzero value constraints) or uniform
`initial estimates of the 1r and A parameters are adequate for giving useful reestimates of these
`parameters in almost all cases. However, for the B parameters, experience has shown that
`good initial estimates are helpful in the discrete symbol case and are essential (when dealing
`with multiple mixtures) in the continuous-distribution case. Such initial estimates can be
`obtained in a number of ways; these include (a) manual segmentation of the observation
`sequence(s) into states and averaging of observations within states, (b) maximum likelihood
`segmentation of observations and averaging, and (c) segmental k-means segmentation with
`clustering, etc. We discuss such segmentation techniques later in this chapter.
`
`6.12.4 Effe~ of Insufficient Training Data
`
`Another problem associated with training HMM parameters via reestimation methods is
`that the observation sequence used for training is, of necessity, finite ([30]). Thus there
`is always an inadequate number of occurrences of low-probability events (e.g., symbol
`occurrences within states) to give good estimates of the model parameters. By way of
`example, consider the case of a discrete observation HMM. Recall that the reestimation
`transformation of bj(k), Eq. (6.40c), requires a count of the expected number of times in
`state j and observing symbol vk simultaneously. If the training sequence is so small that
`it does not have any occurrences of this event (i.e., q, = j and o, = vk), bj(k) = 0 and
`
`IPR2023-00037
`Apple EX1013 Page 221
`
`

`Implementation Issues for HMMs
`
`ffl
`
`•
`
`will tay 0 after ree timation. The resultant model ould produce I mo
`tually includes ( = • Ind q, == /)
`for any ob rvation sequence that
`b t)
`outcome -is obviousJy a consequence of the unrdi.lble estimale
`in ufficiency of the training seL
`One soJution to thi problem i to increase dire iz.c of lbt lniinillta N!llilM!fV'•t.(cid:173)
`lution • to mluce die
`Often thi i impractical. A second possible
`tate . number of ymbol per st.ate). Although Ibis·
`(e.g., number of
`often there are phy ical reason why a given model i used. and lhelefort de modtl
`jze cannot be changed. A third possible solution i to sect uncon!valtiaml di:Sttcll
`estimation algorithms that can somehow enhance the reliabili of lbt DIJimlt•
`even based on limited training data. Deleted interpolation and pu1m1tter ~MOIGN
`are two uch a1tematives. Since deleted interpolation i considered m<xe
`parameter e timation method, we discuss that subject in the nelt
`•
`The simplest way to hand1e the effects of in ufficient tnining
`threshold constraints to the model parameters to en ure that no model ftllll'lllll'Nl•..,.esliLffille
`falls below a specified level '[ 18). Thus, for example. we might sptci
`for a discrete symbol model, that
`bj(k) = { bj(k)
`t,b,
`
`if b1(k) _
`otherwise
`
`6.11 )
`
`or, for a continuous distribution model, that
`
`( . II
`
`)
`
`U
`
`.,
`
`if U~(r r) _
`) _ { U11:(r, r),
`. (
`11c r, r -
`otherwise
`hu,
`When the numeric floor is invoked in the recstimation equation . 11 re-m inin
`need to be rescaled so that the densities obey the required tochastic con train
`.
`implcmcntational mcuu~.
`postprocessor techniques are thus considered
`to com
`in
`to secvcral prob
`insufficient data problem and have been applied with good succc
`speech processing. The method of parameter thresholding has a ju tification from a BiYi
`statistics point of view. It can be shown that Eq. (6.113b) is, in fact. a maximum a
`eeriori
`(MAP) estimate of the variance under the assumption that the parameter prior P U ,, r
`is an 1infonnative one with uniform distribution and (Ujt(r,r)). = 68 [31). (See Section
`6.13.)
`
`·6.12.5 Choice of Model
`
`The 1remaining issues in implementing HMMs are the choice of type of model ergodjc
`and cboa (Jf
`or lef1-right or some other form), choice of model size (number of--,,~
`observation -symbols (discrete or continuous. single or multimixtum-cboice of oblervalim
`parameters). Unfortunately.
`there is no simple. theoreticall COIRCt
`of making
`choices. These choices must be made depending on lhe signal being modeled.
`ilh dae
`comments, we end our discussion of the theorerieal aspects of hidden Mln&:W 111oaa:u
`such models ha e been applied lO •
`and proceed to a discussion of ho
`11· OlallCd
`recognition .problem.
`
`IPR2023-00037
`Apple EX1013 Page 222
`
`

`372
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`6.13
`
`IMPROVING iTHE EFFECTIVENESS OF MODEL ESTIMATES
`
`We discuss three methods that have been shown to be able to enhance the effectiveness
`of HMM model estimates for speech recognition. These are (I) deleted interpolation, (2)
`Bayesian adaptation, and (3) corrective training. The first two methods are motivated by
`the problem of insufficient data, while the last method has the unique objective of trying to
`reduce recognition errors directly.
`
`6.13.1 Deleted Interpolation
`
`When training data are insufficient, reliable and robust determination of HMM parameters
`cannot be accomplished. The HMM obtained by the Baum-Welch reestimation method,
`based on the maximum likelihood criterion, may be adequate in characterizing the training
`data, but for new data the match may be quite poor. One parameter estimation method that
`aims to improve the model reliability is the method of "deleted interpolation."
`The concept involves combining two (or more) separately trained models, one of
`which is more reliably trained than the other. A scenario in which this can happen is the
`case when we use tied states which forces "different" states to share an identical statistical
`characterization, effectively reducing the number of parameters in the model. A model with
`tied states is often more robust than a model without tied states when trained on the same
`amount of data. But a model with tied states is also less precise than a model without tied
`states if the training data are sufficient. Therefore, the idea of combining the two models
`is to allow us to fall back to the more reliable model when the supposedly more precise
`model is, in fact, unreliable. A similar scenario occurs when context-independent (more
`robust) and context-dependent (more precise) phone models are used in large vocabulary
`recognition (see Chapter 8).
`Let the two models be defined by the parameter sets A = (A, B, 1r) and A'
`(A', B',

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases