`
`- I
`
`326
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov M
`Odels
`
`called a hidden Markov model) is a doubly embedded stochastic process with an und 1 .
`.
`.
`I (. • h. dd ) b
`er Ytng
`stochastic process that is not directly observab e It is 1 en ut can be observed on]
`through another set of stochastic processes that_produce the sequence of observations. Y
`To illustrate the basic concepts of the hidden Markov model, we will use se
`u, b
`.
`.
`veraI
`.
`.
`.
`.
`simple examples including simple co1n-toss1ng expenments. ne eg1n with a revie
`.
`.
`w~
`some basic ideas of probability in the following exercise.
`Exercise 6.1
`Given a single fair coin, i.e., P(Heads) = P(Tails) = 0.5, which you toss once and observe
`Tails,
`1. What is the probability that the next 10 tosses will provide the sequence
`(HHTHITHITH)?
`2. What is the probability that the next IO tosses will produce the sequence
`(HHHHHHHHHH)?
`3. What is the probability that 5 of the next 10 tosses will be tails? What is the expected
`number of tails over the next IO tosses?
`
`Solution 6.1
`1. For a fair coin, with independent coin tosses, the probability of any specific observation
`sequence of length 10 (10 tosses) is (1/2) 10 since there are i1° such sequences and all
`are equally probable. Thus:
`
`2.
`
`P(H HTHTTHTTH) = GY°.
`P(HHHHHHHHHH)= GY°
`
`Thus a specified run of length 10 is as likely as a specified run of interlaced H and T.
`3. The probability of 5 tails in the next 10 tosses is just the number of observation sequences
`with 5 tails and 5 heads (in any order) and this is
`
`P(5H, 5T) = ( 150) (!) JO = 252 rv O 25
` ) ways of getting SH and 5T in IO tosses, and each sequence has
`since there are (
`probability of ( ! ) 10
`• The expected n~mber of tails in IO tosses is
`E(T in 10 tosses) = L d ( ~) G) = s.
`
`2
`
`1024
`
`•
`
`lO
`
`10
`
`15°
`
`d=O
`Thus, on average, there will be 5H and 5T in IO tosses but the probability of exactly
`5H and 5T is only 0.25.
`'
`
`6.3.1 <::oin-Toss Models
`
`• v
`Assume the foil owing
`h
`•
`•
`scenano. iou are 1n a room with a barrier (e.g., a curtain) throug
`
`IPR2023-00035
`Apple EX1015 Page 177
`
`
`
`sec. 6.3
`
`Extensions to Hidden Markov Models
`
`327
`
`which you cannot see what is happening. On the other side of the barrier is another person
`who is performing a coin-tossing experiment (using one or more coins). The person wilJ not
`tell you which coin he selects at any time; he will only tell you the result of each coin flip.
`Thus a sequence of hidden coin-tossing experiments is performed, with the observation
`sequence consisting of a series of heads and tails. A typical observation sequence would
`be
`
`... H)
`
`0 = ( 0 I 02 03 .. , Or)
`= (HHTTTHTTH
`where H stands for heads and T stands for tails.
`the question is, How do we build an HMM to explain
`Given the above scenario,
`(model) the observed sequence of heads and tails? The first problem we face is deciding
`what the states in the model correspond to, and then deciding how many states should be
`in the model. One possible choice would be to assume that only a single biased coin was
`being tossed. In this case, we could model the situation with a two-state model in which
`each state corresponds
`to the outcome of the previous toss (i.e., heads or tails). This model
`is depicted in Figure 6.3a. In this case, the Markov model is observable, and the only issue
`for complete specification of the model would be to decide on the best value for the single
`parameter of the model (i.e., the probability of, say, heads). Interestingly, an equivalent
`HMM to that of Figure 6.3a would be a degenerate one-state model in which the state
`corresponds to the single biased coin, and the unknown parameter is the bias of the coin.
`A second HMM for explaining the observed sequence of coin toss outcomes is given
`in Figure 6.3b. In this case there are two states in the model, and each state corresponds to a
`different, biased coin being tossed. Each state is characterized by a probability distribution
`of heads and tails, and transitions between states are characterized by a state-transition
`matrix. The physical mechanism
`that accounts for how state transitions are selected could
`itself be a set of independent coin tosses or some other probabilistic event.
`A third form of HMM for explaining the observed sequence of coin toss outcomes
`is given in Figure 6.3c. This model corresponds to using three biased coins, and choosing
`from among the three, based on some probabilistic event.
`Given the choice among the three models shown in Figure 6.3 for explaining the
`observed sequence of heads and tails, a natural question would be which model best matches
`the actual observations.
`It should be clear that the simple one-coin model of Figure 6.3a
`has only one unknown parameter;
`the two-coin model of Figure 6.3b has four unknown
`parameters; and the three-coin model of Figure 6.3c has nine unknown parameters. Thus,
`with the greater degrees of freedom, the larger HMMs would seem to be inherently more
`capable of modeling a series of coin-tossing experiments than would equivalently smaller
`models. Although this is theoretically true, we will see later in this chapter that practical
`considerations impose some strong limitations on the size of models that we can consider.
`A fundamental question here is whether the observed head-tail sequence is long and rich
`enough to be able to specify a complex model. Also, it might just be the case that only
`a single coin is being tossed. Then using the three-coin model of Figure 6.3c would be
`inappropriate because we would be using an underspecified system.
`
`IPR2023-00035
`Apple EX1015 Page 178
`
`
`
`328
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`(a)
`
`PlH)
`
`1-P(Hl
`
`1-COIN MODEL
`(OBSERVABLE MARKOV MODEL)
`
`HEADS
`
`TAILS
`
`O=HHTTHTHHTTH
`S=l
`1221211221
`
`.. .
`.. .
`
`.. .
`.. .
`
`2-COINS MODEL
`(HIDDEN MARKOV MODEL)
`
`O = HHTTHTHHTTH
`S=21122212212
`
`(b)
`
`(c)
`
`P(H) = Pi
`P(T) = l-P1
`
`P(n = l-P2
`
`G33
`
`_l_
`P1
`1-P,
`
`P(H)
`
`P(T)
`
`STATE
`__i_
`P2
`l-P2
`
`_L_
`P3
`l-P3
`
`3-COINS MODEL
`(HIDDEN MARKOV MODEL)
`
`O=HHTTHTHHTTH
`S=31233112313
`
`.. .
`.. .
`
`Figure 6.3 Three possible Markov models that can account for the results of hidden coin-tossing
`experiments. (a) one-coin model, (b) two-coins model, (c) three-coins model.
`
`6.3.2 The Urn-and-Ball Model
`
`To extend the ideas of the HMM to a somewhat more complicated situation, consider the
`um-and-ball system of Figure 6.4. We assume that there are N (large) glass urns in a room.
`Within each um is a large quantity of colored balls. We assume there are M distinct colors
`of the balls. The physical process for obtaining observations is as follows. A genie is in
`the room, and, according to some random procedure, it chooses an initial um. From this
`um, a ball is chosen at random, and its color is recorded as the observation. The ball is
`
`IPR2023-00035
`Apple EX1015 Page 179
`
`
`
`sec. 6.3
`
`Extensions to Hidden Markov Models
`
`329
`
`• • •
`
`URN 1
`
`URN 2
`
`• b1< 1)
`P(RED)
`P( BLUE)
`• b1(2)
`P(GREEN) • b1(3)
`P(YELLOW) • b1(4)
`•
`•
`•
`P(ORANGE) • b1 (M)
`
`P(REO)
`• bz( 1)
`P(BLUE) • b2(2)
`P(GREEN) • b2(3)
`P(YELLOW) • bz( 4)
`•
`•
`•
`P(ORANGE) • b2(M)
`
`URN N
`
`P(RED)
`• bN( 1)
`P(BLUE) • bN(2)
`P(GREEN) • bN(3)
`P(YELLOW) 1: bN(4)
`
`P(ORANGE) • bN(M)
`
`0• {GREEN, GREEN, BLUE, RED, YELLOW, RED, .......
`
`, BLUE}
`
`Figure 6.4 An N-state um-and-ball model illustrating the general case of a discrete
`symbol HMM.
`
`then replaced in the urn from which it was selected. A new um is then selected according
`to the random selection procedure associated with the current um, and the ball selection
`process is repeated. This entire process generates a finite observation sequence of colors,
`which we would like to model as the observable output of an HMM.
`It should be obvious that the simplest HMM that corresponds to the um-and-ball
`process is one in which each state corresponds to a specific um, and for which a (ball) color
`probability is defined for each state. The choice of urns is dictated by the state-transition
`matrix of the HMM.
`It should be noted that the ball colors in each um may be the same, and the distinction
`among various urns is in the way the collection of colored balls is composed. Therefore,
`an isolated observation of a particular color ball does not immediately tell which um it is
`drawn from.
`
`6.3.3 Elements of an HMM
`
`The above examples give us some idea of what an HMM is and how it can be applied to
`some simple scenarios. We now formally define the elements of an HMM.
`An HMM for discrete symbol observations such as the above um-and-ball model is
`characterized by the foil owing:
`
`1. N, the number of states in the model. Although the states are hidden, for many
`practical applications there is often some physical significance attached to the states
`or to sets of states of the model. Thus, in the coin-tossing experiments, each state
`corresponded to a distinct biased coin. In the um-and-ball model, the states cor(cid:173)
`responded to the urns. Generally the states are interconnected in such a way that
`any state can be reached from any other state (i.e., an ergodic model); however, we
`will see later in this chapter that other possible interconnections of states are often
`
`IPR2023-00035
`Apple EX1015 Page 180
`
`
`
`330
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`(6.7)
`
`of interest and may better suit speech applications. We label the individual states as
`{ 1, 2, ... , N}, and denote the state at time t as q,.
`2. M, the number of distinct observation symbols per state-i.e., the discrete alphabet
`size. The observation symbols correspond to the physical output of the system being
`modeled. For the coin-toss experiments the observation symbols were simply heads
`or tails; for the ball-and-um model they were the colors of the balls selected from the
`urns. We denote the individual symbols as V = { v 1 , v2, ... , v M}.
`3. The state-transition probability distribution A = { aiJ} where
`aiJ = P[qr+l = jlq, = i],
`1 < i,j < N.
`For the special case in which any state can reach any other state in a single step, we
`have aij > 0 for all i,j. For other types of HMMs, we would have aij = 0 for one or
`more (i ,j) pairs.
`4. The observation symbol probability distribution, B = { bj(k)}, in which
`bj(k) = P[o, = vklq1 = }],
`1 < k < M,
`defines the symbol distribution in state},}= 1, 2, ... , N.
`5. The initial state distribution 1r = { 1r;} in which
`1r; = P[q1 = i],
`I < i < N.
`
`(6.8)
`
`(6.9)
`
`It can be seen from the above discussion that a complete specification of an HMM
`requires specification of two model parameters, N and M, specification of observation
`symbols, and the specification of the three sets of probability measures A, B, and 1r. For
`convenience, we use the compact notation
`;\ = (A,B, 1r)
`to indicate the complete parameter set of the model. This parameter set, of course, defines
`a probability measure for 0, i.e. P(OI;\), which we discuss in the next section. We use the
`tenninology HMM to indicate the parameter set .X and the associated probability measure
`interchangeably without ambiguity.
`
`(6.10)
`
`6.3.4 HMM Generator of Observations
`
`Given appropriate values of N,M,A,B, and 1r, the HMM can be used as a generator to give
`an observation sequence
`
`(6.11)
`
`(in which each observation Or is one of the symbols from V, and T is the number of
`observations in the sequence) as follows:
`
`1. Choose an initial state qi = i according to the initial state distribution 1r.
`2. Sett= 1.
`3. Choose Or = vk according to the symbol probability distribution in state i, i.e., bj(k).
`
`IPR2023-00035
`Apple EX1015 Page 181
`
`
`
`Extensions to Hidden Markov Models
`331
`sec. 6.3
`4. Transit to a new state q,+ 1 = j according to the state-transition probability distribution
`for state i, i.e., aij.
`s. Set t == t + 1; return to step 3 if t < T; otherwise, terminate the procedure.
`
`The following table shows the sequence of states and observations generated by the above
`procedure:
`
`time, t
`state
`observation
`
`2
`
`3
`
`4
`
`5
`
`6
`
`T
`
`The above procedure can be used as both a generator of observations and as a model to
`simulate how a given observation sequence was generated by an appropriate HMM.
`Exercise 6.2
`Consider an HMM representation (parametrized by >.) of a coin-tossing experiment. Assume
`a three-state model (corresponding
`to three different coins) with probabilities
`
`State I
`0.5
`0.5
`
`P(H)
`P(T)
`
`State 2
`0.75
`0.25
`
`State 3
`0.25
`0.75
`
`and with all state-transition probabilities equal to I /3. (Assume initial state probabilities of
`1/3.)
`1. You observe the sequence
`
`0 = (HHHHTHTTTT).
`What state sequence is most likely? What is the probability of the observation sequence
`and this most likely state sequence?
`2. What is the probability that the observation sequence came entirely from state I?
`3. Consider the observation sequence
`6 = (HTTHTHHTTH).
`How would your answers to parts a and b change?
`4. If the state-transition probabilities were
`a31 = 0.45
`a21 = 0.45
`a11 = 0.9
`,
`a32 = 0.45
`a,2 = 0.05
`, a22=0.l
`, a23 = 0.45
`a13 = 0.05
`033 = 0.1
`that is, a new model ')..', how would your answers to parts 1-3 change? What does this
`suggest about the type of sequences generated by the models?
`
`,
`,
`
`'
`
`Solution 6.2
`1. Given O = (HHHHTH777T)
`and that all state transitions are equiprobable, the most
`likely state sequence is the one for which the probability of each individual observation
`
`IPR2023-00035
`Apple EX1015 Page 182
`
`
`
`332
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`is maximum. Thus for each H, the most likely_ state is 2 and for each T the most likely
`state is 3. Thus the most likely state sequence 1s
`q = (2 2 2 2 3 2 3 3 3 3).
`
`The probability of O and q (given the model) is
`
`P(O, qj-\) = (0.75) 10
`
`( 1) 10
`
`3
`
`2. The probability of O given that q is
`q = (1111 11 1 1 1 1)
`
`is
`
`P(O,qj-\) = (0.50) 10 (j) 10
`
`The ratio of P(O, qj>.) to P(O, qj-\) is:
`
`R = P(O, qj-\) = (~) 10 = 57.67
`P(O, qj-\)
`2
`which shows, as expected, that q is more likely than q.
`3. Given O which has the same number of Hs and Ts, the answers to parts 1 and 2 would
`remain the same, as the most likely states occur the same number of times in both cases.
`4. The new probability of O and q becomes
`
`P(O, qj-\') = (0.75) 10
`
`( j) (0.1)6(0.45)3.
`
`The new probability of O and q becomes
`
`The ratio is
`
`( 1) (0.9)9.
`P(O, qj-\') = (0.50) 10
`R = (~) ro (if (1) 3 = 1.36 X 10-s.
`In other words, because of the nonuniform transition probabilities, q is more likely than q. (The
`rea~er is encouraged to find the most likely state sequence in this case.) Now, the probability
`of O and q is not the same as the probability of O and q. We have
`P(O, qj-\') = 3 (0.1 )6 (0.45)3 (0.25)4 (0.75)6
`P(O, qj-\') = (0.50) 10 ( j") (0.9)9
`
`-
`
`1
`
`with ratio
`
`R = (if (lf (1)4
`Clearly, because a1, = 0.9, q is more likely.
`
`6
`
`(~)
`
`= 1.67 X 10- 1
`
`.
`
`IPR2023-00035
`Apple EX1015 Page 183
`
`
`
`sec. 6.4
`
`The Three Basic Problems for HMMs
`
`333
`
`THE THREE BASIC: PROBLEMS FOR HMMS
`
`6,4
`
`Given the fonn of HMM of the previous section, three basic problems of interest must
`be solved for the model to be useful in real-world applications. These problems are the
`following:
`
`Problem 1
`
`sequence O = (o, o ... or), and a mode1 A = (A,B, 1r), how do we
`Given the observation
`efficiently compute P(OI .X). the probability of the observation sequence, given the model?
`
`Problem 2
`
`sequence O = (o, o ... Or), and the model A, how do we choose
`Given the observation
`state sequence q = (q,q2 ... qr) that is optimal in some sense (i.e., best
`a corresponding
`"explains"
`the observations)?
`
`Problem 3
`How do we adjust the model parameters A = (A, B, 11') to maximize P(Oj.X)?
`Problem I is the evaluation problem; namely, given a model and a sequence of
`observations, how do we compute the probability that the observed sequence was produced
`by the model? We can also view the problem as one of scoring how well a given model
`matches a given observation sequence. The latter viewpoint is extremely useful. For
`example, if we consider the case in which we are trying to choose among several competing
`models, the solution
`to Problem 1 allows us to choose the model that best matches the
`observations.
`Problem 2 is the one in which we attempt to uncover the hidden part o·~ the model(cid:173)
`that is, to find the "correct" state sequence. It should be clear that for all but the case of
`degenerate models, there is no "correct" state sequence to be found. Hence for practical
`situations, we usually use an optimality criterion to solve this problem as best as possible.
`As we will see, several reasonable optimal cy criteria can be imposed, and hence the choice
`of criterion is a strong function of the intended use for the uncovered state sequence. Typical
`uses might be to learn about the structure of the model, to find optimal state sequences for
`continuous speech recognition, or to get average statistics of individual states, etc.
`Problem 3 is the one in which we attempt to optimize the model parameters to best
`describe how a given observation sequence comes about. The observation sequence used
`to adjust the model parameters is called a training sequence because it is used to "train" the
`HMM. The training problem is the crucial one for most applications of HMMs, because
`it allows us to optimally adapt model parameters to observed training data-i.e.,
`to create
`best models for real phenomena.
`To fix ideas, consider the following simple isolated-word speech recognizer. For each
`word of a W word vocabulary, we want to design a separateN-state HMM. We represent the
`speech signal of a given word as a time sequence of coded spectral vectors. We assume that
`the coding is done using a spectral codebook with M unique spectral vectors; hence each
`observation is the index of the spectral vector closest (in some spectral distortion sense) to
`the original speech signal. Thus, for each vocabulary word, we have a training sequence
`
`IPR2023-00035
`Apple EX1015 Page 184
`
`
`
`334
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`consisting of a number of repetitions of sequences of codebook indices of the word (by
`one or more talkers). The first task is to build individual word models. This task is done
`by using the solution to Problem 3 to optimally estimate model parameters for each word
`model. To develop an understanding of the physical meaning of the model states, we use
`the solution to Problem 2 to segment each of the word training sequences into states, and
`then study the properties of the spectral vectors that lead to the observations occurring in
`each state. The goal here is to make refinements of the model (e.g., more states, different
`codebook size) to improve its capability of modeling the spoken word sequences. Finally,
`once the set of W HMMs has been designed and optimized, recognition of an unknown
`word is performed using the solution to Problem I to score each word model based upon
`the given test observation sequence, and select the word whose model score is highest (i.e.,
`the highest likelihood).
`In the next sections we present formal mathematical solutions to each fundamental
`problem for HMMs. We shall see that the three problems are tightly linked together under
`the probabilistic framework.
`
`6.4.1 Solution to Problem 1-Probability Evaluation
`
`We wish to calculate the probability of the observation sequence, 0 = ( o I o ... or), given
`the model >., i.e., P(Oj>.). The most straightforward way of doing this is through enumer(cid:173)
`ating every possible state sequence of length T (the number of observations). There are N7
`such state sequences. Consider one such fixed-state sequence
`
`(6.12)
`
`where q 1 is the initial state. The probability of the observation sequence O given the state
`sequence of Eq. (6.12) is
`
`P(Ojq, >.) = IT P(o,lq,, >.)
`
`T
`
`t=I
`where we have assumed statistical independence of observations. Thus we get
`
`P(Olq, >.) = bq, (01) • bq 2(02)
`The probability of such a state sequence q can be written as
`
`... bqr(or).
`
`(6.13a)
`
`(6.13b)
`
`P(qi>.) = 1rq,aq,q2aq2q3 • • • aqr-1Qr·
`The joint probability of O and q, i.e., the probability that O and q occur simultaneously, is
`simply the product of the above two terms, i.e.,
`
`(6.14)
`
`P(O, qi>.) = P(Olq, >.)P(qi>.).
`
`(6. 15)
`
`The probability of O (given the model) is obtained by summing this joint probability over
`all possible state sequences q, giving
`P(OI>.) = L P(Olq, >.)P(qi>.)
`
`(6.16)
`
`all q
`
`IPR2023-00035
`Apple EX1015 Page 185
`
`
`
`sec. 6.4
`
`The Three Basic Problems for HMMs
`
`335
`
`(6.17)
`
`QI, Q2, ..• , QT
`
`The interpretation of the computation in the above equation is the following. Initially (at
`time t == 1) we are in state Q1 with probability 7rq 1 , and generate the symbol o1 (in this state)
`with probability bq1 (01). The clock changes from time t tot+
`I (time= 2) and we make
`a transition to state Q2 from state Qi with probability aq 1q2 , and generate symbol 0i with
`probability bq2 (02). This process continues in this manner until we make the last transition
`(at time n from state QT-I to state QT with probability aqr-,qr and generate symbol Or with
`probability bq/0T ).
`A little thought should convince the reader that the calculation of P(Oj.X), according
`to its direct definition (Eq. (6.17)) involves on the order of 2T • NT calculations, since
`at every t = 1, 2, ... , T, there are N possible states that can be reached (i.e., there are
`NT possible state sequences), and for each such state sequence about 2T calculations are
`(To be precise, we need (2T - 1 )NT
`required for each term in the sum of Eq. ( 6.17).
`multiplications, and NT - I additions.) This calculation is computationally infeasible, even
`for small values of N and T; e.g., for N = 5 (states), T = 100 (observations), there are
`on the order of 2 • 100 • 5 100 ::::::: 1072 computations! Clearly a more efficient procedure is
`required to solve problem I. Fortunately such a procedure (called the forward procedure)
`exists.
`
`6.4.1.1 The Forward Procedure
`
`Consider the forward variable a,(i) defined as
`... o,, q, = ij.X)
`a,(i) = P(o102
`that is, the probability of the partial observation sequence, 01 Oi ... o,, (until time t) and
`state i at time t, given the model >.. We can solve for a,(i) inductively, as follows:
`
`(6.18)
`
`1. Initialization
`
`2. Induction
`
`3. Termination
`
`a1 (i) = 1r;b;(o1),
`
`N
`
`at+i (j) = [~ a,(i)au] bj(0 1+i),
`
`t=I
`
`1::;1::;r-1
`1 :s;j:s;N
`
`N
`
`P(OI.X) = L aT(i).
`
`i=I
`
`(6.19)
`
`(6.20)
`
`(6.21)
`
`Step 1 initializes the forward probabilities as the joint probability of state i and initial
`observation o1. The induction step, which is the heart of the forward calculation, is
`illustrated in Figure 6.5(a). This figure shows how state j can be reached at time t + 1
`from the N possible states, i, 1 < i < N, at time t. Since a,(i) is the probability of the
`joint event that o 1 o2 ... o, are observed, and the state at time tis i, the product a,(i)aij is
`
`IPR2023-00035
`Apple EX1015 Page 186
`
`
`
`336
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`(O)
`
`(bl
`
`➔
`➔
`~
`
`T
`
`N
`
`U.I
`
`._
`C ._
`
`u,
`
`2
`
`2
`
`3
`OBSERVATION,
`
`t
`
`Figure 6.5
`(a) Illustration of the sequence of operations required for
`the computation of the forward variable o 1+ 1 (j).
`(b) Implementation
`of the computation of a:1(i) in terms of a lattice of observations t, and
`states i.
`
`then the probability of the joint event that o 1 o2 ... 0 1 are observed, and state j is reached
`at time t + 1 via state i at time t. Summing this product over all the N possible states, i,
`1 ~ i ~ Nat time t results in the probability of j at time t + I with all the accompanying
`previous partial observations. Once this is done and j is known, it is easy to see that a,+ 1 (j)
`is obtained by accounting for observation o,+1 in state j, i.e., by multiplying the summed
`quantity by the probability bj(o,+ 1). The computation of Eq. (6.20) is performed for all
`states j, I ~ j ~ N, for a given t; the computation is then iterated for t = 1, 2 ... , T - 1.
`Finally, step 3 gives the desired calculation of P(OI..\) as the sum of the terminal forward
`variables o.T(i). This is the case since, by definition,
`
`and hence P(Oj..\) is just the sum of the aT(i)'s.
`If we examine the computation involved in the calculation of o.,(j), I < t ~ T,
`I ~ j ~ N, we see that it requires on the order of N 2T calculations, rather than 2TNr as
`I) + N
`required by the direct calculation. (Again, to be precise, we need N(N + I )(T -
`
`(6.22)
`
`IPR2023-00035
`Apple EX1015 Page 187
`
`
`
`337
`
`The Three Basic Problems for HMMs
`
`sec. 6.4
`rnultiplications and N(N - 1 )(T - 1) additions.) For N = 5, T = l 00, we need about 3000
`cornputations for the forward method, versus 1072 computations for the direct calculation,
`a savings of about 69 orders of magnitude.
`is, in effect, based upon the lattice (or trellis)
`The forward probability calculation
`stn1cture shown in Figure 6.5(b). The key is that, because there are only N states (nodes
`at each time slot in the lattice), all the possible state sequences will remerge into these N
`nodes, no matter how long the observation sequence. At time t = I (the first time slot in
`the lattice), we need to calculate values of a 1 (i), l ~ i ~ N. At times t = 2, 3, ... , T, we
`need only calculate values of a,(j), I < j < N, where each calculation involves only the N
`previous values of a,_ 1 (i) because each of the N grid points can be reached from only the
`N grid points at the previous time slot.
`
`6.4.1.2 The Backward Procedure
`
`In a similar manner, we can consider a backward variable /3,(i) defined as
`/J,(i) = P(o,+10,+2
`.. . orlq, = i,A)
`(6.23)
`that is, the probability of the partial observation sequence from t + l to the end, given state
`i at time t and the model .A. Again we can solve for /3,(i) inductively, as follows:
`
`1. Initialization
`
`2. Induction
`
`/3-r(i) = 1,
`
`1 ~ i ~ N.
`
`N
`
`/3,(i) = L aijbj(Or+1)/3,+1U),
`t = T-
`
`j=l
`1, T-
`
`2, ... , 1,
`
`1 ~ i ~ N.
`
`(6.24)
`
`(6.25)
`
`The initialization step 1 arbitrarily defines /3r(i) to be 1 for all i. Step 2, which is illustrated
`in Figure 6.6, shows that in order to have been in state i at time t, and to account for the
`observation sequence from time t + 1 on, you have to consider all possible states j at time
`t + 1, accounting for the transition from i to j (the aij term), as well as the observation
`Or+ 1 in state j (the bj( Or+ 1) term), and then account for the remaining partial observation
`sequence from state j (the /3,+1 (j) term). We will see later how the backward as well as the
`forward calculations are used to help solve fundamental Problems 2 and 3 of HMMs.
`Again, the computation of /3,(i), 1 < t ~ T, 1 ~ i ~ N, requires on the order of N2T
`calculations, and can be computed in a lattice structure similar to that of Figure 6.5(b ).
`
`6.4.2 Solution to Problem 2-"0ptimal" State Sequence
`
`Unlike Problem 1, for which an exact solution can be given, there are several possible
`ways of solving Problem 2-namely,
`finding the "optimal" state sequence associated with
`the given observation sequence. The difficulty lies with the definition of the optimal state
`sequence-that
`is, there are several possible optimality criteria. For example, one possible
`
`IPR2023-00035
`Apple EX1015 Page 188
`
`
`
`338
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`Si
`
`t + 1
`
`Figure 6.6 Sequence of operations required for the computa(cid:173)
`tion of the backward variable {3,(i).
`
`optimality criterion is to choose the states q1 that are individually most likely at each time t.
`This optimality criterion maximizes the expected number of correct individual states. To
`implement this solution to Problem 2, we can define the a posteriori probability variable
`,r(i) = P(q, = ilO, A)
`
`(6.26)
`
`that is, the probability of being in state i at time t, given the observation sequence 0, and
`the model A. We can express 11(i) in several forms, including
`"'(,(i) = P(q, = i I 0, A)
`P(O, q, = i I A)
`P(O I A)
`P(O, q, = i I A)
`N L P(O, q, = i I A)
`
`(6.27)
`
`i=l
`
`Since P(O, q, = i I ,\) is equal to o:r(i)/3,(i), we can write ,,(i) as
`
`(6.28)
`
`o,(i)/3,(i)
`,,(,) = _N ___
`_
`
`L o,<i)/3,<i)
`
`i=l
`where we see that o,(i) accounts for the partial observation sequence o1 o2 . .. o, and state
`i at t, while /3i(i) accounts for the remainder of the observation sequence o,+10,+2 ... or,
`given state q, = i at t.
`
`IPR2023-00035
`Apple EX1015 Page 189
`
`
`
`I
`
`sec. 6.4
`
`The Three Basic Problems for HMMs
`
`Using ,y,(i), we can solve for the individually most likely state q; at time t, as
`
`q~ = arg min ['y,(i)],
`lsJ-5,N
`
`I ~ t ~ T.
`
`339
`
`(6.29)
`
`Although Eq. (6.29) maximizes the expected number of correct states (by choosing the most
`likely state for each t), there could be some problems with the resulting state sequence.
`for example, when the HMM has state transitions which have zero probability (aij = O for
`some i andj), the "optimal" state sequence may, in fact, not even be a valid state sequence.
`This is because the solution of Eq. (6.29) simply determines the most likely state at every
`instant, without regard to the probability of occurrence of sequences of states.
`One possible solution
`to the above problem is to modify the optimality criterion.
`For example, one could solve for the state sequence that maximizes the expected number
`of correct pairs of states (q,, q,+ i), or triples of states (q,, qt+ 1, q,+2), etc. Although these
`criteria might be reasonable for some applications, the most widely used criterion is to find
`is, to maximize P(qjO, A), which is equivalent to
`the single best state sequence (path)-that
`maximizing P( q, 0 I A). A formal technique for finding this single best state sequence exists,
`based on dynamic programming methods, and is called the Viterbi algorithm [15, 16].
`
`6.4.2.1 The Viterbi Algorithm
`
`To find the single best state sequence, q = (q 1 q2 ... qr), for the given observation sequence
`0 = (o 1 02 ... or), we need to define the quantity
`8,(i) = max
`P[q1q2 ... q,-1, q, = i, 0102 ... o,jA]
`
`(6.30)
`
`Qt ,q2,····q,_,
`
`that is, 81(i) is the best score (highest probability) along a single path, at time t, which
`accounts for the first t observations and ends in state i. By induction we have
`
`(6.31)
`
`To actually retrieve the state sequence, we need to keep track of the argument that maximized
`Eq. (6.31), for each t and j. We do this via the array 'lj;,(j). The complete procedure for
`finding the best state sequence can now be stated as follows:
`
`1. Initialization
`
`2. Recursion
`
`81 (i) = 1r;b;(oi),
`'Ip) (i) = 0.
`
`8,(j) = max [8,-1(i)aij]bj(o,),
`lsJ$.N
`
`'l/J,(j) = arg max [8,_ 1(i)aij],
`I $_;$.N
`
`2 ~ t ~ T
`I ~j ~N
`2 ~ t ~ T
`1 ~j ~ N.
`
`'
`
`(6.32a)
`(6.32b)
`
`(6.33a)
`
`(6.33b)
`
`IPR2023-00035
`Apple EX1015 Page 190
`
`
`
`340
`
`Chap. 6
`
`Theory and Implementation of Hidden Markov Models
`
`3. Tennination
`
`p• = max [6r(i)]
`Qr = arg max [ 8r(i)].
`
`l<i<N
`
`J<i<N
`
`4. Path ( state sequence) backtracking
`q; = 1P1+1(q7+1),
`
`t = T - 1, T - 2, ... , 1.
`
`(6.34a)
`
`(6.34b)
`
`(6.35)
`
`It should be noted that the Viterbi algorithm is similar (except for the backtracking step) in
`implementation to the forward calculation of Eqs. (6.19)-(6.21). The major difference is
`the maximization in Eq. (6.33a) over previous states, which is used in place of the summing
`procedure in Eq. (6.20). It also should be clear that a lattice (or trellis) structure efficiently
`implements the computation of the Viterbi procedure.
`
`6.4.2.2 Alternative Viterbi Implementation
`
`~J
`.,,
`~~
`f
`
`By ta1cing logarithms of the model parameters, the Viterbi algorithm of the preceding
`section can be implemented without the need for any multiplications. Thus:
`
`0. Preprocessing
`
`1. Initiali1.ation
`
`ii-; = log (1r;),
`1 < i < N
`b;(o,) = log [b;(o,)], 1 < i < N, 1 < t < T
`aij = log (aiJ),
`1 < i, j < N
`
`-
`-
`81 (i) = log (81 (i)) = i; + b;(o1), 1 < i < N
`VJI (i) = 0,
`1 < i < N
`
`2. Recursion
`6,(1) = log (8,(1)) = max [i,-1(i) + aiJ] + hj(o,)
`1/J,(j) = arg max [8,-1(i) + aiJ],
`2 < t < T, 1 <j < N
`
`l<i<N
`
`l<i<N
`
`3. Termination
`
`4. Backtracking
`
`P* = max [6r(i)]
`qi = arg max [67(i)]
`
`l<i<N
`
`I<i<N
`
`t = T - 1, T - 2, ... , 1
`The calculation required for this alternative implementation is on the order of N2T additions
`(plus the calculation for preprocessing). Because the preprocessing needs to be perform