IPR2023-00037, No. 1029 Exhibit - EX1029 Tutorial on Hidden Markov Models and Selected Applications in Speech (P.T.A.B. Dec. 2, 2022)

A Tutorial on Hidden Markov Models and
`Selected Applications in Speech Recognition
`
`LAWRENCE R. RABINER, FELLOW,
`
`IEEE
`
`Although initially introduced and studied in the late 1960s and
`early 1970s, statistical methods of Markov source or hidden Markov
`modeling have become increasingly popular in the last several
`years. There are two strong reasons why this has occurred. First the
`models are very rich in mathematical structure and hence can form
`the theoretical basis for use in a wide range of applications. Sec-
`ond the models, when applied properly, work very well in practice
`for several important applications. In this paper we attempt to care-
`fully and methodically review the theoretical aspects of this type
`of statistical modeling and show how they have been applied to
`selected problems in machine recognition of speech.
`
`I. INTRODUCTION
`
`Real-world processes generally produce observable out-
`puts which can be characterized as signals. The signals can
`bediscrete in nature(e.g.,charactersfrom afinitealphabet,
`quantized vectors from a codebook, etc.), or continuous in
`nature (e.g., speech samples, temperature measurements,
`music, etc.). The signal source can be stationary (i.e., its sta-
`tistical properties do not vary with time), or nonstationary
`(i.e., the signal properties vary over time). The signals can
`be pure (i.e., coming strictly from a single source), or can
`be corrupted from other signal sources (e.g., noise) or by
`transmission distortions, reverberation, etc.
`A problem of fundamental interest is characterizing such
`real-world signals in terms of signal models. There are sev-
`eral reasons why one is interested in applying signal models.
`First of all, a signal model can provide the basis for a the-
`oretical description of a signal processing system which can
`be used to process the signal so as to provide a desired out-
`put. For example if we are interested in enhancing a speech
`signal corrupted by noise and transmission distortion, we
`can use the signal model to design a system which will opti-
`mally remove the noise and undo the transmission distor-
`tion. A second reason why signal models are important is
`that they are potentially capable of letting us learn a great
`deal about the signal source (i.e., the real-world process
`which produced the signal) without having to have the
`sourceavailable. This property is especially important when
`the cost of getting signals from the actual source is high.
`
`In this case, with a good signal model, we can simulate the
`source and learn as much as possible via simulations.
`Finally, the most important reason why signal models are
`important is that they often workextremelywell in practice,
`and enable us to realize important practical systems-e.g.,
`prediction systems, recognition systems, identification sys-
`tems, etc., in a very efficient manner.
`These are several possible choices for what type of signal
`model is used for characterizing the properties of a given
`signal. Broadly one can dichotomize the types of signal
`models into the class of deterministic models, and the class
`of statistical models. Deterministic models generally exploit
`some known specific properties of the signal, e.g., that the
`signal is a sine wave, or a sum of exponentials, etc. In these
`cases, specification of the signal model is generally straight-
`forward;all that is required istodetermine(estimate)values
`of the parameters of the signal model (e.g., amplitude, fre-
`quency, phase of a sine wave, amplitudes and rates of expo-
`nentials, etc.). The second broad class of signal models is
`the set of statistical models in which one tries to charac-
`terize only the statistical properties of the signal. Examples
`of such statistical models include Gaussian processes, Pois-
`son processes, Markov processes, and hidden Markov pro-
`cesses, among others. The underlying assumption of the
`statistical model i s that the signal can be well characterized
`as a parametric random process, and that the parameters
`of the stochastic process can be determined (estimated) in
`a precise, well-defined manner.
`For the applications of interest, namely speech process-
`ing, both deterministic and stochastic signal models have
`had good success. In this paper we will concern ourselves
`strictlywith one typeof stochastic signal model, namelythe
`hidden Markov model (HMM). (These models are referred
`to as Markov sources or probabilistic functions of Markov
`chains in the communications literature.) We will first
`review the theory of Markov chains and then extend the
`ideas to the class of hidden Markov models using several
`simple examples. We will then focus our attention on the
`three fundamental problems' for HMM design, namely: the
`
`Manuscript received January 15,1988; revised October 4,1988.
`The author is with AT&T Bell Laboratories, Murray Hill, NJ 07974-
`2070, USA.
`IEEE Log Number 8825949.
`
`'The idea of characterizing the theoretical aspects of hidden
`Markov modeling in terms of solving three fundamental problems
`is due to Jack Ferguson of IDA (Institute for Defense Analysis) who
`introduced it in lectures and writing.
`
`0018-9219/89/02000257$01.00 0 1989 IEEE
`
`PROCEEDINGS OF THE IEEE, VOL. 77, NO. 2, FEBRUARY 1989
`
`257
`
`IPR2023-00037
`Apple EX1029 Page 1
`
`

`evaluation of the probability (or likelihood) of a sequence
`of observations given a specific HMM; the determination
`of a best sequence of model states; and the adjustment of
`model parameters so as to best account for the observed
`signal. We will show that once these three fundamental
`problems are solved, we can apply HMMs to selected prob-
`lems in speech recognition.
`Neither the theory of hidden Markov models nor its
`applications to speech recognition is new. The basic theory
`was published in a series of classic papers by Baum and his
`colleagues [I]-[5] in the late 1960s and early 1970s and was
`implemented for speech processing applications by Baker
`161 at CMU, and by Jelinek and his colleagues at IBM [7-[13]
`in the 1970s. However, widespread understanding and
`application of the theory of HMMs to speech processing
`has occurred only within the past several years. There are
`several reasons why this has been the case. First, the basic
`theory of hidden Markov models was published in math-
`ematical journals which were not generally read by engi-
`neers working on problems in speech processing. The sec-
`ond reason was that the original applications of the theory
`to speech processing did not provide sufficient tutorial
`material for most readers to understand the theory and to
`be able to apply it to their own research. As a result, several
`tutorial papers were written which provided a sufficient
`level of detail for a number of research labs to begin work
`using HMMs in individual speech processing applications
`[14]-[19]. This tutorial i s intended to provide an overview
`of the basic theory of HMMs (as originated by Baum and
`his colleagues), provide practical details on methods of
`implementation of the theory, and describe a couple of
`selected applications of the theory to distinct problems in
`speech recognition. The paper combines results from a
`number of original sources and hopefully provides a single
`source for acquiring the background required to pursue
`further this fascinating area of research.
`The organization of this paper is as follows. In Section I1
`we review the theory of discrete Markov chains and show
`how the concept of hidden states, where the observation
`i s a probabilistic function of the state, can be used effec-
`tively. We illustrate the theory with two simple examples,
`namely coin-tossing, and the classic balls-in-urns system.
`In Section Ill we discuss the three fundamental problems
`of HMMs, and give several practical techniques for solving
`these problems. In Section IV we discuss the various types
`of HMMs that have been studied including ergodic as well
`as left-right models. In this section we also discuss the var-
`ious model features including the form of the observation
`density function, the state duration density, and the opti-
`mization criterion for choosing optimal HMM parameter
`values. In Section Vwe discuss the issues that arise in imple-
`menting HMMs including the topics of scaling, initial
`parameter estimates, model size, model form, missingdata,
`and multiple observation sequences. In Section VI we
`describean isolated word speech recognizer, implemented
`with HMM ideas, and show how it performs as compared
`to alternative implementations. In Section VI1 we extend
`the ideas presented in Section VI to the problem of recog-
`nizing a string of spoken words based on concatenating
`individual HMMsofeachword in thevocabulary. In Section
`V l l l we briefly outline how the ideas of HMM have been
`applied to a largevocabulary speech recognizer, and in Sec-
`
`tion IX we summarize the ideas discussed throughout the
`paper.
`
`11. DISCRETE MARKOV PROCESSES~
`Consider a system which may be described at any time
`as being in one of a set of N distinct states, S1, SzI . . . , SN,
`as illustrated in Fig. 1 (where N = 5 for simplicity). At reg-
`
`Fig. 1. A Markov chain with 5 states (labeled S, to S,) with
`selected state transitions.
`
`ularlyspaced discrete times, the system undergoesachange
`of state (possibly back to the same state) according to a set
`of probabilities associated with the state. We denote the
`time instants associated with state changes as t = 1, 2,
`. . . , and we denote the actual state at time t as qr. A full
`probabilistic description of the above system would, in gen-
`eral, require specification of the current state (at time t), as
`well as all the predecessor states. For the special case of a
`discrete, first order, Markov chain, this probabilistic
`description is truncated to just the current and the pre-
`decessor state, i.e.,
`99, = q q t - 1 = S I , q t - 2 = S k r . . . I
`= 9s: = S&:
`= SJ.
`(1 )
`Furthermoreweonlyconsider those processes in which the
`right-hand side of (1) i s independent of time, thereby lead-
`ing to the set of state transition probabilities a,, of the form
`1 5 i , j 5 N
`a,, = 99, = S,(q,-, = S,],
`(2)
`with the state transition coefficients having the properties
`a,, 2 0
`
`(3a)
`
`N
`
`C a,, = I
`
`(3b)
`
`/ = 1
`since they obey standard stochastic constraints.
`The above stochastic process could be called an observ-
`able Markov model since the output of the process is the
`set of states at each instant of time, where each state cor-
`responds to a physical (observable) event. To set ideas, con-
`sider a simple 3-state Markov model of the weather. We
`assume that once a day (e.g., at noon), the weather is
`
`'A good overview of discrete Markov processes is in [20, ch. 51.
`
`258
`
`PROCEEDINGS OF THE IEEE, VOL. 77, NO. 2, FEBRUARY 1989
`
`IPR2023-00037
`Apple EX1029 Page 2
`
`

`observed as being one of the following:
`State 1: rain or (snow)
`State 2: cloudy
`State 3: sunny.
`We postulate that the weather on day tis characterized by
`a single one of the three states above, and that the matrix
`A of state transition probabilities is
`
`0.4 0.3 0.3
`
`LO.1 0.1 0.81
`Given that the weather on day 1 (t = 1) is sunny (state 3),
`we can ask the question: What is the probability (according
`to the model) that the weather for the next 7 days will be
`"sun-sun-rain-rain-sun-cloudy-sun * *
`a " ? Stated more for-
`mally, we define the observation sequence 0 as 0 = {S3,
`S3, S3, S1, S1, S3, Sz, S3} corresponding to t = 1, 2, . . . , 8,
`and we wish to determine the probability of 0, given the
`model. This probability can be expressed (and evaluated)
`as
`
`P(0IModel) = RS,, S3, S3, S1, S1, S3, Sz, S31Modell
`= SS31 . RS3lS3l SS3lS3l RSlIS31
`
`a33 * a33 . a31 * all . a13 . a32 . aZ3
`= 7r3
`= 1 . (0.8)(0.8)(0.1)(0.4)(0.3)(0.1)(0.2)
`= 1.536 X
`where we use the notation
`
`(4)
`
`1 5 i 5 N
`K, = 491 = S;],
`to denote the initial state probabilities.
`Another interesting question we can ask (and answer
`using the model) is: Given that the model is in a known state,
`what is the probabilityit stays in that stateforexactlyddays?
`This probability can be evaluated as the probability of the
`observation sequence
`0 = {Si, Si, Si, . * * , S. s # S;},
`d' dkl
`
`
`
`1
`2
`3
`given the model, which is
`P(OIMode1, ql = S;) = (aJd-'(l - a;;) = p,(d).
`(5)
`The quantityp;(d) i s the (discrete) probability density func-
`tion of duration d i n state i. This exponential duration den-
`sity is characteristic of the state duration in a Markovchain.
`Based on pi(d), we can readily calculate the expected num-
`ber of observations (duration) in a state, conditioned on
`starting in that state as
`d; = c dpi(d)
`-
`m
`= c d(ajJd-'(1 - a;,) = -.
`
`d = l
`
`(6a)
`
`m
`
`1
`1 - ai;
`d = l
`Thus the expected number of consecutive days of sunny
`weather, according to the model, i s 140.2) = 5; for cloudy
`it is 2.5; for rain it is 1.67.
`
`(6b)
`
`A. Extension to Hidden Markov Models
`So far we have considered Markov models in which each
`state corresponded to an observable (physical) event. This
`model is too restrictive to be applicable to many problems
`of interest. In this section we extend the concept of Markov
`models to include the case where the observation is a prob-
`abilistic function of the state-i.e.,
`the resulting model
`(which iscalled a hidden Markovmodel) isadoublyembed-
`ded stochastic process with an underlying stochastic pro-
`cess that is not observable (it is hidden), but can only be
`observed through another set of stochastic processes that
`produce the sequence of observations. To fix ideas, con-
`sider the following model of some simple coin tossing
`experiments.
`Coin Toss Models: Assume the following scenario. You
`are in a room with a barrier (e.g., a curtain) through which
`you cannot see what is happening. On the other side of the
`barrier is another person who is performing a coin (or mul-
`tiplecoin) tossing experiment. Theother person will not tell
`you anything about what he is doing exactly; he will only
`tell you the result of each coin flip. Thus a sequence of hid-
`den coin tossing experiments is performed, with the obser-
`vation sequence consisting of a series of heads and tails;
`e.g., a typical observation sequence would be
`0 = O1 O2 O3 . . . OT
`= x x333x 3 3 x . . . x
`where X stands for heads and 3 stands for tails.
`Given the above scenario, the problem of interest is how
`do we build an HMM to explain (model) the observed
`sequence of heads and tails. The first problem one faces is
`deciding what the states in the model correspond to, and
`then deciding how many states should be in the model. One
`possiblechoicewould betoassumethatonlyasingle biased
`coin was being tossed. In this case we could model the sit-
`uation with a 2-state model where each state corresponds
`to a side of the coin (i.e., heads or tails). This model is
`depicted in Fig. 2(a).3 In this case the Markov model is
`observable, and the only issue for complete specification
`of the model would be to decide on the best value for the
`bias (i.e., the probability of, say, heads). Interestingly, an
`equivalent HMM to that of Fig. 2(a) would be a degenerate
`I-state model, where the state corresponds to the single
`biased coin, and the unknown parameter is the bias of the
`coin.
`A second form of HMM for explaining the observed
`sequence of coin toss outcome is given in Fig. 2(b). In this
`case there are 2 states in the model and each state corre-
`sponds to a different, biased, coin being tossed. Each state
`is characterized by a probability distribution of heads and
`tails, and transitions between states are characterized by a
`state transition matrix. The physical mechanism which
`accounts for how state transitions are selected could itself
`be a set of independent coin tosses, or some other prob-
`abilistic event.
`A third form of HMM for explaining the observed
`sequence of coin toss outcomes is given in Fig. 2(c). This
`model corresponds to using 3 biased coins, and choosing
`from among the three, based on some probabilistic event.
`
`3The model of Fig. 2(a) is a memoryless process and thus is a
`degenerate case of a Markov model.
`
`RABINER: HIDDEN MARKOV MODELS
`
`259
`
`IPR2023-00037
`Apple EX1029 Page 3
`
`

`P(H1
`
`4 - P(HI
`
`HEADS
`
`TAILS
`
`0 - H H T T H T H H T T H ...
`S = l 1 2 2 4 2 1 1 2 2 i ...
`
`0 = H H T T H T H H T T H ...
`S = 2 1 I 2 2 2 1 2 2 1 2 ...
`
`O = H H T T H T H H T T H ...
`s = 3 1 2 3 3 1 4 2 3 1 3...
`
`0 33
`STATE
`---
`3
`2
`t
`
`P3
`Pp
`P(H1 PI
`I-P3
`i-Pp
`P(T)
`I-P,
`Fig. 2. Three possible Markov models which can account
`for the resultsof hidden coin tossing experiments. (a) I-coin
`model. (b) 2-coins model. (c) 3-coins model.
`
`Given the choice among the three models shown in Fig.
`2 for explaining the observed sequence of heads and tails,
`a natural question would bewhich model best matches the
`actual observations. It should beclearthat the simple I-coin
`model of Fig. 2(a) has only 1 unknown parameter; the 2-coin
`model of Fig. 2(b) has4 un known parameters; and the 3-coin
`model of Fig. 2(c) has 9 unknown parameters. Thus, with
`the greater degrees of freedom, the larger HMMs would
`seem to inherently be more capable of modeling a series
`of coin tossing experiments than would equivalently smaller
`models. Although this is theoretically true, we will see later
`in this paper that practical considerations impose some
`strong limitations on the size of models that we can con-
`sider. Furthermore, it might just be the case that only a sin-
`glecoin is being tossed. Then using the 3-coin model of Fig.
`2(c) would be inappropriate, since the actual physical event
`would not correspond to the model being used-i.e., we
`would be using an underspecified system.
`The Urn and BallMode14:To extend the ideas of the HMM
`to a somewhat more complicated situation, consider the
`urn and ball system of Fig. 3. We assume that there are N
`(1arge)glassurnsin aroom. Withineach urntherearealarge
`number of colored balls. We assume there are M distinct
`colorsofthe balls. The physical processforobtainingobser-
`vations is as follows. A genie is in the room, and according
`to some random process, he (or she) chooses an initial urn.
`From this urn, a ball is chosen at random, and i t s color is
`recorded as theobservation.The ball is then replaced in the
`urn from which it was selected. A new urn is then selected
`
`4The urn and ball model was introduced by Jack Ferguson, and
`his colleagues, in lectures on HMM theory.
`
`os {GREEN, GREEN, BLUE, RED, YELLOW, RED, .. . . . ... BLUE}
`Fig. 3. An N-state urn and ball model which illustrates the
`general case of a discrete symbol HMM.
`
`according to the random selection process associated with
`the current urn, and the ball selection process is repeated.
`This entire process generates afinite observation sequence
`of colors, which we would like to model as the observable
`output of an HMM.
`It should be obvious that the simplest HMM that cor-
`responds to the urn and ball process is one in which each
`state corresponds to a specific urn, and for which a (ball)
`color probability is defined for each state. The choice of
`urns is dictated by the state transition matrix of the HMM.
`
`5. Elements of an HMM
`The above examples give us a pretty good idea of what
`an HMM is and how it can be applied to some simple sce-
`narios. We now formally define the elements of an HMM,
`and explain how the model generates observation
`sequences.
`An HMM is characterized by the following:
`1) N, the number of states in the model. Although the
`states are hidden, for many practical applications there is
`often some physical significance attached to the states or
`to sets of states of the model. Hence, in the coin tossing
`experiments, each state corresponded to a distinct biased
`coin. In the urn and ball model, the states corresponded
`to the urns. Generally the states are interconnected in such
`a way that any state can be reached from any other state
`(e.g., an ergodic model); however, we will see later in this
`paper that other possible interconnections of states are
`often of interest. We denote the individual states as S = {Sl,
`S2, . . . , S N } , and the state at time t as g,.
`2) M, the number of distinct observation symbols per
`state, i.e., the discrete alphabet size. The observation sym-
`bols correspond to the physical output of the system being
`modeled. For the coin toss experiments the observation
`symbols were simply heads or tails; for the ball and urn
`model they were the colors of the balls selected from the
`urns. We denote the individual symbols as V = {vl, v,,
`* . * , V M ) .
`3) The state transition probability distribution A = {a,}
`where
`
`1 5 i, j I N.
`a,, = p[q,+l = S,lq, = S,],
`(7)
`For the special case where any state can reach any other
`state in a single step, we have a, > 0 for all i, j . For other
`types of HMMs, we would have a,] = 0 for one or more (i,
`j ) pairs.
`
`260
`
`PROCEEDINGS OF THE IEEE, VOL. 77, NO. 2, FEBRUARY 1989
`
`IPR2023-00037
`Apple EX1029 Page 4
`
`

`4) The observation symbol probability distribution in
`statej, B = {b,(k)}, where
`b,(k) = p[vk at t)qt = S,],
`
`1 I j 5 N
`
`I i k i M .
`5) The initial state distribution T = { T ~ } where
`T , = p[ql = SI], 1 i i i N.
`(9)
`Given appropriate values of N, M, A, B, and ir, the H M M
`can be used as a generator to give an observation sequence
`
`(8)
`
`0 = 0 1 O ~ ~ ~ ~ o J (10)
`(where each observation 0, is one of the symbols from V,
`and Tis the number of observations in the sequence) as
`follows:
`1) Choose an initial state q, = SI according to the initial
`state distribution T .
`2) Set t = 1.
`3) Choose 0, = vk according to the symbol probability
`distribution in state SI, i.e., b,(k).
`4) Transit to a new state q,,, = S, according to the state
`transition probability distribution for state S,, i.e., a,.
`5) Set t = t + 1; return to step 3)if t < T; otherwise ter-
`minate the procedure.
`The above procedure can be used as both a generator of
`observations, and as a model for how a given observation
`sequence was generated by an appropriate HMM.
`It can be seen from the above discussion that a complete
`specification of an H M M requires specification of two
`model parameters (N and M), specification of observation
`symbols, and the specification of the three probability mea-
`sures A, B, and T . For convenience, we use the compact
`notation
`
`A = (A, 6, T )
`to indicate the complete parameter set of the model.
`
`(11)
`
`C. The Three Basic Problems for HMMs5
`Given the form of H M M of the previous section, there are
`three basic problems of interest that must be solved for the
`model to be useful in real-world applications. These prob-
`lems are the following:
`Problem 7: Given the observation sequence 0 = O1 O2
`. . * Or, and a model A = (A, 6, ir), how do
`we efficiently compute P(OIA), the proba-
`bilityof theobservation sequence,given the
`model?
`Problem 2: Given the observation sequence 0 = 0, O2
`. . . Or, and the model A, how do we choose
`a corresponding state sequence Q = q1 q2
`. . . qJwhich is optimal in some meaningful
`sense (i.e., best “explains”
`the observa-
`t ion s)?
`Problem 3: How do we adjust the model parameters A
`= (A, B, T ) to maximize P(OJA)?
`
`5The material in this section and in Section I l l is based on the
`ideas presented by Jack Ferguson of IDA in lectures at Bell Lab-
`oratories.
`
`Problem 1 i s the evaluation problem, namely given a
`model and asequenceof observations, how dowecompute
`the probability that the observed sequence was produced
`by the model. We can also view the problem as one of scor-
`ing how well a given model matches a given observation
`sequence. The latter viewpoint is extremely useful. For
`example, if we consider the case in which we are trying to
`choose among several competing models, the solution to
`Problem 1 allows us to choose the model which best
`matches the observations.
`Problem 2 is the one in which we attempt to uncover the
`hidden part of the model, i.e., to find the “correct” state
`sequence. It should be clear that for all but the case of
`degenerate models, there is no “correct” state sequence
`to be found. Hence for practical situations, we usually use
`an optimality criterion to solve this problem as best as pos-
`sible. Unfortunately, as we will see, there are several rea-
`sonable optimality criteria that can be imposed, and hence
`the choice of criterion is a strong function of the intended
`use for the uncovered state sequence. Typical uses might
`be to learn about the structure of the model, to find optimal
`state sequences for continuous speech recognition, or to
`get average statistics of individual states, etc.
`Problem 3 is the one in which we attempt to optimize the
`model parameters so as to best describe how a given obser-
`vation sequence comes about. The observation sequence
`used to adjust the model parameters is called a training
`sequence since it is used to “train” the HMM. The training
`problem is the crucial one for most applications of HMMs,
`since it allows us to optimally adapt model parameters to
`observed training data-i.e.,
`to create best models for real
`phenomena.
`To fix ideas, consider the following simple isolated word
`speech recognizer. For each word of a Wword vocabulary,
`we want to design a separate N-state HMM. We represent
`the speech signal of a given word as a time sequence of
`coded spectral vectors. We assume that the coding is done
`using a spectral codebook with M unique spectral vectors;
`hence each observation is the index of the spectral vector
`closest (in some spectral sense) to the original speech sig-
`nal. Thus, for each vocabulary word, we have a training
`sequence consisting of a number of repetitions of
`sequencesofcodebook indicesoftheword (byoneor more
`talkers). The first task is to build individual word models.
`This task is done by using the solution to Problem 3 to opti-
`mally estimate model parameters for each word model. To
`develop an understanding of the physical meaning of the
`model states, we use the solution to Problem 2 to segment
`each of the word training sequences into states, and then
`study the properties of the spectral vectors that lead to the
`observations occurring in each state. The goal here would
`be to make refinements on the model (e.g., more states,
`different codebook size, etc.) so as to improve its capability
`of modeling the spoken word sequences. Finally, once the
`set of W HMMs has been designed and optimized and thor-
`oughly studied, recognition of an unknown word is per-
`formed using the solution to Problem 1 to score each word
`model based upon the given test observation sequence,
`and select the word whose modelscore is highe5t [k,? the
`highest I i kel i hood).
`In the next section we present formal mathematical solu-
`tionstoeachofthethreefundamental problemsfor HMMs.
`
`RABINER: HIDDEN MARKOV MODELS
`
`261
`
`IPR2023-00037
`Apple EX1029 Page 5
`
`

`We shall see that the three problems are linked together
`tightly under our probabilistic framework.
`
`I l l . SOLUTIONS TO THE THREE BASIC PROBLEMS OF HMMs
`A. Solution to Problem 1
`We wish to calculate the probability of the observation
`sequence, 0 = O1 O2 . . . Or, given the model A, i.e., P ( 0 I X ) .
`The most straightforward way of doing this is through
`enumerating every possible state sequence of length T(the
`number of observations). Consider one such fixed state
`sequence
`
`Q = q i q 2 . * . q r
`(12)
`where q1 is the initial state. The probability of the obser-
`vation sequence 0 for the state sequence of (12) i s
`r
`P(OIQ, N = II P(OtJqt, h)
`
`i = 1
`where we have assumed statistical independence of obser-
`vations. Thus we get
`p(OlQ, N = bql(Oi) . bqz(OJ . . . bqJ(OT). (13b)
`The probability of such a state sequence Q can be written
`as
`
`(13a)
`
`P(QIA) = rq1aq1qzaq2q3 * . * a q r - l q r .
`(14)
`The joint probability of 0 and Q, i.e., the probability that
`Oand Qoccur simultaneously, is simplythe product of the
`above two terms, i.e.,
`P(0, QIN = P(OIQ, N P(Q, N.
`(1 5)
`The probability of 0 (given the model)is obtained by sum-
`ming this joint probabilityover all possible state sequences
`q giving
`
`P ( 0 I N =
`
`P(OIQ, N P(QIN
`
`(1 6)
`
`computations! Clearly a more efficient procedure is
`required to solve Problem 1. Fortunately such a procedure
`exists and is called the forward-backward procedure.
`The Forward-Backward Procedure [2], [316: Consider the
`forward variable a t ( ; ) defined as
`at(;) = P(O1 0 2 . . . oi, qt = S,IN
`(18)
`i.e., the probability of the partial observation sequence, 0,
`02. . . O,,(until timet)andstateS,at time t,given the model
`A. We can solve for a,(;) inductively, as follows:
`1) Initialization:
`
`CY,(;) = ~,b,(Ol), 1 5 i 5 N.
`2) Induction:
`
`(1 9)
`
`I
`
`= C ai(i)al, b , ( ~ ( + ~ ) , I 5 t 5 T - I
`
`1 5 j 5 N.
`
`(20)
`
`3) Termination:
`
`N
`= C a T ( i ) .
`~ ( 0 1 ~ )
`r = l
`
`(21 )
`
`Stepl) initializesthe forward probabilitiesasthejoint prob-
`ability of state SI and initial observation O1. The induction
`step, which is the heart of the forward calculation, is illus-
`trated in Fig. 4(a). This figure shows how state S, can be
`
`(a)
`
`The interpretation of the computation in the above equa-
`tion is the following. Initially (at time t = l) we are in state
`q1 with probability rq,, and generate the symbol O1 (in this
`state) with probability bqI(O1). The clock changes from time
`t to t + 1 (t = 2) and we make a transition to state q, from
`state q1 with probability aqIq2, and generate symbol O2 with
`probability bq2(O2). This process continues in this manner
`until we make the list transition (at time T ) from state q T - 1
`to state qT with probability aqr-lqr and generate symbol Or
`with probability bql(Or).
`A little thought should convince the reader that the cal-
`culation of P(O(h), according to its direct definition (17)
`involves on the order of 2T. N'calculations, since at every
`t = 1, 2, . . . , T, there are N possible states which can be
`reached (i.e., there are Nr possible state sequences), and
`for each such state sequence about 2T calculations are
`required for each term in the sum of (17). (To be precise,
`we need (2T - l)Nr multiplications, and NT - 1 additions.)
`This calculation is computationally unfeasible, even for
`small values of N and T; e.g., for N = 5 (states), T = 100
`(observations), there are on the order of 2 . 100 * 5" =
`
`I
`
`I
`1
`
` I
`2
`
`I
`
`3
`OBSERVATION, t
`(a) Illustration of the sequence of operations
`Fig. 4.
`required for thecomputation of the forward variableol,+,(j).
`(b) Implementation of the computation of a,(;) in terms of
`a lattice of observations t, and states i.
`
`I
`T
`
`bStrictly speaking, we only need the forward part of the forward-
`backward procedure to solve Problem 1. We will introduce the
`backward part of the procedure in this section since it will be used
`to help solve Problem 3.
`
`262
`
`PROCEEDINGS OF THE IEEE, VOL. 77, NO. 2, FEBRUARY 1989
`
`IPR2023-00037
`Apple EX1029 Page 6
`
`

`(22)
`
`reached at time t + 1 from the N possible states, S I , 1 I i
`I N, at timet. Since a,(;) is the probabilityof the joint event
`that 0, O2 . . . 0, are observed, and the state at time t is SI,
`the product a , ( i ) a , is then the probabilityof the joint event
`that O1 O2 . . . 0, are observed, and state S, is reached at
`time t + 1 via state S, at time t. Summing this product over
`all the N possible states SI, 1 5 i I N at time t results in the
`probabilityof Slat time t + 1 with all theaccompanying pre-
`vious partial observations. Once this is done and SI is known,
`it is easy to see that a,+l(j) i s obtained by accounting for
`in state j , i.e., by multiplying the summed
`observation
`quantity bythe probabilityb,(O,+l).Thecomputation of(20)
`is performed for all states j , 1 I j c N, for a given t; the
`computation i s then iterated for t = 1,2, . . . , T - 1. Finally,
`step 3) gives the desired calculation of P(0IX) as the sum
`of the terminal forward variables a T ( i ) . This is the case since,
`by definition,
`aJ(i) = P(o1 0 2 . . ' or, qJ = s,Ih)
`and hence P(O(X) i s just the sum of the aJ(i)'s.
`If we examine the computation involved in the calcula-
`tion of a,(j), 1 5 t I T, 1 5 j 5 N, we see that it requires
`on the order of N2T calculations, rather than 2TNr as
`required by the direct calculation. (Again, to be precise, we
`need N(N + 1)(T - 1) + N multiplications and N(N - 1)(T
`- 1) additions.) For N = 5, T = 100, we need about 3000
`computations for the forward method, versus IO7* com-
`putations for t

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases