`Selected Applications in Speech Recognition
`
`LAWRENCE R. RABINER, FELLOW,
`
`IEEE
`
`Although initially introduced and studied in the late 1960s and
`early 1970s, statistical methods of Markov source or hidden Markov
`modeling have become increasingly popular in the last several
`years. There are two strong reasons why this has occurred. First the
`models are very rich in mathematical structure and hence can form
`the theoretical basis for use in a wide range of applications. Sec-
`ond the models, when applied properly, work very well in practice
`for several important applications. In this paper we attempt to care-
`fully and methodically review the theoretical aspects of this type
`of statistical modeling and show how they have been applied to
`selected problems in machine recognition of speech.
`
`I. INTRODUCTION
`
`Real-world processes generally produce observable out-
`puts which can be characterized as signals. The signals can
`bediscrete in nature(e.g.,charactersfrom afinitealphabet,
`quantized vectors from a codebook, etc.), or continuous in
`nature (e.g., speech samples, temperature measurements,
`music, etc.). The signal source can be stationary (i.e., its sta-
`tistical properties do not vary with time), or nonstationary
`(i.e., the signal properties vary over time). The signals can
`be pure (i.e., coming strictly from a single source), or can
`be corrupted from other signal sources (e.g., noise) or by
`transmission distortions, reverberation, etc.
`A problem of fundamental interest is characterizing such
`real-world signals in terms of signal models. There are sev-
`eral reasons why one is interested in applying signal models.
`First of all, a signal model can provide the basis for a the-
`oretical description of a signal processing system which can
`be used to process the signal so as to provide a desired out-
`put. For example if we are interested in enhancing a speech
`signal corrupted by noise and transmission distortion, we
`can use the signal model to design a system which will opti-
`mally remove the noise and undo the transmission distor-
`tion. A second reason why signal models are important is
`that they are potentially capable of letting us learn a great
`deal about the signal source (i.e., the real-world process
`which produced the signal) without having to have the
`sourceavailable. This property is especially important when
`the cost of getting signals from the actual source is high.
`
`In this case, with a good signal model, we can simulate the
`source and learn as much as possible via simulations.
`Finally, the most important reason why signal models are
`important is that they often workextremelywell in practice,
`and enable us to realize important practical systems-e.g.,
`prediction systems, recognition systems, identification sys-
`tems, etc., in a very efficient manner.
`These are several possible choices for what type of signal
`model is used for characterizing the properties of a given
`signal. Broadly one can dichotomize the types of signal
`models into the class of deterministic models, and the class
`of statistical models. Deterministic models generally exploit
`some known specific properties of the signal, e.g., that the
`signal is a sine wave, or a sum of exponentials, etc. In these
`cases, specification of the signal model is generally straight-
`forward;all that is required istodetermine(estimate)values
`of the parameters of the signal model (e.g., amplitude, fre-
`quency, phase of a sine wave, amplitudes and rates of expo-
`nentials, etc.). The second broad class of signal models is
`the set of statistical models in which one tries to charac-
`terize only the statistical properties of the signal. Examples
`of such statistical models include Gaussian processes, Pois-
`son processes, Markov processes, and hidden Markov pro-
`cesses, among others. The underlying assumption of the
`statistical model i s that the signal can be well characterized
`as a parametric random process, and that the parameters
`of the stochastic process can be determined (estimated) in
`a precise, well-defined manner.
`For the applications of interest, namely speech process-
`ing, both deterministic and stochastic signal models have
`had good success. In this paper we will concern ourselves
`strictlywith one typeof stochastic signal model, namelythe
`hidden Markov model (HMM). (These models are referred
`to as Markov sources or probabilistic functions of Markov
`chains in the communications literature.) We will first
`review the theory of Markov chains and then extend the
`ideas to the class of hidden Markov models using several
`simple examples. We will then focus our attention on the
`three fundamental problems' for HMM design, namely: the
`
`Manuscript received January 15,1988; revised October 4,1988.
`The author is with AT&T Bell Laboratories, Murray Hill, NJ 07974-
`2070, USA.
`IEEE Log Number 8825949.
`
`'The idea of characterizing the theoretical aspects of hidden
`Markov modeling in terms of solving three fundamental problems
`is due to Jack Ferguson of IDA (Institute for Defense Analysis) who
`introduced it in lectures and writing.
`
`0018-9219/89/02000257$01.00 0 1989 IEEE
`
`PROCEEDINGS OF THE IEEE, VOL. 77, NO. 2, FEBRUARY 1989
`
`257
`
`IPR2023-00037
`Apple EX1029 Page 1
`
`
`
`evaluation of the probability (or likelihood) of a sequence
`of observations given a specific HMM; the determination
`of a best sequence of model states; and the adjustment of
`model parameters so as to best account for the observed
`signal. We will show that once these three fundamental
`problems are solved, we can apply HMMs to selected prob-
`lems in speech recognition.
`Neither the theory of hidden Markov models nor its
`applications to speech recognition is new. The basic theory
`was published in a series of classic papers by Baum and his
`colleagues [I]-[5] in the late 1960s and early 1970s and was
`implemented for speech processing applications by Baker
`161 at CMU, and by Jelinek and his colleagues at IBM [7-[13]
`in the 1970s. However, widespread understanding and
`application of the theory of HMMs to speech processing
`has occurred only within the past several years. There are
`several reasons why this has been the case. First, the basic
`theory of hidden Markov models was published in math-
`ematical journals which were not generally read by engi-
`neers working on problems in speech processing. The sec-
`ond reason was that the original applications of the theory
`to speech processing did not provide sufficient tutorial
`material for most readers to understand the theory and to
`be able to apply it to their own research. As a result, several
`tutorial papers were written which provided a sufficient
`level of detail for a number of research labs to begin work
`using HMMs in individual speech processing applications
`[14]-[19]. This tutorial i s intended to provide an overview
`of the basic theory of HMMs (as originated by Baum and
`his colleagues), provide practical details on methods of
`implementation of the theory, and describe a couple of
`selected applications of the theory to distinct problems in
`speech recognition. The paper combines results from a
`number of original sources and hopefully provides a single
`source for acquiring the background required to pursue
`further this fascinating area of research.
`The organization of this paper is as follows. In Section I1
`we review the theory of discrete Markov chains and show
`how the concept of hidden states, where the observation
`i s a probabilistic function of the state, can be used effec-
`tively. We illustrate the theory with two simple examples,
`namely coin-tossing, and the classic balls-in-urns system.
`In Section Ill we discuss the three fundamental problems
`of HMMs, and give several practical techniques for solving
`these problems. In Section IV we discuss the various types
`of HMMs that have been studied including ergodic as well
`as left-right models. In this section we also discuss the var-
`ious model features including the form of the observation
`density function, the state duration density, and the opti-
`mization criterion for choosing optimal HMM parameter
`values. In Section Vwe discuss the issues that arise in imple-
`menting HMMs including the topics of scaling, initial
`parameter estimates, model size, model form, missingdata,
`and multiple observation sequences. In Section VI we
`describean isolated word speech recognizer, implemented
`with HMM ideas, and show how it performs as compared
`to alternative implementations. In Section VI1 we extend
`the ideas presented in Section VI to the problem of recog-
`nizing a string of spoken words based on concatenating
`individual HMMsofeachword in thevocabulary. In Section
`V l l l we briefly outline how the ideas of HMM have been
`applied to a largevocabulary speech recognizer, and in Sec-
`
`tion IX we summarize the ideas discussed throughout the
`paper.
`
`11. DISCRETE MARKOV PROCESSES~
`Consider a system which may be described at any time
`as being in one of a set of N distinct states, S1, SzI . . . , SN,
`as illustrated in Fig. 1 (where N = 5 for simplicity). At reg-
`
`Fig. 1. A Markov chain with 5 states (labeled S, to S,) with
`selected state transitions.
`
`ularlyspaced discrete times, the system undergoesachange
`of state (possibly back to the same state) according to a set
`of probabilities associated with the state. We denote the
`time instants associated with state changes as t = 1, 2,
`. . . , and we denote the actual state at time t as qr. A full
`probabilistic description of the above system would, in gen-
`eral, require specification of the current state (at time t), as
`well as all the predecessor states. For the special case of a
`discrete, first order, Markov chain, this probabilistic
`description is truncated to just the current and the pre-
`decessor state, i.e.,
`99, = q q t - 1 = S I , q t - 2 = S k r . . . I
`= 9s: = S&:
`= SJ.
`(1 )
`Furthermoreweonlyconsider those processes in which the
`right-hand side of (1) i s independent of time, thereby lead-
`ing to the set of state transition probabilities a,, of the form
`1 5 i , j 5 N
`a,, = 99, = S,(q,-, = S,],
`(2)
`with the state transition coefficients having the properties
`a,, 2 0
`
`(3a)
`
`N
`
`C a,, = I
`
`(3b)
`
`/ = 1
`since they obey standard stochastic constraints.
`The above stochastic process could be called an observ-
`able Markov model since the output of the process is the
`set of states at each instant of time, where each state cor-
`responds to a physical (observable) event. To set ideas, con-
`sider a simple 3-state Markov model of the weather. We
`assume that once a day (e.g., at noon), the weather is
`
`'A good overview of discrete Markov processes is in [20, ch. 51.
`
`258
`
`PROCEEDINGS OF THE IEEE, VOL. 77, NO. 2, FEBRUARY 1989
`
`IPR2023-00037
`Apple EX1029 Page 2
`
`
`
`observed as being one of the following:
`State 1: rain or (snow)
`State 2: cloudy
`State 3: sunny.
`We postulate that the weather on day tis characterized by
`a single one of the three states above, and that the matrix
`A of state transition probabilities is
`
`0.4 0.3 0.3
`
`LO.1 0.1 0.81
`Given that the weather on day 1 (t = 1) is sunny (state 3),
`we can ask the question: What is the probability (according
`to the model) that the weather for the next 7 days will be
`"sun-sun-rain-rain-sun-cloudy-sun * *
`a " ? Stated more for-
`mally, we define the observation sequence 0 as 0 = {S3,
`S3, S3, S1, S1, S3, Sz, S3} corresponding to t = 1, 2, . . . , 8,
`and we wish to determine the probability of 0, given the
`model. This probability can be expressed (and evaluated)
`as
`
`P(0IModel) = RS,, S3, S3, S1, S1, S3, Sz, S31Modell
`= SS31 . RS3lS3l SS3lS3l RSlIS31
`
`a33 * a33 . a31 * all . a13 . a32 . aZ3
`= 7r3
`= 1 . (0.8)(0.8)(0.1)(0.4)(0.3)(0.1)(0.2)
`= 1.536 X
`where we use the notation
`
`(4)
`
`1 5 i 5 N
`K, = 491 = S;],
`to denote the initial state probabilities.
`Another interesting question we can ask (and answer
`using the model) is: Given that the model is in a known state,
`what is the probabilityit stays in that stateforexactlyddays?
`This probability can be evaluated as the probability of the
`observation sequence
`0 = {Si, Si, Si, . * * , S. s # S;},
`d' dkl
`
`
`
`1
`2
`3
`given the model, which is
`P(OIMode1, ql = S;) = (aJd-'(l - a;;) = p,(d).
`(5)
`The quantityp;(d) i s the (discrete) probability density func-
`tion of duration d i n state i. This exponential duration den-
`sity is characteristic of the state duration in a Markovchain.
`Based on pi(d), we can readily calculate the expected num-
`ber of observations (duration) in a state, conditioned on
`starting in that state as
`d; = c dpi(d)
`-
`m
`= c d(ajJd-'(1 - a;,) = -.
`
`d = l
`
`(6a)
`
`m
`
`1
`1 - ai;
`d = l
`Thus the expected number of consecutive days of sunny
`weather, according to the model, i s 140.2) = 5; for cloudy
`it is 2.5; for rain it is 1.67.
`
`(6b)
`
`A. Extension to Hidden Markov Models
`So far we have considered Markov models in which each
`state corresponded to an observable (physical) event. This
`model is too restrictive to be applicable to many problems
`of interest. In this section we extend the concept of Markov
`models to include the case where the observation is a prob-
`abilistic function of the state-i.e.,
`the resulting model
`(which iscalled a hidden Markovmodel) isadoublyembed-
`ded stochastic process with an underlying stochastic pro-
`cess that is not observable (it is hidden), but can only be
`observed through another set of stochastic processes that
`produce the sequence of observations. To fix ideas, con-
`sider the following model of some simple coin tossing
`experiments.
`Coin Toss Models: Assume the following scenario. You
`are in a room with a barrier (e.g., a curtain) through which
`you cannot see what is happening. On the other side of the
`barrier is another person who is performing a coin (or mul-
`tiplecoin) tossing experiment. Theother person will not tell
`you anything about what he is doing exactly; he will only
`tell you the result of each coin flip. Thus a sequence of hid-
`den coin tossing experiments is performed, with the obser-
`vation sequence consisting of a series of heads and tails;
`e.g., a typical observation sequence would be
`0 = O1 O2 O3 . . . OT
`= x x333x 3 3 x . . . x
`where X stands for heads and 3 stands for tails.
`Given the above scenario, the problem of interest is how
`do we build an HMM to explain (model) the observed
`sequence of heads and tails. The first problem one faces is
`deciding what the states in the model correspond to, and
`then deciding how many states should be in the model. One
`possiblechoicewould betoassumethatonlyasingle biased
`coin was being tossed. In this case we could model the sit-
`uation with a 2-state model where each state corresponds
`to a side of the coin (i.e., heads or tails). This model is
`depicted in Fig. 2(a).3 In this case the Markov model is
`observable, and the only issue for complete specification
`of the model would be to decide on the best value for the
`bias (i.e., the probability of, say, heads). Interestingly, an
`equivalent HMM to that of Fig. 2(a) would be a degenerate
`I-state model, where the state corresponds to the single
`biased coin, and the unknown parameter is the bias of the
`coin.
`A second form of HMM for explaining the observed
`sequence of coin toss outcome is given in Fig. 2(b). In this
`case there are 2 states in the model and each state corre-
`sponds to a different, biased, coin being tossed. Each state
`is characterized by a probability distribution of heads and
`tails, and transitions between states are characterized by a
`state transition matrix. The physical mechanism which
`accounts for how state transitions are selected could itself
`be a set of independent coin tosses, or some other prob-
`abilistic event.
`A third form of HMM for explaining the observed
`sequence of coin toss outcomes is given in Fig. 2(c). This
`model corresponds to using 3 biased coins, and choosing
`from among the three, based on some probabilistic event.
`
`3The model of Fig. 2(a) is a memoryless process and thus is a
`degenerate case of a Markov model.
`
`RABINER: HIDDEN MARKOV MODELS
`
`259
`
`IPR2023-00037
`Apple EX1029 Page 3
`
`
`
`P(H1
`
`4 - P(HI
`
`HEADS
`
`TAILS
`
`0 - H H T T H T H H T T H ...
`S = l 1 2 2 4 2 1 1 2 2 i ...
`
`0 = H H T T H T H H T T H ...
`S = 2 1 I 2 2 2 1 2 2 1 2 ...
`
`O = H H T T H T H H T T H ...
`s = 3 1 2 3 3 1 4 2 3 1 3...
`
`0 33
`STATE
`---
`3
`2
`t
`
`P3
`Pp
`P(H1 PI
`I-P3
`i-Pp
`P(T)
`I-P,
`Fig. 2. Three possible Markov models which can account
`for the resultsof hidden coin tossing experiments. (a) I-coin
`model. (b) 2-coins model. (c) 3-coins model.
`
`Given the choice among the three models shown in Fig.
`2 for explaining the observed sequence of heads and tails,
`a natural question would bewhich model best matches the
`actual observations. It should beclearthat the simple I-coin
`model of Fig. 2(a) has only 1 unknown parameter; the 2-coin
`model of Fig. 2(b) has4 un known parameters; and the 3-coin
`model of Fig. 2(c) has 9 unknown parameters. Thus, with
`the greater degrees of freedom, the larger HMMs would
`seem to inherently be more capable of modeling a series
`of coin tossing experiments than would equivalently smaller
`models. Although this is theoretically true, we will see later
`in this paper that practical considerations impose some
`strong limitations on the size of models that we can con-
`sider. Furthermore, it might just be the case that only a sin-
`glecoin is being tossed. Then using the 3-coin model of Fig.
`2(c) would be inappropriate, since the actual physical event
`would not correspond to the model being used-i.e., we
`would be using an underspecified system.
`The Urn and BallMode14:To extend the ideas of the HMM
`to a somewhat more complicated situation, consider the
`urn and ball system of Fig. 3. We assume that there are N
`(1arge)glassurnsin aroom. Withineach urntherearealarge
`number of colored balls. We assume there are M distinct
`colorsofthe balls. The physical processforobtainingobser-
`vations is as follows. A genie is in the room, and according
`to some random process, he (or she) chooses an initial urn.
`From this urn, a ball is chosen at random, and i t s color is
`recorded as theobservation.The ball is then replaced in the
`urn from which it was selected. A new urn is then selected
`
`4The urn and ball model was introduced by Jack Ferguson, and
`his colleagues, in lectures on HMM theory.
`
`os {GREEN, GREEN, BLUE, RED, YELLOW, RED, .. . . . ... BLUE}
`Fig. 3. An N-state urn and ball model which illustrates the
`general case of a discrete symbol HMM.
`
`according to the random selection process associated with
`the current urn, and the ball selection process is repeated.
`This entire process generates afinite observation sequence
`of colors, which we would like to model as the observable
`output of an HMM.
`It should be obvious that the simplest HMM that cor-
`responds to the urn and ball process is one in which each
`state corresponds to a specific urn, and for which a (ball)
`color probability is defined for each state. The choice of
`urns is dictated by the state transition matrix of the HMM.
`
`5. Elements of an HMM
`The above examples give us a pretty good idea of what
`an HMM is and how it can be applied to some simple sce-
`narios. We now formally define the elements of an HMM,
`and explain how the model generates observation
`sequences.
`An HMM is characterized by the following:
`1) N, the number of states in the model. Although the
`states are hidden, for many practical applications there is
`often some physical significance attached to the states or
`to sets of states of the model. Hence, in the coin tossing
`experiments, each state corresponded to a distinct biased
`coin. In the urn and ball model, the states corresponded
`to the urns. Generally the states are interconnected in such
`a way that any state can be reached from any other state
`(e.g., an ergodic model); however, we will see later in this
`paper that other possible interconnections of states are
`often of interest. We denote the individual states as S = {Sl,
`S2, . . . , S N } , and the state at time t as g,.
`2) M, the number of distinct observation symbols per
`state, i.e., the discrete alphabet size. The observation sym-
`bols correspond to the physical output of the system being
`modeled. For the coin toss experiments the observation
`symbols were simply heads or tails; for the ball and urn
`model they were the colors of the balls selected from the
`urns. We denote the individual symbols as V = {vl, v,,
`* . * , V M ) .
`3) The state transition probability distribution A = {a,}
`where
`
`1 5 i, j I N.
`a,, = p[q,+l = S,lq, = S,],
`(7)
`For the special case where any state can reach any other
`state in a single step, we have a, > 0 for all i, j . For other
`types of HMMs, we would have a,] = 0 for one or more (i,
`j ) pairs.
`
`260
`
`PROCEEDINGS OF THE IEEE, VOL. 77, NO. 2, FEBRUARY 1989
`
`IPR2023-00037
`Apple EX1029 Page 4
`
`
`
`4) The observation symbol probability distribution in
`statej, B = {b,(k)}, where
`b,(k) = p[vk at t)qt = S,],
`
`1 I j 5 N
`
`I i k i M .
`5) The initial state distribution T = { T ~ } where
`T , = p[ql = SI], 1 i i i N.
`(9)
`Given appropriate values of N, M, A, B, and ir, the H M M
`can be used as a generator to give an observation sequence
`
`(8)
`
`0 = 0 1 O ~ ~ ~ ~ o J (10)
`(where each observation 0, is one of the symbols from V,
`and Tis the number of observations in the sequence) as
`follows:
`1) Choose an initial state q, = SI according to the initial
`state distribution T .
`2) Set t = 1.
`3) Choose 0, = vk according to the symbol probability
`distribution in state SI, i.e., b,(k).
`4) Transit to a new state q,,, = S, according to the state
`transition probability distribution for state S,, i.e., a,.
`5) Set t = t + 1; return to step 3)if t < T; otherwise ter-
`minate the procedure.
`The above procedure can be used as both a generator of
`observations, and as a model for how a given observation
`sequence was generated by an appropriate HMM.
`It can be seen from the above discussion that a complete
`specification of an H M M requires specification of two
`model parameters (N and M), specification of observation
`symbols, and the specification of the three probability mea-
`sures A, B, and T . For convenience, we use the compact
`notation
`
`A = (A, 6, T )
`to indicate the complete parameter set of the model.
`
`(11)
`
`C. The Three Basic Problems for HMMs5
`Given the form of H M M of the previous section, there are
`three basic problems of interest that must be solved for the
`model to be useful in real-world applications. These prob-
`lems are the following:
`Problem 7: Given the observation sequence 0 = O1 O2
`. . * Or, and a model A = (A, 6, ir), how do
`we efficiently compute P(OIA), the proba-
`bilityof theobservation sequence,given the
`model?
`Problem 2: Given the observation sequence 0 = 0, O2
`. . . Or, and the model A, how do we choose
`a corresponding state sequence Q = q1 q2
`. . . qJwhich is optimal in some meaningful
`sense (i.e., best “explains”
`the observa-
`t ion s)?
`Problem 3: How do we adjust the model parameters A
`= (A, B, T ) to maximize P(OJA)?
`
`5The material in this section and in Section I l l is based on the
`ideas presented by Jack Ferguson of IDA in lectures at Bell Lab-
`oratories.
`
`Problem 1 i s the evaluation problem, namely given a
`model and asequenceof observations, how dowecompute
`the probability that the observed sequence was produced
`by the model. We can also view the problem as one of scor-
`ing how well a given model matches a given observation
`sequence. The latter viewpoint is extremely useful. For
`example, if we consider the case in which we are trying to
`choose among several competing models, the solution to
`Problem 1 allows us to choose the model which best
`matches the observations.
`Problem 2 is the one in which we attempt to uncover the
`hidden part of the model, i.e., to find the “correct” state
`sequence. It should be clear that for all but the case of
`degenerate models, there is no “correct” state sequence
`to be found. Hence for practical situations, we usually use
`an optimality criterion to solve this problem as best as pos-
`sible. Unfortunately, as we will see, there are several rea-
`sonable optimality criteria that can be imposed, and hence
`the choice of criterion is a strong function of the intended
`use for the uncovered state sequence. Typical uses might
`be to learn about the structure of the model, to find optimal
`state sequences for continuous speech recognition, or to
`get average statistics of individual states, etc.
`Problem 3 is the one in which we attempt to optimize the
`model parameters so as to best describe how a given obser-
`vation sequence comes about. The observation sequence
`used to adjust the model parameters is called a training
`sequence since it is used to “train” the HMM. The training
`problem is the crucial one for most applications of HMMs,
`since it allows us to optimally adapt model parameters to
`observed training data-i.e.,
`to create best models for real
`phenomena.
`To fix ideas, consider the following simple isolated word
`speech recognizer. For each word of a Wword vocabulary,
`we want to design a separate N-state HMM. We represent
`the speech signal of a given word as a time sequence of
`coded spectral vectors. We assume that the coding is done
`using a spectral codebook with M unique spectral vectors;
`hence each observation is the index of the spectral vector
`closest (in some spectral sense) to the original speech sig-
`nal. Thus, for each vocabulary word, we have a training
`sequence consisting of a number of repetitions of
`sequencesofcodebook indicesoftheword (byoneor more
`talkers). The first task is to build individual word models.
`This task is done by using the solution to Problem 3 to opti-
`mally estimate model parameters for each word model. To
`develop an understanding of the physical meaning of the
`model states, we use the solution to Problem 2 to segment
`each of the word training sequences into states, and then
`study the properties of the spectral vectors that lead to the
`observations occurring in each state. The goal here would
`be to make refinements on the model (e.g., more states,
`different codebook size, etc.) so as to improve its capability
`of modeling the spoken word sequences. Finally, once the
`set of W HMMs has been designed and optimized and thor-
`oughly studied, recognition of an unknown word is per-
`formed using the solution to Problem 1 to score each word
`model based upon the given test observation sequence,
`and select the word whose modelscore is highe5t [k,? the
`highest I i kel i hood).
`In the next section we present formal mathematical solu-
`tionstoeachofthethreefundamental problemsfor HMMs.
`
`RABINER: HIDDEN MARKOV MODELS
`
`261
`
`IPR2023-00037
`Apple EX1029 Page 5
`
`
`
`We shall see that the three problems are linked together
`tightly under our probabilistic framework.
`
`I l l . SOLUTIONS TO THE THREE BASIC PROBLEMS OF HMMs
`A. Solution to Problem 1
`We wish to calculate the probability of the observation
`sequence, 0 = O1 O2 . . . Or, given the model A, i.e., P ( 0 I X ) .
`The most straightforward way of doing this is through
`enumerating every possible state sequence of length T(the
`number of observations). Consider one such fixed state
`sequence
`
`Q = q i q 2 . * . q r
`(12)
`where q1 is the initial state. The probability of the obser-
`vation sequence 0 for the state sequence of (12) i s
`r
`P(OIQ, N = II P(OtJqt, h)
`
`i = 1
`where we have assumed statistical independence of obser-
`vations. Thus we get
`p(OlQ, N = bql(Oi) . bqz(OJ . . . bqJ(OT). (13b)
`The probability of such a state sequence Q can be written
`as
`
`(13a)
`
`P(QIA) = rq1aq1qzaq2q3 * . * a q r - l q r .
`(14)
`The joint probability of 0 and Q, i.e., the probability that
`Oand Qoccur simultaneously, is simplythe product of the
`above two terms, i.e.,
`P(0, QIN = P(OIQ, N P(Q, N.
`(1 5)
`The probability of 0 (given the model)is obtained by sum-
`ming this joint probabilityover all possible state sequences
`q giving
`
`P ( 0 I N =
`
`P(OIQ, N P(QIN
`
`(1 6)
`
`computations! Clearly a more efficient procedure is
`required to solve Problem 1. Fortunately such a procedure
`exists and is called the forward-backward procedure.
`The Forward-Backward Procedure [2], [316: Consider the
`forward variable a t ( ; ) defined as
`at(;) = P(O1 0 2 . . . oi, qt = S,IN
`(18)
`i.e., the probability of the partial observation sequence, 0,
`02. . . O,,(until timet)andstateS,at time t,given the model
`A. We can solve for a,(;) inductively, as follows:
`1) Initialization:
`
`CY,(;) = ~,b,(Ol), 1 5 i 5 N.
`2) Induction:
`
`(1 9)
`
`I
`
`= C ai(i)al, b , ( ~ ( + ~ ) , I 5 t 5 T - I
`
`1 5 j 5 N.
`
`(20)
`
`3) Termination:
`
`N
`= C a T ( i ) .
`~ ( 0 1 ~ )
`r = l
`
`(21 )
`
`Stepl) initializesthe forward probabilitiesasthejoint prob-
`ability of state SI and initial observation O1. The induction
`step, which is the heart of the forward calculation, is illus-
`trated in Fig. 4(a). This figure shows how state S, can be
`
`(a)
`
`The interpretation of the computation in the above equa-
`tion is the following. Initially (at time t = l) we are in state
`q1 with probability rq,, and generate the symbol O1 (in this
`state) with probability bqI(O1). The clock changes from time
`t to t + 1 (t = 2) and we make a transition to state q, from
`state q1 with probability aqIq2, and generate symbol O2 with
`probability bq2(O2). This process continues in this manner
`until we make the list transition (at time T ) from state q T - 1
`to state qT with probability aqr-lqr and generate symbol Or
`with probability bql(Or).
`A little thought should convince the reader that the cal-
`culation of P(O(h), according to its direct definition (17)
`involves on the order of 2T. N'calculations, since at every
`t = 1, 2, . . . , T, there are N possible states which can be
`reached (i.e., there are Nr possible state sequences), and
`for each such state sequence about 2T calculations are
`required for each term in the sum of (17). (To be precise,
`we need (2T - l)Nr multiplications, and NT - 1 additions.)
`This calculation is computationally unfeasible, even for
`small values of N and T; e.g., for N = 5 (states), T = 100
`(observations), there are on the order of 2 . 100 * 5" =
`
`I
`
`I
`1
`
` I
`2
`
`I
`
`3
`OBSERVATION, t
`(a) Illustration of the sequence of operations
`Fig. 4.
`required for thecomputation of the forward variableol,+,(j).
`(b) Implementation of the computation of a,(;) in terms of
`a lattice of observations t, and states i.
`
`I
`T
`
`bStrictly speaking, we only need the forward part of the forward-
`backward procedure to solve Problem 1. We will introduce the
`backward part of the procedure in this section since it will be used
`to help solve Problem 3.
`
`262
`
`PROCEEDINGS OF THE IEEE, VOL. 77, NO. 2, FEBRUARY 1989
`
`IPR2023-00037
`Apple EX1029 Page 6
`
`
`
`(22)
`
`reached at time t + 1 from the N possible states, S I , 1 I i
`I N, at timet. Since a,(;) is the probabilityof the joint event
`that 0, O2 . . . 0, are observed, and the state at time t is SI,
`the product a , ( i ) a , is then the probabilityof the joint event
`that O1 O2 . . . 0, are observed, and state S, is reached at
`time t + 1 via state S, at time t. Summing this product over
`all the N possible states SI, 1 5 i I N at time t results in the
`probabilityof Slat time t + 1 with all theaccompanying pre-
`vious partial observations. Once this is done and SI is known,
`it is easy to see that a,+l(j) i s obtained by accounting for
`in state j , i.e., by multiplying the summed
`observation
`quantity bythe probabilityb,(O,+l).Thecomputation of(20)
`is performed for all states j , 1 I j c N, for a given t; the
`computation i s then iterated for t = 1,2, . . . , T - 1. Finally,
`step 3) gives the desired calculation of P(0IX) as the sum
`of the terminal forward variables a T ( i ) . This is the case since,
`by definition,
`aJ(i) = P(o1 0 2 . . ' or, qJ = s,Ih)
`and hence P(O(X) i s just the sum of the aJ(i)'s.
`If we examine the computation involved in the calcula-
`tion of a,(j), 1 5 t I T, 1 5 j 5 N, we see that it requires
`on the order of N2T calculations, rather than 2TNr as
`required by the direct calculation. (Again, to be precise, we
`need N(N + 1)(T - 1) + N multiplications and N(N - 1)(T
`- 1) additions.) For N = 5, T = 100, we need about 3000
`computations for the forward method, versus IO7* com-
`putations for t