`
`DYNAMIC PROGRAMMING
`SEARCH STRATEGIES: FROM
`DIGIT STRINGS TO LARGE
`VOCABULARY WORD GRAPHS
`Hermann Neyl and Xavier Aubert 2
`
`1 Lehrstuhl filr Informatik VI, RWTH Aachen -
`University of Technology, D-52056 Aachen, Germany
`
`2 Philips GmbH Forschungslaboratorien,
`D-52066 Aachen, Germany
`
`ABSTRACT
`
`This chapter gives an overview of the dynamic programming (DP) search strategy
`for large-vocabulary, continuous-speech recognition. Starting with the basic one-pass
`algorithm for word string recognition, we extend the search strategy to vocabularies
`of 20,000 words and more by using a tree organization of the vocabulary. Finally,
`we describe how this predecessor conditioned algorithm can be refined to produce
`high-quality word graphs. This method has been tested successfully on the Wall
`Street Journal task (American English, continuous speech, 20,000 words, speaker
`independent) .
`
`1
`
`INTRODUCTION
`
`Search strategies based on dynamic programming (DP) have been used
`successfully in a large number of speech recognition tasks, ranging from
`digit string recognition over I,OOO-word recognition using heavily constrained
`grammars to 20,OOO-word unconstrained speech recognition.
`
`Dynamic programming had already been used in the early days of automatic
`speech recognition [29, 3, 23]. Despite these successes, until recently, it was
`a highly controversial issue whether high-perplexity natural language speech
`recognition could be handled efficiently by dynamic programming. In 1992,
`
`385
`C.-H. Lee et al. (eds.), Automatic Speech and Speaker Recognition
`© Kluwer Academic Publishers 1996
`
`IPR2023-00034
`Apple EX1053 Page 1
`
`
`
`386
`
`CHAPTER 16
`
`when we submitted a paper on this topic to a well reputed journal, we were
`
`
`
`
`
`advised by one of the reviewers that "current wisdom seems to indicate that
`
`a stack decoding (A•) strategy may be more efficient for large vocabulary
`
`
`
`recognition". Meanwhile, three years later, even for 20,000-word tasks like the
`
`Wall Street Journal task, dynamic programming in one variant or the other has
`
`
`established itself as the most widely used search technique.
`
`
`
`The characteristic features of the presentation given in this chapter are:
`
`
`
`■ The search problem in continuous speech recognition 1s put into the
`
`
`
`
`framework of statistical decision theory.
`
`
`
`
`
`
`
`■ As baseline strategy, we review the time-synchronous one-pass algorithm.
`
`■ The time-synchronous concept is extended towards a tree organization of
`
`
`
`
`
`
`the pronunciation lexicon so that the search effort is significantly reduced.
`
`■ As an alternative to the single-best concept, i.e. computing only the single
`
`
`
`
`
`
`most likely word sentence, we extend the one-pass strategy to produce
`
`a word graph or word lattice. We show that this word graph concept
`
`
`can be embedded in the successful one-pass beam search strategy that
`
`
`
`
`uses predecessor conditioned lexical trees. Basically only the bookkeeping
`
`
`
`technique has to be modified to employ the predecessor conditioned one
`
`pass algorithm for word graph construction.
`
`■ For the word graph construction, we present recent experimental results
`
`
`
`
`for the 20,000-word Wall Street Journal (WSJ) task.
`
`2 SYSTEM ARCHITECTURE
`
`
`
`2.1 Bayes Decision Rule
`
`Every approach to automatic speech recognition faces the problem of taking
`
`
`
`
`
`decisions in the presence of ambiguity and context and of modeling the
`
`
`
`interdependence of these decisions at various levels. If it were possible to
`
`
`recognize phonemes (or words) with a very high reliability, it would not
`
`
`be necessary to rely heavily on delayed decision techniques, error correcting
`
`
`techniques and statistical methods. In the near future, this problem of reliable
`
`
`and virtually error free phoneme or word recognition without using high
`
`
`
`level knowledge is unlikely to be solved for large-vocabulary continuous-speech
`
`IPR2023-00034
`Apple EX1053 Page 2
`
`
`
`Dynamic Programming Search
`
`387
`
`Speech Input
`
`Acoustic
`Analysis
`
`x1,,·xT
`
`Global Search:
`
`maximize
`
`Pr(x1···xT Iw1···WN)
`
`Pr(w1···wN)·Pr(x1···xT I w1 •.. WN)
`
`over w1 ..• wN
`
`Pr(w1···wN)
`
`~ Phoneme Inventory I
`~ Pronunciation Lexicon I
`I Language Model I
`I
`
`Recognized
`Word Sequence
`
`Figure 1 Bayes decision rule for speech recognition.
`
`recognition. As a consequence, the recognition system has to deal with a large
`number of hypotheses about phonemes, words and sentences, and ideally has
`to take into account the "high-level constraints" as given by syntax, semantics
`and pragmatics. Given this state of affairs, statistical decision theory tells us
`how to minimize the probability of recognition errors [2]:
`
`Maximize the posterior probability Pr(wl ... WNlxl ... XT), i.e. determine
`the sequence of words Wl ... Wn ... WN (of unknown length N) which has
`most probably caused the observed sequence of acoustic vectors Xl ... Xt ... XT
`(over time t = L.T), which are derived from the speech signal in the
`preprocessing step of acoustic analysis.
`
`By applying Bayes' theorem on conditional probabilities, the problem can be
`written in the following form: Determine the sequence of words Wl ... Wn ... WN
`
`IPR2023-00034
`Apple EX1053 Page 3
`
`
`
`388
`
`CHAPTER 16
`
`which maximizes
`
`This so-called Bayes decision rule is illustrated in Fig. 1. It requires two types
`of probability distributions which we refer to as stochastic knowledge sources
`along with a search strategy:
`
`(1)
`
`•
`
`•
`
`•
`
`The language model, i.e. Pr( W1 ... WN), is independent of the acoustic
`observations and is meant to incorporate restrictions on how to concatenate
`words of the vocabulary to form whole sentences.
`
`The acoustic-phonetic model, i.e. Pr(,x1 ... ,xTlw1 ... WN), is the conditional
`probability of observing the acoustic vectors ,xl ... ,xT when the speaker
`utters the words W1 ... WN. Like the language model probabilities, these
`probabilities are estimated during the training phase of the recognition
`system. For a large-vocabulary system, there is typically a set of basic
`recognition units that are smaller than whole words. Examples of these so(cid:173)
`called subword units are phonemes, demisyllables or syllables. The word
`models are then obtained by concatenating the subword models according
`to the phonetic transcription of the words in a pronunciation lexicon or
`dictionary.
`
`The decision on the spoken words must be taken by an optimization
`procedure which combines information of several knowledge sources: the
`language model, the acoustic-phonetic models of single phonemes, and the
`pronunciation lexicon. The optimization procedure is usually referred to
`as search in a state space defined by the knowledge sources.
`
`2.2 Stochastic Knowledge Sources
`
`The task of a language model is to capture the restrictions on the combinations
`of words due to the inherent redundancy of the language subset handled by the
`system. This redundancy results from the syntactic, semantic and pragmatic
`constraints of the recognition task and may be modeled by probabilistic or
`nonprobabilistic (,yes/no') methods. In large vocabulary recognition tasks, the
`language model probabilities are typically approximated by bigram or trigram
`models:
`
`Pr(wnIW1 ... Wn-1)
`Pr(wnlw1 ... Wn-1)
`
`D< p(wnlwn-d
`D< P(WnIWn-2, wn-d .
`
`(2)
`(3)
`
`IPR2023-00034
`Apple EX1053 Page 4
`
`
`
`Dynamic Programming Search
`
`389
`
`TIME INDEX
`
`Figure 2 Structure of a phoneme model and search space.
`
`In order to reduce the number of parameters, bigram and trigram models can
`be defined at the level of word categories rather than words themselves [22].
`
`As pointed out in the preceding section, the statistical approach requires the
`conditional probability Pr(xI ... XTlwl ... WN) of observing an acoustic vector
`sequence XI ... XT, given the word sequence WI ... WN. These probabilities are
`obtained by concatenating the corresponding word models, which again are
`obtained by concatenating phoneme or other subword unit models according
`to the pronunciation lexicon. As in many other systems, these subword units are
`modeled by so-called Hidden Markov Models (HMM). Hidden Markov Models
`are stochastic finite-state automata (or stochastic regular grammars) which
`consist of a Markov chain of acoustic states, modeling the temporal structure of
`speech, and a probabilistic function for each of the states, modeling the emission
`and observation of acoustic vectors [3, 2, 16]. A typical model along with the
`resulting search space for phonemes is shown in Fig. 2; here, the phoneme
`X consists of three parts (X1 ,X2 ,X3), resulting in a linear arrangement of
`six states. Words are obtained by concatenating the HMM phoneme units
`according to the nominal phonetic transcription. Usually, for a given state u in
`a word model w, we have a transition probability a(slu; w) for going to state s
`and an emission probability (density) b(xtIU; w) for observing vector Xt. For the
`following, it suffices to consider only the product of the emission and transition
`probabilities:
`
`q(Xt, slu; w) = a(slu; w) . b(xt/u; w) ,
`which is the conditional probability that, given state u in word w, the acoustic
`vector Xt is observed and the state s is reached.
`
`(4)
`
`IPR2023-00034
`Apple EX1053 Page 5
`
`
`
`390
`
`CHAPTER 16
`
`A
`
`A
`
`::: 0-: -"::;~--C~~31""-':;:'~ --C~~31_---~ _-0: :::
`
`A
`
`I,
`
`0=1
`
`I,
`
`I,
`
`0=2
`
`I,
`
`I,
`
`0=3
`
`TIME I I WORD POSITION 0
`
`Figure 3
`(A,B,C).
`
`Illustration of the search problem for a three-word vocabulary
`
`3 BASELINE SEARCH: ONE-PASS
`ALGORITHM
`
`3.1 Specification of the Search Problem
`
`In this section, we describe the one-pass algorithm which forms the baseline
`for all search strategies described in this chapter. Originally the one-pass
`algorithm had been designed for small vocabulary recognition tasks like digit
`string recognition. Over the last 20 years, however, it has turned out to be
`surprisingly successful in handling vocabularies of 10,000 and more words. The
`characteristic property of all search variants to be presented in this chapter is
`that they are conceptually based on a strictly time-synchronous left-to-right
`search strategy.
`
`The decision on the spoken sentence is taken in the search procedure which
`attempts to determine the word sequence which best explains the input speech
`signal in terms of the given knowledge sources. The search space can be
`described as a huge finite-state machine [22, 20], which is defined in terms ofthe
`acoustic word models and the language model. For a three-word vocabulary,
`the search space is illustrated in Fig. 3. There are two types of transitions,
`namely the acoustic transitions representing the probabilities of the acoustic
`word models (A, B, C in Fig. 3) and the language transitions representing the
`language model probabilities. In Fig. 3, a bigram language model is assumed.
`
`IPR2023-00034
`Apple EX1053 Page 6
`
`
`
`Dynamic Programming Search
`
`391
`
`For each word bigram (v, w), there is a language transition that is assigned
`the probability p(wlv) and that links the end of predecessor v to the beginning
`of word w. For recognition, as shown in Fig. 3, we unfold the finite-state
`machine along the time axis of the spoken utterance. For simplification reasons,
`Fig. 3 does not cover the details of the acoustic models and shows the language
`model transitions at times t2 and t3, only. Both the acoustic transitions (as
`shown in Fig. 2) and the language transitions must be considered every 10-
`ms time frame. All in all, there is a huge number of possible state sequences,
`and all combinations of state and time must be systematically considered for
`recognition. In formulae, we have:
`
`[wfLpt = argm~{pr(wf).L pr(xi;sf/wf)},
`
`(WI]
`
`[8'f]
`
`(5)
`
`where the sum is to be taken over all paths [sf] that are consistent with the word
`sequence [wf] under consideration. Approximating the sum by the maximum,
`we obtain what is usually called Viterbi approximation:
`
`arg max {pr( wf) . max Pr (xi; si Iwf )} .
`
`(wf]
`
`[8'f]
`
`(6)
`
`In other words, the 'most likely word sequence' is approximated by the 'most
`likely state sequence' [3, 2, 16]. Note that for the maximum approximation
`to work, we really need only the assumption that the resulting optimum
`word sequences are the same, not really that the maximum provides a good
`approximation to the sum.
`
`In the maximum approximation, the search problem can be specified as follows.
`We wish to assign each acoustic vector at time t to a (state,word) index pair.
`This mapping can be viewed as a time alignment path, which is a sequence of
`(state,word) index pairs (stretching notation):
`
`(7)
`
`An example of such a time alignment path in connected word recognition is
`depicted in Fig. 4. For such paths, there are obvious continuity constraints or
`transition rules as shown in Fig. 5. Since the word models are obtained by
`concatenating phoneme models, the transition rules in the word interior (Fig. 5
`top) are those of the used HMMs as shown in Fig. 2. At word boundaries
`(Fig. 5 bottom), we have to allow for transitions that link the terminal state
`S( v) of any predecessor word v to the beginning states s = 1 and s = 2 of any
`
`IPR2023-00034
`Apple EX1053 Page 7
`
`
`
`392
`
`CHAPTER 16
`
`S(5)
`
`I
`
`S ( 4 )+ - - - - - 1 ...... ---.-----------7I
`
`w=5
`
`1 • .r----~~----1-------~~----~~----~------~
`w
`~ S(3)
`~
`
`w=4
`
`w=2
`I S(l):+---.;.---r--------------I
`
`TIME I
`
`w=1
`
`T
`
`Figure 4 Example of a time alignment path .
`
`• iii
`~
`~
`~
`~
`
`sew)
`
`Sew)
`
`0 0 0 0
`
`~~ 0
`
`0
`o
`0 0
`0 0 0 0
`
`TIME
`
`T
`
`o o-;p 0
`O~O
`S(.)E======~o~~gO~o~=======9
`0 0 0 0
`
`T
`
`TIME
`
`Figure:; Path recombinations.
`
`IPR2023-00034
`Apple EX1053 Page 8
`
`
`
`Dynamic Programming Search
`
`393
`
`word w. The dynamic programming search to be presented will allow us to
`compute the probabilities
`
`(8)
`
`in a left-to-right fashion over time t and to carry out the optimization over
`the unknown word sequence at the same time. Note that, unlike Eq. (5),
`the unknown word sequence and the unknown state sequence are determined
`simultaneously. Within the framework of the Viterbi criterion, the dynamic
`programming algorithm presents a closed-form solution for handling the
`interdependence of nonlinear time alignment, word boundary detection and
`word identification in continuous speech recognition [29, 23, 5, 19, 15].
`
`3.2 Time-Synchronous Beam Search
`
`The key concept of the dynamic programming strategy is based on the following
`two quantities:
`
`Q(t, s; w) := score of the best path up to time t that ends in state s of
`word w.
`
`B(t, s; w) := starting time of the best path up to time t that ends in state
`s of word w.
`
`Although the back pointer B(t, s; w) is not needed for small-vocabulary tasks
`like digit string recognition, it is essential for vocabularies of 10,000 words and
`more to drastically reduce the storage requirements.
`
`As shown in Fig. 5, there are two types of transition rules for the path, namely
`rules in the word interior and at word boundaries. The concept of dynamic
`programming is to use these rules to decompose the path into two parts and
`formulate recurrence relations, which can be solved by filling in tables, namely
`Q(t, s; w) for the problem under consideration. In a more general setting of
`optimization problems, this concept is often referred to as Bellman's principle
`of optimality [4]. In the word interior, we have the recurrence equation:
`Q(t, s;w) =
`B(t, s; w)
`
`max { q(Xt, slO"; w) . Q(t - 1,0"; w) }
`o
`B(t -1,O"max(t,s;w);w),
`
`(9)
`
`(10)
`
`where O"max(t, s; w) is the optimum predecessor state for the hypothesis (t, s; w).
`The back pointers B(t, s; w) are propagated simply according to the DP
`
`IPR2023-00034
`Apple EX1053 Page 9
`
`
`
`394
`
`CHAPTER 16
`
`decision. To include word boundaries, we introduce a special state s= 0 which
`is used to start up a word. When encountering a potential word boundary, we
`have to perform the recombination over the predecessor words. To this purpose,
`we define:
`
`H(w;t)
`
`:= max { p(wlv) . Q(t, S(v); v) } ,
`v
`
`(11)
`
`where p(wlv) are the conditional bigram probabilities as defined in Eq. (2).
`The symbol S(v) denotes the terminal state of word v from which new words
`can be started. To this purpose, we have to pass on both the score and the
`time index:
`
`Q(t-l,O;w)
`B(t -1,O;w)
`
`H(w;t-l)
`t -1.
`
`(12)
`(13)
`
`This equation assumes that first the normal states s = 1, ... , S( w) are evaluated
`
`for each word w before the start-up states s = ° are evaluated. The same time
`
`index t is used intentionally, because the language model does not 'absorb'
`an acoustic vector. Note that the scores Q(t, s; w) capture both the acoustic
`observation dependent probabilities resulting from the HMM and the language
`model probabilities.
`
`The operations to be performed are summarized in Table 1. The sequence of
`acoustic vectors extracted from the input speech signal is processed strictly from
`left to right. The search procedure works with a time-synchronous breadth(cid:173)
`first strategy, i.e. all hypotheses for word sequences are extended in parallel
`for each incoming acoustic vector. To reduce the storage requirements, it is
`suitable to introduce back pointers and traceback arrays. The back pointers
`are propagated from state to state during the dynamic programming recursion
`and report the start time for each word end hypothesis. The traceback arrays
`are used to record the decisions about the best predecessor word for each word
`start-up. Using the back pointers and the traceback arrays, the recognized word
`sequence can be recovered efficiently by a few table look-ups into the traceback
`arrays at the end of the utterance.
`
`Since all hypotheses cover the same portion of the input, their scores can be
`directly compared. This enables the system to avoid an exhaustive search
`and to perform a data driven search instead, i.e. to focus the search on those
`hypotheses which are most likely to result in the best state sequence [18]. Every
`lO-ms frame, the score of the best hypothesis is determined, then all hypotheses
`whose scores fall short of this optimal score by more than a fixed factor are
`pruned, i.e. are removed from further consideration. The experimental tests
`
`IPR2023-00034
`Apple EX1053 Page 10
`
`
`
`Dynamic Programming Search
`
`395
`
`Table lOne-pass algorithm ('single best').
`
`proceed over time t from left to right
`
`ACOUSTIC LEVEL: process states
`- initialization: Q(t - 1, OJ w) = H( Wj t - 1)
`B(t - 1, OJ w) = t - 1
`- time alignment: Q(t, Sj w) using DP
`- propagate back pointers B(t,sj w)
`
`- prune unlikely hypotheses
`
`- purge bookkeeping lists
`
`WORD PAIR LEVEL: process word ends
`for each pair (Wj t) do
`H(wjt) = max v [P(wlv) Qv(t, S(w)j w)]
`vO(Wjt) = argmax v [P(wlv) Qv(t,S(w)jw)]
`- store best predecessor Vo : = Vo ( W j t)
`- store best boundary TO := B(t, S(VO)j Vo)
`
`indicate that for this type of beam search, depending on the acoustic input
`and the language model constraints, only a few percent of the potential state
`hypotheses have to be processed for every 10 ms of the input speech while
`at the same time the number of recognition errors is virtually not increased.
`To fully exploit the computational advantages of this beam search strategy, a
`list organization of the active search space [20] has been developed. At several
`levels such as words and states, lists are introduced to dynamically construct the
`search space such that the computational cost of the search is now proportional
`only to the number of active hypotheses and is independent of the overall size
`of the potential search space.
`
`To simplify the presentation, we have not explicitly included intraphrase silence
`models. It is straightforward to extend the algorithm to allow for optional
`silence between words. This algorithm shown in Table 1 is still the most
`successful algorithm for both small-vocabulary and large-vocabulary speech
`recognition [15,6, 14, 1,7,9, 11, 13,30].
`
`IPR2023-00034
`Apple EX1053 Page 11
`
`
`
`396
`
`CHAPTER 16
`
`4 LEXICAL TREE AND ONE-PASS
`ALGORITHM
`
`4.1 Lexical Pronunciation Tree
`
`So far, we have considered a straightforward approach to organizing the search
`space, which was based on the use of a linear lexicon, i.e. each word is
`represented as a linear sequence of phonemes, independently of other words.
`In a 20,000-word vocabulary, there are typically many words that share the
`same beginning phonemes. Therefore a tree representation of the lexicon
`will be considered as an alternative to a linear lexicon. Such a lexical tree
`representation is an essential ingredient for handling vocabularies of 20,000
`words and more (despite the apparent complications in the definition of the
`search space).
`
`When analyzing the distribution of the state hypotheses over the state positions
`within a word, one typically finds in the experimental tests that the lion's
`share of the hypotheses covers the first and second phonemes in a word. In
`[10], for a 12,306-word dictation task, 79% and 16% of the state hypotheses
`were in the first and second phoneme, respectively. On the other hand, a large
`number of words in the 12,306-word vocabulary share the same initial sequence
`of phonemes. Therefore it seems promising to arrange the pronunciation lexicon
`as a tree. Each arc of the tree stands for a phoneme such that an arc sequence
`from the tree root to a tree leaf represents a legal phoneme sequence and thus
`a word of the vocabulary. For the 12,306-word vocabulary, Table 2 shows the
`distribution of arcs over the first 6 generation levels of the tree. The lexical tree
`consists of 43,000 arcs, which is a compression factor of 2.5 over the 108,000
`phoneme copies of the linear lexicon. Even in the fourth level of the tree, the
`number (3,116) of phoneme arcs is only one quarter of the number (12,306) of
`words in the lexicon.
`
`Table 2 Distribution of the number of phoneme arcs for a 12,306-word lexicon
`using a tree organization.
`
`1
`28
`
`2
`331
`
`6
`4
`5
`3
`1511 3116 4380 4950
`
`IPR2023-00034
`Apple EX1053 Page 12
`
`
`
`Dynamic Pr'ogramming Search
`
`397
`
`4.2 Predecessor Conditioned One-Pass
`Algorithm
`
`For a unigram language model, we need only one tree, which is started up
`after incorporating the unigram language probability. However, for a bigram
`language model (and other more complicated language models), there is an
`added complication due to the fact that the identity of the word is known
`only at the end of a tree and only then can the language model probability be
`applied. Therefore the language model probabilities can only be incorporated
`after reaching the end, i.e. the final state, of the second word of the bigram.
`In other words, the use of the language model may be viewed as delayed by
`one word as compared to the bigram model without using a tree lexicon. To
`incorporate the bigram language model into the search, we will keep a separate
`copy of the lexical tree for each predecessor word. Fig. 6 illustrates this concept
`for a three-word vocabulary (A, B, C). For each predecessor word v, there is
`a separate lexical tree so that during the search process we always know the
`predecessor word v when a word end hypothesis w is hypothesized. In the
`set-up of Fig. 6, we apply the bigram probability p(wlv) when the final state
`of word w with predecessor v has been reached, and use the resulting overall
`score to start up the corresponding lexical tree, i.e. the tree that has word w
`as predecessor.
`
`The key quantities for dynamic programming are again the partial scores
`Q(t, s; w) and the back pointers B(t, s; w). However, the quantities must
`be made dependent on the predecessor word v to account for the delayed
`application of the bigram language model:
`
`Qv(t, s; w) := score of the best path up to time t that ends in state S of
`word w for predecessor v.
`
`Bv(t, s; w) := starting time of the best path up to time t that ends in state
`s of word w for predecessor v.
`
`Both quantities are evaluated using the dynamic programming recursion for
`Qv(t, s; w):
`
`= max {q(Xt,sIO';w)·Qv(t-l,O';w)}
`Bv(t, s; w) = Bv(t - 1, O'max(t, s; W, v)j w) ,
`where O'max(t, s; w, v) is the optimum predecessor state for the hypothesis
`(t,Sjw) and predecessor word v. The back pointers Bv(t,s;w) are propagated
`
`(14)
`(15)
`
`q
`
`IPR2023-00034
`Apple EX1053 Page 13
`
`
`
`398
`
`CHAPTER 16
`
`A
`
`A
`
`ACOUSTIC
`MODEL
`
`LANGUAGE
`MODEL
`
`ACOUSTIC
`MODEL
`
`Figure 6 Word recombination for a lexical tree
`
`according to the DP decision. Unlike the predecessor word v, the index w for
`the word under consideration is not really needed, because from the phoneme
`arcs of each tree, the corresponding words can be uniquely recovered. However,
`for notational consistency, we have retained both indices v and w.
`Using a suitable initialization for (j = 0, this equation includes the optimization
`over the unknown word boundaries. At word boundaries, we have to find the
`best predecessor word for each word. To this purpose, we define:
`
`H(w;t)
`
`max { p(wlv) . QtI(t, S(w); w) } .
`
`tI
`
`(16)
`
`IPR2023-00034
`Apple EX1053 Page 14
`
`
`
`Dynamic Programming Search
`
`To start up words, we have to pass on the score and the time index:
`
`Qv(t-1,0;w)
`Bv(t-1,0;w)
`
`H(v;t-1)
`t - 1 .
`
`399
`
`(17)
`(18)
`
`Note that here the right-hand side does not depend on word index w due to
`the notational scheme explained above.
`
`4.3 Experimental Results
`
`Systematic experimental tests have been carried out for the search methods
`presented so far [22, 20]; here, we only summarize the most important results:
`
`•
`
`•
`
`•
`
`•
`
`Throughout all recognition tests, ranging from 100 to 10,000 vocabulary
`words, the experimental tests indicate that by using the beam search
`strategy, only a small fraction, typically a few percent or less, of the
`potential state hypotheses have to be processed while at the same time the
`number of recognition errors is virtually not increased over a full search.
`
`For a 12,000-word dictation task, a series of experiments was run to study
`the effect of the tree lexicon on the search effort [20]. By using the tree
`lexicon, the search effort was reduced by a factor of 6 over the linear
`lexicon. A further reduction by a factor of 3 was obtained by applying a
`look-ahead technique at the phoneme level in the lexical tree [10].
`
`The dramatic reduction for a tree lexicon must be attributed to the
`subtle interaction between the focusing property of beam search and the
`organization of the search space. The ambiguity in the decisions and
`therefore the number of hypotheses is particularly high at the beginning of
`words; by a tree lexicon, such hypotheses are compressed in a most efficient
`way. In other words, the tree lexicon leads to a search space that is most
`compact where the highest uncertainty is expected, i.e. at the beginning
`portions of words.
`
`In general, when more detailed models with a potentially larger search
`space are used, the effort of the beam search strategy is not much affected.
`For example, for a linear lexicon, when going from a bigram to a trigram
`model, the search effort remained the same for the particular experiments
`[22]. For a tree lexicon, when switching from a unigram tree to the bigram
`tree copies, the search effort was increased by a factor of 2 only [20].
`In other words, the early use of more detailed models results in a more
`focussed beam search despite the increase in the potential search space.
`
`IPR2023-00034
`Apple EX1053 Page 15
`
`
`
`400
`
`CHAPTER 16
`
`• As to the pruning strategy, we have found that it is advantageous to use
`two pruning thresholds: an acoustic pruning threshold for the intraword
`recombinations and a language pruning threshold for the interword
`recombinations [28]. This second threshold is related to an anticipated
`language model probability, since a state hypothesis at any point in
`the lexical tree implies a certain lower bound for the language model
`probability. Ultimately, the choice of the pruning thresholds is based on
`trial and error, although the choice is not critical [28].
`
`•
`
`There is a (close to) real time implementation of the search strategy
`the hardware is based on an Intel 80486
`described in this chapter;
`microprocessor and 16 Mbyte RAM for a 25,000-word continuous speech
`recognition system [27].
`
`5 WORD GRAPHS AND ONE-PASS
`ALGORITHM
`
`5.1
`
`Specification of Word Graphs
`
`The main idea of a word graph is to come up with word alternatives in regions of
`the speech signal, where the ambiguity in the acoustic recognition is high. The
`advantage is that the acoustic recognition is decoupled from the application
`of the language model and that a complex language model, in particular a
`long-span language model, can be applied in a subsequent postprocessing step.
`The number of word alternatives should be adapted to the level of ambiguity
`in the acoustic recognition. The difficulty in efficiently constructing a good
`word graph is the following: the start time of a word depends in general on the
`predecessor words. In a first approximation, we limit this dependence to the
`immediate predecessor word and obtain the so-called word pair approximation:
`
`Given a word pair and its ending time, the word boundary between the
`two words is independent of the further predecessor words.
`
`This word pair approximation was originally introduced in [24] to efficiently
`calculate multiple or n-best sentences. The word graph can be expected to be
`more efficient than the n-best approach. In the word graph approach, word
`hypotheses need to be generated only locally whereas in n-best methods each
`local alternative requires a whole sentence to be added to the n-best list. To
`
`IPR2023-00034
`Apple EX1053 Page 16
`
`
`
`Dynamic Programming Search
`
`401
`
`give a (over) simplified example, suppose we have 10 spoken words and 2 word
`hypotheses for each word position. The n-best method then requires 21°=1024
`sentence hypotheses, whereas the word graph approach produces a graph of
`only 2·10 = 20 word arcs.
`
`There have been a number of attempts at using two explicit levels in search: the
`first level producing a short list of either single word or sentence hypotheses,
`and a second level where the final decision is taken using a complex language
`model [23,24,25,8,21, 17].
`
`As in [21], we prefer the term 'word graph' to 'word lattice' to indicate that
`when combining word hypotheses, we do not allow overlaps or gaps along the
`time axis. The difference to the word graph algorithm in [21] is that we use
`a formal specification of word graphs and make explicit use of the word pair
`approximation. As a result, we can incorporate the word graph construction
`into our standard one-pass algorithm using predecessor conditioned lexical
`trees, whereas in [21], the search was based on time conditioned lexical trees.
`
`In this section, we will formally specify the word graph construction problem
`and pave the way for the word graph algorithm. We start with the fundamental
`problem of word graph construction:
`
`• Given a word wand ending time t, how can we find a limited number of
`'most likely' predecessor words?
`
`•
`
`This task is difficult since the start time of word w may very well
`depend on the predecessor word under consideration, which results in an
`interdependence of start times and predecessor words.
`
`In view of the most successful one-pass beam search strategy, what we want
`to achieve conceptually, is to keep track of word sequence hypotheses whose
`scores are very close to the locally optimal hypothesis, but that do not survive
`due to the recombination process.
`
`The basic idea is to represent all these word sequences by a word graph, in
`which each arc represents a word hypothesis. Each word sequence contained in
`the word graph should be close (in terms of scoring) to the single best sentence
`produced by the one-pass algorithm. To describe the word graph construction
`algorithm, we introduce the following definitions:
`
`IPR2023-00034
`Apple EX1053 Page 17
`
`
`
`402
`
`CHAPTER 16
`
`•
`
`h( w; r, t) := Pr{ x~+llw) = probability that word W produces the acoustic
`vectors XT+I ... Xt.
`• G(wl;t):= Pr(w1) . Pr(xilwl ) = (joint) probability of generating the
`acoustic vectors XI ... Xt and a word sequence wI with ending time t.
`
`Using these definitions, we can isolate the probability contributions of a
`particular word hypothesis with respect to both the language model and the
`acoustic model. This decomposition can be visualized as follows:
`
`Xl, ... , ... ,XT XT+I,"', Xt Xt+l,"',"', XT
`...
`~ ' - - - . - - "
`"
`G{wl-I;r)
`h(wn;r,t)
`
`(19)
`
`Exploiting an m-gram language mo