throbber
16
`
`DYNAMIC PROGRAMMING
`SEARCH STRATEGIES: FROM
`DIGIT STRINGS TO LARGE
`VOCABULARY WORD GRAPHS
`Hermann Neyl and Xavier Aubert 2
`
`1 Lehrstuhl filr Informatik VI, RWTH Aachen -
`University of Technology, D-52056 Aachen, Germany
`
`2 Philips GmbH Forschungslaboratorien,
`D-52066 Aachen, Germany
`
`ABSTRACT
`
`This chapter gives an overview of the dynamic programming (DP) search strategy
`for large-vocabulary, continuous-speech recognition. Starting with the basic one-pass
`algorithm for word string recognition, we extend the search strategy to vocabularies
`of 20,000 words and more by using a tree organization of the vocabulary. Finally,
`we describe how this predecessor conditioned algorithm can be refined to produce
`high-quality word graphs. This method has been tested successfully on the Wall
`Street Journal task (American English, continuous speech, 20,000 words, speaker
`independent) .
`
`1
`
`INTRODUCTION
`
`Search strategies based on dynamic programming (DP) have been used
`successfully in a large number of speech recognition tasks, ranging from
`digit string recognition over I,OOO-word recognition using heavily constrained
`grammars to 20,OOO-word unconstrained speech recognition.
`
`Dynamic programming had already been used in the early days of automatic
`speech recognition [29, 3, 23]. Despite these successes, until recently, it was
`a highly controversial issue whether high-perplexity natural language speech
`recognition could be handled efficiently by dynamic programming. In 1992,
`
`385
`C.-H. Lee et al. (eds.), Automatic Speech and Speaker Recognition
`© Kluwer Academic Publishers 1996
`
`IPR2023-00034
`Apple EX1053 Page 1
`
`

`

`386
`
`CHAPTER 16
`
`when we submitted a paper on this topic to a well reputed journal, we were
`
`
`
`
`
`advised by one of the reviewers that "current wisdom seems to indicate that
`
`a stack decoding (A•) strategy may be more efficient for large vocabulary
`
`
`
`recognition". Meanwhile, three years later, even for 20,000-word tasks like the
`
`Wall Street Journal task, dynamic programming in one variant or the other has
`
`
`established itself as the most widely used search technique.
`
`
`
`The characteristic features of the presentation given in this chapter are:
`
`
`
`■ The search problem in continuous speech recognition 1s put into the
`
`
`
`
`framework of statistical decision theory.
`
`
`
`
`
`
`
`■ As baseline strategy, we review the time-synchronous one-pass algorithm.
`
`■ The time-synchronous concept is extended towards a tree organization of
`
`
`
`
`
`
`the pronunciation lexicon so that the search effort is significantly reduced.
`
`■ As an alternative to the single-best concept, i.e. computing only the single
`
`
`
`
`
`
`most likely word sentence, we extend the one-pass strategy to produce
`
`a word graph or word lattice. We show that this word graph concept
`
`
`can be embedded in the successful one-pass beam search strategy that
`
`
`
`
`uses predecessor conditioned lexical trees. Basically only the bookkeeping
`
`
`
`technique has to be modified to employ the predecessor conditioned one­
`
`pass algorithm for word graph construction.
`
`■ For the word graph construction, we present recent experimental results
`
`
`
`
`for the 20,000-word Wall Street Journal (WSJ) task.
`
`2 SYSTEM ARCHITECTURE
`
`
`
`2.1 Bayes Decision Rule
`
`Every approach to automatic speech recognition faces the problem of taking
`
`
`
`
`
`decisions in the presence of ambiguity and context and of modeling the
`
`
`
`interdependence of these decisions at various levels. If it were possible to
`
`
`recognize phonemes (or words) with a very high reliability, it would not
`
`
`be necessary to rely heavily on delayed decision techniques, error correcting
`
`
`techniques and statistical methods. In the near future, this problem of reliable
`
`
`and virtually error free phoneme or word recognition without using high­
`
`
`
`level knowledge is unlikely to be solved for large-vocabulary continuous-speech
`
`IPR2023-00034
`Apple EX1053 Page 2
`
`

`

`Dynamic Programming Search
`
`387
`
`Speech Input
`
`Acoustic
`Analysis
`
`x1,,·xT
`
`Global Search:
`
`maximize
`
`Pr(x1···xT Iw1···WN)
`
`Pr(w1···wN)·Pr(x1···xT I w1 •.. WN)
`
`over w1 ..• wN
`
`Pr(w1···wN)
`
`~ Phoneme Inventory I
`~ Pronunciation Lexicon I
`I Language Model I
`I
`
`Recognized
`Word Sequence
`
`Figure 1 Bayes decision rule for speech recognition.
`
`recognition. As a consequence, the recognition system has to deal with a large
`number of hypotheses about phonemes, words and sentences, and ideally has
`to take into account the "high-level constraints" as given by syntax, semantics
`and pragmatics. Given this state of affairs, statistical decision theory tells us
`how to minimize the probability of recognition errors [2]:
`
`Maximize the posterior probability Pr(wl ... WNlxl ... XT), i.e. determine
`the sequence of words Wl ... Wn ... WN (of unknown length N) which has
`most probably caused the observed sequence of acoustic vectors Xl ... Xt ... XT
`(over time t = L.T), which are derived from the speech signal in the
`preprocessing step of acoustic analysis.
`
`By applying Bayes' theorem on conditional probabilities, the problem can be
`written in the following form: Determine the sequence of words Wl ... Wn ... WN
`
`IPR2023-00034
`Apple EX1053 Page 3
`
`

`

`388
`
`CHAPTER 16
`
`which maximizes
`
`This so-called Bayes decision rule is illustrated in Fig. 1. It requires two types
`of probability distributions which we refer to as stochastic knowledge sources
`along with a search strategy:
`
`(1)
`
`•
`
`•
`
`•
`
`The language model, i.e. Pr( W1 ... WN), is independent of the acoustic
`observations and is meant to incorporate restrictions on how to concatenate
`words of the vocabulary to form whole sentences.
`
`The acoustic-phonetic model, i.e. Pr(,x1 ... ,xTlw1 ... WN), is the conditional
`probability of observing the acoustic vectors ,xl ... ,xT when the speaker
`utters the words W1 ... WN. Like the language model probabilities, these
`probabilities are estimated during the training phase of the recognition
`system. For a large-vocabulary system, there is typically a set of basic
`recognition units that are smaller than whole words. Examples of these so(cid:173)
`called subword units are phonemes, demisyllables or syllables. The word
`models are then obtained by concatenating the subword models according
`to the phonetic transcription of the words in a pronunciation lexicon or
`dictionary.
`
`The decision on the spoken words must be taken by an optimization
`procedure which combines information of several knowledge sources: the
`language model, the acoustic-phonetic models of single phonemes, and the
`pronunciation lexicon. The optimization procedure is usually referred to
`as search in a state space defined by the knowledge sources.
`
`2.2 Stochastic Knowledge Sources
`
`The task of a language model is to capture the restrictions on the combinations
`of words due to the inherent redundancy of the language subset handled by the
`system. This redundancy results from the syntactic, semantic and pragmatic
`constraints of the recognition task and may be modeled by probabilistic or
`nonprobabilistic (,yes/no') methods. In large vocabulary recognition tasks, the
`language model probabilities are typically approximated by bigram or trigram
`models:
`
`Pr(wnIW1 ... Wn-1)
`Pr(wnlw1 ... Wn-1)
`
`D< p(wnlwn-d
`D< P(WnIWn-2, wn-d .
`
`(2)
`(3)
`
`IPR2023-00034
`Apple EX1053 Page 4
`
`

`

`Dynamic Programming Search
`
`389
`
`TIME INDEX
`
`Figure 2 Structure of a phoneme model and search space.
`
`In order to reduce the number of parameters, bigram and trigram models can
`be defined at the level of word categories rather than words themselves [22].
`
`As pointed out in the preceding section, the statistical approach requires the
`conditional probability Pr(xI ... XTlwl ... WN) of observing an acoustic vector
`sequence XI ... XT, given the word sequence WI ... WN. These probabilities are
`obtained by concatenating the corresponding word models, which again are
`obtained by concatenating phoneme or other subword unit models according
`to the pronunciation lexicon. As in many other systems, these subword units are
`modeled by so-called Hidden Markov Models (HMM). Hidden Markov Models
`are stochastic finite-state automata (or stochastic regular grammars) which
`consist of a Markov chain of acoustic states, modeling the temporal structure of
`speech, and a probabilistic function for each of the states, modeling the emission
`and observation of acoustic vectors [3, 2, 16]. A typical model along with the
`resulting search space for phonemes is shown in Fig. 2; here, the phoneme
`X consists of three parts (X1 ,X2 ,X3), resulting in a linear arrangement of
`six states. Words are obtained by concatenating the HMM phoneme units
`according to the nominal phonetic transcription. Usually, for a given state u in
`a word model w, we have a transition probability a(slu; w) for going to state s
`and an emission probability (density) b(xtIU; w) for observing vector Xt. For the
`following, it suffices to consider only the product of the emission and transition
`probabilities:
`
`q(Xt, slu; w) = a(slu; w) . b(xt/u; w) ,
`which is the conditional probability that, given state u in word w, the acoustic
`vector Xt is observed and the state s is reached.
`
`(4)
`
`IPR2023-00034
`Apple EX1053 Page 5
`
`

`

`390
`
`CHAPTER 16
`
`A
`
`A
`
`::: 0-: -"::;~--C~~31""-':;:'~ --C~~31_---~ _-0: :::
`
`A
`
`I,
`
`0=1
`
`I,
`
`I,
`
`0=2
`
`I,
`
`I,
`
`0=3
`
`TIME I I WORD POSITION 0
`
`Figure 3
`(A,B,C).
`
`Illustration of the search problem for a three-word vocabulary
`
`3 BASELINE SEARCH: ONE-PASS
`ALGORITHM
`
`3.1 Specification of the Search Problem
`
`In this section, we describe the one-pass algorithm which forms the baseline
`for all search strategies described in this chapter. Originally the one-pass
`algorithm had been designed for small vocabulary recognition tasks like digit
`string recognition. Over the last 20 years, however, it has turned out to be
`surprisingly successful in handling vocabularies of 10,000 and more words. The
`characteristic property of all search variants to be presented in this chapter is
`that they are conceptually based on a strictly time-synchronous left-to-right
`search strategy.
`
`The decision on the spoken sentence is taken in the search procedure which
`attempts to determine the word sequence which best explains the input speech
`signal in terms of the given knowledge sources. The search space can be
`described as a huge finite-state machine [22, 20], which is defined in terms ofthe
`acoustic word models and the language model. For a three-word vocabulary,
`the search space is illustrated in Fig. 3. There are two types of transitions,
`namely the acoustic transitions representing the probabilities of the acoustic
`word models (A, B, C in Fig. 3) and the language transitions representing the
`language model probabilities. In Fig. 3, a bigram language model is assumed.
`
`IPR2023-00034
`Apple EX1053 Page 6
`
`

`

`Dynamic Programming Search
`
`391
`
`For each word bigram (v, w), there is a language transition that is assigned
`the probability p(wlv) and that links the end of predecessor v to the beginning
`of word w. For recognition, as shown in Fig. 3, we unfold the finite-state
`machine along the time axis of the spoken utterance. For simplification reasons,
`Fig. 3 does not cover the details of the acoustic models and shows the language
`model transitions at times t2 and t3, only. Both the acoustic transitions (as
`shown in Fig. 2) and the language transitions must be considered every 10-
`ms time frame. All in all, there is a huge number of possible state sequences,
`and all combinations of state and time must be systematically considered for
`recognition. In formulae, we have:
`
`[wfLpt = argm~{pr(wf).L pr(xi;sf/wf)},
`
`(WI]
`
`[8'f]
`
`(5)
`
`where the sum is to be taken over all paths [sf] that are consistent with the word
`sequence [wf] under consideration. Approximating the sum by the maximum,
`we obtain what is usually called Viterbi approximation:
`
`arg max {pr( wf) . max Pr (xi; si Iwf )} .
`
`(wf]
`
`[8'f]
`
`(6)
`
`In other words, the 'most likely word sequence' is approximated by the 'most
`likely state sequence' [3, 2, 16]. Note that for the maximum approximation
`to work, we really need only the assumption that the resulting optimum
`word sequences are the same, not really that the maximum provides a good
`approximation to the sum.
`
`In the maximum approximation, the search problem can be specified as follows.
`We wish to assign each acoustic vector at time t to a (state,word) index pair.
`This mapping can be viewed as a time alignment path, which is a sequence of
`(state,word) index pairs (stretching notation):
`
`(7)
`
`An example of such a time alignment path in connected word recognition is
`depicted in Fig. 4. For such paths, there are obvious continuity constraints or
`transition rules as shown in Fig. 5. Since the word models are obtained by
`concatenating phoneme models, the transition rules in the word interior (Fig. 5
`top) are those of the used HMMs as shown in Fig. 2. At word boundaries
`(Fig. 5 bottom), we have to allow for transitions that link the terminal state
`S( v) of any predecessor word v to the beginning states s = 1 and s = 2 of any
`
`IPR2023-00034
`Apple EX1053 Page 7
`
`

`

`392
`
`CHAPTER 16
`
`S(5)
`
`I
`
`S ( 4 )+ - - - - - 1 ...... ---.-----------7I
`
`w=5
`
`1 • .r----~~----1-------~~----~~----~------~
`w
`~ S(3)
`~
`
`w=4
`
`w=2
`I S(l):+---.;.---r--------------I
`
`TIME I
`
`w=1
`
`T
`
`Figure 4 Example of a time alignment path .
`
`• iii
`~
`~
`~
`~
`
`sew)
`
`Sew)
`
`0 0 0 0
`
`~~ 0
`
`0
`o
`0 0
`0 0 0 0
`
`TIME
`
`T
`
`o o-;p 0
`O~O
`S(.)E======~o~~gO~o~=======9
`0 0 0 0
`
`T
`
`TIME
`
`Figure:; Path recombinations.
`
`IPR2023-00034
`Apple EX1053 Page 8
`
`

`

`Dynamic Programming Search
`
`393
`
`word w. The dynamic programming search to be presented will allow us to
`compute the probabilities
`
`(8)
`
`in a left-to-right fashion over time t and to carry out the optimization over
`the unknown word sequence at the same time. Note that, unlike Eq. (5),
`the unknown word sequence and the unknown state sequence are determined
`simultaneously. Within the framework of the Viterbi criterion, the dynamic
`programming algorithm presents a closed-form solution for handling the
`interdependence of nonlinear time alignment, word boundary detection and
`word identification in continuous speech recognition [29, 23, 5, 19, 15].
`
`3.2 Time-Synchronous Beam Search
`
`The key concept of the dynamic programming strategy is based on the following
`two quantities:
`
`Q(t, s; w) := score of the best path up to time t that ends in state s of
`word w.
`
`B(t, s; w) := starting time of the best path up to time t that ends in state
`s of word w.
`
`Although the back pointer B(t, s; w) is not needed for small-vocabulary tasks
`like digit string recognition, it is essential for vocabularies of 10,000 words and
`more to drastically reduce the storage requirements.
`
`As shown in Fig. 5, there are two types of transition rules for the path, namely
`rules in the word interior and at word boundaries. The concept of dynamic
`programming is to use these rules to decompose the path into two parts and
`formulate recurrence relations, which can be solved by filling in tables, namely
`Q(t, s; w) for the problem under consideration. In a more general setting of
`optimization problems, this concept is often referred to as Bellman's principle
`of optimality [4]. In the word interior, we have the recurrence equation:
`Q(t, s;w) =
`B(t, s; w)
`
`max { q(Xt, slO"; w) . Q(t - 1,0"; w) }
`o
`B(t -1,O"max(t,s;w);w),
`
`(9)
`
`(10)
`
`where O"max(t, s; w) is the optimum predecessor state for the hypothesis (t, s; w).
`The back pointers B(t, s; w) are propagated simply according to the DP
`
`IPR2023-00034
`Apple EX1053 Page 9
`
`

`

`394
`
`CHAPTER 16
`
`decision. To include word boundaries, we introduce a special state s= 0 which
`is used to start up a word. When encountering a potential word boundary, we
`have to perform the recombination over the predecessor words. To this purpose,
`we define:
`
`H(w;t)
`
`:= max { p(wlv) . Q(t, S(v); v) } ,
`v
`
`(11)
`
`where p(wlv) are the conditional bigram probabilities as defined in Eq. (2).
`The symbol S(v) denotes the terminal state of word v from which new words
`can be started. To this purpose, we have to pass on both the score and the
`time index:
`
`Q(t-l,O;w)
`B(t -1,O;w)
`
`H(w;t-l)
`t -1.
`
`(12)
`(13)
`
`This equation assumes that first the normal states s = 1, ... , S( w) are evaluated
`
`for each word w before the start-up states s = ° are evaluated. The same time
`
`index t is used intentionally, because the language model does not 'absorb'
`an acoustic vector. Note that the scores Q(t, s; w) capture both the acoustic
`observation dependent probabilities resulting from the HMM and the language
`model probabilities.
`
`The operations to be performed are summarized in Table 1. The sequence of
`acoustic vectors extracted from the input speech signal is processed strictly from
`left to right. The search procedure works with a time-synchronous breadth(cid:173)
`first strategy, i.e. all hypotheses for word sequences are extended in parallel
`for each incoming acoustic vector. To reduce the storage requirements, it is
`suitable to introduce back pointers and traceback arrays. The back pointers
`are propagated from state to state during the dynamic programming recursion
`and report the start time for each word end hypothesis. The traceback arrays
`are used to record the decisions about the best predecessor word for each word
`start-up. Using the back pointers and the traceback arrays, the recognized word
`sequence can be recovered efficiently by a few table look-ups into the traceback
`arrays at the end of the utterance.
`
`Since all hypotheses cover the same portion of the input, their scores can be
`directly compared. This enables the system to avoid an exhaustive search
`and to perform a data driven search instead, i.e. to focus the search on those
`hypotheses which are most likely to result in the best state sequence [18]. Every
`lO-ms frame, the score of the best hypothesis is determined, then all hypotheses
`whose scores fall short of this optimal score by more than a fixed factor are
`pruned, i.e. are removed from further consideration. The experimental tests
`
`IPR2023-00034
`Apple EX1053 Page 10
`
`

`

`Dynamic Programming Search
`
`395
`
`Table lOne-pass algorithm ('single best').
`
`proceed over time t from left to right
`
`ACOUSTIC LEVEL: process states
`- initialization: Q(t - 1, OJ w) = H( Wj t - 1)
`B(t - 1, OJ w) = t - 1
`- time alignment: Q(t, Sj w) using DP
`- propagate back pointers B(t,sj w)
`
`- prune unlikely hypotheses
`
`- purge bookkeeping lists
`
`WORD PAIR LEVEL: process word ends
`for each pair (Wj t) do
`H(wjt) = max v [P(wlv) Qv(t, S(w)j w)]
`vO(Wjt) = argmax v [P(wlv) Qv(t,S(w)jw)]
`- store best predecessor Vo : = Vo ( W j t)
`- store best boundary TO := B(t, S(VO)j Vo)
`
`indicate that for this type of beam search, depending on the acoustic input
`and the language model constraints, only a few percent of the potential state
`hypotheses have to be processed for every 10 ms of the input speech while
`at the same time the number of recognition errors is virtually not increased.
`To fully exploit the computational advantages of this beam search strategy, a
`list organization of the active search space [20] has been developed. At several
`levels such as words and states, lists are introduced to dynamically construct the
`search space such that the computational cost of the search is now proportional
`only to the number of active hypotheses and is independent of the overall size
`of the potential search space.
`
`To simplify the presentation, we have not explicitly included intraphrase silence
`models. It is straightforward to extend the algorithm to allow for optional
`silence between words. This algorithm shown in Table 1 is still the most
`successful algorithm for both small-vocabulary and large-vocabulary speech
`recognition [15,6, 14, 1,7,9, 11, 13,30].
`
`IPR2023-00034
`Apple EX1053 Page 11
`
`

`

`396
`
`CHAPTER 16
`
`4 LEXICAL TREE AND ONE-PASS
`ALGORITHM
`
`4.1 Lexical Pronunciation Tree
`
`So far, we have considered a straightforward approach to organizing the search
`space, which was based on the use of a linear lexicon, i.e. each word is
`represented as a linear sequence of phonemes, independently of other words.
`In a 20,000-word vocabulary, there are typically many words that share the
`same beginning phonemes. Therefore a tree representation of the lexicon
`will be considered as an alternative to a linear lexicon. Such a lexical tree
`representation is an essential ingredient for handling vocabularies of 20,000
`words and more (despite the apparent complications in the definition of the
`search space).
`
`When analyzing the distribution of the state hypotheses over the state positions
`within a word, one typically finds in the experimental tests that the lion's
`share of the hypotheses covers the first and second phonemes in a word. In
`[10], for a 12,306-word dictation task, 79% and 16% of the state hypotheses
`were in the first and second phoneme, respectively. On the other hand, a large
`number of words in the 12,306-word vocabulary share the same initial sequence
`of phonemes. Therefore it seems promising to arrange the pronunciation lexicon
`as a tree. Each arc of the tree stands for a phoneme such that an arc sequence
`from the tree root to a tree leaf represents a legal phoneme sequence and thus
`a word of the vocabulary. For the 12,306-word vocabulary, Table 2 shows the
`distribution of arcs over the first 6 generation levels of the tree. The lexical tree
`consists of 43,000 arcs, which is a compression factor of 2.5 over the 108,000
`phoneme copies of the linear lexicon. Even in the fourth level of the tree, the
`number (3,116) of phoneme arcs is only one quarter of the number (12,306) of
`words in the lexicon.
`
`Table 2 Distribution of the number of phoneme arcs for a 12,306-word lexicon
`using a tree organization.
`
`1
`28
`
`2
`331
`
`6
`4
`5
`3
`1511 3116 4380 4950
`
`IPR2023-00034
`Apple EX1053 Page 12
`
`

`

`Dynamic Pr'ogramming Search
`
`397
`
`4.2 Predecessor Conditioned One-Pass
`Algorithm
`
`For a unigram language model, we need only one tree, which is started up
`after incorporating the unigram language probability. However, for a bigram
`language model (and other more complicated language models), there is an
`added complication due to the fact that the identity of the word is known
`only at the end of a tree and only then can the language model probability be
`applied. Therefore the language model probabilities can only be incorporated
`after reaching the end, i.e. the final state, of the second word of the bigram.
`In other words, the use of the language model may be viewed as delayed by
`one word as compared to the bigram model without using a tree lexicon. To
`incorporate the bigram language model into the search, we will keep a separate
`copy of the lexical tree for each predecessor word. Fig. 6 illustrates this concept
`for a three-word vocabulary (A, B, C). For each predecessor word v, there is
`a separate lexical tree so that during the search process we always know the
`predecessor word v when a word end hypothesis w is hypothesized. In the
`set-up of Fig. 6, we apply the bigram probability p(wlv) when the final state
`of word w with predecessor v has been reached, and use the resulting overall
`score to start up the corresponding lexical tree, i.e. the tree that has word w
`as predecessor.
`
`The key quantities for dynamic programming are again the partial scores
`Q(t, s; w) and the back pointers B(t, s; w). However, the quantities must
`be made dependent on the predecessor word v to account for the delayed
`application of the bigram language model:
`
`Qv(t, s; w) := score of the best path up to time t that ends in state S of
`word w for predecessor v.
`
`Bv(t, s; w) := starting time of the best path up to time t that ends in state
`s of word w for predecessor v.
`
`Both quantities are evaluated using the dynamic programming recursion for
`Qv(t, s; w):
`
`= max {q(Xt,sIO';w)·Qv(t-l,O';w)}
`Bv(t, s; w) = Bv(t - 1, O'max(t, s; W, v)j w) ,
`where O'max(t, s; w, v) is the optimum predecessor state for the hypothesis
`(t,Sjw) and predecessor word v. The back pointers Bv(t,s;w) are propagated
`
`(14)
`(15)
`
`q
`
`IPR2023-00034
`Apple EX1053 Page 13
`
`

`

`398
`
`CHAPTER 16
`
`A
`
`A
`
`ACOUSTIC
`MODEL
`
`LANGUAGE
`MODEL
`
`ACOUSTIC
`MODEL
`
`Figure 6 Word recombination for a lexical tree
`
`according to the DP decision. Unlike the predecessor word v, the index w for
`the word under consideration is not really needed, because from the phoneme
`arcs of each tree, the corresponding words can be uniquely recovered. However,
`for notational consistency, we have retained both indices v and w.
`Using a suitable initialization for (j = 0, this equation includes the optimization
`over the unknown word boundaries. At word boundaries, we have to find the
`best predecessor word for each word. To this purpose, we define:
`
`H(w;t)
`
`max { p(wlv) . QtI(t, S(w); w) } .
`
`tI
`
`(16)
`
`IPR2023-00034
`Apple EX1053 Page 14
`
`

`

`Dynamic Programming Search
`
`To start up words, we have to pass on the score and the time index:
`
`Qv(t-1,0;w)
`Bv(t-1,0;w)
`
`H(v;t-1)
`t - 1 .
`
`399
`
`(17)
`(18)
`
`Note that here the right-hand side does not depend on word index w due to
`the notational scheme explained above.
`
`4.3 Experimental Results
`
`Systematic experimental tests have been carried out for the search methods
`presented so far [22, 20]; here, we only summarize the most important results:
`
`•
`
`•
`
`•
`
`•
`
`Throughout all recognition tests, ranging from 100 to 10,000 vocabulary
`words, the experimental tests indicate that by using the beam search
`strategy, only a small fraction, typically a few percent or less, of the
`potential state hypotheses have to be processed while at the same time the
`number of recognition errors is virtually not increased over a full search.
`
`For a 12,000-word dictation task, a series of experiments was run to study
`the effect of the tree lexicon on the search effort [20]. By using the tree
`lexicon, the search effort was reduced by a factor of 6 over the linear
`lexicon. A further reduction by a factor of 3 was obtained by applying a
`look-ahead technique at the phoneme level in the lexical tree [10].
`
`The dramatic reduction for a tree lexicon must be attributed to the
`subtle interaction between the focusing property of beam search and the
`organization of the search space. The ambiguity in the decisions and
`therefore the number of hypotheses is particularly high at the beginning of
`words; by a tree lexicon, such hypotheses are compressed in a most efficient
`way. In other words, the tree lexicon leads to a search space that is most
`compact where the highest uncertainty is expected, i.e. at the beginning
`portions of words.
`
`In general, when more detailed models with a potentially larger search
`space are used, the effort of the beam search strategy is not much affected.
`For example, for a linear lexicon, when going from a bigram to a trigram
`model, the search effort remained the same for the particular experiments
`[22]. For a tree lexicon, when switching from a unigram tree to the bigram
`tree copies, the search effort was increased by a factor of 2 only [20].
`In other words, the early use of more detailed models results in a more
`focussed beam search despite the increase in the potential search space.
`
`IPR2023-00034
`Apple EX1053 Page 15
`
`

`

`400
`
`CHAPTER 16
`
`• As to the pruning strategy, we have found that it is advantageous to use
`two pruning thresholds: an acoustic pruning threshold for the intraword
`recombinations and a language pruning threshold for the interword
`recombinations [28]. This second threshold is related to an anticipated
`language model probability, since a state hypothesis at any point in
`the lexical tree implies a certain lower bound for the language model
`probability. Ultimately, the choice of the pruning thresholds is based on
`trial and error, although the choice is not critical [28].
`
`•
`
`There is a (close to) real time implementation of the search strategy
`the hardware is based on an Intel 80486
`described in this chapter;
`microprocessor and 16 Mbyte RAM for a 25,000-word continuous speech
`recognition system [27].
`
`5 WORD GRAPHS AND ONE-PASS
`ALGORITHM
`
`5.1
`
`Specification of Word Graphs
`
`The main idea of a word graph is to come up with word alternatives in regions of
`the speech signal, where the ambiguity in the acoustic recognition is high. The
`advantage is that the acoustic recognition is decoupled from the application
`of the language model and that a complex language model, in particular a
`long-span language model, can be applied in a subsequent postprocessing step.
`The number of word alternatives should be adapted to the level of ambiguity
`in the acoustic recognition. The difficulty in efficiently constructing a good
`word graph is the following: the start time of a word depends in general on the
`predecessor words. In a first approximation, we limit this dependence to the
`immediate predecessor word and obtain the so-called word pair approximation:
`
`Given a word pair and its ending time, the word boundary between the
`two words is independent of the further predecessor words.
`
`This word pair approximation was originally introduced in [24] to efficiently
`calculate multiple or n-best sentences. The word graph can be expected to be
`more efficient than the n-best approach. In the word graph approach, word
`hypotheses need to be generated only locally whereas in n-best methods each
`local alternative requires a whole sentence to be added to the n-best list. To
`
`IPR2023-00034
`Apple EX1053 Page 16
`
`

`

`Dynamic Programming Search
`
`401
`
`give a (over) simplified example, suppose we have 10 spoken words and 2 word
`hypotheses for each word position. The n-best method then requires 21°=1024
`sentence hypotheses, whereas the word graph approach produces a graph of
`only 2·10 = 20 word arcs.
`
`There have been a number of attempts at using two explicit levels in search: the
`first level producing a short list of either single word or sentence hypotheses,
`and a second level where the final decision is taken using a complex language
`model [23,24,25,8,21, 17].
`
`As in [21], we prefer the term 'word graph' to 'word lattice' to indicate that
`when combining word hypotheses, we do not allow overlaps or gaps along the
`time axis. The difference to the word graph algorithm in [21] is that we use
`a formal specification of word graphs and make explicit use of the word pair
`approximation. As a result, we can incorporate the word graph construction
`into our standard one-pass algorithm using predecessor conditioned lexical
`trees, whereas in [21], the search was based on time conditioned lexical trees.
`
`In this section, we will formally specify the word graph construction problem
`and pave the way for the word graph algorithm. We start with the fundamental
`problem of word graph construction:
`
`• Given a word wand ending time t, how can we find a limited number of
`'most likely' predecessor words?
`
`•
`
`This task is difficult since the start time of word w may very well
`depend on the predecessor word under consideration, which results in an
`interdependence of start times and predecessor words.
`
`In view of the most successful one-pass beam search strategy, what we want
`to achieve conceptually, is to keep track of word sequence hypotheses whose
`scores are very close to the locally optimal hypothesis, but that do not survive
`due to the recombination process.
`
`The basic idea is to represent all these word sequences by a word graph, in
`which each arc represents a word hypothesis. Each word sequence contained in
`the word graph should be close (in terms of scoring) to the single best sentence
`produced by the one-pass algorithm. To describe the word graph construction
`algorithm, we introduce the following definitions:
`
`IPR2023-00034
`Apple EX1053 Page 17
`
`

`

`402
`
`CHAPTER 16
`
`•
`
`h( w; r, t) := Pr{ x~+llw) = probability that word W produces the acoustic
`vectors XT+I ... Xt.
`• G(wl;t):= Pr(w1) . Pr(xilwl ) = (joint) probability of generating the
`acoustic vectors XI ... Xt and a word sequence wI with ending time t.
`
`Using these definitions, we can isolate the probability contributions of a
`particular word hypothesis with respect to both the language model and the
`acoustic model. This decomposition can be visualized as follows:
`
`Xl, ... , ... ,XT XT+I,"', Xt Xt+l,"',"', XT
`...
`~ ' - - - . - - "
`"
`G{wl-I;r)
`h(wn;r,t)
`
`(19)
`
`Exploiting an m-gram language mo

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket