`
`The Level Building (LB) Algorithm
`
`403
`
`(7.17)
`
`and for each frame, m, in the composite ending range, m 1 (l) ~ m ~ mz(l), we need to
`store
`
`D1(m) = min [De(m)]
`l~v~V
`Flf (m) = arg min [De(m)]
`l~vSV
`
`- best distance at level e to frame m
`
`(7.18)
`
`-
`
`reference pattern index which gave dis- (7 .19)
`tance at level e to frame m
`- back pointer to best ending frame at pre- (7 .20)
`vious level that achieves D1(m)
`(By definition Ff (m) = 0 for all m since the ending frame of the 0th level is 0.) By storing
`only D1(m), N1(m) and F1(m), we significantly reduce the storage at each level and yet
`we retain all the information required to pick up the optimal path through the entire grid as
`shown below.
`The second-level computation does not begin until the first-level computation is
`finished. To see how the computation is picked up at level 2, consider Figure 7.9, which
`shows a series of time warps that span a set of beginning frames and provide a new set of
`ending frames.
`For each reference pattern, Rv, the range of starting frames is m 1 (1) ~ m ~ m2(1)
`because every ending point of the first level is a possible starting point in the second level.
`Hence for each frame m in the beginning range, and for each frame at the beginning of the
`reference pattern, we must consider paths coming from both the current reference pattern
`and the previous level. Other than the broadened beginning region, each DTW at level 2 is
`essentially identical to those at level 1. Thus for reference 'R.1 the range of ending frames
`at level 2 is m 11 (2) < m < m12(2); for reference 'R.2 the range of ending frames at level 2
`is m21 (2) < m < m22 (2), etc. We again derive the ending range at the end of level 2 as
`m1(2) = min [mv1(2)]
`1$v$V
`m2(2) = max [mv2(2)]
`
`1$v$V
`
`(7.21)
`
`(7.22)
`
`and for each frame in m 1 (2) ~ m ~ m2(2) we determine the best distance D~(m), the
`reference with the best distance, Nf (m), and the backpointer F~(m).
`We can continue
`the level building procedure through all levels until level lmax in
`the above manner and we obtain, as the final solution, D* as
`
`(7.23)
`
`It should be clear that by performing the computation in levels (i.e., a word at a time)
`and by doing appropriate minimization within levels, we avoid much of the computation
`of the two-level DP algorithm described in the previous section. However, the negative
`feature of the level building algorithm is that the computation is level synchronous, not
`time synchronous;
`that is, we can go back to a given test frame at many levels. Hence,
`real-time hardware
`implementations of level building are difficult, if not impossible, to
`
`IPR2023-00037
`Apple EX1013 Page 254
`
`
`
`404
`
`Chap. 7
`
`Speech Recognition Based on Connected Word Models
`
`n
`REFERENCE
`1
`
`g:J
`
`m1CO m2<0
`m21<2l
`
`REFERENCE
`n
`2
`
`REFE~ENC~ L.__Jj~~:L.L.:.i..r::::_-+-i _________
`m t (1) : m2(1l
`:
`'
`'
`'
`
`I
`
`e =2
`
`m
`
`m
`
`----:~----
`m
`
`m
`
`2--(cid:173)
`---LEVEL
`ENDING RANGE
`
`Figure 7.9
`[4]).
`
`Implementation of level 2 of the level building algorithm (after Myers and Rabiner
`
`implement. (We will return to these issues later in this section.)
`To gain a better understanding of the basic concepts of the level building algorithm,
`consider the simple example shown in Figure 7 .10. Here we assume that the vocabulary
`consists of two words (which we call A and B for simplicity) and that the two reference
`patterns, 'RA and 'R8 , are of equal length. We also assume we are only interested in an
`f. = 4 level solution. (This greatly simplifies the range of the dynamic programming
`search parallelogram which, as seen in Figure 7. I 0, is essentially identical to the case for
`isolated word template matching. In the next section, where multiple level considerations
`are discussed, the dynamic programming search range will be shown to be significantly
`more complicated to specify.) Since both patterns are of equal length, the ending regions
`for both words, at each level, are identical. Hence at each level ending region we choose
`the reference pattern (A or B) that gave the smallest accumulated distance to that frame. In
`this simple example, there are 6 ending frames at level 1, with the best path corresponding
`to 'RA for the first two frames, and 'R,8 for the next 4 frames. At level 2 there are 10 ending
`frames; at level 3 there are 6 ending frames, and finally at level 4 there is only one ending
`frame corresponding to frame M, the end of the test utterance. By tracing back the (best)
`
`IPR2023-00037
`Apple EX1013 Page 255
`
`
`
`sec. 7.4
`
`The Level Building (LB) Algorithm
`
`405
`
`,--------------
`
`_ __,!8
`
`i=4
`
`2=3
`
`Q=2
`
`i= 1
`
`m(TEST)
`
`Figure 7.10 Simple example illustrating level building on two reference patterns of
`equal length (after Myers and Rabiner (4)).
`
`path ending at m = M we see that the sequence of reference patterns giving the best score
`IS
`
`with test frames e 1, e2, e3, and e4 = M corresponding to the last frames of the 4 words in
`the sequence.
`
`7.4,2 Multiple Level Considerations
`
`In the more realistic case in which we want the level building solutions for all feasible
`levels (i.e., where there is a possible solution), there are some very simple techniques that
`can be used to eliminate unnecessary computation. To understand this issue consider the
`"standard" warping range of the level building algorithm as shown in Figure 7 .11. (We
`assume here that we use a DTW algorithm with a maximum expansion of 2 and a minimum
`
`IPR2023-00037
`Apple EX1013 Page 256
`
`
`
`406
`
`Chap. 7
`
`Speech Recognition Based on Connected Word Models
`
`n
`
`1tt:::..... ________________
`I
`
`m (TEST)
`
`__..
`M
`
`Figure 7.11 Computation region of level building algorithm for a fixed-length
`reference pattern (after Myers and Rabiner [4]).
`
`expansion of ½.) If we define lower and upper constraints lines as
`L(m) = (m + 1)/2
`U(m) = 2(m - 1) + 1
`
`(7.24)
`(7.25)
`
`then for a fixed-length reference pattern we get the computation and ending regions shown
`in Figure 7.11. (We denote the computation region at level£ as Gt with ending region Et,)
`It should be clear that some of the computation of Figure 7 .11 is unnecessary, since there
`do not exist paths from some of the computed points to ends of reference patterns at any
`level.
`We can use the constraint that in order to do the computation at any grid point, the
`path from that grid point must be capable of reaching the end of the reference pattern before
`the end of the test pattern as shown in Figure 7 .12. Here we have drawn lines, at each
`level, of slope 2, from the last frame of the test pattern and the last frame of the reference
`pattern, and used them as constraints on the grid. We have also drawn a line, at level Lmax,
`of slope 1 h from the last test frame and the last reference frame to further constrain the grid
`at the last level. The lower and upper constraints of the simplified grid can be described by
`the equations
`
`l
`
`[m + 1
`L(m) = max ~,
`
`2(m _ M) + 0(£)
`
`(7.26)
`
`IPR2023-00037
`Apple EX1013 Page 257
`
`
`
`sec. 7.4
`
`The Level Building (LB) Algorithm
`
`407
`
`'Rq(1)
`
`m( TEST)
`
`M
`
`Figure 7 .12 Reduced computation region using upper-and lower-level constraints
`(after Myers and Rabiner [ 4 ]).
`
`(7.27)
`
`in which the reference pattern length function, 0( f), is the accumulated number of frames
`of the reference patterns used up to level f where (m + 1)/2 = 2(m - M) + 0(£) at some
`frame m < M. The resulting region of computation is somewhat reduced from that shown
`in Figure 7 .11.
`Since the length of each reference pattern is different (in general) the actual computa(cid:173)
`tion regions for a level building search based on variable length reference patterns is shown
`in Figure 7 .13. Here we show the computation regions, at each level, for the shortest and
`for the longest reference patterns. Fundamentally, the pattern of computation is similar to
`that of Figure 7 .11.
`
`7.4.3 Computation of the Level Building Algorithm
`
`From the above discussion it should be clear that the basic computation of the level building
`algorithm is a series of V time warps at each level, where Vis the size of the vocabulary.
`If we assume the maximum number of levels in Lmax, then we need V Lmax time warps. An
`overbound on size of each time warp is NM /3 grid points where N is the average length of
`each reference pattern and M is the number of frames in the test pattern. Hence the total
`
`IPR2023-00037
`Apple EX1013 Page 258
`
`
`
`408
`
`Chap. 7
`
`Speech Recognition Based on Connected Word Models
`
`~ SEARCH REGION FOR LONGEST REFERENCE AT EACH LEVEL
`0
`
`SEARCH REGION FOR SHORTEST REFERENCE AT EACH LEVEL
`
`I
`I
`I
`
`--------+
`
`w
`... u
`(I) z
`WW
`... a::
`a:::w
`0 Lr.
`::cw
`(I) a::
`
`w
`... u
`(I) z
`WW
`(!) a::
`zw
`0 Lr.
`.-IW a::
`f
`
`m
`
`1=4
`
`J. =3
`
`l=2
`
`£. = 1
`
`M
`
`Figure 7.13 Overall computation pattern of level building algorithm for variable
`length reference patterns (after Myers and Rabiner [4)).
`
`computation of the level building algorithm is
`CLB = V • lmax • N • M /3
`
`grid points
`
`(7.28)
`
`..
`
`with storage
`
`SLB = 3M • lmax
`since we need storage for D8 , N8 , and F8 at each value of m and for each level£.
`Using the same values for M, N, and V as used in determining the computation
`and storage of the two-level DP method, namely M = 300, N = 40, V = 10, and using
`Lmax = 7, we get CLB = 280,000 and SLB = 6300. The basic computation of the level
`building algorithm is a factor of 4.7 less than that of the two-level DP method; the storage
`of both methods is about the same.
`The above calculations are based on full DTW s at each level for each reference
`pattern. Because of the overlaps of regions in the (m, n) plane, at different levels, some of
`the assumed computation is redundant. The following exercise illustrates this point.
`
`(7.29)
`
`IPR2023-00037
`Apple EX1013 Page 259
`
`
`
`sec. 7.4
`
`The Level Building (LB) Algorithm
`
`409
`
`Exercise 7 .2
`In implementing different levels of the level building algorithm, we have assumed all levels
`are independent. Thus, at each level, we have performed a full DTW, for each reference
`pattern, where the size of the DTW was N • M /3 (grid points). An alternative, and, we hope,
`more efficient, approach is to realize that for each reference pattern, a significant portion of the
`computation at each level (namely that of distance computation) may have previously been
`performed at an earlier level, and therefore only the combinatorics part of the DTW is truly
`independent at each level.
`1. Show that for the assumed parameters of the system (i.e., M = number of test
`frames = 300, N = average number of frames per reference pattern = 40, V = size of
`vocabulary = l 0, Lmax = maximum number of levels = 7), the alternative implemen(cid:173)
`tation of storing all previously computed distances is more efficient than the standard
`implementation.
`(Assume that the cost of the combinatorics in a DTW procedure is
`negligible.)
`
`2. What is the ratio of computation of the standard implementation to the stored distance
`implementation?
`3. If we assume the cost of the combinatorics at each grid point is 1
`/ 5 the cost of a distance
`computation, how do the results to parts I and 2 change?
`
`Solution 7 .2
`1. The simplest way of exploiting previously computed distances is to precompute the
`entire grid of distances from each reference pattern frame to each test frame. (Clearly
`this is not the most efficient implementation, since many of these full grid distances will
`never be required in practice.) If we do this we need
`
`CmsT = V • M • N (distances)= 120,000.
`
`The number of combinatorics is just the number of grid points used in the level building
`method; hence
`
`CcoMB = V • Lmax • N • ~ (grid points) = 280,000.
`
`The total computation of this approach is then
`
`CroTAL = Co1sT +a· CcoMB
`
`in which a is the weight for combinatorics. If we assume combinatorics are negligible,
`a= 0, we get
`
`CroTAL = CmsT = 120,000
`which is significantly less than the 280,000 distances for the standard implementation
`(i.e., one per grid point).
`2. The ratio of computation of the two approaches is
`_ V • Lmax • N • M /3 = Lmax = 233
`CLB
`V • M • N
`CroTAL -
`3
`
`and is independent of V, M, and N but only depends on lmax•
`
`IPR2023-00037
`Apple EX1013 Page 260
`
`
`
`410
`
`Chap. 7
`
`Speech Recognition Based on Connected Word Models
`
`3. If we assume the cost of combinatorics is 1/s the cost of a distance computation we get
`-
`- M 1
`CroTAL = V • M • N + V • Lmax • N • 3 S = 176,000
`with the ratio in computation being
`
`CLB
`
`CroTAL
`
`V•lmu•N·M/3
`=
`V • M • N + V • Lmax • N • 3 5
`
`Ml
`
`-
`
`Lma~ = 1.59
`30 + 15 )
`
`7 .4.4
`
`Implementation Aspects of Level Building
`
`Even though we have already shown that the computation for the level building approach
`to connected word recognition is significantly less than that of the two-level DP approach
`of Section 7 .2, there are several ways of even further reducing the computational load of
`the algorithm. These include the following:
`
`1. Beginning range reduction in which we reduce the size of the initial region (range of
`m) for which valid paths to a given level can begin.
`2. Global range reduction in which we reduce the size of the region (width of the
`template) that is tracked, within a level, to determine a best path to each possible
`level ending frame, m.
`3. Test pattern ending range in which we increase the range over which the global path
`can match the connected word string; this procedure provides some robustness to
`ending frame errors.
`4. Reference pattern uncertainty ranges in which we increase the range of search at
`the beginning and end of reference patterns to allow for some degree of word
`coarticulation at reference word boundaries.
`
`The way in which we implement these features is as follows.
`
`1. Beginning range reduction-MT
`For the complete level building algorithm, at level .e - 1, we retain the best score,
`f>1_1(m), for each frame min the ending region m 1(.f, - 1) < m ~ m2(f - 1). It
`should be obvious that, in general, the best global path is not at either boundary but
`somewhere in the middle range of m. Hence if we could eliminate some of the ending
`rarige at level .e -
`I, the search at level .e would involve less computation. (To see
`this, consider the limiting case where we make the ending region, at each level, a •
`single frame; then the computation at each new level is a simple DTW with a single
`initial frame, much as occurs at level 1.)
`I we need to nonnalize
`To determine how to reduce the ending range at level £ -
`the best accumulated distance scores by the number of frames. Thus we first find the
`locally best (minimum) normalized accumulated distance as:
`
`(7.30)
`
`IPR2023-00037
`Apple EX1013 Page 261
`
`
`
`I
`
`sec. 7.4
`
`The level Building (LB) Algorithm
`
`411
`
`level threshold as Mr . <Pt-I where Mr is a defined
`We now define a reduced-range
`I) ::; m ::; m2( e - I) to find the indices S}
`parameter, and search the range m 1 ( f -
`and S~ such that
`
`(7.31)
`
`(7.32)
`
`To see what is achieved by the beginning range reduction, consider Figure 7 .14,
`which shows a sequence of levels and the resulting ending ranges with and without
`range reduction (top graph), and the way in which the reduced range is determined
`(bottom graph). For the level shown, the best normalized distance score, <Pt-I, is
`determined as the smallest value of the curve, and the threshold Mr • <Pt-I is shown
`as a level above </>e-1 (clearly Mr ~ 1.0). The initial reduced range point, S~. is
`the last index (beginning from m = m 1(f - 1)) where the curve is always above the
`threshold; similarly the index S~ is the last index (beginning from m = m2(e - 1))
`where the curve is always above the threshold. The reduced range means that the
`computation at the next level is smaller as seen at the top of Figure 7.14. Depending
`on the value of Mr, the reduced range can get smaller (as Mr ➔ 1) or larger (as
`Mr --+ oo ). It should be clear that too small a value of Mr will allow the best path to
`be prematurely eliminated; hence proper choice of Mr is essential.
`2. Global range reduction-E
`The idea behind global range reduction is to reduce the search range along the
`reference axis, for each test frame, by tracking the global minimum, and allowing
`only a range around the global minimum. Thus for each test frame, m, at each level
`f., and for each reference pattern, 'Rv, we determine the local minimum, c(m), as
`c(m) = arg
`
`min
`c(m-1)-fSnSc(m-1)+€
`
`[Dt(m - 1, n)]
`
`(7.33)
`
`in which De(m - 1, n) is defined to be the best distance at level f. using reference nv
`at test frame m - 1 to reference frame n (as determined by the local alignment path)
`and with c(l) defined to be 1. Figure 7 .15 illustrates the global range reduction for
`a typical level building search. For global range reduction to be effective we must
`have the reduced range, 2E + 1, be smaller than the typical reference pattern width.
`3. Test pattern ending range-<5£ND
`The idea here is to allow a range of test pattern ending frames, rather than restricting
`the ending frame tom = M. This feature provides a measure of robustness to test
`pattern endpoint errors. If we extend the end of the test pattern by 6END frames, the
`global level building solution is modified to be
`
`(7.34)
`
`IPR2023-00037
`Apple EX1013 Page 262
`
`
`
`412
`
`Chap. 7
`
`Speech Recognition Based on Connected Word Models
`
`-e D.)ml
`., m
`
`(bl
`....._----ORIGINAL
`
`m(TEST)
`
`M
`
`RANGE -----~•~1
`I
`I
`I
`
`Figure 7.14 Illustration of beginning range reduction (after Myers and Rabiner [4]).
`
`4. Reference pattern uncertainty regions-8 R
`, 8 R
`1
`2
`To account for coarticulation of words across word boundaries, the level building
`algorithm allows a range of beginning and ending frames of the reference pattern. In
`this manner, at any level, the path can begin over the range 1 :::; n '.S 1 +8R
`(where n is
`2 < n '.S Nv. For
`thereferencepattemindex),andendatanyframeintherangeNv-8R
`appropriate values of 8R1 and 8R2 , it is possible to have a path which skips (8R1 + DR2)
`frames at highly coarticulated word boundaries (e.g., the boundary between six and
`seven in the string "six-seven"). Figure 7.16 illustrates the use of reference pattern
`uncertainty regions in level building.
`
`1
`
`A summary of the four implementation aspects of level building, as discussed in
`this section, is shown in Figure 7.17 in which all four features are combined in a single
`
`IPR2023-00037
`Apple EX1013 Page 263
`
`
`
`sec. 7.4
`
`The Level Building (LB) Algorithm
`
`413
`
`'Rq( I)
`
`m(TEST)
`
`M
`
`Figure 7.15
`[4]).
`
`Illustration of global range reduction (after Myers and Rabiner
`
`• • •
`
`r(n)
`
`Figure 7 .16
`Illustration of the use of reference pattern uncertainty regions (after
`Myers and Rabiner [4]).
`
`t(m)
`
`L
`
`IPR2023-00037
`Apple EX1013 Page 264
`
`
`
`414
`
`Chap. 7
`
`Speech Recognition Based on Connected Word Models
`
`m (TEST)
`
`Figure 7.17 Summary of level building implementation of computation reduction
`methods (after Myers and Rabiner (4)).
`
`level building search. It can be shown that with judicious choice of the implementation
`parameters, Mr, E, 8END, 8R1 and 8R2 , the overall computation can be substantially reduced
`from that of the standard level building approach.
`
`7 .4.5
`
`Integration of a Grammar Network
`
`We have been implicitly assuming that, for connected word recognition, each word in the
`string can be followed by any other word in the string. This implicit form of grammar
`is most appropriate for things like digit strings in which any digit can follow any other
`digit. However, for some connected word recognition tasks there is an explicit set of rules
`(grammar) governing which words can logically follow other words to form valid sentences
`in the language [10]. Although the form of such a grammar can be of several different
`types (we will discuss this in more detail in Chapter 8), we will restrict ourselves here to
`those tasks in which we can represent the grammar by a finite state network (FSN) or a
`finite state automata (FSA) of the form
`
`G =A(Q, V,8,qo,Z)
`
`(7.35)
`
`where
`
`Q = set of states
`
`IPR2023-00037
`Apple EX1013 Page 265
`
`
`
`sec. 7.4
`
`The Level Building (LB) Algorithm
`
`415
`
`V = set of vocabulary words
`8 = set of transitions
`qo E Q = initial state
`Z ~ Q = set of tennina1 states
`and the set of transitions obeys the rule
`
`6(q, v) = s
`meaning that word v drives the state from q to s
`
`(7.36)
`
`Jt
`
`/
`I
`I
`
`V
`
`----- ....
`
`--·
`-
`
`I
`I
`
`/
`
`To integrate the FSN grammar network into the following level building algorithm we must
`do the following:
`
`1. Identify levels with states rather than word position so that word candidates at the flh
`level need not be temporally contiguous to those at the (f + })st level
`2. Partition the vocabulary so that only the reference patterns for words leaving the f1h
`state are matched at the £th level
`3. Retain state backtracking pointers for recovering the best matching string.
`
`(It should be noted that for the most efficient computation, the states in Q should be
`topologically sorted.)
`To illustrate how a simple grammar FSN can be integrated into the LB algorithm,
`consider the following example:
`
`I
`1.
`2. WANT
`3. NEED
`4.
`THREE
`
`5. ONE
`6. A
`7. AN
`8. BOOK
`
`9. BOOKS
`IO. COAT
`11. COATS
`12. NEW
`
`13. OLD
`
`IPR2023-00037
`Apple EX1013 Page 266
`
`
`
`416
`
`Chap. 7
`
`Speech Recognition Based on Connected Word Models
`
`Current
`State
`2
`3
`4
`5
`6
`7
`8
`8
`8
`9*
`9*
`
`Words
`Used
`
`WANT
`NEED
`THREE
`A
`AN
`ONE
`NEW
`OLD
`BOOK, COAT
`BOOKS. COATS
`
`Predecessor
`State
`I
`2
`2
`3
`4
`4
`3
`6
`7
`8
`5
`
`Currect
`Level
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`
`Predecessor
`Levels
`0
`
`1
`2
`3
`3
`2
`5
`6
`7,8,9
`4
`
`If we study this simple example, we see that levels 2 and 3 both pick up from level
`I; similarly both levels 4 and 7 pick up from level 2. By building up the computation
`by levels (keeping track of the correct predecessor level), and by backtracking the final
`result by states (according to the grammar FSN) we can essentially use all the techniques
`described to efficiently find the best grammatically correct string.
`
`7 .4.6 Examples of LB Computation of Digit Strings
`
`Figures 7 .18 and 7 .19 show two examples of connected digit strings matched using the LB
`algorithm. Figure 7 .18 is for the string "51560" and shows the computation building up
`level by level. For this example, the locally best path at each level (shown by the digit to
`the right of the last test frame) is actually the globally best digit. At the end of level four
`the algorithm provided the four best choices with the string "5157" having the lowest score
`of 0.553. At the end of level 5 there were six string choices with the correct string "51560"
`having the lowest average accumulated distance of 0.333.
`The example in Figure 7.19 is for the string "99211," which did not provide a match
`nearly as good as the previous example. We see here that the locally best string at each level,
`namely "90111" is not globally best, and actually is incorrect in two positions. Although
`the algorithm gets the correct string as the best score, the second best string is the string
`"901," which has two digit deletions, and one digit-substitution error.
`
`7.5 THE ONE-PASS (ONE-STATE) ALGORITHM
`
`The third general approach to the connected word recognition problem is a procedure
`which was originally proposed by Vintsyuk in 1971 [5] and which has been "rediscovered"
`several times in the last two decades [6-8] and generalized in scope (9]. The algorithm
`has been called the one-pass procedure or the one-state algorithm, or most recently, the
`frame-synchronous level building (FSLB) method. The basic idea behind the algorithm
`is illustrated in Figure 7 .20, which shows a grid with the test pattern, T, mapped to
`
`IPR2023-00037
`Apple EX1013 Page 267
`
`
`
`sec. 7.5
`
`The One-Pass (One-State) Algorithm
`
`417
`
`1111 ENERGY ENVELOPE 1111
`MAX DB= 44 BEGIN s 33 ENO: 156 NF=124
`
`(a)
`
`I
`
`,. .
`
`.
`
`-
`
`--
`
`t
`
`( b)
`
`0.333 51560
`0.349 11560
`0.354 51570
`0.370 59560
`0.375 51160
`0.413 51562
`0.553 5157
`0.569 t157
`0.591 5957
`0.596 51 I 7
`
`I
`
`, . 0
`. . '
`
`{
`
`6
`
`5
`
`5
`
`Figure 7.18 Level building of the string "51560" (after Myers and Rabiner [4]).
`
`the horizontal axis, and the set of reference patterns, {'R.1, 'R.2, ... , Rv} mapped to the
`vertical axis.
`Using the standard notation of m to represent the test frame index, 1 s m s M, v to
`represent the reference pattern (Rv) index, I < v s V, and n to represent the reference frame
`index of pattern 'Rv, I < n < Nv, then for each test frame we calculate the accumulated
`distance, dA(m, n, v) as:
`dA(m, n, v) = d(m, n, v) + min (dA(m - 1,j, v)).
`
`n-29~n
`For 2 < n < Nv, 1 < v < V, where d(m, n, v) is the local distance between test frame
`t(m) and reference from rv(n), and we assume a maximum path expansion of 2 to I (hence
`we search back by 2 reference pattern frames in the combinatoric stage). The recursion of
`Eq. (7.37) is carried out for all internal frames of each reference pattern (i.e., n ~ 2). At
`the reference pattern boundary, i.e., n = 1, we have the simple recursion
`
`(7.37)
`
`dA(m, 1, v) = d(m, 1, v) + min [ min [dA(m - 1,N,,r)],dA(m -
`l~r~V
`
`I, I, v)].
`
`(7.38)
`
`Thus the combinatorics for internal and boundary frames are as shown in Figure 7.21.
`Figure 7 .21 a shows that for internal frames the combinatorics choose the best internal path
`
`IPR2023-00037
`Apple EX1013 Page 268
`
`
`
`J
`
`(a)
`
`( b)
`
`nu ENERGY ENVELOPE llll
`MAX DB= 44 BEGIN= 27 END• 99 NF= 73
`
`0 523 99211
`O 570 19211
`0 572 92211
`0 603 99219
`o 545 9921
`0 547 9901
`0592
`1921
`0593
`9221
`0 604 9929
`0 535 901
`O 565 101
`0.594 909
`
`{
`
`0
`
`9
`
`Figure 7.19 Level building of the string "99211" (after Myers
`and Rabiner [4]) .
`
`• •
`•
`
`n,
`
`TEST FRAME, m
`
`M
`
`Figure 7.20 The one-pass connected word recognition algorithm (after
`Bridle et al. (6)).
`
`IPR2023-00037
`Apple EX1013 Page 269
`
`
`
`\
`
`sec. 7.5
`
`The One-Pass (One-State) Algorithm
`
`419
`
`m - 1
`
`• n - 2
`m
`
`(a)
`
`m - 1
`
`m
`
`(b)
`
`Figure 7.21 Combinatorics for the one-pass algorithm.
`
`within the reference pattern, whereas at boundary frames the combinatorics choose either
`a straight (horizontal) path from within the reference pattern (subject to the constraint that
`the path cannot remain constant for more than one frame) or the best ending frame of any
`reference pattern. (Within the dynamic programming framework, of course, incorporation
`of a set of local constraints that differ from those of Eq. (7 .38) is possible, subject to
`pragmatic considerations.)
`The final solution for the best path (corresponding to the best word string) is
`
`(7.39)
`
`Thus the one-pass algorithm computes best paths to every reference pattern frame at every
`test frame and eventually is able to backtrack the best score (from Eq. (7.39)) to give the
`best word sequence, as shown in Figure 7 .20.
`The major problem with the one-pass algorithm is that no mechanism is provided for
`controlling the resulting string length-that
`is, for giving a best path for a string of arbitrary
`length. The algorithm inherently finds a single best path whose string length is whatever it
`turns out to be. Thus there is no simple way of exploiting given constraints on string length
`within the fundamental procedure.
`There is, however, a simple and straightforward way of incorporating level (i.e.,
`string-length) constraint
`in the computation. We do this by extending the accumulated
`distance to include level information-that
`is, we extend the recursion to compute the
`accumulated distance at level £ as
`d!(m,n, v) = d(m,n, v) + min
`
`[d!(m - 1,j, v)]
`
`(7.40)
`
`n-25)5,n
`where the computation is for 1 ~ i < Lmax, 2 < n < Nv, 1 < v < V, 1 ~ m ~ M. At each
`boundary frame the computation now becomes
`d!(m, 1, v) = d(m, 1, v) + min [ min d!-l(m - 1,Nv,r),d!(m - 1, 1, v)]
`
`l 5,r5,V
`
`(7.41)
`
`with
`
`v• = min min [d!(M,Nv, v)].
`
`(7.42)
`
`l5,i5,lmu 15,v5,V
`The key difference is in Eq. (7.41), which only allows a path to an ending frame at level
`l) to become a path at a beginning frame at level i.
`(£ -
`The main advantage of the one-pass algorithm is that the computation for a given
`test frame, m, can be done frame synchronously; hence the one-pass algorithm is well
`
`IPR2023-00037
`Apple EX1013 Page 270
`
`
`
`420
`
`Chap. 7
`
`Speech Recognition Based on Connected Word Models
`
`suited to real-time implementation on processors that are capable of doing all the necessary
`computations of Eqs. (7 .40) and (7 .41) in a single frame interval. Although at first glance
`it seems as though the computation of the one-pass algorithm is significantly greater than
`that of the LB approach, it is easily seen that the computation of d(m, n, v) of Eq. (7.40)
`is independent of level,£; hence it can be computed once (e.g., at level 1) and stored, and
`used in subsequent levels with no extra computation. Because of its suitability for real-time
`implementation the level-based version of the one-pass algorithm is generally the one used
`for connected word recognition tasks.
`
`7 .6 MULTIPLE CANDIDATE STRINGS
`
`In the previous sections we have been discussing ways of detennining the globally best
`path through the extended grid, corresponding to the best match to the spoken word string.
`There are many ways of obtaining multiple candidate strings from which we can determine,
`at least in theory, the second-best string match, the third-best string match, etc. Multiple
`string candidates are particularly useful when using grammar networks in order to provide
`robust recognition decisions. The way in which we obtain multiple candidate strings is
`simple; namely, we keep track of both the best distance, and the second-best distance to
`each ending frame at each level (to get the second-best string match). Since we have
`computed all reference pattern distances already (as needed to determine the best distance),
`all that is required is additional storage (to keep track of the array of second-best distances)
`and the bookkeeping to detennine the second best string. (In Chapter 4 we discussed the
`general problem of using dynamic programming methods to determine the N-best paths
`through a grid. What is described here is essentially what was called the parallel algorithm
`in Chapter 4, Section 4.7.5. The main difference is that here the path ranking is in terms of
`word candidate strings instead of frame-based time warping paths. This difference gives
`rise to some extra optimality considerations as discussed in this section.) We illustrate the
`procedure in Figure 7 .22, which shows a simple two-level LB search with tracking of the
`two best strings at each grid point. At the end of level 2 there are two distinct paths to the
`last frame of the test pattern-namely,
`the best distance (labeled 1 and shown as a solid
`path) and the second best distance (labeled 2 and shown as a dashed line). The best path
`from level 2 branches off to two other paths in level 1, with one being a best path and the
`other being a second-best path. Similarly, the second-best path from level 2 branches off
`to two other paths in level 1. Thus a total of four paths are followed-namely,
`the 11 path
`(best from level 2, best from level 1), the 12 path (best from level 2, second best from level
`1), the 21 path (second best from level 2, best from level 1 to the beginning point of the level
`2 path) and the 22 path (second best from level 2, second best from level 1). The overall
`best path is, by definition, the 11 path through the grid. The second-best path, according
`to this method, is (with some pathological exceptions, to be explained next) either the 12
`path, or the 21 path; the 22 path cannot ever be as good as the 21 path.
`The procedure described above can be extended trivially to the best three candidates
`at each level, in which case a total of 3L scores are obtained after L levels. This is illustrated
`at the bottom of Figure 7.22 for a two-level match in which there are nine string candidates.
`
`IPR2023-00037
`Apple EX1013 Page 271
`
`
`
`:.,/
`
`sec. 7.6
`
`Multiple Candidate Strings.
`
`421
`
`(a}
`
`LEVEL 2
`
`LEVEL 1
`
`( b)
`
`I
`I
`I
`I
`
`/
`
`/2
`
`/
`
`/
`
`TEST
`
`/
`
`/ --
`
`I
`I
`I
`I
`I
`I
`2 /
`/
`/
`
`/
`
`,,-'
`
`/
`
`L=2
`K=3
`
`Figure 7.22 Description of procedure for determining multiple can(cid:173)
`didate scores (after Myers and Rabiner [4]).
`
`Figure 7 .23 shows a case of using L = 4 levels with two best candidates. There are now
`four possible choices for the second-best string, including the 1112 path, the 1121 path,
`the 12