throbber
sec. 8.8
`
`Overall Recognition System Based on Subword Units
`
`451
`
`CONTINUOUS SENTENCE RECOGNIZER
`
`r----------------------7
`I
`Ir-------.
`FEATURE I
`VECTOR I WORD-LEVEL
`MATCH
`
`SPEECH
`INPUT
`
`SPECTRAL
`ANALYSIS
`
`1---..i
`
`,..__~
`
`I
`I
`I
`L ___ _
`
`r------.1
`I RECOGNIZED
`SENTENCE-LEVEL I SENTENCE
`MATCH
`
`-------r----...J
`
`_ ___
`
`I
`
`I
`I
`I
`..J
`
`WORD
`MODEL
`
`WORD MODEL
`COMPOSITION
`
`LANGUAGE
`MODEL
`
`SUBWORD
`MODELS
`
`LEXICON
`
`GRAMMAR SEMANTICS
`
`Figure 8.7 Overall block diagram of subword unit based continuous speech recognizer.
`
`features were also studied, but results on such systems will not be presented here.)
`The second step in the recognizer is a combined word-level/sentence-level match.
`The way this is accomplished is as follows. Using the set of subword HMMs and the word
`lexicon, a set of word models (HMMs) is created by concatenating each of the subword
`unit HMMs as specified in the word lexicon. At this point, the system is very similar to
`the connected word recognizers of Chapter 7. The way in which the sentence-level match
`is done is via an FSN realization of the word grammar (the syntax of the system) and the
`semantics as expressed in a composite FSN language model. The implementation of the
`combined word-level match/sentence-level match is via any of the structures described
`in Chapter 7. In particular, most systems use structures similar to the frame synchronous
`level-building method (usually with some type of beam search to restrict the range of paths)
`to solve for the "best" recognition sentence.
`Consider using the recognizer of Figure 8. 7 for a database management task called the
`Naval Resource (Battleship) Management Task-as popularly defined within the DARPA
`community [13]. This task, which has a 991-word vocabulary (plus a separate silence
`word), can be used to query a database as to locations, attributes, constraints, history, and
`other information about ships within the database. Typical examples of sentences used to
`query the database include
`
`• what is mishawaka's percent fuel
`• total the ships that will arrive in diego-garcia by next month
`
`IPR2023-00037
`Apple EX1013 Page 302
`
`

`

`452
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`<I>
`
`silence
`3
`1----.-i
`...... ....,stop
`
`stop
`- - - -•
`
`<I>
`
`Figure 8.8 FSN for the NG syntax.
`
`• do any vessels that are in gulf of tonkin have asw mission area of m4
`• show the names of any submarines in yellow sea on twenty eight october
`• list all the alerts
`• what's jason 's m-rating on mob
`• give t-lam vessels that weren't deployed in november.
`
`The vocabulary thus includes many jargon words, such as m4, m-rating, mob, and t-lam,
`and several long-content words, such as mishawaka 's, diego-garcia, submarines, november,
`etc., and many short-function words, such as is, the, by, do, in, of, and on.
`A wide range of sentences can be constructed from the 991-word vocabulary to
`query this database. It is possible to construct a finite-state network representation of the
`full grammar associated with all such sentences. The perplexity (average word branching
`factor) (see Section 8.7) of the full grammar network is computed to be about 9. However,
`such a network is rather large (because of the high degree of constraint among words within
`the vocabulary which form syntactically valid and semantically meaningful sentences) with
`upward of 50,000 arcs and 20,000 nodes, aqd cannot easily be implemented as a practical
`system. Instead, several types of FSN approximations
`to the full grammar have been
`constructed.
`Perhaps the least constraining grammar (and the simplest to implement) is the no
`grammar (NG) case, in which any word in the vocabulary is allowed to follow any word in
`the vocabulary. Such an FSN has the property that, although its coverage of valid sentences
`is perfect, its overcoverage of the language (i.e., the ratio of sentences generated by the
`grammar to valid sentences within the task language) is extremely large. The perplexity of
`the FSN for the NG case is 991, since each word can follow every word in the grammar
`(assuming all words are essentially equiprobable). The FSN for the NG case is shown in
`Figure 8.8. (Note that the FSN of Figure 8.8 allows arbitrary phrasing, i.e., groups of words
`spoken together followed by a pause, because of the silence model and the null arcs.)
`A second FSN form of the task syntax is to create a word pair (WP) grammar that
`specifies explicitly which words can follow each of the 991 words in the vocabulary. The
`perplexity of this grammar is about 60, and the overcoverage, while significantly below
`that of the NG case, is still very high. Although the network of Figure 8.8 could be used for
`the WP grammar (by explicitly including the word pair information at node 2), a somewhat
`more efficient structure exploits the fact that only a subset of the vocabulary occurs as the
`first word in a sentence (B or beginning words), and only a subset of the vocabulary occurs
`
`IPR2023-00037
`Apple EX1013 Page 303
`
`

`

`sec. 8.8 Overall Recognition System Based on Subword Units
`
`453
`
`--->:
`
`SILENCE
`4'
`,,---,
`,
`'
`SILENCE
`
`SILENCE
`
`Figure 8.9 FSN of the WP syntax.
`
`as the last word in a sentence(£ or ending words); hence we can partition the vocabulary
`into four nonoverlapping sets of words, namely
`{ BE}=set of words that can either begin or end a sentence, IBEI = 117
`{~£}=set of words that can begin a sentence but cannot end a sentence, IBEj == 64
`{ BE}=set of words that cannot begin a sentence but can end a sentence, jBE/ == 488
`{ BE}=set of words that cannot begin or end a sentence, jBE j = 322.
`The resulting FSN, based on this partitioning scheme, is shown in Figure 8.9. This
`network has 995 real arcs and 18 null arcs. To account for silence between words (which
`is optional), each word arc bundle (e.g., nodes 1 to 4) is expanded to individual words
`followed by optional silence, as shown at the bottom of Figure 8.9. Hence the overall FSN
`allows recognition of sentences of the form
`S: (silence) - {BE,BE} - (silence) - ({W}) ... ({W})-
`
`(silence) - {BE,BE}- (silence).
`
`Finally, one could construct a task syntax based on statistical word bigram (or even
`
`IPR2023-00037
`Apple EX1013 Page 304
`
`

`

`454
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`is, we assign a probabilitY,PiJ, to each word pair (Wi, W1) where
`trigram) probabilities-that
`PiJ is the probability that W; is followed immediately by W1, That is, if Wn is the nth word
`in a string of words, then PiJ = P(Wn = W11Wn-l = Wi) is the language model according
`to Eq. (8.6). The advantage of the word bigram (WB) approach is that the perplexity
`is reduced considerably (to 20) for the Resource Management task, with essentially no
`increase in complexity of the implementation.
`
`8.8.1 Control of Word Insertion/Word Deletion Rate
`
`Using a structure of the type shown in Figure 8.9, there is no control on the sentence length.
`That is, it is possible to generate sentences that are arbitrarily long by inserting a large
`number of short-function words. To prevent this from occurring, it is a simple matter to
`incorporate a word insertion penalty into the Viterbi decoding, such that a fixed negative
`quantity is added to the likelihood score at the end of each word arc (i.e., at nodes 5-8 in
`Figure 8.9). By adjusting the word penalty, we can control the rate of word insertion and
`word deletion; a very large word penalty will reduce the word insertion rate and increase
`the word deletion rate, and a very small penalty will have the opposite effect. A value for
`word penalty is usually experimentally detennined to balance these adverse effects.
`
`8.8.2 Task Semantics
`
`We have discussed how task syntax can be incorporated into the overall recognition struc(cid:173)
`ture. At the end of this chapter we will briefly describe a general procedure for integrating
`a semantic component into the recognizer.
`
`8.8.3 System Performance on the Resource Management Task
`
`Using the segmental k-means training algorithm, the set of 47 PLUs of Table 8.1 were trained
`using a set of 4360 sentences from I 09 talkers. The likelihood scores were essentially
`unchanged after two iterations of the k-means loop. The number of mixtures per state was
`varied from 1 to 256 in multiples of 2 to investigate the effects of higher acoustic resolution
`on performance.
`To evaluate the recognizer performance, five different sets of test data were used,
`including:
`
`feb 89
`
`oct 89
`
`train 109 A randomly selected set of 2 sentences from each of the I 09 training talkers; this set
`was used to evaluate the ability of the algorithm to recognize the training material
`A set of 30 sentences from each of 10 talkers, none of whom was in the training set;
`this set was distributed by DARPA in February of 1989 to evaluate performance
`A second set of 30 sentences from each of IO additional talkers, none of whom was
`in the training set; this set was distributed by DARPA in October of 1989
`A set of 120 sentences from each of 4 new talkers, none of whom was in the training
`set (distributed by DARPA in June of I 990)
`A set of 30 sentences from each of IO new talkers, none of whom was in the training
`set (distributed by DARPA in February of 199 I).
`
`jun 90
`
`feb 91
`
`IPR2023-00037
`Apple EX1013 Page 305
`
`

`

`sec. 8.8
`
`Overall Recognition System Based on Subword Units
`
`455
`
`Word Pair Grammar
`100 ,~---~-~~--,-..---.,....._,
`
`__
`
`80
`
`20
`
`train109-2.5wp
`G·-·~
`O-·•··••-o train109-3.0wp
`
`0 ~-:;-7--::------:--:-~-__..._.
`1
`2
`4
`8
`16
`
`100,--,---.-----,.......--.--
`
`(a)
`............
`..__...J
`128 256
`
`32
`
`64
`
`......... __,,._"T""-_~
`
`80
`
`60
`
`40
`
`20
`
`train109-2.5wp
`G·--~
`o--•··••-o traln109-3.0wp
`
`l
`>-() as
`
`~ ::,
`()
`() <
`
`Cl)
`()
`C
`Cl)
`E
`Cl) en
`
`0
`1
`
`2
`
`4
`8
`64
`32
`Number of Mixtures Per State
`
`16
`
`(b)
`
`128 256
`
`Figure 8.10 Word and sentence accuracies versus number of mixtures
`per state for the training subset using the WP syntax.
`
`The recognizer performance was evaluated for each of the test sets, using both WP
`and NG syntax, and with different word penalties. For all cases, evaluations were made
`using models with from 1 to 256 mixtures per state for each PLU.
`The recognition results are presented in terms of word accuracy (percentage words
`correct minus percentage word insertions) and sentence accuracy as a function of the number
`of mixtures per state for each PLU model. The alignment of the text of the recognized
`string with the text of the spoken string was performed using a dynamic programming
`alignment method as specified by DARPA.
`The recognition results on the training subset (train 109) are given in Figures 8.10 (for
`the WP syntax) and 8.11 (for the NG syntax). The upper curves show word accuracy (in
`percentage) versus number of mixtures per state (on a logarithmic scale) for two different
`values of the word penalty, and the lower curves show sentence accuracy for the same
`
`IPR2023-00037
`Apple EX1013 Page 306
`
`

`

`..
`
`456
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`No Grammar Case
`90.---.---,-.-.........,
`......... ------.- ......... -...-.,,..........,.----,
`
`o-----o train109-6.0wp
`0--------0 train109-5.0wp
`
`(a)
`
`2
`
`4
`
`8
`
`16
`
`32
`
`64
`
`128
`
`256
`
`40 .---.--
`
`
`
`........... ----.-......--.-----,--........,-..------,
`
`o-----o train109-6.0Wp
`0--------0 train109-5.0Wp
`
`>,
`(J
`
`G)
`(J
`C:
`G)
`E
`G)
`Cl)
`
`l 30
`Ill ...
`:,
`(J
`(J <
`
`20
`
`10
`
`1
`
`2
`
`32
`16
`64
`8
`4
`Number of Mixtures Per State
`
`(b)
`
`128 256
`
`Figure 8.11 Word and sentence accuracies versus number of mix(cid:173)
`tures per state for the training subset using the NG syntax.
`
`parameters. A sharp and steady increase in accuracy is obtained as the number of mixtures
`per state increases, going from about 43.6% word accuracy (10.6% sentence accuracy) for
`I mixture per state to 97.3% word accuracy (83% sentence accuracy) for 256 mixtures
`per state for the WP syntax using a word penalty of 2.5. For the NG syntax (using a
`word penalty of 6.0), the comparable results were 24% word accuracy (0.9% sentence
`accuracy) for I mixture per state and 84.2% word accuracy (34.9% sentence accuracy) for
`256 mixtures per state.
`The recognition results on the independent test sets are given in Figures 8.12 (for
`the WP syntax) and 8.13 (for the NG syntax). Although there are detailed differences in
`performance among the different test sets (especially for small numbers of mixtures per
`state), the performance trends are essentially the same for all the test sets. In particular we
`see that for the WP syntax, the range of word accuracies for l mixture per state is 42.9%
`
`IPR2023-00037
`Apple EX1013 Page 307
`
`

`

`sec. 8.8
`
`Overall Recognition System Based on Subword Units
`
`457
`
`Word Pair Grammar
`100,---.-~-.......,.......---.--
`..........
`-.--.....-r----,
`
`80
`
`feb89-3wp
`•··-··•
`feb91-3wp
`9-----9
`jun90-3wp
`D-·-·-o
`o-••····~ oct89-3wp
`
`40.:----;:---~--:--'"---:-'::-----:'.:':--"-~
`1
`2
`4
`8
`16
`32
`
`7sr--..----.---...--.---..---~.----,_.....,..--__,
`
`(a)
`.......................
`--.J
`64
`128
`256
`
`>(cid:173)
`CJ ca ...
`CJ <
`
`:::,
`CJ
`
`Q)
`CJ
`C:
`
`Q) c Q) en
`
`(b)
`
`......._~ ................... ....._~_,
`0 .____......__........._~ .......... ........____. __
`
`
`1
`4
`8
`2
`16
`32
`64
`128
`256
`Number of Mixtures Per State
`
`Figure 8.12 Word and sentence accuracies versus number of mix(cid:173)
`tures per state for the four test sets using the WP syntax.
`
`(for feb 89) to 56.0% (for jun 90), whereas for 256 mixtures per state the range is 90.9%
`(for feb 89) to 93.0% (for jun 90). For the NG syntax, the range of word accuracies for 1
`mixture per state is 20.1 % (for feb 91) to 28.5% (for jun 90) and for 256 mixtures per state
`it is 68.5% (for oct 89) to 70.0% (for feb 91).
`Perhaps the most significant aspect of the perfonnance is the difference in accuracies
`between the test sets and the training subset. Thus there is a gap of 4--7% in word
`accuracy for the WP syntax at 256 mixtures per state, and a gap of 14.2-15.7% for the
`NG syntax at 256 mixtures per state. Such gaps are indicative of the ability of the training
`procedure to overtrain (learn details) on the training set, thereby achieving significantly
`higher recognition accuracy on this set than on any other representative test set.
`The results presented in this section show that a simple set of context-independent
`PLUs can be trained for a continuous speech large vocabulary recognition task, using
`
`IPR2023-00037
`Apple EX1013 Page 308
`
`

`

`458
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`No Grammar Case
`
`60
`
`•··-·· .. feb89-6wp
`... ----. feb91-6wp
`jun90-6wp
`a-·-·-£J
`<>······~ oct89-6wp
`
`(a)
`
`20 ~------,---....,.......------..--
`
`.........
`-....--"T""-~-,
`
`f 15
`[
`::,
`u
`~ 10
`~
`C
`!!
`C
`~ 5
`
`feb89-6wp
`•··-··•
`..,. ___ -. feb91-6wp
`jun90-6wp
`a-•-·-£J
`<>······~ oct89-6wp
`
`(b)
`
`0 L...__....__
`1
`
`2
`
`_
`
`__,__......._.....______._
`
`__
`
`.......__
`
`.............. ~~~
`
`64
`32
`16
`8
`4
`Number of Mixtures Per State
`
`128 256
`
`Figure 8.13 Word and sentence accuracies versus number of mix(cid:173)
`tures per state for the four test sets using the NG syntax.
`
`standard Viterbi training procedures, and be used to provide reasonably good recognition
`accuracy for a moderately complex task. The key issue now is what can be done, in
`meaningful ways to improve recognizer performance. To answer this question, we will
`examine several possible extensions of the basic recognition system in the next few sections.
`
`8.9
`
`fONTEXT-DEPENDENT SUBWORD UNITS
`
`There are several advantages to using a small basic set of context-independent subword
`units for large vocabulary speech recognition. First of all we have shown that the models
`of these subword units are easily trained from fluent speech, with essentially no hµman
`decisions as to segmentation and labeling of individual sections of speech. Second, the
`
`IPR2023-00037
`Apple EX1013 Page 309
`
`

`

`sec. 8.9
`
`Context-Dependent Subword Units
`
`459
`
`to new contexts (word vocabularies, tasks with different
`resulting units·are generalizable
`syntax and semantics) with no extra effort. Finally, the resulting models are relatively
`insensitive to the details of the context from which the training tokens are extracted. By
`this we mean that, in theory, we can derive the subword unit model parameters from two
`arbitrary but sufficiently large training sets of fluent speech (hopefully of the same size and
`general linguistic content but not necessarily the same vocabulary words and sentences) to
`obtain essentially the same parameter estimates for each model. In practice, this is almost
`the case.
`in which the subword unit model parameters are ex(cid:173)
`However, there are situations
`tracted from a training set whose linguistic content matches the test set precisely-that
`is, when the training set is a set of sentences drawn from the recognition task (with the
`same vocabulary, syntax, and semantics). In such a case, the resulting subword units are
`somewhat "word sensitive" (showing higher likelihood scores than in the general case)
`and typically provide higher recognition performance than equivalent model sets derived
`from arbitrary input speech.
`In particular, for the Resource Management task discussed
`in the previous section, "word-sensitive"
`subword unit models, trained on task-specific
`training sentences, give about I 0% higher word recognition accuracy than the same set of
`subword unit models trained on arbitrary sentences of comparable size. If the training set is
`increased in size by a factor of about 3, the word accuracy of the text-independent models
`approaches that of the word-sensitive models.
`Obviously, this performance difference can be attributed to the fact that context(cid:173)
`independent subword unit models are not adequate in representing the spectral and temporal
`properties of the speech unit in all contexts. (By context we mean the effects of the preceding
`and following sounds as well as the sound stress and information, and even the word in
`which the sound occurs.) The ultimate effect is a decrease in performance in word and
`sentence accuracy on speech-recognition
`tasks.
`The solution to this problem is basically a simple, straightforward one-namely, to
`extend the set of subword units to include context-dependent units (either in addition to or
`as a replacement for context-independent units) in the recognition system. In theory, the
`only change necessary in either training or recognition is to modify the word lexicon to be
`consistent with the final set of subword units. Consider the word "above." Based on using
`(1) context-independent units, (2) triphone (left and right context) units, (3) multiple-phone
`models, and (4) word-dependent units, we could have the following lexical representations:
`
`Context-Independent Units
`v
`ah
`b
`ax
`(1) above:
`ah-v-$ Triphones (Context Dependent)
`b-ah-v
`ax-b-ah
`(2) above: $-ax-b
`vl Multiple Phone Units
`ahl
`b2
`(3) above:
`ax2
`(4) above: ax (above) b (above) ah (above) v (above) Word-Dependent Units.
`
`In representation (2), using triphone units, the number of units needed for all sounds in all
`words is very large (on the order of 1~20,000).
`In practice, only a small percentage of
`such triphone units are used, since most units are seen rarely, if at all, in a finite training set.
`(We discuss this issue below in more detail.) In representation (3), using multiple models
`of each subword unit, the idea is to cluster common contexts together so as to reduce the
`
`IPR2023-00037
`Apple EX1013 Page 310
`
`

`

`460
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`number of context-dependent models. This leads to problems in defining lexical entries for
`words. (We discuss this issue further in a later section of this chapter.) Finally, the use of
`word-dependent units is most effective for modeling short-function words (like a, the, in,
`of, an, and, or) whose spectral variability is significantly greater than that of long-content
`words like aboard and battleship. (We discuss the modeling of function words in a later
`section of this chapter.) Finally, it is both reasonable and meaningful to combine all four
`types of units in a common structure. In theory, as well as in practice, the training and
`recognition architectures can handle subword unit sets of arbitrary size and complexity.
`We now discuss each of these issues in more detail.
`
`8.9.1 Creation of Context-Dependent Diphones and Triphones
`
`Consider the basic set of context-independent PLUs in which we use the symbol p to denote
`an arbitrary PLU. We can define a set of context-dependent (CD) diphones as
`
`PL - p - $
`$ - p - PR
`
`left context (LC) diphone
`right context (RC) diphone,
`
`in which PL is the PLU immediately preceding p (the left context sound), PR is the PLU
`immediately following p (the right context sound), and $ denotes a don't care ( or don't
`know) condition.
`Similarly we can define a set of context-dependent triphones as
`
`PL -p-pR
`
`left-right context (LRC) triphone.
`
`In theory, the potential number of left ( or right) context di phones is 46 x 45 (for a basic set
`of 47 PLUs and excluding silence) or about 2070 left context diphone units. The potential
`number of left-right context triphone units is 45 x 46 x 45 or 93,150 units. In practice, the
`actual number of context-dependent PLUs actually seen in a finite training set of sentences
`is significantly smaller than these upper bounds.
`To better understand these concepts, consider the RM task (991 word vocabulary)
`with a training set of 3990 sentences. To use diphone and triphone context-dependent units,
`we first convert the lexicon to such units using the rule that the initial sound becomes a
`right context diphone, the middle sounds become left-right context diphones, and the final
`sound becomes a left context diphone. Hence the word "above" is converted to the set of
`units $-ax-b, ax-b-ah, b-ah-v, ah-v-$. (We must use diphone units at the beginnings
`and ends of words because we do not know the preceding or following words.) The above
`rule is modified to eliminate triphone middles for words with only two PLUs (e.g., in, or)
`and to revert to the context-independent PLU for words with only one PLU (e.g., a). Using
`the above method of creating the lexicon, one can count the number of left-right context(cid:173)
`dependent (LRC) units (1778), the number of left-context (LC) units (279), the number of
`right-context (RC) units (280), and the number of context-independent (Cl) units (3) for a
`total of 2340 PLUs in the training set. This number of units, although significantly smaller
`than the maximum possible number of context-dependent units, is deceiving because many
`of the units occur only a small number of times in the training set, and therefore it would
`be difficult to reliably estimate model parameters for such models.
`
`IPR2023-00037
`Apple EX1013 Page 311
`
`

`

`sec. 8.9
`
`Context-Dependent Subword Units
`
`461
`
`TABLE 8.4. Number of intra-word CD units as a function of count threshold, r.
`Number of
`Number
`Number
`Number
`Total
`Count
`LRC
`of LC
`of RC
`of CI
`Number
`Threshold (D
`PLUs
`PLUs
`PLUs
`PLUs
`of PLUs
`378
`158
`171
`47
`754
`461
`172
`188
`47
`868
`639
`199
`205
`47
`1090
`952
`• 234
`212
`46
`1444
`1302
`243
`258
`1847
`44
`1608
`265
`270
`32
`2175
`1778
`279
`280
`3
`2340
`
`50
`40
`30
`20
`10
`5
`1
`
`To combat the difficulties due to the small number of occurrences of some context(cid:173)
`dependent units, one can use one of three strategies. Perhaps the simplest approach is to
`eliminate all models that don't occur sufficiently often in the training set. More formally
`we define c( ·) as the occurrence count for a given unit. Then, given a threshold T on the
`required number of occurrences of a unit (for reliable model estimation), a reasonable Unit
`Reduction Rule is
`If c(pl - p - PR) < T, then
`1. PL - P - PR ➔ $ - P - PR
`2. PL - P - PR ➔ PL - P - $
`3. PL - P - PR ➔ $ - P - $
`
`if c($ - p - PR) : T
`if c(pL - p - $) > T
`otherwise.
`
`The tests above are made sequentially until one passes and the procedure terminates. To
`illustrate the sensitivity of the CD PLU set to the threshold T, Table 8.4 shows the counts of
`LRC PLUs, LC PLUs, RC PLUs, CI PLUs, and the total PLU count for the 3990 sentence
`training set. For a threshold of 50, which is generally adequate for estimating model
`parameters, there are only 378 LRC PLUs (almost a 5-to-1 reduction over the number with
`a count threshold of 1) and a total of 754 PLUs. We will see later that although such
`CD PLU sets do provide improvements in recognition performance over CI PLU sets, the
`amount of context dependency achieved is small and alternative techniques are required to
`create CD PLU sets.
`
`8.9.2 Using lnterword Training to Create CD Units
`
`Although the lexical entry for each word uses right or left context diphone units for the
`first and last sound of each word, both in training and in scoring, one can utilize the known
`(or postulated) sequence of words to replace these diphone units with the triphone unit
`appropriate to the words actually (or assumed) spoken. Hence the sentence "Show all
`ships" would be represented as
`
`$-sh-ow
`
`sh-ow-$ $-aw-£ aw-£-$ $-sh-i sh-i-p
`
`i-,1r-s ,1r-s-$
`
`using only intraword units, whereas the sentence would be represented as
`
`IPR2023-00037
`Apple EX1013 Page 312
`
`

`

`462
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`8000-~-~-~---.----..-------.----,
`lntraword and lnterword Units Combined
`f!:t------8
`G·-·-·-£1
`lnterword Units Alone
`0···········0
`lntraword Units Alone
`
`I
`6000 \
`
`I
`I
`I
`I
`• I
`
`\ \
`0 4000 \\
`IV
`...
`i \
`~
`E
`i \
`~
`~ v,,
`~ ... 'V ...
`2000 __
`\
`·-a. v.
`0
`~~ --V--v-
`0 L_~__.__·@_=~¾-=·· :....::·1~~::::~~~~~--
`:t:.t1:.:t:1.::f:!::.tii::B.::::B~
`0
`20
`40
`60
`80
`100
`Count Threshold
`
`Figure 8.14 Plots of the number of intraword units. interword units,
`and combined units as a function of the count threshold.
`
`$-sh-ow
`
`sh-ow-aw
`
`ow-aw-f.
`
`aw-f.-sh
`
`t'-sh-i sh-i-p
`
`i-p-s
`
`p-s-$
`
`using both intraword and interword units. From this simple example we see that, whereas
`there were only two triphones based on intraword units, there are six triphones based
`on intraword and interword units-that
`is, a threefold increase in context-dependent tri(cid:173)
`phone units. (We are assuming no silence between words; it is straightforward to han(cid:173)
`dle the cases when silence actually occurs between words.) To illustrate this effect,
`Figure 8.14 shows a plot of the number of intraword units, the number of interword units,
`and the combined count, as a function of the count threshold, for the 1990 sentence DARPA
`training set. More than 5000 interword triphone units occur one or more times versus less
`than 2000 intraword units for the same count threshold.
`Even when using interword units, the problems associated with estimating model
`parameters from a small number of occurrences of the units is the major issue. In the next
`sections we discuss various ways of smoothing and interpolating context dependent models,
`created from small numbers of occurrences in the training set, with context-independent
`models, created from large numbers of occurrences in the training set.
`
`8.9.3 Smoothing and Interpolation of CD PLU Models
`
`As shown above, we are faced with the following problem. For a training set of reasonable
`size, there is sufficient data to reliably train context-independent unit models. However, as
`the number of units becomes larger (by including more context dependencies) the amount
`of data available for each unit decreases and the model estimates become less reliable.
`Although there is no ideal solution to this problem (short of increasing the amount of
`training data ad infinitum), a reasonable compromise is to exploit the reliability of the
`estimates of the higher level (e.g., Cl) unit models to smooth or interpolate the estimates
`of the lower level (CD) unit models. There are many ways in which such smoothing or
`
`IPR2023-00037
`Apple EX1013 Page 313
`
`

`

`sec. 8.9
`
`Context-Dependent Subword Units
`
`463
`
`as - P - s Bs - P - s
`Figure 8.1S Deleted interpolation model for smoothing discrete
`density models.
`
`interpolation can be achieved.
`The simplest way to smooth the parameter estimates for the CD models is to inter(cid:173)
`polate the spectral parameters with all higher (less context dependency) models that are
`consistent with the model [ 12]. By this we mean that the model for the CD unit PL - p - PR
`(call this )...PL-P-PR) should be interpolated with the models for the units $-p-pR
`(As-p-pR),
`PL - p - $ ()..PL -p-s) and $ - p - $ (.Xs-p-s). Such an interpolation of model parameters
`is meaningful only for discrete densities, within states of the HMM, based on a common
`codebook. Thus if each model )... is of the form (A, B, 1r) where B is a discrete density over
`a common codebook, then we can formulate the interpolation as:
`-P-PR + aPL -p-sBPL
`BPL -P-PR = aPL -p-pRBPL
`-p-S
`+ as-p-sBs-p-s,
`+ as-p-pRBs-P-PR
`is the interpolated density. We constrain the o:s to add up to 1; hence
`
`where BPL -P-PR
`
`(8.19)
`
`(8.20)
`
`The way in which the as are determined is according to the deleted interpolation algorithm
`discussed in Section 6.13. We review the ideas, as they apply to these speech unit models,
`here. Each of the discrete densities, BpL-P-PR• BPL-p-S, Bs-P-PR• and Bs-p-S,
`is estimated
`from the training data where a small percentage (e.g., 20%) is withheld (deleted). Using the
`withheld data, the as are estimated using a standard forward-backward approach based on
`the HMM shown in Figure 8.15. The interpretation of the o:s is essentially the probability
`weighted percentage of new data (unseen in training) that favors each of the distributions
`over the others. Hence, for well-trained detailed models we get o:PL -P-PR ➔ l, whereas for
`poorly trained models we get aPL -P-PR ➔ 0 (i.e., the LRC model is essentially obtained
`from interpolating higher-level, lower context dependency models that are better trained
`than the detailed CD model).
`Other smoothing methods include empirical estimates of the o:s based on occurrence
`counts, co-occurrence smoothing based on joint probabilities of pairs of codebook symbols
`[14], and use of fuzzy VQs in which an input spectral vector is coded into two or more
`
`IPR2023-00037
`Apple EX1013 Page 314
`
`

`

`464
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`codebook symbols.
`
`8.9.4 Smoothing and Interpolation of Continuous Densities
`
`When one uses continuous density modeling of PLUs it is very difficult to devise a good
`smoothing or interpolation algorithm because the acoustic space of different units is inher(cid:173)
`ently different. There are two reasonable ways to handle this problem. One is to exploit
`the so called semicontinuous or tied mixture modeling approach discussed earlier in which
`each PLU uses a fixed set (a codebook) of mixture means and variances, and the only
`variables are the mixture gains for each model. In this case it is trivial to exploit the method
`of deleted interpolation on the mixture gains in a manner virtually identical to the one
`discussed in the previous section.
`An alternative modeling approach, and one more in line with independent continuous
`density modeling of different sounds, is to use a tied-mixture approach on the CI unit level;
`that is, we design a separate (large) codebook of densities for each CI PLU and then constrain
`each derived CD unit to use the same mixture means and variances but with independent
`mixture gains. Again we can use the method of deleted interpolation to smooth mixture
`gains in an optimal manner.
`
`8.9.5
`
`Implementation Issues Using CD Units
`
`The FSN structure of Figure 8.9 is used to implement the continuous speech-recognition
`algorithm based on a given vocabulary and task syntax ([15-19]). The structure is straight(cid:173)
`forward to implement when using strictly intraword units because there is no effect of
`context at word boundaries. Hence the models (HMMs) for each word can be constructed
`independently and concatenated at the appropriate point of the processing. This is illus(cid:173)
`trated below for the recognition of the string "what {is, are}" based on intraword units,
`where the individual words are represented in the lexicon as
`
`what
`is
`are
`
`{$-w-aa, w-aa-t, aa-t-$}
`{ $-ih-z,
`ih-z-$}
`{$-aa-r, aa-r-$}
`
`w - aa - t
`$ - w

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket