`
`Overall Recognition System Based on Subword Units
`
`451
`
`CONTINUOUS SENTENCE RECOGNIZER
`
`r----------------------7
`I
`Ir-------.
`FEATURE I
`VECTOR I WORD-LEVEL
`MATCH
`
`SPEECH
`INPUT
`
`SPECTRAL
`ANALYSIS
`
`1---..i
`
`,..__~
`
`I
`I
`I
`L ___ _
`
`r------.1
`I RECOGNIZED
`SENTENCE-LEVEL I SENTENCE
`MATCH
`
`-------r----...J
`
`_ ___
`
`I
`
`I
`I
`I
`..J
`
`WORD
`MODEL
`
`WORD MODEL
`COMPOSITION
`
`LANGUAGE
`MODEL
`
`SUBWORD
`MODELS
`
`LEXICON
`
`GRAMMAR SEMANTICS
`
`Figure 8.7 Overall block diagram of subword unit based continuous speech recognizer.
`
`features were also studied, but results on such systems will not be presented here.)
`The second step in the recognizer is a combined word-level/sentence-level match.
`The way this is accomplished is as follows. Using the set of subword HMMs and the word
`lexicon, a set of word models (HMMs) is created by concatenating each of the subword
`unit HMMs as specified in the word lexicon. At this point, the system is very similar to
`the connected word recognizers of Chapter 7. The way in which the sentence-level match
`is done is via an FSN realization of the word grammar (the syntax of the system) and the
`semantics as expressed in a composite FSN language model. The implementation of the
`combined word-level match/sentence-level match is via any of the structures described
`in Chapter 7. In particular, most systems use structures similar to the frame synchronous
`level-building method (usually with some type of beam search to restrict the range of paths)
`to solve for the "best" recognition sentence.
`Consider using the recognizer of Figure 8. 7 for a database management task called the
`Naval Resource (Battleship) Management Task-as popularly defined within the DARPA
`community [13]. This task, which has a 991-word vocabulary (plus a separate silence
`word), can be used to query a database as to locations, attributes, constraints, history, and
`other information about ships within the database. Typical examples of sentences used to
`query the database include
`
`• what is mishawaka's percent fuel
`• total the ships that will arrive in diego-garcia by next month
`
`IPR2023-00037
`Apple EX1013 Page 302
`
`
`
`452
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`<I>
`
`silence
`3
`1----.-i
`...... ....,stop
`
`stop
`- - - -•
`
`<I>
`
`Figure 8.8 FSN for the NG syntax.
`
`• do any vessels that are in gulf of tonkin have asw mission area of m4
`• show the names of any submarines in yellow sea on twenty eight october
`• list all the alerts
`• what's jason 's m-rating on mob
`• give t-lam vessels that weren't deployed in november.
`
`The vocabulary thus includes many jargon words, such as m4, m-rating, mob, and t-lam,
`and several long-content words, such as mishawaka 's, diego-garcia, submarines, november,
`etc., and many short-function words, such as is, the, by, do, in, of, and on.
`A wide range of sentences can be constructed from the 991-word vocabulary to
`query this database. It is possible to construct a finite-state network representation of the
`full grammar associated with all such sentences. The perplexity (average word branching
`factor) (see Section 8.7) of the full grammar network is computed to be about 9. However,
`such a network is rather large (because of the high degree of constraint among words within
`the vocabulary which form syntactically valid and semantically meaningful sentences) with
`upward of 50,000 arcs and 20,000 nodes, aqd cannot easily be implemented as a practical
`system. Instead, several types of FSN approximations
`to the full grammar have been
`constructed.
`Perhaps the least constraining grammar (and the simplest to implement) is the no
`grammar (NG) case, in which any word in the vocabulary is allowed to follow any word in
`the vocabulary. Such an FSN has the property that, although its coverage of valid sentences
`is perfect, its overcoverage of the language (i.e., the ratio of sentences generated by the
`grammar to valid sentences within the task language) is extremely large. The perplexity of
`the FSN for the NG case is 991, since each word can follow every word in the grammar
`(assuming all words are essentially equiprobable). The FSN for the NG case is shown in
`Figure 8.8. (Note that the FSN of Figure 8.8 allows arbitrary phrasing, i.e., groups of words
`spoken together followed by a pause, because of the silence model and the null arcs.)
`A second FSN form of the task syntax is to create a word pair (WP) grammar that
`specifies explicitly which words can follow each of the 991 words in the vocabulary. The
`perplexity of this grammar is about 60, and the overcoverage, while significantly below
`that of the NG case, is still very high. Although the network of Figure 8.8 could be used for
`the WP grammar (by explicitly including the word pair information at node 2), a somewhat
`more efficient structure exploits the fact that only a subset of the vocabulary occurs as the
`first word in a sentence (B or beginning words), and only a subset of the vocabulary occurs
`
`IPR2023-00037
`Apple EX1013 Page 303
`
`
`
`sec. 8.8 Overall Recognition System Based on Subword Units
`
`453
`
`--->:
`
`SILENCE
`4'
`,,---,
`,
`'
`SILENCE
`
`SILENCE
`
`Figure 8.9 FSN of the WP syntax.
`
`as the last word in a sentence(£ or ending words); hence we can partition the vocabulary
`into four nonoverlapping sets of words, namely
`{ BE}=set of words that can either begin or end a sentence, IBEI = 117
`{~£}=set of words that can begin a sentence but cannot end a sentence, IBEj == 64
`{ BE}=set of words that cannot begin a sentence but can end a sentence, jBE/ == 488
`{ BE}=set of words that cannot begin or end a sentence, jBE j = 322.
`The resulting FSN, based on this partitioning scheme, is shown in Figure 8.9. This
`network has 995 real arcs and 18 null arcs. To account for silence between words (which
`is optional), each word arc bundle (e.g., nodes 1 to 4) is expanded to individual words
`followed by optional silence, as shown at the bottom of Figure 8.9. Hence the overall FSN
`allows recognition of sentences of the form
`S: (silence) - {BE,BE} - (silence) - ({W}) ... ({W})-
`
`(silence) - {BE,BE}- (silence).
`
`Finally, one could construct a task syntax based on statistical word bigram (or even
`
`IPR2023-00037
`Apple EX1013 Page 304
`
`
`
`454
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`is, we assign a probabilitY,PiJ, to each word pair (Wi, W1) where
`trigram) probabilities-that
`PiJ is the probability that W; is followed immediately by W1, That is, if Wn is the nth word
`in a string of words, then PiJ = P(Wn = W11Wn-l = Wi) is the language model according
`to Eq. (8.6). The advantage of the word bigram (WB) approach is that the perplexity
`is reduced considerably (to 20) for the Resource Management task, with essentially no
`increase in complexity of the implementation.
`
`8.8.1 Control of Word Insertion/Word Deletion Rate
`
`Using a structure of the type shown in Figure 8.9, there is no control on the sentence length.
`That is, it is possible to generate sentences that are arbitrarily long by inserting a large
`number of short-function words. To prevent this from occurring, it is a simple matter to
`incorporate a word insertion penalty into the Viterbi decoding, such that a fixed negative
`quantity is added to the likelihood score at the end of each word arc (i.e., at nodes 5-8 in
`Figure 8.9). By adjusting the word penalty, we can control the rate of word insertion and
`word deletion; a very large word penalty will reduce the word insertion rate and increase
`the word deletion rate, and a very small penalty will have the opposite effect. A value for
`word penalty is usually experimentally detennined to balance these adverse effects.
`
`8.8.2 Task Semantics
`
`We have discussed how task syntax can be incorporated into the overall recognition struc(cid:173)
`ture. At the end of this chapter we will briefly describe a general procedure for integrating
`a semantic component into the recognizer.
`
`8.8.3 System Performance on the Resource Management Task
`
`Using the segmental k-means training algorithm, the set of 47 PLUs of Table 8.1 were trained
`using a set of 4360 sentences from I 09 talkers. The likelihood scores were essentially
`unchanged after two iterations of the k-means loop. The number of mixtures per state was
`varied from 1 to 256 in multiples of 2 to investigate the effects of higher acoustic resolution
`on performance.
`To evaluate the recognizer performance, five different sets of test data were used,
`including:
`
`feb 89
`
`oct 89
`
`train 109 A randomly selected set of 2 sentences from each of the I 09 training talkers; this set
`was used to evaluate the ability of the algorithm to recognize the training material
`A set of 30 sentences from each of 10 talkers, none of whom was in the training set;
`this set was distributed by DARPA in February of 1989 to evaluate performance
`A second set of 30 sentences from each of IO additional talkers, none of whom was
`in the training set; this set was distributed by DARPA in October of 1989
`A set of 120 sentences from each of 4 new talkers, none of whom was in the training
`set (distributed by DARPA in June of I 990)
`A set of 30 sentences from each of IO new talkers, none of whom was in the training
`set (distributed by DARPA in February of 199 I).
`
`jun 90
`
`feb 91
`
`IPR2023-00037
`Apple EX1013 Page 305
`
`
`
`sec. 8.8
`
`Overall Recognition System Based on Subword Units
`
`455
`
`Word Pair Grammar
`100 ,~---~-~~--,-..---.,....._,
`
`__
`
`80
`
`20
`
`train109-2.5wp
`G·-·~
`O-·•··••-o train109-3.0wp
`
`0 ~-:;-7--::------:--:-~-__..._.
`1
`2
`4
`8
`16
`
`100,--,---.-----,.......--.--
`
`(a)
`............
`..__...J
`128 256
`
`32
`
`64
`
`......... __,,._"T""-_~
`
`80
`
`60
`
`40
`
`20
`
`train109-2.5wp
`G·--~
`o--•··••-o traln109-3.0wp
`
`l
`>-() as
`
`~ ::,
`()
`() <
`
`Cl)
`()
`C
`Cl)
`E
`Cl) en
`
`0
`1
`
`2
`
`4
`8
`64
`32
`Number of Mixtures Per State
`
`16
`
`(b)
`
`128 256
`
`Figure 8.10 Word and sentence accuracies versus number of mixtures
`per state for the training subset using the WP syntax.
`
`The recognizer performance was evaluated for each of the test sets, using both WP
`and NG syntax, and with different word penalties. For all cases, evaluations were made
`using models with from 1 to 256 mixtures per state for each PLU.
`The recognition results are presented in terms of word accuracy (percentage words
`correct minus percentage word insertions) and sentence accuracy as a function of the number
`of mixtures per state for each PLU model. The alignment of the text of the recognized
`string with the text of the spoken string was performed using a dynamic programming
`alignment method as specified by DARPA.
`The recognition results on the training subset (train 109) are given in Figures 8.10 (for
`the WP syntax) and 8.11 (for the NG syntax). The upper curves show word accuracy (in
`percentage) versus number of mixtures per state (on a logarithmic scale) for two different
`values of the word penalty, and the lower curves show sentence accuracy for the same
`
`IPR2023-00037
`Apple EX1013 Page 306
`
`
`
`..
`
`456
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`No Grammar Case
`90.---.---,-.-.........,
`......... ------.- ......... -...-.,,..........,.----,
`
`o-----o train109-6.0wp
`0--------0 train109-5.0wp
`
`(a)
`
`2
`
`4
`
`8
`
`16
`
`32
`
`64
`
`128
`
`256
`
`40 .---.--
`
`
`
`........... ----.-......--.-----,--........,-..------,
`
`o-----o train109-6.0Wp
`0--------0 train109-5.0Wp
`
`>,
`(J
`
`G)
`(J
`C:
`G)
`E
`G)
`Cl)
`
`l 30
`Ill ...
`:,
`(J
`(J <
`
`20
`
`10
`
`1
`
`2
`
`32
`16
`64
`8
`4
`Number of Mixtures Per State
`
`(b)
`
`128 256
`
`Figure 8.11 Word and sentence accuracies versus number of mix(cid:173)
`tures per state for the training subset using the NG syntax.
`
`parameters. A sharp and steady increase in accuracy is obtained as the number of mixtures
`per state increases, going from about 43.6% word accuracy (10.6% sentence accuracy) for
`I mixture per state to 97.3% word accuracy (83% sentence accuracy) for 256 mixtures
`per state for the WP syntax using a word penalty of 2.5. For the NG syntax (using a
`word penalty of 6.0), the comparable results were 24% word accuracy (0.9% sentence
`accuracy) for I mixture per state and 84.2% word accuracy (34.9% sentence accuracy) for
`256 mixtures per state.
`The recognition results on the independent test sets are given in Figures 8.12 (for
`the WP syntax) and 8.13 (for the NG syntax). Although there are detailed differences in
`performance among the different test sets (especially for small numbers of mixtures per
`state), the performance trends are essentially the same for all the test sets. In particular we
`see that for the WP syntax, the range of word accuracies for l mixture per state is 42.9%
`
`IPR2023-00037
`Apple EX1013 Page 307
`
`
`
`sec. 8.8
`
`Overall Recognition System Based on Subword Units
`
`457
`
`Word Pair Grammar
`100,---.-~-.......,.......---.--
`..........
`-.--.....-r----,
`
`80
`
`feb89-3wp
`•··-··•
`feb91-3wp
`9-----9
`jun90-3wp
`D-·-·-o
`o-••····~ oct89-3wp
`
`40.:----;:---~--:--'"---:-'::-----:'.:':--"-~
`1
`2
`4
`8
`16
`32
`
`7sr--..----.---...--.---..---~.----,_.....,..--__,
`
`(a)
`.......................
`--.J
`64
`128
`256
`
`>(cid:173)
`CJ ca ...
`CJ <
`
`:::,
`CJ
`
`Q)
`CJ
`C:
`
`Q) c Q) en
`
`(b)
`
`......._~ ................... ....._~_,
`0 .____......__........._~ .......... ........____. __
`
`
`1
`4
`8
`2
`16
`32
`64
`128
`256
`Number of Mixtures Per State
`
`Figure 8.12 Word and sentence accuracies versus number of mix(cid:173)
`tures per state for the four test sets using the WP syntax.
`
`(for feb 89) to 56.0% (for jun 90), whereas for 256 mixtures per state the range is 90.9%
`(for feb 89) to 93.0% (for jun 90). For the NG syntax, the range of word accuracies for 1
`mixture per state is 20.1 % (for feb 91) to 28.5% (for jun 90) and for 256 mixtures per state
`it is 68.5% (for oct 89) to 70.0% (for feb 91).
`Perhaps the most significant aspect of the perfonnance is the difference in accuracies
`between the test sets and the training subset. Thus there is a gap of 4--7% in word
`accuracy for the WP syntax at 256 mixtures per state, and a gap of 14.2-15.7% for the
`NG syntax at 256 mixtures per state. Such gaps are indicative of the ability of the training
`procedure to overtrain (learn details) on the training set, thereby achieving significantly
`higher recognition accuracy on this set than on any other representative test set.
`The results presented in this section show that a simple set of context-independent
`PLUs can be trained for a continuous speech large vocabulary recognition task, using
`
`IPR2023-00037
`Apple EX1013 Page 308
`
`
`
`458
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`No Grammar Case
`
`60
`
`•··-·· .. feb89-6wp
`... ----. feb91-6wp
`jun90-6wp
`a-·-·-£J
`<>······~ oct89-6wp
`
`(a)
`
`20 ~------,---....,.......------..--
`
`.........
`-....--"T""-~-,
`
`f 15
`[
`::,
`u
`~ 10
`~
`C
`!!
`C
`~ 5
`
`feb89-6wp
`•··-··•
`..,. ___ -. feb91-6wp
`jun90-6wp
`a-•-·-£J
`<>······~ oct89-6wp
`
`(b)
`
`0 L...__....__
`1
`
`2
`
`_
`
`__,__......._.....______._
`
`__
`
`.......__
`
`.............. ~~~
`
`64
`32
`16
`8
`4
`Number of Mixtures Per State
`
`128 256
`
`Figure 8.13 Word and sentence accuracies versus number of mix(cid:173)
`tures per state for the four test sets using the NG syntax.
`
`standard Viterbi training procedures, and be used to provide reasonably good recognition
`accuracy for a moderately complex task. The key issue now is what can be done, in
`meaningful ways to improve recognizer performance. To answer this question, we will
`examine several possible extensions of the basic recognition system in the next few sections.
`
`8.9
`
`fONTEXT-DEPENDENT SUBWORD UNITS
`
`There are several advantages to using a small basic set of context-independent subword
`units for large vocabulary speech recognition. First of all we have shown that the models
`of these subword units are easily trained from fluent speech, with essentially no hµman
`decisions as to segmentation and labeling of individual sections of speech. Second, the
`
`IPR2023-00037
`Apple EX1013 Page 309
`
`
`
`sec. 8.9
`
`Context-Dependent Subword Units
`
`459
`
`to new contexts (word vocabularies, tasks with different
`resulting units·are generalizable
`syntax and semantics) with no extra effort. Finally, the resulting models are relatively
`insensitive to the details of the context from which the training tokens are extracted. By
`this we mean that, in theory, we can derive the subword unit model parameters from two
`arbitrary but sufficiently large training sets of fluent speech (hopefully of the same size and
`general linguistic content but not necessarily the same vocabulary words and sentences) to
`obtain essentially the same parameter estimates for each model. In practice, this is almost
`the case.
`in which the subword unit model parameters are ex(cid:173)
`However, there are situations
`tracted from a training set whose linguistic content matches the test set precisely-that
`is, when the training set is a set of sentences drawn from the recognition task (with the
`same vocabulary, syntax, and semantics). In such a case, the resulting subword units are
`somewhat "word sensitive" (showing higher likelihood scores than in the general case)
`and typically provide higher recognition performance than equivalent model sets derived
`from arbitrary input speech.
`In particular, for the Resource Management task discussed
`in the previous section, "word-sensitive"
`subword unit models, trained on task-specific
`training sentences, give about I 0% higher word recognition accuracy than the same set of
`subword unit models trained on arbitrary sentences of comparable size. If the training set is
`increased in size by a factor of about 3, the word accuracy of the text-independent models
`approaches that of the word-sensitive models.
`Obviously, this performance difference can be attributed to the fact that context(cid:173)
`independent subword unit models are not adequate in representing the spectral and temporal
`properties of the speech unit in all contexts. (By context we mean the effects of the preceding
`and following sounds as well as the sound stress and information, and even the word in
`which the sound occurs.) The ultimate effect is a decrease in performance in word and
`sentence accuracy on speech-recognition
`tasks.
`The solution to this problem is basically a simple, straightforward one-namely, to
`extend the set of subword units to include context-dependent units (either in addition to or
`as a replacement for context-independent units) in the recognition system. In theory, the
`only change necessary in either training or recognition is to modify the word lexicon to be
`consistent with the final set of subword units. Consider the word "above." Based on using
`(1) context-independent units, (2) triphone (left and right context) units, (3) multiple-phone
`models, and (4) word-dependent units, we could have the following lexical representations:
`
`Context-Independent Units
`v
`ah
`b
`ax
`(1) above:
`ah-v-$ Triphones (Context Dependent)
`b-ah-v
`ax-b-ah
`(2) above: $-ax-b
`vl Multiple Phone Units
`ahl
`b2
`(3) above:
`ax2
`(4) above: ax (above) b (above) ah (above) v (above) Word-Dependent Units.
`
`In representation (2), using triphone units, the number of units needed for all sounds in all
`words is very large (on the order of 1~20,000).
`In practice, only a small percentage of
`such triphone units are used, since most units are seen rarely, if at all, in a finite training set.
`(We discuss this issue below in more detail.) In representation (3), using multiple models
`of each subword unit, the idea is to cluster common contexts together so as to reduce the
`
`IPR2023-00037
`Apple EX1013 Page 310
`
`
`
`460
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`number of context-dependent models. This leads to problems in defining lexical entries for
`words. (We discuss this issue further in a later section of this chapter.) Finally, the use of
`word-dependent units is most effective for modeling short-function words (like a, the, in,
`of, an, and, or) whose spectral variability is significantly greater than that of long-content
`words like aboard and battleship. (We discuss the modeling of function words in a later
`section of this chapter.) Finally, it is both reasonable and meaningful to combine all four
`types of units in a common structure. In theory, as well as in practice, the training and
`recognition architectures can handle subword unit sets of arbitrary size and complexity.
`We now discuss each of these issues in more detail.
`
`8.9.1 Creation of Context-Dependent Diphones and Triphones
`
`Consider the basic set of context-independent PLUs in which we use the symbol p to denote
`an arbitrary PLU. We can define a set of context-dependent (CD) diphones as
`
`PL - p - $
`$ - p - PR
`
`left context (LC) diphone
`right context (RC) diphone,
`
`in which PL is the PLU immediately preceding p (the left context sound), PR is the PLU
`immediately following p (the right context sound), and $ denotes a don't care ( or don't
`know) condition.
`Similarly we can define a set of context-dependent triphones as
`
`PL -p-pR
`
`left-right context (LRC) triphone.
`
`In theory, the potential number of left ( or right) context di phones is 46 x 45 (for a basic set
`of 47 PLUs and excluding silence) or about 2070 left context diphone units. The potential
`number of left-right context triphone units is 45 x 46 x 45 or 93,150 units. In practice, the
`actual number of context-dependent PLUs actually seen in a finite training set of sentences
`is significantly smaller than these upper bounds.
`To better understand these concepts, consider the RM task (991 word vocabulary)
`with a training set of 3990 sentences. To use diphone and triphone context-dependent units,
`we first convert the lexicon to such units using the rule that the initial sound becomes a
`right context diphone, the middle sounds become left-right context diphones, and the final
`sound becomes a left context diphone. Hence the word "above" is converted to the set of
`units $-ax-b, ax-b-ah, b-ah-v, ah-v-$. (We must use diphone units at the beginnings
`and ends of words because we do not know the preceding or following words.) The above
`rule is modified to eliminate triphone middles for words with only two PLUs (e.g., in, or)
`and to revert to the context-independent PLU for words with only one PLU (e.g., a). Using
`the above method of creating the lexicon, one can count the number of left-right context(cid:173)
`dependent (LRC) units (1778), the number of left-context (LC) units (279), the number of
`right-context (RC) units (280), and the number of context-independent (Cl) units (3) for a
`total of 2340 PLUs in the training set. This number of units, although significantly smaller
`than the maximum possible number of context-dependent units, is deceiving because many
`of the units occur only a small number of times in the training set, and therefore it would
`be difficult to reliably estimate model parameters for such models.
`
`IPR2023-00037
`Apple EX1013 Page 311
`
`
`
`sec. 8.9
`
`Context-Dependent Subword Units
`
`461
`
`TABLE 8.4. Number of intra-word CD units as a function of count threshold, r.
`Number of
`Number
`Number
`Number
`Total
`Count
`LRC
`of LC
`of RC
`of CI
`Number
`Threshold (D
`PLUs
`PLUs
`PLUs
`PLUs
`of PLUs
`378
`158
`171
`47
`754
`461
`172
`188
`47
`868
`639
`199
`205
`47
`1090
`952
`• 234
`212
`46
`1444
`1302
`243
`258
`1847
`44
`1608
`265
`270
`32
`2175
`1778
`279
`280
`3
`2340
`
`50
`40
`30
`20
`10
`5
`1
`
`To combat the difficulties due to the small number of occurrences of some context(cid:173)
`dependent units, one can use one of three strategies. Perhaps the simplest approach is to
`eliminate all models that don't occur sufficiently often in the training set. More formally
`we define c( ·) as the occurrence count for a given unit. Then, given a threshold T on the
`required number of occurrences of a unit (for reliable model estimation), a reasonable Unit
`Reduction Rule is
`If c(pl - p - PR) < T, then
`1. PL - P - PR ➔ $ - P - PR
`2. PL - P - PR ➔ PL - P - $
`3. PL - P - PR ➔ $ - P - $
`
`if c($ - p - PR) : T
`if c(pL - p - $) > T
`otherwise.
`
`The tests above are made sequentially until one passes and the procedure terminates. To
`illustrate the sensitivity of the CD PLU set to the threshold T, Table 8.4 shows the counts of
`LRC PLUs, LC PLUs, RC PLUs, CI PLUs, and the total PLU count for the 3990 sentence
`training set. For a threshold of 50, which is generally adequate for estimating model
`parameters, there are only 378 LRC PLUs (almost a 5-to-1 reduction over the number with
`a count threshold of 1) and a total of 754 PLUs. We will see later that although such
`CD PLU sets do provide improvements in recognition performance over CI PLU sets, the
`amount of context dependency achieved is small and alternative techniques are required to
`create CD PLU sets.
`
`8.9.2 Using lnterword Training to Create CD Units
`
`Although the lexical entry for each word uses right or left context diphone units for the
`first and last sound of each word, both in training and in scoring, one can utilize the known
`(or postulated) sequence of words to replace these diphone units with the triphone unit
`appropriate to the words actually (or assumed) spoken. Hence the sentence "Show all
`ships" would be represented as
`
`$-sh-ow
`
`sh-ow-$ $-aw-£ aw-£-$ $-sh-i sh-i-p
`
`i-,1r-s ,1r-s-$
`
`using only intraword units, whereas the sentence would be represented as
`
`IPR2023-00037
`Apple EX1013 Page 312
`
`
`
`462
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`8000-~-~-~---.----..-------.----,
`lntraword and lnterword Units Combined
`f!:t------8
`G·-·-·-£1
`lnterword Units Alone
`0···········0
`lntraword Units Alone
`
`I
`6000 \
`
`I
`I
`I
`I
`• I
`
`\ \
`0 4000 \\
`IV
`...
`i \
`~
`E
`i \
`~
`~ v,,
`~ ... 'V ...
`2000 __
`\
`·-a. v.
`0
`~~ --V--v-
`0 L_~__.__·@_=~¾-=·· :....::·1~~::::~~~~~--
`:t:.t1:.:t:1.::f:!::.tii::B.::::B~
`0
`20
`40
`60
`80
`100
`Count Threshold
`
`Figure 8.14 Plots of the number of intraword units. interword units,
`and combined units as a function of the count threshold.
`
`$-sh-ow
`
`sh-ow-aw
`
`ow-aw-f.
`
`aw-f.-sh
`
`t'-sh-i sh-i-p
`
`i-p-s
`
`p-s-$
`
`using both intraword and interword units. From this simple example we see that, whereas
`there were only two triphones based on intraword units, there are six triphones based
`on intraword and interword units-that
`is, a threefold increase in context-dependent tri(cid:173)
`phone units. (We are assuming no silence between words; it is straightforward to han(cid:173)
`dle the cases when silence actually occurs between words.) To illustrate this effect,
`Figure 8.14 shows a plot of the number of intraword units, the number of interword units,
`and the combined count, as a function of the count threshold, for the 1990 sentence DARPA
`training set. More than 5000 interword triphone units occur one or more times versus less
`than 2000 intraword units for the same count threshold.
`Even when using interword units, the problems associated with estimating model
`parameters from a small number of occurrences of the units is the major issue. In the next
`sections we discuss various ways of smoothing and interpolating context dependent models,
`created from small numbers of occurrences in the training set, with context-independent
`models, created from large numbers of occurrences in the training set.
`
`8.9.3 Smoothing and Interpolation of CD PLU Models
`
`As shown above, we are faced with the following problem. For a training set of reasonable
`size, there is sufficient data to reliably train context-independent unit models. However, as
`the number of units becomes larger (by including more context dependencies) the amount
`of data available for each unit decreases and the model estimates become less reliable.
`Although there is no ideal solution to this problem (short of increasing the amount of
`training data ad infinitum), a reasonable compromise is to exploit the reliability of the
`estimates of the higher level (e.g., Cl) unit models to smooth or interpolate the estimates
`of the lower level (CD) unit models. There are many ways in which such smoothing or
`
`IPR2023-00037
`Apple EX1013 Page 313
`
`
`
`sec. 8.9
`
`Context-Dependent Subword Units
`
`463
`
`as - P - s Bs - P - s
`Figure 8.1S Deleted interpolation model for smoothing discrete
`density models.
`
`interpolation can be achieved.
`The simplest way to smooth the parameter estimates for the CD models is to inter(cid:173)
`polate the spectral parameters with all higher (less context dependency) models that are
`consistent with the model [ 12]. By this we mean that the model for the CD unit PL - p - PR
`(call this )...PL-P-PR) should be interpolated with the models for the units $-p-pR
`(As-p-pR),
`PL - p - $ ()..PL -p-s) and $ - p - $ (.Xs-p-s). Such an interpolation of model parameters
`is meaningful only for discrete densities, within states of the HMM, based on a common
`codebook. Thus if each model )... is of the form (A, B, 1r) where B is a discrete density over
`a common codebook, then we can formulate the interpolation as:
`-P-PR + aPL -p-sBPL
`BPL -P-PR = aPL -p-pRBPL
`-p-S
`+ as-p-sBs-p-s,
`+ as-p-pRBs-P-PR
`is the interpolated density. We constrain the o:s to add up to 1; hence
`
`where BPL -P-PR
`
`(8.19)
`
`(8.20)
`
`The way in which the as are determined is according to the deleted interpolation algorithm
`discussed in Section 6.13. We review the ideas, as they apply to these speech unit models,
`here. Each of the discrete densities, BpL-P-PR• BPL-p-S, Bs-P-PR• and Bs-p-S,
`is estimated
`from the training data where a small percentage (e.g., 20%) is withheld (deleted). Using the
`withheld data, the as are estimated using a standard forward-backward approach based on
`the HMM shown in Figure 8.15. The interpretation of the o:s is essentially the probability
`weighted percentage of new data (unseen in training) that favors each of the distributions
`over the others. Hence, for well-trained detailed models we get o:PL -P-PR ➔ l, whereas for
`poorly trained models we get aPL -P-PR ➔ 0 (i.e., the LRC model is essentially obtained
`from interpolating higher-level, lower context dependency models that are better trained
`than the detailed CD model).
`Other smoothing methods include empirical estimates of the o:s based on occurrence
`counts, co-occurrence smoothing based on joint probabilities of pairs of codebook symbols
`[14], and use of fuzzy VQs in which an input spectral vector is coded into two or more
`
`IPR2023-00037
`Apple EX1013 Page 314
`
`
`
`464
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`codebook symbols.
`
`8.9.4 Smoothing and Interpolation of Continuous Densities
`
`When one uses continuous density modeling of PLUs it is very difficult to devise a good
`smoothing or interpolation algorithm because the acoustic space of different units is inher(cid:173)
`ently different. There are two reasonable ways to handle this problem. One is to exploit
`the so called semicontinuous or tied mixture modeling approach discussed earlier in which
`each PLU uses a fixed set (a codebook) of mixture means and variances, and the only
`variables are the mixture gains for each model. In this case it is trivial to exploit the method
`of deleted interpolation on the mixture gains in a manner virtually identical to the one
`discussed in the previous section.
`An alternative modeling approach, and one more in line with independent continuous
`density modeling of different sounds, is to use a tied-mixture approach on the CI unit level;
`that is, we design a separate (large) codebook of densities for each CI PLU and then constrain
`each derived CD unit to use the same mixture means and variances but with independent
`mixture gains. Again we can use the method of deleted interpolation to smooth mixture
`gains in an optimal manner.
`
`8.9.5
`
`Implementation Issues Using CD Units
`
`The FSN structure of Figure 8.9 is used to implement the continuous speech-recognition
`algorithm based on a given vocabulary and task syntax ([15-19]). The structure is straight(cid:173)
`forward to implement when using strictly intraword units because there is no effect of
`context at word boundaries. Hence the models (HMMs) for each word can be constructed
`independently and concatenated at the appropriate point of the processing. This is illus(cid:173)
`trated below for the recognition of the string "what {is, are}" based on intraword units,
`where the individual words are represented in the lexicon as
`
`what
`is
`are
`
`{$-w-aa, w-aa-t, aa-t-$}
`{ $-ih-z,
`ih-z-$}
`{$-aa-r, aa-r-$}
`
`w - aa - t
`$ - w