`
`i
`
`LARGE VOCABULARY
`CONTINUOUS SPEECH
`RECOGNITION
`
`8.1
`
`INTRODUCTION
`
`Throughout this book we have developed a wide range of tools, techniques, and algorithms
`for attacking several fundamental problems in speech recognition. In the previous chapter
`we saw how the different techniques came together to solve the connected word recognition
`problem. In this chapter we extend the concepts to include issues needed to solve the large
`vocabulary, continuous speech recognition problem. We will see that the fundamental ideas
`need modification because of the use of subword speech units; however, a great deal of the
`formalism for recognition, based on word units, is still preserved.
`The standard approach to large vocabulary continuous speech recognition is to assume
`a simple probabilistic model of speech production whereby a specified word sequence, W,
`produces an acoustic observation sequence Y, with probability P(W, Y). The goal is then
`to decode the word string, based on the acoustic observation sequence, so that the decoded
`string has the maximum a posteriori (MAP) probability, i.e.,
`
`W 3 P(WIY) = maxP(WIY).
`w
`
`Using Bayes' Rule, Equation (8.1) can be written as
`
`P(WIY) = P(YIW)P(W)
`P(Y)
`•
`
`434
`
`(8.1)
`
`(8.2)
`
`IPR2023-00035
`Apple EX1015 Page 285
`
`
`
`sec. 8.2
`
`Subword Speech Units
`
`Since P(Y) is independent of W, the MAP decoding rule of Eq. (8.1) is
`W = arg max P(YIW)P(W).
`w
`
`435
`
`(8.3)
`
`The first term in Eq. (8.3), P(YIW), is generally called the acoustic model, as it estimates the
`probability of a sequence of acoustic observations, conditioned on the word string. The way
`in which we compute P(YIW), for large vocabulary speech recognition, is to build statistical
`models for subword speech units, build up word models from these subword speech
`unit models (using a lexicon to describe the composition of words), and then postulate
`word sequences and evaluate the acoustic model probabilities via standard concatenation
`methods. Such methods are discussed in Sections 8.2-8.4 of this chapter.
`The second term in Eq. (8.3), P(W), is generally called the language model, as it
`describes the probability associated with a postulated sequence of words. Such language
`models can incorporate both syntactic and semantic constraints of the language and the
`recognition task. Often, when only syntactic constraints are used, the language model
`is called a grammar and may be of the form of a formal parser and syntax analyzer, an
`N-gram word model (N = 2, 3, ... ), or a word pair grammar of some type. Generally
`such language models are represented in a finite state network so as to be integrated into
`the acoustic model in a straightforward manner. We discuss language models further in
`Section 8.5 of this chapter.
`We begin the chapter with a discussion of subword speech units. We formally define
`subword units and discuss their relative advantages (and disadvantages) as compared to
`whole-word models. We next show how we use standard statistical modeling techniques
`(i.e., hidden Markov models) to model subword units based on either discrete or continuous
`densities. We then show how such units can be trained automatically from continuous
`speech, without the need for a bootstrap model of each of the subword units. Next we
`discuss the problem of creating and implementing word lexicons (dictionaries) for use in
`both training and recognition phases. To evaluate the ideas discussed in this chapter we
`use a specified database access task, called the DARPA Resource Management (RM) task,
`in which there is a word vocabulary of 991 words (plus a silence or background word), and
`any one of several word grammars can be used. Using such a system, we show how a basic
`set of subword units performs on this task. Several directions for creating subword units
`which are more specialized are described, and several of these techniques are evaluated on
`the RM task. Finally we conclude the chapter with a discussion of how task semantics can
`be applied to further constrain the recognizer and improve overall performance.
`
`8.2 SUBWORD SPEEGH UNITS
`
`We began Chapter 2 with a discussion of the basic phonetic units of language and discussed
`the acoustic properties of the phonemes in different speech contexts. We then argued
`that the acoustic variability of the phonemes due to context was sufficiently large and not
`well understood, that such units would not be useful as the basis for speech models for
`recognition. Instead, we have used whole-word models as the basic speech unit, both for
`
`IPR2023-00035
`Apple EX1015 Page 286
`
`
`
`436
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`isolated word recognition systems and for connected word recognition systems, because
`whole words have the property that their acoustic representation is well defined, and the
`acoustic variability occurs mainly in the region of the beginning and the end of the word.
`Another advantage of using whole-word speech models is that it obviates the need for a
`word lexicon, there~y making the recognition structure inherently simple.
`The disadvantages of using whole-word speech models for continuous speech recog(cid:173)
`nition are twofold. First, to obtain reliable whole-word models, the number of word
`utterances in the training set needs to be sufficiently large, i.e., each word in the vocab(cid:173)
`ulary should appear in each possible phonetic context several times in the training set.
`In this way the acoustic variability at the beginning and at the end of each word can be
`modeled appropriately. For word vocabularies like the digits, we know that each digit
`can be preceded and followed by every other digit; hence for an I I-digit vocabulary (zero
`to nine plus oh), there are exactly 121 phonetic contexts (some of which are essentially
`identical). Thus with a training set of several thousand digit strings, it is both realistic
`and practical to see every digit in every phonetic context several times. Now consider a
`vocabulary of l()(X) words with an average of 100 phonetic contexts for both the beginning
`and end of each word. To see each word in each phonetic context exactly once requires
`I 00 x I 000 x 100 = 10 million carefully designed sentences. To see each combination 10
`times requires 100 million such sentences. Clearly, the recording and processing of such
`homogeneous amounts of speech data is both impractical and unthinkable. Second, with
`a large vocabulary the phonetic content of the individual words will inevitably overlap.
`Thus storing and comparing whole-word patterns would be unduly redundant because the
`constituent sounds of individual words are treated independently, regardless of their iden(cid:173)
`tifiable similarities. Hence some more efficient speech representation is required for such
`large vocabulary systems. This is essentially the reason we use subword speech units.
`There are several possible choices for subword units that can be used to model speech.
`These include the following:
`
`• Phonelike units (PLUs) in which we use the basic phoneme set (or some appropri(cid:173)
`ately modified set) of sounds but recognize that the acoustic properties of these units
`could be considerably different than the acoustic properties of the "basic" phonemes
`[1-7]. This is because we define the units based on linguistic similarity but model
`the unit based on acoustic similarity. In cases in which the acoustic and phonetic
`similarities are roughly the same (e.g., stressed vowels) then the phoneme and the
`PLU will be essentially identical. In other cases there can be large differences and a
`simple one-to-one correspondence may be inadequate in terms of modeling accuracy.
`Typically there are about 50 PLUs for English.
`• Syllable-like units in which we again use the linguistic definition of a syllable
`(namely a vowel nucleus plus the optional initial and final consonants or consonant
`clusters) to initially define these units, and then model the unit based on acoustic
`similarity. In English there are approximately 10,000 syllables.
`• Dyad or demisyllable-like units consisting of either the initial (optional) consonant
`cluster and some part of the vowel nucleus, or the remaining part of the vowel nucleus
`and the final (optional) consonant cluster [8]. For English there is something on the
`
`IPR2023-00035
`Apple EX1015 Page 287
`
`
`
`Sec. 8.2
`
`Subword Speech Units
`
`437
`
`order of 2000 demisyllable-like units.
`• Acoustic units, which are defined on the basis of clustering speech segments from
`a segmentation of fluent, unlabeled speech using a specified objective criterion (e.g.,
`maximum likelihood) [9]. Literally a codebook of speech units is created whose
`interpretation, in tenns of classical linguistic units, is at best vague and at worst totaJly
`nonexistent. It has been shown that a set of 256-512 acoustic units is appropriate for
`modeling a wide range of speech vocabularies.
`
`Consider the English word segmentation. Its representation according to each of the above
`subword unit sets is
`
`• PL Us: /s/ /€/ /g/ /m/ /a/ /n/ /t/ /eY / /sh/ /a/ /n/ ( 11 units)
`• syllables: /seg/ /men/ /ta/ /tion/ (4 syllables)
`• demisyllables: /sc./ /r.g/ /ma/ /an/ /teY / /eYsh/ /sha/ /an/ (8 demisyllables)
`• acoustic units: 17 111 37 3 241 121 99 171 37 (9 acoustic units).
`
`We see, from the above example, that the number of subword units for this word can be as
`small as 4 (from a set of I 0,000 units) or as large as 11 (from a set of 50 units).
`Since each of the above subword unit sets is capable of representing any word in the
`English language, the issues in the choice of subword unit sets are the context sensitivity
`and the ease of training the unit from fluent speech. (In addition, for acoustic units, an
`issue is the creation of a word lexicon since the units themselves have no inherent linguistic
`interpretation.) It should be clear that there is no ideal (perfect) set of subword units.
`The PLU set is extremely context sensitive because each unit is potentially affected by its
`predecessors ( one or more) and its followers. However, there is only a small number of
`PLUs and they are relatively easy to train. On the other extreme are the syllables which
`are longest units and are the least context sensitive. However, there are so many of them
`that they are almost as difficult to train as whole-word models.
`For simplicity we will initially assume that we use PLUs as the basic speech units.
`In particular we use the set of 47 PLUs shown in Table 8.1 (which includes an explicit
`symbol for silence -h#). For each PLU we show an orthographic symbol (e.g., aa) and a
`word associated with the symbol (e.g., father). (These symbols are essentially identical to
`the ARPAPET alphabet of Table 2.1; lowercase symbols are used throughout this chapter
`for consistency with the DARPA community.) Table 8.2 shows typical pronunciations
`for several words from the DARPA RM task in terms of the PLUs in Table 8.1. A strong
`advantage of using PLU s is the ease of creating word lexicons of the type shown in Table 8.2
`from standard (electronic) dictionaries. We will see later in this chapter how we exploit the
`advantages of PLU s, while reducing the context dependencies, by going to more specialized
`PLUs which take into consideration either the left or right (or both) contexts in which the
`PLU appears.
`One problem with word lexicons of the type shown in Table 8.2 is that they don't easily
`account for variations in word pronunciation across different dialects and in the context of
`a sentence. Hence a simple word like "a" is often pronounced as /ey/ in isolation (e.g., the
`
`IPR2023-00035
`Apple EX1015 Page 288
`
`
`
`438
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`TABLE 8.1. Set of basic PLUs for speech.
`Word
`
`Number
`1
`2
`3
`4
`5
`
`Sl:'.mbol
`h#
`aa
`ae
`ah
`ao
`
`Number
`26
`27
`28
`29
`30
`
`31
`32
`33
`34
`35
`
`36
`37
`38
`49
`40
`
`41
`42
`43
`44
`45
`
`46
`47
`
`Sl:'.mbol
`k
`
`m
`n
`ng
`
`ow
`oy
`p
`r
`s
`
`sh
`t
`th
`uh
`uw
`
`V
`w
`y
`z
`zh
`
`Word
`
`kick
`led
`mom
`no
`sing
`
`boat
`boy
`pop
`red
`sis
`
`shoe
`tot
`thief
`book
`boot
`
`very
`wet
`yet
`zoo
`measure
`
`dx
`nx
`
`butter
`center
`
`silence
`father
`bat
`butt
`bought
`
`bough
`again
`diner
`bite
`bob
`
`church
`dad
`they
`bet
`bottle
`
`button
`bird
`bait
`fief
`gag
`
`hag
`bit
`roses
`beat
`judge
`
`6
`7
`8
`9
`10
`
`11
`12
`13
`14
`15
`
`16
`17
`18
`19
`20
`
`21
`22
`23
`24
`25
`
`aw
`ax
`axr
`ay
`b
`
`ch
`d
`dh
`eh
`el
`
`en
`er
`ey
`f
`g
`
`hh
`ih
`ix
`iy
`jh
`
`TABLE 8.2. Typical word pronunciations (word lex-
`icon) based on context-independent
`PLUs.
`Number of
`2hones
`
`Transcription
`
`Word
`
`a
`above
`bad
`carry
`define
`end
`gone
`hours
`
`4
`3
`4
`5
`3
`3
`4
`
`ax
`b
`ax
`b
`ae
`k
`ae
`d
`iy
`eh
`n
`g
`ao
`aw w
`
`ah
`d
`r
`f
`d
`n
`axr
`
`V
`
`iy
`ay
`
`z
`
`n
`
`IPR2023-00035
`Apple EX1015 Page 289
`
`
`
`/
`
`sec. 8.3
`
`Subword Unit Models Based on HMMs
`
`439
`
`WORD MODEL
`
`(a)
`
`(b)
`
`SUB-WORD UNIT
`
`---+
`
`B
`
`M
`
`Figure 8.1 HMM representations of a word (a) and a subword
`unit (b).
`
`letter A), but is pronounced as /ax/ in context. Another example is a word like "data," which
`can be pronounced as /d ey tax/ or /d ae tax/ depending on the speaker's dialect. Finally
`words like "you" are normally pronounced as /y uw/ but in context often are pronounced as
`/jh ax/ or /jh uh/. There are several ways of accounting for word pronunciation variability,
`including multiple entries in the word lexicon, use of phonological rules in the recognition
`grammar, and use of context dependent PLUs. We will discuss these options later in this
`chapter.
`
`8.3 SUBWORD UNIT MODELS BASED ON HMMS
`
`As we have shown several times in this book, the most popular way in which speech is
`modeled is as a left-to-right hidden Markov model. As shown in Figure 8. la, a whole-word
`model typically uses a left-to-right HMM with N states, where N can be a fixed value (e.g.,
`5-10 for each word), or can be variable with the number of sounds (phonemes) in the
`word, or can be set equal to the average number of frames in the word. For subword units,
`typically, the number of states in the HMM is set to a fixed value, as shown in Figure 8.1 b
`where a three-state model is used. This means that the shortest tokens of each subword
`unit must last at least three frames, a restriction that seems reasonable in practice. (Models
`that use jumps to eliminate this restriction have been studied [2].)
`To represent the spectral density associated with the states of each subword unit,
`one of three approaches can be used. These approaches are illustrated in Figure 8.2.
`Perhaps the simplest approach is to design a VQ-based codebook for all speech sounds (as
`shown in part a of the figure). For this approach the probability density of the observed
`
`IPR2023-00035
`Apple EX1015 Page 290
`
`
`
`440
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`ACOUSTIC SPACE
`(COVERED BY VO CELLS)
`
`MODEL 1
`
`ACOUSTIC SPACE
`(COVERED BY MIXTURES OF
`CONTINUOUS DENSITIES)
`
`( 0)
`
`DISCRETE DENSITY
`VO CODEBOOK
`
`x = CENTROID
`
`(bl
`
`CONTINUOUS DENSITY
`MIXTURE CASE
`
`(Cl
`
`CONTINUOUS DENSITY
`CODEBOOK
`
`8.
`
`ACOUSTIC SPACE
`(COVERED BY CONTINUOUS
`DENSITIES)
`
`Figure 8.2 Representations of the acoustic space of speech by (a) parti(cid:173)
`tioned VQ cells, (b) sets of continuous mixture Gaussian densities, and (c)
`a continuous-density codebook (after Lee et al. [7]).
`
`spectral sequence within each state of each PLU is simply a discrete density defined over
`the codebook vectors. The interpretation of the discrete density within a state is that of
`implicitly isolating the part of the acoustic space in which the spectral vectors occur and
`assigning the appropriate codebook vector ( over that part of the space) a fixed probability for
`spectral vectors within each isolated region regardless of its proximity to the corresponding
`codebook vector. A ~econd alternative, illustrated in part b of Figure 8.2, is to represent
`the continuous probability density in each subword unit state by a mixture density that
`explicitly defines the part of the acoustic space in which the spectral vectors occur. Each
`mixture component has a spectral mean and variance that is highly dependent on the spectral
`characteristics of the subword unit (i.e., highly localized in the acoustic space). Hence the
`models for different subword units usually do not have substantial overlap in the acoustic
`space. Finally, a third alternative is to design a type of continuous density codebook over
`the entire acoustic space, as illustrated in part c of Figure 8.2. Basically the entire acoustic
`
`IPR2023-00035
`Apple EX1015 Page 291
`
`
`
`sec. 8.4
`
`Training of Subword Units
`
`441
`
`space is covered by a set of independent Gaussian densities, derived in much the same
`way as the discrete VQ codebook, with the resulting set of means and covariances stored
`in a codebook. This alternative is a compromise between the previous two possibilities. It
`differs from the discrete density case in the way the probability of an observation vector is
`computed; instead of assigning a fixed probability to any observation vector that falls withjn
`an isolated region, it actually determines the probability according to the closeness of the
`observation vector to the codebook vector (i.e., it calculates the exponents of the Gaussian
`distributions). For each state of each subword unit, the density is assumed to be a mixture of
`the fixed codebook densities. Hence, even though each state is characterized by a continuous
`mixture density, one need only estimate the set of mixture gains to specify the continuous
`density completely. Furthermore, since the codebook set of Gaussian densities is common
`for all states of all subword models, one can precompute the likelihoods associated with
`an input spectral vector for each of the codebook vectors, and ultimately determine state
`likelihoods using only a simple dot product with the state mixture gains. This represents
`a significant computational reduction over the full mixture continuous density case. This
`mixed density method has been called the tied mixture approach [ 10, 28] as well as the
`semicontinuous modeling method [ 11] and has been applied to the entire acoustic space
`as well as to pieces of the acoustic space for detailed PLU modeling. This method can be
`further extended to the case in which a set of continuous density codebooks is designed,
`one for each state of each basic (context independent) speech unit. One can then estimate
`sets of mixture gains appropriate to context dependent versions of each basic speech unit
`and use them appropriately for recognition. We will return to this issue later in this chapter.
`
`8.4 TRAINING OF SUBWORD UNITS
`
`Implicitly it would seem that training of the models for subword units would be extremely
`difficult, because there is no simple way to create a bootstrap model of such short, im(cid:173)
`precisely defined, speech sounds. Fortunately, this is not the case. The reason for this is
`because of the inherent tying of subword units across words and sentences-that
`is, every
`subword unit occurs a large number of times in any reasonable size training set. Hence
`estimation algorithms like the forward-backward procedure, or the segmental k-means al(cid:173)
`gorithm, can start with a uniform segmentation (flat or random initial models) and rapidly
`converge to the best model estimates in just a few iterations.
`To illustrate how models of subword units are estimated, assume we have a labeled
`training set of speech sentences, where each sentence consists of the speech waveform and
`its transcription into words. (We assume that waveform segmentation into words is not
`available.) We further assume a word lexicon is available that provides a transcription of
`every word in the training set strings in terms of the set of subword units being trained. We
`assume that silence can (but needn't) precede or follow any word within a sentence (i.e.,
`we allow pauses in speaking), with silence at the beginning and end of each sentence the
`most likely situation. Based on the above assumptions, a typical sentence in the training
`set can be transcribed as
`
`IPR2023-00035
`Apple EX1015 Page 292
`
`
`
`442
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`SENTENCE (Sw):
`
`4>
`......... W
`,,
`',
`1
`c5 •
`b>---4•--6
`silence
`
`............ W2
`,
`',
`•
`b
`•
`silence
`
`w,
`o • • • 01--------.cf
`
`WORD (W1):
`U1(W2)
`U2(W1)
`o---o
`
`UL(W1)(W1)
`•
`0
`
`0
`
`• • • 0
`
`4>
`-►,
`
`',
`
`,...
`b
`•
`silence
`
`SUB-WORD UNIT (PLU):
`
`_______ Q_
`
`. Q ' _Q ______ _
`
`Figure 8.3 Representation of a sentence, word, and subword unit in
`terms of FSNs.
`
`in which each W· 1 < i < I is a word in the lexicon. Hence the sentence "show all alerts"
`-
`-
`,
`,,
`is a three-word sentence with W1 = show, W2 = all, and W3 = alerts. Each word can
`be looked up in the lexicon to find its transcription in terms of subword units. Hence the
`sentence Scan be written in terms of subword units as
`
`Su: U1(Wi)U2(W1) ... UL(Wi)(W,) EB U1(W2)U2(W2) · · · ul(W2)(W2) ffi
`U1(W3)U2(W3) ... ul(W3)(W3) EB··· EB U1(W1)U2(W1) • · · ul(W1)(W1),
`
`where l(W 1) is the length (in units) of word W1, etc. Finally we replace each subword unit
`by its HMM (the three-state models shown in Figure 8.1) and incorporate the assumptions
`about silence between words to give an extended HMM for each sentence.
`The above process is illustrated (in general) in Figure 8.3. We see that a sentence
`is represented as a finite-state network (FSN) where the arcs are either words or silence
`or null arcs (where a null (¢) transition is required to skip the alternative silence). Each
`word is represented as an FSN of subword units and each subword unit is represented as a
`three-state HMM.
`Figure 8.4 shows the process of creating the composite FSN for the sentence "Show all
`alerts," based on a single-word pronunciation lexicon. One feature of this implementation
`is the use of a single-state HMM for the silence word. This is used (rather than the three(cid:173)
`state HMMs used for each PLU), since silence is generally stationary and has no temporal
`structure to exploit.
`When there are multiple representations of words in the lexicon ( e.g., for two or more
`distinct pronunciations) it is easy to modify the FSN of Figures 8.3 and 8.4 to add parallel
`paths for the word arcs. (We will see that only one path is chosen in training, namely the
`best representation of the actual word pronunciation in the context of the spoken sentence.)
`Furthermore, multiple models of each subword unit can be used by introducing parallel
`paths in the word FSNs and then choosing the best version of each subword unit in the
`decoding process.
`
`IPR2023-00035
`Apple EX1015 Page 293
`
`
`
`sec. 8.4
`
`Training of Subword Units
`
`SENTENCE (Sw): SHOW ALL ALERTS
`
`443
`
`_,-►,,
`
`,,
`o •
`silence
`
`'
`b•
`show
`
`4>
`,__..,
`
`'\
`
`,,
`o
`silence
`
`b•
`
`all
`
`silence
`
`alerts
`
`silence
`
`WORDS:
`
`SHOW:
`
`ALL:
`
`sh
`•
`ax
`I
`
`0
`
`0
`
`ALERTS:
`
`ax
`
`ow
`I
`
`i
`
`0
`
`0
`
`er
`
`0
`
`0
`
`i
`
`SILENCE· Q
`
`COMPOSITE FSN:
`
`s
`
`~-o--o+---o-+,-~~~!
`
`0
`
`: /sil
`s
`ending states
`
`Figure 8.4 Creation of composite FSN for sentence "Show all alens."
`
`Once a composite sentence FSN is created for each sentence in the training set, the
`training problem becomes one of estimating the subword unit model parameters which
`maximize the likelihood of the models for all the given training data. The maximum
`likelihood parameters can be solved for using either the forward-backward procedure (see
`Ref. [2] for example) or the segmental k-means training algorithm. The way in which
`we use the segmental k-means training procedure to estimate the set of model parameters
`(based on using a mixture density with M mixtures/state) is as follows:
`
`1. Initialization: Linearly segment each training utterance into units and HMM states
`assuming no silence between words (i.e., silence only at the beginning and end of
`each sentence), a single lexical pronunciation of each word, and a single model for
`each subword unit. Figure 8.5, iteration 0, illustrates this step for the first few units
`of one training sentence. Literally we assume every unit is of equal duration initially.
`2. Clustering: All feature vectors from all segments corresponding to a given state (i)
`of a given subword unit are partitioned into M clusters using the k-means algorithm.
`(This step is iterated for all states of all subword units.)
`3. Estimation: The mean vectors, µ;k, the (diagonal) covariance matrices, U;k, and the
`
`IPR2023-00035
`Apple EX1015 Page 294
`
`
`
`444
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`ITERATION 0
`
`ITERATION
`
`db
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`I
`I
`:ih rs
`
`~dh
`
`db
`
`I
`I
`I
`I
`:iy
`
`y
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`h#
`
`dh ih
`1
`
`1
`
`I
`I
`I
`,• lox
`s
`y
`lz
`I
`I
`I
`__,JU-.L.J..-'--.IL....U....._J27
`100
`FRAME
`
`ITERATION 2
`
`db
`
`I
`I
`I
`I
`I
`,
`:z 11y
`1a.,1:s,,,,,,.::::::i..o:LL..1.J...L.--1l.--l.~-'--L...I...L.....J27
`100
`FRAME
`
`ITERATION 4
`..---.-"'T'--,-""'""-,.,..-..--:-~,-:-r7"189
`
`db
`
`h#
`
`'
`' I
`I
`'
`I
`I
`I:
`rih s 1w ox,z r•Y
`dhr
`:2 y
`I
`I
`I
`I
`I:
`I
`b,l::..£::J...l:U.....U...L-.LL-LL..-'-....L.J.-'-_J27
`100
`FRAME
`
`I
`
`h#
`
`dh
`I
`
`FRAME
`
`TERATION 3
`
`FRAME
`
`TERATION 10
`
`I
`:
`I I
`I I
`I I
`I I
`I I
`
`'' I I
`
`I
`I
`I
`I
`11h 1s
`I
`I
`
`o I
`I I
`I
`o I
`IWCD12 y
`'
`I
`FRAME
`
`27
`100
`
`89
`
`db
`
`27
`100
`
`89
`
`db
`
`,z y
`I
`
`27
`100
`
`Figure 8.5 Segmentations of a training utterance resulting from the segmental
`k-means training for the first several iterations (after Lee et al. [7]).
`
`mixture weights, c;k, are estimated for each cluster k in state i. (This step is iterated
`for all states of all subword units.)
`4. Segmentatim1: The updated set of subword unit models (based on the estimation of
`step 3) is used to resegment each training utterance into units and states (via Viterbi
`decoding). At this point multiple lexical entries can be used for any word in the
`vocabulary. Figure 8.5 shows the result of this resegmentation step for iterations 1-4
`and 10 for one training utterance. It can be seen that by iteration 2 the segmentation
`into subword units is remarkably stable.
`5. Iteration: Steps 2-4 are iterated until convergence (i.e., until the overall likelihoods
`stop increasing).
`
`Figure 8.6 illustrates the resulting segmentation of the first few units of the utterance
`
`IPR2023-00035
`Apple EX1015 Page 295
`
`
`
`sec. S.4
`
`Training of Subword Units
`
`445
`
`WHAT IS THE CONSTELLATION+S GROSS DISPLACEMENT IN LONG TONS
`
`I ' I
`I
`I
`I
`I
`I
`I
`t
`:t IOX:t
`
`I
`I
`I
`
`I
`I
`I
`I
`'
`
`I
`I
`I
`I
`
`'
`
`TIME (sec)
`
`TIME (sec)
`
`~
`
`al
`'0
`a::
`w
`~
`0
`(l.
`
`.,
`'
`
`0
`0
`
`4000
`;:;
`:I:
`>-
`(.) z
`w
`:,
`0 w
`a:
`~ 0
`
`0
`
`' -o
`..Jo
`WO
`~ :I:
`....J
`
`~
`d:,
`I-a:
`....JI-wen
`0 f!j 0
`u
`
`0
`
`h•
`
`w
`
`I
`I
`100
`
`I
`I
`
`I
`
`:t h 'z
`
`I
`I ,00
`
`I
`
`I
`
`:n's
`
`Figure 8.6
`
`Segmentation of an utterance into PLUs (after Lee et al. (7)).
`
`TIME (sec)
`
`"What is the constellation .... " Shown in this figure are the power contour in dB (upper
`panel), the running LPC spectral slices (the middle panel), and the likelihood scores and
`delta-cepstral values (lower panel) for the first second of the sentence. The resulting
`segmentations are generally remarkably consistent with those one might manually choose
`based on acoustic-phonetic criteria. Since we use an acoustic criterion for choice of
`segmentation points, the closeness of PLU units to true phonetic units is often remarkable,
`especially in light of the phonetic variability in word pronunciation discussed previously.
`In summary we have shown how one can use a. training set of speech sentences
`that have only word transcriptions associated with each sentence and optimally determine
`the parameters of a set of subword unit HMMs. The resulting parameter estimates are
`extremely robust to the training material as well as to details of word pronunciation as
`obtained from the word lexicon. The reason for this is that a common word lexicon (with
`associated word pronunciation errors) is used for ~oth training and recognition; hence
`errors in associating proper subword units to words are consistent throughout the process
`and are less harmful than they would be in alternative methods of estimating parameters of
`subword models.
`The results of applying the segmental k-means training procedure to a set of 3990 •
`training sentences from 109 different talkers, in terms of PLU counts and PLU likelihood
`scores are shown in Table 8.3. A total of 155,000 PLUs occurred in the 3990 sentences
`with silence (h#) having the most occurrences (10,638 or 6.9% of the total) and nx (flapped
`
`IPR2023-00035
`Apple EX1015 Page 296
`
`
`
`446
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`TABLE 8.3. PLU statistics on count and average likelihood score.
`Average
`likelihood
`18.5
`8.4
`9.7
`7 .1
`15.4
`8.3
`9.9
`12.0
`8.5
`13.3
`8.9
`12.4
`11.2
`10.6
`14.3
`8.5
`10.4
`17.7
`14.5
`10.2
`8.7
`11.8
`12.0
`10.3
`10.7
`13. l
`10.6
`13. l
`10.9
`9.5
`11.3
`10.4
`13.9
`9.1
`9.8
`11.4
`14.1
`9.1
`12.5
`11.0
`10.6
`10.6
`13.6
`11.0
`12.2
`15.3
`10.4
`
`(Rank)
`
`(I)
`(45)
`(37)
`(47)
`(3)
`(46)
`(35)
`(17)
`(44)
`(10)
`(41)
`( 14)
`(21)
`(27)
`(6)
`(43)
`(32)
`(2)
`(5)
`(34)
`(42)
`(18)
`(16)
`(33)
`(25)
`( 11)
`(26)
`(12)
`(24)
`(38)
`(20)
`(31)
`(8)
`(40)
`(36)
`(19)
`(7)
`(39)
`(13)
`(23)
`(29)
`(28)
`(9)
`(22)
`(15)
`(4)
`(30)
`
`PLU
`
`h#
`r
`t
`ax
`s
`n
`ih
`iy
`d
`ae
`e
`z
`eh
`k
`p
`m
`ao
`f
`ey
`w
`ix
`dh
`V
`aa
`b
`y
`uw
`sh
`ow
`axr
`ah
`dx
`ay
`en
`g
`hh
`th
`ng
`ch
`el
`er
`jh
`aw
`uh
`zh
`oy
`nx
`
`Count
`
`10638
`8997
`8777
`8715
`8625
`8478
`6542
`5816
`5391
`4873
`4857
`4733
`4604
`4286
`3793
`3625
`3489
`3276
`3271
`3188
`3079
`2984
`2979
`2738
`2138
`2137
`2032
`1875
`1875
`1825
`1566
`1548
`1527
`1478
`1416
`1276
`924
`903
`885
`863
`852
`816
`682
`242
`198
`130
`57
`
`%
`
`6.9
`5.8
`5.7
`5.6
`5.6
`5.5
`4.2
`3.7
`3.5
`3.1
`3.1
`3.0
`3.0
`2.8
`2.4
`2.3
`2.2
`2.1
`2.1
`2.1
`2.0
`1.9
`1.9
`1.8
`1.4
`1.4
`1.3
`1.2
`1.2
`1.2
`1.0
`1.0
`1.0
`0.9
`0.9
`0.8
`0.6
`0.6
`0.6
`0.6
`0.5
`0.5
`0.4
`0.2
`0.1
`0.1
`0.04
`
`IPR2023-00035
`Apple EX1015 Page 297
`
`
`
`sec. 8.5
`
`Language Models for Large Vocabulary Speech Recognition
`
`447
`
`n) having the fewest occurrences (5 or 0.04% of the total). In tenns of average likelihood
`scores, silence (h#) had the highest score (18.5) followed by f (17.7) ands (15.4), while
`ax had the lowest score (7.1 ), followed by n (8.3) and r (8.4). (Note that, in this case, a
`higher average likelihood implies less variation among different renditions of the particular
`sound.) It is interesting to note that the PLUs with the three lowest average likelihood
`scores (ax, n, and r) were among the most frequently occurring sounds (r was second,
`n sixth, and ax fourth in frequency of occurrence). Similarly, some of the sounds with
`the highest likelihood scores were among the least occurring sounds (e.g., oy was fourth
`according to likelihood score but 21st according to frequency of occurrence).
`
`5 LANGUAGE MODELS FOR LARGE VOCABULARY SPEECH RECOGNITION
`s.
`
`Small vocabulary speech-recognition systems are used primarily for command-and-control
`applications where the vocabulary words are essentially acoustic control signals that the
`system has to respond to. (See Chapter 9 for a discussion of command-and-control appli(cid:173)
`cations of speech recognition.) As such, these systems generally do not rely heavily on
`language models to accomplish their selected tasks. A large vocabulary speech-recognition
`system, however, is generally critically dependent on linguistic knowledge embedded in
`the input speech. Therefore, for large vocabulary speech recognition, incorporation of
`knowledge of the language, in the form of a "language" model, is essential. In this section
`we discuss a statistically motivated framework for language modeling.
`The goal of the (statistical) language model is to provide an estimate of the probability
`of a word sequence W for the given recognition task. If we assume that Wis a specified
`sequence of words, i.e.,
`
`W = W1W2 .. ,WQ,
`then it would seem reasonable that P(W) can be computed as
`P(W) = P(w1w2 ... WQ) = P(w1)P(w2lw1)P(w3lw1w2)
`P(wQlw1w2 ... WQ-1).
`
`(8.4)
`
`(8.5)
`
`...
`
`Unfortunately, it is essentially impossible to reliably estimate the conditional word prob(cid:173)
`abilities, P(wjlw 1 ... Wj-d
`for all words and all sequence lengths in a given language.
`Hence, in practice, it is convenient t