throbber
Chapter 8
`
`i
`
`LARGE VOCABULARY
`CONTINUOUS SPEECH
`RECOGNITION
`
`8.1
`
`INTRODUCTION
`
`Throughout this book we have developed a wide range of tools, techniques, and algorithms
`for attacking several fundamental problems in speech recognition. In the previous chapter
`we saw how the different techniques came together to solve the connected word recognition
`problem. In this chapter we extend the concepts to include issues needed to solve the large
`vocabulary, continuous speech recognition problem. We will see that the fundamental ideas
`need modification because of the use of subword speech units; however, a great deal of the
`formalism for recognition, based on word units, is still preserved.
`The standard approach to large vocabulary continuous speech recognition is to assume
`a simple probabilistic model of speech production whereby a specified word sequence, W,
`produces an acoustic observation sequence Y, with probability P(W, Y). The goal is then
`to decode the word string, based on the acoustic observation sequence, so that the decoded
`string has the maximum a posteriori (MAP) probability, i.e.,
`
`W 3 P(WIY) = maxP(WIY).
`w
`
`Using Bayes' Rule, Equation (8.1) can be written as
`
`P(WIY) = P(YIW)P(W)
`P(Y)
`•
`
`434
`
`(8.1)
`
`(8.2)
`
`IPR2023-00035
`Apple EX1015 Page 285
`
`

`

`sec. 8.2
`
`Subword Speech Units
`
`Since P(Y) is independent of W, the MAP decoding rule of Eq. (8.1) is
`W = arg max P(YIW)P(W).
`w
`
`435
`
`(8.3)
`
`The first term in Eq. (8.3), P(YIW), is generally called the acoustic model, as it estimates the
`probability of a sequence of acoustic observations, conditioned on the word string. The way
`in which we compute P(YIW), for large vocabulary speech recognition, is to build statistical
`models for subword speech units, build up word models from these subword speech
`unit models (using a lexicon to describe the composition of words), and then postulate
`word sequences and evaluate the acoustic model probabilities via standard concatenation
`methods. Such methods are discussed in Sections 8.2-8.4 of this chapter.
`The second term in Eq. (8.3), P(W), is generally called the language model, as it
`describes the probability associated with a postulated sequence of words. Such language
`models can incorporate both syntactic and semantic constraints of the language and the
`recognition task. Often, when only syntactic constraints are used, the language model
`is called a grammar and may be of the form of a formal parser and syntax analyzer, an
`N-gram word model (N = 2, 3, ... ), or a word pair grammar of some type. Generally
`such language models are represented in a finite state network so as to be integrated into
`the acoustic model in a straightforward manner. We discuss language models further in
`Section 8.5 of this chapter.
`We begin the chapter with a discussion of subword speech units. We formally define
`subword units and discuss their relative advantages (and disadvantages) as compared to
`whole-word models. We next show how we use standard statistical modeling techniques
`(i.e., hidden Markov models) to model subword units based on either discrete or continuous
`densities. We then show how such units can be trained automatically from continuous
`speech, without the need for a bootstrap model of each of the subword units. Next we
`discuss the problem of creating and implementing word lexicons (dictionaries) for use in
`both training and recognition phases. To evaluate the ideas discussed in this chapter we
`use a specified database access task, called the DARPA Resource Management (RM) task,
`in which there is a word vocabulary of 991 words (plus a silence or background word), and
`any one of several word grammars can be used. Using such a system, we show how a basic
`set of subword units performs on this task. Several directions for creating subword units
`which are more specialized are described, and several of these techniques are evaluated on
`the RM task. Finally we conclude the chapter with a discussion of how task semantics can
`be applied to further constrain the recognizer and improve overall performance.
`
`8.2 SUBWORD SPEEGH UNITS
`
`We began Chapter 2 with a discussion of the basic phonetic units of language and discussed
`the acoustic properties of the phonemes in different speech contexts. We then argued
`that the acoustic variability of the phonemes due to context was sufficiently large and not
`well understood, that such units would not be useful as the basis for speech models for
`recognition. Instead, we have used whole-word models as the basic speech unit, both for
`
`IPR2023-00035
`Apple EX1015 Page 286
`
`

`

`436
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`isolated word recognition systems and for connected word recognition systems, because
`whole words have the property that their acoustic representation is well defined, and the
`acoustic variability occurs mainly in the region of the beginning and the end of the word.
`Another advantage of using whole-word speech models is that it obviates the need for a
`word lexicon, there~y making the recognition structure inherently simple.
`The disadvantages of using whole-word speech models for continuous speech recog(cid:173)
`nition are twofold. First, to obtain reliable whole-word models, the number of word
`utterances in the training set needs to be sufficiently large, i.e., each word in the vocab(cid:173)
`ulary should appear in each possible phonetic context several times in the training set.
`In this way the acoustic variability at the beginning and at the end of each word can be
`modeled appropriately. For word vocabularies like the digits, we know that each digit
`can be preceded and followed by every other digit; hence for an I I-digit vocabulary (zero
`to nine plus oh), there are exactly 121 phonetic contexts (some of which are essentially
`identical). Thus with a training set of several thousand digit strings, it is both realistic
`and practical to see every digit in every phonetic context several times. Now consider a
`vocabulary of l()(X) words with an average of 100 phonetic contexts for both the beginning
`and end of each word. To see each word in each phonetic context exactly once requires
`I 00 x I 000 x 100 = 10 million carefully designed sentences. To see each combination 10
`times requires 100 million such sentences. Clearly, the recording and processing of such
`homogeneous amounts of speech data is both impractical and unthinkable. Second, with
`a large vocabulary the phonetic content of the individual words will inevitably overlap.
`Thus storing and comparing whole-word patterns would be unduly redundant because the
`constituent sounds of individual words are treated independently, regardless of their iden(cid:173)
`tifiable similarities. Hence some more efficient speech representation is required for such
`large vocabulary systems. This is essentially the reason we use subword speech units.
`There are several possible choices for subword units that can be used to model speech.
`These include the following:
`
`• Phonelike units (PLUs) in which we use the basic phoneme set (or some appropri(cid:173)
`ately modified set) of sounds but recognize that the acoustic properties of these units
`could be considerably different than the acoustic properties of the "basic" phonemes
`[1-7]. This is because we define the units based on linguistic similarity but model
`the unit based on acoustic similarity. In cases in which the acoustic and phonetic
`similarities are roughly the same (e.g., stressed vowels) then the phoneme and the
`PLU will be essentially identical. In other cases there can be large differences and a
`simple one-to-one correspondence may be inadequate in terms of modeling accuracy.
`Typically there are about 50 PLUs for English.
`• Syllable-like units in which we again use the linguistic definition of a syllable
`(namely a vowel nucleus plus the optional initial and final consonants or consonant
`clusters) to initially define these units, and then model the unit based on acoustic
`similarity. In English there are approximately 10,000 syllables.
`• Dyad or demisyllable-like units consisting of either the initial (optional) consonant
`cluster and some part of the vowel nucleus, or the remaining part of the vowel nucleus
`and the final (optional) consonant cluster [8]. For English there is something on the
`
`IPR2023-00035
`Apple EX1015 Page 287
`
`

`

`Sec. 8.2
`
`Subword Speech Units
`
`437
`
`order of 2000 demisyllable-like units.
`• Acoustic units, which are defined on the basis of clustering speech segments from
`a segmentation of fluent, unlabeled speech using a specified objective criterion (e.g.,
`maximum likelihood) [9]. Literally a codebook of speech units is created whose
`interpretation, in tenns of classical linguistic units, is at best vague and at worst totaJly
`nonexistent. It has been shown that a set of 256-512 acoustic units is appropriate for
`modeling a wide range of speech vocabularies.
`
`Consider the English word segmentation. Its representation according to each of the above
`subword unit sets is
`
`• PL Us: /s/ /€/ /g/ /m/ /a/ /n/ /t/ /eY / /sh/ /a/ /n/ ( 11 units)
`• syllables: /seg/ /men/ /ta/ /tion/ (4 syllables)
`• demisyllables: /sc./ /r.g/ /ma/ /an/ /teY / /eYsh/ /sha/ /an/ (8 demisyllables)
`• acoustic units: 17 111 37 3 241 121 99 171 37 (9 acoustic units).
`
`We see, from the above example, that the number of subword units for this word can be as
`small as 4 (from a set of I 0,000 units) or as large as 11 (from a set of 50 units).
`Since each of the above subword unit sets is capable of representing any word in the
`English language, the issues in the choice of subword unit sets are the context sensitivity
`and the ease of training the unit from fluent speech. (In addition, for acoustic units, an
`issue is the creation of a word lexicon since the units themselves have no inherent linguistic
`interpretation.) It should be clear that there is no ideal (perfect) set of subword units.
`The PLU set is extremely context sensitive because each unit is potentially affected by its
`predecessors ( one or more) and its followers. However, there is only a small number of
`PLUs and they are relatively easy to train. On the other extreme are the syllables which
`are longest units and are the least context sensitive. However, there are so many of them
`that they are almost as difficult to train as whole-word models.
`For simplicity we will initially assume that we use PLUs as the basic speech units.
`In particular we use the set of 47 PLUs shown in Table 8.1 (which includes an explicit
`symbol for silence -h#). For each PLU we show an orthographic symbol (e.g., aa) and a
`word associated with the symbol (e.g., father). (These symbols are essentially identical to
`the ARPAPET alphabet of Table 2.1; lowercase symbols are used throughout this chapter
`for consistency with the DARPA community.) Table 8.2 shows typical pronunciations
`for several words from the DARPA RM task in terms of the PLUs in Table 8.1. A strong
`advantage of using PLU s is the ease of creating word lexicons of the type shown in Table 8.2
`from standard (electronic) dictionaries. We will see later in this chapter how we exploit the
`advantages of PLU s, while reducing the context dependencies, by going to more specialized
`PLUs which take into consideration either the left or right (or both) contexts in which the
`PLU appears.
`One problem with word lexicons of the type shown in Table 8.2 is that they don't easily
`account for variations in word pronunciation across different dialects and in the context of
`a sentence. Hence a simple word like "a" is often pronounced as /ey/ in isolation (e.g., the
`
`IPR2023-00035
`Apple EX1015 Page 288
`
`

`

`438
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`TABLE 8.1. Set of basic PLUs for speech.
`Word
`
`Number
`1
`2
`3
`4
`5
`
`Sl:'.mbol
`h#
`aa
`ae
`ah
`ao
`
`Number
`26
`27
`28
`29
`30
`
`31
`32
`33
`34
`35
`
`36
`37
`38
`49
`40
`
`41
`42
`43
`44
`45
`
`46
`47
`
`Sl:'.mbol
`k
`
`m
`n
`ng
`
`ow
`oy
`p
`r
`s
`
`sh
`t
`th
`uh
`uw
`
`V
`w
`y
`z
`zh
`
`Word
`
`kick
`led
`mom
`no
`sing
`
`boat
`boy
`pop
`red
`sis
`
`shoe
`tot
`thief
`book
`boot
`
`very
`wet
`yet
`zoo
`measure
`
`dx
`nx
`
`butter
`center
`
`silence
`father
`bat
`butt
`bought
`
`bough
`again
`diner
`bite
`bob
`
`church
`dad
`they
`bet
`bottle
`
`button
`bird
`bait
`fief
`gag
`
`hag
`bit
`roses
`beat
`judge
`
`6
`7
`8
`9
`10
`
`11
`12
`13
`14
`15
`
`16
`17
`18
`19
`20
`
`21
`22
`23
`24
`25
`
`aw
`ax
`axr
`ay
`b
`
`ch
`d
`dh
`eh
`el
`
`en
`er
`ey
`f
`g
`
`hh
`ih
`ix
`iy
`jh
`
`TABLE 8.2. Typical word pronunciations (word lex-
`icon) based on context-independent
`PLUs.
`Number of
`2hones
`
`Transcription
`
`Word
`
`a
`above
`bad
`carry
`define
`end
`gone
`hours
`
`4
`3
`4
`5
`3
`3
`4
`
`ax
`b
`ax
`b
`ae
`k
`ae
`d
`iy
`eh
`n
`g
`ao
`aw w
`
`ah
`d
`r
`f
`d
`n
`axr
`
`V
`
`iy
`ay
`
`z
`
`n
`
`IPR2023-00035
`Apple EX1015 Page 289
`
`

`

`/
`
`sec. 8.3
`
`Subword Unit Models Based on HMMs
`
`439
`
`WORD MODEL
`
`(a)
`
`(b)
`
`SUB-WORD UNIT
`
`---+
`
`B
`
`M
`
`Figure 8.1 HMM representations of a word (a) and a subword
`unit (b).
`
`letter A), but is pronounced as /ax/ in context. Another example is a word like "data," which
`can be pronounced as /d ey tax/ or /d ae tax/ depending on the speaker's dialect. Finally
`words like "you" are normally pronounced as /y uw/ but in context often are pronounced as
`/jh ax/ or /jh uh/. There are several ways of accounting for word pronunciation variability,
`including multiple entries in the word lexicon, use of phonological rules in the recognition
`grammar, and use of context dependent PLUs. We will discuss these options later in this
`chapter.
`
`8.3 SUBWORD UNIT MODELS BASED ON HMMS
`
`As we have shown several times in this book, the most popular way in which speech is
`modeled is as a left-to-right hidden Markov model. As shown in Figure 8. la, a whole-word
`model typically uses a left-to-right HMM with N states, where N can be a fixed value (e.g.,
`5-10 for each word), or can be variable with the number of sounds (phonemes) in the
`word, or can be set equal to the average number of frames in the word. For subword units,
`typically, the number of states in the HMM is set to a fixed value, as shown in Figure 8.1 b
`where a three-state model is used. This means that the shortest tokens of each subword
`unit must last at least three frames, a restriction that seems reasonable in practice. (Models
`that use jumps to eliminate this restriction have been studied [2].)
`To represent the spectral density associated with the states of each subword unit,
`one of three approaches can be used. These approaches are illustrated in Figure 8.2.
`Perhaps the simplest approach is to design a VQ-based codebook for all speech sounds (as
`shown in part a of the figure). For this approach the probability density of the observed
`
`IPR2023-00035
`Apple EX1015 Page 290
`
`

`

`440
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`ACOUSTIC SPACE
`(COVERED BY VO CELLS)
`
`MODEL 1
`
`ACOUSTIC SPACE
`(COVERED BY MIXTURES OF
`CONTINUOUS DENSITIES)
`
`( 0)
`
`DISCRETE DENSITY
`VO CODEBOOK
`
`x = CENTROID
`
`(bl
`
`CONTINUOUS DENSITY
`MIXTURE CASE
`
`(Cl
`
`CONTINUOUS DENSITY
`CODEBOOK
`
`8.
`
`ACOUSTIC SPACE
`(COVERED BY CONTINUOUS
`DENSITIES)
`
`Figure 8.2 Representations of the acoustic space of speech by (a) parti(cid:173)
`tioned VQ cells, (b) sets of continuous mixture Gaussian densities, and (c)
`a continuous-density codebook (after Lee et al. [7]).
`
`spectral sequence within each state of each PLU is simply a discrete density defined over
`the codebook vectors. The interpretation of the discrete density within a state is that of
`implicitly isolating the part of the acoustic space in which the spectral vectors occur and
`assigning the appropriate codebook vector ( over that part of the space) a fixed probability for
`spectral vectors within each isolated region regardless of its proximity to the corresponding
`codebook vector. A ~econd alternative, illustrated in part b of Figure 8.2, is to represent
`the continuous probability density in each subword unit state by a mixture density that
`explicitly defines the part of the acoustic space in which the spectral vectors occur. Each
`mixture component has a spectral mean and variance that is highly dependent on the spectral
`characteristics of the subword unit (i.e., highly localized in the acoustic space). Hence the
`models for different subword units usually do not have substantial overlap in the acoustic
`space. Finally, a third alternative is to design a type of continuous density codebook over
`the entire acoustic space, as illustrated in part c of Figure 8.2. Basically the entire acoustic
`
`IPR2023-00035
`Apple EX1015 Page 291
`
`

`

`sec. 8.4
`
`Training of Subword Units
`
`441
`
`space is covered by a set of independent Gaussian densities, derived in much the same
`way as the discrete VQ codebook, with the resulting set of means and covariances stored
`in a codebook. This alternative is a compromise between the previous two possibilities. It
`differs from the discrete density case in the way the probability of an observation vector is
`computed; instead of assigning a fixed probability to any observation vector that falls withjn
`an isolated region, it actually determines the probability according to the closeness of the
`observation vector to the codebook vector (i.e., it calculates the exponents of the Gaussian
`distributions). For each state of each subword unit, the density is assumed to be a mixture of
`the fixed codebook densities. Hence, even though each state is characterized by a continuous
`mixture density, one need only estimate the set of mixture gains to specify the continuous
`density completely. Furthermore, since the codebook set of Gaussian densities is common
`for all states of all subword models, one can precompute the likelihoods associated with
`an input spectral vector for each of the codebook vectors, and ultimately determine state
`likelihoods using only a simple dot product with the state mixture gains. This represents
`a significant computational reduction over the full mixture continuous density case. This
`mixed density method has been called the tied mixture approach [ 10, 28] as well as the
`semicontinuous modeling method [ 11] and has been applied to the entire acoustic space
`as well as to pieces of the acoustic space for detailed PLU modeling. This method can be
`further extended to the case in which a set of continuous density codebooks is designed,
`one for each state of each basic (context independent) speech unit. One can then estimate
`sets of mixture gains appropriate to context dependent versions of each basic speech unit
`and use them appropriately for recognition. We will return to this issue later in this chapter.
`
`8.4 TRAINING OF SUBWORD UNITS
`
`Implicitly it would seem that training of the models for subword units would be extremely
`difficult, because there is no simple way to create a bootstrap model of such short, im(cid:173)
`precisely defined, speech sounds. Fortunately, this is not the case. The reason for this is
`because of the inherent tying of subword units across words and sentences-that
`is, every
`subword unit occurs a large number of times in any reasonable size training set. Hence
`estimation algorithms like the forward-backward procedure, or the segmental k-means al(cid:173)
`gorithm, can start with a uniform segmentation (flat or random initial models) and rapidly
`converge to the best model estimates in just a few iterations.
`To illustrate how models of subword units are estimated, assume we have a labeled
`training set of speech sentences, where each sentence consists of the speech waveform and
`its transcription into words. (We assume that waveform segmentation into words is not
`available.) We further assume a word lexicon is available that provides a transcription of
`every word in the training set strings in terms of the set of subword units being trained. We
`assume that silence can (but needn't) precede or follow any word within a sentence (i.e.,
`we allow pauses in speaking), with silence at the beginning and end of each sentence the
`most likely situation. Based on the above assumptions, a typical sentence in the training
`set can be transcribed as
`
`IPR2023-00035
`Apple EX1015 Page 292
`
`

`

`442
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`SENTENCE (Sw):
`
`4>
`......... W
`,,
`',
`1
`c5 •
`b>---4•--6
`silence
`
`............ W2
`,
`',
`•
`b
`•
`silence
`
`w,
`o • • • 01--------.cf
`
`WORD (W1):
`U1(W2)
`U2(W1)
`o---o
`
`UL(W1)(W1)
`•
`0
`
`0
`
`• • • 0
`
`4>
`-►,
`
`',
`
`,...
`b
`•
`silence
`
`SUB-WORD UNIT (PLU):
`
`_______ Q_
`
`. Q ' _Q ______ _
`
`Figure 8.3 Representation of a sentence, word, and subword unit in
`terms of FSNs.
`
`in which each W· 1 < i < I is a word in the lexicon. Hence the sentence "show all alerts"
`-
`-
`,
`,,
`is a three-word sentence with W1 = show, W2 = all, and W3 = alerts. Each word can
`be looked up in the lexicon to find its transcription in terms of subword units. Hence the
`sentence Scan be written in terms of subword units as
`
`Su: U1(Wi)U2(W1) ... UL(Wi)(W,) EB U1(W2)U2(W2) · · · ul(W2)(W2) ffi
`U1(W3)U2(W3) ... ul(W3)(W3) EB··· EB U1(W1)U2(W1) • · · ul(W1)(W1),
`
`where l(W 1) is the length (in units) of word W1, etc. Finally we replace each subword unit
`by its HMM (the three-state models shown in Figure 8.1) and incorporate the assumptions
`about silence between words to give an extended HMM for each sentence.
`The above process is illustrated (in general) in Figure 8.3. We see that a sentence
`is represented as a finite-state network (FSN) where the arcs are either words or silence
`or null arcs (where a null (¢) transition is required to skip the alternative silence). Each
`word is represented as an FSN of subword units and each subword unit is represented as a
`three-state HMM.
`Figure 8.4 shows the process of creating the composite FSN for the sentence "Show all
`alerts," based on a single-word pronunciation lexicon. One feature of this implementation
`is the use of a single-state HMM for the silence word. This is used (rather than the three(cid:173)
`state HMMs used for each PLU), since silence is generally stationary and has no temporal
`structure to exploit.
`When there are multiple representations of words in the lexicon ( e.g., for two or more
`distinct pronunciations) it is easy to modify the FSN of Figures 8.3 and 8.4 to add parallel
`paths for the word arcs. (We will see that only one path is chosen in training, namely the
`best representation of the actual word pronunciation in the context of the spoken sentence.)
`Furthermore, multiple models of each subword unit can be used by introducing parallel
`paths in the word FSNs and then choosing the best version of each subword unit in the
`decoding process.
`
`IPR2023-00035
`Apple EX1015 Page 293
`
`

`

`sec. 8.4
`
`Training of Subword Units
`
`SENTENCE (Sw): SHOW ALL ALERTS
`
`443
`
`_,-►,,
`
`,,
`o •
`silence
`
`'
`b•
`show
`
`4>
`,__..,
`
`'\
`
`,,
`o
`silence
`
`b•
`
`all
`
`silence
`
`alerts
`
`silence
`
`WORDS:
`
`SHOW:
`
`ALL:
`
`sh
`•
`ax
`I
`
`0
`
`0
`
`ALERTS:
`
`ax
`
`ow
`I
`
`i
`
`0
`
`0
`
`er
`
`0
`
`0
`
`i
`
`SILENCE· Q
`
`COMPOSITE FSN:
`
`s
`
`~-o--o+---o-+,-~~~!
`
`0
`
`: /sil
`s
`ending states
`
`Figure 8.4 Creation of composite FSN for sentence "Show all alens."
`
`Once a composite sentence FSN is created for each sentence in the training set, the
`training problem becomes one of estimating the subword unit model parameters which
`maximize the likelihood of the models for all the given training data. The maximum
`likelihood parameters can be solved for using either the forward-backward procedure (see
`Ref. [2] for example) or the segmental k-means training algorithm. The way in which
`we use the segmental k-means training procedure to estimate the set of model parameters
`(based on using a mixture density with M mixtures/state) is as follows:
`
`1. Initialization: Linearly segment each training utterance into units and HMM states
`assuming no silence between words (i.e., silence only at the beginning and end of
`each sentence), a single lexical pronunciation of each word, and a single model for
`each subword unit. Figure 8.5, iteration 0, illustrates this step for the first few units
`of one training sentence. Literally we assume every unit is of equal duration initially.
`2. Clustering: All feature vectors from all segments corresponding to a given state (i)
`of a given subword unit are partitioned into M clusters using the k-means algorithm.
`(This step is iterated for all states of all subword units.)
`3. Estimation: The mean vectors, µ;k, the (diagonal) covariance matrices, U;k, and the
`
`IPR2023-00035
`Apple EX1015 Page 294
`
`

`

`444
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`ITERATION 0
`
`ITERATION
`
`db
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`I
`I
`:ih rs
`
`~dh
`
`db
`
`I
`I
`I
`I
`:iy
`
`y
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`h#
`
`dh ih
`1
`
`1
`
`I
`I
`I
`,• lox
`s
`y
`lz
`I
`I
`I
`__,JU-.L.J..-'--.IL....U....._J27
`100
`FRAME
`
`ITERATION 2
`
`db
`
`I
`I
`I
`I
`I
`,
`:z 11y
`1a.,1:s,,,,,,.::::::i..o:LL..1.J...L.--1l.--l.~-'--L...I...L.....J27
`100
`FRAME
`
`ITERATION 4
`..---.-"'T'--,-""'""-,.,..-..--:-~,-:-r7"189
`
`db
`
`h#
`
`'
`' I
`I
`'
`I
`I
`I:
`rih s 1w ox,z r•Y
`dhr
`:2 y
`I
`I
`I
`I
`I:
`I
`b,l::..£::J...l:U.....U...L-.LL-LL..-'-....L.J.-'-_J27
`100
`FRAME
`
`I
`
`h#
`
`dh
`I
`
`FRAME
`
`TERATION 3
`
`FRAME
`
`TERATION 10
`
`I
`:
`I I
`I I
`I I
`I I
`I I
`
`'' I I
`
`I
`I
`I
`I
`11h 1s
`I
`I
`
`o I
`I I
`I
`o I
`IWCD12 y
`'
`I
`FRAME
`
`27
`100
`
`89
`
`db
`
`27
`100
`
`89
`
`db
`
`,z y
`I
`
`27
`100
`
`Figure 8.5 Segmentations of a training utterance resulting from the segmental
`k-means training for the first several iterations (after Lee et al. [7]).
`
`mixture weights, c;k, are estimated for each cluster k in state i. (This step is iterated
`for all states of all subword units.)
`4. Segmentatim1: The updated set of subword unit models (based on the estimation of
`step 3) is used to resegment each training utterance into units and states (via Viterbi
`decoding). At this point multiple lexical entries can be used for any word in the
`vocabulary. Figure 8.5 shows the result of this resegmentation step for iterations 1-4
`and 10 for one training utterance. It can be seen that by iteration 2 the segmentation
`into subword units is remarkably stable.
`5. Iteration: Steps 2-4 are iterated until convergence (i.e., until the overall likelihoods
`stop increasing).
`
`Figure 8.6 illustrates the resulting segmentation of the first few units of the utterance
`
`IPR2023-00035
`Apple EX1015 Page 295
`
`

`

`sec. S.4
`
`Training of Subword Units
`
`445
`
`WHAT IS THE CONSTELLATION+S GROSS DISPLACEMENT IN LONG TONS
`
`I ' I
`I
`I
`I
`I
`I
`I
`t
`:t IOX:t
`
`I
`I
`I
`
`I
`I
`I
`I
`'
`
`I
`I
`I
`I
`
`'
`
`TIME (sec)
`
`TIME (sec)
`
`~
`
`al
`'0
`a::
`w
`~
`0
`(l.
`
`.,
`'
`
`0
`0
`
`4000
`;:;
`:I:
`>-
`(.) z
`w
`:,
`0 w
`a:
`~ 0
`
`0
`
`' -o
`..Jo
`WO
`~ :I:
`....J
`
`~
`d:,
`I-a:
`....JI-wen
`0 f!j 0
`u
`
`0
`
`h•
`
`w
`
`I
`I
`100
`
`I
`I
`
`I
`
`:t h 'z
`
`I
`I ,00
`
`I
`
`I
`
`:n's
`
`Figure 8.6
`
`Segmentation of an utterance into PLUs (after Lee et al. (7)).
`
`TIME (sec)
`
`"What is the constellation .... " Shown in this figure are the power contour in dB (upper
`panel), the running LPC spectral slices (the middle panel), and the likelihood scores and
`delta-cepstral values (lower panel) for the first second of the sentence. The resulting
`segmentations are generally remarkably consistent with those one might manually choose
`based on acoustic-phonetic criteria. Since we use an acoustic criterion for choice of
`segmentation points, the closeness of PLU units to true phonetic units is often remarkable,
`especially in light of the phonetic variability in word pronunciation discussed previously.
`In summary we have shown how one can use a. training set of speech sentences
`that have only word transcriptions associated with each sentence and optimally determine
`the parameters of a set of subword unit HMMs. The resulting parameter estimates are
`extremely robust to the training material as well as to details of word pronunciation as
`obtained from the word lexicon. The reason for this is that a common word lexicon (with
`associated word pronunciation errors) is used for ~oth training and recognition; hence
`errors in associating proper subword units to words are consistent throughout the process
`and are less harmful than they would be in alternative methods of estimating parameters of
`subword models.
`The results of applying the segmental k-means training procedure to a set of 3990 •
`training sentences from 109 different talkers, in terms of PLU counts and PLU likelihood
`scores are shown in Table 8.3. A total of 155,000 PLUs occurred in the 3990 sentences
`with silence (h#) having the most occurrences (10,638 or 6.9% of the total) and nx (flapped
`
`IPR2023-00035
`Apple EX1015 Page 296
`
`

`

`446
`
`Chap. 8
`
`Large Vocabulary Continuous Speech Recognition
`
`TABLE 8.3. PLU statistics on count and average likelihood score.
`Average
`likelihood
`18.5
`8.4
`9.7
`7 .1
`15.4
`8.3
`9.9
`12.0
`8.5
`13.3
`8.9
`12.4
`11.2
`10.6
`14.3
`8.5
`10.4
`17.7
`14.5
`10.2
`8.7
`11.8
`12.0
`10.3
`10.7
`13. l
`10.6
`13. l
`10.9
`9.5
`11.3
`10.4
`13.9
`9.1
`9.8
`11.4
`14.1
`9.1
`12.5
`11.0
`10.6
`10.6
`13.6
`11.0
`12.2
`15.3
`10.4
`
`(Rank)
`
`(I)
`(45)
`(37)
`(47)
`(3)
`(46)
`(35)
`(17)
`(44)
`(10)
`(41)
`( 14)
`(21)
`(27)
`(6)
`(43)
`(32)
`(2)
`(5)
`(34)
`(42)
`(18)
`(16)
`(33)
`(25)
`( 11)
`(26)
`(12)
`(24)
`(38)
`(20)
`(31)
`(8)
`(40)
`(36)
`(19)
`(7)
`(39)
`(13)
`(23)
`(29)
`(28)
`(9)
`(22)
`(15)
`(4)
`(30)
`
`PLU
`
`h#
`r
`t
`ax
`s
`n
`ih
`iy
`d
`ae
`e
`z
`eh
`k
`p
`m
`ao
`f
`ey
`w
`ix
`dh
`V
`aa
`b
`y
`uw
`sh
`ow
`axr
`ah
`dx
`ay
`en
`g
`hh
`th
`ng
`ch
`el
`er
`jh
`aw
`uh
`zh
`oy
`nx
`
`Count
`
`10638
`8997
`8777
`8715
`8625
`8478
`6542
`5816
`5391
`4873
`4857
`4733
`4604
`4286
`3793
`3625
`3489
`3276
`3271
`3188
`3079
`2984
`2979
`2738
`2138
`2137
`2032
`1875
`1875
`1825
`1566
`1548
`1527
`1478
`1416
`1276
`924
`903
`885
`863
`852
`816
`682
`242
`198
`130
`57
`
`%
`
`6.9
`5.8
`5.7
`5.6
`5.6
`5.5
`4.2
`3.7
`3.5
`3.1
`3.1
`3.0
`3.0
`2.8
`2.4
`2.3
`2.2
`2.1
`2.1
`2.1
`2.0
`1.9
`1.9
`1.8
`1.4
`1.4
`1.3
`1.2
`1.2
`1.2
`1.0
`1.0
`1.0
`0.9
`0.9
`0.8
`0.6
`0.6
`0.6
`0.6
`0.5
`0.5
`0.4
`0.2
`0.1
`0.1
`0.04
`
`IPR2023-00035
`Apple EX1015 Page 297
`
`

`

`sec. 8.5
`
`Language Models for Large Vocabulary Speech Recognition
`
`447
`
`n) having the fewest occurrences (5 or 0.04% of the total). In tenns of average likelihood
`scores, silence (h#) had the highest score (18.5) followed by f (17.7) ands (15.4), while
`ax had the lowest score (7.1 ), followed by n (8.3) and r (8.4). (Note that, in this case, a
`higher average likelihood implies less variation among different renditions of the particular
`sound.) It is interesting to note that the PLUs with the three lowest average likelihood
`scores (ax, n, and r) were among the most frequently occurring sounds (r was second,
`n sixth, and ax fourth in frequency of occurrence). Similarly, some of the sounds with
`the highest likelihood scores were among the least occurring sounds (e.g., oy was fourth
`according to likelihood score but 21st according to frequency of occurrence).
`
`5 LANGUAGE MODELS FOR LARGE VOCABULARY SPEECH RECOGNITION
`s.
`
`Small vocabulary speech-recognition systems are used primarily for command-and-control
`applications where the vocabulary words are essentially acoustic control signals that the
`system has to respond to. (See Chapter 9 for a discussion of command-and-control appli(cid:173)
`cations of speech recognition.) As such, these systems generally do not rely heavily on
`language models to accomplish their selected tasks. A large vocabulary speech-recognition
`system, however, is generally critically dependent on linguistic knowledge embedded in
`the input speech. Therefore, for large vocabulary speech recognition, incorporation of
`knowledge of the language, in the form of a "language" model, is essential. In this section
`we discuss a statistically motivated framework for language modeling.
`The goal of the (statistical) language model is to provide an estimate of the probability
`of a word sequence W for the given recognition task. If we assume that Wis a specified
`sequence of words, i.e.,
`
`W = W1W2 .. ,WQ,
`then it would seem reasonable that P(W) can be computed as
`P(W) = P(w1w2 ... WQ) = P(w1)P(w2lw1)P(w3lw1w2)
`P(wQlw1w2 ... WQ-1).
`
`(8.4)
`
`(8.5)
`
`...
`
`Unfortunately, it is essentially impossible to reliably estimate the conditional word prob(cid:173)
`abilities, P(wjlw 1 ... Wj-d
`for all words and all sequence lengths in a given language.
`Hence, in practice, it is convenient t

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket