throbber
STATISTICAL
`
`METHODS
`
`f EDERICK JELINEK
`
`IPR2023-00037
`Apple EX1011 Page 1
`
`

`

`Statistical Methods for
`Speech Recognition
`
`Frederick Jelinelc
`
`' ..
`
`I.
`
`'
`
`• \
`
`.-
`
`. . .
`' '
`
`..
`
`',
`
`i;
`
`,:,·.
`
`,I
`
`I
`
`• I'
`
`•
`. ,-t
`
`• I
`,,
`
`••
`. •.
`
`•
`
`.
`
`;
`
`• ••
`
`. 1
`
`1
`
`•
`
`. ' .
`. . .
`
`....
`
`,- .
`
`:1.••I·•
`
`•
`• 1.
`
`)
`.-.
`
`: ·'
`
`.,
`
`..
`..
`
`'
`
`• i '. t
`
`. ,,: . ' ' . :. .
`
`The MIT Press
`€ambridge, Massachusetts
`London, England
`
`IPR2023-00037
`Apple EX1011 Page 2
`
`

`

`Third printing, 2001
`© 1997 Massachusetts Institute of Technology
`All rights reserved. No part of this book may be reproduced in any fonn by any
`electronic or mechanical means (including photocopying, recording, or inf orma(cid:173)
`tion storage and retrieval) without permission in writing from the publisher.
`
`This book was set in Times New Roman by Asco Typesetters, Hong Kong.
`
`Printed and bound in the United States of America.
`
`Library of Congress Cataloging-in-Publication Data
`
`Jelinek, Frederick, 1932-
`Statistical methods for speech recognition/ Frederick Jelinek.
`p. cm. -
`(Language, speech, and communication)
`Includes bibliographical references and index.
`ISBN 0-262-10066-5 (alk. paper)
`1. Automatic speech recognition-Statistical methods. I. Title.
`II. Series.
`TK7895.S65J45 1998
`006.4' 54-dc2 l
`
`97-34860
`CIP
`
`f
`
`I
`
`IPR2023-00037
`Apple EX1011 Page 3
`
`

`

`Contents
`
`Preface
`
`xix
`
`Chapter 1
`The Speech Recognition Problem
`
`I
`
`1.1 Introduction
`
`I
`
`1.2 A Mathematical
`Formulation
`4
`
`1.3 Components of a Speech
`Recognizer
`5
`
`1.3.1 Acoustic Processing 5
`1.3.2 Acoustic Modeling
`1
`8
`1.3.3 La.nguage Modeling
`8
`1.3.4 Hypothesis Search
`1.3.5 The Source-Channel Model of
`9
`Speech Recognition
`
`1.4 About This Book
`
`9
`
`1.5 Vector Quantization
`
`10
`
`1.6 Additional Reading
`
`12
`
`References
`
`12
`
`2.1 About Markov Chains
`
`15
`
`2.2 The Hidden Markov Model
`Concept
`17
`
`Chapter 2
`Hidden Markov Models
`
`15
`
`IPR2023-00037
`Apple EX1011 Page 4
`
`

`

`VIn
`
`Contents
`
`2.3 The Trellis
`
`19
`
`2.4 Search for the Likeliest State
`Transition Sequence
`21
`
`2.5 Presence of Null
`Transitions
`23
`
`2.6 Dealing with an HMM That Has
`Null Transitions That Do Not Form a
`Loop
`25
`
`2. 7 Estimation of Statistical
`Parameters of HMMs
`27
`
`2.8 Practical Need for
`Normalization
`33
`
`2.9 Alternative Definitions of
`HMMs
`35
`
`2.9.1 HMMs Outputting Real
`35
`Numbers
`2.9.2 HMM Outputs Attached to
`35
`States
`
`2.10 Additional Reading
`
`36
`
`References 37
`
`3.1 Introduction
`
`39
`
`3.2 Phonetic Acoustic Models
`
`40
`
`3.3 More on Acoustic Model
`Training
`43
`
`3.4 The Effect of Context
`
`44
`
`3.5 Viterbi Alignment
`
`45
`
`Chapter 3
`The Acoustic Model 39
`
`IPR2023-00037
`Apple EX1011 Page 5
`
`

`

`Contents
`
`lX
`
`3.6 Singleton Fenonic Base
`Forms
`45
`
`3.7 A Needed Generalization
`
`47
`
`3.8 Generation of Synthetic Base
`Forms
`48
`
`3.9 A Further Refinement
`
`51
`
`3.10 Singleton Base Forms for Words
`Outside the Vocabulary
`52
`
`3.11 Additional Reading
`
`52
`
`References 54
`
`Chapter 4
`Basic Language Modeling 57
`
`4.1 Introduction
`
`57
`
`4.2 Equivalence Classification of
`History
`59
`
`4.3 The Trigram Language
`Model
`60
`
`4 .4 Optimal Linear
`Smoothing
`62
`
`4.5 An Example of a Trigram
`Language Model
`66
`
`4.6 Practical Aspects of Deleted
`Interpolation
`66
`
`4.7 Backing-Off 69
`
`4.8 HMM Tagging
`
`70
`
`IPR2023-00037
`Apple EX1011 Page 6
`
`

`

`X
`
`Chapter 5
`The Viterbi Search 79
`
`Chapter 6
`Hypothesis Search on a Tree and the
`Fast Match
`93
`
`--
`
`Contents
`
`4.9 Use of Tag Equivalence
`Classification in a Language
`Model
`72
`
`4.1 O Vocabulary Selection and
`Personalization from Text
`Databases
`73
`
`4.11 Additional Reading
`
`75
`
`References
`
`7 6
`
`5.1 Introduction
`
`79
`
`5.2 Finding the Most Likely Word
`Sequence
`79
`
`5. 3 The Beam Search
`
`81
`
`5.4 Successive Language Model
`Refinement Search
`84
`
`5.5 Search versus Language Model
`State Spaces
`86
`
`5.6 N-Best Search
`
`86
`
`5. 7 A Maximum Probability
`Lattice
`89
`
`5.8 Additional Reading
`
`90
`
`References
`
`90
`
`6.1
`
`Introduction
`
`93
`
`6.2 Tree Search versus Trellis
`95
`(Viterbi) Search
`
`IPR2023-00037
`Apple EX1011 Page 7
`
`

`

`Contents
`
`6.3 A* Search
`
`95
`
`6.4 Stack Algorithm for Speech
`Recognition
`97
`
`6.5 Modifications of the Tree
`Search
`99
`
`6.6 Multiple-Stack Search
`
`99
`
`100
`6.6.1 First Algorithm
`6.6.2 A Multistack Algorithm
`6.6.3 Actual Multistack
`l 02
`Algorithm
`
`IOI
`
`6. 7 Fast Match
`
`I 03
`
`6.8 The Cost of Search
`I 09
`Shortcuts
`
`6.9 Additional Reading
`
`110
`
`References
`
`110
`
`Chapter 7
`Elements of Information Theory 113
`
`7.1 Introduction
`
`113
`
`7 .2 Functional Form of the Basic
`Information Measure
`114
`
`7 .3 Some Mathematical Properties of
`Entropy
`119
`
`7.4 An Alternative Point of View and
`Notation
`123
`
`7.5 A Source-Coding
`Theorem
`126
`
`7.6 A Brief Digression
`
`132
`
`IPR2023-00037
`Apple EX1011 Page 8
`
`

`

`Xll
`
`Contents
`
`7.7 Mutual Information
`
`132
`
`7.8 Additional Reading
`
`135
`
`References
`
`135
`
`Chapter 8
`The Complexity of Tasks-The Quality 8.1 The Problem with Estimation of
`of Language Models 13 7
`Recognition Task Complexity
`137
`
`Chapter 9
`The Expectation-Maximization
`Algorithm and Its Consequences 14 7
`
`8.2 The Shannon Game
`
`139
`
`8.3 Perplexity
`
`141
`
`8.4 The Conditional Entropy of the
`System
`142
`
`8.5 Additional Reading
`
`144
`
`References
`
`145
`
`9.1 Introduction
`
`147
`
`9.2 The EM Theorem
`
`147
`
`9.3 The Baum-Welch
`Algorithm
`149
`
`9 .4 Real Vector Outputs of the
`Acoustic Processor
`152
`
`9.4.1 Development for Two
`152
`Dimensions
`9.4.2 The Generalization to k
`158
`Dimensions
`
`9.5 Constant and Tied
`Parameters
`158
`
`9.5.1 Keeping Some Parameters
`158
`Constant
`
`IPR2023-00037
`Apple EX1011 Page 9
`
`

`

`Contents
`
`Xl1l
`
`Chapter 10
`Decision Trees and Tree Language
`Models 165
`
`9.5.2 Tying of Parameter Sets
`
`159
`
`9.6 Tied Mixtures
`
`161
`
`9.7 Additional Reading
`
`163
`
`References
`
`163
`
`IO.I Introduction
`
`165
`
`10.2 Application of Decision Trees to
`Language Modeling
`166
`
`10.3 Decision Tree Example
`
`166
`
`10.4 What Questions?
`
`168
`
`10.5 The Entropy Goodness
`Criterion for the Selection of
`Questions, and a Stopping Rule
`
`170
`
`10.6 A Restricted Set of
`Questions
`172
`
`10.7 Selection of Questions by Chou's
`Method
`173
`
`10.8 Selection of the Initial Split
`of a Set !/ into Complementary
`Subsets
`176
`
`10.9 The Two-ing Theorem
`
`177
`
`IO.IO Practical Considerations of
`Chou's Method
`179
`
`10.10.1 Problem of Os in the
`179
`q-Distribution: The Gini Index
`10.10. 2 Equivalence Classification
`182
`Induced by Decision Trees
`
`IPR2023-00037
`Apple EX1011 Page 10
`
`

`

`Contents
`
`10.10.3 Computationally Feasible
`Specification of Decision Tree
`183
`Equivalence Classes
`
`10.11 Construction of Decision Trees
`Based on Word Encoding
`184
`
`10 .12 A Hierarchical Classification of
`Vocabulary Words
`186
`
`10.13 More on Decision Trees Based
`on Word Encoding
`188
`
`10.13.1 Implementing Hierarchical
`Word Classification 188
`10.13.2 Predicting Encoded Words
`189
`One Bit at a Time
`10.13.3 Treatment of Unseen Training
`190
`Data
`10.13.4 Problems and Advantages of
`190
`Word Encoding
`
`10.14 Final Remarks on the Decision
`Tree Method
`191
`
`191
`10.14.1 Smoothing
`10.14.2 Fragmentation of Data
`
`192
`
`10.15 Additional Reading
`
`193
`
`References
`
`194
`
`Chapter 11
`Phonetics from Orthography: Spelling-
`to-Base Form Mappings 197
`
`11.1 Overview of Base Form
`Generation from Spelling
`197
`
`11.2 Generating Alignment
`Data
`199
`
`11.3 Decision Tree Classification of
`Phonetic Environments
`201
`
`11.4 Finding the Base Forms
`
`204
`
`IPR2023-00037
`Apple EX1011 Page 11
`
`

`

`Contents
`
`xv
`
`Chapter 12
`Triphones and Allophones 207
`
`Chapter 13
`Maximum Entropy Probability
`Estimation and Language
`Models 219
`
`11.5 Additional reading
`
`204
`
`References
`
`205
`
`12.1 Introduction
`
`207
`
`12.2 Triphones
`
`208
`
`12.3 The General Method
`
`210
`
`12.4 Collecting Realizations of
`Particular Phones
`210
`
`12.5 A Direct Method
`
`211
`
`12.6 The Consequences
`
`214
`
`12.7 Back to Triphones
`
`215
`
`12.8 Additional Reading
`
`217
`
`References 218
`
`13.1 Outline of the Maximum
`Entro.py Approach
`219
`
`13.2 The Main Idea
`
`220
`
`13.3 The General Solution
`
`221
`
`13.4 The Practical Problem
`
`222
`
`13.5 An Example
`
`224
`
`13.6 A Trigram Language
`Model
`227
`
`13. 7 Limiting Computation
`
`228
`
`I L
`
`r
`
`IPR2023-00037
`Apple EX1011 Page 12
`
`

`

`13.8 Iterative Scaling
`
`231
`
`13.9 The Problem of Finding
`Appropriate Constraints
`233
`
`13 .10 Weighting of Diverse Evidence:
`Voting
`234
`
`13.11 Limiting Data Fragmentation:
`Multiple Decision Trees
`236
`
`13.11.1 Combining Different
`236
`Knowledge Sources
`13.11.2 Spontaneous Multiple Tree
`238
`Development
`
`13.12 Remaining Unsolved
`Problems
`240
`
`13.13 Additional Reading
`
`241
`
`References 242
`
`14.1 About the Applications
`
`245
`
`14.2 Simple Language Model
`Adaptation to a New Domain
`
`246
`
`14.3 A More Complex
`Adaptation
`248
`
`14.4 A Dynamic Language Model:
`Triggers
`251
`
`14.5 The Cache Language
`Model
`253
`
`14.6 Additional Reading
`
`255
`
`References
`
`255
`
`Chapter 14
`Three Applications of Maximum
`Entropy Estimation to Language
`245
`Modeling
`
`IPR2023-00037
`Apple EX1011 Page 13
`
`

`

`Contents
`
`XVII
`
`Chapter 15
`&timation of Probabilities from Counts
`and the Back-Off Method 257
`
`15.1 Inadequacy of Relative
`Frequency Estimates
`257
`
`15.2 Estimation of Probabilities from
`Counts Using Held-Out Data
`258
`
`258
`15.2.1 The Basic Idea
`259
`15.2.2 The Estimation
`15.2.3 Deciding the Value of M
`
`261
`
`15.3 Universality of the Held-Out
`Estimate
`262
`
`15.4 The Good-Turing
`Estimate
`263
`
`15.5 Applicability of the Held-Out
`and Good-Turing Estimates
`265
`
`15.6 Enhancing Estimation
`Methods
`268
`
`15. 6.1 Frequency Enhancement of
`Held-Out Estimation of Bigrams
`15.6.2 Frequency Enhancement
`of Good-Turing Estimation of
`269
`Bigrams
`15.6.3 Other Enhancements
`
`270
`
`268
`
`15. 7 The Back-Off Language
`Model
`271
`
`15.8 Additional Reading
`
`273
`
`References 274
`
`Name Index
`
`275
`
`Subject Index
`
`279
`
`IPR2023-00037
`Apple EX1011 Page 14
`
`

`

`Chapter 1
`The Speech Recognition
`Problem
`
`1.1 Introduction
`
`A speech recognizer is a device that automatically transcribes speech into
`text. It can be thought of as a voice-actuated "typewriter" in which a
`computer program carries out the transcription and the transcribed text
`appears on a workstation display. The recognizer is usually based on some
`finite vocabulary that restricts the words that can be "printed" out. Until
`we state otherwise, the designation word denotes a word form defined by
`its spelling. Two differently spelled inflections or derivations of the same
`stem are considered different words (e.g., table and tables). Homographs
`having different parts of speech (e.g., absent [VERB] and absent [ADJECTIVE])
`or meanings (e.g., bank [FINANCIAL INSTITUTION] and bank [OF A RIVER])
`constitute the same word.
`Figure 1.1, the speech waveform corresponding to an utterance of the
`chess move "Bishop moves to king knight five," illustrates our problem.
`The waveform was produced at Stanford University in the late 1960s in a
`speech recognition project undertaken by Raj Reddy [l].
`Because the field was then in its infancy, Reddy tried to give himself all
`the conceivable but fair advantages. Recognition of spoken chess moves
`could be based on a small vocabulary, a restricted syntax, and a relatively
`complete world knowledge: The system knew which chess moves were
`legal, so that, for instance, a recognition hypothesis corresponding to a
`move of a piece to a square occupied by another piece of the same color
`could immediately be rejected. The speech was recorded by a very good
`microphone in a quiet environment, and the system was adjusted ( as well
`as the state of the art allowed) to the speaker.
`
`IPR2023-00037
`Apple EX1011 Page 15
`
`

`

`2
`
`Chapter I
`
`• BISIO' rovES lO lll«i ._'NIGil FIVE -- PMiE 8.1
`'
`
`I
`I
`I
`I
`✓~1t~1;~~~1'~l,~
`I
`I
`I
`
`I
`
`• IISfa'
`
`,1gcp
`
`~~I I
`
`8l1HP
`
`819d
`
`8191P
`
`I BISKP
`
`6tSfO'
`
`8190'
`
`Bl!ita
`
`BISKP I BISKP
`
`61gq, Bl
`
`I I I I I
`
`I I :·
`
`I I I I
`
`I I
`
`
`
`I tr-tvS-.../S~S-S.J;:_s.......__s_....,s"--s~:s_s~s---s..,,....~-s-s_s'"'$'---S~• •••
`
`tCP
`•
`•
`
`lltld
`•
`• I•
`
`8190'
`•
`•
`•
`
`I ll&KP
`
`815KP
`
`1111 •
`
`•
`
`•
`
`•
`
`•
`
`•
`
`!IMS ~ 111\'ES 1 !'CMS ~s
`
`• • • . ,. • • • I. • • u u lu u
`~;
`
`I
`
`I
`
`8190'
`
`BIW
`
`• I( • • • ... -
`
`BIS,O,
`- I -
`
`BIS,O,
`
`I
`
`BIS,O,
`- 1::
`I
`
`I
`
`Bl!MP
`
`- .. -
`
`I
`
`BIW
`t
`
`Bl!Hf
`
`I
`!'CMS
`
`I
`rdvEs
`u u " u u u • u u u J u u u
`
`lfJVES I IIMS
`
`u u
`
`IIIVES
`u u u
`
`I
`I
`1
`!'CMS "¥5
`lfJVES I
`u u u u u u u u u
`I
`I
`I
`
`!'CMS
`t
`I
`I
`
`I
`I
`l1lVES
`z .; z
`I
`
`I
`rves
`-
`:r :r
`I
`I
`
`I
`I
`IIMS
`
`I
`r1lVES I IIMS
`
`I
`
`:r z z z • - - _1 - -
`
`IIMS
`
`-
`
`I
`IIMSI
`
`I
`I
`
`• TO l0
`- - •
`
`I
`I
`
`t
`
`t
`
`I
`
`I
`
`TO
`l
`t
`I t
`
`,
`
`• • • • • • • • . . - -
`
`48
`
`TO TOj TO TO I TO TO I TO TO TO TO I TO TO
`22
`
`68
`
`88
`
`188
`
`~llli Kiili I
`Kit«;
`(KIN; Kl~
`• Kit(;
`tk
`k k k k k k k k k k k k
`II k
`-
`1aa
`'12e
`·1,a
`'168
`
`Bl!JIJ' !IMS lO ICIIG ICNIQfT FIVE -- Pia. 8.2
`
`I
`I
`
`I
`
`·V1;
`
`\
`
`I
`
`klNi
`II I
`
`I
`
`I
`
`KINil Klt«i
`Kltci I Kl i«;
`Klt«i I Kit£ WI
`ICI~
`"'' ,c;
`I
`It GIG G G G 1G C C C ~ C C C C C C C 0
`I I
`I
`
`~~{~\,f,f~\~
`
`\cN1GHT KN!CHT KNICH
`'1 y y y YI Y Y Y YI
`
`I
`ICNIOO I ICNIGIT b.1CHT KNietfT
`ICNlr.HT KNIGHT I kNICHT
`'1 n n n n( * Y y y IY y y y y )( y y
`
`ICIC KIIC. I KIIii KIit.
`CIIG
`I I I
`• I I I I I I
`I
`I
`I
`
`ICINi
`I
`I
`
`I
`
`I
`
`• ICNfGKT l(NIGtff
`
`I - I
`G G . "'" " " " l'I " " " • n n n
`"fl· I
`
`I
`
`I
`
`I
`
`I
`
`kMIQfT I IDII GKT
`y y y t y y y
`
`dftCHT
`~ y y
`
`ICNICH(
`ICNl~tT
`y yl .y y y y I y
`I
`I
`
`KNIGHT I KNIGHT ~ICHT KNl&T
`J
`-
`.
`-
`Y. y y It -
`-
`-
`.I
`-
`
`Figure I.I
`The sentence "Bishop moves to king knight five" aligned with its speech waveform
`
`IPR2023-00037
`Apple EX1011 Page 16
`
`

`

`The Speech Recognition Problem
`
`3
`
`In early design strategy of the field, a recognizer would segment the
`speech into successive phones (basic pronunciation units [2]), then identify
`the particular phones corresponding to the segments, and finally transcribe
`the recognized phone strings into an English text.
`Figure 1.1 aligns the speech waveform with the spoken words. Inspect(cid:173)
`ing it, we can see that although the boundary between different speech
`events cannot be accurately placed, distinct events certainly appear in
`succession. So even though the boundaries are fuzzy, most of the segments
`seem to contain repeating nuclei that are candidates for recognition.
`Although this does not hold for stops 1 like b (see the beginning of BISHOP)
`or k (beginning of KING), it does seem to hold for vowels such as i (BISHOP
`and KING) or u ( the first vowel of MOVES). But inspecting figure 1.1 more
`closely, we encounter a very big problem: The i of KING looks much more
`like the u of MOVES than it does like the i of BISHOP! So visually similar
`waveforms do not necessarily indicate perceptually similar sounds.
`Actually, context is involved here, and the apparent waveform sim(cid:173)
`ilarities are due to the influence of the nasality of the sound of ng that
`follows the i and of the sound of m that precedes the u sound.
`Having now indicated an aspect of why speech recognition is not an easy
`task, we are ready to begin its serious consideration. In this chapter we will
`formulate the large vocabulary 2 speech recognition problem mathemati(cid:173)
`cally, which will result in a recognizer's natural breakup into its compo(cid:173)
`nents. The chapter will conclude with the first example of self-organization
`from data: the basic vector quantization algorithm [11] that can be used
`to transform the speech signal into a sequence of symbols from a rela(cid:173)
`tively limited (and automatically selected!) alphabet.
`
`1. Stops, also called plosives, are sounds that consist of two parts, a stop portion
`and a release portion. English stops are b, d, g, k, p, t. [2]
`2. Although no hard and fast distinction between small and large vocabulary
`tasks exists, here are some examples of each:
`• Small vocabulary: recognition of digits, yes-no answers to questions, inventory
`control, etc.
`• Large vocabulary: text creation by dictation, transcription of e-mail, transcrip(cid:173)
`tion of telephone conversations, text creation for the handicapped.
`Small vocabulary tasks are of great economic importance. They are just not the
`subject of this book.
`
`IPR2023-00037
`Apple EX1011 Page 17
`
`

`

`4
`
`Chapter 1
`
`1.2 A Mathematical Formulation
`
`To discuss the problem of speech recognizer design, we need its mathe(cid:173)
`matical formulation. A precise statement of the problem leads directly to
`a fruitful decomposition into easier to treat subproblems. Our approach is
`statistical, 3 so the formulation will involve probabilities. Here it is [3] [4]:
`Let A denote the acoustic evidence ( data) on the basis of which the rec(cid:173)
`ognizer will make its decision about which words were spoken. Because
`we are dealing with digital computers, then without loss of generality we
`may assume that A is a sequence of symbols taken from some (possible
`very large) alphabet d:
`
`a;Ed
`
`, Wn
`
`W;E"f'"
`
`(1)
`The symbols a; can be thought of as having been generated in time, as
`indicated by the index i.
`Let
`W = Wt , W2, ..•
`denote a string of n words, each belonging to a fixed and known vocabu(cid:173)
`lary "f/".
`If P(WIA) denotes the probability that the words W were spoken, given
`that the evidence A was observed, then the recognizer should decide in
`favor of a word string W satisfying
`W = arg max P(WIA)
`w
`That is, the recognizer will pick the most likely word string given the
`observed acoustic evidence.
`Of course, underlying the target formula (3) is the tacit assumption that
`all words of a message are equally important to the user, that is, that
`misrecognition does not carry a different penalty depending on which
`word was misrecognized. Under this philosophy the warning "Fire!"
`
`(2)
`
`(3)
`
`3. No advanced results of probability or statistical theory will be used in this self(cid:173)
`contained text. The student is required simply to be comfortable with statistical
`concepts and be able to manipulate them intuitively. So although nothing in this
`text pres~es mo~e ~ha~ the knowledge of the first four chapters of a book like[~,
`the reqwred soph1sticatlon can probably be gained only by completing an entire
`course.
`
`IPR2023-00037
`Apple EX1011 Page 18
`
`

`

`I L
`
`The Speech Recognition Problem
`
`5
`
`carries no more importance in a crowded theater than an innocuous
`commercial announcement. But let that possible criticism pass. 4
`The well known Bayes' formula of probability theory allows us to
`rewrite the right-hand side probability of (3) as
`
`P(WIA) = P(W)P(AIW)
`P(A)
`where P(W) is the probability that the word string W will be uttered,
`P(AIW) is the probability that when the speaker says W the acoustic
`evidence A will be observed, and P(A) is the average probability that A
`will be observed. That is,
`P(A) = L P(W')P(AIW')
`
`W'
`
`(4)
`
`(5)
`
`Since the maximization in (3) is carried out with the variable A fixed
`(there is no other acoustic data save the one we are given), it follows from
`(3) and (4) that the recognizer's aim is to find the word string W that
`maximizes the product P(W)P(AIW), that is, it satisfies
`W = arg max P(W)P(AIW)
`w
`
`(6)
`
`1.3 Components of a Speech Recognizer
`
`Formula ( 6) determines what processes and components are of concern in
`the design of a speech recognizer.
`
`1.3.1 Acoustic Procesmng
`First, it is necessary to decide what acoustic data A will be observed. That
`is, one needs to decide on a front end that will transform the pressure
`waveform (which is what sound is) into the symbols a; with which the
`recognizer will deal. So in principle, this front end includes a microphone
`whose output is an electric signal, a means of sampling that signal, and a
`manner of processing the resulting sequence of samples.
`
`4. Strictly speaking, formula (3) is appropriate only if we are after a perfect tran(cid:173)
`scription of the utterance, that is, if one error is as bad as many. Were we to accept
`that errors are inevitable (which they certainly are) and aim explicitly at mini(cid:173)
`mizing their number, a much more complex formula would be required. So our
`formula only approximates (it turns out very fruitfully) what we are intuitively
`after.
`
`IPR2023-00037
`Apple EX1011 Page 19
`
`

`

`6
`
`Chapter I
`
`-
`Speech
`
`Signal
`Processor
`
`-
`-
`CJ;
`
`Comparator
`
`-
`-
`a.
`'
`
`Pa,
`
`Prototype
`Storage
`
`Figure 1.2
`A schematic diagram of a recognizer front end (acoustic processor)
`
`In this book, dedicated to statistical models of speech recognition, we
`will not address the front-end problem, except in the penultimate section
`of this chapter. 5 For the present, we will assume that the alphabet .st/ is
`simply given. Those interested in front-end design should consult
`the many books and articles that thoroughly discuss signal processing [6]
`[7]. For the reader to gain some idea of what might be involved, how(cid:173)
`ever, we present in figure 1.2 a schematic diagram of a rudimentary front
`end (acoustic processor).
`We can think of the signal processor as a device that at regular intervals
`of time (e.g., a hundred times per second 6 ) generates real-valued vectors
`(1;. The components of (1; could be sample values of outputs of band-pass
`filters applied to the signal coming out of the microphone.
`The prototype storage contains a set of vector prototypes 9f =
`{p 1 ,p2, ... ,Px} of the same kind as (1;. The comparator finds the closest
`element of 92 to (1;, and the index of that element is the acoustic symbol
`a;. To be precise,
`
`,.
`
`J = arg ~n d(a;,p1-)
`
`K
`
`J=l
`
`(7)
`
`and
`
`5. Obviously, good signal processing is crucial and is the subject of intensive
`research. ~s book is about extracting inf onnation from the processed signal.
`Bad processmg means loss of information: There is less of it to extract.
`6. This happens to be the prevalent standard in our field.
`
`IPR2023-00037
`Apple EX1011 Page 20
`
`

`

`The Speech Recognition Problem
`
`a;= j
`
`7
`
`(8)
`
`In (7), d( , ) denotes a suitable distance function. 7
`In the penultimate section of this chapter we introduce a simple
`method, called vector quantization, that can derive the prototype set~=
`,PK} from speech data. There, the intuition behind the output
`{p1,p 2 , ...
`symbol selection rule (7) and (8) will become apparent.
`In most state-of-the-art large vocabulary recognition systems the com(cid:173)
`parator of figure 1.2 is omitted and the rest of the recognizer handles
`directly the signal processor outputs a;. They then constitute the observ(cid:173)
`able symbols a;. To introduce the relevant methods in the simplest setting,
`however, we will assume an acoustic symbol alphabet size of the order of
`hundreds (the size 200 is very common), which will still allow us to deal
`with the essence of the problem. We will generalize certain important
`results to "continuous" vector spaces (for signal processor outputs a) in
`chapter 9. The added complication is that statistical estimation for vector
`spaces requires parametric methods. Section 2.9.1 of the next chapter will
`briefly introduce an appropriate model.
`
`1.3.2 Acoustic Modeling
`Returning now to formula (6), the recognizer needs to be able to deter(cid:173)
`mine the value P(AIW) of the probability that when the speaker uttered
`the word sequence W the acoustic processor produced the data A. Since
`this number must be made available for all possible pairings of W with A,
`it follows that it must be computable "on the fly." The number of different
`possible values of A and W is just too large to permit a lookup.
`Thus to compute P(AIW) we need a statistical acoustic model of the
`speaker's interaction with the acoustic processor. The total process we are
`modeling involves the way the speaker pronounces the words of W, the
`ambience (room noise, reverberation, etc.), the microphone placement and
`characteristics, and the acoustic processing performed by the front end.
`The usual acoustic model employed in speech recognizers, the hidden
`Markov model, will be discussed in the next chapters. Other models are
`possible, for instance those based on artificial neural networks [8] or on
`dynamic time warping [9]. These methods are not treated in this book.
`
`7. A very adequate distance is Euclidean:
`d(x, y) _:_ ✓~(x; - y;)2
`
`-
`
`r
`
`IPR2023-00037
`Apple EX1011 Page 21
`
`

`

`8
`
`Chapter I
`
`1.3.3 Language Modeling
`F onnula ( 6) further requires that we be able to compute for every word
`string W the a priori probability P(W) that the speaker wishes to utter W.
`Bayes' formula allows many decompositions of P(W), but because the
`recognizer "naturally" wishes to convey the text in the sequence in which
`it was spoken, we will use the decomposition
`P(W) = II P(w;lw1, ... , w;-1)
`
`(9)
`
`n
`
`i=l
`
`The recognizer must thus be able to determine estimates of the proba(cid:173)
`bilities P(w;lw 1, ... , w;_ 1). We use the term estimate on purpose, because
`even for moderate values of i and vocabularies of reasonable size, the
`probability P(w;lw 1, ... ,w;_ 1) has just too many arguments. In fact, if
`l~I denotes the size of the vocabulary, then for l~I = 20,000 and i = 3,
`the number of arguments is 8 x 1012.
`It is, of course, absurd to think that the speaker's choice of his ith word
`depends on the entire history w1, ... , W;-1 of all of his previous speech. It
`is therefore natural that for purposes of the choice of w;, the history be
`put into equivalence classes <I>(w1, ... , w;_ 1). Thus in reality formula (9)
`becomes
`P(W) = II P(w;l<l>(w1, ... , W;-1))
`
`(10)
`
`n
`
`i=l
`
`and the art of language modeling consists of determining the appropriate
`equivalence classification <I> and a method of estimating the probabilities
`P(w;l<l>(w1, ... , W;-1)).
`It is worth stressing that the language model used should depend on
`the use to which the recognizer will be put. The transcription of dictated
`radiological reports requires different language models than the writing
`of movie reviews. If text is to be produced, then the language model may
`reasonably be constructed by processing examples of corresponding
`written materials. It will then depend on text only and not in any way on
`speech.
`
`1.3.4 Hypothesis Search
`Finally, to find the desired transcription W of the acoustic data A by
`formula ( 6), we must search over all possible word strings W to find the
`maximizing one. This search cannot be conducted by brute force: The
`space of Ws is astronomically large.
`
`IPR2023-00037
`Apple EX1011 Page 22
`
`

`

`The Speech Recognition Problem
`
`9
`
`A parsimonious hypothesis search is needed that will not even consider
`the overwhelming number of possible candidates W and will examine
`only those word strings in some way suggested by the acoustics A. We will
`devote chapters 5 and 6 to showing two ways to proceed.
`
`1.3.5 The Source-Channel Model of Speech Recognition
`Our formulation leads to the schematic diagram of figure 1.3. The human
`speaker is shown as consisting of two parts: The source of the communi(cid:173)
`cation is his mind, which specifies the words W that will be pronounced
`by his vocal apparatus,
`the speech producer. The recognizer also con(cid:173)
`sists of two parts, the acoustic processor and the linguistic decoder, the
`latter containing the acoustic and language models and the hypothesis
`search algorithm. Figure 1.3 embeds the process into the communication
`theory [10] framework: the source (modeled by the language model), the
`"noisy" channel ( consisting of the tandem of the speech producer and the
`acoustic processor and modeled by the acoustic model), and the linguistic
`decoder.
`The only difference between our communication situation and the
`standard one is that in the standard, the system designer can introduce an
`encoder and a modulator (signal generator) between the source and the
`channel. We must make do with the coder-modulator that evolution has
`bequeathed to us: human language and speech.
`
`------------------------------------,
`
`________________
`
`Speaker's
`Mind
`
`J _____________
`I
`I
`I
`I
`I
`I
`I
`
`Speech
`Producer
`
`. .
`w:
`
`I
`
`I
`I
`
`~
`
`I
`I
`I
`
`I
`I
`I
`J
`I
`
`I
`I
`I
`
`I ~
`Acoustic
`.
`Processor
`I Speech :
`
`I
`I
`,-------------➔----------------~
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`'
`I
`I
`I A
`I
`I
`I
`I
`I
`I
`I
`
`Linguistic
`Decoder
`
`I w
`
`I A
`
`I
`I
`I
`I
`I
`
`I
`I
`I
`
`~-------------~----------------·
`-----------------L-------------•
`Speech Recognizer
`I L-----------------------------------'
`Speaker
`Acoustic Channel
`
`I
`
`Figure 1.3
`Source-channel model of speech recognition
`
`L
`
`r
`
`IPR2023-00037
`Apple EX1011 Page 23
`
`

`

`10
`
`1.4 About This Book
`
`Chapter I
`
`The previous section revealed the topics with which speech recognition
`research must be concerned. This book investigates methods that would , •
`accomplish the tasks outlined in sections 1.3.2 through 1.3.4. The
`headings of the above sections use the expression modeling because the
`required values of P(AIW) and P(W) can only be produced by evaluating
`probabilities of output strings generated by abstract models of corre(cid:173)
`sponding processes. These models will have no more than a mathematical
`reality. No claims whatever can conceivably be made about their relation
`to humans' actual speech production or recognition.
`For all practical purposes the rest of the book is devoted to the structure
`of models of progressively higher sophistication, to their parameters, and
`to the problem of estimating the parameter values from actual speech data.
`The plan of this text is to present first ( chapters 1 through 5) each
`basic component of a large vocabulary speech recognizer ( acoustic model,
`language model, hypothesis search) and devote the rest of the book
`(chapters 6 through 15) to refinements.
`
`1.5 Vector Quantization
`
`We conclude this .chapter by introducing a simple method that can be
`used to find appropriate vector prototypes Bl= {p 1 ,p 2, ... ,PK} for the
`prototype storage of the acoustic processor of figure 1.2 (see the discus(cid:173)
`sion of section 1.3.1) [11].
`Consider the real vectors a; put out periodically by the signal processor.
`If their dimension is L, then they can be regarded as points in the real £(cid:173)
`dimensional space. As speech is input to the signal processor, the space is
`being occupied by the points a;. If speech can be represented as a succes(cid:173)
`sion of phones [2] produced by the speaker (see the pronunciation specifi(cid:173)
`cation of words in any dictionary), then the points a; will cluster in regions
`of the L-space characteristic of the particular phones the speaker pro(cid:173)
`duced. Therefore, if the prototypes Pi E f1' are selected to be the centers of
`to an observed a; will be an
`the a-clusters, then the nearest prototype
`appropriate representative of a;. This in fact is the idea behind the output
`symbol selection rule (7) and (8).
`Vector quantization is a simple and effective method for finding cluster
`centers. Along with the rule (7) it presupposes a distance measured( , )
`between points of the L-space. It turns out that a very adequate measure
`
`IPR2023-00037
`Apple EX1011 Page 24
`
`

`

`The Speech Recognition Problem
`
`is the Euclidean distance
`
`d(a,p) =
`
`L
`L(a(l)
`l=l
`
`- p(/))2
`
`11
`
`(11)
`
`where a(/) and p(/) denote the f'h components of the vectors a and p.
`The Vector Quantization Algorithm8
`1. Select K, the number of clusters.
`2. Send speech through the signal processor (see figure 1.2) obtaining
`vectors a; for i = 1, 2, ... , N (the total amount of speech, proportional to
`N, must be judiciously chosen). 9
`3. Select uniformly at random K initial candidate cluster centers pJ from
`among the speech vectors { a1, a2, ... , aN}, that is,
`
`1
`0
`P{pi =a;}= N
`
`4. Partition { a1, a2, ... , aN} into K regions YJ comprising all a; nearer to
`pJ than to any other P2, h #-j. That is,
`o
`o·
`o
`a; e ~ if d(a;,pi) < d(a;,ph) for all h
`
`I l
`
`5. Find the center of gravity p) of each collection YJ. That is,
`P] = arg min L d(ai,P)
`
`p
`
`0
`<71 E f/'/
`
`6. Repartition { a1, a2, ... , aN} into K regions ~ 1 comprising all a; nearer
`top) than to any other Pk, h #-j.
`7. Find the center of gravity pJ of each collection ~ 1
`8. And so on.
`
`.
`
`The particular method of finding clusters by alternately determining
`nearest neighbors of candidate centers and then locating neighborhood
`centers is also ref erred to as K-means cluste

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket