`
`METHODS
`
`f EDERICK JELINEK
`
`IPR2023-00037
`Apple EX1011 Page 1
`
`
`
`Statistical Methods for
`Speech Recognition
`
`Frederick Jelinelc
`
`' ..
`
`I.
`
`'
`
`• \
`
`.-
`
`. . .
`' '
`
`..
`
`',
`
`i;
`
`,:,·.
`
`,I
`
`I
`
`• I'
`
`•
`. ,-t
`
`• I
`,,
`
`••
`. •.
`
`•
`
`.
`
`;
`
`• ••
`
`. 1
`
`1
`
`•
`
`. ' .
`. . .
`
`....
`
`,- .
`
`:1.••I·•
`
`•
`• 1.
`
`)
`.-.
`
`: ·'
`
`.,
`
`..
`..
`
`'
`
`• i '. t
`
`. ,,: . ' ' . :. .
`
`The MIT Press
`€ambridge, Massachusetts
`London, England
`
`IPR2023-00037
`Apple EX1011 Page 2
`
`
`
`Third printing, 2001
`© 1997 Massachusetts Institute of Technology
`All rights reserved. No part of this book may be reproduced in any fonn by any
`electronic or mechanical means (including photocopying, recording, or inf orma(cid:173)
`tion storage and retrieval) without permission in writing from the publisher.
`
`This book was set in Times New Roman by Asco Typesetters, Hong Kong.
`
`Printed and bound in the United States of America.
`
`Library of Congress Cataloging-in-Publication Data
`
`Jelinek, Frederick, 1932-
`Statistical methods for speech recognition/ Frederick Jelinek.
`p. cm. -
`(Language, speech, and communication)
`Includes bibliographical references and index.
`ISBN 0-262-10066-5 (alk. paper)
`1. Automatic speech recognition-Statistical methods. I. Title.
`II. Series.
`TK7895.S65J45 1998
`006.4' 54-dc2 l
`
`97-34860
`CIP
`
`f
`
`I
`
`IPR2023-00037
`Apple EX1011 Page 3
`
`
`
`Contents
`
`Preface
`
`xix
`
`Chapter 1
`The Speech Recognition Problem
`
`I
`
`1.1 Introduction
`
`I
`
`1.2 A Mathematical
`Formulation
`4
`
`1.3 Components of a Speech
`Recognizer
`5
`
`1.3.1 Acoustic Processing 5
`1.3.2 Acoustic Modeling
`1
`8
`1.3.3 La.nguage Modeling
`8
`1.3.4 Hypothesis Search
`1.3.5 The Source-Channel Model of
`9
`Speech Recognition
`
`1.4 About This Book
`
`9
`
`1.5 Vector Quantization
`
`10
`
`1.6 Additional Reading
`
`12
`
`References
`
`12
`
`2.1 About Markov Chains
`
`15
`
`2.2 The Hidden Markov Model
`Concept
`17
`
`Chapter 2
`Hidden Markov Models
`
`15
`
`IPR2023-00037
`Apple EX1011 Page 4
`
`
`
`VIn
`
`Contents
`
`2.3 The Trellis
`
`19
`
`2.4 Search for the Likeliest State
`Transition Sequence
`21
`
`2.5 Presence of Null
`Transitions
`23
`
`2.6 Dealing with an HMM That Has
`Null Transitions That Do Not Form a
`Loop
`25
`
`2. 7 Estimation of Statistical
`Parameters of HMMs
`27
`
`2.8 Practical Need for
`Normalization
`33
`
`2.9 Alternative Definitions of
`HMMs
`35
`
`2.9.1 HMMs Outputting Real
`35
`Numbers
`2.9.2 HMM Outputs Attached to
`35
`States
`
`2.10 Additional Reading
`
`36
`
`References 37
`
`3.1 Introduction
`
`39
`
`3.2 Phonetic Acoustic Models
`
`40
`
`3.3 More on Acoustic Model
`Training
`43
`
`3.4 The Effect of Context
`
`44
`
`3.5 Viterbi Alignment
`
`45
`
`Chapter 3
`The Acoustic Model 39
`
`IPR2023-00037
`Apple EX1011 Page 5
`
`
`
`Contents
`
`lX
`
`3.6 Singleton Fenonic Base
`Forms
`45
`
`3.7 A Needed Generalization
`
`47
`
`3.8 Generation of Synthetic Base
`Forms
`48
`
`3.9 A Further Refinement
`
`51
`
`3.10 Singleton Base Forms for Words
`Outside the Vocabulary
`52
`
`3.11 Additional Reading
`
`52
`
`References 54
`
`Chapter 4
`Basic Language Modeling 57
`
`4.1 Introduction
`
`57
`
`4.2 Equivalence Classification of
`History
`59
`
`4.3 The Trigram Language
`Model
`60
`
`4 .4 Optimal Linear
`Smoothing
`62
`
`4.5 An Example of a Trigram
`Language Model
`66
`
`4.6 Practical Aspects of Deleted
`Interpolation
`66
`
`4.7 Backing-Off 69
`
`4.8 HMM Tagging
`
`70
`
`IPR2023-00037
`Apple EX1011 Page 6
`
`
`
`X
`
`Chapter 5
`The Viterbi Search 79
`
`Chapter 6
`Hypothesis Search on a Tree and the
`Fast Match
`93
`
`--
`
`Contents
`
`4.9 Use of Tag Equivalence
`Classification in a Language
`Model
`72
`
`4.1 O Vocabulary Selection and
`Personalization from Text
`Databases
`73
`
`4.11 Additional Reading
`
`75
`
`References
`
`7 6
`
`5.1 Introduction
`
`79
`
`5.2 Finding the Most Likely Word
`Sequence
`79
`
`5. 3 The Beam Search
`
`81
`
`5.4 Successive Language Model
`Refinement Search
`84
`
`5.5 Search versus Language Model
`State Spaces
`86
`
`5.6 N-Best Search
`
`86
`
`5. 7 A Maximum Probability
`Lattice
`89
`
`5.8 Additional Reading
`
`90
`
`References
`
`90
`
`6.1
`
`Introduction
`
`93
`
`6.2 Tree Search versus Trellis
`95
`(Viterbi) Search
`
`IPR2023-00037
`Apple EX1011 Page 7
`
`
`
`Contents
`
`6.3 A* Search
`
`95
`
`6.4 Stack Algorithm for Speech
`Recognition
`97
`
`6.5 Modifications of the Tree
`Search
`99
`
`6.6 Multiple-Stack Search
`
`99
`
`100
`6.6.1 First Algorithm
`6.6.2 A Multistack Algorithm
`6.6.3 Actual Multistack
`l 02
`Algorithm
`
`IOI
`
`6. 7 Fast Match
`
`I 03
`
`6.8 The Cost of Search
`I 09
`Shortcuts
`
`6.9 Additional Reading
`
`110
`
`References
`
`110
`
`Chapter 7
`Elements of Information Theory 113
`
`7.1 Introduction
`
`113
`
`7 .2 Functional Form of the Basic
`Information Measure
`114
`
`7 .3 Some Mathematical Properties of
`Entropy
`119
`
`7.4 An Alternative Point of View and
`Notation
`123
`
`7.5 A Source-Coding
`Theorem
`126
`
`7.6 A Brief Digression
`
`132
`
`IPR2023-00037
`Apple EX1011 Page 8
`
`
`
`Xll
`
`Contents
`
`7.7 Mutual Information
`
`132
`
`7.8 Additional Reading
`
`135
`
`References
`
`135
`
`Chapter 8
`The Complexity of Tasks-The Quality 8.1 The Problem with Estimation of
`of Language Models 13 7
`Recognition Task Complexity
`137
`
`Chapter 9
`The Expectation-Maximization
`Algorithm and Its Consequences 14 7
`
`8.2 The Shannon Game
`
`139
`
`8.3 Perplexity
`
`141
`
`8.4 The Conditional Entropy of the
`System
`142
`
`8.5 Additional Reading
`
`144
`
`References
`
`145
`
`9.1 Introduction
`
`147
`
`9.2 The EM Theorem
`
`147
`
`9.3 The Baum-Welch
`Algorithm
`149
`
`9 .4 Real Vector Outputs of the
`Acoustic Processor
`152
`
`9.4.1 Development for Two
`152
`Dimensions
`9.4.2 The Generalization to k
`158
`Dimensions
`
`9.5 Constant and Tied
`Parameters
`158
`
`9.5.1 Keeping Some Parameters
`158
`Constant
`
`IPR2023-00037
`Apple EX1011 Page 9
`
`
`
`Contents
`
`Xl1l
`
`Chapter 10
`Decision Trees and Tree Language
`Models 165
`
`9.5.2 Tying of Parameter Sets
`
`159
`
`9.6 Tied Mixtures
`
`161
`
`9.7 Additional Reading
`
`163
`
`References
`
`163
`
`IO.I Introduction
`
`165
`
`10.2 Application of Decision Trees to
`Language Modeling
`166
`
`10.3 Decision Tree Example
`
`166
`
`10.4 What Questions?
`
`168
`
`10.5 The Entropy Goodness
`Criterion for the Selection of
`Questions, and a Stopping Rule
`
`170
`
`10.6 A Restricted Set of
`Questions
`172
`
`10.7 Selection of Questions by Chou's
`Method
`173
`
`10.8 Selection of the Initial Split
`of a Set !/ into Complementary
`Subsets
`176
`
`10.9 The Two-ing Theorem
`
`177
`
`IO.IO Practical Considerations of
`Chou's Method
`179
`
`10.10.1 Problem of Os in the
`179
`q-Distribution: The Gini Index
`10.10. 2 Equivalence Classification
`182
`Induced by Decision Trees
`
`IPR2023-00037
`Apple EX1011 Page 10
`
`
`
`Contents
`
`10.10.3 Computationally Feasible
`Specification of Decision Tree
`183
`Equivalence Classes
`
`10.11 Construction of Decision Trees
`Based on Word Encoding
`184
`
`10 .12 A Hierarchical Classification of
`Vocabulary Words
`186
`
`10.13 More on Decision Trees Based
`on Word Encoding
`188
`
`10.13.1 Implementing Hierarchical
`Word Classification 188
`10.13.2 Predicting Encoded Words
`189
`One Bit at a Time
`10.13.3 Treatment of Unseen Training
`190
`Data
`10.13.4 Problems and Advantages of
`190
`Word Encoding
`
`10.14 Final Remarks on the Decision
`Tree Method
`191
`
`191
`10.14.1 Smoothing
`10.14.2 Fragmentation of Data
`
`192
`
`10.15 Additional Reading
`
`193
`
`References
`
`194
`
`Chapter 11
`Phonetics from Orthography: Spelling-
`to-Base Form Mappings 197
`
`11.1 Overview of Base Form
`Generation from Spelling
`197
`
`11.2 Generating Alignment
`Data
`199
`
`11.3 Decision Tree Classification of
`Phonetic Environments
`201
`
`11.4 Finding the Base Forms
`
`204
`
`IPR2023-00037
`Apple EX1011 Page 11
`
`
`
`Contents
`
`xv
`
`Chapter 12
`Triphones and Allophones 207
`
`Chapter 13
`Maximum Entropy Probability
`Estimation and Language
`Models 219
`
`11.5 Additional reading
`
`204
`
`References
`
`205
`
`12.1 Introduction
`
`207
`
`12.2 Triphones
`
`208
`
`12.3 The General Method
`
`210
`
`12.4 Collecting Realizations of
`Particular Phones
`210
`
`12.5 A Direct Method
`
`211
`
`12.6 The Consequences
`
`214
`
`12.7 Back to Triphones
`
`215
`
`12.8 Additional Reading
`
`217
`
`References 218
`
`13.1 Outline of the Maximum
`Entro.py Approach
`219
`
`13.2 The Main Idea
`
`220
`
`13.3 The General Solution
`
`221
`
`13.4 The Practical Problem
`
`222
`
`13.5 An Example
`
`224
`
`13.6 A Trigram Language
`Model
`227
`
`13. 7 Limiting Computation
`
`228
`
`I L
`
`r
`
`IPR2023-00037
`Apple EX1011 Page 12
`
`
`
`13.8 Iterative Scaling
`
`231
`
`13.9 The Problem of Finding
`Appropriate Constraints
`233
`
`13 .10 Weighting of Diverse Evidence:
`Voting
`234
`
`13.11 Limiting Data Fragmentation:
`Multiple Decision Trees
`236
`
`13.11.1 Combining Different
`236
`Knowledge Sources
`13.11.2 Spontaneous Multiple Tree
`238
`Development
`
`13.12 Remaining Unsolved
`Problems
`240
`
`13.13 Additional Reading
`
`241
`
`References 242
`
`14.1 About the Applications
`
`245
`
`14.2 Simple Language Model
`Adaptation to a New Domain
`
`246
`
`14.3 A More Complex
`Adaptation
`248
`
`14.4 A Dynamic Language Model:
`Triggers
`251
`
`14.5 The Cache Language
`Model
`253
`
`14.6 Additional Reading
`
`255
`
`References
`
`255
`
`Chapter 14
`Three Applications of Maximum
`Entropy Estimation to Language
`245
`Modeling
`
`IPR2023-00037
`Apple EX1011 Page 13
`
`
`
`Contents
`
`XVII
`
`Chapter 15
`&timation of Probabilities from Counts
`and the Back-Off Method 257
`
`15.1 Inadequacy of Relative
`Frequency Estimates
`257
`
`15.2 Estimation of Probabilities from
`Counts Using Held-Out Data
`258
`
`258
`15.2.1 The Basic Idea
`259
`15.2.2 The Estimation
`15.2.3 Deciding the Value of M
`
`261
`
`15.3 Universality of the Held-Out
`Estimate
`262
`
`15.4 The Good-Turing
`Estimate
`263
`
`15.5 Applicability of the Held-Out
`and Good-Turing Estimates
`265
`
`15.6 Enhancing Estimation
`Methods
`268
`
`15. 6.1 Frequency Enhancement of
`Held-Out Estimation of Bigrams
`15.6.2 Frequency Enhancement
`of Good-Turing Estimation of
`269
`Bigrams
`15.6.3 Other Enhancements
`
`270
`
`268
`
`15. 7 The Back-Off Language
`Model
`271
`
`15.8 Additional Reading
`
`273
`
`References 274
`
`Name Index
`
`275
`
`Subject Index
`
`279
`
`IPR2023-00037
`Apple EX1011 Page 14
`
`
`
`Chapter 1
`The Speech Recognition
`Problem
`
`1.1 Introduction
`
`A speech recognizer is a device that automatically transcribes speech into
`text. It can be thought of as a voice-actuated "typewriter" in which a
`computer program carries out the transcription and the transcribed text
`appears on a workstation display. The recognizer is usually based on some
`finite vocabulary that restricts the words that can be "printed" out. Until
`we state otherwise, the designation word denotes a word form defined by
`its spelling. Two differently spelled inflections or derivations of the same
`stem are considered different words (e.g., table and tables). Homographs
`having different parts of speech (e.g., absent [VERB] and absent [ADJECTIVE])
`or meanings (e.g., bank [FINANCIAL INSTITUTION] and bank [OF A RIVER])
`constitute the same word.
`Figure 1.1, the speech waveform corresponding to an utterance of the
`chess move "Bishop moves to king knight five," illustrates our problem.
`The waveform was produced at Stanford University in the late 1960s in a
`speech recognition project undertaken by Raj Reddy [l].
`Because the field was then in its infancy, Reddy tried to give himself all
`the conceivable but fair advantages. Recognition of spoken chess moves
`could be based on a small vocabulary, a restricted syntax, and a relatively
`complete world knowledge: The system knew which chess moves were
`legal, so that, for instance, a recognition hypothesis corresponding to a
`move of a piece to a square occupied by another piece of the same color
`could immediately be rejected. The speech was recorded by a very good
`microphone in a quiet environment, and the system was adjusted ( as well
`as the state of the art allowed) to the speaker.
`
`IPR2023-00037
`Apple EX1011 Page 15
`
`
`
`2
`
`Chapter I
`
`• BISIO' rovES lO lll«i ._'NIGil FIVE -- PMiE 8.1
`'
`
`I
`I
`I
`I
`✓~1t~1;~~~1'~l,~
`I
`I
`I
`
`I
`
`• IISfa'
`
`,1gcp
`
`~~I I
`
`8l1HP
`
`819d
`
`8191P
`
`I BISKP
`
`6tSfO'
`
`8190'
`
`Bl!ita
`
`BISKP I BISKP
`
`61gq, Bl
`
`I I I I I
`
`I I :·
`
`I I I I
`
`I I
`
`
`
`I tr-tvS-.../S~S-S.J;:_s.......__s_....,s"--s~:s_s~s---s..,,....~-s-s_s'"'$'---S~• •••
`
`tCP
`•
`•
`
`lltld
`•
`• I•
`
`8190'
`•
`•
`•
`
`I ll&KP
`
`815KP
`
`1111 •
`
`•
`
`•
`
`•
`
`•
`
`•
`
`!IMS ~ 111\'ES 1 !'CMS ~s
`
`• • • . ,. • • • I. • • u u lu u
`~;
`
`I
`
`I
`
`8190'
`
`BIW
`
`• I( • • • ... -
`
`BIS,O,
`- I -
`
`BIS,O,
`
`I
`
`BIS,O,
`- 1::
`I
`
`I
`
`Bl!MP
`
`- .. -
`
`I
`
`BIW
`t
`
`Bl!Hf
`
`I
`!'CMS
`
`I
`rdvEs
`u u " u u u • u u u J u u u
`
`lfJVES I IIMS
`
`u u
`
`IIIVES
`u u u
`
`I
`I
`1
`!'CMS "¥5
`lfJVES I
`u u u u u u u u u
`I
`I
`I
`
`!'CMS
`t
`I
`I
`
`I
`I
`l1lVES
`z .; z
`I
`
`I
`rves
`-
`:r :r
`I
`I
`
`I
`I
`IIMS
`
`I
`r1lVES I IIMS
`
`I
`
`:r z z z • - - _1 - -
`
`IIMS
`
`-
`
`I
`IIMSI
`
`I
`I
`
`• TO l0
`- - •
`
`I
`I
`
`t
`
`t
`
`I
`
`I
`
`TO
`l
`t
`I t
`
`,
`
`• • • • • • • • . . - -
`
`48
`
`TO TOj TO TO I TO TO I TO TO TO TO I TO TO
`22
`
`68
`
`88
`
`188
`
`~llli Kiili I
`Kit«;
`(KIN; Kl~
`• Kit(;
`tk
`k k k k k k k k k k k k
`II k
`-
`1aa
`'12e
`·1,a
`'168
`
`Bl!JIJ' !IMS lO ICIIG ICNIQfT FIVE -- Pia. 8.2
`
`I
`I
`
`I
`
`·V1;
`
`\
`
`I
`
`klNi
`II I
`
`I
`
`I
`
`KINil Klt«i
`Kltci I Kl i«;
`Klt«i I Kit£ WI
`ICI~
`"'' ,c;
`I
`It GIG G G G 1G C C C ~ C C C C C C C 0
`I I
`I
`
`~~{~\,f,f~\~
`
`\cN1GHT KN!CHT KNICH
`'1 y y y YI Y Y Y YI
`
`I
`ICNIOO I ICNIGIT b.1CHT KNietfT
`ICNlr.HT KNIGHT I kNICHT
`'1 n n n n( * Y y y IY y y y y )( y y
`
`ICIC KIIC. I KIIii KIit.
`CIIG
`I I I
`• I I I I I I
`I
`I
`I
`
`ICINi
`I
`I
`
`I
`
`I
`
`• ICNfGKT l(NIGtff
`
`I - I
`G G . "'" " " " l'I " " " • n n n
`"fl· I
`
`I
`
`I
`
`I
`
`I
`
`kMIQfT I IDII GKT
`y y y t y y y
`
`dftCHT
`~ y y
`
`ICNICH(
`ICNl~tT
`y yl .y y y y I y
`I
`I
`
`KNIGHT I KNIGHT ~ICHT KNl&T
`J
`-
`.
`-
`Y. y y It -
`-
`-
`.I
`-
`
`Figure I.I
`The sentence "Bishop moves to king knight five" aligned with its speech waveform
`
`IPR2023-00037
`Apple EX1011 Page 16
`
`
`
`The Speech Recognition Problem
`
`3
`
`In early design strategy of the field, a recognizer would segment the
`speech into successive phones (basic pronunciation units [2]), then identify
`the particular phones corresponding to the segments, and finally transcribe
`the recognized phone strings into an English text.
`Figure 1.1 aligns the speech waveform with the spoken words. Inspect(cid:173)
`ing it, we can see that although the boundary between different speech
`events cannot be accurately placed, distinct events certainly appear in
`succession. So even though the boundaries are fuzzy, most of the segments
`seem to contain repeating nuclei that are candidates for recognition.
`Although this does not hold for stops 1 like b (see the beginning of BISHOP)
`or k (beginning of KING), it does seem to hold for vowels such as i (BISHOP
`and KING) or u ( the first vowel of MOVES). But inspecting figure 1.1 more
`closely, we encounter a very big problem: The i of KING looks much more
`like the u of MOVES than it does like the i of BISHOP! So visually similar
`waveforms do not necessarily indicate perceptually similar sounds.
`Actually, context is involved here, and the apparent waveform sim(cid:173)
`ilarities are due to the influence of the nasality of the sound of ng that
`follows the i and of the sound of m that precedes the u sound.
`Having now indicated an aspect of why speech recognition is not an easy
`task, we are ready to begin its serious consideration. In this chapter we will
`formulate the large vocabulary 2 speech recognition problem mathemati(cid:173)
`cally, which will result in a recognizer's natural breakup into its compo(cid:173)
`nents. The chapter will conclude with the first example of self-organization
`from data: the basic vector quantization algorithm [11] that can be used
`to transform the speech signal into a sequence of symbols from a rela(cid:173)
`tively limited (and automatically selected!) alphabet.
`
`1. Stops, also called plosives, are sounds that consist of two parts, a stop portion
`and a release portion. English stops are b, d, g, k, p, t. [2]
`2. Although no hard and fast distinction between small and large vocabulary
`tasks exists, here are some examples of each:
`• Small vocabulary: recognition of digits, yes-no answers to questions, inventory
`control, etc.
`• Large vocabulary: text creation by dictation, transcription of e-mail, transcrip(cid:173)
`tion of telephone conversations, text creation for the handicapped.
`Small vocabulary tasks are of great economic importance. They are just not the
`subject of this book.
`
`IPR2023-00037
`Apple EX1011 Page 17
`
`
`
`4
`
`Chapter 1
`
`1.2 A Mathematical Formulation
`
`To discuss the problem of speech recognizer design, we need its mathe(cid:173)
`matical formulation. A precise statement of the problem leads directly to
`a fruitful decomposition into easier to treat subproblems. Our approach is
`statistical, 3 so the formulation will involve probabilities. Here it is [3] [4]:
`Let A denote the acoustic evidence ( data) on the basis of which the rec(cid:173)
`ognizer will make its decision about which words were spoken. Because
`we are dealing with digital computers, then without loss of generality we
`may assume that A is a sequence of symbols taken from some (possible
`very large) alphabet d:
`
`a;Ed
`
`, Wn
`
`W;E"f'"
`
`(1)
`The symbols a; can be thought of as having been generated in time, as
`indicated by the index i.
`Let
`W = Wt , W2, ..•
`denote a string of n words, each belonging to a fixed and known vocabu(cid:173)
`lary "f/".
`If P(WIA) denotes the probability that the words W were spoken, given
`that the evidence A was observed, then the recognizer should decide in
`favor of a word string W satisfying
`W = arg max P(WIA)
`w
`That is, the recognizer will pick the most likely word string given the
`observed acoustic evidence.
`Of course, underlying the target formula (3) is the tacit assumption that
`all words of a message are equally important to the user, that is, that
`misrecognition does not carry a different penalty depending on which
`word was misrecognized. Under this philosophy the warning "Fire!"
`
`(2)
`
`(3)
`
`3. No advanced results of probability or statistical theory will be used in this self(cid:173)
`contained text. The student is required simply to be comfortable with statistical
`concepts and be able to manipulate them intuitively. So although nothing in this
`text pres~es mo~e ~ha~ the knowledge of the first four chapters of a book like[~,
`the reqwred soph1sticatlon can probably be gained only by completing an entire
`course.
`
`IPR2023-00037
`Apple EX1011 Page 18
`
`
`
`I L
`
`The Speech Recognition Problem
`
`5
`
`carries no more importance in a crowded theater than an innocuous
`commercial announcement. But let that possible criticism pass. 4
`The well known Bayes' formula of probability theory allows us to
`rewrite the right-hand side probability of (3) as
`
`P(WIA) = P(W)P(AIW)
`P(A)
`where P(W) is the probability that the word string W will be uttered,
`P(AIW) is the probability that when the speaker says W the acoustic
`evidence A will be observed, and P(A) is the average probability that A
`will be observed. That is,
`P(A) = L P(W')P(AIW')
`
`W'
`
`(4)
`
`(5)
`
`Since the maximization in (3) is carried out with the variable A fixed
`(there is no other acoustic data save the one we are given), it follows from
`(3) and (4) that the recognizer's aim is to find the word string W that
`maximizes the product P(W)P(AIW), that is, it satisfies
`W = arg max P(W)P(AIW)
`w
`
`(6)
`
`1.3 Components of a Speech Recognizer
`
`Formula ( 6) determines what processes and components are of concern in
`the design of a speech recognizer.
`
`1.3.1 Acoustic Procesmng
`First, it is necessary to decide what acoustic data A will be observed. That
`is, one needs to decide on a front end that will transform the pressure
`waveform (which is what sound is) into the symbols a; with which the
`recognizer will deal. So in principle, this front end includes a microphone
`whose output is an electric signal, a means of sampling that signal, and a
`manner of processing the resulting sequence of samples.
`
`4. Strictly speaking, formula (3) is appropriate only if we are after a perfect tran(cid:173)
`scription of the utterance, that is, if one error is as bad as many. Were we to accept
`that errors are inevitable (which they certainly are) and aim explicitly at mini(cid:173)
`mizing their number, a much more complex formula would be required. So our
`formula only approximates (it turns out very fruitfully) what we are intuitively
`after.
`
`IPR2023-00037
`Apple EX1011 Page 19
`
`
`
`6
`
`Chapter I
`
`-
`Speech
`
`Signal
`Processor
`
`-
`-
`CJ;
`
`Comparator
`
`-
`-
`a.
`'
`
`Pa,
`
`Prototype
`Storage
`
`Figure 1.2
`A schematic diagram of a recognizer front end (acoustic processor)
`
`In this book, dedicated to statistical models of speech recognition, we
`will not address the front-end problem, except in the penultimate section
`of this chapter. 5 For the present, we will assume that the alphabet .st/ is
`simply given. Those interested in front-end design should consult
`the many books and articles that thoroughly discuss signal processing [6]
`[7]. For the reader to gain some idea of what might be involved, how(cid:173)
`ever, we present in figure 1.2 a schematic diagram of a rudimentary front
`end (acoustic processor).
`We can think of the signal processor as a device that at regular intervals
`of time (e.g., a hundred times per second 6 ) generates real-valued vectors
`(1;. The components of (1; could be sample values of outputs of band-pass
`filters applied to the signal coming out of the microphone.
`The prototype storage contains a set of vector prototypes 9f =
`{p 1 ,p2, ... ,Px} of the same kind as (1;. The comparator finds the closest
`element of 92 to (1;, and the index of that element is the acoustic symbol
`a;. To be precise,
`
`,.
`
`J = arg ~n d(a;,p1-)
`
`K
`
`J=l
`
`(7)
`
`and
`
`5. Obviously, good signal processing is crucial and is the subject of intensive
`research. ~s book is about extracting inf onnation from the processed signal.
`Bad processmg means loss of information: There is less of it to extract.
`6. This happens to be the prevalent standard in our field.
`
`IPR2023-00037
`Apple EX1011 Page 20
`
`
`
`The Speech Recognition Problem
`
`a;= j
`
`7
`
`(8)
`
`In (7), d( , ) denotes a suitable distance function. 7
`In the penultimate section of this chapter we introduce a simple
`method, called vector quantization, that can derive the prototype set~=
`,PK} from speech data. There, the intuition behind the output
`{p1,p 2 , ...
`symbol selection rule (7) and (8) will become apparent.
`In most state-of-the-art large vocabulary recognition systems the com(cid:173)
`parator of figure 1.2 is omitted and the rest of the recognizer handles
`directly the signal processor outputs a;. They then constitute the observ(cid:173)
`able symbols a;. To introduce the relevant methods in the simplest setting,
`however, we will assume an acoustic symbol alphabet size of the order of
`hundreds (the size 200 is very common), which will still allow us to deal
`with the essence of the problem. We will generalize certain important
`results to "continuous" vector spaces (for signal processor outputs a) in
`chapter 9. The added complication is that statistical estimation for vector
`spaces requires parametric methods. Section 2.9.1 of the next chapter will
`briefly introduce an appropriate model.
`
`1.3.2 Acoustic Modeling
`Returning now to formula (6), the recognizer needs to be able to deter(cid:173)
`mine the value P(AIW) of the probability that when the speaker uttered
`the word sequence W the acoustic processor produced the data A. Since
`this number must be made available for all possible pairings of W with A,
`it follows that it must be computable "on the fly." The number of different
`possible values of A and W is just too large to permit a lookup.
`Thus to compute P(AIW) we need a statistical acoustic model of the
`speaker's interaction with the acoustic processor. The total process we are
`modeling involves the way the speaker pronounces the words of W, the
`ambience (room noise, reverberation, etc.), the microphone placement and
`characteristics, and the acoustic processing performed by the front end.
`The usual acoustic model employed in speech recognizers, the hidden
`Markov model, will be discussed in the next chapters. Other models are
`possible, for instance those based on artificial neural networks [8] or on
`dynamic time warping [9]. These methods are not treated in this book.
`
`7. A very adequate distance is Euclidean:
`d(x, y) _:_ ✓~(x; - y;)2
`
`-
`
`r
`
`IPR2023-00037
`Apple EX1011 Page 21
`
`
`
`8
`
`Chapter I
`
`1.3.3 Language Modeling
`F onnula ( 6) further requires that we be able to compute for every word
`string W the a priori probability P(W) that the speaker wishes to utter W.
`Bayes' formula allows many decompositions of P(W), but because the
`recognizer "naturally" wishes to convey the text in the sequence in which
`it was spoken, we will use the decomposition
`P(W) = II P(w;lw1, ... , w;-1)
`
`(9)
`
`n
`
`i=l
`
`The recognizer must thus be able to determine estimates of the proba(cid:173)
`bilities P(w;lw 1, ... , w;_ 1). We use the term estimate on purpose, because
`even for moderate values of i and vocabularies of reasonable size, the
`probability P(w;lw 1, ... ,w;_ 1) has just too many arguments. In fact, if
`l~I denotes the size of the vocabulary, then for l~I = 20,000 and i = 3,
`the number of arguments is 8 x 1012.
`It is, of course, absurd to think that the speaker's choice of his ith word
`depends on the entire history w1, ... , W;-1 of all of his previous speech. It
`is therefore natural that for purposes of the choice of w;, the history be
`put into equivalence classes <I>(w1, ... , w;_ 1). Thus in reality formula (9)
`becomes
`P(W) = II P(w;l<l>(w1, ... , W;-1))
`
`(10)
`
`n
`
`i=l
`
`and the art of language modeling consists of determining the appropriate
`equivalence classification <I> and a method of estimating the probabilities
`P(w;l<l>(w1, ... , W;-1)).
`It is worth stressing that the language model used should depend on
`the use to which the recognizer will be put. The transcription of dictated
`radiological reports requires different language models than the writing
`of movie reviews. If text is to be produced, then the language model may
`reasonably be constructed by processing examples of corresponding
`written materials. It will then depend on text only and not in any way on
`speech.
`
`1.3.4 Hypothesis Search
`Finally, to find the desired transcription W of the acoustic data A by
`formula ( 6), we must search over all possible word strings W to find the
`maximizing one. This search cannot be conducted by brute force: The
`space of Ws is astronomically large.
`
`IPR2023-00037
`Apple EX1011 Page 22
`
`
`
`The Speech Recognition Problem
`
`9
`
`A parsimonious hypothesis search is needed that will not even consider
`the overwhelming number of possible candidates W and will examine
`only those word strings in some way suggested by the acoustics A. We will
`devote chapters 5 and 6 to showing two ways to proceed.
`
`1.3.5 The Source-Channel Model of Speech Recognition
`Our formulation leads to the schematic diagram of figure 1.3. The human
`speaker is shown as consisting of two parts: The source of the communi(cid:173)
`cation is his mind, which specifies the words W that will be pronounced
`by his vocal apparatus,
`the speech producer. The recognizer also con(cid:173)
`sists of two parts, the acoustic processor and the linguistic decoder, the
`latter containing the acoustic and language models and the hypothesis
`search algorithm. Figure 1.3 embeds the process into the communication
`theory [10] framework: the source (modeled by the language model), the
`"noisy" channel ( consisting of the tandem of the speech producer and the
`acoustic processor and modeled by the acoustic model), and the linguistic
`decoder.
`The only difference between our communication situation and the
`standard one is that in the standard, the system designer can introduce an
`encoder and a modulator (signal generator) between the source and the
`channel. We must make do with the coder-modulator that evolution has
`bequeathed to us: human language and speech.
`
`------------------------------------,
`
`________________
`
`Speaker's
`Mind
`
`J _____________
`I
`I
`I
`I
`I
`I
`I
`
`Speech
`Producer
`
`. .
`w:
`
`I
`
`I
`I
`
`~
`
`I
`I
`I
`
`I
`I
`I
`J
`I
`
`I
`I
`I
`
`I ~
`Acoustic
`.
`Processor
`I Speech :
`
`I
`I
`,-------------➔----------------~
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`'
`I
`I
`I A
`I
`I
`I
`I
`I
`I
`I
`
`Linguistic
`Decoder
`
`I w
`
`I A
`
`I
`I
`I
`I
`I
`
`I
`I
`I
`
`~-------------~----------------·
`-----------------L-------------•
`Speech Recognizer
`I L-----------------------------------'
`Speaker
`Acoustic Channel
`
`I
`
`Figure 1.3
`Source-channel model of speech recognition
`
`L
`
`r
`
`IPR2023-00037
`Apple EX1011 Page 23
`
`
`
`10
`
`1.4 About This Book
`
`Chapter I
`
`The previous section revealed the topics with which speech recognition
`research must be concerned. This book investigates methods that would , •
`accomplish the tasks outlined in sections 1.3.2 through 1.3.4. The
`headings of the above sections use the expression modeling because the
`required values of P(AIW) and P(W) can only be produced by evaluating
`probabilities of output strings generated by abstract models of corre(cid:173)
`sponding processes. These models will have no more than a mathematical
`reality. No claims whatever can conceivably be made about their relation
`to humans' actual speech production or recognition.
`For all practical purposes the rest of the book is devoted to the structure
`of models of progressively higher sophistication, to their parameters, and
`to the problem of estimating the parameter values from actual speech data.
`The plan of this text is to present first ( chapters 1 through 5) each
`basic component of a large vocabulary speech recognizer ( acoustic model,
`language model, hypothesis search) and devote the rest of the book
`(chapters 6 through 15) to refinements.
`
`1.5 Vector Quantization
`
`We conclude this .chapter by introducing a simple method that can be
`used to find appropriate vector prototypes Bl= {p 1 ,p 2, ... ,PK} for the
`prototype storage of the acoustic processor of figure 1.2 (see the discus(cid:173)
`sion of section 1.3.1) [11].
`Consider the real vectors a; put out periodically by the signal processor.
`If their dimension is L, then they can be regarded as points in the real £(cid:173)
`dimensional space. As speech is input to the signal processor, the space is
`being occupied by the points a;. If speech can be represented as a succes(cid:173)
`sion of phones [2] produced by the speaker (see the pronunciation specifi(cid:173)
`cation of words in any dictionary), then the points a; will cluster in regions
`of the L-space characteristic of the particular phones the speaker pro(cid:173)
`duced. Therefore, if the prototypes Pi E f1' are selected to be the centers of
`to an observed a; will be an
`the a-clusters, then the nearest prototype
`appropriate representative of a;. This in fact is the idea behind the output
`symbol selection rule (7) and (8).
`Vector quantization is a simple and effective method for finding cluster
`centers. Along with the rule (7) it presupposes a distance measured( , )
`between points of the L-space. It turns out that a very adequate measure
`
`IPR2023-00037
`Apple EX1011 Page 24
`
`
`
`The Speech Recognition Problem
`
`is the Euclidean distance
`
`d(a,p) =
`
`L
`L(a(l)
`l=l
`
`- p(/))2
`
`11
`
`(11)
`
`where a(/) and p(/) denote the f'h components of the vectors a and p.
`The Vector Quantization Algorithm8
`1. Select K, the number of clusters.
`2. Send speech through the signal processor (see figure 1.2) obtaining
`vectors a; for i = 1, 2, ... , N (the total amount of speech, proportional to
`N, must be judiciously chosen). 9
`3. Select uniformly at random K initial candidate cluster centers pJ from
`among the speech vectors { a1, a2, ... , aN}, that is,
`
`1
`0
`P{pi =a;}= N
`
`4. Partition { a1, a2, ... , aN} into K regions YJ comprising all a; nearer to
`pJ than to any other P2, h #-j. That is,
`o
`o·
`o
`a; e ~ if d(a;,pi) < d(a;,ph) for all h
`
`I l
`
`5. Find the center of gravity p) of each collection YJ. That is,
`P] = arg min L d(ai,P)
`
`p
`
`0
`<71 E f/'/
`
`6. Repartition { a1, a2, ... , aN} into K regions ~ 1 comprising all a; nearer
`top) than to any other Pk, h #-j.
`7. Find the center of gravity pJ of each collection ~ 1
`8. And so on.
`
`.
`
`The particular method of finding clusters by alternately determining
`nearest neighbors of candidate centers and then locating neighborhood
`centers is also ref erred to as K-means cluste