`
`Efficient Algorithms for Speech Recognition
`
`Mosur K. Ravishankar
`May 15, 1996
`CMU-CS-96-143
`
`Carnegie
`Mellon
`
`510.7808
`C28R
`96-143
`
`IPR2023-00037
`Apple EX1007 Page 1
`
`
`
`University Libraries
`Carnegie Mellon Universitr
`Pittsburgh PA 15213-389CJ
`
`IPR2023-00037
`Apple EX1007 Page 2
`
`
`
`Efficient Algorithms for Speech Recognition
`
`Mosur K. Ravishankar
`May 15, 1996
`CMU-CS-96-143
`
`School of Computer Science
`Computer Science Division
`Carnegie Mellon University
`Pittsburgh, PA 15213
`
`Submitted in partial fulfillment of the requirements
`for the degree of Doctor of Philosophy.
`
`Thesis Committee:
`
`Roberto Bisiani, co-chair (University of Milan)
`Raj Reddy, co-chair
`Alexander Rudnicky
`Richard Stern
`Wayne Ward
`
`© 1996 Mosur K Ravishankar
`
`This research was supported by the Department of the Navy, Naval Research Laboratory under
`Grant No. N00014-93-1-2005. The views and conclusions contained in this document are those of
`the author and should not be interpreted as representing
`the official policies. either expressed or
`implied, of the U.S. government.
`
`IPR2023-00037
`Apple EX1007 Page 3
`
`
`
`• eg1e
`llon
`
`. School of Computer Science
`
`DOCTORAL THESIS
`in the field of
`Computer Science
`
`Efficient Algorithms for Speech Recognition
`
`MOSUR K. RA VISHANKAR
`
`Submitted in Partial Fulfillment of the Requirements
`for the Degree of Doctor of Philosophy
`
`ACCEPTED:
`
`\
`
`<(ffiis1s COMMITTEE CHAIR
`
`TIIESIS COMMITI"EE CHAIR
`
`DEPARTMENT HEAD
`
`APPROVED:
`
`~/3u/%
`r
`
`7
`
`i--:>o-qb
`
`~/,7 /9b
`
`DATE
`
`DATE
`
`DATE
`
`DEAN
`
`DATE
`
`IPR2023-00037
`Apple EX1007 Page 4
`
`
`
`Abstract
`
`Advances in speech technology and computing power have created a surge of
`interest in the practical application of speech recognition. However, the most accurate
`speech recognition systems in the research world are still far too slow and expensive to
`be used in practical, large vocabulary continuous speech applications. Their main goal
`has been recognition accuracy, with emphasis on acoustic and language modelling.
`But practical speech recognition also requires the computation to be carried out in
`real time within the limited resources-CPU
`power and memory size-of commonly
`available computers. There has been relatively little work in this direction while
`preserving the accuracy of research systems.
`
`In this thesis, we focus on efficient and accurate speech recognition. It is easy to
`improve recognition speed and reduce memory requirements by trading away accu(cid:173)
`racy, for example by greater pruning, and using simpler acoustic and language models.
`It is much harder to improve both the recognition speed and reduce main memory
`size while preserving the accuracy.
`
`This thesis presents several techniques for improving the overall performance of
`the CMU Sphinx-II system. Sphinx-II employs semi-continuous hidden Markov mod(cid:173)
`els for acoustics and trigram language models, and is one of the premier research
`systems of its kind. The-techniques in this thesis are validated on several widely used
`benchmark test sets using two vocabulary sizes of about 20K and 58K words.
`
`The main contributions of this thesis are an 8-fold speedup and 4-fold memory size
`reduction over the baseline Sphinx~II system. The improvement in speed is obtained
`from the following techniques: lexical tree search, phonetic fast match heuristic, and
`global best path search of the word lattice. The gain in speed from the tree search is
`about a factor of 5. The phonetic fast match heuristic speeds up the tree search by
`another factor of 2 by finding the most likely candidate phones active at any time.
`Though the tree search incurs some loss of accuracy, it also produces compact word
`lattices with low error rate which can be rescored for accuracy. Such a rescoring is
`combined with the best path algorithm to find a globally optimum path through a
`word lattice. This recovers the original accuracy of the baseline system. The total
`recognition time is about 3 times real time for the 20K task on a 175MHz DEC Alpha
`workstation.
`
`The memory requirements of Sphinx-II are minimized by reducing the sizes of
`the acoustic and language models. The language model is maintained on disk and
`bigrams and trigrams are read in on demand. Explicit software caching mechanisms
`effectively overcome the disk access latencies. The acoustic model size is reduced by
`simply truncating precision of probability values to 8 bits. Several other engineering
`solutions, not explored in this thesis, can be applied to reduce memory requirements
`further. The memory size for the 20K task is reduced to about 30-40MB.
`
`IPR2023-00037
`Apple EX1007 Page 5
`
`
`
`11
`
`IPR2023-00037
`Apple EX1007 Page 6
`
`IPR2023-00037
`Apple EX1007 Page 6
`
`
`
`Acknowledgements
`
`I cannot overstate the debt I owe to Roberto Bisiani and Raj Reddy. They have
`not only helped me and given me every opportunity to extend my professional career.
`but also helped me through personal difficulties as well. It is quite remarkable that I
`have landed not one but two advisors that combine integrity towards research with a
`human touch that transcends the proverbial hard-headedness of science. One cannot
`hope for better mentors than them. Alex Rudnicky, Rich Stern, and Wayne Ward,
`all have a clarity of thinking and self-expression that simply amazes me without end.
`They have given me the most insightful advice. comments, and questions that I could
`have asked for. Thank you, all.
`The C:\1u speech group has been a pleasure to work with. First of all, I would
`like to thank some former and current members, :-.1ei-Yuh Hwang, Fil Alleva, Lin
`Chase, Eric Thayer, Sunil Issar, Bob Weide, and Roni Rosenfeld. They have helped
`me through the early stages of my induction into the group, and later given invaluable
`support in my work. I'm fortunate to have inherited the work of 1vlei-Yuh and Fil.
`Lin Chase has been a great friend and sounding board for ideas through these years.
`Eric has been all of that and a great officemate. I have learnt a lot from discussions
`with Paul Placeway. The rest of the speech group and the robust gang has made it a
`most lively environment to work in. I hope the charge continues through Sphinx-III
`and beyond.
`I have spent a good fraction of my life in the C:-.1U-CS community so far. It has
`been, and still is, the greatest intellectual environment. The spirit of cooperation, and
`informality of interactions as simply unique. I would like to acknowledge the support
`of everyone I have ever come to know here, too many to name, from the Warp and
`Kectar days until now. The administrative folks have always succeeded in blunting
`the edge off a difficult day. You never know what nickname Catherine Copetas will
`christen you with next. And Sharon Burks has always put up with all my antics.
`It goes without saying that I owe everything to my parents. I have had tremendous
`support from my brothers, and some very special uncles and aunts. In particular, I
`must mention the fun rve had with my brother Kuts. I would also like to acknowledge
`K. Gopinath's help during my stay in Bangalore. Finally, "BB", who has suffered
`through my tantrums on bad days, kept me in touch with the rest of the world, has a
`most creative outlook on the commonplace, can drive me nuts some days, but ·when
`all is said and done, is a most relaxed and comfortable person to have around.
`Last but not least, I would like to thank Andreas Kowatzyk, Monica Lam, Duane
`~orthcutt and Ray Clark. It has been my good fortune to witness and participate in
`some of Andreas's creative work. This thesis owes a lot to his unending support and
`encouragement.
`
`Ill
`
`IPR2023-00037
`Apple EX1007 Page 7
`
`
`
`1V
`
`IPR2023-00037
`Apple EX1007 Page 8
`
`IPR2023-00037
`Apple EX1007 Page 8
`
`
`
`Contents
`
`Abstract
`
`Acknowledgements
`
`1
`
`Introduction
`
`1.1 The Modelling Problem
`
`1.2 The Search Problem
`
`1.3 Thesis Contributions
`
`1.3.1
`
`Improving Speed
`
`1.3.2 Reducing Memory Size .
`
`1.4 Summary and Dissertation Outline
`
`2 Background
`
`2.1 Acoustic Modelling
`
`.....
`
`.
`
`2.1. l Phones and Triphones
`
`2.1.2 HMM modelling of Phones and Triphones
`
`2.2 Language Modelling
`
`2.3 Search Algorithms
`
`.
`
`2.3.l Viterbi Beam Search
`
`2.4 Related Work
`
`. . . . . . . .
`
`2.4.1 Tree Structured Lexicons .
`
`2.4.2 Memory Size and Speed Improvements in Whisper
`
`2.4.3 Search Pruning Using Posterior Phone Probabilities
`
`V
`
`111
`
`1
`
`3
`
`5
`
`7
`
`8
`
`8
`
`9
`
`11
`
`11
`
`11
`
`12
`
`13
`
`15
`
`15
`
`17
`
`17
`
`19
`
`20
`
`IPR2023-00037
`Apple EX1007 Page 9
`
`
`
`2.4.4 Lower Complexity Viterbi Algorithm
`
`2.5 Summary
`
`...........
`
`.
`
`3 The Sphinx-II Baseline System
`
`3.1 Knowledge Sources
`
`..
`
`3.1.l Acoustic Model
`
`3.1.2 Pronunciation Lexicon
`
`3.2
`
`Forward Beam Search
`
`. . . .
`
`3.2.l
`
`Flat Lexical Structure
`
`3.2.2
`
`Incorporating the Language Model
`
`3.2.3 Cross-Word Triphone Modeling
`
`3.2.4 The Forward Search
`
`3.3
`
`Backward and A* Search . .
`
`3.3.l Backward Viterbi Search .
`
`3.3.2 A* Search ........
`
`.
`
`3.4 Baseline Sphinx-II System Performance .
`
`3.4.l Experimentation Methodology .
`
`3.4.2
`
`3.4.3
`
`Recognition Accuracy
`
`Search Speed
`
`3.4.4
`
`Memory Usage
`
`3.5 Baseline System Summary
`
`4 Search Speed Optimization
`
`4.1 Motivation
`
`.....
`
`4.2 Lexical Tree Search
`
`4.2.1 Lexical Tree Construction
`
`4.2.2
`
`Incorporating Language Model Probabilities
`
`4.2.3 Outline of Tree Search Algorithm
`
`4.2.4 Performance of Lexical Tree Search
`
`4.2.5 Lexical Tree Search Summary
`
`...
`
`Vl
`
`20
`21
`
`22
`
`24
`
`24
`
`26
`
`26
`
`26
`
`27
`
`28
`
`31
`
`36
`37
`37
`
`38
`39
`
`41
`
`42
`
`45
`
`48
`
`49
`
`49
`
`51
`
`54
`
`56
`61
`
`62
`
`67
`
`IPR2023-00037
`Apple EX1007 Page 10
`
`
`
`4.3 Global Best Path Search
`
`. . . . . .
`
`4.3.1 Best Path Search Algorithm
`
`4.3.2 Performance
`
`...•......
`
`4.3.3 Best Path Search Summary
`
`4.4 Rescoring Tree-Search Word Lattice .
`
`4.4.1 Motivation ..
`
`4.4.2 Performance
`
`.
`
`4.4.3 Summary
`
`..
`
`4.5 Phonetic Fast Match
`
`4.5.1 Motivation ..
`
`4.5.2 Details of Phonetic Fast Match
`
`4.5.3 Performance of Fast Match Using All Senones
`
`4.5.4 Performance of Fast Match Using CI Senones
`
`4.5.5 Phonetic Fast Match Summary
`
`4.6 Exploiting Concurrency
`
`4.6.1 Multiple Levels of Concurrency
`
`4.6.2 Parallelization Summary
`
`. . . .
`
`4.7 Summary of Search Speed Optimization
`
`5 Memory Size Reduction
`
`5.1 Senone Mixture Weights Compression .
`
`5.2 Disk-Based Language Models
`
`. . . . .
`
`5.3 Summary of Experiments on Memory Size
`
`6 Small Vocabulary
`
`Systems
`
`6.1 General Issues .....
`
`6.2 Performance on ATIS .
`
`6.2.1 Baseline System Performance
`
`6.2.2 Performance of Lexical Tree Based System
`
`6.3 Small Vocabulary Systems Summary
`
`......
`
`.
`
`vii
`
`68
`
`68
`
`73
`
`74
`
`76
`76
`
`76
`
`78
`
`78
`
`78
`
`80
`
`84
`87
`
`88
`
`89
`90
`
`93
`
`93
`
`97
`
`97
`
`98
`
`100
`
`101
`
`101
`
`102
`
`102
`
`103
`
`106
`
`IPR2023-00037
`Apple EX1007 Page 11
`
`
`
`7 Conclusion
`
`7.1 Summary of Results
`
`7.2 Contributions
`
`....
`
`7.3 Future Work on Efficient Speech Recognition .
`
`Appendices
`
`A The Sphinx-II Phone Set
`
`B Statistical Significance Tests
`
`B.l 20K Task
`
`B. 2 5,8K Task
`
`Bibliography
`
`107
`
`108
`
`109
`
`111
`
`115
`
`116
`
`117
`
`121
`
`125
`
`Vil!
`
`IPR2023-00037
`Apple EX1007 Page 12
`
`
`
`List of Figures
`
`2.1 Viterbi Search as Dynamic Programming
`
`3.1 Sphinx-II Signal Processing Front End ......
`
`.
`
`3.2 Sphinx-II HMM Topology: 5-State Bakis Model. .
`
`3.3 Cross-word Triphone Modelling at Word Ends in Sphinx-II ..
`
`3.4 Word Initial Triphone HMM Modelling in Sphinx-II.
`
`3.5 One Frame of Forward Viterbi Beam Search in the Baseline System.
`
`3.6 Word Transitions in Sphinx-II Baseline System.
`
`3. 7 Outline of A* Algorithm in Baseline System
`
`. .
`
`3.8 Language Model Structure in Baseline Sphinx-II System.
`
`4.1 Basephone Lexical Tree Example.
`
`4.2 Tripbone Lexical Tree Example.
`
`.
`
`4.3 Cross-Word Transitions With Flat and Tree Lexicons.
`
`4.4 Auxiliary Flat Lexical Structure for Bigram Transitions.
`
`4.5 Path Score Adjustment Factor f for Word wi Upon Its Exit.
`
`4.6 One Frame of Forward Viterbi Beam Search in Tree Search Algorithm.
`
`4. 7 Word Lattice for Utterance: Take Fidelity's case as an example.
`
`4.8 Word Lattice Example Represented as a DAG. . . . . . .
`
`4.9 Word Lattice DAG Example Using a Trigram Gram.mar.
`
`4.10 Suboptimal Usage of Trigrams in Sphinx-II Viterbi Search.
`
`4.11 Base Phones Predicted by Top Scoring Senones in Each Frame; Speech
`Fragment for Phrase THIS TREND. Pronounced DH-IX-S T-R-EH-
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`N-DD.
`
`15
`
`24
`
`25
`
`29
`
`31
`
`33
`
`35
`
`38
`
`46
`
`52
`
`55
`
`57
`
`58
`
`59
`
`63
`
`69
`
`70
`
`71
`
`73
`
`81
`
`ix
`
`IPR2023-00037
`Apple EX1007 Page 13
`
`
`
`4.12 Position of Correct Phone in Ranking Created by Phonetic Fast Match. 82
`
`4.13 Lookahead Window for Smoothing the Active Phone List.
`
`4.14 Phonetic Fast Match Performance Using All Senones ( 20K Task).
`
`4.15 Word Error Rate vs Recognition Speed of Various Systems ..
`
`4.16 Configuration of a Practical Speech Recognition System.
`
`. .
`
`83
`
`85
`
`94
`
`95
`
`X
`
`IPR2023-00037
`Apple EX1007 Page 14
`
`
`
`List of Tables
`
`3.1 No. of Words and Sentences in Each Test Set
`
`. . . . . . .
`
`3.2 Percentage Word Error Rate of Baseline Sphinx-II System.
`
`3.3 Overall Execution Times of Baseline Sphinx-II System (xRealTime).
`
`.
`
`3.4 Baseline Sphinx-II System Forward Viterbi Search Execution Times
`{xRealTime).
`. . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.5 HMMs Evaluated Per Frame in Baseline Sphinx-II System.
`
`. .
`
`3.6 N-gram Transitions Per Frame in Baseline Sphinx-II System ..
`
`4.1 No. of Nodes at Each Level in Tree and Flat Lexicons.
`
`4.2 Execution Times for Lexical Tree Viterbi Search.
`
`4.3 Breakdown of Tree Viterbi Search Execution Times (xRealTime).
`
`4.4 No. of HMMs Evaluated Per Frame in Lexical Tree Search. . . . .
`
`4.5 No. of Language Model Operations/Frame
`
`in Lexical Tree Search.
`
`4.6 Word Error Rates for Lexical Tree Viterbi Search.
`
`. . . . . . . . .
`
`4. 7 Word Error Rates from Global Best Path Search of Word Lattice Pro-
`duced by Lexical Tree Search. . . . . . . . . . . . . . . . . . . . .
`
`4.8 Execution Times for Global Best Path DAG Search (x RealTime).
`
`.
`
`4.9 Word Error Rates From Lexical Tree+Rescoring+Best Path Search.
`
`4.10 Execution Times With Rescoring Pass.
`
`. . . . . . . . . . . . . . . .
`
`4.11 Fast Match Using All Senones; Lookahead Window=3 (20]{ Task). .
`
`4.12 Fast Match Using All Senones; Lookahead Window=3 (58KTask)..
`
`4.13 Fast Match Using CI Senones; Lookahead Window=3 ..
`
`40
`
`41
`
`43
`
`43
`
`44
`
`45
`
`55
`
`64
`
`65
`
`65
`
`65
`
`66
`
`74
`
`74
`
`77
`
`77
`
`86
`
`87
`
`88
`
`6.1 Baseline System Performance on ATIS.
`
`103
`
`Xl
`
`IPR2023-00037
`Apple EX1007 Page 15
`
`
`
`6.2 Ratio of Number of Root HMMs in Lexical Tree and Words in Lexicon
`103
`. . . . . .
`(approximate).
`6.3 Execution Times on ATIS. . . . . . . . . . . . . . . . . . . . . . . . . 104
`6.4 Breakdown of Tree Search Execution Times on A.TIS (Without Pho-
`. . . . . . .
`netic Fast Match).
`6.5 Recognition Accuracy on A.TIS.
`
`104
`
`105
`
`A.l The Sphinx-II Phone Set.
`
`115
`
`IPR2023-00037
`Apple EX1007 Page 16
`
`
`
`Chapter
`
`1
`
`Introduction
`
`Recent advances in speech technology and computing power have created a surge
`of interest in the practical application of speech recognition. Speech is the primary
`mode of communication among humans. Our ability to communicate with machines
`and computers, through keyboards, mice and other devices, is an order of magnitude
`In order to make this communication more user(cid:173)
`slower and more cumbersome.
`friendly, speech input is an essential component.
`
`There are broadly three classes of speech recognition applications, as described
`in [53]. In isolated word recognition systems each word is spoken with pauses before
`and after it, so that end-pointing techniques can be used to identify word boundaries
`reliably. Second, highly constrained command-and-control applications use small vo(cid:173)
`cabularies, limited to specific phrases, but use connected word or continuous speech.
`Finally, large vocabulary continuous speech systems have vocabularies of several tens
`of thousands of words, and sentences can be arbitrarily long, spoken in a natural fash(cid:173)
`ion. The last is the most user-friendly but also the most challenging to implement.
`However, the most accurate speech recognition systems in the research world are still
`far too slow and expensive to be used in practical, large vocabulary continuous speech
`applications on a wide scale.
`
`Speech research has been concentrated heavily on acoustic and language modelling
`issues. Since the late 1980s, the complexity of tasks undertaken by speech researchers
`has grown from the 1000-word Resource Management (RM)
`task [51] to essentially
`unlimited vocabulary tasks such as transcription of radio news broadcast
`in 1995
`[48]. While the word recognition accuracy has remained impressive, considering the
`increase in task complexity, the resource requirements have grown as well. The R.11
`task ran about an order of magnitude slower than real time on processors of that
`day. The unlimited vocabulary tasks run about two orders of magnitude slower than
`real time on modern workstations whose power has grown by an order of magnitude
`again, in the meantime.
`
`The task of large vocabulary continuous speech recognition is inherently hard :o:-
`
`1
`
`IPR2023-00037
`Apple EX1007 Page 17
`
`
`
`2
`
`CHAPTER l.
`
`IFTRODUCTION
`
`the following reasons. First, word boundaries are not known in advance. One must
`be constantly prepared
`to encounter such a boundary at every time instant. We can
`draw a rough analogy to reading a. para.graph of text without any punctuation marks
`or spaces between words:
`
`myspiritwillsleepinpeaceorifthinksitwillsurelythinkthusfarewellhesprangfrom
`thecabinwindowashesaidthisupontheiceraftwhichlayclosetothevesselhewassoon
`borneaway bythewavesandlostindar
`knessanddistance
`...
`
`from incorrect seg(cid:173)
`Furthermore, many incorrect word hypotheses will be produced
`mentation of speech. Sophisticated
`langu.age models that provide word context or
`semantic
`information are needed to disambiguate between the available hypotheses.
`
`in natural or
`is that co-articulatory effects are very strong
`The second problem
`conversational
`speech, so that
`the sound produced at one instant
`is influenced by
`the preceding and following ones. Distinguishing between these requires
`the use of
`detailed acoustic models that
`take such contextual conditions
`into account. The in(cid:173)
`creasing sophistication of language models and acoustic models, as well as the growth
`in the complexity of tasks, has far exceeded the computational
`and memory capacities
`of commonly available workstations.
`
`the pro(cid:173)
`that
`for practical applications also requires
`Efficient speech recognition
`power and
`cessing be carried out in real time within
`the limited resources-CPu
`memory size-of
`commonly available computers. There certainly are various such
`commercial and demonstration
`systems in existence, but their performance has never
`been formally evaluated with respect
`to the research systems or with respect
`to one
`the accuracy of research systems has been. This thesis is
`another,
`in the way that
`primarily concerned with these issues-in
`improving the computational
`and memory
`efficiency of current speech recognition
`technology without compromising
`the achieve(cid:173)
`ments in recognition accuracy.
`
`recognition speed, memory resource require(cid:173)
`The three aspects of performance,
`ments, and recognition accuracy, are in mutual conflict. It is relatively easy to improve
`recognition speed and reduce memory requirements while trading away some accu(cid:173)
`racy, for example by pruning the search space more drastically: and by using simpler
`acoustic and language models. Alternatively, one can reduce memory requirements
`through efficient encoding schemes at the expense of computation
`time needed to de(cid:173)
`code such representations,
`and vice versa. But it is much harder to improve both the
`recognition speed and reduce main memory requ1rements while preserving or improv(cid:173)
`ing recognition accuracy.
`In this thesis, we demonstrate
`algorithmic
`and heuristic
`techniques
`to tackle the problem.
`
`This work has been carried out in the context of the CMU Sphinx-II
`speech
`recog:iition system as a baseline. There are two main schools of speech recognition
`-ed1~o·ogy ~oday. based on statistical hidden Markov modelling (HMM), and neural
`
`IPR2023-00037
`Apple EX1007 Page 18
`
`
`
`1.1. THE MODELLING PROBLEM
`
`3
`
`net technology, respectively. Sphinx-II uses HMM-based statistical modelling tech(cid:173)
`niques and is one of the premier recognizers of its kind. Using several commonly used
`benchmark test sets and two different vocabulary sizes of about 20,000 and 58,000
`words, we demonstrate that the recognition accuracy of the baseline Sphinx-II system
`can be attained while its execution time is reduced by about an order of magnitude
`and memory requirements reduced by a factor of about 4.
`
`1.1 The Modelling Problem
`
`As the complexity of tasks tackled by speech research has grown, so has that of
`the modelling techniques. In systems that use statistical modelling techniques, such
`as the Sphinx system, this translates
`into several tens to hundreds of megabytes of
`memory needed to store information regarding statistical distributions underlying the
`models.
`
`Acoustic Models
`
`One of the key issues in acoustic modelling has been the choice of a good unit of
`speech (32, 27]. In small vocabulary systems of a few tens of words, it is possible to
`build separate models for entire words, but this approach quickly becomes infeasible
`as the vocabulary size grows. For one thing, it is hard to obtain sufficient training
`data to build all individual word models. It is necessary to represent words in terms
`of sub-word units, and train acoustic models for the latter, in such a way that the
`pronunciation of new vvords can be defined in terms of the already trained sub-word
`units.
`
`The phoneme (or phone) has been the most commonly accepted sub-word unjt.
`There are approximately 50 phones in spoken English language; words are defined as
`sequences of such phones 1 (see Appendix A for the Sphinx-II phone set and examples).
`Each phone is, in turn, modelled by an HMM ( described in greater detail in Section
`2.1.2).
`
`As mentioned earlier, natural continuous speech has strong co-articulatory ef(cid:173)
`fects. Informally, a phone models the position of various a.rticulators in the mouth
`and nasal passage (such as the tongue and the lips) in the making of a particular
`sound. Since these articulators have to move smoothly between different sounds in
`producing speech, each phone is influenced by the neighbouring ones, especially dur(cid:173)
`ing the transition from one phone to the next. This is not a major concern in small
`vocabulary systems in which words are not easily confusable, but becomes an issue
`as the vocabulary size and the degree of confusability increase.
`
`1Some systems define word pronunciations as networks of phones instead of simple linea: se(cid:173)
`quences (36).
`
`IPR2023-00037
`Apple EX1007 Page 19
`
`
`
`4
`
`CHAPTER l.
`
`INTRODUCTION
`
`Most systems employ triphones as one form of context-dependent HMM models
`[4, 33] to deal with. this problem. Triphones are basically phones observed
`in the
`context of given preceding and succeeding phones. There are approximately 50 phones
`there can be a total of about 503 triphones,
`in spoken English
`language. Thus,
`although only a fraction of them are actually observed in the language. Limiting
`the vocabulary can further reduce this number. For example, in Sphinx-II, a 20,000
`word vocabulary has about 75,000 distinct triphones, each of which is modelled by a
`5-state HMM, for a total of about 375,000 states. Since there isn't sufficient training
`data to build models for each state, they are clustered into equivalence classes called
`senones [27].
`
`acoustic models, even after clustering into
`The introduction of context-dependent
`equivalence classes, creates an explosion in the memory requirements
`to store such
`models. For example,
`the Sphinx-II system with 10,000 senones occupies
`tens of
`megabytes of memory.
`
`Language Models
`
`Large vocabulary continuous speech recognition requires the use of a language model
`or grammar to select the most likely word sequence from the relatively large number
`of alternative word hypotheses produced during the search process. As mentioned
`earlier, the absence of explicit word boundary markers in continuous speech causes
`several additional word hypotheses
`to be produced,
`in addition
`to the intended or
`correct ones. For example, the phrase It's a nice day can be equally well recognized
`as It sun iced A. or It son ice day. They a.re all acoustically indistinguishable, but the
`word boundaries have been drawn at a different set of locations in each case. Clearly,
`many more alternatives can be produced with varying degrees of likelihood, given the
`input speech. The language model is necessary to pick the most likely sequence of
`words from the available alternatives.
`
`set of
`to recognize a constrained
`in which one is only required
`Simple tasks,
`phrases, can use rule-based regular or context-free grammars which can be repre(cid:173)
`sented compactly. However, that is impossible with large vocabulary
`tasks. Instead,
`bigram and trigram grammars, consisting of word pairs and triples with given prob(cid:173)
`abilities of occurrence, are most commonly used. One can also build such language
`models based on word classes, such as city names, months of the year, etc. However,
`creating such grammars
`is tecl!ious as they require a fair amount of hand compilation
`of the classes. Ordinary word n-gram language models, on the other hand, can be
`created almost entirely automatically
`from a corpus of training text.
`
`Clearly, it is infeasible to create a complete set of word bigrams for even medium
`vocabulary
`tasks. Thus, the set of bigram and trigram probabilities actually present
`in a given grammar
`is usually a small subset of the possible number. Even then, they
`usually number in the millions for large vocabulary
`tasks. The memory requirements
`
`IPR2023-00037
`Apple EX1007 Page 20
`
`
`
`1.2. THE SEARCH PROBLEM
`
`5
`
`for such language models range from several tens to hundreds of megabytes.
`
`1.2 The Search Problem
`
`There are two components to the computational cost of speech recognition: acoustic
`probability computation, and search. In the case of HMM-based systems, the former
`refers to the computation of the probability of a given HMM state emitting the
`observed speech at a given time. The latter refers to the search for the best word
`sequence given the complete speech input. The search cost is largely unaffected by
`the complexity of the acoustic models. It is much more heavily influenced by the size
`of the task. As we shall see later, the search cost is significant for medium and large
`vocabulary recognition; it is the main focus of this thesis.
`
`for the most l.ikely sequence of words given the
`Speech recognition-searching
`rise to an ex-ponential search space if all possible sequences. of
`input speech-gives
`words are considered. The problem has generally been tackled in two ways: Viterbi
`decoding f62, 52] using beam search [37], or stack decoding [9, 50] which is a variant
`of the A* algorithm [42]. Some hybrid versions that combine Viterbi decoding with
`the A* algorithm also exist [21].
`
`Viterbi Decoding
`
`Viterbi decoding is a dynamic programming algorithm that searches the state space
`for the most likely state sequence that accounts for the input speech. The st:ate
`space is constructed by creating word HMM models from its constituent phone or
`triphone HMM models, and all word HMM models are searched in parallel. Since
`the state space is huge for even medium vocabulary applications, the beam seaJrch
`heuristic is usually applied to limit the search by pruning out the less likely states.
`The combination is often simply referred to as Viterbi beam search. Viterbi decodiing
`is a time-synchronous search that processes the input speech one frame at a time,
`iYfost
`updating all the states for that frame before moving on to the next frame.
`systems employ a frame input rate of 100 frames/sec. Viterbi decoding is described
`in greater detail in Section 2.3.1.
`
`Stack Decoding
`
`Stack decoding maintains a stack of partial hypotheses 2 sorted in descending order of
`posterior likelihood. At each step it pops the best one off the stack. If it is a complete
`hypothesis it is output. Otherwise the algorithm expands it by one word, trying all
`
`2 A partial hypothesis accounts for an initial portion of the input speech. A complete hypothes1s.
`or simply hypothesis, accounts for the entire input speech.
`
`IPR2023-00037
`Apple EX1007 Page 21
`
`
`
`6
`
`CHAPTER 1. INTRODUCTION
`
`the resulting (partial) hypotheses with respect
`possible word extensions, evaluates
`to the input speech and re-inserts
`them in the sorted stack. Any number of N-best
`hypotheses [59] can be generated
`in this manner. To avoid an exponential growth in
`the set of possible word sequences in medium and large vocabulary systems, partial
`hypotheses are expanded only by a limited set of candidate words at each step. These
`candidates are identified by a fast match step [6, 7, 8, 20]. Since our experiments have
`been mostly confined to Viterbi decoding, we do not explore stack decoding in any
`greater detail.
`
`Tree Structured
`
`Lexicons
`
`Even with the beam search heuristic, straightforward Viterbi decoding is expensive.
`The network of states to be searched is formed by a linear sequence of HMM models
`for each word in the vocabulary. The number of models actively searched in this
`organization
`is still one to two orders of magnitude beyond the capabilities of modern
`workstations.
`
`Lexical trees can be used tto reduce the size of the search space. Since many
`words share common pronunciation prefixes, they can also share models and avoid
`duplication. Trees were initially used in fast match algorithms for producing candidate
`word lists for further search. Recently, they have been introduced
`in the main search
`component of several systems [44, 39, 43, 3]. The main problem faced by them is in
`using a language model. Normally, transitions between words are accompanied by
`a prior language model probability. But with trees, the destination nodes of such
`transitions are not individual words but entire groups of them, related phonetically
`but quite unrelated grammatically. An efficient solution to this problem is one of the
`important contributions of this thesis.
`
`Multipass
`
`Search Techniques
`
`to the best
`Viterbi search algorithms usually also create a word lattice in addition
`recognition hypothesis. The lattice includes several alternative words that were recog(cid:173)
`nized at any given time during the search. It also typically contains other information
`such as the time segmentations
`for these words, and their posterior acoustic scores
`(i.e., the probability of observing a word given that time segment of input speech).
`The lattice error rate measures the number of correct words missing from the lattice
`around the expected time. It is typically much lower than the word error rate 3 of the
`single best hypotheses produced for each sentence.
`
`Word lattices can be kept very compact, with low lattice error rate, if they are
`produced using sufficiently detailed acoustic models (as opposed to primitive models
`
`3Word error rates are measured by counting the number of word substitutions, deletions, and
`insertions in the hypothesis, compared to the correct reference sentence.
`
`IPR2023-00037
`Apple EX1007 Page 22
`
`
`
`1.3. THESIS CONTRIBUTIONS
`
`7
`
`as in, for example, fast match algorithms). In our work, a 10sec long sentence typically
`produces a word lattice containing about 1000 word instances.
`
`Given such compact lattices with low error rates, one can search them us.ing
`sophisticated models and search a.lgorithms very efficiently and obtain results with a
`lower word error rate. as described in [38, 65, 41]. Most systems use such multipass
`techniques.
`
`However, there has been relatively little work reported in actually creating such
`lattices efficiently. This is important for the practical applicability of such techniques.
`Lattices can be created with low computational ove