throbber
Computer Science
`
`Efficient Algorithms for Speech Recognition
`
`Mosur K. Ravishankar
`May 15, 1996
`CMU-CS-96-143
`
`Carnegie
`Mellon
`
`510.7808
`C28R
`96-143
`
`IPR2023-00037
`Apple EX1007 Page 1
`
`

`

`University Libraries
`Carnegie Mellon Universitr
`Pittsburgh PA 15213-389CJ
`
`IPR2023-00037
`Apple EX1007 Page 2
`
`

`

`Efficient Algorithms for Speech Recognition
`
`Mosur K. Ravishankar
`May 15, 1996
`CMU-CS-96-143
`
`School of Computer Science
`Computer Science Division
`Carnegie Mellon University
`Pittsburgh, PA 15213
`
`Submitted in partial fulfillment of the requirements
`for the degree of Doctor of Philosophy.
`
`Thesis Committee:
`
`Roberto Bisiani, co-chair (University of Milan)
`Raj Reddy, co-chair
`Alexander Rudnicky
`Richard Stern
`Wayne Ward
`
`© 1996 Mosur K Ravishankar
`
`This research was supported by the Department of the Navy, Naval Research Laboratory under
`Grant No. N00014-93-1-2005. The views and conclusions contained in this document are those of
`the author and should not be interpreted as representing
`the official policies. either expressed or
`implied, of the U.S. government.
`
`IPR2023-00037
`Apple EX1007 Page 3
`
`

`

`• eg1e
`llon
`
`. School of Computer Science
`
`DOCTORAL THESIS
`in the field of
`Computer Science
`
`Efficient Algorithms for Speech Recognition
`
`MOSUR K. RA VISHANKAR
`
`Submitted in Partial Fulfillment of the Requirements
`for the Degree of Doctor of Philosophy
`
`ACCEPTED:
`
`\
`
`<(ffiis1s COMMITTEE CHAIR
`
`TIIESIS COMMITI"EE CHAIR
`
`DEPARTMENT HEAD
`
`APPROVED:
`
`~/3u/%
`r
`
`7
`
`i--:>o-qb
`
`~/,7 /9b
`
`DATE
`
`DATE
`
`DATE
`
`DEAN
`
`DATE
`
`IPR2023-00037
`Apple EX1007 Page 4
`
`

`

`Abstract
`
`Advances in speech technology and computing power have created a surge of
`interest in the practical application of speech recognition. However, the most accurate
`speech recognition systems in the research world are still far too slow and expensive to
`be used in practical, large vocabulary continuous speech applications. Their main goal
`has been recognition accuracy, with emphasis on acoustic and language modelling.
`But practical speech recognition also requires the computation to be carried out in
`real time within the limited resources-CPU
`power and memory size-of commonly
`available computers. There has been relatively little work in this direction while
`preserving the accuracy of research systems.
`
`In this thesis, we focus on efficient and accurate speech recognition. It is easy to
`improve recognition speed and reduce memory requirements by trading away accu(cid:173)
`racy, for example by greater pruning, and using simpler acoustic and language models.
`It is much harder to improve both the recognition speed and reduce main memory
`size while preserving the accuracy.
`
`This thesis presents several techniques for improving the overall performance of
`the CMU Sphinx-II system. Sphinx-II employs semi-continuous hidden Markov mod(cid:173)
`els for acoustics and trigram language models, and is one of the premier research
`systems of its kind. The-techniques in this thesis are validated on several widely used
`benchmark test sets using two vocabulary sizes of about 20K and 58K words.
`
`The main contributions of this thesis are an 8-fold speedup and 4-fold memory size
`reduction over the baseline Sphinx~II system. The improvement in speed is obtained
`from the following techniques: lexical tree search, phonetic fast match heuristic, and
`global best path search of the word lattice. The gain in speed from the tree search is
`about a factor of 5. The phonetic fast match heuristic speeds up the tree search by
`another factor of 2 by finding the most likely candidate phones active at any time.
`Though the tree search incurs some loss of accuracy, it also produces compact word
`lattices with low error rate which can be rescored for accuracy. Such a rescoring is
`combined with the best path algorithm to find a globally optimum path through a
`word lattice. This recovers the original accuracy of the baseline system. The total
`recognition time is about 3 times real time for the 20K task on a 175MHz DEC Alpha
`workstation.
`
`The memory requirements of Sphinx-II are minimized by reducing the sizes of
`the acoustic and language models. The language model is maintained on disk and
`bigrams and trigrams are read in on demand. Explicit software caching mechanisms
`effectively overcome the disk access latencies. The acoustic model size is reduced by
`simply truncating precision of probability values to 8 bits. Several other engineering
`solutions, not explored in this thesis, can be applied to reduce memory requirements
`further. The memory size for the 20K task is reduced to about 30-40MB.
`
`IPR2023-00037
`Apple EX1007 Page 5
`
`

`

`11
`
`IPR2023-00037
`Apple EX1007 Page 6
`
`IPR2023-00037
`Apple EX1007 Page 6
`
`

`

`Acknowledgements
`
`I cannot overstate the debt I owe to Roberto Bisiani and Raj Reddy. They have
`not only helped me and given me every opportunity to extend my professional career.
`but also helped me through personal difficulties as well. It is quite remarkable that I
`have landed not one but two advisors that combine integrity towards research with a
`human touch that transcends the proverbial hard-headedness of science. One cannot
`hope for better mentors than them. Alex Rudnicky, Rich Stern, and Wayne Ward,
`all have a clarity of thinking and self-expression that simply amazes me without end.
`They have given me the most insightful advice. comments, and questions that I could
`have asked for. Thank you, all.
`The C:\1u speech group has been a pleasure to work with. First of all, I would
`like to thank some former and current members, :-.1ei-Yuh Hwang, Fil Alleva, Lin
`Chase, Eric Thayer, Sunil Issar, Bob Weide, and Roni Rosenfeld. They have helped
`me through the early stages of my induction into the group, and later given invaluable
`support in my work. I'm fortunate to have inherited the work of 1vlei-Yuh and Fil.
`Lin Chase has been a great friend and sounding board for ideas through these years.
`Eric has been all of that and a great officemate. I have learnt a lot from discussions
`with Paul Placeway. The rest of the speech group and the robust gang has made it a
`most lively environment to work in. I hope the charge continues through Sphinx-III
`and beyond.
`I have spent a good fraction of my life in the C:-.1U-CS community so far. It has
`been, and still is, the greatest intellectual environment. The spirit of cooperation, and
`informality of interactions as simply unique. I would like to acknowledge the support
`of everyone I have ever come to know here, too many to name, from the Warp and
`Kectar days until now. The administrative folks have always succeeded in blunting
`the edge off a difficult day. You never know what nickname Catherine Copetas will
`christen you with next. And Sharon Burks has always put up with all my antics.
`It goes without saying that I owe everything to my parents. I have had tremendous
`support from my brothers, and some very special uncles and aunts. In particular, I
`must mention the fun rve had with my brother Kuts. I would also like to acknowledge
`K. Gopinath's help during my stay in Bangalore. Finally, "BB", who has suffered
`through my tantrums on bad days, kept me in touch with the rest of the world, has a
`most creative outlook on the commonplace, can drive me nuts some days, but ·when
`all is said and done, is a most relaxed and comfortable person to have around.
`Last but not least, I would like to thank Andreas Kowatzyk, Monica Lam, Duane
`~orthcutt and Ray Clark. It has been my good fortune to witness and participate in
`some of Andreas's creative work. This thesis owes a lot to his unending support and
`encouragement.
`
`Ill
`
`IPR2023-00037
`Apple EX1007 Page 7
`
`

`

`1V
`
`IPR2023-00037
`Apple EX1007 Page 8
`
`IPR2023-00037
`Apple EX1007 Page 8
`
`

`

`Contents
`
`Abstract
`
`Acknowledgements
`
`1
`
`Introduction
`
`1.1 The Modelling Problem
`
`1.2 The Search Problem
`
`1.3 Thesis Contributions
`
`1.3.1
`
`Improving Speed
`
`1.3.2 Reducing Memory Size .
`
`1.4 Summary and Dissertation Outline
`
`2 Background
`
`2.1 Acoustic Modelling
`
`.....
`
`.
`
`2.1. l Phones and Triphones
`
`2.1.2 HMM modelling of Phones and Triphones
`
`2.2 Language Modelling
`
`2.3 Search Algorithms
`
`.
`
`2.3.l Viterbi Beam Search
`
`2.4 Related Work
`
`. . . . . . . .
`
`2.4.1 Tree Structured Lexicons .
`
`2.4.2 Memory Size and Speed Improvements in Whisper
`
`2.4.3 Search Pruning Using Posterior Phone Probabilities
`
`V
`
`111
`
`1
`
`3
`
`5
`
`7
`
`8
`
`8
`
`9
`
`11
`
`11
`
`11
`
`12
`
`13
`
`15
`
`15
`
`17
`
`17
`
`19
`
`20
`
`IPR2023-00037
`Apple EX1007 Page 9
`
`

`

`2.4.4 Lower Complexity Viterbi Algorithm
`
`2.5 Summary
`
`...........
`
`.
`
`3 The Sphinx-II Baseline System
`
`3.1 Knowledge Sources
`
`..
`
`3.1.l Acoustic Model
`
`3.1.2 Pronunciation Lexicon
`
`3.2
`
`Forward Beam Search
`
`. . . .
`
`3.2.l
`
`Flat Lexical Structure
`
`3.2.2
`
`Incorporating the Language Model
`
`3.2.3 Cross-Word Triphone Modeling
`
`3.2.4 The Forward Search
`
`3.3
`
`Backward and A* Search . .
`
`3.3.l Backward Viterbi Search .
`
`3.3.2 A* Search ........
`
`.
`
`3.4 Baseline Sphinx-II System Performance .
`
`3.4.l Experimentation Methodology .
`
`3.4.2
`
`3.4.3
`
`Recognition Accuracy
`
`Search Speed
`
`3.4.4
`
`Memory Usage
`
`3.5 Baseline System Summary
`
`4 Search Speed Optimization
`
`4.1 Motivation
`
`.....
`
`4.2 Lexical Tree Search
`
`4.2.1 Lexical Tree Construction
`
`4.2.2
`
`Incorporating Language Model Probabilities
`
`4.2.3 Outline of Tree Search Algorithm
`
`4.2.4 Performance of Lexical Tree Search
`
`4.2.5 Lexical Tree Search Summary
`
`...
`
`Vl
`
`20
`21
`
`22
`
`24
`
`24
`
`26
`
`26
`
`26
`
`27
`
`28
`
`31
`
`36
`37
`37
`
`38
`39
`
`41
`
`42
`
`45
`
`48
`
`49
`
`49
`
`51
`
`54
`
`56
`61
`
`62
`
`67
`
`IPR2023-00037
`Apple EX1007 Page 10
`
`

`

`4.3 Global Best Path Search
`
`. . . . . .
`
`4.3.1 Best Path Search Algorithm
`
`4.3.2 Performance
`
`...•......
`
`4.3.3 Best Path Search Summary
`
`4.4 Rescoring Tree-Search Word Lattice .
`
`4.4.1 Motivation ..
`
`4.4.2 Performance
`
`.
`
`4.4.3 Summary
`
`..
`
`4.5 Phonetic Fast Match
`
`4.5.1 Motivation ..
`
`4.5.2 Details of Phonetic Fast Match
`
`4.5.3 Performance of Fast Match Using All Senones
`
`4.5.4 Performance of Fast Match Using CI Senones
`
`4.5.5 Phonetic Fast Match Summary
`
`4.6 Exploiting Concurrency
`
`4.6.1 Multiple Levels of Concurrency
`
`4.6.2 Parallelization Summary
`
`. . . .
`
`4.7 Summary of Search Speed Optimization
`
`5 Memory Size Reduction
`
`5.1 Senone Mixture Weights Compression .
`
`5.2 Disk-Based Language Models
`
`. . . . .
`
`5.3 Summary of Experiments on Memory Size
`
`6 Small Vocabulary
`
`Systems
`
`6.1 General Issues .....
`
`6.2 Performance on ATIS .
`
`6.2.1 Baseline System Performance
`
`6.2.2 Performance of Lexical Tree Based System
`
`6.3 Small Vocabulary Systems Summary
`
`......
`
`.
`
`vii
`
`68
`
`68
`
`73
`
`74
`
`76
`76
`
`76
`
`78
`
`78
`
`78
`
`80
`
`84
`87
`
`88
`
`89
`90
`
`93
`
`93
`
`97
`
`97
`
`98
`
`100
`
`101
`
`101
`
`102
`
`102
`
`103
`
`106
`
`IPR2023-00037
`Apple EX1007 Page 11
`
`

`

`7 Conclusion
`
`7.1 Summary of Results
`
`7.2 Contributions
`
`....
`
`7.3 Future Work on Efficient Speech Recognition .
`
`Appendices
`
`A The Sphinx-II Phone Set
`
`B Statistical Significance Tests
`
`B.l 20K Task
`
`B. 2 5,8K Task
`
`Bibliography
`
`107
`
`108
`
`109
`
`111
`
`115
`
`116
`
`117
`
`121
`
`125
`
`Vil!
`
`IPR2023-00037
`Apple EX1007 Page 12
`
`

`

`List of Figures
`
`2.1 Viterbi Search as Dynamic Programming
`
`3.1 Sphinx-II Signal Processing Front End ......
`
`.
`
`3.2 Sphinx-II HMM Topology: 5-State Bakis Model. .
`
`3.3 Cross-word Triphone Modelling at Word Ends in Sphinx-II ..
`
`3.4 Word Initial Triphone HMM Modelling in Sphinx-II.
`
`3.5 One Frame of Forward Viterbi Beam Search in the Baseline System.
`
`3.6 Word Transitions in Sphinx-II Baseline System.
`
`3. 7 Outline of A* Algorithm in Baseline System
`
`. .
`
`3.8 Language Model Structure in Baseline Sphinx-II System.
`
`4.1 Basephone Lexical Tree Example.
`
`4.2 Tripbone Lexical Tree Example.
`
`.
`
`4.3 Cross-Word Transitions With Flat and Tree Lexicons.
`
`4.4 Auxiliary Flat Lexical Structure for Bigram Transitions.
`
`4.5 Path Score Adjustment Factor f for Word wi Upon Its Exit.
`
`4.6 One Frame of Forward Viterbi Beam Search in Tree Search Algorithm.
`
`4. 7 Word Lattice for Utterance: Take Fidelity's case as an example.
`
`4.8 Word Lattice Example Represented as a DAG. . . . . . .
`
`4.9 Word Lattice DAG Example Using a Trigram Gram.mar.
`
`4.10 Suboptimal Usage of Trigrams in Sphinx-II Viterbi Search.
`
`4.11 Base Phones Predicted by Top Scoring Senones in Each Frame; Speech
`Fragment for Phrase THIS TREND. Pronounced DH-IX-S T-R-EH-
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`N-DD.
`
`15
`
`24
`
`25
`
`29
`
`31
`
`33
`
`35
`
`38
`
`46
`
`52
`
`55
`
`57
`
`58
`
`59
`
`63
`
`69
`
`70
`
`71
`
`73
`
`81
`
`ix
`
`IPR2023-00037
`Apple EX1007 Page 13
`
`

`

`4.12 Position of Correct Phone in Ranking Created by Phonetic Fast Match. 82
`
`4.13 Lookahead Window for Smoothing the Active Phone List.
`
`4.14 Phonetic Fast Match Performance Using All Senones ( 20K Task).
`
`4.15 Word Error Rate vs Recognition Speed of Various Systems ..
`
`4.16 Configuration of a Practical Speech Recognition System.
`
`. .
`
`83
`
`85
`
`94
`
`95
`
`X
`
`IPR2023-00037
`Apple EX1007 Page 14
`
`

`

`List of Tables
`
`3.1 No. of Words and Sentences in Each Test Set
`
`. . . . . . .
`
`3.2 Percentage Word Error Rate of Baseline Sphinx-II System.
`
`3.3 Overall Execution Times of Baseline Sphinx-II System (xRealTime).
`
`.
`
`3.4 Baseline Sphinx-II System Forward Viterbi Search Execution Times
`{xRealTime).
`. . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.5 HMMs Evaluated Per Frame in Baseline Sphinx-II System.
`
`. .
`
`3.6 N-gram Transitions Per Frame in Baseline Sphinx-II System ..
`
`4.1 No. of Nodes at Each Level in Tree and Flat Lexicons.
`
`4.2 Execution Times for Lexical Tree Viterbi Search.
`
`4.3 Breakdown of Tree Viterbi Search Execution Times (xRealTime).
`
`4.4 No. of HMMs Evaluated Per Frame in Lexical Tree Search. . . . .
`
`4.5 No. of Language Model Operations/Frame
`
`in Lexical Tree Search.
`
`4.6 Word Error Rates for Lexical Tree Viterbi Search.
`
`. . . . . . . . .
`
`4. 7 Word Error Rates from Global Best Path Search of Word Lattice Pro-
`duced by Lexical Tree Search. . . . . . . . . . . . . . . . . . . . .
`
`4.8 Execution Times for Global Best Path DAG Search (x RealTime).
`
`.
`
`4.9 Word Error Rates From Lexical Tree+Rescoring+Best Path Search.
`
`4.10 Execution Times With Rescoring Pass.
`
`. . . . . . . . . . . . . . . .
`
`4.11 Fast Match Using All Senones; Lookahead Window=3 (20]{ Task). .
`
`4.12 Fast Match Using All Senones; Lookahead Window=3 (58KTask)..
`
`4.13 Fast Match Using CI Senones; Lookahead Window=3 ..
`
`40
`
`41
`
`43
`
`43
`
`44
`
`45
`
`55
`
`64
`
`65
`
`65
`
`65
`
`66
`
`74
`
`74
`
`77
`
`77
`
`86
`
`87
`
`88
`
`6.1 Baseline System Performance on ATIS.
`
`103
`
`Xl
`
`IPR2023-00037
`Apple EX1007 Page 15
`
`

`

`6.2 Ratio of Number of Root HMMs in Lexical Tree and Words in Lexicon
`103
`. . . . . .
`(approximate).
`6.3 Execution Times on ATIS. . . . . . . . . . . . . . . . . . . . . . . . . 104
`6.4 Breakdown of Tree Search Execution Times on A.TIS (Without Pho-
`. . . . . . .
`netic Fast Match).
`6.5 Recognition Accuracy on A.TIS.
`
`104
`
`105
`
`A.l The Sphinx-II Phone Set.
`
`115
`
`IPR2023-00037
`Apple EX1007 Page 16
`
`

`

`Chapter
`
`1
`
`Introduction
`
`Recent advances in speech technology and computing power have created a surge
`of interest in the practical application of speech recognition. Speech is the primary
`mode of communication among humans. Our ability to communicate with machines
`and computers, through keyboards, mice and other devices, is an order of magnitude
`In order to make this communication more user(cid:173)
`slower and more cumbersome.
`friendly, speech input is an essential component.
`
`There are broadly three classes of speech recognition applications, as described
`in [53]. In isolated word recognition systems each word is spoken with pauses before
`and after it, so that end-pointing techniques can be used to identify word boundaries
`reliably. Second, highly constrained command-and-control applications use small vo(cid:173)
`cabularies, limited to specific phrases, but use connected word or continuous speech.
`Finally, large vocabulary continuous speech systems have vocabularies of several tens
`of thousands of words, and sentences can be arbitrarily long, spoken in a natural fash(cid:173)
`ion. The last is the most user-friendly but also the most challenging to implement.
`However, the most accurate speech recognition systems in the research world are still
`far too slow and expensive to be used in practical, large vocabulary continuous speech
`applications on a wide scale.
`
`Speech research has been concentrated heavily on acoustic and language modelling
`issues. Since the late 1980s, the complexity of tasks undertaken by speech researchers
`has grown from the 1000-word Resource Management (RM)
`task [51] to essentially
`unlimited vocabulary tasks such as transcription of radio news broadcast
`in 1995
`[48]. While the word recognition accuracy has remained impressive, considering the
`increase in task complexity, the resource requirements have grown as well. The R.11
`task ran about an order of magnitude slower than real time on processors of that
`day. The unlimited vocabulary tasks run about two orders of magnitude slower than
`real time on modern workstations whose power has grown by an order of magnitude
`again, in the meantime.
`
`The task of large vocabulary continuous speech recognition is inherently hard :o:-
`
`1
`
`IPR2023-00037
`Apple EX1007 Page 17
`
`

`

`2
`
`CHAPTER l.
`
`IFTRODUCTION
`
`the following reasons. First, word boundaries are not known in advance. One must
`be constantly prepared
`to encounter such a boundary at every time instant. We can
`draw a rough analogy to reading a. para.graph of text without any punctuation marks
`or spaces between words:
`
`myspiritwillsleepinpeaceorifthinksitwillsurelythinkthusfarewellhesprangfrom
`thecabinwindowashesaidthisupontheiceraftwhichlayclosetothevesselhewassoon
`borneaway bythewavesandlostindar
`knessanddistance
`...
`
`from incorrect seg(cid:173)
`Furthermore, many incorrect word hypotheses will be produced
`mentation of speech. Sophisticated
`langu.age models that provide word context or
`semantic
`information are needed to disambiguate between the available hypotheses.
`
`in natural or
`is that co-articulatory effects are very strong
`The second problem
`conversational
`speech, so that
`the sound produced at one instant
`is influenced by
`the preceding and following ones. Distinguishing between these requires
`the use of
`detailed acoustic models that
`take such contextual conditions
`into account. The in(cid:173)
`creasing sophistication of language models and acoustic models, as well as the growth
`in the complexity of tasks, has far exceeded the computational
`and memory capacities
`of commonly available workstations.
`
`the pro(cid:173)
`that
`for practical applications also requires
`Efficient speech recognition
`power and
`cessing be carried out in real time within
`the limited resources-CPu
`memory size-of
`commonly available computers. There certainly are various such
`commercial and demonstration
`systems in existence, but their performance has never
`been formally evaluated with respect
`to the research systems or with respect
`to one
`the accuracy of research systems has been. This thesis is
`another,
`in the way that
`primarily concerned with these issues-in
`improving the computational
`and memory
`efficiency of current speech recognition
`technology without compromising
`the achieve(cid:173)
`ments in recognition accuracy.
`
`recognition speed, memory resource require(cid:173)
`The three aspects of performance,
`ments, and recognition accuracy, are in mutual conflict. It is relatively easy to improve
`recognition speed and reduce memory requirements while trading away some accu(cid:173)
`racy, for example by pruning the search space more drastically: and by using simpler
`acoustic and language models. Alternatively, one can reduce memory requirements
`through efficient encoding schemes at the expense of computation
`time needed to de(cid:173)
`code such representations,
`and vice versa. But it is much harder to improve both the
`recognition speed and reduce main memory requ1rements while preserving or improv(cid:173)
`ing recognition accuracy.
`In this thesis, we demonstrate
`algorithmic
`and heuristic
`techniques
`to tackle the problem.
`
`This work has been carried out in the context of the CMU Sphinx-II
`speech
`recog:iition system as a baseline. There are two main schools of speech recognition
`-ed1~o·ogy ~oday. based on statistical hidden Markov modelling (HMM), and neural
`
`IPR2023-00037
`Apple EX1007 Page 18
`
`

`

`1.1. THE MODELLING PROBLEM
`
`3
`
`net technology, respectively. Sphinx-II uses HMM-based statistical modelling tech(cid:173)
`niques and is one of the premier recognizers of its kind. Using several commonly used
`benchmark test sets and two different vocabulary sizes of about 20,000 and 58,000
`words, we demonstrate that the recognition accuracy of the baseline Sphinx-II system
`can be attained while its execution time is reduced by about an order of magnitude
`and memory requirements reduced by a factor of about 4.
`
`1.1 The Modelling Problem
`
`As the complexity of tasks tackled by speech research has grown, so has that of
`the modelling techniques. In systems that use statistical modelling techniques, such
`as the Sphinx system, this translates
`into several tens to hundreds of megabytes of
`memory needed to store information regarding statistical distributions underlying the
`models.
`
`Acoustic Models
`
`One of the key issues in acoustic modelling has been the choice of a good unit of
`speech (32, 27]. In small vocabulary systems of a few tens of words, it is possible to
`build separate models for entire words, but this approach quickly becomes infeasible
`as the vocabulary size grows. For one thing, it is hard to obtain sufficient training
`data to build all individual word models. It is necessary to represent words in terms
`of sub-word units, and train acoustic models for the latter, in such a way that the
`pronunciation of new vvords can be defined in terms of the already trained sub-word
`units.
`
`The phoneme (or phone) has been the most commonly accepted sub-word unjt.
`There are approximately 50 phones in spoken English language; words are defined as
`sequences of such phones 1 (see Appendix A for the Sphinx-II phone set and examples).
`Each phone is, in turn, modelled by an HMM ( described in greater detail in Section
`2.1.2).
`
`As mentioned earlier, natural continuous speech has strong co-articulatory ef(cid:173)
`fects. Informally, a phone models the position of various a.rticulators in the mouth
`and nasal passage (such as the tongue and the lips) in the making of a particular
`sound. Since these articulators have to move smoothly between different sounds in
`producing speech, each phone is influenced by the neighbouring ones, especially dur(cid:173)
`ing the transition from one phone to the next. This is not a major concern in small
`vocabulary systems in which words are not easily confusable, but becomes an issue
`as the vocabulary size and the degree of confusability increase.
`
`1Some systems define word pronunciations as networks of phones instead of simple linea: se(cid:173)
`quences (36).
`
`IPR2023-00037
`Apple EX1007 Page 19
`
`

`

`4
`
`CHAPTER l.
`
`INTRODUCTION
`
`Most systems employ triphones as one form of context-dependent HMM models
`[4, 33] to deal with. this problem. Triphones are basically phones observed
`in the
`context of given preceding and succeeding phones. There are approximately 50 phones
`there can be a total of about 503 triphones,
`in spoken English
`language. Thus,
`although only a fraction of them are actually observed in the language. Limiting
`the vocabulary can further reduce this number. For example, in Sphinx-II, a 20,000
`word vocabulary has about 75,000 distinct triphones, each of which is modelled by a
`5-state HMM, for a total of about 375,000 states. Since there isn't sufficient training
`data to build models for each state, they are clustered into equivalence classes called
`senones [27].
`
`acoustic models, even after clustering into
`The introduction of context-dependent
`equivalence classes, creates an explosion in the memory requirements
`to store such
`models. For example,
`the Sphinx-II system with 10,000 senones occupies
`tens of
`megabytes of memory.
`
`Language Models
`
`Large vocabulary continuous speech recognition requires the use of a language model
`or grammar to select the most likely word sequence from the relatively large number
`of alternative word hypotheses produced during the search process. As mentioned
`earlier, the absence of explicit word boundary markers in continuous speech causes
`several additional word hypotheses
`to be produced,
`in addition
`to the intended or
`correct ones. For example, the phrase It's a nice day can be equally well recognized
`as It sun iced A. or It son ice day. They a.re all acoustically indistinguishable, but the
`word boundaries have been drawn at a different set of locations in each case. Clearly,
`many more alternatives can be produced with varying degrees of likelihood, given the
`input speech. The language model is necessary to pick the most likely sequence of
`words from the available alternatives.
`
`set of
`to recognize a constrained
`in which one is only required
`Simple tasks,
`phrases, can use rule-based regular or context-free grammars which can be repre(cid:173)
`sented compactly. However, that is impossible with large vocabulary
`tasks. Instead,
`bigram and trigram grammars, consisting of word pairs and triples with given prob(cid:173)
`abilities of occurrence, are most commonly used. One can also build such language
`models based on word classes, such as city names, months of the year, etc. However,
`creating such grammars
`is tecl!ious as they require a fair amount of hand compilation
`of the classes. Ordinary word n-gram language models, on the other hand, can be
`created almost entirely automatically
`from a corpus of training text.
`
`Clearly, it is infeasible to create a complete set of word bigrams for even medium
`vocabulary
`tasks. Thus, the set of bigram and trigram probabilities actually present
`in a given grammar
`is usually a small subset of the possible number. Even then, they
`usually number in the millions for large vocabulary
`tasks. The memory requirements
`
`IPR2023-00037
`Apple EX1007 Page 20
`
`

`

`1.2. THE SEARCH PROBLEM
`
`5
`
`for such language models range from several tens to hundreds of megabytes.
`
`1.2 The Search Problem
`
`There are two components to the computational cost of speech recognition: acoustic
`probability computation, and search. In the case of HMM-based systems, the former
`refers to the computation of the probability of a given HMM state emitting the
`observed speech at a given time. The latter refers to the search for the best word
`sequence given the complete speech input. The search cost is largely unaffected by
`the complexity of the acoustic models. It is much more heavily influenced by the size
`of the task. As we shall see later, the search cost is significant for medium and large
`vocabulary recognition; it is the main focus of this thesis.
`
`for the most l.ikely sequence of words given the
`Speech recognition-searching
`rise to an ex-ponential search space if all possible sequences. of
`input speech-gives
`words are considered. The problem has generally been tackled in two ways: Viterbi
`decoding f62, 52] using beam search [37], or stack decoding [9, 50] which is a variant
`of the A* algorithm [42]. Some hybrid versions that combine Viterbi decoding with
`the A* algorithm also exist [21].
`
`Viterbi Decoding
`
`Viterbi decoding is a dynamic programming algorithm that searches the state space
`for the most likely state sequence that accounts for the input speech. The st:ate
`space is constructed by creating word HMM models from its constituent phone or
`triphone HMM models, and all word HMM models are searched in parallel. Since
`the state space is huge for even medium vocabulary applications, the beam seaJrch
`heuristic is usually applied to limit the search by pruning out the less likely states.
`The combination is often simply referred to as Viterbi beam search. Viterbi decodiing
`is a time-synchronous search that processes the input speech one frame at a time,
`iYfost
`updating all the states for that frame before moving on to the next frame.
`systems employ a frame input rate of 100 frames/sec. Viterbi decoding is described
`in greater detail in Section 2.3.1.
`
`Stack Decoding
`
`Stack decoding maintains a stack of partial hypotheses 2 sorted in descending order of
`posterior likelihood. At each step it pops the best one off the stack. If it is a complete
`hypothesis it is output. Otherwise the algorithm expands it by one word, trying all
`
`2 A partial hypothesis accounts for an initial portion of the input speech. A complete hypothes1s.
`or simply hypothesis, accounts for the entire input speech.
`
`IPR2023-00037
`Apple EX1007 Page 21
`
`

`

`6
`
`CHAPTER 1. INTRODUCTION
`
`the resulting (partial) hypotheses with respect
`possible word extensions, evaluates
`to the input speech and re-inserts
`them in the sorted stack. Any number of N-best
`hypotheses [59] can be generated
`in this manner. To avoid an exponential growth in
`the set of possible word sequences in medium and large vocabulary systems, partial
`hypotheses are expanded only by a limited set of candidate words at each step. These
`candidates are identified by a fast match step [6, 7, 8, 20]. Since our experiments have
`been mostly confined to Viterbi decoding, we do not explore stack decoding in any
`greater detail.
`
`Tree Structured
`
`Lexicons
`
`Even with the beam search heuristic, straightforward Viterbi decoding is expensive.
`The network of states to be searched is formed by a linear sequence of HMM models
`for each word in the vocabulary. The number of models actively searched in this
`organization
`is still one to two orders of magnitude beyond the capabilities of modern
`workstations.
`
`Lexical trees can be used tto reduce the size of the search space. Since many
`words share common pronunciation prefixes, they can also share models and avoid
`duplication. Trees were initially used in fast match algorithms for producing candidate
`word lists for further search. Recently, they have been introduced
`in the main search
`component of several systems [44, 39, 43, 3]. The main problem faced by them is in
`using a language model. Normally, transitions between words are accompanied by
`a prior language model probability. But with trees, the destination nodes of such
`transitions are not individual words but entire groups of them, related phonetically
`but quite unrelated grammatically. An efficient solution to this problem is one of the
`important contributions of this thesis.
`
`Multipass
`
`Search Techniques
`
`to the best
`Viterbi search algorithms usually also create a word lattice in addition
`recognition hypothesis. The lattice includes several alternative words that were recog(cid:173)
`nized at any given time during the search. It also typically contains other information
`such as the time segmentations
`for these words, and their posterior acoustic scores
`(i.e., the probability of observing a word given that time segment of input speech).
`The lattice error rate measures the number of correct words missing from the lattice
`around the expected time. It is typically much lower than the word error rate 3 of the
`single best hypotheses produced for each sentence.
`
`Word lattices can be kept very compact, with low lattice error rate, if they are
`produced using sufficiently detailed acoustic models (as opposed to primitive models
`
`3Word error rates are measured by counting the number of word substitutions, deletions, and
`insertions in the hypothesis, compared to the correct reference sentence.
`
`IPR2023-00037
`Apple EX1007 Page 22
`
`

`

`1.3. THESIS CONTRIBUTIONS
`
`7
`
`as in, for example, fast match algorithms). In our work, a 10sec long sentence typically
`produces a word lattice containing about 1000 word instances.
`
`Given such compact lattices with low error rates, one can search them us.ing
`sophisticated models and search a.lgorithms very efficiently and obtain results with a
`lower word error rate. as described in [38, 65, 41]. Most systems use such multipass
`techniques.
`
`However, there has been relatively little work reported in actually creating such
`lattices efficiently. This is important for the practical applicability of such techniques.
`Lattices can be created with low computational ove

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket