`6,092,045
`(114) Patent Number:
`United States Patent 55
`
`Stubley et al.
`[45] Date of Patent:
`Jul. 18, 2000
`
`[54] METHOD AND APPARATUS FOR SPEECH
`RECOGNITION
`
`[75]
`
`Inventors: Peter R. Stubley, Lachine; Andre
`Gillet, St-Laurent; Vishwa N. Gupta,
`"
`:
`Brossard, all of Canada; Christopher
`.
`K. Toulson, Palo Alto; David B.
`Peters, San Carlos, both of Calif.
`
`[73] Assignee: Nore!1Nepworks Corporation,
`.
`[21] Appl. No.: 09/119,621
`
`[22]
`
`Filed:
`
`Jul. 21, 1998
`
`Foreign Application Priority Data
`[30]
`Sep. 19, 1997
`[CA
`Canada
`oo... iceeeseseseceseeeneceeceees 2216224
`“Pe
`[CA]
`Canada
`[51]
`Int. C17ese GI10L 15/10; G1OL 15/14
`[52] U.S. Ch. eeecccccsesssseecsssecssseesssnee 704/254; 704/256
`[58] Field of Search o.......ccccccsssssecceeee 704/256, 251,
`704/254, 255, 200, 231, 240, 242, 238,
`236, 243, 244, 252
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,459,798 10/1995 Bailey et al. wee 382/218
`
`1/1996 Biclby ct al. esscssssssessssseeeees 379/88
`5,488,652
`
`we 395/2.51
`5,515,475
`5/1996 Gupta etal. .....
`4/1997 Schwartz Ct al. cesses 395/2.65
`5,621,859
`5,857,169
`1/1999 Seide be eee eee eee ceceee see eee sees esses esas 704/256
`5,892,960
`4/1999 Seide .....secceesssssssssseeessesssssseeees 395/800
`
`5.930.753.
`7
`:
`wee 704/243
`3930,
`/1999 Potamianoset al.
`
`8/1999 Beyerlein et al. cesses 704/256
`5,933,806
`FOREIGN PATENT DOCUMENTS
`
`3/1996 European Pat. Off... G10L 5/06
`0703568
`0810 583 12/1997 European Pat, Off, G10L, 5/06
`uropean Pat.
`f
`tetteeeeee
`OTHER PUBLICATIONS
`
`Kuhn,R., et al.; “A Cache—Based Natural Language Model
`for Speech Recognition,” IEEE Transactions on Pattern
`Analysis and Machine Intelligence, vol. 12, No.
`6,
`Jun.
`?
`?
`?
`1990, pp. 570-583 (XP000136978).
`Bakis, R. and Cole, A.G., “Dynamic Modification of the
`Vocabulary of a Speech Recognition Machine,” IBM Tech-
`‘nical Disclosure Bulletin, vol. 27, No. 7A, (1984).
`Gao,et al., “Dynamic Adaptation of Hidden Markov Model
`for Robust Speech Recognition,” 1989 IEEE Internatioal
`Symposium on Circuits and Systems, vol. 2 of 3, pp.
`1336-1339, (1989).
`(List continued on next page.)
`
`Primary Examiner—Richemond Dorvil
`Attorney, Agent, or Firm—Foley & Lardner
`
`4,164,025
`8/1979 Dubnowskiet al... 364/900
`4,751,737
`6/1988 Gerson et al. oe
`eeseeeeeneeeee 381/43
`
`4,797,910
`
` 1/1989 Daudelin ...ccceccccssssssssesessssseseeees 379/67
`ABSTRACT
`~—*((57]
`4,805,219
`2/1989 Baker cccccsccssecsssssssssesesesssssssseees 381/43
`nk
`ti
`ob
`ti
`f
`.
`.
`C
`4,959,855
`9/1990) Daudelin .......eeeeeeeeceeeeeneeee 379/213
`omparing a series of observationsrepresenting unknown
`4.979,206
`12/1990 Padden etal. 379/67
`
`speech, to stored models representing known speech, the
`5,050,215
`9/1991 Nishimura cecsessssssessecseeseeeesesen 381/41
`
`
`series of observations being divided into at least two blocks
`5,052,038
`9/1991 Shepard ...cccssssssssssssssessseececee 379/88
`
`
`
`5,086,479 2/1992 Takenaga et al.oe.eee 382/14 each comprising two or more of the observations,is carried
`
`oo
`5,091,947
`2/1992 Ariyoshi et ab.
`eee 381/42
`out in an order which makesbetter use of memory.First, the
`
`
`3/1992 Lemmig: 00... 381/43
`5,097,509
`observations in one of the blocks are compared (31), to a
`6/1992 Larkey w.-ssseesccssssssssssenessseeeeee 381/43
`5,127,055
`subset comprising one or more of the models, to determine
`
`
`2183088 M003 powden et a ereereseenennenesereenes aes
`a likelihood of a match to each of the one or more models.
`
`Owden ef ale vneeesneresereneeen
`oe
`/
`/
`This step is repeated (33) for models other than those in the
`4/1993 Darden wcccccssssssssssssseseseessseeeeeee 379/88
`5,204,894
`beet:
`and
`the
`whol
`’
`ted (34)
`f
`h
`... 704/200
`5,271,088 12/1993 Babhler
`...
`subset;
`and
`the whole process is repeated
`(34)
`for eac
`
`
`12/1993 Green cesccccsssessssssesssseeseeesensesee 379/88
`5,274,695
`block.
`
`4/1994 Tsuboka ..ccseeccccscsssssseeeessssssneees 395/22
`5,307,444
`5,390,278
`2/1995 Gupta et al. oe eeeeeeneeeee 395/2.52
`21 Claims, 12 Drawing Sheets
`
`WORD RECOGNITION
`14
`
`
`COMPARE OBSERVATION
`
`
`
`FIND FIRST OBSERVATION IN
`FIRST BLOCK, AND FIND
`FIRST MODEL 30
`
`TO MODEL31
`
`yO
`NEXT OBSERVATION,
`UNTIL END OF BLOCK 32
`
`SS
`NEXT MODEL 33
`
`i———_
`NEXT BLOCK OF
`OBSERVATIONS 34
`
`
`£dOOl
`
`
`IPR2023-00037
`Apple EX1020 Page1
`
`IPR2023-00037
`Apple EX1020 Page 1
`
`
`
`6,092,045
`Page 2
`
`OTHER PUBLICATIONS
`
`Gorin, A.L. et al., “Adaptive Acquisition of Language,”
`Computer Speech and Language, vol. 5, No. 2, pp. 101-132,
`(1991).
`Lennig, M., et al., “Automated Bilingual Directory Assis-
`tance Trial in Bell Canada,” IEEE Workshop on Interactive
`Voice Technology for Telecom Applications, (1992).
`Lennig, M. and Sharp, D., “Unleashing The Potential Of
`Human—To—Machine Communication” Telesis, No. 97, pp.
`23-37, (1993).
`Lenning, M., “Putting Speech Recognition to Work in the
`Telephone Network,” IEEE Computer Society, vol. 23, No.
`8 (1990).
`
`Lennig, M., et al., “Flexible Vocabulary Recognition of
`Speech Overthe Telephone,” JEEE Workshop on Interactive
`Voice Technology for Telecom Applications, (1992).
`
`Nagata, S., et al., “Mobile Robot Control by a Structured
`Hierarchical Neural Network,” pp. 69-76, (1989).
`
`Rabiner, L.R. and Juang, B.H., “An Introduction to Hidden
`Markov Models,” IEEE ASSP Magazine, vol. 3, No. 1
`(1986).
`
`Young, S., Large Vocabulary Continuous Speech Recogni-
`tion: a Review IEEE Automatic Speech Recognition Work-
`shop, (1995).
`
`IPR2023-00037
`Apple EX1020 Page 2
`
`IPR2023-00037
`Apple EX1020 Page 2
`
`
`
`U.S. Patent
`
`SPEECH
`
`INPUT
`
`Jul. 18, 2000
`
`Sheet 1 of 12
`
`6,092,045
`
`CHANNEL ADAPTATION 11
`
`TO STORED VOCABULARY14
`
`FEATURE EXTRACTION12
`
`WORD ENDPOINT DETECTION13
`
`WORD RECOGNITION BY
`
`FINDING CLOSEST MATCHES
`
`ACCEPT/REJECT DECISION 15
`
`OUTPUT THE BEST MATCH,
`OR REJECTION
`
`FIG.1
`
`PRIOR ART
`
`IPR2023-00037
`Apple EX1020 Page 3
`
`IPR2023-00037
`Apple EX1020 Page 3
`
`
`
`U.S. Patent
`
`Jul. 18, 2000
`
`Sheet 2 of 12
`
`6,092,045
`
`TO LAN/WAN
`
`TO PSTN
`
`CHANGE 104
`
`SERVICE
`PROC.
`101
`
`SERVICE
`PROC.
`102
`
`me TT
`
`VOICE BUS
`
`SPEECH
`RECOGNITION
`PROCESSOR
`105
`
`107
`
`X.25
`INTERFACE
`106
`
`MULTI-
`CHANNEL
`TELECOMS
`LINK
`
`TO DATA NETWORK
`
`FIG,2
`
`IPR2023-00037
`Apple EX1020 Page 4
`
`IPR2023-00037
`Apple EX1020 Page 4
`
`
`
`U.S. Patent
`
`Jul. 18, 2000
`
`Sheet 3 of 12
`
`6,092,045
`
`SPEECH
`RECOGNITION
`PROCESSOR
`CARD 105
`
`VME BUS
`INTERFACE
`121
`
`GLOBAL
`MEMORY
`
`
`
`DESKTOP COMPUTER 132
`
`PENTIUM
`PROCESSOR
`133
`ON-CHIP
`CACHE 134
`
`GLOBAL
`
`FIG. 18
`
`IPR2023-00037
`Apple EX1020 Page 5
`
`IPR2023-00037
`Apple EX1020 Page 5
`
`
`
`U.S. Patent
`
`Jul. 18, 2000
`
`Sheet 4 of 12
`
`6,092,045
`
`WORD RECOGNITION (BY VITERBI
`ALGORITHM)
`14
`
`INPUT WORD AS A
`
`SERIES OF OBSERVATIONS
`
`IN FORM OF CEPSTRAL VECTORS
`20
`
`23
`
`COMPARISON STEP
`FOR CURRENT OBSERVATIONIN
`SERIES, COMPUTE PROBABILITY
`OF IT BEING THE SAME AS EACH
`PATH IN THE STORED HMM
`TRELLIS
`21
`
`REPEAT FOR NEXT HMM
`
`UNTIL END
`
`22
`
`REPEAT FOR NEXT OBSERVATION
`UNTIL END
`
`OUTPUT SCORES FOR BEST MATCHING WORDS
`
`FIG.4
`
`PRIOR ART
`
`IPR2023-00037
`Apple EX1020 Page 6
`
`IPR2023-00037
`Apple EX1020 Page 6
`
`
`
`U.S. Patent
`
`Jul. 18, 2000
`
`Sheet 5 of 12
`
`6,092,045
`
`WORD RECOGNITION
`
`FIND FIRST OBSERVATION IN
`FIRST BLOCK, AND FIND
`FIRST MODEL 30
`
`
`
`COMPARE OBSERVATION
`
`TO MODEL 31
`
`NEXT OBSERVATION,
`UNTIL END OF BLOCK 32
`
`NEXT MODEL 33
`
`NEXT BLOCK OF
`OBSERVATIONS 34
`
`FIG. 5
`
`IPR2023-00037
`Apple EX1020 Page 7
`
`IPR2023-00037
`Apple EX1020 Page 7
`
`
`
`U.S. Patent
`
`Jul. 18, 2000
`
`Sheet 6 of 12
`
`6,092,045
`
`A
`
`B
`
`A
`
`C
`
`
`
`
`WORDS IN
`SERIES OF OBSERVATIONS MAKING
`
`DICTIONARY
`UP WORD BEING RECOGNISED
`OF MODELS
`
`U
`
`§
`
`“ABRAHAM”
`
`“BEACH”
`
`“CARROT”
`
`“DEPTH”
`
`LAST TO BE CALCULATED
`
`FIG,6
`
`SERIES OF OBSERVATIONS MAKING
`UP WORD BEING RECOGNISED
`
`
`WORDS_IN
`}
`{BLOCK1 }{BLOCK2
`
`DICTIONARY
`
`AC
`S
`
`
`
`OF MODELS A B U
`5% 10% 11% 10% 7%, 5%
`‘ABRAHAM’
`
`To
`
`4% 5% [44% 3%
`
`“BEACH”
`
`a ApS} ST
`
`DS
`
`1p
`
`% $2,
`
`5%.
`
`3%
`
`ho
`
`9B
`
`“CARROT”
`
`“DEPTH”
`
`FIG. 7
`
`IPR2023-00037
`Apple EX1020 Page 8
`
`IPR2023-00037
`Apple EX1020 Page 8
`
`
`
`U.S. Patent
`
`Jul. 18, 2000
`
`Sheet 7 of 12
`
`6,092,045
`
`
`
`
`FRAME FRAME-FRAME
`STATE1
`qT
`Ts
`Ie
`
`STATE 2
`
`STATE 3
`
`STATE 4
`
`bo
`TO END STATE OR NEXT HMM
`
`FIG. 9
`
`IPR2023-00037
`Apple EX1020 Page 9
`
`IPR2023-00037
`Apple EX1020 Page 9
`
`
`
`U.S. Patent
`
`Jul. 18, 2000
`
`Sheet 8 of 12
`
`6,092,045
`
`WO
`
`WM 2 WD
`
`WH H W)
`
`uw
`
`
`
`vO)
`
`2 vw
`
`vw
`
`vv)
`
`Ww
`
`vw
`
`FIG. 10
`
`
`
`IPR2023-00037
`Apple EX1020 Page 10
`
`IPR2023-00037
`Apple EX1020 Page 10
`
`
`
`U.S. Patent
`
`Jul. 18, 2000
`
`Sheet 9 of 12
`
`6,092,045
`
`
`
`FIG. 12
`
`S]
`
`S?
`
`FIG. 13
`
`IPR2023-00037
`Apple EX1020 Page 11
`
`IPR2023-00037
`Apple EX1020 Page 11
`
`
`
`U.S. Patent
`
`Jul. 18, 2000
`
`Sheet 10 of 12
`
`6,092,045
`
`}
`else {
`Defineoutp as the index of output buffer ford .
`if (outp is in use) {
`Mark b as an update type.
`
`}
`else {
`Mark outp in use.
`
`}
`
`process_branch_by_type(b,inp,outp )
`
`1. Copy in to Buffer[0].
`2. Mark all buffers free. Mark Buffer[0] in use.
`3.
`4. foreach branch b {
`5.
`Define inp as the index of the input buffer to b.
`6.
`if (b is a terminal branch) {
`7.
`do_terminal_processing(leaf(b ), Buffer[inp ]).
`8
`9.
`10.
`11.
`12.
`13.
`14.
`15.
`16.
`17.
`18.
`19.
`20.
`22.
`24. }
`
`}
`
`Mark inp free.
`
`FIG. 14
`
`IPR2023-00037
`Apple EX1020 Page 12
`
`IPR2023-00037
`Apple EX1020 Page 12
`
`
`
`U.S. Patent
`
`Jul. 18, 2000
`
`Sheet 11 of 12
`
`6,092,045
`
`IN LEXICAL GRAPH 205
`
`COMPARE OBSERVATION 206
`
` DETERMINE TYPE OF EACH BRANCH
`
`
` INPUT SCORE FROM
`
`
`
`PREVIOUS BRANCH 207
`
`PROCESS BRANCH ACCORDING
`TO BRANCH TYPE 208
`
`NEXT BRANCH 209
`
`NEXT OBSERVATION UNTIL
`
`END OF BLOCK 210
`
`NEXT MODEL 211
`
`FIG. 15
`
`IPR2023-00037
`Apple EX1020 Page 13
`
`IPR2023-00037
`Apple EX1020 Page 13
`
`
`
`U.S. Patent
`
`Jul. 18, 2000
`
`Sheet 12 of 12
`
`6,092,045
`
`1
`
`3
`
`5
`
`
`
`rrnaetenennnecette
`
`a a_uy
`
`uw) 1t=O'1 t=1.t=2:t=3:1t=4:1 t=5,
`
`FIG. 16
`
`Restore state scores from global memory.
`S, = max[s,,u(Q)].
`v(O) = u(0).
`fort=0,1,..., B-1 {
`v(t+1) = max[u(t+1),s3 + bg(t)].
`S, = max[s, + b,(t),s; + b,(t)].
`S, = max[s, + b,(t),s, + b,(t)].
`Ss, = max[u(t+1),s, + b,(t)].
`
`} S
`
`ave the state scores in global memory.
`
`FIG. 17
`
`
`
`SweAPNAMPWNT
`
`IPR2023-00037
`Apple EX1020 Page 14
`
`IPR2023-00037
`Apple EX1020 Page 14
`
`
`
`6,092,045
`
`1
`METHOD AND APPARATUS FOR SPEECH
`RECOGNITION
`
`FIELD OF THE INVENTION
`
`BACKGROUND TO THE INVENTION
`
`2
`or more of the models, to determine a likelihood of a
`match to each of the one or more models;
`b) repeating step a) for models other than those in the
`subset; and
`5
`c) repeating steps a) and b) for a different one of the
`The invention relates to methods of comparingaseries of
`blocks.
`observations representing unknownspeech,to stored models
`Amongst the advantages of such a method is that the
`representing known speech, to methods of recognising pat-
`speed of processing of step a), the critical innermost loop,
`terns in a series of observations, by comparing the obser-
`can be improved. The innermost loop is the part which is
`vations to stored models, to apparatus for such methods, and
`repeated most often, and by carrying out the operations in
`to software for such methods.
`the order specified, cache misses can be reduced, and
`pipelining efficiency improved. Morespecifically, slow main
`memory accesses may be required only once per block per
`model, instead of once per observation per model.
`Preferably, the observations are represented as multidi-
`mensional vectors, for the comparison at step a). In such
`cases, processing requirements may be particularly onerous,
`and the advantages of the method become more valuable.
`Advantageously, the comparison at step a) uses a Viterbi
`algorithm. The Viterbi algorithm is a dynamic programming
`algorithm which is preferred for comparing observations to
`particular types of models. If such models are time series of
`states joined bytransitions, the Viterbi algorithm determines
`the probability of each observation matching a one or more
`of those states. If the models comprise strings of states,
`cumulative probability scores for each string can be
`determined, to find the best path. In such cases, processing
`requirements may be particularly onerous, and the advan-
`tages of the method become more valuable.
`Advantageously, the models are represented asfinite state
`machines with probability distribution functions attached.
`Continuously varying signals can be modeled as a series of
`stationary states, with a probability function at each state,
`indicating whether the signal movesto the next state or stay
`s at the given state, after a given time interval. An example
`is the Hidden Markov Model (HMM). This enables the
`models to contain a representation of patterns with varying
`rates of change, yet a common underlying pattern. However,
`again, processing requirements may be particularly onerous,
`and the advantages of the method become more valuable.
`Advantageously, the models comprise groups of repre-
`sentations of phonemes. This enables the models to be stored
`and used more efficiently. Recognising patterns in speech
`signals is particularly processor intensive, because of the
`wide range of differences in sounds carrying the same
`meaning. Also, in many speech recognition applications, it
`is necessary to carry out recognition in real time, that is in
`half a second or so after
`the speaker has finished.
`Accordingly, again, processing requirements may be par-
`ticularly onerous, and the advantages of the method become
`more valuable.
`Advantageously, the models comprise representations of
`elements of speech, and step a) comprises the step of:
`comparing the block of observations to a predetermined
`sequence of the models in the subset. The amount of
`processing can be reduced if the models are stored as
`sequences of elements of speech. Many utterances
`share elements with other utterances. To avoid com-
`paring the same elements multiple times, the individual
`models of the elements can be merged into sequences.
`The common parts can be represented once, with
`branches where the elements are no longer common.
`Advantageously, step a) comprises the steps of:
`comparing the block of observations to a predetermined
`sequence of the models in the subset;
`determining for each of the models in the sequence, a
`score which represents the likelihood of a match with
`the observations compared so far;
`
`Pattern recognition generally, and recognition of patterns
`in continuous signals such as speech signals has been a
`rapidly developing field. A limitation in many applications
`has been the cost of providing sufficient processing power
`for the complex calculations often required. This is particu-
`larly the case in speech recognition, all the more so when
`real time response is required, for example to enable auto-
`mated directory enquiry assistance, or for control operations
`based on speech input. To simulate the speed of response of
`a humanoperator, and thus avoid, a perception of “unnatu-
`ral” delays, which can be disconcerting, the spoken input
`needs to be recognised within abouthalf a second of the end
`of the spoken input.
`The computational load varies directly with the number of
`words or other elements of speech, which are modelled and
`held in a dictionary, for comparison to the spoken input. This
`is also knownas the size of vocabulary of the system. The
`computational load also varies according to the complexity
`of the models in the dictionary, and how the speech inputis
`processed into a representation ready for the comparison to
`the models. Finally, the actual algorithm for carrying out the
`comparisonis clearly a key factor. Numerous attempts have
`been made over many years to improvethe trade off between
`computational load, accuracy of recognition, and speed of
`recognition. For useable systems, having a tolerable recog-
`nition accuracy,
`the computational demands are high.
`Despite continuous refinements to models, speech input
`representations, and recognition algorithms, and advancesin
`processing hardware,
`there remains great demand to
`improve the above mentioned tradeoff.
`There are five main steps: audio channel adaptation,
`feature extraction, word end-point detection, speech
`recognition, and accept/reject decision logic. The speech
`recognition step, the fourth stage, is the most computation-
`ally intensive step, and thus a limiting factor as far as the
`above mentioned trade off is concerned. Depending on the
`size of vocabularies used, and the size of each model, both
`the memory requirements and the numberof calaculations
`required for each recognition decision, may limit the speed/
`accuracy/cost
`trade off. Examples of such systems are
`described in U.S. Pat. No. 5,390,278 (Gupta et al.), and in
`USS. Pat. No. 5,515,475 (Gupta).
`SUMMARYOF THE INVENTION
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`45
`
`50
`
`55
`
`60
`
`65
`
`IPR2023-00037
`Apple EX1020 Page 15
`
`is an object of the invention to provide improved
`It
`method and apparatus.
`According to the invention, there is provided a method of
`comparing a series of observations representing unknown
`speech,
`to stored models representing known speech, the
`series of observations being divided into at least two blocks
`each comprising two or more of the observations,
`the
`method comprising the steps of:
`a) comparing two or more of the observations in one of
`the blocks of observations, to a subset comprising one
`
`IPR2023-00037
`Apple EX1020 Page 15
`
`
`
`6,092,045
`
`3
`storing the score in a score buffer for use in determining
`scores of subsequent models in the sequence; and
`determining when the score is no longer needed, then
`re-using the score buffer to store a subsequent score.
`This can bring an advantage in reducing memory
`requirements, which canrelieve a significant constraint. This
`may occur in particular where the models comprise large
`numbers of paths, each of which has to have a cumulative
`probability score maintained, or where there are a large
`number of models.
`Advantageously, step a) comprises the step of:
`comparing the block of observations to a lexical graph
`comprising a predetermined sequence of the models in
`the subset, wherein the sequence comprises different
`types of models, and the comparison is dependent on
`the type; and the method comprises the step of:
`determining the types of the models before the block is
`compared.
`Wheredifferent algorithms are used, tuned to suit differ-
`ent types of model, or different branches of a sequence of
`models, processing can be speeded upif it is unnecessary to
`test for branch type during processing.
`In particular,
`it
`becomeseasier for a compiler to pipeline the algorithm. If
`a branch is encountered, which route to take must be
`resolved before continuing, thus the pipelining by prepro-
`cessing of instructions ahead of the current instruction may
`be held up.
`the models comprise finite state machines,
`Preferably,
`having multiple state sequences, wherein step a) comprises
`the steps of:
`determining state scores for the matches between each
`respective observation and state sequences of the
`respective model,
`making an approximation of the state scores, for the
`observation, for storing to use in matching subsequent
`observations, the approximation comprising fewerstate
`scores than were determined for the respective obser-
`vation.
`This enables the amount of storage required for state
`scores to be reduced. This is particularly advantageous
`where there are a large number of models, each with
`multiple states and therefore a large numberof state scores
`to be stored for use in the next block. In such cases, a two
`pass approach may be used, in which a rough, or fast match
`algorithm is used onthe first pass, to identify a small number
`of similar models, which are then compared in more detail
`in the second pass. The approximation is particularly advan-
`tageous forthe first pass, to reduce the memory used, and to
`speed up the process since fewer memory accesses are
`required, and fewer calculations. This is particularly advan-
`tageousif the calculations are time consuming floating point
`calculations.
`According to another aspect of the invention, there is
`provided a method of recognising patterns in a series of
`observations, by comparing the observations to stored
`models, using a processing means having a main memory
`for storing the models and a cache memory,
`the cache
`memory being too small
`to contain all
`the models and
`observations, the series of observations being divided into
`blocks of at least two observations, the method comprising
`the steps of:
`a) using the processor to compare a subset of the models
`to the observations in one of the blocks of observations,
`to recognise the patterns, the subset of the models being
`small enoughto fit in the cache memory;
`b) repeating step a) for a different subset of the models
`and;
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`4
`c) repeating steps a) and b) for a different one of the
`blocks.
`
`According to another aspect of the invention, there is
`provided a method of recognising patterns in a series of
`observations by comparing the observations to stored
`models, the series of observations being divided into at least
`two blocks each comprising two or more of the observations,
`the models comprising finite state machines, having multiple
`state sequences, the method comprising the steps of:
`a) comparing two or more of the observations in one of
`the blocks of observations, to a subset comprising one
`or more of the models, to determine a likelihood of a
`match to each of the one or more models, by deter-
`mining which of the state sequences of the respective
`modelis the closest match, and how close is the match;
`b) repeating step a) for models other than those in the
`subset; and
`c) repeating steps a) and b) for a different one of the
`blocks.
`According to another aspect of the invention, there is
`provided a method of comparing a series of observations
`representing unknownspeech,to stored models representing
`known speech, by comparing the observations to stored
`models, the series of observations being grouped into one or
`more blocks each comprising two or more of the
`observations, the models comprising finite state machines,
`having multiple state sequences, the method comprising, for
`each of the one or more blocks, the steps of:
`a) comparing two or more of the observations in the
`respective block, to a subset comprising one or more of
`the models, to determinea likelihood of a match to each
`of the one or more models, by determining which of the
`state sequences of the respective model is the closest
`match, and how close is the match; and
`b) repeating step a) for models other than those in the
`subset.
`In some applications, the series of observations may be
`short enough that only a single block is necessary. The
`advantages set out abovestill apply.
`According to another aspect of the invention, there is
`provided software stored on a computer readable medium
`for comparing a series of observations representing
`unknown speech,
`to stored models representing known
`speech,the series of observations being divided into at least
`two blocks each comprising two or more of the observations,
`the software being arranged for carrying out the steps of:
`a) comparing two or more of the observations in one of
`the blocks of observations, to a subset comprising one
`or more of the models, to determine a likelihood of a
`match to each of the one or more models;
`b) repeating step a) for models other than those in the
`subset; and
`c) repeating steps a) and b) for a different one of the
`blocks.
`According to another aspect of the invention, there is
`provided software stored on a computer readable medium
`for recognising patterns in a series of observations by
`comparing the observations to stored models, the series of
`observations being divided into at least two blocks each
`comprising two or more of the observations, the models
`comprising finite state machines, having multiple state
`sequences, the software being arranged to carry out the steps
`of:
`
`a) comparing two or more of the observations in one of
`the blocks of observations, to a subset comprising one
`or more of the models, to determine a likelihood of a
`
`IPR2023-00037
`Apple EX1020 Page 16
`
`IPR2023-00037
`Apple EX1020 Page 16
`
`
`
`5
`match to each of the one or more models, by deter-
`mining which of the state sequences of the respective
`modelis the closest match, and how closeis the match;
`b) repeating step a) for models other than those in the
`subset; and
`c) repeating steps a) and b) for a different one of the
`blocks.
`According to another aspect of the invention, there is
`provided software stored on a computer readable medium
`for comparing a series of observations representing
`unknown speech,
`to stored models representing known
`speech, by comparing the observationsto stored models,the
`series of observations being grouped into one or more blocks
`each comprising two or more of the observations, the models
`comprising finite state machines, having multiple state
`sequences, the software being arranged to carry out for each
`of the one or more blocks, the steps of:
`a) comparing two or more of the observations in the
`respective block, to a subset comprising one or more of
`the models, to determinea likelihood of a match to each
`of the one or more models, by determining whichof the
`state sequences of the respective model is the closest
`match, and how close is the match; and
`b) repeating step a) for models other than those in the
`subset.
`According to another aspect of the invention, there is
`provided a speech recognition processor for comparing a
`series of observations representing unknown speech,
`to
`stored models representing known speech,
`the series of
`observations being divided into at least two blocks each
`comprising two or more of the observations, the processor
`being arranged to carry out the steps of:
`a) comparing two or more of the observations in one of
`the blocks of observations, to a subset comprising one
`or more of the models, to determine a likelihood of a
`match to each of the one or more models;
`b) repeating step a) for models other than those in the
`subset; and
`c) repeating steps a) and b) for a different one of the
`blocks.
`
`According to another aspect of the invention, there is
`provided a speech recognition processor for recognising
`patterns in a series of observations by comparing the obser-
`vations to stored models, the series of observations being
`dividedinto at least two blocks each comprising two or more
`of the observations,
`the models comprising finite state
`machines, having multiple state sequences, the processor
`being arranged to carry out the steps of:
`a) comparing two or more of the observations in one of
`the blocks of observations, to a subset comprising one
`or more of the models, to determine a likelihood of a
`match to each of the one or more models, by deter-
`mining which of the state sequences of the respective
`modelis the closest match, and how closeis the match;
`b) repeating step a) for models other than those in the
`subset; and
`c) repeating steps a) and b) for a different one of the
`blocks.
`According to another aspect of the invention, there is
`provided a speech recognition processor for comparing a
`series of observations representing unknown speech,
`to
`stored models representing known speech, by comparing the
`observations to stored models,
`the series of observations
`being grouped into one or more blocks each comprising two
`or more of the observations, the models comprising finite
`
`FIG. 1 showsprincipal steps in a known speech recog-
`nition process;
`FIG. 2 shows known service controller apparatus for
`carrying out the method of FIG. 1;
`FIG. 3 showsthe speech recognition processor of FIG. 2
`in more detail;
`FIG. 4 showsthe recognition step of FIG. 1 in more detail;
`FIG. 5 shows the recognition step according to an
`39 embodimentof the invention;
`FIG. 6 showsa table of scores for a series of observations
`matched against models,
`indicating the known order of
`calculating the scores;
`FIG. 7 shows the table of scores as shown in FIG. 6,
`indicating an order of calculating the scores according to an
`embodiment of the invention;
`FIG. 8 shows a basic HMM topology;
`FIG. 9 showsa trellis of an HMM;
`FIG. 10 showsa trellis for a block Viterbi search;
`FIG. 11 shows how full models are collapsed to become
`fast-match models;
`FIG. 12 shows a block Viterbi lexical graph, showing
`branch types and score buffer assignment;
`FIG. 13 shows a fast match model;
`FIG. 14 shows pseudo code for the block Viterbi algo-
`rithm;
`FIG. 15 showsthe recognition step of FIG. 1 adapted for
`sq branch type optimization,
`FIG. 16 shows an optional HMM andtrellis;
`FIG. 17 showspseudo code for a block Viterbi algorithm
`for optional branches; and
`FIG. 18 shows an alternative recognition processor
`arrangement for carrying out some of the steps of FIG. 1.
`DETAILED DESCRIPTION
`
`35
`
`40
`
`45
`
`6,092,045
`
`6
`state machines, having multiple state sequences, the proces-
`sor being arrangedto carry out, for each of the one or more
`blocks, the steps of:
`a) comparing two or more of the observations in the
`respective block, to a subset comprising one or more of
`the models, to determinea likelihood of a match to each
`of the one or more models, by determining which of the
`state sequences of the respective model is the closest
`match, and how close is the match; and
`b) repeating step a) for models other than those in the
`subset.
`
`Preferred features may be combined as would be apparent
`to a skilled person, and may be combined with any aspect of
`the invention.
`To show,by way of example, how to put the invention into
`practice, embodiments will now be described in more detail,
`with reference to the accompanying drawings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`10
`
`15
`
`25
`
`FIG. 1: Voice Recognition
`FIG. 1 shows an overall view of an example of a known
`pattern recognition process, for voice recognition. There are
`five main steps: channel adaptation (1), feature extraction
`(2), word end-point detection (3), speech recognition (4),
`and accept/reject decision logic (5).
`In the first step—channel adaptation—the system moni-
`tors the characteristics of the telephoneline that carries the
`speech, and creates an initial mathematical model of channel
`parameters, such as line noise level. This model, which is
`
`65
`
`IPR2023-00037
`Apple EX1020 Page 17
`
`IPR2023-00037
`Apple EX1020 Page 17
`
`
`
`6,092,045
`
`7
`updated continually throughout the recognition process, is
`used to adapt the vector of coefficients generated during
`feature extraction so that the coefficients respond to fore-
`ground speech and block out line channel variations.
`During the second step—feature extraction—the system
`divides the unknown word or phrase into short segments,
`called frames. Each frame lasts 12.75 milliseconds. Thus, a
`word such as Mississippi, which takes about a second tosay,
`would be broken down into about 100 frames.
`Next, the system conducts a spectral analysis of each
`frame and creates a mathematical vector representation of
`the spectrum. Each feature vector is a string of 15 coeffi-
`cients. Thefirst seven coefficients measure the energy of the
`frame at different
`frequency bands to distinguish,
`for
`example, between a high-frequency /s/ sound, and a lower-
`frequency vowel. The remaining coefficients measure the
`rate of change of these spectral energies.
`The third stage—word endpoint detection—is performed
`by an algorithm that uses the energy and short time spectrum
`of the speech signal. This algorithm removessilence before,
`after, and in the middle of the speech signal, andfilters out
`unwanted background noise in order to expedite the speech-
`recognition stage.
`Speech recognition is the fourth stage of the process. At
`this stage,
`the vector representations generated for each
`frame are compared with the FVR system’s stored models
`for the recognition vocabulary. Each word in the vocabulary
`is represented by a string of hidden Markov models
`(HMMs), one for each phoneme in the word, as shown in
`USS. Pat. No. 5,390,278 (Gupta et al.) The stored string of
`phoneme models that producesthe best match is determined
`using the multiple-pass approach, at which point the spoken
`word is considered recognized, as shown in U.S. Pat. No.
`5,515,475 (Gupta). The matching process is aided by sta-
`tistical models andefficient search algorithms embedded in
`the hidden Markov models.
`Thefinal stage of the processis rejection decision scoring,
`which determines whether the best match found during the
`speech-recognition stage should be accepted or rejected. To
`perform this evaluation,
`the recognizer employs a math-
`ematical distribution matrix. The closer the match, the more
`likely it 1s to be accepted. This feature provides the system
`with the ability to reject imposter input, or words notin its
`vocabulary.
`FIG. 2, knownservice controller apparatus:
`FIG. 2 shows someofthe principal hardware elements for
`a voice processing platform known as the Network Appli-
`cations Vehicle (NAV). The NAV platform is designed for
`enhanced network services, and enables network providers
`to support an array of revenue-generating and cost-saving
`interactive and automated speech-recognition s