`Robinson
`
`USOO5983.18OA
`Patent Number:
`11
`(45) Date of Patent:
`
`5,983,180
`Nov. 9, 1999
`
`54 RECOGNITION OF SEQUENTIAL DATA
`USING FINITE STATE SEQUENCE MODELS
`ORGANIZED IN A TREE STRUCTURE
`
`75 Inventor: Anthony John Robinson, Cambridge,
`United Kingdom
`
`73 Assignee: SoftSound Limited, Hertfordshire,
`United Kingdom
`
`Tony Robinson and James Chrisitie, “Time-First Search for
`Large Vocabulary speech Recognition,” Proc. 1998 IEEE
`Int. Conf. Acoust., Speech, and Sig. Proc., 1998, May
`12–15, 1998, vol. 2, pp. 829–832.
`Aubert, X., et al., "Large Vocabulary Continuous Speech
`Recognition of Wall Street Journal Data”, Proceedings of the
`IEEE International Conference on Acoustics, Speech, and
`Signal Processing, vol. 2, 129-132, (1994).
`
`21 Appl. No.: 09/031,881
`22 Filed:
`Feb. 27, 1998
`30
`Foreign Application Priority Data
`Oct. 31, 1997 GB United Kingdom ................... 9723092
`Jan. 13, 1998 GB United Kingdom ................... 98OO658
`(51) Int. Cl. .................................................. G10L 5/06
`52 U.S. Cl. ............................................. 704/254; 704/256
`58 Field of Search ..................................... 704/241-245,
`704/254 .257
`
`56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4,748,670 5/1988 Bahl et al. .............................. 704/256
`5,033,087
`7/1991 Bahl et al. ...
`... 704/256
`5,349,645 9/1994 Zhao ................
`... 704/243
`5,621,859 4/1997 Schwartz et al. ....................... 704/256
`FOREIGN PATENT DOCUMENTS
`0627726 12/1994 European Pat. Off. .......... G10L 3/OO
`97/08686 3/1997 WIPO .............................. G10L 5/06
`
`OTHER PUBLICATIONS
`Douglas B. Paul, “An Efficient A* Stack Decoder Algorithm
`for Continuous Speech Recognition with a Stochastic Lan
`guage Model.” Int. Conf. on Acoust., Speech, and Sig. Proc.,
`1992, ICASSP-92, 23–25 Mar. 1992, vol. 1, pp. 125-128.
`Patrick Kenny, Rene Hollan, Vishwa N. Gupta, Matthew
`Lennig, P. Mermelstein, and Douglas O'Shaughnessy, “A*
`-admissible Heuristics for Rapid Lexical Access,” IEEE
`Trans. Speech and Audio Proc., Jan. 1993, Vol. 1, iss. 1, pp.
`49-58.
`
`
`
`(List continued on next page.)
`
`Primary Examiner David R. Hudspeth
`ASSistant Examiner Donald L. Storm
`Attorney, Agent, or Firm-Schwegan, Lundberg, Woessner
`& Kluth, PA.
`ABSTRACT
`57
`In a method of automatically recognizing data which com
`prises Sequential data units represented as Sequential tokens
`grouped into one or more items, known items are Stored as
`respective finite State Sequence models. Each State corre
`sponds to a token and the models which have common prefix
`States are organized in a tree Structure Such that Suffix States
`comprise branches from common prefix States and there are
`a plurality of tree Structures each having a different prefix
`State. Each Sequential data unit is compared with Stored
`reference data units identified by reference tokens to gen
`erate Scores indicating the Similarity of the data units to
`reference data units. An accumulated Score for the final State
`in the models is determined by Steps of (a) sequentially
`calculating the accumulated Score for a model to reach the
`final State comprising a leaf in the tree, (b) identifying the
`closest branch to the leaf corresponding to a next model for
`which an accumulated Score for the final Stage has not yet
`been calculated, and (c) accumulating the score from the
`identified closest branch for the next model to the final state.
`These steps are repeated for the branches of the trees. The
`item corresponding to a model having the highest accumu
`lated Score is recognized as the model best matching the
`data.
`
`48 Claims, 24 Drawing Sheets
`
`47
`AccEPTABLE
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 1
`
`
`
`5,983,180
`Page 2
`
`OTHER PUBLICATIONS
`Deller, Jr., J.R., et al., Discrete-Time Processing of Speech
`Signals, J. Griffin, ed., Macmillan Publishing Company,
`New York, table of contents, (1993).
`Jelinek, F., “Up From Trigrams. The Struggle for Improved
`Language Models”, Proceedings of the European Confer
`ence on Speech Technology, 1037–1040, (1991).
`Katz, S.M., “Estimation of Probabilities From Sparse Data
`for the Language Model Component of a Speech Recog
`nizer”, IEEE Transactions on Acoustics, Speech, and Signal
`Processing, vol. ASSP-35, 400–401, (Mar. 1987).
`Moyano, J.H., “Memory Efficient Search for Large Vocabu
`lary Recognition”, Thesis for MPhil in Computer Speech
`and Language Processing, Cambridge University, (Aug.
`1997).
`Rabiner, L., et al., Fundamentals of Speech Recognition,
`PTR Prentice-Hall, Inc., publisher, Englewood Cliffs, New
`Jersey, table of contents, (1993).
`
`Rabiner, L.R., "A Tutorial on Hidden Markov Models and
`Selected Applications in Speech Recognition', Proceedings
`of the IEEE, vol. 77, 257–286, (Feb. 1989).
`Renals, S., et al., “Decoder Technology for Connectionist
`Large Vocabulary Speech Recognition’, Cambridge Univer
`sity Engineering Department Technical Report CS-95-17,
`1-22 and appdx., (Sep. 1995).
`Robinson, T., et al., “The Use of Recurrent Neural Networks
`in Continuous Speech Recognition', Advanced Topics in
`Automatic Speech and Speaker Recognition, C-H. Lee and
`F. K. Soong, eds., Kluwar Academic Publishers, 159-184
`(Chap. 7), (1996).
`Sedgewick, R., Algorithms, Second Edition, Addison-Wes
`ley Publishing Company, Inc., table of contents, (1988).
`Woodland, P., et al., “MPhil in Computer Speech and
`Language Processing, Introduction to Speech Recognition',
`Course
`Notes,
`Cambridge
`University,
`(1995).
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 2
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 1 of 24
`
`5,983,180
`
`
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 3
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 2 of 24
`
`5,983,180
`
`
`
`O(1) O(2) O(3) O(4) O(5) O(6)
`
`FIG. 2
`
`
`
`HMM STATE
`
`s
`
`
`
`21
`
`22
`
`23
`
`FIG. 3
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 4
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 3 of 24
`
`5,983,180
`
`CG)
`C) C3
`C9 C C,
`Ce CeCe Ce
`Cé) C3) C3) C3)
`CO) CO) C(O) CCO)
`CG) CG) CG) CG)
`CO) s
`g
`Cs) CS C3) C3) C3) C3)
`CeCe CeCe CeCe C?
`C3 C3, C5 C5 C5 C5 C5 C5 C5
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 5
`
`
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 6
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet S of 24
`
`5,983,180
`
`HMM STATE
`
`
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 7
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 6 of 24
`
`5,983,180
`
`X100||S
`
`TEGOWHOÀIWES
`
`EN|0NE
`
`
`
`WIWO
`
`0
`Z
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 8
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 7 of 24
`
`5,983,180
`
`
`
`FIG. 8
`
`HMM STATE
`
`FIG. 9
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 9
`
`
`
`U.S. Patent
`
`
`
`Nov. 9, 1999
`
`Sheet 8 of 24
`
`5,983,180
`
`@-@@-@@-@);
`
`~~~~ ~~~~);
`
`FIG. 1 O
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 10
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 9 of 24
`
`5,983,180
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Self-rob
`sil 0.245465 2 0.962093 0.37907
`SelfProb
`0.5755260.424474
`ax 0.031528 1
`Self Prob
`b 0.012664 2 0.736O220.263978
`Self Prob
`ih 0.042714 2 0.5551210.446879
`SelfRrob
`0.026959 2 0.692925.0.307075
`SelfRrob exitFrob
`ih 0.042714 2 0.5551210.446879
`(t) - Sense Self Prob exitFrob
`t 0.053775. 2 0.7267680.273232
`n S nigte t e 6. g E.
`i. X
`i 8
`O %
`0. d f O 9
`SelfRrob
`sil 0.245465. 2 0.962093
`
`
`
`
`
`
`
`
`
`
`
`al-e-
`
`
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 11
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 10 of 24
`
`5,983,180
`
`
`
`o
`
`O
`
`1 2
`
`4.
`l
`
`g
`i
`
`sit
`
`
`e al
`h
`a.
`
`W
`
`X
`
`h
`C
`
`d
`h
`
`e
`
`e
`r
`
`f
`
`hh
`
`i
`i
`h
`k
`
`THE UK PHONE TABLE
`
`States
`
`Self Prob
`
`024546s
`
`0.024685
`0.03200
`0.003347
`
`0.0979
`
`
`
`0.692925
`070660
`
`FIG. 12
`
`0.037907
`0.2O6536
`0.31400
`0.327266
`0.193428
`O. 187452
`0.424.474
`0.183951
`0.263978
`0.200266
`0.323747
`0.363.584
`0.215546
`0.272536
`0.20644
`0.219322
`0.208359
`0.2941 14
`0.279065
`0.215634
`0.446879
`0.245839
`0.202.504
`0.252953
`0.307075
`0.28934
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 12
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 11 of 24
`
`5,983,180
`
`Index
`
`2 6
`2 7
`
`3 3
`3 4
`35
`3 6
`
`4 O
`
`i
`
`4 3
`
`oh
`
`O W
`
`sh
`
`th
`
`la
`uh
`
`W
`
`THE UK PHONE TABLE
`
`
`
`States
`
`Self Prob
`
`2
`
`0.048768
`0.009 110
`O.O12692
`0.03.108
`0.002564
`0.023534
`0.022903
`0.059022
`0.014548
`0.053775
`0.005 125
`0.001.254
`0.001 666
`0.01 0664
`0.01.3320
`0.011328
`0.00947
`0.032797
`OOOO8
`
`
`
`0.7552O6
`0.751965
`96
`
`FIG. 12 (Continued)
`
`0.326562
`0.243297
`O.27958O
`0.219885
`O. 185322
`0.237239
`0.346892
`0.20800
`0.164777
`0.273232
`0.241910
`0.274O16
`0.4695.06
`0.277400
`0.290314
`0.257742
`0.258759
`0.244794
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 13
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 12 of 24
`
`5,983,180
`
`S1
`
`INTIALIZATION
`
`S2
`
`S5
`
`POINT TO FIRST
`PHONEME
`
`
`
`S4
`
`PROCESS NODE
`ROUTINE
`
`S5
`ALL NODES
`PROCESSED
`
`S6
`OUTPUT WORD(S)
`WITH HIGHEST
`PROBABILITY
`
`
`
`
`
`FIG. 13
`
`
`
`POINT TO NEXT
`PHONEME
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 14
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 13 of 24
`
`5,983,180
`
`THE PROCESS NODE SUBROUTE
`
`
`
`S
`
`POINT TO
`FIRST WORD
`
`
`
`
`
`S18
`S1
`STORE ID OF
`BEST (bj(T
`TO by-its WORD AND (T)
`
`
`
`
`
`
`
`S19
`
`S20
`
`DONE
`
`YES
`
`
`
`NEXT WORD
`
`FIG. 14
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 15
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 14 of 24
`
`5,983,180
`
`
`
`
`
`
`
`
`
`
`
`S21
`
`ANY
`CHILDREN2
`
`S22
`POINT TO
`FIRST CHILD
`
`
`
`PASS (bj(1).j(T) TO
`PROCESS NODE SUBROUTINE
`
`
`
`S25
`POINT TO
`NEXT CHILD
`
`
`
`
`
`CHILDREN
`DONE
`
`THE PROCESS NODE SUBROUTE
`
`FIG. 14 (Continued)
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 16
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 15 of 24
`
`5,983,180
`
`tO = START TIME
`t1 = END TIME
`
`(i(t0+1)=p(0) + log ai + log bj0(t0+1)
`
`(bj(t)=max(bj(t-1) + log ai + log bjOCt))
`
`<i> t=t -
`YES
`
`1
`
`i(t)=(i(t-1) + log aij + log bj0(t))
`
`
`
`NO
`
`$5(t) LOCALL
`PRUNED2
`YES
`
`FIND LAST START TIME ABOVE
`GLOBAL PRUNING THRESHOLD
`
`FIND LAST END TIME ABOVE
`GLOBAL PRUNING THRESHOLD
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 17
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 16 of 24
`
`5,983,180
`
`HMM STATE
`
`
`
`s
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 18
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 17 Of 24
`
`5,983,180
`
`
`
`s
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 19
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 18 of 24
`
`5,983,180
`
`S30
`INTIALIZE HEAP WITH
`CSTART), TIME=O AND
`PROBABILITY=0
`
`S31
`POP HEAP TEM FROM
`TOP OF HEAP
`
`S32
`
`CALL PROCESS
`TREE ROUTINE
`
`S35
`ANY WORDS
`OUT2
`
`
`
`YES
`
`S54
`
`POINT AT FIRST WORD
`
`S36
`
`
`
`
`
`
`
`PUSH NEW
`LANGUAGE
`NO MODEL STATE,
`PROBABILITIES
`HISTORIES AND
`TIME RANGE
`ONTO HEAP
`
`
`
`NO
`
`NO
`
`
`
`
`
`
`
`S35
`
`LANGUAGE
`
`IN HEAP
`
`
`
`
`
`
`
`POINT AT NEXT WORD
`
`
`
`
`
`
`
`ANY MORE
`WORDS?
`
`NO
`
`S40
`
`S41
`
`YES
`
`FIG. 18
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 20
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 19 of 24
`
`5,983,180
`
`S52
`
`FOR EACH
`TIME, IF NEW IS
`BETTER REPLACE
`PROBABILITY
`AND HISTORY
`
`
`
`S53
`CONCATENATE
`THE REST
`
`
`
`
`
`
`
`
`
`S5
`7
`
`PAD GAP
`WITH DUMMY
`S5
`8
`CONCATENATE
`TWO RANGES
`PLUS DUMMY
`
`S50
`
`
`
`
`
`
`
`
`
`MERGE START
`
`
`
`IS THERE
`OVERLAP IN TIME
`RANCES
`
`
`
`ARE THEY
`DISJOINTED
`p
`
`NO
`
`CONCATENATE
`
`
`
`FIG. 19
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 21
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 20 of 24
`
`5,983,180
`
`CSTART)
`
`
`
`O
`
`10 20 30 40 50 60 70 80 90
`
`TIME/FRAMES
`
`FIG. 20
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 22
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 21 of 24
`
`5,983,180
`
`
`
`
`
`
`
`
`
`
`
`IWW EH1
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 23
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 22 of 24
`
`5,983,180
`
`WORD HYPOTHESIS TREE
`
`CSTARTY
`
`TIME/FRAMES
`I-I-I-T-I-T-I-T-T I I- --
`O 10 20 30 40 50 60 70 80 90 100 110 120 130 140
`
`FIG. 22
`
`
`
`TIME/FRAMES
`---------
`70 80 90 100 110 120 130 140
`FIG. 23
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 24
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 23 of 24
`
`5,983,180
`
`
`
`)
`
`(
`
`(
`
`Gae-t-k)-ax-CAT+k
`) / (
`)
`Gae-t-dh)-ax-CAT+dh
`(
`)
`)
`(
`de-t--S )--OX-CAT+S
`(dh-ax+k)-ax-THE+k
`) / (
`)
`(
`OX-dh--OX
`(dh-ax+d)-ax-THE+dh
`
`dh-Ox+s)--OX-THE--S
`(
`)
`Gae-t-k)-ox-SAT+k
`) / (
`)
`(
`S-de--t
`
`--OX-SAT+dh
`
`(
`
`)
`
`FIG. 25
`
`(
`
`)
`
`--OX-SAT+s
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 25
`
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 24 of 24
`
`5,983,180
`
`: C1 : C2 : C3 : C4 : C5 :
`
`
`
`CODON
`
`FIG. 26A
`
`CC2
`
`G1
`
`:
`ICN
`
`G2
`
`GENE
`
`FIG. 26B
`
`Phi : Ph2 :
`
`Ph3
`
`: Pha :
`
`Y-1-1
`PHONEME
`
`FIG. 26C
`
`W1
`
`PhilPh2
`
`.
`Phn
`
`W2
`
`WORD
`
`FIG. 26D
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 26
`
`
`
`1
`RECOGNITION OF SEQUENTIAL DATA
`USING FINITE STATE SEQUENCE MODELS
`ORGANIZED IN A TREE STRUCTURE
`
`The present invention generally relates to a recognition
`method and apparatus for recognising data comprising
`Sequential data units represented by Sequential tokens
`grouped into one or more items. In an exemplary embodi
`ment the present invention is applied to Speech recognition.
`Systems for recognising Sequences of data units repre
`Sented as Sequential tokens which can be grouped into one
`or more items attempt to recognise items within the data by
`matching groups of tokens to a pre-stored dictionary of
`Sequences of tokens representing known items. Once a
`match has been found and thus the input data recognised, an
`appropriate output can be generated. Typically for Speech
`recognition this output can take the form of a command to
`instruct apparatus to carry out a function, or it can be
`translated into text for input into an application Such as a
`word processor.
`FIG. 1 illustrates a typical apparatus used for Speech
`recognition. A computer 1 has a keyboard 3 and is interfaced
`via an input/output device 11 to a microphone 7 and a
`loudspeaker 5. A Speech Signal is detected by the micro
`phone 7 and input to the computer 1 whereupon a program
`operates to recognise the Speech in order to at least one of
`control the computer, and generate an acoustic output at the
`loudspeaker 5. The loudspeaker 5 can also be used to play
`back the Speech Signal if this is digitally recorded. The
`computer 1 is operated under the control of a program which
`can be loaded using a computer Storage medium Such as a
`floppy disc 13 or any other type of computer Storage medium
`Such an optical disc, magnetic tape, or programmable
`memory devices.
`There are many different methods for carrying out Speech
`recognition and many techniques are based on probabilistic
`finite State Sequence models known as hidden Markov
`models (HMMs). A Markov chain comprises a plurality of
`States wherein a transition probability is defined for each
`transition from State to every other State and a Self transition
`probability is defined for transitions from a state to itself.
`FIG. 2 illustrates a hidden Markov model for the phonetic
`Sequence ae, b, aX and t for the pronunciation of ABBOT, AS
`can be seen there are four Sequential States which form the
`hidden Markov model for the word and there are six
`observations O(1) to O(6) which correspond to six sequen
`tial data units representing extracted portions of Speech.
`Since the hidden Markov model can undergo Self transitions,
`the transition along the hidden Markov model need not be,
`and indeed is rarely, Synchronous with the observed portions
`of Speech. AS can be seen in FIG. 2 Self transitions occur in
`the ae State and the ax State corresponding to the longer
`duration of the pronunciation of these phonemes.
`FIG. 3 illustrates more clearly the relationship between
`the HMM states and time. Each of the transitions from a
`State to another State and the Self transition within a State has
`asSociated with it a probability. Also, there is a probability
`that the observation corresponds with an HMM state. Thus
`in order to determine the most likely match between the
`sequence of observations and an HMM model for a word it
`is necessary to calculate an accumulated probability in a
`final state of the HMM model. The accumulated probability
`in a final State can be calculated by Summing log probabili
`ties or by multiplying the probabilities. In a system with
`more than one word model, the word having the highest
`accumulated probability can then be identified as the most
`likely spoken word corresponding to the Speech data.
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`5,983,180
`
`2
`A prior art method for calculating the accumulated
`probability for each HMM model is the Viterbi method
`which is described in detail, together with an application of
`HMMs to speech recognition in a paper entitled “ATutorial
`on Hidden Markov Models and Selected Applications in
`Speech Recognition” by L. R. Rabiner (Proceedings of the
`IEEE, Vol. 77, No. 2, February 1989, pages 257-286). In
`implementing the Viterbi algorithm, all of the prior art
`methods calculate the accumulated probabilities for all of
`the HMM states for all of the models time sequentially. Thus
`as can be seen in FIG. 3, where the numbers under each state
`indicate the order calculation of the accumulated probability
`for each state, for each observation (O(1) to O(6)) i.e. for
`each input portion of Speech a probability of the portion of
`speech, matching each phoneme (HMM State) is generated
`and used to Sequentially determine the probability for each
`HMM state of each model. This is then repeated for each
`observation until an accumulated probability is determined
`for the final HMM state in step 24. Thus, as can be seen in
`FIG. 4 which illustrates a set of HMM models for a part of
`a dictionary, using the Viterbi method it is necessary for the
`accumulated probability for the first temporal State for each
`of the HMM states of all of the models to be stored. In fact,
`two temporal levels of HMM states are stored since the
`accumulated probabilities of the HMM states in the current
`time period must be calculated based on the probabilities of
`the HMM states in the previous time period. For a small
`dictionary this does not present a problem. However, for
`large dictionary Speech recognition e.g. 65,000 words, a
`large memory is required for the calculation of the prob
`abilities.
`AS can be seen in FIG. 4, the calculation of the accu
`mulated probability at many of the HMM states represents
`a duplication. This has been recognised in the prior art and
`in order to avoid this repetition the plurality of HMM models
`has been organised into a tree structure (U.S. Pat. No.
`5,621.859). FIG. 5 illustrates how the HMM models of FIG.
`4 can be rearranged in Such a tree Structure. AS can be seen
`this reduces the memory requirement for the Search Space in
`order to accumulate the probabilities.
`The prior art method utilising the tree structure of FIG.
`5 however still require all of the probabilities for the tree
`structure to be stored in memory. And thus whilst there is a
`memory Saving compared to the use of a complete Set of
`HMM models arranged linearly, it is desirable to still further
`reduce the memory requirements for the Search Space to
`calculate the accumulated probabilities and it is to the
`Solution of this problem that one aspect of the present
`invention is directed.
`There are two types of Speech recognition. Discrete
`Speech recognition constrains the Speaker to pause between
`each word to thereby provide discretely spoken words for
`recognition. Such Systems are now being Superseded how
`ever with Systems capable of recognising natural Speech i.e.
`continuous speech wherein a Speaker is not constrained to
`pause between each spoken word. The problem with Such a
`System is that spoken words tend to merge together and the
`boundaries between words are often indefinite. Thus, the
`transition between HMM models is not defined by a single
`transition from a final State but instead can comprise mul
`tiple transitions as illustrated in FIG. 6. AS can be seen in
`FIG. 6 there are four possible transitions from a previous
`word ending in t to a Succeeding word ABBEY comprising
`the phonemes ae, b, and iy. Thus, for continuous speech
`recognition it is necessary for Such transitions to be taken
`into account.
`In speech recognition, in order to improve the accuracy
`of the recognition, not only are models for the words used,
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 27
`
`
`
`5,983,180
`
`15
`
`25
`
`35
`
`40
`
`3
`but also a language model is used to provide models
`indicating the probability of the occurrence of Sequences of
`words. Typically an N-gram is used wherein the accumu
`lated probability to the final state is modified in dependence
`upon one or more previous words. Thus, for instance in a
`bigram model for general English, there is a higher prob
`ability of the word CAT following the word THE than the
`word THE following the word THE. These probabilities can
`be used to modify the accumulated probability in the final
`HMM state. In order to do this however the accumulated
`probability for each HMM model is used as an initial
`probability for one or more initial HMM states in a preced
`ing word. This may be achieved using Stack decoding as for
`example disclosed in U.S. Pat. No. 4,748,670.
`It is an object of a Second aspect of the present invention
`to provide a more efficient recognition method for use in for
`example continuous speech recognition to reduce the com
`putational requirements.
`In accordance with the first aspect the present invention
`provides a method of automatically recognising data com
`prising Sequential data units represented as Sequential tokens
`grouped into one or more items, the method comprising the
`Steps of:
`Storing data representing known items as respective finite
`State Sequence models, where each State corresponds to
`a token and Said models having common prefix States
`are organised in a tree Structure Such that Suffix States
`comprise branches from common prefix States and
`there are a plurality of tree Structures each having a
`different prefix state;
`comparing each Sequential data unit with Stored reference
`data units identified by respective reference tokens to
`generate Scores for each data unit indicating the Simi
`larity of the data unit to respective Said reference data
`units,
`determining an accumulated Score for the final State in the
`models by
`a) Sequentially calculating the accumulated Score for a
`model to reach the final State comprising a leaf in the
`tree,
`b) identifying the closest branch to the leaf correspond
`ing to a next model for which an accumulated Score
`for the final State has not yet been calculated,
`c) accumulating the Score from the identified closest
`branch for the next model to the final state,
`d) repeating steps (b) and (c) for the branches of the
`tree, and
`e) repeating steps (a) to (d) for the plurality of trees, and
`identifying at least the item corresponding to the model
`having the highest accumulated Score.
`Also, in accordance with this aspect of the present inven
`tion there is provided recognition apparatus for recognising
`Sequential tokens grouped into one or more items, the
`apparatus comprising:
`Storage means for Storing data representing known items
`as respective finite State Sequence models, where each
`State corresponds to a token and Said models having
`common prefix States are organised in a tree Structure
`Such that Suffix States comprise branches from common
`prefix States and there are a plurality of tree Structures
`each having a different prefix State,
`comparing means for comparing each Sequential data unit
`with Stored reference data units identified by respective
`reference tokens to generate Scores for each data unit
`indicating the Similarity of the data unit to respective
`Said reference data units,
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`determining means for determining an accumulated Score
`for the final State in the models comprising
`a) means for Sequentially calculating the accumulated
`Score for a model to reach the final State comprising
`a leaf in the tree
`b) means for identifying the closest branch to the leaf
`corresponding to a next model for which an accu
`mulated Score for the final State has not yet been
`calculated, and
`c) means for accumulating the Score from the identified
`closest branch for the next model to the final state,
`wherein the Scores are accumulated for the branches
`of the tree and accumulated for the plurality of trees,
`and
`means for identifying at least the item corresponding to
`the model having the highest accumulated Score.
`The present invention can be implemented using dynamic
`programming to determine the most likely State Sequences or
`the most likely word Sequences, i.e. the maximum accumu
`lated Score is calculated or within a word the Score for all
`possible State Sequences can be Summed to calculate what is
`termed the forward probability (see page 335 of “Funda
`mentals of Speech Recognition” by L. Robiner and B. Juang,
`Prentice Hall, 1993).
`In a preferred embodiment of the present invention each
`state of the hidden Markov model corresponds to a token.
`The present invention is however not limited to such a direct
`comparison and more than one HMM State may correspond
`to a token.
`Thus instead of calculating the accumulated probabilities
`for all of the States in a time instance i.e. time Synchronously
`in accordance with prior art methods of implementing the
`Viterbialgorithm, the accumulated probabilities for each
`State are calculated for nodes of a tree until a final State of
`a model is reached which may or may not be a leaf node of
`the tree, i.e. State Sequentially down branches of the trees.
`Once the accumulated probability at the final state has been
`determined, the memory Space during the accumulated prob
`abilities for the nodes of the tree back to the nearest branch
`for which an accumulated probability for a model has still to
`be calculated is freed for use in the calculation of the
`accumulated probabilities down the next branch. In this way
`the amount of memory required is reduced. The memory
`requirement depends on the number of States in the longest
`hidden Markov model, and the number of observations. In
`contrast the time synchronous implementation of the Viterbi
`algorithm requires a memory capacity equivalent to the
`average number of states in the hidden Markov model times
`the number of hidden Markov models. The comparing step
`can comprise any method of generating probabilities for
`each data unit indicating the likelihood of the data unit being
`represented by a particular token.
`In an embodiment each token is modelled as an HMM
`where each state of the HMM corresponds to a subdata unit.
`Comparison Step then comprises determining accumulated
`probabilities for each data unit using the respective HMM.
`In order to reduce the calculations a pruning Step is used
`to halt the accumulation of Scores for any branches where
`the accumulated score falls below a threshold. Such a
`threshold can be an absolute threshold although in a pre
`ferred embodiment the threshold is relative to the highest
`accumulated Score. Such a technique is termed “beam prun
`ing.
`In accordance with the present invention the tokens can
`represent any form of Sequential data units which occur in
`one or more groups corresponding to items. The tokens
`comprise a finite Set of tokens where each token appears in
`
`Amazon / Zentian Limited
`Exhibit 1009
`Page 28
`
`
`
`5,983,180
`
`5
`
`25
`
`35
`
`40
`
`S
`more than one item. An example of Such is speech recog
`nition wherein Speech data comprises parts of Speech e.g.
`phonemes or Syllables which occur Sequentially and are
`grouped into one or more words.
`Each token can be independent of preceding and Succeed
`ing tokens. For example in Speech recognition it can be
`assumed that there are a Set of parts of Speech e.g. phonemes
`which are not modified by the preceding or Succeeding part
`of Speech. Alternatively, the tokens can depend on neigh
`bouring tokens in the finite State Sequence e.g. in Speech
`recognition context dependent phone models can be used to
`generate phoneme probabilities.
`Since the probabilities of the states of the hidden Markov
`model are calculated for each state sequentially, the HMM
`States can be considered to have temporal States correspond
`ing to each data unit e.g. each time frame. However, the
`15
`present invention is not limited to the recognition of time
`Sequential data and can recognise any data once it is pre
`Sented to the System Sequentially. The accumulation of the
`Score for the final State in the models is calculated by
`calculating the accumulated Score for the temporal States for
`a State in a timewise manner before calculating the temporal
`States for the next State in a timewise manner. This makes it
`possible for each branch of the lexical tree in a speech
`recognition System to be calculated one at a time reusing the
`common Suffix phonemes by releasing the memory for
`HMM states of the previous model.
`In an embodiment of the present invention the accumu
`lated Scores for the temporal States for a final State of a model
`of an item are Stored as a group in a memory together with
`information identifying a grouping model State used to
`arrive at the final State, information identifying positions in
`the tree Structure and information identifying the temporal
`position of the accumulated Scores. A previously stored
`group of accumulated Scores is then read from memory and
`the accumulated Scores for the plurality of temporal States is
`used as a plurality of temporally different initial Scores for
`the calculation of the accumulated Scores for the plurality of
`temporal States in the final State of a model of a Subsequent
`item. The grouping model State used to arrive at the final
`state of the model of the subsequent item is determined from
`the identification of the Subsequent item and the grouping
`model State of the read group of accumulated Scores.
`This embodiment of the present invention reduces the
`calculations required for continuous Speech recognition
`when there is an overlap between the final States and initial
`States of neighbouring words i.e. the boundary between two
`spoken words is not sharply defined. In this embodiment for
`Speech recognition the group model State comprises a lan
`guage model State which for an N-gram model comprises
`N-1 words.
`A Store for each of the grouping model States can be Stored
`to indicate the likelihood of items being Sequentially
`grouped based on a grouping model of likely item
`Sequences. In order to take into account a likelihood of
`Sequences of items, the accumulated Scores for the plurality
`of temporal States for the final State of a model of an item can
`be modified by the Score for the grouping model State before
`being Stored. Thus for the Speech recognition System an
`N-gram language model can be used to give the probability
`of word Sequences.
`The memory means in which the groups of accumulated
`Scores are Stored comprises a Stack in which the groups are
`ordered temporally or by their Scores. In this way groups of
`accumulated Scores for grouping model States can be pushed
`onto and popped off a Stack as groups.
`In an embodiment of the present invention it can be
`determined if a grouping model State is already Stored in the
`
`65
`
`45
`
`50
`
`55
`
`60
`
`6
`memory means and if So the calculated accumulated Scores
`for a final State of a model of an item and associated
`information is merged with the Stored group of accumulated
`Scores and asSociated information in temporal alignment
`whereby if there is temporal overlap between accumulated
`Scores the highest accumulated Score is Stored in the
`memory means together with respective associated informa
`tion.
`In accordance with a Second aspect of the present inven
`tion there is provided a method of automatically recognising
`data comprising Sequential data units represented as Sequen
`tial tokens grouped into a plurality of items, the method
`comprising the Steps of:
`Storing data representing known items as respective finite
`State Sequence models where each State corresponds to
`a token and each State comprises a plurality of temporal
`States corresponding in time to the Sequence of tokens,
`comparing each Sequential data unit with Stored reference
`data units identified by respective reference tokens to
`generate Scores for each data unit indicating the Simi
`larity of the data unit to respective Said reference data
`units,
`determining accumulated Scores for a plurality of tempo
`ral States for a final State in the models by accumulating
`Scores for a pl