`Jiang
`
`USOO6374219 B1
`(10) Patent No.:
`US 6,374,219 B1
`(45) Date of Patent:
`Apr. 16, 2002
`
`(*) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`OTHER PUBLICATIONS
`(54) SYSTEM FOR USING SILENCE IN SPEECH
`RECOGNITIO
`Automatic Speech Segment Boundary Detection Using
`N
`N
`Markov Models, IBM Technical Disclosure Bulletin, vol.
`(75) Inventor: Li Jiang, Redmond, WA (US)
`33, No. 7, pp. 323-324, Dec. 1990.*
`s
`s
`(73) Assignee: Microsoft Corporation, Redmond, WA Jelinek et al., Design of a Linguistic Statistical Decoder for
`(US)
`the Recognition of Continuous Speech, IEEE Transactions
`on information Theory, vol. IT-21, No. 3, pp. 250-256, May
`1975.*
`Averbuch et al., An IBM PC Based large-vocabulary Iso
`lated-utterance Speech recognizer, ICASSP 86, Tokyo, pp.
`53-56, 1986.*
`"Experiments with a spoken dialogue System for taking the
`US census”, by R.A. Cole et al., Speech Communication,
`(21) Appl. No.: 09/026,841
`vol. 23, No. 2, Nov. 1, 1997, pp. 243-260.
`(22) Filed:
`Feb. 20, 1998
`“Improvements in Beam Search for 10000-Word Continu
`ous Speech Recognition”, by H. Ney, R. Haeb-Umbach,
`Related U.S. Application Data
`B-H. Tran, M. Oerder, 1992 IEEE, pp. I-9-I-12.
`“Improvements on the Pronunciation Prefix Tree Search
`(63) Continuation-in-part of application No. 08/934,622, filed on
`Organization”, by Fil Alleva, Xuedong Huang and Mei-Yuh
`Sep. 19, 1997, now Pat. No. 6,076,056.
`Hwang, 1996 IEEE, pp. 133-136.
`(51) Int. Cl. ................................................ G10L 15/04
`(52) U.S. Cl. ...
`... 704/255; 704/251; '95; * cited by examiner
`(58) Field of Search ................................. 704/251, 252,
`Primary Examiner William Korzuch
`704/253, 255, 256, 257, 242
`ASSistant Examiner Abul K. Azad
`(74) Attorney, Agent, or Firm-Joseph R. Kelly; Westman,
`Champlin & Kelly, P.A.
`References Cited
`(57)
`ABSTRACT
`U.S. PATENT DOCUMENTS
`4,336,421 A 6/1982 Welch et al. ............. 179/1 SD A System for recognizing Speech based on an input data
`4,852,173 A 7/1989 Bahl et al. .................... 381/43
`Stream indicative of the Speech provides possible words
`4,977.599 A - 12/1990 Bahl et al....
`... 704/256
`represented by the input data Stream as a prefix tree includ
`5,159,637 A : 10/1992 Okazaki et al............ 381/43
`ing a plurality of phoneme branches connected at nodes. The
`5,202,952 A 4/1993 Gillick et al. .....
`... 704/254 X plurality of phoneme branches is bracketed by at least one
`SE A : 3 E. R.", - - - - - - - - - - - - - - - - -70E. input Silence branch corresponding to a Silence phone on an
`5794.197 A
`8/1998 Af et al...
`704/255
`input Side of the prefix tree and at least one output Silence
`5,799.278 A
`s/199s Cobbetical... 704/256
`branch corresponding to a Silence phone on an output Side of
`5812974 A * 9/1998 Hemphill et al. .
`... 704/256
`the prefix tree. The prefix tree is traversed to obtain a word
`5,848,388 A * 12/1998 Power et al. .....
`704/254 X that is likely represented by the input data Stream. The
`6,076,056. A * 6/2000 Huang et al. ............... 704/254
`Silence phones provided in the prefix tree can vary based on
`teXt.
`FOREIGN PATENT DOCUMENTS
`COCX
`
`(56)
`
`EP
`
`0 762 358 A1
`
`3/1997
`
`23 Claims, 7 Drawing Sheets
`
`SPEECH
`
`62
`
`MICROHONE
`
`ANALOG f
`DIGITAL
`CONWRTER
`
`46
`
`
`
`
`
`REE
`SEARC
`EMGM
`
`76
`output
`WCE
`
`SLENCE
`BRACKEE
`EXICON
`
`72
`4.
`PHONEIC
`SPEECHUNIt
`MODE
`
`CSR LEXICON
`AND ANGUAGE
`MODEL
`
`
`
`
`
`
`
`
`
`
`
`
`
`Amazon / Zentian Limited
`Exhibit 1004
`Page 1
`
`
`
`Amazon / Zentian Limited
`Exhibit 1004
`Page 2
`
`
`
`U.S. Patent
`
`Apr. 16, 2002
`
`Sheet 2 of 7
`
`US 6,374,219 B1
`
`9/
`
`NOOIX ET
`
`
`
`HOH\/ES EEHL
`
`
`
`
`
`
`
`
`
`
`
`
`
`HEINIV/H_L
`
`
`
`HOEECHS OILENOHd
`
`Å HOWEWN
`
`TEICJOWN LINT
`
`EKONETIS
`
`
`
`NO||LOE LEGI
`
`HOEWEcHS
`
`Amazon / Zentian Limited
`Exhibit 1004
`Page 3
`
`
`
`U.S. Patent
`
`Apr. 16, 2002
`
`Sheet 3 of 7
`
`US 6,374,219 B1
`
`
`
`ORANGES
`
`FIG.3
`(PRIOR ART)
`
`Amazon / Zentian Limited
`Exhibit 1004
`Page 4
`
`
`
`U.S. Patent
`
`Apr. 16, 2002
`
`Sheet 4 of 7
`
`US 6,374,219 B1
`
`1OO
`92
`80-N AO O 8O
`AE
`T
`8O
`O 82 O 84 O
`T
`M
`OAM
`
`AE
`O
`
`-88
`
`86
`
`B
`
`O
`
`EY
`
`AY
`
`R
`
`8A OOAT
`
`X
`
`SIL(TSIL)
`
`O O
`
`N
`
`O
`
`JH
`O ORANGE
`S
`SIL(JHSIL)
`
`SIL(MSIL) K
`O
`
`O O OTRY
`
`TRAY
`
`S
`
`D
`
`SIL(EYSIL)
`TAX O O TaleSP
`SL(S,SIL)
`SIL(D,SIL)
`96-O
`98- O
`
`C 94
`
`FIG.4
`
`Amazon / Zentian Limited
`Exhibit 1004
`Page 5
`
`
`
`U.S. Patent
`
`Apr. 16, 2002
`
`Sheet 5 of 7
`
`US 6,374,219 B1
`
`SL(SLA) O SL(SAE) S; SILT) 102
`D
`?
`O
`O O
`AE
`T
`AO
`R O 82
`2. 84 O
`T
`OAM
`AE
`O
`
`8 OOAT
`SL(TSIL)
`
`X
`
`O O
`
`N
`
`O
`UH
`O ORANGE
`S
`SIL(JHSIL)
`O O
`ORANGES SIL(SSIL)
`C 9
`
`86
`
`O
`AY
`
`EY
`
`SL(MSIL)
`O
`
`O O OTRY
`
`TRAY
`
`S
`D
`SIL(EYSIL)
`TAX O O TRIEDO
`
`SIL(S,SIL)
`96
`
`O
`
`SIL(D, SIL)
`98
`
`O
`
`FIG.5
`
`Amazon / Zentian Limited
`Exhibit 1004
`Page 6
`
`
`
`U.S. Patent
`
`Apr. 16, 2002
`
`Sheet 6 of 7
`
`US 6,374,219 B1
`
`SIL(SILAO)
`104
`
`SIL(SILAE)
`106
`SIL(SiLT)
`108
`
`1. 1 O2
`
`
`
`SIL(TSIL)
`
`SIL(MSIL)
`
`
`
`
`
`ORANGE
`
`SL(JHSIL)
`
`ORANGESSIL(SSIL)
`94
`
`FIG.6
`
`Amazon / Zentian Limited
`Exhibit 1004
`Page 7
`
`
`
`U.S. Patent
`
`Apr. 16, 2002
`
`Sheet 7 of 7
`
`US 6,374,219 B1
`
`SPEECH
`
`62
`
`MICROPHONE
`
`ANALOG /
`DIGITAL
`CONVERTER
`
`FEATURE
`EXTRACTION
`
`
`
`
`
`CS/S
`INDICATOR
`
`40
`
`KEYBOARD
`
`TRAINER
`
`TREE
`SEARCH
`ENGINE
`
`76
`
`
`
`OUTPUT
`DEVICE
`
`SLENCE
`BRACKETED
`LEXICON
`
`PHONETIC
`SPEECH UNIT
`MODEL
`
`CSR LEXICON
`AND LANGUAGE
`MODEL
`
`FIG.7
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Amazon / Zentian Limited
`Exhibit 1004
`Page 8
`
`
`
`US 6,374,219 B1
`
`1
`SYSTEM FOR USING SLENCE IN SPEECH
`RECOGNITION
`
`REFERENCE TO RELATED APPLICATION
`The present application is a continuation-in-part of and,
`claims priority from, U.S. patent application Ser. No.
`08/934,622, filed Sep. 19, 1997 now U.S. Pat. No. 6,076,
`056, entitled SPEECH RECOGNITION SYSTEM FOR
`RECOGNIZING CONTINUOUS AND SOLATED
`SPEECH, now U.S. Pat. No. 6,076,056, issued to Huang et
`al.
`
`2
`In order to find the most likely Sequence of States corre
`sponding to a Sequence of acoustic vectors, the Viterbi
`algorithm is employed. The Viterbialgorithm performs a
`computation which starts at the first frame and proceeds one
`frame at a time, in a time-synchronous manner. A probability
`Score is computed for each State in the State Sequences (i.e.,
`the HMMs) being considered. Therefore, a cumulative prob
`ability Score is Successively computed for each of the
`possible State Sequences as the Viterbialgorithm analyzes
`the acoustic Signal frame by frame. By the end of an
`utterance, the state sequence (or HMM or series of HMMs)
`having the highest probability score computed by the Viterbi
`algorithm provides the most likely State Sequence for the
`entire utterance. The most likely State Sequence is then
`converted into a corresponding spoken Subword unit, word,
`or word Sequence.
`The Viterbialgorithm reduces an exponential computa
`tion to one that is proportional to the number of States and
`transitions in the model and the length of the utterance.
`However, for a large Vocabulary, the number of States and
`transitions becomes large and the computation required to
`update the probability Score at each State in each frame for
`all possible State Sequences takes many times longer than the
`duration of one frame, which is typically approximately 10
`milliseconds in duration.
`Thus, a technique called pruning, or beam Searching, has
`been developed to greatly reduce computation needed to
`determine the most likely State Sequence. This type of
`technique eliminates the need to compute the probability
`Score for State Sequences that are very unlikely. This is
`typically accomplished by comparing, at each frame, the
`probability Score for each remaining State Sequence (or
`potential Sequence) under consideration with the largest
`score associated with that frame. If the probability score of
`a State for a particular potential Sequence is Sufficiently low
`(when compared to the maximum computed probability
`Score for the other potential sequences at that point in time)
`the pruning algorithm assumes that it will be unlikely that
`Such a low Scoring State Sequence will be part of the
`completed, most likely State Sequence. The comparison is
`typically accomplished using a minimum threshold value.
`Potential State Sequences having a Score that falls below the
`minimum threshold value are removed from the Searching
`process. The threshold value can be set at any desired level,
`based primarily on desired memory and computational
`Savings, and a desired error rate increase caused by memory
`and computational Savings.
`Another conventional technique for further reducing the
`magnitude of computation required for Speech recognition
`includes the use of a prefix tree. A prefix tree represents the
`lexicon of the Speech recognition System as a tree Structure
`wherein all of the words likely to be encountered by the
`System are represented in the tree Structure.
`In Such a prefix tree, each Subword unit (Such as a
`phoneme) is typically represented by a branch which is
`associated with a particular phonetic model (such as an
`HMM). The phoneme branches are connected, at nodes, to
`Subsequent phoneme branches. All words in the lexicon
`which share the same first phoneme share the same first
`branch. All words which have the same first and second
`phonemes share the Same first and Second branches. By
`contrast, words which have a common first phoneme, but
`which have different Second phonemes, share the same first
`branch in the prefix tree but have second branches which
`diverge at the first node in the prefix tree, and So on. The tree
`Structure continues in Such a fashion Such that all words
`likely to be encountered by the System are represented by the
`end nodes of the tree (i.e., the leaves on the tree).
`
`15
`
`25
`
`BACKGROUND OF THE INVENTION
`The present invention relates to computer Speech recog
`nition. More particularly, the present invention relates to
`computer Speech recognition performed by conducting a
`prefix tree Search of a Silence bracketed lexicon.
`The most Successful current speech recognition Systems
`employ probabilistic models known as hidden Markov mod
`els (HMMs). A hidden Markov model includes a plurality of
`States, wherein a transition probability is defined for each
`transition from each State to every State, including transi
`tions to the same State. An observation is probabilistically
`asSociated with each unique State. The transition probabili
`ties between states (the probabilities that an observation will
`transition from one State to the next) are not all the same.
`Therefore, a Search technique, Such as a Viterbialgorithm, is
`employed in order to determine a most likely State Sequence
`for which the overall probability is maximum, given the
`transition probabilities between states and the observation
`probabilities.
`A sequence of State transitions can be represented, in a
`known manner, as a path through a trellis diagram that
`represents all of the states of the HMM over a sequence of
`observation times. Therefore, given an observation
`Sequence, a most likely path through the trellis diagram (i.e.,
`the most likely sequence of states represented by an HMM)
`can be determined using a Viterbialgorithm.
`40
`In current speech recognition Systems, Speech has been
`Viewed as being generated by a hidden Markov process.
`Consequently, HMMs have been employed to model
`observed Sequences of Speech spectra, where specific Spec
`tra are probabilistically associated with a state in an HMM.
`45
`In other words, for a given observed Sequence of Speech
`Spectra, there is a most likely Sequence of States in a
`corresponding HMM.
`This corresponding HMM is thus associated with the
`observed Sequence. This technique can be extended, Such
`that if each distinct sequence of states in the HMM is
`asSociated with a Sub-word unit, Such as a phoneme, then a
`most likely Sequence of Sub-word units can be found.
`Moreover, using models of how Sub-word units are com
`bined to form words, then using language models of how
`55
`words are combined to form Sentences, complete speech
`recognition can be achieved.
`When actually processing an acoustic Signal, the Signal is
`typically Sampled in Sequential time intervals called frames.
`The frames typically include a plurality of Samples and may
`overlap or be contiguous. Each frame is associated with a
`unique portion of the Speech Signal. The portion of the
`Speech Signal represented by each frame is analyzed to
`provide a corresponding acoustic vectors. During speech
`recognition, a Search is performed for the State Sequence
`most likely to be associated with the Sequence of acoustic
`VectOrS.
`
`35
`
`50
`
`60
`
`65
`
`Amazon / Zentian Limited
`Exhibit 1004
`Page 9
`
`
`
`US 6,374,219 B1
`
`3
`It is apparent that, by employing a prefix tree Structure, the
`number of initial branches will be far fewer than the typical
`number of words in the lexicon or vocabulary of the system.
`In fact, the number of initial branches cannot exceed the
`total number of phonemes (approximately 40-50), regard
`less of the size of the Vocabulary or lexicon being Searched.
`Although if allophonic variations are used, then the initial
`number of branches could be large, depending on the
`allophones used.
`This type of structure lends itself to a number of signifi
`cant advantages. For example, given the Small number of
`initial branches in the tree, it is possible to consider the
`beginning of all words in the lexicon, even if the Vocabulary
`is very large, by evaluating the probability of each of the
`possible first phonemes. Further, using pruning, a number of
`the lower probability phoneme branches can be eliminated
`very early in the search. Therefore, while the second level of
`the tree has many more branches than the first level, the
`number of branches which are actually being considered (i
`e., the number of hypotheses), is also reduced over the
`number of possible branches.
`Speech recognition Systems employing the above tech
`niques can typically be classified in two types. The first type
`is a continuous speech recognition (CSR) system which is
`capable of recognizing fluent Speech. The Second type of
`System is an isolated speech recognition (ISR) system which
`is typically employed to recognize only isolated speech (or
`discreet speech), but which is also typically more accurate
`and efficient than continuous speech recognition Systems
`because the Search Space is generally Smaller. Also, isolated
`Speech recognition Systems have been thought of as a special
`case of continuous speech recognition, because continuous
`speech recognition Systems generally can accept isolated
`Speech as well. They simply do not perform as well when
`attempting to recognize isolated Speech.
`Silence information plays a role in both Systems. To date,
`both types of Speech recognition Systems have treated
`Silence as a special word in the lexicon. The Silence word
`participates in the normal Search process So that it can be
`inserted between words as it is recognized.
`However, it is known that considering word transitions in
`a speech recognition System is a computationally intensive
`and costly proceSS. Therefore, in an isolated Speech recog
`nition System in which Silence is treated as a separate word,
`the transition from the silence word to all other words in the
`lexicon must be considered, as well as the transition from all
`words in the lexicon (or all remaining words at the end of the
`Search) to the Silence word.
`Further, in continuous speech recognition Systems, even if
`the System has identified that the Speaker is Speaking
`discretely, or in an isolated fashion, the CSR system still
`considerS hypotheses which do not have Silence between
`words. This leads to a tendency to improperly break one
`word into two or more words. Of course, this results in a
`higher error rate than would otherwise be expected.
`Moreover, it is computationally inefficient since it still
`covers part of the Search Space which belongs to continuous
`Speech but not isolated Speech.
`In addition to employing the Silence phone as a separate
`word in the lexicon, conventional modeling of the Silence
`phone has also led to problems and errors in prior Speech
`recognition Systems. It is widely believed that Silence is
`independent of context. Thus, Silence has been modeled in
`conventional Speech recognition Systems regardless of con
`text. In other words, the Silence phone has been modeled the
`Same, regardless of the words or Subword units that precede
`
`4
`or follow it. This not only decreases the accuracy of the
`Speech recognition System, but also renders it less efficient
`than it could be with modeling in accordance with the
`present invention.
`
`SUMMARY OF THE INVENTION
`A Speech recognition System recognizes Speech based on
`an input data Stream indicative of the Speech. PoSSible words
`represented by the input data Stream are provided as a prefix
`tree including a plurality of phoneme branches connected at
`nodes. The plurality of phoneme branches are bracketed by
`at least one input Silence branch corresponding to a Silence
`phone on an input Side of the prefix tree and at least one
`output Silence branch corresponding to a silence phone on an
`output Side of the prefix tree.
`In one preferred embodiment, a plurality of Silence
`branches are provided in the prefix tree. The plurality of
`Silence branches represent context dependent Silence
`phones.
`In another preferred embodiment of the present invention,
`the Speech recognition System includes both a continuous
`Speech recognition System lexicon, and an isolated Speech
`recognition System lexicon. The System Switches between
`using the CSR lexicon and the ISR lexicon based upon a
`type of Speech then being employed by the user of the
`System.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is a block diagram of an exemplary environment
`for implementing a Speech recognition System in accordance
`with the present invention.
`FIG. 2 is a more detailed block diagram of a portion of the
`system shown in FIG. 1.
`FIG. 3 is a diagram illustrating a prior art prefix tree.
`FIG. 4 is a diagram illustrating one embodiment of a
`prefix tree in accordance with the present invention.
`FIG. 5 is a diagram illustrating another embodiment of a
`prefix tree in accordance with the present invention.
`FIG. 6 is a diagram illustrating the prefix tree shown in
`FIG. 5, employing a pruning technique in accordance with
`another aspect of the present invention.
`FIG. 7 is a block diagram of another embodiment of a
`Speech recognition System in accordance with another aspect
`of the present invention.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENTS
`FIG. 1 and the related discussion are intended to provide
`a brief, general description of a Suitable computing envi
`ronment in which the invention may be implemented.
`Although not required, the invention will be described, at
`least in part, in the general context of computer-executable
`instructions, Such as program modules, being executed by a
`personal computer. Generally, program modules include
`routine programs, objects, components, data Structures, etc.
`that perform particular tasks or implement particular abstract
`data types. Moreover, those skilled in the art will appreciate
`that the invention may be practiced with other computer
`System configurations, including hand-held devices, multi
`processor Systems, microprocessor-based or programmable
`consumer electronics, network PCs, minicomputers, main
`frame computers, and the like. The invention may also be
`practiced in distributed computing environments where
`tasks are performed by remote processing devices that are
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Amazon / Zentian Limited
`Exhibit 1004
`Page 10
`
`
`
`S
`linked through a communications network. In a distributed
`computing environment, program modules may be located
`in both local and remote memory Storage devices.
`With reference to FIG. 1, an exemplary system for imple
`menting the invention includes a general purpose computing
`device in the form of a conventional personal computer 20,
`including processing unit 21, a System memory 22, and a
`System buS 23 that couples various System components
`including the System memory to the processing unit 21. The
`System buS 23 may be any of Several types of bus Structures
`including a memory bus or memory controller, a peripheral
`bus, and a local bus using any of a variety of bus architec
`tures. The System memory includes read only memory
`(ROM) 24 a random access memory (RAM) 25. A basic
`input/output 26 (BIOS), containing the basic routine that
`helps to transfer information between elements within the
`personal computer 20, Such as during Start-up, is Stored in
`ROM 24. The personal computer 20 further includes a hard
`disk drive 27 for reading from and writing to a hard disk (not
`shown), a magnetic disk drive 28 for reading from or writing
`to removable magnetic disk 29, and an optical disk drive 30
`for reading from or writing to a removable optical disk 31
`such as a CD ROM or other optical media. The hard disk
`drive 27, magnetic disk drive 28, and optical disk drive 30
`are connected to the system bus 23 by a hard disk drive
`interface 32, magnetic disk drive interface 33, and an optical
`drive interface 34, respectively. The drives and the associ
`ated computer-readable media provide nonvolatile Storage
`of computer readable instructions, data structures, program
`modules and other data for the personal computer 20.
`Although the exemplary environment described herein
`employs a hard disk, a removable magnetic disk 29 and a
`removable optical disk 31, it should be appreciated by those
`skilled in the art that other types of computer readable media
`which can Store data that is accessible by a computer, Such
`as magnetic cassettes, flash memory cards, digital Video
`disks, Bernoulli cartridges, random access memories
`(RAMs), read only memory (ROM), and the like, may also
`be used in the exemplary operating environment.
`A number of program modules may be Stored on the hard
`disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25,
`including an operating System 35, one or more application
`programs 36, other program modules 37, and program data
`38. A user may enter commands and information into the
`personal computer 20 through input devices Such as a
`keyboard 40, pointing device 42 and microphone 62. Other
`input devices (not shown) may include a joystick, game pad,
`Satellite dish, Scanner, or the like. These and other input
`devices are often connected to the processing unit 21
`through a Serial port interface 46 that is coupled to the
`System buS 23, but may be connected by other interfaces,
`Such as a Sound card, a parallel port, a game port or a
`universal serial bus (USB). A monitor 47 or other type of
`display device is also connected to the System buS 23 via an
`interface, Such as a Video adapter 48. In addition to the
`monitor 47, personal computers may typically include other
`peripheral output devices Such as Speaker 45 and printers
`(not shown).
`The personal computer 20 may operate in a networked
`environment using logic connections to one or more remote
`computers, Such as a remote computer 49. The remote
`computer 49 may be another personal computer, a Server, a
`router, a network PC, a peer device or other network node,
`and typically includes any or all of the elements described
`above relative to the personal computer 20, although only a
`memory storage device 50 has been illustrated in FIG.1. The
`logic connections depicted in FIG. 1 include a local area
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,374,219 B1
`
`6
`network (LAN) 51 and a wide area network (WAN).52. Such
`networking environments are commonplace in offices,
`enterprise-wide computer network intranets and the Internet.
`When used in a LAN networking environment, the per
`sonal computer 20 is connected to the local area network 51
`through a network interface or adapter 53. When used in a
`WAN networking environment, the personal computer 20
`typically includes a modem 54 or other means for establish
`ing communications over the wide area network 52, Such as
`the Internet. The modem 54, which may be internal or
`external, is connected to the System buS 23 via the Serial port
`interface 46. In a network environment, program modules
`depicted relative to the personal computer 20, or portions
`thereof, may be Stored in the remote memory Storage
`devices. It will be appreciated that the network connections
`shown are exemplary and other means of establishing a
`communications link between the computerS may be used.
`FIG. 2 illustrates a block diagram of a speech recognition
`System 60 in accordance with one aspect of the present
`invention. Speech recognition System 60 includes micro
`phone 62, analog-to-digital (A/D) converter 64, training
`module 65, feature extraction module 66, silence detection
`module 68, lexicon storage module 70, phonetic speech unit
`Storage module 72, tree Search engine 74, and output device
`76. In addition, a language model Storage module 75 can
`also be provided and accessed by Search engine 74. It should
`be noted that the entire system 60, or part of system 60, can
`be implemented in the environment illustrated in FIG. 1. For
`example, microphone 62 may preferably be provided as an
`input device to personal computer 20, through an appropri
`ate interface, and through A/D converter 64. Training mod
`ule 65, feature extraction module 66 and silence detection
`module 68 may be either hardware modules in computer 20,
`or Software modules Stored in any of the information Storage
`devices disclosed in FIG. 1 and accessible by CPU 21 or
`another Suitable processor. In addition, lexicon Storage mod
`ule 70 and phonetic speech unit storage module 72 are also
`preferably Stored in any Suitable memory devices shown in
`FIG. 1. Further, tree search engine 74 is preferably imple
`mented in CPU 21 (which may include one or more
`processors) or may be performed by a dedicated Speech
`recognition processor employed by personal computer 20. In
`addition, output device 76 may, in one preferred
`embodiment, be implemented as monitor 47, or as a printer,
`or any other Suitable output device.
`In any case, during Speech recognition, Speech is input
`into system 60 in the form of an audible voice signal
`provided by the user to microphone 62. Microphone 62
`converts the audible speech Signal into an analog electronic
`signal which is provided to A/D converter 64. A/D converter
`64 converts the analog speech Signal into a sequence of
`digital Signals which is provided to feature extraction mod
`ule 66. In a preferred embodiment, feature extraction mod
`ule 66 is a conventional array processor which performs
`Spectral analysis on the digital Signals and computes a
`magnitude value for each frequency band of a frequency
`Spectrum. The Signals are, in one preferred embodiment,
`provided to feature extraction module 66 by A/D converter
`64 at a Sample rate of approximately 16 kHz, implementing
`A/D converter 64 as a commercially available, well known
`A/D converter.
`Feature extraction module 66 divides the digital Signal
`received from A/D converter 64 into frames which include
`a plurality of digital Samples. Each frame is approximately
`10 milliseconds in duration. The frames are then preferably
`encoded by feature extraction module 66 into a feature
`vector reflecting the Spectral characteristics for a plurality of
`
`Amazon / Zentian Limited
`Exhibit 1004
`Page 11
`
`
`
`US 6,374,219 B1
`
`15
`
`35
`
`40
`
`25
`
`7
`frequency bands. In the case of discrete and Semi-continuous
`hidden Markov modeling, feature extraction module 66 also
`preferably encodes the feature vectors into one or more
`codewords using vector quantization techniques and a code
`book derived from training data. Thus, feature extraction
`module 66 provides, at its output the feature vectors (or
`codewords) for each spoken utterance. The feature extrac
`tion module 66 preferably provides the feature vectors (or
`codewords) at a rate of one codeword approximately every
`10 milliseconds.
`Output probability distributions are then preferably com
`puted against hidden Markov models using the feature
`vector (or codewords) of the particular frame being ana
`lyzed. These probability distributions are later used in per
`forming a Viterbi or Similar type of technique.
`AS feature extraction module 66 is processing the digital
`samples from A/D converter 64, silence detection module 68
`is also processing the Samples. Silence detection module 68
`can either be implemented on the same, or a different,
`processor as that used to implement the feature extraction
`module 66. Silence detection module 68 operates in a well
`known manner. Briefly, silence detection module 68 pro
`ceSSes the digital Samples provided by A/D converter 64, So
`as to detect Silence, in order to determine boundaries
`between words being uttered by the user. Silence detection
`module 68 then provides a boundary detection signal to tree
`search engine 74 indicative of the detection of a word
`boundary.
`Upon receiving the codewords from feature extraction
`module 66, and the boundary detection signal provided by
`Silence detection module 68, tree Search engine 74 accesses
`information Stored in the phonetic Speech unit model
`memory 72. Memory 72 stores phonetic speech unit models,
`Such as hidden Markov models, which represent speech
`units to be detected by system 60. In one preferred
`embodiment, the phonetic models stored in memory 72
`include HMMs which represent phonemes. Based upon the
`HMMs stored in memory 72, tree search engine 74 deter
`mines a most likely phoneme represented by the codeword
`received from feature extraction module 66, and hence
`representative of the utterance received by the user of the
`System. It should also be noted that the proper phoneme can
`be chosen in any number of ways, including by examining
`the particular Senones calculated for each state of the HMMs
`45
`for each phoneme. Also, a phonetic HMM tree search can be
`performed in order to find the proper phoneme.
`Tree Search engine 74 also accesses the lexicon Stored in
`memory 70. The information received by tree search engine
`74 based on its accessing of the phonetic Speech unit models
`in memory 72 is used in searching lexicon 70 to determine
`a word which most likely represents the codewords received
`by feature extraction module 66 between word boundaries as
`indicated by silence detection module 68. Also, search
`engine 74 preferably accesses a language model in module
`75, such as a 60,000 word trigram language model derived
`from North American Business News Corpus and set out in
`greater detail in a publication entitled CSR-III Text Lan
`guage Model, University of Penn., 1994. The language
`model is used in identifying a most likely word or word
`Sequence represented by the input data. The word or word
`Sequence determined is thus most likely representative of the
`utterance received by the user. The word or word Sequence
`is then output by tree search engine 74 to output device 76.
`In a preferred embodiment, lexicon 70 contains informa
`tion which is representative of all of the words in the
`vocabulary of speech recognition system 60. The words are
`
`50
`
`55
`
`60
`
`65
`
`8
`preferably presented to tree search engine 74 in the form of
`a prefix tree which can be traversed from a root to a leaf (or
`to an internal word node) to arrive at the word most likely
`indicative of the utterance of the user.
`FIG. 3 illustrates a prefix tree used in accordance with
`prior art speech recognition Systems. For the Sake of clarity,
`only a portion of a prefix tree is illustrated by FIG. 3. A root
`node (or input node) 78 is encountered at a first word
`boundary. A plurality of branches 80 lead from root node 78
`to a remainder of the prefix tree. Each of the plurality of
`branches is associated with a phoneme. In FIG. 3, the
`branches leaving root node 78 represent only the phonemes
`represented by the letters AO, AE, and T. The tree extends
`through other nodes and branches and terminates in an
`output node 79.
`In accordance with one Searching technique, as tree 77 is
`traversed from the input node 78 to the output node 79, a
`Score is assigned to each node connected to a phoneme
`branch then under consideration by the Speech recognition
`system. The score is indicative of a likelihood that the
`par