throbber
(12) United States Patent
`Jiang
`
`USOO6374219 B1
`(10) Patent No.:
`US 6,374,219 B1
`(45) Date of Patent:
`Apr. 16, 2002
`
`(*) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`OTHER PUBLICATIONS
`(54) SYSTEM FOR USING SILENCE IN SPEECH
`RECOGNITIO
`Automatic Speech Segment Boundary Detection Using
`N
`N
`Markov Models, IBM Technical Disclosure Bulletin, vol.
`(75) Inventor: Li Jiang, Redmond, WA (US)
`33, No. 7, pp. 323-324, Dec. 1990.*
`s
`s
`(73) Assignee: Microsoft Corporation, Redmond, WA Jelinek et al., Design of a Linguistic Statistical Decoder for
`(US)
`the Recognition of Continuous Speech, IEEE Transactions
`on information Theory, vol. IT-21, No. 3, pp. 250-256, May
`1975.*
`Averbuch et al., An IBM PC Based large-vocabulary Iso
`lated-utterance Speech recognizer, ICASSP 86, Tokyo, pp.
`53-56, 1986.*
`"Experiments with a spoken dialogue System for taking the
`US census”, by R.A. Cole et al., Speech Communication,
`(21) Appl. No.: 09/026,841
`vol. 23, No. 2, Nov. 1, 1997, pp. 243-260.
`(22) Filed:
`Feb. 20, 1998
`“Improvements in Beam Search for 10000-Word Continu
`ous Speech Recognition”, by H. Ney, R. Haeb-Umbach,
`Related U.S. Application Data
`B-H. Tran, M. Oerder, 1992 IEEE, pp. I-9-I-12.
`“Improvements on the Pronunciation Prefix Tree Search
`(63) Continuation-in-part of application No. 08/934,622, filed on
`Organization”, by Fil Alleva, Xuedong Huang and Mei-Yuh
`Sep. 19, 1997, now Pat. No. 6,076,056.
`Hwang, 1996 IEEE, pp. 133-136.
`(51) Int. Cl. ................................................ G10L 15/04
`(52) U.S. Cl. ...
`... 704/255; 704/251; '95; * cited by examiner
`(58) Field of Search ................................. 704/251, 252,
`Primary Examiner William Korzuch
`704/253, 255, 256, 257, 242
`ASSistant Examiner Abul K. Azad
`(74) Attorney, Agent, or Firm-Joseph R. Kelly; Westman,
`Champlin & Kelly, P.A.
`References Cited
`(57)
`ABSTRACT
`U.S. PATENT DOCUMENTS
`4,336,421 A 6/1982 Welch et al. ............. 179/1 SD A System for recognizing Speech based on an input data
`4,852,173 A 7/1989 Bahl et al. .................... 381/43
`Stream indicative of the Speech provides possible words
`4,977.599 A - 12/1990 Bahl et al....
`... 704/256
`represented by the input data Stream as a prefix tree includ
`5,159,637 A : 10/1992 Okazaki et al............ 381/43
`ing a plurality of phoneme branches connected at nodes. The
`5,202,952 A 4/1993 Gillick et al. .....
`... 704/254 X plurality of phoneme branches is bracketed by at least one
`SE A : 3 E. R.", - - - - - - - - - - - - - - - - -70E. input Silence branch corresponding to a Silence phone on an
`5794.197 A
`8/1998 Af et al...
`704/255
`input Side of the prefix tree and at least one output Silence
`5,799.278 A
`s/199s Cobbetical... 704/256
`branch corresponding to a Silence phone on an output Side of
`5812974 A * 9/1998 Hemphill et al. .
`... 704/256
`the prefix tree. The prefix tree is traversed to obtain a word
`5,848,388 A * 12/1998 Power et al. .....
`704/254 X that is likely represented by the input data Stream. The
`6,076,056. A * 6/2000 Huang et al. ............... 704/254
`Silence phones provided in the prefix tree can vary based on
`teXt.
`FOREIGN PATENT DOCUMENTS
`COCX
`
`(56)
`
`EP
`
`0 762 358 A1
`
`3/1997
`
`23 Claims, 7 Drawing Sheets
`
`SPEECH
`
`62
`
`MICROHONE
`
`ANALOG f
`DIGITAL
`CONWRTER
`
`46
`
`
`
`
`
`REE
`SEARC
`EMGM
`
`76
`output
`WCE
`
`SLENCE
`BRACKEE
`EXICON
`
`72
`4.
`PHONEIC
`SPEECHUNIt
`MODE
`
`CSR LEXICON
`AND ANGUAGE
`MODEL
`
`
`
`
`
`
`
`
`
`
`
`
`
`IPR2023-00037
`Apple EX1004 Page 1
`
`

`

`U.S. Patent
`
`Apr. 16, 2002
`
`Sheet 1 of 7
`
`US 6,374,219 B1
`
`
`
`ALOWSAY
`
`HALNdWOOD
`
`MYOMLAN
`
`daAldvdv
`
`8
`
`OACIA
`
`AAI
`
`
`YaldVdVWOlLdO
`WVH90ud=|NOILVOMddV
`
`WVd90ddONILWeadO
`ovOv4
`
`JINGOWWALSAS
`LSls
`
`ViVdSNVeDO"d
`QuVOsAAm
`
` JLN!|ysiaGuvHIKuz=—Sniuaysia=ANI
`OUANOVNEY
`Wvd|WOH
`S@|sold
`AAIMSIG
`ANOHdOXOIN
` aalnaWoo
`FAC
`0gLE
`WOlLLdO
`
`LdOdWIAs
`
`Sv
`
`Y>aNVAdS
`
`Ze
`
`ADVYOLS
`
`IPR2023-00037
`Apple EX1004 Page 2
`
`IPR2023-00037
`Apple EX1004 Page 2
`
`
`
`
`
`
`

`

`U.S. Patent
`
`Apr. 16, 2002
`
`Sheet 2 of 7
`
`US 6,374,219 B1
`
`9/
`
`NOOIX ET
`
`
`
`HOH\/ES EEHL
`
`
`
`
`
`
`
`
`
`
`
`
`
`HEINIV/H_L
`
`
`
`HOEECHS OILENOHd
`
`Å HOWEWN
`
`TEICJOWN LINT
`
`EKONETIS
`
`
`
`NO||LOE LEGI
`
`HOEWEcHS
`
`IPR2023-00037
`Apple EX1004 Page 3
`
`

`

`U.S. Patent
`
`Apr. 16, 2002
`
`Sheet 3 of 7
`
`US 6,374,219 B1
`
`
`
`ORANGES
`
`FIG.3
`(PRIOR ART)
`
`IPR2023-00037
`Apple EX1004 Page 4
`
`

`

`U.S. Patent
`
`Apr. 16, 2002
`
`Sheet 4 of 7
`
`US 6,374,219 B1
`
`1OO
`92
`80-N AO O 8O
`AE
`T
`8O
`O 82 O 84 O
`T
`M
`OAM
`
`AE
`O
`
`-88
`
`86
`
`B
`
`O
`
`EY
`
`AY
`
`R
`
`8A OOAT
`
`X
`
`SIL(TSIL)
`
`O O
`
`N
`
`O
`
`JH
`O ORANGE
`S
`SIL(JHSIL)
`
`SIL(MSIL) K
`O
`
`O O OTRY
`
`TRAY
`
`S
`
`D
`
`SIL(EYSIL)
`TAX O O TaleSP
`SL(S,SIL)
`SIL(D,SIL)
`96-O
`98- O
`
`C 94
`
`FIG.4
`
`IPR2023-00037
`Apple EX1004 Page 5
`
`

`

`U.S. Patent
`
`Apr. 16, 2002
`
`Sheet 5 of 7
`
`US 6,374,219 B1
`
`SL(SLA) O SL(SAE) S; SILT) 102
`D
`?
`O
`O O
`AE
`T
`AO
`R O 82
`2. 84 O
`T
`OAM
`AE
`O
`
`8 OOAT
`SL(TSIL)
`
`X
`
`O O
`
`N
`
`O
`UH
`O ORANGE
`S
`SIL(JHSIL)
`O O
`ORANGES SIL(SSIL)
`C 9
`
`86
`
`O
`AY
`
`EY
`
`SL(MSIL)
`O
`
`O O OTRY
`
`TRAY
`
`S
`D
`SIL(EYSIL)
`TAX O O TRIEDO
`
`SIL(S,SIL)
`96
`
`O
`
`SIL(D, SIL)
`98
`
`O
`
`FIG.5
`
`IPR2023-00037
`Apple EX1004 Page 6
`
`

`

`U.S. Patent
`
`Apr. 16, 2002
`
`Sheet 6 of 7
`
`US 6,374,219 B1
`
`SIL(SILAO)
`104
`
`SIL(SILAE)
`106
`SIL(SiLT)
`108
`
`1. 1 O2
`
`
`
`SIL(TSIL)
`
`SIL(MSIL)
`
`
`
`
`
`ORANGE
`
`SL(JHSIL)
`
`ORANGESSIL(SSIL)
`94
`
`FIG.6
`
`IPR2023-00037
`Apple EX1004 Page 7
`
`

`

`U.S. Patent
`
`Apr. 16, 2002
`
`Sheet 7 of 7
`
`US 6,374,219 B1
`
`SPEECH
`
`62
`
`MICROPHONE
`
`ANALOG /
`DIGITAL
`CONVERTER
`
`FEATURE
`EXTRACTION
`
`
`
`
`
`CS/S
`INDICATOR
`
`40
`
`KEYBOARD
`
`TRAINER
`
`TREE
`SEARCH
`ENGINE
`
`76
`
`
`
`OUTPUT
`DEVICE
`
`SLENCE
`BRACKETED
`LEXICON
`
`PHONETIC
`SPEECH UNIT
`MODEL
`
`CSR LEXICON
`AND LANGUAGE
`MODEL
`
`FIG.7
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`IPR2023-00037
`Apple EX1004 Page 8
`
`

`

`US 6,374,219 B1
`
`1
`SYSTEM FOR USING SLENCE IN SPEECH
`RECOGNITION
`
`REFERENCE TO RELATED APPLICATION
`The present application is a continuation-in-part of and,
`claims priority from, U.S. patent application Ser. No.
`08/934,622, filed Sep. 19, 1997 now U.S. Pat. No. 6,076,
`056, entitled SPEECH RECOGNITION SYSTEM FOR
`RECOGNIZING CONTINUOUS AND SOLATED
`SPEECH, now U.S. Pat. No. 6,076,056, issued to Huang et
`al.
`
`2
`In order to find the most likely Sequence of States corre
`sponding to a Sequence of acoustic vectors, the Viterbi
`algorithm is employed. The Viterbialgorithm performs a
`computation which starts at the first frame and proceeds one
`frame at a time, in a time-synchronous manner. A probability
`Score is computed for each State in the State Sequences (i.e.,
`the HMMs) being considered. Therefore, a cumulative prob
`ability Score is Successively computed for each of the
`possible State Sequences as the Viterbialgorithm analyzes
`the acoustic Signal frame by frame. By the end of an
`utterance, the state sequence (or HMM or series of HMMs)
`having the highest probability score computed by the Viterbi
`algorithm provides the most likely State Sequence for the
`entire utterance. The most likely State Sequence is then
`converted into a corresponding spoken Subword unit, word,
`or word Sequence.
`The Viterbialgorithm reduces an exponential computa
`tion to one that is proportional to the number of States and
`transitions in the model and the length of the utterance.
`However, for a large Vocabulary, the number of States and
`transitions becomes large and the computation required to
`update the probability Score at each State in each frame for
`all possible State Sequences takes many times longer than the
`duration of one frame, which is typically approximately 10
`milliseconds in duration.
`Thus, a technique called pruning, or beam Searching, has
`been developed to greatly reduce computation needed to
`determine the most likely State Sequence. This type of
`technique eliminates the need to compute the probability
`Score for State Sequences that are very unlikely. This is
`typically accomplished by comparing, at each frame, the
`probability Score for each remaining State Sequence (or
`potential Sequence) under consideration with the largest
`score associated with that frame. If the probability score of
`a State for a particular potential Sequence is Sufficiently low
`(when compared to the maximum computed probability
`Score for the other potential sequences at that point in time)
`the pruning algorithm assumes that it will be unlikely that
`Such a low Scoring State Sequence will be part of the
`completed, most likely State Sequence. The comparison is
`typically accomplished using a minimum threshold value.
`Potential State Sequences having a Score that falls below the
`minimum threshold value are removed from the Searching
`process. The threshold value can be set at any desired level,
`based primarily on desired memory and computational
`Savings, and a desired error rate increase caused by memory
`and computational Savings.
`Another conventional technique for further reducing the
`magnitude of computation required for Speech recognition
`includes the use of a prefix tree. A prefix tree represents the
`lexicon of the Speech recognition System as a tree Structure
`wherein all of the words likely to be encountered by the
`System are represented in the tree Structure.
`In Such a prefix tree, each Subword unit (Such as a
`phoneme) is typically represented by a branch which is
`associated with a particular phonetic model (such as an
`HMM). The phoneme branches are connected, at nodes, to
`Subsequent phoneme branches. All words in the lexicon
`which share the same first phoneme share the same first
`branch. All words which have the same first and second
`phonemes share the Same first and Second branches. By
`contrast, words which have a common first phoneme, but
`which have different Second phonemes, share the same first
`branch in the prefix tree but have second branches which
`diverge at the first node in the prefix tree, and So on. The tree
`Structure continues in Such a fashion Such that all words
`likely to be encountered by the System are represented by the
`end nodes of the tree (i.e., the leaves on the tree).
`
`15
`
`25
`
`BACKGROUND OF THE INVENTION
`The present invention relates to computer Speech recog
`nition. More particularly, the present invention relates to
`computer Speech recognition performed by conducting a
`prefix tree Search of a Silence bracketed lexicon.
`The most Successful current speech recognition Systems
`employ probabilistic models known as hidden Markov mod
`els (HMMs). A hidden Markov model includes a plurality of
`States, wherein a transition probability is defined for each
`transition from each State to every State, including transi
`tions to the same State. An observation is probabilistically
`asSociated with each unique State. The transition probabili
`ties between states (the probabilities that an observation will
`transition from one State to the next) are not all the same.
`Therefore, a Search technique, Such as a Viterbialgorithm, is
`employed in order to determine a most likely State Sequence
`for which the overall probability is maximum, given the
`transition probabilities between states and the observation
`probabilities.
`A sequence of State transitions can be represented, in a
`known manner, as a path through a trellis diagram that
`represents all of the states of the HMM over a sequence of
`observation times. Therefore, given an observation
`Sequence, a most likely path through the trellis diagram (i.e.,
`the most likely sequence of states represented by an HMM)
`can be determined using a Viterbialgorithm.
`40
`In current speech recognition Systems, Speech has been
`Viewed as being generated by a hidden Markov process.
`Consequently, HMMs have been employed to model
`observed Sequences of Speech spectra, where specific Spec
`tra are probabilistically associated with a state in an HMM.
`45
`In other words, for a given observed Sequence of Speech
`Spectra, there is a most likely Sequence of States in a
`corresponding HMM.
`This corresponding HMM is thus associated with the
`observed Sequence. This technique can be extended, Such
`that if each distinct sequence of states in the HMM is
`asSociated with a Sub-word unit, Such as a phoneme, then a
`most likely Sequence of Sub-word units can be found.
`Moreover, using models of how Sub-word units are com
`bined to form words, then using language models of how
`55
`words are combined to form Sentences, complete speech
`recognition can be achieved.
`When actually processing an acoustic Signal, the Signal is
`typically Sampled in Sequential time intervals called frames.
`The frames typically include a plurality of Samples and may
`overlap or be contiguous. Each frame is associated with a
`unique portion of the Speech Signal. The portion of the
`Speech Signal represented by each frame is analyzed to
`provide a corresponding acoustic vectors. During speech
`recognition, a Search is performed for the State Sequence
`most likely to be associated with the Sequence of acoustic
`VectOrS.
`
`35
`
`50
`
`60
`
`65
`
`IPR2023-00037
`Apple EX1004 Page 9
`
`

`

`US 6,374,219 B1
`
`3
`It is apparent that, by employing a prefix tree Structure, the
`number of initial branches will be far fewer than the typical
`number of words in the lexicon or vocabulary of the system.
`In fact, the number of initial branches cannot exceed the
`total number of phonemes (approximately 40-50), regard
`less of the size of the Vocabulary or lexicon being Searched.
`Although if allophonic variations are used, then the initial
`number of branches could be large, depending on the
`allophones used.
`This type of structure lends itself to a number of signifi
`cant advantages. For example, given the Small number of
`initial branches in the tree, it is possible to consider the
`beginning of all words in the lexicon, even if the Vocabulary
`is very large, by evaluating the probability of each of the
`possible first phonemes. Further, using pruning, a number of
`the lower probability phoneme branches can be eliminated
`very early in the search. Therefore, while the second level of
`the tree has many more branches than the first level, the
`number of branches which are actually being considered (i
`e., the number of hypotheses), is also reduced over the
`number of possible branches.
`Speech recognition Systems employing the above tech
`niques can typically be classified in two types. The first type
`is a continuous speech recognition (CSR) system which is
`capable of recognizing fluent Speech. The Second type of
`System is an isolated speech recognition (ISR) system which
`is typically employed to recognize only isolated speech (or
`discreet speech), but which is also typically more accurate
`and efficient than continuous speech recognition Systems
`because the Search Space is generally Smaller. Also, isolated
`Speech recognition Systems have been thought of as a special
`case of continuous speech recognition, because continuous
`speech recognition Systems generally can accept isolated
`Speech as well. They simply do not perform as well when
`attempting to recognize isolated Speech.
`Silence information plays a role in both Systems. To date,
`both types of Speech recognition Systems have treated
`Silence as a special word in the lexicon. The Silence word
`participates in the normal Search process So that it can be
`inserted between words as it is recognized.
`However, it is known that considering word transitions in
`a speech recognition System is a computationally intensive
`and costly proceSS. Therefore, in an isolated Speech recog
`nition System in which Silence is treated as a separate word,
`the transition from the silence word to all other words in the
`lexicon must be considered, as well as the transition from all
`words in the lexicon (or all remaining words at the end of the
`Search) to the Silence word.
`Further, in continuous speech recognition Systems, even if
`the System has identified that the Speaker is Speaking
`discretely, or in an isolated fashion, the CSR system still
`considerS hypotheses which do not have Silence between
`words. This leads to a tendency to improperly break one
`word into two or more words. Of course, this results in a
`higher error rate than would otherwise be expected.
`Moreover, it is computationally inefficient since it still
`covers part of the Search Space which belongs to continuous
`Speech but not isolated Speech.
`In addition to employing the Silence phone as a separate
`word in the lexicon, conventional modeling of the Silence
`phone has also led to problems and errors in prior Speech
`recognition Systems. It is widely believed that Silence is
`independent of context. Thus, Silence has been modeled in
`conventional Speech recognition Systems regardless of con
`text. In other words, the Silence phone has been modeled the
`Same, regardless of the words or Subword units that precede
`
`4
`or follow it. This not only decreases the accuracy of the
`Speech recognition System, but also renders it less efficient
`than it could be with modeling in accordance with the
`present invention.
`
`SUMMARY OF THE INVENTION
`A Speech recognition System recognizes Speech based on
`an input data Stream indicative of the Speech. PoSSible words
`represented by the input data Stream are provided as a prefix
`tree including a plurality of phoneme branches connected at
`nodes. The plurality of phoneme branches are bracketed by
`at least one input Silence branch corresponding to a Silence
`phone on an input Side of the prefix tree and at least one
`output Silence branch corresponding to a silence phone on an
`output Side of the prefix tree.
`In one preferred embodiment, a plurality of Silence
`branches are provided in the prefix tree. The plurality of
`Silence branches represent context dependent Silence
`phones.
`In another preferred embodiment of the present invention,
`the Speech recognition System includes both a continuous
`Speech recognition System lexicon, and an isolated Speech
`recognition System lexicon. The System Switches between
`using the CSR lexicon and the ISR lexicon based upon a
`type of Speech then being employed by the user of the
`System.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is a block diagram of an exemplary environment
`for implementing a Speech recognition System in accordance
`with the present invention.
`FIG. 2 is a more detailed block diagram of a portion of the
`system shown in FIG. 1.
`FIG. 3 is a diagram illustrating a prior art prefix tree.
`FIG. 4 is a diagram illustrating one embodiment of a
`prefix tree in accordance with the present invention.
`FIG. 5 is a diagram illustrating another embodiment of a
`prefix tree in accordance with the present invention.
`FIG. 6 is a diagram illustrating the prefix tree shown in
`FIG. 5, employing a pruning technique in accordance with
`another aspect of the present invention.
`FIG. 7 is a block diagram of another embodiment of a
`Speech recognition System in accordance with another aspect
`of the present invention.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENTS
`FIG. 1 and the related discussion are intended to provide
`a brief, general description of a Suitable computing envi
`ronment in which the invention may be implemented.
`Although not required, the invention will be described, at
`least in part, in the general context of computer-executable
`instructions, Such as program modules, being executed by a
`personal computer. Generally, program modules include
`routine programs, objects, components, data Structures, etc.
`that perform particular tasks or implement particular abstract
`data types. Moreover, those skilled in the art will appreciate
`that the invention may be practiced with other computer
`System configurations, including hand-held devices, multi
`processor Systems, microprocessor-based or programmable
`consumer electronics, network PCs, minicomputers, main
`frame computers, and the like. The invention may also be
`practiced in distributed computing environments where
`tasks are performed by remote processing devices that are
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`IPR2023-00037
`Apple EX1004 Page 10
`
`

`

`S
`linked through a communications network. In a distributed
`computing environment, program modules may be located
`in both local and remote memory Storage devices.
`With reference to FIG. 1, an exemplary system for imple
`menting the invention includes a general purpose computing
`device in the form of a conventional personal computer 20,
`including processing unit 21, a System memory 22, and a
`System buS 23 that couples various System components
`including the System memory to the processing unit 21. The
`System buS 23 may be any of Several types of bus Structures
`including a memory bus or memory controller, a peripheral
`bus, and a local bus using any of a variety of bus architec
`tures. The System memory includes read only memory
`(ROM) 24 a random access memory (RAM) 25. A basic
`input/output 26 (BIOS), containing the basic routine that
`helps to transfer information between elements within the
`personal computer 20, Such as during Start-up, is Stored in
`ROM 24. The personal computer 20 further includes a hard
`disk drive 27 for reading from and writing to a hard disk (not
`shown), a magnetic disk drive 28 for reading from or writing
`to removable magnetic disk 29, and an optical disk drive 30
`for reading from or writing to a removable optical disk 31
`such as a CD ROM or other optical media. The hard disk
`drive 27, magnetic disk drive 28, and optical disk drive 30
`are connected to the system bus 23 by a hard disk drive
`interface 32, magnetic disk drive interface 33, and an optical
`drive interface 34, respectively. The drives and the associ
`ated computer-readable media provide nonvolatile Storage
`of computer readable instructions, data structures, program
`modules and other data for the personal computer 20.
`Although the exemplary environment described herein
`employs a hard disk, a removable magnetic disk 29 and a
`removable optical disk 31, it should be appreciated by those
`skilled in the art that other types of computer readable media
`which can Store data that is accessible by a computer, Such
`as magnetic cassettes, flash memory cards, digital Video
`disks, Bernoulli cartridges, random access memories
`(RAMs), read only memory (ROM), and the like, may also
`be used in the exemplary operating environment.
`A number of program modules may be Stored on the hard
`disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25,
`including an operating System 35, one or more application
`programs 36, other program modules 37, and program data
`38. A user may enter commands and information into the
`personal computer 20 through input devices Such as a
`keyboard 40, pointing device 42 and microphone 62. Other
`input devices (not shown) may include a joystick, game pad,
`Satellite dish, Scanner, or the like. These and other input
`devices are often connected to the processing unit 21
`through a Serial port interface 46 that is coupled to the
`System buS 23, but may be connected by other interfaces,
`Such as a Sound card, a parallel port, a game port or a
`universal serial bus (USB). A monitor 47 or other type of
`display device is also connected to the System buS 23 via an
`interface, Such as a Video adapter 48. In addition to the
`monitor 47, personal computers may typically include other
`peripheral output devices Such as Speaker 45 and printers
`(not shown).
`The personal computer 20 may operate in a networked
`environment using logic connections to one or more remote
`computers, Such as a remote computer 49. The remote
`computer 49 may be another personal computer, a Server, a
`router, a network PC, a peer device or other network node,
`and typically includes any or all of the elements described
`above relative to the personal computer 20, although only a
`memory storage device 50 has been illustrated in FIG.1. The
`logic connections depicted in FIG. 1 include a local area
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,374,219 B1
`
`6
`network (LAN) 51 and a wide area network (WAN).52. Such
`networking environments are commonplace in offices,
`enterprise-wide computer network intranets and the Internet.
`When used in a LAN networking environment, the per
`sonal computer 20 is connected to the local area network 51
`through a network interface or adapter 53. When used in a
`WAN networking environment, the personal computer 20
`typically includes a modem 54 or other means for establish
`ing communications over the wide area network 52, Such as
`the Internet. The modem 54, which may be internal or
`external, is connected to the System buS 23 via the Serial port
`interface 46. In a network environment, program modules
`depicted relative to the personal computer 20, or portions
`thereof, may be Stored in the remote memory Storage
`devices. It will be appreciated that the network connections
`shown are exemplary and other means of establishing a
`communications link between the computerS may be used.
`FIG. 2 illustrates a block diagram of a speech recognition
`System 60 in accordance with one aspect of the present
`invention. Speech recognition System 60 includes micro
`phone 62, analog-to-digital (A/D) converter 64, training
`module 65, feature extraction module 66, silence detection
`module 68, lexicon storage module 70, phonetic speech unit
`Storage module 72, tree Search engine 74, and output device
`76. In addition, a language model Storage module 75 can
`also be provided and accessed by Search engine 74. It should
`be noted that the entire system 60, or part of system 60, can
`be implemented in the environment illustrated in FIG. 1. For
`example, microphone 62 may preferably be provided as an
`input device to personal computer 20, through an appropri
`ate interface, and through A/D converter 64. Training mod
`ule 65, feature extraction module 66 and silence detection
`module 68 may be either hardware modules in computer 20,
`or Software modules Stored in any of the information Storage
`devices disclosed in FIG. 1 and accessible by CPU 21 or
`another Suitable processor. In addition, lexicon Storage mod
`ule 70 and phonetic speech unit storage module 72 are also
`preferably Stored in any Suitable memory devices shown in
`FIG. 1. Further, tree search engine 74 is preferably imple
`mented in CPU 21 (which may include one or more
`processors) or may be performed by a dedicated Speech
`recognition processor employed by personal computer 20. In
`addition, output device 76 may, in one preferred
`embodiment, be implemented as monitor 47, or as a printer,
`or any other Suitable output device.
`In any case, during Speech recognition, Speech is input
`into system 60 in the form of an audible voice signal
`provided by the user to microphone 62. Microphone 62
`converts the audible speech Signal into an analog electronic
`signal which is provided to A/D converter 64. A/D converter
`64 converts the analog speech Signal into a sequence of
`digital Signals which is provided to feature extraction mod
`ule 66. In a preferred embodiment, feature extraction mod
`ule 66 is a conventional array processor which performs
`Spectral analysis on the digital Signals and computes a
`magnitude value for each frequency band of a frequency
`Spectrum. The Signals are, in one preferred embodiment,
`provided to feature extraction module 66 by A/D converter
`64 at a Sample rate of approximately 16 kHz, implementing
`A/D converter 64 as a commercially available, well known
`A/D converter.
`Feature extraction module 66 divides the digital Signal
`received from A/D converter 64 into frames which include
`a plurality of digital Samples. Each frame is approximately
`10 milliseconds in duration. The frames are then preferably
`encoded by feature extraction module 66 into a feature
`vector reflecting the Spectral characteristics for a plurality of
`
`IPR2023-00037
`Apple EX1004 Page 11
`
`

`

`US 6,374,219 B1
`
`15
`
`35
`
`40
`
`25
`
`7
`frequency bands. In the case of discrete and Semi-continuous
`hidden Markov modeling, feature extraction module 66 also
`preferably encodes the feature vectors into one or more
`codewords using vector quantization techniques and a code
`book derived from training data. Thus, feature extraction
`module 66 provides, at its output the feature vectors (or
`codewords) for each spoken utterance. The feature extrac
`tion module 66 preferably provides the feature vectors (or
`codewords) at a rate of one codeword approximately every
`10 milliseconds.
`Output probability distributions are then preferably com
`puted against hidden Markov models using the feature
`vector (or codewords) of the particular frame being ana
`lyzed. These probability distributions are later used in per
`forming a Viterbi or Similar type of technique.
`AS feature extraction module 66 is processing the digital
`samples from A/D converter 64, silence detection module 68
`is also processing the Samples. Silence detection module 68
`can either be implemented on the same, or a different,
`processor as that used to implement the feature extraction
`module 66. Silence detection module 68 operates in a well
`known manner. Briefly, silence detection module 68 pro
`ceSSes the digital Samples provided by A/D converter 64, So
`as to detect Silence, in order to determine boundaries
`between words being uttered by the user. Silence detection
`module 68 then provides a boundary detection signal to tree
`search engine 74 indicative of the detection of a word
`boundary.
`Upon receiving the codewords from feature extraction
`module 66, and the boundary detection signal provided by
`Silence detection module 68, tree Search engine 74 accesses
`information Stored in the phonetic Speech unit model
`memory 72. Memory 72 stores phonetic speech unit models,
`Such as hidden Markov models, which represent speech
`units to be detected by system 60. In one preferred
`embodiment, the phonetic models stored in memory 72
`include HMMs which represent phonemes. Based upon the
`HMMs stored in memory 72, tree search engine 74 deter
`mines a most likely phoneme represented by the codeword
`received from feature extraction module 66, and hence
`representative of the utterance received by the user of the
`System. It should also be noted that the proper phoneme can
`be chosen in any number of ways, including by examining
`the particular Senones calculated for each state of the HMMs
`45
`for each phoneme. Also, a phonetic HMM tree search can be
`performed in order to find the proper phoneme.
`Tree Search engine 74 also accesses the lexicon Stored in
`memory 70. The information received by tree search engine
`74 based on its accessing of the phonetic Speech unit models
`in memory 72 is used in searching lexicon 70 to determine
`a word which most likely represents the codewords received
`by feature extraction module 66 between word boundaries as
`indicated by silence detection module 68. Also, search
`engine 74 preferably accesses a language model in module
`75, such as a 60,000 word trigram language model derived
`from North American Business News Corpus and set out in
`greater detail in a publication entitled CSR-III Text Lan
`guage Model, University of Penn., 1994. The language
`model is used in identifying a most likely word or word
`Sequence represented by the input data. The word or word
`Sequence determined is thus most likely representative of the
`utterance received by the user. The word or word Sequence
`is then output by tree search engine 74 to output device 76.
`In a preferred embodiment, lexicon 70 contains informa
`tion which is representative of all of the words in the
`vocabulary of speech recognition system 60. The words are
`
`50
`
`55
`
`60
`
`65
`
`8
`preferably presented to tree search engine 74 in the form of
`a prefix tree which can be traversed from a root to a leaf (or
`to an internal word node) to arrive at the word most likely
`indicative of the utterance of the user.
`FIG. 3 illustrates a prefix tree used in accordance with
`prior art speech recognition Systems. For the Sake of clarity,
`only a portion of a prefix tree is illustrated by FIG. 3. A root
`node (or input node) 78 is encountered at a first word
`boundary. A plurality of branches 80 lead from root node 78
`to a remainder of the prefix tree. Each of the plurality of
`branches is associated with a phoneme. In FIG. 3, the
`branches leaving root node 78 represent only the phonemes
`represented by the letters AO, AE, and T. The tree extends
`through other nodes and branches and terminates in an
`output node 79.
`In accordan

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket