`(12) Patent Application Publication (10) Pub. No.: US 2005/0049854 A1
`Reding et al.
`(43) Pub. Date:
`Mar. 3, 2005
`
`US 2005.0049854A1
`
`(54) METHODS AND APPARATUS FOR
`GENERATING, UPDATING AND
`DISTRIBUTING SPEECH RECOGNITION
`MODELS
`(76) Inventors: Craig Reding, Midland Park, NJ (US);
`Suzi Levas, Nanuet, NY (US)
`Correspondence Address:
`VERIZON CORPORATE SERVICES GROUP
`INC.
`CIO CHRISTIAN R. ANDERSEN
`600 HIDDEN RIDGE DRIVE
`MAILCODE HQEO3H14
`IRVING, TX 75038 (US)
`(21) Appl. No.:
`10/961,781
`(22) Filed:
`Oct. 8, 2004
`Related U.S. Application Data
`(63) Continuation of application No. 09/726,972, filed on
`Nov. 30, 2000, now Pat. No. 6,823,306.
`
`Publication Classification
`
`(51) Int. Cl. .................................................. G10L 11/00
`
`(52) U.S. Cl. .............................................................. 704/201
`
`(57)
`
`ABSTRACT
`
`Techniques for generating, distributing, and using speech
`recognition models are described. A shared speech proceSS
`ing facility is used to Support Speech recognition for a wide
`variety of devices with limited capabilities including busi
`neSS computer Systems, personal data assistants, etc., which
`are coupled to the Speech processing facility via a commu
`nications channel, e.g., the Internet. Devices with audio
`capture capability record and transmit to the Speech pro
`cessing facility, Via the Internet, digitized speech and receive
`Speech processing Services, e.g., Speech recognition model
`generation and/or Speech recognition Services, in response.
`The Internet is used to return Speech recognition models
`and/or information identifying recognized words or phrases.
`Thus, the Speech processing facility can be used to provide
`Speech recognition capabilities to devices without Such
`capabilities and/or to augment a device's Speech processing
`capability. Voice dialing, telephone control and/or other
`Services are provided by the Speech processing facility in
`response to Speech recognition results.
`
`A
`
`N.
`1OO
`
`BUSINESS
`PREMISES
`
`
`
`
`
`CUSTOMER
`PREMISES
`1
`
`CUSTOMER
`PREMISES
`2
`
`CUSTOMER
`PREMSES
`N
`
`
`
`SPEECH
`PROCESSING
`FACLITY
`
`
`
`
`
`TELEPHONE
`NETWORK
`
`22
`
`Exhibit 1023
`Page 01 of 26
`
`
`
`Patent Application Publication Mar. 3, 2005 Sheet 1 of 14
`
`US 2005/0049854 A1
`
`ZZ
`
`| -61-I
`
`Z
`
`
`
`SESIWE? Jc3
`
`è?EWOLSTO
`
`|--
`
`
`
`SESIWE??ejSESIVNE? Jc3
`
`
`
`
`
`
`
`
`
`
`
`èJEWOLSTVOSSBN|S|[^{}
`
`Exhibit 1023
`Page 02 of 26
`
`
`
`Patent Application Publication Mar. 3, 2005 Sheet 2 of 14
`
`US 2005/0049854 A1
`
`
`
`di SÐNITIVO
`
`BONEYJE-NOO
`
`81
`
`Š??
`
`F=======================
`
`T1
`
`ENOHd'ETEL
`
`X!!!ONALEN
`
`O
`
`° 09
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Exhibit 1023
`Page 03 of 26
`
`
`
`Patent Application Publication
`
`Mar. 3, 2005 Sheet 3 of 14
`
`US 2005/0049854A1
`
`09Xè?ONALEN
`
`OL
`
`BNOH.&ETEL
`
`WECJOWO
`
`º?º | IndNodny | góc | Indiño?igny
`
`BOIAEO
`
`HOSSECO}}&
`
`
`
`
`
`
`
`
`
`
`
`Exhibit 1023
`Page 04 of 26
`
`
`
`Patent Application Publication Mar. 3, 2005 Sheet 4 of 14
`
`US 2005/0049854A1
`
`302
`
`
`
`416
`
`VOICE DALNG
`ROUTINE
`
`WORD PROCESSORW1
`SPEECH RECOGNITION
`INTERFACE
`
`4.18
`
`420
`
`
`
`VOICE DALING
`DATA
`422
`
`SD VOICE DALNG
`CUSTOMER RECORD
`
`424
`
`
`
`S VOICE DALNG
`CUSTOMER RECORD
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MEMORY
`TELEPHONE
`CONTACT NUMBER
`
`
`
`
`
`
`
`SPEECH
`RECOGNITION
`ROUTINES
`
`401
`
`402
`
`SPEAKER
`INDEPENDENT
`SRMS
`
`SPEAKER
`DEPENDENT SRMS
`
`406
`
`MODELTRAINING 408
`ROUTINE
`
`
`
`
`
`
`
`
`
`4
`10
`
`412
`
`SPEECH
`DATA
`EXTRACTED
`FEATURE
`NFORMATION
`DIGITAL
`SPEECH
`RECORDING
`
`Fig. 4
`
`Exhibit 1023
`Page 05 of 26
`
`
`
`Patent Application Publication Mar. 3, 2005 Sheet 5 of 14
`
`US 2005/0049854 A1
`
`
`
`
`
`
`§EWWELSAS ?NISSEOOHdHOBEASBLOWB}}
`
`W00 SÅSEOIOA:SSEAQQy
`
`Exhibit 1023
`Page 06 of 26
`
`
`
`Patent Application Publication Mar. 3, 2005 Sheet 6 of 14
`
`US 2005/0049854A1
`
`70
`
`VOICE DIALING IP
`MEMORY
`
`612
`
`SPEECH
`RECOGNITION
`ROUTINE
`
`CALL SETUP
`ROUTINE
`
`
`
`613
`
`
`
`614
`
`615
`
`VOICE DIALNG
`ROUTINE
`
`DATABASE
`PERSONEL
`CORPORATE
`DALER
`DALER
`RECORDS
`RECORDS
`CUSOMER 1
`CORPORATION
`RECORD
`1 RECORD
`622
`626 .
`624
`CUSTOMERN
`RECORD
`
`628
`CORPORATION
`N RECORD
`
`Fig. 6
`
`
`
`
`
`
`
`SPEECH
`RECOGNIZER
`CRCUT
`
`TO
`SSP
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`SIGNAL
`NE
`S.
`
`
`
`606
`
`SWITCH fo
`INTERFACE
`
`PROCESSOR
`608
`
`
`
`610
`
`NETWORK
`INTERFACE
`
`TO
`NERME
`
`Exhibit 1023
`Page 07 of 26
`
`
`
`Patent Application Publication Mar. 3, 2005 Sheet 7 of 14
`
`US 2005/0049854A1
`
`702
`
`START
`MODEL TRAINING ROUTINE
`
`
`
`-N
`700
`
`RECEIVE TEXT OF WORD ORNAME TO
`BE TRANED
`
`704
`
`
`
`PROMPT USER TO STATE
`WORD OR NAME TO BE
`TRANED
`
`706
`
`RECORD USER
`SPEECH
`
`708
`
`710
`
`
`
`
`
`
`
`
`
`S
`LOCAL FEATURE
`EXTRACTION
`
`712
`
`
`
`
`
`
`
`
`
`
`
`PERFORM FEATURE
`EXTRACTION ON
`RECEIVED SPEECH
`
`
`
`Fig. 7
`
`714.
`
`TRANSMIT USER DENTIFIER
`NADDITION TO RECORDED
`SPEECH, EXTRACTED
`FEATURE INFORMATION,
`TEXT VERSON OF SPEECH
`AND/OR EXISTING SPEECH
`RECOGNITION MODELO
`SPEECH PROCESSING
`FACLITY
`
`
`
`
`
`
`
`RECEIVE SPEECH RECOGNITION
`MODEL(S) FROMSPEECH
`PROCESSING FACFLTY
`
`STORE RECEIVED SPEECH
`RECOGNITION MODEL(S)
`
`STOP
`MODEL TRAINING ROUTINE
`
`mawnwarrauxwm.super
`
`716
`
`718
`
`720
`
`Exhibit 1023
`Page 08 of 26
`
`
`
`Patent Application Publication Mar. 3, 2005 Sheet 8 of 14
`
`US 2005/0049854 A1
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Fig. 8
`
`->
`
`800
`
`START
`VOICE DIALING ROUTINE
`8
`
`8O2
`c
`
`MONITORFOR
`SPEECH
`
`V
`
`806
`
`HAS
`SPEECH BEEN
`RECEIVED?
`
`PERFORM
`FEATURE
`EXTRACTION ON
`RECEIVED
`SPEECH
`
`IS
`LOCAL FEATURE
`EXTRACTION
`PPORTED2
`
`
`
`CAL
`E.g
`ROUTINE
`
`
`
`
`
`IS
`LOCAL SPEECH
`RECOGNITION CAPABILITY
`AVALABLE
`
`LOCAL VOICE DIALING
`OPERATION SUCCESSFUL2
`
`DETECT RESPONSE
`FROM REMOTE SPEECH
`PROCESSINGFACLITY
`
`828
`
`
`
`
`
`
`
`
`
`
`
`DOES
`RESONSE INCLUDE
`TELEPHONE NUMBERTO
`BEDIAED
`
`829
`
`DAL TELEPHONE
`NUMBER
`
`
`
`
`
`
`
`
`
`PROVIDE MESSAGE TO
`SYSTEM USER INDICATING
`THAT THE NAMED PARTY IS
`BEING CALLED OR THAT THE
`SYSTEM WAS UNABLE TO
`IDENTIFY A PARTY TO BE
`CALEO
`
`826
`
`
`
`
`
`
`
`
`
`
`
`
`
`818
`/
`TRANSMT SYSTEM USER
`ID TO REMOTE SPEECH
`PROCESSINGFACITY
`
`820
`
`TRANSM SPEECH AND/OR
`EXTRACTE
`DFEATURE
`INFORMATION OREMOTE
`SPEECH PROCESSINGFACLITY
`822
`
`DOES
`-TELEPHON
`COMPUTER
`CONNECTION
`EX
`
`RANSMIT TELEPHONE
`CONTACT NUMBERTO
`REMOTE SPEECH
`PROCESSING FACTY
`
`
`
`Exhibit 1023
`Page 09 of 26
`
`
`
`Patent Application Publication Mar. 3, 2005 Sheet 9 of 14
`
`US 2005/0049854 A1
`
`AN
`900
`
`908
`
`RETURN
`O wE. 80
`UNSUCCESSFUL
`VOICE DALING
`NDICATOR
`
`
`
`903
`
`902
`
`
`
`
`
`
`
`
`
`EXTRACTED
`FEATURE
`INFORMATION
`
`
`
`
`
`START
`LOCAL VOICE DIALING
`SUBROUTINE
`
`PERFORM SPEECH
`RECOGNITION
`OPERATION USING
`LOCALLY STORED
`SPEECH RECOGNITION
`MODELS
`
`
`
`904
`
`
`
`NAME
`RECOGNIZED
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`DOES
`COMPUTER-TELEPHONE
`CONNECTION
`EXIST?
`
`
`
`
`
`
`
`DAL TELEPHONE
`NUMBER ASSOCATED
`WITH RECEIVED NAME
`
`DETECT CALL
`COMPLETION
`
`TRANSMIT TELEPHONE
`NUMBERTO BE DIALED &
`CONTACT TELEPHONE
`NUMBER TO CAL
`ESTABLISHMENT DEVICE
`
`918
`
`
`
`
`
`
`
`RETURN
`TO STEP 810
`WITH
`SUCCESSFUL
`VOICE DALING
`NDICATOR
`
`
`
`Exhibit 1023
`Page 10 of 26
`
`
`
`Patent Application Publication
`
`Mar. 3, 2005 Sheet 10 of 14
`
`US 2005/0049854 A1
`
`START
`REMOTE VOICE DIALNG
`ROUTINE
`
`
`
`
`
`RECEIVE VOICE DIALING
`SERVICE INPUTFROMREMOTE
`DEVICE
`
`1004
`
`1 OO6
`
`
`
`RETRIEVE VOICE DALING
`INFORMATION TO BE USED WITH
`IDENTIFIED USER
`1008
`
`Y
`
`WAS
`XTRACTED FEATUR
`INFORMATION
`RECEIVE)
`
`AN
`1000
`
`101 O
`
`PERFORM FEATURE
`EXTRACTION OPERATION
`ON RECEIVED SPEECH
`
`1022
`
`S
`HERE AN
`ADDITIONAL REMOTE SPEECH
`PROCESSING SYSTEMASSOCIATED
`WTH THEIDENTIFIED
`
`1024
`
`Fig. 10
`
`
`
`
`
`
`
`
`
`PERFORMSPEECHRECOGNITION
`OPERATIONUSING RETRIEVED
`VOICE DALING INFORMATION
`
`1014
`
`NAME
`RECOGNIZED
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`WAS
`ATELEPHONE CONTACT
`NUMBER
`RECEIVED?
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`TRANSMTTELEPHONE
`NUMBERTO BE DALED &
`CONTACT TELEPHONE
`NUMBER TO CALL
`NTATION DEVCE
`
`
`
`
`
`
`
`TRANSMIT TELEPHONE
`NUMBER ASSOCATED
`WITH RECOGNIZED NAME
`TO REMOTE COMPUTER
`SYSTEM
`
`SEND MESSAGETO
`REMOTE
`COMPUTER
`SYSTEM NOTFYING
`OF FAILED WOICE
`DALNGATEMPT
`
`1028
`
`
`
`SOP
`REMOTE VOICE DALENG
`ROUTINE
`
`RANSMIT SYSTEM USER
`D TO ADOTIONAL
`REMOTE SPEECH
`PROCESSINGFACILITY
`
`
`
`
`
`
`
`
`
`TRANSMTSPEECH AND/OR
`EXTRACED FEATURE
`NFORMATION TO
`ADDITIONAL REVOE
`SPEECH PROCESSING
`FACTY
`
`Exhibit 1023
`Page 11 of 26
`
`
`
`Patent Application Publication Mar. 3, 2005 Sheet 11 of 14
`
`US 2005/0049854A1
`
`1N
`
`1100
`
`STAR
`CALL ESTABLISHMENT
`ROUTINE
`
`11 O2
`
`
`
`
`
`
`
`RECEIVE TELEPHONE NUMBERTO BE 1104
`DALED & CONTACT TELEPHONE
`NUMBER
`
`NTATE CALL TO TELEPHONE
`NUMBER TO BE DIALED
`
`NITATE CALL TO CONTACT
`TELEPHONE NUMBER
`
`Fig. 11
`
`BRIDGE INTATED
`CAS
`
`
`
`Allow BRIDGED call 1112
`TO TERMANTE IN
`NORMAL MANNER
`
`stop
`CALLESTABLISHMENT
`ROUTINE
`
`1114
`
`
`
`
`
`
`
`Exhibit 1023
`Page 12 of 26
`
`
`
`Patent Application Publication Mar. 3, 2005 Sheet 12 of 14
`
`US 2005/0049854 A1
`
`1200
`
`1224
`
`MANTAIN SYSTEM
`CLOCK
`
`
`
`
`
`HAS
`PRESELECTED TIME
`BETWEENUPDATES
`OCCURRED?
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`TRANSMIT UPDATED MODELS
`ANDOR UPDATED SPEECH
`RECOGNITIONSOFTWARE TO
`USER DEVICES
`
`
`
`
`
`Fig. 12
`
`START MODEL GENERATION
`ROUTINE
`
`1202
`
`
`
`1204
`
`
`
`MONTORFOR INPUT:
`
`RECEIVE INFORMATION
`1206a
`USER ID, SPEECH/
`FEATURE
`INFORMATION, TEXT
`INFORMATION &
`MODELTYPE INFO
`
`USERID, EXISTING
`MODE, EXISTING
`MODELTYPE, SPEECH/
`FEATURE
`NFORMATION,
`UPDATED MODELTYPE
`
`TRAINING
`DATABASE
`
`AUGMENT TRAINING
`DATABASE WITH RECEIVED
`INFORMATION
`
`
`
`GENERATE NEWSPEECHMODEL FROM
`RECEIVED SPEECH AND OTHER
`RECEIVED INFORMATION AND/OR
`SPEECHFN RAINFNG DATABASE
`
`STORE GENERATED
`SPEECH RECOGNITION
`MODEL
`
`TRANSMIT GENERATED
`SPEECH MODELTO
`DEVICE WHICH
`REQUESTED A MODE.
`GENERATIONOR
`UPDATE SERVICE
`
`
`
`Exhibit 1023
`Page 13 of 26
`
`
`
`Patent Application Publication
`
`Mar. 3, 2005 Sheet 13 of 14
`
`US 2005/0049854 A1
`
`Å LITIO\/-|
`
`HOEWEc}S
`
`
`
`SÕNISSE OOXJej
`
`9 | -61
`
`
`Aь}}10 9NISSEOOH)
`
`TWNOIS OICJOW
`
`
`HOSSE OOY}A
`ESWGWLWG
`
`5) NINIW}}]
`
`OL
`
`BNOH?ETEL
`XèJONALEN
`
`
`
`
`
`
`
`Exhibit 1023
`Page 14 of 26
`
`
`
`Patent Application Publication Mar. 3, 2005 Sheet 14 of 14
`
`US 2005/0049854 A1
`
`
`
`
`
`
`
`
`
`START
`SPEECH RECOGNITION
`ROUTINE
`
`1402
`
`1400
`
`1404.
`
`RECEIVE SPEECH RECOGNITION SERVICE
`REQUST FROMA REMOTE DEVE, THE REQUEST
`INCLUDING SPEECHOR EXTRACTED FEATURE
`INFORMATION AND A SYSTEM DENTFER
`
`PERFORMA SPEECH RECOGNITION
`OPERATIONUSING THE RECEIVED
`SPEECHOR EXTRACTED FEATURE
`INFORMATION
`
`
`
`
`
`GENERATEAMESSAGE INCLUDING
`SPEECH RECOGNITION RESULTS,
`INCLUDING RECOGNIZED WORDS IN
`TEXT FORM
`
`TRANSMFT GENERATED MESSAGE
`NCLUDING RECOGNIZED WORDS TO
`SYSTEM IDENTIFIED BY SYSTEM
`IDENTFER ASSOCATED WITH RECEIVED
`SPEECH
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Fig. 14
`
`Exhibit 1023
`Page 15 of 26
`
`
`
`US 2005/0049854 A1
`
`Mar. 3, 2005
`
`METHODS AND APPARATUS FOR GENERATING,
`UPDATING AND DISTRIBUTING SPEECH
`RECOGNITION MODELS
`
`FIELD OF THE INVENTION
`0001. The present invention is directed to speech recog
`nition techniques and, more particularly, to methods and
`apparatus for generating Speech recognition models, distrib
`uting speech recognition models and performing speech
`recognition operations, e.g., voice dialing and word proceSS
`ing operations, using Speech recognition models.
`
`BACKGROUND OF THE INVENTION
`0002 Speech recognition, which includes both speaker
`independent speech recognition and Speaker dependent
`Speech recognition, is used for a wide variety of applica
`tions.
`0.003 Speech recognition normally involves the use of
`Speech recognition models or templates that have been
`trained using Speech Samples provided by one or more
`individuals. Commonly used speech recognition models
`include Hidden Markov Models (HMMS). An example of a
`common template is a dynamic time warping (DTW) tem
`plate. In the context of the present application “speech
`recognition model” is intended to encompass both speech
`recognition models as well as templates which are used for
`Speech recognition purposes.
`0004 AS part of a speech recognition operation, speech
`input is normally digitized and then processed. The proceSS
`ing normally involves extracting feature information, e.g.,
`energy and/timing information, from the digitized signal.
`The extracted feature information normally takes the form of
`one or more feature vectors. The extracted feature vectors
`are then compared to one or more speech recognition models
`in an attempt to recognize words, phrases or Sounds.
`0005. In speech recognition systems, various actions,
`e.g., dialing a telephone number, entering information into a
`form, etc., are often performed in response to the results of
`the Speech recognition operation.
`0006 Before speech recognition operations can be per
`formed, one or more speech recognition models need to be
`trained. Speech recognition models can be either speaker
`dependent or Speaker independent. Speaker dependent (SD)
`Speech recognition models are normally trained using
`Speech from a single individual and are designed So that they
`should accurately recognize the Speech of the individual
`who provided the training speech but not necessarily other
`individuals. Speaker independent (SI) speech recognition
`models are normally generated from Speech provided from
`numerous individuals or from text. The generated Speaker
`independent speech recognition models often represent
`composite models which take into consideration variations
`between different speakers, e.g., due to differing pronuncia
`tions of the same word. Speaker independent Speech recog
`nition models are designed to accurately identify speech
`from a wide range of individuals including individuals who
`did not provide Speech Samples for training purposes.
`0007. In general, model training involves one or more
`individuals Speaking a word or phrase, converting the
`Speech into digital Signal data, and then processing the
`digital Signal data to generate a speech recognition model.
`
`Model training frequently involves an iterative process of
`computing a speech recognition model, Scoring the model,
`and then using the results of the Scoring operation to further
`improve and retrain the Speech recognition model.
`0008 Speech recognition model training processes can
`be very computationally complex. This is true particularly in
`the case of SI models where audio data from numerous
`SpeakerS is normally processed to generate each model. For
`this reason, Speech recognition models are often generated
`using a relatively powerful computer Systems.
`0009 Individual speech recognition models can take up a
`considerable amount of Storage Space. For this reason, it is
`often impractical to Store speech recognition models corre
`sponding to large numbers of words or phrases, e.g., the
`names of all the people in a mid-sized company, or large
`dictionary in a portable device or Speech recognizer where
`Storage Space, e.g., memory, is limited.
`0010. In addition to limits in storage capacity, portable
`devices are often equipped with limited processing power.
`Speech recognition, like the model training process, can be
`a relatively computationally complex proceSS and can there
`for be time consuming given limited processing resources.
`Since most users of a Speech processing System expect a
`prompt response from the System, to Satisfy user demands
`Speech processing often needs to be performed in real or
`near real time. AS the number of potential words which may
`be recognized increases, So does the amount of processing
`required to perform a speech recognition operation. Thus,
`devices with limited processing power which may be able to
`perform a speech recognition operation involving recogniz
`ing, e.g., 20 possible names in near real time, may not be fast
`enough to perform a recognition operation in near real time
`where the number of names is increased to 100 possible
`CS.
`0011. In the case of voice dialing and other applications
`where the recognition results need to be generated in near
`real time, e.g., with relatively little delay, the limited pro
`cessing power of portable devices often limits the size of the
`Vocabulary which can be considered as possible recognition
`OutCOmeS.
`0012. In addition to the above implementation problems,
`implementers of Speech recognition Systems are often con
`fronted with logistical problems associated with collecting
`Speech Samples to be used for model training purposes. This
`is particularly a problem in the case of Speaker independent
`Speech recognition models where the robustness of the
`models are often a function of the number of Speech Samples
`used for training and the differences between the individuals
`providing the Samples. In applications where speech recog
`nition models are to be used over a wide geographical
`region, it is particularly desirable that Speech Samples be
`collected from the various geographic regions where the
`models will ultimately be used. In this manner, regional
`Speech differences can be taken into account during model
`training.
`0013 Another problem confronting implementers of
`Speech recognition Systems is that older Speech recognition
`models may include different feature information than cur
`rent speech recognition models. When updating a System to
`use newer Speech recognition models, previously used mod
`els in addition to speech recognition Software may have to
`
`Exhibit 1023
`Page 16 of 26
`
`
`
`US 2005/0049854 A1
`
`Mar. 3, 2005
`
`be revised or replaced. This frequently requires Speech
`Samples to retrain and/or update the older models. Thus the
`problems of collecting training data and training speech
`recognition models discussed above are often encountered
`when updating existing Speech recognition Systems.
`0.014.
`In Systems using multiple speech recognition
`devices, Speech model incompatibility may require the
`extraction of different speech features for different speech
`recognition devices when the devices are used to perform a
`Speech recognition operation on the same speech Segment.
`Accordingly, in Some cases it is desirable to be able to
`Supply the Speech to be processed to multiple Systems So that
`each System can perform its own feature extraction opera
`tion.
`0.015. In view of the above discussion, it is apparent that
`there is a need for new and improved methods and apparatus
`relating to a wider range of Speech recognition issues. For
`example, there is a need for improvements with regard to the
`collecting of Speech Samples for purposes of training speech
`recognition models. There is also a need for improved
`methods of providing users of portable devices with limited
`processing power, e.g., notebook computers and personal
`data assistants (PDAS) speech recognition functionality.
`Improved methods of providing speech recognition func
`tionality in Systems where different types of Speech recog
`nition models are used by different speech recognizers is
`also desirable. Enhanced methods and apparatus for updat
`ing Speech recognition models are also desirable.
`
`SUMMARY OF THE INVENTION
`0016. The present invention is directed to methods and
`apparatus for generating, distributing, and using Speech
`recognition models. In accordance with the present inven
`tion, a shared, e.g., centralized, Speech processing facility is
`used to Support Speech recognition for a wide variety of
`devices, e.g., notebook computers, busineSS computer Sys
`tems personal data assistants, etc. The centralized speech
`processing facility of the present invention may be located
`at a physically remote Site, e.g., in a different room, building,
`or even country, than the devices to which it provides Speech
`processing and/or speech recognition Services. The shared
`Speech processing facility may be coupled to numerous
`devices via the Internet and/or one or more other commu
`nications channels. Such as telephone lines, a local area
`network (LAN), etc.
`0.017. In various embodiments, the Internet is used as the
`communications channel via which model training data is
`collected and/or Speech recognition input is received by the
`shared speech processing facility of the present invention.
`Speech files may be sent to the Speech processing facility as
`electronic mail (E-mail) message attachments. The Internet
`is also used to return Speech recognition models and/or
`information identifying recognized words or phrases
`included in the processed speech. The Speech recognition
`models may be returned as E-mail message attachments
`while the recognized words may be returned as text in the
`body of an E-mail message or in a text file attachment to an
`E-mail message.
`0.018 Thus, via the Internet, devices with audio capture
`capability and Internet access can record and transmit to the
`centralized speech processing facility of the present inven
`tion digitized speech, e.g., as Speech files. The Speech
`
`processing facility then performs a model training operation
`or Speech recognition operation using the received speech. A
`Speech recognition model or data message including the
`recognized words, phases or other information is then
`returned depending on whether a model training or recog
`nition operation was performed, to the device which Sup
`plied the Speech.
`0019. Thus, the speech processing facility of the present
`invention can be used to provide Speech recognition capa
`bilities and/or to augment a device's Speech processing
`capability by performing Speech recognition model training
`operations and/or additional Speech recognition operations
`which can be used to Supplement local Speech recognition
`attempts.
`0020 For example, in various embodiments of the
`present invention, the generation of Speech recognition
`models to be used locally is performed by the remote speech
`processing facility. In one Such embodiment, when the local
`computer device needs a speech recognition model to be
`trained, the local computer System collects the necessary
`training data, e.g., Speech Samples from the System user and
`text corresponding to the retrieved speech Samples and then
`transmits the training data, e.g., via the Internet, to the
`Speech processing facility of the present invention. The
`Speech processing facility then generates one or more speech
`recognition models and returns them to the local computer
`System for use in local Speech recognition operations.
`0021. In various embodiments, the shared speech pro
`cessing facility updates a training database with the Speech
`Samples received from local computer Systems. In this way,
`a more robust Set of training data is created at the remote
`Speech processing facility as part of the model training
`and/or updating proceSS without imposing addition burdens
`on individual devices beyond those needed to Support Ser
`vices being provided to a use of an individual device, e.g.,
`notebook computer or PDA. AS the training database is
`augmented, Speaker independent Speech recognition models
`may be retrained periodically using the updated training data
`and then transmitted to those computer Systems which use
`Speech recognition models corresponding to those models
`which are retrained. In this manner, multiple local Systems
`can benefit from one or more different users initiating the
`retraining of Speech recognition models to enhance recog
`nition results.
`0022. As discussed above, in various embodiments, the
`remote speech processing facility of the present invention is
`used to perform Speech recognition operations and then
`return the recognition results or take other actions based on
`the recognition results. For example, in one embodiment
`business computer Systems capture Speech from, e.g., cus
`tomers, and then transmit the Speech or extracted Speech
`information to the shared speech processing facility via the
`Internet. The remote speech processing facility performs
`Speech recognition operations on the received speech and/or
`received extracted Speech information. The results of the
`recognition operation, e.g., recognized words in the form of,
`e.g., text, are then returned to the busineSS computer System
`which Supplied the processed speech or Speech information.
`The busineSS System can then use the information returned
`by the Speech processing facility, e.g., recognized text, to fill
`in forms or perform other Services Such as automatically
`respond to Verbal customer inquires. Thus, the remote
`
`Exhibit 1023
`Page 17 of 26
`
`
`
`US 2005/0049854 A1
`
`Mar. 3, 2005
`
`Speech processing method of the present invention can be
`used to Supply Speech processing capabilities to customers,
`e.g., businesses, who can't, or do not want to, Support local
`Speech processing operations.
`0023. In addition to providing speech recognition capa
`bilities to Systems which can't perform Speech recognition
`locally, the Speech processing facility of the present inven
`tion is used in various embodiments to augment the Speech
`recognition capabilities of various devices Such as notebook
`computers and personal data assistants. In Such embodi
`ments the remote speech processing facility may be used to
`perform Speech recognition when the local device is unable
`to obtain a Satisfactory recognition result, e.g., because of a
`limited Vocabulary or limited processing capability.
`0024.
`In one particular exemplary embodiment, a note
`book computer attempts to perform a voice dialing operation
`on received speech using locally Stored speech recognition
`models prior to contracting the Speech processing facility of
`the present invention. If the local Speech recognition opera
`tion fails to result in the recognition of a name, the received
`Speech or extracted feature information is transmitted to the
`remote speech processing facility. If the local notebook
`computer can't perform a dialing operation the notebook
`computer also transmits to the remote speech processing
`facility a telephone number where the user of the notebook
`computer can be contacted by telephone. The remote speech
`processing facility performs a speech recognition operation
`using the received speech and/or extracted feature informa
`tion. If the Speech recognition operation results in the
`recognition of a name with which a telephone number is
`asSociated the telephone number is retrieved from the
`remote speech processing facility's memory. The telephone
`number is returned to the device requesting that the Voice
`dialing Speech recognition operation be performed unless a
`contact telephone number was provided with the Speech
`and/or extracted feature information. In Such a case, the
`Speech processing facility uses telephone circuitry to initiate
`one telephone call to the telephone number retrieved from
`memory and another telephone call to the received contact
`telephone number. When the two calls are answered, they
`are bridged thereby completing the Voice dialing operation.
`0.025
`In addition to generating new speech recognition
`models to be used in Speech processing operations and
`providing speech recognition Services, the centralized
`Speech processing facility of the present invention can be
`used for modernizing existing Speech recognition System but
`upgrading speech recognition models and the Speech recog
`nition engine used therewith. In one particular embodiment,
`Speech recognition models or templates are received via the
`Internet from a System to be updated along with Speech
`corresponding to the modeled words. The received models
`or templates and/or speech are used to generate updated
`models which include different speech characteristic infor
`mation or have a different model format than the existing
`Speech recognition models. The updated models are returned
`to the Speech recognition Systems along with, in Some cases,
`new speech recognition engine Software.
`0026. In one particular embodiment, speech recognition
`templates used by voice dialing Systems are updated and
`replaced with HMMs generated by the central processing
`System of the present invention.
`
`0027. At the time the templates are replaced, the speech
`recognition engine Software is also replaced with a new
`Speech recognition engine which uses HMMs for recogni
`tion purposes.
`0028. Various additional features and advantages of the
`present invention will be apparent from the detailed descrip
`tion which follows.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`0029 FIG. 1 illustrates a communication system imple
`mented in accordance with an exemplary embodiment of the
`present invention.
`0030 FIG. 2 illustrates the communications system of
`FIG. 1 in greater detail.
`0031
`FIG. 3 illustrates a computer system for use in the
`communications system illustrated in FIG. 1.
`0032 FIG. 4 illustrates memory which may be used as
`the memory of a computer in the system illustrated in FIG.
`1.
`0033 FIG. 5 illustrates a voice dialing customer record
`implemented in accordance with the present invention.
`0034 FIG. 6 illustrates a voice dialing IP device which
`may be used in the system illustrated in FIG. 1.
`0035 FIG. 7 illustrates a model training routine of the
`present invention.
`0036 FIG. 8 illustrates an exemplary voice dialing rou
`tine of the present invention.
`0037 FIG. 9 illustrates a local voice dialing subroutine
`of the present invention.
`0038 FIG. 10 illustrates a remote voice dialing routine
`implemented in accordance with the present invention.
`0039 FIG. 11 illustrates a call establishment routine of
`the present invention.
`0040 FIG. 12 illustrates a model generation routine of
`the present invention.
`0041
`FIG. 13 illustrates a speech processing facility
`implemented in accordance with one embodiment of the
`present invention.
`0042 FIG. 14 illustrates a speech recognition routine
`that can be executed by the Speech processing facility of
`FIG. 13.
`
`DETAILED DESCRIPTION
`0043. As discussed above, the present invention is
`directed to methods and apparatus for generating speech
`recognition models, distributing Speech recognition models
`and performing Speech recognition operations, e.g., Voice
`dialing and word processing operations, using Speech rec
`ognition models.
`0044 FIG. 1 illustrates a communications system 100
`implemented in accordance with the present invention. AS
`illustrated, the system 100 includes a business premises 10
`and customer premises 12, 14, 16. Each one of the premises
`10, 12, 14, 16 represents a customer or business site. While
`only one business premise 10 is shown, it is to be understood
`that any number of busineSS and customer premises may be
`
`Exhibit 1023
`Page 18 of 26
`
`
`
`US 2005/0049854 A1
`
`Mar. 3, 2005
`
`included in the system 100. The various premises 10, 12, 14,
`16, 18 are coupled together and to a shared speech proceSS
`ing facility 18 of the present invention via the Internet 20
`and a telephone network 22. Connections to the Internet 20
`may be via digital subscriber lines (DSL), cable modems,
`cable lines, high Speed data links, e.g., T1 links, dial-up
`lines, wireleSS connections or a wide range of other com
`munications channels. The premises 10, 12, 14, 16, 18 may
`be connected to the Speech processing facility via a LAN or
`other communications channel instead of, or in addition to,
`the Internet.
`0.045 While businesses have frequently contracted for
`high Speed Internet connections, e.g., T1 links and other high
`Speed Services, which may be on during all hours of business
`Service, residential customers are now also moving to rela
`tively high Speed Internet connections which are “always
`on'. AS part of Such Services, a link to the Internet is
`maintained while the computer user has his/her computer on
`avoiding delays associated with establishing an Internet
`connection when data needs to be sent or received over the
`Internet. Examples of Such Internet connections include
`cable modem services and DSL services. Such services
`frequently Support Sufficient bandwidth for the transmission
`of audio Signals. AS the Speed of Internet connections
`increases, the number of Internet Service Subscribers capable
`of transmitting audio signals in real or near real time will
`continue to increase.
`0046) The speech processing facility 18 is capable of
`receiving speech from the various premises 10, 12, 14, 16
`and performing speech processing operations thereon. The
`operations may include Speech model training, e.g., gener