`(12) Patent Application Publication (10) Pub. No.: US 2006/0241948 A1
`(43) Pub. Date:
`Oct. 26, 2006
`Abrash et al.
`
`US 20060241948A1
`
`Publication Classification
`
`(51) Int. Cl.
`(2006.01)
`GIOL 2L/00
`(52) U.S. Cl. .............................................................. 704/275
`
`(57)
`
`ABSTRACT
`
`(54)
`
`(76)
`
`(21)
`(22)
`
`(60)
`
`METHOD AND APPARATUS FOR
`OBTAINING COMPLETE SPEECH SIGNALS
`FOR SPEECH RECOGNITION
`APPLICATIONS
`
`Inventors: Victor Abrash, Montara, CA (US);
`Federico Cesari, Menlo Park, CA (US);
`Horacio Franco, Menlo Park, CA (US);
`Christopher George, Los Osos, CA
`(US); Jing Zheng, Sunnyvale, CA (US)
`Correspondence Address:
`PATTERSON & SHERIDAN, LLP
`SRI INTERNATIONAL
`595 SHREWSBURY AVENUE
`SUTE 100
`SHREWSBURY, NJ 07702 (US)
`Appl. No.:
`11/217,912
`
`Filed:
`
`Sep. 1, 2005
`
`Related U.S. Application Data
`Provisional application No. 60/606,644, filed on Sep.
`1, 2004.
`
`The present invention relates to a method and apparatus for
`obtaining complete speech signals for speech recognition
`applications. In one embodiment, the method continuously
`records an audio stream comprising a sequence of frames to
`a circular buffer. When a user command to commence or
`terminate speech recognition is received, the method obtains
`a number of frames of the audio stream occurring before or
`after the user command in order to identify an augmented
`audio signal for speech recognition processing. In further
`embodiments, the method analyzes the augmented audio
`signal in order to locate starting and ending speech end
`points that bound at least a portion of speech to be processed
`for recognition. At least one of the speech endpoints is
`located using a Hidden Markov Model.
`(STARD-102
`
`to-
`
`— - - -
`CONINUOUSLY RECORDAUDO STREAM
`- 104
`TO CRCULAR BUFFER
`
`- - -
`
`RECEIVE USER COMMAND TO COMMENCE
`SPEECHRECOGNITION AT t=Ts
`
`106
`
`USER BEGINS SPEAKING at t=S
`
`108
`
`REGUEST PORTION OF AUDO STREAM
`FROM CIRCULAR BUFFER STARTING AT
`t=Ts-N, WHERETs. NCSCTs MOST OF THE TIME
`
`RECEIVE USER COMMAND TO TERMINATE
`SPEECHRECOGNITION AT t-T
`
`12
`
`USER STOPS SPEAKING at tE
`
`- 114
`
`
`
`REGUEST PORTION OF AUDIO STREAM
`FROM CIRCULAR BUFFER UPTO =TE +N2, N- 116
`WHERE TCE4TFN MOST OF THE TIME
`
`- - - -
`
`-
`
`- -
`
`-
`
`-
`PERFORMENDPOINT SEARCH ON AUDIO
`Ts. NAND T+N
`STREAMBEWEEN
`2.
`
`--------
`
`118
`
`APPY SPEECHRECOGNITION PROCESSING
`TOENDPOINTED AUDIOSGNA
`
`120
`
`Exhibit 1016
`Page 01 of 14
`
`
`
`Patent Application Publication Oct. 26, 2006 Sheet 1 of 5
`
`US 2006/0241948 A1
`
`100
`
`y
`
`(STARD 102
`
`CONTINUOUSLY RECORD AUDIO STREAM
`TO CIRCULAR BUFFER
`
`104
`
`RECEIVE USER COMMAND TO COMMENCE
`SPEECH RECOGNITION AT t=Ts
`
`106
`
`USER BEGINS SPEAKING attS
`
`108
`
`
`
`REQUEST PORTION OF AUDIO STREAM
`FROM CIRCULAR BUFFER STARTING AT
`t=Ts-N, WHERETs. N-S-Ts MOST OF THE TIME
`
`110
`
`RECEIVE USER COMMAND TO TERMINATE
`SPEECHRECOGNITION AT t=T
`
`112
`
`USER STOPS SPEAKING at t=E
`
`114
`
`
`
`REQUEST PORTION OF AUDIO STREAM
`FROM CIRCULAR BUFFER UPTO t=TE+N2, N- 116
`WHERE TCECT+N MOST OF THE TIME
`
`| PERFORMENDPOINT SEARCH ON AUDIO U 118
`| STREAM BETWEEN Ts-N AND TH N2
`- - - - - - - - - - - - - - - - - - -
`
`APPLY SPEECH RECOGNITION PROCESSING
`TO ENDPOINTED AUDIO SIGNAL
`
`120
`
`FIG. 1
`
`Exhibit 1016
`Page 02 of 14
`
`
`
`Patent Application Publication Oct. 26, 2006 Sheet 2 of 5
`
`US 2006/0241948 A1
`
`200
`
`N
`
`(StarD-22
`
`RECEIVE AUDIO SIGNAL
`
`204
`
`PERFORMENDPOINTING SEARCHUSINGENDPONTING
`HMM TO DETECT SPEECH IN RECEIVEDAUDIO SIGNAL
`
`206
`
`BACKUP APREDEFINED NUMBER OF FRAMES
`
`208
`
`COMMENCE RECOGNITION PROCESSING
`STARTING AT NEW START FRAME
`
`210
`
`DETECT END OF SPEECH
`
`212
`
`TERMINATE RECOGNITION PROCESSING AND
`OUTPUT RECOGNIZED SPEECH
`
`214
`
`FIG. 2
`
`Exhibit 1016
`Page 03 of 14
`
`
`
`Patent Application Publication Oct. 26, 2006 Sheet 3 of 5
`
`US 2006/0241948 A1
`
`300
`
`
`
`COUNT NUMBER OF FRAMES OF AUDIO
`SIGNAL IN WHICH THE MOST LIKELY WORD
`SSPEECH IN THEN PRECEDING FRAMES
`
`YES
`
`DOES NUMBER
`EXCEED FIRST PREDEFINED
`THRESHOLD?
`
`310
`
`CONTINUE TO SEARCH
`AUDIO SIGNAL
`
`BACKUP TO B FRAMES
`BEFORE FIRST SPEECH
`FRAMEN SEQUENCE
`NACCORDANCE WITH
`STEP 208 OF THE
`METHOD 200
`
`
`
`
`
`FIG. 3
`
`Exhibit 1016
`Page 04 of 14
`
`
`
`Patent Application Publication Oct. 26, 2006 Sheet 4 of 5
`
`US 2006/0241948 A1
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`400
`
`N
`
`402
`
`IDENTIFY MOST LIKELY WORD IN ENDPONTING SEARCH
`
`IS MOST
`LIKELY WORD
`SPEECH?
`
`COMPUTE MOST LIKELY WORD'S
`DURATION BACK TO MOST
`RECENT PAUSE-TO-SPEECH
`TRANSiTION
`
`IS MOST
`LKELY WORD FRAMEd
`SPEECH STARTING
`FRAMET
`
`COMPUTE PAUSE
`DURATION BACK TO LAST
`SPEECH-TO-PAUSE
`TRANSITION
`
`DOES DURATION
`MEET OR EXCEED FIRST
`PREDEFINED THRESHOLD
`
`
`
`
`
`
`
`DOES DURATION
`MEET OR EXCEED SECOND
`PREDEFINED THRESHOLD
`
`
`
`YES
`
`412
`
`DETECT START OF SPEECH
`
`DETECT END OF SPEECH
`
`420
`
`422
`
`FG. 4
`
`Exhibit 1016
`Page 05 of 14
`
`
`
`Patent Application Publication Oct. 26, 2006 Sheet 5 of 5
`
`US 2006/0241948 A1
`
`
`
`506
`
`I/O DEVICE
`e.g. STORAGE
`
`MEMORY
`
`504
`
`502
`
`PROCESSOR
`
`FIG. 5
`
`Exhibit 1016
`Page 06 of 14
`
`
`
`US 2006/0241948 A1
`
`Oct. 26, 2006
`
`METHOD AND APPARATUS FOR OBTAINING
`COMPLETE SPEECH SIGNALS FOR SPEECH
`RECOGNITION APPLICATIONS
`
`CROSS REFERENCE TO RELATED
`APPLICATIONS
`0001) This application claims the benefit of U.S. Provi
`sional Patent Application No. 60/606,644, filed Sep. 1, 2004
`(entitled “Method and Apparatus for Obtaining Complete
`Speech Signals for Speech Recognition Applications').
`which is herein incorporated by reference in its entirety.
`
`REFERENCE TO GOVERNMENT FUNDING
`0002 This invention was made with Government support
`under contract number DAAHO1-00-C-R003, awarded by
`Defense Advance Research Projects Agency and under
`contract number NAG2-1568 awarded by NASA. The Gov
`ernment has certain rights in this invention.
`
`FIELD OF THE INVENTION
`0003. The present invention relates generally to the field
`of speech recognition and relates more particularly to meth
`ods for obtaining speech signals for speech recognition
`applications.
`
`BACKGROUND OF THE DISCLOSURE
`0004. The accuracy of existing speech recognition sys
`tems is often adversely impacted by an inability to obtain a
`complete speech signal for processing. For example, imper
`fect synchronization between a user's actual speech signal
`and the times at which the user commands the speech
`recognition system to listen for the speech signal can cause
`an incomplete speech signal to be provided for processing.
`For instance, a user may begin speaking before he provides
`the command to process his speech (e.g., by pressing a
`button), or he may terminate the processing command
`before he is finished uttering the speech signal to be pro
`cessed (e.g., by releasing or pressing a button). If the speech
`recognition system does not “hear the user's entire utter
`ance, the results that the speech recognition system Subse
`quently produces will not be as accurate as otherwise
`possible. In open-microphone applications, audio gaps
`between two utterances (e.g., due to latency or others
`factors) can also produce incomplete results if an utterance
`is started during the audio gap.
`0005 Poor endpointing (e.g., determining the start and
`the end of speech in an audio signal) can also cause
`incomplete or inaccurate results to be produced. Good
`endpointing increases the accuracy of speech recognition
`results and reduces speech recognition system response time
`by eliminating background noise, silence, and other non
`speech Sounds (e.g., breathing, coughing, and the like) from
`the audio signal prior to processing. By contrast, poor
`endpointing may produce more flawed speech recognition
`results or may require the consumption of additional com
`putational resources in order to process a speech signal
`containing extraneous information. Efficient and reliable
`endpointing is therefore extremely important in speech
`recognition applications.
`0006 Conventional endpointing methods typically use
`short-time energy or spectral energy features (possibly aug
`mented with other features Such as Zero-crossing rate, pitch,
`
`or duration information) in order to determine the start and
`the end of speech in a given audio signal. However, Such
`features become less reliable under conditions of actual use
`(e.g., noisy real-world situations), and some users elect to
`disable endpointing capabilities in Such situations because
`they contribute more to recognition error than to recognition
`accuracy.
`0007 Thus, there is a need in the art for a method and
`apparatus for obtaining complete speech signals for speech
`recognition applications.
`
`SUMMARY OF THE INVENTION
`0008. In one embodiment, the present invention relates to
`a method and apparatus for obtaining complete speech
`signals for speech recognition applications. In one embodi
`ment, the method continuously records an audio stream
`which is converted to a sequence of frames of acoustic
`speech features and stored in a circular buffer. When a user
`command to commence or terminate speech recognition is
`received, the method obtains a number of frames of the
`audio stream occurring before or after the user command in
`order to identify an augmented audio signal for speech
`recognition processing.
`0009. In further embodiments, the method analyzes the
`augmented audio signal in order to locate starting and
`ending speech endpoints that bound at least a portion of
`speech to be processed for recognition. At least one of the
`speech endpoints is located using a Hidden Markov Model.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`0010. The teachings of the present invention can be
`readily understood by considering the following detailed
`description in conjunction with the accompanying drawings,
`in which:
`0011 FIG. 1 is a flow diagram illustrating one embodi
`ment of a method for speech recognition processing of an
`augmented audio stream, according to the present invention;
`0012 FIG. 2 is a flow diagram illustrating one embodi
`ment of a method for performing endpoint searching and
`speech recognition processing on an audio signal;
`0013 FIG. 3 is a flow diagram illustrating a first embodi
`ment of a method for performing an endpointing search
`using an endpointing HMM, according to the present inven
`tion;
`0014 FIG. 4 is a flow diagram illustrating a second
`embodiment of a method for performing an endpointing
`search using an endpointing HMM, according to the present
`invention;
`0015 FIG. 5 is a high-level block diagram of the present
`invention implemented using a general purpose computing
`device.
`0016 To facilitate understanding, identical reference
`numerals have been used, where possible, to designate
`identical elements that are common to the figures.
`
`DETAILED DESCRIPTION
`0017. The present invention relates to a method and
`apparatus for obtaining an improved audio signal for speech
`recognition processing, and to a method and apparatus for
`
`Exhibit 1016
`Page 07 of 14
`
`
`
`US 2006/0241948 A1
`
`Oct. 26, 2006
`
`improved endpointing for speech recognition. In one
`embodiment, an audio stream is recorded continuously by a
`speech recognition system, enabling the speech recognition
`system to retrieve portions of a speech signal that conven
`tional speech recognition systems might miss due to user
`commands that are not properly synchronized with user
`utterances.
`0018. In further embodiments of the invention, one or
`more Hidden Markov Models (HMMs) are employed to
`endpoint an audio signal in real time in place of a conven
`tional signal processing endpointer. Using HMMs for this
`function enables speech start and end detection that is faster
`and more robust to noise than conventional endpointing
`techniques.
`0.019
`FIG. 1 is a flow diagram illustrating one embodi
`ment of a method 100 for speech recognition processing of
`an augmented audio stream, according to the present inven
`tion. The method 100 is initialized at step 102 and proceeds
`to step 104, where the method 100 continuously records an
`audio stream (e.g., a sequence of audio frames containing
`user speech, background audio, etc.) to a circular buffer. In
`step 106, the method 100 receives a user command (e.g., via
`a button press or other means) to commence speech recog
`nition, at time t=Ts.
`0020. In step 108, the user begins speaking, at time t=S.
`The user command to commence speech recognition,
`received at time t=Ts, and the actual start of the user speech,
`at time t=S. are only approximately synchronized; the user
`may begin speaking before or after the command to com
`mence speech recognition received in step 106.
`0021. Once the user begins speaking, the method 100
`proceeds to step 110 and requests a portion of the recorded
`audio stream from the circular buffer starting at time t=Ts
`N, where N is an interval of time such that Ts-N-SsTs
`most of the time. In one embodiment, the interval N is
`chosen by analyzing real or simulated user data and select
`ing the minimum value of N that minimizes the speech
`recognition error rate on that data. In some embodiments, a
`sufficient value for N is in the range of tenths of a second.
`In another embodiment, where the audio signal for speech
`recognition processing has been acquired using an open
`microphone mode, N is approximately equal to T-T,
`where T is the absolute time at which the previous speech
`recognition process on the previous utterance ended. Thus,
`the current speech recognition process will start on the first
`audio frame that was not recognized in the previous speech
`recognition processing.
`0022. In step 112, the method 100 receives a user com
`mand (e.g., via a button press or other means) to terminate
`speech recognition, at time t=T. In step 114, the user stops
`speaking, at time t=E. The user command to terminate
`speech recognition, received at time t=TE, and the actual end
`of the user speech, at time t-E, are only approximately
`synchronized; the user may stop speaking before or after the
`command to terminate speech recognition received in step
`112.
`0023. In step 116, the method 100 requests a portion of
`the audio stream from the circular buffer up to time t=T+
`N, where N is an interval of time such that Ts E<T+N
`most of the time. In one embodiment, N is chosen by
`analyzing real or simulated user data and selecting the
`
`minimum value of N that minimizes the speech recognition
`error rate on that data. Thus, an augmented audio signal
`starting at time T-N and ending at time TE+N2 is identi
`fied.
`0024. In step 118 (illustrated in phantom), the method
`100 optionally performs an endpoint search on at least a
`portion of the augmented audio signal. In one embodiment,
`an endpointing search in accordance with step 118 is per
`formed using a conventional endpointing technique. In
`another embodiment, an endpointing search in accordance
`with step 118 is performed using one or more Hidden
`Markov Models (HMMs), as described in further detail
`below in connection with FIG. 2.
`0025. In step 120, the method 100 applies speech recog
`nition processing to the endpointed audio signal. Speech
`recognition processing may be applied in accordance with
`any known speech recognition technique.
`0026. The method 100 then returns to step 104 and
`continues to record the audio stream to the circular buffer.
`Recording of the audio stream to the circular buffer is
`performed in parallel with the speech recognition processes,
`e.g., steps 106-120 of the method 100.
`0027. The method 100 affords greater flexibility in choos
`ing speech signals for recognition processing than conven
`tional speech recognition techniques. Importantly, the
`method 100 improves the likelihood that a user's entire
`utterance is provided for recognition processing, even when
`user operation of the speech recognition system would
`normally provide an incomplete speech signal. Because the
`method 100 continuously records the audio stream contain
`ing the speech signals, the method 100 can “back up’ or “go
`forward to retrieve portions of a speech signal that con
`ventional speech recognition systems might miss due to user
`commands that are not properly synchronized with user
`utterances. Thus, more complete and more accurate speech
`recognition results are produced.
`0028 Moreover, because the audio stream is continu
`ously recorded even when speech is not being actively
`processed, the method 100 enables new interaction strate
`gies. For example, speech recognition processing can be
`applied to an audio stream immediately upon command,
`from a specified point in time (e.g., in the future or recent
`past), or from a last detected speech endpoint (e.g., a speech
`starting or speech ending point), among other times. Thus,
`speech recognition can be performed, on the user's com
`mand, from a frame that is not necessarily the most recently
`recorded frame (e.g., occurring some time before or after the
`most recently recorded frame).
`0029 FIG. 2 is a flow diagram illustrating one embodi
`ment of a method 200 for performing endpoint searching
`and speech recognition processing on an audio signal, e.g.,
`in accordance with steps 118-120 of FIG.1. The method 200
`is initialized at step 202 and proceeds to step 204, where the
`method 200 receives an audio signal, e.g., from the method
`1OO.
`0030) In step 206, the method 200 performs a speech
`endpointing search using an endpointing HMM to detect the
`start of the speech in the received audio signal. In one
`embodiment, the endpointing HMM recognizes speech and
`silence in parallel, enabling the method 200 to hypothesize
`the start of speech when speech is more likely than silence.
`
`Exhibit 1016
`Page 08 of 14
`
`
`
`US 2006/0241948 A1
`
`Oct. 26, 2006
`
`Many topologies can be used for the speech HMM, and a
`standard silence HMM may also be used. In one embodi
`ment, the topology of the speech HMM is defined as a
`sequence of one or more reject “phones', where a reject
`phone is an HMM model trained on all types of speech. In
`another embodiment, the topology of the speech HMM is
`defined as a sequence (or sequence of loops) of context
`independent (CI) or other phones. In further embodiments,
`the endpointing HMM has a pre-determined but config
`urable minimum duration, which may be a function of the
`number of reject or other phones in sequence in the speech
`HMM, and which enables the endpointer to more easily
`reject short noises as speech.
`0031. In one embodiment, the method 200 identifies the
`speech starting frame when it detects a predefined Sufficient
`number of frames of speech in the audio signal. The number
`of frames of speech that are required to indicate a speech
`endpoint may be adjusted as appropriate for different speech
`recognition applications. Embodiments of methods for
`implementing an endpointing HMM in accordance with step
`206 are described in further detail below with reference to
`FIGS. 3-4
`0032. In step 208, once the speech starting frame, Fs, is
`detected, the method 200 backs up a pre-defined number B
`of frames to a frame Fs preceding the speech starting frame
`Fs, such that Fs=Fs-B becomes the new 'start frame for
`the speech for the purposes of the speech recognition
`process. In one embodiment, the number B of frames by
`which the method 200 backs up is relatively small (e.g.,
`approximately 10 frames), but is large enough to ensure that
`the speech recognition process begins on a frame of silence.
`0033. In step 210, the method 200 commences recogni
`tion processing starting from the new start frame Fs identi
`fied in step 108. In one embodiment, recognition processing
`is performed in accordance with step 210 using a standard
`speech recognition HMM separate from the endpointing
`HMM.
`0034). In step 212, the method 200 detects the end of the
`speech to be processed. In one embodiment, a speech "end
`frame is detected when the recognition process started in
`step 210 of the method 200 detects a predefined sufficient
`number of frames of silence following frames of speech. In
`one embodiment, the number of frames of silence that are
`required to indicate a speech endpoint is adjustable based on
`the particular speech recognition application. In another
`embodiment, the ending/silence frames might be required to
`legally end the speech recognition grammar, forcing the
`endpointer not to detect the end of speech until a legal
`ending point. In another embodiment, the speech end frame
`is detected using the same endpointing HMM used to detect
`the speech start frame. Embodiments of methods for imple
`menting an endpointing HMM in accordance with step 212
`are described in further detail below with reference to FIGS.
`3-4.
`0035) In step 214, the method 200 terminates speech
`recognition processing and outputs recognized speech, and
`in step 216, the method 200 terminates.
`0.036
`Implementation of endpointing HMM’s in con
`junction with the method 200 enables more accurate detec
`tion of speech endpoints in an input audio signal, because
`the method 200 does not have any internal parameters that
`
`directly depend on the characteristics of the audio signal and
`that require extensive tuning. Moreover, the method 200
`does not utilize speech features that are unreliable in noisy
`environments. Furthermore, because the method 200
`requires minimal computation (e.g., processing while detect
`ing the start and the end of speech is minimal), speech
`recognition results can be produced more rapidly than is
`possible by conventional speech recognition systems. Thus,
`the method 200 can rapidly and reliably endpoint an input
`speech signal in virtually any environment.
`0037 Moreover, implementation of the method 200 in
`conjunction with the method 100 improves the likelihood
`that a user's complete utterance is provided for speech
`recognition processing, which ultimately produces more
`complete and more accurate speech recognition results.
`0038 FIG. 3 is a flow diagram illustrating a first embodi
`ment of a method 300 for performing an endpointing search
`using an endpointing HMM, according to the present inven
`tion. The method 300 may be implemented in accordance
`with step 206 and/or step 212 of the method 200 to detect
`endpoints of speech in an audio signal received by a speech
`recognition system.
`0.039 The method 300 is initialized at step 302 and
`proceeds to step 304, where the method 300 counts a
`number, F, of frames of the received audio signal in which
`the most likely word (e.g., according to the standard HMM
`Viterbi search criteria) is speech in the last N preceding
`frames. In one embodiment, N is a predefined parameter
`that is configurable based on the particular speech recogni
`tion application and the desired results. Once the number F.
`of frames is determined, the method 300 proceeds to step
`306 and determines whether the number F of frames
`exceeds a first predefined threshold, T. Again, the first
`predefined threshold, T, is configurable based on the par
`ticular speech recognition application and the desired
`results.
`0040) If the method 300 concludes in step 306 that F.
`does not exceed T, the method 300 proceeds to step 310 and
`continues to search the audio signal for a speech endpoint,
`e.g., by returning to step 304, incrementing the location in
`the speech signal by one frame, and continuing to count the
`number of speech frames in the last N frames of the audio
`signal. Alternatively, if the method 300 concludes in step
`306 that F does exceed T, the method 300 proceeds to step
`308 and defines the first frame Fs of the frame sequence
`that includes the number (F) of frames as the speech
`starting point. The method 300 then backs up to a predefined
`number B of frames before the speech starting frame for
`speech recognition processing, e.g., in accordance with step
`208 of the method 200. In one embodiment, values for the
`parameters N and T are determined to simultaneously
`minimize the probability of detecting short noises as speech
`and maximize the probability of detecting single, short
`words (e.g., “yes” or 'no') as speech.
`0041. In one embodiment, the method 300 may be
`adapted to detect the speech stopping frame as well as the
`speech starting frame (e.g., in accordance with step 212 of
`the method 200). However, in step 304, the method 300
`would count the number, F, of frames of the received audio
`signal in which the most likely word is silence in the last N.
`preceding frames. Then, when that number, F, meets a
`second predefined threshold, T, speech recognition pro
`
`Exhibit 1016
`Page 09 of 14
`
`
`
`US 2006/0241948 A1
`
`Oct. 26, 2006
`
`cessing is terminated (e.g., effectively identifying the frame
`at which recognition processing is terminated as the speech
`endpoint). In either case, the method 300 is robust to noise
`and produces accurate speech recognition results with mini
`mal computational complexity.
`0.042
`FIG. 4 is a flow diagram illustrating a second
`embodiment of a method 400 for performing an endpointing
`search using an endpointing HMM, according to the present
`invention. Similar to the method 300, the method 400 may
`be implemented in accordance with step 206 and/or step 212
`of the method 200 to detect endpoints of speech in an audio
`signal received by a speech recognition system.
`0043. The method 400 is initialized at step 402 and
`proceeds to step 404, where the method 400 identifies the
`most likely word in the endpointing search (e.g., in accor
`dance with the standard Viterbi HMM search algorithm).
`0044) In order to determine the speech starting endpoint,
`in step 406 the method 400 determines whether the most
`likely word identified in step 404 is speech or silence. If the
`method 400 concludes that the most likely word is speech,
`the method 400 proceeds to step 408 and computes the
`duration, DS, back to the most recent pause-to-speech tran
`sition.
`0045. In step 410, the method 400 determines whether
`the duration Ds meets or exceeds a first predefined threshold
`T. If the method 400 concludes that the duration D does not
`meet or exceed T, then the method 400 determines that the
`identified most likely word does not represent a starting
`endpoint of the speech, and the method 400 processes the
`next audio frame and returns to step 404 and to continue the
`search for a starting endpoint.
`0046 Alternatively, if the method 400 concludes in step
`410 that the duration D does meet or exceed T, then the
`method 400 proceeds to step 412 and identifies the first
`frame Fs of the most likely speech word identified in step
`404 as a speech starting endpoint. Note that according to
`step 208 of the method 200, speech recognition processing
`will start some number B of frames before the speech
`starting point identified in step 404 of the method 400 at
`frame Fs=Fs-B. The method 400 then terminates in step
`422.
`0047. To determine the speech ending endpoint, referring
`back to step 406, if the method 400 concludes that the most
`likely word identified in step 404 is not speech (i.e., is
`silence), the method 400 proceeds to step 414, where the
`method 400 confirms that the frame(s) in which the most
`likely word appears is Subsequent to the frame representing
`the speech starting point. If the method 400 concludes that
`the frame in which the most likely word appears is not
`Subsequent to the frame of the speech starting point, then the
`method 400 concludes that the most likely word identified in
`step 404 is not a speech endpoint and returns to step 404 to
`process the next audio frame and continue the search for a
`speech endpoint.
`0.048
`Alternatively, if the method 400 concludes in step
`414 that the frame in which the most likely word appears is
`Subsequent to the frame of the speech starting point, the
`method 400 proceeds to step 416 and computes the duration,
`D. back to the most recent speech-to-pause transition.
`0049. In step 418, the method 400 determines whether
`the duration, D. meets or exceeds a second predefined
`
`threshold T. If the method 400 concludes that the duration
`D does not meet or exceed T., then the method 400
`determines that the identified most likely word does not
`represent an endpoint of the speech, and the method 400
`processes the next audio frame and returns to step 404 to
`continue the search for an ending enpoint.
`0050. However, if the method 400 concludes in step 418
`that the duration D does meet or exceed T., then the method
`400 proceeds to step 420 and identifies the most likely word
`identified in step 404 as a speech endpoint (specifically, as
`a speech ending endpoint). The method 400 then terminates
`in step 422.
`0051. The method 400 produces accurate speech recog
`nition results in a manner that is more robust to noise, but
`more computationally complex than the method 300. Thus,
`the method 400 may be implemented in cases where greater
`noise robustness is desired and the additional computational
`complexity is less of a concern. The method 300 may be
`implemented in cases where it is not feasible to determine
`the duration back to the most recent pause-to-speech or
`speech-to-pause transition (e.g., when backtrace information
`is limited due to memory constraints).
`0052. In one embodiment, when determining the speech
`ending frame in step 418 of the method 400, an additional
`requirement that the speech ending word legally ends the
`speech recognition grammar can prevent premature speech
`endpoint detection when a user utters a long pause in the
`middle of an utterance.
`0053 FIG. 5 is a high-level block diagram of the present
`invention implemented using a general purpose computing
`device 500. It should be understood that the digital sched
`uling engine, manager or application (e.g., for endpointing
`audio signals for speech recognition) can be implemented as
`a physical device or Subsystem that is coupled to a processor
`through a communication channel. Therefore, in one
`embodiment, a general purpose computing device 500 com
`prises a processor 502, a memory 504, a speech endpointer
`or module 505 and various input/output (I/O) devices 506
`Such as a display, a keyboard, a mouse, a modem, and the
`like. In one embodiment, at least one I/O device is a storage
`device (e.g., a disk drive, an optical disk drive, a floppy disk
`drive).
`0054 Alternatively, the digital scheduling engine, man
`ager or application (e.g., speech endpointer 505) can be
`represented by one or more Software applications (or even a
`combination of Software and hardware, e.g., using Applica
`tion Specific Integrated Circuits (ASIC)), where the soft
`ware is loaded from a storage medium (e.g., I/O devices
`506) and operated by the processor 502 in the memory 504
`of the general purpose computing device 500. Thus, in one
`embodiment, the speech endpointer 505 for endpointing
`audio signals described herein with reference to the preced
`ing Figures can be stored on a computer readable medium or
`carrier (e.g., RAM, magnetic or optical drive or diskette, and
`the like).
`0055. The endpointing methods of the present invention
`may also be easily implemented in a variety of existing
`speech recognition systems, including systems using “hold
`to-talk’, "push-to-talk”. “open microphone”, “barge-in” and
`other audio acquisition techniques. Moreover, the simplicity
`of the endpointing methods enables the endpointing meth
`
`Exhibit 1016
`Page 10 of 14
`
`
`
`US 2006/0241948 A1
`
`Oct. 26, 2006
`
`ods to automatically take advantage of improvements to a
`speech recognition system’s acoustic speech features or
`acoustic models with little or no modification to the end
`pointing methods themselves. For example, upgrades or
`improvements to the noise robustness of the system's speech
`features or acoustic models correspondingly improve the
`noise robustness of the endpointing methods employed.
`0056. Thus, the present invention represents a significant
`advancement in the field speech recognition. One or more
`Hidden Markov Models are implemented to endpoint
`(potentially augmented) audio signals for speech recognition
`processing, resulting in an endpointing method that is more
`efficient, more robust to noise and more reliable than exist
`ing endpointing methods. The method is more accurate and
`less computationally complex than conventional methods,
`making it especially useful for speech recognition applica
`tions in which input audio signals may contain background
`noise and/or other non-speech Sounds.
`0057 Although various embodiments which incorporate
`the teachings of the present invention have been shown and
`described in detail herein, those skilled in the art can readily
`devise many other varied emb