throbber
(19) United States
`(12) Patent Application Publication (10) Pub. No.: US 2006/0241948 A1
`(43) Pub. Date:
`Oct. 26, 2006
`Abrash et al.
`
`US 20060241948A1
`
`Publication Classification
`
`(51) Int. Cl.
`(2006.01)
`GIOL 2L/00
`(52) U.S. Cl. .............................................................. 704/275
`
`(57)
`
`ABSTRACT
`
`(54)
`
`(76)
`
`(21)
`(22)
`
`(60)
`
`METHOD AND APPARATUS FOR
`OBTAINING COMPLETE SPEECH SIGNALS
`FOR SPEECH RECOGNITION
`APPLICATIONS
`
`Inventors: Victor Abrash, Montara, CA (US);
`Federico Cesari, Menlo Park, CA (US);
`Horacio Franco, Menlo Park, CA (US);
`Christopher George, Los Osos, CA
`(US); Jing Zheng, Sunnyvale, CA (US)
`Correspondence Address:
`PATTERSON & SHERIDAN, LLP
`SRI INTERNATIONAL
`595 SHREWSBURY AVENUE
`SUTE 100
`SHREWSBURY, NJ 07702 (US)
`Appl. No.:
`11/217,912
`
`Filed:
`
`Sep. 1, 2005
`
`Related U.S. Application Data
`Provisional application No. 60/606,644, filed on Sep.
`1, 2004.
`
`The present invention relates to a method and apparatus for
`obtaining complete speech signals for speech recognition
`applications. In one embodiment, the method continuously
`records an audio stream comprising a sequence of frames to
`a circular buffer. When a user command to commence or
`terminate speech recognition is received, the method obtains
`a number of frames of the audio stream occurring before or
`after the user command in order to identify an augmented
`audio signal for speech recognition processing. In further
`embodiments, the method analyzes the augmented audio
`signal in order to locate starting and ending speech end
`points that bound at least a portion of speech to be processed
`for recognition. At least one of the speech endpoints is
`located using a Hidden Markov Model.
`(STARD-102
`
`to-
`
`— - - -
`CONINUOUSLY RECORDAUDO STREAM
`- 104
`TO CRCULAR BUFFER
`
`- - -
`
`RECEIVE USER COMMAND TO COMMENCE
`SPEECHRECOGNITION AT t=Ts
`
`106
`
`USER BEGINS SPEAKING at t=S
`
`108
`
`REGUEST PORTION OF AUDO STREAM
`FROM CIRCULAR BUFFER STARTING AT
`t=Ts-N, WHERETs. NCSCTs MOST OF THE TIME
`
`RECEIVE USER COMMAND TO TERMINATE
`SPEECHRECOGNITION AT t-T
`
`12
`
`USER STOPS SPEAKING at tE
`
`- 114
`
`
`
`REGUEST PORTION OF AUDIO STREAM
`FROM CIRCULAR BUFFER UPTO =TE +N2, N- 116
`WHERE TCE4TFN MOST OF THE TIME
`
`- - - -
`
`-
`
`- -
`
`-
`
`-
`PERFORMENDPOINT SEARCH ON AUDIO
`Ts. NAND T+N
`STREAMBEWEEN
`2.
`
`--------
`
`118
`
`APPY SPEECHRECOGNITION PROCESSING
`TOENDPOINTED AUDIOSGNA
`
`120
`
`Exhibit 1016
`Page 01 of 14
`
`

`

`Patent Application Publication Oct. 26, 2006 Sheet 1 of 5
`
`US 2006/0241948 A1
`
`100
`
`y
`
`(STARD 102
`
`CONTINUOUSLY RECORD AUDIO STREAM
`TO CIRCULAR BUFFER
`
`104
`
`RECEIVE USER COMMAND TO COMMENCE
`SPEECH RECOGNITION AT t=Ts
`
`106
`
`USER BEGINS SPEAKING attS
`
`108
`
`
`
`REQUEST PORTION OF AUDIO STREAM
`FROM CIRCULAR BUFFER STARTING AT
`t=Ts-N, WHERETs. N-S-Ts MOST OF THE TIME
`
`110
`
`RECEIVE USER COMMAND TO TERMINATE
`SPEECHRECOGNITION AT t=T
`
`112
`
`USER STOPS SPEAKING at t=E
`
`114
`
`
`
`REQUEST PORTION OF AUDIO STREAM
`FROM CIRCULAR BUFFER UPTO t=TE+N2, N- 116
`WHERE TCECT+N MOST OF THE TIME
`
`| PERFORMENDPOINT SEARCH ON AUDIO U 118
`| STREAM BETWEEN Ts-N AND TH N2
`- - - - - - - - - - - - - - - - - - -
`
`APPLY SPEECH RECOGNITION PROCESSING
`TO ENDPOINTED AUDIO SIGNAL
`
`120
`
`FIG. 1
`
`Exhibit 1016
`Page 02 of 14
`
`

`

`Patent Application Publication Oct. 26, 2006 Sheet 2 of 5
`
`US 2006/0241948 A1
`
`200
`
`N
`
`(StarD-22
`
`RECEIVE AUDIO SIGNAL
`
`204
`
`PERFORMENDPOINTING SEARCHUSINGENDPONTING
`HMM TO DETECT SPEECH IN RECEIVEDAUDIO SIGNAL
`
`206
`
`BACKUP APREDEFINED NUMBER OF FRAMES
`
`208
`
`COMMENCE RECOGNITION PROCESSING
`STARTING AT NEW START FRAME
`
`210
`
`DETECT END OF SPEECH
`
`212
`
`TERMINATE RECOGNITION PROCESSING AND
`OUTPUT RECOGNIZED SPEECH
`
`214
`
`FIG. 2
`
`Exhibit 1016
`Page 03 of 14
`
`

`

`Patent Application Publication Oct. 26, 2006 Sheet 3 of 5
`
`US 2006/0241948 A1
`
`300
`
`
`
`COUNT NUMBER OF FRAMES OF AUDIO
`SIGNAL IN WHICH THE MOST LIKELY WORD
`SSPEECH IN THEN PRECEDING FRAMES
`
`YES
`
`DOES NUMBER
`EXCEED FIRST PREDEFINED
`THRESHOLD?
`
`310
`
`CONTINUE TO SEARCH
`AUDIO SIGNAL
`
`BACKUP TO B FRAMES
`BEFORE FIRST SPEECH
`FRAMEN SEQUENCE
`NACCORDANCE WITH
`STEP 208 OF THE
`METHOD 200
`
`
`
`
`
`FIG. 3
`
`Exhibit 1016
`Page 04 of 14
`
`

`

`Patent Application Publication Oct. 26, 2006 Sheet 4 of 5
`
`US 2006/0241948 A1
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`400
`
`N
`
`402
`
`IDENTIFY MOST LIKELY WORD IN ENDPONTING SEARCH
`
`IS MOST
`LIKELY WORD
`SPEECH?
`
`COMPUTE MOST LIKELY WORD'S
`DURATION BACK TO MOST
`RECENT PAUSE-TO-SPEECH
`TRANSiTION
`
`IS MOST
`LKELY WORD FRAMEd
`SPEECH STARTING
`FRAMET
`
`COMPUTE PAUSE
`DURATION BACK TO LAST
`SPEECH-TO-PAUSE
`TRANSITION
`
`DOES DURATION
`MEET OR EXCEED FIRST
`PREDEFINED THRESHOLD
`
`
`
`
`
`
`
`DOES DURATION
`MEET OR EXCEED SECOND
`PREDEFINED THRESHOLD
`
`
`
`YES
`
`412
`
`DETECT START OF SPEECH
`
`DETECT END OF SPEECH
`
`420
`
`422
`
`FG. 4
`
`Exhibit 1016
`Page 05 of 14
`
`

`

`Patent Application Publication Oct. 26, 2006 Sheet 5 of 5
`
`US 2006/0241948 A1
`
`
`
`506
`
`I/O DEVICE
`e.g. STORAGE
`
`MEMORY
`
`504
`
`502
`
`PROCESSOR
`
`FIG. 5
`
`Exhibit 1016
`Page 06 of 14
`
`

`

`US 2006/0241948 A1
`
`Oct. 26, 2006
`
`METHOD AND APPARATUS FOR OBTAINING
`COMPLETE SPEECH SIGNALS FOR SPEECH
`RECOGNITION APPLICATIONS
`
`CROSS REFERENCE TO RELATED
`APPLICATIONS
`0001) This application claims the benefit of U.S. Provi
`sional Patent Application No. 60/606,644, filed Sep. 1, 2004
`(entitled “Method and Apparatus for Obtaining Complete
`Speech Signals for Speech Recognition Applications').
`which is herein incorporated by reference in its entirety.
`
`REFERENCE TO GOVERNMENT FUNDING
`0002 This invention was made with Government support
`under contract number DAAHO1-00-C-R003, awarded by
`Defense Advance Research Projects Agency and under
`contract number NAG2-1568 awarded by NASA. The Gov
`ernment has certain rights in this invention.
`
`FIELD OF THE INVENTION
`0003. The present invention relates generally to the field
`of speech recognition and relates more particularly to meth
`ods for obtaining speech signals for speech recognition
`applications.
`
`BACKGROUND OF THE DISCLOSURE
`0004. The accuracy of existing speech recognition sys
`tems is often adversely impacted by an inability to obtain a
`complete speech signal for processing. For example, imper
`fect synchronization between a user's actual speech signal
`and the times at which the user commands the speech
`recognition system to listen for the speech signal can cause
`an incomplete speech signal to be provided for processing.
`For instance, a user may begin speaking before he provides
`the command to process his speech (e.g., by pressing a
`button), or he may terminate the processing command
`before he is finished uttering the speech signal to be pro
`cessed (e.g., by releasing or pressing a button). If the speech
`recognition system does not “hear the user's entire utter
`ance, the results that the speech recognition system Subse
`quently produces will not be as accurate as otherwise
`possible. In open-microphone applications, audio gaps
`between two utterances (e.g., due to latency or others
`factors) can also produce incomplete results if an utterance
`is started during the audio gap.
`0005 Poor endpointing (e.g., determining the start and
`the end of speech in an audio signal) can also cause
`incomplete or inaccurate results to be produced. Good
`endpointing increases the accuracy of speech recognition
`results and reduces speech recognition system response time
`by eliminating background noise, silence, and other non
`speech Sounds (e.g., breathing, coughing, and the like) from
`the audio signal prior to processing. By contrast, poor
`endpointing may produce more flawed speech recognition
`results or may require the consumption of additional com
`putational resources in order to process a speech signal
`containing extraneous information. Efficient and reliable
`endpointing is therefore extremely important in speech
`recognition applications.
`0006 Conventional endpointing methods typically use
`short-time energy or spectral energy features (possibly aug
`mented with other features Such as Zero-crossing rate, pitch,
`
`or duration information) in order to determine the start and
`the end of speech in a given audio signal. However, Such
`features become less reliable under conditions of actual use
`(e.g., noisy real-world situations), and some users elect to
`disable endpointing capabilities in Such situations because
`they contribute more to recognition error than to recognition
`accuracy.
`0007 Thus, there is a need in the art for a method and
`apparatus for obtaining complete speech signals for speech
`recognition applications.
`
`SUMMARY OF THE INVENTION
`0008. In one embodiment, the present invention relates to
`a method and apparatus for obtaining complete speech
`signals for speech recognition applications. In one embodi
`ment, the method continuously records an audio stream
`which is converted to a sequence of frames of acoustic
`speech features and stored in a circular buffer. When a user
`command to commence or terminate speech recognition is
`received, the method obtains a number of frames of the
`audio stream occurring before or after the user command in
`order to identify an augmented audio signal for speech
`recognition processing.
`0009. In further embodiments, the method analyzes the
`augmented audio signal in order to locate starting and
`ending speech endpoints that bound at least a portion of
`speech to be processed for recognition. At least one of the
`speech endpoints is located using a Hidden Markov Model.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`0010. The teachings of the present invention can be
`readily understood by considering the following detailed
`description in conjunction with the accompanying drawings,
`in which:
`0011 FIG. 1 is a flow diagram illustrating one embodi
`ment of a method for speech recognition processing of an
`augmented audio stream, according to the present invention;
`0012 FIG. 2 is a flow diagram illustrating one embodi
`ment of a method for performing endpoint searching and
`speech recognition processing on an audio signal;
`0013 FIG. 3 is a flow diagram illustrating a first embodi
`ment of a method for performing an endpointing search
`using an endpointing HMM, according to the present inven
`tion;
`0014 FIG. 4 is a flow diagram illustrating a second
`embodiment of a method for performing an endpointing
`search using an endpointing HMM, according to the present
`invention;
`0015 FIG. 5 is a high-level block diagram of the present
`invention implemented using a general purpose computing
`device.
`0016 To facilitate understanding, identical reference
`numerals have been used, where possible, to designate
`identical elements that are common to the figures.
`
`DETAILED DESCRIPTION
`0017. The present invention relates to a method and
`apparatus for obtaining an improved audio signal for speech
`recognition processing, and to a method and apparatus for
`
`Exhibit 1016
`Page 07 of 14
`
`

`

`US 2006/0241948 A1
`
`Oct. 26, 2006
`
`improved endpointing for speech recognition. In one
`embodiment, an audio stream is recorded continuously by a
`speech recognition system, enabling the speech recognition
`system to retrieve portions of a speech signal that conven
`tional speech recognition systems might miss due to user
`commands that are not properly synchronized with user
`utterances.
`0018. In further embodiments of the invention, one or
`more Hidden Markov Models (HMMs) are employed to
`endpoint an audio signal in real time in place of a conven
`tional signal processing endpointer. Using HMMs for this
`function enables speech start and end detection that is faster
`and more robust to noise than conventional endpointing
`techniques.
`0.019
`FIG. 1 is a flow diagram illustrating one embodi
`ment of a method 100 for speech recognition processing of
`an augmented audio stream, according to the present inven
`tion. The method 100 is initialized at step 102 and proceeds
`to step 104, where the method 100 continuously records an
`audio stream (e.g., a sequence of audio frames containing
`user speech, background audio, etc.) to a circular buffer. In
`step 106, the method 100 receives a user command (e.g., via
`a button press or other means) to commence speech recog
`nition, at time t=Ts.
`0020. In step 108, the user begins speaking, at time t=S.
`The user command to commence speech recognition,
`received at time t=Ts, and the actual start of the user speech,
`at time t=S. are only approximately synchronized; the user
`may begin speaking before or after the command to com
`mence speech recognition received in step 106.
`0021. Once the user begins speaking, the method 100
`proceeds to step 110 and requests a portion of the recorded
`audio stream from the circular buffer starting at time t=Ts
`N, where N is an interval of time such that Ts-N-SsTs
`most of the time. In one embodiment, the interval N is
`chosen by analyzing real or simulated user data and select
`ing the minimum value of N that minimizes the speech
`recognition error rate on that data. In some embodiments, a
`sufficient value for N is in the range of tenths of a second.
`In another embodiment, where the audio signal for speech
`recognition processing has been acquired using an open
`microphone mode, N is approximately equal to T-T,
`where T is the absolute time at which the previous speech
`recognition process on the previous utterance ended. Thus,
`the current speech recognition process will start on the first
`audio frame that was not recognized in the previous speech
`recognition processing.
`0022. In step 112, the method 100 receives a user com
`mand (e.g., via a button press or other means) to terminate
`speech recognition, at time t=T. In step 114, the user stops
`speaking, at time t=E. The user command to terminate
`speech recognition, received at time t=TE, and the actual end
`of the user speech, at time t-E, are only approximately
`synchronized; the user may stop speaking before or after the
`command to terminate speech recognition received in step
`112.
`0023. In step 116, the method 100 requests a portion of
`the audio stream from the circular buffer up to time t=T+
`N, where N is an interval of time such that Ts E<T+N
`most of the time. In one embodiment, N is chosen by
`analyzing real or simulated user data and selecting the
`
`minimum value of N that minimizes the speech recognition
`error rate on that data. Thus, an augmented audio signal
`starting at time T-N and ending at time TE+N2 is identi
`fied.
`0024. In step 118 (illustrated in phantom), the method
`100 optionally performs an endpoint search on at least a
`portion of the augmented audio signal. In one embodiment,
`an endpointing search in accordance with step 118 is per
`formed using a conventional endpointing technique. In
`another embodiment, an endpointing search in accordance
`with step 118 is performed using one or more Hidden
`Markov Models (HMMs), as described in further detail
`below in connection with FIG. 2.
`0025. In step 120, the method 100 applies speech recog
`nition processing to the endpointed audio signal. Speech
`recognition processing may be applied in accordance with
`any known speech recognition technique.
`0026. The method 100 then returns to step 104 and
`continues to record the audio stream to the circular buffer.
`Recording of the audio stream to the circular buffer is
`performed in parallel with the speech recognition processes,
`e.g., steps 106-120 of the method 100.
`0027. The method 100 affords greater flexibility in choos
`ing speech signals for recognition processing than conven
`tional speech recognition techniques. Importantly, the
`method 100 improves the likelihood that a user's entire
`utterance is provided for recognition processing, even when
`user operation of the speech recognition system would
`normally provide an incomplete speech signal. Because the
`method 100 continuously records the audio stream contain
`ing the speech signals, the method 100 can “back up’ or “go
`forward to retrieve portions of a speech signal that con
`ventional speech recognition systems might miss due to user
`commands that are not properly synchronized with user
`utterances. Thus, more complete and more accurate speech
`recognition results are produced.
`0028 Moreover, because the audio stream is continu
`ously recorded even when speech is not being actively
`processed, the method 100 enables new interaction strate
`gies. For example, speech recognition processing can be
`applied to an audio stream immediately upon command,
`from a specified point in time (e.g., in the future or recent
`past), or from a last detected speech endpoint (e.g., a speech
`starting or speech ending point), among other times. Thus,
`speech recognition can be performed, on the user's com
`mand, from a frame that is not necessarily the most recently
`recorded frame (e.g., occurring some time before or after the
`most recently recorded frame).
`0029 FIG. 2 is a flow diagram illustrating one embodi
`ment of a method 200 for performing endpoint searching
`and speech recognition processing on an audio signal, e.g.,
`in accordance with steps 118-120 of FIG.1. The method 200
`is initialized at step 202 and proceeds to step 204, where the
`method 200 receives an audio signal, e.g., from the method
`1OO.
`0030) In step 206, the method 200 performs a speech
`endpointing search using an endpointing HMM to detect the
`start of the speech in the received audio signal. In one
`embodiment, the endpointing HMM recognizes speech and
`silence in parallel, enabling the method 200 to hypothesize
`the start of speech when speech is more likely than silence.
`
`Exhibit 1016
`Page 08 of 14
`
`

`

`US 2006/0241948 A1
`
`Oct. 26, 2006
`
`Many topologies can be used for the speech HMM, and a
`standard silence HMM may also be used. In one embodi
`ment, the topology of the speech HMM is defined as a
`sequence of one or more reject “phones', where a reject
`phone is an HMM model trained on all types of speech. In
`another embodiment, the topology of the speech HMM is
`defined as a sequence (or sequence of loops) of context
`independent (CI) or other phones. In further embodiments,
`the endpointing HMM has a pre-determined but config
`urable minimum duration, which may be a function of the
`number of reject or other phones in sequence in the speech
`HMM, and which enables the endpointer to more easily
`reject short noises as speech.
`0031. In one embodiment, the method 200 identifies the
`speech starting frame when it detects a predefined Sufficient
`number of frames of speech in the audio signal. The number
`of frames of speech that are required to indicate a speech
`endpoint may be adjusted as appropriate for different speech
`recognition applications. Embodiments of methods for
`implementing an endpointing HMM in accordance with step
`206 are described in further detail below with reference to
`FIGS. 3-4
`0032. In step 208, once the speech starting frame, Fs, is
`detected, the method 200 backs up a pre-defined number B
`of frames to a frame Fs preceding the speech starting frame
`Fs, such that Fs=Fs-B becomes the new 'start frame for
`the speech for the purposes of the speech recognition
`process. In one embodiment, the number B of frames by
`which the method 200 backs up is relatively small (e.g.,
`approximately 10 frames), but is large enough to ensure that
`the speech recognition process begins on a frame of silence.
`0033. In step 210, the method 200 commences recogni
`tion processing starting from the new start frame Fs identi
`fied in step 108. In one embodiment, recognition processing
`is performed in accordance with step 210 using a standard
`speech recognition HMM separate from the endpointing
`HMM.
`0034). In step 212, the method 200 detects the end of the
`speech to be processed. In one embodiment, a speech "end
`frame is detected when the recognition process started in
`step 210 of the method 200 detects a predefined sufficient
`number of frames of silence following frames of speech. In
`one embodiment, the number of frames of silence that are
`required to indicate a speech endpoint is adjustable based on
`the particular speech recognition application. In another
`embodiment, the ending/silence frames might be required to
`legally end the speech recognition grammar, forcing the
`endpointer not to detect the end of speech until a legal
`ending point. In another embodiment, the speech end frame
`is detected using the same endpointing HMM used to detect
`the speech start frame. Embodiments of methods for imple
`menting an endpointing HMM in accordance with step 212
`are described in further detail below with reference to FIGS.
`3-4.
`0035) In step 214, the method 200 terminates speech
`recognition processing and outputs recognized speech, and
`in step 216, the method 200 terminates.
`0.036
`Implementation of endpointing HMM’s in con
`junction with the method 200 enables more accurate detec
`tion of speech endpoints in an input audio signal, because
`the method 200 does not have any internal parameters that
`
`directly depend on the characteristics of the audio signal and
`that require extensive tuning. Moreover, the method 200
`does not utilize speech features that are unreliable in noisy
`environments. Furthermore, because the method 200
`requires minimal computation (e.g., processing while detect
`ing the start and the end of speech is minimal), speech
`recognition results can be produced more rapidly than is
`possible by conventional speech recognition systems. Thus,
`the method 200 can rapidly and reliably endpoint an input
`speech signal in virtually any environment.
`0037 Moreover, implementation of the method 200 in
`conjunction with the method 100 improves the likelihood
`that a user's complete utterance is provided for speech
`recognition processing, which ultimately produces more
`complete and more accurate speech recognition results.
`0038 FIG. 3 is a flow diagram illustrating a first embodi
`ment of a method 300 for performing an endpointing search
`using an endpointing HMM, according to the present inven
`tion. The method 300 may be implemented in accordance
`with step 206 and/or step 212 of the method 200 to detect
`endpoints of speech in an audio signal received by a speech
`recognition system.
`0.039 The method 300 is initialized at step 302 and
`proceeds to step 304, where the method 300 counts a
`number, F, of frames of the received audio signal in which
`the most likely word (e.g., according to the standard HMM
`Viterbi search criteria) is speech in the last N preceding
`frames. In one embodiment, N is a predefined parameter
`that is configurable based on the particular speech recogni
`tion application and the desired results. Once the number F.
`of frames is determined, the method 300 proceeds to step
`306 and determines whether the number F of frames
`exceeds a first predefined threshold, T. Again, the first
`predefined threshold, T, is configurable based on the par
`ticular speech recognition application and the desired
`results.
`0040) If the method 300 concludes in step 306 that F.
`does not exceed T, the method 300 proceeds to step 310 and
`continues to search the audio signal for a speech endpoint,
`e.g., by returning to step 304, incrementing the location in
`the speech signal by one frame, and continuing to count the
`number of speech frames in the last N frames of the audio
`signal. Alternatively, if the method 300 concludes in step
`306 that F does exceed T, the method 300 proceeds to step
`308 and defines the first frame Fs of the frame sequence
`that includes the number (F) of frames as the speech
`starting point. The method 300 then backs up to a predefined
`number B of frames before the speech starting frame for
`speech recognition processing, e.g., in accordance with step
`208 of the method 200. In one embodiment, values for the
`parameters N and T are determined to simultaneously
`minimize the probability of detecting short noises as speech
`and maximize the probability of detecting single, short
`words (e.g., “yes” or 'no') as speech.
`0041. In one embodiment, the method 300 may be
`adapted to detect the speech stopping frame as well as the
`speech starting frame (e.g., in accordance with step 212 of
`the method 200). However, in step 304, the method 300
`would count the number, F, of frames of the received audio
`signal in which the most likely word is silence in the last N.
`preceding frames. Then, when that number, F, meets a
`second predefined threshold, T, speech recognition pro
`
`Exhibit 1016
`Page 09 of 14
`
`

`

`US 2006/0241948 A1
`
`Oct. 26, 2006
`
`cessing is terminated (e.g., effectively identifying the frame
`at which recognition processing is terminated as the speech
`endpoint). In either case, the method 300 is robust to noise
`and produces accurate speech recognition results with mini
`mal computational complexity.
`0.042
`FIG. 4 is a flow diagram illustrating a second
`embodiment of a method 400 for performing an endpointing
`search using an endpointing HMM, according to the present
`invention. Similar to the method 300, the method 400 may
`be implemented in accordance with step 206 and/or step 212
`of the method 200 to detect endpoints of speech in an audio
`signal received by a speech recognition system.
`0043. The method 400 is initialized at step 402 and
`proceeds to step 404, where the method 400 identifies the
`most likely word in the endpointing search (e.g., in accor
`dance with the standard Viterbi HMM search algorithm).
`0044) In order to determine the speech starting endpoint,
`in step 406 the method 400 determines whether the most
`likely word identified in step 404 is speech or silence. If the
`method 400 concludes that the most likely word is speech,
`the method 400 proceeds to step 408 and computes the
`duration, DS, back to the most recent pause-to-speech tran
`sition.
`0045. In step 410, the method 400 determines whether
`the duration Ds meets or exceeds a first predefined threshold
`T. If the method 400 concludes that the duration D does not
`meet or exceed T, then the method 400 determines that the
`identified most likely word does not represent a starting
`endpoint of the speech, and the method 400 processes the
`next audio frame and returns to step 404 and to continue the
`search for a starting endpoint.
`0046 Alternatively, if the method 400 concludes in step
`410 that the duration D does meet or exceed T, then the
`method 400 proceeds to step 412 and identifies the first
`frame Fs of the most likely speech word identified in step
`404 as a speech starting endpoint. Note that according to
`step 208 of the method 200, speech recognition processing
`will start some number B of frames before the speech
`starting point identified in step 404 of the method 400 at
`frame Fs=Fs-B. The method 400 then terminates in step
`422.
`0047. To determine the speech ending endpoint, referring
`back to step 406, if the method 400 concludes that the most
`likely word identified in step 404 is not speech (i.e., is
`silence), the method 400 proceeds to step 414, where the
`method 400 confirms that the frame(s) in which the most
`likely word appears is Subsequent to the frame representing
`the speech starting point. If the method 400 concludes that
`the frame in which the most likely word appears is not
`Subsequent to the frame of the speech starting point, then the
`method 400 concludes that the most likely word identified in
`step 404 is not a speech endpoint and returns to step 404 to
`process the next audio frame and continue the search for a
`speech endpoint.
`0.048
`Alternatively, if the method 400 concludes in step
`414 that the frame in which the most likely word appears is
`Subsequent to the frame of the speech starting point, the
`method 400 proceeds to step 416 and computes the duration,
`D. back to the most recent speech-to-pause transition.
`0049. In step 418, the method 400 determines whether
`the duration, D. meets or exceeds a second predefined
`
`threshold T. If the method 400 concludes that the duration
`D does not meet or exceed T., then the method 400
`determines that the identified most likely word does not
`represent an endpoint of the speech, and the method 400
`processes the next audio frame and returns to step 404 to
`continue the search for an ending enpoint.
`0050. However, if the method 400 concludes in step 418
`that the duration D does meet or exceed T., then the method
`400 proceeds to step 420 and identifies the most likely word
`identified in step 404 as a speech endpoint (specifically, as
`a speech ending endpoint). The method 400 then terminates
`in step 422.
`0051. The method 400 produces accurate speech recog
`nition results in a manner that is more robust to noise, but
`more computationally complex than the method 300. Thus,
`the method 400 may be implemented in cases where greater
`noise robustness is desired and the additional computational
`complexity is less of a concern. The method 300 may be
`implemented in cases where it is not feasible to determine
`the duration back to the most recent pause-to-speech or
`speech-to-pause transition (e.g., when backtrace information
`is limited due to memory constraints).
`0052. In one embodiment, when determining the speech
`ending frame in step 418 of the method 400, an additional
`requirement that the speech ending word legally ends the
`speech recognition grammar can prevent premature speech
`endpoint detection when a user utters a long pause in the
`middle of an utterance.
`0053 FIG. 5 is a high-level block diagram of the present
`invention implemented using a general purpose computing
`device 500. It should be understood that the digital sched
`uling engine, manager or application (e.g., for endpointing
`audio signals for speech recognition) can be implemented as
`a physical device or Subsystem that is coupled to a processor
`through a communication channel. Therefore, in one
`embodiment, a general purpose computing device 500 com
`prises a processor 502, a memory 504, a speech endpointer
`or module 505 and various input/output (I/O) devices 506
`Such as a display, a keyboard, a mouse, a modem, and the
`like. In one embodiment, at least one I/O device is a storage
`device (e.g., a disk drive, an optical disk drive, a floppy disk
`drive).
`0054 Alternatively, the digital scheduling engine, man
`ager or application (e.g., speech endpointer 505) can be
`represented by one or more Software applications (or even a
`combination of Software and hardware, e.g., using Applica
`tion Specific Integrated Circuits (ASIC)), where the soft
`ware is loaded from a storage medium (e.g., I/O devices
`506) and operated by the processor 502 in the memory 504
`of the general purpose computing device 500. Thus, in one
`embodiment, the speech endpointer 505 for endpointing
`audio signals described herein with reference to the preced
`ing Figures can be stored on a computer readable medium or
`carrier (e.g., RAM, magnetic or optical drive or diskette, and
`the like).
`0055. The endpointing methods of the present invention
`may also be easily implemented in a variety of existing
`speech recognition systems, including systems using “hold
`to-talk’, "push-to-talk”. “open microphone”, “barge-in” and
`other audio acquisition techniques. Moreover, the simplicity
`of the endpointing methods enables the endpointing meth
`
`Exhibit 1016
`Page 10 of 14
`
`

`

`US 2006/0241948 A1
`
`Oct. 26, 2006
`
`ods to automatically take advantage of improvements to a
`speech recognition system’s acoustic speech features or
`acoustic models with little or no modification to the end
`pointing methods themselves. For example, upgrades or
`improvements to the noise robustness of the system's speech
`features or acoustic models correspondingly improve the
`noise robustness of the endpointing methods employed.
`0056. Thus, the present invention represents a significant
`advancement in the field speech recognition. One or more
`Hidden Markov Models are implemented to endpoint
`(potentially augmented) audio signals for speech recognition
`processing, resulting in an endpointing method that is more
`efficient, more robust to noise and more reliable than exist
`ing endpointing methods. The method is more accurate and
`less computationally complex than conventional methods,
`making it especially useful for speech recognition applica
`tions in which input audio signals may contain background
`noise and/or other non-speech Sounds.
`0057 Although various embodiments which incorporate
`the teachings of the present invention have been shown and
`described in detail herein, those skilled in the art can readily
`devise many other varied emb

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket