`(12) Patent Application Publication (10) Pub. No.: US 2007/0043563 A1
`(43) Pub. Date:
`Feb. 22, 2007
`Comerford et al.
`
`US 2007.0043563A1
`
`(54) METHODS AND APPARATUS FOR
`BUFFERING DATA FOR USE IN
`ACCORDANCE WITH A SPEECH
`RECOGNITION SYSTEM
`
`(75) Inventors: Liam D. Comerford, Carmel, NY
`(US); David Carl Frank, Ossining, NY
`(US); Burn L. Lewis, Ossining, NY
`(US); Leonid Rachevksy, Ossining, NY
`(US); Mahesh Viswanathan, Yorktown
`Heights, NY (US)
`Correspondence Address:
`Ryan, Mason & Lewis, LLP
`90 Forest Avenue
`Locust Valley, NY 11560 (US)
`(73) Assignee: International Business Machines Cor
`poration, Armonk, NY
`(21) Appl. No.:
`11/209,004
`
`(22) Filed:
`
`Aug. 22, 2005
`
`Publication Classification
`
`(51) Int. Cl.
`(2006.01)
`GIOL I5/06
`(52) U.S. Cl. .............................................................. 704/243
`
`(57)
`
`ABSTRACT
`
`Techniques are disclosed for overcoming errors in speech
`recognition systems. For example, a technique for process
`ing acoustic data in accordance with a speech recognition
`system comprises the following steps/operations. Acoustic
`data is obtained in association with the speech recognition
`system. The acoustic data is recorded using a combination of
`a first buffer area and a second buffer area, such that the
`recording of the acoustic data using the combination of the
`two buffer areas at least substantially minimizes one or more
`truncation errors associated with operation of the speech
`recognition system.
`
`201
`
`INITIALIZE BUFFERS
`
`202
`
`BEGINCIRCULAR BUFFER RECORDING
`
`
`
`204
`
`
`
`e MICON RECEIVED
`YES
`SAWE CIRCULAR BUFFER
`WRITE POINTER ADDRESS
`
`SET WRITE POINTER TO
`START OF LINEAR BUFFER
`
`
`
`PREPEND LINEAR COPY OF CIRCULAR
`BUFFER TO LINEAR BUFFER
`
`FIND SILENCE IN
`COMPOSITE AUDIO BUFFER
`
`PASS BUFFER START ADDRESS
`TO ASR AND START DECODING
`
`WAIT FOR ASR SILENCE DETECTION
`
`HAL RECORDING
`
`OWERWRITE CIRCULAR BUFFER
`
`
`
`
`
`
`
`RESTART CIRCULAR BUFFER RECORDING
`
`ALLOCATE NEW OR CLEAR LINEAR BUFFER
`
`
`
`2O7
`
`208
`
`209
`
`210
`
`211
`
`212
`
`213
`
`Exhibit 1014
`Page 01 of 12
`
`
`
`Patent Application Publication Feb. 22, 2007 Sheet 1 of 4
`
`US 2007/0043563 A1
`
`(OZI)/
`
`(01)/,
`
`
`
`Exhibit 1014
`Page 02 of 12
`
`
`
`Patent Application Publication Feb. 22, 2007 Sheet 2 of 4
`
`US 2007/0043563 A1
`
`FIG. 2
`
`201
`
`INITIALIZE BUFFERS
`
`2O2
`
`BEGIN CIRCULAR BUFFER RECORDING
`
`203
`
`essed YES
`
`204
`
`MICON RECEIVED 2
`
`SAVE CIRCULAR BUFFER
`WRITE POINTER ADDRESS
`
`SET WRITE POINTERTO
`START OF LINEAR BUFFER
`
`PREPEND LINEAR COPY OF CIRCULAR
`BUFFER TO LINEAR BUFFER
`
`FIND SILENCE IN
`COMPOSITE AUDIO BUFFER
`
`PASS BUFFER START ADDRESS
`TO ASR AND START DECODING
`
`WAIT FOR ASR SILENCE DETECTION
`
`HALT RECORDING
`
`OVERWRITE CIRCULAR BUFFER
`
`205
`
`206
`
`2O7
`
`208
`
`209
`
`210
`
`211
`
`212
`
`RESTART CIRCULAR BUFFER RECORDING - 23
`
`ALLOCATE NEW OR CLEAR LINEAR BUFFER us 214
`
`Exhibit 1014
`Page 03 of 12
`
`
`
`Patent Application Publication Feb. 22, 2007 Sheet 3 of 4
`
`US 2007/0043563 A1
`
`FIG. 3
`
`SILENCE TO
`SPEECH TRANSITION
`DETECTED
`(320)
`LOW APIUDEN 4:
`SOUNDS NA
`(314)
`7
`SPEECH TO --Es
`SILENCE
`TRANSITION
`DETECTED
`
`
`
`
`
`
`
`HIGH AMPLITUDE
`SOUNDS
`(316)
`
`\\
`
`cy."
`(312)
`
`LARGE CIRCULAR
`BUFFER
`(308)
`
`
`
`
`
`FILLING PROCEEDS
`"CLOCKWISE"
`
`BUFFER START AND END"
`(FILLING BEGINS HERE ON FIRST USE)
`(310)
`
`START
`(302)
`|
`SMALL LINEAR
`BUFFER - -3
`(300)
`
`
`
`:
`
`FINISH
`(306)
`|
`
`CURRENT WRITE POSITION
`(304)
`
`Exhibit 1014
`Page 04 of 12
`
`
`
`Patent Application Publication Feb. 22, 2007 Sheet 4 of 4
`
`US 2007/0043563 A1
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`400
`
`
`
`
`
`INITIALIZE ASR,
`CIRCULAR BUFFER
`AND OTHER SOFTWARE
`COMPONENTS
`
`F16. 1 BUFFERMANAGEMENT PROCESS
`
`406
`
`402
`
`START AUDIO
`RECORDING PROCESS
`AND SPEECH
`TRANSITION DETECTION
`
`EXAMINE SPEECH
`TRANSITION RECORD TO
`DETERMINE WHETHER
`USER IS SPEAKING
`
`IN
`SILENT BUFFER
`INTERWAL
`t
`
`APPEND LINEAR BUFFER
`CONTENTS TO CIRCULAR
`BUFFER AT CURRENT
`WRITE POSITION
`
`424
`
`CONSTRUCT DATA
`BUFFER FROM LEADING
`SILENCE AND INCOMING
`AUDIO FOR ASR
`
`CONSTRUCT DATA
`BUFFER FROM
`CIRCULAR BUFFER
`DATA FOR ASR
`
`START ASR
`RECOGNITION PROCESS
`
`428
`
`S.
`
`DETECTS SILENCE
`
`STORE TRANSITION AND
`CURRENT CIRCULAR
`BUFFER WRITE POSITION
`
`FIC. 6
`
`l? 500
`
`
`
`540
`
`PROCESSOR
`
`MEMORY
`
`USER INTERFACE
`
`Exhibit 1014
`Page 05 of 12
`
`
`
`US 2007/0043563 A1
`
`Feb. 22, 2007
`
`METHODS AND APPARATUS FOR BUFFERING
`DATA FOR USE IN ACCORDANCE WITH A
`SPEECH RECOGNITION SYSTEM
`
`FIELD OF INVENTION
`0001. The present invention relates generally to speech
`processing systems and, more particularly, to speech recog
`nition systems.
`
`BACKGROUND OF THE INVENTION
`0002 Speech recognition systems may be described in
`terms of several properties including whether they use
`discrete word vocabularies (typical of large vocabulary
`recognizers) or grammar-based Vocabularies (typical of
`Small Vocabulary recognizers), and whether they continu
`ously process an uninterrupted stream of input audio or
`commence processing on command (typically a “micro
`phone on' or “MICON event). Recognizers that use control
`events may terminate recognition on an external event
`(typically a “microphone off or “MICOFF' event), comple
`tion of processing of an audio buffer, or detection of silence
`in the buffered audio data. The processed audio stream in
`any case may be “live' or streamed from a buffer.
`0003. It is a common problem of continuously operated
`recognition systems that they generate large numbers of
`errors of recognition and spurious recognition output at
`times that the recognition system is not being addressed. For
`example, in a vehicle-based speech recognition system, this
`problem may occur due to audio from the radio, person-to
`person conversation, and/or noise. This fact makes the use of
`a microphone button or other dialog pacing mechanism
`almost universal in automotive (telematic) speech recogni
`tion applications.
`0004. It is a common problem of microphone-button
`paced speech applications that the application user fails to
`operate the button correctly. The two typical errors are
`completing the pushing of the microphone-on button after
`speech has already begun and releasing the microphone-on
`button (or pushing the microphone-off button) before speech
`has ended. In either case, Some speech intended to be
`recognized is being cut off due to these errors of operation
`by the user.
`0005 Accordingly, techniques for overcoming errors in
`speech recognition systems are needed.
`
`SUMMARY OF THE INVENTION
`0006 Principles of the present invention provide tech
`niques for overcoming errors in speech recognition systems.
`0007. It is to be understood that any errors that may occur
`in a speech recognition system that could cause loss of
`speech intended to be recognized will be generally referred
`to herein as “truncation errors.” While two examples of user
`operation failure in microphone-button paced speech appli
`cations are given above, it is to be understood that the phrase
`“truncation error” is not intended to be limited thereto.
`0008. In one aspect of the invention, a technique for
`processing acoustic data in accordance with a speech rec
`ognition system comprises the following stepS/operations.
`Acoustic data is obtained in association with the speech
`recognition system. The acoustic data is recorded using a
`
`combination of a first buffer area and a second buffer area,
`Such that the recording of the acoustic data using the
`combination of the two buffer areas at least substantially
`minimizes one or more truncation errors associated with
`operation of the speech recognition system.
`0009. It is to be appreciated that the data structures types
`chosen for each of the buffer areas reflect the functions of
`those areas. By way of example, the first buffer type may be
`chosen to allow continuous recording so that the most recent
`few seconds (depending on the buffer size) of acoustic data
`is available for processing. Data structures such as circular
`or ring buffers and FIFO (First In First Out) stacks are
`suitable examples. The second buffer type may be chosen to
`ensure that the system will not run out of buffer space when
`recording long utterances. Linked lists or appropriately large
`memory blocks are exemplary data structures for imple
`menting such buffers.
`0010. It is to be further appreciated that the acoustic data
`referenced here may be, by way of example, digital repre
`sentations of speech and other audio signals present at a
`system input microphone. It is, in any case, data comprising
`acoustic features or data in a format that is suitable for
`extracting acoustic features, as is well known in the art. Such
`features may be used to determine whether speech was
`taking place at the time of the recording or not. Further, Such
`data may be decoded into text that represents the words
`uttered by the user to create a part of the acoustic data.
`0011. In one embodiment, the recording step? operation
`may further comprise: recording acoustic data obtained by
`the speech recognition system in the first buffer area; stop
`ping recording of acoustic data in the first buffer area and
`starting recording of acoustic data obtained by the speech
`recognition system at the start of the second buffer area,
`when an indication that the speech recognition system is
`being addressed is detected; and prepending, to the begin
`ning of the acoustic data stored in the second buffer area,
`acoustic data in the first buffer. This data may be arranged so
`that the oldest acoustic data in the first buffer is located at the
`start of the segment prepended to the second buffer. This
`means that the acoustic data recorded immediately before
`the indication that the system is being addressed ends the
`prepended segment and is contiguous in memory with the
`acoustic data which immediately followed the “being
`addressed’ event that is stored in the second area.
`0012. The technique may further comprise processing the
`acoustic data in the composite buffer area (prepended area
`and second buffer area) to detect features indicating silence.
`The location of the silence closest to the end of the
`prepended segment may then be used as the location in the
`composite buffer at which speech intended for the system to
`process begins. This silence will be in the prepended seg
`ment if the indication of speech was given after speech
`started. It will follow the end of the prepended segment if
`speech began after the indication event.
`0013 The technique may further comprise decoding
`acoustic data in the composite buffer area from the acoustic
`data format into text. The decoding of acoustic data in the
`composite buffer area may begin when the starting silence
`location has been established from the acoustic data.
`0014. The recording of acoustic data in the second buffer
`area may continue until an indication that the speech rec
`
`Exhibit 1014
`Page 06 of 12
`
`
`
`US 2007/0043563 A1
`
`Feb. 22, 2007
`
`ognition system is no longer being addressed is detected and
`a silence indication is detected in the acoustic data recorded
`in the second buffer area. Recording of acoustic data in the
`second buffer area may stop and recording of acoustic data
`in the first buffer area may restart, when the indication that
`the speech recognition system is no longer being addressed
`is detected and the silence indication is detected in the
`acoustic data recorded in the second buffer area.
`0.015 The indication that the speech recognition system
`is being addressed may comprise a microphone on event,
`and the indication that the speech recognition system is no
`longer being addressed may comprise a microphone off
`event.
`0016. The first buffer area may comprise a circular buffer,
`and the second buffer area may comprise a linear buffer.
`Further, the first buffer area and the second buffer area may
`be at least part of a single storage data structure, or may be
`at least part of separate storage data structures. These buffers
`may be in addition to any buffer resources maintained by the
`speech recognizer.
`0017. In another embodiment, the recording step/opera
`tion may further comprise: recording acoustic data obtained
`by the speech recognition system in the first buffer area;
`appending acoustic data recorded in the first buffer area to
`the second buffer area, when the first buffer area is full;
`identifying the existence of a speech region and a silence
`region in the acoustic data appended to the second buffer
`area; detecting when an indication that the speech recogni
`tion system is being addressed occurs; and filling a recog
`nition buffer area at least with the acoustic data appended to
`the second buffer area, when a speech region is identified
`and when the indication that the speech recognition system
`is being addressed is detected, or filling the recognition
`buffer area at least with incoming acoustic data for the
`speech recognition system, when a silence region is identi
`fied and when the indication that the speech recognition
`system is being addressed is detected.
`0018. These and other objects, features and advantages of
`the present invention will become apparent from the fol
`lowing detailed description of illustrative embodiments
`thereof, which is to be read in connection with the accom
`panying drawings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`0.019
`FIG. 1 is a diagram illustrating an anti-truncation
`buffering methodology, according to an embodiment of the
`invention;
`0020 FIG. 2 is a flow diagram illustrating more details of
`an anti-truncation buffering methodology, according to an
`embodiment of the invention;
`0021
`FIG. 3 is a diagram illustrating an anti-truncation
`buffering methodology, according to another embodiment of
`the invention;
`0022 FIG. 4 is a flow diagram illustrating more details of
`an anti-truncation buffering methodology, according to
`another embodiment of the invention; and
`0023 FIG. 5 is a diagram illustrating a computing system
`for use in implementing an anti-truncation buffering meth
`odology, according to an embodiment of the invention.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENTS
`0024. It is to be understood that while the present inven
`tion may be described below in the context of a particular
`computing system environment and an illustrative speech
`recognition application, the invention is not so limited.
`Rather, the invention is more generally applicable to any
`computing system environment and any speech recognition
`application in which it would be desirable to overcome
`truncation errors.
`0025. As used herein, the phrase "acoustic data gener
`ally refers to any acoustic input picked up and transduced by
`a microphone of the Automatic Speech Recognizer (ASR)
`including, but not limited to, acoustic input representative of
`speech and acoustic input representative of silence. Depend
`ing on the requirements or features of a particular imple
`mentation, acoustic data may refer to compressed audio,
`audio that has undergone feature extraction and is repre
`sented, for example, as cepstra or any other digital repre
`sentation of audio that that is suitable for silence detection
`and decoding into text.
`0026. As will be explained below, illustrative embodi
`ments of the invention address the truncation error problem
`by means by of a software apparatus employing the 'silence
`detection' capability of an automatic speech recognizer, an
`audio buffer arrangement, a buffer management algorithm
`and the events generated by microphone mechanisms that
`signal the user's intention to address the system such as
`“Push-to-Talk’ or “Push-for-Attention buttons.
`0027. When a “Push-to-Talk” button is employed in an
`apparatus containing an ASR, button depression typically
`causes a “MICON event, and button release typically
`causes a “MICOFF event. When a “Push-for-Attention
`button is employed in an apparatus containing an ASR,
`button depression typically causes a “MICON event, and
`button release typically does not produce any event. Other
`mechanisms such as video speech recognition may imitate
`either of these patterns.
`0028 Conventional ASRs provide “alignment data that
`indicate which parts of an audio stream or buffer has been
`recognized or decoded into which particular words or
`silence.
`0029. In a conventional ASR, audio to be recognized
`begins to be stored in buffer memory upon receipt of the
`MICON message and ceases to be stored upon receipt of the
`MICOFF message. In accordance with illustrative principles
`of the invention, the acoustic data, which may include
`speech by the user, is continuously recorded in a circular
`buffer. That is, acoustic data is continuously picked up by the
`microphone of the ASR and is continuously stored in the
`buffer arrangement of the invention, as will be explained in
`detail below, regardless of the receipt of a MICON event or
`a MICOFF event (as will be seen, such events serve to
`trigger which buffer of the buffer arrangement does the
`storing).
`0030. As is known, the terms “circular buffer” or “ring
`buffer” refer to a commonly used programming method in
`which a region of memory is managed by a Software module
`so that when the region has been filled with incoming data,
`new data is written beginning again at the start of the
`memory region. The management software retains the
`
`Exhibit 1014
`Page 07 of 12
`
`
`
`US 2007/0043563 A1
`
`Feb. 22, 2007
`
`address of the current “write' location and the locations of
`the beginning and end of the memory region. Any portion of
`the memory region may be read. This permits the manage
`ment Software to read the data from the memory region as
`a continuous stream, in the correct order, even if, when the
`region is viewed as a linear segment of memory, the end of
`the data appears at a lower memory address than the start of
`the data. The effective topology of the memory region is thus
`made into a ring by the managing software. In contrast, a
`“linear buffer typically does not have such a wrapping
`feature, and thus when the end of the buffer is reached, the
`processor must allocate additional buffer space.
`0031. Accordingly, in accordance with illustrative prin
`ciples of the invention using a circular buffer and a linear
`buffer, upon receipt of the MICON message, the ASR and
`the software system of this invention proceed to:
`0032) 1... mark the point in the circular buffer correspond
`ing to the time at which the MICON was received by storing
`the address of that memory location and halt recording in
`that buffer,
`0033 2. buffer all further speech in a separate linear
`buffer, and;
`0034 3. prepend linear copy of circular buffer to linear
`buffer.
`0035 4. process the resulting composite buffer until the
`silence closest to the MICON marker is found.
`0036) The silence found in step 4 is taken to be the start
`of the utterance.
`0037 Depending on the particular ASR configuration,
`decoding into text may now proceed or be postponed until
`the MICOFF message is received or silence is detected in the
`most recently buffered audio. In either case, the audio buffer
`of step 2 continues to store additional acoustic data until
`both a detected silence occurs and a MICOFF message is
`received.
`0038. At detection of the terminal silence, and MICOFF
`event, recording re-commences in the circular buffer.
`0039. By these means, the meaning of MICON and
`MICOFF messages are converted into indications of the
`approximate time segment of speech to be decoded, and the
`exact boundaries of that period are determined using the
`silence detection features of the ASR. The silence detection
`feature of ASRs, such as the Embedded ViaVoiceTM product
`available from IBM Corporation (Armonk, N.Y.), operates
`in parallel with the speech to text decoding functions so, at
`the point that the MICOFF and silence condition has been
`met, the speech Sounds have also been decoded to text and
`are available for use by application Software or dialog
`managers or other Software components.
`0040. Referring initially to FIG. 1, a diagram illustrates
`an anti-truncation buffering methodology, according to an
`embodiment of the invention. It is to be understood that
`illustrative principles of the invention operate in the context
`of a computing system containing an acoustic signal capture
`and encoding capability. Software embodying the invention
`provides the means for directing the encoded audio into a
`circular buffer 110 or a linear buffer 120 by changing the
`value of a Write Pointer (shown as 130 for the circular buffer
`
`and 140 for the linear buffer) to indicate the next available
`memory address within a buffers address range.
`0041. Thus, as illustrated in FIG. 1, audio recording into
`the circular buffer has taken place for as long as the system
`has been turned on. At time 1, a MICON event is received.
`User speech may have begun prior to this event or may
`follow this event. The write pointer originally pointing to a
`position (130) in circular buffer 110 is then repositioned to
`point to a position (140) in linear buffer 120.
`0042. A linear copy of the circular buffer is prepended to
`the linear buffer (170). The data in this linear copy is
`arranged so that the oldest acoustic data in the circular buffer
`is located at the start of the segment prepended to the second
`buffer. This means that the acoustic data recorded immedi
`ately before the indication that the system is being addressed
`ends the prepended segment and is contiguous in memory
`with the acoustic data which immediately followed the
`MICON event that is stored in the second area.
`0043. The resulting composite buffer is searched for
`acoustic data representing silence. The location of the
`silence closest to the end of the prepended segment may then
`be used as the location in the composite buffer at which
`speech intended for the system to process begins. This
`silence will be in the prepended segment if the indication of
`speech was given after speech started. It will follow the end
`of the prepended segment if speech began after the MICON
`event.
`0044) The ASR may now start decoding (into text) the
`content of the augmented linear buffer from the address of
`the first detected silence. Later, at say time 3, the ASR
`detects silence (150) in the linear buffer data or receives a
`MICOFF event corresponding to time 3. As shown, at time
`3, the write pointer points to a position (160) in the linear
`buffer since the encoded audio continues to be written as the
`ASR decodes.
`0045 Referring now to FIG. 2, a more detailed flow
`diagram illustrates an anti-truncation buffering methodol
`ogy, according to an embodiment of the invention.
`0046.
`In step 201, the methodology is started.
`0047. In step 202, a set of buffers (a circular buffer and a
`linear buffer) and their low-level support software are
`instantiated. The specification and programming Support for
`such buffers and low-level support is well known to those of
`ordinary skill in the art. The circular buffer is of fixed size
`so that its oldest content is continually being over-written by
`the most recently captured acoustic signal. The linear buffer
`and its support are arranged to permit additional buffer space
`to be concatenated to the end of the buffer at any time the
`buffer runs low on space for recording.
`0048. In step 203, recording into the circular buffer
`begins. This completes the initialization phase.
`0049. In step 204, the methodology waits for a “MICON”
`event or signal or otherwise tests for the indication the user
`is addressing the system, by the means provided in the
`specific implementation of the system. If no event or indi
`cation is present, the methodology remains at step 204.
`Otherwise, in the presence of an indication or event or
`message, the system proceeds to 205.
`0050. In step 205, the memory location of the Write
`Pointer in the circular buffer is stored for later use.
`
`Exhibit 1014
`Page 08 of 12
`
`
`
`US 2007/0043563 A1
`
`Feb. 22, 2007
`
`0051. In step 206, the value of the Write Pointer is
`changed to the memory address of the start of a linear buffer,
`at which location audio recording continues without inter
`ruption. This is possible because the process of encoding the
`audio into a form which can be stored in digital computer
`memory requires computing on digital samples acquired at
`a rate of several tens of thousands a second while modem
`computers are capable of several billion operations a second.
`There is, therefore, a surplus of time available to switch
`storage locations (change the Write Pointer) without inter
`rupting the receipt or encoding or recording of the audio
`signal.
`In step 207, a linear copy is made of the circular
`0.052
`buffer segment. This copy begins at the oldest recoding
`(from the address of the write pointer in the circular buffer,
`stored in step 205) and continues using data from the circular
`buffer until the address in the circular buffer which imme
`diately precedes the write pointer address is reached. This
`copy buffer is then prepended to the linear buffer so the
`location in the copy buffer corresponding to the last data
`written into the circular buffer is contiguous with the first
`location in which data has been written into the linear buffer.
`Alternatively, the copy of the circular buffer could be made
`into memory space immediately preceding the linear buffer
`which had been allocated for the purpose of making the copy
`buffer.
`0053. In step 208, the audio in the composite buffer is
`decoded by the silence detection mechanism of the ASR.
`The “alignment data of the ASR is used to determine the
`beginning memory address of the silence that is in closest
`temporal proximity to the MICON event. The address of the
`silence closest to the end of the prepended segment may then
`be used as the address in the composite buffer at which
`speech intended for the system to process begins. This
`silence will be in the prepended segment if the indication of
`speech was given after speech started. It will follow the end
`of the prepended segment if speech began after the indica
`tion event. This address is stored for later use.
`0054) In step 209, the start address found in step 208 is
`passed to the ASR with the instruction to begin decoding.
`0055. In step 210, the ASR or some other mechanism
`detects that the user has stopped speaking for long enough
`to signal that the utterance is complete.
`0056.
`In step 211, the linear buffer recording is halted. At
`some time after this, the ASR returns the results of the
`recognition process through a channel created in another
`part of the application. This return is not a focus of the
`invention.
`0057. In step 212, the circular buffer is overwritten with
`values that cannot be misrecognized as silence and, in step
`213, the recording process begins again.
`0058. In step 213, either the old linear buffer is cleared or
`a new buffer is allocated with a "pre-pend” segment (e.g.,
`segment between arrows 170 and 140 in FIG. 1) long
`enough to hold the complete contents of the circular buffer.
`Control is then returned to step 204 so that the anti
`truncation buffering functionality can be used for the next
`utterance.
`0059) The above embodiment has been described in
`terms of a circular buffer offixed size, a linear buffer that can
`
`be extended, a microphone button supplying a “MICON”
`signal and an automatic speech recognizer with a silence
`detection feature. It should be understood that other con
`figurations of buffers and other buffer segment selection
`mechanisms may be realized by those of ordinary skill in the
`art and could be applied to implementing this invention
`without departing from the spirit of the invention. These
`include, but are not limited to, single buffer configurations
`which are expanded or changed in topology when “MICON
`or its equivalent is detected, multiple buffer configurations in
`which new allocation plays a reduced or nonexistent role,
`mechanisms which detect that speech is being directed to the
`recognition system without the use of a “microphone but
`ton, configurations which use First In First Out (FIFO)
`stacks in hardware or software in place of the circular buffer
`and utterance absence detection mechanisms other than
`acoustic silence detection.
`0060 An alternative embodiment of the invention is
`described below.
`0061. In this alternative embodiment, a large circular
`buffer is allocated when the program is started. This buffer
`is longer than the longest expected user utterance. For
`practical purposes, the buffer may hold approximately 100
`seconds of recorded speech. For the purpose of discussion,
`an ASR capable of providing several services is assumed.
`These include the capability to convert analog audio signals
`into a digital format such as pulse code modulated (PCM)
`format, the ability to detect (within some several hundreds
`of milliseconds) when speech Sounds have begun and when
`they have ended, and the ability to decode PCM stored in a
`memory buffer into a text representation of the speech audio
`stored in that buffer. The Embedded ViaVoice speech rec
`ognition engine from IBM Corporation (Armonk, N.Y.) is a
`currently commercially available ASR with these capabili
`ties. That is, the ASR adapted for use in the embodiments of
`FIGS. 1 and 2 may be adapted for use here as well.
`0062 Referring now to FIG. 3, a diagram illustrates the
`anti-truncation buffering methodology according to another
`embodiment of the invention. As in the embodiment above,
`data structures called buffers are used in order to capture and
`retain audio signals so that a memory buffer with a complete
`recording of a user utterance can be supplied to the ASR for
`decoding into text. The data structures allocated at initial
`ization are the linear buffer 300 and the circular buffer 308.
`Buffers, as is well understood in the art, are associated with
`data structures which are used to keep track of the buffer
`start (302, 310), finish (306, 310), and current writing
`location (304,312) along with other management data.
`0063. In operation, the ASR is used to capture speech in
`a form suitable for later recognition. It is typical for an ASR
`to deal with short frames of speech audio. In the case of the
`IBM Embedded ViaVoiceTM ASR, these frames are 100
`milliseconds in length. In order to preserve the captured
`audio for later use, the ASR is provided with a linear buffer
`in which it stores audio until the buffer is filled. A signal,
`referred to as a “callback,” is used to trigger a software
`component that transfers the content of the Small linear
`buffer to the next appropriate location in the large circular
`buffer.
`0064 Operation of the invention can be understood by
`considering several computer processes that are carried on
`“simultaneously' in “threads, in the sense used by those
`
`Exhibit 1014
`Page 09 of 12
`
`
`
`US 2007/0043563 A1
`
`Feb. 22, 2007
`
`with ordinary skill in the art of computer programming.
`These processes and operation of the invention are described
`below in the context of FIGS. 3 and FIG. 4.
`0065. The software instantiating this invention begins at
`step 400.
`0066.
`In step 402, the software components supporting
`the invention (the ASR) and the software components com
`prising the invention are initialized. For example, in the case
`of the IBM Embedded ViaVoice ASR product, part of the
`initialization includes making Application Programmers
`Interface (API) calls to the ASR to cause it to allocate and
`the Small linear buffer 300.
`0067. An API call is made in step 406 to cause the ASR
`to begin its audio recording function using the linear buffer
`300. This buffer 300 fills until the current write location 304
`corresponds to the buffer finish location 306. When this
`condition is detected, in step 408, a callback is generated to
`a function 410 that appends the content of buffer 300 to the
`content of buffer 308 at location 312.
`0068. Function 412 then sets the linear buffer 300 current
`write location 304 to equal the buffer start 302. This switch
`is accomplished in less time than is required by the ASR to
`process the next frame of speech so the change in write
`location has no effect on the continuity or integrity of the
`recorded data. The same function 412 advances the circular
`buffer current write location 312 to the location correspond
`ing to the end of the newly appended frame.
`0069. In step 414, the ASR is queried to determine
`whether a transition from speech sounds to silence 318 or
`from silence to speech sounds 320 occurred as detected in
`the last few frames. I