throbber
US007729912B1
`
`(12) United States Patent
`Bacchiani et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,729,912 B1
`Jun. 1, 2010
`
`(54)
`
`(75)
`
`(73)
`
`(*)
`
`(21)
`(22)
`(51)
`
`(52)
`(58)
`
`(56)
`
`SYSTEMAND METHOD FOR LATENCY
`REDUCTION FOR AUTOMATIC SPEECH
`RECOGNITION USING PARTAL
`MULT-PASS RESULTS
`
`Inventors: Michiel Adriaan Unico Bacchiani,
`Summit, NJ (US); Brian Scott Amento,
`Morris Plains, NJ (US)
`Assignee: AT&T Intellectual Property II, L.P.,
`New York, NY (US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 995 days.
`Appl. No.: 10/742,852
`
`Notice:
`
`Filed:
`
`Dec. 23, 2003
`
`Int. C.
`(2006.01)
`GIOL I5/04
`U.S. Cl. ........................ 704/252: 704/236; 704/276
`Field of Classification Search ................. 704/231,
`704/235, 246, 251, 270, 252, 229, 243,236,
`704/244, 275,239, 255, 276, 240
`See application file for complete search history.
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`1 1/2002 Thelen et al.
`6,487,534 B1
`6,950,795 B1* 9/2005 Wong ......................... TO4,231
`7,058,573 B1* 6/2006 Murveit et al. .............. TO4,229
`7,184,957 B2 * 2/2007 Brookes et al. ............. 704/246
`7.440,895 B1 * 10/2008 Miller et al. ................ 704,244
`2003/0O28375 A1
`2/2003 Kellner
`2004/O138885 A1
`7, 2004 Lin
`
`OTHER PUBLICATIONS
`
`Steve Whittaker et al., "SCANMail: a voicemail interface that makes
`speech browsable, readable and searchable.” Proceedings of the
`SIGCHI conference on Human factors in computing systems:
`Changing our world, changing ourselves, Apr. 20-25, 2002, Minne
`apolis, Minnesota.
`* cited by examiner
`Primary Examiner Huyen X. Vo
`
`(57)
`
`ABSTRACT
`
`A system and method is provided for reducing latency for
`automatic speech recognition. In one embodiment, interme
`diate results produced by multiple search passes are used to
`update a display of transcribed text.
`
`6,122,613 A
`
`9/2000 Baker ......................... TO4/235
`
`20 Claims, 2 Drawing Sheets
`
`
`
`PERFORMINITIAL ASR PASS ON
`SPEECH SEGMENT
`
`502
`
`DISPLAY TRANSCRIBED TEXT FROM
`INITIAL ASR PASS
`
`504
`
`PERFORM ADDITIONAL ASR PASS
`ON SPEECH SEGMENT
`
`UPDATE INTIAL DISPLAY OF
`TRANSCRIBED TEXT
`
`306
`
`308
`
`Exhibit 1020
`Page 01 of 08
`
`

`

`U.S. Patent
`
`Jun. 1, 2010
`
`Sheet 1 of 2
`
`US 7,729,912 B1
`
`(FI G. 1
`
`110
`
`100
`
`12
`
`ASR
`MODULE
`
`130
`
`120
`
`USER INTERFACE
`
`
`
`TIG. 2
`SPEECH 1 HEADER
`|SPEECH2 HEADER:
`
`SPEECH M HEADER
`
`
`
`
`
`
`
`
`
`
`
`XT SEGMENT
`
`210
`
`200
`1.
`
`IDENTIFIER 1
`
`
`
`IDENTIFIER 2
`
`IDENTIFIERN
`
`Exhibit 1020
`Page 02 of 08
`
`

`

`U.S. Patent
`
`Jun. 1, 2010
`
`Sheet 2 of 2
`
`US 7,729,912 B1
`
`TIG. 3
`
`PERFORMINITIAL ASR PASS ON
`SPEECH SEGMENT
`
`302
`
`DISPLAY TRANSCRIBED TEXT FROM
`INITIAL ASR PASS
`
`PERFORMADDITIONAL ASR PASS
`ON SPEECH SEGMENT
`
`UPDATE INTIAL DISPLAY OF
`TRANSCRIBED TEXT
`
`504
`
`306
`
`308
`
`Exhibit 1020
`Page 03 of 08
`
`

`

`1.
`SYSTEMAND METHOD FOR LATENCY
`REDUCTION FOR AUTOMATIC SPEECH
`RECOGNITION USING PARTAL
`MULTI-PASS RESULTS
`
`BACKGROUND
`
`1. Field of the Invention
`The present invention relates generally to speech recogni
`tion systems and, more particularly, to a system and method
`for latency reduction for automatic speech recognition using
`partial multi-pass results.
`2. Introduction
`Automatic speech recognition (ASR) is a valuable tool that
`enables spoken audio to be automatically converted into tex
`tual output. The elimination of manual transcription repre
`sents a huge user benefit. Thus, whether applied to the gen
`eration of transcribed text, the interpretation of voice
`commands, or any other time-saving application, ASR is pre
`Sumed to have immense utility.
`In practice, however, ASR comes at a great computational
`cost. As computing technology has improved, so has the
`complexity of the computation models being applied to ASR.
`Computing capacity is rarely wasted in the ever continuing
`search for accuracy and speed in the recognition of speech.
`These two criteria, accuracy and speed, in particular rep
`resent the thresholds by which user adoption and acceptance
`of the technology are governed. Quite simply, if the promise
`of the technology exceeds the practical benefit in real-world
`usage, the ASR technology quickly moves into the category
`of novelty, not usefulness.
`Conventionally, high accuracy ASR of continuous sponta
`neous speech requires computations taking far more time
`than the duration of the speech. As a result, a long latency
`exists between the delivery of the speech and the availability
`of the final text transcript. What is needed therefore is a
`mechanism that accommodates real-world ASR latencies
`without sacrificing application usefulness.
`
`10
`
`15
`
`25
`
`30
`
`35
`
`SUMMARY
`
`40
`
`In accordance with the present invention, a process is pro
`vided for reducing latency for automatic speech recognition.
`In one embodiment, intermediate results produced by mul
`tiple search passes are used to update a display of transcribed
`45
`text.
`Additional features and advantages of the invention will be
`set forth in the description which follows, and in part will be
`obvious from the description, or may be learned by practice of
`the invention. The features and advantages of the invention
`50
`may be realized and obtained by means of the instruments and
`combinations particularly pointed out in the appended
`claims. These and other features of the present invention will
`become more fully apparent from the following description
`and appended claims, or may be learned by the practice of the
`invention as set forth herein.
`
`55
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`In order to describe the manner in which the above-recited
`and other advantages and features of the invention can be
`obtained, a more particular description of the invention
`briefly described above will be rendered by reference to spe
`cific embodiments thereof which are illustrated in the
`appended drawings. Understanding that these drawings
`depict only typical embodiments of the invention and are not
`therefore to be considered to be limiting of its scope, the
`
`60
`
`65
`
`US 7,729,912 B1
`
`2
`invention will be described and explained with additional
`specificity and detail through the use of the accompanying
`drawings in which:
`FIG. 1 illustrates an embodiment of a system of the present
`invention;
`FIG. 2 illustrates an embodiment of a user interface for
`navigating transcription data; and
`FIG. 3 illustrates a flowchart of a method of the present
`invention.
`
`DETAILED DESCRIPTION
`
`Various embodiments of the invention are discussed in
`detail below. While specific implementations are discussed, it
`should be understood that this is done for illustration pur
`poses only. A person skilled in the relevant art will recognize
`that other components and configurations may be used with
`out parting from the spirit and scope of the invention.
`Access to speech data is becoming increasingly prevalent
`due to the ubiquitous nature of digitized storage. In particular,
`digitized storage has enabled public, corporate, and private
`speech data to be easily transferred over public and private
`networks. With increasing frequency, speech content is being
`recorded, archived and distributed on demand to interested
`USCS.
`While access to speech content is increasing, its usability
`has remained relatively stagnant. This results from the nature
`of speech as a serial medium, an inherent characteristic that
`demands serial playback in its retrieval. Even with conven
`tional technologies that can increase the rate of playback, the
`fundamental disadvantages in access to the speech content
`remain.
`It is a feature of the present invention that access to speech
`content is improved through the removal of inherent difficul
`ties of speech access. As will be described in greater detail
`below, serial access to speech content is replaced by an effi
`cient graphical user interface that Supports visual scanning,
`search and information extraction of transcription text gener
`ated from the speech content.
`As is well known in the art, automatic speech recognition
`(ASR) represents an evolving technology that enables the
`generation of transcription data from speech content. ASR
`has shown increasing potential as new generations of ASR
`technology have leveraged the continual advances in comput
`ing technology. Notwithstanding these advancements, ASR
`technology has not yet broken into a full range of uses among
`every day tasks. This has likely resulted due to fundamental
`issues of transcription accuracy and speed.
`As would be appreciated, typical applications of ASR tech
`nology are faced with a tradeoff between transcription accu
`racy and speed. Quite simply, increased transcription accu
`racy often requires more complex modeling (e.g., acoustic
`and language model), the inevitable consequence of which is
`increased processing time. This increased processing time
`comes at an ever-increasing penalty as it goes significantly
`beyond the real time (or actual) rate of the speech content. The
`delay in completion (or latency) of the speech processing can
`often become the primary reason that bars a user from accept
`ing the application of the ASR technology in a given context.
`User acceptance being a key, what is needed is a user
`interface that enhances a user's experience with transcription
`data. Prior to illustrating the various features of the present
`invention, reference is made first to the generic system dia
`gram of FIG.1. As illustrated, system 100 includes a process
`ing system 110 that includes ASR module 112. In one
`embodiment, processing system 110 is a generic computer
`system. ASR module 112 is generally operative on spoken
`
`Exhibit 1020
`Page 04 of 08
`
`

`

`US 7,729,912 B1
`
`3
`audio data stored in audio source 130. As would be appreci
`ated, audio source 130 may be representative of a storage unit
`that may be local or remote to processing system 110. In other
`scenarios, audio Source 130 may be representative of a trans
`mission medium that is providing live content to processing
`system 110. Upon receipt of audio data from audio source
`130, ASR module 112 would be operative to generate text
`data that would be displayable in user interface 130.
`An embodiment of user interface 130 is illustrated in FIG.
`2. As illustrated, user interface 200 includes three primary
`element areas, including speech headersection 210, text tran
`scription section 220, and text segment identifier section 230.
`In general, speech header section 210 includes some form of
`identifying information for various speech content (e.g., pub
`lic speeches, lectures, audio books, Voice mails, etc.) that is
`accessible through user interface 200. In one embodiment,
`selection of a particular speech header in speech header sec
`tion 210 initiates replay of the speech content along with the
`generation of transcription text that appears in transcription
`section 220. As illustrated in FIG. 2, selection of Speech 2
`Header produces a display of its corresponding transcription
`text in transcription section 220.
`In an example related to a Voice mail embodiment, speech
`header section 210 could be designed to include information
`Such as the caller's name, the date of the Voice mail, the length
`of the voice mail, etc; text transcription section 220 could be
`designed to include the transcription text generated from a
`selected voice mail; and speech header section 230 could be
`designed to include keywords for segments of the selected
`Voicemail.
`As further illustrated in FIG. 2, transcription section 220
`can be designed to display transcription text as text segments
`1-N. In one embodiment, text segments 1-N are formatted
`into audio paragraphs using an acoustic segmentation algo
`rithm. In this process, segments are identified using pause
`duration data, along with information about changes in acous
`tic signal energy. As would be appreciated, the specific for
`matting of the transcription text into text segments 1-N would
`be implementation dependent in accordance with any criteria
`that would be functionally useful for a viewing user.
`In one embodiment, text segments can also have associated
`therewith an identifier that relates to the text segment. These
`identifiers are displayed in text segment identifier section 230
`and can be collectively used to enable a user to intelligently
`navigate through the transcribed text. As would be appreci
`ated, the specific form of the text segment identifiers would be
`implementation dependent in accordance with any criteria
`that would be functionally useful for a viewing user. In one
`example, the text segment identifiers could represent one or
`more keywords that were extracted from the corresponding
`transcription text segment.
`As noted, one of the goals of user interface 200 is to
`improve upon a user's experience in interacting with tran
`Scription text. A significant drawback in this process is the
`relevant tradeoff between transcription accuracy and speed.
`Indeed, conventional ASR technology that produces reason
`ably accurate text can be expected to run at a rate four times
`that of real time. This delay in the generation of transcription
`text represents a real impediment to a user's adoption of the
`technology. It is therefore a feature of the present invention
`that user interface 200 is designed to accommodate a user's
`sense of both transcription speed and accuracy. As will be
`described in greater detail below, this process leverages tran
`Scription efforts that incrementally improve upon transcrip
`tion accuracy.
`To obtain high accuracy transcripts for continuous sponta
`neous speech, several normalization and adaptation algo
`
`40
`
`45
`
`4
`rithms can be applied. These techniques can take into account
`the specific channel conditions as well as the gender, Vocal
`tract length and dialect of the speaker. The model parameters
`for the compensation/adaptation model can be estimated at
`test time in an unsupervised fashion. The unsupervised algo
`rithms use an initial guess at the transcription of the speech.
`Based on that guess and the audio, an adapted/normalized
`model is estimated and a re-transcription with the adapted/
`normalized model improves the accuracy of the transcript.
`The final, most accurate transcript is obtained by iterative
`transcription with models adapted/normalized in multiple
`stages. Hence this process can be referred to as multi-pass
`transcription.
`The ASR transcription passes are computationally expen
`sive. To express their cost, the processing time is related to the
`duration of the speech and the quotient of the two expresses
`the computation cost in terms of a real-time factor. In one
`embodiment, to reduce the computational cost of repeated
`transcription passes, the initial search pass produces a word
`graph representing a few of the possible transcriptions
`deemed most likely by the current model. Subsequent tran
`Scription passes, using the more accurate adapted/normalized
`model, only consider the transcriptions enumerated by the
`word-graph, dramatically reducing the computational cost of
`transcription.
`The first recognition pass, which uses an unadapted/unnor
`malized model and performs an unconstrained search for the
`transcript, takes about 4 times real-time. On an independent
`test set, the word-accuracy of this transcript was 74.2%.
`Besides an initial guess of the transcript, this search pass
`produces a word-graph that is used in Subsequent search
`passes.
`The second recognition pass estimates the gender and
`Vocal tract of the speaker based on the audio and the transcript
`produced in the first pass. A second search pass is then per
`formed, constrained by the word-graph produced by the first
`pass. In one embodiment, this second search pass uses a
`Gender Dependent (GD), Vocal Tract Length Normalized
`(VTLN) model (based on the unsupervised gender estimate)
`and uses VTLN acoustic features (based on the unsupervised
`Vocal tract length estimate). The result of this search pass is a
`new transcript. The word-accuracy of this second transcript
`was 77.0% (a 2.8% absolute improvement over the first pass
`accuracy). The computation cost of this second pass is about
`1.5 times real-time.
`The third recognition pass uses two adaptation algorithms
`to adapt to the channel and speaker characteristics of the
`speaker. Again, this is performed in an unsupervised way,
`using the audio and the transcript of the second (VTLN) pass.
`It estimates two linear transformations, one applied to the
`acoustic features using Constrained Model-spade Adaptation
`(CMA) and one applied to the model mean parameters using
`Maximum Likelihood Linear Regression (MLLR). First the
`CMA transform is estimated using the VTLN transcript and
`the audio. Then a search pass is performed, again constrained
`by the word-graph from the first pass, using the CMA rotated
`acoustic features and the GD VTLN model. The transcript
`from that pass is then used with the audio to estimate the
`MLLR transform. After rotating the model means using that
`transform, another word-graph constrained search pass is
`executed that provides the third pass transcript. The total
`processing time of this adaptation pass is about 1.5 times
`real-time. The adaptation pass transcript is the final system
`output and improves the accuracy to 78.4% (a 1.4% absolute
`improvement over the VTLN pass result).
`Since the first pass still represents a large latency of 4 times
`real-time, an initial search pass is performed before running
`
`10
`
`15
`
`25
`
`30
`
`35
`
`50
`
`55
`
`60
`
`65
`
`Exhibit 1020
`Page 05 of 08
`
`

`

`US 7,729,912 B1
`
`10
`
`15
`
`5
`the first pass, referred to as the Quick Pass. This output from
`this pass is not used in the multi-pass recognition process but
`is simply used as a low latency result for presentation to the
`user. It can be obtained in about one times real time. Here, the
`exact rate of the initial pass as being at real time or slightly
`greater than real time is not a critical factor. Rather, one of the
`goals of the initial pass is to produce a result for the user as
`quickly as possible.
`The run-time reduction of the initial pass as compared to
`the first pass can be obtained by reducing the search beam
`used internally in the recognizer. As a result, the speed-up
`comes at an accuracy cost: the Quick Pass accuracy was
`68.7% (a 5.5% absolute degradation compared to the first
`pass result).
`It is a feature of the present invention that the intermediate
`guesses at the transcript of the speech can be presented to the
`user at lower latency than the final, most accurate transcript.
`In other words, the results of each of the search passes enables
`the developer to produce a more usable interface by present
`ing the best quality results available at the time of the user
`request.
`Thus, in one embodiment, the particular search pass can be
`indicated in the interface by using color shading or explicitly
`indicating which transcript the user is currently viewing. For
`example, as illustrated in FIG. 2, Text Segment N is displayed
`in a different shade or color, thereby indicating that further,
`more accurate transcription results would be forthcoming. In
`other embodiments, the user interface can be designed to
`show the user how many additional transcription passes are
`forthcoming, the estimated transcription accuracy rate, etc.
`With this approach, when a one-minute speech file is being
`processed, a rough transcript can be displayed in the interface
`within the first minute, so that users can begin working with
`the intermediate results. As each pass is completed, the dis
`play is updated with the new information. After eight minutes,
`for example, all processing may be completed and the final
`transcript would then be shown in text transcription section
`220.
`In one embodiment, accuracy information from one or
`more of the multiple search passes can also be used on a word
`or utterance level. Illustration of this feature is provided by
`Text Segment 2, which is broken down into words and/or
`utterances 1-12. Here, accuracy information (e.g., confidence
`scores) generated by a particular search pass can be used to
`differentiate between the estimated accuracy of different
`words or utterances. For example, in one embodiment, words
`orutterances having confidencescores below a certain thresh
`old can be targeted for differential display, such as that shown
`by words/utterances 3, 6, 8, and 12. As would be appreciated,
`various levels of differentiation can be defined with associ
`ated levels of shading, colors, patterns, etc. to communicate
`issues of relative accuracy in the transcribed text. In this
`manner, the accelerated display of transcribed text would not
`hinder the user in his appreciation of the data. Indeed, the
`highlighting of known or estimated accuracy issues, would
`enable the user to listen to specific portions of the speech
`content to discern words or utterances on his own. For
`example, the user interface can be designed to replay a spe
`cific portion of the speech content upon selection of a high
`lighted portion of the transcribed text. Thus, it is a feature of
`the present invention that the user can work with transcribed
`text earlier than he otherwise would be permitted to do so, and
`in a manner that is not hindered by the lower accuracy of an
`initial or intermediate search pass.
`Having described an exemplary user interface enabled by
`principles of the present invention, a brief description of a
`process of the invention is now described with reference to the
`
`6
`flowchart of FIG. 3. As illustrated, the process begins at step
`302 where an initial ASR pass is performed on a speech
`segment. This initial ASR pass can represent any search pass
`that produces discernible transcribed output. At step 304, this
`transcribed output is displayed.
`Next, at step 306, an additional ASR pass is performed on
`the speech segment. As would be appreciated, this additional
`ASR pass can be initiated after completion of the initial ASR
`pass, or can be performed contemporaneously with the initial
`ASR pass. Regardless, it is expected that the additional ASR
`pass would produce discernible transcribed output after the
`initial ASR pass. At step 308, the transcribed output from the
`additional ASR pass is used to update the initial display of
`transcribed text.
`It should be noted that the particular manner in which the
`transcribed text from the additional ASR pass is used to
`update the display is implementation dependent. In one
`embodiment, the displayed text itself would be modified. In
`other embodiments, an indicator on the display reflective of
`the state of a multi-pass search strategy would be updated. In
`still other embodiments, the relative highlighting or other
`communicated differentiator would be modified.
`Embodiments within the scope of the present invention
`may also include computer-readable storage media for carry
`ing or having computer-executable instructions or data struc
`tures stored thereon. Such computer-readable storage media
`can be any available media that can be accessed by a general
`purpose or special purpose computer. By way of example, and
`not limitation, Such computer-readable media can comprise
`RAM, ROM, EEPROM, CD-ROM or other optical disk stor
`age, magnetic disk storage or other magnetic storage devices,
`or any other medium which can be used to carry or store
`desired program code means in the form of computer-execut
`able instructions or data structures. When information is
`transferred or provided over a network or another communi
`cations connection (eitherhardwired, wireless, or a combina
`tion thereof) to a computer, the computer properly views the
`connection as a computer-readable medium. Thus, any Such
`connection is properly termed a computer-readable medium.
`Combinations of the above should also be included within the
`Scope of the computer-readable media.
`Computer-executable instructions include, for example,
`instructions and data which cause a general purpose com
`puter, special purpose computer, or special purpose process
`ing device to perform a certain function or group of functions.
`Computer executable instructions also include program mod
`ules that are executed by computers in Stand-alone or network
`environments. Generally, program modules include routines,
`programs, objects, components, and data structures, etc. that
`perform particular tasks or implement particular abstract data
`types. Computer-executable instructions, associated data
`structures, and program modules represent examples of the
`program code means for executing steps of the methods dis
`closed herein. The particular sequence of Such executable
`instructions or associated data structures represents examples
`of corresponding acts for implementing the functions
`described in Such steps.
`Those of skill in the art will appreciate that other embodi
`ments of the invention may be practiced in network comput
`ing environments with many types of computer system con
`figurations, including personal computers, hand-held
`devices, multi-processor systems, microprocessor-based or
`programmable consumer electronics, network PCs, mini
`computers, mainframe computers, and the like. Embodi
`ments may also be practiced in distributed computing envi
`ronments where tasks are performed by local and remote
`processing devices that are linked (either by hardwired links,
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Exhibit 1020
`Page 06 of 08
`
`

`

`25
`
`7
`wireless links, or by a combination thereof) through a com
`munications network. In a distributed computing environ
`ment, program modules may be located in both local and
`remote memory storage devices.
`Although the above description may contain specific
`details, they should not be construed as limiting the claims in
`any way. Other configurations of the described embodiments
`of the invention are part of the scope of this invention. For
`example, the invention may have applicability in a variety of
`environments where ASR may be used. Therefore, the inven
`tion is not limited to ASR within any particular application.
`Accordingly, the appended claims and their legal equivalents
`only should define the invention, rather than any specific
`examples given.
`What is claimed is:
`1. A computer-implemented method for reducing latency
`in an automatic speech recognition (ASR) system, the method
`comprising:
`transcribing via a processor speech data using a first ASR
`pass, which operates at a first transcription rate near real
`time, to produce first transcription data;
`transcribing said speech data using a second ASR pass
`slower than said first ASR pass to produce second tran
`Scription data, wherein said second transcription data is
`more accurate than said first transcription data;
`transcribing said speech data using a third ASR pass based
`on the speech data and transcribed speech data from the
`second ASR pass to produce third transcription data:
`displaying via a display a part of said first transcription
`data, which corresponds to a portion of said speech data,
`prior to transcription of said portion of said speech data
`by said second ASR pass;
`updating said displayed part of said first transcription data
`with one or more of said second transcription data and
`said third transcription data upon completion of the tran
`Scription of said portion of said speech data by said
`second or third ASR pass; and
`displaying with transcription data an indication of how
`many additional transcription passes are forthcoming
`40
`that will update the transcription data.
`2. The method of claim 1, wherein said first non-normal
`ized ASR pass operates at real time.
`3. The method of claim 1, wherein said first non-normal
`ized ASR pass operates at greater than real time.
`45
`4. The method of claim 1, wherein said displaying com
`prises displaying an indicator that signifies that more accurate
`transcription data is being generated.
`5. The method of claim 1, wherein said displayed data is in
`a different color or shade as compared to said updated dis
`played data.
`6. The method of claim 1, wherein portions of displayed
`data having a relatively lower confidence score are distinctly
`displayed as compared to displayed data having a relatively
`higher confidence score.
`7. The method of claim 6, wherein the displayed data
`having a relatively lower confidence score is displayed in a
`darker shade as compared to displayed data having a rela
`tively higher confidence score.
`8. The method of claim 6, wherein said portions of dis
`played data having a relatively lower confidence score enable
`a user to listen to the corresponding portions of speech data.
`9. The method of claim 1, further comprising transcribing
`said speech data using one or more normalized ASR passes
`beyond said second normalized ASR pass.
`10. A tangible computer-readable storage medium that
`stores a program for controlling a computer device to perform
`
`50
`
`55
`
`60
`
`65
`
`US 7,729,912 B1
`
`10
`
`15
`
`30
`
`35
`
`8
`a method to reduce latency in an automatic speech recogni
`tion (ASR) system, the method comprising:
`transcribing speech data via a processor in the computer
`device using a first ASR pass, which operates at a first
`transcription rate near real time, to produce first tran
`scription data;
`transcribing said speech data using a second ASR pass
`slower than said first ASR pass to produce second tran
`Scription data, wherein said second transcription data is
`more accurate than said first transcription data;
`transcribing said speech data using a third ASR pass based
`on the speech data and transcribed speech data from the
`second ASR pass to produce third transcription data;
`displaying on a display a part of said first transcription data,
`which corresponds to a portion of said speech data, prior
`to transcription of said portion of said speech data by
`said second ASR pass;
`updating said displayed part of said first transcription data
`with one or more of said second transcription data and
`said third transcription data upon completion of the tran
`Scription of said portion of said speech data by said
`second or third ASR pass; and
`displaying with transcription data an indication of how
`many additional transcription passes are forthcoming
`that will update the transcription data.
`11. An automatic speech recognition (ASR) system using a
`method of reducing latency, the method comprising:
`transcribing, via a processor in the ASR system, speech
`data using a first ASR pass, which operates at a first
`transcription rate near real time, to produce first tran
`scription data;
`transcribing said speech data using a second ASR pass
`slower than said first ASR pass to produce second tran
`Scription data, wherein said second transcription data is
`more accurate than said first transcription data;
`transcribing said speech data using a third ASR pass based
`on the speech data and transcribed speech data from the
`second ASR pass to produce third transcription data;
`displaying a part of said first transcription data, which
`corresponds to a portion of said speech data, prior to
`transcription of said portion of said speech data by said
`second ASR pass;
`updating said displayed part of said first transcription data
`with one or more of said second transcription data and
`said third transcription data upon completion of the tran
`Scription of said portion of said speech data by said
`second or third ASR pass; and
`displaying with transcription data an indication of how
`many additional transcription passes are forthcoming
`that will update the transcription data.
`12. A computer-implemented method of reducing latency
`in the display of transcribed data generated by automatic
`speech recognition (ASR) process, the method comprising:
`transcribing via a processor a segment of speech data using
`a plurality of normalized ASR passes, said plurality of
`normalized ASR passes having varying levels of accu
`racy and speed, wherein a second normalized ASR pass
`estimates agender and a vocal tract of a speaker based on
`audio and a first transcription data obtained prior to the
`transcribing using the plurality of normalized ASR
`passes;
`incrementally updating a display of transcribed text as
`more accurate text is generated by one of said plurality
`of normalized ASR passes;
`displaying an indicator that communicates a relative accu
`racy of words in said displayed text; and
`
`Exhibit 1020
`Page 07 of 08
`
`

`

`displaying with transcription data an indication of how
`many additional transcription passes are forthcoming
`that will upda

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket