`
`HARNESS, DickrEy & PIERCE, P.L.C.
`ATTORNEYS AND COUNSELORS
`P.O. BOX 828
`BLOOMFIELD HILLS, MICHIGAN 48303
`U.S.A.
`
`Date: April 16, 1997
`
`.
`.
`Hon. Commissioner of Patents and Trademarks
`Washington, D. C. 20231
`
`Re:
`
`Inventor:
`
`Jean-Claude Junqua; Michael Galler
`
`For.
`
`SPEECH RECOGNITION SYSTEM EMPLOYING
`MULTIPLE GRAMMAR NETWORKS
`
`Atty. Docket No.:
`
`9432-000024
`
`Sir:
`
`216/9T/v0AUNAggepeo/0
`
`TELEPH
`(810) 641-16
`—_
`{LE
`TELEFACSIM
`70
`(810) 641-02
`
`
`
`‘S’°nO8*TL
`
`Old
`
`
`
`
`Transmitted herewith forfiling is the above referenced patent application.
`1.
`[8]
`Eight formal sheets of drawings showing Figures 1-9 are enclosed.
`2.
`[
`]
`A Verified Statement Claiming Small Entity Status is enclosed.
`3a.
`[
`]
`A checkis enclosed to cover the fees as calculated below. The Commissioneris
`hereby authorized to charge any additional fees which may be required, or credit
`any overpayment to Deposit Account No. 08-0750. A duplicate copy of this
`document is enclosed.
`
`3b.
`
`[X]
`
`The fees calculated below will be paid within the time allotted for completion of
`the filing requirements.
`
`3c.
`
`[
`
`]
`
`The fees calculated below are to be charged to Deposit Account No. 08-0750.
`The Commissioneris hereby authorized to charge any additional fees which may
`be required, or credit any overpaymentto said Deposit Account. A duplicate copy
`of this documentis enclosed.
`
` Number
`FILING FEE
`Number
`Basic Fee
`$770.00
`CALCULATION
`Filed
`Extra
`Rate
`
`Total Claims
`21
`1
`$22.00
`-
`20 =
`*
`=
`22.00
`
`
`Independent Claims
`4
`-
`3
`=
`1
`-™‘«
`$80.00
`=
`80.00
`
`
`Multiple Dependent Claim(s) Used ...................
`
`
`FILING FEE - NON-SMALL ENTITY ..........................
`
`
`
`FILING FEE - SMALL ENTITY: Reduction by 1/2 ................
`[
`] Verified Statement enclosed.
`] Verified Statementfiled in priority application.
`[
`
`
`
`Assignment Recordal Fee ($40.00) ...........................
`
`
`37 C.F.R. §1.17(k) Fee (non-English application) ...........02....
`
`
`770.00
`$872.00 |
`TOTAL 2... ec cece cece cece eee eeeeeecce eee.
`
`$260.00
`
`=
`
`
`
`Page 1 of 2
`
`

`

`Atty. Docket No. 9432-000024
`
`Date: April 16, 1997
`
`4.
`
`5.
`
`6.
`
`[
`
`]
`
`[X]
`
`[
`
`]
`
`An Assignment of the invention is enclosed. The required cover sheet under 37
`C.F.R. §3.11, §3.28 and §3.41 is attached.
`
`A signed Oath/Declaration [
`37 C.F.R. 1.63.
`
`]
`
`is enclosed [ X] will be filed in accordance with
`
`Foreign Priority: Priority based on
`Application No.
`_ filed
`is claimed.
`
`,
`
`7.
`
`[
`
`]
`
`A copy of the above referencedpriority document [
`in due course, pursuant to 35 U.S.C. §119(a)-(d).
`
`]is enclosed [
`
`] will be filed
`
`Because the enclosed application is in a non-English language, a verified English
`]
`[
`8.
`
`translation for examination purposes of same [ ]is enclosed [|]will be filed as
`soon as it is available.
`
`9.
`
`[
`
`]
`
`10.
`
`1.
`
`12.
`
`[
`
`[
`
`[
`
`]
`
`]
`
`]
`
`Provisional Application Priority: Priority based on United States Provisional
`Application No.
`, filed
`:
`is claimed under 35 U.S.C. §119(e).
`
`A Preliminary Amendmentis enclosed.
`
`sheets of PTO Form 1449, and
`An Information Disclosure Statement,
`patent(s)/publications/documents are enclosed.
`
`An Establishment of Assignee’s Right To Prosecute Application Under 37 C.F.R.
`§ 3.73(b), and Power Of Attorney is enclosed.
`
`13.
`
`[X]
`
`An Express Mailing Certificate is enclosed.
`
`14.
`
`[
`
`J
`
`Other
`
`Attention is directed to the fact that the address of this firm has been designated as the
`correspondence addressforthis application.
`
`Respectfully,
`
`
`
`Page 2 of 2
`
`

`

`f4)
`
`SPEECH RECOGNITION SYSTEM
`EMPLOYING MULTIPLE GRAMMAR NETWORKS
`
`i
`a
`1s
`OC?
`
`
`
`“Cross Reference to Related Applications
`
`This is a continuation-in-part of U.S. patent application Serial
`
`Number: 08/642,766, filed May 6, 1996, entitled "Call Routing Device Employing
`
`Continuous Speech" by applicants Jean-Claude Junqua and Michael Galler.
`
`Field of the Invention
`
`The present invention relates generally to computer-implemented
`
`speech recognition. More particularly,
`
`the invention relates to a method and
`
`apparatus of processing acoustic speech data using multiple grammar networks.
`
`Use of multiple networks results in different segmentation of the acoustic
`
`speech data,
`
`to facilitate extracting speech that has utility from speech that
`
`does not.
`
`Although the invention has many uses, it is described here in a
`
`spelled name
`
`recognition system of the type suitable for
`
`telephone cal]
`
`routing applications.
`
`In the illustrated embodiment, first and second grammar
`
`networks are used to separately detect the N-best and M-best letter sequences.
`
`One grammar network is configured on the assumption that
`
`the user will
`
`immediately begin spelling in response to the system prompt.
`
`The second
`
`20
`
`grammar network is configured on the assumption that the spelled name letter
`
`sequence begins with extraneous noise or utterances that are not recognized
`
`by the system.
`
`The N-best and M-best
`
`letter sequences are separately
`
`subjected to a dynamic programming match against a dictionary of valid names,
`
`to extract the N-best and M-best name hypotheses corresponding to each of the
`
`25
`
`N-best and M-best letter sequences.
`
`The recognition decision is then made by
`
`selecting the best candidate from these sets of name hypotheses.
`
`

`

`Background and Summary of the Invention
`
`Current speech recognition technology involves the recognition of
`
`patterns in acoustic data and the correlation of those patterns with a
`
`predetermined set of dictionary entries that are recognized by the system.
`
`The speech recognition problem is an extremely challenging one because there
`
`are so many different variants.
`
`In general,
`
`the speech recognizer applies the
`
`incoming acoustic data in digital
`
`form to a mathematical
`
`recognition process
`
`that converts the digital data into parameters based on a predefined model.
`
`Conventionally,
`
`the model
`
`is one that has been previously trained
`
`using a sufficiently large training set so that individual speaker variances
`
`are largely abated.
`
`The model-based recognition process segments the incoming
`
`data into elemental components. such as phonemes,
`
`that are then labeled by
`
`comparison with the trained model.
`
`In one form of recognizer, once individual
`
`phonemes have been labeled,
`
`the phoneme data are compared with prestored words
`
`in the system dictionary. This comparison is performed through an alignment
`
`accommodate imprecise matches due to incorrect phoneme
`process that will
`recognition as well
`as
`insertion and deletion of phoneme within a given
`
`sequence.
`
`The system works on the basis of probabilities. Conventionally.
`
`the speech recognizer will select
`
`the most probable word candidate that
`
`results from the previously described segmentation,
`
`labeling and alignment
`
`
`
`process.
`
`By their very nature, current speech recognizers select word
`
`candidates from the predefined dictionary and hence they will only recognize
`
`a predefined set of words. This raises a problem, particularly in systems
`
`25
`
`where further decisions are made based on the results of speech recognition.
`
`Extraneous noise or speech utterances of words that are not
`
`found in the
`
`dictionary are often incorrectly interpreted as words that are found in the
`
`

`

`dictionary.
`
`Subsequent decisions based on such incorrect recognition can lead
`
`to faulty system performance.
`
`To illustrate the problem, consider a spelled name telephone cal]
`
`routing application.
`
`The user is instructed by synthesized voice prompt to
`
`spell
`
`the name of the person to whom the call should be routed.
`
`If the user
`
`follows these instructions,
`
`the speech recognizer
`
`identifies each of the
`
`letters uttered and then is able to look up the spelled name by alignment of
`
`the sequence of letters with the dictionary.
`
`The system then routes the call
`
`to the appropriate extension using routing information found
`
`in the
`
`dictionary. However, if the user utters extraneous information first, such
`
`as by pronouncing the person’s name prior to spelling,
`
`the recognition process
`
`is highly likely to fail. This is because the recognition system expects to
`
`receive only a sequence of uttered letters and will attempt to “recognize” the
`
`spoken name as one or more letters.
`The conventional system simply is not
`equipped to properly segment
`the incoming acoustic data,
`because the
`underlying model on which the system is built assumes a prior? that the data
`
`are all equivalent units (spoken letters) that are useful or meaningful
`
`to the
`
`system.
`
`
`
`The present invention solves the aforementioned problem through
`
`20
`
`a speech recognition system that employs and integrates multiple grammar
`
`networks to generate multiple sets of recognition candidates,
`
`some based on
`
`a model
`
`that assumes the presence of extraneous speech, and some based on a
`
`model
`
`that assumes the absence of extraneous speech.
`
`The results of both
`
`models are used to make the ultimate recognition decision,
`
`relying on the
`
`29
`
`respective match probability scores to select the most likely candidate.
`
`According to one aspect of the invention,
`
`the acoustic speech data
`
`are separately processed using different first and second grammar networks
`
`that result in different segmentation of the acoustic speech data.
`
`In this
`
`

`

`way,
`
`the system will extract speech that has utility from speech that does
`
`not.
`
`For each grammar network,
`
`a plurality of recognition candidates are
`
`generated.
`
`The preferred embodiment generates the N-best candidates using a
`
`first grammar network and the M-best candidates using the a second grammar
`
`network, where N and M are integers greater than one and may be equal.
`
`The
`
`first and second plurality of recognition candidates (N-best, M-best) are
`
`transformed based on a least one set of a priori constraints about the speech
`
`that has utility.
`
`The transformation may comprise, for example, matching the
`
`candidates to a dictionary of spelled names that are recognized by the system.
`
`The
`
`recognition decision is then based on
`
`the transformed recognition
`
`candidates.
`
`As will be more fully explained below,
`
`the invention splits the
`
`acoustic speech data into two or more paths that are each handled differently.
`
`One path is processed using a first grammar network based on the assumption
`
`that only useful utterances (e.g.,
`
`letters) are supplied. Another path is
`
`processed using a different grammar network that assumes extraneous, non-
`
`useful speech precedes the useful speech.
`
`The different grammar networks thus
`
`result in different segmentation of the data.
`
`The recognition candidates generated by each path may each be
`
`20
`
`scored according to how well each candidate matched the respective models.
`
`Rather than requiring the two paths to compete at this stage-in order to
`
`select the single candidate with the highest score-the two sets of recognition
`
`candidates are maintained separate. At this stage,
`
`the recognition candidates
`
`represent the N-best and M-best letter sequence hypotheses.
`
`To select which
`
`25
`
`hypothesis is the best candidate, both sets are separately matched to the
`
`dictionary of all names that are recognized by the system.
`
`The dictionary is,
`
`in effect,
`
`a collection of a prior? constraints
`
`about the speech that will have utility to the system.
`
`Thus certain letter
`
`

`

`sequence hypotheses may be scored less probable because those letter sequences
`
`do not well match the letter sequences stored in the dictionary.
`
`The
`
`presently preferred embodiment uses the N-best and M-best letter sequences to
`
`select the N-best and M-best names from the dictionary.
`
`Thus contributions
`
`from both paths are included in the decision making process. Finally,
`
`the N-
`
`best and M-best sets of names may be combined to form a
`
`reduced set of
`
`dictionary candidates to which the input utterance is applied.
`
`This
`
`reduced size dictionary can be used to build a dynamic
`
`grammar
`
`that
`
`is built
`
`from the N-best and M-best name candidates.
`
`This
`
`dynamic grammar will
`tend to favor one set of candidates or
`the other,
`depending on whether the input utterance contains extraneous speech or not.
`If extraneous speech is present,
`the grammar network designed to identify and
`
`reject the extraneous speech will
`
`tend to produce better recognition results,
`
`
`
`and those results will be reflected as the better candidates in the dynamic
`
`15
`
`grammar that is constructed from the N-best and M-best name candidates.
`
`On
`
`the other hand, if no extraneous speech is present,
`
`the other grammar network
`
`will produce better recognition results, and it will be better reflected as
`
`the better candidates in the dynamic grammar.
`
`Once the dynamic grammar is constructed,
`
`the input acoustic speech
`
`20
`
`data may be processed using a
`
`recognizer based on the dynamic grammar
`
`to
`
`extract the single most probable name candidate as the recognized name.
`
`The
`
`recognized name is then used to access a suitable database for routing the
`
`telephone call appropriately.
`
`For a more complete understanding of the invention, its objects
`
`25
`
`and advantages, reference may be had to the following specification and to the
`
`accompanying drawings.
`
`

`

`Brief Description of the Drawings
`
`Figure 1 is a block diagram of an exemplary system using the cal]
`
`routing device of the invention:
`
`Figure 2 is a block diagram of an exemplary embodiment of the call
`
`routing device of the invention:
`
`Figure 3 is state diagram illustrating the grammar network Gl,
`
`configured on the assumption that the spelled name letter sequence begins with
`
`valid letters;
`
`Figure 4 is a state diagram illustrating the grammar network G2,
`
`configured on the assumption that the spelled name letter sequence begins with
`
`extraneous noise or utterances that are not recognized by the system:
`
`Figure 5 is a detailed block diagram of the presently preferred
`
`recognition system of the invention;
`
`Figure 6 is a diagram illustrating different types of recognition
`
`18
`
`errors;
`
`‘yh
`
`Figure 7 is a graph showing optimization of the PLP-RASTA filter
`
`coefficients to decrease the number of substitution, deletion and insertion
`
`errors;
`
`20
`
`technique;
`
`Figure 8
`
`is
`
`a diagram showing the improved lattice N-best
`
`Figure 9 is a diagram further describing how hypotheses generation
`
`is performed during the backtracking phase of recognition.
`
`Description of the Preferred Embodiment
`
`The principles of the invention will be illustrated and described
`
`25
`
`in the context of a call routing device that will prompt users to supply cal]
`
`routing infermation by verbally spelling names into the system. Therefore,
`
`

`

`to aid in understanding the speech recognition system,
`
`a brief description
`
`will
`
`first be provided of
`
`a call
`
`routing device in which the speech
`
`recognition system may be employed.
`
`It should be kept
`
`in mind, however,
`
`that
`
`the speech recognition system of the invention is not limited to call
`
`routing
`
`devices. Rather,
`
`the recognition system finds utility in a wide range of
`
`different applications where useful speech must be extracted from extraneous
`
`noise or speech that is not useful.
`
`System Overview _and Basic Operation
`
`The call
`
`routing device employing continuous speech recognition
`
`will be illustrated in an exemplary embodiment, suitable for plug-and-play
`
`connection to an existing PBX switch, or
`
`for
`
`incorporation into the PBX
`
`equipment at
`
`the time of manufacture.
`
`Referring to Figure 1,
`
`the PBX
`
`switch 210 is connected to the telephone network infrastructure 212 by
`
`conventional means such as telephone lines 214.
`
`In the illustrated embodiment
`
`three lines have been illustrated’ for convenience. This is not intended as
`
`a limitation of the invention as the invention is capable of being deployed
`
`in systems having a greater or fewer number of telephone lines.
`
`The PBX switch is of conventional design, capable of
`
`routing
`
`incoming calls from network 212 to any of the selected telephone devices such
`
`as handsets 216.
`
`The spelled name
`
`recognition call
`
`router 218 of
`
`the
`
`invention is connected, as the handsets 216 are connected,
`
`to additional
`
`extensions or ports on the PBX switch 210.
`
`As will be more fully discussed,
`
`the presently preferred embodiment connects to the PBX switch through a
`
`plurality of lines 220 that carry voice traffic and through an additional
`
`20
`
`Tine 222 that carries the control
`
`logic signals that allow the call router to
`
`work integrally with the existing PBX system.
`
`Figure 2 shows
`
`the call
`
`router 218 in greater detail.
`
`PBX
`
`switch 210 and lines 220 and 222 have also been illustrated.
`
`The call
`
`

`

`router 218 can be configured in a variety of different ways, depending on the
`
`architecture of the PBX system.
`
`In the illustrated embodiment the call router
`
`has
`
`three separate audio channels connected respectively to the three
`
`jines 220.
`
`Of course,
`
`the number of channels required will depend on the
`
`architecture of the telephone system. Three channels have been illustrated
`
`here to illustrate how a
`
`system may simultaneously provide spelled name
`
`recognition for
`
`three callers on each of
`
`the three incoming telephone
`
`lines 214.
`
`To support additional callers additional audio channels may be
`
`included or multiplexing circuitry may be included to allow channels to be
`
`10
`
`shared.
`
`Fach of
`
`the audio channels has
`
`a digital
`
`signal processor
`
`(DSP)
`
`224
`
`and associated analog-to-digital/digital-to-analog conversion
`
`circuitry 226.
`
`The digital
`
`signal processors are coupled to the host
`
`processor 228 that includes a data store 230 in which all
`
`references or names
`
`Is
`
`are stored. Data store 230 may be any suitable digital storage medium such
`
`as
`
`random access memory.
`
`Data store 230 stores the continuous
`
`speech
`
`recognition dictionary of all names that can be recognized by the system, with
`
`the associated telephone exchange numbers.
`
`As will be explained more fully
`
`below,
`
`the preferred embodiment uses a special
`
`speech recognizer that
`
`is
`
`20
`
`optimized for speaker-independent recognition of continuously spelled names.
`
`Also coupled to host processor 228 (or incorporated as part of the
`
`host processor)
`
`is the call
`
`switching logic 232.
`
`This switching logic
`
`connects to the signal
`
`line 222 and communicates with the PBX switching
`
`system, following the communication protocols dictated by the PBX switch.
`
`29
`
`Before proceeding with a detailed explanation of
`
`the speech
`
`recognizer,
`
`a brief explanation of the operation of the call router 218 may
`
`be helpful. Referring to Figures 1 and 2, when an incoming call
`
`reaches the
`
`PBX switch through one of the telephone lines 214,
`
`it may be handled by a
`
`
`
`

`

`human operator without
`
`intervention by the call
`
`router of the invention.
`
`However,
`
`if the human operator is not able to handle the call
`
`(for example,
`
`the call
`
`comes
`
`in after normal business hours when
`
`there is no human
`
`operator),
`
`the PBX switch is programmed to forward the call
`
`to the call
`
`router 218.
`
`The switch does this by simply assigning the call
`
`to one of the
`
`audio channels of
`
`the call
`
`router
`
`(to one of
`
`the lines 220), based on
`
`switching instructions sent on line 222.
`
`If desired,
`
`the PBX switch can be
`
`programmed to jump to a different signal
`
`line on a different audio channel
`
`within the router 218 if the first line is busy.
`
`Having done this,
`
`the
`
`incoming call
`
`is now in communication with a selected one of
`
`the DSP
`
`processors 224.
`
`The processor supplies any required voice prompts to the
`
`incoming caller (requesting the caller to spell
`
`the name of the desired
`
`person) and it also processes the caller's spelled name response.
`
`The details
`
`of the speech recognition algorithm used by DSP processors 224 will
`
`be
`
`described below.
`
`As part of
`
`the recognition process,
`
`the DSP processor 224
`
`downloads from the host 228 a copy of the shared speech recognition resources,
`
`namely the data reflecting all reference names and their associated telephone
`
`extension numbers. Using an N-best strategy for real-time recognition the
`
`
`
`20
`
`DSP-implemented speech recognizer selects the most probable candidate from
`
`data store 230.
`
`The name of this candidate is spoken back to the caller using
`
`the DSP processor to supply a speech synthesized signal or to playback a
`
`prerecorded audio signal rendition of the selected person’s name.
`
`The caller
`
`is then asked to respond “yes” or "no," indicating whether the candidate is
`
`25
`
`the correct one.
`
`If it is,
`
`the host processor 228 uses the call switching
`
`logic 232 to instruct the PBX switch to transfer the call
`
`from one of the
`
`lines 220 to a selected one of the handsets 216. After this switching has
`
`

`

`occurred,
`
`the call router’s audio channel
`
`is once again free to handle a new
`
`incoming call.
`
`Details of Preferred Speech Recognition Processing
`
`The presently preferred speech recognition system may be viewed
`
`as a multipass procedure, with the final pass being used only if the preceding
`
`(alignment) pass does not produce a single recognized name as output.
`
`The
`
`first and final passes employ Hidden Markov Model
`
`recognition, while the
`
`alignment pass employs dynamic programming alignment with the dictionary.
`
`As
`
`will
`
`be more
`
`fully discussed,
`
`the
`
`first
`
`pass
`
`(Hidden Markov Models
`
`
`
`recognition) is,
`
`itself, split into plural parallel subpaths.
`
`In Figure 5,
`
`the first, second and third passes are illustrated. Note that the first pass
`
`bifurcated through separate Hidden Markov Models recognition blocks 26a and
`
`26b.
`
`The illustrated embodiment is designed to recognize continuously
`
`spelled names comprising a sequence of
`
`letters that are supplied to the
`
`recognition system as input
`
`through a caller’s telephone handset 10.
`
`Io
`
`jllustrate examples of useful and nonuseful
`
`input,
`
`two handsets 10 have been
`
`illustrated.
`
`Into one handset the caller has correctly used the system by
`
`supplying the sequence of letters: H-A-N-S-O-N.
`
`Into the other handset,
`
`the
`
`20
`
`caller has incorrectly used the system, by uttering the spoken name and then
`
`following with a sequence of
`
`letters:
`
`"Hanson" H-A-N-S-O-N.
`
`As will be
`
`described below,
`
`the system is designed to accommodate both correct usage and
`
`incorrect usage, resulting in a more robust recognition system.
`
`The recognition system,
`
`shown generally at 12 includes a name
`
`20
`
`retrieval
`
`system shown generally at 13.
`
`As will be discussed.
`
`the name
`
`retrieval system has the ability to construct dynamic grammars representing
`
`a selected subset of entries found in the name dictionary.
`
`The dynamic
`
`10
`
`

`

`grammars are used in the event recognition is not accomplished in the second
`
`pass and processing proceeds to the third pass.
`
`The input sequence of letters may be fed to a suitable speech
`
`analysis module 14. This module performs front end optimization designed to
`
`decrease the number of substitution, deletion and insertion errors.
`
`Ina
`
`continuously spelled name a substitution error is the substitution of an
`
`incorrect letter for the correct one.
`
`Figure 6 illustrates at 16 and 18
`
`examples of substitution errors made in the recognition of the spelled name
`
`JOHNSON.
`
`A deletion error is the omission of one or more letters from the
`
`continuously spelled name.
`
`This is illustrated at 20 in Figure 6.
`
`An
`
`insertion error is the inclusion of additional
`
`letters not originally uttered
`
`in the continuously spelled name.
`
`An example of an insertion error is shown
`
`at 22 and 24 in Figure 6.
`
`The speech analysis module 14 is designed to operate on digitized
`
`speech data.
`
`Thus if an analog speech input system is used,
`
`the analog signa!
`
`should first be digitized. This may be done by suitable analog-to-digital
`
`circuitry that may be included in the speech analysis module 14.
`
`The presently preferred speech analysis module uses an 8th-order
`
`
`
`
`
`PLP-RASTA process to compensate for the effect of the communication channel.
`
`20
`
`For more information regarding the PLP-RASTA compensation, see H. Hermansky,
`
`N. Morgan, A. Bayya and P. Kohn, EUROSPEECH ‘91, pages 1367-1370, 1991.
`
`The
`
`presently preferred embodiment uses
`
`a
`
`10 millisecond frame shift and a
`
`20 millisecond analysis window.
`
`The RASTA filter coefficient is optimized to
`
`decrease the number of substitution, deletion and insertion errors.
`
`The best
`
`20
`
`filter coefficient compromise is selected for a value of 0.90.
`
`In determining the optimized RASTA filter coefficients,
`
`the
`
`energy,
`
`the first derivative of the energy and the first derivative of the
`
`static cepstral
`
`coefficients C,
`
`through Cg
`
`(computed over / frames) are
`
`11
`
`

`

`alternatively combined with the static cepstral coefficients to form the
`speech parametric representation (a total of 18 coefficients).
`Figure 7
`4]lustrates the optimized RASTA filter coefficients that will decrease the
`number of substitution, deletion and insertion errors.
`In this figure PLP-
`RASTA stands for the combination of energy,
`the first derivative of energy,
`static cepstral
`coefficients
`and
`first derivative of static cepstral
`
`coefficients.
`While the PLP-RASTA optimization is presently preferred, other
`forms of optimization may also be used.
`For example, a mel
`frequency cepstrum
`coefficient (MFCC) analysis may alternatively be used. Suitable results can
`be obtained using a 14th-order MFCC analysis.
`For
`the MFCC analysis,
`11
`static cepstral coefficients (C, included) are computed with a frame shift of
`16 milliseconds and an analysis window of 32 milliseconds.
`Different recognition accuracy may be obtained using different
`feature sets.
`These feature sets may include static features and dynamic
`features separately and combined.
`To
`illustrate the robustness of
`the
`parameterization used in the invention, clean as well as filtered data were
`used.
`To obtain filtered data for the test set in the presently preferred
`embodiment,
`a distorting filter is used and the test data is filtered to
`artificially create a mismatch between the training set and the test set.
`In
`this regard,
`see H. Murveit, J. Butzberger and M. Weintraub.
`In Darpa
`Workshop Speech and Natural Language, pages 280-284, February 1992.
`Returning to Figure 5,
`the output from speech analysis module 14,
`is split into two paths, one associated with Hidden Markov Models recognition
`block 26a and one associated with Hidden Markov Models recognition block 26b.
`Recognition block 26a works with a pre-defined letter grammar Gl, depicted
`diagrammatically at 28a.
`Recognition 26b works with a different
`letter
`grammar G2 depicted diagrammatically at 28b. These different letter grammars
`
`12
`
`
`
`20
`
`25
`
`

`

`are constructed as
`
`grammar networks
`
`illustrated in Figures
`
`3
`
`and 4,
`
`respectively. These grammar networks are graphs comprising nodes associated
`with each of the possible letters and what node-to-node transitions are
`possible.
`The grammars both include a silence node followed by letter loops,
`where any letter may follow any letter. Grammar Gl of Figure 3 begins with
`
`a silence (Sil) node 50,
`
`transitioning to individual starting letters A, B,
`
`C ....
`Grammar G2, depicted in Figure 4, begins with a filler node 52 to
`represent extraneous speech or noise spoken prior to spelling.
`The filler
`node transitions to silence node 52 and then to the individual letter nodes
`
`In the presently preferred implementation, recognition blocks 26a and
`as Gl.
`26b are frame-synchronous, first order, continuous density Hidden Markov Mode|
`
`recognizers employing Viterbi decoding.
`a modified Viterbi
`The presently preferred embodiment employs
`decoder that yields the N-best or M-best hypotheses
`(instead of a single
`hypothesis). Usually the Viterbi decoder is designed to provide only the best
`hypothesis, based on probability of a match between HMM models and the test
`utterance.
`This standard Viterbi decoder
`is modified for use in the
`
`invention so that it provides the N-best or M-best hypotheses, based on the
`
`highest probabilities of a match between HMM models and the test utterance.
`Recognition blocks 26a and 26b each generate their own N-best or M-best
`hypotheses. “Tf desired,
`these two recognition blocks need not generate the
`same number of hypotheses, although in the preferred embodiment
`the same
`number is used (e.g., N=M=10).
`Thus in Figure 5,
`recognition block 26a yields
`
`the N-best hypotheses and recognition block 26b yields the M-best hypotheses.
`As previously noted,
`the symbols N and M may be any integer greater than 1.
`The precise value chosen for integers N and M may depend on the speed of the
`processor and on the memory size.
`The techniques for generating the N-best
`(or M-best) letter candidates will be discussed more fully below.
`It will be
`
`13
`
`
`
`20
`
`25
`
`

`

`understood that
`
`the techniques
`
`for generating the N-best
`
`(or M-best)
`
`hypotheses are essentially the same for both cases.
`
`The Hidden Markov Model
`
`recognizers employed at 26a and 26b are
`
`provided with beam search capability designed to limit the search space, so
`
`that the recognizer will process the incoming speech more quickly.
`
`The Hidden
`
`Markov Model recognizer produces a score that represents the likelihood of a
`
`match between input speech and reference speech. Without the beam search
`
`mechanism,
`
`the recognizer must score all possible paths at each frame during
`
`the search process. With beam search the recognizer considers only those
`
`paths whose scores that deviate from the best score no more than an amount
`
`equal
`
`to the beam width. Rather than searching the entire search space,
`
`a
`
`ah
`
`beam search is implemented whereby the least likely search paths are pruned,
`
`such that only the best hypotheses are returned.
`
`The N-best (or M-best) hypotheses resulting from the recognizers
`
`26a and 26b are then passed to dynamic programming (DP) alignment modules 38a
`
`and 38b, respectively.
`
`The dynamic programming alignment modules have access
`
`to an associated name dictionary 39 against which the N-best
`
`(or M-best)
`
`hypotheses
`
`are compared.
`
`Dynamic programming is used to account
`
`for
`
`insertion, substitution and deletion errors.
`
`20
`
`In some instances,
`
`the result of dynamic programming alignment
`
`will produce a single name with no other candidates.
`
`Decision strategy
`
`module 40 detects this and provides the recognized name as the output, when
`
`there is only one candidate resulting from the DP alignment.
`
`In most cases,
`
`however,
`
`a single candidate does not result,
`
`in which case the decision
`
`20
`
`strategy module passes the N-best and M-best hypotheses to module 42 for
`
`building a dynamic grammar.
`
`Module 42 builds a grammar using the N-best and M-best candidates
`provided by the DP alignment modules.
`The highly constrained recognizer 44
`
`14
`
`

`

`is then invoked to evaluate the N-best and M-best candidates using the dynamic
`
`grammar 42.
`
`The recognizer 44 may also be a Hidden Markov Model recognizer.
`
`Even though highly constrained,
`
`the data pass through this recognizer is not
`
`time-consuming because the dynamic grammar is small and because the parametric
`
`representation (computed in 14) need not be recomputed.
`
`If desired,
`
`a neural
`
`network discriminator can be applied at the output of recognizers 26a and 26b
`
`or recognizer 44.
`
`The
`
`listing in the Appendix A shows
`
`how the system of
`
`the
`
`invention performs in recognizing the spelled name WILSON.
`
`In the listing the
`
`section designated [First Pass]
`
`shows all hypotheses produced by both
`
`grammars. None of these is the name WILSON.
`
`In the section labeled [DP Alignment] the top candidates have been
`
`listed:
`
`included in the list is the name WILSON (candidate 1 of 10).
`
`In the section labeled [Costly Constrained Pass]
`
`the input
`
`utterance is compared with only the candidates selected during DP Alignment.
`
`In this case,
`
`the recognizer correctly detects the name WILSON.
`
`
`
`
`
`
`
`
`N-Best Processing Techniques
`
`The N-best or M-best candidates are chosen using an N-best
`
`selection algorithm.
`
`For details regarding this technique, see R. Schwartz
`
`20
`
`and Steve Austin, “Efficient, High Performance Algorithms for N-Best Search,”
`
`DARPA Workshop on Speech Recognition, pp. 6-11, 1990.
`
`In speech recognition,
`
`the incoming speech data are broken up into time frames and analyzed on a
`
`there may be several possible
`For any given utterance,
`frame-by-frame basis.
`hypotheses.
`The presently preferred N-best (or M-best) algorithm selects the
`best starting time for a letter based only on the preceding letter and not on
`
`29
`
`letters before the preceding letter.
`
`As each letter is spoken and analyzed,
`
`the Hidden Markov Model
`
`recognizer will generate probability scores for each
`
`15
`
`

`

`of the models. Because the objective of the system is ultimately to select
`
`the most probable sequence of letters, the system stores a plurality of paths,
`
`representing possible spelled combinations of letters.
`
`To make the system work better as a real
`
`time recognizer,
`
`two
`
`different levels of data pruning are implemented.
`
`The pruning tech

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.

We are unable to display this document.

HTTP Error 500: Internal Server Error

Refresh this Document
Go to the Docket