`
`HARNESS, DickrEy & PIERCE, P.L.C.
`ATTORNEYS AND COUNSELORS
`P.O. BOX 828
`BLOOMFIELD HILLS, MICHIGAN 48303
`U.S.A.
`
`Date: April 16, 1997
`
`.
`.
`Hon. Commissioner of Patents and Trademarks
`Washington, D. C. 20231
`
`Re:
`
`Inventor:
`
`Jean-Claude Junqua; Michael Galler
`
`For.
`
`SPEECH RECOGNITION SYSTEM EMPLOYING
`MULTIPLE GRAMMAR NETWORKS
`
`Atty. Docket No.:
`
`9432-000024
`
`Sir:
`
`216/9T/v0AUNAggepeo/0
`
`TELEPH
`(810) 641-16
`—_
`{LE
`TELEFACSIM
`70
`(810) 641-02
`
`
`
`‘S’°nO8*TL
`
`Old
`
`
`
`
`Transmitted herewith forfiling is the above referenced patent application.
`1.
`[8]
`Eight formal sheets of drawings showing Figures 1-9 are enclosed.
`2.
`[
`]
`A Verified Statement Claiming Small Entity Status is enclosed.
`3a.
`[
`]
`A checkis enclosed to cover the fees as calculated below. The Commissioneris
`hereby authorized to charge any additional fees which may be required, or credit
`any overpayment to Deposit Account No. 08-0750. A duplicate copy of this
`document is enclosed.
`
`3b.
`
`[X]
`
`The fees calculated below will be paid within the time allotted for completion of
`the filing requirements.
`
`3c.
`
`[
`
`]
`
`The fees calculated below are to be charged to Deposit Account No. 08-0750.
`The Commissioneris hereby authorized to charge any additional fees which may
`be required, or credit any overpaymentto said Deposit Account. A duplicate copy
`of this documentis enclosed.
`
` Number
`FILING FEE
`Number
`Basic Fee
`$770.00
`CALCULATION
`Filed
`Extra
`Rate
`
`Total Claims
`21
`1
`$22.00
`-
`20 =
`*
`=
`22.00
`
`
`Independent Claims
`4
`-
`3
`=
`1
`-™‘«
`$80.00
`=
`80.00
`
`
`Multiple Dependent Claim(s) Used ...................
`
`
`FILING FEE - NON-SMALL ENTITY ..........................
`
`
`
`FILING FEE - SMALL ENTITY: Reduction by 1/2 ................
`[
`] Verified Statement enclosed.
`] Verified Statementfiled in priority application.
`[
`
`
`
`Assignment Recordal Fee ($40.00) ...........................
`
`
`37 C.F.R. §1.17(k) Fee (non-English application) ...........02....
`
`
`770.00
`$872.00 |
`TOTAL 2... ec cece cece cece eee eeeeeecce eee.
`
`$260.00
`
`=
`
`
`
`Page 1 of 2
`
`
`
`Atty. Docket No. 9432-000024
`
`Date: April 16, 1997
`
`4.
`
`5.
`
`6.
`
`[
`
`]
`
`[X]
`
`[
`
`]
`
`An Assignment of the invention is enclosed. The required cover sheet under 37
`C.F.R. §3.11, §3.28 and §3.41 is attached.
`
`A signed Oath/Declaration [
`37 C.F.R. 1.63.
`
`]
`
`is enclosed [ X] will be filed in accordance with
`
`Foreign Priority: Priority based on
`Application No.
`_ filed
`is claimed.
`
`,
`
`7.
`
`[
`
`]
`
`A copy of the above referencedpriority document [
`in due course, pursuant to 35 U.S.C. §119(a)-(d).
`
`]is enclosed [
`
`] will be filed
`
`Because the enclosed application is in a non-English language, a verified English
`]
`[
`8.
`
`translation for examination purposes of same [ ]is enclosed [|]will be filed as
`soon as it is available.
`
`9.
`
`[
`
`]
`
`10.
`
`1.
`
`12.
`
`[
`
`[
`
`[
`
`]
`
`]
`
`]
`
`Provisional Application Priority: Priority based on United States Provisional
`Application No.
`, filed
`:
`is claimed under 35 U.S.C. §119(e).
`
`A Preliminary Amendmentis enclosed.
`
`sheets of PTO Form 1449, and
`An Information Disclosure Statement,
`patent(s)/publications/documents are enclosed.
`
`An Establishment of Assignee’s Right To Prosecute Application Under 37 C.F.R.
`§ 3.73(b), and Power Of Attorney is enclosed.
`
`13.
`
`[X]
`
`An Express Mailing Certificate is enclosed.
`
`14.
`
`[
`
`J
`
`Other
`
`Attention is directed to the fact that the address of this firm has been designated as the
`correspondence addressforthis application.
`
`Respectfully,
`
`
`
`Page 2 of 2
`
`
`
`f4)
`
`SPEECH RECOGNITION SYSTEM
`EMPLOYING MULTIPLE GRAMMAR NETWORKS
`
`i
`a
`1s
`OC?
`
`
`
`“Cross Reference to Related Applications
`
`This is a continuation-in-part of U.S. patent application Serial
`
`Number: 08/642,766, filed May 6, 1996, entitled "Call Routing Device Employing
`
`Continuous Speech" by applicants Jean-Claude Junqua and Michael Galler.
`
`Field of the Invention
`
`The present invention relates generally to computer-implemented
`
`speech recognition. More particularly,
`
`the invention relates to a method and
`
`apparatus of processing acoustic speech data using multiple grammar networks.
`
`Use of multiple networks results in different segmentation of the acoustic
`
`speech data,
`
`to facilitate extracting speech that has utility from speech that
`
`does not.
`
`Although the invention has many uses, it is described here in a
`
`spelled name
`
`recognition system of the type suitable for
`
`telephone cal]
`
`routing applications.
`
`In the illustrated embodiment, first and second grammar
`
`networks are used to separately detect the N-best and M-best letter sequences.
`
`One grammar network is configured on the assumption that
`
`the user will
`
`immediately begin spelling in response to the system prompt.
`
`The second
`
`20
`
`grammar network is configured on the assumption that the spelled name letter
`
`sequence begins with extraneous noise or utterances that are not recognized
`
`by the system.
`
`The N-best and M-best
`
`letter sequences are separately
`
`subjected to a dynamic programming match against a dictionary of valid names,
`
`to extract the N-best and M-best name hypotheses corresponding to each of the
`
`25
`
`N-best and M-best letter sequences.
`
`The recognition decision is then made by
`
`selecting the best candidate from these sets of name hypotheses.
`
`
`
`Background and Summary of the Invention
`
`Current speech recognition technology involves the recognition of
`
`patterns in acoustic data and the correlation of those patterns with a
`
`predetermined set of dictionary entries that are recognized by the system.
`
`The speech recognition problem is an extremely challenging one because there
`
`are so many different variants.
`
`In general,
`
`the speech recognizer applies the
`
`incoming acoustic data in digital
`
`form to a mathematical
`
`recognition process
`
`that converts the digital data into parameters based on a predefined model.
`
`Conventionally,
`
`the model
`
`is one that has been previously trained
`
`using a sufficiently large training set so that individual speaker variances
`
`are largely abated.
`
`The model-based recognition process segments the incoming
`
`data into elemental components. such as phonemes,
`
`that are then labeled by
`
`comparison with the trained model.
`
`In one form of recognizer, once individual
`
`phonemes have been labeled,
`
`the phoneme data are compared with prestored words
`
`in the system dictionary. This comparison is performed through an alignment
`
`accommodate imprecise matches due to incorrect phoneme
`process that will
`recognition as well
`as
`insertion and deletion of phoneme within a given
`
`sequence.
`
`The system works on the basis of probabilities. Conventionally.
`
`the speech recognizer will select
`
`the most probable word candidate that
`
`results from the previously described segmentation,
`
`labeling and alignment
`
`
`
`process.
`
`By their very nature, current speech recognizers select word
`
`candidates from the predefined dictionary and hence they will only recognize
`
`a predefined set of words. This raises a problem, particularly in systems
`
`25
`
`where further decisions are made based on the results of speech recognition.
`
`Extraneous noise or speech utterances of words that are not
`
`found in the
`
`dictionary are often incorrectly interpreted as words that are found in the
`
`
`
`dictionary.
`
`Subsequent decisions based on such incorrect recognition can lead
`
`to faulty system performance.
`
`To illustrate the problem, consider a spelled name telephone cal]
`
`routing application.
`
`The user is instructed by synthesized voice prompt to
`
`spell
`
`the name of the person to whom the call should be routed.
`
`If the user
`
`follows these instructions,
`
`the speech recognizer
`
`identifies each of the
`
`letters uttered and then is able to look up the spelled name by alignment of
`
`the sequence of letters with the dictionary.
`
`The system then routes the call
`
`to the appropriate extension using routing information found
`
`in the
`
`dictionary. However, if the user utters extraneous information first, such
`
`as by pronouncing the person’s name prior to spelling,
`
`the recognition process
`
`is highly likely to fail. This is because the recognition system expects to
`
`receive only a sequence of uttered letters and will attempt to “recognize” the
`
`spoken name as one or more letters.
`The conventional system simply is not
`equipped to properly segment
`the incoming acoustic data,
`because the
`underlying model on which the system is built assumes a prior? that the data
`
`are all equivalent units (spoken letters) that are useful or meaningful
`
`to the
`
`system.
`
`
`
`The present invention solves the aforementioned problem through
`
`20
`
`a speech recognition system that employs and integrates multiple grammar
`
`networks to generate multiple sets of recognition candidates,
`
`some based on
`
`a model
`
`that assumes the presence of extraneous speech, and some based on a
`
`model
`
`that assumes the absence of extraneous speech.
`
`The results of both
`
`models are used to make the ultimate recognition decision,
`
`relying on the
`
`29
`
`respective match probability scores to select the most likely candidate.
`
`According to one aspect of the invention,
`
`the acoustic speech data
`
`are separately processed using different first and second grammar networks
`
`that result in different segmentation of the acoustic speech data.
`
`In this
`
`
`
`way,
`
`the system will extract speech that has utility from speech that does
`
`not.
`
`For each grammar network,
`
`a plurality of recognition candidates are
`
`generated.
`
`The preferred embodiment generates the N-best candidates using a
`
`first grammar network and the M-best candidates using the a second grammar
`
`network, where N and M are integers greater than one and may be equal.
`
`The
`
`first and second plurality of recognition candidates (N-best, M-best) are
`
`transformed based on a least one set of a priori constraints about the speech
`
`that has utility.
`
`The transformation may comprise, for example, matching the
`
`candidates to a dictionary of spelled names that are recognized by the system.
`
`The
`
`recognition decision is then based on
`
`the transformed recognition
`
`candidates.
`
`As will be more fully explained below,
`
`the invention splits the
`
`acoustic speech data into two or more paths that are each handled differently.
`
`One path is processed using a first grammar network based on the assumption
`
`that only useful utterances (e.g.,
`
`letters) are supplied. Another path is
`
`processed using a different grammar network that assumes extraneous, non-
`
`useful speech precedes the useful speech.
`
`The different grammar networks thus
`
`result in different segmentation of the data.
`
`The recognition candidates generated by each path may each be
`
`20
`
`scored according to how well each candidate matched the respective models.
`
`Rather than requiring the two paths to compete at this stage-in order to
`
`select the single candidate with the highest score-the two sets of recognition
`
`candidates are maintained separate. At this stage,
`
`the recognition candidates
`
`represent the N-best and M-best letter sequence hypotheses.
`
`To select which
`
`25
`
`hypothesis is the best candidate, both sets are separately matched to the
`
`dictionary of all names that are recognized by the system.
`
`The dictionary is,
`
`in effect,
`
`a collection of a prior? constraints
`
`about the speech that will have utility to the system.
`
`Thus certain letter
`
`
`
`sequence hypotheses may be scored less probable because those letter sequences
`
`do not well match the letter sequences stored in the dictionary.
`
`The
`
`presently preferred embodiment uses the N-best and M-best letter sequences to
`
`select the N-best and M-best names from the dictionary.
`
`Thus contributions
`
`from both paths are included in the decision making process. Finally,
`
`the N-
`
`best and M-best sets of names may be combined to form a
`
`reduced set of
`
`dictionary candidates to which the input utterance is applied.
`
`This
`
`reduced size dictionary can be used to build a dynamic
`
`grammar
`
`that
`
`is built
`
`from the N-best and M-best name candidates.
`
`This
`
`dynamic grammar will
`tend to favor one set of candidates or
`the other,
`depending on whether the input utterance contains extraneous speech or not.
`If extraneous speech is present,
`the grammar network designed to identify and
`
`reject the extraneous speech will
`
`tend to produce better recognition results,
`
`
`
`and those results will be reflected as the better candidates in the dynamic
`
`15
`
`grammar that is constructed from the N-best and M-best name candidates.
`
`On
`
`the other hand, if no extraneous speech is present,
`
`the other grammar network
`
`will produce better recognition results, and it will be better reflected as
`
`the better candidates in the dynamic grammar.
`
`Once the dynamic grammar is constructed,
`
`the input acoustic speech
`
`20
`
`data may be processed using a
`
`recognizer based on the dynamic grammar
`
`to
`
`extract the single most probable name candidate as the recognized name.
`
`The
`
`recognized name is then used to access a suitable database for routing the
`
`telephone call appropriately.
`
`For a more complete understanding of the invention, its objects
`
`25
`
`and advantages, reference may be had to the following specification and to the
`
`accompanying drawings.
`
`
`
`Brief Description of the Drawings
`
`Figure 1 is a block diagram of an exemplary system using the cal]
`
`routing device of the invention:
`
`Figure 2 is a block diagram of an exemplary embodiment of the call
`
`routing device of the invention:
`
`Figure 3 is state diagram illustrating the grammar network Gl,
`
`configured on the assumption that the spelled name letter sequence begins with
`
`valid letters;
`
`Figure 4 is a state diagram illustrating the grammar network G2,
`
`configured on the assumption that the spelled name letter sequence begins with
`
`extraneous noise or utterances that are not recognized by the system:
`
`Figure 5 is a detailed block diagram of the presently preferred
`
`recognition system of the invention;
`
`Figure 6 is a diagram illustrating different types of recognition
`
`18
`
`errors;
`
`‘yh
`
`Figure 7 is a graph showing optimization of the PLP-RASTA filter
`
`coefficients to decrease the number of substitution, deletion and insertion
`
`errors;
`
`20
`
`technique;
`
`Figure 8
`
`is
`
`a diagram showing the improved lattice N-best
`
`Figure 9 is a diagram further describing how hypotheses generation
`
`is performed during the backtracking phase of recognition.
`
`Description of the Preferred Embodiment
`
`The principles of the invention will be illustrated and described
`
`25
`
`in the context of a call routing device that will prompt users to supply cal]
`
`routing infermation by verbally spelling names into the system. Therefore,
`
`
`
`to aid in understanding the speech recognition system,
`
`a brief description
`
`will
`
`first be provided of
`
`a call
`
`routing device in which the speech
`
`recognition system may be employed.
`
`It should be kept
`
`in mind, however,
`
`that
`
`the speech recognition system of the invention is not limited to call
`
`routing
`
`devices. Rather,
`
`the recognition system finds utility in a wide range of
`
`different applications where useful speech must be extracted from extraneous
`
`noise or speech that is not useful.
`
`System Overview _and Basic Operation
`
`The call
`
`routing device employing continuous speech recognition
`
`will be illustrated in an exemplary embodiment, suitable for plug-and-play
`
`connection to an existing PBX switch, or
`
`for
`
`incorporation into the PBX
`
`equipment at
`
`the time of manufacture.
`
`Referring to Figure 1,
`
`the PBX
`
`switch 210 is connected to the telephone network infrastructure 212 by
`
`conventional means such as telephone lines 214.
`
`In the illustrated embodiment
`
`three lines have been illustrated’ for convenience. This is not intended as
`
`a limitation of the invention as the invention is capable of being deployed
`
`in systems having a greater or fewer number of telephone lines.
`
`The PBX switch is of conventional design, capable of
`
`routing
`
`incoming calls from network 212 to any of the selected telephone devices such
`
`as handsets 216.
`
`The spelled name
`
`recognition call
`
`router 218 of
`
`the
`
`invention is connected, as the handsets 216 are connected,
`
`to additional
`
`extensions or ports on the PBX switch 210.
`
`As will be more fully discussed,
`
`the presently preferred embodiment connects to the PBX switch through a
`
`plurality of lines 220 that carry voice traffic and through an additional
`
`20
`
`Tine 222 that carries the control
`
`logic signals that allow the call router to
`
`work integrally with the existing PBX system.
`
`Figure 2 shows
`
`the call
`
`router 218 in greater detail.
`
`PBX
`
`switch 210 and lines 220 and 222 have also been illustrated.
`
`The call
`
`
`
`router 218 can be configured in a variety of different ways, depending on the
`
`architecture of the PBX system.
`
`In the illustrated embodiment the call router
`
`has
`
`three separate audio channels connected respectively to the three
`
`jines 220.
`
`Of course,
`
`the number of channels required will depend on the
`
`architecture of the telephone system. Three channels have been illustrated
`
`here to illustrate how a
`
`system may simultaneously provide spelled name
`
`recognition for
`
`three callers on each of
`
`the three incoming telephone
`
`lines 214.
`
`To support additional callers additional audio channels may be
`
`included or multiplexing circuitry may be included to allow channels to be
`
`10
`
`shared.
`
`Fach of
`
`the audio channels has
`
`a digital
`
`signal processor
`
`(DSP)
`
`224
`
`and associated analog-to-digital/digital-to-analog conversion
`
`circuitry 226.
`
`The digital
`
`signal processors are coupled to the host
`
`processor 228 that includes a data store 230 in which all
`
`references or names
`
`Is
`
`are stored. Data store 230 may be any suitable digital storage medium such
`
`as
`
`random access memory.
`
`Data store 230 stores the continuous
`
`speech
`
`recognition dictionary of all names that can be recognized by the system, with
`
`the associated telephone exchange numbers.
`
`As will be explained more fully
`
`below,
`
`the preferred embodiment uses a special
`
`speech recognizer that
`
`is
`
`20
`
`optimized for speaker-independent recognition of continuously spelled names.
`
`Also coupled to host processor 228 (or incorporated as part of the
`
`host processor)
`
`is the call
`
`switching logic 232.
`
`This switching logic
`
`connects to the signal
`
`line 222 and communicates with the PBX switching
`
`system, following the communication protocols dictated by the PBX switch.
`
`29
`
`Before proceeding with a detailed explanation of
`
`the speech
`
`recognizer,
`
`a brief explanation of the operation of the call router 218 may
`
`be helpful. Referring to Figures 1 and 2, when an incoming call
`
`reaches the
`
`PBX switch through one of the telephone lines 214,
`
`it may be handled by a
`
`
`
`
`
`human operator without
`
`intervention by the call
`
`router of the invention.
`
`However,
`
`if the human operator is not able to handle the call
`
`(for example,
`
`the call
`
`comes
`
`in after normal business hours when
`
`there is no human
`
`operator),
`
`the PBX switch is programmed to forward the call
`
`to the call
`
`router 218.
`
`The switch does this by simply assigning the call
`
`to one of the
`
`audio channels of
`
`the call
`
`router
`
`(to one of
`
`the lines 220), based on
`
`switching instructions sent on line 222.
`
`If desired,
`
`the PBX switch can be
`
`programmed to jump to a different signal
`
`line on a different audio channel
`
`within the router 218 if the first line is busy.
`
`Having done this,
`
`the
`
`incoming call
`
`is now in communication with a selected one of
`
`the DSP
`
`processors 224.
`
`The processor supplies any required voice prompts to the
`
`incoming caller (requesting the caller to spell
`
`the name of the desired
`
`person) and it also processes the caller's spelled name response.
`
`The details
`
`of the speech recognition algorithm used by DSP processors 224 will
`
`be
`
`described below.
`
`As part of
`
`the recognition process,
`
`the DSP processor 224
`
`downloads from the host 228 a copy of the shared speech recognition resources,
`
`namely the data reflecting all reference names and their associated telephone
`
`extension numbers. Using an N-best strategy for real-time recognition the
`
`
`
`20
`
`DSP-implemented speech recognizer selects the most probable candidate from
`
`data store 230.
`
`The name of this candidate is spoken back to the caller using
`
`the DSP processor to supply a speech synthesized signal or to playback a
`
`prerecorded audio signal rendition of the selected person’s name.
`
`The caller
`
`is then asked to respond “yes” or "no," indicating whether the candidate is
`
`25
`
`the correct one.
`
`If it is,
`
`the host processor 228 uses the call switching
`
`logic 232 to instruct the PBX switch to transfer the call
`
`from one of the
`
`lines 220 to a selected one of the handsets 216. After this switching has
`
`
`
`occurred,
`
`the call router’s audio channel
`
`is once again free to handle a new
`
`incoming call.
`
`Details of Preferred Speech Recognition Processing
`
`The presently preferred speech recognition system may be viewed
`
`as a multipass procedure, with the final pass being used only if the preceding
`
`(alignment) pass does not produce a single recognized name as output.
`
`The
`
`first and final passes employ Hidden Markov Model
`
`recognition, while the
`
`alignment pass employs dynamic programming alignment with the dictionary.
`
`As
`
`will
`
`be more
`
`fully discussed,
`
`the
`
`first
`
`pass
`
`(Hidden Markov Models
`
`
`
`recognition) is,
`
`itself, split into plural parallel subpaths.
`
`In Figure 5,
`
`the first, second and third passes are illustrated. Note that the first pass
`
`bifurcated through separate Hidden Markov Models recognition blocks 26a and
`
`26b.
`
`The illustrated embodiment is designed to recognize continuously
`
`spelled names comprising a sequence of
`
`letters that are supplied to the
`
`recognition system as input
`
`through a caller’s telephone handset 10.
`
`Io
`
`jllustrate examples of useful and nonuseful
`
`input,
`
`two handsets 10 have been
`
`illustrated.
`
`Into one handset the caller has correctly used the system by
`
`supplying the sequence of letters: H-A-N-S-O-N.
`
`Into the other handset,
`
`the
`
`20
`
`caller has incorrectly used the system, by uttering the spoken name and then
`
`following with a sequence of
`
`letters:
`
`"Hanson" H-A-N-S-O-N.
`
`As will be
`
`described below,
`
`the system is designed to accommodate both correct usage and
`
`incorrect usage, resulting in a more robust recognition system.
`
`The recognition system,
`
`shown generally at 12 includes a name
`
`20
`
`retrieval
`
`system shown generally at 13.
`
`As will be discussed.
`
`the name
`
`retrieval system has the ability to construct dynamic grammars representing
`
`a selected subset of entries found in the name dictionary.
`
`The dynamic
`
`10
`
`
`
`grammars are used in the event recognition is not accomplished in the second
`
`pass and processing proceeds to the third pass.
`
`The input sequence of letters may be fed to a suitable speech
`
`analysis module 14. This module performs front end optimization designed to
`
`decrease the number of substitution, deletion and insertion errors.
`
`Ina
`
`continuously spelled name a substitution error is the substitution of an
`
`incorrect letter for the correct one.
`
`Figure 6 illustrates at 16 and 18
`
`examples of substitution errors made in the recognition of the spelled name
`
`JOHNSON.
`
`A deletion error is the omission of one or more letters from the
`
`continuously spelled name.
`
`This is illustrated at 20 in Figure 6.
`
`An
`
`insertion error is the inclusion of additional
`
`letters not originally uttered
`
`in the continuously spelled name.
`
`An example of an insertion error is shown
`
`at 22 and 24 in Figure 6.
`
`The speech analysis module 14 is designed to operate on digitized
`
`speech data.
`
`Thus if an analog speech input system is used,
`
`the analog signa!
`
`should first be digitized. This may be done by suitable analog-to-digital
`
`circuitry that may be included in the speech analysis module 14.
`
`The presently preferred speech analysis module uses an 8th-order
`
`
`
`
`
`PLP-RASTA process to compensate for the effect of the communication channel.
`
`20
`
`For more information regarding the PLP-RASTA compensation, see H. Hermansky,
`
`N. Morgan, A. Bayya and P. Kohn, EUROSPEECH ‘91, pages 1367-1370, 1991.
`
`The
`
`presently preferred embodiment uses
`
`a
`
`10 millisecond frame shift and a
`
`20 millisecond analysis window.
`
`The RASTA filter coefficient is optimized to
`
`decrease the number of substitution, deletion and insertion errors.
`
`The best
`
`20
`
`filter coefficient compromise is selected for a value of 0.90.
`
`In determining the optimized RASTA filter coefficients,
`
`the
`
`energy,
`
`the first derivative of the energy and the first derivative of the
`
`static cepstral
`
`coefficients C,
`
`through Cg
`
`(computed over / frames) are
`
`11
`
`
`
`alternatively combined with the static cepstral coefficients to form the
`speech parametric representation (a total of 18 coefficients).
`Figure 7
`4]lustrates the optimized RASTA filter coefficients that will decrease the
`number of substitution, deletion and insertion errors.
`In this figure PLP-
`RASTA stands for the combination of energy,
`the first derivative of energy,
`static cepstral
`coefficients
`and
`first derivative of static cepstral
`
`coefficients.
`While the PLP-RASTA optimization is presently preferred, other
`forms of optimization may also be used.
`For example, a mel
`frequency cepstrum
`coefficient (MFCC) analysis may alternatively be used. Suitable results can
`be obtained using a 14th-order MFCC analysis.
`For
`the MFCC analysis,
`11
`static cepstral coefficients (C, included) are computed with a frame shift of
`16 milliseconds and an analysis window of 32 milliseconds.
`Different recognition accuracy may be obtained using different
`feature sets.
`These feature sets may include static features and dynamic
`features separately and combined.
`To
`illustrate the robustness of
`the
`parameterization used in the invention, clean as well as filtered data were
`used.
`To obtain filtered data for the test set in the presently preferred
`embodiment,
`a distorting filter is used and the test data is filtered to
`artificially create a mismatch between the training set and the test set.
`In
`this regard,
`see H. Murveit, J. Butzberger and M. Weintraub.
`In Darpa
`Workshop Speech and Natural Language, pages 280-284, February 1992.
`Returning to Figure 5,
`the output from speech analysis module 14,
`is split into two paths, one associated with Hidden Markov Models recognition
`block 26a and one associated with Hidden Markov Models recognition block 26b.
`Recognition block 26a works with a pre-defined letter grammar Gl, depicted
`diagrammatically at 28a.
`Recognition 26b works with a different
`letter
`grammar G2 depicted diagrammatically at 28b. These different letter grammars
`
`12
`
`
`
`20
`
`25
`
`
`
`are constructed as
`
`grammar networks
`
`illustrated in Figures
`
`3
`
`and 4,
`
`respectively. These grammar networks are graphs comprising nodes associated
`with each of the possible letters and what node-to-node transitions are
`possible.
`The grammars both include a silence node followed by letter loops,
`where any letter may follow any letter. Grammar Gl of Figure 3 begins with
`
`a silence (Sil) node 50,
`
`transitioning to individual starting letters A, B,
`
`C ....
`Grammar G2, depicted in Figure 4, begins with a filler node 52 to
`represent extraneous speech or noise spoken prior to spelling.
`The filler
`node transitions to silence node 52 and then to the individual letter nodes
`
`In the presently preferred implementation, recognition blocks 26a and
`as Gl.
`26b are frame-synchronous, first order, continuous density Hidden Markov Mode|
`
`recognizers employing Viterbi decoding.
`a modified Viterbi
`The presently preferred embodiment employs
`decoder that yields the N-best or M-best hypotheses
`(instead of a single
`hypothesis). Usually the Viterbi decoder is designed to provide only the best
`hypothesis, based on probability of a match between HMM models and the test
`utterance.
`This standard Viterbi decoder
`is modified for use in the
`
`invention so that it provides the N-best or M-best hypotheses, based on the
`
`highest probabilities of a match between HMM models and the test utterance.
`Recognition blocks 26a and 26b each generate their own N-best or M-best
`hypotheses. “Tf desired,
`these two recognition blocks need not generate the
`same number of hypotheses, although in the preferred embodiment
`the same
`number is used (e.g., N=M=10).
`Thus in Figure 5,
`recognition block 26a yields
`
`the N-best hypotheses and recognition block 26b yields the M-best hypotheses.
`As previously noted,
`the symbols N and M may be any integer greater than 1.
`The precise value chosen for integers N and M may depend on the speed of the
`processor and on the memory size.
`The techniques for generating the N-best
`(or M-best) letter candidates will be discussed more fully below.
`It will be
`
`13
`
`
`
`20
`
`25
`
`
`
`understood that
`
`the techniques
`
`for generating the N-best
`
`(or M-best)
`
`hypotheses are essentially the same for both cases.
`
`The Hidden Markov Model
`
`recognizers employed at 26a and 26b are
`
`provided with beam search capability designed to limit the search space, so
`
`that the recognizer will process the incoming speech more quickly.
`
`The Hidden
`
`Markov Model recognizer produces a score that represents the likelihood of a
`
`match between input speech and reference speech. Without the beam search
`
`mechanism,
`
`the recognizer must score all possible paths at each frame during
`
`the search process. With beam search the recognizer considers only those
`
`paths whose scores that deviate from the best score no more than an amount
`
`equal
`
`to the beam width. Rather than searching the entire search space,
`
`a
`
`ah
`
`beam search is implemented whereby the least likely search paths are pruned,
`
`such that only the best hypotheses are returned.
`
`The N-best (or M-best) hypotheses resulting from the recognizers
`
`26a and 26b are then passed to dynamic programming (DP) alignment modules 38a
`
`and 38b, respectively.
`
`The dynamic programming alignment modules have access
`
`to an associated name dictionary 39 against which the N-best
`
`(or M-best)
`
`hypotheses
`
`are compared.
`
`Dynamic programming is used to account
`
`for
`
`insertion, substitution and deletion errors.
`
`20
`
`In some instances,
`
`the result of dynamic programming alignment
`
`will produce a single name with no other candidates.
`
`Decision strategy
`
`module 40 detects this and provides the recognized name as the output, when
`
`there is only one candidate resulting from the DP alignment.
`
`In most cases,
`
`however,
`
`a single candidate does not result,
`
`in which case the decision
`
`20
`
`strategy module passes the N-best and M-best hypotheses to module 42 for
`
`building a dynamic grammar.
`
`Module 42 builds a grammar using the N-best and M-best candidates
`provided by the DP alignment modules.
`The highly constrained recognizer 44
`
`14
`
`
`
`is then invoked to evaluate the N-best and M-best candidates using the dynamic
`
`grammar 42.
`
`The recognizer 44 may also be a Hidden Markov Model recognizer.
`
`Even though highly constrained,
`
`the data pass through this recognizer is not
`
`time-consuming because the dynamic grammar is small and because the parametric
`
`representation (computed in 14) need not be recomputed.
`
`If desired,
`
`a neural
`
`network discriminator can be applied at the output of recognizers 26a and 26b
`
`or recognizer 44.
`
`The
`
`listing in the Appendix A shows
`
`how the system of
`
`the
`
`invention performs in recognizing the spelled name WILSON.
`
`In the listing the
`
`section designated [First Pass]
`
`shows all hypotheses produced by both
`
`grammars. None of these is the name WILSON.
`
`In the section labeled [DP Alignment] the top candidates have been
`
`listed:
`
`included in the list is the name WILSON (candidate 1 of 10).
`
`In the section labeled [Costly Constrained Pass]
`
`the input
`
`utterance is compared with only the candidates selected during DP Alignment.
`
`In this case,
`
`the recognizer correctly detects the name WILSON.
`
`
`
`
`
`
`
`
`N-Best Processing Techniques
`
`The N-best or M-best candidates are chosen using an N-best
`
`selection algorithm.
`
`For details regarding this technique, see R. Schwartz
`
`20
`
`and Steve Austin, “Efficient, High Performance Algorithms for N-Best Search,”
`
`DARPA Workshop on Speech Recognition, pp. 6-11, 1990.
`
`In speech recognition,
`
`the incoming speech data are broken up into time frames and analyzed on a
`
`there may be several possible
`For any given utterance,
`frame-by-frame basis.
`hypotheses.
`The presently preferred N-best (or M-best) algorithm selects the
`best starting time for a letter based only on the preceding letter and not on
`
`29
`
`letters before the preceding letter.
`
`As each letter is spoken and analyzed,
`
`the Hidden Markov Model
`
`recognizer will generate probability scores for each
`
`15
`
`
`
`of the models. Because the objective of the system is ultimately to select
`
`the most probable sequence of letters, the system stores a plurality of paths,
`
`representing possible spelled combinations of letters.
`
`To make the system work better as a real
`
`time recognizer,
`
`two
`
`different levels of data pruning are implemented.
`
`The pruning tech