throbber
United States Patent (19)
`D'hoore et al.
`
`US00608516OA
`Patent Number:
`11
`(45) Date of Patent:
`
`6,085,160
`Jul. 4, 2000
`
`54) LANGUAGE INDEPENDENTSPEECH
`RECOGNITION
`
`75 Inventors: Bart Dhoore, Aalter; Dirk Van
`Compernolle, Korbeek-Dijle, both of
`Belgium
`
`73 Assignee: Lernout & Hauspie Speech Products
`N.V., Ieper, Belgium
`
`21 Appl. No.: 09/113,589
`22 Filed:
`Jul. 10, 1998
`(51) Int. Cl." ........................................................ G10L 5/04
`52 U.S. Cl. ................................ 704/256; 704/2; 704/277
`58 Field of Search ..................................... 704/251, 254,
`704/255, 243, 256, 2, 277
`
`56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,540,589 7/1996 Waters .................................... 704/246
`5,717,743 2/1998 McMahan et al. ..
`... 704/244
`5,758,023 5/1998 Bordeaux .............
`... 704/232
`5,768,603
`6/1998 Brown et al. ...
`... 704/232
`5,882,202 3/1999 Sameth et al. .............................. 704/8
`5,915,001 6/1999 Uppaluru ...
`379/88.22
`5,963,892 10/1999 Tanka et al. ................................ 704/2
`5,963,903 10/1999 Hon et al. ............................... 704/254
`FOREIGN PATENT DOCUMENTS
`
`DE 19634
`138 2/1998 Germany.
`WO 98/11534 3/1998 WIPO.
`
`OTHER PUBLICATIONS
`Bub, U. et al. “In-Service Adapation of Multilingual Hid
`den-Markov-Models”, Proceedings IEEE International
`Conference on Acoustics, Speech and Signal Processing
`(ICASSP '97), Apr. 21, 1997, pp. 1451–1454.
`
`Constantinescu, A. et al., “On CroSS-Language Experiments
`and Data-Driven Units for ALISP (Automatic Language
`Independent Speech Processing, Proceedings IEEE Work
`Shop On Automatic Speech Recognition and UnderStanding,
`Dec. 14–17, 1997, pp. 606–613.
`Joachim Kohler, “Multi-Lingual Phoneme Recognition
`Exploiting Acoustic-Phonetic Similarities of Sounds”, pp.
`2195-2198.
`Weng, et al., “A Study of Multilingual Speech Recognition”,
`ESCA, Eurospeech97, pp. 359-362.
`Schultz, et al., “Fast Bootstrapping of LVCSR Systems With
`Multilingual Phoneme Sets”, ESCA, Eurospeech97, pp.
`371-374.
`Schultz, et al., Japanese LVCSR on the Spontaneous Sched
`uling Task With Janus-3, ESCA, EuroSpeech97, pp.
`367-370.
`Bonaventura, et al., “Multilingual Speech Recognition for
`Flexible Vocabularies”, ESCA, Eurospeech97, pp. 355-358.
`Jayadev Billa, et al., “Multilingual Speech Recognition: The
`1996 Byblos Callhome System”, ESCA Eurospeech97, pp.
`363-366.
`Wang, Chao, et al., “Yinhe: A Mandarin Chinese Version of
`the Galaxy System”, ESCA, Eurospeech97, pp. 351-354.
`
`Primary Examiner David R. Hudspeth
`ASSistant Examiner Susan Wieland
`Attorney, Agent, or Firm-Bromberg & Sunstein LLP
`57
`ABSTRACT
`A speech recognition System uses language independent
`acoustic models derived from Speech data from multiple
`languages to represent Speech units which are concatenated
`into words. In addition, the input Speech Signal which is
`compared to the language independent acoustic models may
`be vector quantized according to a codebook which is
`derived from Speech data from multiple languages.
`
`26 Claims, 3 Drawing Sheets
`
`
`
`MULTILINGUAL
`ACOUSTIC MODEL
`
`PHONEMEN
`
`RECORDED
`SPEECH DATA
`LANGUAGE
`
`RECORDED
`SPEECH DATA
`LANGUAGE 2
`
`RECORDED
`SPEECH DATA
`LANGUAGEM
`
`Petitioner Google Ex-1023, 0001
`
`

`

`U.S. Patent
`
`Jul. 4, 2000
`
`Sheet 1 of 3
`
`6,085,160
`
`
`
`(LHV HOIBd)
`
`
`
`OILSÍTOOV/
`
`STEICJOWN
`
`
`
`
`
`
`
`Petitioner Google Ex-1023, 0002
`
`

`

`U.S. Patent
`
`Jul. 4, 2000
`
`Sheet 2 of 3
`
`6,085,160
`
`22
`
`ACOUSTIC MODEL
`LANGUAGE 1
`
`24
`
`ACOUSTIC MODEL
`LANGUAGE 2.23
`
`MULTILINGUAL
`ACOUSTIC MODEL
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`RECORDED
`SPEECH DATA
`LANGUAGE 1
`
`RECORDED
`SPEECH DATA
`LANGUAGE 2
`
`25
`
`(PRIOR ART)
`
`RECORDED
`SPEECH DATA
`LANGUAGE 1
`
`RECORDED
`SPEECH DATA
`LANGUAGE 2
`
`RECORDED
`SPEECH DATA
`LANGUAGEM
`
`Petitioner Google Ex-1023, 0003
`
`

`

`U.S. Patent
`
`Jul. 4, 2000
`
`Sheet 3 of 3
`
`6,085,160
`
`6
`
`
`
`ALISNAG3LSayHOSIO
`
`DILSNOOVWWH
`
`HOLOVHLXS
`
`
`
`HOaadSWaZINDO034SYNLVS4=fe—_HOdSLNANI
`
`
`
`
`tv40083009Oa4OOEIGODyOO843a09
`
`Gals1LN3diHOAadSONIZLLNWNOaHOLOAA
`
`
`
`
`
`JWANOHdTSGOWNSNVAIN-™SNVSW-™SNVSW-™SNVAW-
`
`
`
`
`
`qsvavivaONIYSALSNIDONIHALSN1OONIPSALSNIDONIBHSALSN19
`
`
`1300501v1130yISGWHISd3O
`
`3OVNONVIvi1ad
`an)vivavivavivd
`
`
`
`
`WHLIHOD1VWHLIYOD1VWHLIHOD1VWHLIYOD1V
`HOAAdS
`
`HOLOSACeteag)HOLOAANOLOSA
`
`
`
`N5SISAGNOISSGNDSISAGN5ISsSd
`
`ADYANAWHISdIO
`daqHoos3uyNASDVNONVI6SDVNONVILSDVNONV1Sv
`daquoosudaqyoosY
`
`HOdadSHOaadSHOaSadS
`
`
`
`ASVaEVLVG
`
`Petitioner Google Ex-1023, 0004
`
`Petitioner Google Ex-1023, 0004
`
`
`
`
`

`

`1
`LANGUAGE INDEPENDENT SPEECH
`RECOGNITION
`
`TECHNICAL FIELD
`The present invention relates to speech recognition SyS
`temS.
`
`BACKGROUND ART
`Current Speech recognition Systems Support only indi
`vidual languages. If words of another language need to be
`recognized, acoustic models must be exchanged. For most
`Speech recognition Systems, these models are built, or
`trained, by extracting Statistical information from a large
`body of recorded speech. To provide Speech recognition in
`a given language, one typically defines a set of Symbols,
`known as phonemes, that represent all Sounds of that lan
`guage. Some Systems use other Subword units more gener
`ally known as phoneme-like units to represent the funda
`mental Sounds of a given language. These phoneme-like
`units include biphones and triphones modeled by Hidden
`Markov Models (HMMs), and other speech models well
`known within the art.
`A large quantity of Spoken Samples are typically recorded
`to permit extraction of an acoustic model for each of the
`phonemes. Usually, a number of native Speakers-i.e.,
`people having the language as their mother tongue-are
`asked to record a number of utterances. A set of recordings
`is referred to as a speech database. The recording of Such a
`Speech database for every language one wants to Support is
`very costly and time consuming.
`
`15
`
`25
`
`SUMMARY OF THE INVENTION
`(AS used in the following description and claims, and
`unless context otherwise requires, the term "language inde
`pendent' in connection with a speech recognition System
`means a recognition capability that is independently existing
`in a plurality of languages that are modeled in the Speech
`recognition System.)
`In a preferred embodiment of the present invention, there
`is provided a language independent Speech recognition
`System comprising a speech pre-processor, a database of
`acoustic models, a language model, and a speech recognizer.
`The Speech pre-processor receives input Speech and pro
`duces a speech-related Signal representative of the input
`Speech. The database of acoustic models represent each
`Subword unit in each of a plurality of languages. The
`language model characterizes a Vocabulary of recognizable
`words and a set of grammar rules, and the Speech recognizer
`compares the Speech-related Signal to the acoustic models
`and the language model, and recognizes the input Speech as
`a specific word Sequence of at least one word.
`In a further and related embodiment, the Speech pre
`processor comprises a feature extractor which extracts rel
`evant Speech parameters to produce the Speech-related Sig
`nal. The feature extractor may include a codebook created
`using speech data from the plurality of languages, and use
`vector quantization Such that the Speech-related Signal is a
`Sequence of feature vectors.
`Alternatively, or in addition, an embodiment may create
`the acoustic models using Speech data from the plurality of
`languages. The Subword units may be at least one of
`phonemes, parts of phonemes, and Sequences of phonemes.
`The Vocabulary of recognizable words may contain words in
`the plurality of languages, including proper nouns, or words
`in a language not present in the plurality of languages, or
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6,085,160
`
`2
`foreign-loan words. In addition, the words in the Vocabulary
`of recognizable words may be described by a voice print
`comprised of a user-trained Sequence of acoustic models
`from the database. Such an embodiment may further include
`a speaker identifier which uses the Voice prints to determine
`the identity of the Speaker of the Speech input.
`In yet another embodiment, the Speech recognizer may
`compare the relevant Speech parameters to acoustic models
`which represent Subword units in a first language in the
`plurality of languages, and then recognize the Speech input
`as a specific word Sequence of at least one word in a Second
`language in the plurality of languages So that input speech
`from a non-native Speaker may be recognized.
`Another embodiment of the present invention includes a
`computer-readable digital Storage medium encoded with a
`computer program for teaching a foreign language to a user
`which when loaded into a computer operates in conjunction
`with an embodiment of the language independent speech
`recognition System described.
`Embodiments of the present invention may also include a
`method of a language independent speech recognition Sys
`tem using one of the Systems described above.
`BRIEF DESCRIPTION OF THE DRAWINGS
`The present invention will be more readily understood by
`reference to the following detailed description taken with the
`accompanying drawings, in which:
`FIG. 1 illustrates the logical flow associated with a typical
`Speech recognition System.
`FIG. 2 illustrates acoustic models of phonemes for mul
`tiple languages according to prior art.
`FIG. 3 illustrates multi-language acoustic models using a
`universal Set of phonemes according to a preferred embodi
`ment.
`FIG. 4 illustrates a speech recognition System according
`to a preferred embodiment.
`DETAILED DESCRIPTION OF SPECIFIC
`EMBODIMENTS
`Operation of a typical Speech recognition engine accord
`ing to the prior art is illustrated in FIG. 1. A speech Signal
`10 is directed to a pre-processor 11, where relevant param
`eters are extracted from the Speech Signal 10. The pattern
`matching recognizer 12 tries to find the best word Sequence
`recognition result 15 based on acoustic models 13 and a
`language model 14. The language model 14 describes words
`and how they connect to form a Sentence. It might be as
`Simple as a list of words in the case of an isolated word
`recognizer, or as complicated as a Statistical language model
`for large Vocabulary continuous Speech recognition. The
`acoustic models 13 establish a link between the speech
`parameters from the pre-processor 11 and the recognition
`Symbols that need to be recognized. In medium and large
`Vocabulary Systems, the recognition Symbols are phonemes,
`or phoneme-like units, that are concatenated to form words.
`Further information on the design of a speech recognition
`System is provided, for example, in Rabiner and Juang,
`Fundamentals of Speech Recognition (hereinafter “Rabiner
`and Juang”), Prentice Hall 1993, which is hereby incorpo
`rated herein by reference.
`In a prior art System, as illustrated in FIG. 2, for any given
`Language 1, Language 1-specific recorded Speech data 20 is
`used to generate acoustic models 22 which represent each
`phoneme 21 in the language. For any other given Language
`2, Language 2-specific recorded Speech data 25 is used to
`
`Petitioner Google Ex-1023, 0005
`
`

`

`3
`generate other acoustic models 24 Specific to that language
`which represent each phoneme 23 in that Language 2.
`FIG. 3 illustrates acoustic models generated according to
`a preferred embodiment of the present invention. Instead of
`recording Speech data and building acoustic models for all
`languages Separately, as described above, a single universal
`Set of acoustic models is used that may Support all languages
`of the World, or a large group of languages-Such as
`European or Oriental languages, or any plurality of lan
`guages. To accomplish this, the Speech database from which
`the Statistical information is retrieved to create the acoustic
`models contains the Speech of Several languages 33 and will
`cover all possible phonemes or phoneme-like units in those
`languages. Thus, the acoustic model of a particular phoneme
`is constructed based on Speech from multiple languages.
`Accordingly, a list of universal phonemes 31 that cover all
`the desired languages is included in the Speech recognition
`System, along with corresponding acoustic models 32. Since
`each phoneme 31 is a unique representation of a Single
`Sound, a Sound that appears in Several languages will be
`represented by the Same phoneme 31 and have the same
`corresponding acoustic model 32. Instead of phonemes, an
`alternative embodiment may use phoneme-like Subword
`units Such as biphones and triphones based on Hidden
`Markov Models (HMMs), etc. In another embodiment, the
`language model 14 in FIG. 1 may be omitted and pattern
`matching by the recognizer 12 may be based Solely on
`comparison of the Speech parameters from the preprocessor
`11 to the acoustic models 13.
`A speech recognition System according to a preferred
`embodiment is shown in FIG. 4, based on a discrete density
`HMM phoneme-based continuous recognition engine. These
`recognition engines may be useful for telephone speech, for
`microphone speech, or for other advantageous applications.
`An input Speech Signal initially undergoes Some form of
`pre-processing. AS shown in FIG. 4, a preferred embodiment
`uses a vector quantizing feature extraction module 41 which
`processes an input Speech Signal and calculates energy and
`spectral properties (cepstrum) for a 30 msec speech Segment
`once every 10 mSec. A preferred embodiment of a telephone
`Speech recognition engine uses the commonly known LPC
`analysis method to derive 12 cepstral coefficients and log
`energy, along with first and Second order derivatives. A
`preferred embodiment of a microphone Speech recognition
`engine uses the commonly known MEL-FFT method to
`accomplish the same purpose. The result for both engines for
`each Speech frame is a vector of 12 cepStra, 12 delta cepstra,
`12 delta delta cepStra, delta log energy and delta delta log
`energy. These speech pre-processing techniques are well
`known in the art. See, for example, Rabiner and Juang,
`Supra, pp. 112-17 and 188-90, for additional discussion of
`this Subject. The remainder of the processing is the same for
`both engines.
`In a preferred embodiment which employs discrete den
`sity HMMs, the System employs a vector quantizing feature
`extraction module 41 which replaces each observed feature
`vector by a prototype (or codeword) out of a codebook 43
`that best matches the feature vector. The codebooks 43 are
`designed and created using a large speech database 44 which
`contains recorded speech data 45 from each of a plurality of
`languages together with an algorithm 46 that minimizes
`Some cost function, Such as the commonly used k-means
`clustering method that minimizes the total distortion of the
`codebookS 43. Single language System codebooks according
`to the prior art are designed and created using Speech data
`from the target language only. Preferred embodiments of the
`present invention, on the other hand, are based on multi
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6,085,160
`
`4
`language models using Speech from a large number of
`languages and Selecting the Speech data Such that there is an
`equal amount of data from all languages. In Such an
`embodiment, four codebooks 43 may be constructed: one for
`cepstra, one for delta cepStra, one for delta delta cepStra, and
`one for delta log energy and delta delta log energy. Each
`codebook 43 uses a design algorithm:
`
`number of codewords: number of codewords calculated so far
`target: number of codewords chosen to calculate
`codebook: list of codewords
`while (number of codewords < target) do
`split (codewords)
`update (codewords)
`end
`split (codewords) #splits each codeword into two new ones based on the
`covariance matrix
`
`foreach codeword
`eigenvector = calculateeigenvector (covariance matrix)
`alfa = epsilon eigenvalue
`new codeword1 = codeword + alfaeigenvector
`new codeword2 = codeword - alfaeigenvector
`end
`
`update (codewords) #updates codewords with mean + calculates
`covariance matrix
`
`until (stop criterium) do
`#running through the speech data
`foreach vector in trainingset
`#select codeword belonging to vector
`codewordi = classify (vector, codebook)
`updatemean (codewordi, vector)
`updatecovariance (codewordi, vector)
`end
`end
`
`Although a preferred embodiment has been described as
`using a codebook based vector quantizing technique to
`initially process a speech input Signal, other embodiments of
`the invention may employ other methods of initial Speech
`processing, for example, Such as would be used in a con
`tinuous density based speech recognition System.
`Once the input Speech Signal has been pre-processed as
`previously described, Such by vector quantizing, the Speech
`recognizer, 48 in FIG. 4, compares the Speech Signal to
`acoustic models in the phoneme database 47 together with
`the language model 49. Instead of creating acoustic models
`for the phonemes (or other Sub word units) of any one
`particular language, a preferred embodiment uses acoustic
`models for all the phonemes that appear in a large number
`of languages. A list of Such universal language independent
`phonemes may be constructed by merging specific phoneme
`lists from each of the various desired languages. A preferred
`embodiment uses L&H+, a phonetic alphabet designed to
`cover all languages which represents each Sound by a single
`Symbol, and wherein each Symbol represents a single Sound.
`Table 1 shows a multi-language phoneme list used to train
`microphone models on British English, Dutch, American
`English, French, German, Italian, Spanish, and Japanese.
`For each phoneme, the table indicates in which language it
`appears. For example, the phoneme A has been trained on
`British English, Dutch, American English, French, and Japa
`nese Speech.
`The training procedures for Single language and multi
`language acoustic models both use Standard training tech
`niques, they differ in the type of data that is presented and
`the Speech units that are trained. The training can be viewed
`as the construction of a database of acoustic models 47
`
`Petitioner Google Ex-1023, 0006
`
`

`

`S
`covering a specific phoneme Set. The training proceSS begins
`by training context independent models using Viterbi train
`ing of discrete density HMMs. Then the phoneme models
`are automatically classified into 14 classes. Based on the
`class information, context dependent phoneme models are
`constructed. Next, the context dependent models are trained
`using Viterbi training of discrete density HMMs. The con
`text dependent and context independent phoneme models
`are merged, and then, lastly, badly trained context dependent
`models are Smoothed with the context independent models.
`Such acoustic model training methods are well-known
`within the art of Speech recognition. Similar training tech
`niques may be employed in other embodiments Such as for
`continuous density based speech recognition Systems.
`Prior art Single language acoustic models are trained on
`Speech from the target language. Thus an acoustic model of
`a given phoneme will be trained based only on Speech
`Samples from a single language. The Speech recognizer
`engine will be able to recognize words from that language
`only. Separate acoustic model libraries for Several languages
`may be constructed, but they can not easily be combined. In
`a discrete density based Speech recognition System, it not
`even possible to combine them into one database Since the
`codebooks are incompatible acroSS languages. On the other
`hand, multi-language acoustic models in a preferred
`embodiment are trained on a speech database 44 which
`contains recorded Speech data 45 from multiple languages.
`The result of the training is a database of discrete density
`HMM acoustic models 47 corresponding to a universal list
`of language independent phonemes. Some of the phoneme
`models will Still be language specific, Since they are only
`observed in one language. Other phoneme models will be
`trained on Speech from more than one language.
`The System describes recognizable words in its vocabu
`lary by representing their pronunciation in Speech units that
`are available in the acoustic model database 47. For single
`language acoustic model databases, this implies that only
`words of one language can be described, or that foreign
`words are simulated by describing them in Speech units of
`that particular language. In a preferred embodiment, the
`multi-language acoustic model database 47 contains pho
`neme models that can describe words in any of the targeted
`languages. In either a single language or multi-language
`implementation, words may be added to the Vocabulary of
`the Speech recognizer System either automatically or by
`interaction with the user. Whether automatically or
`interactively, however, a preferred embodiment of a multi
`language recognizer uses a Vocabulary, i.e. the list of words
`the recognizer knows of, which can contain words of Several
`languages. It is thus possible to recognize words of different
`languages. The detailed procedures for word addition differ
`accordingly between Single language and multi-language
`Speech recognition Systems.
`In a single language System, the interactive word addition
`mode starts with the user entering a word by typing it (e.g.
`“L&H'). The new word is automatically converted to a
`phonetic representation by a rule based System derived from
`the automatic text to speech conversion module or by
`dictionary look-up. The user can then check the transcription
`by listening to the output of the text to speech System that
`reads the phonetic transcription that it just generated (e.g.
`the system says “Lernout and Hauspie Speech Products”). If
`the user is not satisfied with the pronunciation, he can
`change the phonetic transcription in two ways (e.g. the user
`would have liked “el and eitch”). By editing the phonetic
`transcriptions directly, the user can listen to the changes he
`made by having the text to speech System play back the
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6,085,160
`
`6
`altered phonetic String. Alternatively, the user may enter a
`word that Sounds like what he actually wants in a Separate
`orthographic field (e.g. “L. and H.”) and the system will
`convert the Sound-like item into phonemes and use this as
`phonetic transcription for the real word. Once the user is
`Satisfied with the pronunciation of the new word, he can
`check it in, the transcription units are retrieved from the
`model database, the word is added to the recognizer and can
`now be recognized.
`In the multi-language System of a preferred embodiment,
`however, the procedure for adding words interactively dif
`ferS Somewhat. First, as before, the user enters a new word
`by typing it. The System then automatically determines the
`language of the word Via dictionary look-up and/or a rule
`based System and presents one or more choices to the user.
`For each of the chosen languages, the word is automatically
`converted to a phonetic representation by a rule based
`System derived from an automatic text to speech conversion
`module of that particular language. The user can check the
`transcriptions by listening to the output of the text to speech
`System that reads the phonetic transcriptions that it just
`generated. If the user is not satisfied with the language
`choice the System made, he can overrule the System and
`indicate explicitly one or more languages. If the user is not
`Satisfied with the pronunciation, he can change the phonetic
`transcription in two ways, for each of the Selected languages.
`The user may edit the phonetic transcriptions directly; he
`can listen to the changes he made by having the text to
`Speech System play back the altered phonetic String. In this
`way, the user can use phoneme Symbols coming from
`another language, but will then not necessarily be able to
`listen to the changes. Alternatively, the user may enter a
`word that Sounds like what he actually wants in a separate
`orthographic field. The system will convert the sound-like
`item into phonemes and use this as phonetic transcription for
`the real word. Once the user is satisfied with the transcrip
`tions of the word, he can check it in. The transcription units
`are retrieved from the model database, the word is added to
`the recognizer and can now be recognized.
`The automatic mode for entering words to the recognizer
`also differs between Single language and multi-language
`Systems. In a single language System, the application pro
`gram presents the words it wants to have recognized to the
`Speech recognition System, and the word is automatically
`converted to a phonetic representation by a rule based
`System derived from the automatic text to speech conversion
`module or by dictionary look-up. The transcription units
`then are retrieved from the model database, the word is
`added to the recognizer and can now be recognized. In a
`multi-language System of a preferred embodiment, however,
`the application program presents the words it wants to have
`recognized to the Speech recognition System and optionally
`indicates one or more languages for the word. If the lan
`guage is not indicated, the System will automatically deter
`mine the language by dictionary lookup or via a rule-based
`System, resulting in one or more language choices. For each
`language, the word is automatically converted to a phonetic
`representation by a rule based System derived from the
`automatic text to Speech conversion module. The transcrip
`tion units then are retrieved from the model database, the
`word is added to the recognizer and can now be recognized.
`A multi-language System of a preferred embodiment also
`Supports a translation mode. In Such a System, one or more
`words are added to the recognizer for a single language
`following the procedures explained above. An automatic
`translation System then translates the words to one or more
`other languages that are Supported by the recognizer. For
`
`Petitioner Google Ex-1023, 0007
`
`

`

`6,085,160
`
`5
`
`15
`
`7
`each word, the System can propose one or more candidates.
`The automatically translated words may be added to the
`recognizer or edited interactively.
`A preferred embodiment also enables recognition of
`words of a new language. Since creating acoustic models for
`a particular language requires the recording of a large
`amount of Speech data, the development of a speech recog
`nizer for a new language is costly and time consuming. The
`model database of the multi-language recognizer Supports
`many more phonemes than a single language model does.
`Since the probability of finding a non-Supported phoneme in
`this database is low, it becomes possible to describe a word
`of a language that was not present in the training data. This
`description will be much more accurate than the description
`of that word in phonemes of a Single different language. To
`recognize words of a new language, a preferred embodiment
`requires only the input of the new words and their phonetic
`representation. No training is necessary.
`Prior art Speech recognition Systems generally have prob
`lems recognizing Speech from non-native Speakers. There
`are two main reasons: 1) non-native speakers Sometimes do
`not pronounce the words correctly, and 2) non-native speak
`erS Sometimes do not pronounce Some Sounds correctly.
`Multi-language models, Such as in a preferred embodiment,
`more effectively recognize the Speech of non-native Speak
`erS because the models for each of the phonemes have been
`trained on Several languages and are more robust to varia
`tions due to accent. In addition, when creating a word
`Vocabulary, the user can easily edit phonetic transcriptions
`and is allowed to use phonemes of a different language to
`describe foreign influences.
`Some algorithms, Such as Speaker dependent training of
`words, try to find the best possible phonetic representation
`for a particular word based on a few utterances of that word
`by the user. In most cases, the native language of the user is
`not known. When Single language models are used, the
`Speech recognition System is restricted to mapping the
`Speech onto language Specific Symbols, even though the
`Speech may be from a completely different language. Non
`native Speakers may produce Sounds that can not be repre
`40
`Sented well by the model database of a Single language
`model. Preferred embodiments of the present invention
`avoid this type of problem Since the phoneme model data
`base covers a much wider span of Sounds. A word can be
`added to the recognizer by having the user pronounce the
`word a few times. The System will automatically construct
`the best possible phoneme or model unit Sequence to
`describe the word, based on the phoneme model database
`and the uttered Speech. This Sequence is referred to as a
`Voice print. These voice prints can be used to recognize
`utterances of the trained word by the Speaker. Since the
`Voice print will better match the Speech of the targeted
`Speaker than the Speech of another speaker, it can also be
`used to check or detect the identity of the Speaker. This is
`referred to as Speaker verification, or Speaker identification.
`A preferred embodiment is also advantageously employed
`for language independent recognition of words with lan
`
`8
`guage dependent transcriptions. The pronunciation of Some
`words Strongly depends on the native language of the
`Speaker. This is a problem for Systems in which the native
`language of the user either varies or is unknown. A typical
`example is the recognition of proper names. A Dutch name
`is pronounced differently by a Dutch Speaker and a French
`Speaker. Language dependent Systems usually describe the
`foreign pronunciation variants by mapping them to the
`phonemes of the native language. AS described above, it is
`possible to add a word to the Speech recognition System of
`a preferred embodiment and indicate that it will be spoken
`in Several languages. The System will transcribe the word
`with rule Sets from Several languages and generate Several
`phonetic transcriptions. The recognizer uses all the tran
`Scriptions in parallel, thus covering all pronunciation vari
`ants. This is particularly useful for recognizing proper names
`in an application that will be used by a variety of Speakers
`whose language is not known.
`Language learning programs are computer programs that
`help users to learn to Speak a language without intervention
`of a live tutor. Automatic Speech recognition Systems are
`often used in Such programs to help the users test the
`progreSS they make and to help them improve the pronun
`ciation of the language to be learned. The confidence level
`of the recognizer, i.e. an indication of how well a model
`matches the uttered speech, is an indication of how well the
`user pronounced a word or Sentence that is represented by
`that model. The local confidence, which is a measure for
`how well the model matches a small portion of the uttered
`Speech, a word in a Sentence or a phoneme in an utterance,
`can give an indication on what type of error the user made
`and can be used to indicate Specific problem areas the user
`should work on. Multi-language models are more Suited for
`language learning applications than Single language models.
`Users having Language 1 as a mother tongue, who want to
`learn Language 2, will make mistakes that are typical of the
`language couple (Language 1, Language 2). Some pho
`nemes that appear in Language 2 do not appear in Language
`1 and are thus not known to people having Language 1 as a
`mother tongue. They will typically replace the unknown
`phoneme with a phoneme that appears in Language 1, thus
`mispronouncing words. A typical example is a French per
`Son pronouncing an English word in English text in a French
`manner, because the Same word also exists in French. This
`type of mistakes is typical of each language couple
`(Language 1, Language 2). A single language recognition
`System, be it Language 1 or Language 2 specific, cannot
`detect these Substitutions because models to describe the
`particular phoneme combination are not available. Multi
`language models can be used to detect this type of error
`Since all phonemes of Language 1 and Language 2 are
`covered. Thus it becomes possible to create language learn
`ing Systems for language couples that are enhanced with
`rules that describe mistakes typical to the language couple,
`and automatically detect specific mistakes with the help of
`an automatic speech recognition System.
`
`25
`
`35
`
`45
`
`50
`
`55
`
`TABLE 1

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket