`D'hoore et al.
`
`US00608516OA
`Patent Number:
`11
`(45) Date of Patent:
`
`6,085,160
`Jul. 4, 2000
`
`54) LANGUAGE INDEPENDENTSPEECH
`RECOGNITION
`
`75 Inventors: Bart Dhoore, Aalter; Dirk Van
`Compernolle, Korbeek-Dijle, both of
`Belgium
`
`73 Assignee: Lernout & Hauspie Speech Products
`N.V., Ieper, Belgium
`
`21 Appl. No.: 09/113,589
`22 Filed:
`Jul. 10, 1998
`(51) Int. Cl." ........................................................ G10L 5/04
`52 U.S. Cl. ................................ 704/256; 704/2; 704/277
`58 Field of Search ..................................... 704/251, 254,
`704/255, 243, 256, 2, 277
`
`56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,540,589 7/1996 Waters .................................... 704/246
`5,717,743 2/1998 McMahan et al. ..
`... 704/244
`5,758,023 5/1998 Bordeaux .............
`... 704/232
`5,768,603
`6/1998 Brown et al. ...
`... 704/232
`5,882,202 3/1999 Sameth et al. .............................. 704/8
`5,915,001 6/1999 Uppaluru ...
`379/88.22
`5,963,892 10/1999 Tanka et al. ................................ 704/2
`5,963,903 10/1999 Hon et al. ............................... 704/254
`FOREIGN PATENT DOCUMENTS
`
`DE 19634
`138 2/1998 Germany.
`WO 98/11534 3/1998 WIPO.
`
`OTHER PUBLICATIONS
`Bub, U. et al. “In-Service Adapation of Multilingual Hid
`den-Markov-Models”, Proceedings IEEE International
`Conference on Acoustics, Speech and Signal Processing
`(ICASSP '97), Apr. 21, 1997, pp. 1451–1454.
`
`Constantinescu, A. et al., “On CroSS-Language Experiments
`and Data-Driven Units for ALISP (Automatic Language
`Independent Speech Processing, Proceedings IEEE Work
`Shop On Automatic Speech Recognition and UnderStanding,
`Dec. 14–17, 1997, pp. 606–613.
`Joachim Kohler, “Multi-Lingual Phoneme Recognition
`Exploiting Acoustic-Phonetic Similarities of Sounds”, pp.
`2195-2198.
`Weng, et al., “A Study of Multilingual Speech Recognition”,
`ESCA, Eurospeech97, pp. 359-362.
`Schultz, et al., “Fast Bootstrapping of LVCSR Systems With
`Multilingual Phoneme Sets”, ESCA, Eurospeech97, pp.
`371-374.
`Schultz, et al., Japanese LVCSR on the Spontaneous Sched
`uling Task With Janus-3, ESCA, EuroSpeech97, pp.
`367-370.
`Bonaventura, et al., “Multilingual Speech Recognition for
`Flexible Vocabularies”, ESCA, Eurospeech97, pp. 355-358.
`Jayadev Billa, et al., “Multilingual Speech Recognition: The
`1996 Byblos Callhome System”, ESCA Eurospeech97, pp.
`363-366.
`Wang, Chao, et al., “Yinhe: A Mandarin Chinese Version of
`the Galaxy System”, ESCA, Eurospeech97, pp. 351-354.
`
`Primary Examiner David R. Hudspeth
`ASSistant Examiner Susan Wieland
`Attorney, Agent, or Firm-Bromberg & Sunstein LLP
`57
`ABSTRACT
`A speech recognition System uses language independent
`acoustic models derived from Speech data from multiple
`languages to represent Speech units which are concatenated
`into words. In addition, the input Speech Signal which is
`compared to the language independent acoustic models may
`be vector quantized according to a codebook which is
`derived from Speech data from multiple languages.
`
`26 Claims, 3 Drawing Sheets
`
`
`
`MULTILINGUAL
`ACOUSTIC MODEL
`
`PHONEMEN
`
`RECORDED
`SPEECH DATA
`LANGUAGE
`
`RECORDED
`SPEECH DATA
`LANGUAGE 2
`
`RECORDED
`SPEECH DATA
`LANGUAGEM
`
`Petitioner Google Ex-1023, 0001
`
`
`
`U.S. Patent
`
`Jul. 4, 2000
`
`Sheet 1 of 3
`
`6,085,160
`
`
`
`(LHV HOIBd)
`
`
`
`OILSÍTOOV/
`
`STEICJOWN
`
`
`
`
`
`
`
`Petitioner Google Ex-1023, 0002
`
`
`
`U.S. Patent
`
`Jul. 4, 2000
`
`Sheet 2 of 3
`
`6,085,160
`
`22
`
`ACOUSTIC MODEL
`LANGUAGE 1
`
`24
`
`ACOUSTIC MODEL
`LANGUAGE 2.23
`
`MULTILINGUAL
`ACOUSTIC MODEL
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`RECORDED
`SPEECH DATA
`LANGUAGE 1
`
`RECORDED
`SPEECH DATA
`LANGUAGE 2
`
`25
`
`(PRIOR ART)
`
`RECORDED
`SPEECH DATA
`LANGUAGE 1
`
`RECORDED
`SPEECH DATA
`LANGUAGE 2
`
`RECORDED
`SPEECH DATA
`LANGUAGEM
`
`Petitioner Google Ex-1023, 0003
`
`
`
`U.S. Patent
`
`Jul. 4, 2000
`
`Sheet 3 of 3
`
`6,085,160
`
`6
`
`
`
`ALISNAG3LSayHOSIO
`
`DILSNOOVWWH
`
`HOLOVHLXS
`
`
`
`HOaadSWaZINDO034SYNLVS4=fe—_HOdSLNANI
`
`
`
`
`tv40083009Oa4OOEIGODyOO843a09
`
`Gals1LN3diHOAadSONIZLLNWNOaHOLOAA
`
`
`
`
`
`JWANOHdTSGOWNSNVAIN-™SNVSW-™SNVSW-™SNVAW-
`
`
`
`
`
`qsvavivaONIYSALSNIDONIHALSN1OONIPSALSNIDONIBHSALSN19
`
`
`1300501v1130yISGWHISd3O
`
`3OVNONVIvi1ad
`an)vivavivavivd
`
`
`
`
`WHLIHOD1VWHLIYOD1VWHLIHOD1VWHLIYOD1V
`HOAAdS
`
`HOLOSACeteag)HOLOAANOLOSA
`
`
`
`N5SISAGNOISSGNDSISAGN5ISsSd
`
`ADYANAWHISdIO
`daqHoos3uyNASDVNONVI6SDVNONVILSDVNONV1Sv
`daquoosudaqyoosY
`
`HOdadSHOaadSHOaSadS
`
`
`
`ASVaEVLVG
`
`Petitioner Google Ex-1023, 0004
`
`Petitioner Google Ex-1023, 0004
`
`
`
`
`
`
`1
`LANGUAGE INDEPENDENT SPEECH
`RECOGNITION
`
`TECHNICAL FIELD
`The present invention relates to speech recognition SyS
`temS.
`
`BACKGROUND ART
`Current Speech recognition Systems Support only indi
`vidual languages. If words of another language need to be
`recognized, acoustic models must be exchanged. For most
`Speech recognition Systems, these models are built, or
`trained, by extracting Statistical information from a large
`body of recorded speech. To provide Speech recognition in
`a given language, one typically defines a set of Symbols,
`known as phonemes, that represent all Sounds of that lan
`guage. Some Systems use other Subword units more gener
`ally known as phoneme-like units to represent the funda
`mental Sounds of a given language. These phoneme-like
`units include biphones and triphones modeled by Hidden
`Markov Models (HMMs), and other speech models well
`known within the art.
`A large quantity of Spoken Samples are typically recorded
`to permit extraction of an acoustic model for each of the
`phonemes. Usually, a number of native Speakers-i.e.,
`people having the language as their mother tongue-are
`asked to record a number of utterances. A set of recordings
`is referred to as a speech database. The recording of Such a
`Speech database for every language one wants to Support is
`very costly and time consuming.
`
`15
`
`25
`
`SUMMARY OF THE INVENTION
`(AS used in the following description and claims, and
`unless context otherwise requires, the term "language inde
`pendent' in connection with a speech recognition System
`means a recognition capability that is independently existing
`in a plurality of languages that are modeled in the Speech
`recognition System.)
`In a preferred embodiment of the present invention, there
`is provided a language independent Speech recognition
`System comprising a speech pre-processor, a database of
`acoustic models, a language model, and a speech recognizer.
`The Speech pre-processor receives input Speech and pro
`duces a speech-related Signal representative of the input
`Speech. The database of acoustic models represent each
`Subword unit in each of a plurality of languages. The
`language model characterizes a Vocabulary of recognizable
`words and a set of grammar rules, and the Speech recognizer
`compares the Speech-related Signal to the acoustic models
`and the language model, and recognizes the input Speech as
`a specific word Sequence of at least one word.
`In a further and related embodiment, the Speech pre
`processor comprises a feature extractor which extracts rel
`evant Speech parameters to produce the Speech-related Sig
`nal. The feature extractor may include a codebook created
`using speech data from the plurality of languages, and use
`vector quantization Such that the Speech-related Signal is a
`Sequence of feature vectors.
`Alternatively, or in addition, an embodiment may create
`the acoustic models using Speech data from the plurality of
`languages. The Subword units may be at least one of
`phonemes, parts of phonemes, and Sequences of phonemes.
`The Vocabulary of recognizable words may contain words in
`the plurality of languages, including proper nouns, or words
`in a language not present in the plurality of languages, or
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6,085,160
`
`2
`foreign-loan words. In addition, the words in the Vocabulary
`of recognizable words may be described by a voice print
`comprised of a user-trained Sequence of acoustic models
`from the database. Such an embodiment may further include
`a speaker identifier which uses the Voice prints to determine
`the identity of the Speaker of the Speech input.
`In yet another embodiment, the Speech recognizer may
`compare the relevant Speech parameters to acoustic models
`which represent Subword units in a first language in the
`plurality of languages, and then recognize the Speech input
`as a specific word Sequence of at least one word in a Second
`language in the plurality of languages So that input speech
`from a non-native Speaker may be recognized.
`Another embodiment of the present invention includes a
`computer-readable digital Storage medium encoded with a
`computer program for teaching a foreign language to a user
`which when loaded into a computer operates in conjunction
`with an embodiment of the language independent speech
`recognition System described.
`Embodiments of the present invention may also include a
`method of a language independent speech recognition Sys
`tem using one of the Systems described above.
`BRIEF DESCRIPTION OF THE DRAWINGS
`The present invention will be more readily understood by
`reference to the following detailed description taken with the
`accompanying drawings, in which:
`FIG. 1 illustrates the logical flow associated with a typical
`Speech recognition System.
`FIG. 2 illustrates acoustic models of phonemes for mul
`tiple languages according to prior art.
`FIG. 3 illustrates multi-language acoustic models using a
`universal Set of phonemes according to a preferred embodi
`ment.
`FIG. 4 illustrates a speech recognition System according
`to a preferred embodiment.
`DETAILED DESCRIPTION OF SPECIFIC
`EMBODIMENTS
`Operation of a typical Speech recognition engine accord
`ing to the prior art is illustrated in FIG. 1. A speech Signal
`10 is directed to a pre-processor 11, where relevant param
`eters are extracted from the Speech Signal 10. The pattern
`matching recognizer 12 tries to find the best word Sequence
`recognition result 15 based on acoustic models 13 and a
`language model 14. The language model 14 describes words
`and how they connect to form a Sentence. It might be as
`Simple as a list of words in the case of an isolated word
`recognizer, or as complicated as a Statistical language model
`for large Vocabulary continuous Speech recognition. The
`acoustic models 13 establish a link between the speech
`parameters from the pre-processor 11 and the recognition
`Symbols that need to be recognized. In medium and large
`Vocabulary Systems, the recognition Symbols are phonemes,
`or phoneme-like units, that are concatenated to form words.
`Further information on the design of a speech recognition
`System is provided, for example, in Rabiner and Juang,
`Fundamentals of Speech Recognition (hereinafter “Rabiner
`and Juang”), Prentice Hall 1993, which is hereby incorpo
`rated herein by reference.
`In a prior art System, as illustrated in FIG. 2, for any given
`Language 1, Language 1-specific recorded Speech data 20 is
`used to generate acoustic models 22 which represent each
`phoneme 21 in the language. For any other given Language
`2, Language 2-specific recorded Speech data 25 is used to
`
`Petitioner Google Ex-1023, 0005
`
`
`
`3
`generate other acoustic models 24 Specific to that language
`which represent each phoneme 23 in that Language 2.
`FIG. 3 illustrates acoustic models generated according to
`a preferred embodiment of the present invention. Instead of
`recording Speech data and building acoustic models for all
`languages Separately, as described above, a single universal
`Set of acoustic models is used that may Support all languages
`of the World, or a large group of languages-Such as
`European or Oriental languages, or any plurality of lan
`guages. To accomplish this, the Speech database from which
`the Statistical information is retrieved to create the acoustic
`models contains the Speech of Several languages 33 and will
`cover all possible phonemes or phoneme-like units in those
`languages. Thus, the acoustic model of a particular phoneme
`is constructed based on Speech from multiple languages.
`Accordingly, a list of universal phonemes 31 that cover all
`the desired languages is included in the Speech recognition
`System, along with corresponding acoustic models 32. Since
`each phoneme 31 is a unique representation of a Single
`Sound, a Sound that appears in Several languages will be
`represented by the Same phoneme 31 and have the same
`corresponding acoustic model 32. Instead of phonemes, an
`alternative embodiment may use phoneme-like Subword
`units Such as biphones and triphones based on Hidden
`Markov Models (HMMs), etc. In another embodiment, the
`language model 14 in FIG. 1 may be omitted and pattern
`matching by the recognizer 12 may be based Solely on
`comparison of the Speech parameters from the preprocessor
`11 to the acoustic models 13.
`A speech recognition System according to a preferred
`embodiment is shown in FIG. 4, based on a discrete density
`HMM phoneme-based continuous recognition engine. These
`recognition engines may be useful for telephone speech, for
`microphone speech, or for other advantageous applications.
`An input Speech Signal initially undergoes Some form of
`pre-processing. AS shown in FIG. 4, a preferred embodiment
`uses a vector quantizing feature extraction module 41 which
`processes an input Speech Signal and calculates energy and
`spectral properties (cepstrum) for a 30 msec speech Segment
`once every 10 mSec. A preferred embodiment of a telephone
`Speech recognition engine uses the commonly known LPC
`analysis method to derive 12 cepstral coefficients and log
`energy, along with first and Second order derivatives. A
`preferred embodiment of a microphone Speech recognition
`engine uses the commonly known MEL-FFT method to
`accomplish the same purpose. The result for both engines for
`each Speech frame is a vector of 12 cepStra, 12 delta cepstra,
`12 delta delta cepStra, delta log energy and delta delta log
`energy. These speech pre-processing techniques are well
`known in the art. See, for example, Rabiner and Juang,
`Supra, pp. 112-17 and 188-90, for additional discussion of
`this Subject. The remainder of the processing is the same for
`both engines.
`In a preferred embodiment which employs discrete den
`sity HMMs, the System employs a vector quantizing feature
`extraction module 41 which replaces each observed feature
`vector by a prototype (or codeword) out of a codebook 43
`that best matches the feature vector. The codebooks 43 are
`designed and created using a large speech database 44 which
`contains recorded speech data 45 from each of a plurality of
`languages together with an algorithm 46 that minimizes
`Some cost function, Such as the commonly used k-means
`clustering method that minimizes the total distortion of the
`codebookS 43. Single language System codebooks according
`to the prior art are designed and created using Speech data
`from the target language only. Preferred embodiments of the
`present invention, on the other hand, are based on multi
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6,085,160
`
`4
`language models using Speech from a large number of
`languages and Selecting the Speech data Such that there is an
`equal amount of data from all languages. In Such an
`embodiment, four codebooks 43 may be constructed: one for
`cepstra, one for delta cepStra, one for delta delta cepStra, and
`one for delta log energy and delta delta log energy. Each
`codebook 43 uses a design algorithm:
`
`number of codewords: number of codewords calculated so far
`target: number of codewords chosen to calculate
`codebook: list of codewords
`while (number of codewords < target) do
`split (codewords)
`update (codewords)
`end
`split (codewords) #splits each codeword into two new ones based on the
`covariance matrix
`
`foreach codeword
`eigenvector = calculateeigenvector (covariance matrix)
`alfa = epsilon eigenvalue
`new codeword1 = codeword + alfaeigenvector
`new codeword2 = codeword - alfaeigenvector
`end
`
`update (codewords) #updates codewords with mean + calculates
`covariance matrix
`
`until (stop criterium) do
`#running through the speech data
`foreach vector in trainingset
`#select codeword belonging to vector
`codewordi = classify (vector, codebook)
`updatemean (codewordi, vector)
`updatecovariance (codewordi, vector)
`end
`end
`
`Although a preferred embodiment has been described as
`using a codebook based vector quantizing technique to
`initially process a speech input Signal, other embodiments of
`the invention may employ other methods of initial Speech
`processing, for example, Such as would be used in a con
`tinuous density based speech recognition System.
`Once the input Speech Signal has been pre-processed as
`previously described, Such by vector quantizing, the Speech
`recognizer, 48 in FIG. 4, compares the Speech Signal to
`acoustic models in the phoneme database 47 together with
`the language model 49. Instead of creating acoustic models
`for the phonemes (or other Sub word units) of any one
`particular language, a preferred embodiment uses acoustic
`models for all the phonemes that appear in a large number
`of languages. A list of Such universal language independent
`phonemes may be constructed by merging specific phoneme
`lists from each of the various desired languages. A preferred
`embodiment uses L&H+, a phonetic alphabet designed to
`cover all languages which represents each Sound by a single
`Symbol, and wherein each Symbol represents a single Sound.
`Table 1 shows a multi-language phoneme list used to train
`microphone models on British English, Dutch, American
`English, French, German, Italian, Spanish, and Japanese.
`For each phoneme, the table indicates in which language it
`appears. For example, the phoneme A has been trained on
`British English, Dutch, American English, French, and Japa
`nese Speech.
`The training procedures for Single language and multi
`language acoustic models both use Standard training tech
`niques, they differ in the type of data that is presented and
`the Speech units that are trained. The training can be viewed
`as the construction of a database of acoustic models 47
`
`Petitioner Google Ex-1023, 0006
`
`
`
`S
`covering a specific phoneme Set. The training proceSS begins
`by training context independent models using Viterbi train
`ing of discrete density HMMs. Then the phoneme models
`are automatically classified into 14 classes. Based on the
`class information, context dependent phoneme models are
`constructed. Next, the context dependent models are trained
`using Viterbi training of discrete density HMMs. The con
`text dependent and context independent phoneme models
`are merged, and then, lastly, badly trained context dependent
`models are Smoothed with the context independent models.
`Such acoustic model training methods are well-known
`within the art of Speech recognition. Similar training tech
`niques may be employed in other embodiments Such as for
`continuous density based speech recognition Systems.
`Prior art Single language acoustic models are trained on
`Speech from the target language. Thus an acoustic model of
`a given phoneme will be trained based only on Speech
`Samples from a single language. The Speech recognizer
`engine will be able to recognize words from that language
`only. Separate acoustic model libraries for Several languages
`may be constructed, but they can not easily be combined. In
`a discrete density based Speech recognition System, it not
`even possible to combine them into one database Since the
`codebooks are incompatible acroSS languages. On the other
`hand, multi-language acoustic models in a preferred
`embodiment are trained on a speech database 44 which
`contains recorded Speech data 45 from multiple languages.
`The result of the training is a database of discrete density
`HMM acoustic models 47 corresponding to a universal list
`of language independent phonemes. Some of the phoneme
`models will Still be language specific, Since they are only
`observed in one language. Other phoneme models will be
`trained on Speech from more than one language.
`The System describes recognizable words in its vocabu
`lary by representing their pronunciation in Speech units that
`are available in the acoustic model database 47. For single
`language acoustic model databases, this implies that only
`words of one language can be described, or that foreign
`words are simulated by describing them in Speech units of
`that particular language. In a preferred embodiment, the
`multi-language acoustic model database 47 contains pho
`neme models that can describe words in any of the targeted
`languages. In either a single language or multi-language
`implementation, words may be added to the Vocabulary of
`the Speech recognizer System either automatically or by
`interaction with the user. Whether automatically or
`interactively, however, a preferred embodiment of a multi
`language recognizer uses a Vocabulary, i.e. the list of words
`the recognizer knows of, which can contain words of Several
`languages. It is thus possible to recognize words of different
`languages. The detailed procedures for word addition differ
`accordingly between Single language and multi-language
`Speech recognition Systems.
`In a single language System, the interactive word addition
`mode starts with the user entering a word by typing it (e.g.
`“L&H'). The new word is automatically converted to a
`phonetic representation by a rule based System derived from
`the automatic text to speech conversion module or by
`dictionary look-up. The user can then check the transcription
`by listening to the output of the text to speech System that
`reads the phonetic transcription that it just generated (e.g.
`the system says “Lernout and Hauspie Speech Products”). If
`the user is not satisfied with the pronunciation, he can
`change the phonetic transcription in two ways (e.g. the user
`would have liked “el and eitch”). By editing the phonetic
`transcriptions directly, the user can listen to the changes he
`made by having the text to speech System play back the
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6,085,160
`
`6
`altered phonetic String. Alternatively, the user may enter a
`word that Sounds like what he actually wants in a Separate
`orthographic field (e.g. “L. and H.”) and the system will
`convert the Sound-like item into phonemes and use this as
`phonetic transcription for the real word. Once the user is
`Satisfied with the pronunciation of the new word, he can
`check it in, the transcription units are retrieved from the
`model database, the word is added to the recognizer and can
`now be recognized.
`In the multi-language System of a preferred embodiment,
`however, the procedure for adding words interactively dif
`ferS Somewhat. First, as before, the user enters a new word
`by typing it. The System then automatically determines the
`language of the word Via dictionary look-up and/or a rule
`based System and presents one or more choices to the user.
`For each of the chosen languages, the word is automatically
`converted to a phonetic representation by a rule based
`System derived from an automatic text to speech conversion
`module of that particular language. The user can check the
`transcriptions by listening to the output of the text to speech
`System that reads the phonetic transcriptions that it just
`generated. If the user is not satisfied with the language
`choice the System made, he can overrule the System and
`indicate explicitly one or more languages. If the user is not
`Satisfied with the pronunciation, he can change the phonetic
`transcription in two ways, for each of the Selected languages.
`The user may edit the phonetic transcriptions directly; he
`can listen to the changes he made by having the text to
`Speech System play back the altered phonetic String. In this
`way, the user can use phoneme Symbols coming from
`another language, but will then not necessarily be able to
`listen to the changes. Alternatively, the user may enter a
`word that Sounds like what he actually wants in a separate
`orthographic field. The system will convert the sound-like
`item into phonemes and use this as phonetic transcription for
`the real word. Once the user is satisfied with the transcrip
`tions of the word, he can check it in. The transcription units
`are retrieved from the model database, the word is added to
`the recognizer and can now be recognized.
`The automatic mode for entering words to the recognizer
`also differs between Single language and multi-language
`Systems. In a single language System, the application pro
`gram presents the words it wants to have recognized to the
`Speech recognition System, and the word is automatically
`converted to a phonetic representation by a rule based
`System derived from the automatic text to speech conversion
`module or by dictionary look-up. The transcription units
`then are retrieved from the model database, the word is
`added to the recognizer and can now be recognized. In a
`multi-language System of a preferred embodiment, however,
`the application program presents the words it wants to have
`recognized to the Speech recognition System and optionally
`indicates one or more languages for the word. If the lan
`guage is not indicated, the System will automatically deter
`mine the language by dictionary lookup or via a rule-based
`System, resulting in one or more language choices. For each
`language, the word is automatically converted to a phonetic
`representation by a rule based System derived from the
`automatic text to Speech conversion module. The transcrip
`tion units then are retrieved from the model database, the
`word is added to the recognizer and can now be recognized.
`A multi-language System of a preferred embodiment also
`Supports a translation mode. In Such a System, one or more
`words are added to the recognizer for a single language
`following the procedures explained above. An automatic
`translation System then translates the words to one or more
`other languages that are Supported by the recognizer. For
`
`Petitioner Google Ex-1023, 0007
`
`
`
`6,085,160
`
`5
`
`15
`
`7
`each word, the System can propose one or more candidates.
`The automatically translated words may be added to the
`recognizer or edited interactively.
`A preferred embodiment also enables recognition of
`words of a new language. Since creating acoustic models for
`a particular language requires the recording of a large
`amount of Speech data, the development of a speech recog
`nizer for a new language is costly and time consuming. The
`model database of the multi-language recognizer Supports
`many more phonemes than a single language model does.
`Since the probability of finding a non-Supported phoneme in
`this database is low, it becomes possible to describe a word
`of a language that was not present in the training data. This
`description will be much more accurate than the description
`of that word in phonemes of a Single different language. To
`recognize words of a new language, a preferred embodiment
`requires only the input of the new words and their phonetic
`representation. No training is necessary.
`Prior art Speech recognition Systems generally have prob
`lems recognizing Speech from non-native Speakers. There
`are two main reasons: 1) non-native speakers Sometimes do
`not pronounce the words correctly, and 2) non-native speak
`erS Sometimes do not pronounce Some Sounds correctly.
`Multi-language models, Such as in a preferred embodiment,
`more effectively recognize the Speech of non-native Speak
`erS because the models for each of the phonemes have been
`trained on Several languages and are more robust to varia
`tions due to accent. In addition, when creating a word
`Vocabulary, the user can easily edit phonetic transcriptions
`and is allowed to use phonemes of a different language to
`describe foreign influences.
`Some algorithms, Such as Speaker dependent training of
`words, try to find the best possible phonetic representation
`for a particular word based on a few utterances of that word
`by the user. In most cases, the native language of the user is
`not known. When Single language models are used, the
`Speech recognition System is restricted to mapping the
`Speech onto language Specific Symbols, even though the
`Speech may be from a completely different language. Non
`native Speakers may produce Sounds that can not be repre
`40
`Sented well by the model database of a Single language
`model. Preferred embodiments of the present invention
`avoid this type of problem Since the phoneme model data
`base covers a much wider span of Sounds. A word can be
`added to the recognizer by having the user pronounce the
`word a few times. The System will automatically construct
`the best possible phoneme or model unit Sequence to
`describe the word, based on the phoneme model database
`and the uttered Speech. This Sequence is referred to as a
`Voice print. These voice prints can be used to recognize
`utterances of the trained word by the Speaker. Since the
`Voice print will better match the Speech of the targeted
`Speaker than the Speech of another speaker, it can also be
`used to check or detect the identity of the Speaker. This is
`referred to as Speaker verification, or Speaker identification.
`A preferred embodiment is also advantageously employed
`for language independent recognition of words with lan
`
`8
`guage dependent transcriptions. The pronunciation of Some
`words Strongly depends on the native language of the
`Speaker. This is a problem for Systems in which the native
`language of the user either varies or is unknown. A typical
`example is the recognition of proper names. A Dutch name
`is pronounced differently by a Dutch Speaker and a French
`Speaker. Language dependent Systems usually describe the
`foreign pronunciation variants by mapping them to the
`phonemes of the native language. AS described above, it is
`possible to add a word to the Speech recognition System of
`a preferred embodiment and indicate that it will be spoken
`in Several languages. The System will transcribe the word
`with rule Sets from Several languages and generate Several
`phonetic transcriptions. The recognizer uses all the tran
`Scriptions in parallel, thus covering all pronunciation vari
`ants. This is particularly useful for recognizing proper names
`in an application that will be used by a variety of Speakers
`whose language is not known.
`Language learning programs are computer programs that
`help users to learn to Speak a language without intervention
`of a live tutor. Automatic Speech recognition Systems are
`often used in Such programs to help the users test the
`progreSS they make and to help them improve the pronun
`ciation of the language to be learned. The confidence level
`of the recognizer, i.e. an indication of how well a model
`matches the uttered speech, is an indication of how well the
`user pronounced a word or Sentence that is represented by
`that model. The local confidence, which is a measure for
`how well the model matches a small portion of the uttered
`Speech, a word in a Sentence or a phoneme in an utterance,
`can give an indication on what type of error the user made
`and can be used to indicate Specific problem areas the user
`should work on. Multi-language models are more Suited for
`language learning applications than Single language models.
`Users having Language 1 as a mother tongue, who want to
`learn Language 2, will make mistakes that are typical of the
`language couple (Language 1, Language 2). Some pho
`nemes that appear in Language 2 do not appear in Language
`1 and are thus not known to people having Language 1 as a
`mother tongue. They will typically replace the unknown
`phoneme with a phoneme that appears in Language 1, thus
`mispronouncing words. A typical example is a French per
`Son pronouncing an English word in English text in a French
`manner, because the Same word also exists in French. This
`type of mistakes is typical of each language couple
`(Language 1, Language 2). A single language recognition
`System, be it Language 1 or Language 2 specific, cannot
`detect these Substitutions because models to describe the
`particular phoneme combination are not available. Multi
`language models can be used to detect this type of error
`Since all phonemes of Language 1 and Language 2 are
`covered. Thus it becomes possible to create language learn
`ing Systems for language couples that are enhanced with
`rules that describe mistakes typical to the language couple,
`and automatically detect specific mistakes with the help of
`an automatic speech recognition System.
`
`25
`
`35
`
`45
`
`50
`
`55
`
`TABLE 1