`(12) Patent Application Publication (10) Pub. No.: US 2001/0053974 A1
`Lucke et al.
`(43) Pub. Date:
`Dec. 20, 2001
`
`US 2001.0053974A1
`
`(54) SPEECH RECOGNITION APPARATUS,
`SPEECH RECOGNITION METHOD, AND
`RECORDING MEDIUM
`(76) Inventors: Helmut Lucke, Tokyo (JP); Katsuki
`Minamino, Tokyo (JP); Yasuharu
`Asano, Kanagawa (JP); Hiroaki
`Ogawa, Chiba (JP)
`Correspondence Address:
`p
`William S. Frommer, Esq.
`FROMMER LAWRENCE & HAUG LLP
`745 Fifth Avenue
`New York, NY 10151 (US)
`(21) Appl. No.:
`09/804,354
`
`(22) Filed:
`(30)
`
`Mar. 12, 2001
`Foreign Application Priority Data
`
`Mar. 14, 2000 (JP)...................................... 2000-069698
`
`Publication Classification
`
`(51) Int. Cl." .......................... G10L 15/12; G 10L 15/08;
`G1OL 15/00
`(52) U.S. Cl. .............................................................. 704/240
`(57)
`ABSTRACT
`In order to prevent degradation of Speech recognition accu
`racy due to an unknown word, a dictionary database has
`Stored therein a word dictionary in which are Stored, in
`addition to words for the objects of Speech recognition,
`Suffixes, which are Sound elements and a Sound element
`Sequence, which form the unknown word, for classifying the
`unknown word by the part of speech thereof. Based on such
`a word dictionary, a matching Section connects the acoustic
`models of an Sound model database, and calculates the Score
`using the Series of features output by a feature extraction
`Section on the basis of the connected acoustic model. Then,
`the matching Section Selects a Series of the words, which
`represents the Speech recognition result, on the basis of the
`SCOC.
`
`SPEECH INPUT
`
`
`
`MICROPHONE
`
`AD
`CONVERSION
`SECTION
`
`FEATURE
`EXTRACON
`SECTION
`
`MATCHING
`SECTION
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`SOUND
`MODEL
`DATABASE
`
`
`
`
`
`DCTIONARY
`DATABASE
`
`
`
`
`
`GRAMMAR
`DATABASE
`
`SPEECH
`RECOGNITION
`RESULT
`
`Amazon / Zentian Limited
`Exhibit 1008
`Page 1
`
`
`
`Patent Application Publication Dec. 20, 2001 Sheet 1 of 7
`
`US 2001/0053974 A1
`
`F.G. 1
`
`SPEECH INPUT
`
`1
`
`2
`
`3
`
`4.
`
`MCROPHONE
`
`AD
`CONVERSION
`SECTION
`
`FEATURE
`EXTRACTION
`SECTION
`
`MATCHING
`SECTION
`
`SPEECH
`RECOGNITION
`RESULT
`
`- S
`SE
`5
`DATABASE
`- S
`DCTIONARY n-6
`DATABASE
`
`
`
`GRAMMAR
`DATABASE
`
`7
`
`Amazon / Zentian Limited
`Exhibit 1008
`Page 2
`
`
`
`Patent Application Publication Dec. 20, 2001 Sheet 2 of 7
`
`US 2001/0053974 A1
`
`FIG. 2
`
`
`
`PHONOLOGICAL
`INFORMATION
`
`Amazon / Zentian Limited
`Exhibit 1008
`Page 3
`
`
`
`Patent Application Publication Dec. 20, 2001 Sheet 3 of 7
`
`US 2001/0053974 A1
`
`FIG. 3
`
`SPEECH INPUT
`
`1
`
`MICROPHONE
`
`AD
`2n-1 CONVERSION
`SECTION
`
`3
`
`FEATURE
`EXTRACTION
`SECTION
`
`MATCHING
`SECTION
`
`SPEECH
`RECOGNITION
`RESULT
`
`
`
`C D
`SYSP
`5
`DATABASE
`
`DCTIONARY
`DATABASE
`
`GRAMMAR
`DATABASE
`
`6
`
`7
`
`
`
`
`
`
`
`Amazon / Zentian Limited
`Exhibit 1008
`Page 4
`
`
`
`Amazon / Zentian Limited
`Exhibit 1008
`Page 5
`
`
`
`Patent Application Publication Dec. 20, 2001 Sheet 5 of 7
`
`US 2001/0053974 A1
`
`SPEECH RECOGNITION
`PROCESSING
`
`EXTRACT FEATURES
`
`CALCULATE SCORE
`
`S1
`
`S2
`
`SELECT WORD SEQUENCE
`HAVING GREATEST FINAL
`SCORE
`
`S3
`
`
`
`
`
`
`
`Amazon / Zentian Limited
`Exhibit 1008
`Page 6
`
`
`
`Patent Application Publication Dec. 20, 2001 Sheet 6 of 7
`
`US 2001/0053974 A1
`
`
`
`
`
`CHEVONWLS
`
`
`
`Ô BOJON LOOH
`
`TWO|0OTONOHd
`
`Amazon / Zentian Limited
`Exhibit 1008
`Page 7
`
`
`
`Patent Application Publication Dec. 20, 2001 Sheet 7 of 7
`
`US 2001/0053974 A1
`
`FIG. 7
`
`
`
`110
`
`REMOVABLE
`RECORDING
`MEDIUM
`
`111
`
`Amazon / Zentian Limited
`Exhibit 1008
`Page 8
`
`
`
`US 2001/0053974 A1
`
`Dec. 20, 2001
`
`SPEECH RECOGNITION APPARATUS, SPEECH
`RECOGNITION METHOD, AND RECORDING
`MEDIUM
`
`BACKGROUND OF THE INVENTION
`0001) 1. Field of the Invention
`0002 The present invention relates to a speech recogni
`tion apparatus, a Speech recognition method, and a recording
`medium. More particularly, the present invention relates to
`a Speech recognition apparatus and a speech recognition
`method which are capable of reducing degradation of speech
`recognition accuracy, for example, in a case where an
`unknown word is contained in an utterance, and to a
`recording medium therefor.
`0003 2. Description of the Related Art
`0004 FIG. 1 shows the construction of an example of a
`conventional Speech recognition apparatus for performing
`continuous speech recognition.
`0005 Speech produced by a user is input to a mike
`(microphone) 1. In the microphone 1, the input speech is
`converted into an audio signal as an electrical Signal. This
`audio signal is Supplied to an AD (Analog-to-Digital) con
`version Section 2. In the AD conversion Section 2, the audio
`Signal, which is an analog signal, from the microphone 1 is
`Sampled and quantized, and is converted into audio data
`which is a digital Signal. This audio data is Supplied to a
`feature extraction Section 3.
`0006 The feature extraction section 3 performs, for each
`appropriate frame, acoustic processing, Such as Fourier
`transforming and filtering, on the audio data from the AD
`conversion Section 2, thereby extracting features, Such as,
`for example, MFCC (Mel Frequency Cepstrum Coefficient),
`and Supplies the features to a matching Section 4. Addition
`ally, it is possible for the feature extraction Section 3 to
`extract features, Such as a Spectrum, a linear prediction
`coefficient, a cepstrum coefficient, and a line Spectrum pair.
`0007. The matching section 4 performs speech recogni
`tion of Speech input to the matching Section 4 (input speech)
`based on, for example, the continuous distribution HMM
`method, while referring to a Sound model database 5, a
`dictionary database 6, and a grammar database 7 as neces
`Sary by using the features from the feature extraction Section
`3.
`0008 More specifically, the sound model database 5
`Stores therein a Sound model showing acoustic features of
`individual Sound elements and Syllables in a spoken lan
`guage for which Speech recognition is performed. Here,
`Since Speech recognition is performed based on the continu
`ous distribution HMM method, for the Sound model, for
`example, HMM (Hidden Markov Model) is used. The dic
`tionary database 6 stores therein word dictionaries in which
`information for the pronunciation (phonological informa
`tion) of each word (vocabulary) which is the object of
`Speech recognition is described. The grammar database 7
`Stores therein grammar rules (language models) for the way
`in which each word entered in the word dictionary of the
`dictionary database 6 is connected (chained). Here, as the
`grammar rule, for example, a rule based on context free
`grammar (CFG), Statistical word Sequencing probability
`(N-gram), etc., can be used.
`
`0009. The matching section 4 connects sound models
`stored in the sound model database 5 by referring to the
`word dictionary of the dictionary database 6, thereby form
`ing a sound model (word model) of the word. Furthermore,
`the matching Section 4 connects Several word models by
`referring to the grammar rules Stored in the grammar data
`base 7, and uses the word model which is connected in that
`manner in order to recognize, based on the features, the
`Speech input to the microphone 1 by the continuous distri
`bution HMM method. That is, the matching section 4 detects
`a series of word models in which the score (likelihood) at
`which the features of the time series output by the feature
`extraction Section 3 are observed is greatest, and outputs a
`word Sequence corresponding to that Series of word models
`as the Speech recognition result.
`0010 More specifically, the matching section 4 accumu
`lates the appearance probability of each feature for the word
`Sequence corresponding to the connected word model,
`assumes the accumulated value as a Score, and outputs the
`word Sequence which maximizes the Score as a speech
`recognition result.
`0011. The score calculation is generally performed by
`jointly evaluating an acoustic Score (hereinafter referred to
`as an "acoustic score” where appropriate) given by the
`Sound model stored in the Sound model database 5 and a
`linguistic Score (hereinafter referred to as a “linguistic
`Score” where appropriate) given by the grammar rule Stored
`in the grammar database 7.
`0012 More specifically, for example, in the case of the
`HMM method, the acoustic score is calculated, for each
`word from the acoustic models which form a word model,
`based on the probability at which the Sequence of features
`output by the feature extraction section 3 is observed
`(appearance probability). Also, for example, in the case of a
`bigram, the linguistic Score is determined based on the
`probability at which a particular word and a word immedi
`ately before that word are connected (chained). Then, the
`Speech recognition result is determined based on a final
`Score (hereinafter referred to as a “final score” where appro
`priate) obtained by jointly evaluating the acoustic score and
`the linguistic Score for each word.
`0013 Specifically, when a k-th word in a word sequence
`composed of N words is denoted as w, and when the
`acoustic Score of the word w is denoted as A(w) and the
`linguistic Score is denoted as L(w), the final Score of that
`word Sequence is calculated, for example, based on the
`following equation:
`
`0014 where X represents summation by varying k from
`1 to N, and C represents a weight applied to the linguistic
`Score L(w) of the word W.
`0015 The matching section 4 performs a matching pro
`ceSS for determining, for example, N by which the final Score
`shown in equation (1) is maximized and a word Sequence
`W1, W2, ..., WN, and outputs the Word Sequence W1, W2, ...,
`WN as the Speech recognition result.
`0016. As a result of processing such as that described
`above being performed, in the Speech recognition apparatus
`in FIG. 1, for example, when a user utters “
`3fai-vacs, (I want to go to New York)", an
`
`-- 7 if
`
`Amazon / Zentian Limited
`Exhibit 1008
`Page 9
`
`
`
`US 2001/0053974 A1
`
`Dec. 20, 2001
`
`acoustic Score and a linguistic Score are given to each word,
`such as “ - - - -27”, “c”, “ffs favy, and “C3”. When the
`final Score obtained by jointly evaluating those is greatest, a
`word sequence “ -si-a-27”, “c”, “frasi-va”, and “Ed” is
`output as a speech recognition result.
`0.017. If the calculation of the acoustic score is performed
`independently for all the words entered in the word dictio
`nary of the dictionary database 6, Since the amount of
`calculations is large, a method of making common (sharing)
`portions of calculations of the acoustic Score for a plurality
`of words may be used. That is, there is a method in which,
`of the words of the word dictionary, for the words whose
`phonemes at the Start thereof are the Same, a common
`acoustic model is used from the Start phoneme up to the
`phoneme which is the same as the Start phoneme, and
`individual acoustic models are used for the phonemes there
`after, thereby forming one tree-structured network as a
`whole, and an acoustic Score is determined by using this
`network.
`0.018. In this case, for example, as shown in FIG. 2, the
`word dictionary is formed by a network of words of a tree
`Structure (word network), which is obtained by Sequentially
`connecting branches corresponding to the phonemes from
`the Start of each word which is the object of Speech recog
`nition, from a root node which is a starting point.
`0019. When the word network is formed, for the words
`whose phonemes at the Start thereof are the Same, in the
`manner described above, branches corresponding to the Start
`phoneme up to the phoneme which is the same as the start
`phoneme are commonly used. That is, in FIG. 2, an alpha
`betic character Surrounded by Slashes (/) attached to each
`branch indicates a phoneme, and a portion enclosed by a
`rectangle indicates a word. For example, for words “I”,
`“ice”, “icy', and “up', the phoneme /A/ at the start thereof
`is the same and, therefore, a common branch corresponding
`to the phoneme /A/ is made. Also, for the words “I”, “ice',
`and "icy', Since the Second phoneme /I? thereof is also the
`Same, in addition to the Start phoneme /A/, a common branch
`corresponding to the Second phoneme /I/ is also made.
`Furthermore, for the words “ice” and “icy', since the third
`phoneme /S/ thereof is the same, a common branch corre
`sponding to the third phoneme /S/thereof, in addition to the
`Start phoneme /A/ and the Second phoneme /I/, is also made.
`0020) Furthermore, for the words “be” and “beat”, since
`the first phoneme /B/ thereof and the second phoneme /I/
`thereof are the Same, common branches corresponding to the
`Start phoneme /B/ and the Second phoneme // are made.
`0021. In a case where the word dictionary which forms
`the word network of FIG. 2 is used, the matching section 4
`reads, from the Sound model database 5, an acoustic model
`of phonemes corresponding to a Series of branches extend
`ing from the root node of the word network, connects them,
`and calculates, based on the connected acoustic model, an
`acoustic Score by using the Series of features from the feature
`extraction Section 3.
`0022 Consequently, the acoustic scores of the words “I”,
`“ice”, “icy', and “up' are calculated in a common manner
`for the first phoneme /A/ thereof. Also, the acoustic Scores
`of the words “I”, “ice”, and “icy” are calculated in a
`common manner for the first and Second phonemes /A/ and
`/I/. In addition, the acoustic scores of the words “ice” and
`
`“icy” are calculated in a common manner for the first to third
`phonemes /A/, //, and /S/. For the remaining phoneme
`(second phoneme) /P/ of the word “up” and the remaining
`phoneme (fourth phoneme) /I/ of the word "icy", the acous
`tic Score is calculated Separately.
`0023 The acoustic scores of the words “be” and “beat”
`are calculated in a common manner for the first and Second
`phonemes /B/ and // thereof. Then, for the remaining
`phoneme (third phoneme) /T/ of the word “beat”, the acous
`tic Score is calculated Separately.
`0024 Consequently, by using the word dictionary which
`forms the word network, the amount of calculations of
`acoustic Scores can be greatly reduced.
`0025. In the matching section 4, in the manner described
`above, when acoustic Scores are calculated using a Series of
`features on the basis of acoustic models which are connected
`along a series of branches (hereinafter referred to as a "path’
`where appropriate) extending from the root node of the word
`network, eventually, the end node (in FIG. 2, the end of the
`final branch in a case where movement occurs from the root
`node to the right along the branches) of the word network is
`reached. That is, for example, in a case where an HMM is
`used as an acoustic model, when acoustic Scores are calcu
`lated using the series of features on the basis of the HMMs
`connected along the Series of branches which form the path,
`there is a time when the acoustic Score becomes large to a
`certain degree (hereinafter referred to as a “local maximum
`time' where appropriate) in the final State of the connected
`HMMS
`0026. In this case, in the matching section 4, it is assumed
`that the region from the time of the features at the Start, used
`for the calculation of the acoustic Scores, to the local
`maximum time is a speech region in which a word corre
`sponding to the path is spoken, and the word is assumed to
`be a candidate for a word which is a constituent of the word
`Sequence as the Speech recognition result. Then, based on
`the acoustic models connected along the Series of the
`branches (path) extending from the root node of the word
`network, the calculations of the acoustic Scores of the
`candidate for the word which is connected after the candi
`date of that word are performed again using the Series of
`features after the local maximum time.
`0027. In the matching section 4, as a result of the above
`processing being repeated, a Word Sequence as a candidate
`of a large number of Speech recognition results is obtained.
`The matching Section 4 discards words with a low acoustic
`Score among the candidates of Such a large number of word
`Sequences, that is, performs acoustic pruning, thereby Select
`ing (leaving) only a word Sequence whose acoustic score is
`equal to or greater than a predetermined threshold value, that
`is, only a word Sequence which has a certain degree of
`certainty, from an acoustic point of view, as a speech
`recognition result, and the processing continues.
`0028. In addition, in the process in which a candidate of
`a word Sequence as a speech recognition result is created
`while calculating the acoustic Score in the manner described
`above, the matching Section 4 calculates the linguistic Score
`of a word which is a constituent of the candidates of the
`word Sequence as a speech recognition result, on the basis of
`the grammar rule, Such as N-gram, entered in the grammar
`database 7. Then, the matching Section 4 discards words
`
`Amazon / Zentian Limited
`Exhibit 1008
`Page 10
`
`
`
`US 2001/0053974 A1
`
`Dec. 20, 2001
`
`having a low acoustic Score, that is, performs linguistic
`pruning, thereby selecting (leaving) only a word Sequence
`whose linguistic Score is equal to or greater than a prede
`termined threshold value, that is, only a word Sequence
`which has a certain degree of certainty, from a linguistic
`point of View, as a Speech recognition result, and the
`processing continues.
`0029. As described above, the matching section 4 calcu
`lates the acoustic Score and the linguistic Score of a word,
`and performs acoustic and linguistic pruning on the basis of
`the acoustic Score and the linguistic Score, thereby Selecting
`one or more word Sequences which seem likely as a speech
`recognition result. Then, by repeating the calculations of the
`acoustic Score and the linguistic Score of a word connected
`after the connected word Sequence, eventually, one or more
`word Sequences which have a certain degree of certainty is
`obtained as a candidate of the Speech recognition result.
`Then, the matching Section 4 determines, from among Such
`word Sequences, a word Sequence having the greatest final
`Score, for example, as shown in equation (1), as the speech
`recognition result.
`0.030. In the speech recognition apparatus, the number of
`words, as the object of Speech recognition, to be entered in
`the word dictionary of the dictionary database 6 is limited,
`for example, due to the computation Speed of the apparatus,
`the memory capacity, etc.
`0.031 When the number of words as the object of speech
`recognition is limited, various problems occur if a user
`speaks a word which is not the object of speech recognition
`(hereinafter referred to as an “unknown word” where appro
`priate).
`0.032 More specifically, in the matching section 4, even
`when an unknown word is spoken, the acoustic Score of each
`word entered in the word dictionary is calculated using the
`features of the Speech of the unknown word, and a word
`whose acoustic Score is large to a certain degree is errone
`ously Selected as a candidate of the Speech recognition result
`of the unknown word.
`0.033 AS described above, when an unknown word is
`spoken, an error occurs at the portion of that unknown word,
`and furthermore, this error may cause an error at other
`portions.
`0034) More specifically, for example, in the manner
`described above, in a case where the user Speaks “/
`- -27 fiéi-vac g”, (I want to go to New York)", when
`i.7E.g 27, (New York)" is an unknown word, since an
`erroneous word is selected in the portion "/- 1 - 3 - 27,
`(New York)", it is difficult to precisely determine the bound
`ary between “/ -a, -sa-27, (New York)", which is an
`unknown word, and the word “/3, (to)” which follows. As
`a result, an error occurs at the boundary between the words
`and this error affects the calculation of the acoustic Score of
`the other portions.
`0035) Specifically, in the manner described above, after
`an erroneous word, which is not “? a - 3 - 27, (New York)”,
`is Selected, the acoustic Score of the next word is calculated
`using the Series of features in which the end point of the
`Series of features, used for the calculation of the acoustic
`Score of that erroneous word, is a Starting point. Conse
`
`quently, the calculation of the acoustic Score is performed,
`for example, using the features of the end portion of the
`speech "/ t-r-sa-27, (New York)”, or is performed without
`using the features of the initial portion of the next speech "/
`, (to)”. As a result, there are cases in which the acoustic
`score of the correct word “/k, (to)" as the speech recogni
`tion result becomes Smaller than that of the other words.
`0036). In addition, in this case, even if the acoustic score
`of the word which was wrongly recognized as the Speech
`recognition result does not become very large, the linguistic
`Score of the word becomes large. As a result, there are cases
`in which the Score when the acoustic Score and the linguistic
`Score are jointly evaluated becomes greater than the Score
`when the acoustic Score and the linguistic Score of the
`correct word "/3, (to) as the speech recognition result are
`jointly evaluated (hereinafter referred to as a “word score”
`where appropriate).
`0037 AS described above, as a result of making a mistake
`in the Speech recognition of the unknown word, the Speech
`recognition of a word at a position close to the unknown
`word is also performed mistakenly.
`0038. As a word which is the object of speech recognition
`in the Speech recognition apparatus, generally, for example,
`a word with a high appearance incidence in newspapers,
`novels, etc., is often Selected, but there is no guarantee that
`a word with a low appearance incidence will not be spoken
`by a user. Therefore, it is necessary to Somehow cope with
`an unknown word.
`0039. An example of a method for coping with an
`unknown word, is one in which, for example, an unknown
`word, which is a word which is not the object of speech
`recognition, is divided into Segments, Such as Sound ele
`ments which form the word or a Sound element Sequence
`composed of Several Sound elements, and this Segment is
`considered as a word in a pseudo manner (what is commonly
`called a “sub-word”) so that the word is made an object of
`Speech recognition.
`0040 Since there are not very large number of types of
`Sound elements which form a word and Sound element
`Sequences, even if Such Sound elements and Sound element
`Sequences are made objects of Speech recognition as pseudo
`words, this does not exert a very large influence on the
`amount of calculations and the memory capacity. In this
`case, the unknown word is recognized as a Series of pseudo
`words (hereinafter referred to as “pseudo-words” where
`appropriate), and as a result, the number of unknown words
`apparently becomes Zero.
`0041. In this case, even if not only an unknown word, but
`also a word entered in the word dictionary is spoken, it can
`be recognized as a Series of pseudo-words. Whether the
`spoken word will be recognized as a word entered in the
`word dictionary or as an unknown word as a Series of
`pseudo-words, is determined based on the Score calculated
`for each word.
`0042. However, in a case where a pseudo-word is used,
`Since the unknown word is recognized as Sound elements
`which are a pseudo-word or a Series of Sound element
`Sequences, the unknown word cannot be processed by using
`an attribute thereof. That is, for the unknown word, since, for
`example, the part of Speech as the attribute thereof cannot be
`
`Amazon / Zentian Limited
`Exhibit 1008
`Page 11
`
`
`
`US 2001/0053974 A1
`
`Dec. 20, 2001
`
`known, the grammar rule cannot be applied, causing the
`Speech recognition accuracy to be degraded.
`0.043
`Also, there are some types of speech recognition
`apparatuses in which the word dictionary for each of a
`plurality of languages is prestored in the dictionary database
`6, and the word dictionary is, for example, Switched accord
`ing to an operation by a user So that speech recognition of
`a plurality of languages is made possible. In this case, the
`words of the languages other than the language of the word
`dictionary which is currently used become unknown words,
`however, if the language, as the attribute, of the unknown
`word is known, it is possible to automatically Switch to the
`word dictionary of that language, and furthermore, in this
`case, the word which was an unknown word can be recog
`nized correctly.
`0044 Specifically, for example, in a case where English
`and French word dictionaries are Stored in the dictionary
`database 6, when the English word dictionary is in use, if it
`is known that the unknown word is a French word, consid
`ering that the Speaker changed to a French perSon, the word
`dictionary may be switched to the French dictionary from
`the English dictionary, So that Speech recognition with a
`higher accuracy is made possible.
`
`SUMMARY OF THE INVENTION
`004.5 The present invention has been achieved in view of
`Such circumstances. An object of the present invention is to
`improve the Speech recognition accuracy by allowing the
`attribute of the unknown word to be obtained.
`0046) To achieve the above-mentioned object, according
`to one aspect of the present invention, there is provided a
`Speech recognition apparatus comprising: extraction means
`for extracting features of the Speech from the Speech;
`calculation means for calculating the Score using the features
`on the basis of a dictionary in which unknown-word
`forming elements, which are elements forming an unknown
`word, for classifying an unknown word by an attribute
`thereof and words for the object of Speech recognition are
`entered; and Selection means for Selecting a Series of the
`words, which represents a speech recognition result, on the
`basis of the Score.
`0047. In the dictionary, unknown-word-forming elements
`for classifying an unknown word by a part of Speech thereof
`may be entered.
`0.048. In the dictionary. Suffixes may be entered as the
`unknown-word-forming elements.
`0049. In the dictionary, phonemes which form an
`unknown word may be entered together with the suffixes.
`0050. In the dictionary, unknown-word-forming elements
`for classifying an unknown word by a language thereof may
`be entered.
`0051. The speech recognition apparatus of the present
`invention may further comprise a dictionary.
`0.052 According to another aspect of the present inven
`tion, there is provided a speech recognition method com
`prising the Steps of extracting features of the Speech from
`the Speech, calculating the Score using the features on the
`basis of a dictionary in which unknown-word-forming ele
`ments, which are elements forming an unknown word, for
`
`classifying an unknown word by an attribute thereof and
`words for the object of Speech recognition are entered; and
`Selecting a Series of the words, which represents a speech
`recognition result, on the basis of the Score.
`0053 According to yet another aspect of the present
`invention, there is provided a recording medium having
`recorded therein a program, the program comprising the
`Steps of: extracting features of the Speech from the Speech;
`calculating the Score using the features on the basis of a
`dictionary in which unknown-word-forming elements,
`which are elements forming an unknown word, for classi
`fying an unknown word by an attribute thereof and words for
`the object of Speech recognition are entered; and Selecting a
`Series of the words, which represents a speech recognition
`result, on the basis of the Score.
`0054.
`In the speech recognition apparatus, the speech
`recognition method, and the recording medium therefor of
`the present invention, a Score is calculated using features on
`the basis of a dictionary in which unknown-word-forming
`elements, which are elements forming an unknown word, for
`classifying an unknown word by an attribute thereof and
`words for the object of Speech recognition are entered, and
`a Series of words, which represents a speech recognition
`result, is Selected on the basis of the Score.
`0055. The above and further objects, aspects and novel
`features of the invention will become more fully apparent
`from the following detailed description when read in con
`junction with the accompanying drawings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`0056 FIG. 1 is a block diagram showing the construction
`of an example of a conventional Speech recognition appa
`ratuS.
`0057 FIG. 2 is a diagram showing an example of the
`Structure of a word dictionary Stored in a dictionary database
`6 of FIG. 1.
`0058 FIG. 3 is a block diagram showing an example of
`the construction of an embodiment of a Speech recognition
`apparatus to which the present invention is applied.
`0059 FIG. 4 is a diagram showing an example of the
`Structure of a word dictionary Stored in the dictionary
`database 6 of FIG. 3.
`0060 FIG. 5 is a flowchart illustrating processing of the
`Speech recognition apparatus of FIG. 3.
`0061
`FIG. 6 is a diagram showing another example of
`the Structure of a word dictionary Stored in the dictionary
`database 6 of FIG. 3.
`0062 FIG. 7 is a block diagram showing an example of
`the construction of an embodiment of a computer to which
`the present invention is applied.
`
`DESCRIPTION OF THE PREFERRED
`EMBODIMENTS
`0063 FIG. 3 shows an example of the construction of an
`embodiment of a Speech recognition apparatus to which the
`present invention is applied. Components in FIG. 3 corre
`sponding to those in FIG. 1 are given the same reference
`numerals and, accordingly, in the following, descriptions
`thereof are omitted. That is, the Speech recognition appara
`
`Amazon / Zentian Limited
`Exhibit 1008
`Page 12
`
`
`
`US 2001/0053974 A1
`
`Dec. 20, 2001
`
`tus of FIG. 3 is constructed basically similarly to the speech
`recognition apparatus of FIG. 1.
`0064. However, in addition to a word dictionary in which
`are entered words for the objects of Speech recognition
`(hereinafter referred to as a “standard dictionary” where
`appropriate), stored in the dictionary database 6 of FIG. 1,
`in the dictionary database 6 of the Speech recognition
`apparatus of FIG. 3, an unknown word dictionary is also
`Stored in which unknown-word-forming elements, which are
`elements forming an unknown word, for classifying an
`unknown word by an attribute thereof are entered. That is,
`in the embodiment of FIG. 3, the word dictionary stored in
`the dictionary database 6 is composed of the Standard
`dictionary and the unknown word dictionary.
`0065. Also in the word dictionary of the dictionary data
`base 6 of FIG. 3, a word network is formed similarly to the
`word dictionary of the dictionary database 6 of FIG. 1.
`0.066 More specifically, in the word dictionary of the
`dictionary database 6 of FIG. 3, for example, as shown in
`FIG. 4, a word dictionary similar to the case in FIG. 2
`described above is formed, and this is assumed to be a
`Standard dictionary. Furthermore, in the word dictionary of
`the dictionary database 6 of FIG. 3, a general-purpose
`branch, which is one or more branches, to which the
`phonemes of a pseudo-word which is a Sound element or a
`Sound element Sequence which form an unknown word
`correspond, is connected to the root node, and furthermore,
`an attribute branch, which is one or more branches, to which
`phonemes (sequence) for classifying the unknown word by
`an attribute thereof corresponds, is connected to the general
`purpose branch, thereby forming a word network for coping
`with the unknown word, and this is assumed to be an
`unknown word dictionary.
`0067 More specifically, in the embodiment of FIG.4, the
`unknown word dictionary is formed in Such a way that a
`general-purpose branch and an attribute branch are con
`nected in Sequence to the root node. Furthermore, a branch
`which acts as a loop (hereinafter referred to as a "loop
`branch” where appropriate) is connected to the general
`purpose branch. Since the general-purpose branch is formed
`of one or more branches to which the phonemes of a
`pseudo-word which is various Sound elements or a Sound
`element Sequence correspond, by repeating passing through
`the general-purpose branch and after going through the loop
`branch, passing through the general-purpose branch again,
`all the words (containing both the words entered in the
`Standard dictionary, and the unknown words) can be recog
`nized as a Series of pseudo-words.
`0068. However, whether the spoken word will be recog
`nized as a word entered in the Standard dictionary or as an
`unknown word as a Series of pseudo-words is determined
`based on the Score calculated