`Lee et al.
`
`US006151574A
`[11] Patent Number:
`[45] Date of Patent:
`
`6,151,574
`Nov. 21, 2000
`
`[54] TECHNIQUE FOR ADAPTATION OF
`HIDDEN MARKOV MODELS FOR SPEECH
`RECOGNITION
`
`[75] Inventors? chln'ljlul Pee, Baskhlg R1dge> N1;
`Kolchl Shm0da> Ch1ba> Japan
`_
`_
`_
`[73] Ass1gnee: Lucent Technologies Inc., Murray Hlll,
`N'J'
`
`[21] Appl' NO‘: 09/149’782
`[22] Filed:
`Sen 8’ 1998
`
`Related US. Application Data
`[60] Provisional application No. 60/067,822, Dec. 5, 1997.
`
`[51] Int. Cl.7 ................................................... .. G10L 15/14
`[52] US. Cl. ........................................... .. 704/256; 704/243
`[58] Field of Search ................................... .. 704/243, 244,
`
`J. Gauvain et al., “Maximum a Posteriori Estimation for
`Multivariate Gaussian Mixture Observations of Markov
`Chains,” IEEE Transactions on Speech and Audio Process
`ing, vol. 2, No. 2, Apr. 1994, pp. 291—298.
`C. Lee et al., “A Study on Speaker Adaptation of the
`Parameters of Continuous Density Hidden Markov Models,”
`IEEE Transactions on Signal Processing, vol. 39, No. 4,
`Apt 1991, pp 806_814_
`Boulianne et al, “Optimal tying of HMM densities using
`decision trees”, Spoken Language, Oct. 1996.
`Miglietta, “Bayesian Adaptation of speech RecogniZers to
`Field SpGGCh Data”, OCI. 1996.
`J en—TZung Chen, Chin—Hui Lee, and Hsiao—Chuan Wang, A
`Hybrid Algorithm for Speaker Adaption Using MAP.
`_
`_
`(List continued on neXt page.)
`
`Primary Examiner—Krista Zele
`Assistant Examiner—Michael N- Opsasnick
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5’737’487
`5Z787Z394
`5,794,197
`5,797,123
`5,857,169
`5,907,825
`579127989
`
`gsggltsetetaiil'
`4/1998 B 611 e garda a
`7/1998 Bahl et a1_ ______ __
`8/1998 Alleva et a1, _
`8/1998 Chou et a1.
`1/1999 Seide --------------- -
`5/1999 TZiTkel-Hancock -
`6/1999 Watanahe """""" "
`shgyerlfm et a1‘ """""""""" "
`9/1999 Tz?llgelijH'a'gc'g'c'lg """""""""" " 704041
`5’96O’395
`579837180 11/1999 Robinson ........... .. .............. Q... 704/254
`
`704050
`704038
`704/255
`704/256
`704/256
`704/243
`382/228
`
`OTHER PUBLICATIONS
`
`J. Chien et al., “Improved Bayesian Learning of Hidden
`Markov Models for Speaker Adaptation,” Proc. I CASSP—97,
`1997, pp. 1027—1030.
`K. Shinoda et al., “Speaker Adaptation With Autonomous
`Model Complexity Control by MDL Principle,” Proc.
`ICASSP—96, 1996, pp. 717—720.
`
`A speech recognition system learns characteristics of speech
`by a user during a learning phase to improve its perfor
`mance. Adaptation data derived from the user’s speech and
`its recogniZed result is collected during the learning phase.
`Parameters characterizing hidden Markov Models (HMMs)
`used in the system for speech recognition are modi?ed based
`on the adaptation data. To that end, a hierarchical structure
`is de?ned in an HMM parameter space. This structure may
`assume the form of a tree structure having multiple layers,
`each of Which includes one or more nodes. Each node on
`each layer is connected to at least one node on another layer.
`The nodes on the loWest layer of the tree structure are
`referred to as “leaf nodes.” Each node in the tree structure
`represents a subset of the HMM parameters, and is associ
`ated with a Probability measure Which is derived from the
`adaptation data. In particular, each leaf node represents a
`different one of the HMM parameters, Which is derivable
`from the probability measure associated With the leaf node.
`This probability measure is a function of the probability
`measures Which are associated With the nodes connected to
`the leaf node, and Which represent “hierarchical priors” to
`such a probability measure.
`
`38 Claims, 3 Drawing Sheets
`
`ASSIGN INITIAL PDFs
`T0 CHILD NDDES
`
`DIVIDE SET OF MIXTURE
`COMPONENTS REPRESENTED
`BY MODE n AMONG CHILD
`NODES
`
`CALCULATE FDF FOR
`EACH CHILD NODE
`
`DISTANCES BETWEEN
`CHILD NODE PDFs AND
`MIXTURE COMPONENTS
`CONVERGE ?
`YES
`
`SET EACH CHILD NUDE
`TD MODE n
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 1
`
`
`
`6,151,574
`Page 2
`
`OTHER PUBLICATIONS
`
`Transformation and Adaption, XP—002130508, IEEE Signal
`Processing Letters, pp. 167—169, vol. 4, No. 6, Jun. 1997.
`Carrnelo Giarnrnarco Miglietta, Cha?c Mokbel, Denis Jou
`vet and Jean Monne, Bayesian Adaption of Speech Recog
`niZers to Field.
`Speech Data, XP—002130507, ICSLI96, pp. 917—920, Oct.
`1996, Lannion cedeX, France.
`
`Douglas B. Paul, “Extensions to Phone—State Decision—Tree
`ClusteringzSingle
`Tree
`and
`Tagged
`Clustering”,
`XP—000822740, pp. 1487—1490, Apr. 21, 1997, NeWton,
`Massachusettes, USA.
`
`Koichi Shinoda adn Chin—Hui Lee, “Structural MAP
`Speaker
`Adaption
`Using
`Hierarchical
`Priors”,
`XP—002130506, pp. 381—388, Dec. 1997, Murray Hill, NeW
`Jersey, USA.
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 2
`
`
`
`U.S. Patent
`
`Nov. 21,2000
`
`Sheet 1 of3
`
`6,151,574
`
`_ _
`
`>52
`y
`
`Emzud
`
`l
`I
`
`I
`
`|
`I
`__.l
`
`0
`ENHZQSHE
`
`185m
`
`w
`
`.68: 2o;
`
`“828%
`
`\ .QFN
`
`02
`
`m:
`
`ESATQE
`HEDGE w m2
`
`“285m
`1064.56
`$558
`
`a} w 2:
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 3
`
`
`
`U.S. Patent
`
`Nov. 21,2000
`
`Sheet 2 of3
`
`6,151,574
`
`FIG. 2
`m
`
`1st LEvEL
`
`203
`
`kth LEvEL
`
`Hth LEvEL
`
`247
`
`245
`
`243
`
`241
`
`235
`
`239
`
`237
`
`233
`
`231
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 4
`
`
`
`U.S. Patent
`
`Nov. 21, 2000
`
`Sheet 3 0f 3
`
`6,151,574
`
`FIG. 3
`
`m
`
`SET ROOT NODE TO NODE n
`
`f 305
`
`I 308
`:
`< ANY CHILD NODES ? \NO
`YES
`fan
`"
`(
`ASSIGN INITIAL PDFs
`TO CHILD NODES
`
`w
`
`END
`
`)
`
`4
`IT
`DIVIDE SET OF MIXTURE
`COMPONENTS REPRESENTED
`BY NODE n AMONG CHILD
`NODES
`
`I315
`
`IV
`CALCULATE PDF FOR
`EACH CHILD NODE
`
`IT
`SUM OF
`DISTANCES BETWEEN
`CHILD NODE PDFs AND
`MIXTURE COMPONENTS
`CONVERGE ?
`YES
`
`IT
`
`SET EACH CHILD NODE
`TO NODE n
`
`321
`
`NO
`
`‘ FOR EACH NODE n
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 5
`
`
`
`6,151,574
`
`1
`TECHNIQUE FOR ADAPTATION OF
`HIDDEN MARKOV MODELS FOR SPEECH
`RECOGNITION
`
`This application claims the bene?t of US. Provisional
`No. 60/067,822 ?led Dec. 5, 1997.
`
`FIELD OF THE INVENTION
`
`The invention relates to speech recognition systems and
`methods, and more particularly to systems and methods for
`recognizing speech based on hidden Markov models
`(HMMs), Which are adapted to acoustic inputs during a
`learning phase of speech recognition.
`
`BACKGROUND OF THE INVENTION
`
`15
`
`2
`tion are estimated using the adaptation data. In implement
`ing the transformation based approach, it is desirable to
`divide the acoustic feature space into a number of subspaces
`and estimate the transformation parameters in each sub
`space. HoWever, the performance of speech recognition
`using the transformation based approach does not signi?
`cantly improve With an increasing amount of adaptation data
`as any improvement is restricted by the limited number of
`variable transformation parameters used in the approach.
`An attempt to combine the Bayesian adaptation approach
`With the transformation based approach to improve the
`speech recognition performance has been made. This
`attempt is described in: Chien et al., “Improved Bayesian
`Learning of Hidden Markov Models for Speaker
`Adaptation,” ICASSP-97, 1997, pp. 1027—1030. HoWever,
`the success of such an attempt relies on the requirement that
`the number of subspaces in the acoustic feature space be
`optimiZed for various amounts of adaptation data, Which is
`usually impractical.
`Accordingly, there eXists a need for combining the Baye
`sian adaptation approach With the transformation based
`approach in a feasible manner to improve the speech rec
`ognition performance, regardless of the amount of available
`adaptation data.
`
`SUMMARY OF THE INVENTION
`
`In accordance With the invention, the HMMs in a speech
`recognition system are adapted during a learning phase
`using a “structural maXimum a posteriori (SMAP)”
`approach. According to this inventive approach, a hierarchi
`cal structure, e.g., a tree structure, having multiple levels is
`constructed in an HMM parameter space. Such a structure is
`characteriZed by transformation parameters associated With
`different levels of the structure. The transformation param
`eters associated With a level represent prior knoWledge,
`referred to as “hierarchical priors,” for the transformation
`parameters associated With the levels therebeneath. The
`transformation parameters associated With each level of the
`structure are estimated based on the adaptation data, and
`hierarchical priors from the levels thereabove. The HMM
`parameters are updated based on the transformation param
`eters associated With the loWest level of the structure, Which
`thus are a function of at least the transformation parameters
`associated With a level other than the loWest level.
`Advantageously, given an amount of available adaptation
`data, the inventive SMAP approach effectively combines the
`Bayesian adaptation approach, characteriZed by use of the
`aforementioned hierarchical priors, With the transformation
`based approach, characteriZed by use of the aforementioned
`hierarchical structure, to synergistically effect the adaptation
`of HMMs.
`
`BRIEF DESCRIPTION OF THE DRAWING
`In the draWing,
`FIG. 1 is a block diagram of a speech recognition system
`in accordance With the invention;
`FIG. 2 illustrates a hierarchical structure Whereby HMMs
`in the system of FIG. 1 are adapted during its learning phase;
`and
`FIG. 3 is a How chart depicting an algorithm for con
`structing the hierarchical structure of FIG. 2.
`
`In communication, data processing and similar systems, a
`user interface using audio facilities is often advantageous
`especially When it is anticipated that the user Would be
`physically engaged in an activity (e.g., driving a car) While
`he/she is operating one such system. Techniques for recog
`niZing human speech in such systems to perform certain
`tasks have been developed.
`In accordance With one such technique, input speech is
`analyZed in signal frames, represented by feature vectors
`corresponding to phonemes making up individual Words.
`The phonemes are characteriZed by hidden Markov models
`(HMMs), and a Viterbi algorithm is used to identify a
`sequence of HMMs Which best matches, in a maXimum
`likelihood sense, a respective concatenation of phonemes
`corresponding to an unknoWn, spoken utterance.
`It is Well knoWn that each HMM comprises model
`parameters, e.g., miXture components Which are character
`iZed by Gaussian distributions. In a learning phase in a
`speech recognition system, the HMMs are adapted to input
`speech by a user to adjust to the particular speech charac
`teristics of the user, thereby increasing accuracy of the
`speech recognition. In prior art, tWo Well knoWn approaches
`for adaptation of HMMs, namely, the Bayesian adaptation
`approach and the transformation based approach, have been
`employed.
`According to the Bayesian adaptation approach, prior
`distributions are assumed for the model parameters in
`HMMs, and the maXimum a posteriori (MAP) estimates for
`the model parameters are calculated. For details on this
`approach, one may refer to: C. Lee et al., “A study on
`Speaker Adaptation of the Parameters of Continuous Den
`sity Hidden Markov Models,” IEEE Transactions on Signal
`Processing, Vol. 39, No. 4, April 1991, pp. 806—814; and J.
`Gauvain et al., “Maximum a Posteriori Estimation for Mul
`tivariate Gaussian Mixture Observations of Markov
`Chains,” IEEE Transactions on Speech and Audio
`Processing, Vol. 2, No. 2, 1994, pp. 291—298. Since the
`Bayesian adaptation approach utiliZes the MAP estimates
`based on knoWledge of prior distributions, it requires less
`input speech data for the adaptation than, e.g., one utiliZing
`maXimum likelihood (ML) estimates Which does not rely on
`any such knoWledge.
`HoWever, if the adaptation data is scarce, the transforma
`tion based approach may be more effective than the Baye
`sian adaptation approach to adapt the HMMs. According to
`the transformation based approach, a transformation, e.g., a
`shift or an affine transformation, is de?ned in an acoustic
`feature space, also knoWn as an “HMM parameter space,” to
`eXplore correlations betWeen different HMMs, and such
`correlations help adapt the HMMs despite the scarcity of the
`adaptation data. Parameters characteriZing the transforma
`
`25
`
`35
`
`45
`
`55
`
`65
`
`DETAILED DESCRIPTION
`FIG. 1 illustrates speech recognition system 100 embody
`ing the principles of the invention. As shoWn in FIG. 1,
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 6
`
`
`
`6,151,574
`
`3
`system 100 includes a number of functional blocks including
`analog-to-digital
`convertor 103, feature extractor 105,
`end-point detector 113, speech recognizer 117, Word model
`processor 125, sWitch 139 and delay element 141. The
`functionality of each block of system 100 may be performed
`by a respective different processor, or the functionality of
`several or all blocks may be performed by the same pro
`cessor. Furthermore, each stage can include multiple pro
`cessing elements. The stages are pipelined and their opera
`tions are performed in a synchronous manner.
`Speci?cally, input speech including a sequence of spoken
`Words is provided, through a microphone (not shoWn), to
`A/D convertor 103 in system 100. Convertor 103 in a
`conventional manner samples the input speech. The digital
`samples are then provided to feature extractor 105.
`Upon receiving the digital samples, feature extractor 105
`organiZes the received samples in speech frames of, say, 20
`ms long, and derives for each frame a measure of energy in
`the frame and a set of short spectrum vectors, e.g., linear
`predictive-coding (LPC) parameters. In this instance, the
`LPC parameters specify a spectrum of an all-pole model
`Which best matches the signal spectrum over a period of time
`in Which the frame of speech samples are accumulated.
`Based on the LPC parameters, extractor 105 produces a
`feature vector per frame, Which comprises, for example,
`tWelve cepstral features, tWelve delta-cepstral features and a
`delta-energy feature. In a Well-knoWn manner, these cepstral
`and delta-cepstral features characteriZe the spectrum and its
`time variation of the speech frame. The delta-energy feature
`indicates an amount of change in energy in the speech frame
`from the previous frame.
`End-point detector 113 of conventional design uses the
`delta energy feature, in conjunction With the energy measure
`by feature extractor 105, to determine the beginning and end
`of a speech signal. It then passes data signals containing the
`25 features in each feature vector onto speech recogniZer
`117 of conventional design, along With any end-point deter
`minations. Based on such data signals and Word models
`provided by Word model processor 125 in accordance With
`the invention, recogniZer 117 determines What the spoken
`Words are.
`Processor 125, among other things, provides hidden
`Markov models (HMMs), e.g., continuous density (CD)
`HMMs in this instance, for various spoken Words. Based on
`the Viterbi algorithm, recogniZer 117 identi?es an optimum
`sequence of CDHMMs Which best matches, in a maximum
`likelihood sense, a respective concatenation of phonemes
`corresponding to an unknoWn spoken utterance. Such an
`identi?cation process is realiZed by dynamic programming
`in a conventional manner.
`It should be noted at this point that system 100 operates
`in tWo modes, namely, a learning mode and a normal
`operation mode. In the normal operation mode, Which is the
`current mode, sWitch 139 is set at a ?rst position to relay a
`series of recogniZed Words from recogniZer 117 to the output
`of system 100.
`During the manufacture of system 100, all of the model
`parameters of the CDHMMs in processor 125 are prede?ned
`in a conventional manner using training data representing
`samples of speech from many speakers. HoWever, When a
`user initially uses system 100 for speech recognition, system
`100 undergoes the learning mode Where the CDHMMs are
`adapted to the user’s input speech to adjust to his/her
`particular speech characteristics, thereby further increasing
`the recognition accuracy of system 100.
`In implementing unsupervised learning in this instance,
`sWitch 139 is set at a second position in the learning mode
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`4
`to feed the series of recogniZed Words from recogniZer 117
`back to processor 125, and the input speech corresponding
`to the recogniZed Words. According to the unsupervised
`learning, the input speech is uncontrolled and the user is not
`restricted to speak only certain Words as in supervised
`learning. In the learning mode, delay element 141 imparts a
`proper amount of delay to the input of recogniZer 117
`representing the input speech to synchroniZe it With the
`corresponding recogniZed Words. Processor 125 uses the
`input of recogniZer 117 and the corresponding recogniZed
`Words as “adaptation data” to adapt the CDHMMs therein.
`In accordance With the invention, the HMMs in a speech
`recognition system, e.g., system 100, are adapted during the
`learning phase using a “structural maximum a posteriori
`(SMAP)” approach. According to this inventive approach, a
`hierarchical structure, e.g., a tree structure, having multiple
`levels is constructed in an HMM parameter space, also
`knoWn as an “acoustic feature space.” Such a structure is
`characteriZed by transformation parameters associated With
`different levels of the structure. The transformation param
`eters associated With a level represent prior knoWledge,
`referred to as “hierarchical priors,” for the transformation
`parameters associated With the levels beneath, or subordi
`nate to, that level. The transformation parameters associated
`With each level of the structure are estimated based on the
`adaptation data, and aggregate hierarchical priors from the
`levels thereabove. The estimated transformation parameters
`associated With the loWest level of the hierarchical structure
`are used to update the HMM parameters in the learning
`phase. As a result, such updated HMM parameters are a
`function of a combination of all of the transformation
`parameters connected to the structure, Which are Weighted
`according to the respective levels associated thereWith. The
`Weight for each level varies With the amount of adaptation
`data used. The updated HMM parameters are used by
`processor 125 in system 100 to provide the HMMs, in
`particular, CDHMMs, during its normal operation mode for
`speech recognition as described before.
`In order to fully appreciate the adaptation of HMMs in
`accordance With the invention, a normaliZation technique
`Which simpli?es the adaptation Will noW be described. Let
`G={gm; =1, .
`.
`.
`, M} and be the set of all mixture
`components of the CDHMMs in processor 125, Where M
`represents the total number of mixture components in all the
`states of all the CDHMMs in processor 125, and gm repre
`sents a normal density function for a mixture component In
`This normal density function is denoted N(XIpm, Sm), Where
`pm represents a means vector and Sm represents a covari
`ance matrix.
`
`. , xT} and denote a set of T given
`. , xt. .
`.
`Let X={x1, .
`observation vectors, Which represents the adaptation data. In
`the normaliZation, each sample vector X, is transformed into
`a vector ymt for each mixture component m as folloWs:
`
`Where t=1, .
`.
`.
`, Tand m=1, .
`.
`.
`, M.
`As mentioned before, all of the CDHMM parameters here
`have been prede?ned in a conventional manner using train
`ing data derived from samples of speech from many speak
`ers during the manufacture of system 100. Where there is no
`mismatch betWeen the training data and adaptation data, it is
`apparent that the probability density function (pdf) for
`Ym={ym1, .
`.
`.
`, ymt .
`.
`.
`, ymT} may be represented by the
`standard normal distribution N(YIO,I), Where O represents a
`vector Whose components have Zero values, and I represents
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 7
`
`
`
`6,151,574
`
`5
`an identity matrix. Otherwise, where there is a mismatch
`between the training data and adaptation data, the pdf for Ym
`may be generically represented by N(Y{v, 11), where v¢O
`and 11:4. A “mismatch pdf” is de?ned below to represent the
`differences between the acoustic characteristics depicted by
`the training data and those depicted by the adaptation data.
`It can be shown that the number of mismatch pdfs required
`to model the acoustic differences is smaller than that of the
`pdfs for mixture components of HMMs.
`Thus, to reduce the number of required pdfs, the set of
`mixture components, G, is divided into more than one subset
`GP, 1 épéfP, where fP is the total number of subsets which
`is smaller than the total number of mixture components, i.e.,
`M. A common mismatch pdf hp=N(YI\/p, 11p) is shared by all
`the mixture components in each subset GP. In the following
`discussion, each mixture component g in subset GP is
`renumbered as gfp), .
`.
`.
`, gm(P) .
`.
`.
`, gMpw), where Mp
`represents the number of mixture components in subset GP.
`The observed vector sequence X={x1, .
`.
`. , xT} is normaliZed
`to X={ym1(P), .
`.
`.
`, ymT(P)} for mixture components gmw).
`The parameters for the mismatch pdfs are estimated using
`the well known estimation-maximization (EM) algorithm.
`When the transition probabilities and the mixture component
`weight coef?cients are speci?ed, the auxiliary function Q for
`the HMM parameters is given by:
`
`15
`
`25
`
`representing the current HMM parameter set; 6={pm(p),
`§m(p); m=1, .
`.
`. Mp and p=1, .
`.
`. SP} representing the HMM
`35
`parameter set to be estimated; and ymfp) represents the
`posterior probability of using mixture component gm(P) at
`time t. The relation between the HMM parameters and the
`mismatch pdf parameters is established as follows:
`
`where Jm(P)=(Sm(P))1/2 representing the Jacobian matrix for
`the normaliZation expressed in
`Based on expression [3],
`the auxiliary function may be modi?ed as follows;
`
`T
`
`P M”
`
`(
`
`>
`
`[4]
`
`Q19 0) = 22y"? 10g a m
`
`
`‘:1 p11 m:1
`
`
`
`1/2
`|(S£np))
`I
`
`I
`
`By differentiating function [4], the maximum likelihood
`(ML) estimates of the mismatch pdf parameters can be
`de?ned as follows:
`
`Mp
`(p) (p)
`7m: ymt
`
`[5]
`
`45
`
`55
`
`65
`
`6
`-continued
`
`T Mp
`Z Z MHZ’ — VP) (yin?) — VP)
`
`T
`
`~ _ 1:1 m:l
`
`[6]
`
`
`
`11p — *T Mp ,
`
`Z 7%’
`
`1 5 H
`
`where (yWWLY/P)t represents the transpose of (ymfPLY/p).
`Using these mismatch pdf parameters, the corresponding
`HMM parameters can be updated as follows:
`
`smwmsme.
`[8]
`where ftmw) and §m(P) represent the updated mean and
`covariance, respectively.
`The optimal number of mismatch pdfs for deriving the
`corresponding HMMs varies with the amount of available
`adaptation data. We have recogniZed that a methodology
`which utiliZes the mismatch pdf parameters as transforma
`tion parameters for constructing the aforementioned hierar
`chical structure, and which takes advantage of both its entire
`structure and sub-structures is preferable to achieve a good
`speech recognition performance with any given amount of
`adaptation data. To that end, a tree structure for the set G is
`realiZed in accordance with the invention.
`FIG. 2 illustrates one such tree structure, denoted 200. As
`shown in FIG. 2, tree structure 200 has H levels, where H is
`an integer greater than Zero. The ?rst level of structure 200
`consists of root node 203 which represents the set G con
`taining all M mixture components of the CDHMMs. The H11
`level of structure 200 consists of M leaf nodes representing
`the M mixture components of the CDHMMs, respectively.
`Each intermediate node or non-terminal node between the
`?rst level and the H”1 level represents a subset of G, i.e., GP,
`containing Mp mixture components respectively represented
`by those Mp leaf nodes or terminal nodes on the H”1 level
`which are subordinate to the intermediate node. A leaf node
`is said to be subordinate to an intermediate node if the leaf
`node “emanates” from that intermediate node, or in other
`words its origin can be traced back to that intermediate node
`through intervening paths. For example, in structure 200, the
`leaf nodes which are subordinate to intermediate node 222
`are leaf nodes 231, 233 and 235 as they are connected to
`node 222 through paths 224, 226 and 228, respectively.
`Similarly, the leaf nodes which are subordinate to interme
`diate node 211 are leaf nodes 231, 233, 235, 237, 239, 241,
`243, 245 and 247. Of course, root node 203 has each leaf
`node on the H”1 level subordinate thereto. The methodology
`for constructing structure 200 is described in detail below.
`For each node N in tree structure 200, a mismatch pdf,
`which is shared by the mixture components in GP repre
`sented by the node, is assigned thereto. Speci?cally, for each
`node N, the ML estimates of the mismatch pdf parameters,
`TIP and T1 p, are determined using the adaptation data based on
`expressions [5] and
`Without loss of generality, let’s estimate the parameter set
`6m=(um, Sm) for each mixture component m in G. For the
`sake of convenience, it should be noted that in the ensuing
`discussion, the subscript “m” of each variable associated
`with the mixture component m is dropped except where a
`confusion may arise without such a subscript. In tree struc
`ture 200, each sequence of connected nodes with one node
`from each level, including root node 203 and a leaf node,
`corresponds to a mixture component represented by the leaf
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 8
`
`
`
`7
`node. Let the sequence of connected nodes corresponding to
`the mixture component m be denoted {NP .
`.
`.
`, Nk, .
`.
`.
`,
`NH}, Where N1 denotes root node 203; Nk denotes one of the
`intermediate nodes in the sequence; and NH denotes the leaf
`node representing the mixture component In
`Let >\,k=(\/k, 11k) and represent the set of mismatch pdf
`parameters associated With node Nk. The parameter set, M,
`is determined by maximizing the posterior probabilities
`described below, given a sequence of feature vectors, Y. It
`should be noted that once M is determined, the HMM
`parameter set 6 is readily obtained using expressions [7] and
`[8] above.
`In accordance With the inventive SMAP approach, the set
`{K0, K1, .
`.
`. )tk .
`.
`.
`, AHA} is used as the aforementioned
`“hierarchical priors” for >\,H, Where >\.O=(O, I). The mismatch
`pdf parameter set )to is assumed to be the prior for the
`parameter set M associated With root node 203, and the
`parameter set M associated With node Nk is used as the prior
`for the parameter set >\,k+1 associated With its immediately
`subordinate node or child node, Nk+1.
`The posteriori distribution for )»k may be expressed as
`folloWs:
`
`15
`
`6,151,574
`
`8
`
`1
`
`Where Gk is a subset of G represented by node Nk; (hvk, T1 k)
`represents ML estimates for (vk, 11k). The parameters 1p>0,
`and E>1 are standard control parameters. By sccessively
`applying expressions [18] and [19] from the root node N1 to
`a leaf node NH along the node sequence, the mean vH and
`the variance 11H for the leaf node NH are obtained. These vH
`and 11H are used as v and 11p, respectively in expressions [7]
`and [8] to update the mixture components, in accordance
`With the inventive SMAP approach.
`It should be noted that expression [18] can be reWritten for
`the leaf node NH as follows:
`
`25
`
`Where
`
`Pk
`
`H
`
`W;
`
`[21]
`
`[22]
`
`to simplify the evaluation of the posteriori distribution of
`[9]. The ?rst approximation using the probability given by
`the maximum a posteriori (MAP) estimates is expressed as
`folloWs:
`
`35
`
`YJdA'kNPO‘k+1:}\'k> Y),
`
`Where
`
`XIFA'U;
`
`Xk=argmaXMpO‘kiXk*1> YJ=aTEmaXMP(YlAQPO‘kIXkAI
`
`The second approximation is expressed as folloWs:
`
`p<Y:Ak)~N(Y:m),
`
`[12]
`
`[13]
`
`[14]
`
`[15]
`
`With these approximations, the MAP estimates of the mis
`match pdf parameters for each node Nk can be determined as
`folloWs:
`
`45
`
`55
`
`From expression [21], the mean estimate vH in accordance
`With SMAP approach can be envisioned as the Weighted sum
`of the ML estimates T/k at different levels of tree structure
`200. It should also be noted that at node Nk, the more
`adaptation data available, the larger is Pk, and thus Wk. In
`addition, the Weight Wk for 17k at node Nk decreases expo
`nentially With the value of k. Thus, When the amount of
`available adaptation data is relatively small, the ML esti
`mates vk corresponding to upper levels of tree structure 200
`dominate in the mean estimate of [21]. On the other hand,
`When the amount of available adaptation data is relatively
`large, the ML estimates vk corresponding to loWer levels
`thereof dominate in such a mean estimate.
`The methodology for construction of tree structure 200
`Will noW be described. The a priori knoWledge about the
`embedded structure in the acoustic feature space should be
`used for the construction of tree structure 200 for the set of
`all mixture components, G. The tree construction is based on
`the methodology described, e.g., in: T. Watanabe et al.,
`“Speech Recognition Using Tree-Structured Probability
`Density Function,” Proc. of I CSLP-94, 1994, pp. 223—226.
`In this tree construction, a Well knoWn Kullback divergence
`betWeen pdfs of the mixture components is used as the
`distance measure betWeen the mixture components.
`Accordingly, the distance d(m,n) betWeen tWo mixture
`components, say, gm and gn, is determined as folloWs:
`
`gm(x)
`gnu‘)
`
`[Z3]
`
`[16]
`
`i
`
`M02 + (m) — MW
`
`mi?
`
`’
`
`65
`
`Where om(i) represents the i”1 diagonal element of covari
`ance Sm. Anode pdf is assigned to each node of tree structure
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 9
`
`
`
`6,151,574
`
`200. The pdf parameters for node Nk to Which mixture
`components gm(k)=N(Xlum(k), Sm(k)) belong, m=1, .
`.
`. , Mk,
`are given as follows:
`
`FIG. 3 illustrates conventional k-means clustering algo
`rithm 300 for clustering the mixture components. According
`to this algorithm, the number of branches from each node
`and the number of levels in tree structure 200 are
`predetermined, and the clustering is recursively carried out.
`Speci?cally, root node 203 is initially set to be node n, as
`indicated at step 305. At step 308, it is determined Whether
`node n has any child nodes. If node n has no child node,
`algorithm 300 comes to an end.
`OtherWise, if node n has one or more child nodes, initial
`pdfs are assigned to the child nodes, respectively, using the
`Well knoWn min-max method, as indicated at step 311. At
`step 315, the set of mixture components represented by node
`n is divided among the child nodes by calculating the
`distances betWeen the child node pdfs and the mixture
`components using expression [23]. Each mixture component
`is assigned to a child node Whose pdf is closest to the
`mixture component. A pdf is then calculated at step 318 for
`each child node of node n based on expressions [24] and
`[25]. For each child node, the sum of the distances from the
`child node pdf to the mixture components represented by the
`child node is computed. The respective sums of distances for
`all of the child nodes of node n are added, resulting in a
`grand sum. It is then determined Whether such a grand sum
`converges, as indicated at step 321. If the grand sum does not
`converge, algorithm 300 returns to step 315. OtherWise, if it
`converges, each child node is set to be a node n, as indicated
`at step 322. For each neW node n, algorithm 300 returns to
`step 308.
`The foregoing merely illustrates the principles of the
`invention. It Will thus be appreciated that a person skilled in
`the art Will be able to devise numerous systems Which,
`although not explicitly shoWn or described herein, embody
`the principles of the invention and are thus Within its spirit
`and scope.
`For example, in the disclosed embodiment, the learning
`by speech recognition system 100 is unsupervised. That is,
`the speech provided by a user for the learning is uncon
`trolled. HoWever, it Will be appreciated that such learning
`may be supervised, in Which the user may be alloWed to
`speak only certain pre-selected Words for the learning.
`We claim:
`1. Apparatus for recogniZing selected speech based on
`acoustic models, the apparatus comprising:
`a processor responsive to selected data representing a
`sample of the selected speech for modifying the acous
`tic models; and
`a mechanism for de?ning a hierarchical structure Which
`includes a plurality of levels, each level having one or
`more nodes thereon, each level in the structure arranged
`higher than every other level having a smaller number
`of nodes thereon than the other level, each node being
`associated With a probability measure Which is deter
`mined based on at least the s