throbber
Ulllted States Patent [19]
`Lee et al.
`
`US006151574A
`[11] Patent Number:
`[45] Date of Patent:
`
`6,151,574
`Nov. 21, 2000
`
`[54] TECHNIQUE FOR ADAPTATION OF
`HIDDEN MARKOV MODELS FOR SPEECH
`RECOGNITION
`
`[75] Inventors? chln'ljlul Pee, Baskhlg R1dge> N1;
`Kolchl Shm0da> Ch1ba> Japan
`_
`_
`_
`[73] Ass1gnee: Lucent Technologies Inc., Murray Hlll,
`N'J'
`
`[21] Appl' NO‘: 09/149’782
`[22] Filed:
`Sen 8’ 1998
`
`Related US. Application Data
`[60] Provisional application No. 60/067,822, Dec. 5, 1997.
`
`[51] Int. Cl.7 ................................................... .. G10L 15/14
`[52] US. Cl. ........................................... .. 704/256; 704/243
`[58] Field of Search ................................... .. 704/243, 244,
`
`J. Gauvain et al., “Maximum a Posteriori Estimation for
`Multivariate Gaussian Mixture Observations of Markov
`Chains,” IEEE Transactions on Speech and Audio Process
`ing, vol. 2, No. 2, Apr. 1994, pp. 291—298.
`C. Lee et al., “A Study on Speaker Adaptation of the
`Parameters of Continuous Density Hidden Markov Models,”
`IEEE Transactions on Signal Processing, vol. 39, No. 4,
`Apt 1991, pp 806_814_
`Boulianne et al, “Optimal tying of HMM densities using
`decision trees”, Spoken Language, Oct. 1996.
`Miglietta, “Bayesian Adaptation of speech RecogniZers to
`Field SpGGCh Data”, OCI. 1996.
`J en—TZung Chen, Chin—Hui Lee, and Hsiao—Chuan Wang, A
`Hybrid Algorithm for Speaker Adaption Using MAP.
`_
`_
`(List continued on neXt page.)
`
`Primary Examiner—Krista Zele
`Assistant Examiner—Michael N- Opsasnick
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5’737’487
`5Z787Z394
`5,794,197
`5,797,123
`5,857,169
`5,907,825
`579127989
`
`gsggltsetetaiil'
`4/1998 B 611 e garda a
`7/1998 Bahl et a1_ ______ __
`8/1998 Alleva et a1, _
`8/1998 Chou et a1.
`1/1999 Seide --------------- -
`5/1999 TZiTkel-Hancock -
`6/1999 Watanahe """""" "
`shgyerlfm et a1‘ """""""""" "
`9/1999 Tz?llgelijH'a'gc'g'c'lg """""""""" " 704041
`5’96O’395
`579837180 11/1999 Robinson ........... .. .............. Q... 704/254
`
`704050
`704038
`704/255
`704/256
`704/256
`704/243
`382/228
`
`OTHER PUBLICATIONS
`
`J. Chien et al., “Improved Bayesian Learning of Hidden
`Markov Models for Speaker Adaptation,” Proc. I CASSP—97,
`1997, pp. 1027—1030.
`K. Shinoda et al., “Speaker Adaptation With Autonomous
`Model Complexity Control by MDL Principle,” Proc.
`ICASSP—96, 1996, pp. 717—720.
`
`A speech recognition system learns characteristics of speech
`by a user during a learning phase to improve its perfor
`mance. Adaptation data derived from the user’s speech and
`its recogniZed result is collected during the learning phase.
`Parameters characterizing hidden Markov Models (HMMs)
`used in the system for speech recognition are modi?ed based
`on the adaptation data. To that end, a hierarchical structure
`is de?ned in an HMM parameter space. This structure may
`assume the form of a tree structure having multiple layers,
`each of Which includes one or more nodes. Each node on
`each layer is connected to at least one node on another layer.
`The nodes on the loWest layer of the tree structure are
`referred to as “leaf nodes.” Each node in the tree structure
`represents a subset of the HMM parameters, and is associ
`ated with a Probability measure Which is derived from the
`adaptation data. In particular, each leaf node represents a
`different one of the HMM parameters, Which is derivable
`from the probability measure associated With the leaf node.
`This probability measure is a function of the probability
`measures Which are associated With the nodes connected to
`the leaf node, and Which represent “hierarchical priors” to
`such a probability measure.
`
`38 Claims, 3 Drawing Sheets
`
`ASSIGN INITIAL PDFs
`T0 CHILD NDDES
`
`DIVIDE SET OF MIXTURE
`COMPONENTS REPRESENTED
`BY MODE n AMONG CHILD
`NODES
`
`CALCULATE FDF FOR
`EACH CHILD NODE
`
`DISTANCES BETWEEN
`CHILD NODE PDFs AND
`MIXTURE COMPONENTS
`CONVERGE ?
`YES
`
`SET EACH CHILD NUDE
`TD MODE n
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 1
`
`

`

`6,151,574
`Page 2
`
`OTHER PUBLICATIONS
`
`Transformation and Adaption, XP—002130508, IEEE Signal
`Processing Letters, pp. 167—169, vol. 4, No. 6, Jun. 1997.
`Carrnelo Giarnrnarco Miglietta, Cha?c Mokbel, Denis Jou
`vet and Jean Monne, Bayesian Adaption of Speech Recog
`niZers to Field.
`Speech Data, XP—002130507, ICSLI96, pp. 917—920, Oct.
`1996, Lannion cedeX, France.
`
`Douglas B. Paul, “Extensions to Phone—State Decision—Tree
`ClusteringzSingle
`Tree
`and
`Tagged
`Clustering”,
`XP—000822740, pp. 1487—1490, Apr. 21, 1997, NeWton,
`Massachusettes, USA.
`
`Koichi Shinoda adn Chin—Hui Lee, “Structural MAP
`Speaker
`Adaption
`Using
`Hierarchical
`Priors”,
`XP—002130506, pp. 381—388, Dec. 1997, Murray Hill, NeW
`Jersey, USA.
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 2
`
`

`

`U.S. Patent
`
`Nov. 21,2000
`
`Sheet 1 of3
`
`6,151,574
`
`_ _
`
`>52
`y
`
`Emzud
`
`l
`I
`
`I
`
`|
`I
`__.l
`
`0
`ENHZQSHE
`
`185m
`
`w
`
`.68: 2o;
`
`“828%
`
`\ .QFN
`
`02
`
`m:
`
`ESATQE
`HEDGE w m2
`
`“285m
`1064.56
`$558
`
`a} w 2:
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 3
`
`

`

`U.S. Patent
`
`Nov. 21,2000
`
`Sheet 2 of3
`
`6,151,574
`
`FIG. 2
`m
`
`1st LEvEL
`
`203
`
`kth LEvEL
`
`Hth LEvEL
`
`247
`
`245
`
`243
`
`241
`
`235
`
`239
`
`237
`
`233
`
`231
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 4
`
`

`

`U.S. Patent
`
`Nov. 21, 2000
`
`Sheet 3 0f 3
`
`6,151,574
`
`FIG. 3
`
`m
`
`SET ROOT NODE TO NODE n
`
`f 305
`
`I 308
`:
`< ANY CHILD NODES ? \NO
`YES
`fan
`"
`(
`ASSIGN INITIAL PDFs
`TO CHILD NODES
`
`w
`
`END
`
`)
`
`4
`IT
`DIVIDE SET OF MIXTURE
`COMPONENTS REPRESENTED
`BY NODE n AMONG CHILD
`NODES
`
`I315
`
`IV
`CALCULATE PDF FOR
`EACH CHILD NODE
`
`IT
`SUM OF
`DISTANCES BETWEEN
`CHILD NODE PDFs AND
`MIXTURE COMPONENTS
`CONVERGE ?
`YES
`
`IT
`
`SET EACH CHILD NODE
`TO NODE n
`
`321
`
`NO
`
`‘ FOR EACH NODE n
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 5
`
`

`

`6,151,574
`
`1
`TECHNIQUE FOR ADAPTATION OF
`HIDDEN MARKOV MODELS FOR SPEECH
`RECOGNITION
`
`This application claims the bene?t of US. Provisional
`No. 60/067,822 ?led Dec. 5, 1997.
`
`FIELD OF THE INVENTION
`
`The invention relates to speech recognition systems and
`methods, and more particularly to systems and methods for
`recognizing speech based on hidden Markov models
`(HMMs), Which are adapted to acoustic inputs during a
`learning phase of speech recognition.
`
`BACKGROUND OF THE INVENTION
`
`15
`
`2
`tion are estimated using the adaptation data. In implement
`ing the transformation based approach, it is desirable to
`divide the acoustic feature space into a number of subspaces
`and estimate the transformation parameters in each sub
`space. HoWever, the performance of speech recognition
`using the transformation based approach does not signi?
`cantly improve With an increasing amount of adaptation data
`as any improvement is restricted by the limited number of
`variable transformation parameters used in the approach.
`An attempt to combine the Bayesian adaptation approach
`With the transformation based approach to improve the
`speech recognition performance has been made. This
`attempt is described in: Chien et al., “Improved Bayesian
`Learning of Hidden Markov Models for Speaker
`Adaptation,” ICASSP-97, 1997, pp. 1027—1030. HoWever,
`the success of such an attempt relies on the requirement that
`the number of subspaces in the acoustic feature space be
`optimiZed for various amounts of adaptation data, Which is
`usually impractical.
`Accordingly, there eXists a need for combining the Baye
`sian adaptation approach With the transformation based
`approach in a feasible manner to improve the speech rec
`ognition performance, regardless of the amount of available
`adaptation data.
`
`SUMMARY OF THE INVENTION
`
`In accordance With the invention, the HMMs in a speech
`recognition system are adapted during a learning phase
`using a “structural maXimum a posteriori (SMAP)”
`approach. According to this inventive approach, a hierarchi
`cal structure, e.g., a tree structure, having multiple levels is
`constructed in an HMM parameter space. Such a structure is
`characteriZed by transformation parameters associated With
`different levels of the structure. The transformation param
`eters associated With a level represent prior knoWledge,
`referred to as “hierarchical priors,” for the transformation
`parameters associated With the levels therebeneath. The
`transformation parameters associated With each level of the
`structure are estimated based on the adaptation data, and
`hierarchical priors from the levels thereabove. The HMM
`parameters are updated based on the transformation param
`eters associated With the loWest level of the structure, Which
`thus are a function of at least the transformation parameters
`associated With a level other than the loWest level.
`Advantageously, given an amount of available adaptation
`data, the inventive SMAP approach effectively combines the
`Bayesian adaptation approach, characteriZed by use of the
`aforementioned hierarchical priors, With the transformation
`based approach, characteriZed by use of the aforementioned
`hierarchical structure, to synergistically effect the adaptation
`of HMMs.
`
`BRIEF DESCRIPTION OF THE DRAWING
`In the draWing,
`FIG. 1 is a block diagram of a speech recognition system
`in accordance With the invention;
`FIG. 2 illustrates a hierarchical structure Whereby HMMs
`in the system of FIG. 1 are adapted during its learning phase;
`and
`FIG. 3 is a How chart depicting an algorithm for con
`structing the hierarchical structure of FIG. 2.
`
`In communication, data processing and similar systems, a
`user interface using audio facilities is often advantageous
`especially When it is anticipated that the user Would be
`physically engaged in an activity (e.g., driving a car) While
`he/she is operating one such system. Techniques for recog
`niZing human speech in such systems to perform certain
`tasks have been developed.
`In accordance With one such technique, input speech is
`analyZed in signal frames, represented by feature vectors
`corresponding to phonemes making up individual Words.
`The phonemes are characteriZed by hidden Markov models
`(HMMs), and a Viterbi algorithm is used to identify a
`sequence of HMMs Which best matches, in a maXimum
`likelihood sense, a respective concatenation of phonemes
`corresponding to an unknoWn, spoken utterance.
`It is Well knoWn that each HMM comprises model
`parameters, e.g., miXture components Which are character
`iZed by Gaussian distributions. In a learning phase in a
`speech recognition system, the HMMs are adapted to input
`speech by a user to adjust to the particular speech charac
`teristics of the user, thereby increasing accuracy of the
`speech recognition. In prior art, tWo Well knoWn approaches
`for adaptation of HMMs, namely, the Bayesian adaptation
`approach and the transformation based approach, have been
`employed.
`According to the Bayesian adaptation approach, prior
`distributions are assumed for the model parameters in
`HMMs, and the maXimum a posteriori (MAP) estimates for
`the model parameters are calculated. For details on this
`approach, one may refer to: C. Lee et al., “A study on
`Speaker Adaptation of the Parameters of Continuous Den
`sity Hidden Markov Models,” IEEE Transactions on Signal
`Processing, Vol. 39, No. 4, April 1991, pp. 806—814; and J.
`Gauvain et al., “Maximum a Posteriori Estimation for Mul
`tivariate Gaussian Mixture Observations of Markov
`Chains,” IEEE Transactions on Speech and Audio
`Processing, Vol. 2, No. 2, 1994, pp. 291—298. Since the
`Bayesian adaptation approach utiliZes the MAP estimates
`based on knoWledge of prior distributions, it requires less
`input speech data for the adaptation than, e.g., one utiliZing
`maXimum likelihood (ML) estimates Which does not rely on
`any such knoWledge.
`HoWever, if the adaptation data is scarce, the transforma
`tion based approach may be more effective than the Baye
`sian adaptation approach to adapt the HMMs. According to
`the transformation based approach, a transformation, e.g., a
`shift or an affine transformation, is de?ned in an acoustic
`feature space, also knoWn as an “HMM parameter space,” to
`eXplore correlations betWeen different HMMs, and such
`correlations help adapt the HMMs despite the scarcity of the
`adaptation data. Parameters characteriZing the transforma
`
`25
`
`35
`
`45
`
`55
`
`65
`
`DETAILED DESCRIPTION
`FIG. 1 illustrates speech recognition system 100 embody
`ing the principles of the invention. As shoWn in FIG. 1,
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 6
`
`

`

`6,151,574
`
`3
`system 100 includes a number of functional blocks including
`analog-to-digital
`convertor 103, feature extractor 105,
`end-point detector 113, speech recognizer 117, Word model
`processor 125, sWitch 139 and delay element 141. The
`functionality of each block of system 100 may be performed
`by a respective different processor, or the functionality of
`several or all blocks may be performed by the same pro
`cessor. Furthermore, each stage can include multiple pro
`cessing elements. The stages are pipelined and their opera
`tions are performed in a synchronous manner.
`Speci?cally, input speech including a sequence of spoken
`Words is provided, through a microphone (not shoWn), to
`A/D convertor 103 in system 100. Convertor 103 in a
`conventional manner samples the input speech. The digital
`samples are then provided to feature extractor 105.
`Upon receiving the digital samples, feature extractor 105
`organiZes the received samples in speech frames of, say, 20
`ms long, and derives for each frame a measure of energy in
`the frame and a set of short spectrum vectors, e.g., linear
`predictive-coding (LPC) parameters. In this instance, the
`LPC parameters specify a spectrum of an all-pole model
`Which best matches the signal spectrum over a period of time
`in Which the frame of speech samples are accumulated.
`Based on the LPC parameters, extractor 105 produces a
`feature vector per frame, Which comprises, for example,
`tWelve cepstral features, tWelve delta-cepstral features and a
`delta-energy feature. In a Well-knoWn manner, these cepstral
`and delta-cepstral features characteriZe the spectrum and its
`time variation of the speech frame. The delta-energy feature
`indicates an amount of change in energy in the speech frame
`from the previous frame.
`End-point detector 113 of conventional design uses the
`delta energy feature, in conjunction With the energy measure
`by feature extractor 105, to determine the beginning and end
`of a speech signal. It then passes data signals containing the
`25 features in each feature vector onto speech recogniZer
`117 of conventional design, along With any end-point deter
`minations. Based on such data signals and Word models
`provided by Word model processor 125 in accordance With
`the invention, recogniZer 117 determines What the spoken
`Words are.
`Processor 125, among other things, provides hidden
`Markov models (HMMs), e.g., continuous density (CD)
`HMMs in this instance, for various spoken Words. Based on
`the Viterbi algorithm, recogniZer 117 identi?es an optimum
`sequence of CDHMMs Which best matches, in a maximum
`likelihood sense, a respective concatenation of phonemes
`corresponding to an unknoWn spoken utterance. Such an
`identi?cation process is realiZed by dynamic programming
`in a conventional manner.
`It should be noted at this point that system 100 operates
`in tWo modes, namely, a learning mode and a normal
`operation mode. In the normal operation mode, Which is the
`current mode, sWitch 139 is set at a ?rst position to relay a
`series of recogniZed Words from recogniZer 117 to the output
`of system 100.
`During the manufacture of system 100, all of the model
`parameters of the CDHMMs in processor 125 are prede?ned
`in a conventional manner using training data representing
`samples of speech from many speakers. HoWever, When a
`user initially uses system 100 for speech recognition, system
`100 undergoes the learning mode Where the CDHMMs are
`adapted to the user’s input speech to adjust to his/her
`particular speech characteristics, thereby further increasing
`the recognition accuracy of system 100.
`In implementing unsupervised learning in this instance,
`sWitch 139 is set at a second position in the learning mode
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`4
`to feed the series of recogniZed Words from recogniZer 117
`back to processor 125, and the input speech corresponding
`to the recogniZed Words. According to the unsupervised
`learning, the input speech is uncontrolled and the user is not
`restricted to speak only certain Words as in supervised
`learning. In the learning mode, delay element 141 imparts a
`proper amount of delay to the input of recogniZer 117
`representing the input speech to synchroniZe it With the
`corresponding recogniZed Words. Processor 125 uses the
`input of recogniZer 117 and the corresponding recogniZed
`Words as “adaptation data” to adapt the CDHMMs therein.
`In accordance With the invention, the HMMs in a speech
`recognition system, e.g., system 100, are adapted during the
`learning phase using a “structural maximum a posteriori
`(SMAP)” approach. According to this inventive approach, a
`hierarchical structure, e.g., a tree structure, having multiple
`levels is constructed in an HMM parameter space, also
`knoWn as an “acoustic feature space.” Such a structure is
`characteriZed by transformation parameters associated With
`different levels of the structure. The transformation param
`eters associated With a level represent prior knoWledge,
`referred to as “hierarchical priors,” for the transformation
`parameters associated With the levels beneath, or subordi
`nate to, that level. The transformation parameters associated
`With each level of the structure are estimated based on the
`adaptation data, and aggregate hierarchical priors from the
`levels thereabove. The estimated transformation parameters
`associated With the loWest level of the hierarchical structure
`are used to update the HMM parameters in the learning
`phase. As a result, such updated HMM parameters are a
`function of a combination of all of the transformation
`parameters connected to the structure, Which are Weighted
`according to the respective levels associated thereWith. The
`Weight for each level varies With the amount of adaptation
`data used. The updated HMM parameters are used by
`processor 125 in system 100 to provide the HMMs, in
`particular, CDHMMs, during its normal operation mode for
`speech recognition as described before.
`In order to fully appreciate the adaptation of HMMs in
`accordance With the invention, a normaliZation technique
`Which simpli?es the adaptation Will noW be described. Let
`G={gm; =1, .
`.
`.
`, M} and be the set of all mixture
`components of the CDHMMs in processor 125, Where M
`represents the total number of mixture components in all the
`states of all the CDHMMs in processor 125, and gm repre
`sents a normal density function for a mixture component In
`This normal density function is denoted N(XIpm, Sm), Where
`pm represents a means vector and Sm represents a covari
`ance matrix.
`
`. , xT} and denote a set of T given
`. , xt. .
`.
`Let X={x1, .
`observation vectors, Which represents the adaptation data. In
`the normaliZation, each sample vector X, is transformed into
`a vector ymt for each mixture component m as folloWs:
`
`Where t=1, .
`.
`.
`, Tand m=1, .
`.
`.
`, M.
`As mentioned before, all of the CDHMM parameters here
`have been prede?ned in a conventional manner using train
`ing data derived from samples of speech from many speak
`ers during the manufacture of system 100. Where there is no
`mismatch betWeen the training data and adaptation data, it is
`apparent that the probability density function (pdf) for
`Ym={ym1, .
`.
`.
`, ymt .
`.
`.
`, ymT} may be represented by the
`standard normal distribution N(YIO,I), Where O represents a
`vector Whose components have Zero values, and I represents
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 7
`
`

`

`6,151,574
`
`5
`an identity matrix. Otherwise, where there is a mismatch
`between the training data and adaptation data, the pdf for Ym
`may be generically represented by N(Y{v, 11), where v¢O
`and 11:4. A “mismatch pdf” is de?ned below to represent the
`differences between the acoustic characteristics depicted by
`the training data and those depicted by the adaptation data.
`It can be shown that the number of mismatch pdfs required
`to model the acoustic differences is smaller than that of the
`pdfs for mixture components of HMMs.
`Thus, to reduce the number of required pdfs, the set of
`mixture components, G, is divided into more than one subset
`GP, 1 épéfP, where fP is the total number of subsets which
`is smaller than the total number of mixture components, i.e.,
`M. A common mismatch pdf hp=N(YI\/p, 11p) is shared by all
`the mixture components in each subset GP. In the following
`discussion, each mixture component g in subset GP is
`renumbered as gfp), .
`.
`.
`, gm(P) .
`.
`.
`, gMpw), where Mp
`represents the number of mixture components in subset GP.
`The observed vector sequence X={x1, .
`.
`. , xT} is normaliZed
`to X={ym1(P), .
`.
`.
`, ymT(P)} for mixture components gmw).
`The parameters for the mismatch pdfs are estimated using
`the well known estimation-maximization (EM) algorithm.
`When the transition probabilities and the mixture component
`weight coef?cients are speci?ed, the auxiliary function Q for
`the HMM parameters is given by:
`
`15
`
`25
`
`representing the current HMM parameter set; 6={pm(p),
`§m(p); m=1, .
`.
`. Mp and p=1, .
`.
`. SP} representing the HMM
`35
`parameter set to be estimated; and ymfp) represents the
`posterior probability of using mixture component gm(P) at
`time t. The relation between the HMM parameters and the
`mismatch pdf parameters is established as follows:
`
`where Jm(P)=(Sm(P))1/2 representing the Jacobian matrix for
`the normaliZation expressed in
`Based on expression [3],
`the auxiliary function may be modi?ed as follows;
`
`T
`
`P M”
`
`(
`
`>
`
`[4]
`
`Q19 0) = 22y"? 10g a m
`
`
`‘:1 p11 m:1
`
`
`
`1/2
`|(S£np))
`I
`
`I
`
`By differentiating function [4], the maximum likelihood
`(ML) estimates of the mismatch pdf parameters can be
`de?ned as follows:
`
`Mp
`(p) (p)
`7m: ymt
`
`[5]
`
`45
`
`55
`
`65
`
`6
`-continued
`
`T Mp
`Z Z MHZ’ — VP) (yin?) — VP)
`
`T
`
`~ _ 1:1 m:l
`
`[6]
`
`
`
`11p — *T Mp ,
`
`Z 7%’
`
`1 5 H
`
`where (yWWLY/P)t represents the transpose of (ymfPLY/p).
`Using these mismatch pdf parameters, the corresponding
`HMM parameters can be updated as follows:
`
`smwmsme.
`[8]
`where ftmw) and §m(P) represent the updated mean and
`covariance, respectively.
`The optimal number of mismatch pdfs for deriving the
`corresponding HMMs varies with the amount of available
`adaptation data. We have recogniZed that a methodology
`which utiliZes the mismatch pdf parameters as transforma
`tion parameters for constructing the aforementioned hierar
`chical structure, and which takes advantage of both its entire
`structure and sub-structures is preferable to achieve a good
`speech recognition performance with any given amount of
`adaptation data. To that end, a tree structure for the set G is
`realiZed in accordance with the invention.
`FIG. 2 illustrates one such tree structure, denoted 200. As
`shown in FIG. 2, tree structure 200 has H levels, where H is
`an integer greater than Zero. The ?rst level of structure 200
`consists of root node 203 which represents the set G con
`taining all M mixture components of the CDHMMs. The H11
`level of structure 200 consists of M leaf nodes representing
`the M mixture components of the CDHMMs, respectively.
`Each intermediate node or non-terminal node between the
`?rst level and the H”1 level represents a subset of G, i.e., GP,
`containing Mp mixture components respectively represented
`by those Mp leaf nodes or terminal nodes on the H”1 level
`which are subordinate to the intermediate node. A leaf node
`is said to be subordinate to an intermediate node if the leaf
`node “emanates” from that intermediate node, or in other
`words its origin can be traced back to that intermediate node
`through intervening paths. For example, in structure 200, the
`leaf nodes which are subordinate to intermediate node 222
`are leaf nodes 231, 233 and 235 as they are connected to
`node 222 through paths 224, 226 and 228, respectively.
`Similarly, the leaf nodes which are subordinate to interme
`diate node 211 are leaf nodes 231, 233, 235, 237, 239, 241,
`243, 245 and 247. Of course, root node 203 has each leaf
`node on the H”1 level subordinate thereto. The methodology
`for constructing structure 200 is described in detail below.
`For each node N in tree structure 200, a mismatch pdf,
`which is shared by the mixture components in GP repre
`sented by the node, is assigned thereto. Speci?cally, for each
`node N, the ML estimates of the mismatch pdf parameters,
`TIP and T1 p, are determined using the adaptation data based on
`expressions [5] and
`Without loss of generality, let’s estimate the parameter set
`6m=(um, Sm) for each mixture component m in G. For the
`sake of convenience, it should be noted that in the ensuing
`discussion, the subscript “m” of each variable associated
`with the mixture component m is dropped except where a
`confusion may arise without such a subscript. In tree struc
`ture 200, each sequence of connected nodes with one node
`from each level, including root node 203 and a leaf node,
`corresponds to a mixture component represented by the leaf
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 8
`
`

`

`7
`node. Let the sequence of connected nodes corresponding to
`the mixture component m be denoted {NP .
`.
`.
`, Nk, .
`.
`.
`,
`NH}, Where N1 denotes root node 203; Nk denotes one of the
`intermediate nodes in the sequence; and NH denotes the leaf
`node representing the mixture component In
`Let >\,k=(\/k, 11k) and represent the set of mismatch pdf
`parameters associated With node Nk. The parameter set, M,
`is determined by maximizing the posterior probabilities
`described below, given a sequence of feature vectors, Y. It
`should be noted that once M is determined, the HMM
`parameter set 6 is readily obtained using expressions [7] and
`[8] above.
`In accordance With the inventive SMAP approach, the set
`{K0, K1, .
`.
`. )tk .
`.
`.
`, AHA} is used as the aforementioned
`“hierarchical priors” for >\,H, Where >\.O=(O, I). The mismatch
`pdf parameter set )to is assumed to be the prior for the
`parameter set M associated With root node 203, and the
`parameter set M associated With node Nk is used as the prior
`for the parameter set >\,k+1 associated With its immediately
`subordinate node or child node, Nk+1.
`The posteriori distribution for )»k may be expressed as
`folloWs:
`
`15
`
`6,151,574
`
`8
`
`1
`
`Where Gk is a subset of G represented by node Nk; (hvk, T1 k)
`represents ML estimates for (vk, 11k). The parameters 1p>0,
`and E>1 are standard control parameters. By sccessively
`applying expressions [18] and [19] from the root node N1 to
`a leaf node NH along the node sequence, the mean vH and
`the variance 11H for the leaf node NH are obtained. These vH
`and 11H are used as v and 11p, respectively in expressions [7]
`and [8] to update the mixture components, in accordance
`With the inventive SMAP approach.
`It should be noted that expression [18] can be reWritten for
`the leaf node NH as follows:
`
`25
`
`Where
`
`Pk
`
`H
`
`W;
`
`[21]
`
`[22]
`
`to simplify the evaluation of the posteriori distribution of
`[9]. The ?rst approximation using the probability given by
`the maximum a posteriori (MAP) estimates is expressed as
`folloWs:
`
`35
`
`YJdA'kNPO‘k+1:}\'k> Y),
`
`Where
`
`XIFA'U;
`
`Xk=argmaXMpO‘kiXk*1> YJ=aTEmaXMP(YlAQPO‘kIXkAI
`
`The second approximation is expressed as folloWs:
`
`p<Y:Ak)~N(Y:m),
`
`[12]
`
`[13]
`
`[14]
`
`[15]
`
`With these approximations, the MAP estimates of the mis
`match pdf parameters for each node Nk can be determined as
`folloWs:
`
`45
`
`55
`
`From expression [21], the mean estimate vH in accordance
`With SMAP approach can be envisioned as the Weighted sum
`of the ML estimates T/k at different levels of tree structure
`200. It should also be noted that at node Nk, the more
`adaptation data available, the larger is Pk, and thus Wk. In
`addition, the Weight Wk for 17k at node Nk decreases expo
`nentially With the value of k. Thus, When the amount of
`available adaptation data is relatively small, the ML esti
`mates vk corresponding to upper levels of tree structure 200
`dominate in the mean estimate of [21]. On the other hand,
`When the amount of available adaptation data is relatively
`large, the ML estimates vk corresponding to loWer levels
`thereof dominate in such a mean estimate.
`The methodology for construction of tree structure 200
`Will noW be described. The a priori knoWledge about the
`embedded structure in the acoustic feature space should be
`used for the construction of tree structure 200 for the set of
`all mixture components, G. The tree construction is based on
`the methodology described, e.g., in: T. Watanabe et al.,
`“Speech Recognition Using Tree-Structured Probability
`Density Function,” Proc. of I CSLP-94, 1994, pp. 223—226.
`In this tree construction, a Well knoWn Kullback divergence
`betWeen pdfs of the mixture components is used as the
`distance measure betWeen the mixture components.
`Accordingly, the distance d(m,n) betWeen tWo mixture
`components, say, gm and gn, is determined as folloWs:
`
`gm(x)
`gnu‘)
`
`[Z3]
`
`[16]
`
`i
`
`M02 + (m) — MW
`
`mi?
`
`’
`
`65
`
`Where om(i) represents the i”1 diagonal element of covari
`ance Sm. Anode pdf is assigned to each node of tree structure
`
`Amazon / Zentian Limited
`Exhibit 1021
`Page 9
`
`

`

`6,151,574
`
`200. The pdf parameters for node Nk to Which mixture
`components gm(k)=N(Xlum(k), Sm(k)) belong, m=1, .
`.
`. , Mk,
`are given as follows:
`
`FIG. 3 illustrates conventional k-means clustering algo
`rithm 300 for clustering the mixture components. According
`to this algorithm, the number of branches from each node
`and the number of levels in tree structure 200 are
`predetermined, and the clustering is recursively carried out.
`Speci?cally, root node 203 is initially set to be node n, as
`indicated at step 305. At step 308, it is determined Whether
`node n has any child nodes. If node n has no child node,
`algorithm 300 comes to an end.
`OtherWise, if node n has one or more child nodes, initial
`pdfs are assigned to the child nodes, respectively, using the
`Well knoWn min-max method, as indicated at step 311. At
`step 315, the set of mixture components represented by node
`n is divided among the child nodes by calculating the
`distances betWeen the child node pdfs and the mixture
`components using expression [23]. Each mixture component
`is assigned to a child node Whose pdf is closest to the
`mixture component. A pdf is then calculated at step 318 for
`each child node of node n based on expressions [24] and
`[25]. For each child node, the sum of the distances from the
`child node pdf to the mixture components represented by the
`child node is computed. The respective sums of distances for
`all of the child nodes of node n are added, resulting in a
`grand sum. It is then determined Whether such a grand sum
`converges, as indicated at step 321. If the grand sum does not
`converge, algorithm 300 returns to step 315. OtherWise, if it
`converges, each child node is set to be a node n, as indicated
`at step 322. For each neW node n, algorithm 300 returns to
`step 308.
`The foregoing merely illustrates the principles of the
`invention. It Will thus be appreciated that a person skilled in
`the art Will be able to devise numerous systems Which,
`although not explicitly shoWn or described herein, embody
`the principles of the invention and are thus Within its spirit
`and scope.
`For example, in the disclosed embodiment, the learning
`by speech recognition system 100 is unsupervised. That is,
`the speech provided by a user for the learning is uncon
`trolled. HoWever, it Will be appreciated that such learning
`may be supervised, in Which the user may be alloWed to
`speak only certain pre-selected Words for the learning.
`We claim:
`1. Apparatus for recogniZing selected speech based on
`acoustic models, the apparatus comprising:
`a processor responsive to selected data representing a
`sample of the selected speech for modifying the acous
`tic models; and
`a mechanism for de?ning a hierarchical structure Which
`includes a plurality of levels, each level having one or
`more nodes thereon, each level in the structure arranged
`higher than every other level having a smaller number
`of nodes thereon than the other level, each node being
`associated With a probability measure Which is deter
`mined based on at least the s

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket