`6,151,574
`(114) Patent Number:
`United States Patent 55
`Lee et al.
`[45] Date of Patent:
`Nov. 21, 2000
`
`
`[54] TECHNIQUE FOR ADAPTATION OF
`HIDDEN MARKOV MODELS FOR SPEECH
`RECOGNITION
`
`[75]
`
`Inventors: Chin-Hui Lee, Basking Ridge, N.J.;
`Koichi Shinoda, Chiba, Japan
`.
`.
`.
`[73] Assignee: Lucent Technologies Inc., Murray Hill,
`NJ.
`
`[21] Appl. No.: 09/149,782
`[22]
`Filed:
`Sep. 8, 1998
`
`J. Gauvain et al., “Maximum a Posteriori Estimation for
`Multivariate Gaussian Mixture Observations of Markov
`Chains,” IEEE Transactions on Speech and Audio Process-
`ing, vol. 2, No. 2, Apr. 1994, pp. 291-298.
`C. Lee et al., “A Study on Speaker Adaptation of the
`Parameters of Continuous Density Hidden Markov Models,”
`IEEE Transactions on Signal Processing, vol. 39, No. 4,
`Apr. 1991, pp. 806-814.
`Boulianne et al, “Optimal tying of HMM densities using
`decision trees”, Spoken Language, Oct. 1996.
`Miglietta, “Bayesian Adaptation of speech Recognizers to
`Field Speech Data”, Oct. 1996.
`Jen—Tzung Chen, Chin—Hui Lee, and Hsiao—Chuan Wang, A
`Hybrid Algorithm for Speaker Adaption Using MAP.
`.
`.
`(List continued on next page.)
`
`Related U.S. Application Data
`Provisional application No. 60/067,822, Dec. 5, 1997.
`[60]
`Tent, Cdn eeeeeecccceeeeescesnneeeseesnneseennnes G10L 15/14
`PSL]
`[52] U.S. Ch. eeecccccsesssseecsssecssseesssnee 704/256; 704/243=Primary Examiner—Krista Zele
`[58] Field of Search 0... 704/243, 244,
`Assistant Examiner—Michael N. Opsasnick
`
`[56]
`
`References Cited
`
`Aspeech recognition system learns characteristics of speech
`by a user during a learning phase to improve its perfor-
`U.S. PATENT DOCUMENTS
`mance. Adaptation data derived from the user’s speech and
`its recognized result is collected during the learning phase.
`
`s657424|$/1997 Farelleal. qoainss Parameters characterizing hidden Markov Models (IMMs)
`
`
`. 704/250
`5,737,487
`4/1998 Bellegardaetal.
`used in the system for speech recognition are modified based
`
`.......
`ws. 704/238
`5,787,394
`7/1998 Bahlet al.
`on the adaptation data. To that end, a hierarchical structure
`
`5,794,197
`8/1998 Alleva et al.
`.
`w 704/255
`is defined in an HMM parameter space. This structure may
`5,797,123
`8/1998 Chou et al.
`...
`... 704/256
`assume the form of a tree structure having multiple layers,
`5,857,169
`1/1999 Seide «ee
`we 704/256
`each of which includes one or more nodes. Each node on
`
`w+ 704/243
`5,907,825
`5/1999 Tzirkel-Hancock .
`each layer is connectedto at least one node on anotherlayer.
`382/228
`5,912,989
`6/1999 Watanabe.............
`The nodes on the lowest layer of the tree structure are
`beseeeseeceeeseseseees 704/256
`5,933,806
`8/1999 Beyerlein et al.
`referred to as “leaf nodes.” Each node in the tree structure
`5,956,676
`9/1999 Shinoda ose eeeesesereeeeeeeee 704/244
`:
`.
`5,960,395
`9/1999 Tzirkel-Hancock .
`“.. 704/241
`Tepresents a subset of the HMM parameters, and is associ-
`5,983,180
`11/1999 Robinson ....scnucneenenenenee 704/254
`ated with a probability measure which is derived from the
`adaptation data. In particular, each leaf node represents a
`OTHER PUBLICATIONS
`different one of the HMM parameters, which is derivable
`from the probability measure associated with the leaf node.
`This probability measure is a function of the probability
`measures which are associated with the nodes connected to
`
`J. Chien et al., “Improved Bayesian Learning of Hidden
`Markov Models for Speaker Adaptation,” Proc. ICASSP-97,
`1997, pp. 1027-1030.
`K. Shinoda et al., “Speaker Adaptation with Autonomous
`Model Complexity Control by MDL Principle,” Proc.
`ICASSP-96, 1996, pp. 717-720.
`
`the leaf node, and which represent “hierarchical priors” to
`such a probability measure.
`
`38 Claims, 3 Drawing Sheets
`
`300
`
`SET ROOT NODE TO NODE n Lv 305
`308
`NO.
`ANY CHILD NODES ?
`YES
`
`¥
`fol
`END
`ASSIGN INIIITIAL POFs
`TO CHILD NODES
`
`¥
`DIVIDE SET OF MIXTURE
`COMPONENTS
`REPRESENTED
`BY NODE n AMONG CHILD
`NODES
`
`
`
`
`
`
`
`IPR2023-00037
`Apple EX1021 Page1
`
`
`
`
` FOR EACH NODE n
`
`CALCULATE POF FOR
`EACH CHI
`ILD_ NODE
`
`DISTANCES BETWEEN
`CHILD NODE PDFs AND
`MIXTURE COMPONENTS.
`CONVERGE 2
`YES
`
`¥
`SET EACH CHILD NODE
`n
`10
`
`
`
`IPR2023-00037
`Apple EX1021 Page 1
`
`
`
`6,151,574
`
`Page 2
`
`OTHER PUBLICATIONS
`
`Transformation and Adaption, XP—002130508, IEEE Signal
`Processing Letters, pp. 167-169, vol. 4, No. 6, Jun. 1997.
`Carmelo Giammarco Miglietta, Chafic Mokbel, Denis Jou-
`vet and Jean Monne, Bayesian Adaption of Speech Recog-
`nizers to Field.
`
`Speech Data, XP—002130507, ICSLI96, pp. 917-920, Oct.
`1996, Lannion cedex, France.
`
`Douglas B. Paul, “Extensions to Phone-State Decision—Tree
`Clustering:Single
`Tree
`and
`‘Tagged
`Clustering”,
`XP-000822740, pp. 1487-1490, Apr. 21, 1997, Newton,
`Massachusettes, USA.
`
`Koichi Shinoda adn Chin-Hui Lee, “Structural MAP
`Speaker
`Adaption
`Using
`Hierarchical
`Priors”,
`XP-002130506, pp. 381-388, Dec. 1997, Murray Hill, New
`Jersey, USA.
`
`IPR2023-00037
`Apple EX1021 Page 2
`
`IPR2023-00037
`Apple EX1021 Page 2
`
`
`
`U.S. Patent
`
`Nov.21, 2000
`
`Sheet 1 of 3
`
`6,151,574
`
`[--------
`
`rl
`
`AV1A0
`
`ININ413
`
`HOq4dS
`
`a/V
`
`INIOd-CN4
`
`Jan
`
`
`
`TAQONGYOM
`
`YOSSIN0Ud
`
`YAZINIOIIY
`
`YOL0ILI0
`
`YOLOVYLXS
`
`YOLYIANOD
`
`IPR2023-00037
`Apple EX1021 Page 3
`
`IPR2023-00037
`Apple EX1021 Page 3
`
`
`
`
`
`
`
`1St LeveL
`keh LeveL
`He LeveL
`
`247
`
`245)
`
`243)
`
`241
`
`235239
`
`237
`
`233
`
`U.S. Patent
`
`Nov.21, 2000
`
`Sheet 2 of 3
`
`6,151,574
`
`FIG. 2
`
`200
`
`231
`
`IPR2023-00037
`Apple EX1021 Page 4
`
`IPR2023-00037
`Apple EX1021 Page 4
`
`
`
`U.S. Patent
`
`Nov.21, 2000
`
`Sheet 3 of 3
`
`6,151,574
`
`FIG. 3
`
` ASSIGN INITIAL PDFs
`
`TO CHILD NODES
`
`
`
`DIVIDE SET OF MIXTURE
`COMPONENTS REPRESENTED
`BY NODE n AMONG CHILD
`NODES
`
`
`
`
`
`
`
`
`
`CALCULATE PDF FOR
`EACH CHILD NODE
`
`
`
`
`
`SUM OF
`
`
`DISTANCES BETWEEN
`
`CHILD NODE PDFs AND
`
`MIXTURE COMPONENTS
`
`
`CONVERGE ?
`YES
`
`
`
`
`SET EACH CHILD NODE
`TO NODE n
`
`
`
`FOR EACH NODE n
`
`
`
`
`
`IPR2023-00037
`Apple EX1021 Page 5
`
`IPR2023-00037
`Apple EX1021 Page 5
`
`
`
`6,151,574
`
`1
`TECHNIQUE FOR ADAPTATION OF
`HIDDEN MARKOV MODELS FOR SPEECH
`RECOGNITION
`
`This application claims the benefit of U.S. Provisional
`No. 60/067,822 filed Dec. 5, 1997.
`
`FIELD OF THE INVENTION
`
`The invention relates to speech recognition systems and
`methods, and more particularly to systems and methods for
`recognizing speech based on hidden Markov models
`(HMMs), which are adapted to acoustic inputs during a
`learning phase of speech recognition.
`BACKGROUND OF THE INVENTION
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`50
`
`Chains,” IEEE Transactions on Speech and Audio
`Processing, Vol. 2, No. 2, 1994, pp. 291-298. Since the
`Bayesian adaptation approach utilizes the MAP estimates
`based on knowledge of prior distributions, it requires less
`input speech data for the adaptation than,e.g., one utilizing
`maximum likelihood (ML) estimates which doesnotrely on
`any such knowledge.
`However,if the adaptation data is scarce, the transforma-
`tion based approach may be more effective than the Baye-
`sian adaptation approach to adapt the HMMs.According to
`the transformation based approach, a transformation,e.g., a
`shift or an affine transformation, is defined in an acoustic
`feature space, also known as an “HMM parameterspace,” to
`explore correlations between different HMMs, and such
`correlations help adapt the HMMsdespite the scarcity of the
`adaptation data. Parameters characterizing the transforma-
`
`55
`
`60
`
`65
`
`In communication, data processing and similar systems, a
`user interface using audio facilities is often advantageous
`especially when it is anticipated that the user would be
`physically engaged in an activity (e.g., driving a car) while
`he/she is operating one such system. Techniques for recog-
`nizing human speech in such systems to perform certain
`tasks have been developed.
`In accordance with one such technique, input speech is
`analyzed in signal frames, represented by feature vectors
`corresponding to phonemes making up individual words.
`The phonemesare characterized by hidden Markov models
`(HMMs), and a Viterbi algorithm is used to identify a
`In accordance with the invention, the HMMsin a speech
`sequence of HMMs which best matches,
`in a maximum
`recognition system are adapted during a learning phase
`likelihood sense, a respective concatenation of phonemes
`using a “structural maximum a posteriori (SMAP)’
`corresponding to an unknown, spoken utterance.
`approach. Accordingto this inventive approach, a hierarchi-
`cal structure, e.g., a tree structure, having multiple levels is
`It
`is well known that each HMM comprises model
`constructed in an HMM parameter space. Sucha structure is
`parameters, e.g., mixture components which are character-
`characterized by transformation parameters associated with
`ized by Gaussian distributions. In a learning phase in a
`different levels of the structure. The transformation param-
`speech recognition system, the HMMsare adapted to input
`eters associated with a level represent prior knowledge,
`speech by a user to adjust to the particular speech charac-
`referred to as “hierarchical priors,” for the transformation
`teristics of the user,
`thereby increasing accuracy of the
`parameters associated with the levels therebeneath. The
`speech recognition. In prior art, two well known approaches
`transformation parameters associated with each level of the
`for adaptation of HMMs, namely, the Bayesian adaptation
`structure are estimated based on the adaptation data, and
`approach and the transformation based approach, have been
`hierarchical priors from the levels thereabove. The HMM
`employed.
`parameters are updated based on the transformation param-
`According to the Bayesian adaptation approach, prior
`eters associated with the lowest level of the structure, which
`distributions are assumed for
`the model parameters in
`thus are a function ofat least the transformation parameters
`HMMs,and the maximumaposteriori (MAP)estimates for
`associated with a level other than the lowestlevel.
`45
`the model parameters are calculated. For details on this
`approach, one may refer to: C. Lee et al., “A study on
`Speaker Adaptation of the Parameters of Continuous Den-
`sity Hidden Markov Models,” IEEE Transactions on Signal
`Processing, Vol. 39, No. 4, April 1991, pp. 806-814; and J.
`Gauvain et al., “Maximum a Posteriori Estimation for Mul-
`tivariate Gaussian Mixture Observations of Markov
`
`2
`tion are estimated using the adaptation data. In implement-
`ing the transformation based approach,
`it is desirable to
`divide the acoustic feature space into a numberof subspaces
`and estimate the transformation parameters in each sub-
`space. However,
`the performance of speech recognition
`using the transformation based approach does not signifi-
`cantly improve with an increasing amount of adaptation data
`as any improvementis restricted by the limited number of
`variable transformation parameters used in the approach.
`An attempt to combine the Bayesian adaptation approach
`with the transformation based approach to improve the
`speech recognition performance has been made. This
`attempt is described in: Chien et al., “Improved Bayesian
`Learning of Hidden Markov Models for Speaker
`Adaptation,” ICASSP-97, 1997, pp. 1027-1030. However,
`the success of such an attempt relies on the requirementthat
`the number of subspaces in the acoustic feature space be
`optimized for various amounts of adaptation data, which is
`usually impractical.
`Accordingly, there exists a need for combining the Baye-
`sian adaptation approach with the transformation based
`approach in a feasible manner to improve the speech rec-
`ognition performance, regardless of the amountof available
`adaptation data.
`
`SUMMARYOF THE INVENTION
`
`Advantageously, given an amountof available adaptation
`data, the inventive SMAP approacheffectively combines the
`Bayesian adaptation approach, characterized by use of the
`aforementioned hierarchical priors, with the transformation
`based approach, characterized by use of the aforementioned
`hierarchical structure, to synergistically effect the adaptation
`of HMMs.
`
`BRIEF DESCRIPTION OF THE DRAWING
`
`In the drawing,
`FIG. 1 is a block diagram of a speech recognition system
`in accordance with the invention;
`FIG. 2 illustrates a hierarchical structure whereby HMMs
`in the system of FIG. 1 are adapted duringits learning phase;
`and
`
`FIG. 3 is a flow chart depicting an algorithm for con-
`structing the hierarchical structure of FIG. 2.
`DETAILED DESCRIPTION
`
`FIG. 1 illustrates speech recognition system 100 embody-
`ing the principles of the invention. As shown in FIG. 1,
`
`IPR2023-00037
`Apple EX1021 Page 6
`
`IPR2023-00037
`Apple EX1021 Page 6
`
`
`
`6,151,574
`
`10
`
`3
`4
`to feed the series of recognized words from recognizer 117
`system 100 includes a numberof functional blocks including
`analog-to-digital (A/D) convertor 103, feature extractor 105,
`back to processor 125, and the input speech corresponding
`end-point detector 113, speech recognizer 117, word model
`to the recognized words. According to the unsupervised
`processor 125, switch 139 and delay element 141. The
`learning, the input speech is uncontrolled and the user is not
`functionality of each block of system 100 may be performed
`restricted to speak only certain words as in supervised
`by a respective different processor, or the functionality of
`learning. In the learning mode, delay element 141 imparts a
`several or all blocks may be performed by the same pro-
`proper amount of delay to the input of recognizer 117
`cessor. Furthermore, each stage can include multiple pro-
`representing the input speech to synchronize it with the
`cessing elements. The stages are pipelined and their opera-
`corresponding recognized words. Processor 125 uses the
`tions are performed in a synchronous manner.
`input of recognizer 117 and the corresponding recognized
`Specifically, input speech including a sequence of spoken
`words as “adaptation data” to adapt the CDHMMstherein.
`words is provided, through a microphone (not shown), to
`In accordance with the invention, the HMMsin a speech
`A/D convertor 103 in system 100. Convertor 103 in a
`recognition system,e.g., system 100, are adapted during the
`conventional manner samples the input speech. The digital
`learning phase using a “structural maximumaposteriori
`samples are then provided to feature extractor 105.
`15
`(SMAP)”approach. According to this inventive approach, a
`Upon receiving the digital samples, feature extractor 105
`hierarchical structure, e.g., a tree structure, having multiple
`organizes the received samples in speech framesof, say, 20
`levels is constructed in an HMM parameter space, also
`ms long, and derives for each frame a measure of energy in
`known as an “acoustic feature space.” Such a structure is
`the frame and a set of short spectrum vectors, e.g., linear-
`characterized by transformation parameters associated with
`predictive-coding (LPC) parameters. In this instance,
`the
`different levels of the structure. The transformation param-
`LPC parameters specify a spectrum of an all-pole model
`eters associated with a level represent prior knowledge,
`which best matchesthe signal spectrum overa period of time
`referred to as “hierarchical priors,” for the transformation
`in which the frame of speech samples are accumulated.
`parameters associated with the levels beneath, or subordi-
`Based on the LPC parameters, extractor 105 produces a
`nate to, that level. The transformation parameters associated
`feature vector per frame, which comprises, for example,
`with each level of the structure are estimated based on the
`twelve cepstral features, twelve delta-cepstral features and a
`adaptation data, and aggregate hierarchical priors from the
`delta-energy feature. In a well-known manner,these cepstral
`levels thereabove. The estimated transformation parameters
`and delta-cepstral features characterize the spectrum andits
`associated with the lowest level of the hierarchical structure
`time variation of the speech frame. The delta-energy feature
`are used to update the HMM parameters in the learning
`indicates an amount of change in energy in the speech frame
`phase. As a result, such updated HMM parameters are a
`from the previous frame.
`function of a combination of all of the transformation
`End-point detector 113 of conventional design uses the
`delta energy feature, in conjunction with the energy measure
`by feature extractor 105, to determine the beginning and end
`of a speech signal. It then passes data signals containing the
`25 features in each feature vector onto speech recognizer
`117 of conventional design, along with any end-point deter-
`minations. Based on such data signals and word models
`provided by word model processor 125 in accordance with
`the invention, recognizer 117 determines what the spoken
`wordsare.
`
`20
`
`25
`
`30
`
`35
`
`40
`
`Processor 125, among other things, provides hidden
`Markov models (HMMs), e.g., continuous density (CD)
`HMMsinthis instance, for various spoken words. Based on
`the Viterbi algorithm, recognizer 117 identifies an optimum
`sequence of CDHMMswhichbest matches, in a maximum
`likelihood sense, a respective concatenation of phonemes
`corresponding to an unknown spoken utterance. Such an
`identification process is realized by dynamic programming
`in a conventional manner.
`
`It should be noted at this point that system 100 operates
`in two modes, namely, a learning mode and a normal
`operation mode. In the normal operation mode, whichis the
`current mode, switch 139 is set at a first position to relay a
`series of recognized words from recognizer 117 to the output
`of system 100.
`During the manufacture of system 100, all of the model
`parameters of the CDHMMsin processor 125 are predefined
`in a conventional manner using training data representing
`samples of speech from many speakers. However, when a
`userinitially uses system 100 for speech recognition, system
`100 undergoes the learning mode where the CDHMMsare
`adapted to the user’s input speech to adjust
`to his/her
`particular speech characteristics, thereby further increasing
`the recognition accuracy of system 100.
`In implementing unsupervised learning in this instance,
`switch 139 is set at a second position in the learning mode
`
`45
`
`50
`
`55
`
`60
`
`65
`
`parameters connected to the structure, which are weighted
`according to the respective levels associated therewith. The
`weight for each level varies with the amount of adaptation
`data used. The updated HMM parameters are used by
`processor 125 in system 100 to provide the HMMs,
`in
`particular, CDHMMs,during its normal operation mode for
`speech recognition as described before.
`In order to fully appreciate the adaptation of HMMsin
`accordance with the invention, a normalization technique
`which simplifies the adaptation will now be described. Let
`G={g,,; =1, ..., M} and be the set of all mixture
`components of the CDHMMsin processor 125, where M
`represents the total number of mixture componentsin all the
`states of all the CDHMMsin processor 125, and g,, repre-
`sents a normal density function for a mixture component m.
`This normal density function is denoted N(X\u,,,, S,,), Where
`44,
`tepresents a means vector and S,, represents a covari-
`ance matrix.
`
`Let X={x,,..., x, .., X;} and denote a set of T given
`observation vectors, which represents the adaptation data. In
`the normalization, each sample vector x,is transformed into
`a vector y,,, for each mixture component m as follows:
`
`YnSmEeHin)s
`
`[1]
`
`where t=1,..., Tandm=1,...,M.
`As mentioned before, all of the CDHMM parameters here
`have been predefined in a conventional manner using train-
`ing data derived from samples of speech from many speak-
`ers during the manufacture of system 100. Wherethere is no
`mismatch betweenthe training data and adaptation data,it is
`apparent
`that
`the probability density function (pdf)
`for
`Yun=l¥mi + +> Yme-+++>YVmrt may be represented by the
`standard normal distribution N(Y{O,D, where O represents a
`vector whose components have zero values, and I represents
`
`IPR2023-00037
`Apple EX1021 Page 7
`
`IPR2023-00037
`Apple EX1021 Page 7
`
`
`
`6,151,574
`
`5
`an identity matrix. Otherwise, where there is a mismatch
`between the training data and adaptation data, the pdf for Y,,
`[6]
`T Mp
`Yeoero)lol 7)
`may be generically represented by N(Ylv, 4), where v~O
`y2) —¥,)(y\ —%,)°
`and yzI. A “mismatch pdf”is defined below to represent the
`differences between the acoustic characteristics depicted by
`the training data and those depicted by the adaptation data.
`It can be shown that the number of mismatch pdfs required
`to model the acoustic differences is smaller than that of the
`
`t=1 m=11REPIE
`
`6
`-continued
`
`where(y,,,-V,) represents the transposeof (y,,;”-V,).
`Using these mismatch pdf parameters, the corresponding
`HMMparameters can be updated as follows:
`
`[7]
`[8]
`
`1pPigSOW
`
`the updated mean and
`
`5,0=80,
`where Hy? and §,™ represent
`covariance, respectively.
`The optimal number of mismatch pdfs for deriving the
`corresponding HMMsvaries with the amount of available
`adaptation data. We have recognized that a methodology
`which utilizes the mismatch pdf parameters as transforma-
`tion parameters for constructing the aforementioned hierar-
`chical structure, and which takes advantage of bothits entire
`structure and sub-structures is preferable to achieve a good
`speech recognition performance with any given amount of
`adaptation data. To that end, a tree structure for the set G is
`realized in accordance with the invention.
`FIG. 2 illustrates one suchtree structure, denoted 200. As
`shown in FIG. 2, tree structure 200 has H levels, where H is
`an integer greater than zero. Thefirst level of structure 200
`consists of root node 203 which represents the set G con-
`taining all M mixture components of the CDHMMs.The H™”
`level of structure 200 consists of M leaf nodes representing
`the M mixture components of the CDHMMs,respectively.
`Each intermediate node or non-terminal node between the
`first level and the H”level represents a subsetofG,ie., G,,
`containing M,, mixture components respectively represented
`by those M,, leaf nodes or terminal nodes on the H™level
`which are subordinate to the intermediate node. A leaf node
`is said to be subordinate to an intermediate nodeif the leaf
`node “emanates” from that intermediate node, or in other
`words its origin can be traced back to that intermediate node
`through intervening paths. For example, in structure 200, the
`leaf nodes which are subordinate to intermediate node 222
`
`are leaf nodes 231, 233 and 235 as they are connected to
`node 222 through paths 224, 226 and 228, respectively.
`Similarly, the leaf nodes which are subordinate to interme-
`diate node 211 are leaf nodes 231, 233, 235, 237, 239, 241,
`243, 245 and 247. Of course, root node 203 has each leaf
`node on the H™ level subordinate thereto. The methodology
`for constructing structure 200 is described in detail below.
`For each node N in tree structure 200, a mismatch pdf,
`which is shared by the mixture components in G,, repre-
`sented by the node,is assigned thereto. Specifically, for each
`node N, the MLestimates of the mismatch pdf parameters,
`¥V, and %,, are determined using the adaptation data based on
`expressions [5] and [6].
`Withoutloss of generality, let’s estimate the parameterset
`6,,.=(n> Sm) for each mixture component m in G. For the
`sake of convenience, it should be noted that in the ensuing
`discussion, the subscript “m” of each variable associated
`with the mixture component m is dropped except where a
`confusion may arise without such a subscript. In tree struc-
`ture 200, each sequence of connected nodes with one node
`from each level, including root node 203 and a leaf node,
`corresponds to a mixture componentrepresented bythe leaf
`
`IPR2023-00037
`Apple EX1021 Page 8
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`pdfs for mixture components of HMMs.
`Thus, to reduce the number of required pdfs, the set of
`mixture components, G,is divided into more than one subset
`G,, 1SpSP, where P is the total number of subsets which
`is smaller than the total number of mixture components,1.e.,
`M. Acommon mismatch pdf h,=N(Yiv,, 11) is shared by all
`the mixture components in each subset G,,. In the following
`discussion, each mixture component S in subset G,
`renumbered as g,,
`5 By
`-
`>» Sup®), whereM,
`represents the number of mixture components in subset G,,.
`The observed vector sequence X={x,,..., x} is normalized
`to X=fy4, - - 5 Vm7/?} for mixture ‘components ye.
`The parameters for the mismatch pdfs are estimated using
`the well known estimation-maximization (EM) algorithm.
`Whenthetransition probabilities and the mixture component
`weight coefficients are specified, the auxiliary function Q for
`the HMMparameters is given by:
`
`Nee MeP
`
`alP)
`TS
`nllog gl? nla??, Sn).
`m=1
`
`P
`
`, P}
`, M, and p=1,
`where 0={u,,”, S,,”; m=1,
`representing the current HMM parameter set; G=Ls,(p),
`SP; m=1, . .M,, and p=1, . . P} representing the HMM
`parameter set
`to ‘be estimated; and y,,,represents the
`posterior probability of using mixture component g,,” at
`time t. The relation between the HMM parameters and the
`mismatch pdf parameters is established as follows:
`
`alp)
`a
`elie’, Sy) =
`
`h(nlps Np)
`||
`_ A(yiYps Mp)
`[sey]
`
`[3]
`
`where J,,=(S,,”)representing the Jacobian matrix for
`the normalization expressed in [1]. Based on expression [3],
`the auxiliary function may be modified as follows:
`
`aveYee
`
`p=l m=1
`
`=1
`
`[4]
`
`A (ynPpp)
`\(sspy?|
`
`the maximum likelihood
`By differentiating function [4],
`(ML) estimates of the mismatch pdf parameters can be
`defined as follows:
`
`[5]
`
`T D
`
`d2dYnueYe
`
`Mp
`
`IPR2023-00037
`Apple EX1021 Page 8
`
`
`
`6,151,574
`
`7
`node. Let the sequence of connected nodes corresponding to
`the mixture component m be denoted {N,,...,N,,..-,
`Nz}, where N, denotes root node 203; N, denotes oneof the
`intermediate nodes in the sequence; and N,, denotesthe leaf
`node representing the mixture component m.
`Let 4,=(v,, N,) and represent the set of mismatch pdf
`parameters associated with node N,. The parameterset, A,,
`is determined by maximizing the posterior probabilities
`described below, given a sequence of feature vectors, Y. It
`should be noted that once A,
`is determined,
`the HMM
`parameterset @ is readily obtained using expressions [7] and
`[8] above.
`In accordance with the inventive SMAP approach,the set
`{o, Ay, + Ags ++ > Ag} is used as the aforementioned
`“hierarchical priors” for 4,,, where 4,=(O, I). The mismatch
`pdf parameter set >,
`is assumed to be the prior for the
`parameter set 2, associated with root node 203, and the
`parameterset 4, associated with node N,is used as the prior
`for the parameter set A,,, associated with its immediately
`subordinate node or child node, N,,,4.
`The posteriori distribution for , may be expressed as
`follows:
`
`PAwIY)=f. +. [PAruiAra> Y). +. PAgMAga» YD... POLAo nay
`H—1>
`
`where
`
`8
`-continued
`
`« wile
`7
`
`- SM-1 + Veh + halk (Fe = VEL) Fe = Ve?
`Eel;
`
`Mk
`
`r
`T= » » Ymt>
`t
`meGy
`
`[19]
`
`[20]
`
`10
`
`15
`
`20
`
`25
`
`where G, is a subset of G represented by node N,; (¥, 11.)
`represents ML estimates for (v,, n,). The parameters w>0,
`and €>1
`are standard control parameters. By sccessively
`applying expressions [18] and [19] from the root node N, to
`a leaf node N,, along the node sequence, the mean v,, and
`the variance 1,, for the leaf node N,, are obtained. These v,,
`and 1, are used as v,, and n,,, respectively in expressions[7]
`and [8] to update the mixture components, in accordance
`with the inventive SMAP approach.
`It should be noted that expression [18] can be rewritten for
`the leaf node N,, as follows:
`H
`Va = yas
`k=l
`where
`
`[21]
`
`w
`
`k=
`
`
`ry
`Wi
`Tet he 20 Cit Wi
`
`:
`
`[22]
`
`From expression [21], the mean estimate v,, in accordance
`with SMAP approach can be envisioned as the weighted sum
`of the MLestimates ¥, at different levels of tree structure
`200. It should also be noted that at node N,,
`the more
`adaptation data available, the larger is T,, and thus w,. In
`where k=H, .. ., 1. Two approximations are employed here
`addition, the weight w, for Vv, at node N, decreases expo-
`to simplify the evaluation of the posteriori distribution of
`nentially with the value of k. Thus, when the amount of
`[9]. The first approximation using the probability given by
`available adaptation data is relatively small, the ML esti-
`the maximumaposteriori (MAP) estimates is expressed as
`follows:
`mates v, corresponding to upperlevels of tree structure 200
`dominate in the mean estimate of [21]. On the other hand,
`when the amount of available adaptation data is relatively
`large, the ML estimates v, corresponding to lower levels
`thereof dominate in such a meanestimate.
`
`PCYIAg)p AglAg-1)
`PAgAg-1, Y) = ————_,
`JPWpeed rs
`
`30
`
`35
`
`40
`
`JpQrtros V)p(Ag)drc~POualAos Y); [PQrsttas Y)pOulryas
`V)dAy~PAtaathes Y),
`
`[12]
`
`where
`
`Agro;
`
`Arargmax,pay15 ¥)=argmax,p(ViAy)p(AdAg1):
`
`The second approximation is expressed as follows:
`
`PCYIA-N (via),
`
`The methodology for construction of tree structure 200
`will now be described. The a priori knowledge about the
`embedded structure in the acoustic feature space should be
`used for the construction of tree structure 200 for the set of
`all mixture components, G. The tree construction is based on
`the methodology described, e.g.,
`in: T. Watanabe et al.,
`“Speech Recognition Using Tree-Structured Probability
`Density Function,” Proc. of ICSLP-94, 1994, pp. 223-226.
`In this tree construction, a well known Kullback divergence
`between pdfs of the mixture components is used as the
`distance measure between the mixture components.
`Accordingly,
`the distance d(m,n) between two mixture
`components,say, g,, and g,,, is determined as follows:
`
`45
`
`50
`
`55
`
`[13]
`
`[14]
`
`[15]
`
`With these approximations, the MAP estimates of the mis-
`match pdf parameters for each node N, can be determined as
`follows:
`
`Vo=0
`
`No=l
`
`Vk
`
`LV + WM
`Ty + We
`
`[16]
`
`[17]
`
`[18]
`
`65
`
`
`Bm)
`Sn(X)
`
`ax
`
`[23]
`
`dim, n) = f(gn() — gnl)log
`_ y Crd)? + (el) =HD
`
`“ZL
`
`on iP
`
`On? + (ta— Mind?
`Tm iP
`
`where o,,(i) represents the i” diagonal element of covari-
`ance S,,. Anode pdfis assigned to each nodeoftree structure
`
`IPR2023-00037
`Apple EX1021 Page 9
`
`IPR2023-00037
`Apple EX1021 Page 9
`
`
`
`6,151,574
`
`9
`200. The pdf parameters for node N, to which mixture
`components g,,=N(X|u,,, S,,) belong, m=1,..., Mz,
`are given as follows:
`
`,
`W,;
`oe
`(i) = 77Dt @,
`
`>
`Mg
`>
`|M
`1
`oul? = Sowa +) Ww - Mena? |.
`m=1
`RTE
`
`[24]
`
`[25]
`
`FIG. 3 illustrates conventional k-means clustering algo-
`rithm 300 for clustering the mixture components. According
`to this algorithm, the number of branches from each node
`and the number of levels in tree structure 200 are
`
`10
`
`15
`
`20
`
`25
`
`30
`
`40
`
`45
`
`50
`
`1. Apparatus for recognizing selected speech based on
`acoustic models, the apparatus comprising:
`a processor responsive to selected data representing a
`sample of the selected speech for modifying the acous-
`tic models; and
`a mechanism for defining a hierarchical structure which
`includes a plurality of levels, each level having one or
`more nodesthereon, each level in the structure arranged
`higher than every other level having a smaller number
`of nodes thereon than the other level, each node being
`associated with a probability measure which is deter-
`mined based onat least the selected data, each node on
`each level having two or more nodes thereon being
`
`55
`
`60
`
`65
`
`predetermined, and the clustering is recursively carried out.
`Specifically, root node 203 is initially set to be node n, as
`indicated at step 305. At step 308, it is determined whether
`node n has any child nodes. If node n has no child node,
`algorithm 300 comes to an end.
`Otherwise, if node n has one or more child nodes, initial
`pdfs are assigned to the child nodes, respectively, using the
`well known min-max method, as indicated at step 311. At
`step 315, the set of mixture components represented by node
`n is divided among the child nodes by calculating the
`distances between the child node pdfs and the mixture
`components using expression [23]. Each mixture component
`is assigned to a child node whose pdf is closest to the
`mixture component. A pdf is then calculated at step 318 for
`each child node of node n based on expressions [24] and
`[25]. For each child node, the sum of the distances from the
`child node pdf to the mixture components represented by the
`child node is computed. The respective sumsof distances for
`35
`10. The apparatus of claim 8 wherein the set represented
`all of the child nodes of node n are added, resulting in a
`by each nodeonalevel other than the selected level includes
`grand sum. It is then determined whether such a grand sum
`the sets represented by the nodes on the selected level which
`converges, as indicated at step 321. If the grand sum does not
`are connected to the node.
`converge, algorithm 300 returns to step 315. Otherwise,if it
`converges, each child nodeis set to be a node n,as indicated
`at step 322. For each new node n, algorithm 300 returns to
`step 308.
`The foregoing merely illustrates the principles of the
`invention. It will thus be appreciated that a person skilled in
`the art will be able to devise numerous systems which,
`although not explici