throbber
US006151574A
`6,151,574
`(114) Patent Number:
`United States Patent 55
`Lee et al.
`[45] Date of Patent:
`Nov. 21, 2000
`
`
`[54] TECHNIQUE FOR ADAPTATION OF
`HIDDEN MARKOV MODELS FOR SPEECH
`RECOGNITION
`
`[75]
`
`Inventors: Chin-Hui Lee, Basking Ridge, N.J.;
`Koichi Shinoda, Chiba, Japan
`.
`.
`.
`[73] Assignee: Lucent Technologies Inc., Murray Hill,
`NJ.
`
`[21] Appl. No.: 09/149,782
`[22]
`Filed:
`Sep. 8, 1998
`
`J. Gauvain et al., “Maximum a Posteriori Estimation for
`Multivariate Gaussian Mixture Observations of Markov
`Chains,” IEEE Transactions on Speech and Audio Process-
`ing, vol. 2, No. 2, Apr. 1994, pp. 291-298.
`C. Lee et al., “A Study on Speaker Adaptation of the
`Parameters of Continuous Density Hidden Markov Models,”
`IEEE Transactions on Signal Processing, vol. 39, No. 4,
`Apr. 1991, pp. 806-814.
`Boulianne et al, “Optimal tying of HMM densities using
`decision trees”, Spoken Language, Oct. 1996.
`Miglietta, “Bayesian Adaptation of speech Recognizers to
`Field Speech Data”, Oct. 1996.
`Jen—Tzung Chen, Chin—Hui Lee, and Hsiao—Chuan Wang, A
`Hybrid Algorithm for Speaker Adaption Using MAP.
`.
`.
`(List continued on next page.)
`
`Related U.S. Application Data
`Provisional application No. 60/067,822, Dec. 5, 1997.
`[60]
`Tent, Cdn eeeeeecccceeeeescesnneeeseesnneseennnes G10L 15/14
`PSL]
`[52] U.S. Ch. eeecccccsesssseecsssecssseesssnee 704/256; 704/243=Primary Examiner—Krista Zele
`[58] Field of Search 0... 704/243, 244,
`Assistant Examiner—Michael N. Opsasnick
`
`[56]
`
`References Cited
`
`Aspeech recognition system learns characteristics of speech
`by a user during a learning phase to improve its perfor-
`U.S. PATENT DOCUMENTS
`mance. Adaptation data derived from the user’s speech and
`its recognized result is collected during the learning phase.
`
`s657424|$/1997 Farelleal. qoainss Parameters characterizing hidden Markov Models (IMMs)
`
`
`. 704/250
`5,737,487
`4/1998 Bellegardaetal.
`used in the system for speech recognition are modified based
`
`.......
`ws. 704/238
`5,787,394
`7/1998 Bahlet al.
`on the adaptation data. To that end, a hierarchical structure
`
`5,794,197
`8/1998 Alleva et al.
`.
`w 704/255
`is defined in an HMM parameter space. This structure may
`5,797,123
`8/1998 Chou et al.
`...
`... 704/256
`assume the form of a tree structure having multiple layers,
`5,857,169
`1/1999 Seide «ee
`we 704/256
`each of which includes one or more nodes. Each node on
`
`w+ 704/243
`5,907,825
`5/1999 Tzirkel-Hancock .
`each layer is connectedto at least one node on anotherlayer.
`382/228
`5,912,989
`6/1999 Watanabe.............
`The nodes on the lowest layer of the tree structure are
`beseeeseeceeeseseseees 704/256
`5,933,806
`8/1999 Beyerlein et al.
`referred to as “leaf nodes.” Each node in the tree structure
`5,956,676
`9/1999 Shinoda ose eeeesesereeeeeeeee 704/244
`:
`.
`5,960,395
`9/1999 Tzirkel-Hancock .
`“.. 704/241
`Tepresents a subset of the HMM parameters, and is associ-
`5,983,180
`11/1999 Robinson ....scnucneenenenenee 704/254
`ated with a probability measure which is derived from the
`adaptation data. In particular, each leaf node represents a
`OTHER PUBLICATIONS
`different one of the HMM parameters, which is derivable
`from the probability measure associated with the leaf node.
`This probability measure is a function of the probability
`measures which are associated with the nodes connected to
`
`J. Chien et al., “Improved Bayesian Learning of Hidden
`Markov Models for Speaker Adaptation,” Proc. ICASSP-97,
`1997, pp. 1027-1030.
`K. Shinoda et al., “Speaker Adaptation with Autonomous
`Model Complexity Control by MDL Principle,” Proc.
`ICASSP-96, 1996, pp. 717-720.
`
`the leaf node, and which represent “hierarchical priors” to
`such a probability measure.
`
`38 Claims, 3 Drawing Sheets
`
`300
`
`SET ROOT NODE TO NODE n Lv 305
`308
`NO.
`ANY CHILD NODES ?
`YES
`

`fol
`END
`ASSIGN INIIITIAL POFs
`TO CHILD NODES
`

`DIVIDE SET OF MIXTURE
`COMPONENTS
`REPRESENTED
`BY NODE n AMONG CHILD
`NODES
`
`
`
`
`
`
`
`IPR2023-00037
`Apple EX1021 Page1
`
`
`
`
` FOR EACH NODE n
`
`CALCULATE POF FOR
`EACH CHI
`ILD_ NODE
`
`DISTANCES BETWEEN
`CHILD NODE PDFs AND
`MIXTURE COMPONENTS.
`CONVERGE 2
`YES
`

`SET EACH CHILD NODE
`n
`10
`
`
`
`IPR2023-00037
`Apple EX1021 Page 1
`
`

`

`6,151,574
`
`Page 2
`
`OTHER PUBLICATIONS
`
`Transformation and Adaption, XP—002130508, IEEE Signal
`Processing Letters, pp. 167-169, vol. 4, No. 6, Jun. 1997.
`Carmelo Giammarco Miglietta, Chafic Mokbel, Denis Jou-
`vet and Jean Monne, Bayesian Adaption of Speech Recog-
`nizers to Field.
`
`Speech Data, XP—002130507, ICSLI96, pp. 917-920, Oct.
`1996, Lannion cedex, France.
`
`Douglas B. Paul, “Extensions to Phone-State Decision—Tree
`Clustering:Single
`Tree
`and
`‘Tagged
`Clustering”,
`XP-000822740, pp. 1487-1490, Apr. 21, 1997, Newton,
`Massachusettes, USA.
`
`Koichi Shinoda adn Chin-Hui Lee, “Structural MAP
`Speaker
`Adaption
`Using
`Hierarchical
`Priors”,
`XP-002130506, pp. 381-388, Dec. 1997, Murray Hill, New
`Jersey, USA.
`
`IPR2023-00037
`Apple EX1021 Page 2
`
`IPR2023-00037
`Apple EX1021 Page 2
`
`

`

`U.S. Patent
`
`Nov.21, 2000
`
`Sheet 1 of 3
`
`6,151,574
`
`[--------
`
`rl
`
`AV1A0
`
`ININ413
`
`HOq4dS
`
`a/V
`
`INIOd-CN4
`
`Jan
`
`
`
`TAQONGYOM
`
`YOSSIN0Ud
`
`YAZINIOIIY
`
`YOL0ILI0
`
`YOLOVYLXS
`
`YOLYIANOD
`
`IPR2023-00037
`Apple EX1021 Page 3
`
`IPR2023-00037
`Apple EX1021 Page 3
`
`
`
`
`
`

`

`1St LeveL
`keh LeveL
`He LeveL
`
`247
`
`245)
`
`243)
`
`241
`
`235239
`
`237
`
`233
`
`U.S. Patent
`
`Nov.21, 2000
`
`Sheet 2 of 3
`
`6,151,574
`
`FIG. 2
`
`200
`
`231
`
`IPR2023-00037
`Apple EX1021 Page 4
`
`IPR2023-00037
`Apple EX1021 Page 4
`
`

`

`U.S. Patent
`
`Nov.21, 2000
`
`Sheet 3 of 3
`
`6,151,574
`
`FIG. 3
`
` ASSIGN INITIAL PDFs
`
`TO CHILD NODES
`
`
`
`DIVIDE SET OF MIXTURE
`COMPONENTS REPRESENTED
`BY NODE n AMONG CHILD
`NODES
`
`
`
`
`
`
`
`
`
`CALCULATE PDF FOR
`EACH CHILD NODE
`
`
`
`
`
`SUM OF
`
`
`DISTANCES BETWEEN
`
`CHILD NODE PDFs AND
`
`MIXTURE COMPONENTS
`
`
`CONVERGE ?
`YES
`
`
`
`
`SET EACH CHILD NODE
`TO NODE n
`
`
`
`FOR EACH NODE n
`
`
`
`
`
`IPR2023-00037
`Apple EX1021 Page 5
`
`IPR2023-00037
`Apple EX1021 Page 5
`
`

`

`6,151,574
`
`1
`TECHNIQUE FOR ADAPTATION OF
`HIDDEN MARKOV MODELS FOR SPEECH
`RECOGNITION
`
`This application claims the benefit of U.S. Provisional
`No. 60/067,822 filed Dec. 5, 1997.
`
`FIELD OF THE INVENTION
`
`The invention relates to speech recognition systems and
`methods, and more particularly to systems and methods for
`recognizing speech based on hidden Markov models
`(HMMs), which are adapted to acoustic inputs during a
`learning phase of speech recognition.
`BACKGROUND OF THE INVENTION
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`50
`
`Chains,” IEEE Transactions on Speech and Audio
`Processing, Vol. 2, No. 2, 1994, pp. 291-298. Since the
`Bayesian adaptation approach utilizes the MAP estimates
`based on knowledge of prior distributions, it requires less
`input speech data for the adaptation than,e.g., one utilizing
`maximum likelihood (ML) estimates which doesnotrely on
`any such knowledge.
`However,if the adaptation data is scarce, the transforma-
`tion based approach may be more effective than the Baye-
`sian adaptation approach to adapt the HMMs.According to
`the transformation based approach, a transformation,e.g., a
`shift or an affine transformation, is defined in an acoustic
`feature space, also known as an “HMM parameterspace,” to
`explore correlations between different HMMs, and such
`correlations help adapt the HMMsdespite the scarcity of the
`adaptation data. Parameters characterizing the transforma-
`
`55
`
`60
`
`65
`
`In communication, data processing and similar systems, a
`user interface using audio facilities is often advantageous
`especially when it is anticipated that the user would be
`physically engaged in an activity (e.g., driving a car) while
`he/she is operating one such system. Techniques for recog-
`nizing human speech in such systems to perform certain
`tasks have been developed.
`In accordance with one such technique, input speech is
`analyzed in signal frames, represented by feature vectors
`corresponding to phonemes making up individual words.
`The phonemesare characterized by hidden Markov models
`(HMMs), and a Viterbi algorithm is used to identify a
`In accordance with the invention, the HMMsin a speech
`sequence of HMMs which best matches,
`in a maximum
`recognition system are adapted during a learning phase
`likelihood sense, a respective concatenation of phonemes
`using a “structural maximum a posteriori (SMAP)’
`corresponding to an unknown, spoken utterance.
`approach. Accordingto this inventive approach, a hierarchi-
`cal structure, e.g., a tree structure, having multiple levels is
`It
`is well known that each HMM comprises model
`constructed in an HMM parameter space. Sucha structure is
`parameters, e.g., mixture components which are character-
`characterized by transformation parameters associated with
`ized by Gaussian distributions. In a learning phase in a
`different levels of the structure. The transformation param-
`speech recognition system, the HMMsare adapted to input
`eters associated with a level represent prior knowledge,
`speech by a user to adjust to the particular speech charac-
`referred to as “hierarchical priors,” for the transformation
`teristics of the user,
`thereby increasing accuracy of the
`parameters associated with the levels therebeneath. The
`speech recognition. In prior art, two well known approaches
`transformation parameters associated with each level of the
`for adaptation of HMMs, namely, the Bayesian adaptation
`structure are estimated based on the adaptation data, and
`approach and the transformation based approach, have been
`hierarchical priors from the levels thereabove. The HMM
`employed.
`parameters are updated based on the transformation param-
`According to the Bayesian adaptation approach, prior
`eters associated with the lowest level of the structure, which
`distributions are assumed for
`the model parameters in
`thus are a function ofat least the transformation parameters
`HMMs,and the maximumaposteriori (MAP)estimates for
`associated with a level other than the lowestlevel.
`45
`the model parameters are calculated. For details on this
`approach, one may refer to: C. Lee et al., “A study on
`Speaker Adaptation of the Parameters of Continuous Den-
`sity Hidden Markov Models,” IEEE Transactions on Signal
`Processing, Vol. 39, No. 4, April 1991, pp. 806-814; and J.
`Gauvain et al., “Maximum a Posteriori Estimation for Mul-
`tivariate Gaussian Mixture Observations of Markov
`
`2
`tion are estimated using the adaptation data. In implement-
`ing the transformation based approach,
`it is desirable to
`divide the acoustic feature space into a numberof subspaces
`and estimate the transformation parameters in each sub-
`space. However,
`the performance of speech recognition
`using the transformation based approach does not signifi-
`cantly improve with an increasing amount of adaptation data
`as any improvementis restricted by the limited number of
`variable transformation parameters used in the approach.
`An attempt to combine the Bayesian adaptation approach
`with the transformation based approach to improve the
`speech recognition performance has been made. This
`attempt is described in: Chien et al., “Improved Bayesian
`Learning of Hidden Markov Models for Speaker
`Adaptation,” ICASSP-97, 1997, pp. 1027-1030. However,
`the success of such an attempt relies on the requirementthat
`the number of subspaces in the acoustic feature space be
`optimized for various amounts of adaptation data, which is
`usually impractical.
`Accordingly, there exists a need for combining the Baye-
`sian adaptation approach with the transformation based
`approach in a feasible manner to improve the speech rec-
`ognition performance, regardless of the amountof available
`adaptation data.
`
`SUMMARYOF THE INVENTION
`
`Advantageously, given an amountof available adaptation
`data, the inventive SMAP approacheffectively combines the
`Bayesian adaptation approach, characterized by use of the
`aforementioned hierarchical priors, with the transformation
`based approach, characterized by use of the aforementioned
`hierarchical structure, to synergistically effect the adaptation
`of HMMs.
`
`BRIEF DESCRIPTION OF THE DRAWING
`
`In the drawing,
`FIG. 1 is a block diagram of a speech recognition system
`in accordance with the invention;
`FIG. 2 illustrates a hierarchical structure whereby HMMs
`in the system of FIG. 1 are adapted duringits learning phase;
`and
`
`FIG. 3 is a flow chart depicting an algorithm for con-
`structing the hierarchical structure of FIG. 2.
`DETAILED DESCRIPTION
`
`FIG. 1 illustrates speech recognition system 100 embody-
`ing the principles of the invention. As shown in FIG. 1,
`
`IPR2023-00037
`Apple EX1021 Page 6
`
`IPR2023-00037
`Apple EX1021 Page 6
`
`

`

`6,151,574
`
`10
`
`3
`4
`to feed the series of recognized words from recognizer 117
`system 100 includes a numberof functional blocks including
`analog-to-digital (A/D) convertor 103, feature extractor 105,
`back to processor 125, and the input speech corresponding
`end-point detector 113, speech recognizer 117, word model
`to the recognized words. According to the unsupervised
`processor 125, switch 139 and delay element 141. The
`learning, the input speech is uncontrolled and the user is not
`functionality of each block of system 100 may be performed
`restricted to speak only certain words as in supervised
`by a respective different processor, or the functionality of
`learning. In the learning mode, delay element 141 imparts a
`several or all blocks may be performed by the same pro-
`proper amount of delay to the input of recognizer 117
`cessor. Furthermore, each stage can include multiple pro-
`representing the input speech to synchronize it with the
`cessing elements. The stages are pipelined and their opera-
`corresponding recognized words. Processor 125 uses the
`tions are performed in a synchronous manner.
`input of recognizer 117 and the corresponding recognized
`Specifically, input speech including a sequence of spoken
`words as “adaptation data” to adapt the CDHMMstherein.
`words is provided, through a microphone (not shown), to
`In accordance with the invention, the HMMsin a speech
`A/D convertor 103 in system 100. Convertor 103 in a
`recognition system,e.g., system 100, are adapted during the
`conventional manner samples the input speech. The digital
`learning phase using a “structural maximumaposteriori
`samples are then provided to feature extractor 105.
`15
`(SMAP)”approach. According to this inventive approach, a
`Upon receiving the digital samples, feature extractor 105
`hierarchical structure, e.g., a tree structure, having multiple
`organizes the received samples in speech framesof, say, 20
`levels is constructed in an HMM parameter space, also
`ms long, and derives for each frame a measure of energy in
`known as an “acoustic feature space.” Such a structure is
`the frame and a set of short spectrum vectors, e.g., linear-
`characterized by transformation parameters associated with
`predictive-coding (LPC) parameters. In this instance,
`the
`different levels of the structure. The transformation param-
`LPC parameters specify a spectrum of an all-pole model
`eters associated with a level represent prior knowledge,
`which best matchesthe signal spectrum overa period of time
`referred to as “hierarchical priors,” for the transformation
`in which the frame of speech samples are accumulated.
`parameters associated with the levels beneath, or subordi-
`Based on the LPC parameters, extractor 105 produces a
`nate to, that level. The transformation parameters associated
`feature vector per frame, which comprises, for example,
`with each level of the structure are estimated based on the
`twelve cepstral features, twelve delta-cepstral features and a
`adaptation data, and aggregate hierarchical priors from the
`delta-energy feature. In a well-known manner,these cepstral
`levels thereabove. The estimated transformation parameters
`and delta-cepstral features characterize the spectrum andits
`associated with the lowest level of the hierarchical structure
`time variation of the speech frame. The delta-energy feature
`are used to update the HMM parameters in the learning
`indicates an amount of change in energy in the speech frame
`phase. As a result, such updated HMM parameters are a
`from the previous frame.
`function of a combination of all of the transformation
`End-point detector 113 of conventional design uses the
`delta energy feature, in conjunction with the energy measure
`by feature extractor 105, to determine the beginning and end
`of a speech signal. It then passes data signals containing the
`25 features in each feature vector onto speech recognizer
`117 of conventional design, along with any end-point deter-
`minations. Based on such data signals and word models
`provided by word model processor 125 in accordance with
`the invention, recognizer 117 determines what the spoken
`wordsare.
`
`20
`
`25
`
`30
`
`35
`
`40
`
`Processor 125, among other things, provides hidden
`Markov models (HMMs), e.g., continuous density (CD)
`HMMsinthis instance, for various spoken words. Based on
`the Viterbi algorithm, recognizer 117 identifies an optimum
`sequence of CDHMMswhichbest matches, in a maximum
`likelihood sense, a respective concatenation of phonemes
`corresponding to an unknown spoken utterance. Such an
`identification process is realized by dynamic programming
`in a conventional manner.
`
`It should be noted at this point that system 100 operates
`in two modes, namely, a learning mode and a normal
`operation mode. In the normal operation mode, whichis the
`current mode, switch 139 is set at a first position to relay a
`series of recognized words from recognizer 117 to the output
`of system 100.
`During the manufacture of system 100, all of the model
`parameters of the CDHMMsin processor 125 are predefined
`in a conventional manner using training data representing
`samples of speech from many speakers. However, when a
`userinitially uses system 100 for speech recognition, system
`100 undergoes the learning mode where the CDHMMsare
`adapted to the user’s input speech to adjust
`to his/her
`particular speech characteristics, thereby further increasing
`the recognition accuracy of system 100.
`In implementing unsupervised learning in this instance,
`switch 139 is set at a second position in the learning mode
`
`45
`
`50
`
`55
`
`60
`
`65
`
`parameters connected to the structure, which are weighted
`according to the respective levels associated therewith. The
`weight for each level varies with the amount of adaptation
`data used. The updated HMM parameters are used by
`processor 125 in system 100 to provide the HMMs,
`in
`particular, CDHMMs,during its normal operation mode for
`speech recognition as described before.
`In order to fully appreciate the adaptation of HMMsin
`accordance with the invention, a normalization technique
`which simplifies the adaptation will now be described. Let
`G={g,,; =1, ..., M} and be the set of all mixture
`components of the CDHMMsin processor 125, where M
`represents the total number of mixture componentsin all the
`states of all the CDHMMsin processor 125, and g,, repre-
`sents a normal density function for a mixture component m.
`This normal density function is denoted N(X\u,,,, S,,), Where
`44,
`tepresents a means vector and S,, represents a covari-
`ance matrix.
`
`Let X={x,,..., x, .., X;} and denote a set of T given
`observation vectors, which represents the adaptation data. In
`the normalization, each sample vector x,is transformed into
`a vector y,,, for each mixture component m as follows:
`
`YnSmEeHin)s
`
`[1]
`
`where t=1,..., Tandm=1,...,M.
`As mentioned before, all of the CDHMM parameters here
`have been predefined in a conventional manner using train-
`ing data derived from samples of speech from many speak-
`ers during the manufacture of system 100. Wherethere is no
`mismatch betweenthe training data and adaptation data,it is
`apparent
`that
`the probability density function (pdf)
`for
`Yun=l¥mi + +> Yme-+++>YVmrt may be represented by the
`standard normal distribution N(Y{O,D, where O represents a
`vector whose components have zero values, and I represents
`
`IPR2023-00037
`Apple EX1021 Page 7
`
`IPR2023-00037
`Apple EX1021 Page 7
`
`

`

`6,151,574
`
`5
`an identity matrix. Otherwise, where there is a mismatch
`between the training data and adaptation data, the pdf for Y,,
`[6]
`T Mp
`Yeoero)lol 7)
`may be generically represented by N(Ylv, 4), where v~O
`y2) —¥,)(y\ —%,)°
`and yzI. A “mismatch pdf”is defined below to represent the
`differences between the acoustic characteristics depicted by
`the training data and those depicted by the adaptation data.
`It can be shown that the number of mismatch pdfs required
`to model the acoustic differences is smaller than that of the
`
`t=1 m=11REPIE
`
`6
`-continued
`
`where(y,,,-V,) represents the transposeof (y,,;”-V,).
`Using these mismatch pdf parameters, the corresponding
`HMMparameters can be updated as follows:
`
`[7]
`[8]
`
`1pPigSOW
`
`the updated mean and
`
`5,0=80,
`where Hy? and §,™ represent
`covariance, respectively.
`The optimal number of mismatch pdfs for deriving the
`corresponding HMMsvaries with the amount of available
`adaptation data. We have recognized that a methodology
`which utilizes the mismatch pdf parameters as transforma-
`tion parameters for constructing the aforementioned hierar-
`chical structure, and which takes advantage of bothits entire
`structure and sub-structures is preferable to achieve a good
`speech recognition performance with any given amount of
`adaptation data. To that end, a tree structure for the set G is
`realized in accordance with the invention.
`FIG. 2 illustrates one suchtree structure, denoted 200. As
`shown in FIG. 2, tree structure 200 has H levels, where H is
`an integer greater than zero. Thefirst level of structure 200
`consists of root node 203 which represents the set G con-
`taining all M mixture components of the CDHMMs.The H™”
`level of structure 200 consists of M leaf nodes representing
`the M mixture components of the CDHMMs,respectively.
`Each intermediate node or non-terminal node between the
`first level and the H”level represents a subsetofG,ie., G,,
`containing M,, mixture components respectively represented
`by those M,, leaf nodes or terminal nodes on the H™level
`which are subordinate to the intermediate node. A leaf node
`is said to be subordinate to an intermediate nodeif the leaf
`node “emanates” from that intermediate node, or in other
`words its origin can be traced back to that intermediate node
`through intervening paths. For example, in structure 200, the
`leaf nodes which are subordinate to intermediate node 222
`
`are leaf nodes 231, 233 and 235 as they are connected to
`node 222 through paths 224, 226 and 228, respectively.
`Similarly, the leaf nodes which are subordinate to interme-
`diate node 211 are leaf nodes 231, 233, 235, 237, 239, 241,
`243, 245 and 247. Of course, root node 203 has each leaf
`node on the H™ level subordinate thereto. The methodology
`for constructing structure 200 is described in detail below.
`For each node N in tree structure 200, a mismatch pdf,
`which is shared by the mixture components in G,, repre-
`sented by the node,is assigned thereto. Specifically, for each
`node N, the MLestimates of the mismatch pdf parameters,
`¥V, and %,, are determined using the adaptation data based on
`expressions [5] and [6].
`Withoutloss of generality, let’s estimate the parameterset
`6,,.=(n> Sm) for each mixture component m in G. For the
`sake of convenience, it should be noted that in the ensuing
`discussion, the subscript “m” of each variable associated
`with the mixture component m is dropped except where a
`confusion may arise without such a subscript. In tree struc-
`ture 200, each sequence of connected nodes with one node
`from each level, including root node 203 and a leaf node,
`corresponds to a mixture componentrepresented bythe leaf
`
`IPR2023-00037
`Apple EX1021 Page 8
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`pdfs for mixture components of HMMs.
`Thus, to reduce the number of required pdfs, the set of
`mixture components, G,is divided into more than one subset
`G,, 1SpSP, where P is the total number of subsets which
`is smaller than the total number of mixture components,1.e.,
`M. Acommon mismatch pdf h,=N(Yiv,, 11) is shared by all
`the mixture components in each subset G,,. In the following
`discussion, each mixture component S in subset G,
`renumbered as g,,
`5 By
`-
`>» Sup®), whereM,
`represents the number of mixture components in subset G,,.
`The observed vector sequence X={x,,..., x} is normalized
`to X=fy4, - - 5 Vm7/?} for mixture ‘components ye.
`The parameters for the mismatch pdfs are estimated using
`the well known estimation-maximization (EM) algorithm.
`Whenthetransition probabilities and the mixture component
`weight coefficients are specified, the auxiliary function Q for
`the HMMparameters is given by:
`
`Nee MeP
`
`alP)
`TS
`nllog gl? nla??, Sn).
`m=1
`
`P
`
`, P}
`, M, and p=1,
`where 0={u,,”, S,,”; m=1,
`representing the current HMM parameter set; G=Ls,(p),
`SP; m=1, . .M,, and p=1, . . P} representing the HMM
`parameter set
`to ‘be estimated; and y,,,represents the
`posterior probability of using mixture component g,,” at
`time t. The relation between the HMM parameters and the
`mismatch pdf parameters is established as follows:
`
`alp)
`a
`elie’, Sy) =
`
`h(nlps Np)
`||
`_ A(yiYps Mp)
`[sey]
`
`[3]
`
`where J,,=(S,,”)representing the Jacobian matrix for
`the normalization expressed in [1]. Based on expression [3],
`the auxiliary function may be modified as follows:
`
`aveYee
`
`p=l m=1
`
`=1
`
`[4]
`
`A (ynPpp)
`\(sspy?|
`
`the maximum likelihood
`By differentiating function [4],
`(ML) estimates of the mismatch pdf parameters can be
`defined as follows:
`
`[5]
`
`T D
`
`d2dYnueYe
`
`Mp
`
`IPR2023-00037
`Apple EX1021 Page 8
`
`

`

`6,151,574
`
`7
`node. Let the sequence of connected nodes corresponding to
`the mixture component m be denoted {N,,...,N,,..-,
`Nz}, where N, denotes root node 203; N, denotes oneof the
`intermediate nodes in the sequence; and N,, denotesthe leaf
`node representing the mixture component m.
`Let 4,=(v,, N,) and represent the set of mismatch pdf
`parameters associated with node N,. The parameterset, A,,
`is determined by maximizing the posterior probabilities
`described below, given a sequence of feature vectors, Y. It
`should be noted that once A,
`is determined,
`the HMM
`parameterset @ is readily obtained using expressions [7] and
`[8] above.
`In accordance with the inventive SMAP approach,the set
`{o, Ay, + Ags ++ > Ag} is used as the aforementioned
`“hierarchical priors” for 4,,, where 4,=(O, I). The mismatch
`pdf parameter set >,
`is assumed to be the prior for the
`parameter set 2, associated with root node 203, and the
`parameterset 4, associated with node N,is used as the prior
`for the parameter set A,,, associated with its immediately
`subordinate node or child node, N,,,4.
`The posteriori distribution for , may be expressed as
`follows:
`
`PAwIY)=f. +. [PAruiAra> Y). +. PAgMAga» YD... POLAo nay
`H—1>
`
`where
`
`8
`-continued
`
`« wile
`7
`
`- SM-1 + Veh + halk (Fe = VEL) Fe = Ve?
`Eel;
`
`Mk
`
`r
`T= » » Ymt>
`t
`meGy
`
`[19]
`
`[20]
`
`10
`
`15
`
`20
`
`25
`
`where G, is a subset of G represented by node N,; (¥, 11.)
`represents ML estimates for (v,, n,). The parameters w>0,
`and €>1
`are standard control parameters. By sccessively
`applying expressions [18] and [19] from the root node N, to
`a leaf node N,, along the node sequence, the mean v,, and
`the variance 1,, for the leaf node N,, are obtained. These v,,
`and 1, are used as v,, and n,,, respectively in expressions[7]
`and [8] to update the mixture components, in accordance
`with the inventive SMAP approach.
`It should be noted that expression [18] can be rewritten for
`the leaf node N,, as follows:
`H
`Va = yas
`k=l
`where
`
`[21]
`
`w
`
`k=
`
`
`ry
`Wi
`Tet he 20 Cit Wi
`
`:
`
`[22]
`
`From expression [21], the mean estimate v,, in accordance
`with SMAP approach can be envisioned as the weighted sum
`of the MLestimates ¥, at different levels of tree structure
`200. It should also be noted that at node N,,
`the more
`adaptation data available, the larger is T,, and thus w,. In
`where k=H, .. ., 1. Two approximations are employed here
`addition, the weight w, for Vv, at node N, decreases expo-
`to simplify the evaluation of the posteriori distribution of
`nentially with the value of k. Thus, when the amount of
`[9]. The first approximation using the probability given by
`available adaptation data is relatively small, the ML esti-
`the maximumaposteriori (MAP) estimates is expressed as
`follows:
`mates v, corresponding to upperlevels of tree structure 200
`dominate in the mean estimate of [21]. On the other hand,
`when the amount of available adaptation data is relatively
`large, the ML estimates v, corresponding to lower levels
`thereof dominate in such a meanestimate.
`
`PCYIAg)p AglAg-1)
`PAgAg-1, Y) = ————_,
`JPWpeed rs
`
`30
`
`35
`
`40
`
`JpQrtros V)p(Ag)drc~POualAos Y); [PQrsttas Y)pOulryas
`V)dAy~PAtaathes Y),
`
`[12]
`
`where
`
`Agro;
`
`Arargmax,pay15 ¥)=argmax,p(ViAy)p(AdAg1):
`
`The second approximation is expressed as follows:
`
`PCYIA-N (via),
`
`The methodology for construction of tree structure 200
`will now be described. The a priori knowledge about the
`embedded structure in the acoustic feature space should be
`used for the construction of tree structure 200 for the set of
`all mixture components, G. The tree construction is based on
`the methodology described, e.g.,
`in: T. Watanabe et al.,
`“Speech Recognition Using Tree-Structured Probability
`Density Function,” Proc. of ICSLP-94, 1994, pp. 223-226.
`In this tree construction, a well known Kullback divergence
`between pdfs of the mixture components is used as the
`distance measure between the mixture components.
`Accordingly,
`the distance d(m,n) between two mixture
`components,say, g,, and g,,, is determined as follows:
`
`45
`
`50
`
`55
`
`[13]
`
`[14]
`
`[15]
`
`With these approximations, the MAP estimates of the mis-
`match pdf parameters for each node N, can be determined as
`follows:
`
`Vo=0
`
`No=l
`
`Vk
`
`LV + WM
`Ty + We
`
`[16]
`
`[17]
`
`[18]
`
`65
`
`
`Bm)
`Sn(X)
`
`ax
`
`[23]
`
`dim, n) = f(gn() — gnl)log
`_ y Crd)? + (el) =HD
`
`“ZL
`
`on iP
`
`On? + (ta— Mind?
`Tm iP
`
`where o,,(i) represents the i” diagonal element of covari-
`ance S,,. Anode pdfis assigned to each nodeoftree structure
`
`IPR2023-00037
`Apple EX1021 Page 9
`
`IPR2023-00037
`Apple EX1021 Page 9
`
`

`

`6,151,574
`
`9
`200. The pdf parameters for node N, to which mixture
`components g,,=N(X|u,,, S,,) belong, m=1,..., Mz,
`are given as follows:
`
`,
`W,;
`oe
`(i) = 77Dt @,
`
`>
`Mg
`>
`|M
`1
`oul? = Sowa +) Ww - Mena? |.
`m=1
`RTE
`
`[24]
`
`[25]
`
`FIG. 3 illustrates conventional k-means clustering algo-
`rithm 300 for clustering the mixture components. According
`to this algorithm, the number of branches from each node
`and the number of levels in tree structure 200 are
`
`10
`
`15
`
`20
`
`25
`
`30
`
`40
`
`45
`
`50
`
`1. Apparatus for recognizing selected speech based on
`acoustic models, the apparatus comprising:
`a processor responsive to selected data representing a
`sample of the selected speech for modifying the acous-
`tic models; and
`a mechanism for defining a hierarchical structure which
`includes a plurality of levels, each level having one or
`more nodesthereon, each level in the structure arranged
`higher than every other level having a smaller number
`of nodes thereon than the other level, each node being
`associated with a probability measure which is deter-
`mined based onat least the selected data, each node on
`each level having two or more nodes thereon being
`
`55
`
`60
`
`65
`
`predetermined, and the clustering is recursively carried out.
`Specifically, root node 203 is initially set to be node n, as
`indicated at step 305. At step 308, it is determined whether
`node n has any child nodes. If node n has no child node,
`algorithm 300 comes to an end.
`Otherwise, if node n has one or more child nodes, initial
`pdfs are assigned to the child nodes, respectively, using the
`well known min-max method, as indicated at step 311. At
`step 315, the set of mixture components represented by node
`n is divided among the child nodes by calculating the
`distances between the child node pdfs and the mixture
`components using expression [23]. Each mixture component
`is assigned to a child node whose pdf is closest to the
`mixture component. A pdf is then calculated at step 318 for
`each child node of node n based on expressions [24] and
`[25]. For each child node, the sum of the distances from the
`child node pdf to the mixture components represented by the
`child node is computed. The respective sumsof distances for
`35
`10. The apparatus of claim 8 wherein the set represented
`all of the child nodes of node n are added, resulting in a
`by each nodeonalevel other than the selected level includes
`grand sum. It is then determined whether such a grand sum
`the sets represented by the nodes on the selected level which
`converges, as indicated at step 321. If the grand sum does not
`are connected to the node.
`converge, algorithm 300 returns to step 315. Otherwise,if it
`converges, each child nodeis set to be a node n,as indicated
`at step 322. For each new node n, algorithm 300 returns to
`step 308.
`The foregoing merely illustrates the principles of the
`invention. It will thus be appreciated that a person skilled in
`the art will be able to devise numerous systems which,
`although not explici

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket