`Nadas et al.
`
`73 Assignee:
`
`(54) NORMALIZATION OF SPEECHBY
`ADAPTIVE LABELLING
`75 Inventors: Arthur J. Nadas, Rock Tavern;
`David Nahamoo, White Plains, both
`of N.Y.
`International Business Machines
`Corporation, Armonk, N.Y.
`21 Appl. No.:71,687
`(22
`Filed:
`Jul. 9, 1987
`5ll Int. Cl........................... G10L 5/04; G10L 9/16
`52 U.S.C. ......................................... 381/41; 381/46
`58) Field of Search .................. 364/513.5; 381/41-50
`56
`References Cited
`U.S. PATENT DOCUMENTS
`2,938,079 5/1960 Flanagan ............................... 381/50
`3,673,331 6/1972 Hair et al. ............................. 381/42
`3,770,891 11/1973 Kalfaian ................................ 381/42
`3,969,698 7/1976 Bollinger et al...................... 381/43
`4,227,046 10/1980 Nakajima et al...................... 381/47
`4,256,924 3/1981 Sakoe .................................... 381/43
`4,282,403 8/1981 Sakoe .......
`... 364/513.5
`4,292,471 9/1981 Kuhn et al. ........................... 381/42
`4,394,538 7/1983 Warren et al. ........................ 381/.43
`4,519,094 5/1985 Brown et al. ......................... 381/43
`4,559,604 12/1985 Ichikawa et al.
`364/53.5
`4,597,098 6/1986 Noso et al. ............................ 381/46
`4,601,054 7/1986 Watariet al. ......................... 38/43
`4,658,4264/1987 Chabries et al. ...................... 381/47
`4,718,094 1/1988 Bahl et al.............................. 38/43
`4,720,802 1/1988 Damoulakis et al. ....
`364/S13.5
`4,752,957 6/1988 Maeda ................................... 381/42
`4,802,224 1/1989 Shiraki et al. ......................... 381/41
`4,803,729 2/1989 Baker .................................... 381/43
`OTHER PUBLICATIONS
`Paul, "An 800 PBS Adaptive Vector Quantization Vo
`coder Using a Perceptual Distance Measure', ICASSP
`'83 Boston, pp. 73-76.
`Burton et al., "Isolated-Word Recognition Using Mul
`tisection Vector Quantization Codebooks', IEEE
`Trans. on ASSP, vol.33, No. 4, Aug. 1985, pp. 837-849.
`Technical Disclosure Bulletin, vol. 28, No. 11, Apr.
`1986, pp. 5401-5402, by K. Sugawara, Entitled
`
`11
`45
`
`Patent Number:
`Date of Patent:
`
`4,926,488
`May 15, 1990
`
`"Method for Making Confusion Matrix by DP Match
`ing'.
`Siano, K., et al., "Speaker Adaptation Through Vec
`tor Quantization”, ICASSP'86, Tokyo, pp. 2643-2646.
`Tappert, C. C., et al., "Fast Training Method for
`Speech Recognition Systems', IBM Tech. Discl. Bull,
`vol. 21, No. 8, Jan. 1979, pp. 3413-3414.
`Technical Disclosure Bulletin, vol. 28, No. 11, Apr.
`1986, pp. 5401-5402, by K. Sugawara, Entitled,
`"Method for Making Confusion Matrix by DP Match
`ing'.
`Primary Examiner-Gary V. Harkcom
`Assistant Examiner-David D. Knepper
`Attorney, Agent, or Firm-Marc A. Block; Marc D.
`Schechter
`ABSTRACT
`57
`In a speech processor system in which prototype vec
`tors of speech are generated by an acoustic processor
`under reference noise and known ambient conditions
`and in which feature vectors of speech are generated
`during varying noise and other ambient and recording
`conditions, normalized vectors are generated to reflect
`the form the feature vectors would have if generated
`under the reference conditions. The normalized vectors
`are generated by: (a) applying an operator function Ai
`to a set of feature vectors x occurring at or before time
`interval i to yield a normalized vector yi=A(x); (b)
`determining a distance error vector Ei by which the
`normalized vector is projectively moved toward the
`closest prototype vector to the normalized vectory; (c)
`up-dating the operator function for next time interval to
`correspond to the most recently determined distance
`error vector; and (d) incrementing i to the next time
`interval and repeating steps (a) through (d) wherein the
`feature vector corresponding to the incremented ivalue
`has the most recent up-dated operator function applied
`thereto. With successive time intervals, successive nor
`malized vectors are generated based on a successively
`up-dated operator function. For each normalized vec
`tor, the closest prototype thereto is associated there
`with. The string of normalized vectors or the string of
`associated prototypes (or respective label identifiers
`thereof) or both provide output from the acoustic pro
`CeSSO.
`
`8 Claims, 8 Drawing Sheets
`
`
`
`
`
`AAPED
`fEES
`
`ff
`A9
`filt.
`A.
`POCESSct
`
`
`
`
`
`
`
`cysts
`OFA
`PCSO
`
`24
`
`Amazon / Zentian Limited
`Exhibit 1017
`Page 1
`
`
`
`U.S. Patent May 15, 1990
`
`Sheet 1 of 8
`
`4,926,488
`
`F. G.
`
`OO
`N
`
`iO2
`
`SPEECH
`
`ACOUSTIC
`PROCESSOR
`
`04
`
`F. G. 2
`
`
`
`
`
`ACOUSTC
`PROCESS OR
`
`iO2
`
`SPEECH
`CODER
`
`SPEECH
`SYNTHESIZER
`
`
`
`SPEECH
`SYNTHESIZER
`
`
`
`
`
`
`
`
`
`
`
`
`
`Amazon / Zentian Limited
`Exhibit 1017
`Page 2
`
`
`
`U.S. Patent May 15, 1990
`F. G. 3
`
`Sheet 2 of 8
`
`4,926,488
`
`
`
`PROTOTYPE SPACE = P, P.,..., Poo
`INPUT FEATURE VECTORS = {x,x2, x2, x axis...}
`OUTPUT FEATURE VECTORS = {xxxx,x,...}
`FENEME STRING = P, P., P., P., P.,..., P.
`
`Amazon / Zentian Limited
`Exhibit 1017
`Page 3
`
`
`
`U.S. Patent May 15, 1990
`
`Sheet 3 of 8
`
`4,926,488
`
`F. G. 4
`
`
`
`PROTOTYPE SPACE = { P., P.,..., Foo
`INPUT FEATURE VECTORS = X,X2, Xs, X,Xs...}
`OUTPUT FEATURE VECTORS = {Y,Y, Y-3, Ya,Y,...}
`FENEME STRING = P, Pi, P3, P3, Pse,...}
`
`Amazon / Zentian Limited
`Exhibit 1017
`Page 4
`
`
`
`U.S. Patent May 15, 1990
`Sheet 4 of 8
`F. G. 5
`
`4,926,488
`
`ACOUSTIC PROCESSOR
`200Y,
`--A- - - - - - - -
`-
`2O6
`20
`-
`202
`
`204
`
`208
`
`D
`
`PRE-
`AMP
`
`FILTER
`
`A / )
`CONVERTOR
`
`NORMALIZED
`OUTPUT
`VECTORS
`
`ADAPTED
`FENEMES
`
`ADAPTIVE
`ABELLER
`PROCESSOR
`
`
`
`212
`FFT
`AND
`FILTER
`BANK
`PROCESSOR
`
`
`
`
`
`CLUSTERING
`OPERATOR
`PROCESSOR
`
`PROTOTYPE
`MEMORY
`
`2i 6
`
`amum unas -e humus amus suus am sun
`
`Amazon / Zentian Limited
`Exhibit 1017
`Page 5
`
`
`
`Sheet S of 8
`U.S. Patent May 15, 1990
`F. G. 6
`300
`
`302
`
`COUNTER
`
`306
`
`
`
`
`
`
`
`
`
`NTAL
`
`4,926,488
`
`INPUT
`OVECTOR
`- FEATURES
`X
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`() PARAMETER
`
`MEMORY
`
`MEMORY
`
`
`
`FIR FILTER
`
`FROM
`PROTOTYPE
`MEMORY
`
`DISTANCE CACULATOR
`
`MINIMUM SELECTOR
`
`DERVATIVE CALCULATOR
`
`FIRST ORDER
`FIR FLTER
`
`NORMAZED
`O OUTPUT
`VECTOR
`
`PROTOTYPE
`OUTPUT
`
`Amazon / Zentian Limited
`Exhibit 1017
`Page 6
`
`
`
`Sheet 6 of 8
`U.S. Patent May 15, 1990
`F. G. 7
`
`4,926,488
`
`
`
`400
`
`402
`
`404
`
`AODER
`
`SQUARE
`
`ACCUMULATOR
`
`
`
`F G. 9
`
`
`
`422
`
`Amazon / Zentian Limited
`Exhibit 1017
`Page 7
`
`
`
`Sheet 7 of 8
`U.S. Patent May 15, 1990
`F. G. O
`
`4,926,488
`
`SPEECH INPUT
`
`-so
`
`502
`
`
`
`
`
`NITALIZE
`PARAMETERS
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`PREPARE
`FEATURE
`VECTORS
`
`PERFORM
`NORMALIZATION
`
`FIND CLOSEST
`PROTOTYPE
`
`CALCULATE CLOSEST
`DISTANCE DER WATIVE
`WITH RESPECT TO
`NORMALIZATION
`PARAMETERS
`
`504
`
`506
`
`NORMALIZED
`OUTPUT
`VECTORS
`Y
`
`ADAPTED
`FENEMES
`
`50
`
`
`
`UPDATE THE
`NORMALIZATION
`PARAMETERS
`
`
`
`52
`
`Amazon / Zentian Limited
`Exhibit 1017
`Page 8
`
`
`
`Amazon / Zentian Limited
`Exhibit 1017
`Page 9
`
`
`
`1.
`
`NORMALIZATION OF SPEECHBY ADAPTIVE
`LABELLING
`
`5
`
`4,926,488
`2
`and proper selection of the closest prototype vector for
`each feature vector is critical.
`The relationship between a feature vector and the
`prototype vectors has normally, in the past, been static;
`there has been a fixed set of prototype vectors and a
`feature vector based on the values of set features.
`However, due to ambient noise, signal drift, changes
`in the speech production of the talker, differences be
`tween talkers or a combination of these, signal traits
`may vary over time. That is, the acoustic traits of the
`training data from which the prototype vectors are
`derived may differ from the acoustic traits of the data
`from which the test or new feature vectors are derived.
`The fit of the prototype vectors to the new data traits is
`5 normally not as good as to the original training data.
`This affects the relationship between the prototype
`vectors and later-generated feature vectors, which re
`sults in a degradation of performance in the speech
`processor.
`SUMMARY OF THE INVENTION
`It is an object of the present invention to provide
`apparatus and method for adapting feature vectors in
`order to account for noise and other ambient conditions
`as well as intra and inter speaker variations which cause
`the speech data traits from which feature vectors are
`derived to vary from the training data traits from which
`the prototypes are derived.
`h
`In particular, each feature vector Xi generated at a
`time interval i is transformed into a normalized vector
`yi according to the expression:
`
`30
`
`BACKGROUND OF THE INVENTION
`I. Field of the Invention
`In general, the present invention relates to speech
`processing (such as speech recognition). In particular,
`the invention relates to apparatus and method for char
`10
`acterizing speech as a string of spectral vectors and/or
`labels representing predefined prototype vectors of
`speech.
`II. Description of the Problem
`In speech processing, speech is generally represented
`by an n-dimensional space in which each dimension
`corresponds to some prescribed acoustic feature. For
`example, each component may represent a amplitude of
`energy in a respective frequency band. For a given time
`interval of speech, each component will have a respec
`20
`tive amplitude. Taken together, the namplitudes for the
`given time interval represent an n-component vector in
`the n-dimensional space.
`Based on a known sample text uttered during a train
`ing period, the n-dimensional space is divided into a
`25
`fixed number of regions by some clustering algorithm.
`Each region represents sounds of a common prescribed
`type: sounds having component values which are
`within regional bounds. For each region, a prototype
`vector is defined to represent the region.
`The prototype vectors are defined and stored for
`later processing. When an unknown speech input is
`uttered, for each time interval, a value is measured or
`computed for each of the n components, where each
`component is referred to as a feature. The values of all
`35
`of the features are consolidated to forman n-component
`feature vector for a time interval.
`In some instances, the feature vectors are used in
`subsequent processing.
`In other instances, each feature vector is associated
`with one of the predefined prototype vector and the
`associated prototype vectors are used in subsequent
`processing.
`In associating prototype vectors with feature vectors,
`the feature vector for each time interval is typically
`45
`compared to each prototype vector. Based on a prede
`fined closeness measure, the distance between the fea
`ture vector and each prototype vector is determined
`and the closest prototype vector is selected.
`A speech type of event, such as a word or a phoneme,
`is characterized by a sequence of feature vectors in the
`time period over which the speech event was produced.
`Some prior art accounts for temporal variations in the
`generation of feature vector sequences. These varia
`tions may result from differences in speech between
`55
`speakers or for a single speaker speaking at different
`times. The temporal variations are addressed by a pro
`cess referred to as time warping in which time periods
`are stretched or shrunk so that the time period of a
`feature vector sequence conforms to the time period of
`60
`a reference prototype vector sequence, called a ten
`plate. Oftentimes, the resultant feature vector sequence
`is styled as a "time normalized' feature vector se
`quence.
`Because feature vectors or prototype vectors (or
`representations thereof) associated with the feature
`vectors or both are used in subsequent speech process
`ing, the proper characterization of the feature vectors
`
`where x is a set of one or more feature vectors at or
`before time interval i and where Ai is an operator func
`tion which includes a number of parameters. According
`to the invention, the values of the parameters in the
`operator function are up-dated so that the vectory (at a
`time interval i) is more informative than the feature
`vector x (at a time interval i) with respect to the repre
`sentation of the acoustic space characterized by an ex
`isting set of prototypes. That is, the transformed vectors
`yi more closely correlate to the training data upon
`which the prototype vectors are based than do the fea
`ture vectors xi,
`Generally, the invention includes transforming a fea
`ture vector x to a normalized vectory according to an
`operator function; determining the closest prototype
`vector for yi, altering the operator function in a manner
`which would move yi closer to the closest prototype
`thereto; and applying the altered operator function to
`the next feature vector in the transforming thereof to a
`normalized vector. Stated more specifically, the present
`invention provides that parameters of the operator
`function be first initialized. The operator function Ao at
`the first time interval i=O is defined with the initialized
`parameters and is applied to a first vector xoto produce
`a transformed vector yo. For yo, the closest prototype
`vector is selected based on an objective closeness func
`tion D. The objective function D is in terms of the
`parameters used in the operator function. Optimizing
`the function D with respect to the various parameters
`(e.g., determining, with a "hill-climbing' approach, a
`value for each parameter at which the closeness func
`tion is maximum), up-dated values for the parameters
`are determined and incorporated into the operator func
`tion for the next time interval i=1. The adapted opera
`
`50
`
`65
`
`Amazon / Zentian Limited
`Exhibit 1017
`Page 10
`
`
`
`25
`
`4,926,488
`3
`4.
`tor function A1 is applied to the next feature vectorx to
`ment is a speech coder 110. The speech coder 110 alters
`produce a normalized vector y. For the normalized
`the form of the data exiting the acoustic processor 102
`vectory, the closest prototype vector is selected. The
`to provide a coded representation of speech data. The
`objective function D is again optimized with respect to
`coded data can be transferred more rapidly and can be
`the various parameters to determine up-dated values for
`contained in less storage than the original uncoded data.
`the parameters. The operator function A2 is then de
`The second element receiving input from the acoustic
`fined in terms of the up-dated parameter values.
`processor 102 is a speech synthesizer 112. In some envi
`With each successive feature vector, the operator
`ronments, it is desired to enhance a spoken input by
`function parameters are up-dated from the previous
`reducing noise which accompanies the speech signal. In
`values thereof.
`such environments, a speech waveform is passed
`10
`In accordance with the invention, the following im
`through an acoustic processor 102 and the data there
`proved outputs are generated. One output corresponds
`from enters a speech synthesizer 112 which provides a
`to "normalized' vectors y. Another output corre
`speech output with less noise.
`sponds to respective prototype vectors (or label repre
`The third element corresponds to a speech recognizer
`sentations thereof) associated with the normalized vec
`114 which converts the output of the acoustic processor
`15
`102 into text format. That is, the output from the acous
`tOrs.
`When a speech processor receives continuously nor
`tic processor 102 is formed into a sequence of words
`malized vectors yi as input rather than the raw feature
`which may be displayed on a screen, processed by a text
`vectors xi, the degradation of performance is reduced.
`editor, used in providing commands to machinery,
`Similarly, for those speech processors which receive
`stored for later use in a textual context, or used in some
`20
`successive prototype vectors from a fixed set of proto
`other text-related manner.
`type vectors and/or label representations as input, per
`Various examples of the three elements are found in
`formance is improved when the input prototype vectors
`the prior technology. In that the present invention is
`are selected based on the transformed vectors rather
`mainly involved with generating input to these various
`than raw feature vectors.
`elements, further details are not provided. It is noted,
`however, that a preferred use of the invention is in
`BRIEF DESCRIPTION OF THE DRAWINGS
`conjunction with a "Speech Recognition System' in
`FIG. 1 is a general block diagram of a speech process
`vented by L. Bahl, S. V. DeGennaro, and R. L. Mercer
`ing system.
`for which a patent application was filed on Mar. 27,
`FIG. 2 is a general block diagram of a speech process
`1986 (S.N. 06/845155) now Pat. No. 4,718,094. The
`30
`ing system with designated back ends.
`earlier filed application is assigned to the IBM Corpora
`FIG. 3 is a drawing illustrating acoustic space parti
`tion, the assignee of the present application, and is in
`tioned into regions, where each region has a representa
`corporated herein by reference to the extent necessary
`tive prototype included therein. Feature vectors are
`to provide background disclosure of a speech recog
`also shown, each being associated with a "closest” pro
`nizer which may be employed with the present inven
`35
`totype vector.
`tion.
`FIG. 4 is a drawing illustrating acoustic space parti
`At this point, it is noted that the present invention
`tioned into regions, where each region has a representa
`may be used with any speech processing element which
`tive prototype included therein. Feature vectors are
`receives as input either feature vectors or prototype
`shown transformed according to the present invention
`vectors (or labels representative thereof) associated
`into normalized vectors which are each associated with
`with feature vectors. By way of explanation, reference
`a "closest' prototype vector.
`is made to FIG. 3. In FIG. 3, speech is represented by
`FIG. 5 is a block diagram showing an acoustic pro
`an acoustic space. The acoustic space has n dimensions
`cessor which embodies the adaptive labeller of the pres
`and is partitioned into a plurality of regions (or clusters)
`ent invention.
`by any of various known techniques referred to as
`45
`FIG. 6 is a block diagram showing a specific embodi
`"clustering'. In the present embodiment, acoustic space
`ment of an adaptive labeller according to the present
`is divided into 200 non-overlapping clusters which are
`invention.
`preferably Voronoi regions. FIG. 3 is a two-dimen
`FIG. 7 is a diagram of a distance calculator element
`sional representation of part of the acoustic space.
`of FIG. 6.
`For each region in the acoustic space, there is defined
`50
`FIG. 8 is a diagram of a minimum selector element of
`a respective, representative n-component prototype
`FIG. 6.
`vector. In FIG 3, four of the 200 prototype vectors Ps,
`P1, P4, and Pss are illustrated. Each prototype repre
`FIG. 9 is a diagram of a derivative calculator element
`of FIG. 6.
`sents a region which, in turn, may be viewed as a
`FIG. 10 is a flowchart generally illustrating the steps
`"sound type.' Each region, it is noted, contains vector
`55
`of adaptive labelling according to the present invention.
`points for which then components -when taken toge
`FIG. 11 is a specific flowchart illustrating the steps of
`ther-are somewhat similar.
`adaptive labelling according to the present invention.
`In a first embodiment, then components correspond
`to energy amplitudes in n distinct frequency bands. The
`DESCRIPTION OF THE INVENTION
`points in a region represent sounds in which the n fre
`In FIG. 1, the general diagram for a speech process
`quency band amplitudes are collectively within regional
`ing system 100 is shown. An acoustic processor 102
`bounds.
`receives as input an acoustic speech waveform and
`Alternatively, in another earlier filed patent applica
`converts it into data which a back-end 104 processes for
`tion commonly assigned to the IBM Corporation,
`a prescribed purpose. Such purposes are suggested in
`which is incorporated herein by reference, the n com
`FIG. 2.
`ponents are based on a model of the human ear. That is,
`In FIG. 2, the acoustic processor 102 is shown gener
`a neural firing rate in the ear is determined for each of
`ating output to three different elements. The first ele
`in frequency bands; the n neural firing rates serving as
`
`65
`
`Amazon / Zentian Limited
`Exhibit 1017
`Page 11
`
`
`
`4,926,488
`5
`6
`next error vector E3 for time interval i=3. The error
`then components which define the acoustic space, the
`prototype vectors, and feature vectors used in speech
`vector E3 in effect builds from the projected errors of
`recognition. The sound types in this case are defined
`previous feature vectors.
`based on then neural firing rates, the points in a given
`Referring still to FIG. 4, it is observed that error
`region having somewhat similar neural firing rates in 5
`vector E3 is added to feature vector x to provide a
`the n frequency bands. The prior application entitled
`transformed normalized vectory 4, which is projected a
`"Nonlinear Signal Processing in a Speech Recognition
`distance toward the prototype associated therewith. ya
`is in the region corresponding to prototype P3; the pro
`System', U.S.S.N. 06/665401, was filed on Oct. 26,
`1984 and was invented by J. Cohen and R. Bakis.
`jected move is thus toward prototype vector P3 by a
`distance computed according to an objective function.
`Referring still to FIG. 3, five feature vectors at re- 10
`spective successive time intervals is 1, i=2, i=3, i=4,
`Error vector E4 is generated and is applied to feature
`vector xsto yieldys, ysis in the region corresponding to
`and is 5 are shown as X1, X2, X3, X4, and X5, respec
`tively. According to standard prior art methodology,
`prototype vector Psg; the projected move of ys is thus
`toward that prototype vector.
`each of the five identified feature vectors would be
`assigned to the Voronoi region corresponding to the 15
`FIG. 4, each feature vector x is transformed into a
`prototype vector P11.
`normalized vectory. It is the normalized vectors which
`The two selectable outputs for a prior art acoustic
`serve as one output of the acoustic processor 102,
`processor would be (1) the feature vectors X1,X2, X3,
`namely yy2ysy4ys. Each normalized vector, in turn,
`X4, and Xs themselves and (2) the prototypes associated
`has an associated prototype vector. A second output of
`therewith, namely P11, P11, P11, P1, P1, respectively. 20
`the acoustic processor 102 is the associated prototype
`It is noted that each feature vector X1, X2, X3, X4, and
`vector for each normalized vector. In the FIG. 4 exam
`X5 is displaced from the prototype vector P11 by some
`ple, this second type of output would include the proto
`type vector string PPPPP56. Alternatively, assign
`considerable deviation distance; however the prior
`technology ignores the deviation distance.
`ing each prototype a label (or "feneme') which identi
`In FIG.4, the effect underlying the present invention 25
`fies each prototype vector by a respective number, the
`second output may be represented by a string such as
`is illustrated. With each feature vector, at least part of
`the deviation distance is considered in generating more
`11,1,1,3,3,56 rather than the vectors themselves.
`informative vector outputs for subsequent speech cod
`In FIG. 5, an acoustic processor 200 which embodies
`ing, speech synthesis, or speech recognition processing.
`the present invention is illustrated. A speech input en
`Looking first at feature vector x, a transformation is 30
`ters a microphone 202, such as a Crown PZM micro
`formed based on an operator function A1 to produce a
`phone. The output from the microphone 202 passes
`transformed normalized vectory. The operator func
`through a pre-amplifier 204, such as a Studio Consul
`tion is defined in terms of parameters which, at time
`tants Inc. pre-amplifier, enroute to a filter 206 which
`interval i=1, are initialized so that y=x in the FIG. 4
`operates in the 200 Hz to 8 KHz range. (Precision Fil
`embodiment; x and y are directed to the same point. 35
`ters markets a filter and amplifier which may be used for
`It is observed that initialization may be set to occur at
`elements 206 and 208.) The filtered output is amplified
`in amplifier 208 before being digitized in an A/D con
`time interval i=0 or i=1 or at other time intervals
`depending on convention. In this regard, in FIG. 4
`vertor 210. The convertor 210 is a 12-bit, 100 kHz ana
`log-to-digital convertor. The digitized output passes
`initialization occurs at time interval i=1; in other parts
`of the description herein initialization occurs at time 40
`through a Fast Fourier Transform FFT/Filter Bank
`interval i=0.
`Stage 212 (which is preferably an IBM 3081 Processor).
`Based on a predefined objective function, an error
`The FFT/Filter Bank Stage 212 separates the digitized
`vector E is determined. In FIG. 4, E1 is the difference
`output of the A/D convertor 210 according to fre
`vector of projected movement of y in the direction of
`quency bands. That is, for a given time interval, a value
`the closest prototype thereto. (The meaning of "close- 45
`is measured or computed for each frequency band based
`ness-- is discussed hereinbelow.) Ei may be viewed as a
`on a predefined characteristic (e.g., the neural firing
`determined error vector for the normalized vectory at
`rate mentioned hereinabove). The value for each of the
`frequency bands represents one component of a point in
`time interval i=1.
`Turning next to feature vector x, it is noted that y is
`the acoustic space. For 20 frequency bands, the acoustic
`determined by simply vectorally adding the E1 error 50
`space has n=20 dimensions and each point has 20 com
`vector to feature vector X2. A projected distance vec
`ponents.
`tor of movement of y, toward the prototype associated
`During a training period in which known sounds are
`therewith (in this case prototype P11) is then computed
`uttered, the characteristic(s) for each frequency band is
`according to a predefined objective function. The result
`measured or computed at successive time intervals.
`of adding (1) the computed projected distance vector 55
`Based on the points generated during the training per
`fromyzonto (2) the error vector E1 (extending from the
`iod, in response to known speech inputs, acoustic space
`feature vector x2) is an error vector E2 for time interval
`is divided into regions. Each region is represented by a
`i=2. The error vector E2 is shown in FIG. 4 by a
`prototype vector. In the present discussion, a prototype
`vector is preferably defined as a fully specified probabil
`dashed line arrow.
`Turning next to feature vector x3, the accumulated 60
`ity distribution over the n-dimensional space of possible
`error vector E2 is shown being added to vector x3 in
`acoustic vectors.
`order to derive the normalized vectory3. Significantly,
`A clustering operator 214 (e.g., an IBM 3081 proces
`it is observed that y is in the region represented by the
`sor) determines how the regions are to be defined, based
`prototype P3. A projected move of yi toward the proto
`on the training data. The prototype vectors which rep
`type associated therewith is computed based on an ob- 65
`resent the regions, or clusters, are stored in a memory
`jective function. The result of adding (1) the computed
`216. The memory 216 stores the components of each
`projected distance vector from y, onto (2) the error
`prototype vector and, preferably, stores a label (or
`vector E2 (extending from the feature vector X3) is a
`feneme) which uniquely identifies the prototype vector.
`
`Amazon / Zentian Limited
`Exhibit 1017
`Page 12
`
`
`
`O
`
`15
`
`4,926,488
`7
`8
`Preferably, the clustering operator 214 divides the
`308 are incorporated into the operator function imple
`acoustic space into 200 clusters, so that there are 200
`mented by the FIR filter 310 to generate a normalized
`prototype vectors which are defined based on the train
`output vectory1. y exits the labeller 300 as the output
`ing data. Clustering and storing respective prototypes
`vector followingyo and also enters the distance calcula
`for the clusters are discussed in prior technology.
`tor 312. An associated prototype is selected by the mini
`During the training period, the FFT/Filter Bank
`mum selector 314; the label therefor is provided as the
`Stage 212 provides data used in clustering and forming
`next prototype output from the labeller 300. The param
`prototypes. After the training period, the FFT/Filter
`eters are again up-dated by means of the derivative
`Bank Stage 212 provides its output to an adaptive la
`calculator 316 and the filter 318.
`beller 218 (which preferably comprises an IBM 3081
`Referring to FIG. 7, a specific embodiment of the
`processor). After the training period and the prototypes
`distance calculator 312 is shown to include an adder 400
`are defined and stored, unknown speech inputs (i.e., an
`for subtracting the value of one frequency band of a
`unknown acoustic waveform) are uttered into the mi
`given prototype vector from the normalized value of
`crophone 202 for processing. The FFT/Filter Bank
`the same band of the output vector. In similar fashion, a
`Stage 212 produces an output for each successive time
`difference value is determined for each band. Each
`interval (i=1,2,3,. . . ), the output having a value for
`resulting difference is supplied to a squarer element 402.
`each of the n=20 frequency bands. The 20 values, taken
`The output of the squarer element 402 enters an accu
`together, represent a feature vector. The feature vectors
`mulator 404. The accumulator 404 sums the difference
`enter the adaptive labeller 218 as a string of input fea
`values for all bands. The output from the accumulator
`ture Vectors.
`20
`404 enters the minimum selector 314.
`The other input to the adaptive labeller 218 is from
`FIG. 8 shows a specific minimum selector formed of
`the prototype memory 216. The adaptive labeller 218, in
`a comparator 410 which compares the current mini
`response to an input feature vector, provides as output:
`mum distance diagainst the current computed distance
`(1) a normalized output vector and (2) a label corre
`dk for a prototype vector Pik. If dj<dk, j=k; otherwise
`sponding to the prototype vector associated with a
`25
`j retains its value. After all distance computations are
`normalized output vector. At each successive time in
`processed by the comparator 410, the last value for j
`terval, a respective normalized output vector and a
`represents the (label) prototype output.
`corresponding label (or feneme) is output from the
`FIG. 9 shows a specific embodiment for the deriva
`adaptive labeller 218.
`tive calculator which includes an adder 420 followed by
`FIG. 6 is a diagram illustrating a specific embodiment
`30
`a multiplier 422. The adder 420 subtracts the associated
`of an adaptive labeller 300 (see labeller 218 of FIG. 5).
`prototype from the normalized output vector; the dif
`The input feature vectors xi are shown entering a
`ference is multiplied in the multiplier 422 by another
`counter 302. The counter 302 increments with each
`value (described in further detail with regard to FIG.
`time intervalstarting with i=0. At i=0, initial parame
`11).
`ters are provided by memory 304 through switch 306 to
`35
`FIG. 10 is a general flow diagram of a process 500
`a parameter storage memory 308. The input feature
`performed by the adaptive labeller 300. Normalization
`vector xo enters an FIR filter 310 together with the
`parameters are initialized in step 502. Input speech is
`stored parameter values. The FIR filter 310 applies the
`converted into input feature vectors in step 504. The
`operator function A0 to the input feature vector Xo as
`input feature vectors xi are transformed in step 506 into
`discussed hereinabove. (A preferred operator function
`normalized vectors yi which replace the input feature
`is outlined in the description hereinbelow.) The normal
`vectors in subsequent speech processing. The normal
`ized output vectoryo from the FIR filter 310 serves as
`ized vectors provide one output of the process 500. The
`an output of the adaptive labeller 300 and also as an
`input to distance calculator 312 of the labeller 300. The
`closest prototype for each normalized vector is found in
`distance calculator 312 is also connected to the proto
`step 508 and the label therefor is provided as a second
`45
`type memory (see FIG. 5). The distance calculator 312
`output of the process 500. In step 510, a calculation is
`computes the distance between each prototype vector
`m