throbber
S9.1
`
`Recent Advances in Speecb Processing
`I. Mariani
`
`WMSI/CNRS
`BP 30
`91406 O r q Ceder (France)
`
`On invitation from the ICASSP'89 Technical Committee, this
`paper aims at giving to non-specialists i n s i i Processing an overview of
`recent advances in the domain of Speech Recognition. The paper mainly
`focuses on Speech Recoption, but also mentions some progress in other
`areas of Speech Processlng (speaker recognition, speech synthesis, speech
`analysis and coding) using similar methodologies.
`It first giw a view of what the problems related to automatic
`speech processing are, and then describes the initial approaches that have
`been followed in order to address those problems.
`It then introduces the methodological noveltiis that allowed for
`progress along three axes: from isolated-word reco
`tion to continuous
`speech, from speaker-dependent recognition to spexr-independent, and
`from small vocabularies to large vocabularies. S p e d emphasis centers on
`the improvements made possible by Markov Models, and, more recently,
`by Connectionist Models, resulting in progress simultaneously obtained
`along the above different axes, in improved performance for difficult
`vocabularies, or in more robust systems. Some specialised hardware is also
`described, as well as the efforts aimed at assessing Speech Recognition
`systems.
`Most of the progress will he referenced with papers that have
`been resented at the IEEE ICASSP Conference, which is the major
`annuafconference in the field. We will take this opportunity to produce
`some statistical data on the "Speech Processing" part of the conference,
`from its beginning in 1976 to its present fourteenth issue.
`Introduction
`
`The aim of this paper is to give non-specialists in Signal
`Processing an overview of recent advances in the domain of Speech
`Recognition. It can also be considered an introduction of the papers that
`will be presented in that field during this conference; especially those
`results on large vocabulary, continuous speech
`presenting
`latest
`recognition systems.
`As a general comment, one may feel that in recent years, the
`choice between methods based on extended knowledge introduced by
`human experts with corresponding heuristic strategies, and self-organizing
`methods, based on speech data bases and learning methodologies, with
`little human input, has turned toward the latter. This is partly due to the
`results of comparative assessment trials.
`Problems related to speech processing
`Several problems make speech processing difficult, and unsolved
`at the present time:
`A. There is no separator, no silence between words. comparable
`to spaces in written language.
`B. Each elementary sound (also called phoneme) is modified by
`its (dose) context: the phoneme which is before it, and the one which
`comes after it. This is related to martidation: the fact that when a
`phoneme is pronounced, the pronunciation of the next phoneme is
`prepared by a movement of the vocal apparatus. This cause is also refered
`to as the "teleological" nature of speech [!IO]. Other (second order)
`modifications of the signal corresponding to a phoneme will be caused by
`larger context such as its place in the whole sentence.
`C. A good deal of variability is present in speech: intra-speaker
`variability, due to the speaking mode (singing, shouting, whispering,
`stuttering, with a cold, when hoarse, creakiness, voice under stress, etc.),
`inter-speaker variability (different timbre, male, female, child, etc.), due to
`the signal input device (type of microphone), or to the environment (noise,
`co-channel interference, etc.).
`D. Because of B and C, it will be necessary to observe, or to
`process, a large amount of data in order to find, or to obtain, what makes
`an elementary sound, despite the different contexts, the different speaking
`modes, the different spealters and the different environments. A difficult
`problem for the system is to be able to decide that an "a" pronounced by
`an aged male adult is more similar to an 'a" pronounced in a different
`word by a child, in a different environment, than to an "0" pronounced in
`the same sentence by the same male adult.
`E. The same signal carries different
`of information (the
`sounds themselves, the syntactic structure, the meaning, the sex and the
`identity of the person speaking, his mood, etc.). A system will have to
`focus on the kinds of information which are of interest for its task.
`
`F. There are no re& d e s at the resent time for formalizing
`the information at ddment Imls of L
`g
` (indudig syntax,
`
`semantics, pragmatics), thus making it diftimlt to use fluent speech.
`Moreover, those different levels seem to be heavily linked to each other
`(syntax and semantics, for example). Fortunately, the roblem mentioned
`in E. also means that the information in the signal wfbe redundant, and
`that the different types of information will cooperate with each other to
`make the signal understandable, despite the ambiguity and noise that may
`be found at each level.
`First msults on n simplified problem
`After some overly optimistic hopes about the difficulty of the
`Speech Recognition task, similar to early views concerning automatic
`translation, a beneficial reaction in the late '60s was to consider the
`importance of the problem in its generality, and to try to solve a simpler
`problem by introducing simplifying hypotheses. Instead of trying to
`recognize anyone pronouncing anything, in any manner, and in fluent
`speech, a first suh-problem was isolated recognizing only one person,
`using a small vocabulary (on the order of 20 to 50 words), and asking for
`short pauses between words.
`The basic approach used two passes: a training pass and a
`recognition pass. During the training pass, the user pronounces each word
`of the vocabulary once. The corresponding signal is processed at the so-
`called "acoustic" or "parametric" level, and the resulting information, also
`called "acoustic image", "speech spectrogram", "template" or "reference
`pattern", which usually represents the signal in 3 dimensions (time,
`frequency, amplitude), is stored in memory, with its corresponding label.
`During the rempition pass, similar processing is conducted at the
`"acoustic" level: the corresponding pattern is then compared with all the
`reference patterns in memory, using an appropriate distance measure. The
`reference with the smallest distance is said to have been recognized, and
`its label can be furnished as a result. If that distance is too high, compared
`with a pre-defined threshold value, the decision can be non-recognition of
`the uttered word, thus allowing the system to "reject" a word which is not
`in its vocabulary.
`This approach lead to the first commercial systems, appearing on
`the market in the early '70s. such as the VIP 100 from Threshold
`Technology Inc. which won a US National Award in 1972. Due to those
`simplifications, this approach doesn't have to deal with the problems of
`segmenting continuous speech into words (problem A, above), of the
`context effect (as it deals with a complete pattern corresponding to a word
`always spoken in the same context - silence (B), of inter-speaker
`variability (C). Also, indirectly, it bypasses the problem of allowing for
`"natural Ianguape'' speech (F), as the small size of the vocabulary, and the
`pronunciation UI isolation prevents fluent speech ! However, the intra-
`speaker variability, the sound recording and the environment problems are
`still present.
`- Pattern Matchine ushe h a m ic Proeramming
`In the recognition pass, the distance between the pattern to be
`recognized (test pattern) and each of the reference patterns in the
`vocabulary has to be computed. Each pattern is represented by a sequence
`of vectors regularly spaced along the time axis. Those vectors can
`represent the output of a fiter bank (analog or simulated by different
`means, induding the (Fast) Fourier Transform [33]), eocffcients obtained
`by an autoregressive process such as Linear Prediction Coding (LPC)
`[157], or coefficients derived from these methods, like the Cepstral
`coefficients [19], or even obtained by using an auditory model [5O,Sa,9S).
`Typical values are a vector of dimensions 8 to 20 (also called a spectrum
`or a frame), each 10 ms (for general information on speech signal
`Prmssing techni ues, see [117,101,129]).
`The probqem is that when a speaker pronounces the same word
`twice, the corresponding spectrograms will never be exactly the same.
`There are non-linear differences in time (rhythm), in frequency (timbre),
`and in amplitude (intensity). Thus, it is necessary to align the two
`spectrograms, so that, when the test pattern is compared to the correct
`reference pattern, the vectors representing the same sound in the two
`words correspond to each other. The distance. measure between the two
`spectrograms will be calculated acsording to this alignment. Optimal
`alignment can be obtained by using the Dynamic Programming method
`
`429
`
`CH26'13-2/89/0000429 $1.00 0 1989 IEEE
`
`Authorized licensed use limited to: Callie Pendergrass. Downloaded on October 14,2022 at 16:01:50 UTC from IEEE Xplore. Restrictions apply.
`
`IPR2023-00035
`Apple EX1013 Page 1
`
`

`

`E tances d(ij) (for example, the Euclidian distance) between each vector
`
`igurc 1). If WT consider the distance mobir D obtained by computing the
`of the test attern and of the reference pattern. this method furnishes the
`optimal pat% from (1.1 ) t o (IJ) (where I and J are respectively the length
`of the test and of the reference pattern), and the corresponding distance
`measure behueen the two patterns. In the case of Speech Recognition. this
`method is also called Dynamic Time Warping, or DTW, since the main
`the time ads. Dynamic Programming was first
`result is to 'warp'
`prcscnted by R. Bcllman 1141, and fust applied to speech by the Russian
`researchers T. Vintsjuk and G. Slutsker in the late 60s [165,160].
`
`algorithm (such as K-means) is used to determine dusters corresponding
`to a certain type of pronunciation for that word. The centroid of each
`cluster is chosen to be the reference pattern for this type of pronunciation
`(Figure 2). Each word is then represented by several reference patterns.
`Recognition is carried out in the same way as it is in the speaker-
`dependent mode, with eventually, a more sophisticated decision process
`(like K" (K-nearest neighbors)) [lx)].
`
`Figure 2 An illustration of Clustering.
`Each cmss is a word. The distance behwen cmsses represenfs the DTW
`disfance between the words. Each clusferis npresented by its cenfmid
`(cimlu).
`- Isolated Word R w w n ition (IWR) to Con nected Word Remenition
`(CWR) and Word Swttine
`In order to allow the user to spek untinuously (without pauses
`between words) several problems have to be solved: bow many words are
`in the sentence and where their boundaries are; if the training is to be
`done in isolation, the pattern corresponding to the beginning and to the
`end of the words will be modified, due to the context of the end of the
`previous word, and to the context of the beginning of the following word.
`The fust two problems have been solved by using methods generalising
`the IWR DTW, such as Two-Level DP matching" proposed by H. Sakoe
`[MI, "Level building" proposed by C. Myers and L. Rabiner [108], 'One-
`o called "One-stage DP by H. Ney
`Pass DP" proposed by J. Bridle
`[115]. It appears in fact that th
`DP approach as fust described by
`T. Vintsjnk in 1968 [165] already had its extension to Connected Word
`Recognition [87. To address the second problem, the "embedded training
`method has been proposed [l32], where each w a d is first pronounced in
`isolation. It is then pronounced in a sentence known to the system. The
`"isolated" reference templates will be used to optimally segment the
`sentence into its constituents, and extract the "contextual" image of the
`words which will be added as new reference templates.
`The "Word Spotting" technique is very similar, and uses the same
`DTW techniques. But it should allow rejection of words in the sentence
`which are not in the vocabulary. Recent results on Speaker Independent
`Word Spotting give 61% correct detection in dean speech, and 44% when
`Gaussian noise is added for a Signal-to-Noise ratio of 10 dB, with a 20-
`word vocabulary (1 to 3 syllables long), the false alarm rate being set to 10
`false alarms per hour 1201.
`A syntax can be used during the refognition process. The syntax
`represents the word sequences that are allowed for the language
`corresponding to the task using speech recognition. The role of the syntax
`is to determine which words can follow a given word (sub-vocabulary),
`ition process by reducing the size of the
`thus accelerating the rem
`vocabulary to be r e c o g n i x at each step, and improving the performance
`by possibly eliminating words that are acoustically similar, but do not
`belong to the same sub-vocabulary, and thus do not compete. Introduction
`of the grammar into the search procedure may be more or less difficult,
`depending on the CWR-DTW algorithm used. Most of the syntaxes used
`in DTW-type systems correspond to simple command languages (regular,
`or context-free grammars introduced manually by the system user).
`It appears that the better the trainin& the better the recognition.
`In order to improve training, several teduuques have been tried, like
`"embedded training" already mentioned, multireference training, where
`several references of a word are kept, using the same clustering techniques
`for representing intra-speaker variations as those used for inter-speaker
`variations in multi-reference speaker-independent recognition, and robust
`training [131].
`
`-Small Vocabularies to Laree Vocabularies:
`Increasing the size of the vocabulary raises several problems:
`since each word is represented by its spectrogram, necessary memory size
`gets very large. Since matching between the test pattern and the reference
`patterns is done sequentially, computation time also greatly increases. If a
`speaker has to train the system by pronouncing aU of the words, the task
`rapidly becomes tedious. A large vocabulary has the consequence of many
`acoustically similar words, thus increasing the error rate. This also implies
`that the speaker will want to use a natural way of speaking, without strong
`syntax constraints. To address these problems, there have been several
`improvements:
`- Vector Ouantization 153. 49. 97) In the domain of Speech Processing,
`this method was fust used for low bit rate speech coding 1911. Considering
`a reasonable amount of speech pronounced by one speaker, the method
`consists of computing the distances (like the Eudidian distance) between
`each vector of the corresponding spectrogram, and using a clustering
`
`- I
`
`F i i r e 1: Example of Dyoamic Time Warping between two spcech
`patterns (the word "Paris' represented by a schematic spectrogram).
`G is the disfance measure between the hvo unerances ofthe word d(!,j) is
`the distance behwen hwfmmu ofthe reference and lest putems at mianis
`i and j. An erample ofa local DP equan'm is @en. ?%e opfimalpath IS
`repmen fed by squares. ?%e cumulated distmcep involved in the
`compufafh ofthe cumulafed disfance &j) are represented by circles.
`- SDeech and A I the ARPA-SUR oroied
`A different approach, mainly based on 'Artilicial Intelligence"
`techniques was initiated in 1971, in the hamework of the ARPA-SUR
`project [114]. The idea behind it was that the use of 'upper Icvel"
`knowledge (lexicon, syntax, semantics, pragmatics) could produce an
`acceptable recognition rate, even if the initial phoneme recognition rate
`was poor 1701. The task was speaker-dependent, continuous speech
`recognition (improperly called 'understanding" because upper levels were
`used), with a 1,000-word vocabulary. Several systems were delivered at the
`end of the p r o j q in 1976. From CMU, the DRAGON system 141 was
`designed, using a Markov approach. The HEARSAY I and HEARSAY I1
`systems, bawd on the use of a Blackboard Model, where each Knowledge
`Source can read and write information d u r i q the decoding, with a
`heuristic strategy, and the HARPY system, wluch merged parts of the
`DRAGON and HEARSAY systems. From BBN, the SPEECHLIS and
`HWlM systems were developed. SDC also produced a system 1761.
`Although the initial requirements (which were in fact rather vague) wcrc
`attained by at least one system (HARPY), the systems needed so much
`computer power, at a time when this was expensive. and were so
`cumbersome to use and so non-robust, that there was no follow-up. In
`fact, one of the major conclusions was that there was a need for better
`acoustic-phonetic decoding In]!
`Improvements alnmg each of the 3 axes
`From the basic IWR method, progress has been made which
`independently addresses
`the three different problems: size of
`the
`population using the system, s p e a k q rate, size of the vocabulary.
`- Soeaker-DeDendent (SD) to SDeaker-lndeoendent tSIt
`In order to allow any speaker to use a recognition system, a multi-
`referenec approach has k e n experimented. Each word of the vocabulary
`ia pronounced by a large population, male and female, with different
`timbres and dilfcrent d i a l e d origins. The distancc between the different
`pronunciations of the same word is computed using DTW. A clustering
`
`430
`
`Authorized licensed use limited to: Callie Pendergrass. Downloaded on October 14,2022 at 16:01:50 UTC from IEEE Xplore. Restrictions apply.
`
`IPR2023-00035
`Apple EX1013 Page 2
`
`

`

`algorithm to determine clusters corr~ponding to a type of vector, which is
`a "eodeboor. E the training phase, after acoustic
`represented by the centroid called
`ototype" or "codeword"). The set of
`nrototvDcs is
`of the word, each Jpcetrum is r c c q n i d to be one of the
`&&kg
`prototypes of the codebook. Thns, instead of b e i i represented by a
`sequence of vectors. the word will be represented by a sequence of
`numbers (also d e d labels) companding to the protolyp. A distmtion
`m e m can be obtained by computing the average distance between the
`and the doscst~ototype. On a practicd level, if the size
`incoming &or,
`ofthe codebookis Us, 01 kas ( t i s addrwbk onone byte), and each
`vector component is coded on one byte, the reduction of information is
`equal to the dimension of the vedors. Also, computing time is s a d
`during recognition for large vocabularies since, for each in ut vector of
`the test pattern, only 2% distances have to be computeg, instead of
`computing the distances with all the vectors of all the reference templates.
`M o r ~ v c r , the distances between prototypes can be computed after
`training, and kept in a distance m a t h Those c o d e h p
`concern not
`only spedral information, but also energy, or vanahon of spectral
`information or of energy in time. AU this can be represented by a single
`for each type of 3 ormation. This approach was applied with success to
`codebook with supervectors, constructed by including the different kinds
`It can also be reprwcnted by a different codebook
`of information [ U
`speaker identification [lal], and to speech recognition [55]. The
`codebooks can also be constructed from the spcech of several sgeakers
`(speaker-independent codebooks) 1831.
`It should be noted thal sunilar methods have been used previously
`(Centisecond Model) [ss]. The problem at that time was that the vectors
`had been labelled with a linguistic label (a phoneme), thus making a
`decision too early. The Vector QuanhtiOn scheme inspired much
`thought. One remark was that each word could have a specific set of
`prototypes without taking into account the chronologicd sequence of
`those prototypes. Even if some words contain the same phonemes in a
`different order, the transition between those phonemes is ditferent, and
`the prototypes corresponding to those transitions may be different, the
`latter making the distinction between words. During traiohg, a codebook
`is built for each reference word. The reeOgnition process then coosists of
`istortion with the test [l%]. A refined
`gives the smallest average
`approach consisted of segmenting the words into multiple sections, in
`order to partly reflect time sequencing for words having several phonemes
`in common 1261. This refinement increases the computation time. without
`giving better results than the DTW-based approach does,
`- Sub-word units; Another way to reduce the memory requirement is to
`use decision units that are shorter than the words (also called subword
`units). The words will then be rccognizcd as the concatenation of such
`units, using a Connected "Word" DTW algorithm. These units must be
`chosen so that they are not too affected by the coarticulation problem at
`their boundaries. But they also should not be too numerous. Examples of
`such units are phonemes [163], diphones [98, 148, 32, 2, 1491, syllables
`[59,168,48], demi-syllables [144,139], disyllables [159].
`Graphemic Word :
`&nipante
`SemigRitS
`Phonemic Word :
`phonemes :
`S e m i g R i t S
`diphones :
`Se em mi ig gR R i i t t S
`syllahlea :
`Se mi gRitS
`demi-nyllahles :
`Se em mi ig gRi i t tS
`disyllahles :
`Se emi igRi it$
`Figure 3 Representation of a word by subword units
`(Ihe word is "bnigrante" ("emigmrit") in French, $ standsfor silence)
`Other approaches tend to use units with no linguistic affiliation,
`for example segments, obtained by a segmentation algorithm. This
`approach lead to Segment (or Matrix) Quantization, very similar to
`Vector Quantization, except that the distance between segment prototypes
`may need time alignment, if the segments do not have a constant length
`
`simply reeogoizing the inmy vectors, and choaing the reference which
`
`- Time comorchsion Time compression can also reduce the amount of
`information [75,46]. h e idea is to compress (linearly, or non lincarly) the
`steady states, which may have very different lengths depending on
`speaking rate, while keeping all the vectors during the transitions, thus
`m&g
`from the time space to the variation space. An algorithm like the
`VLTS (Variable Length Trace Segmentation) [4q h a l m the amount of
`information used. It also obtains better results when the pronunciation
`rate is very different between training and recognition (some ofien-used
`DTW equations, for example, do not accept speaking rate variations of
`more than a 2-to-1 ratio, which is w i l y reached between isolated word
`pronunciation and continuous speech). However, if duration itself carries
`meaning, that information may be lost.
`
`Two-vas reeoenitioo: In order to aefclcrate recognition, it can be
`processed in two pasrcs: first there can be a rough, but fast match aimed
`at eliminating words of the vocabulary that are very different from the test
`pattern, before applying an optimal match (DTW or Viterbi) on the
`remaining reduced subvocabulary. In this case, the goal is not to get just
`the correct word, but is to eliminate as many word-candidates as possible
`(without eliminating the right one, of course). Simple approaches l i e
`summing the distances on the diagonal of the disllutcc monS used for
`DTW [48] have been tried. Other approaches are based on Vector
`Quantization without time alignment, the system b e i i based on Pattern
`Matching I991 or on Stochastic Modeling ( d e d "Poisson Polling') [SI.
`Using a phonetic classifier, b a d on broad [45j or usual [16] phonetic
`classes, and matching the reeogoised phoneme lattice with the reference
`phonemic words in the lexicon by DTW is another reported method.
`- Soeaker
`' The adaptation of one speaker's references to a new
`i
`through their respedive codebooks, if a Vector
`speaker ca-ed
`Quantization scheme is used. The reference speaker produces several
`sentences, which are vector quantized with his codebook. The new speaker
`produces the same sentences, which are vector quantized with hi own
`codebook. Time alignment of the two sets of sentences creates a mapping
`between the two codebooks. This basic method has several variants
`[155,21,43].
`Most of the progress related to this technique has been obtained
`on one aspect of the problem. Some systems addressing two aspects can
`also be found, l i e the Conversant system from AT&T [152], which allows
`for speaker-independent connected digit recognition over telephone Lines
`using a multireference CWR-DTW approach. Further advances have been
`obtcmed by using more elaborated i&hniques: Hidden Markov Models,
`and Connectionist Models.
`
`The Hidden Markov Model approach
`Whereas in the previous pattern matching approach, a reference
`was represented by the pattern itself which was stored in memory, the
`Markov Model approach w r i e s a higher level of abstraetion, representing
`the reference by a model [125,l3s]. To be recognled, the input is thus
`compared to the reference models. The fust uses of this approach for
`speech recognition can be found at CMU [41, IBM [62] and, apparently,
`IDA 11241.
`In a stochastic approach, if we consider an acoustic signal A, the
`ition process can be described as computing the probability
`reco
`P(w$A) that any W word string (or sentence) corresponds to the acoustic
`signal A, and as finding the word string having the maximum probability.
`Using Bayes' rule, P(WIA) can be represented as:
`PWIA) = P W . W I W ) /P(A)
`where P(W) is the probability of the word string W, P(AIW) is the
`probability of the acoustic signal A, given the word string W, and P(A) is
`the probability of the acoustic signal (which does not depend on W). Thus
`it is necessary to take into account P(A(w) (whieh is the acoustic model),
`and P(W) (which is the Iunguage &I). Both models can be represented
`as Markw models [6]. We will first consider Acmtic Modeling.
`- Basic discrete aooroach; Here each acoustic entity to be reeogniZed,
`each reference word for example, is represented by a f ~ t e
`state machine,
`also called a Markov machine, composed of states, and of arcs between
`states. A tmnsilion probability . is attached to the arc going from state i
`to state j representing the pr& bility that this arc could be taken. The
`sum of the transition robabilities attached to the arcs issued from a given
`state i is equal to 1. Tiere is also an ourputpubability b..(k) that a symbol
`k from a finite alphabet can be emitted when the arc frdo state i to state j
`is taken. In some variants, this output probability is attached to the state,
`distribution (also called output probability density function (pdj)), 4 the
`not to the arc. When Vector Quantizdon is used, this output probability
`probability distribution of the prototypes. The sum of the prohabhties in
`the distribution is also equal to 1 (Figure 4). In a first-order Hidden
`Markov Model, it is assumed that the probability that the Markov chain is
`in a particular state at time t depends only on the state where it was at
`time t-1, and that the output probability at time t depends only on the arc
`being taken at time t..
`
`k-l.K n
`
`.I3
`F i e 4 An example of a Hidden Markov Model
`Ihe outpulprobability distributions b-(k) m encbsed in rectangles. a.. is
`the mition pmbability. Ihis I@-w%ght modcl has 3 stales and 4 a&.
`
`431
`
`Authorized licensed use limited to: Callie Pendergrass. Downloaded on October 14,2022 at 16:01:50 UTC from IEEE Xplore. Restrictions apply.
`
`IPR2023-00035
`Apple EX1013 Page 3
`
`

`

`. Continuous models We have just presented what are usually called
`'Dixretc Hidden Markov Models". Another type of Markov Model is the
`arc is replaced by a model of the continuous spectrum on &at &c. A LUG
`'Continuous Markov Model". In this case. the discrete outout d o n one
`model is the multivariate Gaussian density 1201 which dcsoibcs the pdf
`by a mean M o r and a owariance matrix [cve~tually diagonal). The usc
`of a multivariate Gaussian mixture density seems to be more appropriate
`[135,66,137,122]. The Lapladan mixture density seems to allow for good
`quality results, with reduced computation time [lla]. Several attempts to
`compare discrete and continuous HMMs have been reported. It seems
`that only complex continuous models d o w for better results than discrete
`ones, reflecting the fact that with the usual Maximum Likelihood training,
`the complete model should be correct to allow for good recognition
`rcsults 17. But complex continuous models need a good deal of
`computation,
`
`from the distance measure. and thus dclining simikw p m h y p . If the
`output probability of a prolotypc is null on an are, it can be smoothed with
`the non-null probability of a Simibrptcfpe U01 A third method is the
`co-ocnmnce smoothing [SZ], which mooLa 'on all thc arcs the
`probabilities of labels that sometimes appcar on the same arcs.
`' ' lo order to smooth the cstimatcs of the
`-2
`to apply
`differeat methods, it is ne-
`weights to the d8ercnt estimates. Thosc weights will refled the quality of
`each estimate, or the quantity of information used to calculate each of
`them. A method lo automatid determine thosc weights is the deleted
`interpOrorion atimOriOn, which spits the estimates on two arcs, and defincs
`the weights as the transition probabilities of thc arcs, as wmputed by the
`Forward-Backward algorithm 1631.
`. Time model in& The modclisation of time in a Markov model is
`contained in the probabilities of the arcs. It appears that the probability to
`stay at a given state will dcacasc as a power of the probability to follow
`the arc looping on that state, which seems to be a poor time model in the
`case of the speech signal. Several ancmpts to improve that issue can be
`found.
`In the Semi-Hidden Morkov Model 144,1451, a set of robability
`density functions P.(d) at each state i indicates the probability ofstaying in
`that state for a dven duration, d. This set of probabilities is trained
`together with the transition and output probabilities by using a modified
`Forward-Backward algorithm. A simpler approach is lo independently
`train the duration robability and the HMM parameters I 1341.
`To aUow &r a more easily trainable model, continuous probability
`density functions can be used for duration modeling, like the Poisson
`distribution (1451 or the gamma distribution, used by S. Levinson in his
`Confinurnsly Vmable Dwotion Hidden Markav Model (CVDHMM) [&?SI.
`Another way of i n d i r d y taking time into account is to mcludc
`the dynamics of the spectrum as a new parameter. I t can be represented
`by the diflcrenced Cepstrum eocffiaents corresponding to adjacent
`frames, and can also include the differenced power. M e r Vector
`Ouantization, the multiple codebooks for those new parameters are built.
`They are introduced in the HMM with independcot output pdfs on the
`arcs [831.
`
`- k l s l o n Units:
`natural idea is to modelisc a word with a HMM. An example
`-The
`of a Markov word machine, from R. Bakis [62], is given in Figure 5. The
`number of states in the word model is equal to the average duration of the
`word (50 states for a 500 ms word, with a frame each 10 ms). It should be
`insertion
`the model indudes the frame deletion and
`noted
`that
`Phenomena previously detected durin DTW More recently, models with
`ess states have been successfuUy trie% [l33]. 'The problem is that to get a
`good model of the word, there should be a Large number of
`pronunciations of that word. Also, the recognition process considers a
`word as a whole, and does not focus on the information discriminating two
`acoustically similar words.
`
`The number of states, the number of arcs, and the initial and final
`states for each arc arc choscn by the system designer. The parameters of
`the model (transition probabilities, and output probabilities) have to he
`obtained through training. Three problems have to be adressed
`. the EvalaaUnm problem (what is the probability that a sequence
`of labels has been produced by a given model ?). This can be obtained by
`using the Forward algorithm, which gives the Maximum Likelihood
`Estimation that the sequence was produced by the model.
`- the Decoding problem (which yuence of states bas produced
`the sequence of labels ?). This can be obtmed by the Viterbi algorithm,
`which is vc'y S i m i to DTW I 1661.
`. the Leanlog (or h i n i n g ) problem (how to get the F a m e t e r s
`of the model, given a sequence of labels ?). This can be obtamed by the
`Forward-Backward (also called Baum-Welch) algorithm 1121, when the
`training is based on Maximum Likelihood.
`- Traininp;
`- Initialisation: Initialisation of the parameters in the model has to be
`carried out before starting

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket