`
`Recent Advances in Speecb Processing
`I. Mariani
`
`WMSI/CNRS
`BP 30
`91406 O r q Ceder (France)
`
`On invitation from the ICASSP'89 Technical Committee, this
`paper aims at giving to non-specialists i n s i i Processing an overview of
`recent advances in the domain of Speech Recognition. The paper mainly
`focuses on Speech Recoption, but also mentions some progress in other
`areas of Speech Processlng (speaker recognition, speech synthesis, speech
`analysis and coding) using similar methodologies.
`It first giw a view of what the problems related to automatic
`speech processing are, and then describes the initial approaches that have
`been followed in order to address those problems.
`It then introduces the methodological noveltiis that allowed for
`progress along three axes: from isolated-word reco
`tion to continuous
`speech, from speaker-dependent recognition to spexr-independent, and
`from small vocabularies to large vocabularies. S p e d emphasis centers on
`the improvements made possible by Markov Models, and, more recently,
`by Connectionist Models, resulting in progress simultaneously obtained
`along the above different axes, in improved performance for difficult
`vocabularies, or in more robust systems. Some specialised hardware is also
`described, as well as the efforts aimed at assessing Speech Recognition
`systems.
`Most of the progress will he referenced with papers that have
`been resented at the IEEE ICASSP Conference, which is the major
`annuafconference in the field. We will take this opportunity to produce
`some statistical data on the "Speech Processing" part of the conference,
`from its beginning in 1976 to its present fourteenth issue.
`Introduction
`
`The aim of this paper is to give non-specialists in Signal
`Processing an overview of recent advances in the domain of Speech
`Recognition. It can also be considered an introduction of the papers that
`will be presented in that field during this conference; especially those
`results on large vocabulary, continuous speech
`presenting
`latest
`recognition systems.
`As a general comment, one may feel that in recent years, the
`choice between methods based on extended knowledge introduced by
`human experts with corresponding heuristic strategies, and self-organizing
`methods, based on speech data bases and learning methodologies, with
`little human input, has turned toward the latter. This is partly due to the
`results of comparative assessment trials.
`Problems related to speech processing
`Several problems make speech processing difficult, and unsolved
`at the present time:
`A. There is no separator, no silence between words. comparable
`to spaces in written language.
`B. Each elementary sound (also called phoneme) is modified by
`its (dose) context: the phoneme which is before it, and the one which
`comes after it. This is related to martidation: the fact that when a
`phoneme is pronounced, the pronunciation of the next phoneme is
`prepared by a movement of the vocal apparatus. This cause is also refered
`to as the "teleological" nature of speech [!IO]. Other (second order)
`modifications of the signal corresponding to a phoneme will be caused by
`larger context such as its place in the whole sentence.
`C. A good deal of variability is present in speech: intra-speaker
`variability, due to the speaking mode (singing, shouting, whispering,
`stuttering, with a cold, when hoarse, creakiness, voice under stress, etc.),
`inter-speaker variability (different timbre, male, female, child, etc.), due to
`the signal input device (type of microphone), or to the environment (noise,
`co-channel interference, etc.).
`D. Because of B and C, it will be necessary to observe, or to
`process, a large amount of data in order to find, or to obtain, what makes
`an elementary sound, despite the different contexts, the different speaking
`modes, the different spealters and the different environments. A difficult
`problem for the system is to be able to decide that an "a" pronounced by
`an aged male adult is more similar to an 'a" pronounced in a different
`word by a child, in a different environment, than to an "0" pronounced in
`the same sentence by the same male adult.
`E. The same signal carries different
`of information (the
`sounds themselves, the syntactic structure, the meaning, the sex and the
`identity of the person speaking, his mood, etc.). A system will have to
`focus on the kinds of information which are of interest for its task.
`
`F. There are no re& d e s at the resent time for formalizing
`the information at ddment Imls of L
`g
` (indudig syntax,
`
`semantics, pragmatics), thus making it diftimlt to use fluent speech.
`Moreover, those different levels seem to be heavily linked to each other
`(syntax and semantics, for example). Fortunately, the roblem mentioned
`in E. also means that the information in the signal wfbe redundant, and
`that the different types of information will cooperate with each other to
`make the signal understandable, despite the ambiguity and noise that may
`be found at each level.
`First msults on n simplified problem
`After some overly optimistic hopes about the difficulty of the
`Speech Recognition task, similar to early views concerning automatic
`translation, a beneficial reaction in the late '60s was to consider the
`importance of the problem in its generality, and to try to solve a simpler
`problem by introducing simplifying hypotheses. Instead of trying to
`recognize anyone pronouncing anything, in any manner, and in fluent
`speech, a first suh-problem was isolated recognizing only one person,
`using a small vocabulary (on the order of 20 to 50 words), and asking for
`short pauses between words.
`The basic approach used two passes: a training pass and a
`recognition pass. During the training pass, the user pronounces each word
`of the vocabulary once. The corresponding signal is processed at the so-
`called "acoustic" or "parametric" level, and the resulting information, also
`called "acoustic image", "speech spectrogram", "template" or "reference
`pattern", which usually represents the signal in 3 dimensions (time,
`frequency, amplitude), is stored in memory, with its corresponding label.
`During the rempition pass, similar processing is conducted at the
`"acoustic" level: the corresponding pattern is then compared with all the
`reference patterns in memory, using an appropriate distance measure. The
`reference with the smallest distance is said to have been recognized, and
`its label can be furnished as a result. If that distance is too high, compared
`with a pre-defined threshold value, the decision can be non-recognition of
`the uttered word, thus allowing the system to "reject" a word which is not
`in its vocabulary.
`This approach lead to the first commercial systems, appearing on
`the market in the early '70s. such as the VIP 100 from Threshold
`Technology Inc. which won a US National Award in 1972. Due to those
`simplifications, this approach doesn't have to deal with the problems of
`segmenting continuous speech into words (problem A, above), of the
`context effect (as it deals with a complete pattern corresponding to a word
`always spoken in the same context - silence (B), of inter-speaker
`variability (C). Also, indirectly, it bypasses the problem of allowing for
`"natural Ianguape'' speech (F), as the small size of the vocabulary, and the
`pronunciation UI isolation prevents fluent speech ! However, the intra-
`speaker variability, the sound recording and the environment problems are
`still present.
`- Pattern Matchine ushe h a m ic Proeramming
`In the recognition pass, the distance between the pattern to be
`recognized (test pattern) and each of the reference patterns in the
`vocabulary has to be computed. Each pattern is represented by a sequence
`of vectors regularly spaced along the time axis. Those vectors can
`represent the output of a fiter bank (analog or simulated by different
`means, induding the (Fast) Fourier Transform [33]), eocffcients obtained
`by an autoregressive process such as Linear Prediction Coding (LPC)
`[157], or coefficients derived from these methods, like the Cepstral
`coefficients [19], or even obtained by using an auditory model [5O,Sa,9S).
`Typical values are a vector of dimensions 8 to 20 (also called a spectrum
`or a frame), each 10 ms (for general information on speech signal
`Prmssing techni ues, see [117,101,129]).
`The probqem is that when a speaker pronounces the same word
`twice, the corresponding spectrograms will never be exactly the same.
`There are non-linear differences in time (rhythm), in frequency (timbre),
`and in amplitude (intensity). Thus, it is necessary to align the two
`spectrograms, so that, when the test pattern is compared to the correct
`reference pattern, the vectors representing the same sound in the two
`words correspond to each other. The distance. measure between the two
`spectrograms will be calculated acsording to this alignment. Optimal
`alignment can be obtained by using the Dynamic Programming method
`
`429
`
`CH26'13-2/89/0000429 $1.00 0 1989 IEEE
`
`Authorized licensed use limited to: Callie Pendergrass. Downloaded on October 14,2022 at 16:01:50 UTC from IEEE Xplore. Restrictions apply.
`
`IPR2023-00035
`Apple EX1013 Page 1
`
`
`
`E tances d(ij) (for example, the Euclidian distance) between each vector
`
`igurc 1). If WT consider the distance mobir D obtained by computing the
`of the test attern and of the reference pattern. this method furnishes the
`optimal pat% from (1.1 ) t o (IJ) (where I and J are respectively the length
`of the test and of the reference pattern), and the corresponding distance
`measure behueen the two patterns. In the case of Speech Recognition. this
`method is also called Dynamic Time Warping, or DTW, since the main
`the time ads. Dynamic Programming was first
`result is to 'warp'
`prcscnted by R. Bcllman 1141, and fust applied to speech by the Russian
`researchers T. Vintsjuk and G. Slutsker in the late 60s [165,160].
`
`algorithm (such as K-means) is used to determine dusters corresponding
`to a certain type of pronunciation for that word. The centroid of each
`cluster is chosen to be the reference pattern for this type of pronunciation
`(Figure 2). Each word is then represented by several reference patterns.
`Recognition is carried out in the same way as it is in the speaker-
`dependent mode, with eventually, a more sophisticated decision process
`(like K" (K-nearest neighbors)) [lx)].
`
`Figure 2 An illustration of Clustering.
`Each cmss is a word. The distance behwen cmsses represenfs the DTW
`disfance between the words. Each clusferis npresented by its cenfmid
`(cimlu).
`- Isolated Word R w w n ition (IWR) to Con nected Word Remenition
`(CWR) and Word Swttine
`In order to allow the user to spek untinuously (without pauses
`between words) several problems have to be solved: bow many words are
`in the sentence and where their boundaries are; if the training is to be
`done in isolation, the pattern corresponding to the beginning and to the
`end of the words will be modified, due to the context of the end of the
`previous word, and to the context of the beginning of the following word.
`The fust two problems have been solved by using methods generalising
`the IWR DTW, such as Two-Level DP matching" proposed by H. Sakoe
`[MI, "Level building" proposed by C. Myers and L. Rabiner [108], 'One-
`o called "One-stage DP by H. Ney
`Pass DP" proposed by J. Bridle
`[115]. It appears in fact that th
`DP approach as fust described by
`T. Vintsjnk in 1968 [165] already had its extension to Connected Word
`Recognition [87. To address the second problem, the "embedded training
`method has been proposed [l32], where each w a d is first pronounced in
`isolation. It is then pronounced in a sentence known to the system. The
`"isolated" reference templates will be used to optimally segment the
`sentence into its constituents, and extract the "contextual" image of the
`words which will be added as new reference templates.
`The "Word Spotting" technique is very similar, and uses the same
`DTW techniques. But it should allow rejection of words in the sentence
`which are not in the vocabulary. Recent results on Speaker Independent
`Word Spotting give 61% correct detection in dean speech, and 44% when
`Gaussian noise is added for a Signal-to-Noise ratio of 10 dB, with a 20-
`word vocabulary (1 to 3 syllables long), the false alarm rate being set to 10
`false alarms per hour 1201.
`A syntax can be used during the refognition process. The syntax
`represents the word sequences that are allowed for the language
`corresponding to the task using speech recognition. The role of the syntax
`is to determine which words can follow a given word (sub-vocabulary),
`ition process by reducing the size of the
`thus accelerating the rem
`vocabulary to be r e c o g n i x at each step, and improving the performance
`by possibly eliminating words that are acoustically similar, but do not
`belong to the same sub-vocabulary, and thus do not compete. Introduction
`of the grammar into the search procedure may be more or less difficult,
`depending on the CWR-DTW algorithm used. Most of the syntaxes used
`in DTW-type systems correspond to simple command languages (regular,
`or context-free grammars introduced manually by the system user).
`It appears that the better the trainin& the better the recognition.
`In order to improve training, several teduuques have been tried, like
`"embedded training" already mentioned, multireference training, where
`several references of a word are kept, using the same clustering techniques
`for representing intra-speaker variations as those used for inter-speaker
`variations in multi-reference speaker-independent recognition, and robust
`training [131].
`
`-Small Vocabularies to Laree Vocabularies:
`Increasing the size of the vocabulary raises several problems:
`since each word is represented by its spectrogram, necessary memory size
`gets very large. Since matching between the test pattern and the reference
`patterns is done sequentially, computation time also greatly increases. If a
`speaker has to train the system by pronouncing aU of the words, the task
`rapidly becomes tedious. A large vocabulary has the consequence of many
`acoustically similar words, thus increasing the error rate. This also implies
`that the speaker will want to use a natural way of speaking, without strong
`syntax constraints. To address these problems, there have been several
`improvements:
`- Vector Ouantization 153. 49. 97) In the domain of Speech Processing,
`this method was fust used for low bit rate speech coding 1911. Considering
`a reasonable amount of speech pronounced by one speaker, the method
`consists of computing the distances (like the Eudidian distance) between
`each vector of the corresponding spectrogram, and using a clustering
`
`- I
`
`F i i r e 1: Example of Dyoamic Time Warping between two spcech
`patterns (the word "Paris' represented by a schematic spectrogram).
`G is the disfance measure between the hvo unerances ofthe word d(!,j) is
`the distance behwen hwfmmu ofthe reference and lest putems at mianis
`i and j. An erample ofa local DP equan'm is @en. ?%e opfimalpath IS
`repmen fed by squares. ?%e cumulated distmcep involved in the
`compufafh ofthe cumulafed disfance &j) are represented by circles.
`- SDeech and A I the ARPA-SUR oroied
`A different approach, mainly based on 'Artilicial Intelligence"
`techniques was initiated in 1971, in the hamework of the ARPA-SUR
`project [114]. The idea behind it was that the use of 'upper Icvel"
`knowledge (lexicon, syntax, semantics, pragmatics) could produce an
`acceptable recognition rate, even if the initial phoneme recognition rate
`was poor 1701. The task was speaker-dependent, continuous speech
`recognition (improperly called 'understanding" because upper levels were
`used), with a 1,000-word vocabulary. Several systems were delivered at the
`end of the p r o j q in 1976. From CMU, the DRAGON system 141 was
`designed, using a Markov approach. The HEARSAY I and HEARSAY I1
`systems, bawd on the use of a Blackboard Model, where each Knowledge
`Source can read and write information d u r i q the decoding, with a
`heuristic strategy, and the HARPY system, wluch merged parts of the
`DRAGON and HEARSAY systems. From BBN, the SPEECHLIS and
`HWlM systems were developed. SDC also produced a system 1761.
`Although the initial requirements (which were in fact rather vague) wcrc
`attained by at least one system (HARPY), the systems needed so much
`computer power, at a time when this was expensive. and were so
`cumbersome to use and so non-robust, that there was no follow-up. In
`fact, one of the major conclusions was that there was a need for better
`acoustic-phonetic decoding In]!
`Improvements alnmg each of the 3 axes
`From the basic IWR method, progress has been made which
`independently addresses
`the three different problems: size of
`the
`population using the system, s p e a k q rate, size of the vocabulary.
`- Soeaker-DeDendent (SD) to SDeaker-lndeoendent tSIt
`In order to allow any speaker to use a recognition system, a multi-
`referenec approach has k e n experimented. Each word of the vocabulary
`ia pronounced by a large population, male and female, with different
`timbres and dilfcrent d i a l e d origins. The distancc between the different
`pronunciations of the same word is computed using DTW. A clustering
`
`430
`
`Authorized licensed use limited to: Callie Pendergrass. Downloaded on October 14,2022 at 16:01:50 UTC from IEEE Xplore. Restrictions apply.
`
`IPR2023-00035
`Apple EX1013 Page 2
`
`
`
`algorithm to determine clusters corr~ponding to a type of vector, which is
`a "eodeboor. E the training phase, after acoustic
`represented by the centroid called
`ototype" or "codeword"). The set of
`nrototvDcs is
`of the word, each Jpcetrum is r c c q n i d to be one of the
`&&kg
`prototypes of the codebook. Thns, instead of b e i i represented by a
`sequence of vectors. the word will be represented by a sequence of
`numbers (also d e d labels) companding to the protolyp. A distmtion
`m e m can be obtained by computing the average distance between the
`and the doscst~ototype. On a practicd level, if the size
`incoming &or,
`ofthe codebookis Us, 01 kas ( t i s addrwbk onone byte), and each
`vector component is coded on one byte, the reduction of information is
`equal to the dimension of the vedors. Also, computing time is s a d
`during recognition for large vocabularies since, for each in ut vector of
`the test pattern, only 2% distances have to be computeg, instead of
`computing the distances with all the vectors of all the reference templates.
`M o r ~ v c r , the distances between prototypes can be computed after
`training, and kept in a distance m a t h Those c o d e h p
`concern not
`only spedral information, but also energy, or vanahon of spectral
`information or of energy in time. AU this can be represented by a single
`for each type of 3 ormation. This approach was applied with success to
`codebook with supervectors, constructed by including the different kinds
`It can also be reprwcnted by a different codebook
`of information [ U
`speaker identification [lal], and to speech recognition [55]. The
`codebooks can also be constructed from the spcech of several sgeakers
`(speaker-independent codebooks) 1831.
`It should be noted thal sunilar methods have been used previously
`(Centisecond Model) [ss]. The problem at that time was that the vectors
`had been labelled with a linguistic label (a phoneme), thus making a
`decision too early. The Vector QuanhtiOn scheme inspired much
`thought. One remark was that each word could have a specific set of
`prototypes without taking into account the chronologicd sequence of
`those prototypes. Even if some words contain the same phonemes in a
`different order, the transition between those phonemes is ditferent, and
`the prototypes corresponding to those transitions may be different, the
`latter making the distinction between words. During traiohg, a codebook
`is built for each reference word. The reeOgnition process then coosists of
`istortion with the test [l%]. A refined
`gives the smallest average
`approach consisted of segmenting the words into multiple sections, in
`order to partly reflect time sequencing for words having several phonemes
`in common 1261. This refinement increases the computation time. without
`giving better results than the DTW-based approach does,
`- Sub-word units; Another way to reduce the memory requirement is to
`use decision units that are shorter than the words (also called subword
`units). The words will then be rccognizcd as the concatenation of such
`units, using a Connected "Word" DTW algorithm. These units must be
`chosen so that they are not too affected by the coarticulation problem at
`their boundaries. But they also should not be too numerous. Examples of
`such units are phonemes [163], diphones [98, 148, 32, 2, 1491, syllables
`[59,168,48], demi-syllables [144,139], disyllables [159].
`Graphemic Word :
`&nipante
`SemigRitS
`Phonemic Word :
`phonemes :
`S e m i g R i t S
`diphones :
`Se em mi ig gR R i i t t S
`syllahlea :
`Se mi gRitS
`demi-nyllahles :
`Se em mi ig gRi i t tS
`disyllahles :
`Se emi igRi it$
`Figure 3 Representation of a word by subword units
`(Ihe word is "bnigrante" ("emigmrit") in French, $ standsfor silence)
`Other approaches tend to use units with no linguistic affiliation,
`for example segments, obtained by a segmentation algorithm. This
`approach lead to Segment (or Matrix) Quantization, very similar to
`Vector Quantization, except that the distance between segment prototypes
`may need time alignment, if the segments do not have a constant length
`
`simply reeogoizing the inmy vectors, and choaing the reference which
`
`- Time comorchsion Time compression can also reduce the amount of
`information [75,46]. h e idea is to compress (linearly, or non lincarly) the
`steady states, which may have very different lengths depending on
`speaking rate, while keeping all the vectors during the transitions, thus
`m&g
`from the time space to the variation space. An algorithm like the
`VLTS (Variable Length Trace Segmentation) [4q h a l m the amount of
`information used. It also obtains better results when the pronunciation
`rate is very different between training and recognition (some ofien-used
`DTW equations, for example, do not accept speaking rate variations of
`more than a 2-to-1 ratio, which is w i l y reached between isolated word
`pronunciation and continuous speech). However, if duration itself carries
`meaning, that information may be lost.
`
`Two-vas reeoenitioo: In order to aefclcrate recognition, it can be
`processed in two pasrcs: first there can be a rough, but fast match aimed
`at eliminating words of the vocabulary that are very different from the test
`pattern, before applying an optimal match (DTW or Viterbi) on the
`remaining reduced subvocabulary. In this case, the goal is not to get just
`the correct word, but is to eliminate as many word-candidates as possible
`(without eliminating the right one, of course). Simple approaches l i e
`summing the distances on the diagonal of the disllutcc monS used for
`DTW [48] have been tried. Other approaches are based on Vector
`Quantization without time alignment, the system b e i i based on Pattern
`Matching I991 or on Stochastic Modeling ( d e d "Poisson Polling') [SI.
`Using a phonetic classifier, b a d on broad [45j or usual [16] phonetic
`classes, and matching the reeogoised phoneme lattice with the reference
`phonemic words in the lexicon by DTW is another reported method.
`- Soeaker
`' The adaptation of one speaker's references to a new
`i
`through their respedive codebooks, if a Vector
`speaker ca-ed
`Quantization scheme is used. The reference speaker produces several
`sentences, which are vector quantized with his codebook. The new speaker
`produces the same sentences, which are vector quantized with hi own
`codebook. Time alignment of the two sets of sentences creates a mapping
`between the two codebooks. This basic method has several variants
`[155,21,43].
`Most of the progress related to this technique has been obtained
`on one aspect of the problem. Some systems addressing two aspects can
`also be found, l i e the Conversant system from AT&T [152], which allows
`for speaker-independent connected digit recognition over telephone Lines
`using a multireference CWR-DTW approach. Further advances have been
`obtcmed by using more elaborated i&hniques: Hidden Markov Models,
`and Connectionist Models.
`
`The Hidden Markov Model approach
`Whereas in the previous pattern matching approach, a reference
`was represented by the pattern itself which was stored in memory, the
`Markov Model approach w r i e s a higher level of abstraetion, representing
`the reference by a model [125,l3s]. To be recognled, the input is thus
`compared to the reference models. The fust uses of this approach for
`speech recognition can be found at CMU [41, IBM [62] and, apparently,
`IDA 11241.
`In a stochastic approach, if we consider an acoustic signal A, the
`ition process can be described as computing the probability
`reco
`P(w$A) that any W word string (or sentence) corresponds to the acoustic
`signal A, and as finding the word string having the maximum probability.
`Using Bayes' rule, P(WIA) can be represented as:
`PWIA) = P W . W I W ) /P(A)
`where P(W) is the probability of the word string W, P(AIW) is the
`probability of the acoustic signal A, given the word string W, and P(A) is
`the probability of the acoustic signal (which does not depend on W). Thus
`it is necessary to take into account P(A(w) (whieh is the acoustic model),
`and P(W) (which is the Iunguage &I). Both models can be represented
`as Markw models [6]. We will first consider Acmtic Modeling.
`- Basic discrete aooroach; Here each acoustic entity to be reeogniZed,
`each reference word for example, is represented by a f ~ t e
`state machine,
`also called a Markov machine, composed of states, and of arcs between
`states. A tmnsilion probability . is attached to the arc going from state i
`to state j representing the pr& bility that this arc could be taken. The
`sum of the transition robabilities attached to the arcs issued from a given
`state i is equal to 1. Tiere is also an ourputpubability b..(k) that a symbol
`k from a finite alphabet can be emitted when the arc frdo state i to state j
`is taken. In some variants, this output probability is attached to the state,
`distribution (also called output probability density function (pdj)), 4 the
`not to the arc. When Vector Quantizdon is used, this output probability
`probability distribution of the prototypes. The sum of the prohabhties in
`the distribution is also equal to 1 (Figure 4). In a first-order Hidden
`Markov Model, it is assumed that the probability that the Markov chain is
`in a particular state at time t depends only on the state where it was at
`time t-1, and that the output probability at time t depends only on the arc
`being taken at time t..
`
`k-l.K n
`
`.I3
`F i e 4 An example of a Hidden Markov Model
`Ihe outpulprobability distributions b-(k) m encbsed in rectangles. a.. is
`the mition pmbability. Ihis I@-w%ght modcl has 3 stales and 4 a&.
`
`431
`
`Authorized licensed use limited to: Callie Pendergrass. Downloaded on October 14,2022 at 16:01:50 UTC from IEEE Xplore. Restrictions apply.
`
`IPR2023-00035
`Apple EX1013 Page 3
`
`
`
`. Continuous models We have just presented what are usually called
`'Dixretc Hidden Markov Models". Another type of Markov Model is the
`arc is replaced by a model of the continuous spectrum on &at &c. A LUG
`'Continuous Markov Model". In this case. the discrete outout d o n one
`model is the multivariate Gaussian density 1201 which dcsoibcs the pdf
`by a mean M o r and a owariance matrix [cve~tually diagonal). The usc
`of a multivariate Gaussian mixture density seems to be more appropriate
`[135,66,137,122]. The Lapladan mixture density seems to allow for good
`quality results, with reduced computation time [lla]. Several attempts to
`compare discrete and continuous HMMs have been reported. It seems
`that only complex continuous models d o w for better results than discrete
`ones, reflecting the fact that with the usual Maximum Likelihood training,
`the complete model should be correct to allow for good recognition
`rcsults 17. But complex continuous models need a good deal of
`computation,
`
`from the distance measure. and thus dclining simikw p m h y p . If the
`output probability of a prolotypc is null on an are, it can be smoothed with
`the non-null probability of a Simibrptcfpe U01 A third method is the
`co-ocnmnce smoothing [SZ], which mooLa 'on all thc arcs the
`probabilities of labels that sometimes appcar on the same arcs.
`' ' lo order to smooth the cstimatcs of the
`-2
`to apply
`differeat methods, it is ne-
`weights to the d8ercnt estimates. Thosc weights will refled the quality of
`each estimate, or the quantity of information used to calculate each of
`them. A method lo automatid determine thosc weights is the deleted
`interpOrorion atimOriOn, which spits the estimates on two arcs, and defincs
`the weights as the transition probabilities of thc arcs, as wmputed by the
`Forward-Backward algorithm 1631.
`. Time model in& The modclisation of time in a Markov model is
`contained in the probabilities of the arcs. It appears that the probability to
`stay at a given state will dcacasc as a power of the probability to follow
`the arc looping on that state, which seems to be a poor time model in the
`case of the speech signal. Several ancmpts to improve that issue can be
`found.
`In the Semi-Hidden Morkov Model 144,1451, a set of robability
`density functions P.(d) at each state i indicates the probability ofstaying in
`that state for a dven duration, d. This set of probabilities is trained
`together with the transition and output probabilities by using a modified
`Forward-Backward algorithm. A simpler approach is lo independently
`train the duration robability and the HMM parameters I 1341.
`To aUow &r a more easily trainable model, continuous probability
`density functions can be used for duration modeling, like the Poisson
`distribution (1451 or the gamma distribution, used by S. Levinson in his
`Confinurnsly Vmable Dwotion Hidden Markav Model (CVDHMM) [&?SI.
`Another way of i n d i r d y taking time into account is to mcludc
`the dynamics of the spectrum as a new parameter. I t can be represented
`by the diflcrenced Cepstrum eocffiaents corresponding to adjacent
`frames, and can also include the differenced power. M e r Vector
`Ouantization, the multiple codebooks for those new parameters are built.
`They are introduced in the HMM with independcot output pdfs on the
`arcs [831.
`
`- k l s l o n Units:
`natural idea is to modelisc a word with a HMM. An example
`-The
`of a Markov word machine, from R. Bakis [62], is given in Figure 5. The
`number of states in the word model is equal to the average duration of the
`word (50 states for a 500 ms word, with a frame each 10 ms). It should be
`insertion
`the model indudes the frame deletion and
`noted
`that
`Phenomena previously detected durin DTW More recently, models with
`ess states have been successfuUy trie% [l33]. 'The problem is that to get a
`good model of the word, there should be a Large number of
`pronunciations of that word. Also, the recognition process considers a
`word as a whole, and does not focus on the information discriminating two
`acoustically similar words.
`
`The number of states, the number of arcs, and the initial and final
`states for each arc arc choscn by the system designer. The parameters of
`the model (transition probabilities, and output probabilities) have to he
`obtained through training. Three problems have to be adressed
`. the EvalaaUnm problem (what is the probability that a sequence
`of labels has been produced by a given model ?). This can be obtained by
`using the Forward algorithm, which gives the Maximum Likelihood
`Estimation that the sequence was produced by the model.
`- the Decoding problem (which yuence of states bas produced
`the sequence of labels ?). This can be obtmed by the Viterbi algorithm,
`which is vc'y S i m i to DTW I 1661.
`. the Leanlog (or h i n i n g ) problem (how to get the F a m e t e r s
`of the model, given a sequence of labels ?). This can be obtamed by the
`Forward-Backward (also called Baum-Welch) algorithm 1121, when the
`training is based on Maximum Likelihood.
`- Traininp;
`- Initialisation: Initialisation of the parameters in the model has to be
`carried out before starting