`3. 4 Neural response
`3 .5 Psychophysical measurements
`3.6 Analysis of simple and complex signals
`3.7 Models of the auditory system
`3. 7 .1 Mechanical filtering
`3.7.2 Models of neural transduction
`3.7.3 Higher-level neural processing
`Chapter 3 summary
`Chapter 3 exercises
`4 Digital Coding of Speech
`4.2 Simple waveform coders
`4.2.1 Pulse code modulation
`4.2.2 Deltamodulation
`4.3 Analysis/synthesis systems (vocoders)
`4 .3 .1 Channel vocoders
`4.3.2 Sinusoidal coders
`4.3.3 LPC vocoders
`4.3.4 Formant vocoders
`4.3.5 Efficient parameter coding
`4.3.6 Vocoders based on segmental/phonetic structure
`Intermediate systems
`4.4.1 Sub-band coding
`4.4.2 Linear prediction with simple coding of the residual
`4.4.3 Adaptive predictive coding
`4.4.4 Multipulse LPC
`4.4.5 Code-excited linear prediction
`4.5 Evaluating speech coding algorithms
`4.5.1 Subjective speech intelligibility measures
`4.5.2 Subjective speech quality measures
`4.5.3 Objective speech quality measures
`4.6 Choosing a coder
`Chapter 4 summary
`Chapter 4 exercises
`5 Message Synthesis from Stored Human Speech Components
`5. I
`5.2 Concatenation of whole words
`5.2. l Simple waveform concatenation
`5.2.2 Concatenation of vocoded words
`5.2.3 Limitations of concatenating word-size units
Apple EX1016 Page 5


`5.3 Concatenation of sub-word units: general principles
`5.3.1 Choice of sub-word unit
`5 .3 .2 Recording and selecting data for the units
`5 .3 .3 Varying durations of concatenative units
`5.4 Synthesis by concatenating vocoded sub-word units
`5.5 Synthesis by concatenating waveform segments
`5 .5 .1 Pitch modification
`5.5.2 Timing modification
`5.5.3 Performance of waveform concatenation
`5.6 Variants of concatenative waveform synthesis
`5. 7 Hardware requirements
`Chapter 5 summary
`Chapter 5 exercises
`6 Phonetic synthesis by rule
`6.2 Acoustic-phonetic rules
`6.3 Rules for formant synthesizers
`6.4 Table-driven phonetic rules
`6.4.1 Simple transition calculation
`6.4.2 Overlapping transitions
`6.4.3 Using the tables to generate utterances
`6.5 Optimizing phonetic rules
`6.5. I Automatic adjustment of phonetic rules
`6.5.2 Rules for different speaker types
`6.5.3 Incorporating intensity rules
`6.6 Current capabilities of phonetic synthesis by rule
`Chapter 6 summary
`Chapter 6 exercises
`Speech Synthesis from Textual or Conceptual Input
`7 .1
`7 .2 Emulating the human speaking process
`7.3 Converting from text to speech
`7 .3 .1 TIS system architecture
`7.3 .2 Overview of tasks required for TIS conversion
`7.4 Text analysis
`7.4.1 Text pre-processing
`7.4.2 Morphological analysis
`7.4.3 Phonetic transcription
`7.4 .4 Syntactic analysis and prosodic phrasing
`7.4.5 Assignment of lexical stress and pattern of word accents
Apple EX1016 Page 6












`Introduction to Automatic Speech
`Recognition: Template Matching
`Much of the early work on automatic speech recognition (ASR), starting in the
`1950s, involved attempting
`to apply rules based either on acoustic/phonetic
`knowledge or in many cases on simple ad hoc measurements of properties of the
`speech signal for different types of speech sound. The intention was to decode the
`signal directly into a sequence of phoneme-like units. These early methods,
`extensively reviewed by Hyde ( 1972), achieved very little success. The poor results
`were mainly because co-articulation causes the acoustic properties of individual
`phones to vary very widely, and any rule-based hard decisions about phone identity
`will often be wrong if they use only local information. Once wrong decisions have
`been made at an early stage, it is extremely difficult to recover from the errors later.
`An alternative to rule-based methods is to use pattern-matching techniques.
`Primitive pattern-matching approaches were being investigated at around the same
`time as the early rule-based methods, but major improvements in speech recognizer
`performance did not occur until more general pattern-matching techniques were
`invented. This chapter describes typical methods that were developed for spoken
`word recognition during the 1970s. Although these methods were widely used in
`commercial speech recognizers in the 1970s and 1980s, they have now been largely
`superseded by more powerful methods ( to be described in later chapters), which
`can be understood as a generalization of the simpler pattern-matching techniques
`introduced here. A thorough understanding of the principles of the first successful
`pattern-matching methods is thus a valuable introduction to the later techniques.
`When a person utters a word, as we saw in Chapter 1, the word can be considered
`as a sequence of phonemes ( the linguistic units) and the phonemes will be realized
`as phones. Because of inevitable co-articulation, the acoustic patterns associated
`with individual phones overlap in time, and therefore depend on the identities of
`their neighbours. Even for a word spoken ~ isolation, therefore, the acoustic
`pattern is related in a very complicated way to the word's linguistic structure.
`However, if the same person repeats the same isolated word on separate
`occasions, the pattern is likely to be generally similar, because the same phonetic
`relationships will apply. Of course, there will probably also be differences, arising
`from many causes. For example, the second occurrence might be spoken faster or
`more slowly; there may be differences in vocal effort; the pitch and its variation
`during the word could be different; one example may be spoken more precisely
Apple EX1016 Page 12


`Speech Synthesis and Recognition
`than the other, etc. It is obvious that the waveform of separate utterances of the
`same word may be very different. There are likely to be more similarities between
`spectrograms because (assuming that a short time-window is used, see Section 2.6),
`they better illustrate the vocal-tract resonances, which are closely related to the
`positions of the articulators. But even spectrograms will differ in detail due to the
`above types of difference, and timescale differences will be particularly obvious.
`A well-established approach to ASR is to store in the machine example
`acoustic patterns ( called templates) for all the words to be recognized, usually
`spoken by the person who will subsequently use the machine. Any incoming word
`can then be compared in tum with all words in the store, and the one that is most
`similar is assumed to be the correct one. In general none of the templates will match
`perfectly, so to be successful this technique must rely on the correct word being
`more similar to its own template than to any of the alternatives.
`It is obvious that in some sense the sound pattern of the correct word is likely
`to be a better match than a wrong word, because it is made by more similar
`articulatory movements. Exploiting this similarity is, however, critically dependent
`on how the word patterns are compared, i.e. on how the 'distance' between two
`word examples is calculated. For example,
`it would be useless
`to compare
`waveforms, because even very similar repetitions of a word will differ appreciably
`in waveform detail from moment to moment, largely due to the difficulty of
`repeating the intonation and timing exactly.
`It is implicit in the above comments that it must also be possible to identify
`the start and end points of words that are to be compared.
`In this section we will consider the problem of comparing the templates with the
`incoming speech when we know that corresponding points
`in time will be
`associated with similar articulatory events. In effect, we appear to be assuming that
`the words to be compared are spoken in isolation at exactly the same speed, and
`that their start and end points can be reliably determined.
`In practice these
`assumptions will very rarely be justified, and methods of dealing with the resultant
`problems will be discussed later in the chapter.
`In calculating a distance between two words it is usual to derive a short-term
`distance that is local to corresponding parts of the words, and to integrate this
`distance over the entire word duration. Parameters representing the acoustic signal
`must be derived over some span of time, during which the properties are assumed
`not to change much. In one such span of time the measurements can be stored as a
`set of numbers, or feature vector, which may be regarded as representing a point
`in multi-dimensional space. The properties of a whole word can then be described
`as a succession of feature vectors ( often referred to as frames), each representing a
`time slice of, say, 10-20 ms. The integral of the distance between the patterns then
`reduces to a sum of distances between corresponding pairs of feature vectors. To be
`useful, the distance must not be sensitive to small differences in intensity between
`otherwise similar words, and it should not give too much weight to differences in
`pitch. Those features of the acoustic signal that are determined by the phonetic
`properties should obviously be given more weight in the distance calculation.
Apple EX1016 Page 13




`Speech Synthesis and Recognition
`spacings are roughly equal to those of critical bands and whose range of centre
`frequencies covers the frequencies most important for speech perception (say from
`300 Hz up to around 5 kHz). The total number of band-pass filters is therefore not
`likely to be more than about 20, and successful results have been achieved with as
`few as 10. When the necessary time-smoothing is included, the feature vector will
`represent the signal power in the filters averaged over the frame interval.
`The usual name for this type of speech analysis is filter-bank analysis.
`Whether it is provided by a bank of discrete filters, implemented in analogue or
`digital form, or is implemented by sampling the outputs from short-term Fourier
`transforms, is a matter of engineering convenience. Figure 8.1 displays word
`patterns from a typical I 0-channel filter-bank analyser for two examples of one
`word and one example of another. It can be seen from the frequency scales that the
`channels are closer together in the lower-frequency regions.
`A consequence of removing the effect of the fundamental frequency and of
`using filters at least as wide as critical bands is to reduce the amount of information
`needed to describe a word pattern to much less than is needed for the waveform.
`Thus storage and computation in the pattern-matching process are much reduced.
`8.3.2 Level normalization
`Mean speech level normally varies by a few dB over periods of a few seconds, and
`changes in spacing between the microphone and the speaker's mouth can also cause
`changes of several dB. As these changes will be of no phonetic significance, it is
`desirable to minimize their effects on the distance metric. Use of filter-bank power
`directly gives most weight to more intense regions of the spectrum, where a change
`of 2 or 3 dB will represent a very large absolute difference. On the other hand, a
`3 dB difference in one of the weaker formants might be of similar phonetic
`significance, but will cause a very small effect on the power. This difficulty can be
`avoided to a large extent by representing the power logarithmically, so that similar
`power ratios have the same effect on the distance calculation whether they occur in
`intense or weak spectral regions. Most of the phonetically unimportant variations
`discussed above will then have much less weight in the distance calculation than the
`differences in spectrum level that result from formant movements, etc.
`Although comparing levels logarithmically is advantageous, care must be
`exercised in very low-level sounds, such as weak fricatives or during stop(cid:173)
`consonant closures. At these times the logarithm of the level in a channel will
`depend more on the ambient background noise level than on the speech signal. If
`the speaker is in a very quiet environment the logarithmic level may suffer quite
`wide irrelevant variations as a result of breath noise or the rustle of clothing. One
`way of avoiding this difficulty is to add a small constant to the measured level
`before taking logarithms. The value of the constant would be chosen to dominate
`the greatest expected background noise level, but to be small compared with the
`level usually found during speech.
`Differences in vocal effort will mainly have the effect of adding a constant to
`all components of the log spectrum, rather than changing the shape of the spectrum
`cross-section. Such differences can be made to have no effect on the distance
`metric by subtracting the mean of the logarithm of the spectrum level of each frame
Apple EX1016 Page 15










`/ntroduction to Automatic Speech Recognition: Template Matching
`(As the scheme is symmetrical we
`or D(i-
`consider values of D(i-1,j)
`could equally well have chosen the horizontal direction instead.) When the first
`column values for D(l,j) are known, Equation (8.2) can be applied successively to
`calculate D(i,j) for columns 2 to n. The value obtained for D(n, N) is the score for
`the best way of matching the two words. For simple speech recognition
`applications, just the final score is required, and so the only working memory
`needed during the calculation is a one-dimensional array for holding a column ( or
`row) of D(i,j) values. However, there will then be no record at the end of what the
`optimum path was, and if this information is required for any purpose it is also
`necessary to store a two-dimensional array of back-pointers, to indicate which
`direction was chosen at each stage. It is not possible to know until the end has been
`reached whether any particular point will lie on the optimum path, and this
`infonnation can only be found by tracing back from the end.
`The DP algorithm represented by Equation (8.2) is intended to deal with variations
`of timescale between two otherwise similar words. However, if two examples of a
`word have the same length but one is spoken faster at the beginning and slower at
`the end, there will be more horizontal and vertical steps in the optimum path and
`fewer diagonals. As a result there will be a greater number of values of d(i, j) in the
`final score for words with timescale differences than when the timescales are the
`same. Although it may be justified to have some penalty for timescale distortion, on
`the grounds that an utterance with a very different timescale is more likely to be the
`wrong word, it is better to choose values of such penalties explicitly than to have
`them as an incidental consequence of the algorithm. Making the number of
`to D(n, N) independent of the path can be achieved by
`contributions of d(i,j)
`modifying Equation (8.2) to add twice the value of d(i,j) when the path is diagonal.
`One can then add an explicit penalty to the right-hand side of Equation (8.2) when
`the step is either vertical or horizontal. Equation (8.2) thus changes to:
`D(i,j) = min[D(i -1,j) + d(i,j) + hdp,
`D(i -1,j -1) + 2d(i,j),
`D(i,j -1) + d(i,j) + vdp].
`Suitable values for the horizontal and vertical distortion penalties, hdp and vdp,
`would probably have to be found by experiment in association with the chosen
`distance metric. It is, however, obvious that, all other things being equal, paths with
`appreciable timescale distortion should be given a worse score than diagonal paths,
`and so the values of the penalties should certainly not be zero.
`Even in Equation (8.3) the number of contributions to a cumulative distance
`will depend on the lengths of both the example and the template, and so there will
`be a tendency for total distances to be smaller with short templates and larger with
`long templates. The final best-match decision will as a result favour short words.
`This bias can be avoided by dividing the total distance by the template length.
`The algorithm described above is inherently symmetrical, and so makes no
`distinction between the word in the store of templates and the new word to be
Apple EX1016 Page 20


`Speech Synthesis and Recognition
`identified. DP is, in fact, a much more general technique that can be applied to a
`wide range of applications, and which has been popularized especially by the work
`of Bellman ( 1957). The number of choices at each stage is not restricted to three, as
`in the example given in Figure 8.3. Nor is it necessary
`in speech recognition
`applications to assume that the best path should include all frames of both patterns.
`If the properties of the speech only change slowly compared with the frame
`interval, it is permissible
`to skip occasional
`frames, so achieving timescale
`compression of the pattern. A particularly useful alternative version of the
`algorithm is asymmetrical, in that vertical paths are not permitted. The steps have a
`slope of zero (horizontal), one (diagonal), or two (which skips one frame in the
`template). Each input frame then makes just one contribution to the total distance,
`so it is not appropriate to double the distance contribution for diagonal paths. Many
`other variants of the algorithm have been proposed,
`including one that allows
`average slopes of 0.5, 1 and 2, in which the 0.5 is achieved by preventing a
`horizontal step if the previous step was horizontal. Provided the details of the
`formula are sensibly chosen, all of these algorithms can work well. In a practical
`implementation computational convenience may be the reason for choosing one in
`preference to another.
`Although DP algorithms provide a great computational saving compared with
`exhaustive search of all possible paths,
`the remaining computation can be
`substantial, particularly if each incoming word has to be compared with a large
`number of candidates for matching. Any saving in computation that does not affect
`the accuracy of the recognition result
`is therefore desirable. One possible
`computational saving is to exploit the fact that, in the calculations for any column
`in Figure 8.3, it is very unlikely that the best path for a correctly matching word
`will pass through any points for which the cumulative distance, D(i,j), is much in
`excess of the lowest value in that column. The saving can be achieved by not
`allowing paths from relatively badly scoring points to propagate further. (This
`process is sometimes known as pruning because
`the growing paths are like
`branches of a tree.) There will then only be a small subset of possible paths
`considered, usually lying on either side of the best path. If this economy is applied
`it can no longer be guaranteed that the DP algorithm will find the best-scoring path.
`However, with a value of score-pruning threshold that reduces the average amount
`of computation by a factor of 5-10 the right path will almost always be obtained if
`the words are fairly similar. The only circumstances where this amount of pruning
`is likely to prevent the optimum path from being obtained will be if the words are
`actually different, when the resultant over-estimate of total distance would not
`cause any error in recognition.
`Figures 8.4(a), 8.5 and 8.6 show DP paths using the symmetrical algorithm
`for the words illustrated in Figures 8.1 and 8.2. Figure 8.4(b) illustrates the
`asymmetrical algorithm for comparison, with slopes of 0, 1 and 2. In Figure 8.4
`there is no time-distortion penalty, and Figure 8.5 with a small distortion penalty
`shows a much more plausible matching of the two timescales. The score pruning
`used in these figures illustrates the fact that there are low differences in cumulative
Apple EX1016 Page 21








`Speech Synthesis and Recognition
`point where one word stops and the next one starts. However, it is mainly the ends
`of words that are affected and, apart from a likely speeding up of the timescale
`words in a carefully spoken connected sequence do not normally differ greatly fro~
`their isolated counterparts except near the ends. In matching connected sequences
`of words for which separate templates are already available one might thus defme
`the best-matching word sequence to be given by the sequence of templates which
`when joined end to end, offers the best match to the input. It is of course assumed
`that the optimum time alignment is used for the sequence, as with DP for isolated
`words. Although this model of connected speech totally ignores co-articulation, it
`has been successfully used in many connected-word speech recognizers.
`As with the isolated-word
`time-alignment process,
`there seems to be a
`potentially explosive increase in computation, as every frame must be considered as
`a possible boundary between words. When each frame is considered as an end point
`for one word, all other permitted words in the vocabulary have to be considered as
`possible starters. Once again the solution to the problem
`is to apply dynamic
`programming, but in this case the algorithm is applied to word sequences as well as
`to frame sequences within words. A few algorithms have been developed to extend
`the isolated-word DP method to work economically across word boundaries. One
`of the most straightforward and widely used is described below.
`In Figure 8.8 consider a point that represents a match between frame i of a
`multi-word input utterance and frame j of template number k. Let the cumulative
`distance from the beginning of the utterance along the best-matching sequence of
`complete templates followed by the first j frames of template k be D(i,j, k). The
`best path through template k can be found by exactly the same process as for
`isolated-word recognition. However, in contrast to the isolated-word case, it is not
`known where on the input utterance the match with template k should finish, and
`for every input frame any valid path that reaches the end of template k could join to
`the beginning of the path through another template, representing the next word.
`Thus, for each input frame i, it is necessary to consider all templates that may have
`just ended in order to find which one has the lowest cumulative score so far. This
`score is then used in the cumulative distance at the start of any new template, m:
`D(i, 1, m) = min [D(i -1, L(k),k)]+ d(i, 1, m),
`where L(k) is the length of template k. The use of i - 1 in Equation (8.4) implies
`that moving from the last frame of one template to the first frame of another always
`involves advancing one frame on the input ( i.e. in effect only allowing diagonal
`paths between templates). This restriction is necessary, because the scores for the
`ends of all other templates may not yet be available for input frame i when the path
`decision has to be made. A horizontal path from within template m could have been
`included in Equation (8.4), but has been omitted merely to simplify the explanation.
`A timescale distortion penalty has not been included for the same reason.
`In the same way as for isolated words, the process can be started off at the
`beginning of an utterance because all values of D(O, L(k), k) will be zero. At the end
`of an utterance the template that gives the lowest cumulative distance is assumed to
`represent the final word of the sequence, but its identity gives no indication of the
`templates that preceded it. These can only be determined by storing pointers to the
`preceding templates of each path as it evolves, and then tracing back when the final
`point is reached. It is also possible to recover the positions in the input sequence
Apple EX1016 Page 25






`Introduction to Automatic Speech Recognition: Template Matching
`be wrong because of inherent ambiguity in the acoustic signal.) On the other hand,
`if the input matches very badly to all except one of the permitted words, all paths
`not including that word will be abandoned as soon as the word has finished. In fact,
`if score pruning is used to cause poor paths to be abandoned early, the path in such
`a case may be uniquely detennined even at a matching point within the word. There
`is plenty of evidence that human listeners also often decide on the identity of a long
`word before it is complete if its beginning is sufficiently distinctive.
`The rules of grammar often prevent certain sequences of words from occurring in
`human language, and these rules apply to particular syntactic classes, such as
`nouns, verbs, etc. In the more artificial circumstances in which speech recognizers
`are often used, the tasks can sometimes be arranged to apply much more severe
`constraints on which words are permitted to follow each other. Although applying
`such constraints requires more care in designing the application of the recognizer, it
`usually offers a substantial gain in recognition accuracy because there are then
`fewer potentially confusable words to be compared. The reduction in the number of
`templates that need to be matched at any point also leads to a computational saving.
`In all the algorithms described in this chapter it is assumed that suitable templates
`for the words of the vocabulary are available in the machine. Usually the templates
`are made from speech of the intended user, and thus a training session is needed
`for enrolment of each new user, who is required to speak examples of all the
`vocabulary words. If the same user regularly uses the machine, the templates can be
`stored in some back-up memory and re-loaded prior to each use of the system. For
`isolated-word recognizers
`the only technical problem with training is end-point
`detection. If the templates are stored with incorrect end points the error will affect
`recognition of every subsequent occurrence of the faulty word. Some systems have
`tried to ensure more reliable templates by time aligning a few examples of each
`word and averaging the measurements
`in corresponding frames. This technique
`gives some protection against occasional end-point errors, because such words
`would then give a poor match in this alignment process and so could be rejected.
`If a connected-word recognition algorithm is available, each template can be
`segmented from the surrounding silence by means of a special training syntax that
`only allows silence and wildcard
`templates. The new template candidate will
`obviously not match the silence, so it will be allocated to the wildcard. The
`boundaries of the wildcard match can then be taken as end points of the template.
`In acquiring templates for connected-word recognition, more realistic training
`examples can be obtained if connected words are used for the training. Again the
`recognition algorithm can be used to determine the template end points, but the
`syntax would specify the preceding and following words as existing templates, with
`just the new word to be captured represented by a wildcard between them. Provided
`the surrounding words can be chosen to give clear acoustic boundaries where they
Apple EX1016 Page 28


`Speech Synthesis and Recognition
`join to the new word, the segmentation will then be fairly accurate. This process is
`often called embedded training. More powerful embedded training procedures for
`use with statistical recognizers are discussed in Chapters 9 and 11.
`• Most early successful speech recogrution machines worked by pattern
`matching on whole words. Acoustic analysis, for example by a bank of band(cid:173)
`pass filters, describes the speech as a sequence of feature vectors, which can be
`compared with stored templates for all the words in the vocabulary using a
`suitable distance metric. Matching
`is improved
`if speech level is coded
`logarithmically and level variations are normalized.
`• Two major problems in isolated-word recognition are end-point detection and
`timescale variation. The timescale problem can be overcome by dynamic
`programming (DP) to find the best way to align the timescales of the incoming
`word and each template (known as dynamic time warping). Performance is
`improved by using penalties for timescale distortion. Score pruning, which
`abandons alignment paths that are scoring badly, can save a lot of computation.
`• DP can be extended to deal with sequences of connected words, which has the
`added advantage of solving the end-point detection problem. DP can also
`operate continuously, outputting words a second or two after they have been
`spoken. A wildcard template can be provided to cope with extraneous noises
`and words that are not in the vocabulary.
`• A syntax is often provided to prevent illegal sequences of words from being
`recognized. This method increases accuracy and reduces the computation .
`ES.I Give examples of factors which cause acoustic differences between
`utterances of the same word. Why does simple pattern matching work
`reasonably well in spite of this variability?
`ES.2 What factors influence the choice of bandwidth for filter-bank analysis?
`ES.3 What are the reasons in favour of logarithmic representation of power in
`filter-bank analysis? What difficulties can arise due to the logarithmic scale?
`ES.4 Explain the principles behind dynamic time warping, with a simple diagram.
`E8.5 Describe the special precautions which are necessary when using the
`symmetrical DTW algorithm for isolated-word recognition.
`ES.6 How can a DTW isolated-word recognizer be made more tolerant of end(cid:173)
`point errors?
`E8.7 How can a connected-word recognizer be used to segment a speech signal
`into individual words?
`E8.8 What extra processes are needed to tum a connected-word recognizer into a
`continuous recognizer?
`E8.9 Describe a training technique suitable for connected-word recognizers.
`Apple EX1

