throbber

`

`

`

`First edition by the late Dr J.N. Holmes published 1988 by Van Nostrand
`Reinhold
`Second edition published 2001 by Taylor & Francis
`11 New Fetter Lane, London EC4P 4EE
`
`Simultaneously published in the USA and Canada
`by Taylor & Francis
`29 West 35th Street, New York, NY 10001
`
`Taylor & Francis is an imprint of the Taylor & Francis Group
`
`@ 2001 Wendy J. Holmes
`
`Publisher's Note
`This book has been prepared from camera-ready copy provided by the authors.
`Printed and bound in Great Britain by Biddies Ltd, Guildford and King's Lynn
`
`All rights reserved. No part of this book may be reprinted or reproduced or
`utilised in any form or by any electronic, mechanical, or other means, now
`known or hereafter invented, including photocopying and recording, or in any
`information storage or retrieval system, without permission in writing from the
`publishers.
`
`Every effort has been made to ensure that the advice and information in this
`book is true and accurate at the time of going to press. However, neither the
`publisher nor the authors can accept any legal responsibility or liability for any
`errors or omissions that may be made. In the case of drug administration, any
`medical procedure or the use of technical equipment mentioned within this
`book, you are strongly advised to consult the manufacturer's guidelines.
`
`British library Cataloguing in Publication Data
`A catalogue record for this book is available from the British Library
`
`library of Congress Cataloging in Publication Data
`
`Holmes, J.N.
`Speech synthesis and recognition/John Holmes and Wendy Holmes.--2nd ed
`p.cm.
`Includes bibliographical references and index.
`ISBN 0-7484-0856-8 (he.) -- ISBN 0-7484-0857-6 (pbk.)
`1. Speech processing systems. I. Holmes, Wendy (Wendy J.) II. Title.
`
`TK77882.S65 H64 2002
`006.4'54--dc21
`
`ISBN 0-7484-0856-8 (hbk)
`ISBN 0-7484-0857-6 (pbk)
`
`2001044279
`
`I
`I
`
`I
`
`I
`
`I
`I
`I
`I
`
`I
`I
`I
`I
`
`I
`
`I
`I
`
`I
`
`I
`I
`
`I
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`I
`I
`I
`I
`
`I
`
`I
`I
`I
`I
`
`I
`I
`I
`I
`I
`
`I
`
`IPR2023-00037
`Apple EX1016 Page 3
`
`

`

`

`

`vi
`
`Contents
`
`3. 4 Neural response
`3 .5 Psychophysical measurements
`3.6 Analysis of simple and complex signals
`3.7 Models of the auditory system
`3. 7 .1 Mechanical filtering
`3.7.2 Models of neural transduction
`3.7.3 Higher-level neural processing
`Chapter 3 summary
`Chapter 3 exercises
`
`4 Digital Coding of Speech
`
`4.4
`
`Introduction
`4.1
`4.2 Simple waveform coders
`4.2.1 Pulse code modulation
`4.2.2 Deltamodulation
`4.3 Analysis/synthesis systems (vocoders)
`4 .3 .1 Channel vocoders
`4.3.2 Sinusoidal coders
`4.3.3 LPC vocoders
`4.3.4 Formant vocoders
`4.3.5 Efficient parameter coding
`4.3.6 Vocoders based on segmental/phonetic structure
`Intermediate systems
`4.4.1 Sub-band coding
`4.4.2 Linear prediction with simple coding of the residual
`4.4.3 Adaptive predictive coding
`4.4.4 Multipulse LPC
`4.4.5 Code-excited linear prediction
`4.5 Evaluating speech coding algorithms
`4.5.1 Subjective speech intelligibility measures
`4.5.2 Subjective speech quality measures
`4.5.3 Objective speech quality measures
`4.6 Choosing a coder
`Chapter 4 summary
`Chapter 4 exercises
`
`5 Message Synthesis from Stored Human Speech Components
`
`5. I
`Introduction
`5.2 Concatenation of whole words
`5.2. l Simple waveform concatenation
`5.2.2 Concatenation of vocoded words
`5.2.3 Limitations of concatenating word-size units
`
`36
`38
`41
`42
`42
`43
`43
`46
`46
`
`47
`
`47
`48
`48
`50
`52
`53
`53
`54
`56
`57
`58
`58
`59
`60
`60
`62
`62
`63
`64
`64
`64
`65
`66
`66
`
`67
`
`67
`67
`67
`70
`71
`
`IPR2023-00037
`Apple EX1016 Page 5
`
`

`

`Contents
`
`5.3 Concatenation of sub-word units: general principles
`5.3.1 Choice of sub-word unit
`5 .3 .2 Recording and selecting data for the units
`5 .3 .3 Varying durations of concatenative units
`5.4 Synthesis by concatenating vocoded sub-word units
`5.5 Synthesis by concatenating waveform segments
`5 .5 .1 Pitch modification
`5.5.2 Timing modification
`5.5.3 Performance of waveform concatenation
`5.6 Variants of concatenative waveform synthesis
`5. 7 Hardware requirements
`Chapter 5 summary
`Chapter 5 exercises
`
`6 Phonetic synthesis by rule
`
`Introduction
`6.1
`6.2 Acoustic-phonetic rules
`6.3 Rules for formant synthesizers
`6.4 Table-driven phonetic rules
`6.4.1 Simple transition calculation
`6.4.2 Overlapping transitions
`6.4.3 Using the tables to generate utterances
`6.5 Optimizing phonetic rules
`6.5. I Automatic adjustment of phonetic rules
`6.5.2 Rules for different speaker types
`6.5.3 Incorporating intensity rules
`6.6 Current capabilities of phonetic synthesis by rule
`Chapter 6 summary
`Chapter 6 exercises
`
`7
`
`Speech Synthesis from Textual or Conceptual Input
`
`Introduction
`7 .1
`7 .2 Emulating the human speaking process
`7.3 Converting from text to speech
`7 .3 .1 TIS system architecture
`7.3 .2 Overview of tasks required for TIS conversion
`7.4 Text analysis
`7.4.1 Text pre-processing
`7.4.2 Morphological analysis
`7.4.3 Phonetic transcription
`7.4 .4 Syntactic analysis and prosodic phrasing
`7.4.5 Assignment of lexical stress and pattern of word accents
`
`vii
`
`71
`71
`72
`73
`74
`74
`75
`77
`77
`78
`79
`80
`80
`
`81
`
`81
`81
`82
`83
`84
`85
`86
`89
`89
`90
`91
`91
`92
`92
`
`93
`
`93
`93
`94
`94
`96
`97
`97
`99
`100
`101
`102
`
`IPR2023-00037
`Apple EX1016 Page 6
`
`

`

`

`

`

`

`

`

`

`

`

`

`CHAPTERS
`
`Introduction to Automatic Speech
`Recognition: Template Matching
`
`8.1 INTRODUCTION
`
`Much of the early work on automatic speech recognition (ASR), starting in the
`1950s, involved attempting
`to apply rules based either on acoustic/phonetic
`knowledge or in many cases on simple ad hoc measurements of properties of the
`speech signal for different types of speech sound. The intention was to decode the
`signal directly into a sequence of phoneme-like units. These early methods,
`extensively reviewed by Hyde ( 1972), achieved very little success. The poor results
`were mainly because co-articulation causes the acoustic properties of individual
`phones to vary very widely, and any rule-based hard decisions about phone identity
`will often be wrong if they use only local information. Once wrong decisions have
`been made at an early stage, it is extremely difficult to recover from the errors later.
`An alternative to rule-based methods is to use pattern-matching techniques.
`Primitive pattern-matching approaches were being investigated at around the same
`time as the early rule-based methods, but major improvements in speech recognizer
`performance did not occur until more general pattern-matching techniques were
`invented. This chapter describes typical methods that were developed for spoken
`word recognition during the 1970s. Although these methods were widely used in
`commercial speech recognizers in the 1970s and 1980s, they have now been largely
`superseded by more powerful methods ( to be described in later chapters), which
`can be understood as a generalization of the simpler pattern-matching techniques
`introduced here. A thorough understanding of the principles of the first successful
`pattern-matching methods is thus a valuable introduction to the later techniques.
`
`8.2 GENERAL PRINCIPLES OF PATTERN MATCHING
`
`When a person utters a word, as we saw in Chapter 1, the word can be considered
`as a sequence of phonemes ( the linguistic units) and the phonemes will be realized
`as phones. Because of inevitable co-articulation, the acoustic patterns associated
`with individual phones overlap in time, and therefore depend on the identities of
`their neighbours. Even for a word spoken ~ isolation, therefore, the acoustic
`pattern is related in a very complicated way to the word's linguistic structure.
`However, if the same person repeats the same isolated word on separate
`occasions, the pattern is likely to be generally similar, because the same phonetic
`relationships will apply. Of course, there will probably also be differences, arising
`from many causes. For example, the second occurrence might be spoken faster or
`more slowly; there may be differences in vocal effort; the pitch and its variation
`during the word could be different; one example may be spoken more precisely
`
`IPR2023-00037
`Apple EX1016 Page 12
`
`

`

`110
`
`Speech Synthesis and Recognition
`
`than the other, etc. It is obvious that the waveform of separate utterances of the
`same word may be very different. There are likely to be more similarities between
`spectrograms because (assuming that a short time-window is used, see Section 2.6),
`they better illustrate the vocal-tract resonances, which are closely related to the
`positions of the articulators. But even spectrograms will differ in detail due to the
`above types of difference, and timescale differences will be particularly obvious.
`A well-established approach to ASR is to store in the machine example
`acoustic patterns ( called templates) for all the words to be recognized, usually
`spoken by the person who will subsequently use the machine. Any incoming word
`can then be compared in tum with all words in the store, and the one that is most
`similar is assumed to be the correct one. In general none of the templates will match
`perfectly, so to be successful this technique must rely on the correct word being
`more similar to its own template than to any of the alternatives.
`It is obvious that in some sense the sound pattern of the correct word is likely
`to be a better match than a wrong word, because it is made by more similar
`articulatory movements. Exploiting this similarity is, however, critically dependent
`on how the word patterns are compared, i.e. on how the 'distance' between two
`word examples is calculated. For example,
`it would be useless
`to compare
`waveforms, because even very similar repetitions of a word will differ appreciably
`in waveform detail from moment to moment, largely due to the difficulty of
`repeating the intonation and timing exactly.
`It is implicit in the above comments that it must also be possible to identify
`the start and end points of words that are to be compared.
`
`8.3 DISTANCE METRICS
`
`In this section we will consider the problem of comparing the templates with the
`incoming speech when we know that corresponding points
`in time will be
`associated with similar articulatory events. In effect, we appear to be assuming that
`the words to be compared are spoken in isolation at exactly the same speed, and
`that their start and end points can be reliably determined.
`In practice these
`assumptions will very rarely be justified, and methods of dealing with the resultant
`problems will be discussed later in the chapter.
`In calculating a distance between two words it is usual to derive a short-term
`distance that is local to corresponding parts of the words, and to integrate this
`distance over the entire word duration. Parameters representing the acoustic signal
`must be derived over some span of time, during which the properties are assumed
`not to change much. In one such span of time the measurements can be stored as a
`set of numbers, or feature vector, which may be regarded as representing a point
`in multi-dimensional space. The properties of a whole word can then be described
`as a succession of feature vectors ( often referred to as frames), each representing a
`time slice of, say, 10-20 ms. The integral of the distance between the patterns then
`reduces to a sum of distances between corresponding pairs of feature vectors. To be
`useful, the distance must not be sensitive to small differences in intensity between
`otherwise similar words, and it should not give too much weight to differences in
`pitch. Those features of the acoustic signal that are determined by the phonetic
`properties should obviously be given more weight in the distance calculation.
`
`IPR2023-00037
`Apple EX1016 Page 13
`
`

`

`

`

`112
`
`Speech Synthesis and Recognition
`
`spacings are roughly equal to those of critical bands and whose range of centre
`frequencies covers the frequencies most important for speech perception (say from
`300 Hz up to around 5 kHz). The total number of band-pass filters is therefore not
`likely to be more than about 20, and successful results have been achieved with as
`few as 10. When the necessary time-smoothing is included, the feature vector will
`represent the signal power in the filters averaged over the frame interval.
`The usual name for this type of speech analysis is filter-bank analysis.
`Whether it is provided by a bank of discrete filters, implemented in analogue or
`digital form, or is implemented by sampling the outputs from short-term Fourier
`transforms, is a matter of engineering convenience. Figure 8.1 displays word
`patterns from a typical I 0-channel filter-bank analyser for two examples of one
`word and one example of another. It can be seen from the frequency scales that the
`channels are closer together in the lower-frequency regions.
`A consequence of removing the effect of the fundamental frequency and of
`using filters at least as wide as critical bands is to reduce the amount of information
`needed to describe a word pattern to much less than is needed for the waveform.
`Thus storage and computation in the pattern-matching process are much reduced.
`
`8.3.2 Level normalization
`
`Mean speech level normally varies by a few dB over periods of a few seconds, and
`changes in spacing between the microphone and the speaker's mouth can also cause
`changes of several dB. As these changes will be of no phonetic significance, it is
`desirable to minimize their effects on the distance metric. Use of filter-bank power
`directly gives most weight to more intense regions of the spectrum, where a change
`of 2 or 3 dB will represent a very large absolute difference. On the other hand, a
`3 dB difference in one of the weaker formants might be of similar phonetic
`significance, but will cause a very small effect on the power. This difficulty can be
`avoided to a large extent by representing the power logarithmically, so that similar
`power ratios have the same effect on the distance calculation whether they occur in
`intense or weak spectral regions. Most of the phonetically unimportant variations
`discussed above will then have much less weight in the distance calculation than the
`differences in spectrum level that result from formant movements, etc.
`Although comparing levels logarithmically is advantageous, care must be
`exercised in very low-level sounds, such as weak fricatives or during stop(cid:173)
`consonant closures. At these times the logarithm of the level in a channel will
`depend more on the ambient background noise level than on the speech signal. If
`the speaker is in a very quiet environment the logarithmic level may suffer quite
`wide irrelevant variations as a result of breath noise or the rustle of clothing. One
`way of avoiding this difficulty is to add a small constant to the measured level
`before taking logarithms. The value of the constant would be chosen to dominate
`the greatest expected background noise level, but to be small compared with the
`level usually found during speech.
`Differences in vocal effort will mainly have the effect of adding a constant to
`all components of the log spectrum, rather than changing the shape of the spectrum
`cross-section. Such differences can be made to have no effect on the distance
`metric by subtracting the mean of the logarithm of the spectrum level of each frame
`
`IPR2023-00037
`Apple EX1016 Page 15
`
`

`

`

`

`

`

`

`

`

`

`/ntroduction to Automatic Speech Recognition: Template Matching
`
`117
`
`(As the scheme is symmetrical we
`or D(i-
`consider values of D(i-1,j)
`l,j-1).
`could equally well have chosen the horizontal direction instead.) When the first
`column values for D(l,j) are known, Equation (8.2) can be applied successively to
`calculate D(i,j) for columns 2 to n. The value obtained for D(n, N) is the score for
`the best way of matching the two words. For simple speech recognition
`applications, just the final score is required, and so the only working memory
`needed during the calculation is a one-dimensional array for holding a column ( or
`row) of D(i,j) values. However, there will then be no record at the end of what the
`optimum path was, and if this information is required for any purpose it is also
`necessary to store a two-dimensional array of back-pointers, to indicate which
`direction was chosen at each stage. It is not possible to know until the end has been
`reached whether any particular point will lie on the optimum path, and this
`infonnation can only be found by tracing back from the end.
`
`8.7 REFINEMENTS TO ISOLATED-WORD DP MATCHING
`
`The DP algorithm represented by Equation (8.2) is intended to deal with variations
`of timescale between two otherwise similar words. However, if two examples of a
`word have the same length but one is spoken faster at the beginning and slower at
`the end, there will be more horizontal and vertical steps in the optimum path and
`fewer diagonals. As a result there will be a greater number of values of d(i, j) in the
`final score for words with timescale differences than when the timescales are the
`same. Although it may be justified to have some penalty for timescale distortion, on
`the grounds that an utterance with a very different timescale is more likely to be the
`wrong word, it is better to choose values of such penalties explicitly than to have
`them as an incidental consequence of the algorithm. Making the number of
`to D(n, N) independent of the path can be achieved by
`contributions of d(i,j)
`modifying Equation (8.2) to add twice the value of d(i,j) when the path is diagonal.
`One can then add an explicit penalty to the right-hand side of Equation (8.2) when
`the step is either vertical or horizontal. Equation (8.2) thus changes to:
`D(i,j) = min[D(i -1,j) + d(i,j) + hdp,
`D(i -1,j -1) + 2d(i,j),
`D(i,j -1) + d(i,j) + vdp].
`
`(8.3)
`
`Suitable values for the horizontal and vertical distortion penalties, hdp and vdp,
`would probably have to be found by experiment in association with the chosen
`distance metric. It is, however, obvious that, all other things being equal, paths with
`appreciable timescale distortion should be given a worse score than diagonal paths,
`and so the values of the penalties should certainly not be zero.
`Even in Equation (8.3) the number of contributions to a cumulative distance
`will depend on the lengths of both the example and the template, and so there will
`be a tendency for total distances to be smaller with short templates and larger with
`long templates. The final best-match decision will as a result favour short words.
`This bias can be avoided by dividing the total distance by the template length.
`The algorithm described above is inherently symmetrical, and so makes no
`distinction between the word in the store of templates and the new word to be
`
`IPR2023-00037
`Apple EX1016 Page 20
`
`

`

`118
`
`Speech Synthesis and Recognition
`
`identified. DP is, in fact, a much more general technique that can be applied to a
`wide range of applications, and which has been popularized especially by the work
`of Bellman ( 1957). The number of choices at each stage is not restricted to three, as
`in the example given in Figure 8.3. Nor is it necessary
`in speech recognition
`applications to assume that the best path should include all frames of both patterns.
`If the properties of the speech only change slowly compared with the frame
`interval, it is permissible
`to skip occasional
`frames, so achieving timescale
`compression of the pattern. A particularly useful alternative version of the
`algorithm is asymmetrical, in that vertical paths are not permitted. The steps have a
`slope of zero (horizontal), one (diagonal), or two (which skips one frame in the
`template). Each input frame then makes just one contribution to the total distance,
`so it is not appropriate to double the distance contribution for diagonal paths. Many
`other variants of the algorithm have been proposed,
`including one that allows
`average slopes of 0.5, 1 and 2, in which the 0.5 is achieved by preventing a
`horizontal step if the previous step was horizontal. Provided the details of the
`formula are sensibly chosen, all of these algorithms can work well. In a practical
`implementation computational convenience may be the reason for choosing one in
`preference to another.
`
`8.8 SCORE PRUNING
`
`Although DP algorithms provide a great computational saving compared with
`exhaustive search of all possible paths,
`the remaining computation can be
`substantial, particularly if each incoming word has to be compared with a large
`number of candidates for matching. Any saving in computation that does not affect
`the accuracy of the recognition result
`is therefore desirable. One possible
`computational saving is to exploit the fact that, in the calculations for any column
`in Figure 8.3, it is very unlikely that the best path for a correctly matching word
`will pass through any points for which the cumulative distance, D(i,j), is much in
`excess of the lowest value in that column. The saving can be achieved by not
`allowing paths from relatively badly scoring points to propagate further. (This
`process is sometimes known as pruning because
`the growing paths are like
`branches of a tree.) There will then only be a small subset of possible paths
`considered, usually lying on either side of the best path. If this economy is applied
`it can no longer be guaranteed that the DP algorithm will find the best-scoring path.
`However, with a value of score-pruning threshold that reduces the average amount
`of computation by a factor of 5-10 the right path will almost always be obtained if
`the words are fairly similar. The only circumstances where this amount of pruning
`is likely to prevent the optimum path from being obtained will be if the words are
`actually different, when the resultant over-estimate of total distance would not
`cause any error in recognition.
`Figures 8.4(a), 8.5 and 8.6 show DP paths using the symmetrical algorithm
`for the words illustrated in Figures 8.1 and 8.2. Figure 8.4(b) illustrates the
`asymmetrical algorithm for comparison, with slopes of 0, 1 and 2. In Figure 8.4
`there is no time-distortion penalty, and Figure 8.5 with a small distortion penalty
`shows a much more plausible matching of the two timescales. The score pruning
`used in these figures illustrates the fact that there are low differences in cumulative
`
`IPR2023-00037
`Apple EX1016 Page 21
`
`

`

`

`

`

`

`

`

`/22
`
`Speech Synthesis and Recognition
`
`point where one word stops and the next one starts. However, it is mainly the ends
`of words that are affected and, apart from a likely speeding up of the timescale
`words in a carefully spoken connected sequence do not normally differ greatly fro~
`their isolated counterparts except near the ends. In matching connected sequences
`of words for which separate templates are already available one might thus defme
`the best-matching word sequence to be given by the sequence of templates which
`'
`when joined end to end, offers the best match to the input. It is of course assumed
`that the optimum time alignment is used for the sequence, as with DP for isolated
`words. Although this model of connected speech totally ignores co-articulation, it
`has been successfully used in many connected-word speech recognizers.
`As with the isolated-word
`time-alignment process,
`there seems to be a
`potentially explosive increase in computation, as every frame must be considered as
`a possible boundary between words. When each frame is considered as an end point
`for one word, all other permitted words in the vocabulary have to be considered as
`possible starters. Once again the solution to the problem
`is to apply dynamic
`programming, but in this case the algorithm is applied to word sequences as well as
`to frame sequences within words. A few algorithms have been developed to extend
`the isolated-word DP method to work economically across word boundaries. One
`of the most straightforward and widely used is described below.
`In Figure 8.8 consider a point that represents a match between frame i of a
`multi-word input utterance and frame j of template number k. Let the cumulative
`distance from the beginning of the utterance along the best-matching sequence of
`complete templates followed by the first j frames of template k be D(i,j, k). The
`best path through template k can be found by exactly the same process as for
`isolated-word recognition. However, in contrast to the isolated-word case, it is not
`known where on the input utterance the match with template k should finish, and
`for every input frame any valid path that reaches the end of template k could join to
`the beginning of the path through another template, representing the next word.
`Thus, for each input frame i, it is necessary to consider all templates that may have
`just ended in order to find which one has the lowest cumulative score so far. This
`score is then used in the cumulative distance at the start of any new template, m:
`D(i, 1, m) = min [D(i -1, L(k),k)]+ d(i, 1, m),
`overk
`where L(k) is the length of template k. The use of i - 1 in Equation (8.4) implies
`that moving from the last frame of one template to the first frame of another always
`involves advancing one frame on the input ( i.e. in effect only allowing diagonal
`paths between templates). This restriction is necessary, because the scores for the
`ends of all other templates may not yet be available for input frame i when the path
`decision has to be made. A horizontal path from within template m could have been
`included in Equation (8.4), but has been omitted merely to simplify the explanation.
`A timescale distortion penalty has not been included for the same reason.
`In the same way as for isolated words, the process can be started off at the
`beginning of an utterance because all values of D(O, L(k), k) will be zero. At the end
`of an utterance the template that gives the lowest cumulative distance is assumed to
`represent the final word of the sequence, but its identity gives no indication of the
`templates that preceded it. These can only be determined by storing pointers to the
`preceding templates of each path as it evolves, and then tracing back when the final
`point is reached. It is also possible to recover the positions in the input sequence
`
`(8.4)
`
`IPR2023-00037
`Apple EX1016 Page 25
`
`

`

`

`

`

`

`Introduction to Automatic Speech Recognition: Template Matching
`
`125
`
`be wrong because of inherent ambiguity in the acoustic signal.) On the other hand,
`if the input matches very badly to all except one of the permitted words, all paths
`not including that word will be abandoned as soon as the word has finished. In fact,
`if score pruning is used to cause poor paths to be abandoned early, the path in such
`a case may be uniquely detennined even at a matching point within the word. There
`is plenty of evidence that human listeners also often decide on the identity of a long
`word before it is complete if its beginning is sufficiently distinctive.
`
`8.12 SYNTACTIC CONSTRAINTS
`
`The rules of grammar often prevent certain sequences of words from occurring in
`human language, and these rules apply to particular syntactic classes, such as
`nouns, verbs, etc. In the more artificial circumstances in which speech recognizers
`are often used, the tasks can sometimes be arranged to apply much more severe
`constraints on which words are permitted to follow each other. Although applying
`such constraints requires more care in designing the application of the recognizer, it
`usually offers a substantial gain in recognition accuracy because there are then
`fewer potentially confusable words to be compared. The reduction in the number of
`templates that need to be matched at any point also leads to a computational saving.
`
`8.13 TRAINING A WHOLE-WORD RECOGNIZER
`
`In all the algorithms described in this chapter it is assumed that suitable templates
`for the words of the vocabulary are available in the machine. Usually the templates
`are made from speech of the intended user, and thus a training session is needed
`for enrolment of each new user, who is required to speak examples of all the
`vocabulary words. If the same user regularly uses the machine, the templates can be
`stored in some back-up memory and re-loaded prior to each use of the system. For
`isolated-word recognizers
`the only technical problem with training is end-point
`detection. If the templates are stored with incorrect end points the error will affect
`recognition of every subsequent occurrence of the faulty word. Some systems have
`tried to ensure more reliable templates by time aligning a few examples of each
`word and averaging the measurements
`in corresponding frames. This technique
`gives some protection against occasional end-point errors, because such words
`would then give a poor match in this alignment process and so could be rejected.
`If a connected-word recognition algorithm is available, each template can be
`segmented from the surrounding silence by means of a special training syntax that
`only allows silence and wildcard
`templates. The new template candidate will
`obviously not match the silence, so it will be allocated to the wildcard. The
`boundaries of the wildcard match can then be taken as end points of the template.
`In acquiring templates for connected-word recognition, more realistic training
`examples can be obtained if connected words are used for the training. Again the
`recognition algorithm can be used to determine the template end points, but the
`syntax would specify the preceding and following words as existing templates, with
`just the new word to be captured represented by a wildcard between them. Provided
`the surrounding words can be chosen to give clear acoustic boundaries where they
`
`IPR2023-00037
`Apple EX1016 Page 28
`
`

`

`/26
`
`Speech Synthesis and Recognition
`
`join to the new word, the segmentation will then be fairly accurate. This process is
`often called embedded training. More powerful embedded training procedures for
`use with statistical recognizers are discussed in Chapters 9 and 11.
`
`CHAPTER 8 SUMMARY
`
`• Most early successful speech recogrution machines worked by pattern
`matching on whole words. Acoustic analysis, for example by a bank of band(cid:173)
`pass filters, describes the speech as a sequence of feature vectors, which can be
`compared with stored templates for all the words in the vocabulary using a
`suitable distance metric. Matching
`is improved
`if speech level is coded
`logarithmically and level variations are normalized.
`• Two major problems in isolated-word recognition are end-point detection and
`timescale variation. The timescale problem can be overcome by dynamic
`programming (DP) to find the best way to align the timescales of the incoming
`word and each template (known as dynamic time warping). Performance is
`improved by using penalties for timescale distortion. Score pruning, which
`abandons alignment paths that are scoring badly, can save a lot of computation.
`• DP can be extended to deal with sequences of connected words, which has the
`added advantage of solving the end-point detection problem. DP can also
`operate continuously, outputting words a second or two after they have been
`spoken. A wildcard template can be provided to cope with extraneous noises
`and words that are not in the vocabulary.
`• A syntax is often provided to prevent illegal sequences of words from being
`recognized. This method increases accuracy and reduces the computation .
`
`CHAPTER 8 EXERCISES
`
`ES.I Give examples of factors which cause acoustic differences between
`utterances of the same word. Why does simple pattern matching work
`reasonably well in spite of this variability?
`ES.2 What factors influence the choice of bandwidth for filter-bank analysis?
`ES.3 What are the reasons in favour of logarithmic representation of power in
`filter-bank analysis? What difficulties can arise due to the logarithmic scale?
`ES.4 Explain the principles behind dynamic time warping, with a simple diagram.
`E8.5 Describe the special precautions which are necessary when using the
`symmetrical DTW algorithm for isolated-word recognition.
`ES.6 How can a DTW isolated-word recognizer be made more tolerant of end(cid:173)
`point errors?
`E8.7 How can a connected-word recognizer be used to segment a speech signal
`into individual words?
`E8.8 What extra processes are needed to tum a connected-word recognizer into a
`continuous recognizer?
`E8.9 Describe a training technique suitable for connected-word recognizers.
`
`...
`
`IPR2023-00037
`Apple EX1

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket