throbber
11111~1 ~mmiffil ~!~ll
`iml1
`
`II~ 11111
`3 3029 04671 7492
`
`\
`
`2ND EDITION
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 1
`
`

`

`Speech Synthesis and
`Recognition
`
`i
`
`:Jt I
`
`.1
`
`i
`
`Second Edition
`
`");j • JI t:r,,i;
`, 1 v , 'i vrf
`,.
`
`'·
`
`) I
`
`John Holmes and Wend~ Elobties·:
`
`ttJW J,:o~ :~
`ih\\l<.\
`• .,_ •'
`·{rj.-,(' ":iif [
`
`J,
`~,, i,,
`
`lt
`
`••
`
`I H I.
`1· t I • 1 'L'l
`, •
`
`,.
`
`. ~· ,
`
`10 . 1 ~rrft ~ n:i.; 11 i 11"'.',· iEru
`• n r,
`,. :.;ti tu .11,-om~
`fri .. r1•L1 1,..,, , YI
`l' 1·,)
`·; _, -; "'1 ·,~ZJOl,. . :·,_ Hif!noin i
`.. dr1g
`_.,, r~;.11"•.,
`
`lil
`
`, ! . '- .>
`
`,,
`
`-1
`
`•,~:u.1$ 1f' ,L j,1
`
`)
`
`:1,·,
`.
`
`J
`
`,· 1. J'
`
`~ i
`
`,._., ., .,-.,; • ·,.n1-;
`,;. ·: ·, .. i .iJ
`. :
`
`' .. :,
`
`:.t.
`
`·.,;;.,
`
`'1·
`•.:..,
`
`1·-~•1•1-.c h •·
`'•
`I
`_,,;
`'dH,
`
`·''t......i
`...
`,
`lJ
`, • 4V,I\J
`·1.:11::rJ;.url1Jq
`• ,:1 .,:·. ·1,
`• l ·_,J t ·: at.I!<~
`• ."-YOTt-.1
`
`.,
`
`•
`I\.
`
`1 -~"r ,. •· ,:1;·,-,;)..,•'.,
`_l,\fh.Jl'.
`
`......
`
`. I
`
`,
`
`'• ~, ··;°V 1_o.,.q"'>t,11c,:!•·•.··:\•t
`"S.f
`•.
`'
`. '
`t
`
`·:1:_,•l
`
`.. ~1i''.,H
`, • •
`
`•
`
`... 1. I c.?~'Til'lH
`.>.t'>ri'l11-(1 1L'fl'J\;c.
`[!; 1 1
`:.·~i'.~,,.·· Ti;·;iri{;~.~rodd:-: .
`.-:1•· Lui.;
`H-_.u
`i.::i31. ·· (.j~J
`i-l 1:,.1~':1-f~!-~· ·) ·".~(.,l
`:••;) l.i::..r-li
`. if ~u,, ,·Jl ~~l\.lOH .I "nr·~i,·~ gr.ii!£~...l'l i'11 ·'>-'l 1/·
`. f
`
`1
`
`, I•
`
`....
`
`r!',
`
`I
`
`r,r
`
`,. ')
`,.,.,
`; .5
`·. ~t
`t ~, 1
`?}
`·:,
`
`-~n
`
`),,
`
`r,
`
`,•,·.:..,
`
`•
`•
`~ 1111dec1 \1C,,
`London and New York
`
`h
`
`I
`
`• •
`
`&_i ,
`
`...
`
`I
`
`I
`
`,._.;
`
`I
`
`f
`
`\
`
`)
`r
`I
`I
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 2
`
`

`

`First edition by the late Dr J.N. Holmes published 1988 by Van Nostrand
`Reinhold
`Second edition published 2001 by Taylor & Francis
`11 New Fetter Lane, London EC4P 4EE
`
`Simultaneously published in the USA and Canada
`by Taylor & Francis
`29 West 35th Street, New York, NY 10001
`
`Taylor & Francis is an imprint of the Taylor & Francis Group
`
`@ 2001 Wendy J. Holmes
`
`Publisher's Note
`This book has been prepared from camera-ready copy provided by the authors.
`Printed and bound in Great Britain by Biddies Ltd, Guildford and King's Lynn
`
`All rights reserved. No part of this book may be reprinted or reproduced or
`utilised in any form or by any electronic, mechanical, or other means, now
`known or hereafter invented, including photocopying and recording, or in any
`information storage or retrieval system, without permission in writing from the
`publishers.
`
`Every effort has been made to ensure that the advice and information in this
`book is true and accurate at the time of going to press. However, neither the
`publisher nor the authors can accept any legal responsibility or liability for any
`errors or omissions that may be made. In the case of drug administration, any
`medical procedure or the use of technical equipment mentioned within this
`book, you are strongly advised to consult the manufacturer's guidelines.
`
`British library Cataloguing in Publication Data
`A catalogue record for this book is available from the British Library
`
`library of Congress Cataloging in Publication Data
`
`Holmes, J.N.
`Speech synthesis and recognition/John Holmes and Wendy Holmes.--2nd ed
`p.cm.
`Includes bibliographical references and index.
`ISBN 0-7484-0856-8 (he.) -- ISBN 0-7484-0857-6 (pbk.)
`1. Speech processing systems. I. Holmes, Wendy (Wendy J.) II. Title.
`
`TK77882.S65 H64 2002
`006.4'54--dc21
`
`ISBN 0-7484-0856-8 (hbk)
`ISBN 0-7484-0857-6 (pbk)
`
`2001044279
`
`I
`I
`
`I
`
`I
`
`I
`I
`I
`I
`
`I
`I
`I
`I
`
`I
`
`I
`I
`
`I
`
`I
`I
`
`I
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`I
`I
`I
`I
`
`I
`
`I
`I
`I
`I
`
`I
`I
`I
`I
`I
`
`I
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 3
`
`

`

`CONTENTS
`
`Preface to the First Edition
`Preface to the Second Edition
`List of Abbreviations
`
`1 Human Speech Communication
`
`1.1 Value of speech for human-machine communication
`1.2
`Ideas and language
`1.3 Relationship between written and spoken language
`1.4 Phonetics and phonology
`1.5 The acoustic signal
`1.6 Phonemes, phones and allophones
`1. 7 Vowels, consonants and syllables
`1.8 Phonemes and spelling
`1.9 Prosodic features
`1.10 Language, accent and dialect
`1.11 Supplementing the acoustic signal
`1.12 The complexity of speech processing
`Chapter 1 summary
`Chapter 1 exercises
`
`2 Mechanisms and Models of Human Speech Production
`
`Introduction
`2.1
`Sound sources
`2.2
`2.3 The resonant system
`2.4
`Interaction of laryngeal and vocal tract functions
`2.5 Radiation
`2.6 Waveforms and spectrograms
`2. 7
`Speech production models
`2. 7.1 Excitation models
`2.7.2 Vocal tract models
`Chapter 2 summary
`Chapter 2 exercises
`
`3 Mechanisms and Models of the Human Auditory System
`
`Introduction
`3 .1
`3 .2 Physiology of the outer and middle ears
`3.3
`Structure of the cochlea
`
`Xlll
`
`xv ..
`xvn
`
`1
`
`1
`1
`1
`2
`2
`3
`4
`6
`6
`7
`8
`9
`10
`10
`
`11
`
`11
`12
`15
`19
`21
`21
`25
`26
`27
`31
`32
`
`33
`
`33
`33
`34
`
`-
`
`I I'
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 4
`
`

`

`vi
`
`Contents
`
`3. 4 Neural response
`3 .5 Psychophysical measurements
`3.6 Analysis of simple and complex signals
`3.7 Models of the auditory system
`3. 7 .1 Mechanical filtering
`3.7.2 Models of neural transduction
`3.7.3 Higher-level neural processing
`Chapter 3 summary
`Chapter 3 exercises
`
`4 Digital Coding of Speech
`
`4.4
`
`Introduction
`4.1
`4.2 Simple waveform coders
`4.2.1 Pulse code modulation
`4.2.2 Deltamodulation
`4.3 Analysis/synthesis systems (vocoders)
`4 .3 .1 Channel vocoders
`4.3.2 Sinusoidal coders
`4.3.3 LPC vocoders
`4.3.4 Formant vocoders
`4.3.5 Efficient parameter coding
`4.3.6 Vocoders based on segmental/phonetic structure
`Intermediate systems
`4.4.1 Sub-band coding
`4.4.2 Linear prediction with simple coding of the residual
`4.4.3 Adaptive predictive coding
`4.4.4 Multipulse LPC
`4.4.5 Code-excited linear prediction
`4.5 Evaluating speech coding algorithms
`4.5.1 Subjective speech intelligibility measures
`4.5.2 Subjective speech quality measures
`4.5.3 Objective speech quality measures
`4.6 Choosing a coder
`Chapter 4 summary
`Chapter 4 exercises
`
`5 Message Synthesis from Stored Human Speech Components
`
`5. I
`Introduction
`5.2 Concatenation of whole words
`5.2. l Simple waveform concatenation
`5.2.2 Concatenation of vocoded words
`5.2.3 Limitations of concatenating word-size units
`
`36
`38
`41
`42
`42
`43
`43
`46
`46
`
`47
`
`47
`48
`48
`50
`52
`53
`53
`54
`56
`57
`58
`58
`59
`60
`60
`62
`62
`63
`64
`64
`64
`65
`66
`66
`
`67
`
`67
`67
`67
`70
`71
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 5
`
`

`

`Contents
`
`5.3 Concatenation of sub-word units: general principles
`5.3.1 Choice of sub-word unit
`5 .3 .2 Recording and selecting data for the units
`5 .3 .3 Varying durations of concatenative units
`5.4 Synthesis by concatenating vocoded sub-word units
`5.5 Synthesis by concatenating waveform segments
`5 .5 .1 Pitch modification
`5.5.2 Timing modification
`5.5.3 Performance of waveform concatenation
`5.6 Variants of concatenative waveform synthesis
`5. 7 Hardware requirements
`Chapter 5 summary
`Chapter 5 exercises
`
`6 Phonetic synthesis by rule
`
`Introduction
`6.1
`6.2 Acoustic-phonetic rules
`6.3 Rules for formant synthesizers
`6.4 Table-driven phonetic rules
`6.4.1 Simple transition calculation
`6.4.2 Overlapping transitions
`6.4.3 Using the tables to generate utterances
`6.5 Optimizing phonetic rules
`6.5. I Automatic adjustment of phonetic rules
`6.5.2 Rules for different speaker types
`6.5.3 Incorporating intensity rules
`6.6 Current capabilities of phonetic synthesis by rule
`Chapter 6 summary
`Chapter 6 exercises
`
`7
`
`Speech Synthesis from Textual or Conceptual Input
`
`Introduction
`7 .1
`7 .2 Emulating the human speaking process
`7.3 Converting from text to speech
`7 .3 .1 TIS system architecture
`7.3 .2 Overview of tasks required for TIS conversion
`7.4 Text analysis
`7.4.1 Text pre-processing
`7.4.2 Morphological analysis
`7.4.3 Phonetic transcription
`7.4 .4 Syntactic analysis and prosodic phrasing
`7.4.5 Assignment of lexical stress and pattern of word accents
`
`vii
`
`71
`71
`72
`73
`74
`74
`75
`77
`77
`78
`79
`80
`80
`
`81
`
`81
`81
`82
`83
`84
`85
`86
`89
`89
`90
`91
`91
`92
`92
`
`93
`
`93
`93
`94
`94
`96
`97
`97
`99
`100
`101
`102
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 6
`
`

`

`viii
`
`7 .5 Prosody generation
`7 .5 .1 Timing pattern
`7.5.2 Fundamental frequency contour
`Implementation issues
`7.6
`7. 7 Current TIS synthesis capabilities
`7.8 Speech synthesis from concept
`Chapter 7 summary
`Chapter 7 exercises
`
`Contents
`
`102
`103
`104
`106
`107
`107
`108
`108
`
`8
`
`Introduction to automatic speech recognition: template matching
`
`109
`
`Introduction
`8.1
`8.2 General principles of pattern matching
`8.3 Distance metrics
`8.3.1 Filter-bank analysis
`8.3.2 Level normalization
`8.4 End-point detection for isolated words
`8.5 Allowing for timescale variations
`8.6 Dynamic programming for time alignment
`8.7 Refinements to isolated-word DP matching
`8. 8 Score pruning
`8.9 Allowing for end-point errors
`8.10 Dynamic programming for connected words
`8.11 Continuous speech recognition
`8.12 Syntactic constraints
`8.13 Training a whole-word recognizer
`Chapter 8 summary
`Chapter 8 exercises
`
`9
`
`Introduction to stochastic modelling
`
`109
`109
`110
`111
`112
`114
`115
`115
`117
`118
`121
`121
`124
`125
`125
`126
`126
`
`127
`
`127
`9 .1 Feature variability in pattern matching
`128
`9.2
`Introduction to hidden Markov models
`130
`9.3 Probability calculations in hidden Markov models
`133
`9.4 The Viterbi algorithm
`134
`9.5 Parameter estimation for hidden Markov models
`135
`9.5. l Forward and backward probabilities
`9.5.2 Parameter re-estimation with forward and backward probabilities 136
`9.5.3 Viterbi training
`139
`9.6 Vector quantization
`140
`9.7 Multi-variate continuous distributions
`141
`9.8 Use of normal distributions with HMMs
`142
`9.8.1 Probability calculations
`143
`9 .8.2 Estimating the parameters of a normal distribution
`144
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 7
`
`

`

`Contents
`
`9.8.3 Baum-Welch re-estimation
`9.8.4 Viterbi training
`9.9 Model initialization
`9 .10 Gaussian mixtures
`9 .10.1 Calculating emission probabilities
`9 .10.2 Baum-Welch re-estimation
`9 .10.3 Re-estimation using the most likely state sequence
`9 .10.4 Initialization of Gaussian mixture distributions
`9 .10.5 Tied mixture distributions
`9 .11 Extension of stochastic models to word sequences
`9 .12 Implementing probability calculations
`9.12.1 Using the Viterbi algorithm with probabilities in logarithmic
`form
`9 .12.2 Adding probabilities when they are in logarithmic form
`9.13 Relationship between DTW and a simple HMM
`9.14 State durational characteristics ofHMMs
`Chapter 9 summary
`Chapter 9 exercises
`
`ix
`
`144
`145
`146
`14 7
`14 7
`148
`149
`150
`151
`152
`153
`153
`
`154
`155
`156
`157
`15 8
`
`10 Introduction to front-end analysis for automatic speech recognition
`
`159
`
`10.1 Introduction
`10 .2 Pre-emphasis
`10.3 Frames and windowing
`10.4 Filter banks, Fourier analysis and the mel scale
`10.5 Cepstral analysis
`10.6 Analysis based on linear prediction
`10.7 Dynamic features
`10.8 Capturing the perceptually relevant information
`10.9 General feature transformations
`10.10 Variable-frame-rate analysis
`Chapter 10 summary
`Chapter 10 exercises
`
`159
`15 9
`159
`160
`161
`165
`166
`167
`167
`167
`168
`168
`
`""
`
`11 Practical techniques for improving speech recognition performance
`
`169
`
`11.1 Introduction
`11.2 Robustness to environment and channel effects
`11.2.1 Feature-based techniques
`11.2.2 Model-based techniques
`11.2.3 Dealing with unknown or unpredictable noise corruption
`11.3 Speaker-independent recognition
`11.3.1 Speaker normalization
`11.4 Model adaptation
`
`169
`169
`171
`171
`173
`174
`175
`17 6
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 8
`
`

`

`x
`
`11.4.1 Bayesian methods for training and adaptation of HMMs
`11.4.2 Adaptation methods based on linear transforms
`11.5 Discriminative training methods
`11.5.1 Maximum mutual information training
`11.5.2 Training criteria based on reducing recognition errors
`11.6 Robustness of recognizers to vocabulary variation
`Chapter 11 summary
`Chapter 11 exercises
`
`12 Automatic speech recognition for large vocabularies
`
`Contents
`
`176
`178
`179
`179
`180
`181
`181
`182
`
`183
`
`183
`12.1 Introduction
`183
`12.2 Historical perspective
`184
`12.3 Speech transcription and speech understanding
`185
`12.4 Speech transcription
`186
`12.5 Challenges posed by large vocabularies
`187
`12.6 Acoustic modelling
`188
`12.6.1 Context-dependent phone modelling
`188
`12.6.2 Training issues for context-dependent models
`190
`12.6.3 Parameter tying
`190
`12.6.4 Training procedure
`193
`12.6.5 Methods for clustering model parameters
`194
`12.6.6 Constructing phonetic decision trees
`195
`12.6.7 Extensions beyond triphone modelling
`196
`12.7 Language modelling
`197
`12.7.1 N-grams
`197
`12.7.2 Perplexity and evaluating language models
`198
`12.7.3 Data sparsity in language modelling
`199
`12.7.4 Discounting
`200
`12. 7.5 Backing off in language modelling
`200
`12.7.6 Interpolation of language models
`201
`12.7.7 Choice of more general distribution for smoothing
`202
`12.7.8 Improving on simple N-grams
`203
`12.8 Decoding
`203
`12.8.1 Efficient one-pass Viterbi decoding for large vocabularies
`204
`12.8.2 Multiple-pass Viterbi decoding
`205
`12.8.3 Depth-first decoding
`205
`12.9 Evaluating LVCSR performance
`205
`12.9.1 Measuring errors
`206
`12.9.2 Controlling word insertion errors
`206
`12.9.3 Performance evaluations
`209
`12.10 Speech understanding
`12.10.1 Measuring and evaluating speech understanding performance 210
`Chapter 12 summary
`211
`Chapter 12 exercises
`212
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 9
`
`

`

`Contents
`
`13 Neural networks for speech recognition
`
`13 .1 Introduction
`13.2 The human brain
`13 .3 Connectionist models
`13.4 Properties of ANNs
`13.5 ANNs for speech recognition
`13.5.1 Hybrid HMM/ANN methods
`Chapter 13 summary
`Chapter 13 exercises
`
`14 Recognition of speaker characteristics
`
`I ,t
`
`f
`
`xi
`
`213
`
`213
`213
`214
`215
`216
`217
`218
`218
`
`219
`
`219
`14.1 Characteristics of speakers
`219
`14.2 Verification versus identification
`220
`14.2.1 Assessing performance
`221
`14.2.2 Measures of verification performance
`224
`14.3 Speaker recognition
`224
`14.3.1 Text dependence
`14.3.2 Methods for text-dependent/text-prompted speaker recognition 224
`14.3.3 Methods for text-independent speaker recognition
`225
`14.3.4 Acoustic features for speaker recognition
`226
`14.3.5 Evaluations of speaker recognition performance
`227
`14.4 Language recognition
`228
`14.4.1 Techniques for language recognition
`228
`14.4.2 Acoustic features for language recognition
`229
`Chapter 14 summary
`230
`Chapter 14 exercises
`23 0
`
`15 Applications and performance of current technology
`
`15 .1 Introduction
`15 .2 Why use speech technology?
`15.3 Speech synthesis technology
`15 .4 Examples of speech synthesis applications
`15. 4 .1 Aids for the dis ab led
`15.4.2 Spoken warning signals, instructions and user feedback
`15.4.3 Education, toys and games
`15.4.4 Telecommunications
`15.5 Speech recognition technology
`15.5.1 Characterizing speech recognizers and recognition tasks
`15 .5 .2 Typical recognition performance for different tasks
`15.5.3 Achieving success with ASR in an application
`15.6 Examples of ASR applications
`
`231
`
`231
`231
`232
`233
`23 3
`233
`234
`234
`235
`235
`23 7
`238
`239
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 10
`
`

`

`xii
`
`Contents
`
`15.6.1 Command and control
`15.6.2 Education, toys and games
`15.6.3 Dictation
`15.6.4 Data entry and retrieval
`15 .6.5 Telecommunications
`15.7 Applications of speaker and language recognition
`15.8 The future of speech technology applications
`Chapter 15 summary
`Chapter 15 exercises
`
`16 Future research directions in speech synthesis and recognition
`
`16.1 Introduction
`16.2 Speech synthesis
`16.2.1 Speech sound generation
`16.2.2 Prosody generation and higher-level linguistic processing
`16.3 Automatic speech recognition
`16.3.1 Advantages of statistical pattern-matching methods
`16.3.2 Limitations of HMMs for speech recognition
`16.3.3 Developing improved recognition models
`16.4 Relationship between synthesis and recognition
`16.5 Automatic speech understanding
`Chapter 16 summary
`Chapter 16 exercises
`
`17 Further Reading
`
`17.1 Books
`17.2 Journals
`17 .3 Conferences and workshops
`1 7. 4 The Internet
`17.5 Reading for individual chapters
`
`References
`Solutions to Exercises
`Glossary
`Index
`
`239
`239
`240
`240
`241
`243
`243
`244
`244
`
`245
`
`245
`245
`246
`247
`248
`248
`249
`250
`252
`253
`254
`254
`
`255
`
`255
`256
`256
`257
`258
`
`265
`277
`283
`287
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 11
`
`

`

`CHAPTERS
`
`Introduction to Automatic Speech
`Recognition: Template Matching
`
`8.1 INTRODUCTION
`
`Much of the early work on automatic speech recognition (ASR), starting in the
`1950s, involved attempting
`to apply rules based either on acoustic/phonetic
`knowledge or in many cases on simple ad hoc measurements of properties of the
`speech signal for different types of speech sound. The intention was to decode the
`signal directly into a sequence of phoneme-like units. These early methods,
`extensively reviewed by Hyde ( 1972), achieved very little success. The poor results
`were mainly because co-articulation causes the acoustic properties of individual
`phones to vary very widely, and any rule-based hard decisions about phone identity
`will often be wrong if they use only local information. Once wrong decisions have
`been made at an early stage, it is extremely difficult to recover from the errors later.
`An alternative to rule-based methods is to use pattern-matching techniques.
`Primitive pattern-matching approaches were being investigated at around the same
`time as the early rule-based methods, but major improvements in speech recognizer
`performance did not occur until more general pattern-matching techniques were
`invented. This chapter describes typical methods that were developed for spoken
`word recognition during the 1970s. Although these methods were widely used in
`commercial speech recognizers in the 1970s and 1980s, they have now been largely
`superseded by more powerful methods ( to be described in later chapters), which
`can be understood as a generalization of the simpler pattern-matching techniques
`introduced here. A thorough understanding of the principles of the first successful
`pattern-matching methods is thus a valuable introduction to the later techniques.
`
`8.2 GENERAL PRINCIPLES OF PATTERN MATCHING
`
`When a person utters a word, as we saw in Chapter 1, the word can be considered
`as a sequence of phonemes ( the linguistic units) and the phonemes will be realized
`as phones. Because of inevitable co-articulation, the acoustic patterns associated
`with individual phones overlap in time, and therefore depend on the identities of
`their neighbours. Even for a word spoken ~ isolation, therefore, the acoustic
`pattern is related in a very complicated way to the word's linguistic structure.
`However, if the same person repeats the same isolated word on separate
`occasions, the pattern is likely to be generally similar, because the same phonetic
`relationships will apply. Of course, there will probably also be differences, arising
`from many causes. For example, the second occurrence might be spoken faster or
`more slowly; there may be differences in vocal effort; the pitch and its variation
`during the word could be different; one example may be spoken more precisely
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 12
`
`

`

`110
`
`Speech Synthesis and Recognition
`
`than the other, etc. It is obvious that the waveform of separate utterances of the
`same word may be very different. There are likely to be more similarities between
`spectrograms because (assuming that a short time-window is used, see Section 2.6),
`they better illustrate the vocal-tract resonances, which are closely related to the
`positions of the articulators. But even spectrograms will differ in detail due to the
`above types of difference, and timescale differences will be particularly obvious.
`A well-established approach to ASR is to store in the machine example
`acoustic patterns ( called templates) for all the words to be recognized, usually
`spoken by the person who will subsequently use the machine. Any incoming word
`can then be compared in tum with all words in the store, and the one that is most
`similar is assumed to be the correct one. In general none of the templates will match
`perfectly, so to be successful this technique must rely on the correct word being
`more similar to its own template than to any of the alternatives.
`It is obvious that in some sense the sound pattern of the correct word is likely
`to be a better match than a wrong word, because it is made by more similar
`articulatory movements. Exploiting this similarity is, however, critically dependent
`on how the word patterns are compared, i.e. on how the 'distance' between two
`word examples is calculated. For example,
`it would be useless
`to compare
`waveforms, because even very similar repetitions of a word will differ appreciably
`in waveform detail from moment to moment, largely due to the difficulty of
`repeating the intonation and timing exactly.
`It is implicit in the above comments that it must also be possible to identify
`the start and end points of words that are to be compared.
`
`8.3 DISTANCE METRICS
`
`In this section we will consider the problem of comparing the templates with the
`incoming speech when we know that corresponding points
`in time will be
`associated with similar articulatory events. In effect, we appear to be assuming that
`the words to be compared are spoken in isolation at exactly the same speed, and
`that their start and end points can be reliably determined.
`In practice these
`assumptions will very rarely be justified, and methods of dealing with the resultant
`problems will be discussed later in the chapter.
`In calculating a distance between two words it is usual to derive a short-term
`distance that is local to corresponding parts of the words, and to integrate this
`distance over the entire word duration. Parameters representing the acoustic signal
`must be derived over some span of time, during which the properties are assumed
`not to change much. In one such span of time the measurements can be stored as a
`set of numbers, or feature vector, which may be regarded as representing a point
`in multi-dimensional space. The properties of a whole word can then be described
`as a succession of feature vectors ( often referred to as frames), each representing a
`time slice of, say, 10-20 ms. The integral of the distance between the patterns then
`reduces to a sum of distances between corresponding pairs of feature vectors. To be
`useful, the distance must not be sensitive to small differences in intensity between
`otherwise similar words, and it should not give too much weight to differences in
`pitch. Those features of the acoustic signal that are determined by the phonetic
`properties should obviously be given more weight in the distance calculation.
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 13
`
`

`

`Introduction to Automatic Speech Recognition: Template Matching
`
`111
`
`8.3.l Filter-bank analysis
`
`The most obvious approach in choosing a distance metric which has some of the
`desirable properties is to use some representation of the short-term power spectrum.
`It has been explained in Chapter 2 how the short-term spectrum can represent the
`effects of moving formants, excitation spectrum, etc.
`Although in tone languages pitch needs to be taken into account, in Western
`languages there is normally only slight correlation between pitch variations and the
`phonetic content of a word. The likely idiosyncratic variations of pitch that will
`occur from occasion to occasion mean that, except for tone languages, it is
`normally safer to ignore pitch in whole-word pattern-matching recognizers. Even
`for tone languages it is probably desirable to analyse pitch variations separately
`from effects due to the vocal tract configuration. It is best, therefore, to make the
`bandwidth of the spectral resolution such that it will not resolve the harmonics of
`the fundamental of voiced speech. Because the excitation periodicity is evident in
`the amplitude variations of the output from a broad-band analysis, it is also
`necessary to apply some time-smoothing to remove it. Such time-smoothing will
`also remove most of the fluctuations
`that result from randomness in turbulent
`excitation.
`At higher frequencies the precise formant positions become less significant,
`and the resolving power of the ear ( critical bandwidth - see Chapter 3) is such that
`detailed spectral information is not available to human listeners at high frequencies.
`It is therefore permissible to make the spectral analysis less selective, such that the
`effective filter bandwidth is several times the typical harmonic spacing. The desired
`analysis can thus be provided by a set of bandpass filters whose bandwidths and
`
`2
`
`. ......
`
`.
`
`• • •
`
`•
`
`• • •
`
`• •
`
`• • ♦ ♦• ♦ •••
`
`♦ •••
`
`2
`
`0
`
`•
`
`••••••••••
`.........
`••••••••••
`. ........ .
`
`GHT
`
`Time
`
`0·5 s
`
`Figure 8.1 Spectrographic dis.plays of a 10-channel filter-bank analysis (with a non-linear
`frequency spacing of the channels), shown for one example of the word "three" and two
`examples of the word "eight". It can be seen that the examples of "eight" are generally similar,
`although the lower one has a shorter gap for the [t] and a longer burst.
`
`
`..... ·••· .................. .
`. . . . . . . . . . . . . . . . . . . . . .
`. .. . . ... . . . . .
`5 • ····•·•··················•·
`• • • • ·•••·················••·
`·••··•··············••·
`··•·················••·
`·••·················••·
`N5-rc-•~.~.-.-.-.-.-.-.~.~.~.---~.~~.~.7.7,
`I
`• • • ••••••••••
`••••• • •
`·············
`• ••••••
`>
`• • •••••••••••
`• ♦ •••••
`···••·.
`g 2 • • • • • • • • . . .
`. ••••••
`♦ ••••••••
`~
`••
`. ....
`r::r
`••••••••••
`~o....L-·-·~••~•~•~•'-Z..JIUL:E~•~•------=-~·~•-•-·-·--
`GHT
`E
`5 .••••••••••
`·•••·······
`·•••·······
`. . . . . . . .
`·••·····••·
`.••••••••••
`• • •• • • •
`•••••••••••
`• • •••••••••
`. • •
`• •••
`I
`E
`I
`0·0
`
`.:it!
`
`•••••
`
`••••••••
`
`♦ ••
`
`
`~ ............ .
`. .. ·••··
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 14
`
`

`

`112
`
`Speech Synthesis and Recognition
`
`spacings are roughly equal to those of critical bands and whose range of centre
`frequencies covers the frequencies most important for speech perception (say from
`300 Hz up to around 5 kHz). The total number of band-pass filters is therefore not
`likely to be more than about 20, and successful results have been achieved with as
`few as 10. When the necessary time-smoothing is included, the feature vector will
`represent the signal power in the filters averaged over the frame interval.
`The usual name for this type of speech analysis is filter-bank analysis.
`Whether it is provided by a bank of discrete filters, implemented in analogue or
`digital form, or is implemented by sampling the outputs from short-term Fourier
`transforms, is a matter of engineering convenience. Figure 8.1 displays word
`patterns from a typical I 0-channel filter-bank analyser for two examples of one
`word and one example of another. It can be seen from the frequency scales that the
`channels are closer together in the lower-frequency regions.
`A consequence of removing the effect of the fundamental frequency and of
`using filters at least as wide as critical bands is to reduce the amount of information
`needed to describe a word pattern to much less than is needed for the waveform.
`Thus storage and computation in the pattern-matching process are much reduced.
`
`8.3.2 Level normalization
`
`Mean speech level normally varies by a few dB over periods of a few seconds, and
`changes in spacing between the microphone and the speaker's mouth can also cause
`changes of several dB. As these changes will be of no phonetic significance, it is
`desirable to minimize their effects on the distance metric. Use of filter-bank power
`directly gives most weight to more intense regions of the spectrum, where a change
`of 2 or 3 dB will represent a very large absolute difference. On the other hand, a
`3 dB difference in one of the weaker formants might be of similar phonetic
`significance, but will cause a very small effect on the power. This difficulty can be
`avoided to a large extent by representing the power logarithmically, so that similar
`power ratios have the same effect on the distance calculation whether they occur in
`intense or weak spectral regions. Most of the phonetically unimportant variations
`discussed above will then have much less weight in the distance calculation than the
`differences in spectrum level that result from formant movements, etc.
`Although comparing levels logarithmically is advantageous, care must be
`exercised in very low-level sounds, such as weak fricatives or during stop(cid:173)
`consonant closures. At these times the logarithm of the level in a channel will
`depend more on the ambient background noise level than on the speech signal. If
`the speaker is in a very quiet environment the logarithmic level may suffer quite
`wide irrelevant variations as a result of breath noise or the rustle of clothing. One
`way of avoiding this difficulty is to add a small constant to the measured level
`before taking logarithms. The value of the constant would be chosen to dominate
`the greatest expected background noise level, but to be small compared with the
`level usually found during speech.
`Differences in vocal effort will mainly have the effect of adding a constant to
`all components of the log spectrum, rather than changing the shape of the spectrum
`cross-section. Such differences can be made to have no effect on the distance
`metric by subtracting the mean of the logarithm of the spectrum level of each frame
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 15
`
`

`

`
`
`Introduction to Automatic Speech Recognition: Template Matching
`
`113
`
`from all the separate spectrum components for the frame. In practice this amount of
`level compensation is undesirable because extreme level variations are of some
`phonetic significance. For e~ample, . a substa~tial part of the acoustic difference
`between [ f] and any vowel 1s the difference m level, which can be as much as
`
`. . . . . .
`····•··
`····••·
`••••••
`··•··
`~ •••
`♦ •••
`(.!)
`
`• • •
`
`• ••
`
`♦•• •
`
`•
`
`····· ·••
`- ........ .
`. ·••· ...•
`•••••• • •
`···••···•·
`····•··••·
`····•···•·
`····•·····
`w ••••••••••
`..........
`••••••••••
`····••·
`
`• • •••
`
`• ♦ ••••
`
`•
`
`•
`
`··•·······
`·••·······
`. . . . . . .
`·••·····••·
`• ••••••••
`• •• • • •
`. .
`·••··••· ••
`·••·······
`
`♦ ••
`
`E
`
`••••
`
`UJ
`••••••••••
`UJ •••••
`
`••••••••••
`••••••
`••••• •
`••••••
`···••·
`···••·
`• •••
`• • •• •
`• • • • • • • • ••
`••
`• •••••••
`····•···•·
`····•···•·
`···•····•·
`········•
`•••••••••
`•••••••••
`a::::~::::::
`•••••••••
`••••••••••
`. . . . . .
`··•···
`
`.• ..
`·'· • ••
`:•:•:
`.•.•.
`·•• ...... .
`•••••••••••
`·••:·······
`·••·····••·
`·••·······
`• ••••••
`• • • • • • •
`• • •••••••
`• •
`··•·······
`..
`··••·
`
`•
`
`E
`
`e i=
`
`• •••
`•••
`·······••·
`........
`•••••••••
`. ...... .
`
`GHT
`
`. .. ·•···
`. . . . . . . . .
`·······•··
`••••••••••
`........
`
`GHT
`
`Figure 8.2 Graphical representation of the distance between frames of the spectrogr~ms
`shown in Figure 8.1. The larger the blob the smaller the distance. It can be seen that there 1s a
`continuous path of fairly small distances between the bottom left and top right when the two
`examples of "eight" are compared, but not when "eight" is compared with "three".
`
`Amazon / Zentian Limited
`Exhibit 1016
`Page 16
`
`

`

`114
`
`Speech Synthesis and Recognition
`
`level differences of this
`if
`30 dB. Recognition accuracy might well suffer
`magnitude were ignored. A useful compromise
`is to compensate only partly for
`level variations, by subtracting some fraction (say in the range 0.7 to 0.9) of the
`mean logarithmic level from each spectral channel. There are also several other
`techniques for achieving a similar effect.
`A suitable distance metric for use with a filter bank is the sum of the squared
`differences between the logarithms of power levels in corresponding channels (i.e.
`the square of the Euclidean distance in the multi-dimensional space). A graphical
`representation of the Euclidean distance between frames for the words used in
`Figure 8.1 is shown in Figure 8

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket