`BUNG-HWANG JUANG
`
`IPR2023-00035
`Apple EX1015 Page 1
`
`
`
`FUNDAMENTALS
`OF SPEECH
`RECOGNITION
`
`Lawrence Rabiner
`Biing-Hwang Juang
`
`!\'arson
`Edu('afion
`---
`
`Prentice Hall P T R
`Upper Saddle River, New Jersey 07458
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`
`I
`
`I
`
`..
`II I
`
`I
`
`IPR2023-00035
`Apple EX1015 Page 2
`
`
`
`Library of Congress Cataloging-in-Publication Data
`
`Rabiner. Lawrence R., 1943-
`Fundamentals of speech recognition/ Lawrence Rabiner. Biing-Hwang
`Juang.
`p.
`cm.
`Includes bibliographical references and index.
`ISBN 0-13--015157-2
`I. Automatic speech recognition. 2. Speech processing systems.
`I. Juang, B. H. (Biing-Hwang) II. Title.
`TK7895.S65R33 1993
`006.4 '54---dc20
`
`92-34093
`CIP
`
`Editorial production
`and interior design: bookworks
`Acquisitions Editor: Karen Gettman Manufacturing Buyer: Ma,y Elizabeth McCartney
`
`Cover Designer: Ben Santora
`
`@1993 by AT&T. All rights reserved.
`0 Published by Prentice Hall PTR
`Prentice-Hall. Inc.
`A Pearson Education Company
`Upper Saddle River, NJ 07458
`
`-
`
`The publisher offers discounts on this book when ordered
`in bulk quantities. For more information, contact:
`Corporate Sales Department
`PTR Prentice Hall
`
`All rights reserved. No part of this book may be
`reproduced, in any form or by any means,
`without permission in writing from the publisher.
`
`Printed in the United States of America
`10 9 8
`
`ISBN □-13-015157-2
`
`Prentice-Hall International (UK) Limited,London
`Prentice-Hall of Australia Pty. Limited, Sydney
`Prentice-Hall Canada Inc., Toronto
`Prentice-Hall Hispanoarnericana, S.A., Mexico
`Prentice-Hall of India Private Limited, New Delhi
`Prentice-Hall of Japan, Inc., Tokyo
`Pearson Education Asia Pte. Ltd., Singapore
`Editora Prentice-Hall do Brasil, Ltda., Rio de Janeiro
`
`IPR2023-00035
`Apple EX1015 Page 3
`
`
`
`CONTENTS
`
`LIST OF. FIGURES
`l!IST OF TABL!ES
`
`PREFA<::E
`
`1 FUNDAMENTALS OF SPEEGH RECOGNITION
`
`Introduction
`1.1
`1.2 The Paradigm for Speech Recognition
`1.3 Outline
`1.4 A Brief History of Speech-Recognition Research
`
`2 THE SPEECH SIGNAL: PRODUCTION, PERC:EPTION, AND
`ACOUSTIC-PHONETIC CHARACTERIZATl0N
`
`2.1
`
`Introduction
`2.1.1 The Process of Speech Production and Perception in Human
`Beings
`2.2 The Speech-Production Process
`2.3 Representing Speech in the Time and Frequency Domains
`2.4 Speech Sounds and Features
`
`xiii
`
`xxix
`
`xxxi
`
`1
`
`I
`3
`3
`6
`
`11
`
`11
`
`11
`14
`17
`20
`
`vii
`
`IPR2023-00035
`Apple EX1015 Page 4
`
`
`
`viii
`
`Contents
`
`2.4. l The Vowels
`2.4.2 Diphthongs
`2.4.3 Semivowels
`2.4.4 Nasal Consonants
`2.4.5 Unvoiced Fricatives
`2.4.6 Voiced Fricatives
`2.4.7 Voiced and Unvoiced Stops
`2.4.8 Review Exercises
`2.5 Approaches to Automatic Speech Recognition by Machine
`2.5.1 Acoustic-Phonetic Approach to Speech Recognition
`2.5.2 Statistical Pattern-Recognition Approach to Speech
`Recognition
`2.5.3 Artificial Intelligence (Al) Approaches to Speech
`Recognition
`2.5.4 Neural Networks and Their Application to Speech
`Recognition
`2.6 Summary
`
`3 SIGNAL PROCESSING AND ANALYSIS METHODS F0R SP.EEEH
`RECOGNITION
`
`3.1
`
`Introduction
`3.1.1 Spectral Analysis Models
`3.2 The Bank-of-Filters Front-End Processor
`3.2.1 Types of Filter Bank Used for Speech Recognition
`3.2.2
`Implementations of Filter Banks
`3.2.3 Summary of Considerations for Speech-Recognition Filter
`Banks
`3.2.4 Practical Examples of Speech-Recognition Filter Banks
`3.2.5 Generalizations of Filter-Bank Analyzer
`3.3 Linear Predictive Coding Model for Speech Recognition
`3.3.1 The LPC Model
`3.3.2 LPC Analysis Equations
`3.3.3 The Autocorrelation Method
`3.3.4 The Covariance Method
`3.3.5 Review Exercise
`3.3.6 Examples of LPC Analysis
`3.3.7 LPC Processor for Speech Recognition
`3.3.8 Review Exercises
`3.3.9 Typical LPC Analysis Parameters
`3.4 Vector Quantization
`3.4.1 Elements of a Vector Quantization Implementation
`3.4.2 The VQ Training Set
`3.4.3 The Similarity or Distance Measure
`3.4.4 Clustering the Training Vectors
`3.4.5 Vector Classification Procedure
`3.4.6 Comparison of Vector and Scalar Quantizers
`
`21
`28
`29
`30
`31
`32
`33
`37
`42
`45
`
`51
`
`52
`
`54
`65
`
`69
`
`69
`70
`73
`77
`80
`
`92
`93
`95
`97
`100
`101
`103
`106
`107
`108
`112
`117
`121
`122
`123
`124
`125
`125
`128
`129
`
`IPR2023-00035
`Apple EX1015 Page 5
`
`
`
`Contents
`
`3.4.7 Extensions of Vector Quantization
`3.4.8 Summary of the VQ Method
`3.5 Auditory-Based Spectral Analysis Models
`3.5.1 The EIH Model
`3.6 Summary
`
`4 PATTERN-COMPARISON TECHNIQUES
`4.1
`Introduction
`4.2 Speech (Endpoint) Detection
`4.3 Distortion Measures-Mathematical Considerations
`4.4 Distortion Measures-Perceptual Considerations
`4.5 Spectra I-Disto11ion Measures
`4.5.1 Log Spectral Distance
`4.5.2 Cepstral Distances
`4.5.3 Weighted Cepstral Distances and Liftering
`4.5.4 Likelihood Distortions
`4.5.5 Variations of Likelihood Distortions
`4.5.6 Spectral Distortion Using a Warped Frequency Scale
`4.5.7 Alternative Spectral Representations and Distortion
`Measures
`4.5.8 Summary of Di,;tortion Measures--Computational
`Considerations
`Incorporation of Spectral Dynamic Features into the Distortion
`Measure
`4.7 Time Alignment and Normalization
`4.7.1 Dynamic Programming-Basic Considerations
`4.7.2 Time-Normalization Constraints
`4.7.3 Dynamic Time-Warping Solution
`4.7.4 Other Considerations in Dynamic Time Warping
`4.7.5 Multiple Time-Alignment Paths
`4.8 Summary
`5 SPEECH REG0GNITION SYSTEM DESIGN AND IMPLEMENTATION
`ISSUES
`
`4.6
`
`5.1
`Introduction
`5.2 Application of Source-Coding Techniques to Recognition
`5.2.1 Vector Quantization and Pattern Comparison Without Time
`Alignment
`5.2.2 Centroid Computation for VQ Codebook Design
`5.2.3 Vector Quantizers with Memory
`5.2.4 Segmental Vector Quantization
`5.2.5 Use of a Vector Quantizer as a Recognition Preprocessor
`5.2.6 Vector Quantization for Efficient Pattern Matching
`5.3 Template Training Methods
`5.3.1 Casual Training
`
`ix
`
`129
`131
`132
`134
`139
`
`141
`
`141
`143
`149
`150
`154
`158
`163
`166
`171
`177
`183
`
`190
`
`193
`
`194
`200
`204
`208
`221
`229
`232
`238
`
`242
`
`242
`244
`
`244
`246
`254
`256
`257
`263
`264
`265
`
`IPR2023-00035
`Apple EX1015 Page 6
`
`
`
`X
`
`5.3.2 Robust Training
`5.3.3 Clustering
`5.4 Performance Analysis and Recognition Enhancements
`5.4. l Choice of Distortion Measures
`5.4.2 Choice of Clustering Methods and kNN Decision Rule
`5.4.3
`Incorporation of Energy Information
`5.4.4 Effects of Signal Analysis Parameters
`5.4.5 Performance of Isolated Word-Recognition Systems
`5.5 Template Adaptation to New Talkers
`5.5. l Spectral Transformation
`5.5.2 Hierarchical Spectral Clustering
`5.6 Discriminative Methods in Speech Recognition
`5.6. l Determination of Word Equivalence Classes
`5.6.2 Discriminative Weighting Functions
`5.6.3 Discriminative Training for Minimum Recognition Error
`5.7 Speech Recognition in Adverse Environments
`5.7.1 Adverse Conditions in Speech Recognition
`5.7.2 Dealing with Adverse Conditions
`5.8 Summary
`
`Contents
`
`266
`267
`274
`274
`277
`280
`282
`284
`285
`286
`288
`291
`294
`297
`302
`305
`306
`309
`317
`
`6 THEORY AND IMPLEMENTATION OF HIDDEN MARK0V MODEl!S
`
`321
`
`6.1
`6.2
`6.3
`
`6.4
`
`6.5
`6.6
`6.7
`6.8
`
`6.9
`6.10
`6.11
`6.12
`
`Introduction
`Discrete-Time Markov Processes
`Extensions to Hidden Markov Models
`6.3.1 Coin-Toss Models
`6.3.2 The Um-and-Ball Model
`6.3.3 Elements of an HMM
`6.3.4 HMM Generator of Observations
`The Three Basic Problems for HMMs
`6.4.1 Solution to Problem I-Probability Evaluation
`6.4.2 Solution to Problem 2-"0ptimal" State Sequence
`6.4.3 Solution to Problem 3-Parameter Estimation
`6.4.4 Notes on the Reestimation Procedure
`Types of HM Ms
`Continuous Observation Densities in HMMs
`Autoregressive HMMs
`Variants on HMM Structures-Null Transitions and Tied
`States
`Inclusion of Explicit State Duration Density in HMMs
`Optimization Criterion-ML, MMI, and MDI
`Comparisons of HMMs
`Implementation
`Issues for HMMs
`6.12.1 Scaling
`6.12.2 Multiple Observation Sequences
`6.12.3
`Initial Estimates of HMM Parameters
`
`321
`322
`325
`326
`328
`329
`330
`333
`334
`337
`342
`347
`348
`350
`352
`
`356
`358
`362
`364
`365
`365
`369
`370
`
`IPR2023-00035
`Apple EX1015 Page 7
`
`
`
`Contents
`
`6.13
`
`6.12.4 Effects of Insufficient Training Data
`6.12.5 Choice of Model
`Improving the Effectiveness of Model Estimates
`6.13.1 Deleted Interpolation
`6.13.2 Bayesian Adaptation
`6.13.3 Corrective Training
`6.14 Model Clustering and Splitting
`6.15 HMM System for Isolated Word Recognition
`6.15.1 Choice of Model Parameters
`6.15.2 Segmental K-Means Segmentation into States
`6.15.3
`Incorporation of State Duration into the HMM
`6.15.4 HMM Isolated-Digit Performance
`6.16 Summary
`
`for the Connected Word-Recognition
`
`7 SPEECH RECOGNITION BASED ON CONNECTED WORD MODEl!S
`7 .1
`Introduction
`7.2 General Notation
`Problem
`7.3 The Two-Level Dynamic Programming (Two-Level DP)
`Algorithm
`7.3.l Computation of the Two-Level DP Algorithm
`7.4 The Level Building (LB) Algorithm
`7.4.l Mathematics of the Level Building Algorithm
`7.4.2 Multiple Level Considerations
`7.4.3 Computation of the Level Building Algorithm
`Implementation Aspects of Level Building
`7.4.4
`Integration of a Grammar Network
`7.4.5
`7.4.6 Examples of LB Computation of Digit Strings
`7.5 The One-Pass (One-State) Algorithm
`7.6 Multiple Candidate Strings
`7.7 Summary of Connected Word Recognition Algorithms
`7.8 Grammar Networks for Connected Digit Recognition
`7.9 Segmental K-Means Training Procedure
`Implementation
`7.10 Connected Digit Recognition
`7.10.l HMM-Based System for Connected Digit Recognition
`7.10.2 Performance Evaluation on Connected Digit Strings
`7.11 Summary
`
`8 LARGE VOCABULARY CONTINUOUS Sf?EECH REC:OGNITION
`
`8.1
`Introduction
`8.2 Subword Speech Units
`8.3 Subword Unit Models Based on HMMs
`8.4 Training of Subword Units
`
`xi
`
`370
`371
`372
`372
`373
`376
`377
`378
`379
`382
`384
`385
`386
`
`390
`
`390
`
`393
`
`395
`399
`400
`401
`405
`407
`410
`414
`416
`416
`420
`423
`425
`427
`428
`429
`430
`432
`
`434
`
`434
`435
`439
`441
`
`'
`~
`(
`~
`\
`
`r::
`...
`..,
`...
`i:
`)
`"'
`
`...
`I
`
`i'.
`~
`(
`"
`
`IPR2023-00035
`Apple EX1015 Page 8
`
`
`
`xii
`
`Contents
`
`8.5 Language Models for Large Vocabulary Speech
`Recognition
`8.6 Statistical Language Modeling
`8.7 Perplexity of the Language Model
`8.8 Overall Recognition System Based on Subword Units
`8.8.1 Control of Word Insertion/Word Deletion Rate
`8.8.2 Task Semantics
`8.8.3 System Perfonnance on the Resource Management Task
`8.9 Context-Dependent Subword Units
`8.9.1 Creation of Context-Dependent Di phones and Triphones
`8.9.2 Using Interword Training to Create CD Units
`8.9.3 Smoothing and Interpolation of CD PLU Models
`8.9.4 Smoothing and Interpolation of Continuous Densities
`8.9.5 Implementation Issues Using CD Units
`8.9.6 Recognition Results Using CD Units
`8.9.7 Position Dependent Units
`8.9.8 Unit Splitting and Clustering
`8.9.9 Other Factors for Creating Additional Subword Units
`8.9.10 Acoustic Segment Units
`8.10 Creation of Vocabulary-Independent Units
`8.11 Semantic Postprocessor for Recognition
`8.12 Summary
`
`9 TASK ORIENTED AP.PLICATIONS OF AUif0MATIC SPEECH
`RECOGNITION
`
`9.1
`Introduction
`9.2 Speech-Recognizer Performance Scores
`9.3 Characteristics of Speech-Recognition Applications
`9.3.1 Methods of Handling Recognition Errors
`9.4 Broad Classes of Speech-Recognition Applications
`9.5 Command-and-Control Applications
`9.5.1 Voice Repertory Dialer
`9.5.2 Automated Call-Type Recognition
`9.5.3 Call Distribution by Voice Commands
`9.5.4 Directory Listing Retrieval
`9.5.5 Credit Card Sales Validation
`9.6 Projections for Speech Recognition
`INDEX
`
`447
`448
`449
`450
`454
`454
`454
`458
`460
`461
`462
`464
`464
`467
`469
`470
`475
`476
`477
`478
`478
`
`482
`
`482
`484
`485
`486
`487
`488
`489
`490
`491
`491
`492
`493
`497
`
`IPR2023-00035
`Apple EX1015 Page 9
`
`
`
`LIST OF FIGURES
`
`1.1 General block diagram of a task-oriented speech-recognition
`system.
`
`2.1 Schematic diagram of speech-production/speech-perception process
`(after Flanagan [unpublished]).
`2.2 Alternative view of speech-production/speech-perception process
`(after Rabiner and Levinson [ 1 ]).
`2.3 Mid-sagittal plane X-ray of the human vocal apparatus (after
`Flanagan et al. [2]).
`2.4 Schematic view of the human vocal mechanism (after Flanagan
`[3]).
`2.5 Glottal volume velocity and resulting sound pressure at the start of a
`voiced sound (after lshizaka and Flanagan [4]).
`2.6 Schematic representation of the complete physiological mechanism of
`speech production (after Flanagan [3]).
`2.7 Waveform plot of the beginning of the utterance "It's time."
`2.8 Wideband and narrowband spectrograms and speech amplitude for
`the utterance "Every salt breeze comes from the sea."
`
`3
`
`12
`
`13
`
`15
`
`16
`
`16
`
`17
`18
`
`19
`
`xiii
`
`IPR2023-00035
`Apple EX1015 Page 10
`
`
`
`~
`
`xiv
`
`List of Figures
`
`2.9 Wideband spectrogram and formant frequency representation of the
`utterance "Why do I owe you a letter" (after Atal and
`Hanauer [5]).
`2.10 Wideband spectrogram and intensity contour of the phrase "Should
`we chase."
`2.11 The speech waveform and a segmentation and labeling of the
`constituent sounds of the phrase "Should we chase."
`2.12 Chart of the classification of the standard phonemes of American
`English into broad sound classes.
`2.13 Articulatory configurations for typical vowel sounds (after
`Flanagan [3]).
`2.14 Acoustic waveform plots of typical vowel sounds.
`2.15 Spectrograms of the vowel sounds.
`2.16 Measured frequencies of first and second formants for a wide range of
`talkers for several vowels (after Peterson & Barney [7]).
`2.17 The vowel triangle with centroid positions of the common
`vowels.
`2.18 Spectrogram plots of four diphthongs.
`2.19 Time variation of the first two formants for the diphthongs (after
`Holbrook and Fairbanks [9]).
`2.20 Waveforms for the sequences /a-m-a/ and /a-n-a/.
`2.21 Spectrograms of the sequences /a-m-a/ and a-n-a/.
`2.22 Waveforms for the sounds /f/, /s/ and /sh/ in the context /a-x-a/ where
`/x/ is the unvoiced fricative.
`2.23 Spectrogram comparisons of the sounds /a-f-a/, /a-s-a/ and
`/a-sh-a/.
`2.24 Waveforms for the sequences /a-v-a/ and /a-zh-a/.
`2.25 Spectrograms for the sequences /a-v-a/ and /a-zh-a/.
`2.26 Waveform for the sequence /a-b-a/.
`2.27 Waveforms for the sequences /a-p-a/ and /a-t-a/.
`2.28 Spectrogram comparisons of the sequences of voiced (/a-b-a/) and
`voiceless (/a-p-a/ and /a-t-a/) stop consonants.
`2.29 Spectrograms of the 11 isolated digits, 0 through 9 plus oh, in random
`sequence.
`2.30 Spectrograms of two connected digit sequences.
`2.31 Phoneme lattice for word string.
`2.32 Block diagram of acoustic-phonetic speech-recognition
`system.
`2.33 Acoustic-phonetic vowel classifier.
`2.34 Binary tree speech sound classifier.
`2.35 Segmentation and labeling for word sequence "seven-six."
`2.36 Segmentation and labeling for word sequence "did you."
`2.37 Block diagram of pattern-recognition speech recognizer.
`
`21
`
`22
`
`23
`
`25
`
`25
`26
`27
`
`27
`
`28
`30
`
`31
`32
`33
`
`34
`
`34
`35
`36
`36
`37
`
`38
`
`40
`41
`43
`
`45
`47
`48
`49
`50
`51
`
`IPR2023-00035
`Apple EX1015 Page 11
`
`
`
`List of Figures
`
`2.38
`
`Illustration of the word correction capability of syntax in speech
`recognition (after Rabiner and Levinson [ 1 ]).
`2.39 A bottom-up approach to knowledge integration for speech
`recognition.
`2.40 A top-down approach to knowledge integration for speech
`recognition.
`2.41 A blackboard approach to knowledge integration for speech
`recognition (after Lesser et al. [ 11 ]).
`2.42 Conceptual block diagram of a human speech understanding
`system.
`2.43 Simple computation element of a neural network.
`2.44 McCullough-Pitts model of neurons (after McCullough and
`Pitts [ 12]).
`2.45 Single-layer and three-layer perceptrons.
`2.46 A multilayer perceptron for classifying steady vowels based on F 1, F2
`measurements (after Lippmann [ 13 ]).
`2.47 Model of a recurrent neural network.
`2.48 A fixed point interpretation of the Hopfield network.
`2.49 The time delay neural network computational element (after Waibel
`et al. [14]).
`2.50 A TDNN architecture for recognizing /b/, /d/ and /g/ (after Waibel
`et al. [ 14 ]).
`2.51 A combination neural network and matched filter for speech
`recognition (after Tank & Hopfield [ 15)).
`2.52 Example illustrating the combination of a neural network and a set of
`matched filters (after Tank & Hopfield [ 15]).
`2.53 The hidden control neural network (after Levin [16]).
`
`3.1
`
`(a) Pattern recognition and (b) acoustic phonetic approaches to speech
`recognition.
`3.2 Bank-of-filters analysis model.
`3.3 LPC analysis model.
`3.4 Complete bank-of-filters analysis model.
`3.5 Typical waveforms and spectra for analysis of a pure sinusoid in the
`filter-bank model.
`3.6 Typical waveforms and spectra of a voice speech signal in the
`bank-of-filters analysis model.
`Ideal (a) and realistic (b) set of filter responses of a Q-channel filter
`bank covering the frequency range F5/N to (Q + ½)Fs/N.
`Ideal specifications of a 4-channel octave band-filter bank (a), a
`12-channel third-octave band filter bank (b), and a 7-channel critical
`band scale filter bank (c) covering the telephone bandwidth range
`(200-3200 Hz).
`
`3.7
`
`3.8
`
`xv
`
`54
`
`55
`
`55
`
`56
`
`56
`57
`
`58
`59
`
`59
`60
`60
`
`63
`
`64
`
`65
`
`66
`67
`
`71
`72
`72
`74
`
`75
`
`76
`
`77
`
`79
`
`IPR2023-00035
`Apple EX1015 Page 12
`
`
`
`xvi
`
`List of Figures
`
`3.22
`
`3.9 The variation of bandwidth with frequency for the perceptually based
`critical band scale.
`3.10 The signals s(m) and w(n - m) used in evaluation of the short-time
`Fourier transform.
`3.11 Short-time Fourier transform using a long (500 points or 50 msec)
`Hamming window on a section of voiced speech.
`3.12 Short-time Fourier transform using a short (50 points or 5 msec)
`Hamming window on a section of voiced speech.
`3.13 Short-time Fourier transform using a long (500 points or 50 msec)
`Hamming window on a section of unvoiced speech.
`3.14 Short-time Fourier transform using a short (50 points or 5 msec)
`Hamming window on a section of unvoiced speech.
`3.15 Linear filter interpretation of the short-time Fourier transform.
`3.16 FFf implementation of a uniform filter bank.
`3.17 Direct form implementation of an arbitrary nonuniform filter
`bank.
`3.18 Two arbitrary nonuniform filter-bank ideal filter specifications
`consisting of either 3 bands (part a) or 7 bands (part b).
`3.19 Tree structure implementation of a 4-band, octave-spaced, filter
`bank.
`3.20 Window sequence, w(n), (part a), the individual filter response
`(part b), and the composite response (part c) of a Q = 15 channel,
`uniform filter bank, designed using a IO I -point Kaiser window
`smoothed lowpass window (after Dautrich et al. [4]).
`3.21 Window sequence, w(n), (part a), the individual filter responses
`(part b), and the composite response (part c) of a Q = 15 channel,
`uniform filter bank, designed using a IO I-point Kaiser window
`directly as the lowpass window (after Dautrich et al. [4]).
`Individual channel responses (parts a to d) and composite filter
`response (part c) of a Q = 4 channel, octave band design, using
`101-point FIR filters in each band (after Dautrich et al. [4]).
`Individual channel responses and composite filter response of a
`Q = 12 channel, I /3 octave band design, using 20 I-point FIR filters
`in each band (after Dautrich et al. [4]).
`Individual channel responses (parts a to g) and composite filter
`response (part h) of a Q = 7 channel critical band filter bank design
`(after Dautrich et al. [4]).
`Individual channel responses and composite filter response of a
`Q = 13 channel, critical band spacing filter bank, using highly
`overlapping filters in frequency (after Dautrich et al. [4]).
`3.26 Generalization of filter-bank analysis model.
`3.27 Linear prediction model of speech.
`3.28 Speech synthesis model based on LPC model.
`
`79
`
`81
`
`82
`
`82
`
`83
`
`83
`84
`89
`
`89
`
`90
`
`92
`
`94
`
`95
`
`96
`
`97
`
`98
`
`99
`99
`100
`IOI
`
`3.23
`
`3.24
`
`3.25
`
`IPR2023-00035
`Apple EX1015 Page 13
`
`
`
`List of Figures
`
`3.29
`
`3.30
`
`3.31
`
`Illustration of speech sample, weighted speech section, and prediction
`error for voiced speech where the prediction error is large at the
`beginning of the section.
`Illustration of speech sample, weighted speech section, and prediction
`error for voiced speech where the prediction error is large at the end
`of the section.
`Illustration of speech sample, weighted speech section, and prediction
`error for unvoiced speech where there are almost no artifacts at the
`boundaries of the section.
`3.32 Typical signals and spectra for LPC autocorrelation method for a
`segment of speech spoken by a male speaker (after Rabiner et al.
`[8]).
`3.33 Typical signals and spectra for LPC autocorrelation method for a
`segment of speech spoken by a female speaker (after Rabiner et al.
`(81).
`3.34 Examples of signal (differentiated) and prediction error for several
`vowels (after Strube [9]).
`3.35 Variation of the RMS prediction error with the number of predictor
`coefficients, p (after Atal and Hanauer l 10]).
`3.36 Spectra for a vowel sound for several values of predictor order,
`p.
`3.37 Block diagram of LPC processor for speech recognition.
`3.38 Magnitude spectrum of LPC preemphasis network for
`a= o.95.
`3.39 Blocking of speech into overlapping frames.
`3.40 Block diagram of the basic VQ training and classification
`structure.
`3.41 Partitioning of a vector space into VQ cells with each cell represented
`by a centroid vector.
`3.42 Flow diagram of binary split codebook generation algorithm.
`3.43 Codebook distortion versus codebook size (measured in bits per
`frame) for both voiced and unvoiced speech (after Juang et al.
`[ I 2]).
`3.44 Codebook vector locations in the F 1 - F2 plane (for a 32-vector
`codebook) superimposed on the vowel ellipses (after Juang et al.
`[ 12]).
`3.45 Model and distortion error spectra for scalar and vector quantizers
`(after Juang et al. [12]).
`3.46 Plots and histograms of temporal distortion for scalar and vector
`quantizers (after Juang et al. [ 12]).
`3.47 Physiological model of the human ear.
`3.48 Expanded view of the middle and inner ear mechanics.
`3.49 Block diagram of the EIH model (after Ghitza [13]).
`
`xvii
`
`104
`
`104
`
`105
`
`108
`
`109
`
`110
`
`I 10
`
`I I I
`113
`
`113
`114
`
`124
`
`126
`127
`
`128
`
`128
`
`130
`
`131
`132
`133
`135
`
`~
`
`;;;;
`
`-
`
`IPR2023-00035
`Apple EX1015 Page 14
`
`
`
`xviii
`
`List of Figures
`
`3.50 Frequency response curves of a cat's basilar membrane (after Ghitza
`(13]).
`3:51 Magnitude of EIH for vowel /o/ showing the time-frequency
`resolution (after Ghitza [ 13 J).
`3.52 Operation of the EIH model for a pure sinusoid (after
`Ghitza (13]).
`3.53 Comparison of Fourier and EIH log spectra for clean and noisy
`speech signals (after Ghitza [ 13]).
`
`4.1 Contour of digit recognition accuracy (percent correct) as a function
`of endpoint perturbation (in ms) in a multispeaker digit-recognition
`experiment. Both the initial (beginning point) and the final (ending
`point) boundary of the detected speech signal were varied (after
`Wilpon et al. [2]).
`4.2 Example of mouth click preceding a spoken word (after Wilpon et al.
`[2]).
`4.3 Example of breathy speech due to heavy breathing while speaking
`(after Wilpon et al. [2]).
`4.4 Example of click produced at the end of a spoken word (after Wilpon
`et al. [2]).
`4.5 Block diagram of the explicit approach to speech endpoint
`detection.
`4.6 Block diagram of the implicit approach to speech-endpoint
`detection.
`4.7 Examples of word boundaries as determined by the implicit endpoint
`detection algorithm.
`4.8 Block diagram of the hybrid approach to speech endpoint
`detection.
`4.9 Block diagram of typical speech activity detection algorithm.
`4.10 LPC pole frequency JNDs as a function of the pole bandwidth; the
`blank circles denote positive frequency perturbations, and the solid
`circles represent negative frequency perturbations; the fitting curves
`are parabolic (after Erell et al. [7]).
`4.11 LPC pole bandwidth JNDs, in a logarithmic scale, as a function of the
`pole bandwidth itself (after Erell et al. [7]).
`4.12 Two typical FFT power spectra, S(w), of the sound /re/ in a log scale
`and their difference magnitude IV(w)I as a function of
`frequency.
`4.13 LPC model spectra corresponding to the FFf spectra in Figure 4.12,
`plotted also in a log scale, and their difference magnitude jV(w)I as a
`function of frequency.
`4.14 Two typical FFT power spectra, S(w), of the sound /sh/ in a log scale
`and their difference magnitude IV(w)I as a function of
`frequency.
`
`136
`
`136
`
`137
`
`138
`
`144
`
`145
`
`146
`
`147
`
`147
`
`148
`
`148
`
`149
`149
`
`153
`
`154
`
`159
`
`159
`
`160
`
`IPR2023-00035
`Apple EX1015 Page 15
`
`
`
`List of Figures
`
`4.15 LPC model spectra corresponding to the FFf spectra in Figure 4.14,
`plotted also in a log scale, and their difference magnitude IV(w)I as a
`function of frequency.
`4.16 Typical FFf power spectra of the sounds /re/ and /i/ respectively and
`their difference magnitude as a function of frequency.
`4.17 LPC model spectra corresponding to the FFf spectra in Figure 4.16
`and their difference magnitude !V(w)I as a function of
`frequency.
`4.18 Scatter plot of di, the cepstral distance, versus 2d;(L), the truncated
`cepstral distance (multiplied by 2), for 800 pairs of all-pole model
`spectra; the truncation is at L = 20 (after Gray and
`Markel [9]).
`4.19 Scatter plot of di, the cepstral distance, versus 2d;(L), the truncated
`cepstral distance (multiplied by 2), for 800 pairs of all-pole model
`spectra; the truncation is at L = 30 (after Gray and
`Markel [9]).
`4.20 Effects of cepstral liftering on a log LPC spectrum, as a function of
`the lifter length (L = 8 to 16) (after Juang et al. [11]).
`4.21 Comparison of (a) original sequence of LPC log magnitude spectra;
`(b) liftered LPC log magnitude spectra, and (c) liftered log magnitude
`spectra (after Juang et al. [11]).
`4.22 Comparison of the distortion integrands V2(w) /2 and
`- V(w) - 1 (after Gray and Markel [9]).
`eV(w)
`, 1/ IAl 2
`4.23 A scatter plot of d1s0/ 1Apl2
`) + 1 versus d1s0/ IAl2
`1/ 1Apl2
`) + 1 as measured from 6800 pairs of speech model
`spectra.
`4.24 A linear system with transfer function H(z) = A(z)/B(z).
`4.25 A scatter diagram of the log spectral distance versus the COSH
`distortion as measured from a database of 6800 pairs of speech
`spectra.
`4.26 LPC spectral pair and various spectral weighting functions;
`W 1 (w), W2(w), W3(w) and W4(w) are defined in (4.71), (4.72), (4.73),
`and (4.74), respectively.
`4.27a An example of the cosh spectral deviation F4(w) and its weighted
`version using W3(w) = 1/ IA(eiw)l
`2 as the weighting function; in this
`case the two spectra are of comparable power levels.
`4.27b An example of the cosh spectral deviation F4(w) and its weighted
`version using W3(w) = 1/ IA(eiw)l
`2 as the weighting function; in this
`case, the two spectra have significantly different power levels.
`
`,
`
`xix
`
`160
`
`161
`
`161
`
`165
`
`165
`
`169
`
`170
`
`173
`
`176
`177
`
`178
`
`180
`
`182
`
`182
`
`IPR2023-00035
`Apple EX1015 Page 16
`
`
`
`xx
`
`List of Figures
`
`4.28 Subjectively perceived pitch, in mels, of a tone as a function of the
`frequency, in Hz; the upper curve relates the subjective pitch to
`frequency in a linear scale and the lower curve shows the subjective
`pitch as a function of the frequency in a logarithmic scale (after
`Stevens and Volkmann [13]).
`4.29 The critical bandwidth phenomenon; the critical bandwidth as a
`function of the frequency at the center of the band (after Zwicker,
`Flottorp and Stevens [ 14 ]).
`4.30 Real part of exp U0(b)k] as a function of b, the Bark scale, for
`different values of k (after Nocerino et al. [ 16]).
`4.31 A filter-bank design in which each filter has a triangle bandpass
`frequency response with bandwidth and spacing determined by a
`constant mel frequency interval (spacing = 150 mels, bandwidth =
`300 mels) (after Davis and Mermelstein [ 17]).
`4.32 (a) Series of cylindrical sections concatenated as an acoustic tube
`model of the vocal tract; (b) the area function of the cylindrical
`sections in (a) (after Markel and Gray [ 10]).
`4.33 A critical band spectrum (a) of a typical vowel sound and the
`corresponding log spectral differences VLM (b) and VcM (c) as
`functions of the critical band number (after Nocerino et al.
`[16]).
`4.34 A trajectory of the (2nd) cepstral coefficient with 2nd-order
`polynomial (h1 +hit+ h3ti) fitting on short portions of the trajectory;
`the width for polynomial fitting is 7 points.
`4.35 Scatter diagram showing the correlation between the "instantaneous"
`cepstral distance, di, and the "differential" or "dynamic" cepstral
`distance, di6n,; the correlation index is 0.6.
`4.36 Linear time alignment for two sequences of different
`durations.
`4.37 An example of time normalization of two sequential patterns to a
`common time index; the time warping functions ¢x and <Py map the
`individual time index ix and iy, respectively, to the common time
`index k.
`4.38 The optimal path problem-finding the minimum cost path from
`point l to point i in as many moves as needed.
`4.39 A trellis structure that illustrates the problem of finding the optimal
`path from point i to point j in M steps.
`4.40 An example of local continuity constraints expressed in terms of
`coordinate increments (after Myers et al. [23 ]).
`4.41 The effects of global path constraints and range limiting on the
`allowable regions for time warping functions.
`I = Qmax(Ty -
`4.42 Illustration of the extreme cases, T.r -
`l) or
`l = Qmax(T.r -
`I), where only linear time warping (single
`Ty -
`straight path) is allowed.
`
`184
`
`185
`
`188
`
`190
`
`191
`
`193
`
`197
`
`199
`
`202
`
`203
`
`205
`
`207
`
`2 IO
`
`215
`
`216
`
`IPR2023-00035
`Apple EX1015 Page 17
`
`
`
`List of Figures
`
`4.43 Type III local continuity constraints with four types of slope
`weighting (after Myers et al. [23 ]).
`4.44 Type II local continuity constraints with 4 types of slope weighting
`and their smoothed version in which the slope weights are uniformly
`redistributed along paths where abrupt weight changes exist (after
`Myers et al. [23]).
`4.45 Set of allowable grid points for dynamic programming
`implementation of local path expansion and contraction by 2
`to 1.
`4.46 The allowable path region for dynamic time alignment with relaxed
`endpoint constraints.
`4.47 Set of allowable grid points when opening up the initial point range to
`5 frames and the final point range to 9 frames.
`4.48 The allowable path region for dynamic time alignment with localized
`range constraints.
`4.49 Dynamic programming for finding K-best paths implemented in a
`parallel manner.
`4.50 The serial dynamic programming algorithm for finding the K-best
`paths.
`4.5 I Example illustrating the need for nonlinear time alignment of two
`versions of a spoken word.
`Illustration of the effectiveness of dynamic time warping alignment of
`two versions of a spoken _word.
`
`4.52
`
`5.1 A vector-quantizer-based speech-recognition system.
`5.2 A trellis quantizer as a finite state machine.
`5.3 Codebook training for segmental vector quantization.
`5.4 Block diagram of isolated word recognizer incorporating a
`word-based VQ preprocessor and a DTW-based postprocessor (after
`Pan et al. [4]).
`5.5 Plots of the variation of preprocessor performance parameters P 1, £ 1,
`, ) as a function of the distortion threshold D' for several
`and ( I -
`codebook sizes for the digits vocabulary.(after Pan et al. [4]).
`5.6 Plots of the variation of preprocessor performance parameters £2 and
`f3 as a function of the distortion threshold D' for several codebook
`sizes for the digits vocabulary (after Pan et al. [4]).
`5.7 Plots of average fraction of decisions made by the preprocessor, , ,
`versus preprocessor decision threshold D" for several codebook sizes
`for the digits vocabulary (after Pan et al. [4]).
`5.8 Plots of average fraction of candidate words, {3, passed on to the
`postprocessor, versus preprocessor decision threshold D" for several
`codebook sizes for the digits vocabulary (after Pan et al. 14]).
`
`xxi
`
`217
`
`218
`
`225
`
`230
`
`231
`
`232
`
`234
`
`236
`
`239
`
`240
`
`246
`255
`256
`
`258
`
`260
`
`260
`
`262
`
`262
`
`IPR2023-00035
`Apple EX1015 Page 18
`
`
`
`xxii
`
`List of Figures
`
`5.9 Accumulated DTW distortion scores versus test frame based on
`casual training with two reference patterns per word (after Rabiner
`et al. [5]).
`5.10 A flow diagram of the UWA clustering procedure (after Wilpon and
`Rabiner [7]).
`5.11 A flow diagram of the MKM clustering procedure (after Wilpon and
`Rabiner [7]).
`5.12 Recognition accuracy (percent c