throbber
LAWRENCE RABINER
`BUNG-HWANG JUANG
`
`IPR2023-00035
`Apple EX1015 Page 1
`
`

`

`FUNDAMENTALS
`OF SPEECH
`RECOGNITION
`
`Lawrence Rabiner
`Biing-Hwang Juang
`
`!\'arson
`Edu('afion
`---
`
`Prentice Hall P T R
`Upper Saddle River, New Jersey 07458
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`I
`
`I
`
`I
`
`..
`II I
`
`I
`
`IPR2023-00035
`Apple EX1015 Page 2
`
`

`

`Library of Congress Cataloging-in-Publication Data
`
`Rabiner. Lawrence R., 1943-
`Fundamentals of speech recognition/ Lawrence Rabiner. Biing-Hwang
`Juang.
`p.
`cm.
`Includes bibliographical references and index.
`ISBN 0-13--015157-2
`I. Automatic speech recognition. 2. Speech processing systems.
`I. Juang, B. H. (Biing-Hwang) II. Title.
`TK7895.S65R33 1993
`006.4 '54---dc20
`
`92-34093
`CIP
`
`Editorial production
`and interior design: bookworks
`Acquisitions Editor: Karen Gettman Manufacturing Buyer: Ma,y Elizabeth McCartney
`
`Cover Designer: Ben Santora
`
`@1993 by AT&T. All rights reserved.
`0 Published by Prentice Hall PTR
`Prentice-Hall. Inc.
`A Pearson Education Company
`Upper Saddle River, NJ 07458
`
`-
`
`The publisher offers discounts on this book when ordered
`in bulk quantities. For more information, contact:
`Corporate Sales Department
`PTR Prentice Hall
`
`All rights reserved. No part of this book may be
`reproduced, in any form or by any means,
`without permission in writing from the publisher.
`
`Printed in the United States of America
`10 9 8
`
`ISBN □-13-015157-2
`
`Prentice-Hall International (UK) Limited,London
`Prentice-Hall of Australia Pty. Limited, Sydney
`Prentice-Hall Canada Inc., Toronto
`Prentice-Hall Hispanoarnericana, S.A., Mexico
`Prentice-Hall of India Private Limited, New Delhi
`Prentice-Hall of Japan, Inc., Tokyo
`Pearson Education Asia Pte. Ltd., Singapore
`Editora Prentice-Hall do Brasil, Ltda., Rio de Janeiro
`
`IPR2023-00035
`Apple EX1015 Page 3
`
`

`

`CONTENTS
`
`LIST OF. FIGURES
`l!IST OF TABL!ES
`
`PREFA<::E
`
`1 FUNDAMENTALS OF SPEEGH RECOGNITION
`
`Introduction
`1.1
`1.2 The Paradigm for Speech Recognition
`1.3 Outline
`1.4 A Brief History of Speech-Recognition Research
`
`2 THE SPEECH SIGNAL: PRODUCTION, PERC:EPTION, AND
`ACOUSTIC-PHONETIC CHARACTERIZATl0N
`
`2.1
`
`Introduction
`2.1.1 The Process of Speech Production and Perception in Human
`Beings
`2.2 The Speech-Production Process
`2.3 Representing Speech in the Time and Frequency Domains
`2.4 Speech Sounds and Features
`
`xiii
`
`xxix
`
`xxxi
`
`1
`
`I
`3
`3
`6
`
`11
`
`11
`
`11
`14
`17
`20
`
`vii
`
`IPR2023-00035
`Apple EX1015 Page 4
`
`

`

`viii
`
`Contents
`
`2.4. l The Vowels
`2.4.2 Diphthongs
`2.4.3 Semivowels
`2.4.4 Nasal Consonants
`2.4.5 Unvoiced Fricatives
`2.4.6 Voiced Fricatives
`2.4.7 Voiced and Unvoiced Stops
`2.4.8 Review Exercises
`2.5 Approaches to Automatic Speech Recognition by Machine
`2.5.1 Acoustic-Phonetic Approach to Speech Recognition
`2.5.2 Statistical Pattern-Recognition Approach to Speech
`Recognition
`2.5.3 Artificial Intelligence (Al) Approaches to Speech
`Recognition
`2.5.4 Neural Networks and Their Application to Speech
`Recognition
`2.6 Summary
`
`3 SIGNAL PROCESSING AND ANALYSIS METHODS F0R SP.EEEH
`RECOGNITION
`
`3.1
`
`Introduction
`3.1.1 Spectral Analysis Models
`3.2 The Bank-of-Filters Front-End Processor
`3.2.1 Types of Filter Bank Used for Speech Recognition
`3.2.2
`Implementations of Filter Banks
`3.2.3 Summary of Considerations for Speech-Recognition Filter
`Banks
`3.2.4 Practical Examples of Speech-Recognition Filter Banks
`3.2.5 Generalizations of Filter-Bank Analyzer
`3.3 Linear Predictive Coding Model for Speech Recognition
`3.3.1 The LPC Model
`3.3.2 LPC Analysis Equations
`3.3.3 The Autocorrelation Method
`3.3.4 The Covariance Method
`3.3.5 Review Exercise
`3.3.6 Examples of LPC Analysis
`3.3.7 LPC Processor for Speech Recognition
`3.3.8 Review Exercises
`3.3.9 Typical LPC Analysis Parameters
`3.4 Vector Quantization
`3.4.1 Elements of a Vector Quantization Implementation
`3.4.2 The VQ Training Set
`3.4.3 The Similarity or Distance Measure
`3.4.4 Clustering the Training Vectors
`3.4.5 Vector Classification Procedure
`3.4.6 Comparison of Vector and Scalar Quantizers
`
`21
`28
`29
`30
`31
`32
`33
`37
`42
`45
`
`51
`
`52
`
`54
`65
`
`69
`
`69
`70
`73
`77
`80
`
`92
`93
`95
`97
`100
`101
`103
`106
`107
`108
`112
`117
`121
`122
`123
`124
`125
`125
`128
`129
`
`IPR2023-00035
`Apple EX1015 Page 5
`
`

`

`Contents
`
`3.4.7 Extensions of Vector Quantization
`3.4.8 Summary of the VQ Method
`3.5 Auditory-Based Spectral Analysis Models
`3.5.1 The EIH Model
`3.6 Summary
`
`4 PATTERN-COMPARISON TECHNIQUES
`4.1
`Introduction
`4.2 Speech (Endpoint) Detection
`4.3 Distortion Measures-Mathematical Considerations
`4.4 Distortion Measures-Perceptual Considerations
`4.5 Spectra I-Disto11ion Measures
`4.5.1 Log Spectral Distance
`4.5.2 Cepstral Distances
`4.5.3 Weighted Cepstral Distances and Liftering
`4.5.4 Likelihood Distortions
`4.5.5 Variations of Likelihood Distortions
`4.5.6 Spectral Distortion Using a Warped Frequency Scale
`4.5.7 Alternative Spectral Representations and Distortion
`Measures
`4.5.8 Summary of Di,;tortion Measures--Computational
`Considerations
`Incorporation of Spectral Dynamic Features into the Distortion
`Measure
`4.7 Time Alignment and Normalization
`4.7.1 Dynamic Programming-Basic Considerations
`4.7.2 Time-Normalization Constraints
`4.7.3 Dynamic Time-Warping Solution
`4.7.4 Other Considerations in Dynamic Time Warping
`4.7.5 Multiple Time-Alignment Paths
`4.8 Summary
`5 SPEECH REG0GNITION SYSTEM DESIGN AND IMPLEMENTATION
`ISSUES
`
`4.6
`
`5.1
`Introduction
`5.2 Application of Source-Coding Techniques to Recognition
`5.2.1 Vector Quantization and Pattern Comparison Without Time
`Alignment
`5.2.2 Centroid Computation for VQ Codebook Design
`5.2.3 Vector Quantizers with Memory
`5.2.4 Segmental Vector Quantization
`5.2.5 Use of a Vector Quantizer as a Recognition Preprocessor
`5.2.6 Vector Quantization for Efficient Pattern Matching
`5.3 Template Training Methods
`5.3.1 Casual Training
`
`ix
`
`129
`131
`132
`134
`139
`
`141
`
`141
`143
`149
`150
`154
`158
`163
`166
`171
`177
`183
`
`190
`
`193
`
`194
`200
`204
`208
`221
`229
`232
`238
`
`242
`
`242
`244
`
`244
`246
`254
`256
`257
`263
`264
`265
`
`IPR2023-00035
`Apple EX1015 Page 6
`
`

`

`X
`
`5.3.2 Robust Training
`5.3.3 Clustering
`5.4 Performance Analysis and Recognition Enhancements
`5.4. l Choice of Distortion Measures
`5.4.2 Choice of Clustering Methods and kNN Decision Rule
`5.4.3
`Incorporation of Energy Information
`5.4.4 Effects of Signal Analysis Parameters
`5.4.5 Performance of Isolated Word-Recognition Systems
`5.5 Template Adaptation to New Talkers
`5.5. l Spectral Transformation
`5.5.2 Hierarchical Spectral Clustering
`5.6 Discriminative Methods in Speech Recognition
`5.6. l Determination of Word Equivalence Classes
`5.6.2 Discriminative Weighting Functions
`5.6.3 Discriminative Training for Minimum Recognition Error
`5.7 Speech Recognition in Adverse Environments
`5.7.1 Adverse Conditions in Speech Recognition
`5.7.2 Dealing with Adverse Conditions
`5.8 Summary
`
`Contents
`
`266
`267
`274
`274
`277
`280
`282
`284
`285
`286
`288
`291
`294
`297
`302
`305
`306
`309
`317
`
`6 THEORY AND IMPLEMENTATION OF HIDDEN MARK0V MODEl!S
`
`321
`
`6.1
`6.2
`6.3
`
`6.4
`
`6.5
`6.6
`6.7
`6.8
`
`6.9
`6.10
`6.11
`6.12
`
`Introduction
`Discrete-Time Markov Processes
`Extensions to Hidden Markov Models
`6.3.1 Coin-Toss Models
`6.3.2 The Um-and-Ball Model
`6.3.3 Elements of an HMM
`6.3.4 HMM Generator of Observations
`The Three Basic Problems for HMMs
`6.4.1 Solution to Problem I-Probability Evaluation
`6.4.2 Solution to Problem 2-"0ptimal" State Sequence
`6.4.3 Solution to Problem 3-Parameter Estimation
`6.4.4 Notes on the Reestimation Procedure
`Types of HM Ms
`Continuous Observation Densities in HMMs
`Autoregressive HMMs
`Variants on HMM Structures-Null Transitions and Tied
`States
`Inclusion of Explicit State Duration Density in HMMs
`Optimization Criterion-ML, MMI, and MDI
`Comparisons of HMMs
`Implementation
`Issues for HMMs
`6.12.1 Scaling
`6.12.2 Multiple Observation Sequences
`6.12.3
`Initial Estimates of HMM Parameters
`
`321
`322
`325
`326
`328
`329
`330
`333
`334
`337
`342
`347
`348
`350
`352
`
`356
`358
`362
`364
`365
`365
`369
`370
`
`IPR2023-00035
`Apple EX1015 Page 7
`
`

`

`Contents
`
`6.13
`
`6.12.4 Effects of Insufficient Training Data
`6.12.5 Choice of Model
`Improving the Effectiveness of Model Estimates
`6.13.1 Deleted Interpolation
`6.13.2 Bayesian Adaptation
`6.13.3 Corrective Training
`6.14 Model Clustering and Splitting
`6.15 HMM System for Isolated Word Recognition
`6.15.1 Choice of Model Parameters
`6.15.2 Segmental K-Means Segmentation into States
`6.15.3
`Incorporation of State Duration into the HMM
`6.15.4 HMM Isolated-Digit Performance
`6.16 Summary
`
`for the Connected Word-Recognition
`
`7 SPEECH RECOGNITION BASED ON CONNECTED WORD MODEl!S
`7 .1
`Introduction
`7.2 General Notation
`Problem
`7.3 The Two-Level Dynamic Programming (Two-Level DP)
`Algorithm
`7.3.l Computation of the Two-Level DP Algorithm
`7.4 The Level Building (LB) Algorithm
`7.4.l Mathematics of the Level Building Algorithm
`7.4.2 Multiple Level Considerations
`7.4.3 Computation of the Level Building Algorithm
`Implementation Aspects of Level Building
`7.4.4
`Integration of a Grammar Network
`7.4.5
`7.4.6 Examples of LB Computation of Digit Strings
`7.5 The One-Pass (One-State) Algorithm
`7.6 Multiple Candidate Strings
`7.7 Summary of Connected Word Recognition Algorithms
`7.8 Grammar Networks for Connected Digit Recognition
`7.9 Segmental K-Means Training Procedure
`Implementation
`7.10 Connected Digit Recognition
`7.10.l HMM-Based System for Connected Digit Recognition
`7.10.2 Performance Evaluation on Connected Digit Strings
`7.11 Summary
`
`8 LARGE VOCABULARY CONTINUOUS Sf?EECH REC:OGNITION
`
`8.1
`Introduction
`8.2 Subword Speech Units
`8.3 Subword Unit Models Based on HMMs
`8.4 Training of Subword Units
`
`xi
`
`370
`371
`372
`372
`373
`376
`377
`378
`379
`382
`384
`385
`386
`
`390
`
`390
`
`393
`
`395
`399
`400
`401
`405
`407
`410
`414
`416
`416
`420
`423
`425
`427
`428
`429
`430
`432
`
`434
`
`434
`435
`439
`441
`
`'
`~
`(
`~
`\
`
`r::
`...
`..,
`...
`i:
`)
`"'
`
`...
`I
`
`i'.
`~
`(
`"
`
`IPR2023-00035
`Apple EX1015 Page 8
`
`

`

`xii
`
`Contents
`
`8.5 Language Models for Large Vocabulary Speech
`Recognition
`8.6 Statistical Language Modeling
`8.7 Perplexity of the Language Model
`8.8 Overall Recognition System Based on Subword Units
`8.8.1 Control of Word Insertion/Word Deletion Rate
`8.8.2 Task Semantics
`8.8.3 System Perfonnance on the Resource Management Task
`8.9 Context-Dependent Subword Units
`8.9.1 Creation of Context-Dependent Di phones and Triphones
`8.9.2 Using Interword Training to Create CD Units
`8.9.3 Smoothing and Interpolation of CD PLU Models
`8.9.4 Smoothing and Interpolation of Continuous Densities
`8.9.5 Implementation Issues Using CD Units
`8.9.6 Recognition Results Using CD Units
`8.9.7 Position Dependent Units
`8.9.8 Unit Splitting and Clustering
`8.9.9 Other Factors for Creating Additional Subword Units
`8.9.10 Acoustic Segment Units
`8.10 Creation of Vocabulary-Independent Units
`8.11 Semantic Postprocessor for Recognition
`8.12 Summary
`
`9 TASK ORIENTED AP.PLICATIONS OF AUif0MATIC SPEECH
`RECOGNITION
`
`9.1
`Introduction
`9.2 Speech-Recognizer Performance Scores
`9.3 Characteristics of Speech-Recognition Applications
`9.3.1 Methods of Handling Recognition Errors
`9.4 Broad Classes of Speech-Recognition Applications
`9.5 Command-and-Control Applications
`9.5.1 Voice Repertory Dialer
`9.5.2 Automated Call-Type Recognition
`9.5.3 Call Distribution by Voice Commands
`9.5.4 Directory Listing Retrieval
`9.5.5 Credit Card Sales Validation
`9.6 Projections for Speech Recognition
`INDEX
`
`447
`448
`449
`450
`454
`454
`454
`458
`460
`461
`462
`464
`464
`467
`469
`470
`475
`476
`477
`478
`478
`
`482
`
`482
`484
`485
`486
`487
`488
`489
`490
`491
`491
`492
`493
`497
`
`IPR2023-00035
`Apple EX1015 Page 9
`
`

`

`LIST OF FIGURES
`
`1.1 General block diagram of a task-oriented speech-recognition
`system.
`
`2.1 Schematic diagram of speech-production/speech-perception process
`(after Flanagan [unpublished]).
`2.2 Alternative view of speech-production/speech-perception process
`(after Rabiner and Levinson [ 1 ]).
`2.3 Mid-sagittal plane X-ray of the human vocal apparatus (after
`Flanagan et al. [2]).
`2.4 Schematic view of the human vocal mechanism (after Flanagan
`[3]).
`2.5 Glottal volume velocity and resulting sound pressure at the start of a
`voiced sound (after lshizaka and Flanagan [4]).
`2.6 Schematic representation of the complete physiological mechanism of
`speech production (after Flanagan [3]).
`2.7 Waveform plot of the beginning of the utterance "It's time."
`2.8 Wideband and narrowband spectrograms and speech amplitude for
`the utterance "Every salt breeze comes from the sea."
`
`3
`
`12
`
`13
`
`15
`
`16
`
`16
`
`17
`18
`
`19
`
`xiii
`
`IPR2023-00035
`Apple EX1015 Page 10
`
`

`

`~
`
`xiv
`
`List of Figures
`
`2.9 Wideband spectrogram and formant frequency representation of the
`utterance "Why do I owe you a letter" (after Atal and
`Hanauer [5]).
`2.10 Wideband spectrogram and intensity contour of the phrase "Should
`we chase."
`2.11 The speech waveform and a segmentation and labeling of the
`constituent sounds of the phrase "Should we chase."
`2.12 Chart of the classification of the standard phonemes of American
`English into broad sound classes.
`2.13 Articulatory configurations for typical vowel sounds (after
`Flanagan [3]).
`2.14 Acoustic waveform plots of typical vowel sounds.
`2.15 Spectrograms of the vowel sounds.
`2.16 Measured frequencies of first and second formants for a wide range of
`talkers for several vowels (after Peterson & Barney [7]).
`2.17 The vowel triangle with centroid positions of the common
`vowels.
`2.18 Spectrogram plots of four diphthongs.
`2.19 Time variation of the first two formants for the diphthongs (after
`Holbrook and Fairbanks [9]).
`2.20 Waveforms for the sequences /a-m-a/ and /a-n-a/.
`2.21 Spectrograms of the sequences /a-m-a/ and a-n-a/.
`2.22 Waveforms for the sounds /f/, /s/ and /sh/ in the context /a-x-a/ where
`/x/ is the unvoiced fricative.
`2.23 Spectrogram comparisons of the sounds /a-f-a/, /a-s-a/ and
`/a-sh-a/.
`2.24 Waveforms for the sequences /a-v-a/ and /a-zh-a/.
`2.25 Spectrograms for the sequences /a-v-a/ and /a-zh-a/.
`2.26 Waveform for the sequence /a-b-a/.
`2.27 Waveforms for the sequences /a-p-a/ and /a-t-a/.
`2.28 Spectrogram comparisons of the sequences of voiced (/a-b-a/) and
`voiceless (/a-p-a/ and /a-t-a/) stop consonants.
`2.29 Spectrograms of the 11 isolated digits, 0 through 9 plus oh, in random
`sequence.
`2.30 Spectrograms of two connected digit sequences.
`2.31 Phoneme lattice for word string.
`2.32 Block diagram of acoustic-phonetic speech-recognition
`system.
`2.33 Acoustic-phonetic vowel classifier.
`2.34 Binary tree speech sound classifier.
`2.35 Segmentation and labeling for word sequence "seven-six."
`2.36 Segmentation and labeling for word sequence "did you."
`2.37 Block diagram of pattern-recognition speech recognizer.
`
`21
`
`22
`
`23
`
`25
`
`25
`26
`27
`
`27
`
`28
`30
`
`31
`32
`33
`
`34
`
`34
`35
`36
`36
`37
`
`38
`
`40
`41
`43
`
`45
`47
`48
`49
`50
`51
`
`IPR2023-00035
`Apple EX1015 Page 11
`
`

`

`List of Figures
`
`2.38
`
`Illustration of the word correction capability of syntax in speech
`recognition (after Rabiner and Levinson [ 1 ]).
`2.39 A bottom-up approach to knowledge integration for speech
`recognition.
`2.40 A top-down approach to knowledge integration for speech
`recognition.
`2.41 A blackboard approach to knowledge integration for speech
`recognition (after Lesser et al. [ 11 ]).
`2.42 Conceptual block diagram of a human speech understanding
`system.
`2.43 Simple computation element of a neural network.
`2.44 McCullough-Pitts model of neurons (after McCullough and
`Pitts [ 12]).
`2.45 Single-layer and three-layer perceptrons.
`2.46 A multilayer perceptron for classifying steady vowels based on F 1, F2
`measurements (after Lippmann [ 13 ]).
`2.47 Model of a recurrent neural network.
`2.48 A fixed point interpretation of the Hopfield network.
`2.49 The time delay neural network computational element (after Waibel
`et al. [14]).
`2.50 A TDNN architecture for recognizing /b/, /d/ and /g/ (after Waibel
`et al. [ 14 ]).
`2.51 A combination neural network and matched filter for speech
`recognition (after Tank & Hopfield [ 15)).
`2.52 Example illustrating the combination of a neural network and a set of
`matched filters (after Tank & Hopfield [ 15]).
`2.53 The hidden control neural network (after Levin [16]).
`
`3.1
`
`(a) Pattern recognition and (b) acoustic phonetic approaches to speech
`recognition.
`3.2 Bank-of-filters analysis model.
`3.3 LPC analysis model.
`3.4 Complete bank-of-filters analysis model.
`3.5 Typical waveforms and spectra for analysis of a pure sinusoid in the
`filter-bank model.
`3.6 Typical waveforms and spectra of a voice speech signal in the
`bank-of-filters analysis model.
`Ideal (a) and realistic (b) set of filter responses of a Q-channel filter
`bank covering the frequency range F5/N to (Q + ½)Fs/N.
`Ideal specifications of a 4-channel octave band-filter bank (a), a
`12-channel third-octave band filter bank (b), and a 7-channel critical
`band scale filter bank (c) covering the telephone bandwidth range
`(200-3200 Hz).
`
`3.7
`
`3.8
`
`xv
`
`54
`
`55
`
`55
`
`56
`
`56
`57
`
`58
`59
`
`59
`60
`60
`
`63
`
`64
`
`65
`
`66
`67
`
`71
`72
`72
`74
`
`75
`
`76
`
`77
`
`79
`
`IPR2023-00035
`Apple EX1015 Page 12
`
`

`

`xvi
`
`List of Figures
`
`3.22
`
`3.9 The variation of bandwidth with frequency for the perceptually based
`critical band scale.
`3.10 The signals s(m) and w(n - m) used in evaluation of the short-time
`Fourier transform.
`3.11 Short-time Fourier transform using a long (500 points or 50 msec)
`Hamming window on a section of voiced speech.
`3.12 Short-time Fourier transform using a short (50 points or 5 msec)
`Hamming window on a section of voiced speech.
`3.13 Short-time Fourier transform using a long (500 points or 50 msec)
`Hamming window on a section of unvoiced speech.
`3.14 Short-time Fourier transform using a short (50 points or 5 msec)
`Hamming window on a section of unvoiced speech.
`3.15 Linear filter interpretation of the short-time Fourier transform.
`3.16 FFf implementation of a uniform filter bank.
`3.17 Direct form implementation of an arbitrary nonuniform filter
`bank.
`3.18 Two arbitrary nonuniform filter-bank ideal filter specifications
`consisting of either 3 bands (part a) or 7 bands (part b).
`3.19 Tree structure implementation of a 4-band, octave-spaced, filter
`bank.
`3.20 Window sequence, w(n), (part a), the individual filter response
`(part b), and the composite response (part c) of a Q = 15 channel,
`uniform filter bank, designed using a IO I -point Kaiser window
`smoothed lowpass window (after Dautrich et al. [4]).
`3.21 Window sequence, w(n), (part a), the individual filter responses
`(part b), and the composite response (part c) of a Q = 15 channel,
`uniform filter bank, designed using a IO I-point Kaiser window
`directly as the lowpass window (after Dautrich et al. [4]).
`Individual channel responses (parts a to d) and composite filter
`response (part c) of a Q = 4 channel, octave band design, using
`101-point FIR filters in each band (after Dautrich et al. [4]).
`Individual channel responses and composite filter response of a
`Q = 12 channel, I /3 octave band design, using 20 I-point FIR filters
`in each band (after Dautrich et al. [4]).
`Individual channel responses (parts a to g) and composite filter
`response (part h) of a Q = 7 channel critical band filter bank design
`(after Dautrich et al. [4]).
`Individual channel responses and composite filter response of a
`Q = 13 channel, critical band spacing filter bank, using highly
`overlapping filters in frequency (after Dautrich et al. [4]).
`3.26 Generalization of filter-bank analysis model.
`3.27 Linear prediction model of speech.
`3.28 Speech synthesis model based on LPC model.
`
`79
`
`81
`
`82
`
`82
`
`83
`
`83
`84
`89
`
`89
`
`90
`
`92
`
`94
`
`95
`
`96
`
`97
`
`98
`
`99
`99
`100
`IOI
`
`3.23
`
`3.24
`
`3.25
`
`IPR2023-00035
`Apple EX1015 Page 13
`
`

`

`List of Figures
`
`3.29
`
`3.30
`
`3.31
`
`Illustration of speech sample, weighted speech section, and prediction
`error for voiced speech where the prediction error is large at the
`beginning of the section.
`Illustration of speech sample, weighted speech section, and prediction
`error for voiced speech where the prediction error is large at the end
`of the section.
`Illustration of speech sample, weighted speech section, and prediction
`error for unvoiced speech where there are almost no artifacts at the
`boundaries of the section.
`3.32 Typical signals and spectra for LPC autocorrelation method for a
`segment of speech spoken by a male speaker (after Rabiner et al.
`[8]).
`3.33 Typical signals and spectra for LPC autocorrelation method for a
`segment of speech spoken by a female speaker (after Rabiner et al.
`(81).
`3.34 Examples of signal (differentiated) and prediction error for several
`vowels (after Strube [9]).
`3.35 Variation of the RMS prediction error with the number of predictor
`coefficients, p (after Atal and Hanauer l 10]).
`3.36 Spectra for a vowel sound for several values of predictor order,
`p.
`3.37 Block diagram of LPC processor for speech recognition.
`3.38 Magnitude spectrum of LPC preemphasis network for
`a= o.95.
`3.39 Blocking of speech into overlapping frames.
`3.40 Block diagram of the basic VQ training and classification
`structure.
`3.41 Partitioning of a vector space into VQ cells with each cell represented
`by a centroid vector.
`3.42 Flow diagram of binary split codebook generation algorithm.
`3.43 Codebook distortion versus codebook size (measured in bits per
`frame) for both voiced and unvoiced speech (after Juang et al.
`[ I 2]).
`3.44 Codebook vector locations in the F 1 - F2 plane (for a 32-vector
`codebook) superimposed on the vowel ellipses (after Juang et al.
`[ 12]).
`3.45 Model and distortion error spectra for scalar and vector quantizers
`(after Juang et al. [12]).
`3.46 Plots and histograms of temporal distortion for scalar and vector
`quantizers (after Juang et al. [ 12]).
`3.47 Physiological model of the human ear.
`3.48 Expanded view of the middle and inner ear mechanics.
`3.49 Block diagram of the EIH model (after Ghitza [13]).
`
`xvii
`
`104
`
`104
`
`105
`
`108
`
`109
`
`110
`
`I 10
`
`I I I
`113
`
`113
`114
`
`124
`
`126
`127
`
`128
`
`128
`
`130
`
`131
`132
`133
`135
`
`~
`
`;;;;
`
`-
`
`IPR2023-00035
`Apple EX1015 Page 14
`
`

`

`xviii
`
`List of Figures
`
`3.50 Frequency response curves of a cat's basilar membrane (after Ghitza
`(13]).
`3:51 Magnitude of EIH for vowel /o/ showing the time-frequency
`resolution (after Ghitza [ 13 J).
`3.52 Operation of the EIH model for a pure sinusoid (after
`Ghitza (13]).
`3.53 Comparison of Fourier and EIH log spectra for clean and noisy
`speech signals (after Ghitza [ 13]).
`
`4.1 Contour of digit recognition accuracy (percent correct) as a function
`of endpoint perturbation (in ms) in a multispeaker digit-recognition
`experiment. Both the initial (beginning point) and the final (ending
`point) boundary of the detected speech signal were varied (after
`Wilpon et al. [2]).
`4.2 Example of mouth click preceding a spoken word (after Wilpon et al.
`[2]).
`4.3 Example of breathy speech due to heavy breathing while speaking
`(after Wilpon et al. [2]).
`4.4 Example of click produced at the end of a spoken word (after Wilpon
`et al. [2]).
`4.5 Block diagram of the explicit approach to speech endpoint
`detection.
`4.6 Block diagram of the implicit approach to speech-endpoint
`detection.
`4.7 Examples of word boundaries as determined by the implicit endpoint
`detection algorithm.
`4.8 Block diagram of the hybrid approach to speech endpoint
`detection.
`4.9 Block diagram of typical speech activity detection algorithm.
`4.10 LPC pole frequency JNDs as a function of the pole bandwidth; the
`blank circles denote positive frequency perturbations, and the solid
`circles represent negative frequency perturbations; the fitting curves
`are parabolic (after Erell et al. [7]).
`4.11 LPC pole bandwidth JNDs, in a logarithmic scale, as a function of the
`pole bandwidth itself (after Erell et al. [7]).
`4.12 Two typical FFT power spectra, S(w), of the sound /re/ in a log scale
`and their difference magnitude IV(w)I as a function of
`frequency.
`4.13 LPC model spectra corresponding to the FFf spectra in Figure 4.12,
`plotted also in a log scale, and their difference magnitude jV(w)I as a
`function of frequency.
`4.14 Two typical FFT power spectra, S(w), of the sound /sh/ in a log scale
`and their difference magnitude IV(w)I as a function of
`frequency.
`
`136
`
`136
`
`137
`
`138
`
`144
`
`145
`
`146
`
`147
`
`147
`
`148
`
`148
`
`149
`149
`
`153
`
`154
`
`159
`
`159
`
`160
`
`IPR2023-00035
`Apple EX1015 Page 15
`
`

`

`List of Figures
`
`4.15 LPC model spectra corresponding to the FFf spectra in Figure 4.14,
`plotted also in a log scale, and their difference magnitude IV(w)I as a
`function of frequency.
`4.16 Typical FFf power spectra of the sounds /re/ and /i/ respectively and
`their difference magnitude as a function of frequency.
`4.17 LPC model spectra corresponding to the FFf spectra in Figure 4.16
`and their difference magnitude !V(w)I as a function of
`frequency.
`4.18 Scatter plot of di, the cepstral distance, versus 2d;(L), the truncated
`cepstral distance (multiplied by 2), for 800 pairs of all-pole model
`spectra; the truncation is at L = 20 (after Gray and
`Markel [9]).
`4.19 Scatter plot of di, the cepstral distance, versus 2d;(L), the truncated
`cepstral distance (multiplied by 2), for 800 pairs of all-pole model
`spectra; the truncation is at L = 30 (after Gray and
`Markel [9]).
`4.20 Effects of cepstral liftering on a log LPC spectrum, as a function of
`the lifter length (L = 8 to 16) (after Juang et al. [11]).
`4.21 Comparison of (a) original sequence of LPC log magnitude spectra;
`(b) liftered LPC log magnitude spectra, and (c) liftered log magnitude
`spectra (after Juang et al. [11]).
`4.22 Comparison of the distortion integrands V2(w) /2 and
`- V(w) - 1 (after Gray and Markel [9]).
`eV(w)
`, 1/ IAl 2
`4.23 A scatter plot of d1s0/ 1Apl2
`) + 1 versus d1s0/ IAl2
`1/ 1Apl2
`) + 1 as measured from 6800 pairs of speech model
`spectra.
`4.24 A linear system with transfer function H(z) = A(z)/B(z).
`4.25 A scatter diagram of the log spectral distance versus the COSH
`distortion as measured from a database of 6800 pairs of speech
`spectra.
`4.26 LPC spectral pair and various spectral weighting functions;
`W 1 (w), W2(w), W3(w) and W4(w) are defined in (4.71), (4.72), (4.73),
`and (4.74), respectively.
`4.27a An example of the cosh spectral deviation F4(w) and its weighted
`version using W3(w) = 1/ IA(eiw)l
`2 as the weighting function; in this
`case the two spectra are of comparable power levels.
`4.27b An example of the cosh spectral deviation F4(w) and its weighted
`version using W3(w) = 1/ IA(eiw)l
`2 as the weighting function; in this
`case, the two spectra have significantly different power levels.
`
`,
`
`xix
`
`160
`
`161
`
`161
`
`165
`
`165
`
`169
`
`170
`
`173
`
`176
`177
`
`178
`
`180
`
`182
`
`182
`
`IPR2023-00035
`Apple EX1015 Page 16
`
`

`

`xx
`
`List of Figures
`
`4.28 Subjectively perceived pitch, in mels, of a tone as a function of the
`frequency, in Hz; the upper curve relates the subjective pitch to
`frequency in a linear scale and the lower curve shows the subjective
`pitch as a function of the frequency in a logarithmic scale (after
`Stevens and Volkmann [13]).
`4.29 The critical bandwidth phenomenon; the critical bandwidth as a
`function of the frequency at the center of the band (after Zwicker,
`Flottorp and Stevens [ 14 ]).
`4.30 Real part of exp U0(b)k] as a function of b, the Bark scale, for
`different values of k (after Nocerino et al. [ 16]).
`4.31 A filter-bank design in which each filter has a triangle bandpass
`frequency response with bandwidth and spacing determined by a
`constant mel frequency interval (spacing = 150 mels, bandwidth =
`300 mels) (after Davis and Mermelstein [ 17]).
`4.32 (a) Series of cylindrical sections concatenated as an acoustic tube
`model of the vocal tract; (b) the area function of the cylindrical
`sections in (a) (after Markel and Gray [ 10]).
`4.33 A critical band spectrum (a) of a typical vowel sound and the
`corresponding log spectral differences VLM (b) and VcM (c) as
`functions of the critical band number (after Nocerino et al.
`[16]).
`4.34 A trajectory of the (2nd) cepstral coefficient with 2nd-order
`polynomial (h1 +hit+ h3ti) fitting on short portions of the trajectory;
`the width for polynomial fitting is 7 points.
`4.35 Scatter diagram showing the correlation between the "instantaneous"
`cepstral distance, di, and the "differential" or "dynamic" cepstral
`distance, di6n,; the correlation index is 0.6.
`4.36 Linear time alignment for two sequences of different
`durations.
`4.37 An example of time normalization of two sequential patterns to a
`common time index; the time warping functions ¢x and <Py map the
`individual time index ix and iy, respectively, to the common time
`index k.
`4.38 The optimal path problem-finding the minimum cost path from
`point l to point i in as many moves as needed.
`4.39 A trellis structure that illustrates the problem of finding the optimal
`path from point i to point j in M steps.
`4.40 An example of local continuity constraints expressed in terms of
`coordinate increments (after Myers et al. [23 ]).
`4.41 The effects of global path constraints and range limiting on the
`allowable regions for time warping functions.
`I = Qmax(Ty -
`4.42 Illustration of the extreme cases, T.r -
`l) or
`l = Qmax(T.r -
`I), where only linear time warping (single
`Ty -
`straight path) is allowed.
`
`184
`
`185
`
`188
`
`190
`
`191
`
`193
`
`197
`
`199
`
`202
`
`203
`
`205
`
`207
`
`2 IO
`
`215
`
`216
`
`IPR2023-00035
`Apple EX1015 Page 17
`
`

`

`List of Figures
`
`4.43 Type III local continuity constraints with four types of slope
`weighting (after Myers et al. [23 ]).
`4.44 Type II local continuity constraints with 4 types of slope weighting
`and their smoothed version in which the slope weights are uniformly
`redistributed along paths where abrupt weight changes exist (after
`Myers et al. [23]).
`4.45 Set of allowable grid points for dynamic programming
`implementation of local path expansion and contraction by 2
`to 1.
`4.46 The allowable path region for dynamic time alignment with relaxed
`endpoint constraints.
`4.47 Set of allowable grid points when opening up the initial point range to
`5 frames and the final point range to 9 frames.
`4.48 The allowable path region for dynamic time alignment with localized
`range constraints.
`4.49 Dynamic programming for finding K-best paths implemented in a
`parallel manner.
`4.50 The serial dynamic programming algorithm for finding the K-best
`paths.
`4.5 I Example illustrating the need for nonlinear time alignment of two
`versions of a spoken word.
`Illustration of the effectiveness of dynamic time warping alignment of
`two versions of a spoken _word.
`
`4.52
`
`5.1 A vector-quantizer-based speech-recognition system.
`5.2 A trellis quantizer as a finite state machine.
`5.3 Codebook training for segmental vector quantization.
`5.4 Block diagram of isolated word recognizer incorporating a
`word-based VQ preprocessor and a DTW-based postprocessor (after
`Pan et al. [4]).
`5.5 Plots of the variation of preprocessor performance parameters P 1, £ 1,
`, ) as a function of the distortion threshold D' for several
`and ( I -
`codebook sizes for the digits vocabulary.(after Pan et al. [4]).
`5.6 Plots of the variation of preprocessor performance parameters £2 and
`f3 as a function of the distortion threshold D' for several codebook
`sizes for the digits vocabulary (after Pan et al. [4]).
`5.7 Plots of average fraction of decisions made by the preprocessor, , ,
`versus preprocessor decision threshold D" for several codebook sizes
`for the digits vocabulary (after Pan et al. [4]).
`5.8 Plots of average fraction of candidate words, {3, passed on to the
`postprocessor, versus preprocessor decision threshold D" for several
`codebook sizes for the digits vocabulary (after Pan et al. 14]).
`
`xxi
`
`217
`
`218
`
`225
`
`230
`
`231
`
`232
`
`234
`
`236
`
`239
`
`240
`
`246
`255
`256
`
`258
`
`260
`
`260
`
`262
`
`262
`
`IPR2023-00035
`Apple EX1015 Page 18
`
`

`

`xxii
`
`List of Figures
`
`5.9 Accumulated DTW distortion scores versus test frame based on
`casual training with two reference patterns per word (after Rabiner
`et al. [5]).
`5.10 A flow diagram of the UWA clustering procedure (after Wilpon and
`Rabiner [7]).
`5.11 A flow diagram of the MKM clustering procedure (after Wilpon and
`Rabiner [7]).
`5.12 Recognition accuracy (percent c

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket