throbber
VOICE
`RECOGNITION >
`
`ROBERT D. RODMAN
`
`RICHARD L. KLEVANS
`
`IPR2023-00035
`Apple EX1016 Page 1
`
`

`

`• I
`!
`
`•. •[
`
`·,
`
`I •
`
`,.:
`
`,,,.
`
`r,.
`
`V:@m~
`
`
`, ·,
`
`,
`
`er {ie@~gn•i(~w~~ '.·, :,I : •<., :·, ! ';/
`
`. .....
`> ..... ,.
`..... ~ 1
`
`1
`
`\
`
`I I
`
`i.:. .h---,
`
`&i€haFa L .. Klev.aBS,
`• R0bler.t ID. Ro<dmaB
`
`, , I_·
`
`I'
`
`• ,._t
`
`, 1·
`
`. '.
`
`'t.
`
`tl'.~£~h }'.~'?<\)
`...
`
`) ..
`
`.,
`
`'l
`.i • ..- .\. ~
`
`•
`l
`
`,
`'t •
`
`., .
`
`i f .. ,
`
`• '
`,_.
`
`q,r:
`
`1 ""1°'
`
`,At;te¢h ~0use 1
`~ -Lon(lQn,
`~JSl~ston,
`
`>
`
`I
`
`l
`
`.• :,',
`
`);._.
`
`'
`
`1 I
`
`IPR2023-00035
`Apple EX1016 Page 2
`
`

`

`Library of Congress Cataloging-in-Publication Data
`Klevans, Richard L.
`Voice recognition/ Richard L. Klevans, Robert D. Rodman.
`p.
`cm. -
`(Artech House telecommunications library)
`Includes bibliographical references and index.
`ISBN 0-89006-927-1 (alk. paper)
`2. Automatic speech recognition.
`1. Speech processing systems.
`3. Voiceprints.
`4. Natural language processing (Computer science)
`I. Rodman, Robert D. II. Title. III. Series.
`TK7882.2.S65K55
`1997
`006.4'54-dc21
`
`97-30792
`CIP
`
`British Library Cataloguing in Publication Data
`Klevans, Richard L.
`Voice recognition
`1. Automatic speech recognition
`I. Title. II. Rodman, Robert D.
`006.4'54
`
`ISBN 0-89006-927-1
`
`Cover design by Jennifer L. Stuart
`
`© 1997 ARTECH HOUSE, INC.
`685 Canton Street
`Norwood, MA 02062
`
`All rights reserved. Printed and bound in the United States of America. No part
`of this book may be reproduced or utilized in any form or by any means, elec(cid:173)
`tronic or mechanical, including photocopying, recording, or by any information
`storage and retrieval system, without permission in writing from the publisher.
`All terms mentioned in this book that are known to be trademarks or service
`marks have been appropriately capitalized. Artech House cannot attest to the ac(cid:173)
`curacy of this information. Use of a term in this book should not be regarded as
`affecting the validity of any trademark or service mark.
`
`International Standard Book Number: 0-89006-927-1
`Library of Congress Catalog Card Number: 97-30792
`
`10 9 8 7 6 5 4 3 2 1
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`'
`
`IPR2023-00035
`Apple EX1016 Page 3
`
`

`

`Contents
`
`Introduction
`Chapter 1
`Speech Synthesis
`Speech Recognition
`Speaker Classification
`Areas of Application for Voice Recognition
`Design Tradeoffs in Voice Recognition
`Text-Dependent Versus Text-Independent
`Ideal Recording Environment Versus Noisy
`Environment
`Speaker Verification Versus Speaker
`Identification
`Real-Time Operation Versus Off-Line
`Operation
`Regarding This Book
`Intended Readers
`What Is Covered
`Why?
`References
`Chapter 2 Background of Voice Recognition
`Voiceprint Analysis
`Parameter Extraction
`The Parameter Extraction Process
`Types of Parameters
`Evaluation of Parameters
`Distance Measures
`Pattern Recognition
`
`1
`1
`3
`4
`6
`7
`7
`
`8
`
`9
`
`9
`10
`10
`11
`12
`13
`15
`16
`20
`20
`21
`2 7
`2 9
`32
`
`-----------v-----------
`
`IPR2023-00035
`Apple EX1016 Page 4
`
`

`

`vi Voice Recognition
`
`Voice Recognition in Noisy Environments
`Summary
`References
`
`Chapter 3 Methods of Context-Free Voice
`Recognition
`Voice Recognition in Law Enforcement
`Forensic Recognition Classification
`Ideal Voice Recognition
`A Segregating Voice-Recognition System
`System Tasks
`Channel Variation Compensation
`Software Implementation
`Logistics of Forensic Speaker Identification
`Summary
`References
`
`Chapter 4 Experimental Results
`Test Utterance Length Experiments
`Large Population Results
`Filtered Data Test
`Channel Compensation Tests
`Average Filter Compensation Technique
`Experiment
`Rehumanizing Filter Technique
`Experiment
`Secondary Parameters
`Secondary Parameter Usage
`Effects of Varying the Cutoff Value
`Best-Case Secondary Parameter Usage
`Mock Forensic Cases
`SBI Case 1
`SBI Case 2
`SBI Case 3
`Summary
`References
`
`54
`56
`5 7
`
`61
`61
`62
`64
`72
`75
`94
`99
`101
`105
`105
`
`107
`107
`112
`114
`116
`
`117
`
`119
`121
`130
`131
`13 2
`135
`136
`144
`147
`148
`149
`
`IPR2023-00035
`Apple EX1016 Page 5
`
`

`

`. ·,
`
`'
`
`Contents vii
`
`Chapter 5 The Future of Context-Free Voice
`Recognition
`Rehumanizing Filter Technique Tests
`Voice-Recognition Databases
`Medium-Term Goals
`Long-Term Goals
`Other Applications
`Summary
`References
`
`Chapter 6 Conclusions
`About the Authors
`
`Index
`
`151
`151
`153
`156
`157
`158
`159
`160
`161
`165
`167
`
`IPR2023-00035
`Apple EX1016 Page 6
`
`

`

`Background of Voice
`Recognition
`
`This chapter will present a review of the research in the area
`of voice recognition. Initially, research in this area concen(cid:173)
`trated on determining whether speakers' voices were unique
`or at least distinguishable from those of a group of other
`speakers. In these studies, manual intervention was necessary
`to carry out the recognition
`task. As computer power
`increased and knowledge about speech signals improved,
`research became aimed at fully automated systems executed
`on general-purpose computers or specially designed com(cid:173)
`puter hardware.
`Voice recognition consists of two major tasks: feature
`extraction and pattern recognition. Feature extraction
`attempts to discover characteristics of the speech signal
`unique to the individual speaker. The process is analogous
`to a police description of a suspect, which typically lists
`height, weight, skin color, facial shape, body type, and any
`distinguishing marks or disfigurements. Pattern recognition
`refers to the matching of features in such a way as to deter(cid:173)
`mine, within probabilistic limits, whether two sets of features
`are from the same or different individuals.
`In this chapter,
`we will discuss research related to these tasks. The chapter
`will conclude with a short description of methods for dealing
`with noise in voice-recognition systems.
`
`-----------
`
`15 -----------
`
`IPR2023-00035
`Apple EX1016 Page 7
`
`

`

`-
`
`16 Voice Recognition
`
`VOICEPRINT ANAi=. YSIS
`The first type of automatic speaker recognition, called voice(cid:173)
`print analysis, was begun in the 1960s. The term voiceprint
`term fingerprint.
`was derived
`from the more familiar
`Researchers hoped that voiceprints would provide a reliable
`method for uniquely identifying people by their voices, just
`as fingerprints had proven to be a reliable method of identifi(cid:173)
`cation in forensic situations.
`Voiceprint analysis was only a semiautomatic process.
`First, a graphical representation of each speaker's voice was
`created. Then, human experts manually determined whether
`two graphs represented utterances spoken by the same person.
`The graphical representations took one of two forms: a speech
`spectrogram (called a bar voiceprint at the time)-see Figure
`2.1-or a contour voiceprint [1]. The former, the more com(cid:173)
`monly used form, consists of a representation of a spoken
`utterance in which time is displayed on the abscissa, fre(cid:173)
`quency on the ordinate, and spectral energy as the darkness
`at a given point.
`
`Ii
`
`5
`
`,,,,.tfl•H-~
`
`Figure 2.1 Spectrogram of author saying, "This is Rick."
`
`IPR2023-00035
`Apple EX1016 Page 8
`
`

`

`Background of Voice Recognition 17
`
`Prior to a voiceprint identification attempt, spectrograms
`would have been produced by a sound spectrograph from
`recordings of the speakers in question. Typically, the input
`data for voiceprint analysis consisted of recordings of utter(cid:173)
`ances of 10 commonly used words-such
`as "the," "you,,,
`and "I" -from each speaker in the set to be identified. These
`10 words can be thought of as roughly analogous to the 10
`fingers used in fingerprint analysis. Human experts deter(cid:173)
`mined the identity of speakers by visually inspecting the spec(cid:173)
`trograms of a given word spoken by several known speakers
`and comparing those to a spectrogram of the same word spo(cid:173)
`ken by an unknown speaker.
`The experts looked for features of the spectrograms that
`best characterized each speaker. Some commonly used fea(cid:173)
`tures were absolute formant frequency, formant bandwidths,
`and formant trajectories. Formants are bands of energy in the
`spectrogram that are related to the resonant frequencies of
`the speaker's vocal tract. Therefore, formant locations and
`trajectories are related to the fixed shapes of the speaker's
`vocal tract as well as the way in which the speaker manipu(cid:173)
`lates his or her vocal tract during utterances.
`The voiceprint
`identification method described above
`had many flaws. First, identification was based on the subjec(cid:173)
`tive judgment of human experts. Second, multiple voiceprints
`of a word spoken by one person can vary as much as voice(cid:173)
`prints by two different speakers speaking the same word. This
`phenomenon introduces the general problem of interspeaker
`versus intraspeaker variance that is of primary concern for
`all voice-recognition research. A final concern was the vulner(cid:173)
`ability of the voiceprint identification process to impostors
`that had been trained to mimic other speakers. Thus, research(cid:173)
`ers were uncertain about the worth of voiceprint identifica(cid:173)
`tion. In the 1960s, a number of experiments were performed
`that addressed these issues. L.G. Kersta reported an error rate
`of 1 % for 2000 identification attempts with populations of 9
`to 15 known speakers for each unknown [1]. Richard Bolt
`
`IPR2023-00035
`Apple EX1016 Page 9
`
`

`

`r.------------·
`
`--
`
`-
`
`L----~
`
`--
`
`18 Voice Recognition
`
`summarized the results of several similar studies with widely
`varying error rates [2]. Some studies reported error rates as
`high as 21 % and others as low as 0.42%. Bolt criticized all
`the studies as being artificial inasmuch as the experiments
`consisted of matching tasks. If used as evidence in court, he
`task, in
`pointed out, the analysis would be a verification
`which the experts would have to decide from two sets of
`( one set from the accused, one set from the
`voiceprints
`unknown) whether or not the accused and the unknown per(cid:173)
`son were the same.
`The inconsistent experimental evidence caused experts
`to disagree about the viability of voiceprints. Kersta's original
`study led him to believe that voiceprint analysis could be as
`effective as fingerprint analysis:
`
`Other experimental data encourages me to believe
`that unique identifications from voiceprints can be
`made. Work continues,
`there being questions
`to
`answer and problems to solve ....
`It is my opinion,
`however, that identifiable uniqueness does exist in
`each voice, and that masking, disguising, or dis(cid:173)
`torting the voice will not defeat identification
`if the
`speech is intelligible [1].
`
`A study by Richard Bolt and others a few years later reached
`the opposite conclusion:
`
`Fingerprints show directly the physical pattern of
`the finger producing them, and these patterns are
`readily discernible. Spectrographic patterns and the
`sound waves that they represent are not, however,
`related so simply and directly
`to vocal anatomy;
`moreover, the spectrogram is not the primary evi(cid:173)
`dence, but only a graphic means for examining
`the
`sounds that a speaker makes [2].
`
`IPR2023-00035
`Apple EX1016 Page 10
`
`

`

`...........
`
`Background of Voice Recognition 19
`
`Between 1970 and 1985, the Federal Bureau of Investigation
`(FBI) made extensive use of spectrogram identification,
`the
`results of which were analyzed by Bruce Koenig [3]. The
`FBI formulated a 10-point procedural protocol dictating how
`voice comparison was to take place. The Bureau insisted on
`high-quality recordings, from which spectrograms in the fre(cid:173)
`quency range of 0-4,000 Hz were to be made. FBI technicians
`examined twenty words pronounced alike (supposedly) for
`similarities and differences, and these results were supple(cid:173)
`mented by aural comparisons made by repeatedly and simul(cid:173)
`taneously playing the two voice samples on separate tape
`recorders. In the end, the examiner determined whether two
`exemplars were "no or low confidence," "very similar," or
`"very dissimilar," and these results were confirmed by two
`other examiners.
`Identification of an individual was only
`claimed in the presence of a sufficiently high percentage of
`"very similar" determinations.
`A survey of the results of 2,000 voice comparisons found
`that in two-thirds (1,304) of the cases, examiners had no or
`low confidence; in 318 cases there was a positive identifica(cid:173)
`tion; and in 3 78 cases a positive elimination. There was one
`false
`identification
`and
`two false eliminations. Koenig
`observes:
`
`Most of the no or low confidence decisions were
`due to poor recording quality and/ or an insufficient
`number of comparable words. Decisions were also
`affected by high-pitched voices (usually female) and
`some forms of voice disguise [ 3].
`
`The attempt to use voiceprints in a forensic setting left unan(cid:173)
`swered many questions about the practicality of using voice
`to identify individuals uniquely. It became clear that research
`must be focused on the following goals:
`
`IPR2023-00035
`Apple EX1016 Page 11
`
`

`

`20 Voice Recognition
`
`1. Automating the recognition procedures;
`from dependency on
`2. Freeing recognition procedures
`fixed words;
`3. Standardizing testing so improvements
`could be measured;
`4. Handling noisy signals;
`5. Coping with unknown and/or inadequate channels;
`6. Dealing with intervoice and intravoice variation both
`natural and artificial (i.e., disguised voice).
`
`in procedures
`
`Advances in digital computer hardware in the mid 1980s
`made achievement of these goals seem possible. The six points
`enumerated above were the basis of many research programs
`in voice recognition during subsequent years. These programs
`will be discussed in the remaining sections of this chapter.
`Since voice-recognition research progressed along many dif(cid:173)
`ferent paths after the 196Os, a historical perspective
`is not
`fully appropriate. Thus,we have partitioned
`the discussion
`of research by task: parameter extraction, distance measure(cid:173)
`ments,
`pattern
`recognition
`techniques,
`special
`and
`considerations.
`
`PARAMETER EXTiRACTION
`
`In this section, we will discuss methods of extracting informa(cid:173)
`tion from speech waveforms. Parameter or feature extraction
`consists of preprocessing an electrical signal to transform it
`into a usable digital form, applying algorithms to extract only
`speaker-related information from the signal, and determining
`the quality of the extracted parameters.
`
`The Parameter Extraction Pr.ocess
`
`The preprocessing required by voice-recognition systems uses
`digital signal processing (DSP) methods that are common to
`
`IPR2023-00035
`Apple EX1016 Page 12
`
`

`

`~---------------
`
`..
`I
`
`Background of Voice Recognition 21
`
`all computer speech systems. First, the sound wave created by
`an individual's speech is transduced into an analog electrical
`signal via a microphone. The electrical signal is sampled and
`quantized, resulting in a digital representation of the analog
`signal. Typical representations of signals for voice-recogni(cid:173)
`tion systems are sampled at rates of between 8 and 16 kHz
`with 8 to 16 bits of resolution [4].
`The digital signal may then be subjected to conditioning.
`For example, bandpass filtering can be used for attenuating
`parts of the spectrum that are corrupted with additive noise.
`Spectral flattening can be used to improve the pitch extraction
`process by compensating for the effect of the vocal tract on
`the excitation signal created by the vibrating vocal folds. Many
`other conditioning
`techniques have been reported. After the
`signal has been conditioJ?.ed, it may then be used as input to
`an algorithm for parameter extraction (Figure 2.2).
`
`Types of Parameters
`
`The most basic type of parameters used for voice recognition
`are either quantifiable by a human listener, such as pitch or
`
`~ .. Quantizer
`
`Sampler&
`
`Signal
`onditioning
`
`Extraction
`Algorithm
`
`0.234
`0.340
`0.001
`-1.234'
`0.61!6
`-0.451
`
`Figure 2.2 The parameter extraction process.
`
`IPR2023-00035
`Apple EX1016 Page 13
`
`

`

`22 Voice Recognition
`
`loudness, or have been borrowed from systems for speech
`coding, recognition, or synthesis.
`
`Pitch
`
`The pitch of a speaker's voice during an utterance is roughly
`describable by a human listener. The human listener can sense
`the average pitch and detect changes of pitch during an utter(cid:173)
`ance. Although it is not an easy process, pitch determination
`can be performed by computer algorithms. Many different
`algorithms have been devised for pitch extraction [5].
`At first glance, pitch appears to be a valuable parameter
`for speaker identification. For example, a distinction between
`male voices, female voices, and juvenile voices can be made
`based mainly on pitch. However, pitch is affected by the
`speaker's mood and can be modified intentionally by an unco(cid:173)
`operative speaker or one with criminal intent.
`
`Fvrequency Representations
`
`A second simple type of parameter is the frequency represen(cid:173)
`tation of a signal in various time frames. This representation is
`equivalent to a spectrogram in numerical form. The numerical
`form of a spectrogram is usually computed using the fast
`Fourier transform (FFT) algorithm. Many processors have
`been designed specifically to execute FFT algorithms in real
`time.
`The results obtainable using the FFT algorithm vary with
`the design parameters of the algorithm. If short analysis win(cid:173)
`dows are used, the FFT algorithm accurately
`represents
`changes in the spectral energy of the signal over time but
`will not have high resolution
`in the frequency dimension.
`Conversely, if long analysis windows are used, the results
`will be accurate in the frequency dimension but coarse in the
`
`IPR2023-00035
`Apple EX1016 Page 14
`
`

`

`Background of Voice Recognition 23
`
`time dimension. 1 Most voice-processing systems use FFTs
`with moderate-sized
`analysis windows
`(approximately
`20 ms). The magnitudes of the resulting FFT coefficients are
`commonly called inverse filter spectral coefficients.
`As mentioned earlier, formant frequencies, which can be
`determined from the frequency representation of a speech
`signal, are related to the resonant cavities of an individual's
`vocal tract. Thus, researchers believed that this correlation
`might be useful for voice recognition. The original research
`in this area required manual intervention for determining
`formant frequencies, but soon, automated methods became
`available [6].
`
`bPC Coefficients
`Linear predictive coding (LPC) coefficients are commonly
`used as features for voice-recognition systems. LPC was devel(cid:173)
`oped as an efficient method for representing speech signals
`and became widely used
`in many areas of speech
`processing [7].
`In LPC, a parametric representation of speech is created
`by using past values of the signal to predict future values.
`The nth value of the speech signal can be predicted by the
`formula below:
`
`p
`
`Sn= L Sn-iai
`
`i=1
`
`where Sn is the nth speech sample, the ak are the predictor
`coefficients, and Sn is the prediction of the nth value of the
`
`1. This is actually a manifestation of a classical trade-off in physics known
`as the Heisenberg Uncertainty Principle. Most readers will know it as
`follows: "one cannot determine· both the position and the velocity
`of an elementary particle with complete accuracy; the more highly
`determined the one, the less highly determined the other." It is a conse(cid:173)
`quence of the wave description of matter and, in the particular case of
`digital signal processing, of the wave description of sound.
`
`IPR2023-00035
`Apple EX1016 Page 15
`
`

`

`24 Voice Recognition
`
`speech signal. Predictor coefficients can be estimated by an
`iterative algorithm that minimizes the mean square error
`between the predicted waveform, s, and the actual waveform,
`s. The number of coefficients derived using LPC, p, is a param(cid:173)
`eter of the algorithm and is roughly related to the number of
`real and complex poles of the vocal tract filter. With more
`coefficients, the original signal can be reconstructed more
`accurately but at a higher computational cost. Typically,
`12 coefficients are calculated for speech sampled at
`10 kHz [8-10].
`Although the LPC predictor coefficients can be used
`directly as features, many transformations of the coefficients
`are also used. The transformations are designed to create a
`new set of coefficients that are optimized for various perfor(cid:173)
`mance criteria.
`The most commonly used transformation is that which
`derives the anagrammatically named cepstrum from the spec(cid:173)
`trum. The LPC-derived cepstral coefficients are defined as
`follows, where ci is the ith cepstral coefficient:
`
`Ci= ai + L ((1 - (kli))akci-k), 1 < i < p
`
`i-1
`
`k=1
`
`Unlike LPC coefficients, cepstral coefficients are independent
`and the distance between cepstral coefficient vectors can be
`calculated with a Euclidean-type dista:r;ice measure [11].
`The reflection coefficients are natural byproducts of the
`computation of the LPC predictor coefficients. They are
`defined from the following backward recursion:
`
`b .. + b .. b, ..
`b. . = 1,1
`1,1 1- 1,1
`k2
`. 1
`
`1,1-1
`
`1-
`
`IPR2023-00035
`Apple EX1016 Page 16
`
`

`

`Background of Voice Recognition 25
`
`bj,p = aj
`1 < j < i - 1, 1 < i < p
`
`coefficient,
`reflection
`ith
`of
`the value
`is
`where ki
`i = _(p, p -_1,: .. , 1), ai is the ith _LPC coefficient, bj,i is a
`variable w1th1n the recurrence relation, and p is the number
`of LPC coefficients.
`The log area coefficients are defined by:
`
`1 - k·)
`gi = log 1 + k: 1 < i < p
`(
`
`where gi is the ith log area coefficient, ki is the ith reflection
`coefficient and ki < 1 [12].
`-
`Another such transformation
`is the impulse response
`function, calculated as follows:
`
`p
`
`hi= L Dk,h1-k
`i = o
`i < o
`
`k=1
`hi= 1
`hi= o
`
`i > 0
`
`where ak is the kth LPC coefficient and p is equal to the
`number of LPC coefficients. The impulse response function
`is the time-domain output function that would result from
`inputting an impulse
`function
`to a finite duration impulse
`response (FIR) filter that used the LPC coefficients as the filter
`coefficients.
`
`0ther Rarameters
`
`]he selection of features, for the most part, is not affected by
`the type of application. Most text-independent voice-recogni(cid:173)
`tion systems currently developed have used the same kinds
`
`IPR2023-00035
`Apple EX1016 Page 17
`
`

`

`26 Voice Recognition
`
`of features as are used in text-dependent systems. However
`some features have been developed specifically to improv~
`performance in noisy environments.
`For example, Delta-Cepstrum coefficients are calculated 1
`by determining the differences between cepstral coefficients
`in each time frame. Thus, any constant bias caused by the
`channel would be removed [11]. The relative spectral-based
`coefficients (RAST A) use a series of transformations to remove
`linear distortion of a signal (i.e., filtering). With this tech(cid:173)
`nique, the slow-moving variations in the frequency domain
`are detected and removed. Fast-moving variations-caused
`by the speech itself-are captured in the resulting parameters
`[13,14]. The intensity deviation spectrum (IDS) parameters
`constitute another attempt to remove the frequency character(cid:173)
`istics of the transmission channel by normalizing by the mean
`value at each frequency in the spectrum [15].
`Other miscellaneous features have also been suggested:
`perceptual linear predictive (PLP) coefficients attempt to
`modify LPC coefficients based on the way human perception
`and physiology effects sounds [16]. Line spectral pair (LSP)
`frequencies have also been used as parameters. LSP frequen(cid:173)
`cies are derived from the LPC coefficients and have a rough
`correlation to formant bandwidths and locations [17]. The
`partial correlation (P ARCOR) coefficients, which are another
`natural byproduct of LPC analysis, have also been used [18].
`Finally, smoothed discrete Wigner distributions (SDWD)
`attempt to eliminate the problem of time versus frequency ,
`accuracy when calculating FFTs. By smoothing the FFT calcu(cid:173)
`lation in an efficient manner, the resulting SDWD parameters
`achieve accuracy in both time and frequency dimensions
`without a high computation cost. The resulting parameters
`have been used effectively for voice recognition [19].
`The list of features used for voice recognition discussed
`in this section consists of many parameters that are common
`to other voice-processing applications as well as some param(cid:173)
`eters that were devised specifically for the voice-recognition
`
`IPR2023-00035
`Apple EX1016 Page 18
`
`

`

`Background of Voice Recognition 27
`
`task. Most of these parameters were derived by performing
`some kind of transformation of the LPC coefficients.
`
`Evaluation of Parameters
`
`To build a successful voice-recognition system, one must
`make informed decisions concerning which parameters to
`use. The penalties
`for choosing parameters
`incorrectly
`include poor recognition performance and excessive pro(cid:173)
`cessing time and storage space. The goal of parameter evalua(cid:173)
`tion should be to determine the smallest set of parameters
`which contain as much useful information as possible.
`The theory of analysis of variance provides a method
`for determining
`the relative merits of parameters for voice
`recognition. Features are identified which remain relatively
`constant for the speech of a single individual but vary over
`the speech of different individuals. Typical voice-recognition
`systems use a set of parameters (features) that may be repre(cid:173)
`sented by a vector W:
`
`where w1, w2 , etc., are individual features such as LPC coeffi(cid:173)
`cients or cepstral coefficients. Numerous vectors can be
`obtained by performing feature extraction on evenly spaced
`analysis windows throughout utterances spoken by the indi(cid:173)
`viduals to be recognized. Thus, at different time positions in
`an utterance, the same parameters are calculated.
`The F-ratio for each feature, k, in W can be determined
`as follows [6]:
`
`(Variance of Speaker Means)
`Fk = (Average Within Speaker Variance)
`
`(2.1)
`
`Ifs vectors have been collected for each of q number of speak(cid:173)
`ers, then:
`
`IPR2023-00035
`Apple EX1016 Page 19
`
`

`

`28 Voice Recognition
`
`(2.2)
`
`(2.3)
`
`(2.4)
`
`where wi,j,k is the value of the kth feature for the ith speaker
`during the jth reference frame. Si,k estimates the value of the
`kth feature for the ith speaker. The average of the kth feature
`over all frames of all speakers is represented by Uk.
`Features with larger F-ratios will be more useful for voice
`recognition. However, F-ratios are only valid for the set of
`data from which they were calculated. Features that appear
`to be useful for one set of speakers may be worthless for
`another set of speakers. To calculate meaningful F-ratios, a
`large population with a large number of examples from each
`speaker must be used.
`Other methods for evaluating the usefulness of features
`exist. For example, the feature effectiveness criterion (FEC)
`is defined by Shridhar as follows [10]:
`FEC = L Interspeaker distances - L Intraspeaker distances
`Parameters with higher FEC values are more desirable since
`high interspeaker distances are favorable for discrimination
`and low intraspeaker distances are favorable for speaker vari(cid:173)
`ability. Another method for choosing which features to use
`in a voice-recognition system is simply to use recognition
`error rates of the system when different features are used
`as input. By using the same input speech data and pattern
`
`IPR2023-00035
`Apple EX1016 Page 20
`
`

`

`Background of Voice Recognition 29
`
`matching algorithm (algorithms will be discussed later in this
`chapter), the performance of different sets of parameters may
`be evaluated by comparison of recognition scores. Better
`parameters will yield better recognition scores. A broad range
`of testing must be used to prevent overtraining, which occurs
`when the system parameters are varied slightly in an attempt
`to achieve better performance on a specific set of input data,
`while the performance of the system actually drops for more
`general input data.
`
`Distance Measures
`
`Distance measures refer to methods of calculating differences
`between parameter vectors. Typically, one of the vectors is
`calculated from data of the unknown speaker while the other
`vector is calculated from that of a known speaker. However,
`some pattern-matching
`techniques require that vectors from
`the same speaker be compared to each other to determine the
`expected variance of the speaker in question. Descriptions of
`how distance measures are used will be presented later in
`this chapter.
`Many different distance measures have been proposed,
`and deciding which one to use is as difficult as determining
`which set of parameters to use. Often, a method is chosen
`simply because it yields favorable results and/ or compensates
`for the ineffectiveness of certain parameters within a feature
`vector.
`Most distance measures are variations of either the
`Euclidean or Manhattan distance between two vectors.
`Euclidean:
`
`p
`d(a,b) = (~
`
`1/2
`(ai - bj)2)
`
`IPR2023-00035
`Apple EX1016 Page 21
`
`

`

`30 Voice Recognition
`
`Manhattan:
`
`d(a,b) = L lai - bil
`
`p
`
`i=1
`
`where ai and bi are ith components of the two vectors to be
`compared and p is the number of features to compare.
`The Euclidean and Manhattan distance measures are not
`appropriate for comparing two vectors of LPC coefficients
`since the coefficients are not independent. However, the like(cid:173)
`lihood ratio distortion, which only applies to LPC coefficients,
`can be used. It is defined as follows:
`
`where a and b are vectors of LPC predictor coefficients, Ra
`is the Toeplitz autocorrelation matrix (a byproduct of the
`calculation of the predictor coefficients) associated with a,
`and T is transpose [20]. The log likelihood distance can be
`computed as follows:
`
`dUR = log(dLR)
`
`These two distance measures are effective ways of comparing
`vectors of LPC predictor coefficients.
`Since cepstral coefficients are the most commonly used
`type of voice-recognition parameter, several distance mea(cid:173)
`sures for cepstral coefficients have been suggested. Most of
`these distance measures are simple variations of the weighted
`cepstral distance:
`
`IPR2023-00035
`Apple EX1016 Page 22
`
`

`

`Background of Voice Recognition 31
`
`where, again, pis the number of features and Ji is the weighting
`function
`[21]. Several weighting
`functions have been
`suggested:
`Uniform:
`
`fi=1
`
`Expected difference:
`
`where Eis the expected difference between two features deter(cid:173)
`mined from a population of speakers.
`Inverse variance:
`
`Uniform without first coefficient:
`
`o if i = 1
`{
`Ji= 1 if 1 < i < p
`
`The expected difference and inverse variance weighting
`functions attempt to maximize the F-ratio of each feature. The
`uniform without first coefficient function discounts the first
`coefficient, which has been shown to contain little informa(cid:173)
`tion for speaker recognition [21].
`The number of different distance measures is as great
`as the number of different extracted parameter types. Some
`distance measures were designed for a specific type of param(cid:173)
`eter. Others were chosen to maximize the F-ratios of any given
`feature. However, most were chosen simply because of their
`favorable performance with
`specific pattern matching
`algorithms.
`
`IPR2023-00035
`Apple EX1016 Page 23
`
`

`

`32 Voice Recognition
`
`Pattem Recognition
`
`systems consists of
`Patt m recognition
`in voice-recognition
`developing a database of information about known speakers
`{tTaining) and determining if an unknown speaker is one of the
`known speakers (testing). The result of the pattern recognition
`identity. In
`step i a decision about an unknown
`speaker's
`th previous sections, we discussed
`feature extraction and
`that use these
`di stan e measures. In this section, algorithms
`f atur s and distance measures for making voice-recognition
`d
`i i n will be explained.
`
`Testing Voice-Recognition Systems
`
`T ompar the relative performance of the different pattern(cid:173)
`r o nition te hniques, a brief discussion on the testing of
`voi
`.. r ognition systems is necessary. Typically, the relative
`porfor1nance of voice-recognition systems is based on the error
`r t , for either verification or identification
`tasks. Unfortu(cid:173)
`nat 1 , error rates can be misleading owing to the large number
`of

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket