`RECOGNITION >
`
`ROBERT D. RODMAN
`
`RICHARD L. KLEVANS
`
`IPR2023-00034
`Apple EX1016 Page 1
`
`
`
`• I
`!
`
`•. •[
`
`·,
`
`I •
`
`,.:
`
`,,,.
`
`r,.
`
`V:@m~
`
`
`, ·,
`
`,
`
`er {ie@~gn•i(~w~~ '.·, :,I : •<., :·, ! ';/
`
`. .....
`> ..... ,.
`..... ~ 1
`
`1
`
`\
`
`I I
`
`i.:. .h---,
`
`&i€haFa L .. Klev.aBS,
`• R0bler.t ID. Ro<dmaB
`
`, , I_·
`
`I'
`
`• ,._t
`
`, 1·
`
`. '.
`
`'t.
`
`tl'.~£~h }'.~'?<\)
`...
`
`) ..
`
`.,
`
`'l
`.i • ..- .\. ~
`
`•
`l
`
`,
`'t •
`
`., .
`
`i f .. ,
`
`• '
`,_.
`
`q,r:
`
`1 ""1°'
`
`,At;te¢h ~0use 1
`~ -Lon(lQn,
`~JSl~ston,
`
`>
`
`I
`
`l
`
`.• :,',
`
`);._.
`
`'
`
`1 I
`
`IPR2023-00034
`Apple EX1016 Page 2
`
`
`
`Library of Congress Cataloging-in-Publication Data
`Klevans, Richard L.
`Voice recognition/ Richard L. Klevans, Robert D. Rodman.
`p.
`cm. -
`(Artech House telecommunications library)
`Includes bibliographical references and index.
`ISBN 0-89006-927-1 (alk. paper)
`2. Automatic speech recognition.
`1. Speech processing systems.
`3. Voiceprints.
`4. Natural language processing (Computer science)
`I. Rodman, Robert D. II. Title. III. Series.
`TK7882.2.S65K55
`1997
`006.4'54-dc21
`
`97-30792
`CIP
`
`British Library Cataloguing in Publication Data
`Klevans, Richard L.
`Voice recognition
`1. Automatic speech recognition
`I. Title. II. Rodman, Robert D.
`006.4'54
`
`ISBN 0-89006-927-1
`
`Cover design by Jennifer L. Stuart
`
`© 1997 ARTECH HOUSE, INC.
`685 Canton Street
`Norwood, MA 02062
`
`All rights reserved. Printed and bound in the United States of America. No part
`of this book may be reproduced or utilized in any form or by any means, elec(cid:173)
`tronic or mechanical, including photocopying, recording, or by any information
`storage and retrieval system, without permission in writing from the publisher.
`All terms mentioned in this book that are known to be trademarks or service
`marks have been appropriately capitalized. Artech House cannot attest to the ac(cid:173)
`curacy of this information. Use of a term in this book should not be regarded as
`affecting the validity of any trademark or service mark.
`
`International Standard Book Number: 0-89006-927-1
`Library of Congress Catalog Card Number: 97-30792
`
`10 9 8 7 6 5 4 3 2 1
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`'
`
`IPR2023-00034
`Apple EX1016 Page 3
`
`
`
`Contents
`
`Introduction
`Chapter 1
`Speech Synthesis
`Speech Recognition
`Speaker Classification
`Areas of Application for Voice Recognition
`Design Tradeoffs in Voice Recognition
`Text-Dependent Versus Text-Independent
`Ideal Recording Environment Versus Noisy
`Environment
`Speaker Verification Versus Speaker
`Identification
`Real-Time Operation Versus Off-Line
`Operation
`Regarding This Book
`Intended Readers
`What Is Covered
`Why?
`References
`Chapter 2 Background of Voice Recognition
`Voiceprint Analysis
`Parameter Extraction
`The Parameter Extraction Process
`Types of Parameters
`Evaluation of Parameters
`Distance Measures
`Pattern Recognition
`
`1
`1
`3
`4
`6
`7
`7
`
`8
`
`9
`
`9
`10
`10
`11
`12
`13
`15
`16
`20
`20
`21
`2 7
`2 9
`32
`
`-----------v-----------
`
`IPR2023-00034
`Apple EX1016 Page 4
`
`
`
`vi Voice Recognition
`
`Voice Recognition in Noisy Environments
`Summary
`References
`
`Chapter 3 Methods of Context-Free Voice
`Recognition
`Voice Recognition in Law Enforcement
`Forensic Recognition Classification
`Ideal Voice Recognition
`A Segregating Voice-Recognition System
`System Tasks
`Channel Variation Compensation
`Software Implementation
`Logistics of Forensic Speaker Identification
`Summary
`References
`
`Chapter 4 Experimental Results
`Test Utterance Length Experiments
`Large Population Results
`Filtered Data Test
`Channel Compensation Tests
`Average Filter Compensation Technique
`Experiment
`Rehumanizing Filter Technique
`Experiment
`Secondary Parameters
`Secondary Parameter Usage
`Effects of Varying the Cutoff Value
`Best-Case Secondary Parameter Usage
`Mock Forensic Cases
`SBI Case 1
`SBI Case 2
`SBI Case 3
`Summary
`References
`
`54
`56
`5 7
`
`61
`61
`62
`64
`72
`75
`94
`99
`101
`105
`105
`
`107
`107
`112
`114
`116
`
`117
`
`119
`121
`130
`131
`13 2
`135
`136
`144
`147
`148
`149
`
`IPR2023-00034
`Apple EX1016 Page 5
`
`
`
`. ·,
`
`'
`
`Contents vii
`
`Chapter 5 The Future of Context-Free Voice
`Recognition
`Rehumanizing Filter Technique Tests
`Voice-Recognition Databases
`Medium-Term Goals
`Long-Term Goals
`Other Applications
`Summary
`References
`
`Chapter 6 Conclusions
`About the Authors
`
`Index
`
`151
`151
`153
`156
`157
`158
`159
`160
`161
`165
`167
`
`IPR2023-00034
`Apple EX1016 Page 6
`
`
`
`Background of Voice
`Recognition
`
`This chapter will present a review of the research in the area
`of voice recognition. Initially, research in this area concen(cid:173)
`trated on determining whether speakers' voices were unique
`or at least distinguishable from those of a group of other
`speakers. In these studies, manual intervention was necessary
`to carry out the recognition
`task. As computer power
`increased and knowledge about speech signals improved,
`research became aimed at fully automated systems executed
`on general-purpose computers or specially designed com(cid:173)
`puter hardware.
`Voice recognition consists of two major tasks: feature
`extraction and pattern recognition. Feature extraction
`attempts to discover characteristics of the speech signal
`unique to the individual speaker. The process is analogous
`to a police description of a suspect, which typically lists
`height, weight, skin color, facial shape, body type, and any
`distinguishing marks or disfigurements. Pattern recognition
`refers to the matching of features in such a way as to deter(cid:173)
`mine, within probabilistic limits, whether two sets of features
`are from the same or different individuals.
`In this chapter,
`we will discuss research related to these tasks. The chapter
`will conclude with a short description of methods for dealing
`with noise in voice-recognition systems.
`
`-----------
`
`15 -----------
`
`IPR2023-00034
`Apple EX1016 Page 7
`
`
`
`-
`
`16 Voice Recognition
`
`VOICEPRINT ANAi=. YSIS
`The first type of automatic speaker recognition, called voice(cid:173)
`print analysis, was begun in the 1960s. The term voiceprint
`term fingerprint.
`was derived
`from the more familiar
`Researchers hoped that voiceprints would provide a reliable
`method for uniquely identifying people by their voices, just
`as fingerprints had proven to be a reliable method of identifi(cid:173)
`cation in forensic situations.
`Voiceprint analysis was only a semiautomatic process.
`First, a graphical representation of each speaker's voice was
`created. Then, human experts manually determined whether
`two graphs represented utterances spoken by the same person.
`The graphical representations took one of two forms: a speech
`spectrogram (called a bar voiceprint at the time)-see Figure
`2.1-or a contour voiceprint [1]. The former, the more com(cid:173)
`monly used form, consists of a representation of a spoken
`utterance in which time is displayed on the abscissa, fre(cid:173)
`quency on the ordinate, and spectral energy as the darkness
`at a given point.
`
`Ii
`
`5
`
`,,,,.tfl•H-~
`
`Figure 2.1 Spectrogram of author saying, "This is Rick."
`
`IPR2023-00034
`Apple EX1016 Page 8
`
`
`
`Background of Voice Recognition 17
`
`Prior to a voiceprint identification attempt, spectrograms
`would have been produced by a sound spectrograph from
`recordings of the speakers in question. Typically, the input
`data for voiceprint analysis consisted of recordings of utter(cid:173)
`ances of 10 commonly used words-such
`as "the," "you,,,
`and "I" -from each speaker in the set to be identified. These
`10 words can be thought of as roughly analogous to the 10
`fingers used in fingerprint analysis. Human experts deter(cid:173)
`mined the identity of speakers by visually inspecting the spec(cid:173)
`trograms of a given word spoken by several known speakers
`and comparing those to a spectrogram of the same word spo(cid:173)
`ken by an unknown speaker.
`The experts looked for features of the spectrograms that
`best characterized each speaker. Some commonly used fea(cid:173)
`tures were absolute formant frequency, formant bandwidths,
`and formant trajectories. Formants are bands of energy in the
`spectrogram that are related to the resonant frequencies of
`the speaker's vocal tract. Therefore, formant locations and
`trajectories are related to the fixed shapes of the speaker's
`vocal tract as well as the way in which the speaker manipu(cid:173)
`lates his or her vocal tract during utterances.
`The voiceprint
`identification method described above
`had many flaws. First, identification was based on the subjec(cid:173)
`tive judgment of human experts. Second, multiple voiceprints
`of a word spoken by one person can vary as much as voice(cid:173)
`prints by two different speakers speaking the same word. This
`phenomenon introduces the general problem of interspeaker
`versus intraspeaker variance that is of primary concern for
`all voice-recognition research. A final concern was the vulner(cid:173)
`ability of the voiceprint identification process to impostors
`that had been trained to mimic other speakers. Thus, research(cid:173)
`ers were uncertain about the worth of voiceprint identifica(cid:173)
`tion. In the 1960s, a number of experiments were performed
`that addressed these issues. L.G. Kersta reported an error rate
`of 1 % for 2000 identification attempts with populations of 9
`to 15 known speakers for each unknown [1]. Richard Bolt
`
`IPR2023-00034
`Apple EX1016 Page 9
`
`
`
`r.------------·
`
`--
`
`-
`
`L----~
`
`--
`
`18 Voice Recognition
`
`summarized the results of several similar studies with widely
`varying error rates [2]. Some studies reported error rates as
`high as 21 % and others as low as 0.42%. Bolt criticized all
`the studies as being artificial inasmuch as the experiments
`consisted of matching tasks. If used as evidence in court, he
`task, in
`pointed out, the analysis would be a verification
`which the experts would have to decide from two sets of
`( one set from the accused, one set from the
`voiceprints
`unknown) whether or not the accused and the unknown per(cid:173)
`son were the same.
`The inconsistent experimental evidence caused experts
`to disagree about the viability of voiceprints. Kersta's original
`study led him to believe that voiceprint analysis could be as
`effective as fingerprint analysis:
`
`Other experimental data encourages me to believe
`that unique identifications from voiceprints can be
`made. Work continues,
`there being questions
`to
`answer and problems to solve ....
`It is my opinion,
`however, that identifiable uniqueness does exist in
`each voice, and that masking, disguising, or dis(cid:173)
`torting the voice will not defeat identification
`if the
`speech is intelligible [1].
`
`A study by Richard Bolt and others a few years later reached
`the opposite conclusion:
`
`Fingerprints show directly the physical pattern of
`the finger producing them, and these patterns are
`readily discernible. Spectrographic patterns and the
`sound waves that they represent are not, however,
`related so simply and directly
`to vocal anatomy;
`moreover, the spectrogram is not the primary evi(cid:173)
`dence, but only a graphic means for examining
`the
`sounds that a speaker makes [2].
`
`IPR2023-00034
`Apple EX1016 Page 10
`
`
`
`...........
`
`Background of Voice Recognition 19
`
`Between 1970 and 1985, the Federal Bureau of Investigation
`(FBI) made extensive use of spectrogram identification,
`the
`results of which were analyzed by Bruce Koenig [3]. The
`FBI formulated a 10-point procedural protocol dictating how
`voice comparison was to take place. The Bureau insisted on
`high-quality recordings, from which spectrograms in the fre(cid:173)
`quency range of 0-4,000 Hz were to be made. FBI technicians
`examined twenty words pronounced alike (supposedly) for
`similarities and differences, and these results were supple(cid:173)
`mented by aural comparisons made by repeatedly and simul(cid:173)
`taneously playing the two voice samples on separate tape
`recorders. In the end, the examiner determined whether two
`exemplars were "no or low confidence," "very similar," or
`"very dissimilar," and these results were confirmed by two
`other examiners.
`Identification of an individual was only
`claimed in the presence of a sufficiently high percentage of
`"very similar" determinations.
`A survey of the results of 2,000 voice comparisons found
`that in two-thirds (1,304) of the cases, examiners had no or
`low confidence; in 318 cases there was a positive identifica(cid:173)
`tion; and in 3 78 cases a positive elimination. There was one
`false
`identification
`and
`two false eliminations. Koenig
`observes:
`
`Most of the no or low confidence decisions were
`due to poor recording quality and/ or an insufficient
`number of comparable words. Decisions were also
`affected by high-pitched voices (usually female) and
`some forms of voice disguise [ 3].
`
`The attempt to use voiceprints in a forensic setting left unan(cid:173)
`swered many questions about the practicality of using voice
`to identify individuals uniquely. It became clear that research
`must be focused on the following goals:
`
`IPR2023-00034
`Apple EX1016 Page 11
`
`
`
`20 Voice Recognition
`
`1. Automating the recognition procedures;
`from dependency on
`2. Freeing recognition procedures
`fixed words;
`3. Standardizing testing so improvements
`could be measured;
`4. Handling noisy signals;
`5. Coping with unknown and/or inadequate channels;
`6. Dealing with intervoice and intravoice variation both
`natural and artificial (i.e., disguised voice).
`
`in procedures
`
`Advances in digital computer hardware in the mid 1980s
`made achievement of these goals seem possible. The six points
`enumerated above were the basis of many research programs
`in voice recognition during subsequent years. These programs
`will be discussed in the remaining sections of this chapter.
`Since voice-recognition research progressed along many dif(cid:173)
`ferent paths after the 196Os, a historical perspective
`is not
`fully appropriate. Thus,we have partitioned
`the discussion
`of research by task: parameter extraction, distance measure(cid:173)
`ments,
`pattern
`recognition
`techniques,
`special
`and
`considerations.
`
`PARAMETER EXTiRACTION
`
`In this section, we will discuss methods of extracting informa(cid:173)
`tion from speech waveforms. Parameter or feature extraction
`consists of preprocessing an electrical signal to transform it
`into a usable digital form, applying algorithms to extract only
`speaker-related information from the signal, and determining
`the quality of the extracted parameters.
`
`The Parameter Extraction Pr.ocess
`
`The preprocessing required by voice-recognition systems uses
`digital signal processing (DSP) methods that are common to
`
`IPR2023-00034
`Apple EX1016 Page 12
`
`
`
`~---------------
`
`..
`I
`
`Background of Voice Recognition 21
`
`all computer speech systems. First, the sound wave created by
`an individual's speech is transduced into an analog electrical
`signal via a microphone. The electrical signal is sampled and
`quantized, resulting in a digital representation of the analog
`signal. Typical representations of signals for voice-recogni(cid:173)
`tion systems are sampled at rates of between 8 and 16 kHz
`with 8 to 16 bits of resolution [4].
`The digital signal may then be subjected to conditioning.
`For example, bandpass filtering can be used for attenuating
`parts of the spectrum that are corrupted with additive noise.
`Spectral flattening can be used to improve the pitch extraction
`process by compensating for the effect of the vocal tract on
`the excitation signal created by the vibrating vocal folds. Many
`other conditioning
`techniques have been reported. After the
`signal has been conditioJ?.ed, it may then be used as input to
`an algorithm for parameter extraction (Figure 2.2).
`
`Types of Parameters
`
`The most basic type of parameters used for voice recognition
`are either quantifiable by a human listener, such as pitch or
`
`~ .. Quantizer
`
`Sampler&
`
`Signal
`onditioning
`
`Extraction
`Algorithm
`
`0.234
`0.340
`0.001
`-1.234'
`0.61!6
`-0.451
`
`Figure 2.2 The parameter extraction process.
`
`IPR2023-00034
`Apple EX1016 Page 13
`
`
`
`22 Voice Recognition
`
`loudness, or have been borrowed from systems for speech
`coding, recognition, or synthesis.
`
`Pitch
`
`The pitch of a speaker's voice during an utterance is roughly
`describable by a human listener. The human listener can sense
`the average pitch and detect changes of pitch during an utter(cid:173)
`ance. Although it is not an easy process, pitch determination
`can be performed by computer algorithms. Many different
`algorithms have been devised for pitch extraction [5].
`At first glance, pitch appears to be a valuable parameter
`for speaker identification. For example, a distinction between
`male voices, female voices, and juvenile voices can be made
`based mainly on pitch. However, pitch is affected by the
`speaker's mood and can be modified intentionally by an unco(cid:173)
`operative speaker or one with criminal intent.
`
`Fvrequency Representations
`
`A second simple type of parameter is the frequency represen(cid:173)
`tation of a signal in various time frames. This representation is
`equivalent to a spectrogram in numerical form. The numerical
`form of a spectrogram is usually computed using the fast
`Fourier transform (FFT) algorithm. Many processors have
`been designed specifically to execute FFT algorithms in real
`time.
`The results obtainable using the FFT algorithm vary with
`the design parameters of the algorithm. If short analysis win(cid:173)
`dows are used, the FFT algorithm accurately
`represents
`changes in the spectral energy of the signal over time but
`will not have high resolution
`in the frequency dimension.
`Conversely, if long analysis windows are used, the results
`will be accurate in the frequency dimension but coarse in the
`
`IPR2023-00034
`Apple EX1016 Page 14
`
`
`
`Background of Voice Recognition 23
`
`time dimension. 1 Most voice-processing systems use FFTs
`with moderate-sized
`analysis windows
`(approximately
`20 ms). The magnitudes of the resulting FFT coefficients are
`commonly called inverse filter spectral coefficients.
`As mentioned earlier, formant frequencies, which can be
`determined from the frequency representation of a speech
`signal, are related to the resonant cavities of an individual's
`vocal tract. Thus, researchers believed that this correlation
`might be useful for voice recognition. The original research
`in this area required manual intervention for determining
`formant frequencies, but soon, automated methods became
`available [6].
`
`bPC Coefficients
`Linear predictive coding (LPC) coefficients are commonly
`used as features for voice-recognition systems. LPC was devel(cid:173)
`oped as an efficient method for representing speech signals
`and became widely used
`in many areas of speech
`processing [7].
`In LPC, a parametric representation of speech is created
`by using past values of the signal to predict future values.
`The nth value of the speech signal can be predicted by the
`formula below:
`
`p
`
`Sn= L Sn-iai
`
`i=1
`
`where Sn is the nth speech sample, the ak are the predictor
`coefficients, and Sn is the prediction of the nth value of the
`
`1. This is actually a manifestation of a classical trade-off in physics known
`as the Heisenberg Uncertainty Principle. Most readers will know it as
`follows: "one cannot determine· both the position and the velocity
`of an elementary particle with complete accuracy; the more highly
`determined the one, the less highly determined the other." It is a conse(cid:173)
`quence of the wave description of matter and, in the particular case of
`digital signal processing, of the wave description of sound.
`
`IPR2023-00034
`Apple EX1016 Page 15
`
`
`
`24 Voice Recognition
`
`speech signal. Predictor coefficients can be estimated by an
`iterative algorithm that minimizes the mean square error
`between the predicted waveform, s, and the actual waveform,
`s. The number of coefficients derived using LPC, p, is a param(cid:173)
`eter of the algorithm and is roughly related to the number of
`real and complex poles of the vocal tract filter. With more
`coefficients, the original signal can be reconstructed more
`accurately but at a higher computational cost. Typically,
`12 coefficients are calculated for speech sampled at
`10 kHz [8-10].
`Although the LPC predictor coefficients can be used
`directly as features, many transformations of the coefficients
`are also used. The transformations are designed to create a
`new set of coefficients that are optimized for various perfor(cid:173)
`mance criteria.
`The most commonly used transformation is that which
`derives the anagrammatically named cepstrum from the spec(cid:173)
`trum. The LPC-derived cepstral coefficients are defined as
`follows, where ci is the ith cepstral coefficient:
`
`Ci= ai + L ((1 - (kli))akci-k), 1 < i < p
`
`i-1
`
`k=1
`
`Unlike LPC coefficients, cepstral coefficients are independent
`and the distance between cepstral coefficient vectors can be
`calculated with a Euclidean-type dista:r;ice measure [11].
`The reflection coefficients are natural byproducts of the
`computation of the LPC predictor coefficients. They are
`defined from the following backward recursion:
`
`b .. + b .. b, ..
`b. . = 1,1
`1,1 1- 1,1
`k2
`. 1
`
`1,1-1
`
`1-
`
`IPR2023-00034
`Apple EX1016 Page 16
`
`
`
`Background of Voice Recognition 25
`
`bj,p = aj
`1 < j < i - 1, 1 < i < p
`
`coefficient,
`reflection
`ith
`of
`the value
`is
`where ki
`i = _(p, p -_1,: .. , 1), ai is the ith _LPC coefficient, bj,i is a
`variable w1th1n the recurrence relation, and p is the number
`of LPC coefficients.
`The log area coefficients are defined by:
`
`1 - k·)
`gi = log 1 + k: 1 < i < p
`(
`
`where gi is the ith log area coefficient, ki is the ith reflection
`coefficient and ki < 1 [12].
`-
`Another such transformation
`is the impulse response
`function, calculated as follows:
`
`p
`
`hi= L Dk,h1-k
`i = o
`i < o
`
`k=1
`hi= 1
`hi= o
`
`i > 0
`
`where ak is the kth LPC coefficient and p is equal to the
`number of LPC coefficients. The impulse response function
`is the time-domain output function that would result from
`inputting an impulse
`function
`to a finite duration impulse
`response (FIR) filter that used the LPC coefficients as the filter
`coefficients.
`
`0ther Rarameters
`
`]he selection of features, for the most part, is not affected by
`the type of application. Most text-independent voice-recogni(cid:173)
`tion systems currently developed have used the same kinds
`
`IPR2023-00034
`Apple EX1016 Page 17
`
`
`
`26 Voice Recognition
`
`of features as are used in text-dependent systems. However
`some features have been developed specifically to improv~
`performance in noisy environments.
`For example, Delta-Cepstrum coefficients are calculated 1
`by determining the differences between cepstral coefficients
`in each time frame. Thus, any constant bias caused by the
`channel would be removed [11]. The relative spectral-based
`coefficients (RAST A) use a series of transformations to remove
`linear distortion of a signal (i.e., filtering). With this tech(cid:173)
`nique, the slow-moving variations in the frequency domain
`are detected and removed. Fast-moving variations-caused
`by the speech itself-are captured in the resulting parameters
`[13,14]. The intensity deviation spectrum (IDS) parameters
`constitute another attempt to remove the frequency character(cid:173)
`istics of the transmission channel by normalizing by the mean
`value at each frequency in the spectrum [15].
`Other miscellaneous features have also been suggested:
`perceptual linear predictive (PLP) coefficients attempt to
`modify LPC coefficients based on the way human perception
`and physiology effects sounds [16]. Line spectral pair (LSP)
`frequencies have also been used as parameters. LSP frequen(cid:173)
`cies are derived from the LPC coefficients and have a rough
`correlation to formant bandwidths and locations [17]. The
`partial correlation (P ARCOR) coefficients, which are another
`natural byproduct of LPC analysis, have also been used [18].
`Finally, smoothed discrete Wigner distributions (SDWD)
`attempt to eliminate the problem of time versus frequency ,
`accuracy when calculating FFTs. By smoothing the FFT calcu(cid:173)
`lation in an efficient manner, the resulting SDWD parameters
`achieve accuracy in both time and frequency dimensions
`without a high computation cost. The resulting parameters
`have been used effectively for voice recognition [19].
`The list of features used for voice recognition discussed
`in this section consists of many parameters that are common
`to other voice-processing applications as well as some param(cid:173)
`eters that were devised specifically for the voice-recognition
`
`IPR2023-00034
`Apple EX1016 Page 18
`
`
`
`Background of Voice Recognition 27
`
`task. Most of these parameters were derived by performing
`some kind of transformation of the LPC coefficients.
`
`Evaluation of Parameters
`
`To build a successful voice-recognition system, one must
`make informed decisions concerning which parameters to
`use. The penalties
`for choosing parameters
`incorrectly
`include poor recognition performance and excessive pro(cid:173)
`cessing time and storage space. The goal of parameter evalua(cid:173)
`tion should be to determine the smallest set of parameters
`which contain as much useful information as possible.
`The theory of analysis of variance provides a method
`for determining
`the relative merits of parameters for voice
`recognition. Features are identified which remain relatively
`constant for the speech of a single individual but vary over
`the speech of different individuals. Typical voice-recognition
`systems use a set of parameters (features) that may be repre(cid:173)
`sented by a vector W:
`
`where w1, w2 , etc., are individual features such as LPC coeffi(cid:173)
`cients or cepstral coefficients. Numerous vectors can be
`obtained by performing feature extraction on evenly spaced
`analysis windows throughout utterances spoken by the indi(cid:173)
`viduals to be recognized. Thus, at different time positions in
`an utterance, the same parameters are calculated.
`The F-ratio for each feature, k, in W can be determined
`as follows [6]:
`
`(Variance of Speaker Means)
`Fk = (Average Within Speaker Variance)
`
`(2.1)
`
`Ifs vectors have been collected for each of q number of speak(cid:173)
`ers, then:
`
`IPR2023-00034
`Apple EX1016 Page 19
`
`
`
`28 Voice Recognition
`
`(2.2)
`
`(2.3)
`
`(2.4)
`
`where wi,j,k is the value of the kth feature for the ith speaker
`during the jth reference frame. Si,k estimates the value of the
`kth feature for the ith speaker. The average of the kth feature
`over all frames of all speakers is represented by Uk.
`Features with larger F-ratios will be more useful for voice
`recognition. However, F-ratios are only valid for the set of
`data from which they were calculated. Features that appear
`to be useful for one set of speakers may be worthless for
`another set of speakers. To calculate meaningful F-ratios, a
`large population with a large number of examples from each
`speaker must be used.
`Other methods for evaluating the usefulness of features
`exist. For example, the feature effectiveness criterion (FEC)
`is defined by Shridhar as follows [10]:
`FEC = L Interspeaker distances - L Intraspeaker distances
`Parameters with higher FEC values are more desirable since
`high interspeaker distances are favorable for discrimination
`and low intraspeaker distances are favorable for speaker vari(cid:173)
`ability. Another method for choosing which features to use
`in a voice-recognition system is simply to use recognition
`error rates of the system when different features are used
`as input. By using the same input speech data and pattern
`
`IPR2023-00034
`Apple EX1016 Page 20
`
`
`
`Background of Voice Recognition 29
`
`matching algorithm (algorithms will be discussed later in this
`chapter), the performance of different sets of parameters may
`be evaluated by comparison of recognition scores. Better
`parameters will yield better recognition scores. A broad range
`of testing must be used to prevent overtraining, which occurs
`when the system parameters are varied slightly in an attempt
`to achieve better performance on a specific set of input data,
`while the performance of the system actually drops for more
`general input data.
`
`Distance Measures
`
`Distance measures refer to methods of calculating differences
`between parameter vectors. Typically, one of the vectors is
`calculated from data of the unknown speaker while the other
`vector is calculated from that of a known speaker. However,
`some pattern-matching
`techniques require that vectors from
`the same speaker be compared to each other to determine the
`expected variance of the speaker in question. Descriptions of
`how distance measures are used will be presented later in
`this chapter.
`Many different distance measures have been proposed,
`and deciding which one to use is as difficult as determining
`which set of parameters to use. Often, a method is chosen
`simply because it yields favorable results and/ or compensates
`for the ineffectiveness of certain parameters within a feature
`vector.
`Most distance measures are variations of either the
`Euclidean or Manhattan distance between two vectors.
`Euclidean:
`
`p
`d(a,b) = (~
`
`1/2
`(ai - bj)2)
`
`IPR2023-00034
`Apple EX1016 Page 21
`
`
`
`30 Voice Recognition
`
`Manhattan:
`
`d(a,b) = L lai - bil
`
`p
`
`i=1
`
`where ai and bi are ith components of the two vectors to be
`compared and p is the number of features to compare.
`The Euclidean and Manhattan distance measures are not
`appropriate for comparing two vectors of LPC coefficients
`since the coefficients are not independent. However, the like(cid:173)
`lihood ratio distortion, which only applies to LPC coefficients,
`can be used. It is defined as follows:
`
`where a and b are vectors of LPC predictor coefficients, Ra
`is the Toeplitz autocorrelation matrix (a byproduct of the
`calculation of the predictor coefficients) associated with a,
`and T is transpose [20]. The log likelihood distance can be
`computed as follows:
`
`dUR = log(dLR)
`
`These two distance measures are effective ways of comparing
`vectors of LPC predictor coefficients.
`Since cepstral coefficients are the most commonly used
`type of voice-recognition parameter, several distance mea(cid:173)
`sures for cepstral coefficients have been suggested. Most of
`these distance measures are simple variations of the weighted
`cepstral distance:
`
`IPR2023-00034
`Apple EX1016 Page 22
`
`
`
`Background of Voice Recognition 31
`
`where, again, pis the number of features and Ji is the weighting
`function
`[21]. Several weighting
`functions have been
`suggested:
`Uniform:
`
`fi=1
`
`Expected difference:
`
`where Eis the expected difference between two features deter(cid:173)
`mined from a population of speakers.
`Inverse variance:
`
`Uniform without first coefficient:
`
`o if i = 1
`{
`Ji= 1 if 1 < i < p
`
`The expected difference and inverse variance weighting
`functions attempt to maximize the F-ratio of each feature. The
`uniform without first coefficient function discounts the first
`coefficient, which has been shown to contain little informa(cid:173)
`tion for speaker recognition [21].
`The number of different distance measures is as great
`as the number of different extracted parameter types. Some
`distance measures were designed for a specific type of param(cid:173)
`eter. Others were chosen to maximize the F-ratios of any given
`feature. However, most were chosen simply because of their
`favorable performance with
`specific pattern matching
`algorithms.
`
`IPR2023-00034
`Apple EX1016 Page 23
`
`
`
`32 Voice Recognition
`
`Pattem Recognition
`
`systems consists of
`Patt m recognition
`in voice-recognition
`developing a database of information about known speakers
`{tTaining) and determining if an unknown speaker is one of the
`known speakers (testing). The result of the pattern recognition
`identity. In
`step i a decision about an unknown
`speaker's
`th previous sections, we discussed
`feature extraction and
`that use these
`di stan e measures. In this section, algorithms
`f atur s and distance measures for making voice-recognition
`d
`i i n will be explained.
`
`Testing Voice-Recognition Systems
`
`T ompar the relative performance of the different pattern(cid:173)
`r o nition te hniques, a brief discussion on the testing of
`voi
`.. r ognition systems is necessary. Typically, the relative
`porfor1nance of voice-recognition systems is based on the error
`r t , for either verification or identification
`tasks. Unfortu(cid:173)
`nat 1 , error rates can be misleading owing to the large number
`of