`and Speaker Recognition
`
`Richard D. Peacocke and Daryl H. Graf
`Bell-Northern Research
`
`I
`
`v
`
`mining wh.it WJ\ m d , you determine who
`w d i t Deciding whether or 1101 a particu-
`lar speaker produced the utterance is called
`verification, and choosing a person’s iden-
`tity from a set of known speakers is called
`identification. The most general form of
`speaker recognition (text-independent) is
`still not very accurate for large speaker
`populations, but if you constrain the words
`spoken by the user (text-dependent) and
`do not allow the speech quality to vary too
`wildly, then it too can be done on a
`workstation.
`See the sidebar “Applications” for a
`description of typical speech and speaker
`_ _
`recognition applications.
`
`~~
`
`I
`
`SDeech recognition,
`to identify
`the
`spoken words, and
`speaker recognition,
`to identify
`the
`who is saying them,
`are becoming
`-
`commonplace
`applications
`of speech processing Factors affecting
`technology.
`speech recognition
`
`eing able to speak to your personal
`computer, and have it recognize
`and understand what you say,
`would provide a comfortable and natural
`form of communication. It would reduce
`the amount of typing you have to do, leave
`your hands free, and allow you to move
`away from the terminal or screen. You
`would not even have to be in the line of
`sight of the terminal. It would also help in
`some cases if the computer could tell who
`was speaking.
`If you want to use voice as a new me-
`dium on a computer workstation, it is natu-
`ral to explore how speech recognition can
`contribute to such an environment. Here,
`we will review the state of speech and
`speaker recognition, focusing on current
`technology applied to personal worksta-
`tions.
`Limited forms of speech recognition are
`available on personal workstations. Cur-
`rently there is much interest in speech
`recognition, and performance is improv-
`ing. Speech recognition has already proven
`useful for certain applications, such as
`telephone voice-response systems for se-
`lecting services or information, digit rec-
`ognition for cellular phones, and data entry
`while walking around a railway yard or
`clambering over a jet engine during an
`inspection.
`Nonetheless, comfortable and natural
`communication in a general setting (no
`constraints on what you can say and how
`
`you say it) is beyond us for now, posing
`a problem too difficult to solve. Fortu-
`nately, we can simplify the problem to
`allow the creation of applications like the
`examples just mentioned. Some of these
`simplifying constraints are discussed in
`the next section.
`Speaker recognition is related to work
`on speech recognition. Instead of deter-
`
`Modern speech recognition research
`began in the late 1950s with the advent of
`the digital computer. Combined with tools
`to capture and analyze speech, such as
`analog-to-digital converters and sound
`spectrograms, the computer allowed re-
`searchers to search for ways to extract
`features from speech that allow discrimi-
`nation between different words. The 1960s
`saw advances in the automatic segmenta-
`tion of speech into units of linguistic rele-
`vance (such as phonemes, syllables, and
`words) and on new pattern-matching and
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:30:24 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1016, Page 1
`
`
`
`Applications
`
`Although the performance of speech and speaker recogni-
`tion systems is far from perfect, these systems have already
`proven their usefulness for certain applications.
`
`Speech recognition. Currently, speech recognition is most
`often applied in manufacturing for companies needing voice
`entry of data or commands while the operator’s hands are
`otherwise occupied. Related applications occur in product in-
`spection, inventory control, command/control, and material
`handling. Speech recognition also finds frequent application
`in medicine, where voice input can significantly accelerate
`the writing of routine reports.
`Speech recognition over the telephone network, although
`less used, has the greatest potential for growth. Automating
`the telephone operator’s job can greatly reduce operating
`costs for telephone companies. Furthermore, speech recog-
`nition can help users control the personal workstation or in-
`teract with other applications remotely when touch-tone key-
`pads are not available. (Telephone network applications are
`described in articles by Matthew Lennig and Ryohei Nakatsu
`elsewhere in this issue.)
`Finally, speech recognition offers greater freedom to the
`physically handicapped.
`Typical real-world applications:
`
`Delco electronics employs IBM PC/AT-Cherry Electron-
`ics and Intel RMX86 recognition systems to collect circuit
`board inspection data while the operator repairs and marks
`
`the boards.
`
`- Southern Pacific Railway inspectors now routinely use a
`
`PC-based Votan recognition system to enter car inspection
`information from the field by walkie-talkie.
`Michigan Bell has installed a Northern Telecom recogni-
`tion system to automate collect and third-number billed calls.
`AT&T has also put in field trial systems to automate call-
`type selection in its Reno, Nevada, and Hayward, California,
`offices.
`
`Speaker recognition. Speaker recognition has been ap-
`plied most often as a security device to control access to
`buildings or information. One of the best known examples is
`the Texas Instruments corporate computer center security
`system. Security Pacific has employed speaker verification
`as a security mechanism on telephone-initiated transfers of
`large sums of money. In addition to adding security, verifica-
`tion is advantageous because it reduces the turnaround time
`on these banking transactions. Bellcore uses speaker verifi-
`cation to limit remote access of training information to au-
`thorized field personnel. Speaker recognition also provides a
`mechanism to limit the remote access of a personal worksta-
`tion to its owner or a set of registered users.
`In addition to its use as a security device, speaker recogni-
`tion could be used to trigger specialized services based on a
`user’s identity. For example, you could configure an answer-
`ing machine to deliver personalized messages to a small set
`of frequent callers.
`
`classification algorithms. B y the 1970s, a
`number of important techniques essential
`to today’s state-of-the-art speech recogni-
`tion systems had emerged, spurred on in
`part by the Defense Advanced Research
`Projects Agency speech recognition proj-
`ect. These techniques have now been re-
`fined to the point where very high recogni-
`tion rates are possible, and commercial
`systems are available at reasonable prices.
`Five factors can be used to control and
`simplify the speech recognition task’:
`
`(1) Isolated words. Speech consisting
`of isolated words (short silences between
`the words) i s much easier to recognize than
`continuous speech because word bounda-
`ries are difficult to find in continuous
`speech. Also, coarticulation effects in
`continuous speech cause the pronunciation
`o f a word to change depending on i t s posi-
`tion relative to other words in a sentence.
`For example, “did you?” i s not the same as
`“did” + short silence + “you?” Other ef-
`fects depend on the rate o f speaking as
`well, such as our tendency to drop the “t” in
`
`want when saying “want to” casually and
`quickly.
`Error rates can definitely be reduced by
`requiring the user to pause between each
`word. For example, in a study by Bahl et
`al.,? error rates of 9 percent for continuous
`recognition decreased to 3 percent for iso-
`lated-word recognition. However, this
`type o f restriction places a burden on the
`user and reduces the speed with which
`information can be input to the system
`(from a range o f about 150-250 words per
`minute down to about 20-100 words per
`minute).
`(2) Single speaker.. Speech from a
`single speaker i s also easier to recognize
`than speech from a variety o f speakers
`because most parametric representations
`of speech are sensitive to the characteris-
`tics o f the particular speaker. This makes a
`set of pattern-matching templates for one
`speaker perform poorly for another
`speaker. Therefore, many systems are
`speaker dependent - trained for use with
`each different operator. Relatively few
`speech recognition systems can be used by
`
`the general public. A rule of thumb used by
`many researchers i s that, for the same task,
`speaker-dependent systems will have error
`rates roughly three to five times smaller
`than speaker-independent ones.
`One way to make a system speaker inde-
`pendent i s simply to mix training templates
`from a wide variety of speakers. A more
`sophisticated approach will attempt to look
`for phonetic features that are relatively
`invariant between speakers.
`(3) Vocahu/ary size. The size o f the
`vocabulary of words to be recognized also
`strongly influences recognition accuracy.
`Large vocabularies are more likely to
`contain ambiguous words than small vo-
`cabularies. Ambiguous words are those
`whose pattern-matching templates appear
`similar to the classification algorithm used
`by the recognizer. They are therefore
`harder to distinguish from each other. O f
`course, small vocabularies composed of
`many ambiguous words can be particularly
`difficult to recognize. A famous example
`i s the E-set, which consists o f a subset o f
`the English alphabet and digits: “B,” “C,”
`
`August 1990
`
`27
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:30:24 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1016, Page 2
`
`
`
`1 :!;E: ki
`
`device
`
`DSP module
`
`storage
`
`lection process chooses the template or
`model (possibly more than one) with the
`best match.
`
`Reference speech
`patterns
`
`Figure 1. Components of a typical speech recognition system.
`
`Two major types of pattern matching in
`use are template matching by dynamic time
`warping and hidden Markov models. Arti-
`ficial neural networks applied to speech
`recognition have also had some success,
`but this work is still in the early stages of re-
`search.s Moreover, linguistic knowledge
`incorporated into the pattern-recognition
`algorithm can enhance performance. How-
`ever, such sophisticated techniques lie out-
`side of the scope of this article (see, for
`example, O’Shaughnessy4 and Mariani“).
`Template matching by dynamic time
`warping became very popular in the 1970s.
`Template matching is conceptually simple.
`You want to compare the preprocessed
`speech waveform directly against a refer-
`ence template by summing the distances
`between respective speech frames. How-
`ever, biological limitations tend to produce
`nonlinear variations in timing from utter-
`ance to utterance. Consequently, the vari-
`ous frames of a word may be out of align-
`(1) A speech capture device. This usu-
`ment with the corresponding frames of the
`given template. Since the order of speech
`ally consists of a microphone and associ-
`events is fairly constant, you correct the
`ated analog-to-digital converter, which
`misalignment by stretching the template in
`digitally encodes the raw speech wave-
`some places and compressing it in others to
`form.
`(2) A digital signal processing module.
`find an optimum match. Dynamic program-
`ming helps compute the optimum match.
`The DSP module performs endpoint (word
`The sidebar “Dynamic time warping” illus-
`boundary) detection to separate speech
`trates the resulting time warp process.
`from nonspeech, converts the raw wave-
`form into a frequency domain representa-
`Hidden Markov models are used in most
`current research systems because this tech-
`tion, and performs further windowing,
`nique produces better results for continu-
`scaling, filtering, and data compres~ion.~
`ous speech with moderate-size vocabular-
`The goal is to enhance and retain only
`ies. HMMs are stochastic state machines
`those components of the spectral represen-
`that associate probabilities of producing
`tation that are useful for recognition pur-
`sounds with transitions from state to state.
`poses, thereby reducing the amount of
`An ideal HMM models speech with the
`information that the pattern-matching al-
`same variations that occur in human speech
`gorithm must contend with. A set of these
`speech parameters for one interval of time
`due to coarticulation and other effects.
`Speech generated by a human being is
`(usually 10-30 milliseconds) is called a
`matched against an HMM by computing
`speech frame.
`( 3 ) Preprocessed signal storage. Here,
`the probability that the HMM would have
`generated the same utterance or by finding
`the preprocessed speech is buffered for the
`the state sequence through the HMM that
`recognition algorithm.
`(4) Reference speech patterns. Stored
`has the highest probability of producing the
`reference patterns can be matched against
`utterance. The fact that HMMs generate
`poor-quality speech explains why recogni-
`the user’s speech sample once it has been
`preprocessed by the DSP module. This
`tion based on HMMs is still not perfect.
`information is stored as a set of speech
`The sidebar “Hidden Markov models”
`templates or as generative speech models.
`further details the use of HMMs. Markov
`chains, although known about for almost a
`(5) A pattern matching algorithm. The
`algorithm must compute a measure of
`century, have only been successfully used
`in the context of speech recognition for the
`goodness-of-fit between the preprocessed
`past 15 years or so. Until recently, no
`signal from the user’s speech and all the
`method existed for optimizing the model
`stored templates or speech models. A se-
`
`Components of a
`speech recognition
`system
`
`Most computer systems for speech rec-
`ognition include the following five com-
`ponents (see Figure I):
`
`‘‘D,” ‘‘E,” ‘‘(-,,> “p,” “
`T,” “V,” “Z,” and
`“three.”
`The amount of time it takes to search the
`speech model database also relates to vo-
`cabulary size. Systems containing many
`pattern templates typically require pruning
`techniques to cut down the computational
`load of the pattern-matching algorithm. By
`ignoring potentially useful search paths,
`pruning heuristics can also introduce rec-
`ognition errors.
`(4) Grammar. The grammar of the rec-
`ognition domain defines the allowable
`sequences of words. A tightly constrained
`grammar is one in which the number of
`words that can legally follow any given
`word is small. The amount of constraint on
`word choice is referred to as the perplexity
`of the grammar. Systems with low perplex-
`ity are potentially more accurate than those
`that give the user more freedom because
`the system can limit the effective vocabu-
`lary (and search space) to those words that
`can occur in the current input context. For
`example, a system described in Kimbal et
`al.3 had an error rate of 1.6 percent with
`perplexity 19 (tightly constrained), while
`the error rate hit about 4.5 percent with
`perplexity 58 (more loosely constrained).
`( 5 ) Environment. Background noise,
`changes in microphone characteristics,
`and loudness can all dramatically affect
`recognition accuracy. Many recognition
`systems are capable of very low error rates
`as long as the environmental conditions
`remain quiet and controlled. However,
`performance degrades when noise is intro-
`duced or when conditions differ from the
`training session used to build the reference
`templates. To compensate, the user must
`almost always wear a head-mounted,
`noise-limiting microphone with the same
`response characteristics as the microphone
`used during training.
`
`28
`
`COMPUTER
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:30:24 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1016, Page 3
`
`
`
`Dynamic time warping
`
`Frame distances between the pro-
`cessed speech frames and those of
`the reference templates are summed
`to provide an overall distance measure
`of similarity. But, instead of taking
`frames that correspond exactly in
`time, you would do a time “warp” on
`the utterance (and scale its length) so
`that similar frames in the utterance
`line up better against the reference
`frames. A dynamic programming pro-
`cedure finds a warp that minimizes
`the sum of frame distances in the tem-
`plate comparison. The distance pro-
`
`duced by this warp is chosen as the
`similarity measure.
`In the illustration here, the speech
`frames that make up the test and ref-
`erence templates are shown as scalar
`amplitude values plotted on a graph
`with time as the x axis. In practice,
`they are multidimensional vectors, and
`the distance between them is usually
`taken as the Euclidean distance. The
`graphs show how warping one of the
`templates improves the match be-
`tween them. (For further information,
`see chapter 10 of O’Sha~ghnessy.~)
`
`Before time warp
`A Amplitude
`
`1
`
`7
`
`
`Reference template
`Test template
`
`After time warp
`
`A Amplitude
`
`Ti m‘e
`
`c
`
`Time
`
`parameters to generate observed speech
`patterns. (The US Department of Defense
`actually suppressed publication of the ad-
`vances in HMM algorithms for a while in
`the mid-l970s, probably because of their
`use in cryptanalysis.) As well as represent-
`ing low-level speech segments and transi-
`tions, hidden Markov models provide a
`framework on which you can model higher
`level structures in continuous speech sig-
`nals and incorporate other knowledge
`about the communication.
`
`August 1990
`
`Current speech
`recognition systems
`
`Current speech recognition systems can
`be categorized according to the types of
`constraint they place on the speech. At one
`end of the spectrum fall speaker-independ-
`ent, continuous, unconstrained-grammar,
`large-vocabulary systems. These systems
`are still very much in the research stage.
`Several systems among those represent-
`
`ing the state of the art were trained and
`tested on the same speech data - the
`DARPA resource management database
`- and are easily compared. The DARPA
`resource management task involves que-
`ries and commands to a database of war-
`ships. The associated database consists of
`a997-word vocabulary andgrammars with
`various complexities. Sphinx, a recognizer
`developed at Carnegie Mellon University,
`has a maximum word-recognition accu-
`racy of 93.7 percent for a grammar of
`perplexity 60 and 70.6 percent for a gram-
`mar of perplexity 997.’ BBN’s Byblos’
`and a system developed at Lincoln LabsX
`have word accuracies of 88.7 percent and
`87.4 percent, respectively, for the perplex-
`ity 60 grammar (BBN’s system requires
`about two minutes of speech to adapt to a
`particular speaker before reaching this
`level of performance). Texas Instruments*
`and Stanford Research Institute9 have re-
`ported systems with 44.3 percent and 40.4
`percent accuracy on the perplexity 997
`grammar. These systems have considera-
`bly lower sentence accuracies.
`Representative of the state of the art in
`speaker-dependent, isolated-word, large-
`vocabulary recognizers are systems like
`IBM’s Tangora recognizer, which is ca-
`pable of 97 percent accuracy for a 20,000-
`word vocabulary’” and NEC’s 97.5 percent
`accurate, 1,800-word system.]’
`A variety of other systems trade off
`constraints on the input speech for higher
`recognition accuracies. Among these are
`the AT&T Bell Labs telephone-grade,
`speaker-independent, connected-digit re-
`cognizer (98.5 percent accurate when the
`number of digits is knownI2) and a speaker-
`dependent version of BBN’s Byblos,
`which measured 94.8 percent accurate on
`the perplexity 60 DARPA resource man-
`agement task.
`At the highly constrained speech end of
`the spectrum fall speaker-dependent,
`single-word, small-vocabulary recogni-
`tion systems. A variety of such systems
`developed can achieve accuracies above
`99 percent.
`Various commercial systems have ap-
`peared for Sun workstations and IBM-
`compatible PCs over the past few years.
`Table 1 summarizes the capabilities, costs,
`and manufacturers’ claimed accuracies of
`a sample of these commercial products.
`Although several companies advertise
`speaker-independent, continuous, large-
`
`___.
`* See Kai-fu Lee,’ p. 133.
`
`29
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:30:24 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1016, Page 4
`
`
`
`Hidden Markov models
`
`A hidden Markov model (HMM) is a
`doubly stochastic process for produc-
`ing a sequence of observed symbols.
`An underlying stochastic finite state
`machine (FSM) drives a set of sto-
`chastic processes, which produce the
`symbols. When a state is entered after
`a state transition in the FSM, a symbol
`from that state’s set of symbols is se-
`
`lected probabilistically for output. The
`term “hidden” is appropriate because
`the actual state of the FSM cannot be
`observed directly, only through the
`symbols emitted. In the example
`here, the sequence of symbols
`AAaaB could have been produced by
`any of three different state transition
`sequences.
`
`State
`-
`1
`2
`3
`
`Possible Outputs
`A,a
`a
`B
`
`AAaaB could be produced by the fol-
`lowing state sequences:
`
`+ 1 - 3 1 + 1 + 1 + 3
`+ 1 + 1 - - + 1 + 2 + 3
`- + 1 + 1 + 2 - 3 2 + 3
`
`or
`or
`
`Although not shown in the example,
`probabilities are attached to the finite
`state transitions, and discrete probabil-
`ity distributions control the symbol out-
`put for each state (continuous density
`HMMs also exist). In the case of iso-
`lated word recognition, each word in
`the vocabulary has a corresponding
`HMM. These HMMs might actually
`consist of HMMs that model subword
`units such as phonemes connected to
`form a single word-model HMM. In the
`case of continuous word recognition, a
`single HMM corresponds to the do-
`main grammar. This grammar model is
`constructed from word-model HMMs.
`The observable symbols correspond to
`(quantized) speech frame measure-
`ments.
`An algorithm known as the forward/
`backward (or Baum-Welch) algorithm
`
`finds a set of state transition proba-
`bilities and symbol output distribu-
`tions for each HMM. This gradient
`descent algorithm uses training data
`to iteratively refine an initial (possibly
`random) set of model parameters
`such that the HMM is more likely to
`generate patterns from the training
`set.
`After this initial training stage, a
`word or sentence to be recognized is
`spoken, and speech measurements
`are made that reduce the utterance to
`a sequence of symbols. In the case of
`isolated word recognition, the forward
`algorithm computes the probability
`that each word model produced the
`observed sequence of symbols - the
`model with the highest probability
`represents the recognized word. In
`the case of continuous recognition,
`the Viterbi algorithm finds the state
`transition path, through the grammar
`model, with the maximum likelihood
`of generating the set of measure-
`ments. The sequence of word models
`on this path corresponds to the rec-
`ognized sentence. (For further infor-
`mation see “Introduction to Hidden
`Markov Models” by L.R. Rabiner and
`B.H. Juang, published in /E€€ Trans.
`Acoustics, Speech, and Signal Pro-
`cessing, Jan. 1986, pp. 4-16.)
`
`30
`
`vocabulary speech recognition, they care-
`fully avoid making strong claims about the
`accuracy of their products. With commer-
`cial systems, you typically get what you
`pay for. Products available for less than
`$1,000 US are isolated-word, small-vo-
`cabulary recognizers. Speaker-dependent,
`isolated-word, large-vocabulary recogniz-
`ers for automated dictation are available
`for a few thousand dollars. You’ll see an
`order of magnitude leap in price when you
`move to large-vocabulary, speaker-inde-
`pendent, continuous-speech recognizers.
`
`Speaker recognition -
`the voice, not just the
`words
`
`Speaker recognition i s related to speech
`recognition. When the task involves iden-
`tifying the person talking rather than what
`i s said, the speech signal niust be processed
`to extract measures of speaker variability
`instead of being analyzed by segments
`corresponding to phonemes or pieces of
`text one after the other. For speaker recog-
`nition, only one classification i s made,
`based on part or a l l o f an input test utter-
`ance. Although various studies have shown
`that certain acoustical features work better
`than others in predicting speaker identity,
`few recognizers examine specific sounds
`because of difficulties in phone segmenta-
`tion and identification.
`Both automatic speaker verification and
`speaker identification use a stored data-
`base of reference patterns (templates) for
`N known speakers. Both involve similar
`analysis and decision techniques. Verifi-
`cation i s simpler because it only requires
`comparing the test pattern against one ref-
`erence pattern and i t involves a binary
`decision: Is there a good enough match
`against the template o f the claimed
`speaker‘? The error rate for speaker identi-
`fication can be much greater because i t
`requires choosing which of the N voices
`known to the system best matches the test
`voice or “no match” if the test voice differs
`sufficiently from all the reference tem-
`plates.
`Comparing test and reference utterances
`for speaker identity i s much simpler for
`identical underlying texts, as in text-de-
`pendent speaker recognition. With coop-
`erative speakers you can apply speaker
`recognition straightforwardly by using the
`same words to train the system and then
`test it. This usually happens in verification,
`but speaker identification often requires
`
`COMPUTER
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:30:24 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1016, Page 5
`
`
`
`Table 1. A sample of commercially available speech recognition systems (for IBM PCs unless otherwise indicated).
`
`Recognizer
`
`Constraints
`(Speaker/Speech/Vocabulary )
`
`Price (US $)
`
`Percent Word Accuracy
`(per the manufacturer)
`
`Dragon Voice-Scribe 400
`
`Dragon Dictate
`
`ITT VRS 1280PC
`
`Phonetic Engine*
`(Speech Systems Inc.)
`
`Telerec
`(Voice Control Systems)**
`
`Verbex series 5000, 6000,
`7000
`
`Voice Card
`(Votan)
`
`Voice Comm Unit
`(Fujitsu)
`
`Voice Master Key
`(COVOX)
`
`Voice Navigator?
`(Articulate Systems)
`
`Voice Pro
`(Voice Processing Corp.)
`
`Voice Report
`(Kurzweil AI)
`
`Speaker dependent
`Isolated-word recognition
`400 words
`
`Speaker adaptive
`Isolated-word recognition
`30,000 words
`
`Speaker dependent
`Continous-speech recognition
`2.000 words
`
`Speaker independent
`Continous-speech recognition
`10,000-40,000 words
`
`Speaker independent
`Connected-word recognition
`50 words
`
`Speaker dependent
`Continuous-speech recognition
`80-10,000 words
`
`$995
`
`$9,000
`
`$9,000
`
`>95
`
`>90
`
`>98
`
`$10,500-$47,100
`
`95
`
`$3,000
`
`>98
`
`$5,600-$9,600
`
`>99.5
`
`Speaker dependent or independent
`Continuous-speech recognition
`300 words
`
`$3,500
`
`>99 (speaker dependent),
`95 (speaker independent)
`
`Speaker dependent
`Connected-word recognition
`4,000 words
`
`Speaker dependent
`Isolated-word recognition
`64 words
`
`Speaker dependent
`Isolated-word recognition
`1,000 words
`
`Speaker independent
`Continuous-speech recognition
`13 words
`
`Speaker dependent
`Isolated-word recognition
`20,000 words
`
`Only in Japan
`
`99.9
`
`$150
`
`95-96
`
`$1,300
`
`95
`
`$5,000
`
`97-99
`
`$18,900
`
`98
`
`* Available for Sun workstations
`
`** VCS technology is used in Dialogic products
`
`t Available for Macintosh (based on Dragon Systems technology)
`
`text-independent methods. Higher error
`rates for text-independent methods means
`you will need much more speech data both
`for training and testing.
`
`Automatic speaker recognition by com-
`puter has been an active research area since
`the early 1960s. A 1962 paper introduced
`the spectrogram as a means of personal
`
`identification, and this stimulated a good
`deal of further research. The term “voice-
`print”also appeared in that paper. Unfortu-
`nately, the analogy with fingerprint read-
`
`August 1990
`
`31
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:30:24 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1016, Page 6
`
`
`
`ing is incorrect. As pointed out by Dod-
`dington,” the spectrogram is a function of
`the speech signal, not of the physical anat-
`omy of the speaker. The speech signal
`depends far more on the speaker’s actions,
`themselves a complex function of many
`factors, than on the shape of the speaker’s
`vocal tract. The term “voiceprint” is
`misleading.
`
`Speaker recognition
`systems
`
`Speaker recognition by computer has
`only had limited success to date in applica-
`tions using free text (text independent).
`Nonetheless, text-independent recogni-
`tion of speakers has become an increas-
`ingly popular area of research, particularly
`for applications such as forensic, intelli-
`gence gathering, and passive surveillance
`of voice circuits. Free-text recognition
`usually lacks control over conditions that
`influence system performance, including
`variability in the speech signal and distor-
`tions and noise in the communications
`channel. The recognition task faces mul-
`tiple problems: unconstrained
`input
`speech, uncooperative speakers, and un-
`controlled environmental parameters. This
`has made it necessary to focus on features
`and characteristics of speech unique to the
`individual.
`Performance of text-independent sys-
`tems has lagged behind that of text-de-
`pendent systems, as you might expect.
`However, Markel and DavisI4 achieved
`excellent results with a linguistically un-
`constrained database of unrehearsed
`speech. Using voice pitch and linear pre-
`dictive coding (LPC) reflection coeffi-
`cients in their model, they reached 2 per-
`cent identification error and 4 percent
`verification error rates for 40-second seg-
`ments of input speech. Results were not
`nearly as good with shorter input speech
`segments, even though the system avoided
`operational problems of microphone deg-
`radation, acoustic noise, and channel dis-
`tortion. In text-independent recognition of
`nine male speakers over a radio channel
`at Bolt Beranek and Newman, the best
`performance was a 30 percent error rate
`for input speech segments of about two
`seconds.”
`Text-independent recognition seems
`mainly slated for unobtrusive surveillance
`of individuals. As mentioned earlier, text-
`independent speaker identification poses a
`difficult problem. The accuracy of state-
`of-the-art text-independent identification
`
`32
`
`is low, and it requires continuous use of
`computing power. Text-dependent
`speaker verification has the greatest poten-
`tial for practical application at the moment.
`A number of organizations have research
`and development programs in speaker
`verification, and Texas Instruments and
`AT&T Bell Labs have both made major
`efforts in this research area.
`AT&T Bell Labs has concentrated on
`speaker recognition over telephone lines,
`which faces difficult problems of micro-
`phone and channel distortion. Speaker
`recognition over telephone lines opens up
`an enormous set of possible uses, such as
`identification for various kinds of transac-
`tion processing in banking, shopping, and
`database access.
`AT&T Bell Labs started its automatic
`speaker verification system in 1970. Re-
`searchers there chose measurements that
`are largely insensitive to the phase and
`spectral amplitude distortions likely over
`telephone lines. In an early five-month
`operational simulation, the system showed
`a user rejection rate and impostor accep-
`tance rate of about 10 percent initially for
`new users, dropping to about 5 percent for
`experienced users and fully adapted tem-
`plates. A more recent system used over
`telephone lines has achieved error rates
`(rejection of true speakers and acceptance
`of impostors) of approximately 2 percent.I6
`Texas Instruments has applied speaker
`verification to control access to its corpo-
`rate computer center.” The gross rejection
`rate of the operational system measured
`0.9 percent, with a casual impostor accep-
`tance rate of 0.7 percent. The system has
`been operational 24 hours a day for more
`than a decade. The verification step uses a
`comparison of dynamic features, and time
`alignment is established using a simplified
`form of dynamic time warping. Verifica-
`tion utterances are constructed randomly
`using a four-word fixed phrase structure,
`for example, “Proud Ben served hard.” As
`well as indicating what the user should say,
`the voice prompt helps stabilize pronun-
`ciation because the user tends to say i