throbber
An Introduction to Speech
`and Speaker Recognition
`
`Richard D. Peacocke and Daryl H. Graf
`Bell-Northern Research
`
`I
`
`v
`
`mining wh.it WJ\ m d , you determine who
`w d i t Deciding whether or 1101 a particu-
`lar speaker produced the utterance is called
`verification, and choosing a person’s iden-
`tity from a set of known speakers is called
`identification. The most general form of
`speaker recognition (text-independent) is
`still not very accurate for large speaker
`populations, but if you constrain the words
`spoken by the user (text-dependent) and
`do not allow the speech quality to vary too
`wildly, then it too can be done on a
`workstation.
`See the sidebar “Applications” for a
`description of typical speech and speaker
`_ _
`recognition applications.
`
`~~
`
`I
`
`SDeech recognition,
`to identify
`the
`spoken words, and
`speaker recognition,
`to identify
`the
`who is saying them,
`are becoming
`-
`commonplace
`applications
`of speech processing Factors affecting
`technology.
`speech recognition
`
`eing able to speak to your personal
`computer, and have it recognize
`and understand what you say,
`would provide a comfortable and natural
`form of communication. It would reduce
`the amount of typing you have to do, leave
`your hands free, and allow you to move
`away from the terminal or screen. You
`would not even have to be in the line of
`sight of the terminal. It would also help in
`some cases if the computer could tell who
`was speaking.
`If you want to use voice as a new me-
`dium on a computer workstation, it is natu-
`ral to explore how speech recognition can
`contribute to such an environment. Here,
`we will review the state of speech and
`speaker recognition, focusing on current
`technology applied to personal worksta-
`tions.
`Limited forms of speech recognition are
`available on personal workstations. Cur-
`rently there is much interest in speech
`recognition, and performance is improv-
`ing. Speech recognition has already proven
`useful for certain applications, such as
`telephone voice-response systems for se-
`lecting services or information, digit rec-
`ognition for cellular phones, and data entry
`while walking around a railway yard or
`clambering over a jet engine during an
`inspection.
`Nonetheless, comfortable and natural
`communication in a general setting (no
`constraints on what you can say and how
`
`you say it) is beyond us for now, posing
`a problem too difficult to solve. Fortu-
`nately, we can simplify the problem to
`allow the creation of applications like the
`examples just mentioned. Some of these
`simplifying constraints are discussed in
`the next section.
`Speaker recognition is related to work
`on speech recognition. Instead of deter-
`
`Modern speech recognition research
`began in the late 1950s with the advent of
`the digital computer. Combined with tools
`to capture and analyze speech, such as
`analog-to-digital converters and sound
`spectrograms, the computer allowed re-
`searchers to search for ways to extract
`features from speech that allow discrimi-
`nation between different words. The 1960s
`saw advances in the automatic segmenta-
`tion of speech into units of linguistic rele-
`vance (such as phonemes, syllables, and
`words) and on new pattern-matching and
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:30:24 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1016, Page 1
`
`

`

`Applications
`
`Although the performance of speech and speaker recogni-
`tion systems is far from perfect, these systems have already
`proven their usefulness for certain applications.
`
`Speech recognition. Currently, speech recognition is most
`often applied in manufacturing for companies needing voice
`entry of data or commands while the operator’s hands are
`otherwise occupied. Related applications occur in product in-
`spection, inventory control, command/control, and material
`handling. Speech recognition also finds frequent application
`in medicine, where voice input can significantly accelerate
`the writing of routine reports.
`Speech recognition over the telephone network, although
`less used, has the greatest potential for growth. Automating
`the telephone operator’s job can greatly reduce operating
`costs for telephone companies. Furthermore, speech recog-
`nition can help users control the personal workstation or in-
`teract with other applications remotely when touch-tone key-
`pads are not available. (Telephone network applications are
`described in articles by Matthew Lennig and Ryohei Nakatsu
`elsewhere in this issue.)
`Finally, speech recognition offers greater freedom to the
`physically handicapped.
`Typical real-world applications:
`
`Delco electronics employs IBM PC/AT-Cherry Electron-
`ics and Intel RMX86 recognition systems to collect circuit
`board inspection data while the operator repairs and marks
`
`the boards.
`
`- Southern Pacific Railway inspectors now routinely use a
`
`PC-based Votan recognition system to enter car inspection
`information from the field by walkie-talkie.
`Michigan Bell has installed a Northern Telecom recogni-
`tion system to automate collect and third-number billed calls.
`AT&T has also put in field trial systems to automate call-
`type selection in its Reno, Nevada, and Hayward, California,
`offices.
`
`Speaker recognition. Speaker recognition has been ap-
`plied most often as a security device to control access to
`buildings or information. One of the best known examples is
`the Texas Instruments corporate computer center security
`system. Security Pacific has employed speaker verification
`as a security mechanism on telephone-initiated transfers of
`large sums of money. In addition to adding security, verifica-
`tion is advantageous because it reduces the turnaround time
`on these banking transactions. Bellcore uses speaker verifi-
`cation to limit remote access of training information to au-
`thorized field personnel. Speaker recognition also provides a
`mechanism to limit the remote access of a personal worksta-
`tion to its owner or a set of registered users.
`In addition to its use as a security device, speaker recogni-
`tion could be used to trigger specialized services based on a
`user’s identity. For example, you could configure an answer-
`ing machine to deliver personalized messages to a small set
`of frequent callers.
`
`classification algorithms. B y the 1970s, a
`number of important techniques essential
`to today’s state-of-the-art speech recogni-
`tion systems had emerged, spurred on in
`part by the Defense Advanced Research
`Projects Agency speech recognition proj-
`ect. These techniques have now been re-
`fined to the point where very high recogni-
`tion rates are possible, and commercial
`systems are available at reasonable prices.
`Five factors can be used to control and
`simplify the speech recognition task’:
`
`(1) Isolated words. Speech consisting
`of isolated words (short silences between
`the words) i s much easier to recognize than
`continuous speech because word bounda-
`ries are difficult to find in continuous
`speech. Also, coarticulation effects in
`continuous speech cause the pronunciation
`o f a word to change depending on i t s posi-
`tion relative to other words in a sentence.
`For example, “did you?” i s not the same as
`“did” + short silence + “you?” Other ef-
`fects depend on the rate o f speaking as
`well, such as our tendency to drop the “t” in
`
`want when saying “want to” casually and
`quickly.
`Error rates can definitely be reduced by
`requiring the user to pause between each
`word. For example, in a study by Bahl et
`al.,? error rates of 9 percent for continuous
`recognition decreased to 3 percent for iso-
`lated-word recognition. However, this
`type o f restriction places a burden on the
`user and reduces the speed with which
`information can be input to the system
`(from a range o f about 150-250 words per
`minute down to about 20-100 words per
`minute).
`(2) Single speaker.. Speech from a
`single speaker i s also easier to recognize
`than speech from a variety o f speakers
`because most parametric representations
`of speech are sensitive to the characteris-
`tics o f the particular speaker. This makes a
`set of pattern-matching templates for one
`speaker perform poorly for another
`speaker. Therefore, many systems are
`speaker dependent - trained for use with
`each different operator. Relatively few
`speech recognition systems can be used by
`
`the general public. A rule of thumb used by
`many researchers i s that, for the same task,
`speaker-dependent systems will have error
`rates roughly three to five times smaller
`than speaker-independent ones.
`One way to make a system speaker inde-
`pendent i s simply to mix training templates
`from a wide variety of speakers. A more
`sophisticated approach will attempt to look
`for phonetic features that are relatively
`invariant between speakers.
`(3) Vocahu/ary size. The size o f the
`vocabulary of words to be recognized also
`strongly influences recognition accuracy.
`Large vocabularies are more likely to
`contain ambiguous words than small vo-
`cabularies. Ambiguous words are those
`whose pattern-matching templates appear
`similar to the classification algorithm used
`by the recognizer. They are therefore
`harder to distinguish from each other. O f
`course, small vocabularies composed of
`many ambiguous words can be particularly
`difficult to recognize. A famous example
`i s the E-set, which consists o f a subset o f
`the English alphabet and digits: “B,” “C,”
`
`August 1990
`
`27
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:30:24 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1016, Page 2
`
`

`

`1 :!;E: ki
`
`device
`
`DSP module
`
`storage
`
`lection process chooses the template or
`model (possibly more than one) with the
`best match.
`
`Reference speech
`patterns
`
`Figure 1. Components of a typical speech recognition system.
`
`Two major types of pattern matching in
`use are template matching by dynamic time
`warping and hidden Markov models. Arti-
`ficial neural networks applied to speech
`recognition have also had some success,
`but this work is still in the early stages of re-
`search.s Moreover, linguistic knowledge
`incorporated into the pattern-recognition
`algorithm can enhance performance. How-
`ever, such sophisticated techniques lie out-
`side of the scope of this article (see, for
`example, O’Shaughnessy4 and Mariani“).
`Template matching by dynamic time
`warping became very popular in the 1970s.
`Template matching is conceptually simple.
`You want to compare the preprocessed
`speech waveform directly against a refer-
`ence template by summing the distances
`between respective speech frames. How-
`ever, biological limitations tend to produce
`nonlinear variations in timing from utter-
`ance to utterance. Consequently, the vari-
`ous frames of a word may be out of align-
`(1) A speech capture device. This usu-
`ment with the corresponding frames of the
`given template. Since the order of speech
`ally consists of a microphone and associ-
`events is fairly constant, you correct the
`ated analog-to-digital converter, which
`misalignment by stretching the template in
`digitally encodes the raw speech wave-
`some places and compressing it in others to
`form.
`(2) A digital signal processing module.
`find an optimum match. Dynamic program-
`ming helps compute the optimum match.
`The DSP module performs endpoint (word
`The sidebar “Dynamic time warping” illus-
`boundary) detection to separate speech
`trates the resulting time warp process.
`from nonspeech, converts the raw wave-
`form into a frequency domain representa-
`Hidden Markov models are used in most
`current research systems because this tech-
`tion, and performs further windowing,
`nique produces better results for continu-
`scaling, filtering, and data compres~ion.~
`ous speech with moderate-size vocabular-
`The goal is to enhance and retain only
`ies. HMMs are stochastic state machines
`those components of the spectral represen-
`that associate probabilities of producing
`tation that are useful for recognition pur-
`sounds with transitions from state to state.
`poses, thereby reducing the amount of
`An ideal HMM models speech with the
`information that the pattern-matching al-
`same variations that occur in human speech
`gorithm must contend with. A set of these
`speech parameters for one interval of time
`due to coarticulation and other effects.
`Speech generated by a human being is
`(usually 10-30 milliseconds) is called a
`matched against an HMM by computing
`speech frame.
`( 3 ) Preprocessed signal storage. Here,
`the probability that the HMM would have
`generated the same utterance or by finding
`the preprocessed speech is buffered for the
`the state sequence through the HMM that
`recognition algorithm.
`(4) Reference speech patterns. Stored
`has the highest probability of producing the
`reference patterns can be matched against
`utterance. The fact that HMMs generate
`poor-quality speech explains why recogni-
`the user’s speech sample once it has been
`preprocessed by the DSP module. This
`tion based on HMMs is still not perfect.
`information is stored as a set of speech
`The sidebar “Hidden Markov models”
`templates or as generative speech models.
`further details the use of HMMs. Markov
`chains, although known about for almost a
`(5) A pattern matching algorithm. The
`algorithm must compute a measure of
`century, have only been successfully used
`in the context of speech recognition for the
`goodness-of-fit between the preprocessed
`past 15 years or so. Until recently, no
`signal from the user’s speech and all the
`method existed for optimizing the model
`stored templates or speech models. A se-
`
`Components of a
`speech recognition
`system
`
`Most computer systems for speech rec-
`ognition include the following five com-
`ponents (see Figure I):
`
`‘‘D,” ‘‘E,” ‘‘(-,,> “p,” “
`T,” “V,” “Z,” and
`“three.”
`The amount of time it takes to search the
`speech model database also relates to vo-
`cabulary size. Systems containing many
`pattern templates typically require pruning
`techniques to cut down the computational
`load of the pattern-matching algorithm. By
`ignoring potentially useful search paths,
`pruning heuristics can also introduce rec-
`ognition errors.
`(4) Grammar. The grammar of the rec-
`ognition domain defines the allowable
`sequences of words. A tightly constrained
`grammar is one in which the number of
`words that can legally follow any given
`word is small. The amount of constraint on
`word choice is referred to as the perplexity
`of the grammar. Systems with low perplex-
`ity are potentially more accurate than those
`that give the user more freedom because
`the system can limit the effective vocabu-
`lary (and search space) to those words that
`can occur in the current input context. For
`example, a system described in Kimbal et
`al.3 had an error rate of 1.6 percent with
`perplexity 19 (tightly constrained), while
`the error rate hit about 4.5 percent with
`perplexity 58 (more loosely constrained).
`( 5 ) Environment. Background noise,
`changes in microphone characteristics,
`and loudness can all dramatically affect
`recognition accuracy. Many recognition
`systems are capable of very low error rates
`as long as the environmental conditions
`remain quiet and controlled. However,
`performance degrades when noise is intro-
`duced or when conditions differ from the
`training session used to build the reference
`templates. To compensate, the user must
`almost always wear a head-mounted,
`noise-limiting microphone with the same
`response characteristics as the microphone
`used during training.
`
`28
`
`COMPUTER
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:30:24 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1016, Page 3
`
`

`

`Dynamic time warping
`
`Frame distances between the pro-
`cessed speech frames and those of
`the reference templates are summed
`to provide an overall distance measure
`of similarity. But, instead of taking
`frames that correspond exactly in
`time, you would do a time “warp” on
`the utterance (and scale its length) so
`that similar frames in the utterance
`line up better against the reference
`frames. A dynamic programming pro-
`cedure finds a warp that minimizes
`the sum of frame distances in the tem-
`plate comparison. The distance pro-
`
`duced by this warp is chosen as the
`similarity measure.
`In the illustration here, the speech
`frames that make up the test and ref-
`erence templates are shown as scalar
`amplitude values plotted on a graph
`with time as the x axis. In practice,
`they are multidimensional vectors, and
`the distance between them is usually
`taken as the Euclidean distance. The
`graphs show how warping one of the
`templates improves the match be-
`tween them. (For further information,
`see chapter 10 of O’Sha~ghnessy.~)
`
`Before time warp
`A Amplitude
`
`1
`
`7
`
`
`Reference template
`Test template
`
`After time warp
`
`A Amplitude
`
`Ti m‘e
`
`c
`
`Time
`
`parameters to generate observed speech
`patterns. (The US Department of Defense
`actually suppressed publication of the ad-
`vances in HMM algorithms for a while in
`the mid-l970s, probably because of their
`use in cryptanalysis.) As well as represent-
`ing low-level speech segments and transi-
`tions, hidden Markov models provide a
`framework on which you can model higher
`level structures in continuous speech sig-
`nals and incorporate other knowledge
`about the communication.
`
`August 1990
`
`Current speech
`recognition systems
`
`Current speech recognition systems can
`be categorized according to the types of
`constraint they place on the speech. At one
`end of the spectrum fall speaker-independ-
`ent, continuous, unconstrained-grammar,
`large-vocabulary systems. These systems
`are still very much in the research stage.
`Several systems among those represent-
`
`ing the state of the art were trained and
`tested on the same speech data - the
`DARPA resource management database
`- and are easily compared. The DARPA
`resource management task involves que-
`ries and commands to a database of war-
`ships. The associated database consists of
`a997-word vocabulary andgrammars with
`various complexities. Sphinx, a recognizer
`developed at Carnegie Mellon University,
`has a maximum word-recognition accu-
`racy of 93.7 percent for a grammar of
`perplexity 60 and 70.6 percent for a gram-
`mar of perplexity 997.’ BBN’s Byblos’
`and a system developed at Lincoln LabsX
`have word accuracies of 88.7 percent and
`87.4 percent, respectively, for the perplex-
`ity 60 grammar (BBN’s system requires
`about two minutes of speech to adapt to a
`particular speaker before reaching this
`level of performance). Texas Instruments*
`and Stanford Research Institute9 have re-
`ported systems with 44.3 percent and 40.4
`percent accuracy on the perplexity 997
`grammar. These systems have considera-
`bly lower sentence accuracies.
`Representative of the state of the art in
`speaker-dependent, isolated-word, large-
`vocabulary recognizers are systems like
`IBM’s Tangora recognizer, which is ca-
`pable of 97 percent accuracy for a 20,000-
`word vocabulary’” and NEC’s 97.5 percent
`accurate, 1,800-word system.]’
`A variety of other systems trade off
`constraints on the input speech for higher
`recognition accuracies. Among these are
`the AT&T Bell Labs telephone-grade,
`speaker-independent, connected-digit re-
`cognizer (98.5 percent accurate when the
`number of digits is knownI2) and a speaker-
`dependent version of BBN’s Byblos,
`which measured 94.8 percent accurate on
`the perplexity 60 DARPA resource man-
`agement task.
`At the highly constrained speech end of
`the spectrum fall speaker-dependent,
`single-word, small-vocabulary recogni-
`tion systems. A variety of such systems
`developed can achieve accuracies above
`99 percent.
`Various commercial systems have ap-
`peared for Sun workstations and IBM-
`compatible PCs over the past few years.
`Table 1 summarizes the capabilities, costs,
`and manufacturers’ claimed accuracies of
`a sample of these commercial products.
`Although several companies advertise
`speaker-independent, continuous, large-
`
`___.
`* See Kai-fu Lee,’ p. 133.
`
`29
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:30:24 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1016, Page 4
`
`

`

`Hidden Markov models
`
`A hidden Markov model (HMM) is a
`doubly stochastic process for produc-
`ing a sequence of observed symbols.
`An underlying stochastic finite state
`machine (FSM) drives a set of sto-
`chastic processes, which produce the
`symbols. When a state is entered after
`a state transition in the FSM, a symbol
`from that state’s set of symbols is se-
`
`lected probabilistically for output. The
`term “hidden” is appropriate because
`the actual state of the FSM cannot be
`observed directly, only through the
`symbols emitted. In the example
`here, the sequence of symbols
`AAaaB could have been produced by
`any of three different state transition
`sequences.
`
`State
`-
`1
`2
`3
`
`Possible Outputs
`A,a
`a
`B
`
`AAaaB could be produced by the fol-
`lowing state sequences:
`
`+ 1 - 3 1 + 1 + 1 + 3
`+ 1 + 1 - - + 1 + 2 + 3
`- + 1 + 1 + 2 - 3 2 + 3
`
`or
`or
`
`Although not shown in the example,
`probabilities are attached to the finite
`state transitions, and discrete probabil-
`ity distributions control the symbol out-
`put for each state (continuous density
`HMMs also exist). In the case of iso-
`lated word recognition, each word in
`the vocabulary has a corresponding
`HMM. These HMMs might actually
`consist of HMMs that model subword
`units such as phonemes connected to
`form a single word-model HMM. In the
`case of continuous word recognition, a
`single HMM corresponds to the do-
`main grammar. This grammar model is
`constructed from word-model HMMs.
`The observable symbols correspond to
`(quantized) speech frame measure-
`ments.
`An algorithm known as the forward/
`backward (or Baum-Welch) algorithm
`
`finds a set of state transition proba-
`bilities and symbol output distribu-
`tions for each HMM. This gradient
`descent algorithm uses training data
`to iteratively refine an initial (possibly
`random) set of model parameters
`such that the HMM is more likely to
`generate patterns from the training
`set.
`After this initial training stage, a
`word or sentence to be recognized is
`spoken, and speech measurements
`are made that reduce the utterance to
`a sequence of symbols. In the case of
`isolated word recognition, the forward
`algorithm computes the probability
`that each word model produced the
`observed sequence of symbols - the
`model with the highest probability
`represents the recognized word. In
`the case of continuous recognition,
`the Viterbi algorithm finds the state
`transition path, through the grammar
`model, with the maximum likelihood
`of generating the set of measure-
`ments. The sequence of word models
`on this path corresponds to the rec-
`ognized sentence. (For further infor-
`mation see “Introduction to Hidden
`Markov Models” by L.R. Rabiner and
`B.H. Juang, published in /E€€ Trans.
`Acoustics, Speech, and Signal Pro-
`cessing, Jan. 1986, pp. 4-16.)
`
`30
`
`vocabulary speech recognition, they care-
`fully avoid making strong claims about the
`accuracy of their products. With commer-
`cial systems, you typically get what you
`pay for. Products available for less than
`$1,000 US are isolated-word, small-vo-
`cabulary recognizers. Speaker-dependent,
`isolated-word, large-vocabulary recogniz-
`ers for automated dictation are available
`for a few thousand dollars. You’ll see an
`order of magnitude leap in price when you
`move to large-vocabulary, speaker-inde-
`pendent, continuous-speech recognizers.
`
`Speaker recognition -
`the voice, not just the
`words
`
`Speaker recognition i s related to speech
`recognition. When the task involves iden-
`tifying the person talking rather than what
`i s said, the speech signal niust be processed
`to extract measures of speaker variability
`instead of being analyzed by segments
`corresponding to phonemes or pieces of
`text one after the other. For speaker recog-
`nition, only one classification i s made,
`based on part or a l l o f an input test utter-
`ance. Although various studies have shown
`that certain acoustical features work better
`than others in predicting speaker identity,
`few recognizers examine specific sounds
`because of difficulties in phone segmenta-
`tion and identification.
`Both automatic speaker verification and
`speaker identification use a stored data-
`base of reference patterns (templates) for
`N known speakers. Both involve similar
`analysis and decision techniques. Verifi-
`cation i s simpler because it only requires
`comparing the test pattern against one ref-
`erence pattern and i t involves a binary
`decision: Is there a good enough match
`against the template o f the claimed
`speaker‘? The error rate for speaker identi-
`fication can be much greater because i t
`requires choosing which of the N voices
`known to the system best matches the test
`voice or “no match” if the test voice differs
`sufficiently from all the reference tem-
`plates.
`Comparing test and reference utterances
`for speaker identity i s much simpler for
`identical underlying texts, as in text-de-
`pendent speaker recognition. With coop-
`erative speakers you can apply speaker
`recognition straightforwardly by using the
`same words to train the system and then
`test it. This usually happens in verification,
`but speaker identification often requires
`
`COMPUTER
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:30:24 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1016, Page 5
`
`

`

`Table 1. A sample of commercially available speech recognition systems (for IBM PCs unless otherwise indicated).
`
`Recognizer
`
`Constraints
`(Speaker/Speech/Vocabulary )
`
`Price (US $)
`
`Percent Word Accuracy
`(per the manufacturer)
`
`Dragon Voice-Scribe 400
`
`Dragon Dictate
`
`ITT VRS 1280PC
`
`Phonetic Engine*
`(Speech Systems Inc.)
`
`Telerec
`(Voice Control Systems)**
`
`Verbex series 5000, 6000,
`7000
`
`Voice Card
`(Votan)
`
`Voice Comm Unit
`(Fujitsu)
`
`Voice Master Key
`(COVOX)
`
`Voice Navigator?
`(Articulate Systems)
`
`Voice Pro
`(Voice Processing Corp.)
`
`Voice Report
`(Kurzweil AI)
`
`Speaker dependent
`Isolated-word recognition
`400 words
`
`Speaker adaptive
`Isolated-word recognition
`30,000 words
`
`Speaker dependent
`Continous-speech recognition
`2.000 words
`
`Speaker independent
`Continous-speech recognition
`10,000-40,000 words
`
`Speaker independent
`Connected-word recognition
`50 words
`
`Speaker dependent
`Continuous-speech recognition
`80-10,000 words
`
`$995
`
`$9,000
`
`$9,000
`
`>95
`
`>90
`
`>98
`
`$10,500-$47,100
`
`95
`
`$3,000
`
`>98
`
`$5,600-$9,600
`
`>99.5
`
`Speaker dependent or independent
`Continuous-speech recognition
`300 words
`
`$3,500
`
`>99 (speaker dependent),
`95 (speaker independent)
`
`Speaker dependent
`Connected-word recognition
`4,000 words
`
`Speaker dependent
`Isolated-word recognition
`64 words
`
`Speaker dependent
`Isolated-word recognition
`1,000 words
`
`Speaker independent
`Continuous-speech recognition
`13 words
`
`Speaker dependent
`Isolated-word recognition
`20,000 words
`
`Only in Japan
`
`99.9
`
`$150
`
`95-96
`
`$1,300
`
`95
`
`$5,000
`
`97-99
`
`$18,900
`
`98
`
`* Available for Sun workstations
`
`** VCS technology is used in Dialogic products
`
`t Available for Macintosh (based on Dragon Systems technology)
`
`text-independent methods. Higher error
`rates for text-independent methods means
`you will need much more speech data both
`for training and testing.
`
`Automatic speaker recognition by com-
`puter has been an active research area since
`the early 1960s. A 1962 paper introduced
`the spectrogram as a means of personal
`
`identification, and this stimulated a good
`deal of further research. The term “voice-
`print”also appeared in that paper. Unfortu-
`nately, the analogy with fingerprint read-
`
`August 1990
`
`31
`
`Authorized licensed use limited to: MIT Libraries. Downloaded on October 22,2022 at 01:30:24 UTC from IEEE Xplore. Restrictions apply.
`
`Petitioner’s Ex. 1016, Page 6
`
`

`

`ing is incorrect. As pointed out by Dod-
`dington,” the spectrogram is a function of
`the speech signal, not of the physical anat-
`omy of the speaker. The speech signal
`depends far more on the speaker’s actions,
`themselves a complex function of many
`factors, than on the shape of the speaker’s
`vocal tract. The term “voiceprint” is
`misleading.
`
`Speaker recognition
`systems
`
`Speaker recognition by computer has
`only had limited success to date in applica-
`tions using free text (text independent).
`Nonetheless, text-independent recogni-
`tion of speakers has become an increas-
`ingly popular area of research, particularly
`for applications such as forensic, intelli-
`gence gathering, and passive surveillance
`of voice circuits. Free-text recognition
`usually lacks control over conditions that
`influence system performance, including
`variability in the speech signal and distor-
`tions and noise in the communications
`channel. The recognition task faces mul-
`tiple problems: unconstrained
`input
`speech, uncooperative speakers, and un-
`controlled environmental parameters. This
`has made it necessary to focus on features
`and characteristics of speech unique to the
`individual.
`Performance of text-independent sys-
`tems has lagged behind that of text-de-
`pendent systems, as you might expect.
`However, Markel and DavisI4 achieved
`excellent results with a linguistically un-
`constrained database of unrehearsed
`speech. Using voice pitch and linear pre-
`dictive coding (LPC) reflection coeffi-
`cients in their model, they reached 2 per-
`cent identification error and 4 percent
`verification error rates for 40-second seg-
`ments of input speech. Results were not
`nearly as good with shorter input speech
`segments, even though the system avoided
`operational problems of microphone deg-
`radation, acoustic noise, and channel dis-
`tortion. In text-independent recognition of
`nine male speakers over a radio channel
`at Bolt Beranek and Newman, the best
`performance was a 30 percent error rate
`for input speech segments of about two
`seconds.”
`Text-independent recognition seems
`mainly slated for unobtrusive surveillance
`of individuals. As mentioned earlier, text-
`independent speaker identification poses a
`difficult problem. The accuracy of state-
`of-the-art text-independent identification
`
`32
`
`is low, and it requires continuous use of
`computing power. Text-dependent
`speaker verification has the greatest poten-
`tial for practical application at the moment.
`A number of organizations have research
`and development programs in speaker
`verification, and Texas Instruments and
`AT&T Bell Labs have both made major
`efforts in this research area.
`AT&T Bell Labs has concentrated on
`speaker recognition over telephone lines,
`which faces difficult problems of micro-
`phone and channel distortion. Speaker
`recognition over telephone lines opens up
`an enormous set of possible uses, such as
`identification for various kinds of transac-
`tion processing in banking, shopping, and
`database access.
`AT&T Bell Labs started its automatic
`speaker verification system in 1970. Re-
`searchers there chose measurements that
`are largely insensitive to the phase and
`spectral amplitude distortions likely over
`telephone lines. In an early five-month
`operational simulation, the system showed
`a user rejection rate and impostor accep-
`tance rate of about 10 percent initially for
`new users, dropping to about 5 percent for
`experienced users and fully adapted tem-
`plates. A more recent system used over
`telephone lines has achieved error rates
`(rejection of true speakers and acceptance
`of impostors) of approximately 2 percent.I6
`Texas Instruments has applied speaker
`verification to control access to its corpo-
`rate computer center.” The gross rejection
`rate of the operational system measured
`0.9 percent, with a casual impostor accep-
`tance rate of 0.7 percent. The system has
`been operational 24 hours a day for more
`than a decade. The verification step uses a
`comparison of dynamic features, and time
`alignment is established using a simplified
`form of dynamic time warping. Verifica-
`tion utterances are constructed randomly
`using a four-word fixed phrase structure,
`for example, “Proud Ben served hard.” As
`well as indicating what the user should say,
`the voice prompt helps stabilize pronun-
`ciation because the user tends to say i

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket