`
`R-1434-ARPA, June 1974
`speech understanding systems,” Rep.
`(available from ARPA under Order 189-1).
`M. W. Grady and M. B. Hemher, “Advanced speech technology
`applied to problems of air traffic control,”
`in NAECON 1975
`Conf. Rec., Dayton, OH, pp. 541-546, June 1975.
`D. Hill and E. Wacker, “ESOTERIC-11-An approach to practical
`voice control: progress report,” Machine ZntelHgence, Edinburgh,
`Scotland: Edinburgh Univ. Press, 1969,vol. 5. pp. 463493.
`M. Medress, “A procedure for machine recognition of speech,” in
`Con5 Rec., 1972 Cons Speech Communication and Rocuaing,
`pp. 113-116 (Newton, MA, Apr. 1972, (AD-742236)).
`P. B. Scott. “Voice input code identirer,” Final Tech. Rep.-
`1975, Rome Air Development Center, Air Force Systems Com-
`mand, Griffm AFB, NY.
`F. Itakura, “Minimum prediction residual applied
`to speech rec-
`ognition,” ZEEE Tmns. Acowt. Speech Sign~I Roceaaing, vol.
`ASSP-23, pp. 67-72, Feb. 1975.
`T. B. Martin, H. J. Zadell, E. F. Grunza, and M. B. Hmcher, “NU-
`meric speech translating system,” in Auromotic Portem Recogni-
`tion, pp. 113-141, May 1969, Washington, DC: Nat. Security In-
`dustird Ass.
`M. R. Sambur and L. R. Rabmer, “A speaka independent digit
`D. G. Bobrow and D. H. Klatt. “A limited speech recognition sys-
`recognition system,” Bell Sys. Tech. J., vol. 54, Jan. 1975.
`tem,” BBN Rep. 1667, Final Rep., Contract NAS 12-138, Bolt,
`Bemnek and Newman, Inc., May 15.1975.
`J. N. Shearme and P. F. Leach, “Some experiments with a simple
`
`501
`B E E lhm. Audio Elecrrmcoust.,
`word recognition system,”
`VOl. AU-16, pp. 256261, 1967.
`[ 35) G. L. Clapper, “Automatic word recognition,” ZEEE Spccmcm,
`vol. 16, pp. 5769,1971.
`361 V. M. Velichiko and N. G. Zagoruiko, “Automatic recognition of
`200 words,” Znt. J. Man-Mochine Studies, vol. 2, p. 223,1970.
`371 K. Kido, H. Suzuki. S. Mako, and T. Matsoulu, “Recognition
`of spoken words by use of spectral peaks and lexicon,” paper pre-
`sented at IEEE Symp.
`Speech Recognition, Camegie-MeUon
`Univ., Pittaburgh, PA, Apr. 15-19,1974.
`381 A. Ichikawa, Y. Nakno, and K. Nakta, ‘‘Evaluation of various
`parameter set8 in spoken digits recognition,” ZEEE ZYam Au&
`~ectrocrcoust.. V d . AU-21. DD. 202-209, June 1973.
`recognition system for spoken dig-
`[ 391 T. G. Von KeUer, “An on&
`its,” I. Acous. Soc. Amcr., pp. 1288-1296, 1971.
`[40] C. F. Teacher, H. G. Kellet, and L. R.
`Focht, “Experimental,
`limited-vocabulary speech recognizer,” ZEEE Tram. Audio EJee
`W W O U ~ ~ . , V O ~ . AU-15, pp. 127-130, 1967.
`[41] L. C. W. Pols, “Real-time recognition of spoken words,” ZEEE
`ZYans. Computers, vol. 12-20, pp. 972-978,1971.
`[42] B. Gold, “Wordrecognition computer program,” MIT Res. Lab.
`Electronics, Cambridge, MA, Tech. Rep. 452, June 15,1966.
`[43) R. DeMori, L. Gilli, and A. R. Meo, ‘‘A flexible real-time recog-
`nizer of spoken words for man-machine communiation,”Znr. J.
`MOn-Mmhh S ~ U ~ S ,
`VOI. 2, pp. 317-326,1970.
`I441 J . W. Glenn, “Machines you can talk to,” Machine Design, pp.
`72-75, May 1, 1975.
`
`Speech Recognition by Machine: A Review
`
`D. R4J REDDY
`
`..
`use of knowledge such as acoustic-phonetiis, syntax, seman-
`Abrtroct-Tl~is paper provides I review of recent developments m
`mteooenrboe
`research. The concept of (10u~cea of kao-
`is
`tics, and context are more clearly understood. Computer pro-
`i n t r o d d l l l d t h e u r e o f k n o w ~ t o ~ n c s a t e . a d P a i t y h y p o t h e t g
`grams for speech recognition seem to deal with ambiguity, er-
`bdirured. ThedifficuttiedthrtPireintbeconrtnlctiwofdiffmnt
`ror, and nongrammaticality of input in a graceful and effective
`is pmemted. Aspects of compo-
`typerOfSpeechncognitionCyste~Uedbaued~the~ctmemd
`pedornunce of sed such
`manner that is uncommon to most other computer programs.
`&lent 8UbSyStertM 8t tbe 8COQ.&, phoaetic, Sy’Dt8Cliq lad e m t i c k V -
`Yet there is still a long way to go. We can handle relatively re-
`eb, are presented. System o x p i z a t h ~ that ue required for effective
`stricted task domains requiring simple grammatical structure
`interactioll .rd OT d various component Wbsystemsin the plesence
`and a few hundred words of vocabulary for single trained
`0fewr.adlmbigUityaredisamed.
`speakers in controlled environments, but we are very far from
`I. INTRODUCTION
`being able to handle relatively unrestricted dialogs from a large
`population of speakers in uncontrolled environments. Many
`more years of intensive research seem necessary
`to achieve
`such a goal.
`Sources of Information: The primary sources of informa-
`on Acoustics,
`tion in this area are the IEEE Transactions
`Speech, and Signal Processing (pertinent special issues: vol. 21,
`June 1973; vol. 23, Feb. 1975) and the Journul of the Acous-
`tical Society of America (in particular, Semiannual Conference
`Abstracts which appear with January
`and July issues each
`have been appearing as spring and fall
`year; recently they
`relevant journals are IEEE Transactions
`supplements). Other
`(Computer; Information Theory;
`and Systems, Man, and
`
`Manuscript received September 1, 1975; revised November 19, 1975.
`This work was supported in part by the Advanced Research Projects
`Agency and in part by the John Simon Guggenheim Memorial Founda-
`tion.
`is with the Computer Science Department, Carnegie-
`The author
`MeUon University, Pittsburgh, PA 152 13.
`
`T HE OBJECT of this paper is to review recent develop-
`
`speech recognition. The Advanced Research
`ments in
`Projects Agency’s support of speech understanding re-
`search has led to a significantly increased level of activity in
`this area since 1971. Several connected speech recognition
`systems have been developed and demonstrated. The role and
`
`Comcast - Exhibit 1004, page 501
`
`
`
`5 02
`
`APRIL
`
`
`
`
`
`
`
`PROCEEDINGS OF THE IEEE,
`
`1976
`
`Cybernetics), Communications of ACM, International Journal
`of Man-Machine Studies, Artificial Intelligence,
`and Pattern
`Recognition.
`The books by Flanagan (441, Fant [40], and Lehiste [ 841
`provide extensive coverage of speech, acoustics, and phonetics,
`and form the
`necessary background for speech recognition
`research. Collections of papers, in the books edited by David
`[ 251, Lehiste [83], Reddy [ 121 1, and Wathen-
`and Denes
`Dunn [ 1581, and in conference proceedings edited by Erman
`[341 and Fant [41 I , provide a rich source of relevant material.
`The articles by Lindgren [ 881, Hyde [ 661, Fant
`[39],
`Zagoruiko [1711, Derkach [271, Hill [631, and Otten [113]
`cover the research progress in speech recognition prior to 1970
`and proposals for the future. The
`papers by Klatt
`[74] and
`Wolf [ 1631 provide other points of view of recent advances.
`Other useful sources of information are research reports pub-
`lished by various research groups active in this area (and can be
`obtained by writing to one of the principal researchers given in
`parentheses): Bell Telephone Laboratories (Denes, Flanagan,
`Fujimura, Rabiner); Bolt Beranek and Newman, Inc. (Makhoul,
`Wolf, Woods); Carnegie-Mellon University
`(Erman, Newell,
`Reddy); Department of Speech Communication, KTH, Stock-
`holm (Fant); Haskins Laboratories (Cooper, Mermelstein);
`IBM Research Laboratories (Bahl, Dixon, Jelinek); M.I.T. Lin-
`coln Laboratories (Forgie, Weinstein); Research Laboratory
`of Electronics, M.I.T. (Klatt); Stanford
`Research Institute
`(Walker); Speech Communication
`Research Laboratory
`(Broad, Markel, Shoup); System Development Corporation
`(Barnett, Ritea); S p e w Univac (Lea, Medress); University of
`California, Berkeley (O’Malley);
`Xerox Palo Alto Research
`Center (White); and Threshold Technology (Martin).
`In addi-
`tion there are several groups in Japan and Europe who publish
`reports in national languages and English. Complete addresses
`for most of these groups can be obtained by
`referring to
`author addresses in the IEEE Trans. Acoust., Speech, Signal
`Processing, June 1973 and Feb. 1975. For background and in-
`troductory information on various aspects of speech recogni-
`tion we recommend the tutorial-review papers on “Speech
`understanding systems” by Newell, “Parametric representa-
`tions of Speech” by Schafer and Rabiner, “Linear prediction
`in automatic speech recognition” by Makhoul, “Concepts for
`Acoustic-Phonetic recognition” by Broad and Shoup, “Syn-
`tax, Semantics and Speech” by Woods, and “System organiza-
`tion for
`speech understanding”
`by Reddy and Erman, all
`appearing in Speech Recognition: Invited Papers of the IEEE
`Symposium [ 12 1 ] .
`Scope of the Paper: This paper is intended as a review and
`not as an exhaustive survey of all research in speech recogni-
`tion. It is hoped that, upon reading this paper, the reader will
`know what a speech recognition system
`consists of, what
`as-
`makes speech
`recognition a difficult problem, and what
`pects of the problem remain unsolved. To this end we will
`study the structure and performance of some typical systems,
`component subsystems that are needed, and system organiza-
`tion that permits effective interaction and use of
`the compo-
`nents. We do not attempt to give detailed descriptions of sys-
`tems or mathematical formulations, as these are available in
`published literature. Rather, we will mainly present distinctive
`relative
`and novel features of selected systems and their
`advantages.
`Many of the comments of an editorial nature that appear in
`this paper represent one point of view and are not n e c d y
`shared by all the researchers in the field. Two other papers
`
`appearing in this issue, Jelinek’s on statistical approaches and
`Martin’s on applications, augment and complement this paper.
`Papers by Flanagan and others, also appearing in this issue,
`look at the total problem
`of man-machine
`communication
`by voice.
`A. The Nature of the Speech Recognition Problem
`The main goal of this area of research
`is to develop tech-
`niques and systems for speech input to machhes. In earlier
`attempts, it was hoped that learning how
`to build simple
`recognition systems would lead
`in a natural
`way to more
`sophisticated systems. Systems were built in the 1950’s for
`vowel recognition and digit recognition, producing creditable
`performance. But these techniques and
`results could not be
`extended and extrapolated toward
`larger and more sophisti-
`cated systems. This had led to the appreciation that linguktic
`and contextual cues must be brought to bear on the recogni-
`tion strategy if we are to achieve significant progress. The
`many dimensions that affect the feasibility and performance
`of a speech recognition system
`are clearly stated in Newell
`[108].
`Fig. 1 characterizes several different types of speech recogni-
`tion systems ordered according
`to their intrinsic difficulty.
`There are already several commercially available isolated word
`recognition systems today. A few research systems have been
`developed for restricted connected speech recognition
`and
`speech understanding. There
`is hope among some researchers
`that, in the not too distant future, we may be able to develop
`interactive systems for taking dictation
`using a restricted
`vocabulary. Unlimited
`vocabulary speech understanding and
`connected speech recognition systems seem feasible to some,
`but are likely to require many years of directed research.
`The main feature that
`is used to characterize the com-
`plexity of a speech recognition task
`is whether the speech is
`In connected
`connected or is spoken one word at a time.
`speech, it is difficult to determine where one word ends and
`another begins, and the characteristic acoustic patterns
`of
`words exhibit much greater variability depending on the con-
`text.
`do not have these
`Isolated word recognition systems
`problems since words are separated by pauses.
`The second feature that affects the complexity of system is
`the vocabulary size. As the size or the confusability of a
`vocabulary increases, simple brute-force methods of represen-
`tation and matching become too expensive and unacceptable.
`Techniques for compact representation
`of acoustic patterns
`of words, and techniques for reducing search by constraining
`the number of possible words that can occur at a given point,
`assume added importance.
`Just as vocabulary is restricted to make a speech recognition
`problem more tractable, there are several other aspects of the
`problem which can be used to constrain the speech recognition
`task so that what might otherwise be an unsolvable problem
`becomes solvable. The rest of the features in Fig. 1, i.e., task-
`specific knowledge, language of communication, number and
`cooperativeness of speakers,
`and quietness of environment,
`represent some of the commonly used constraints in speech
`recognition systems.
`One way
`to reduce the problems of error and ambiguity
`resulting from the use of connected speech and large vocabu-
`laries is to use all the available task-specific information to
`reduce search. The restricted speech understanding systems
`(Fig. 1, line 3) assume that the speech signal does not have all
`the necessary information to uniquely decode the message and
`
`Comcast - Exhibit 1004, page 502
`
`
`
`REDDY: SPEECH RECOGNITION BY MACHINE
`
`5 03
`
`
`
`
`
` Sire Inforration
`
`node of
`Speech
`
`
`
`Vocabulary
`
`Task Specific Language Speaker Environment
`
`Word recognition-isolated
`m)
`
`isolated
`words
`
`10-300
`
`1imi:ed
`use
`
`-
`
`cooperative
`
`-
`
`Connected speech
`recognition-restricted
`(CSR)
`
`connccted
`speech
`
`30-500
`
`limited
`use
`
`restricted
`C-Ud
`laquage
`
`cooperatin
`
`v i e t room
`
`Speech, understmding-
`restricted (SUI
`
`Dictation machine-
`restricted (W)
`
`connected
`speech
`
`connected
`speech
`
`103-2000
`
`full use
`
`not
`English-
`like uncooperative
`
`1000-10000
`
`.limited
`use
`
`English- cooperative
`like
`
`Unrestricted speech
`understanding (US)
`
`connected
`speech
`
`unlimited full
`
`use
`
`English
`
`Unrestricted cocnected
`speech recognition
`(OCSR)
`
`connected
`speech
`
`unlimited
`
`none
`
`
`
`
`
`English not
`
`not
`uncooperative
`
`uncooperative
`
`-
`
`quiet roan
`
`-
`
`quiet room
`
`Fig. 1. Different types of speech recognition systems ordered according to their intrinsic dmlculty, and the dimensions
`along which they are usually constrained. Vocabulary sizes given are for some typical systems and can vary from system to
`system. It is assumed that a cooperative speaker would speak clearly and would be willing t o repeat or spell a word. A not
`to go out of his way to help it either.
`In
`try t o confuse the system but does not want
`uncooperative speaker does not
`an “un-
`to handle “uhms” and “ahs” and other speech-like noise. The “-”indicates
`earticular. the system would have
`~pecXied’; entry -kriable from system to system.
`
`that, to be successful, one must use all the available sources of
`knowledge to infer (or deduce) the intent
`of the message
`is somewhat relaxed in
`[ 1071. The performance criterion
`that, as long as the message is understood, it is not important
`to recognize each and every phoneme and/or word correctly.
`The requirement of using all the sources of knowledge, and the
`representation of the task, conversational context, understand-
`all add to the difficulty and
`ing, and response generation,
`overall complexity of speech understanding systems.
`The restricted connected speech recognition systems (Fig. 1,
`line 2) keep their program structure simple by using only some
`task-specific knowledge, such as restricted vocabulary and syn-
`tax, and by requiring that the speaker speak clearly and use a
`quiet room. The
`simpler program structure of these systems
`provides an economical solution in a restricted
`class of con-
`nected speech recognition tasks. Further, by not
`being task-
`specific, they can be used
`in a wider variety of applications
`without modification.
`The restricted speech understanding systems have the advan-
`tage that by making effective use of all the available knowl-
`edge, including semantics, conversational context, and speaker
`preferences, they can provide a more flexible and hopefully
`higher performance system. For example, they usually permit
`an English-like grammatical structure, do not require the
`speaker to speak clearly, and permit some nongrammaticality
`(including babble, mumble, and
`cough). Further, by paying
`careful attention to the task, many aspects of error detection
`and correction can be handled naturally, thus
`providing a
`graceful interaction with the user.
`The (restricted) dictation machine problem (Fig. 1, line 4)
`requires larger vocabularies (1000 to 10 000 words).
`It is
`assumed that the user would be willing to spell any word that
`is unknown to the system. The
`task requires an English-like
`syntax, but can assume a cooperative speaker speaking clearly
`in a quiet room.
`
`The unrestricted speech understanding problem requires un-
`limited vocabulary connected speech recognition, but permits
`the use of all the available task-specific information. The most
`difficult of all recognition tasks is the unrestricted connected
`speech recognition problem which requires unlimited vocabu-
`lary, but does not assume the availability of any task-specific
`information.
`We do not have anything interesting to say about the last
`speculatively. In Section 11, we
`three tasks, except perhaps
`will study the structure and performance of several systems of
`the f i t three types (Fig. 11, i.e., isolated word recognition
`systems, restricted connected speech recognition systems, and
`restiicted speech understanding systems.
`In general, for a given system and task, performance depends
`on the size and speed of the computer and on the accuracy of
`(We
`the algorithm used. Accuracy is often task dependent.
`shall see in Section I1 that a system which gives 99-percent
`accuracy on a 200-word vocabulary might give only 89-percent
`accuracy on a 36-word vocabulary.) Accuracy versus response
`time tradeoff is also possible, i.e., it is often possible to tune a
`system and adjust thresholds so as to improve the response
`time while reducing accuracy and vice versa.
`Sources of Knowledge: Many of us are aware that a native
`speaker uses, subconsciously, his knowledge of the language,
`the environment, and the context in understanding a sentence.
`These sources of knowledge (KS’s) include the characteristics
`of speech sounds (phonetics), variability in pronunciations
`intonation patterns
`of speech
`(phonoZogy), the stress and
`(prosodics), the sound patterns of words (lexicon), the gram-
`matical structure of language (syntax), the meaning of words
`(semantics), and the context of conversation
`and sentences
`(pragmatics). Fig. 2 shows the many dimensions of variability
`of these KS’s; it is but a slight reorganization (to correspond to
`the sections of this paper) of a similar figure appearing in
`[ 1081.
`
`Comcast - Exhibit 1004, page 503
`
`
`
`504
`
`PROCEEDINGS OF THE IEEE, APRIL 1976
`
`
`
`
`
`
`
` input 1.solated words? 1. Performance connected speech?
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`of
`
`Nature
`
`
`Response time Real time?
`
`
`Accuracy
`
`no hurry?
` close to real-time?
`
`
`
`
`Error-free (>?9.9%)? almost error-free
`(>99$)?
`occasional error
`( > ~ s c $ ) ?
`
`Airconditioning noise? computer room? reverberation:
`Dialect? sex? age? cooperative?
`H i g h quality microphone? telephone?
`Spectrum? formants? rerocrossings?
`
`LPc?
`
`Voiced? energy? stress? intonation?
`Number? distinguishability?
`Phone realization rules? junction rules?
`Insertion, deletion and change rules?
`Brd hypothesis? word v e r i f i c a t i o n ?
`
`2. Source characteristics
`(acoustic knowledge)
`
`Acoustic analysis
`b i s e sources
`speaker characteristics
`
`
`3. Language characteristics Features
`(phonetic knowledge)
`
`Phones
`Phonology
`Word realization
`
`Size of vocabulary
`
`4 . Problem Characteristics
`(task specific
`knowledge) Confusability
`
`
`
`
`
`
`
`
`Syntactic support Artificial lanpage? free English?
`Semantic and contextual support constrained task?
`
`of vocabulary
`
`l o ? l o o ? 1,000? 10,000?
`High? what equivalent vocabulary?
`
`open semantics?
`
`
`
`5 . System characteristics organization
`
`Interaction
`
`Strategy? representation?
`Graceful interaction with user? gracefu? error
`recovery?
`
`Fig. 2. Factors affecting feasibility and performance of speech recognition systems. (Adapted from Newell etal. [ 1081 .)
`
`To dlustrate the effect of some of these KS’s, consider the
`following sentences.
`1) Colorless paper packages crackle loudly.
`2) Colorless yellow ideas sleep furiously.
`3) Sleep roses dangerously young colorless.
`4) Ben burada ne yaptigimi bilmiyorum.
`The first sentence, though grammatical
`and meaningful,
`is
`pragmatically implausible. The second is syntactically correct
`but meaningless. The third is both syntactically and semanti-
`cally unacceptable. The fourth (a sentence in Turkish) is com-
`us. One would expect a
`pletely unintelligible to most of
`listener to have more difficulty in recognizing a sentence if it
`is inconsistent with one or more KS’s. Miller and Isard [ 101 1
`show that this is indeed the case.
`If the knowledge is incomplete or inaccurate, people will
`tend to make erroneous hypotheses This can be illustrated by
`a simple experiment. Subjects were asked to listen to two sen-
`tences and write down what
`they heard. The sentences were
`“In mud eels are, in clay none are” and “In pine tar is, in oak
`none is.” The responses of four subjects are given below.
`In clay none are
`In mud eels are,
`in muddies sar
`in clay nanar
`en clainanar
`in my deals are
`en clain
`in my ders
`in model sar
`in claynanar
`In oak none is
`In pine tar is,
`inOak?eS
`in pine tarrar
`in oak nonnus
`in pyntar es
`in ocnonin
`in pine tar is
`in oak is
`en pine tar is
`The responses show that the listener forces his own interpre-
`tation of what he hears, and not necessarily what may have
`been intended by the speaker. Because the subjects do not
`have the contextual
`framework to expect the words “mud
`eels” together, they write more likely sounding combinations
`such as “my deals” or “models.” We find the same problem
`with words such as “oak none is.” Notice that they failed to
`detect where one word ends and another begins. It is not un-
`common for machine recognition systems to have similar
`problems with word segmentation. To approach human per-
`also use all the available KS’s
`formance, a machine must
`effectively.
`
`knowledge at vari-
`[1241 show that
`Reddy and Newell
`ous levels can be
`further decomposed into sublevels (Fig. 3)
`based on whether it is taskdependent a priori knowledge,
`conversationdependent knowledge, speakerdependent knowl-
`edge, or analysisdependent
`knowledge. One can further
`decompose each of these sublevels into sets of rules relating
`to specific topics. Many of
`the present systems have only a
`all the KS’s shown in Fig. 3. This is because
`small subsct of
`much of this knowledge is yet to be identified and codified in
`ways that can be conveniently used in a speech understanding
`system. Sections I11 through V review the recent progress in
`representation and use of various sources of knowledge.
`In Section 111, we consider aspects of
`signal processing for
`speech recognition. There is a great deal of research and many
`publications in this area, but very few of them are addressed to
`questions that arise in building speech recognition systems. It
`is not uncommon for a
`speech recognition system to show a
`catastrophic drop in performance when the microphone
`is
`changed or moved to a slightly noisy room. Many parametric
`representations of speech have been proposed but there are
`studies. In Section III, we shall review the
`few comparative
`techniques that are presently used in speech signal and analysis
`and noise normalization, and examine their limitations.
`There are several KS’s which are common to most connected
`speech recognition systems and independent
`of the task.
`These can be broadly grouped together as task-independent as-
`pects of a speech recognition system. Topics such as feature
`extraction, phonetic labeling, phonological rules, (bottom-up)
`word hypothesis, and word verification fall into this category.
`In Section IV, we will review the techniques used and
`the
`present state of accomplishment in these areas.
`Given a task that is to be performed using a speech recogni-
`tion system, one is usually able to specify the vocabulary, the
`grammatical structure of sentences, and the semantic and con-
`textual constraints provided by the task. In Section V, we will
`discuss the nature, representation, and use of
`these KS’s in a
`recognition (or understanding) system.
`Control Structure and System Organization: How is a given
`source of knowledge used in recognition? The Shannon [ 1401
`experiment gives a clue. In this experiment, human subjects
`demonstrate their ability to predict (and correct) what will a p
`pear next, given a portion of a sentence.
`Just as in the above experiment, many recognition systems
`use the KS’s to generate hypotheses about what word might
`
`Comcast - Exhibit 1004, page 504
`
`
`
`REDDY: SPEECH RECOGNITION BY MACHINE
`
`Task-dependent
`knouledge
`
`
`Speaker-dependent Conversation-dependent
`knowledge
`
`~
`
`Analysis-dependent
`knowledge
`
`5 0 5
`
`I
`
`Psychological
`model of the user
`
`s based
`
`Graurnar subselection
`on speaker
`
`Concept subselection
`based on partial
`sentence recognition
`
`Gramnar subselection
`based on partial
`phrase recognition
`
`I
`
`~
`
`I
`1 Vocabulary subselection/
`'
`selection and ordering 1 based on segmeutcl
`based on speaker
`features
`preference
`
`I
`
`j Dialectal variations i Phocemic subselection
`I basnd on segmental
`! featcres
`I based on previous
`
`I
`1 knowledge i Pragmatic and 6 priori semantic
`i I
`1
`1
`
`Semantic
`
`knowledge about the
`task doaain
`
`Concept subselection
`based on conversation
`
`Grlmmar for the
`language
`
`Gr8mmar suboelectim
`based on topic
`
`t
`
`Lexical
`
`I I
`
`I
`I
`Size and confuaabilit
`of the vocabulary
`
`Characteristics of
`Phonemic and phones and phonemes
`of the language
`phonetic
`
`~~
`
`selection based
`on topic
`
`Contextual
`variability in
`phonemic character-
`istics
`
`' of the speaker
`
`acoustic
`
`characteristics
`
`Variations resulting
`f r m the size and
`shape of voc3.1 tract 17arareters
`
`Parmeter tracking
`
`~
`
`Fig. 3. Sources of knowledge (KS).
`
`(From Reddy and Newell [ 1241 .)
`
`:1)
`
`Speed of Ccimnication
`
`(2) Total System Response Time
`
`(3) Total Systcn Reliability
`
`( 4 ) Parallel Channel
`
`( 5 ) Freedan of Mvement
`
`( 6 ) Untrained Users
`
`(7) Unplanned CmmuniLation
`
`(8) Identification of Speaker
`
`( 9 ) b u g Term Reliability
`
`(10) LOV Cost operation
`
`Speech is about 4 times faster than
`standard manual input for continuous
`text.
`Direct data entry from remote source,
`which avoids relayed entry via inter-
`mediate human transducers, speeZs up
`cannunication substantially.
`Direct data entry from remote sourcc
`with imnediate feedback, avoiding re-
`layed entry via intermediate hurr:n
`transducers, increases reliabil.ity
`substantially.
`
`executivej .
`
`appear in a given context, or to reject a guess. When one of
`these systems makes errors, it is usually because
`the present
`state of its knowledge is incomplete and possibly inaccurate.
`In Section VI, we shall review aspects of system organization
`such as control strategies, error handling, real-time system
`design, and knowledge acquisition.
`B. The Uses of Speech Recognition
`Until recently there has been little experience in the use of
`speech recognition systems in real applications. Most of the
`systems developed in the
`1960's were laboratory systems,
`which were expensive and had an unacceptable error rate
`for
`Provides an independent ccmnunication
`real life situations. Recently, however, there have been com-
`channel in hands-busy operational
`situations.
`mercially available
`systems for isolated word recognition,
`costing from $10 000 to $100 000, with less than 1-percent
`within small physical regions speech
`can be used while moving about freely
`error rate in noisy environments. The paper by Martin in this
`doing a task.
`issue illustrates a variety of applications where these systems
`NO training in basic physical skill
`required for use (as opposed to acqui-
`have been found to be useful and costeffective.
`sition of typing or keying skills);
`As long as speech recognition systems continue to cost
`speech is natural for users at
`all
`general skill levels (clerical to
`around $10 000 to $100 000, the range of applications for
`which they will be used will be limited. As the research under
`Speech is to be used imediately by
`users to cmmunicate unplanned infor-
`way at present comes to fruition over the next few years, and
`mation, in a vay not true of manual
`as connected speech recognition systems costingunder $10000
`input.
`begin to become available, one can expect a significant in-
`Speakers are recognizable by their
`voice characteristics.
`crease in the number of applications. Fig. 4, adapted from
`Newell et ol. [ 1091, summarizes and extends the views ex-
`Performance of speech reception and
`processing tasks which require mono-
`pressed by several authors earlier [63], [78], [ 871, and [89]
`tonous vigilant operation can
`be done
`-re reliably by computer than by
`on the desirability and usefulness of speech-it provides a list
`h m n s .
`of task situation characteristics that are likely to benefit from
`Speech can provide cost savings where
`speech input. Beek et al. [ 17 ] provide an assessment of the po-
`it eliminates substantial numbers of
`people.
`tential military applications of automatic speech recognition.
`As computers get cheaper and more powerful, it is estimated
`that 60-80 percent of the cost of running a business computer
`installation will be spent on data collection, preparation, and
`entry (unpublished proprietary studies;
`should be considered
`
`Fig. 4. Task demands providing comparative advantages for
`(From Newell cr al. [ 1091 .)
`
`speech.
`
`Comcast - Exhibit 1004, page 505
`
`
`
`5 06
`
`PROCEEDINGS OF THE IEEE, APRIL 1976
`
`speculative for the present). Given speech recognition systems
`that are flexible enough to change speakers or task definitions
`with a few days of effort, speech will begin to be used as an
`alternate medium of input to computers. Speech is likely to
`be used not so much for program entry, but rather primarily
`[33]. This increased usage should in
`in data entry situations
`turn lead to increased versatility and reduced cost in speech
`input systems.
`There was some earlier skepticism as to whether speech in-
`put was necessary or even desirable as an input medium for
`computers [ 1161. The present attitude among the researchers
`in the field appears to be just the opposite, i.e., if speech input
`systems of reasonable cost and reliability were available, they
`would be the preferred mode of communication even though
`the relative cost is higher than other types of input [log].
`Recent human factors studies in cooperative problem
`solving
`[23], [ 1101 seem to support the view that speech is the pre-
`ferred mode of communication. If it is indeed preferred, it
`seems safe to assume that the user would be willing
`to pay
`somewhat higher prices to be able to talk to computers. This
`prospect of being able to talk to computers is what drives the
`field, not just the development of a few systems for highly spe-
`cidized applications.
`
`11. SYSTEMS
`This section provides an overview of the structure of differ-
`en