throbber
PROCEEDINGS OF THE IEEE, VOL. 64, NO. 4, APRIL 1976
`
`R-1434-ARPA, June 1974
`speech understanding systems,” Rep.
`(available from ARPA under Order 189-1).
`M. W. Grady and M. B. Hemher, “Advanced speech technology
`applied to problems of air traffic control,”
`in NAECON 1975
`Conf. Rec., Dayton, OH, pp. 541-546, June 1975.
`D. Hill and E. Wacker, “ESOTERIC-11-An approach to practical
`voice control: progress report,” Machine ZntelHgence, Edinburgh,
`Scotland: Edinburgh Univ. Press, 1969,vol. 5. pp. 463493.
`M. Medress, “A procedure for machine recognition of speech,” in
`Con5 Rec., 1972 Cons Speech Communication and Rocuaing,
`pp. 113-116 (Newton, MA, Apr. 1972, (AD-742236)).
`P. B. Scott. “Voice input code identirer,” Final Tech. Rep.-
`1975, Rome Air Development Center, Air Force Systems Com-
`mand, Griffm AFB, NY.
`F. Itakura, “Minimum prediction residual applied
`to speech rec-
`ognition,” ZEEE Tmns. Acowt. Speech Sign~I Roceaaing, vol.
`ASSP-23, pp. 67-72, Feb. 1975.
`T. B. Martin, H. J. Zadell, E. F. Grunza, and M. B. Hmcher, “NU-
`meric speech translating system,” in Auromotic Portem Recogni-
`tion, pp. 113-141, May 1969, Washington, DC: Nat. Security In-
`dustird Ass.
`M. R. Sambur and L. R. Rabmer, “A speaka independent digit
`D. G. Bobrow and D. H. Klatt. “A limited speech recognition sys-
`recognition system,” Bell Sys. Tech. J., vol. 54, Jan. 1975.
`tem,” BBN Rep. 1667, Final Rep., Contract NAS 12-138, Bolt,
`Bemnek and Newman, Inc., May 15.1975.
`J. N. Shearme and P. F. Leach, “Some experiments with a simple
`
`501
`B E E lhm. Audio Elecrrmcoust.,
`word recognition system,”
`VOl. AU-16, pp. 256261, 1967.
`[ 35) G. L. Clapper, “Automatic word recognition,” ZEEE Spccmcm,
`vol. 16, pp. 5769,1971.
`361 V. M. Velichiko and N. G. Zagoruiko, “Automatic recognition of
`200 words,” Znt. J. Man-Mochine Studies, vol. 2, p. 223,1970.
`371 K. Kido, H. Suzuki. S. Mako, and T. Matsoulu, “Recognition
`of spoken words by use of spectral peaks and lexicon,” paper pre-
`sented at IEEE Symp.
`Speech Recognition, Camegie-MeUon
`Univ., Pittaburgh, PA, Apr. 15-19,1974.
`381 A. Ichikawa, Y. Nakno, and K. Nakta, ‘‘Evaluation of various
`parameter set8 in spoken digits recognition,” ZEEE ZYam Au&
`~ectrocrcoust.. V d . AU-21. DD. 202-209, June 1973.
`recognition system for spoken dig-
`[ 391 T. G. Von KeUer, “An on&
`its,” I. Acous. Soc. Amcr., pp. 1288-1296, 1971.
`[40] C. F. Teacher, H. G. Kellet, and L. R.
`Focht, “Experimental,
`limited-vocabulary speech recognizer,” ZEEE Tram. Audio EJee
`W W O U ~ ~ . , V O ~ . AU-15, pp. 127-130, 1967.
`[41] L. C. W. Pols, “Real-time recognition of spoken words,” ZEEE
`ZYans. Computers, vol. 12-20, pp. 972-978,1971.
`[42] B. Gold, “Wordrecognition computer program,” MIT Res. Lab.
`Electronics, Cambridge, MA, Tech. Rep. 452, June 15,1966.
`[43) R. DeMori, L. Gilli, and A. R. Meo, ‘‘A flexible real-time recog-
`nizer of spoken words for man-machine communiation,”Znr. J.
`MOn-Mmhh S ~ U ~ S ,
`VOI. 2, pp. 317-326,1970.
`I441 J . W. Glenn, “Machines you can talk to,” Machine Design, pp.
`72-75, May 1, 1975.
`
`Speech Recognition by Machine: A Review
`
`D. R4J REDDY
`
`..
`use of knowledge such as acoustic-phonetiis, syntax, seman-
`Abrtroct-Tl~is paper provides I review of recent developments m
`mteooenrboe
`research. The concept of (10u~cea of kao-
`is
`tics, and context are more clearly understood. Computer pro-
`i n t r o d d l l l d t h e u r e o f k n o w ~ t o ~ n c s a t e . a d P a i t y h y p o t h e t g
`grams for speech recognition seem to deal with ambiguity, er-
`bdirured. ThedifficuttiedthrtPireintbeconrtnlctiwofdiffmnt
`ror, and nongrammaticality of input in a graceful and effective
`is pmemted. Aspects of compo-
`typerOfSpeechncognitionCyste~Uedbaued~the~ctmemd
`pedornunce of sed such
`manner that is uncommon to most other computer programs.
`&lent 8UbSyStertM 8t tbe 8COQ.&, phoaetic, Sy’Dt8Cliq lad e m t i c k V -
`Yet there is still a long way to go. We can handle relatively re-
`eb, are presented. System o x p i z a t h ~ that ue required for effective
`stricted task domains requiring simple grammatical structure
`interactioll .rd OT d various component Wbsystemsin the plesence
`and a few hundred words of vocabulary for single trained
`0fewr.adlmbigUityaredisamed.
`speakers in controlled environments, but we are very far from
`I. INTRODUCTION
`being able to handle relatively unrestricted dialogs from a large
`population of speakers in uncontrolled environments. Many
`more years of intensive research seem necessary
`to achieve
`such a goal.
`Sources of Information: The primary sources of informa-
`on Acoustics,
`tion in this area are the IEEE Transactions
`Speech, and Signal Processing (pertinent special issues: vol. 21,
`June 1973; vol. 23, Feb. 1975) and the Journul of the Acous-
`tical Society of America (in particular, Semiannual Conference
`Abstracts which appear with January
`and July issues each
`have been appearing as spring and fall
`year; recently they
`relevant journals are IEEE Transactions
`supplements). Other
`(Computer; Information Theory;
`and Systems, Man, and
`
`Manuscript received September 1, 1975; revised November 19, 1975.
`This work was supported in part by the Advanced Research Projects
`Agency and in part by the John Simon Guggenheim Memorial Founda-
`tion.
`is with the Computer Science Department, Carnegie-
`The author
`MeUon University, Pittsburgh, PA 152 13.
`
`T HE OBJECT of this paper is to review recent develop-
`
`speech recognition. The Advanced Research
`ments in
`Projects Agency’s support of speech understanding re-
`search has led to a significantly increased level of activity in
`this area since 1971. Several connected speech recognition
`systems have been developed and demonstrated. The role and
`
`Comcast - Exhibit 1004, page 501
`
`

`

`5 02
`
`APRIL
`
`
`
`
`
`
`
`PROCEEDINGS OF THE IEEE,
`
`1976
`
`Cybernetics), Communications of ACM, International Journal
`of Man-Machine Studies, Artificial Intelligence,
`and Pattern
`Recognition.
`The books by Flanagan (441, Fant [40], and Lehiste [ 841
`provide extensive coverage of speech, acoustics, and phonetics,
`and form the
`necessary background for speech recognition
`research. Collections of papers, in the books edited by David
`[ 251, Lehiste [83], Reddy [ 121 1, and Wathen-
`and Denes
`Dunn [ 1581, and in conference proceedings edited by Erman
`[341 and Fant [41 I , provide a rich source of relevant material.
`The articles by Lindgren [ 881, Hyde [ 661, Fant
`[39],
`Zagoruiko [1711, Derkach [271, Hill [631, and Otten [113]
`cover the research progress in speech recognition prior to 1970
`and proposals for the future. The
`papers by Klatt
`[74] and
`Wolf [ 1631 provide other points of view of recent advances.
`Other useful sources of information are research reports pub-
`lished by various research groups active in this area (and can be
`obtained by writing to one of the principal researchers given in
`parentheses): Bell Telephone Laboratories (Denes, Flanagan,
`Fujimura, Rabiner); Bolt Beranek and Newman, Inc. (Makhoul,
`Wolf, Woods); Carnegie-Mellon University
`(Erman, Newell,
`Reddy); Department of Speech Communication, KTH, Stock-
`holm (Fant); Haskins Laboratories (Cooper, Mermelstein);
`IBM Research Laboratories (Bahl, Dixon, Jelinek); M.I.T. Lin-
`coln Laboratories (Forgie, Weinstein); Research Laboratory
`of Electronics, M.I.T. (Klatt); Stanford
`Research Institute
`(Walker); Speech Communication
`Research Laboratory
`(Broad, Markel, Shoup); System Development Corporation
`(Barnett, Ritea); S p e w Univac (Lea, Medress); University of
`California, Berkeley (O’Malley);
`Xerox Palo Alto Research
`Center (White); and Threshold Technology (Martin).
`In addi-
`tion there are several groups in Japan and Europe who publish
`reports in national languages and English. Complete addresses
`for most of these groups can be obtained by
`referring to
`author addresses in the IEEE Trans. Acoust., Speech, Signal
`Processing, June 1973 and Feb. 1975. For background and in-
`troductory information on various aspects of speech recogni-
`tion we recommend the tutorial-review papers on “Speech
`understanding systems” by Newell, “Parametric representa-
`tions of Speech” by Schafer and Rabiner, “Linear prediction
`in automatic speech recognition” by Makhoul, “Concepts for
`Acoustic-Phonetic recognition” by Broad and Shoup, “Syn-
`tax, Semantics and Speech” by Woods, and “System organiza-
`tion for
`speech understanding”
`by Reddy and Erman, all
`appearing in Speech Recognition: Invited Papers of the IEEE
`Symposium [ 12 1 ] .
`Scope of the Paper: This paper is intended as a review and
`not as an exhaustive survey of all research in speech recogni-
`tion. It is hoped that, upon reading this paper, the reader will
`know what a speech recognition system
`consists of, what
`as-
`makes speech
`recognition a difficult problem, and what
`pects of the problem remain unsolved. To this end we will
`study the structure and performance of some typical systems,
`component subsystems that are needed, and system organiza-
`tion that permits effective interaction and use of
`the compo-
`nents. We do not attempt to give detailed descriptions of sys-
`tems or mathematical formulations, as these are available in
`published literature. Rather, we will mainly present distinctive
`relative
`and novel features of selected systems and their
`advantages.
`Many of the comments of an editorial nature that appear in
`this paper represent one point of view and are not n e c d y
`shared by all the researchers in the field. Two other papers
`
`appearing in this issue, Jelinek’s on statistical approaches and
`Martin’s on applications, augment and complement this paper.
`Papers by Flanagan and others, also appearing in this issue,
`look at the total problem
`of man-machine
`communication
`by voice.
`A. The Nature of the Speech Recognition Problem
`The main goal of this area of research
`is to develop tech-
`niques and systems for speech input to machhes. In earlier
`attempts, it was hoped that learning how
`to build simple
`recognition systems would lead
`in a natural
`way to more
`sophisticated systems. Systems were built in the 1950’s for
`vowel recognition and digit recognition, producing creditable
`performance. But these techniques and
`results could not be
`extended and extrapolated toward
`larger and more sophisti-
`cated systems. This had led to the appreciation that linguktic
`and contextual cues must be brought to bear on the recogni-
`tion strategy if we are to achieve significant progress. The
`many dimensions that affect the feasibility and performance
`of a speech recognition system
`are clearly stated in Newell
`[108].
`Fig. 1 characterizes several different types of speech recogni-
`tion systems ordered according
`to their intrinsic difficulty.
`There are already several commercially available isolated word
`recognition systems today. A few research systems have been
`developed for restricted connected speech recognition
`and
`speech understanding. There
`is hope among some researchers
`that, in the not too distant future, we may be able to develop
`interactive systems for taking dictation
`using a restricted
`vocabulary. Unlimited
`vocabulary speech understanding and
`connected speech recognition systems seem feasible to some,
`but are likely to require many years of directed research.
`The main feature that
`is used to characterize the com-
`plexity of a speech recognition task
`is whether the speech is
`In connected
`connected or is spoken one word at a time.
`speech, it is difficult to determine where one word ends and
`another begins, and the characteristic acoustic patterns
`of
`words exhibit much greater variability depending on the con-
`text.
`do not have these
`Isolated word recognition systems
`problems since words are separated by pauses.
`The second feature that affects the complexity of system is
`the vocabulary size. As the size or the confusability of a
`vocabulary increases, simple brute-force methods of represen-
`tation and matching become too expensive and unacceptable.
`Techniques for compact representation
`of acoustic patterns
`of words, and techniques for reducing search by constraining
`the number of possible words that can occur at a given point,
`assume added importance.
`Just as vocabulary is restricted to make a speech recognition
`problem more tractable, there are several other aspects of the
`problem which can be used to constrain the speech recognition
`task so that what might otherwise be an unsolvable problem
`becomes solvable. The rest of the features in Fig. 1, i.e., task-
`specific knowledge, language of communication, number and
`cooperativeness of speakers,
`and quietness of environment,
`represent some of the commonly used constraints in speech
`recognition systems.
`One way
`to reduce the problems of error and ambiguity
`resulting from the use of connected speech and large vocabu-
`laries is to use all the available task-specific information to
`reduce search. The restricted speech understanding systems
`(Fig. 1, line 3) assume that the speech signal does not have all
`the necessary information to uniquely decode the message and
`
`Comcast - Exhibit 1004, page 502
`
`

`

`REDDY: SPEECH RECOGNITION BY MACHINE
`
`5 03
`
`
`
`
`
` Sire Inforration
`
`node of
`Speech
`
`
`
`Vocabulary
`
`Task Specific Language Speaker Environment
`
`Word recognition-isolated
`m)
`
`isolated
`words
`
`10-300
`
`1imi:ed
`use
`
`-
`
`cooperative
`
`-
`
`Connected speech
`recognition-restricted
`(CSR)
`
`connccted
`speech
`
`30-500
`
`limited
`use
`
`restricted
`C-Ud
`laquage
`
`cooperatin
`
`v i e t room
`
`Speech, understmding-
`restricted (SUI
`
`Dictation machine-
`restricted (W)
`
`connected
`speech
`
`connected
`speech
`
`103-2000
`
`full use
`
`not
`English-
`like uncooperative
`
`1000-10000
`
`.limited
`use
`
`English- cooperative
`like
`
`Unrestricted speech
`understanding (US)
`
`connected
`speech
`
`unlimited full
`
`use
`
`English
`
`Unrestricted cocnected
`speech recognition
`(OCSR)
`
`connected
`speech
`
`unlimited
`
`none
`
`
`
`
`
`English not
`
`not
`uncooperative
`
`uncooperative
`
`-
`
`quiet roan
`
`-
`
`quiet room
`
`Fig. 1. Different types of speech recognition systems ordered according to their intrinsic dmlculty, and the dimensions
`along which they are usually constrained. Vocabulary sizes given are for some typical systems and can vary from system to
`system. It is assumed that a cooperative speaker would speak clearly and would be willing t o repeat or spell a word. A not
`to go out of his way to help it either.
`In
`try t o confuse the system but does not want
`uncooperative speaker does not
`an “un-
`to handle “uhms” and “ahs” and other speech-like noise. The “-”indicates
`earticular. the system would have
`~pecXied’; entry -kriable from system to system.
`
`that, to be successful, one must use all the available sources of
`knowledge to infer (or deduce) the intent
`of the message
`is somewhat relaxed in
`[ 1071. The performance criterion
`that, as long as the message is understood, it is not important
`to recognize each and every phoneme and/or word correctly.
`The requirement of using all the sources of knowledge, and the
`representation of the task, conversational context, understand-
`all add to the difficulty and
`ing, and response generation,
`overall complexity of speech understanding systems.
`The restricted connected speech recognition systems (Fig. 1,
`line 2) keep their program structure simple by using only some
`task-specific knowledge, such as restricted vocabulary and syn-
`tax, and by requiring that the speaker speak clearly and use a
`quiet room. The
`simpler program structure of these systems
`provides an economical solution in a restricted
`class of con-
`nected speech recognition tasks. Further, by not
`being task-
`specific, they can be used
`in a wider variety of applications
`without modification.
`The restricted speech understanding systems have the advan-
`tage that by making effective use of all the available knowl-
`edge, including semantics, conversational context, and speaker
`preferences, they can provide a more flexible and hopefully
`higher performance system. For example, they usually permit
`an English-like grammatical structure, do not require the
`speaker to speak clearly, and permit some nongrammaticality
`(including babble, mumble, and
`cough). Further, by paying
`careful attention to the task, many aspects of error detection
`and correction can be handled naturally, thus
`providing a
`graceful interaction with the user.
`The (restricted) dictation machine problem (Fig. 1, line 4)
`requires larger vocabularies (1000 to 10 000 words).
`It is
`assumed that the user would be willing to spell any word that
`is unknown to the system. The
`task requires an English-like
`syntax, but can assume a cooperative speaker speaking clearly
`in a quiet room.
`
`The unrestricted speech understanding problem requires un-
`limited vocabulary connected speech recognition, but permits
`the use of all the available task-specific information. The most
`difficult of all recognition tasks is the unrestricted connected
`speech recognition problem which requires unlimited vocabu-
`lary, but does not assume the availability of any task-specific
`information.
`We do not have anything interesting to say about the last
`speculatively. In Section 11, we
`three tasks, except perhaps
`will study the structure and performance of several systems of
`the f i t three types (Fig. 11, i.e., isolated word recognition
`systems, restricted connected speech recognition systems, and
`restiicted speech understanding systems.
`In general, for a given system and task, performance depends
`on the size and speed of the computer and on the accuracy of
`(We
`the algorithm used. Accuracy is often task dependent.
`shall see in Section I1 that a system which gives 99-percent
`accuracy on a 200-word vocabulary might give only 89-percent
`accuracy on a 36-word vocabulary.) Accuracy versus response
`time tradeoff is also possible, i.e., it is often possible to tune a
`system and adjust thresholds so as to improve the response
`time while reducing accuracy and vice versa.
`Sources of Knowledge: Many of us are aware that a native
`speaker uses, subconsciously, his knowledge of the language,
`the environment, and the context in understanding a sentence.
`These sources of knowledge (KS’s) include the characteristics
`of speech sounds (phonetics), variability in pronunciations
`intonation patterns
`of speech
`(phonoZogy), the stress and
`(prosodics), the sound patterns of words (lexicon), the gram-
`matical structure of language (syntax), the meaning of words
`(semantics), and the context of conversation
`and sentences
`(pragmatics). Fig. 2 shows the many dimensions of variability
`of these KS’s; it is but a slight reorganization (to correspond to
`the sections of this paper) of a similar figure appearing in
`[ 1081.
`
`Comcast - Exhibit 1004, page 503
`
`

`

`504
`
`PROCEEDINGS OF THE IEEE, APRIL 1976
`
`
`
`
`
`
`
` input 1.solated words? 1. Performance connected speech?
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`of
`
`Nature
`
`
`Response time Real time?
`
`
`Accuracy
`
`no hurry?
` close to real-time?
`
`
`
`
`Error-free (>?9.9%)? almost error-free
`(>99$)?
`occasional error
`( > ~ s c $ ) ?
`
`Airconditioning noise? computer room? reverberation:
`Dialect? sex? age? cooperative?
`H i g h quality microphone? telephone?
`Spectrum? formants? rerocrossings?
`
`LPc?
`
`Voiced? energy? stress? intonation?
`Number? distinguishability?
`Phone realization rules? junction rules?
`Insertion, deletion and change rules?
`Brd hypothesis? word v e r i f i c a t i o n ?
`
`2. Source characteristics
`(acoustic knowledge)
`
`Acoustic analysis
`b i s e sources
`speaker characteristics
`
`
`3. Language characteristics Features
`(phonetic knowledge)
`
`Phones
`Phonology
`Word realization
`
`Size of vocabulary
`
`4 . Problem Characteristics
`(task specific
`knowledge) Confusability
`
`
`
`
`
`
`
`
`Syntactic support Artificial lanpage? free English?
`Semantic and contextual support constrained task?
`
`of vocabulary
`
`l o ? l o o ? 1,000? 10,000?
`High? what equivalent vocabulary?
`
`open semantics?
`
`
`
`5 . System characteristics organization
`
`Interaction
`
`Strategy? representation?
`Graceful interaction with user? gracefu? error
`recovery?
`
`Fig. 2. Factors affecting feasibility and performance of speech recognition systems. (Adapted from Newell etal. [ 1081 .)
`
`To dlustrate the effect of some of these KS’s, consider the
`following sentences.
`1) Colorless paper packages crackle loudly.
`2) Colorless yellow ideas sleep furiously.
`3) Sleep roses dangerously young colorless.
`4) Ben burada ne yaptigimi bilmiyorum.
`The first sentence, though grammatical
`and meaningful,
`is
`pragmatically implausible. The second is syntactically correct
`but meaningless. The third is both syntactically and semanti-
`cally unacceptable. The fourth (a sentence in Turkish) is com-
`us. One would expect a
`pletely unintelligible to most of
`listener to have more difficulty in recognizing a sentence if it
`is inconsistent with one or more KS’s. Miller and Isard [ 101 1
`show that this is indeed the case.
`If the knowledge is incomplete or inaccurate, people will
`tend to make erroneous hypotheses This can be illustrated by
`a simple experiment. Subjects were asked to listen to two sen-
`tences and write down what
`they heard. The sentences were
`“In mud eels are, in clay none are” and “In pine tar is, in oak
`none is.” The responses of four subjects are given below.
`In clay none are
`In mud eels are,
`in muddies sar
`in clay nanar
`en clainanar
`in my deals are
`en clain
`in my ders
`in model sar
`in claynanar
`In oak none is
`In pine tar is,
`inOak?eS
`in pine tarrar
`in oak nonnus
`in pyntar es
`in ocnonin
`in pine tar is
`in oak is
`en pine tar is
`The responses show that the listener forces his own interpre-
`tation of what he hears, and not necessarily what may have
`been intended by the speaker. Because the subjects do not
`have the contextual
`framework to expect the words “mud
`eels” together, they write more likely sounding combinations
`such as “my deals” or “models.” We find the same problem
`with words such as “oak none is.” Notice that they failed to
`detect where one word ends and another begins. It is not un-
`common for machine recognition systems to have similar
`problems with word segmentation. To approach human per-
`also use all the available KS’s
`formance, a machine must
`effectively.
`
`knowledge at vari-
`[1241 show that
`Reddy and Newell
`ous levels can be
`further decomposed into sublevels (Fig. 3)
`based on whether it is taskdependent a priori knowledge,
`conversationdependent knowledge, speakerdependent knowl-
`edge, or analysisdependent
`knowledge. One can further
`decompose each of these sublevels into sets of rules relating
`to specific topics. Many of
`the present systems have only a
`all the KS’s shown in Fig. 3. This is because
`small subsct of
`much of this knowledge is yet to be identified and codified in
`ways that can be conveniently used in a speech understanding
`system. Sections I11 through V review the recent progress in
`representation and use of various sources of knowledge.
`In Section 111, we consider aspects of
`signal processing for
`speech recognition. There is a great deal of research and many
`publications in this area, but very few of them are addressed to
`questions that arise in building speech recognition systems. It
`is not uncommon for a
`speech recognition system to show a
`catastrophic drop in performance when the microphone
`is
`changed or moved to a slightly noisy room. Many parametric
`representations of speech have been proposed but there are
`studies. In Section III, we shall review the
`few comparative
`techniques that are presently used in speech signal and analysis
`and noise normalization, and examine their limitations.
`There are several KS’s which are common to most connected
`speech recognition systems and independent
`of the task.
`These can be broadly grouped together as task-independent as-
`pects of a speech recognition system. Topics such as feature
`extraction, phonetic labeling, phonological rules, (bottom-up)
`word hypothesis, and word verification fall into this category.
`In Section IV, we will review the techniques used and
`the
`present state of accomplishment in these areas.
`Given a task that is to be performed using a speech recogni-
`tion system, one is usually able to specify the vocabulary, the
`grammatical structure of sentences, and the semantic and con-
`textual constraints provided by the task. In Section V, we will
`discuss the nature, representation, and use of
`these KS’s in a
`recognition (or understanding) system.
`Control Structure and System Organization: How is a given
`source of knowledge used in recognition? The Shannon [ 1401
`experiment gives a clue. In this experiment, human subjects
`demonstrate their ability to predict (and correct) what will a p
`pear next, given a portion of a sentence.
`Just as in the above experiment, many recognition systems
`use the KS’s to generate hypotheses about what word might
`
`Comcast - Exhibit 1004, page 504
`
`

`

`REDDY: SPEECH RECOGNITION BY MACHINE
`
`Task-dependent
`knouledge
`
`
`Speaker-dependent Conversation-dependent
`knowledge
`
`~
`
`Analysis-dependent
`knowledge
`
`5 0 5
`
`I
`
`Psychological
`model of the user
`
`s based
`
`Graurnar subselection
`on speaker
`
`Concept subselection
`based on partial
`sentence recognition
`
`Gramnar subselection
`based on partial
`phrase recognition
`
`I
`
`~
`
`I
`1 Vocabulary subselection/
`'
`selection and ordering 1 based on segmeutcl
`based on speaker
`features
`preference
`
`I
`
`j Dialectal variations i Phocemic subselection
`I basnd on segmental
`! featcres
`I based on previous
`
`I
`1 knowledge i Pragmatic and 6 priori semantic
`i I
`1
`1
`
`Semantic
`
`knowledge about the
`task doaain
`
`Concept subselection
`based on conversation
`
`Grlmmar for the
`language
`
`Gr8mmar suboelectim
`based on topic
`
`t
`
`Lexical
`
`I I
`
`I
`I
`Size and confuaabilit
`of the vocabulary
`
`Characteristics of
`Phonemic and phones and phonemes
`of the language
`phonetic
`
`~~
`
`selection based
`on topic
`
`Contextual
`variability in
`phonemic character-
`istics
`
`' of the speaker
`
`acoustic
`
`characteristics
`
`Variations resulting
`f r m the size and
`shape of voc3.1 tract 17arareters
`
`Parmeter tracking
`
`~
`
`Fig. 3. Sources of knowledge (KS).
`
`(From Reddy and Newell [ 1241 .)
`
`:1)
`
`Speed of Ccimnication
`
`(2) Total System Response Time
`
`(3) Total Systcn Reliability
`
`( 4 ) Parallel Channel
`
`( 5 ) Freedan of Mvement
`
`( 6 ) Untrained Users
`
`(7) Unplanned CmmuniLation
`
`(8) Identification of Speaker
`
`( 9 ) b u g Term Reliability
`
`(10) LOV Cost operation
`
`Speech is about 4 times faster than
`standard manual input for continuous
`text.
`Direct data entry from remote source,
`which avoids relayed entry via inter-
`mediate human transducers, speeZs up
`cannunication substantially.
`Direct data entry from remote sourcc
`with imnediate feedback, avoiding re-
`layed entry via intermediate hurr:n
`transducers, increases reliabil.ity
`substantially.
`
`executivej .
`
`appear in a given context, or to reject a guess. When one of
`these systems makes errors, it is usually because
`the present
`state of its knowledge is incomplete and possibly inaccurate.
`In Section VI, we shall review aspects of system organization
`such as control strategies, error handling, real-time system
`design, and knowledge acquisition.
`B. The Uses of Speech Recognition
`Until recently there has been little experience in the use of
`speech recognition systems in real applications. Most of the
`systems developed in the
`1960's were laboratory systems,
`which were expensive and had an unacceptable error rate
`for
`Provides an independent ccmnunication
`real life situations. Recently, however, there have been com-
`channel in hands-busy operational
`situations.
`mercially available
`systems for isolated word recognition,
`costing from $10 000 to $100 000, with less than 1-percent
`within small physical regions speech
`can be used while moving about freely
`error rate in noisy environments. The paper by Martin in this
`doing a task.
`issue illustrates a variety of applications where these systems
`NO training in basic physical skill
`required for use (as opposed to acqui-
`have been found to be useful and costeffective.
`sition of typing or keying skills);
`As long as speech recognition systems continue to cost
`speech is natural for users at
`all
`general skill levels (clerical to
`around $10 000 to $100 000, the range of applications for
`which they will be used will be limited. As the research under
`Speech is to be used imediately by
`users to cmmunicate unplanned infor-
`way at present comes to fruition over the next few years, and
`mation, in a vay not true of manual
`as connected speech recognition systems costingunder $10000
`input.
`begin to become available, one can expect a significant in-
`Speakers are recognizable by their
`voice characteristics.
`crease in the number of applications. Fig. 4, adapted from
`Newell et ol. [ 1091, summarizes and extends the views ex-
`Performance of speech reception and
`processing tasks which require mono-
`pressed by several authors earlier [63], [78], [ 871, and [89]
`tonous vigilant operation can
`be done
`-re reliably by computer than by
`on the desirability and usefulness of speech-it provides a list
`h m n s .
`of task situation characteristics that are likely to benefit from
`Speech can provide cost savings where
`speech input. Beek et al. [ 17 ] provide an assessment of the po-
`it eliminates substantial numbers of
`people.
`tential military applications of automatic speech recognition.
`As computers get cheaper and more powerful, it is estimated
`that 60-80 percent of the cost of running a business computer
`installation will be spent on data collection, preparation, and
`entry (unpublished proprietary studies;
`should be considered
`
`Fig. 4. Task demands providing comparative advantages for
`(From Newell cr al. [ 1091 .)
`
`speech.
`
`Comcast - Exhibit 1004, page 505
`
`

`

`5 06
`
`PROCEEDINGS OF THE IEEE, APRIL 1976
`
`speculative for the present). Given speech recognition systems
`that are flexible enough to change speakers or task definitions
`with a few days of effort, speech will begin to be used as an
`alternate medium of input to computers. Speech is likely to
`be used not so much for program entry, but rather primarily
`[33]. This increased usage should in
`in data entry situations
`turn lead to increased versatility and reduced cost in speech
`input systems.
`There was some earlier skepticism as to whether speech in-
`put was necessary or even desirable as an input medium for
`computers [ 1161. The present attitude among the researchers
`in the field appears to be just the opposite, i.e., if speech input
`systems of reasonable cost and reliability were available, they
`would be the preferred mode of communication even though
`the relative cost is higher than other types of input [log].
`Recent human factors studies in cooperative problem
`solving
`[23], [ 1101 seem to support the view that speech is the pre-
`ferred mode of communication. If it is indeed preferred, it
`seems safe to assume that the user would be willing
`to pay
`somewhat higher prices to be able to talk to computers. This
`prospect of being able to talk to computers is what drives the
`field, not just the development of a few systems for highly spe-
`cidized applications.
`
`11. SYSTEMS
`This section provides an overview of the structure of differ-
`en

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket