`Foreword by Nicholas Negroponte
`MIT Media Lab
`
`Voice Communicalion
`With
`Compulers
`
`Conversational Systems
`
`IPR2023-00035
`Apple EX1010 Page 1
`
`
`
`Copyright© 1994 by Christopher Schmandt
`Library of Congress Catalog Card Number 93-36404
`ISBN 0-442-23935-1
`All rights reserved. No part of this work covered by the copyright hereon may be re(cid:173)
`produced or used in any form or by any means-graphic,
`electronic, or mechanical,
`including photocopying, recording, taping, or information storage and retrieval
`systems-without the written permission of the publisher.
`Itr'P Van Nostrand Reinhold is an International Thomson Publishing company.
`ITP logo is a trademark under license.
`I..V
`Printed in
`Van Nostrand Reinhold
`115 Fifth Avenue
`New York, NY 10003
`
`International Thomson Publishing GmbH
`Konigswinterer Str. 418
`53277 Bonn
`Germany
`International Thomson Publishing Asia
`221 Henderson Building #05-10
`Singapore 0315
`
`International Thomson Publishing
`Berkshire House, 168-173
`High Holborn, London WClV 7AA
`England
`Thomas Nelson Australia
`102 Dodds Street
`South Melbourne 3205
`Victoria, Australia
`
`International Thomson Publishing Japan
`Kyowa Building, 3F
`2-2-1 Hirakawacho
`Chiyoda-Ku, Tokyo 102
`Japan
`
`Nelson Canada
`1120 Birchmount Road
`Scarborough, Ontario
`MlK 5G4, Canada
`16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
`
`Library of Congress Cataloging in Publication Data 93-36404
`Schmandt, Chris.
`Voice communication with computers / Chris Schmandt.
`cm.
`p.
`Includes bibliographical references and index.
`ISBN 0-442-23935-1
`1. Interactive computer systems. 2. Natural language processing
`(Computer science) I. Title.
`QA76.9.158S35 1993
`006.4'54-dc20
`
`93-36404
`CIP
`
`IPR2023-00035
`Apple EX1010 Page 2
`
`
`
`(ontents
`
`Speaking of Talk
`Preface
`xvii
`Acknowledgments
`Introduction
`1
`
`xvii
`
`xxi
`
`Chapter 1. Speech as Communication
`
`5
`
`8
`
`6
`SPEECH AS CONVERSATION
`HIERARCHICAL STRUCTURE OF CONVERSATION
`REPRESENTATIONS OF SPEECH
`12
`Acoustic Representations
`12
`PHONEMES AND SYLLABLES
`Phonemes
`14
`Syllables
`17
`Other Representations
`SUMMARY
`18
`
`14
`
`17
`
`Chapter 2. Speech Production and Perception
`
`19
`
`VOCAL TRACT
`
`19
`
`y
`
`IPR2023-00035
`Apple EX1010 Page 3
`
`
`
`28
`
`37
`44
`
`vi
`
`VOICE COMMUNICATION WITH COMPUTERS
`
`24
`
`THE SPEECH SOUNDS
`Vowels
`25
`26
`Consonants
`28
`Liquids and Glides
`Acoustic Features of Phonemes
`HEARING
`28
`29
`Auditory System
`Localization of Sounds
`33
`Psychoacoustics
`SUMMARY
`34
`FURTHER READING
`
`31
`
`35
`
`Chapter 3. Speech Coding
`
`36
`
`SAMPLING AND QUANTIZATION
`SPEECH-CODING ALGORITHMS
`Waveform Coders
`44
`51
`Source Coders
`CODER CONSIDERATIONS
`Intelligibility
`54
`Editing
`54
`Silence Removal
`57
`Time Scaling
`58
`Robustness
`59
`SUMMARY
`FURTHER READING
`
`55
`
`59
`
`53
`
`Chapter 4. Applications and Editing of Stored Voice
`
`60
`
`TAXONOMY OF VOICE OUTPUT APPLICATIONS
`Playback-Only Applications
`61
`Interactive Record and Playback Applications
`Dictation
`63
`64
`Voice as a Document Type
`VOICE IN INTERACTIVE DOCUMENTS
`
`65
`
`61
`
`62
`
`IPR2023-00035
`Apple EX1010 Page 4
`
`
`
`Contents
`
`vi
`
`69
`VOICE EDITING
`69
`Temporal Granularity
`70
`Manipulation of Audio Data
`EXAMPLES OF VOICE EDITORS
`Intelligent Ear, M.I.T.
`74
`Tioga Voice, Xerox PARC
`75
`76
`PX Editor, Bell Northern Research
`Sedit, Olivetti Research Center, and M.I.T. Media Laboratory
`Pitchtool, M.I.T. Media Laboratory
`79
`SUMMARY
`80
`
`74
`
`78
`
`Chapter 5. Speech Synthesis-
`
`82
`
`84
`
`87
`
`94
`
`97
`
`SYNTHESIZING SPEECH FROM TEXT
`FROM TEXT TO PHONEMES
`85
`Additional Factors for Pronunciation
`FROM PHONEMES TO SOUND
`91
`Parametric Synthesis
`91
`93
`Concatenative Synthesis
`QUALITY OF SYNTHETIC SPEECH
`Measuring Intelligibility
`95
`96
`Listener Satisfaction
`96
`Performance Factors
`APPLICATIONS OF SYNTHETIC SPEECH
`99
`SUMMARY
`FURTHER READING
`
`99
`
`Chapter 6. Interactive Voice Response
`
`100
`
`101
`
`LIMITATIONS OF SPEECH OUTPUT
`Speed
`101
`102
`Temporal Nature
`102
`Serial Nature
`102
`Bulkiness
`103
`Privacy
`ADVANTAGES OF VOICE
`
`104
`
`IPR2023-00035
`Apple EX1010 Page 5
`
`
`
`vi
`
`VOICE COMMUNICATION WITH COMPUTERS
`
`105
`106
`
`DESIGN CONSIDERATIONS
`Application Appropriateness
`107
`Data Appropriateness
`108
`Responsiveness
`108
`Speech Rate
`Interruption
`109
`Repetition
`109
`Exception Pronunciation
`Multiple Voices
`111
`USER INPUT WITH TOUCHTONES
`Menus
`112
`113
`Data Entry
`117
`CASE STUDIES
`Direction Assistance
`Back Seat Driver
`Voiced Mail
`124
`SUMMARY
`130
`
`110
`
`112
`
`117
`121
`
`Chapter 7. Speech Recognition
`
`132
`
`132
`
`BASIC RECOGNIZER COMPONENTS
`SIMPLE RECOGNIZER
`133
`Representation
`134
`Templates
`134
`135
`Pattern Matching
`137
`CLASSES OF RECOGNIZERS
`137
`Who Can Use the Recognizer?
`Speaking Style: Connected or Isolated Words?
`Vocabulary Size
`140
`ADVANCED RECOGNITION TECHNIQUES
`Dynamic Time Warping
`142
`Hidden Markov Models
`144
`Vector Quantization
`14 7
`Employing Constraints
`149
`
`139
`
`141
`
`IPR2023-00035
`Apple EX1010 Page 6
`
`
`
`Contents
`
`ix
`
`151
`
`ADVANCED RECOGNITION SYSTEMS
`IBM's Tangora
`151
`CMU's Sphinx
`151
`MIT's SUMMIT
`152
`SUMMARY
`152
`FURTHER READING
`
`153
`
`Chapter 8. Using Speech Recognition
`
`154
`
`160
`
`161
`
`154
`USES OF VOICE INPUT
`154
`Sole Input Channel
`156
`Auxiliary Input Channel
`157
`Keyboard Replacement
`SPEECH RECOGNITION ERRORS
`Classes of Recognition Errors
`160
`Factors Influencing the Error Rate
`INTERACTION TECHNIQUES
`163
`Minimizing Errors
`164
`Confirmation Strategies
`Error Correction
`167
`CASE STUDIES
`169
`Xspeak: Window Management by Voice
`Put That There
`175
`SUMMARY
`178
`
`165
`
`170
`
`Chapter 9. Higher Levels of Linguistic Knowledge
`
`179
`
`180
`
`180
`SYNTAX
`Syntactic Structure and Grammars
`Parsers
`185
`SEMANTICS
`186
`PRAGMATICS
`189
`Knowledge Representation
`Speech Acts
`192
`Conversational Implicature and Speech Acts
`
`190
`
`193
`
`IPR2023-00035
`Apple EX1010 Page 7
`
`
`
`l
`
`VOICE COMMUNICATION WITH COMPUTERS
`
`194
`DISCOURSE
`Regulation of Conversation
`Discourse Focus
`197
`CASE STUDIES
`199
`Grunt
`199
`Conversational Desktop
`SUMMARY
`208
`FURTHER READING
`
`209
`
`195
`
`204
`
`Chapter 10. Basics of Telephones
`
`210
`
`211
`212
`
`221
`
`FUNCTIONAL OVERVIEW
`ANALOG TELEPHONES
`Signaling
`213
`Transmission
`218
`DIGITAL TELEPHONES
`Signaling
`222
`Transmission
`224
`PBXS
`226
`228
`SUMMARY
`FURTHER READING
`
`229
`
`Chapter 11. Telephones and Computers
`
`230
`
`MOTIVATION
`
`231
`
`Access to Multiple Communication Channels
`Improved User Interfaces
`232
`Enhanced Functionality
`233
`Voice and Computer Access
`234
`
`PROJECTS IN INTEGRATED TELEPHONY
`Etherphone
`234
`MICE
`237
`BerBell
`239
`Personal eXchange
`Phonetool
`242
`
`239
`
`231
`
`234
`
`IPR2023-00035
`Apple EX1010 Page 8
`
`
`
`Contents
`
`xi
`
`269
`
`244
`246
`249
`
`244
`ARCHITECTURES
`Distributed Architectures
`Centralized Architectures
`Comparison of Architectures
`CASE STUDIES
`251
`Phone Slave
`251
`Xphone and Xrolo
`Flexible Call Routing
`SUMMARY
`267
`
`256
`260
`
`Chapter 12. Desktop Audio
`
`268
`
`EFFECTIVE DEPLOYMENT OF DESKTOP AUDIO
`GRAPHICAL USER INTERFACES
`271
`AUDIO SERVER ARCHITECTURES
`273
`UBIQUITOUS AUDIO
`278
`CASE STUDIES
`281
`Evolution of a Visual Interface
`Conversational Desktop
`285
`Phoneshell
`287
`Visual User Interfaces to Desktop Audio
`SUMMARY
`295
`
`292
`
`282
`
`Chapter 13. Toward More Robust Communication
`
`297
`
`298
`ROBUST COMMUNICATION
`SPEECH RECOGNITION AND ROBUST PARSING
`PROSODY
`301
`WHAT NEXT?
`303
`
`299
`
`Bibliography
`Index
`315
`
`305
`
`IPR2023-00035
`Apple EX1010 Page 9
`
`
`
`Introduction
`
`For most of us, speech has been an integral part of our daily lives since we were
`small children. Speech is communication; it is highly expressive and conveys sub(cid:173)
`tle intentions clearly. Our conversations employ a range of interactive techniques
`to facilitate mutual understanding and ensure that we are understood.
`But despite the effectiveness of speech communication, few of us use speech in
`our daily computing environments. In most workplaces voice is relegated to spe(cid:173)
`cialized industrial applications or aids to the disabled; voice is not a part of the
`computer interfaces based on displays, keyboards, and mice. Although current
`workstations have become capable of supporting much more sophisticated voice
`processing, the most successful speech application to date, voice mail, is tied most
`closely to the telephone.
`language understanding mature in the
`As speech technologies and natural
`coming decades, many more potential applications will become reality. But much
`more than raw technology is required to bridge the gap between human conver(cid:173)
`sation and computer interfaces; we must understand the assets and liabilities of
`voice communication if we are to gauge under which circumstances it will prove
`to be valuable to end users.
`Conversational systems must speak and listen, but they also must understand,
`pose queries, take turns, and remember the topic of conversation. Understanding
`how people converse lets us develop better models for interaction with computers
`by voice. But speech is a very demanding medium to employ effectively, and
`unless user interaction techniques are chosen with great care, voice applications
`tend to be slow and awkward to use.
`
`IPR2023-00035
`Apple EX1010 Page 10
`
`
`
`"'
`
`I
`
`1o." 'V
`
`2
`
`VOICE COMMUNICATION WITH COMPUTERS
`
`This book is about using speech in a variety of computing environments based
`on appreciating
`its role in human communication. Speech can be used as a
`method of interacting with a computer to place requests or receive warnings and
`notices. Voice can also be used as the underlying data itself, such as notes stored
`in a calendar, voice annotations of a text document, or telephone messages. Desk(cid:173)
`top workstations can already support both these speech functions. Speech excels
`as a method of interacting with the desktop computer over the telephone and has
`strong potential as the primary channel to access a computer small enough to fit
`in one's shirt pocket. The full utility of speech will be realized only when it is inte(cid:173)
`grated across all these situations; when users find it effective to talk to their com(cid:173)
`puters over the telephone, for example, they will suddenly have more utility for
`voice as data while in the office.
`
`CONTENTS OF THIS BOOK
`
`This book serves different needs for different readers. The author believes that a
`firm grounding in the theory of operation of speech technologies forms an impor(cid:173)
`tant basis for appreciating the difficulties of building applications and interfaces
`to employ them. This understanding
`is necessary ifwe wish to be capable of mak(cid:173)
`ing any predictions or even guesses of where this field will lead us over the next
`decade. Paired with descriptions of voice technologies are chapters devoted to
`applications and user interaction techniques for each, including case studies to
`illustrate potential applications in more detail. But many chapters stand more or
`less on their own, and individual readers may pick and choose among them.
`Readers interested primarily in user interface design issues will gain most bene(cid:173)
`fit from Chapters 4, 6, 8, 9, and 12. Those most concerned about system architec(cid:173)
`tures and support for voice in multimedia computing environments should focus
`on Chapters 3, 5, 7, and 12. A telecommunications perspective is the emphasis of
`Chapters 10, 11, and 6.
`A conversation requires the ability to speak and to listen, and, if the parties are
`not in close proximity, some means of transporting
`their voices across a distance.
`Chapter 1 discusses the communicative role of speech and introduces some rep(cid:173)
`resentations of speech and an analytic approach that frames the content of this
`book. Chapter 2 discusses the physiology of human speech and how we perceive it
`through our ears; although later chapters refer back to this information,
`it is not
`essential for understanding
`the remainder of the book.
`in con(cid:173)
`Voice interface technologies are required for computers to participate
`versations. These technologies include digital recording, speech synthesis, and
`speech recognition; these are the topics of Chapters 3, 5, and 7. Knowledge of the
`operations of the speech technologies better prepares
`the reader
`to appreciate
`their limitations and understand the impact of improvements in the technologies
`in the near and distant future.
`Although speech is intuitive and seemingly effortless for most of us, it is actu(cid:173)
`ally quite difficult to employ as a computer interface. This difficulty is partially
`due to limitations of current technology but also a result of characteristics
`inher-
`
`IPR2023-00035
`Apple EX1010 Page 11
`
`
`
`Introduction
`
`3
`
`ent in the speech medium itself. The heart of this book is both criteria for evalu(cid:173)
`ating the suitability of voice to a range of applications and interaction techniques
`to make its use effective in the user interface. Although these topics are treated
`throughout this book, they receive particular emphasis in Chapters 4, 6, 8 and 12.
`These design guidelines are accentuated by case studies scattered throughout the
`book but especially in these chapters.
`These middle chapters are presented in pairs. Each pair contains a chapter
`describing underlying
`technology matched with a chapter discussing how to
`apply the technology. Chapter 3 describes various speech coding methods in a
`descriptive form and differentiates coding schemes based on data rate, intelligi(cid:173)
`bility, and flexibility. Chapter 4 then focuses on simple applications of stored voice
`in computer documents and the internal structure of audio editors used to pro(cid:173)
`duce those documents. Chapter 5 introduces text-to-speech algorithms. Chapter
`6 then draws on both speech coding as well as speech synthesis to discuss inter(cid:173)
`active applications using speech output over the telephone.
`Chapter 7 introduces an assortment of speech recognition techniques. After
`this, Chapter 8 returns
`to interactive systems, this time emphasizing voice
`input instead of touch tones. The vast majority of work to date on systems that
`speak and· listen has involved short utterances and brief transactions. But both
`sentences and conversations exhibit a variety of structures
`that must be mas(cid:173)
`if computers are to become fiuent. Syntax and semantics constrain
`tered
`sentences in ways that facilitate interpretation; pragmatics relates a person's
`utterances
`to intentions and real-world objects; and discourse knowledge
`indicates how to respond and carry on the thread of a conversation across mul(cid:173)
`tiple exchanges. These aspects of speech communication, which are the focus
`of Chapters 9 and 13, must be incorporated into any system that can engage
`successfully in a conversation that in any way approaches the way we speak to
`each other.
`Although a discussion of the workings of the telephone network may at first
`seem tangential
`to a book about voice in computing, the telephone plays a key
`role in any discussion of speech and computers. The ubiquity of the telephone
`assures it a central role in our voice communication tasks. Every aspect of tele(cid:173)
`phone technology is rapidly changing from the underlying network to the devices
`we hold in our hands, and this is creating many opportunities for computers to
`get involved in our day-to-day communication tasks. Chapter 10 describes the
`telephone technologies, while Chapter 11 discusses the integration of telephone
`functionality
`into computer workstations. Much of Chapter 6 is about building
`telephone-based voice applications that can provide a means of accessing per(cid:173)
`sonal databases while not in the office.
`When we work at our desks, we may employ a variety of speech processing
`technologies in isolation, but the full richness of voice at the desktop comes with
`the combination of multiple voice applications. Voice applications on the work(cid:173)
`station also raise issues of interaction between both audio and window systems
`and operating system and run-time support for voice. This is the topic of Chapter
`12. Speakers and microphones at every desk may allow us to capture many of
`the spontaneous conversations we hold every day, which are such an essential
`
`IPR2023-00035
`Apple EX1010 Page 12
`
`
`
`4
`
`VOICE COMMUNICATION WITH COMPUTERS
`
`aspect of our work lives. Desktop voice processing also enables remote telephone
`access to many of the personal information management utilities that we use in
`our offices.
`
`ASSUMPTIONS
`
`This book covers material derived from a number of specialized disciplines in a
`way that is accessible to a general audience. It is divided equally between back(cid:173)
`ground knowledge of speech technologies and practical application and interac(cid:173)
`tion techniques. This broad view of voice communication taken in this book is by
`definition interdisciplinary. Speech communication is so vital and so rich that a
`number of specialized areas of research have risen around it, including speech
`science, digital signal processing and linguistics, aspects of artificial intelligence
`(computational linguistics), cognitive psychology, and human factors. This book
`touches on all these areas but makes no pretense of covering any of them in
`depth. This book attempts to open doors by revealing why each of these research
`areas is relevant to the design of conversational computer systems; the reader
`with further interest in any of these fields is encouraged to pursue the key
`overview references mentioned in each chapter.
`Significant knowledge of higher mathematics as well as digital signal process(cid:173)
`ing is assumed by many speech texts. These disciplines provide an important
`level of abstraction and on a practical level are tools required for any serious
`development of speech technology itself. But to be accessible to a wider audience,
`this book makes little use of mathematics beyond notation from basic algebra.
`This book provides an intuitive, rather than rigorous, treatment of speech signal
`processing to aid the reader in evaluation and selection of technologies and to
`appreciate their operation and design tradeoffs.
`There is a wide gap between the goal of emulating conversational human
`behavior and what is commercially viable with today's speech technology. Despite
`the large amount of basic speech research around the world, there is little inno(cid:173)
`vative work on how speech devices may be used in advanced systems, but it is dif(cid:173)
`ficult to discuss applications without examples. To this end, the author has taken
`the liberty to provide more detail with a series of voice projects from the Speech
`Research Group ofM.I.T.'s Media Laboratory (including work from one of its pre(cid:173)
`decessors, the Architecture Machine Group). Presented as case studies, these
`projects are intended both to illustrate applications of the ideas presented in each
`chapter and to present pertinent design issues. It is hoped that taken collectively
`these projects will offer a vision of the many ways in which computers can take
`part in communication.
`
`l
`
`IPR2023-00035
`Apple EX1010 Page 13
`
`
`
`-l
`
`
`
`Speech as Communication
`
`,, .,
`
`Speech can be viewed in many ways, Although chapters of this book focus on
`specific aspects of speech and the computer technologies that utilize speech, the
`reader should begin with a broad perspective on the role of speech in our daily
`lives. It is essential to appreciate the range of capabilities that conversational
`systems must possess before attempting to build them. This chapter lays the
`groundwork for the entire book by presenting several perspectives on speech
`communication.
`The first section of this chapter emphasizes the interactive and expressive
`role of voice communication. Except in formal circumstances such as lectures
`and dramatic performances, speech occurs in the context of a conversation,
`wherein participants
`take
`turns speaking,
`interrupt each other, nod in
`agreement, or try to change the topic. Computer systems that talk or listen
`may ultimately be judged by their ability to converse in like manner simply
`because conversation permeates human experience. The second section dis(cid:173)
`cusses
`the various components or layers of a conversation. Although the
`distinctions between these layers are somewhat contrived, they provide a
`means of analyzing
`the communication process; research disciplines have
`evolved for the study of each of these components. Finally, the last section intro(cid:173)
`duces the representations of speech and conversation, corresponding in part to
`the layers
`identified
`in the second section. These representations provide
`abstractions
`that a computer program may employ to engage in a conversation
`with a human.
`
`5
`
`IPR2023-00035
`Apple EX1010 Page 14
`
`
`
`6
`
`VOICE COMMUNICATION WITH COMPUTERS
`
`SPEECH AS CONVERSATION
`
`Conversation is a process involving multiple participants, shared knowledge, and
`a protocol for taking turns and providing mutual feedback. Voice is our primary
`channel of interaction in conversation, and speech evolved in humans in response
`to the need among its members to communicate. It is hard to imagine many uses
`of speech that do not involve some interchange between multiple participants in
`a conversation; if we are discovered talking to ourselves, we usually feel embar(cid:173)
`rassed.
`For people of normal physical and mental ability, speech is both rich in expres(cid:173)
`siveness and easy to use. We learn it without much apparent effort as children
`and employ it spontaneously on a daily basis. 1 People employ many layers of
`knowledge and sophisticated protocols while having a conversation; until we
`attempt to analyze dialogues, we are unaware of the complexity of this interplay
`between parties.
`Although much is known about language, study of interactive speech commu(cid:173)
`nication has begun only recently. Considerable research has been done on natu(cid:173)
`ral language processing systems, but much of this is based on keyboard input. It
`is important to note the contrast between written and spoken language and
`between read or rehearsed speech and spontaneous utterances. Spoken language
`is less formal than written language, and errors in construction of spoken sen(cid:173)
`tences are less objectionable. Spontaneous speech shows much evidence of the
`real-time processes associated with its production, including false starts, non(cid:173)
`speech noises such as mouth clicks and breath sounds, and pauses either silent or
`filled(" ... um ... ") [Zue et al. 1989b]. In addition, speech naturally conveys into(cid:173)
`national and emotional information that fiction writers and playwrights must
`struggle to impart to written language.
`that the listener under(cid:173)
`Speech is rich in interactive techniques to guarantee
`stands what is being expressed, including facial expressions, physical and vocal
`gestures, "uh-huhs," and the like. At certain points in a conversation, it is appro(cid:173)
`priate for the listener to begin speaking; these points are often indicated by
`longer pauses and lengthened final syllables or marked decreases in pitch at the
`end of a sentence. Each round of speech by one person is called a turn; inter•
`ruption occurs when a participant speaks before a break point offered by the
`talker. Instead of taking a turn, the listener may quickly indicate agreement with
`a word or two, a nonverbal sound ("uh-huh"), or a facial gesture. Such responses,
`called back channels, speed the exchange and result in more effective conver(cid:173)
`sations [Kraut et al. 1982].2
`Because of these interactive characteristics, speech is used for immediate com(cid:173)
`munication needs, while writing often implies a distance, either in time or space,
`
`1For a person with normal speech and hearing to spend a day without speaking is quite
`a novel experience.
`ZWe will return to these topics in Chapter 9.
`
`IPR2023-00035
`Apple EX1010 Page 15
`
`
`
`Speech as Communication
`
`7
`
`between the author and reader. Speech is used in transitory interactions or situ(cid:173)
`ations in which the process of the interaction may be as important as its result.
`For example, the agenda for a meeting is likely to be written, and a written sum(cid:173)
`mary or minutes may be issued "for the record," but the actual decisions are made
`during a conversation. Chapanis and his colleagues arranged a series of experi(cid:173)
`ments to compare the effectiveness of several communication media, i.e., voice,
`video, handwriting, and typewriting, either alone or in combination, for problem(cid:173)
`solving tasks [Ochsman and Chapanis 1974]. Their findings indicated an over(cid:173)
`whelming contribution of voice for such interactions. Any experimental condition
`that included voice was superior to any excluding voice; the inclusion of other
`media with voice resulted in only a small additional effectiveness. Although these
`experiments were simplistic in their use of student subjects and invented tasks
`and more recent work by others [Minneman and Bly 1991] clarifies a role for
`video interaction, the dominance of voice seems unassailable.
`But conversation is more than mere interaction; communication often serves a
`purpose of changing or influencing the parties speaking to each other. I tell you
`something I have learned with the intention that you share my knowledge and
`hence enhance your view of the world. Or I wish to obtain some information from
`you so I ask you a question, hoping to elicit a reply. Or perhaps I seek to convince
`you to perform some activity for me; this may be satisfied either by your physical
`performance of the requested action or by your spoken promise to perform the act
`at a later time. "Speech Act" theories (to be discussed in more detail in Chapter 9)
`attempt
`to explain language as action, e.g., to request, command, query, and
`promise, as well as to inform.
`The intention behind an utterance may not be explicit. For example, "Can you
`pass the salt?" is not a query about one's ability; it is a request. Many actual con(cid:173)
`versations resist such purposeful classifications. Some utterances ("go ahead,"
`"uh-huh," "just a moment") exist only to guide the flow of the conversation or com(cid:173)
`ment on the state of the discourse, rather than to convey information. Directly
`purposeful requests are often phrased in a manner allowing flexibility of inter(cid:173)
`pretation and response. This looseness is important to the process of people defin(cid:173)
`ing and maintaining their work roles with respect to each other and establishing
`socially comfortable relationships in a hierarchical organization. The richness of
`speech allows a wide range of "acceptance" and "agreement" from wholehearted
`to skeptical to incredulous.
`Speech also serves a strong social function among individuals and is often used
`just to pass the time, tell jokes, or talk about the weather. Indeed, extended peri(cid:173)
`ods of silence among a group may be associated with interpersonal awkwardness
`or discomfort. Sometimes the actual occurrence of the conversation serves a more
`significant purpose than any of the topics under discussion. Speech may be used
`to call attention to oneself in a social setting or as an exclamation of surprise or
`dismay in which an utterance has little meaning with respect to any preceding
`conversation. [Goffman 1981]
`The expressiveness of speech and robustness of conversation strongly support
`the use of speech in computer systems, both for stored voice as a data type as
`well as speech as a medium of interaction. Unfortunately, current computers are
`
`IPR2023-00035
`Apple EX1010 Page 16
`
`
`
`8
`
`VOICE COMMUNICATION WITH COMPUTERS
`
`capable of uttering only short sentences of marginal intelligibility and occasion(cid:173)
`ally recognizing single words. Engaging a computer in a conversation can be like
`an interaction in a foreign country. One studies
`the phrase book, utters a
`request, and in return receives either a blank stare (wrong pronunciation, try
`again) or a torrent of fluent speech in which one cannot perceive even the word
`boundaries.
`However, limitations in technology only reinforce the need to take advantage of
`conversational techniques to ensure that the user is understood. Users will judge
`the performance of computer systems employing speech on the basis of their
`expectations about conversation developed from years of experience speaking
`with fellow humans. Users may expect computers to be either deaf and dumb, or
`once they realize the system can talk and listen, expect it to speak fluently like
`you and me. Since the capabilities of current speech technology lie between these
`extremes, building effective conversational computer systems can be very frus(cid:173)
`trating.
`
`
`
`HIERARCHICAL STRUCTURE OF CONVERSATION
`
`
`
`A more analytic approach to speech communication reveals a number of different
`ways of describing what actually occurs when we speak. The hierarchical struc(cid:173)
`ture of such analysis suggests goals to be attained at various stages in computer(cid:173)
`based speech communication.
`Conversation requires apparatus both for listening and speaking. Effective
`communication invokes mental processes employing the mouth and ears to con(cid:173)
`vey a message thoroughly and reliably. There are many layers at which we can
`analyze the communication process, from the lower layers where speech is con(cid:173)
`sidered primarily acoustically to higher layers that express meaning and inten(cid:173)
`tion. Each layer involves increased knowledge and potential for intelligence and
`interactivity.
`From the point of view of the speaker, we may look at speech from at least eight
`layers of processing as shown in Figure 1.1.
`
`Layers of Speech Processing
`
`discourse The regulation of conversation for pragmatic ends. This includes
`taking turns talking, the history of referents in a conversation so pronouns can
`refer to words spoken earlier, and the process of introducing new topics.
`pragmatics The intent or motivation for an utterance. This is the underlying
`reason the utterance was spoken.
`
`semantics The meaning of the words individually and their meaning as com(cid:173)
`bined in a particular sentence.
`
`syntax The rules governing the combination of words in a sentence, their parts
`of speech, and their forms, such as case and number.
`
`IPR2023-00035
`Apple EX1010 Page 17
`
`
`
`speaker
`
`listener
`
`Speech os Communication
`
`9
`
`discourse
`pragmatic
`semantic
`syntactic
`lexical
`
`phonemic
`articulatory
`
`acoustic
`
`discourse
`pragmatic
`semantic
`syntactic
`lexical
`phonemic
`perceptual
`acoustic
`
`~>
`
`Figure I.I. A layered view of speech communication.
`
`The series of sounds that uniquely convey the series of words in the
`
`lexical The set of words in a language, the rules for forming new words from
`affixes (prefixes and suffixes), and the stress ("accent") of syllables within the
`words.
`phonetics
`sentence.
`articulation
`The motions or configurations of the vocal tract that produce the
`sounds, e.g., the tongue touching the lips or the vocal cords vibrating.
`acoustics
`The realization of the string of phonemes in the sentence as vibra(cid:173)
`tions of air molecules to produce pressure waves, i.e., sound.
`
`Consider two hikers walking through the forest when one hiker's shoelace
`becomes untied. The other hiker sees this and says, "Hey, you're going to trip on
`your shoelace." The listener then ties the shoelace. We can consider this utterance
`at each layer of description.
`Discourse analysis reveals that "Hey'' serves to call attention to the urgency
`of the message and probably indicates the introduction of a new topic of conver(cid:173)
`sation. It is probably spoken in a raised tone and an observer would reasonably
`expect the listener to acknowledge this utterance, either with a vocal response or
`by tying the shoe. Experience with discourse indicates that this is an appropriate
`interruption or initiation of a conversation at least under some circumstances.
`Discourse structure may help the listener understand
`that subsequent utter(cid:173)
`ances refer to the shoelace instead of the difficulty of the terrain on which the con(cid:173)
`versants are traveling.
`In terms of pragmatics,
`the speaker's intent is to warn the listener against
`tripping; presumably the speaker does not wish the listener to fall. But this utter(cid:173)
`ance might also have been a ruse intended to get the liste