throbber
Christopher Schmandt
`Foreword by Nicholas Negroponte
`MIT Media Lab
`
`Voice Communicalion
`With
`Compulers
`
`Conversational Systems
`
`IPR2023-00035
`Apple EX1010 Page 1
`
`

`

`Copyright© 1994 by Christopher Schmandt
`Library of Congress Catalog Card Number 93-36404
`ISBN 0-442-23935-1
`All rights reserved. No part of this work covered by the copyright hereon may be re(cid:173)
`produced or used in any form or by any means-graphic,
`electronic, or mechanical,
`including photocopying, recording, taping, or information storage and retrieval
`systems-without the written permission of the publisher.
`Itr'P Van Nostrand Reinhold is an International Thomson Publishing company.
`ITP logo is a trademark under license.
`I..V
`Printed in
`Van Nostrand Reinhold
`115 Fifth Avenue
`New York, NY 10003
`
`International Thomson Publishing GmbH
`Konigswinterer Str. 418
`53277 Bonn
`Germany
`International Thomson Publishing Asia
`221 Henderson Building #05-10
`Singapore 0315
`
`International Thomson Publishing
`Berkshire House, 168-173
`High Holborn, London WClV 7AA
`England
`Thomas Nelson Australia
`102 Dodds Street
`South Melbourne 3205
`Victoria, Australia
`
`International Thomson Publishing Japan
`Kyowa Building, 3F
`2-2-1 Hirakawacho
`Chiyoda-Ku, Tokyo 102
`Japan
`
`Nelson Canada
`1120 Birchmount Road
`Scarborough, Ontario
`MlK 5G4, Canada
`16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
`
`Library of Congress Cataloging in Publication Data 93-36404
`Schmandt, Chris.
`Voice communication with computers / Chris Schmandt.
`cm.
`p.
`Includes bibliographical references and index.
`ISBN 0-442-23935-1
`1. Interactive computer systems. 2. Natural language processing
`(Computer science) I. Title.
`QA76.9.158S35 1993
`006.4'54-dc20
`
`93-36404
`CIP
`
`IPR2023-00035
`Apple EX1010 Page 2
`
`

`

`(ontents
`
`Speaking of Talk
`Preface
`xvii
`Acknowledgments
`Introduction
`1
`
`xvii
`
`xxi
`
`Chapter 1. Speech as Communication
`
`5
`
`8
`
`6
`SPEECH AS CONVERSATION
`HIERARCHICAL STRUCTURE OF CONVERSATION
`REPRESENTATIONS OF SPEECH
`12
`Acoustic Representations
`12
`PHONEMES AND SYLLABLES
`Phonemes
`14
`Syllables
`17
`Other Representations
`SUMMARY
`18
`
`14
`
`17
`
`Chapter 2. Speech Production and Perception
`
`19
`
`VOCAL TRACT
`
`19
`
`y
`
`IPR2023-00035
`Apple EX1010 Page 3
`
`

`

`28
`
`37
`44
`
`vi
`
`VOICE COMMUNICATION WITH COMPUTERS
`
`24
`
`THE SPEECH SOUNDS
`Vowels
`25
`26
`Consonants
`28
`Liquids and Glides
`Acoustic Features of Phonemes
`HEARING
`28
`29
`Auditory System
`Localization of Sounds
`33
`Psychoacoustics
`SUMMARY
`34
`FURTHER READING
`
`31
`
`35
`
`Chapter 3. Speech Coding
`
`36
`
`SAMPLING AND QUANTIZATION
`SPEECH-CODING ALGORITHMS
`Waveform Coders
`44
`51
`Source Coders
`CODER CONSIDERATIONS
`Intelligibility
`54
`Editing
`54
`Silence Removal
`57
`Time Scaling
`58
`Robustness
`59
`SUMMARY
`FURTHER READING
`
`55
`
`59
`
`53
`
`Chapter 4. Applications and Editing of Stored Voice
`
`60
`
`TAXONOMY OF VOICE OUTPUT APPLICATIONS
`Playback-Only Applications
`61
`Interactive Record and Playback Applications
`Dictation
`63
`64
`Voice as a Document Type
`VOICE IN INTERACTIVE DOCUMENTS
`
`65
`
`61
`
`62
`
`IPR2023-00035
`Apple EX1010 Page 4
`
`

`

`Contents
`
`vi
`
`69
`VOICE EDITING
`69
`Temporal Granularity
`70
`Manipulation of Audio Data
`EXAMPLES OF VOICE EDITORS
`Intelligent Ear, M.I.T.
`74
`Tioga Voice, Xerox PARC
`75
`76
`PX Editor, Bell Northern Research
`Sedit, Olivetti Research Center, and M.I.T. Media Laboratory
`Pitchtool, M.I.T. Media Laboratory
`79
`SUMMARY
`80
`
`74
`
`78
`
`Chapter 5. Speech Synthesis-
`
`82
`
`84
`
`87
`
`94
`
`97
`
`SYNTHESIZING SPEECH FROM TEXT
`FROM TEXT TO PHONEMES
`85
`Additional Factors for Pronunciation
`FROM PHONEMES TO SOUND
`91
`Parametric Synthesis
`91
`93
`Concatenative Synthesis
`QUALITY OF SYNTHETIC SPEECH
`Measuring Intelligibility
`95
`96
`Listener Satisfaction
`96
`Performance Factors
`APPLICATIONS OF SYNTHETIC SPEECH
`99
`SUMMARY
`FURTHER READING
`
`99
`
`Chapter 6. Interactive Voice Response
`
`100
`
`101
`
`LIMITATIONS OF SPEECH OUTPUT
`Speed
`101
`102
`Temporal Nature
`102
`Serial Nature
`102
`Bulkiness
`103
`Privacy
`ADVANTAGES OF VOICE
`
`104
`
`IPR2023-00035
`Apple EX1010 Page 5
`
`

`

`vi
`
`VOICE COMMUNICATION WITH COMPUTERS
`
`105
`106
`
`DESIGN CONSIDERATIONS
`Application Appropriateness
`107
`Data Appropriateness
`108
`Responsiveness
`108
`Speech Rate
`Interruption
`109
`Repetition
`109
`Exception Pronunciation
`Multiple Voices
`111
`USER INPUT WITH TOUCHTONES
`Menus
`112
`113
`Data Entry
`117
`CASE STUDIES
`Direction Assistance
`Back Seat Driver
`Voiced Mail
`124
`SUMMARY
`130
`
`110
`
`112
`
`117
`121
`
`Chapter 7. Speech Recognition
`
`132
`
`132
`
`BASIC RECOGNIZER COMPONENTS
`SIMPLE RECOGNIZER
`133
`Representation
`134
`Templates
`134
`135
`Pattern Matching
`137
`CLASSES OF RECOGNIZERS
`137
`Who Can Use the Recognizer?
`Speaking Style: Connected or Isolated Words?
`Vocabulary Size
`140
`ADVANCED RECOGNITION TECHNIQUES
`Dynamic Time Warping
`142
`Hidden Markov Models
`144
`Vector Quantization
`14 7
`Employing Constraints
`149
`
`139
`
`141
`
`IPR2023-00035
`Apple EX1010 Page 6
`
`

`

`Contents
`
`ix
`
`151
`
`ADVANCED RECOGNITION SYSTEMS
`IBM's Tangora
`151
`CMU's Sphinx
`151
`MIT's SUMMIT
`152
`SUMMARY
`152
`FURTHER READING
`
`153
`
`Chapter 8. Using Speech Recognition
`
`154
`
`160
`
`161
`
`154
`USES OF VOICE INPUT
`154
`Sole Input Channel
`156
`Auxiliary Input Channel
`157
`Keyboard Replacement
`SPEECH RECOGNITION ERRORS
`Classes of Recognition Errors
`160
`Factors Influencing the Error Rate
`INTERACTION TECHNIQUES
`163
`Minimizing Errors
`164
`Confirmation Strategies
`Error Correction
`167
`CASE STUDIES
`169
`Xspeak: Window Management by Voice
`Put That There
`175
`SUMMARY
`178
`
`165
`
`170
`
`Chapter 9. Higher Levels of Linguistic Knowledge
`
`179
`
`180
`
`180
`SYNTAX
`Syntactic Structure and Grammars
`Parsers
`185
`SEMANTICS
`186
`PRAGMATICS
`189
`Knowledge Representation
`Speech Acts
`192
`Conversational Implicature and Speech Acts
`
`190
`
`193
`
`IPR2023-00035
`Apple EX1010 Page 7
`
`

`

`l
`
`VOICE COMMUNICATION WITH COMPUTERS
`
`194
`DISCOURSE
`Regulation of Conversation
`Discourse Focus
`197
`CASE STUDIES
`199
`Grunt
`199
`Conversational Desktop
`SUMMARY
`208
`FURTHER READING
`
`209
`
`195
`
`204
`
`Chapter 10. Basics of Telephones
`
`210
`
`211
`212
`
`221
`
`FUNCTIONAL OVERVIEW
`ANALOG TELEPHONES
`Signaling
`213
`Transmission
`218
`DIGITAL TELEPHONES
`Signaling
`222
`Transmission
`224
`PBXS
`226
`228
`SUMMARY
`FURTHER READING
`
`229
`
`Chapter 11. Telephones and Computers
`
`230
`
`MOTIVATION
`
`231
`
`Access to Multiple Communication Channels
`Improved User Interfaces
`232
`Enhanced Functionality
`233
`Voice and Computer Access
`234
`
`PROJECTS IN INTEGRATED TELEPHONY
`Etherphone
`234
`MICE
`237
`BerBell
`239
`Personal eXchange
`Phonetool
`242
`
`239
`
`231
`
`234
`
`IPR2023-00035
`Apple EX1010 Page 8
`
`

`

`Contents
`
`xi
`
`269
`
`244
`246
`249
`
`244
`ARCHITECTURES
`Distributed Architectures
`Centralized Architectures
`Comparison of Architectures
`CASE STUDIES
`251
`Phone Slave
`251
`Xphone and Xrolo
`Flexible Call Routing
`SUMMARY
`267
`
`256
`260
`
`Chapter 12. Desktop Audio
`
`268
`
`EFFECTIVE DEPLOYMENT OF DESKTOP AUDIO
`GRAPHICAL USER INTERFACES
`271
`AUDIO SERVER ARCHITECTURES
`273
`UBIQUITOUS AUDIO
`278
`CASE STUDIES
`281
`Evolution of a Visual Interface
`Conversational Desktop
`285
`Phoneshell
`287
`Visual User Interfaces to Desktop Audio
`SUMMARY
`295
`
`292
`
`282
`
`Chapter 13. Toward More Robust Communication
`
`297
`
`298
`ROBUST COMMUNICATION
`SPEECH RECOGNITION AND ROBUST PARSING
`PROSODY
`301
`WHAT NEXT?
`303
`
`299
`
`Bibliography
`Index
`315
`
`305
`
`IPR2023-00035
`Apple EX1010 Page 9
`
`

`

`Introduction
`
`For most of us, speech has been an integral part of our daily lives since we were
`small children. Speech is communication; it is highly expressive and conveys sub(cid:173)
`tle intentions clearly. Our conversations employ a range of interactive techniques
`to facilitate mutual understanding and ensure that we are understood.
`But despite the effectiveness of speech communication, few of us use speech in
`our daily computing environments. In most workplaces voice is relegated to spe(cid:173)
`cialized industrial applications or aids to the disabled; voice is not a part of the
`computer interfaces based on displays, keyboards, and mice. Although current
`workstations have become capable of supporting much more sophisticated voice
`processing, the most successful speech application to date, voice mail, is tied most
`closely to the telephone.
`language understanding mature in the
`As speech technologies and natural
`coming decades, many more potential applications will become reality. But much
`more than raw technology is required to bridge the gap between human conver(cid:173)
`sation and computer interfaces; we must understand the assets and liabilities of
`voice communication if we are to gauge under which circumstances it will prove
`to be valuable to end users.
`Conversational systems must speak and listen, but they also must understand,
`pose queries, take turns, and remember the topic of conversation. Understanding
`how people converse lets us develop better models for interaction with computers
`by voice. But speech is a very demanding medium to employ effectively, and
`unless user interaction techniques are chosen with great care, voice applications
`tend to be slow and awkward to use.
`
`IPR2023-00035
`Apple EX1010 Page 10
`
`

`

`"'
`
`I
`
`1o." 'V
`
`2
`
`VOICE COMMUNICATION WITH COMPUTERS
`
`This book is about using speech in a variety of computing environments based
`on appreciating
`its role in human communication. Speech can be used as a
`method of interacting with a computer to place requests or receive warnings and
`notices. Voice can also be used as the underlying data itself, such as notes stored
`in a calendar, voice annotations of a text document, or telephone messages. Desk(cid:173)
`top workstations can already support both these speech functions. Speech excels
`as a method of interacting with the desktop computer over the telephone and has
`strong potential as the primary channel to access a computer small enough to fit
`in one's shirt pocket. The full utility of speech will be realized only when it is inte(cid:173)
`grated across all these situations; when users find it effective to talk to their com(cid:173)
`puters over the telephone, for example, they will suddenly have more utility for
`voice as data while in the office.
`
`CONTENTS OF THIS BOOK
`
`This book serves different needs for different readers. The author believes that a
`firm grounding in the theory of operation of speech technologies forms an impor(cid:173)
`tant basis for appreciating the difficulties of building applications and interfaces
`to employ them. This understanding
`is necessary ifwe wish to be capable of mak(cid:173)
`ing any predictions or even guesses of where this field will lead us over the next
`decade. Paired with descriptions of voice technologies are chapters devoted to
`applications and user interaction techniques for each, including case studies to
`illustrate potential applications in more detail. But many chapters stand more or
`less on their own, and individual readers may pick and choose among them.
`Readers interested primarily in user interface design issues will gain most bene(cid:173)
`fit from Chapters 4, 6, 8, 9, and 12. Those most concerned about system architec(cid:173)
`tures and support for voice in multimedia computing environments should focus
`on Chapters 3, 5, 7, and 12. A telecommunications perspective is the emphasis of
`Chapters 10, 11, and 6.
`A conversation requires the ability to speak and to listen, and, if the parties are
`not in close proximity, some means of transporting
`their voices across a distance.
`Chapter 1 discusses the communicative role of speech and introduces some rep(cid:173)
`resentations of speech and an analytic approach that frames the content of this
`book. Chapter 2 discusses the physiology of human speech and how we perceive it
`through our ears; although later chapters refer back to this information,
`it is not
`essential for understanding
`the remainder of the book.
`in con(cid:173)
`Voice interface technologies are required for computers to participate
`versations. These technologies include digital recording, speech synthesis, and
`speech recognition; these are the topics of Chapters 3, 5, and 7. Knowledge of the
`operations of the speech technologies better prepares
`the reader
`to appreciate
`their limitations and understand the impact of improvements in the technologies
`in the near and distant future.
`Although speech is intuitive and seemingly effortless for most of us, it is actu(cid:173)
`ally quite difficult to employ as a computer interface. This difficulty is partially
`due to limitations of current technology but also a result of characteristics
`inher-
`
`IPR2023-00035
`Apple EX1010 Page 11
`
`

`

`Introduction
`
`3
`
`ent in the speech medium itself. The heart of this book is both criteria for evalu(cid:173)
`ating the suitability of voice to a range of applications and interaction techniques
`to make its use effective in the user interface. Although these topics are treated
`throughout this book, they receive particular emphasis in Chapters 4, 6, 8 and 12.
`These design guidelines are accentuated by case studies scattered throughout the
`book but especially in these chapters.
`These middle chapters are presented in pairs. Each pair contains a chapter
`describing underlying
`technology matched with a chapter discussing how to
`apply the technology. Chapter 3 describes various speech coding methods in a
`descriptive form and differentiates coding schemes based on data rate, intelligi(cid:173)
`bility, and flexibility. Chapter 4 then focuses on simple applications of stored voice
`in computer documents and the internal structure of audio editors used to pro(cid:173)
`duce those documents. Chapter 5 introduces text-to-speech algorithms. Chapter
`6 then draws on both speech coding as well as speech synthesis to discuss inter(cid:173)
`active applications using speech output over the telephone.
`Chapter 7 introduces an assortment of speech recognition techniques. After
`this, Chapter 8 returns
`to interactive systems, this time emphasizing voice
`input instead of touch tones. The vast majority of work to date on systems that
`speak and· listen has involved short utterances and brief transactions. But both
`sentences and conversations exhibit a variety of structures
`that must be mas(cid:173)
`if computers are to become fiuent. Syntax and semantics constrain
`tered
`sentences in ways that facilitate interpretation; pragmatics relates a person's
`utterances
`to intentions and real-world objects; and discourse knowledge
`indicates how to respond and carry on the thread of a conversation across mul(cid:173)
`tiple exchanges. These aspects of speech communication, which are the focus
`of Chapters 9 and 13, must be incorporated into any system that can engage
`successfully in a conversation that in any way approaches the way we speak to
`each other.
`Although a discussion of the workings of the telephone network may at first
`seem tangential
`to a book about voice in computing, the telephone plays a key
`role in any discussion of speech and computers. The ubiquity of the telephone
`assures it a central role in our voice communication tasks. Every aspect of tele(cid:173)
`phone technology is rapidly changing from the underlying network to the devices
`we hold in our hands, and this is creating many opportunities for computers to
`get involved in our day-to-day communication tasks. Chapter 10 describes the
`telephone technologies, while Chapter 11 discusses the integration of telephone
`functionality
`into computer workstations. Much of Chapter 6 is about building
`telephone-based voice applications that can provide a means of accessing per(cid:173)
`sonal databases while not in the office.
`When we work at our desks, we may employ a variety of speech processing
`technologies in isolation, but the full richness of voice at the desktop comes with
`the combination of multiple voice applications. Voice applications on the work(cid:173)
`station also raise issues of interaction between both audio and window systems
`and operating system and run-time support for voice. This is the topic of Chapter
`12. Speakers and microphones at every desk may allow us to capture many of
`the spontaneous conversations we hold every day, which are such an essential
`
`IPR2023-00035
`Apple EX1010 Page 12
`
`

`

`4
`
`VOICE COMMUNICATION WITH COMPUTERS
`
`aspect of our work lives. Desktop voice processing also enables remote telephone
`access to many of the personal information management utilities that we use in
`our offices.
`
`ASSUMPTIONS
`
`This book covers material derived from a number of specialized disciplines in a
`way that is accessible to a general audience. It is divided equally between back(cid:173)
`ground knowledge of speech technologies and practical application and interac(cid:173)
`tion techniques. This broad view of voice communication taken in this book is by
`definition interdisciplinary. Speech communication is so vital and so rich that a
`number of specialized areas of research have risen around it, including speech
`science, digital signal processing and linguistics, aspects of artificial intelligence
`(computational linguistics), cognitive psychology, and human factors. This book
`touches on all these areas but makes no pretense of covering any of them in
`depth. This book attempts to open doors by revealing why each of these research
`areas is relevant to the design of conversational computer systems; the reader
`with further interest in any of these fields is encouraged to pursue the key
`overview references mentioned in each chapter.
`Significant knowledge of higher mathematics as well as digital signal process(cid:173)
`ing is assumed by many speech texts. These disciplines provide an important
`level of abstraction and on a practical level are tools required for any serious
`development of speech technology itself. But to be accessible to a wider audience,
`this book makes little use of mathematics beyond notation from basic algebra.
`This book provides an intuitive, rather than rigorous, treatment of speech signal
`processing to aid the reader in evaluation and selection of technologies and to
`appreciate their operation and design tradeoffs.
`There is a wide gap between the goal of emulating conversational human
`behavior and what is commercially viable with today's speech technology. Despite
`the large amount of basic speech research around the world, there is little inno(cid:173)
`vative work on how speech devices may be used in advanced systems, but it is dif(cid:173)
`ficult to discuss applications without examples. To this end, the author has taken
`the liberty to provide more detail with a series of voice projects from the Speech
`Research Group ofM.I.T.'s Media Laboratory (including work from one of its pre(cid:173)
`decessors, the Architecture Machine Group). Presented as case studies, these
`projects are intended both to illustrate applications of the ideas presented in each
`chapter and to present pertinent design issues. It is hoped that taken collectively
`these projects will offer a vision of the many ways in which computers can take
`part in communication.
`
`l
`
`IPR2023-00035
`Apple EX1010 Page 13
`
`

`

`-l
`
`
`
`Speech as Communication
`
`,, .,
`
`Speech can be viewed in many ways, Although chapters of this book focus on
`specific aspects of speech and the computer technologies that utilize speech, the
`reader should begin with a broad perspective on the role of speech in our daily
`lives. It is essential to appreciate the range of capabilities that conversational
`systems must possess before attempting to build them. This chapter lays the
`groundwork for the entire book by presenting several perspectives on speech
`communication.
`The first section of this chapter emphasizes the interactive and expressive
`role of voice communication. Except in formal circumstances such as lectures
`and dramatic performances, speech occurs in the context of a conversation,
`wherein participants
`take
`turns speaking,
`interrupt each other, nod in
`agreement, or try to change the topic. Computer systems that talk or listen
`may ultimately be judged by their ability to converse in like manner simply
`because conversation permeates human experience. The second section dis(cid:173)
`cusses
`the various components or layers of a conversation. Although the
`distinctions between these layers are somewhat contrived, they provide a
`means of analyzing
`the communication process; research disciplines have
`evolved for the study of each of these components. Finally, the last section intro(cid:173)
`duces the representations of speech and conversation, corresponding in part to
`the layers
`identified
`in the second section. These representations provide
`abstractions
`that a computer program may employ to engage in a conversation
`with a human.
`
`5
`
`IPR2023-00035
`Apple EX1010 Page 14
`
`

`

`6
`
`VOICE COMMUNICATION WITH COMPUTERS
`
`SPEECH AS CONVERSATION
`
`Conversation is a process involving multiple participants, shared knowledge, and
`a protocol for taking turns and providing mutual feedback. Voice is our primary
`channel of interaction in conversation, and speech evolved in humans in response
`to the need among its members to communicate. It is hard to imagine many uses
`of speech that do not involve some interchange between multiple participants in
`a conversation; if we are discovered talking to ourselves, we usually feel embar(cid:173)
`rassed.
`For people of normal physical and mental ability, speech is both rich in expres(cid:173)
`siveness and easy to use. We learn it without much apparent effort as children
`and employ it spontaneously on a daily basis. 1 People employ many layers of
`knowledge and sophisticated protocols while having a conversation; until we
`attempt to analyze dialogues, we are unaware of the complexity of this interplay
`between parties.
`Although much is known about language, study of interactive speech commu(cid:173)
`nication has begun only recently. Considerable research has been done on natu(cid:173)
`ral language processing systems, but much of this is based on keyboard input. It
`is important to note the contrast between written and spoken language and
`between read or rehearsed speech and spontaneous utterances. Spoken language
`is less formal than written language, and errors in construction of spoken sen(cid:173)
`tences are less objectionable. Spontaneous speech shows much evidence of the
`real-time processes associated with its production, including false starts, non(cid:173)
`speech noises such as mouth clicks and breath sounds, and pauses either silent or
`filled(" ... um ... ") [Zue et al. 1989b]. In addition, speech naturally conveys into(cid:173)
`national and emotional information that fiction writers and playwrights must
`struggle to impart to written language.
`that the listener under(cid:173)
`Speech is rich in interactive techniques to guarantee
`stands what is being expressed, including facial expressions, physical and vocal
`gestures, "uh-huhs," and the like. At certain points in a conversation, it is appro(cid:173)
`priate for the listener to begin speaking; these points are often indicated by
`longer pauses and lengthened final syllables or marked decreases in pitch at the
`end of a sentence. Each round of speech by one person is called a turn; inter•
`ruption occurs when a participant speaks before a break point offered by the
`talker. Instead of taking a turn, the listener may quickly indicate agreement with
`a word or two, a nonverbal sound ("uh-huh"), or a facial gesture. Such responses,
`called back channels, speed the exchange and result in more effective conver(cid:173)
`sations [Kraut et al. 1982].2
`Because of these interactive characteristics, speech is used for immediate com(cid:173)
`munication needs, while writing often implies a distance, either in time or space,
`
`1For a person with normal speech and hearing to spend a day without speaking is quite
`a novel experience.
`ZWe will return to these topics in Chapter 9.
`
`IPR2023-00035
`Apple EX1010 Page 15
`
`

`

`Speech as Communication
`
`7
`
`between the author and reader. Speech is used in transitory interactions or situ(cid:173)
`ations in which the process of the interaction may be as important as its result.
`For example, the agenda for a meeting is likely to be written, and a written sum(cid:173)
`mary or minutes may be issued "for the record," but the actual decisions are made
`during a conversation. Chapanis and his colleagues arranged a series of experi(cid:173)
`ments to compare the effectiveness of several communication media, i.e., voice,
`video, handwriting, and typewriting, either alone or in combination, for problem(cid:173)
`solving tasks [Ochsman and Chapanis 1974]. Their findings indicated an over(cid:173)
`whelming contribution of voice for such interactions. Any experimental condition
`that included voice was superior to any excluding voice; the inclusion of other
`media with voice resulted in only a small additional effectiveness. Although these
`experiments were simplistic in their use of student subjects and invented tasks
`and more recent work by others [Minneman and Bly 1991] clarifies a role for
`video interaction, the dominance of voice seems unassailable.
`But conversation is more than mere interaction; communication often serves a
`purpose of changing or influencing the parties speaking to each other. I tell you
`something I have learned with the intention that you share my knowledge and
`hence enhance your view of the world. Or I wish to obtain some information from
`you so I ask you a question, hoping to elicit a reply. Or perhaps I seek to convince
`you to perform some activity for me; this may be satisfied either by your physical
`performance of the requested action or by your spoken promise to perform the act
`at a later time. "Speech Act" theories (to be discussed in more detail in Chapter 9)
`attempt
`to explain language as action, e.g., to request, command, query, and
`promise, as well as to inform.
`The intention behind an utterance may not be explicit. For example, "Can you
`pass the salt?" is not a query about one's ability; it is a request. Many actual con(cid:173)
`versations resist such purposeful classifications. Some utterances ("go ahead,"
`"uh-huh," "just a moment") exist only to guide the flow of the conversation or com(cid:173)
`ment on the state of the discourse, rather than to convey information. Directly
`purposeful requests are often phrased in a manner allowing flexibility of inter(cid:173)
`pretation and response. This looseness is important to the process of people defin(cid:173)
`ing and maintaining their work roles with respect to each other and establishing
`socially comfortable relationships in a hierarchical organization. The richness of
`speech allows a wide range of "acceptance" and "agreement" from wholehearted
`to skeptical to incredulous.
`Speech also serves a strong social function among individuals and is often used
`just to pass the time, tell jokes, or talk about the weather. Indeed, extended peri(cid:173)
`ods of silence among a group may be associated with interpersonal awkwardness
`or discomfort. Sometimes the actual occurrence of the conversation serves a more
`significant purpose than any of the topics under discussion. Speech may be used
`to call attention to oneself in a social setting or as an exclamation of surprise or
`dismay in which an utterance has little meaning with respect to any preceding
`conversation. [Goffman 1981]
`The expressiveness of speech and robustness of conversation strongly support
`the use of speech in computer systems, both for stored voice as a data type as
`well as speech as a medium of interaction. Unfortunately, current computers are
`
`IPR2023-00035
`Apple EX1010 Page 16
`
`

`

`8
`
`VOICE COMMUNICATION WITH COMPUTERS
`
`capable of uttering only short sentences of marginal intelligibility and occasion(cid:173)
`ally recognizing single words. Engaging a computer in a conversation can be like
`an interaction in a foreign country. One studies
`the phrase book, utters a
`request, and in return receives either a blank stare (wrong pronunciation, try
`again) or a torrent of fluent speech in which one cannot perceive even the word
`boundaries.
`However, limitations in technology only reinforce the need to take advantage of
`conversational techniques to ensure that the user is understood. Users will judge
`the performance of computer systems employing speech on the basis of their
`expectations about conversation developed from years of experience speaking
`with fellow humans. Users may expect computers to be either deaf and dumb, or
`once they realize the system can talk and listen, expect it to speak fluently like
`you and me. Since the capabilities of current speech technology lie between these
`extremes, building effective conversational computer systems can be very frus(cid:173)
`trating.
`
`
`
`HIERARCHICAL STRUCTURE OF CONVERSATION
`
`
`
`A more analytic approach to speech communication reveals a number of different
`ways of describing what actually occurs when we speak. The hierarchical struc(cid:173)
`ture of such analysis suggests goals to be attained at various stages in computer(cid:173)
`based speech communication.
`Conversation requires apparatus both for listening and speaking. Effective
`communication invokes mental processes employing the mouth and ears to con(cid:173)
`vey a message thoroughly and reliably. There are many layers at which we can
`analyze the communication process, from the lower layers where speech is con(cid:173)
`sidered primarily acoustically to higher layers that express meaning and inten(cid:173)
`tion. Each layer involves increased knowledge and potential for intelligence and
`interactivity.
`From the point of view of the speaker, we may look at speech from at least eight
`layers of processing as shown in Figure 1.1.
`
`Layers of Speech Processing
`
`discourse The regulation of conversation for pragmatic ends. This includes
`taking turns talking, the history of referents in a conversation so pronouns can
`refer to words spoken earlier, and the process of introducing new topics.
`pragmatics The intent or motivation for an utterance. This is the underlying
`reason the utterance was spoken.
`
`semantics The meaning of the words individually and their meaning as com(cid:173)
`bined in a particular sentence.
`
`syntax The rules governing the combination of words in a sentence, their parts
`of speech, and their forms, such as case and number.
`
`IPR2023-00035
`Apple EX1010 Page 17
`
`

`

`speaker
`
`listener
`
`Speech os Communication
`
`9
`
`discourse
`pragmatic
`semantic
`syntactic
`lexical
`
`phonemic
`articulatory
`
`acoustic
`
`discourse
`pragmatic
`semantic
`syntactic
`lexical
`phonemic
`perceptual
`acoustic
`
`~>
`
`Figure I.I. A layered view of speech communication.
`
`The series of sounds that uniquely convey the series of words in the
`
`lexical The set of words in a language, the rules for forming new words from
`affixes (prefixes and suffixes), and the stress ("accent") of syllables within the
`words.
`phonetics
`sentence.
`articulation
`The motions or configurations of the vocal tract that produce the
`sounds, e.g., the tongue touching the lips or the vocal cords vibrating.
`acoustics
`The realization of the string of phonemes in the sentence as vibra(cid:173)
`tions of air molecules to produce pressure waves, i.e., sound.
`
`Consider two hikers walking through the forest when one hiker's shoelace
`becomes untied. The other hiker sees this and says, "Hey, you're going to trip on
`your shoelace." The listener then ties the shoelace. We can consider this utterance
`at each layer of description.
`Discourse analysis reveals that "Hey'' serves to call attention to the urgency
`of the message and probably indicates the introduction of a new topic of conver(cid:173)
`sation. It is probably spoken in a raised tone and an observer would reasonably
`expect the listener to acknowledge this utterance, either with a vocal response or
`by tying the shoe. Experience with discourse indicates that this is an appropriate
`interruption or initiation of a conversation at least under some circumstances.
`Discourse structure may help the listener understand
`that subsequent utter(cid:173)
`ances refer to the shoelace instead of the difficulty of the terrain on which the con(cid:173)
`versants are traveling.
`In terms of pragmatics,
`the speaker's intent is to warn the listener against
`tripping; presumably the speaker does not wish the listener to fall. But this utter(cid:173)
`ance might also have been a ruse intended to get the liste

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket