`
`Dean Harris Rubine
`
`December, 1991
`CMU–CS–91–202
`
`Submitted in partial fulfillment of the requirements for the degree of
`Doctor of Philosophy in Computer Science at Carnegie Mellon University.
`
`Thesis Committee:
`Roger B. Dannenberg, Advisor
`Dario Giuse
`Brad Myers
`William A. S. Buxton, University of Toronto
`
`Copyright c
`
`1991 Dean Harris Rubine
`
`APPLE 1015
`
`1
`
`
`
`Abstract
`
`Gesture-based interfaces, in which the user specifies commands by simple freehand drawings,
`offer an alternative to traditional keyboard, menu, and direct manipulation interfaces. The ability
`to specify objects, an operation, and additional parameters with a single intuitive gesture makes
`gesture-based systems appealing to both novice and experienced users.
`Unfortunately, the difficulty in building gesture-based systems has prevented such systems from
`being adequately explored. This dissertation presents work that attempts to alleviate two of the
`major difficulties: the construction of gesture classifiers and the integration of gestures into direct-
`manipulation interfaces. Three example gesture-based applications were built to demonstrate this
`work.
`Gesture-based systems require classifiers to distinguish between the possible gestures a user
`may enter. In the past, classifiers have often been hand-coded for each new application, making
`them difficult to build, change, and maintain. This dissertation applies elementary statistical pattern
`recognition techniques to produce gesture classifiers that are trained by example, greatly simplifying
`their creation and maintenance. Both single-path gestures (drawn with a mouse or stylus) and
`multiple-path gestures (consisting of the simultaneous paths of multiple fingers) may be classified.
`On a 1 MIPS workstation, a 30-class single-path recognizer takes 175 milliseconds to train (once
`the examples have been entered), and classification takes 9 milliseconds, typically achieving 97%
`accuracy. A method for classifying a gesture as soon as it is unambiguous is also presented.
`This dissertation also describes GRANDMA, a toolkit for building gesture-based applications
`based on Smalltalk’s Model/View/Controller paradigm. Using GRANDMA, one associates sets of
`gesture classes with individual views or entire view classes. A gesture class can be specified at
`runtime by entering a few examples of the class, typically15. The semantics of a gesture class can be
`specified at runtime via a simple programming interface. Besides allowing for easy experimentation
`with gesture-based interfaces, GRANDMA sports a novel input architecture, capable of supporting
`multiple input devices and multi-threaded dialogues. The notion of virtual tools and semantic
`feedback are shown to arise naturally from GRANDMA’s approach.
`
`i
`
`2
`
`
`
`iiii
`
`3
`
`
`
`Acknowledgments
`
`First and foremost, I wish to express my enormous gratitude to my advisor, Roger Dannenberg.
`Roger was always there when I needed him, never failing to come up with a fresh idea.
`In
`retrospect, I should have availed myself more than I did. In his own work, Roger always addresses
`fundamental problems, and his solutions are always simple and elegant. I try to follow Roger’s
`example in my own work, usually falling far short. Roger, thank you for your insight and your
`example. Sorry for taking so long.
`I was incredibly lucky that Brad Myers showed up at CMU while I was working on this research.
`His seminar on user interface software gave me the knowledge and breadth I needed to approach the
`problem of software architectures for gesture-based systems. Furthermore, his extensive comments
`on drafts of this document improved it immensely. Much of the merit in this work is due to him.
`Thank you, Brad.
`I am also grateful to Bill Buxton and Dario Giuse, both of whom provided
`valuable criticism and excellent suggestions during the course of this work.
`It was Paul McAvinney’s influence that led me to my thesis topic; had I never met him, mine
`would have been a dissertation on compiler technology. Paul is an inexhaustible source of ideas,
`and this thesis is really the second idea of Paul’s that I’ve spent multiple years pursuing. Exploring
`Paul’s ideas could easily be the life’s work of hundreds of researchers. Thanks, Paul, you madman
`you.
`My wife Ruth Sample deserves much of the credit for the existence of this dissertation. She
`supported me immeasurably, fed me and clothed me, made me laugh, motivated me to finish, and
`lovingly tolerated me the whole time. Honey, I love you. Thanks for everything.
`I could not have done it with the love and support of my parents, Shirley and Stanley, my brother
`Scott, my uncle Donald, and my grandma Bertha. For years they encouraged me to be a doctor, and
`they were not the least bit dismayed when they found out the kind of doctor I wanted to be. They
`hardly even balked when “just another year” turned out to be six. Thanks, folks, you’re the best. I
`love you all very much.
`My friends Dale Amon, Josh Bloch, Blaine Burks, Paul Crumley, Ken Goldberg, Klaus Gross,
`Gary Keim, Charlie Krueger, Kenny Nail, Eric Nyberg, Barak Pearlmutter, Todd Rockoff, Tom
`Neuendorffer, Marie-Helene Serra, Ellen Siegal, Kathy Swedlow, Paul Vranesevic, Peter Velikonja,
`and Brad White all helped me in innumerable ways, from technical assistance to making life worth
`living. Peter and Klaus deserve special thanks for all the time and aid they’ve given me over the
`years. Also, Mark Maimone and John Howard provided valuable criticism which helped me prepare
`for my oral examination. I am grateful to you all.
`
`iii
`
`4
`
`
`
`iv
`
`I wish to also thank my dog Dismal, who was present at my feet during much of the design,
`implementation, and writing efforts, and who concurs on all opinions. Dismal, however, strongly
`objects to this dissertation’s focus on human gesture.
`I also wish to acknowledge the excellent environment that CMU Computer Science provides;
`none of this work would have been possible without their support. In particular, I’d like to thank
`Nico Habermann and the faculty for supporting my work for so long, and my dear friends Sharon
`Burks, Sylvia Berry, Edith Colmer, and Cathy Copetas.
`
`5
`
`
`
`Contents
`
`
`
`
`
`1 Introduction
`1.1 An Example Gesture-based Application
`1.1.1 GDP from the user’s perspective
`1.1.2 Using GRANDMA to Design GDP’s Gestures
`1.2 Glossary
`1.3 Summary of Contributions
`1.4 Motivation for Gestures
`1.5 Primitive Interactions
`1.6 The Anatomy of a Gesture
`1.6.1 Gestural motion
`1.6.2 Gestural meaning
`1.7 Gesture-based systems
`1.7.1 The four states of interaction
`1.8 A Comparison with Handwriting Systems
`1.9 Motivation for this Research
`1.10 Criteria for Gesture-based Systems
`1.10.1 Meaningful gestures must be specifiable
`1.10.2 Accurate recognition
`1.10.3 Evaluation of accuracy
`1.10.4 Efficient recognition
`1.10.5 On-line/real-time recognition
`1.10.6 General quantitative application interface
`1.10.7 Immediate feedback
`1.10.8 Context restrictions
`1.10.9 Efficient training
`1.10.10 Good handling of misclassifications
`1.10.11 Device independence
`1.10.12 Device utilization
`1.11 Outline
`1.12 What Is Not Covered
`
`
`
`
`
`1
`2
`2
`4
`7
`8
`9
`12
`12
`12
`13
`14
`15
`16
`17
`18
`18
`18
`19
`19
`19
`19
`20
`20
`20
`20
`20
`21
`21
`22
`
`v
`
`6
`
`
`
`vi
`
`CONTENTS
`
`
`
`2 Related Work
`2.1
`Input Devices
`2.2 Example Gesture-based Systems
`2.3 Approaches for Gesture Classification
`2.3.1 Alternatives for Representers
`2.3.2 Alternatives for Deciders
`2.4 Direct Manipulation Architectures
`2.4.1 Object-oriented Toolkits
`
`
`
`
`
`3 Statistical Single-Path Gesture Recognition
`3.1 Overview
`3.2 Single-path Gestures
`3.3 Features
`3.4 Gesture Classification
`3.5 Classifier Training
`3.5.1 Deriving the linear classifier
`3.5.2 Estimating the parameters
`3.6 Rejection
`3.7 Discussion
`3.7.1 The features
`3.7.2 Training considerations
`3.7.3 The covariance matrix
`3.8 Conclusion
`
`
`
`
`
`4 Eager Recognition
`4.1
`Introduction
`4.2 An Overview of the Algorithm
`4.3
`Incomplete Subgestures
`4.4 A First Attempt
`4.5 Constructing the Recognizer
`4.6 Discussion
`4.7 Conclusion
`
`
`
`
`
`
`
`5 Multi-Path Gesture Recognition
`5.1 Path Tracking
`5.2 Path Sorting
`5.3 Multi-path Recognition
`5.4 Training a Multi-path Classifier
`5.4.1 Creating the statistical classifiers
`5.4.2 Creating the decision tree
`5.5 Path Features and Global Features
`5.6 A Further Improvement
`5.7 An Alternate Approach: Path Clustering
`
`
`
`
`
`
`
`25
`25
`28
`34
`35
`37
`41
`43
`
`47
`47
`48
`49
`53
`55
`55
`58
`59
`61
`62
`63
`63
`65
`
`67
`67
`68
`69
`71
`72
`76
`78
`
`79
`79
`81
`83
`85
`85
`86
`86
`87
`88
`
`7
`
`
`
`CONTENTS
`
`5.7.1 Global features without path sorting
`5.7.2 Multi-path recognition using one single-path classifier
`5.7.3 Clustering
`5.7.4 Creating the decision tree
`5.8 Discussion
`5.9 Conclusion
`
`
`
`
`
`6 An Architecture for Direct Manipulation
`6.1 Motivation
`6.2 Architectural Overview
`6.2.1 An example: pressing a switch
`6.2.2 Tools
`6.3 Objective-C Notation
`6.4 The Two Hierarchies
`6.5 Models
`6.6 Views
`6.7 Event Handlers
`6.7.1 Events
`6.7.2 Raising an Event
`6.7.3 Active Event Handlers
`6.7.4 The View Database
`6.7.5 The Passive Event Handler Search Continues
`6.7.6
`Passive Event Handlers
`6.7.7
`Semantic Feedback
`6.7.8 Generic Event Handlers
`6.7.9 The Drag Handler
`6.8 Summary of GRANDMA
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`7 Gesture Recognizers in GRANDMA
`7.1 A Note on Terms
`7.2 Gestures in MVC systems
`7.2.1 Gestures and the View Class Hierarchy
`7.2.2 Gestures and the View Tree
`7.3 The GRANDMA Gesture Subsystem
`7.4 Gesture Event Handlers
`7.5 Gesture Classification and Training
`7.5.1 Class Gesture
`7.5.2 Class GestureClass
`7.5.3 Class GestureSemClass
`7.5.4 Class Classifier
`7.6 Manipulating Gesture Event Handlers at Runtime
`7.7 Gesture Semantics
`7.7.1 Gesture Semantics Code
`
`
`
`
`
`vii
`
`88
`88
`89
`92
`93
`94
`
`95
`95
`96
`96
`98
`99
`101
`101
`103
`105
`105
`106
`107
`109
`110
`111
`113
`115
`117
`120
`
`125
`125
`126
`126
`127
`128
`130
`139
`139
`140
`141
`142
`145
`148
`148
`
`8
`
`
`
`viii
`
`CONTENTS
`
`7.7.2 The User Interface
`7.7.3
`Interpreter Implementation
`7.8 Conclusion
`
`
`
`8 Applications
`8.1 GDP
`8.1.1 GDP’s gestural interface
`8.1.2 GDP Implementation
`8.1.3 Models
`8.1.4 Views
`8.1.5 Event Handlers
`8.1.6 Gestures in GDP
`8.2 GSCORE
`8.2.1 A brief description of the interface
`8.2.2 Design and implementation
`8.3 MDP
`8.3.1
`Internals
`8.3.2 MDP gestures and their semantics
`8.3.3 Discussion
`8.4 Conclusion
`
`9 Evaluation
`9.1 Basic single-path recognition
`9.1.1 Recognition Rate
`9.1.2 Rejection parameters
`9.1.3 Coverage
`9.1.4 Varying orientation and size
`9.1.5
`Interuser variability
`9.1.6 Recognition Speed
`9.1.7 Training Time
`9.2 Eager recognition
`9.3 Multi-finger recognition
`9.4 GRANDMA
`9.4.1 The author’s experience with GRANDMA
`9.4.2 A user uses GSCORE and GRANDMA
`
`
`
`10 Conclusion and Future Directions
`10.1 Contributions
`10.1.1 New interactions techniques
`10.1.2 Recognition Technology
`10.1.3 Integrating gestures into interfaces
`10.1.4 Input in Object-Oriented User Interface Toolkits
`10.2 Future Directions
`
`150
`156
`162
`
`163
`163
`164
`164
`166
`166
`167
`168
`170
`170
`173
`181
`181
`188
`193
`194
`
`195
`195
`195
`201
`205
`205
`208
`213
`216
`218
`221
`222
`222
`223
`
`225
`225
`225
`226
`227
`228
`228
`
`9
`
`
`
`CONTENTS
`
`10.3 Final Remarks
`
`A Code for Single-Stroke Gesture Recognition and Training
`A.1 Feature Calculation
`A.2 Deriving and Using the Linear Classifier
`A.3 Undefined functions
`
`
`
`ix
`
`233
`
`235
`235
`243
`255
`
`10
`
`
`
`x
`
`CONTENTS
`CONTENTS
`
`11
`
`11
`
`
`
`List of Figures
`
`1.1 Proofreader’s Gesture (from Buxton [15])
`1.2 GDP, a gesture-based drawing program
`1.3 GDP’s View class hierarchy and associated gestures
`1.4 Manipulating gesture handlers at runtime
`1.5 Adding examples of the delete gesture
`1.6 Macintosh Finder, MacDraw, and MacWrite (from Apple [2])
`
`
`
`
`
`
`
`
`
`
`
`2.1 The Sensor Frame
`2.2 The DataGlove, Dexterous Hand Master, and PowerGlove (from Eglowstein [32])
`2.3 Proofreading symbols (from Coleman [25])
`2.4 Note gestures (from Buxton [21])
`2.5 Button Box (from Minksy [86])
`2.6 A gesture-based spreadsheet (from Rhyne and Wolf [109])
`2.7 Recognizing flowchart symbols
`2.8 Sign language recognition (from Tamura [128])
`2.9 Copying a group of objects in GEdit (from Kurtenbach and Buxton [75])
`2.10 GloveTalk (from Fels and Hinton [34])
`2.11 Basic PenPoint gestures (from Carr [24])
`2.12 Shaw’s Picture Description Language
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`3.1 Some example gestures
`3.2 Feature calculation
`3.3 Feature vector computation
`3.4 Two different gestures with identical feature vectors
`3.5 A potentially troublesome gesture set
`
`
`
`
`
`
`
`4.1 Eager recognition overview
`4.2
`Incomplete and complete subgestures of U and D
`4.3 A first attempt at determining the ambiguity of subgestures
`4.4 Step 1: Computing complete and incomplete sets
`4.5 Step 2: Moving accidentally complete subgestures
`4.6 Accidentally complete subgestures have been moved
`4.7 Step 3: Building the AUC
`
`
`
`
`
`xi
`
`1
`2
`4
`5
`5
`10
`
`27
`27
`28
`29
`30
`30
`31
`32
`33
`33
`34
`39
`
`48
`51
`54
`62
`64
`
`68
`70
`71
`73
`75
`76
`76
`
`12
`
`
`
`xii
`
`LIST OF FIGURES
`
`77
`77
`
`80
`82
`84
`91
`
`97
`106
`
`129
`146
`147
`147
`152
`153
`154
`
`165
`165
`170
`171
`172
`174
`175
`182
`184
`189
`
`196
`197
`197
`199
`200
`202
`203
`204
`204
`206
`207
`208
`209
`209
`
`4.8 Step 4: Tweaking the classifier
`4.9 Classification of subgestures of U and D
`
`5.1 Some multi-path gestures
`5.2
`Inconsistencies in path sorting
`5.3 Classifying multi-path gestures
`5.4 Path Clusters
`
`
`
`6.1 GRANDMA’s Architecture
`6.2 The Event Hierarchy
`
`7.1 GRANDMA’s gesture subsystem
`7.2 Passive Event Handler Lists
`7.3 A Gesture Event Handler
`7.4 Window of examples of a gesture class
`7.5 The interpreter window for editing gesture semantics
`7.6 An empty message and a selector browser
`7.7 Attributes to use in gesture semantics
`
`
`
`
`
`
`
`8.1 GDP gestures
`8.2 GDP’s class hierarchy
`8.3 GSCORE’s cursor menu
`8.4 GSCORE’s palette menu
`8.5 GSCORE gestures
`8.6 A GSCORE session
`8.7 GSCORE’s class hierarchy
`8.8 An example MDP session
`8.9 MDP internal structure
`8.10 MDP gestures
`
`
`
`
`
`
`
`9.1 GSCORE gesture classes used for evaluation
`9.2 Recognition rate vs. number of classes
`9.3 Recognition rate vs. training set size
`9.4 Misclassified GSCORE gestures
`9.5 A looped corner
`9.6 Rejection parameters
`9.7 Counting correct and incorrect rejections
`9.8 Correctly classified gestures with d2
` 90
` 95
`9.9 Correctly classified gestures with
`9.10 Recognition rates for various gesture sets
`9.11 Classes used to study variable size and orientation
`9.12 Recognition rate for set containing classes that vary
`9.13 Mistakes in the variable class test
`9.14 Testing program (user’s gesture not shown)
`
`
`
` P
`
`
`
`
`
`13
`
`
`
`LIST OF FIGURES
`
`
`
`
`
`9.15 PV’s misclassified gestures (author’s set)
`9.16 PV’s gesture set
`9.17 The performance of the eager recognizer on easily understood data
`9.18 The performance of the eager recognizer on GDP gestures
`9.19 PV’s task
`9.20 PV’s result
`
`
`
`
`
`xiii
`
`211
`212
`219
`220
`223
`223
`
`14
`
`
`
`xiv
`xiv
`
`LIST OF FIGURES
`LIST OF FIGURES
`
`15
`
`15
`
`
`
`List of Tables
`
`9.1 Speed of various computers used for testing
`9.2 Speed of feature calculation
`9.3 Speed of Classification
`9.4 Speed of classifier training
`
`214
`214
`215
`217
`
`xv
`
`16
`
`
`
`xvi
`XVi
`
`LIST OF TABLES
`LIST OF TABLES
`
`17
`
`17
`
`
`
`to Grandma Bertha
`to Grandma Bertha
`
`xvii
`xvii
`
`18
`
`18
`
`
`
`xviii
`xviii
`
`LIST OF TABLES
`LIST OF TABLES
`
`19
`
`19
`
`
`
`Chapter 1
`
`Introduction
`
`People naturally use hand motions to communicate with other people. This dissertation explores the
`use of human gestures to communicate with computers.
`Random House [122] defines “gesture” as “the movement of the body, head, arms, hands, or
`face that is expressive of an idea, opinion, emotion, etc.” This is a rather general definition, which
`characterizes well what is generally thought of as gesture. It might eventually be possible through
`computer vision for machines to interpret gestures, as defined above, in real time. Currently such
`an approach is well beyond the state of the art in computer science.
`Because of this, the term “gesture” usually has a restricted connotation when used in the context
`of human-computer interaction. There, gesture refers to hand markings, entered with a stylus or
`mouse, which function to indicate scope and commands [109]. Buxton [14] gives a fine example,
`reproduced here as figure 1.1.
`In this dissertation, such gestures are referred to as single-path
`gestures.
`Recently, input devices able to track the paths of multiple fingers have come into use. The
`Sensor Frame [84] and the DataGlove [32, 130] are two examples. The human-computer interaction
`community has naturally extended the use of the term “gesture” to refer to hand motions used to
`indicate commands and scope, entered via such multiple finger input devices. These are referred to
`here as multi-path gestures.
`Rather than defining gesture more precisely at this point, the following section describes an
`
`Figure 1.1: Proofreader’s Gesture (from Buxton [15])
`
`1
`
`20
`
`
`
`2
`
`CHAPTER 1. INTRODUCTION
`
`(a)
`
`(b)
`
`(c)
`
`(d)
`
`(e)
`
`(f)
`
`(g)
`
`(h)
`
`Figure 1.2: GDP, a gesture-based drawing program
`
`example application with a gestural interface. A more technical definition of gesture will be
`presented in section 1.6.
`
`1.1 An Example Gesture-based Application
`
`GRANDMA is a toolkit used to create gesture-based systems. It was built by the author and is
`described in detail in the pages that follow. GRANDMA was used to create GDP, a gesture-based
`drawing editor loosely based on DP [42]. GDP provides for the creation and manipulation of lines,
`rectangles, ellipses, and text. In this section, GDP is used as an example gesture-based system.
`GDP’s operation is presented first, followed by a description of how GRANDMA was used to create
`GDP’s gestural interface.
`
`1.1.1 GDP from the user’s perspective
`GDP’s operation from a user’s point of view will now be described. (GDP’s design and implemen-
`tation is presented in detail in Section 8.1.) The intent is to give the reader a concrete example of
`a gesture-based system before embarking on a general discussion of such systems. Furthermore,
`the description of GDP serves to illustrates many of GRANDMA’s capabilities. A new interaction
`technique, which combines gesture and direct manipulationin a single interaction, is also introduced
`in the description.
`
`21
`
`
`
`1.1. AN EXAMPLE GESTURE-BASED APPLICATION
`
`3
`
`Figure 1.2 shows some snapshots of GDP in action. When first started, GDP presents the user
`with a blank window. Panel (a) shows the rectangle gesture being entered. This gesture is drawn
`like an “L.”1 The user begins the gesture by positioning the mouse cursor and pressing a mouse
`button. The user then draws the gesture by moving the mouse.
`The gesture is shown on the screen as is being entered. This technique is called inking [109],
`and provides valuable feedback to the user. In the figure, inking is shown with dotted lines so that
`the gesture may be distinguished from the objects in the drawing. In GDP, the inking is done with
`solid lines, and disappears as soon as the gesture has been recognized.
`The end of the rectangle gesture is indicated in one of two ways. If the user simply releases
`the mouse button immediately after drawing “L” a rectangle is created, one corner of which is at
`the start of the gesture (where the button was first pressed), with the opposite corner at the end of
`the gesture (where the button was released). Another way to end the gesture is to stop moving the
`mouse for a given amount of time (0.2 seconds works well), while still pressing the mouse button.
`In this case, a rectangle is created with one corner at the start of the gesture, and the opposite corner
`at the current mouse location. As long as the button is held, that corner is dragged by the mouse,
`enabling the size and shape of the rectangle to be determined interactively.
`Panel (b) of figure 1.2 shows the rectangle that has been created and the ellipse gesture. This
`gesture creates an ellipse with its center at the start of the gesture. A point on the ellipse tracks the
`mouse after the gesture has been recognized; this gives the user interactive control over the size and
`eccentricity of the ellipse.
`Panel (c) shows the created ellipse, and a line gesture. Similar to the rectangle and the ellipse, the
`start of the gesture determines one endpoint of the newly created line, and the mouse position after
`the gesture has been recognized determines the other endpoint, allowing the line to be rubberbanded.
`Panel (d) shows all three shapes being encircled by a pack gesture. This gesture packs (groups)
`all the objects which it encloses into a single composite object, which can then be manipulated as
`a unit. Panel (e) shows a copy gesture being made; the composite object is copied and the copy is
`dragged by the mouse.
`Panel (f) shows the rotate-and-scale gesture. The object is made to rotate around the starting
`point of the gesture; a point on the object is dragged by the mouse, allowing the user to interactively
`determine the size and orientation of the obj