`
`Data Mining and Knowledge Discovery Series
`
`Constrained Clustering
`Advances in Algorithms,
`, Theory, and Applications
`
`-
`
`Edited by
`
`Sugato Basu ' Ian Davidson
`Kiri L. Wagstaff
`
`CRC Press
`Taylor& Francis Group
`BocaRaton London New York
`
`CRC Press is an imprint of the
`Taylor 81: Francis Group, an lnfonna business
`A CHAPMAN 8t HALL BOOK
`
`1
`
`-
`
`APPLE 1017
`
`APPLE 1017
`
`1
`
`
`
`QC ’09 I7?
`Cover image shows the result of clustering Migegral image of Mars using soft constraints to
`
`A 23%
`
`impose spatial contiguity on cluster assignments. The data set was collected by the Space Telescope
`Imaging Spectrograph (STIS) on the Hubble Space Telescope. This image was reproduced with permis-
`sion from Intelligent Clustering with Instance—Level Constraints by Kiri Wagstaff.
`
`Chapman & Hali/CRC
`Taylor 8: Francis Group
`6000 Broken Sound Parkway NW, Suite 300
`Boca Raton, FL 33487-2742
`
`© 2009 by Taylor 8: Francis Group, LLC
`Chapman 8: Hall/CRC is an imprint of Taylor 8: Francis Group. an Informa busineSs
`
`No claim to original US. Government works
`Printed in the United States of America on acid‘free paper
`10 9 8 7 6 5 4 3 2 1
`
`International Standard Book Number-13: 978-1-58488-996-0 (Hardcover)
`
`This book contains information obtained from authentic and highly regarded sources. Reasonable
`efforts have been made to publish reliable data and information, but the author and publisher cannot
`assume responsibility for the validity ofall materials or the consequences of their use. The authors and
`publishers have attempted to trace the copyright holders of all material reproduced in this publication
`and apologize to copyright holders if permission to publish in this form has not been obtained. If any
`copyright material has not been acknowledged please write and let us know so we may rectifyin any
`future reprint.
`
`Except as permitted under US. Copyright Law, no part of this book may be reprinted. reproduced,
`transmitted, or utilized in any form by any electronic. mechanical. or other means. now known or here
`after invented, including photocopying, microfilming, and recording, or in any information storage or
`retrieval system, without written permission from the publishers.
`
`For permission to photocopy or use material electronically from this work, please access www.copy-
`right.com (http://www.copyright.coml) or contact the Copyright Clearance Center, Inc. (CCC), 222
`Rosewood Drive, Danvers, MA 01923, 978 -750- 8400. CCC is a not-for-profit organization that provides
`licenses and registration for a variety of users. For organizations that have been granted a photocopy
`license by the CCC, a separate system of payment has been arranged
`
`Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
`used only for identification and explanation without intent to infringe.
`
`Library of Congress Cataloging—in-Publication Data
`
`Constrained clustering: advances in algorithms, theory, and applications] editors,
`Sugato Basu, Ian Davidson, Kiri Wagstaff.
`p. cm. —- (Chapman & Hali/CRC data mining and knowledge discovery series)
`Includes bibliographical references and index.
`ISBN 978—1—58488-996-0 (hardback : alk. paper)
`1. Cluster analysis-—Data processing. 2. Data mining. 3. Computer algorithms. 1.
`Basu, Sugato. ll. Davidson, Ian, 1971- [11. Wagstaff, Kiri. IV. Title. V. Series. _
`
`QA278.C63 2008
`519.5'3—-dc22
`
`-
`
`2008014590
`
`Visit the Taylor 8: Francis Web site at
`http://www.taylora ndfrancls.com
`
`and the CRC Press Web site at
`
`http:/Iwww.crcpress.com
`
`2
`
`
`
`Chapter 2
`
`. Semi-Supervised Clustering with
`User Feedback
`
`David Cohn
`
`Google, Inc., cothgoogle . com
`
`Rich Caruana
`
`Cornell University, caruanans . cornell . edu
`
`Andrew Kachites McCallum
`
`University of Massachusetts, Amherst, mccallqucs .umass.edu
`
`Abstract We present an approach to clustering based on the observa-
`tion that “it is easier to criticize than to construct.” Our approach of semi-
`superuised clustering allows a user to iteratively provide feedback to a clus-
`tering algorithm. The feedback is incorporated in the form of constraints,
`which the clustering algorithm attempts to satisfy on future iterations. These
`constraints allow the user to guide the clusterer toward clusterings of the data
`that the user finds more useful. We demonstrate semi-supervised clustering
`with a system that learns to cluster news stories from a Reuters data set.1
`
`2.1 Introduction
`
`Consider the following problem: you are given 100,000 text documents (e.g.,
`papers, newsgroup articles, or web pages) and asked to group them into classes
`or into a hierarchy such that related documents are grouped together. You
`are not told what classes or hierarchy to use or what documents are related.
`Your job is simply. to create this taxonomy so that the documents can be
`browsed and accessed efficiently, either by yourself or by other people. While
`
`1This work was originally circulated as an unpublished manuscript [4] when all the authors
`were at Justsystem Pittsburgh Research Center.
`
`17
`
`3
`
`
`
`18 Constrained Clustering: Advances in Algorithms, Theory, and Applications
`
`you may have some criterion in mind, you would probably be hard-pressed to
`express it algorithmically.
`This problem is ubiquitous. The web has created a number of new examples
`of it, but it can be found in many fields that don’t involve the web, as well as
`with many different types of “documents.” Librarians, astronomers, biologists
`— practically everyone tasked with creating a taxonomy from data faces this
`problem in one form or another.
`We propose the following iterative solution to this problem:
`
`1. Give the 100,000 documents to an unsupervised clustering algorithm
`and have it cluster them.
`
`2. Browse the resulting clusters and tell the system which clusters you like,
`and which clusters you don’t like. Don’t do this for all theiclusters, just '
`for some of the ones you browsed. Provide feedback to the system by
`saying “This document doesn’t belong in here,” “Move this document to
`that cluster, ” or “These two documents shouldn’t be {or should be) .in
`the same cluster.”
`
`,
`
`Don’t do this for all, or even many, of the documents; only for the few
`that look most out of place.
`
`‘3. After your critique, re—cluster the documents, allowihg’the clustering
`algorithm to modify the the distance metric parameters to try to find a
`new clustering that satisfies the constraints you provided in the critique.
`
`4. Repeat this until you are happy with the clustering.
`
`This solution is distinct from both traditional supervised and unsupervised
`learning. Unsupervised clustering takes an unlabeled collection of data and,
`without intervention or additional knowledge, partitions it into sets of ex-
`amples such that examples within clusters are more “similar” than examples
`between clusters. Much work in unsupervised clustering is dedicated to the
`problem of manually engineering similarity criteria that yield good partition-
`ing of data for a given domain.
`.
`Supervised learning, on the other hand, assumes that the class structure or
`hierarchy already is known. It takes a set of examples with class labels, and
`returns a function that maps examples to class labels. The goal of supervised
`learning is to learn mappings that are accurate enough to be useful when
`classifying new examples, and perhaps to learn mappings that allow users to
`understand the relationships between the data and the labels, such as which
`features are important.
`
`Semi-supemised clustering falls between the extremes of totally unsuper-
`vised clustering and totally supervised learning. The main goal of our ap—
`proach to semi—supervised clustering is to allow a human to “steer” the clus—
`tering process so that examples can be partitioned into a useful set of clusters
`with minimum time and human effort. A secondary goal of semi-supervised
`
`4
`
`
`
`Semi—Supervised Clustering with User Feedback
`
`19
`
`clustering is to give the user a. way to interact and play with the data so that
`they can understand it better.2
`Our approach to semi-supervised clustering assumes that the human user
`has in their mind criteria that enable them to evaluate the duality of a cluster-
`ing. It does not assume that the user is conscious of what they think defines
`a good clustering but that, as with art, they will “know it when they see it.”
`Most importantly, semi-supervised clustering never expects a user to write a
`function that defines the clustering criterion. Instead, the user interacts with
`the clustering system, which attempts to learn a criterion that yields clusters
`the user is satisfied with. As such, one of the primary challenges of semi—
`supervised clustering is finding ways to elicit and make use of user' feedback
`during clustering.
`‘ The remainder of this chapter describes one simple, illustrative way in which
`this may be accomplished. Other challenges that need to be addressed by
`future research on semi-supervised clustering are briefly described in the dis-
`cussion section.
`
`2.1.1 Relation to Active Learning
`
`Semi-supervised clustering with user feedback is closely related to active
`learning [5]. In the most common form of active learning, a learning system
`attempts to identify which data points, if labeled by a human, would be most
`informative. In semi-supervised clustering, the human selects the data points,
`and puts on them a wide array of possible constraints instead of labels. These
`two key differences point toward some situations in which the semi-supervised
`approach is preferable.
`
`1. In some clustering problems the desired similarity metric may be so dif-
`ferent from the default that traditional active learning would make many
`inefficient queries. This problem also arises when there are many differ—
`ent plausible clusterings. Although less automated, a human browsing
`the data would do less work by selecting the feedback data points them-
`self.
`‘
`
`2. The intuitive array of possible constraints are easier to apply than labels,
`especially when the final clusters are not known in advance.
`
`3. The very act of human browsing can lead to the discovery of what
`clusters are desired. Semi—supervised learning can thus be seen as a
`method of data exploration and pattern discovery, efficiently aided by
`cluster—based summarization.
`
`[7] independently introduced a. semi-supervised clustering model similar to
`2Derniriz et a1.
`the one we describe here. The main distinction between our work and theirs is our use of
`iterative feedback to acquire labelings; Demiriz et a1. assume that all available labels are
`given a. priori.
`
`5
`
`
`
`20 Constrained Clustering: Advances in Algorithms, Theory, and Applications
`
`However, the distinction with active learning is subjective. As we will see
`in Section 2.5.1, our system could easily be viewed as a practical application
`of learning by counterexamples [1] — one of the earliest and most powerful
`forms of active learning studied in the theory community.
`
`Hybrid active—semi-supervised systems are also plausible. In situations with
`
`a large number of data points and data types that are difficult to browse, one
`could imagine a system that combines some of the automated selection of
`active learning with the human browsing of semi-supervised clustering. The
`active learner could make many disparate hypotheses about the underlying
`labels and present the examples that would be most indicative of each.
`
`2.2 Clustering
`
`Formally, clustering is the process of partitioning a data set into subsets
`such that all members of a given subset are “similar” according to some dis-
`tance measure D. We will denote the distance between two examples m1 and
`502 as D(:r1, :32). We can generalize this to refer to D(y1, 312), the distance be-
`tween two cluster centers, or D(y1,a:1), the distance between a cluster center
`and an example.
`The two most popular approaches to clustering are agglomerative cluster—
`ing and prototype-based clustering. In agglomerative clustering, each datum
`is initially placed in its own cluster. The clusters that are most similar (ac-
`cording to D) are iteratively merged, until the desired number of clusters
`is reached, or some limit on data likelihood or distortion is exceeded (see
`Hofmann and Buhmann [14] for an in—depth treatment of agglomerative clus-
`tering).
`’
`In prototype-based clustering, the final number of clusters is usually set a
`priori, and the corresponding prototypes are found using some form of Expec—
`tation Maximization (EM) [8]. Each prototype is initialized to some position
`- (in our case, a randomly weighted sample of the training points). Examples
`are assigned to prototypes according to their similarity to each prototype (the
`assignment may be 0-1 or fractional, depending on the algorithm). Prototypes
`are then adjusted to maximize the data likelihood, or, equivalently, minimize _
`the data distortion. The assignment/ adjustment process .is repeated until no
`significant changes result (see Meila and Heckerman [17} for concise review of
`prototype—based clustering).
`In the present chapter, we adopt a statistical prototype—based approach,
`resulting from the naive Bayes model of document generation [16].3 Given a
`
`3We reiterate that the approach described in this chapter is only for the point of exploration
`and illustration; the approach is, in theory, applicable to almost any clustering algorithm.
`
`6
`
`
`
`Semi—Supervised Clustering with User Feedback
`
`21
`
`vocabulary V, a document is assumed to be a “bag of words” generated from
`a multinomial distribution 0. In this model, the probability of document at is
`
`P($) : H P(tj|9)N(t"’“),
`tjEV
`
`where P (t,- |6) is the parameterized probability of term tj being generated, and
`N (tj, :12) is the number of times t, appears in the document. Each document
`at forms an estimate ‘of a multinomial distribution 6x; likewise, each cluster of
`documents 1r forms an estimate 6,, composed from the 03. of its constituent
`documents.4
`'
`
`For clustering we assume that, instead of being produced by a single multi—
`nomial distribution, each of the observed documents was drawn from one of
`distributions raven, .
`.
`. ,BM, corresponding to the unknown distribution of
`clusters 7r1,7r2, .
`. . ,7rk:
`
`Pe) = Dam-weir» : 2 PM.) 11 P(tjl6m)”“j"‘)-
`i
`i
`tj EV
`
`Our task is to estimate values for P(7r,-) and 6,7,, which will in turn allow us
`to estimate cluster memberships P(1r,-|a:) by Bayes rule:
`
`P(1r,-|:1:) = P(m|1r,:)P(1r,:)/P(:c).
`
`-
`
`-
`
`(2.1)
`
`We find estimates for P(1r,-) and 9,“. via the standard procedure for EM,
`beginning with randomized estimates of 67,, drawn as a weighted sample from
`the observations. Then, for each cluster 7r,- and document :12, we compute
`P(:t:|9,,,) and apply Equation 2.1 to compute P(7r,:|:r). Each cluster is given
`partial ownership of a document proportional to P(1r,-[:z:). The parameters
`0,” are recomputed as the weighted sum of their component documents, and
`the process is repeated. The algorithm is guaranteed to converge to a locally
`optimal clustering (see, e.g., MacK-ay [15] or Meila and Heckerman [17] 'for
`details).
`
`2.3 Semi-Supervised Clustering
`
`The goodness of any clustering depends on how well the metric D matches
`the user’s (perhaps unknown) internal model of the target domain. We pro-
`pose allowing the user to impose their model on the metric via the clustering
`
`4The estimates for term probabilities are derived from the relative term frequencies in the
`documents. Following McCallum and Nigam [16], we smooth with a LaPlaCean|prior to
`avoid zero term probabilities.
`
`7
`
`
`
`22 Constrained Clustering: Advances in Algorithms, Theory, and. Applications
`
`
`
`
`FIGURE 2.1: Illustration of semi-supervised clustering. Given an initial clus-
`tering, the user specifies two points that should not have been placed in the
`same cluster. The system warps its metric, allowing it to find a clustering
`'that respects the constraint.
`
`algorithm, by having the user provide the algorithm with feedback, and al-
`lowing it to alter the metric so as to accommodate that feedback. Not only
`is it easier to critique than to construct, but the user’s criticism can take
`many forms —— specifying that a particular example does/doesn’t belong in a
`particular cluster, that two examples do/don’t belong in the same cluster, or
`that a particular cluster is good (and should be preserved) or bad (and should
`be split up).
`Feedback may be incorporated into the metric as constraints to be respected
`by the clustering algorithm. Consider two examples, $1 and $2, that are con-
`strained by the user feedback to be in separate clusters. When the clustering
`algorithm attempts a partitioning which places 2:1 and 3:2 in the same cluster,
`the metric may be altered to increase the distance between $1 and x2 until
`one or the other of them falls in a different cluster (Figure 2.1). Other con-
`straints may be implemented similarly, shrinking the distance between some
`eXample and a cluster prototype, or increasing the distance between a cluster
`prototype and all the examples assigned to it.
`
`2.3.1
`
`Implementing Pairwise Document Constraints
`
`In this probabilistic setting, the natural measure of dissimilarity between
`two documents, x1 and $2, is the probability that they were generated by the
`same multinomial. From Pereira et a1. [19], this is proportional to the XL
`divergence to the mean of their multinomial distributions:
`
`DKLM(-’1?1,932) = lmllDKL(9:cl:0:1,zr2) + |$2|DKL(02216:1:1,$2):
`
`where |a:| is the length of document :12, DKL (01, 02) is the standard Kullback—
`Leibler divergence of 01 to 62, and 931,32 is a distribution such that
`
`P(tjl9$1,$2) = (P(tjl021) +P(tjl0:cz)) /2'
`
`8
`
`
`
`Semi-Supervised Clustering with User Feedback
`
`23
`
`The advantage of this measure is that it is symmetric, unlike standard KL
`divergence.
`
`To implement our constraints, we augment the standard KL divergence
`' Dwamfl62 ) with a weighting function
`
`,
`
`0
`
`DKL( 31’03’2)
`
`=
`
`lag/7'7 P(t]|6xl)log (P(tjl0x1)) '
`
`P(t-|0 )
`_J__€v_2_
`
`..
`
`_
`
`where 7,— may be interpreted as indicating the importance of ti for distinguish-
`ing 3:1 and :32. Then, given a constraint that :31 and 332 must be in separate
`clusters, we can warp the metric by computing
`
`aITIKLMCEI-iaa) _
`
`P(t.'il0$1$2)
`
`|x2lp(tjl6$2) log (W)
`
`and hillclimbing over '7 to increase the effective distance between the two.
`This gradient tells us the direction to move the 7’s in order to increase (or
`decrease) the separation between two documents. (In the current experiments
`we constrain the 1’s to be positive, but it might be interesting to relax this
`and allow some 7’s to become negative.)
`
`These 7’s are incorporated back into the E-step of clustering algorithm as
`weights attached to the individual term frequencies:
`
`P($l7r-;) =- H P(tj]0m)7jN(t,-,z)‘
`tjEV
`
`Intuitively, a small 71' reduces the effect of tj,S presence or absence on doc-
`ument likelihood, effectiVely scaling its effect on the document’s divergence
`from its cluster center. As such, we are able to inject a learned distance metric
`directly into the clustering algorithm.
`
`2.3.2 Other Constraints
`
`Other constraints described in the previous section may be similarly im—
`plemented by hillclimbing over the example-to—cluster and cluster—to-cluster
`distance. Note that the linear warping we describe will not guarantee that
`all constraints can be satisfied; some clusterings desired by the user may be
`non—convex and unrealizable in the space of models supported by naive Bayes.
`In this case, the hillclimbing will converge to a weighting that provides a local
`minimum of constraint violations. Local or nonlinear warpings of the distance
`metric, such as the ones described by Friedman [11] and Yianilos [21] may be
`of use in these situations.
`
`9
`
`
`
`24 Constrained Clustering: Advances in Algorithms, Theory, and Applications
`
`2.4 Experiments
`
`In this section, we illustrate the semi-supervised approach on a small docu—
`‘ ment clustering problem. We use a. set of 25 documents each from five Reuters
`topic areas: business, health, politics, sports, and tech. Starting from five ran—
`domly initialized prototypes, the EM—based clustering algorithm described in '
`the previous sections finds clusters that maximize data likelihood.
`Each time clustering converges, we add a constraint. We simulate a human
`user by identifying two documents from the same cluster whose sources are
`different Reuters topics, and constrain them to be in different clusters.5 For
`each unsatisfied constraint, we reweight the divergence by a fixed number
`of hillclimbing steps, re—initialize the cluster prototypes, and repeat the EM
`training.
`
`2.4.1 Clustering Performance
`
`Figure 2.2 compares the performance of supervised, unsupervised, and semi-
`supervised learning. For unsupervised and semi-supervised learners, we plot
`cluster purity:
`the fraction of examples that would be classified correctly
`if all examples were assigned the majority label in each cluster. For the
`supervised learner, we plot both. cluster purity and classification accuracy
`(generalization) .
`After only a few constraints have been added, cluster purity increases
`sharply over that of unsupervised clustering.
`It is not clear, however, how
`to fairly comparethe performance of semi—supervised clustering with that of
`fully supervised clustering: constraints do not exactly correspond to labeled
`examples, and it is uncertain what constitutes a proper test set. _In super-
`vised learning, documents used for training are traditionally excluded-from
`the test set, since their labels are already known. But the semi-supervised
`model clusters (and is tested on) the entire corpus, so it is also reasonable to
`gauge it against a supervised learner tested the same way. In the figure we
`show the cluster purity of supervised learning on the training set as well as
`its generalization to an independent test set.
`J
`The semi-supervised learner reaches its asymptotic performance after about
`10 constraints have been added; the supervised learners require between 3 and
`6 times more labeled examples to reach that level of performance.6 It is in-
`
`5A fully—operational semi—supervised clustering system would benefit from a graphical user
`interface that permits efficient browsing of the current clusters and supports easy specifi-
`cation of user constraints. See the discussion of Scatter/Gather later in this chapter.
`6To assure ourselves that metric-warping alone wasn’t responsible for the performance dis-
`parity, we also incorporated metric warping into the supervised clusterer, shrinking the
`divergence between a document and its assigned cluster. The addition resulted in no sig-
`nificant performance improvement.
`
`10
`
`10
`
`
`
`Semi—Supervised Clustering with User Feedback
`0.9
`
`25
`
`0.3 .
`
`Pw
`
`Pa)
`
`correct)9ob01
`
`
`
`
`
`clusterpurity(orfractlon
`
`o
`
`10
`
`20
`
`50
`40
`30
`number 0! constraints (or labeled examples)
`
`so
`
`70
`
`80
`
`FIGURE 2.2: Learning curves for supervised, unsupervised, and semi—
`supervised clustering. For supervised clustering, cluster purity (measured
`on the train set) and generalization (measured on an independent test set)
`are plotted against the number of labeled examples; for semi-supervised clus—
`tering, purity is plotted against the number of constraints. Averages over 10
`runs each, with the upper and lower lines indicating error bars at one standard
`deviation. See text for details.
`
`teresting to note that the performance of the semi—supervised learner actually
`begins to decrease after roughly 20 constraints have been added. The Reuters
`data set contains many documents that appear under more than one topic
`(an identical article on Microsoft, for example, appears under both business
`and tech). We hypothesize that, in an attempt to separate these unseparable
`documents, the learner is pushing its term weightings to unhealthy extremes.
`
`Experiments on a larger data set consisting of 20,000 USENET articles sug-
`gest that semi-supervised clustering is just as effective with large data sets.
`More importantly, these experiments show that semi-supervised clustering is
`able to cluster the same data according to different orthogonal criteria. This
`data set contains articles on four subjects: aviation simulators, real aviation,
`auto simulators, and real autos. Semi-supervised clustering can cluster the
`simulators and real groups together (e.g., aviation simulators and real avia-
`tion) or the auto and aviation groups together (e.g., aviation simulators and
`auto simulators) depending on the feedback provided by the user.
`In both
`cases it does so at'about 80% accuracy with 10 constraints. When the dis—
`tance metric is not adjusted, the same constraints give an average of only 64%
`accuracy. (Purely unsupervised clustering achieves only about 50% acCuracy.)
`
`11
`
`11
`
`
`
`26 Constrained Clustering: Advances in Algorithms, Theory, and Applications
`
`—-r——‘-"“'
`l
`
`0.5
`
`0.3
`
`
`
`
`
`50 words 0.7% P.. us
`lractlonwelspwithmaxcross—entropywords E
`
`—- top 1200 words (20%)
`- - top 750 words (11%)
`- - top 150 words (2.2%)
`l
`
`"-
`
`......
`
`~
`
`0.!
`
`.'
`
`5‘
`
`0
`
`2
`
`4
`
`45
`
`12
`10
`a
`number oiconstrslnts
`
`14
`
`16
`
`18
`
`20
`
`FIGURE 2.3: Fraction overlap of the top n weighted terms with top 12 terms
`ranked by information gain on fully-supervised data. As the number of con-
`straints increases, there is increasing correlation with terms that strongly
`affect class conditional probabilities. Note that this overlap is achieved with
`far fewer constraints than the number of labels in the fully—supervised data.
`
`2.4.2 Learning Term Weightings
`
`Adjusting .7]. warps the metric by adjusting the resolving power of term
`15,-, essentially identifying which terms are most useful for distinguishing docu—
`ments. If 'Yj is large, small disparities in the frequency of 15,- become important
`and will tend to separate documents; if '73- is small, large disparities in fre-
`quency will be ignored.
`
`Empirically, this behavior is borne out on the Reuters experiments. Terms
`that subjectively appear highly relevant for distinguishing topics, such as Iraq,
`economy, weapons and council are given large weightings. We computed the
`information gain of t,- using all document labels [18], and compared it with
`39-. Figure 2.3 shows the overlap between the top—weighted n% terms in the
`vocabulary with the same terms ranked by information gain. After about
`a dozen constraints, semi-supervised clustering learns term weightings with
`moderate overlap to the term weightings learned by supervised learning from
`all 125 document labels.
`
`12
`
`12
`
`
`
`Semi-Supervised Clustering with User Feedback
`
`27
`
`2.5 Discussion
`
`This chapter only scratches the surface of semi—supervised clustering with
`user feedback. There are still many issues to be addressed; we touch on a few
`of these next.
`
`2.5.1 Constraints vs. Labels
`
`When applying supervised learning to classification problems, it is assumed
`that the users know the target classes and have labeled examples from each
`target class.
`In many interesting problems, this is an unrealistic assump-
`tion. A semi-supervised system allows users to give label—like information to
`the learner without having to know labels. Although user feedback in semi-
`supervised clustering serves a similar role as class labels serve in supervised
`learning, comparing supervised learning with semi-supervised clustering is an
`apples—to—oranges comparison. Semi—supervised clustering usually will be ap—
`plied to problems where labels are not readily available. However, evaluating
`clustering systems is difficult and usually subjective. We compare the per—
`formance of semi-supervised clustering to supervised learning using a labeled
`data set principally to avoid this subjectivity.
`
`The performance disparity between supervised and semi-supervised cluster—
`ing is surprising. While we have argued that it is easier to provide constraints
`than labels, constraints also provide less information than labels. Constraints
`don’t require the user to know the correct label (or even what labels exist!) #
`only the relationship among pairs or sets of labels. There are only 125 possible
`labels in the small Reuters data set, but thousands of possible separation con—
`straints. Yet empirically, even with very few constraints, the semi-supervised
`learner is able to perform surprisingly well.
`
`One explanation is in the connection to active learning. As a means of
`user feedback, the addition of a constraint indicates a problem and effectively
`acts as a counterexample for the present clustering. Counterexamples are a
`powerful tool for doing active learning, which, in some situations, are much
`more efiicient than learning from randomly labeled examples [1]. As such,
`the user, by iteratively directing the clusterer’s attention toward points that
`are incorrectly clustered, gives a semi-supervised clustering system the many
`advantages of an active learning system.
`
`2.5.2 Types of User Feedback
`
`As we have discussed, there are many different types of feedback that users
`might provide to a semi-supervised clustering system. One type of feedback
`is the constraints on individual data points and clusters we used earlier. But
`many other forms of feedback might prove useful as well. For example, a user
`
`13
`
`13
`
`
`
`28 Constrained Clustering: Advances in Algorithms, Theory, and Applications
`
`might tell the system that the current clustering is too coarse or too fine. Or
`the user might point to a cluster and indicate that the cluster is bad without
`saying how it is bad. Similarly, a user might indicate that a cluster is good,
`suggesting that future re-clusterings of the data should attempt to maintain
`this cluster. Users might also give feedback that is not cluster specific, such
`as telling the system that the entire clustering looks bad and that the next
`clustering should be very different.
`Some types of user feedback may require adaptive clustering that cannot be
`easily handled by the 'y weighting scheme we used above. For example, we con-
`sidered an approach to finding good—but qualitatively different—w—clusterings
`of the same data by exploiting EM’s weakness for getting trapped in local
`minima. Different local minima may capture qualitatively different ways of
`clustering the data, one of which may better match the user’s internal prefer—
`ence function than the deepest minima the system can find. In the long run
`we hope to develop a general framework for representing user feedback about
`clusters.
`
`2.5.3 Other Applications
`
`We believe there are many applications of feedback-driven semi-supervised
`clustering. Imagine a Yahoo! hierarchy for web pages that allows the user to
`tailor the hierarchy to better match their own interests by providing feedback
`while browsing. Similarly, consider an automatic e—mail system in which a user
`allows the system to cluster e-mail into related mailboxes instead of manually
`specifying the mailboxes. Semi-supervised feedback would allow the user to
`tailor mailbox clusters to fit their (possibly changing) needs. As a different
`example, consider a user clustering proteins into homology groups (groups of
`proteins with similar structures). Large proteins have complex structures and
`could be clustered many different ways.
`.A feedback-driven semi-supervised
`clustering system would allow the user to explore many different ways the
`proteins might be clustered and to find clusterings most suitable to their
`purposes.
`
`2.5.4 Related Work
`
`The core operation of semi—supervised clustering involves learning a distance
`metric, of which a great deal of work has been done for classification problems
`(see Hastie and Tibshirani [13] for an overview); more recently, researchers
`have begun applying these techniques to clustering and other forms of machine
`learning (see, e.g., King et al.
`[20]).
`As indicated earlier, our model is most similar to the work of Demirii et
`a]. They report how a fixed set of labeled examples may be used to bias a
`clustering algorithm; we investigate how a user, interacting with the system,
`may efficiently guide the learner to a desired clustering.
`In the time since this work was first presented, there has been a great deal
`
`14
`
`14
`
`
`
`Semi-Supervised Clustering with User Feedback:
`
`29
`
`of research in improving clusterings by the (semi—supervised) learning of a
`distance measure.
`Instead of attempting a complete list of references here,
`we refer the reader the references in Chapter 1 and to the other, more recent
`contributions in this volume.
`
`Our technique of incorporating user feedback is a cousin to relevance feed-
`back, a technique for information retrieval [2]. Given a query and initial set
`of retrieved documents, relevance feedback asks the user to tag documents
`as being more or less relevant to the query being pursued. As the process is
`iterated, the retrieval system builds an increasingly accurate model of what
`the user is searching for.
`The question of how a