`Data Mining and Knowledge Discovery Series
`
`Constrained Clustering
`Advances in Algorithms,
`_. Theory, and Applications
`
`Edited by
`Sugato Basu ° Jan Davidson
`Kiri L. Wagstaff
`
`CRC Press
`Taylor & Francis Group
`Boca Raton London New York
`
`CRCPress is an imprint ofthe
`Taylor & Francis Group, an informa business
`A CHAPMAN & HALL BOOK
`
`1
`
`.
`
`APPLE 1017
`
`APPLE 1017
`
`1
`
`
`
`ay bY 1 FY
`Cover image shows the result of clusteringferQetral image of Mars using soft constraints to
`
`ALFTS
`
`impose spatial contiguity on cluster assignments. The data set was collected by the Space Telescope
`Imaging Spectrograph (STIS) on the Hubble Space Telescope. This image was reproduced with permis-
`sion from Intelligent Clustering with Instance-Level Constraints by Kiri Wagstaff.
`
`Chapman & Hall/CRC
`Taylor & Francis Group
`6000 Broken Sound Parkway NW,Suite 300
`Boca Raton, FL 33487-2742
`
`© 2009 by Taylor & Francis Group, LLC
`Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business
`
`Noclaim to original U.S. Government works
`Printed in the United States of America on acid-free paper
`10987654321
`
`International Standard Book Number-13: 978-1-58488-996-0 (Hardcover)
`
`This book contains information obtained from authentic and highly regarded sources. Reasonable
`efforts have been made to publish reliable data and information, but the author and publisher cannot
`assumeresponsibility for the validity ofall materials or the consequences of their use. The authors and
`publishers have attemptedto trace the copyright holders of all material reproduced in this publication
`and apologize to copyright holders if permission to publish in this form has not been obtained. If any
`copyright material has not been acknowledged please write and let us know so we mayrectify in any
`future reprint.
`
`Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
`transmitted, or utilized in any form by any electronic, mechanical, or other means, now knownor here-
`after invented, including photocopying, microfilming, and recording, or in any information storage or
`retrieval system, without written permission from the publishers.
`
`For permission to photocopy or use material electronically from this work, please access www.copy-
`right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
`Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides
`licenses and registration for a variety of users. For organizations that have been granted a photocopy
`license by the CCC,a separate system of payment has been arranged.
`
`Trademark Notice: Product or corporate names may be trademarksor registered trademarks, and are
`used only for identification and explanation without intent to infringe.
`
`Library of Congress Cataloging-in-Publication Data
`
`Constrained clustering : advances in algorithms, theory, and applications / editors,
`Sugato Basu, Ian Davidson, Kiri Wagstaff.
`p. cm. -- (Chapman & Hall/CRC data mining and knowledge discovery series)
`Includes bibliographical references and index.
`ISBN 978-1-58488-996-0 (hardback: alk. paper)
`1. Cluster analysis--Data processing. 2. Data mining. 3. Computer algorithms.1.
`Basu, Sugato.II. Davidson, lan, 1971- ILI. Wagstaff, Kiri. IV. Title. V. Series. _
`
`QA278.C63 2008
`519.5'3--de22
`
`2008014590
`
`Visit the Taylor & Francis Website at
`http://www.taylorandfrancis.com
`
`and the CRC Press Website at
`http://www.crcpress.com
`
`2
`
`
`
`Chapter 2
`
`_Semi-Supervised Clustering with
`User Feedback
`
`David Cohn
`
`Google, Inc., cohn@google.com
`
`Rich Caruana
`
`Cornell University, caruana@cs.cornell.edu
`
`Andrew Kachites McCallum
`
`University of Massachusetts, Amherst, mccallum@cs.umass.edu
`
`Abstract We present an approach to clustering based on the observa-
`tion that “it is easier to criticize than to construct.” Our approach of semi-
`supervised clustering allows a user to iteratively provide feedback to a clus-
`tering algorithm. The feedback is incorporated in the form of constraints,
`which the clustering algorithm attempts to satisfy on future iterations. These
`constraints allow the user to guide the clusterer toward clusterings of the data
`that the user finds more useful. We demonstrate semi-supervised clustering
`with a system that learns to cluster news stories from a Reuters data set.
`
`2.1 .Introduction
`
`Consider the following problem: you are given 100,000 text documents(e.g.,
`papers, newsgrouparticles, or web pages) and asked to group them into classes
`or into a hierarchy such that related documents are grouped together. You
`are not told what classes or hierarchy to use or what documents arerelated.
`Your job is simplyto create this taxonomy so that the documents can be
`browsed andaccessedefficiently, either by yourself or by other people. While
`
`\This work was originally circulated as an unpublished manuscript [4] when all the authors
`were at Justsystem Pittsburgh Research Center.
`
`17
`
`3
`
`
`
`18 Constrained Clustering: Advances in Algorithms, Theory, and Applications
`
`you may have somecriterion in mind, you would probably be hard-pressed to
`express it algorithmically.
`This problem is ubiquitous. The web has created a numberof new examples
`of it, but it can be found in manyfields that don’t involve the web, as well as
`with manydifferent types of “documents.” Librarians, astronomers, biologists
`— practically everyone tasked with creating a taxonomy from data faces this
`problem in one form or another.
`We propose the following iterative solution to this problem:
`
`1. Give the 100,000 documents to an unsupervised clustering algorithm
`and have it cluster them.
`
`2. Browse the resulting clusters and tell the system which clusters you like,
`and which clusters you don’t like. Don’t do this for all the.clusters, just ¢
`for some of the ones you browsed. Provide feedback to the system by
`saying “This document doesn’t belong in here,” “Move this document to
`that cluster,” or “These two documents shouldn't be (or should be) .in
`the same cluster.”
`
`_
`
`Don’t do this for all, or even many, of the documents; only for the few
`that look most out of place.
`*3. After your critique, re-cluster the documents, allowing ‘the clustering
`algorithm to modify the the distance metric parameters to try to find a
`new clustering that satisfies the constraints you provided in the critique.
`
`4. Repeat this until you are happy with the clustering.
`
`This solution is distinct from both traditional supervised and unsupervised
`learning. Unsupervised clustering takes an unlabeled collection of data and,
`without intervention or additional knowledge, partitions it into sets of ex-
`amples such that examples within clusters are more “similar” than examples
`between clusters. Much work in unsupervised clustering is dedicated to the
`problem of manually engineering similarity criteria that yield good partition-
`ing of data for a given domain.
`Supervised learning, on the other hand, assumes that the class structure or
`hierarchy already is known. It takes a set of examples with class labels, and
`returns a function that maps examplesto class labels. The goal of supervised
`learning is to learn mappings that are accurate enough to be useful when
`classifying new examples, and perhaps to learn mappings that allow users to
`understand the relationships between the data and the labels, such as which
`features are important.
`Semi-supervised clustering falls between the extremes of totally unsuper-
`vised clustering and totally supervised learning. The main goal of our ap-
`proach to semi-supervised clustering is to allow a human to “steer” the clus-
`tering process so that examples can be partitioned into a useful set of clusters
`with minimum time and humaneffort. A secondary goal of semi-supervised
`
`4
`
`
`
`Semi-Supervised Clustering with User Feedback
`
`19
`
`clustering is to give the user a way to interact and play with the data so that
`they can understandit better.”
`Our approach to semi-supervised clustering assumes that the human user
`has in their mindcriteria that enable them to evaluate the quality of a cluster-
`ing. It does not assume that the user is conscious of what they think defines
`a good clustering but that, as with art, they will “know it when they see it.”
`Most importantly, semi-supervised clustering never expects a user to write a
`function that defines theclustering criterion. Instead, the user interacts with
`the clustering system, which attempts to learn a criterion that yields clusters
`the user is satisfied with. As such, one of the primary challenges of semi-
`supervised clustering is finding ways to elicit and make use of user’ feedback
`during clustering.
`_ The remainderof this chapter describes one simple,illustrative way in which
`this may be accomplished. Other challenges that need to be addressed by
`future research on semi-supervised clustering are briefly described in the dis-
`cussion section.
`
`2.1.1 Relation to Active Learning
`
`Semi-supervised clustering with user feedback is closely related to active
`learning [5]. In the most common form ofactive learning, a learning system
`attempts to identify which data points, if labeled by a human, would be most
`informative. In semi-supervised clustering, the human selects the data points,
`and puts on them a wide array of possible constraints instead of labels. These
`two key differences point toward somesituations in which the semi-supervised
`approachis preferable.
`
`1. In some clustering problemsthe desired similarity metric may be so dif-
`ferent from the default that traditional active learning would make many
`inefficient queries. This problem also arises when there are manydiffer-
`ent plausible clusterings. Although less automated, a human browsing
`the data would do less work by selecting the feedback data points them-
`self.
`
`2. The intuitive array of possible constraints are easier to apply than labels,
`especially when the final clusters are not known in advance.
`
`3. The very act of human browsing can lead to the discovery of what
`clusters are desired. Semi-supervised learning can thus be seen as a
`method of data exploration and pattern discovery, efficiently aided by
`clustér-based summarization.
`
`[7] independently introduced a semi-supervised clustering model similar to
`2Demiriz et al.
`the one we dedcribe here. The main distinction between our work and theirs is our use of
`iterative feedback to acquire labelings; Demiriz et al. assume that all available labels are
`given a priori.
`
`5
`
`
`
`20 Constrained Clustering: Advances in Algorithms, Theory, and Applications
`
`However, the distinction with active learning is subjective. As we will see
`in Section 2.5.1, our system could easily be viewed as a practical application
`of learning by counterexamples [1] — one of the earliest and most powerful
`forms of active learning studied in the theory community.
`Hybrid active-semi-supervised systemsare also plausible. In situations with
`a large numberof data points and data types that are difficult to browse, one
`could imagine a system that combines some of the automated selection of
`active learning with the human browsing of semi-supervised clustering. The
`active learner could make many disparate hypotheses about the underlying
`labels and present the examples that would be most indicative of each.
`
`2.2 Clustering
`
`Formally, clustering is the process of partitioning a data set into subsets
`such that all members of a given subset are “similar” according to some dis-
`tance measure D. We will denote the distance between two examples x; and
`x2 as D(x1, 22). We can generalize this to refer to D(y1, y2), the distance be-
`tween two cluster centers, or D(y,21), the distance between a cluster center
`and an example.
`The two most popular approachesto clustering are agglomerative cluster-
`ing and prototype-based clustering. In agglomerative clustering, each datum
`is initially placed in its own cluster. The clusters that are most similar (ac-
`cording to D) are iteratively merged, until the desired number of clusters
`is reached, or some limit on data likelihood or distortion is exceeded (see
`Hofmann and Buhmann[14] for an in-depth treatment of agglomerative clus-
`tering).
`~
`In prototype-based clustering, the final numberofclusters is usually set a
`priori, and the corresponding prototypes are found using some form of Expec-
`tation Maximization (EM) [8]. Each prototypeis initialized to some position
`- (in our case, a randomly weighted sample of the training points). Examples
`are assigned to prototypes accordingto their similarity to each prototype (the
`assignment may be 0-1 or fractional, depending on the algorithm). Prototypes
`are then adjusted to maximize the data likelihood, or, equivalently, minimize —
`the data distortion. The assignment/adjustment process is repeated until no
`significant changes result (see Meilé and Heckerman [17} for concise review of
`prototype-based clustering).
`In the present chapter, we adopt a statistical prototype-based approach,
`resulting from the naive Bayes model of document generation [16].* Given a
`
`3Wereiterate that the approach described in this chapter is only for the point of exploration
`and illustration; the approach is, in theory, applicable to almost any clustering algorithm.
`
`6
`
`
`
`Semi-Supervised Clustering with User Feedback
`
`21
`
`vocabulary V, a document is assumed to be a “bag of words” generated from
`a multinomial distribution @. In this model, the probability of document z is
`
`P(x) = [[ Pela",
`
`tj;EV
`
`where P(t;|@) is the parameterized probability of term t; being generated, and
`N(é;,x) is the numberof times t; appears in the document. Each document
`x forms an estimate of a multinomialdistribution 8,; likewise, each cluster of
`documents a forms an estimate @,, composed from the 6, of its constituent
`documents.*
`,
`For clustering we assumethat, instead of being produced by a single multi-
`nomial distribution, each of the observed documents was drawn from one of
`distributions 6,,,@2,,--.,9n,, corresponding to the unknowndistribution of
`clusters 771,72,..., 7%!
`
`P(t) =D) P(m)P(elm) = D> P(r) [J Pt.)O.
`
`t
`
`i
`
`é;EV
`
`Our task is to estimate values for P(x;) and6,,, which will in turn allow us
`to estimate cluster memberships P(7;|xr) by Bayes rule:
`
`P(aijx) = P(2|7;)P(7;)/P(2).
`
`‘
`
`-
`
`(2.1)
`
`We find estimates for P(7;) and @,, via the standard procedure for EM,
`beginning with randomized estimates of 6,, drawn as a weighted sample from
`the observations. Then, for each cluster 7; and document +, we compute
`P(z|6,,) and apply Equation 2.1 to compute P(z;,|x). Each cluster is given
`partial ownership of a document proportional to P(7;|z). The parameters
`0, are recomputed as the weighted sum of their component documents, and
`the process is repeated. The algorithm is guaranteed to convergeto a locally
`optimal clustering (see, e.g., MacKay [15] or Meili and Heckerman [17] ‘for
`details).
`
`2.3 Semi-Supervised Clustering
`
`The goodness of any clustering depends on how well the metric D matches
`the user’s (perhaps unknown) internal model of the target domain. We pro-
`pose allowing the user to impose their model on the metric via the clustering
`
`4The estimates for term probabilities are derived from the relative term frequencies in the
`documents. Following McCallum and Nigam [16], we smooth with a LaPlacean prior to
`avoid zero term probabilities.
`
`7
`
`
`
`22 Constrained Clustering: Advances in Algorithms, Theory, and Applications
`
`
`
`FIGURE 2.1: Illustration of semi-supervised clustering. Given an initial clus-
`tering, the user specifies two points that should not have been placed in the
`same cluster. The system warps its metric, allowing it to find a clustering
`‘that respects the constraint.
`
`algorithm, by having the user provide the algorithm with feedback, and al-
`lowing it to alter the metric so as to accommodate that feedback. Not only
`is it easier to critique than to construct, but the user’s criticism can take
`many forms — specifying that a particular example does/doesn’t belong in a
`particular cluster, that two examples do/don’t belong in the samecluster, or
`that a particular cluster is good (and should be preserved) or bad (and should
`be split up).
`Feedback may be incorporated into the metric as constraints to be respected
`by the clustering algorithm. Consider two examples, x1 and x2, that are con-
`strained by the user feedback to be in separate clusters. When the clustering
`algorithm attempts a partitioning which places x; and x2 in the samecluster,
`the metric may be altered to increase the distance between x1 and x2 until
`one or the other of them falls in a different cluster (Figure 2.1). Other con-
`straints may be implemented similarly, shrinking the distance between some
`example and a cluster prototype, or increasing the distance between a cluster
`prototype andall the examples assignedtoit.
`
`Implementing Pairwise Document Constraints
`2.3.1
`In this probabilistic setting, the natural measure of dissimilarity between
`two documents, x; and 29, is the probability that they were generated by the
`same multinomial. From Pereira et al. [19], this is proportional to the KL
`divergence to the mean of their multinomial distributions:
`
`Drm (21,22) = |21|Dx1 (Ox, 921,22) + |22|Dx1(A295 92; 29);
`
`where|z| is the length of document +, Dx1,(61, 92) is the standard Kullback-
`Leibler divergence of 6; to 62, and @z, 2, is a distribution such that
`
`P(t;|Ox,,22) = (P(t; 142) + P(t;|9x,)) /2.
`
`8
`
`
`
`Semi-Supervised Clustering with User Feedback
`
`23
`
`The advantage of this measure is that it is symmetric, unlike standard KL
`divergence.
`To implement our constraints, we augment the standard KL divergence
`' D(@z,,9c,) with a weighting function
`
`P(t; |Ox»)
`ig pre.PUG|@x2)_
`
`
`
`Diet 51»Oxo) x P(t;l@x,)log (seae}) .
`
`
`
`where -y; may be interpreted as indicating the importance of t; for distinguish-
`ing x1; and x2. Then, given a constraint that z, and x2 must be in separate
`clusters, we can warp the metric by computing
`
`ODicny (21 22) _
`
`P(t3|9x122)
`
`|za|P(t;|@2,) log (FE)
`
`and hillclimbing over yy to increase the effective distance between the two.
`This gradient tells us the direction to move the y’s in order to increase (or
`decrease) the separation between two documents. (In the current experiments
`we constrain the -y’s to be positive, but it might be interesting to relax this
`and allow some 7y’s to become negative.)
`These ‘y’s are incorporated back into the E-step of clustering algorithm as
`weights attached to the individual term frequencies:
`
`P(aimi) = [] Pltsl6n)o"O).
`
`t;€V
`
`Intuitively, a small y; reduces the effect of t;’s presence or absence on doc-
`ument likelihood, effectively scaling its effect on the document’s divergence
`from its cluster center. As such, we are abie to inject a learned distance metric
`directly into the clustering algorithm.
`
`2.3.2 Other Constraints
`
`Other constraints described in the previous section may be similarly im-
`plemented by hillclimbing over the example-to-cluster and cluster-to-cluster
`distance. Note that the linear warping we describe will not guarantee that
`all constraints can be satisfied; some clusterings desired by the user may be
`non-convex and unrealizable in the space of models supported by naive Bayes.
`In this case, the hillclimbing will converge to a weighting that provides a local
`minimum ofconstraint violations. Local or nonlinear warpings of the distance
`metric, such as the ones described by Friedman[11] and Yianilos [21] may be
`of use in these situations.
`
`9
`
`
`
`24 Constrained Clustering: Advances in Algorithms, Theory, and Applications
`
`2.4 Experiments
`
`In this section, we illustrate the semi-supervised approach on a small docu-
`‘ ment clustering problem. We use a set of 25 documents each from five Reuters
`topic areas: business, health, politics, sports, and tech. Starting from five ran-
`domly initialized prototypes, the EM-based clustering algorithm described in ~
`the previous sections finds clusters that maximize data likelihood.
`Each time clustering converges, we add a constraint. We simulate a human
`user by identifying two documents from the same cluster whose sources are
`different Reuters topics, and constrain them to be in different clusters.° For
`each unsatisfied constraint, we reweight the divergence by a fixed number
`of hillclimbing steps, re-initialize the cluster prototypes, and repeat the EM
`training.
`
`2.4.1 Clustering Performance
`
`Figure 2.2 compares the performance of supervised, unsupervised, and semi-
`supervised learning. For unsupervised and semi-supervised learners, we plot
`cluster purity:
`the fraction of examples that would be classified correctly
`if all examples were assigned the majority label in each cluster. For the
`supervised learner, we pilot both. cluster purity and classification accuracy
`(generalization).
`After only a few constraints have been added, cluster purity increases
`sharply over that of unsupervised clustering.
`It is not clear, however, how
`to fairly compare. the performance of semi-supervised clustering with that of
`fully supervised clustering: constraints do not exactly correspond to labeled
`examples, and it is uncertain what constitutes a proper test set.. In super-
`vised learning, documents used for training are traditionally excluded .from
`the test set, since their labels are already known. But the semi-supervised
`model clusters (and is tested on) the entire corpus, so it is also reasonable to
`gauge it against a supervised learner tested the same way. In the figure we
`show the cluster purity of supervised learning on the training set as well as
`its generalization to an independent test set.
`The semi-supervised learner reaches its asymptotic performance after about
`10 constraints have been added; the supervised learners require between 3 and
`6 times more labeled examples to reach that level of performance.® It is in-
`
`5A fully-operational semi-supervised clustering system would benefit from a graphical user
`interface that permits efficient browsing of the current clusters and supports easy specifi-
`cation of user constraints. See the discussion of Scatter/Gather later in this chapter.
`®To assure ourselves that metric-warping alone wasn’t responsible for the performancedis-
`parity, we also incorporated metric warping into the supervised clusterer, shrinking the
`divergence between a document andits assigned cluster. The addition resulted in no sig-
`nificant performance improvement.
`
`10
`
`10
`
`
`
`correct)es 30
`
`
`
`
`
`Semi-Supervised Clustering with User Feedback
`a9
`
`25
`
`2S
`
`clusterpurity(orfraction
`
`50
`40
`numberof constraints (or labeled examples)
`
`60
`
`FIGURE 2.2: Learning curves for supervised, unsupervised, and semi-
`supervised clustering. For supervised clustering, cluster purity (measured
`on the train set) and generalization (measured on an independent: test set)
`are plotted against the numberof labeled examples; for semi-supervised clus-
`tering, purity is plotted against the numberof constraints. Averages over 10
`runs each, with the upperandlowerlines indicating error bars at one standard
`deviation. See text for details.
`
`teresting to note that the performance of the semi-supervised learner actually
`begins to decrease after roughly 20 constraints have been added. The Reuters
`data set contains many documents that appear under more than one topic
`(an identical article on Microsoft, for example, appears under both business
`and tech). We hypothesize that, in an attempt to separate these unseparable
`documents, the learner is pushing its term weightings to unhealthy extremes.
`
`Experiments on a larger data set consisting of 20,000 USENETarticles sug-
`gest that semi-supervised clustering is just as effective with large data sets.
`More importantly, these experiments show that semi-supervised clustering is
`able to cluster the same data according to different orthogonal criteria. This
`data set contains articles on four subjects: aviation simulators, real aviation,
`auto simulators, and real autos. Semi-supervised clustering can cluster the
`simulators and real groups together (e.g., aviation simulators and real avia-
`tion) or the auto and aviation groups together(e.g., aviation simulators and
`auto simulators) depending on the feedback provided by the user.
`In both
`cases it does so at about 80% accuracy with 10 constraints. When the dis-
`tance metric is not adjusted, the same constraints give an average of only 64%
`accuracy. (Purely unsupervised clustering achieves only about 50% accuracy.)
`
`11
`
`11
`
`
`
`26 Constrained Clustering: Advances in Algorithms, Theory, and Applications
`
`Os et
`
`Sstea
`
`0.3
`
`
`
`
`
`0.7%) weTe em,
`tractionovelapwithmaxcross—entropywords 8
`
`
`
`2= a x
`
`ote ff
`
`— top 1200 words (20%)
`— - top 750 words (11%)
`+= fop 150 words (2.2%)
`+ top
`
`eae
`
`o
`
`2
`
`4
`
`4
`
`12
`10
`8
`Number of constraints
`
`14
`
`16
`
`18
`
`20
`
`FIGURE 2.3: Fraction overlap of the top n weighted terms with top n terms
`ranked by information gain on fully-supervised data. As the numberof con-
`straints increases, there is increasing correlation with terms that strongly
`affect class conditional probabilities. Note that this overlap is achieved with
`far fewer constraints than the number of labels in the fully-supervised data.
`
`2.4.2 Learning Term Weightings
`
`Adjusting ; warps the metric by adjusting the resolving power of term
`t;, essentially identifying which terms are most useful for distinguishing docu-
`ments. If 7; is large, small disparities in the frequency of t; become important
`and will tend to separate documents; if y; is small, large disparities in fre-
`quency will be ignored.
`
`Empirically, this behavior is borne out on the Reuters experiments. Terms
`that subjectively appear highly relevant for distinguishing topics, such as Jraq,
`economy, weapons and council are given large weightings. We computed the
`information gain of ¢; using all document labels [18], and compared it with
`yj. Figure 2.3 shows the overlap between the top-weighted n% terms in the
`vocabulary with the same terms ranked by information gain. After about
`a dozen constraints, semi-supervised clustering learns term weightings with
`moderate overlap to the term weightings learned by supervised learning from
`all 125 document labels.
`
`12
`
`12
`
`
`
`Semi-Supervised Clustering with User Feedback
`
`27
`
`2.5 Discussion
`
`This chapter only scratches the surface of semi-supervised clustering with
`user feedback. There are still many issues to be addressed; we touch on a few
`of these next.
`
`2.5.1 Constraints vs. Labels
`
`Whenapplying supervised learning to classification problems,it is assumed
`that the users know the target classes and have labeled examples from each
`target class.
`In many interesting problems, this is an unrealistic assump-
`tion. A semi-supervised system allows users to give label-like information to
`the learner without having to know labels. Although user feedback in semi-
`supervised clustering serves a similar role as class labels serve in supervised
`learning, comparing supervised learning with semi-supervised clustering is an
`apples—to—oranges comparison. Semi-supervised clustering usually will be ap-
`plied to problems where labels are not readily available. However, evaluating
`clustering systems is difficult and usually subjective. We compare the per-
`formance of semi-supervised clustering to supervised learning using a labeled
`data set principally to avoid this subjectivity.
`The performance disparity between supervised and semi-supervised cluster-
`ing is surprising. While we have argued that it is easier to provide constraints
`than labels, constraints also provide less information than labels. Constraints
`don’t require the user to know the correct label (or even what labels exist!) —
`only the relationship among pairsor sets of labels. There are only 125 possible
`labels in the small Reuters data set, but thousandsof possible separation con-
`straints. Yet empirically, even with very few constraints, the semi-supervised
`learner is able to perform surprisingly well.
`One explanation is in the connection to active learning. As a means of
`user feedback, the addition of a constraint indicates a problem and effectively
`acts as a counterexample for the present clustering. Counterexamples are a
`powerful tool for doing active learning, which, in some situations, are much
`more efficient than learning from randomly labeled examples [1]. As such,
`the user, by iteratively directing the clusterer’s attention toward points that
`are incorrectly clustered, gives a semi-supervised clustering system the many
`advantages of an active learning system.
`
`2.5.2 Types of User Feedback
`
`As we have discussed, there are many different types of feedback that users
`might provide to a semi-supervised clustering system. One type of feedback
`is the constraints on individual data points and clusters we used earlier. But
`many other forms of feedback might prove useful as well. For example, a user
`
`13
`
`13
`
`
`
`28 Constrained Clustering: Advancesin Algorithms, Theory, and Applications
`
`mighttell the system that the current clustering is too coarse or too fine. Or
`the user might point to a cluster and indicate that the cluster is bad without
`saying how it is bad. Similarly, a user might indicate that a cluster is good,
`suggesting that future re-clusterings of the data should attempt to maintain
`this cluster. Users might also give feedback that is not cluster specific, such
`as telling the system that the entire clustering looks bad and that the next
`clustering should be very different.
`Sometypes of user feedback may require adaptive clustering that cannot be
`easily handled by the -y weighting scheme weused above. For example, we con-
`sidered an approachto finding good—but qualitatively different-—clusterings
`of the same data by exploiting EM’s weakness for getting trapped in local
`minima. Different local minima may capture qualitatively different ways of
`clustering the data, one of which may better match the user’s internal prefer-
`ence function than the deepest minima the system can find. In the long run
`we hope to develop a general framework for representing user feedback about
`clusters.
`
`2.5.3 Other Applications
`
`Webelieve there are many applications of feedback-driven semi-supervised
`clustering. Imagine a Yahoo! hierarchy for web pages that allows the user to
`tailor the hierarchy to better match their own interests by providing feedback
`while browsing. Similarly, consider an automatic e-mail system in which a user
`allows the system to cluster e-mail into related mailboxes instead of manually
`specifying the mailboxes. Semi-supervised feedback would allow the user to
`tailor mailbox clusters to fit their (possibly changing) needs. As a different
`example, consider a user clustering proteins into homology groups (groups of
`proteins with similar structures). Large proteins have complex structures and
`could be clustered many different ways. A feedback-driven semi-supervised
`clustering system would allow the user to explore many different ways the
`proteins might be clustered and to find clusterings most suitable to their
`purposes.
`
`2.5.4 Related Work
`
`The core operation of semi-supervised clustering involves learning a distance
`metric, of which a great deal of work has been doneforclassification problems
`(see Hastie and Tibshirani [13] for an overview); more recently, researchers
`have begun applying these techniques to clustering and other forms of machine
`learning (see, e.g., Xing et al.
`[20]).
`Asindicated earlier, our model is most similar to the work of Demiriz et
`al. They report how a fixed set of labeled examples may be used to bias a
`clustering algorithm; we investigate how a user, interacting with the system,
`may efficiently guide the learner to a desired clustering.
`In the time since this work was first presented, there has been a great deal
`
`14
`
`14
`
`
`
`Semi-Supervised Clustering with User Feedback
`
`29
`
`of research in improving clusterings by the (semi-supervised) learning of a
`distance measure.
`Instead of attempting a complete list of references here,
`werefer the reader the references in Chapter 1 and to the other, more recent
`contributions in this volume.
`Ourtechnique of incorporating user feedback is a cousin to relevance feed-
`back, a technique for information retrieval [2|. Given a query andinitial set
`of retrieved documents, relevance feedback asks the user to tag documents
`as being moreorless relevant to the query being pursued. As the process is
`iterated, the retrieval system builds an increasingly accurate model of what
`the user is searching for.
`The question of how a user (or teacher) may best select examples to help a
`learner identify a target concept is the focus of much work in computational
`learning theory. See Goldman and Kearns [12] for a detailed treatment of the
`problem.
`The Scatter/Gather algorithm [6] is an interactive clustering algorithm de-
`signed f