`c(cid:2) 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.
`
`Correlation Clustering∗
`
`NIKHIL BANSAL
`AVRIM BLUM
`SHUCHI CHAWLA
`Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
`
`nikhil@cs.cmu.edu
`avrim@cs.cmu.edu
`shuchi@cs.cmu.edu
`
`Editors: Nina Mishra and Rajeev Motwani
`
`Abstract. We consider the following clustering problem: we have a complete graph on n vertices (items), where
`each edge (u, v) is labeled either + or − depending on whether u and v have been deemed to be similar or different.
`The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels.
`That is, we want a clustering that maximizes the number of + edges within clusters, plus the number of − edges
`between clusters (equivalently, minimizes the number of disagreements: the number of − edges inside clusters
`plus the number of + edges between clusters). This formulation is motivated from a document clustering problem
`in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current
`set of documents in a way that correlates with f as much as possible; it can also be viewed as a kind of “agnostic
`learning” problem.
`An interesting feature of this clustering formulation is that one does not need to specify the number of clusters
`k as a separate parameter, as in measures such as k-median or min-sum or min-max clustering. Instead, in our
`formulation, the optimal number of clusters could be any value between 1 and n, depending on the edge labels.
`We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For
`minimizing disagreements, we give a constant factor approximation. For maximizing agreements we give a PTAS,
`building on ideas of Goldreich, Goldwasser, and Ron (1998) and de la Vega (1996). We also show how to extend
`some of these results to graphs with edge labels in [−1, +1], and give some results for the case of random noise.
`
`Keywords:
`
`clustering, approximation algorithm, document classification
`
`1.
`
`Introduction
`
`Suppose that you are given a set of n documents to cluster into topics. Unfortunately, you
`have no idea what a “topic” is. However, you have at your disposal a classifier f (A, B) that
`given two documents A and B, outputs whether or not it believes A and B are similar to
`each other. For example, perhaps f was learned from some past training data. In this case,
`a natural approach to clustering is to apply f to every pair of documents in your set, and
`then to find the clustering that agrees as much as possible with the results.
`Specifically, we consider the following problem. Given a fully-connected graph G with
`edges labeled “+” (similar) or “−” (different), find a partition of the vertices into clusters
`that agrees as much as possible with the edge labels. In particular, we can look at this in
`
`∗
`
`This research was supported in part by NSF grants CCR-0085982, CCR-0122581, CCR-0105488, and an IBM
`Graduate Fellowship.
`
`1
`
`APPLE 1085
`Apple v. AliveCor
`IPR2021-00972
`
`
`
`90
`
`N. BANSAL, A. BLUM, AND S. CHAWLA
`
`terms of maximizing agreements (the number of + edges inside clusters plus the number
`of − edges between clusters) or in terms of minimizing disagreements (the number of −
`edges inside clusters plus the number of+ edges between clusters). These two are equivalent
`at optimality but, as usual, differ from the point of view of approximation. In this paper
`we give a constant factor approximation to the problem of minimizing disagreements, and
`a PTAS1 for maximizing agreements. We also extend some of our results to the case of
`real-valued edge weights.
`This problem formulation is motivated in part by a set of clustering problems at Whizbang
`Labs (Cohen & McCallum, 2001; Cohen & Richman, 2001, 2002) in which learning algo-
`rithms were trained to help with various clustering tasks. An example of one such problem,
`studied by Cohen and Richman (2001, 2002) is clustering entity names. In this prob-
`lem, items are entries taken from multiple databases (e.g., think of names/affiliations of
`researchers), and the goal is to do a “robust uniq”—collecting together the entries that cor-
`respond to the same entity (person). E.g., in the case of researchers, the same person might
`appear multiple times with different affiliations, or might appear once with a middle name
`and once without, etc. In practice, the classifier f typically would output a probability, in
`which case the natural edge label is log(Pr(same)/Pr(different)). This is 0 if the classifier
`is unsure, positive if the classifier believes the items are more likely in the same cluster,
`and negative if the classifier believes they are more likely in different clusters. The case of
`{+, −} labels corresponds to the setting in which the classifier has equal confidence about
`each of its decisions.
`What is interesting about the clustering problem defined here is that unlike most clustering
`formulations, we do not need to specify the number of clusters k as a separate parameter.
`For example, in min-sum clustering (Schulman, 2000) or min-max clustering (Hochbaum
`& Shmoys, 1986) or k-median (Charikar & Guha, 1999; Jain & Vazirani, 2001), one can
`always get a perfect score by putting each node into its own cluster—the question is how
`well one can do with only k clusters. In our clustering formulation, there is just a single
`objective, and the optimal clustering might have few or many clusters: it all depends on the
`edge labels.
`To get a feel for this problem, notice that if there exists a perfect clustering, i.e., one
`that gets all the edges correct, then the optimal clustering is easy to find: just delete all
`“−” edges and output the connected components of the graph remaining. In Cohen and
`Richman (2002) this is called the “naive algorithm”. Thus, the interesting case is when no
`clustering is perfect. Also, notice that for any graph G, it is trivial to produce a clustering
`that agrees with at least half of the edge labels: if there are more + edges than − edges, then
`simply put all vertices into one big cluster; otherwise, put each vertex into its own cluster.
`This observation means that for maximizing agreements, getting a 2-approximation is easy
`(note: we will show a PTAS). In general, finding the optimal clustering is NP-hard (shown
`in Section 3).
`Another simple fact to notice is that if the graph contains a triangle in which two edges
`are labeled + and one is labeled −, then no clustering can be perfect. More generally, the
`number of edge-disjoint triangles of this form gives a lower bound on the number of dis-
`agreements of the optimal clustering. This fact is used in our constant-factor approximation
`algorithm.
`
`2
`
`
`
`CORRELATION CLUSTERING
`
`91
`
`For maximizing agreements, our PTAS is quite similar to the PTAS developed by de la
`Vega (1996) for MAXCUT on dense graphs, and related to PTASs of Arora, Karger, and
`Karpinski (1999) and Arora, Frieze, and Kaplan (2002). Notice that since there must exist
`a clustering with at least n(n − 1)/4 agreements, this means it suffices to approximate
`agreements to within an additive factor of εn2. This problem is also closely related to work
`on testing graph properties of Goldreich, Goldwasser, and Ron (1998), Parnas and Ron
`(2002), and Alon et al. (2000). In fact, we show how we can use the General Partition
`Property Tester of Goldreich, Goldwasser, and Ron (1998) as a subroutine to get a PTAS
`with running time O(neO(( 1
`)). Unfortunately, this is doubly exponential in 1
`ε )
`ε , so we also
`present an alternative direct algorithm (based more closely on the approach of de la Vega
`(1996)) that takes only O(n2eO( 1
`ε )) time.
`
`1ε
`
`Relation to agnostic learning. One way to view this clustering problem is that edges
`are “examples” (labeled as positive or negative) and we are trying to represent the target
`function f using a hypothesis class of vertex clusters. This hypothesis class has limited
`representational power: if we want to say (u, v) and (v, w) are positive in this language,
`then we have to say (u, w) is positive too. So, we might not be able to represent f perfectly.
`This sort of problem—trying to find the (nearly) best representation of some arbitrary target
`f in a given limited hypothesis language—is sometimes called agnostic learning (Kearns,
`Schapire, & Sellie, 1994; Ben-David, Long, & Mansour, 2001). The observation that one
`can trivially agree with at least half the edge labels is equivalent to the standard machine
`learning fact that one can always achieve error at most 1/2 using either the all positive or
`all negative hypothesis.
`Our PTAS for approximating the number of agreements means that if the optimal clus-
`tering has error rate ν, then we can find one of error rate at most ν + ε. Our running time is
`exponential in 1/ε, but this means that we can achieve any constant error gap in polynomial
`time. What makes this interesting from the point of view of agnostic learning is that there
`are very few problems where agnostic learning can be done in polynomial time.2 Even for
`simple classes such as conjunctions and disjunctions, no polynomial-time algorithms are
`known that give even an error gap of 1/2 − ε.
`
`Organization of this paper. We begin by describing notation in Section 2. In Section 3 we
`prove that the clustering problem defined here is NP complete. Then we describe a constant
`factor approximation algorithm for minimizing disagreements in Section 4. In Section 5,
`we describe a PTAS for maximizing agreements. In Section 6, we present simple algorithms
`and motivation for the random noise model. Section 7 extends some of our results to the
`case of real-valued edge labels. Finally, subsequent work by others is briefly described in
`Section 8.
`
`2. Notation and definitions
`Let G = (V , E) be a complete graph on n vertices, and let e(u, v) denote the label (+ or −)
`(u) = {u} ∪ {v : e(u, v) = +} and N
`(u) = {v : e(u, v) = −}
`+
`−
`of the edge (u, v). Let N
`denote the positive and negative neighbors of u respectively.
`
`3
`
`
`
`92
`
`N. BANSAL, A. BLUM, AND S. CHAWLA
`
`We let OPT denote an optimal clustering on this graph. In general, for a clustering C, let
`C(v) be the set of vertices in the same cluster as v. We will use A to denote the clustering
`produced by our algorithms.
`In a clustering C, we call an edge (u, v) a mistake if either e(u, v) = + and yet u (cid:4)∈ C(v),
`or e(u, v) = − and u ∈ C(v). When e(u, v) = +, we call the mistake a positive mistake,
`otherwise it is called a negative mistake. We denote the total number of mistakes made by
`a clustering C by mC, and use mOPT to denote the number of mistakes made by OPT.
`For positive real numbers x, y and z, we use x ∈ y ± z to denote x ∈ [y − z, y + z].
`Finally, let ¯X for X ⊆ V denote the complement (V \ X).
`
`3. NP-completeness
`
`In this section, we will prove that the problem of minimizing disagreements, or equivalently,
`maximizing agreements, is NP-complete. It is easy to see that the decision version of this
`problem (viz. is there a clustering with at most z disagreements?) is in NP since we can easily
`check the number of disagreements given a clustering. Also, if we allow arbitrary weights
`on edges with the goal of minimizing weighted disagreements, then a simple reduction from
`the Multiway Cut problem proves NP-hardness—simply put a −∞-weight edge between
`every pair of terminals, then the value of the multiway cut is equal to the value of weighted
`disagreements. We use this reduction to give a hardness of approximation result for the
`weighted case in Section 7.
`We give a proof of NP hardness for the unweighted case by reducing the problem of
`Partition into Triangles GT11 in Garey and Johnson (2000) to the problem of minimizing
`disagreements. The reader who is not especially interested in NP-completeness proofs
`should feel free to skip this section.
`The Partition into Triangles problem is described as follows: Given a graph G with n = 3k
`vertices, does there exist a partition of the vertices into k sets V1, . . . , Vk, such that for all
`i, |Vi| =3 and the vertices in Vi form a triangle.
`Given a graph G = (V , E), we first transform it into a complete graph G
`(cid:8)
`on the same
`is weighted +1 if it is an edge in G and −1 otherwise.
`(cid:8)
`vertex set V . An edge in G
`Let A be an algorithm that given a graph outputs a clustering that minimizes the number of
`mistakes. First notice that if we impose the additional constraint that all clusters produced by
`(cid:8)
`, the algorithm will produce a partition
`A should be of size at most 3, then given the graph G
`into triangles if the graph admits one. This is because if the graph admits a partition into
`triangles, then the clustering corresponding to this triangulation has no negative mistakes,
`and any other clustering with clusters of size at most 3 has more positive mistakes than
`this clustering. Thus we could use such an algorithm to solve the Partition into Triangles
`problem.
`We will now design a gadget that forces the optimal clustering to contain at most 3
`(cid:8)
`vertices in each cluster. In particular, we will augment the graph G
`to a larger complete
`graph H , such that in the optimal clustering on H , each cluster contains at most 3 vertices
`(cid:8)
`.
`from G
`(cid:8)
`The construction of H is as follows: In addition to the vertices and edges of G
`, for every
`3-tuple {u, v, w} ⊂ G
`(cid:8)
`, H contains a clique Cu,v,w containing n6 vertices. All edges inside
`
`4
`
`
`
`CORRELATION CLUSTERING
`
`93
`
`these cliques have weight +1. Edges between vertices belonging to two different cliques
`have weight −1. Furthermore, for all u, v, w ∈ G
`(cid:8)
`each vertex in Cu,v,w has a positive edge
`(cid:8)
`.
`to u, v and w, and a negative edge to all other vertices in G
`Now assume that G admits a triangulation and let us examine the behavior of algorithm
`A on graph H . Let N = n6( n
`3 ).
`
`Lemma 1. Given H as input, in any clustering that A outputs, every cluster contains at
`(cid:8)
`most three vertices of G
`.
`Proof: First consider a clustering C of the following form:
`
`1. There are ( n
`3 ) clusters.
`2. Each cluster contains exactly one clique Cu,v,w and some vertices of G
`3. Every vertex u ∈ G
`(cid:8)
`is in the same cluster as Cu,v,w for some v and w.
`
`(cid:8)
`
`.
`
`In any such clustering, there are no mistakes among edges between cliques. The only
`(cid:8)
`(cid:8)
`mistakes are between vertices of G
`and the cliques, and those between the vertices of G
`.
`2 ) − 1) + ( n2 ) because each vertex
`The number of mistakes of this clustering is at most n7(( n
`
`(cid:8)
`has n6 positive edges to ( n
`2 ) cliques and is clustered with only one of them.
`in G
`(cid:8)
`Now consider a clustering in which some cluster has four vertices in G
`, say, u, v, w and
`2 ) − 1) + n6
`y. We show that this clustering has at least n7(( n
`2 mistakes. Call this clustering
`X. Firstly, without loss of generality we can assume that each cluster in X has size at most
`n6 + n4, otherwise there are at least (n10) negative mistakes within a cluster. This implies
`2 )n6 − (n6 + n4) positive mistakes. Hence the total
`(cid:8)
`makes at least ( n
`that each vertex in G
`2 ) − 1) − n5. Let Xu be the cluster containing
`number of positive mistakes is at least n7(( n
`. Since Xu has at most n6 + n4 vertices, at least one of u, v, w, y
`vertices u, v, w, y ∈ G
`(cid:8)
`will have at most n4 positive edges inside Xu and hence will contribute at least an additional
`n6 − n4 negative mistakes to the clustering. Thus the total number of mistakes is at least
`
`2 ) − 1)n7 − n5 + n6 − n4 ≥ n7(( n2 ) − 1) + n6/2. Thus the result follows.
`(( n
`
`The above lemma shows that the clustering produced by A will have at most 3 vertices
`of G in each cluster. Thus we can use the algorithm A to solve the Partition into Triangles
`problem and the reduction is complete.
`
`4. A constant factor approximation for minimizing disagreements
`
`As a warm-up to the general case, we begin by giving a very simple 3-approximation to the
`best clustering containing two clusters. That is, if the best two-cluster partition of the graph
`has x mistakes, then the following algorithm will produce one with at most 3x mistakes.
`Let OPT(2) be the best clustering containing two clusters, and let the corresponding clus-
`ters be C1 and C2. Our algorithm simply considers all clusters of the form {N
`(v)}
`+
`−
`(v), N
`for v ∈ V . Of these, it outputs the one that minimizes the number of mistakes.
`
`Theorem 2. The number of mistakes of the clustering output by the algorithm stated above
`is at most m A ≤ 3mOPT(2).
`
`5
`
`
`
`94
`
`N. BANSAL, A. BLUM, AND S. CHAWLA
`
`Proof: Let’s say an edge is “bad” if OPT(2) disagrees with it, and define the “bad degree”
`of a vertex to be the number of bad edges incident to it. Clearly, if there is a vertex that has
`no bad edges incident to it, the clustering produced by that vertex would be the same as
`{C1, C2}, and we are done with as many mistakes as mOPT(2).
`Otherwise, let v be a vertex with minimum bad degree d, and without loss of generality,
`let v ∈ C1. Consider the partition {N
`(v)}. Let X be the set of bad neighbors of
`+
`−
`(v), N
`v—the d vertices that are in the wrong set of the partition with respect to {C1, C2}. The total
`number of extra mistakes due to this set X (other than the mistakes already made by OPT) is
`at most dn. However, since all vertices have bad degree at least d, mOPT(2) ≥ nd/2. So, the
`(v)} is at most 2mOPT(2).
`number of extra mistakes made by taking the partition {N
`+
`−
`(v), N
`This proves the theorem.
`
`We now describe our main algorithm: a constant-factor approximation for minimizing
`the number of disagreements.
`The high-level idea of the algorithm is as follows. First, we show (Lemma 3 and 4) that
`if we can cluster a portion of the graph using clusters that each look sufficiently “clean”
`(Definition 1), then we can charge off the mistakes made within that portion to “erroneous
`triangles”: triangles with two + edges and one − edge. Furthermore, we can do this in
`such a way that the triangles we charge are nearly edge-disjoint, allowing us to bound the
`number of these mistakes by a constant factor of OPT. Second, we show (Lemma 6) that
`(cid:8)
`there must exist a nearly optimal clustering OPT
`in which all non-singleton clusters are
`“clean”. Finally, we show (Theorem 7 and Lemma 11) that we can algorithmically produce
`a clustering of the entire graph containing only clean clusters and singleton clusters, such
`(cid:8)
`that mistakes that have an endpoint in singleton clusters are bounded by OPT
`, and mistakes
`with both endpoints in clean clusters are bounded using Lemma 4.
`We begin by showing a lower bound for OPT. We call a triangle “erroneous” if it contains
`(cid:2)
`two positive edges and one negative edge. A fractional packing of erroneous triangles is a
`set of erroneous triangles {T1, . . . , Tm} and positive real numbers ri associated with each
`triangle Ti , such that for any edge e ∈ E,
`ri ≤ 1.
`e∈Ti
`(cid:2)
`Lemma 3. Given any fractional packing of erroneous triangles {r1, . . . , rm}, we have
`i ri ≤ OPT.
`
`Proof: Let M be the set of mistakes made by OPT. Then, mOPT = (cid:2)e∈M 1 ≥ (cid:2)
`ri , by the definition of a fractional packing. So we have mOPT ≥(cid:2)
`(cid:2)
`e∈M
`|M ∩ Ti|ri . Now,
`e∈Ti
`for each Ti , we must have |M ∩ Ti| ≥1, because OPT must make at least one mistake on
`each erroneous triangle. This gives us the result.
`
`i
`
`Next we give a definition of a “clean” cluster and a “good” vertex.
`Definition 1. A vertex v is called δ-good with respect to C, where C ⊆ V , if it satisfies
`the following:
`– |N
`(v) ∩ C| ≥(1 − δ)|C|
`+
`(v) ∩ (V \ C)| ≤δ |C|
`– |N
`+
`
`6
`
`
`
`CORRELATION CLUSTERING
`
`95
`
`If a vertex v is not δ-good with respect to (w.r.t.) C, then it is called δ-bad w.r.t. C. Finally,
`a set C is δ-clean if all v ∈ C are δ-good w.r.t. C.
`
`−C
`
`−C
`
`We now present two key lemmas.
`Lemma 4. Given a clustering of V in which all clusters are δ-clean for some δ ≤ 1/4,
`(cid:2)
`there exists a fractional packing {ri , Ti}m
`i=1 such that the number of mistakes made by this
`clustering is at most 4
`i ri .
`Proof: Let the clustering on V be (C1, . . . ,C k). First consider the case where the number
`(cid:2)
`) is at least half the total number of mistakes mC. We will construct
`of negative mistakes (m
`≥ 1
`i ri ≥ 1
`4 mC.
`a fractional packing of erroneous triangles with
`2 m
`Pick a negative edge (u, v) ∈ Ci × Ci that has not been considered so far. We will pick
`a vertex w ∈ Ci such that both (u, w) and (v, w) are positive, and associate (u, v) with the
`erroneous triangle (u, v, w) (see figure 1). We now show that for all (u, v), such a w can
`(cid:8), v) or (u, v(cid:8)
`) (i.e. the ones sharing u
`always be picked such that no other negative edges (u
`or v) also pick w.
`Since Ci is δ-clean, neither u nor v has more than δ|Ci| negative neighbors inside Ci .
`Thus (u, v) has at least (1− 2δ)|Ci| vertices w such that both (u, w) and (v, w) are positive.
`Moreover, at most 2δ|Ci| −2 of these could have already been chosen by other negative
`(cid:8), v). Thus (u, v) has at least (1 − 4δ)|Ci| +2 choices of w that satisfy the
`edges (u, v(cid:8)
`) or (u
`required condition. Since δ ≤ 1/4, (u, v) will always be able to pick such a w. Let Tuvw
`denote the erroneous triangle u, v, w.
`(cid:2)
`Note that any positive edge (v, w) can be chosen at most 2 times by the above scheme, once
`(cid:2)
`(cid:2)
`for negative mistakes on v and possibly again for negative mistakes on w. Thus we can give a
`value of ruvw = 1/2 to each erroneous triangle picked, ensuring that
`Ti contains (v,w) ri ≤ 1.
`1 ≥
`ri = 1
`Now, since we pick a triangle for each negative mistake, we get that
`.
`m
`
`Ti
`
`2
`
`Ti
`
`−C
`
`12
`
`Figure 1. Construction of a triangle packing for Lemma 4.
`
`7
`
`
`
`96
`
`N. BANSAL, A. BLUM, AND S. CHAWLA
`
`Next, consider the case when at least half the mistakes are positive mistakes. Just as
`above, we will associate mistakes with erroneous triangles. We will start afresh, without
`taking into account the labelings from the previous part.
`Consider a positive edge between u ∈ Ci and v ∈ C j . Let |Ci| ≥ |C j|. Pick a w ∈ Ci
`such that (u, w) is positive and (v, w) is negative (see figure 1). There will be at least
`|Ci|−δ (|Ci|+|C j|) such vertices as before and at most δ(|Ci|+|C j|) of them will be already
`taken. Thus, there are at least |Ci| −2δ (|Ci| + |C j|) ≥ |Ci|(1 − 4δ) > 0 choices for w.
`(cid:2)
`(cid:2)
`Moreover only the positive edge (u, w) can be chosen twice (once as (u, w) and once as
`(w, u)). Thus, as before, to obtain a packing, we can give a fractional value of ruvw = 1
`2 to
`1 ≥ 1
`ri = 1
`.
`the triangle Tuvw. We get that
`2 m
`Ti
`Ti
`2
`Now depending on whether there are more negative mistakes or more positive mistakes,
`we can choose the triangles appropriately, and hence account for at least a quarter of the
`total mistakes in the clustering.
`
`+C
`
`Lemma 4 along with Lemma 3 gives us the following corollary.
`Corollary 5. Any clustering in which all clusters are δ-clean for some δ ≤ 1
`4 has at most
`4mOPT mistakes.
`
`(cid:8)
`Lemma 6. There exists a clustering OPT
`
`(cid:8) ≤ ( 9δ2 + 1)mOPT.
`and mOPT
`
`in which each non-singleton cluster is δ-clean,
`
`, COPT
`
`2
`
`, . . . ,C OPT
`
`k
`
`1
`
`be the clusters in OPT.
`
`Proof: Consider the following procedure applied to the clustering of OPT and call the
`(cid:8)
`resulting clustering OPT
`.
`Procedure δ-Clean-Up. Let COPT
`1. Let S = ∅.
`2. For i = 1, . . . ,k do:
`|, then, S = S ∪ COPT
`|COPT
`3 -bad vertices in COPT
`(a) If the number of δ
`is more than δ
`,
`= ∅. We call this “dissolving” the cluster.
`C(cid:8)
`3
`= COPT
`. Then S = S ∪ Bi and C(cid:8)
`\ Bi .
`3 -bad vertices in COPT
`i
`(b) Else, let Bi denote the δ
`, {x}x∈S.
`, C(cid:8)
`, . . . ,C (cid:8)
`: C(cid:8)
`(cid:8)
`3. Output the clustering OPT
`
`i
`
`i
`
`i
`
`i
`
`i
`
`i
`
`1
`
`2
`
`k
`
`(cid:8) are closely related.
`We will prove that mOPT and mOPT
`= ∅. Now ifC (cid:8)
`i is δ clean. Clearly, this holds if C(cid:8)
`We first show that each C(cid:8)
`i is non-empty,
`| ≥ |C(cid:8)
`| ≥ |COPT
`|(1 − δ/3). For each point v ∈ C(cid:8)
`we know that |COPT
`(cid:4)(cid:5)(cid:5)COPT
`(cid:3)
`(cid:3)
`(cid:5)(cid:5) −
`(cid:5)(cid:5)
`i , we have:
`| ≥
`(v) ∩ C(cid:8)
`1 − δ
`|N
`(cid:4)(cid:5)(cid:5)COPT
`(cid:3)
`(cid:5)(cid:5)
`3
`1 − 2
`=
`> (1 − δ)|C(cid:8)
`
`i
`
`i
`
`i
`
`i
`
`|
`
`i
`
`δ 3
`
`+
`
`i
`
`i
`
`i
`
`(cid:4)(cid:5)(cid:5)COPT
`
`i
`
`δ 3
`
`8
`
`
`
`CORRELATION CLUSTERING
`
`97
`
`∩ C(cid:8)
`i and outside COPT
`
`i
`
`, we get,
`
`i
`
`Similarly, counting positive neighbors of v in COPT
`(cid:5)(cid:5)COPT
`(cid:5)(cid:5)COPT
`(cid:5)(cid:5) + δ
`(cid:5)(cid:5)
`| ≤ δ
`3
`3
`|
`|C(cid:8)
`≤ 2δ
`(1 − δ/3)
`3
`|
`< δ|C(cid:8)
`(as δ < 1)
`
`+
`
`|N
`
`(v) ∩ C(cid:8)
`
`i
`
`i
`
`i
`
`i
`
`i
`
`Thus each C(cid:8)
`i is δ-clean.
`We now account for the number of mistakes. If we dissolve some COPT
`, then clearly
`the number of mistakes associated with vertices in the original cluster COPT
`is at least
`|2/2. The mistakes added due to dissolving clusters is at most |COPT
`(δ/3)2|COPT
`|2/2.
`were at least δ/3|COPT
`|
`was not dissolved, then, the original mistakes in COPT
`If COPT
`|Bi|/2. The mistakes added by the procedure is at most |Bi||COPT
`|. Noting that 6/δ < 9/δ2,
`the lemma follows.
`
`i
`
`i
`
`i
`
`i
`
`i
`
`i
`
`i
`
`i
`
`(cid:8)
`
`given by the above lemma, we use C(cid:8)
`i to denote the non-singleton
`For the clustering OPT
`clusters and S to denote the set of singleton clusters. We will now describe Algorithm
`(cid:8)
`Cautious that tries to find clusters similar to OPT
`. Throughout the rest of this section, we
`assume that δ = 1
`44 .
`
`Algorithm Cautious
`
`1. Pick an arbitrary vertex v and do the following:
`(a) Let A(v) = N
`+
`(v).
`(b) (Vertex Removal Step): While ∃x ∈ A(v) such that x is 3δ-bad w.r.t. A(v), A(v) =
`A(v) \ {x}.
`(c) (Vertex Addition Step): Let Y = {y|y ∈ V , y is 7δ-good w.r.t. A(v)}. Let A(v) =
`A(v) ∪ Y .3
`2. Delete A(v) from the set of vertices and repeat until no vertices are left or until all
`the produced sets A(v) are empty. In the latter case, output the remaining vertices as
`singleton nodes.
`
`Call the clusters output by algorithm Cautious A1, A2, . . . . Let Z be the set of singleton
`vertices created in the final step. Our main goal will be to show that the clusters output by
`our algorithm satisfy the property stated below.
`Theorem 7. ∀ j, ∃i such that C(cid:8)
`⊆ Ai . Moreover, each Ai is 11δ-clean.
`
`j
`
`In order to prove this theorem, we need the following two lemmas.
`If v ∈ C(cid:8)
`(cid:8), then, any vertex w ∈ C(cid:8)
`, where C(cid:8)
`Lemma 8.
`i is a δ-clean cluster in OPT
`i is
`+
`(v).
`3δ-good w.r.t. N
`
`i
`
`9
`
`
`
`98
`
`N. BANSAL, A. BLUM, AND S. CHAWLA
`
`| ≤
`|. So, (1 − δ)|C(cid:8)
`| ≤δ |C(cid:8)
`(v) ∩ C(cid:8)
`| and |N
`| ≥(1 − δ)|C(cid:8)
`(v) ∩ C(cid:8)
`Proof: As v ∈ Ci , |N
`+
`+
`(v)| ≤(1 + δ)|C(cid:8)
`|. The same holds for w. Thus, we get the following two conditions.
`|N
`+
`|N
`(w) ∩ N
`(v)| ≥(1 − 2δ)|C(cid:8)
`| ≥(1 − 3δ)|N
`(v)|
`+
`+
`+
`|N
`(w) ∩ N+(v)| ≤ |N
`(w) ∩ N+(v) ∩ C(cid:8)
`| + |N
`(w) ∩ N+(v) ∩ C(cid:8)
`+
`+
`+
`| ≤ 2δ
`(v)| ≤3δ |N
`(v)|
`|N
`≤ 2δ|C(cid:8)
`+
`+
`1 − δ
`
`i
`
`i
`
`i
`
`|
`
`i
`
`i
`
`i
`
`i
`
`i
`
`i
`
`i
`
`+
`
`Lemma 9. Given an arbitrary set X, if v1 ∈ C(cid:8)i and v2 ∈ C(cid:8)
`
`j
`
`, i (cid:4)= j, then v1 and v2 cannot
`
`(v).
`Thus, w is 3δ-good w.r.t. N
`
`both be 3δ-good w.r.t. X.
`(v1)∩ X| ≥
`Proof: Suppose that v1 and v2 are both 3δ-good with respect to X. Then, |N
`+
`(1− 3δ)|X| and |N
`(v2)∩ X| ≥(1 − 3δ)|X|, hence |N
`(v1)∩ N
`(v2)∩ X| ≥(1 − 6δ)|X|,
`+
`+
`+
`which implies that
`(v2)| ≥(1 − 6δ)|X|
`(v1) ∩ N
`|N
`+
`Also, since v1 and v2 lie in δ-clean clusters C(cid:8)i and C(cid:8)j in OPT
`
`(cid:8)
`
`| ≤δ |C(cid:8)
`∩ C(cid:8)
`|, |N
`| and C(cid:8)
`= ∅. It follows that
`(v2) \ C(cid:8)
`δ|C(cid:8)
`+
`|N
`(v1) ∩ N
`(v2)| ≤δ (|C(cid:8)
`| + |C(cid:8)
`|)
`(2)
`Now notice that |C(cid:8)
`(v1) ∩ ¯X
`| + |N
`(v1) ∩ X ∩ C(cid:8)
`| ≤ |N
`| ≤ |N
`(v1) ∩ C(cid:8)
`| +δ |C(cid:8)
`+
`+
`+
`| ≤ 1+3δ
`|+δ|C(cid:8)
`| ≤ |N
`(v1)∩ X ∩C(cid:8)
`|+3δ|X|+δ|C(cid:8)
`| ≤(1 + 3δ)|X|+δ|C(cid:8)
`|. So, |C(cid:8)
`|X|.
`∩C(cid:8)
`+
`1−δ
`j . Using Eq. (2), |N
`(v1) ∩ N
`(v2)| ≤2δ 1+3δ
`|X|.
`The same holds for C(cid:8)
`+
`+
`1−δ
`However, since δ < 1/9, we have 2δ(1+ 3δ) < (1− 6δ)(1− δ). Thus the above equation
`along with Eq. (1) gives a contradiction and the result follows.
`
`+
`
`j
`
`+
`
`i
`
`j
`
`i
`
`i
`
`i
`
`j
`
`j
`
`i
`
`i
`
`i
`
`+
`
`i
`
`respectively, |N
`
`+
`
`(v1)\C(cid:8)
`
`i
`
`(1)
`| ≤
`
`i
`
`i
`
`i
`
`i
`
`i
`
`This gives us the following important corollary.
`
`Corollary 10. After every application of the removal step 1b of the algorithm, no two
`vertices from distinct C(cid:8)
`i and C(cid:8)j can be present in A(v).
`
`
`Now we go on to prove Theorem 7.
`
`Proof of Theorem 7: We will first show that each Ai is either a subset of S or contains
`exactly one of the clusters C(cid:8)
`j . The first part of the theorem will follow.
`We proceed by induction on i. Consider the inductive step. For a cluster Ai , let A
`be
`the set produced after the vertex removal phase such the cluster Ai is obtained by applying
`. We have two cases. First, we consider the case when
`the vertex addition phase to A
`⊆ S. Now during the vertex addition step, no vertex u ∈ C(cid:8)
`for any j.
`j can enter A
`A
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`10
`
`
`
`CORRELATION CLUSTERING
`
`99
`
`This follows because, since C(cid:8)
`, for u to enter we need that
`j is δ-clean and disjoint from A
`δ|C(cid:8)
`| ≥(1 − 7δ)|A
`| and (1− δ)|C(cid:8)
`| ≤7δ |A
`|, and these two conditions cannot be satisfied
`simultaneously. Thus Ai ⊆ S.
`j
`In the second case, some u ∈ C(cid:8)
`. However, in this case observe that from
`j is present in A
`for any k (cid:4)= j. Also, by the same
`Corollary 10, no vertices from C(cid:8)
`k can be present in A
`⊆ S, no vertex from C(cid:8)
`(cid:6)
`k will enter A
`in the vertex addition
`reasoning as for the case A
`⊆ Ai . Note that all vertices of C(cid:8)
`phase. Now it only remains to show that C(cid:8)
`j are still present
`in the remaining graph G \ (
`(cid:6)<i A(cid:6)).
`it follows that many vertices from C(cid:8)
`.
`j are present in A
`Since u was not removed from A
`(u) ∩ A
`| ≥(1 − 3δ)|A
`| and |N
`(u) ∩ A
`| ≤3δ |A
`|. Now (1 − δ)|C(cid:8)
`| ≤
`In particular, |N
`+
`+
`∩ C(cid:8)
`| ≥ |A
`∩ N
`| < 2|A
`|. Also, |A
`(u)| − |N
`(u) ∩
`(u)| implies that |C(cid:8)
`| ≤ 1+3δ
`|A
`|N
`j
`+
`+
`+
`1−δ
`| ≥ |A
`∩ N
`(u)| −δ |C(cid:8)
`|. So we have| A
`∩ C(cid:8)
`| ≥(1 − 5δ)|A
`|.
`C(cid:8)
`+
`We now show that all remaining vertices from C(cid:8)
`j will enter Ai during the vertex addition
`j such that w /∈ A
`∩ C(cid:8)
`, |A
`| ≤5δ |A
`| and|N+(w) ∩ C(cid:8)
`| ≤δ |C(cid:8)
`| together
`phase. For w ∈ C(cid:8)
`j
`j
`| +δ |C(cid:8)
`∩ N+(w)| ≤5δ |A
`| ≤7δ |A
`∩ N
`|. The same holds for |A
`(w)|. So
`imply that |A
`+
`j
`and will be added in the Vertex Addition step. Thus we have shown
`w is 7δ-good w.r.t. A
`that A(v) can contain C(cid:8)
`j for at most one j and in fact will contain this set entirely.
`⊆ Ai . Let v chosen in Step 1 of the
`Next, we will show that for every j, ∃i s.t. C(cid:8)
`algorithm be such that v ∈ C(cid:8)
`j
`j . We show that during the vertex removal step, no vertex from
`(v) ∩ C(cid:8)
`+
`j is removed. The proof follows by an easy induction on the number of vertices
`N
`removed so far (r) in the vertex removal step. The base case (r = 0) follows from Lemma 8
`since every vertex in C(cid:8)
`+
`(v). For the induction step observe
`j is 3δ-good with respect to N
`(v)∩C(cid:8)
`j is removed thus far, every vertex in C(cid:8)j is still 3δ-good
`+
`
`that since no vertex from N
`+
`(v) replaced
`w.r.t. to the intermediate A(v) (by mimicking the proof of Lemma 8 with N
`contains at least (1− δ)|C(cid:8)
`| vertices of C(cid:8)
`j at the end of the vertex removal
`by A(v)). Thus A
`⊆ Ai after the vertex addition phase.
`phase, and hence by the second case above, C(cid:8)
`j
`Finally we show that every non-singleton cluster Ai is 11δ-clean. We know that at the end
`. Thus, |N
`(x) ∩ A
`| ≤3δ |A
`|.
`of the vertex removal phase, ∀x ∈ A
`+
`, x is 3δ-good w.r.t. A
`is at most 3δ|A
`|2. Since, in the vertex
`So the total number of positive edges leaving A
`, the number of these vertices can
`addition step, we add vertices that are 7δ-good w.r.t. A
`|. Thus |Ai| < (1 + 4δ)|A
`|.
`|2/(1 − 7δ)|A
`| < 4δ|A
`be at most 3δ|A
`(v) ∩ Ai ≥ (1 − 7δ)|A
`| ≥
`+
`Since all vertices v in Ai are at least 7δ-good w.r.t. A
`, N
`|Ai| ≥(1 − 11δ)|Ai|. Similarly, N
`(v) ∩ Ai ≤ 7δ|A
`| ≤11δ |Ai|. This gives us the
`1−7δ
`+
`1+4δ
`result.
`
`(cid:8)i
`
`j
`
`j
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`j
`
`j
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`j
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`j
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`j
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`j
`
`j
`
`. Call mistakes
`Now we are ready to bound the mistakes of A in terms of OPT and OPT
`that have both end points in some clusters Ai and A j as internal