throbber
Machine Learning, 56, 89–113, 2004
`c(cid:2) 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.
`
`Correlation Clustering∗
`
`NIKHIL BANSAL
`AVRIM BLUM
`SHUCHI CHAWLA
`Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
`
`nikhil@cs.cmu.edu
`avrim@cs.cmu.edu
`shuchi@cs.cmu.edu
`
`Editors: Nina Mishra and Rajeev Motwani
`
`Abstract. We consider the following clustering problem: we have a complete graph on n vertices (items), where
`each edge (u, v) is labeled either + or − depending on whether u and v have been deemed to be similar or different.
`The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels.
`That is, we want a clustering that maximizes the number of + edges within clusters, plus the number of − edges
`between clusters (equivalently, minimizes the number of disagreements: the number of − edges inside clusters
`plus the number of + edges between clusters). This formulation is motivated from a document clustering problem
`in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current
`set of documents in a way that correlates with f as much as possible; it can also be viewed as a kind of “agnostic
`learning” problem.
`An interesting feature of this clustering formulation is that one does not need to specify the number of clusters
`k as a separate parameter, as in measures such as k-median or min-sum or min-max clustering. Instead, in our
`formulation, the optimal number of clusters could be any value between 1 and n, depending on the edge labels.
`We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For
`minimizing disagreements, we give a constant factor approximation. For maximizing agreements we give a PTAS,
`building on ideas of Goldreich, Goldwasser, and Ron (1998) and de la Vega (1996). We also show how to extend
`some of these results to graphs with edge labels in [−1, +1], and give some results for the case of random noise.
`
`Keywords:
`
`clustering, approximation algorithm, document classification
`
`1.
`
`Introduction
`
`Suppose that you are given a set of n documents to cluster into topics. Unfortunately, you
`have no idea what a “topic” is. However, you have at your disposal a classifier f (A, B) that
`given two documents A and B, outputs whether or not it believes A and B are similar to
`each other. For example, perhaps f was learned from some past training data. In this case,
`a natural approach to clustering is to apply f to every pair of documents in your set, and
`then to find the clustering that agrees as much as possible with the results.
`Specifically, we consider the following problem. Given a fully-connected graph G with
`edges labeled “+” (similar) or “−” (different), find a partition of the vertices into clusters
`that agrees as much as possible with the edge labels. In particular, we can look at this in
`
`∗
`
`This research was supported in part by NSF grants CCR-0085982, CCR-0122581, CCR-0105488, and an IBM
`Graduate Fellowship.
`
`1
`
`APPLE 1085
`Apple v. AliveCor
`IPR2021-00972
`
`

`

`90
`
`N. BANSAL, A. BLUM, AND S. CHAWLA
`
`terms of maximizing agreements (the number of + edges inside clusters plus the number
`of − edges between clusters) or in terms of minimizing disagreements (the number of −
`edges inside clusters plus the number of+ edges between clusters). These two are equivalent
`at optimality but, as usual, differ from the point of view of approximation. In this paper
`we give a constant factor approximation to the problem of minimizing disagreements, and
`a PTAS1 for maximizing agreements. We also extend some of our results to the case of
`real-valued edge weights.
`This problem formulation is motivated in part by a set of clustering problems at Whizbang
`Labs (Cohen & McCallum, 2001; Cohen & Richman, 2001, 2002) in which learning algo-
`rithms were trained to help with various clustering tasks. An example of one such problem,
`studied by Cohen and Richman (2001, 2002) is clustering entity names. In this prob-
`lem, items are entries taken from multiple databases (e.g., think of names/affiliations of
`researchers), and the goal is to do a “robust uniq”—collecting together the entries that cor-
`respond to the same entity (person). E.g., in the case of researchers, the same person might
`appear multiple times with different affiliations, or might appear once with a middle name
`and once without, etc. In practice, the classifier f typically would output a probability, in
`which case the natural edge label is log(Pr(same)/Pr(different)). This is 0 if the classifier
`is unsure, positive if the classifier believes the items are more likely in the same cluster,
`and negative if the classifier believes they are more likely in different clusters. The case of
`{+, −} labels corresponds to the setting in which the classifier has equal confidence about
`each of its decisions.
`What is interesting about the clustering problem defined here is that unlike most clustering
`formulations, we do not need to specify the number of clusters k as a separate parameter.
`For example, in min-sum clustering (Schulman, 2000) or min-max clustering (Hochbaum
`& Shmoys, 1986) or k-median (Charikar & Guha, 1999; Jain & Vazirani, 2001), one can
`always get a perfect score by putting each node into its own cluster—the question is how
`well one can do with only k clusters. In our clustering formulation, there is just a single
`objective, and the optimal clustering might have few or many clusters: it all depends on the
`edge labels.
`To get a feel for this problem, notice that if there exists a perfect clustering, i.e., one
`that gets all the edges correct, then the optimal clustering is easy to find: just delete all
`“−” edges and output the connected components of the graph remaining. In Cohen and
`Richman (2002) this is called the “naive algorithm”. Thus, the interesting case is when no
`clustering is perfect. Also, notice that for any graph G, it is trivial to produce a clustering
`that agrees with at least half of the edge labels: if there are more + edges than − edges, then
`simply put all vertices into one big cluster; otherwise, put each vertex into its own cluster.
`This observation means that for maximizing agreements, getting a 2-approximation is easy
`(note: we will show a PTAS). In general, finding the optimal clustering is NP-hard (shown
`in Section 3).
`Another simple fact to notice is that if the graph contains a triangle in which two edges
`are labeled + and one is labeled −, then no clustering can be perfect. More generally, the
`number of edge-disjoint triangles of this form gives a lower bound on the number of dis-
`agreements of the optimal clustering. This fact is used in our constant-factor approximation
`algorithm.
`
`2
`
`

`

`CORRELATION CLUSTERING
`
`91
`
`For maximizing agreements, our PTAS is quite similar to the PTAS developed by de la
`Vega (1996) for MAXCUT on dense graphs, and related to PTASs of Arora, Karger, and
`Karpinski (1999) and Arora, Frieze, and Kaplan (2002). Notice that since there must exist
`a clustering with at least n(n − 1)/4 agreements, this means it suffices to approximate
`agreements to within an additive factor of εn2. This problem is also closely related to work
`on testing graph properties of Goldreich, Goldwasser, and Ron (1998), Parnas and Ron
`(2002), and Alon et al. (2000). In fact, we show how we can use the General Partition
`Property Tester of Goldreich, Goldwasser, and Ron (1998) as a subroutine to get a PTAS
`with running time O(neO(( 1
`)). Unfortunately, this is doubly exponential in 1
`ε )
`ε , so we also
`present an alternative direct algorithm (based more closely on the approach of de la Vega
`(1996)) that takes only O(n2eO( 1
`ε )) time.
`
`1ε
`
`Relation to agnostic learning. One way to view this clustering problem is that edges
`are “examples” (labeled as positive or negative) and we are trying to represent the target
`function f using a hypothesis class of vertex clusters. This hypothesis class has limited
`representational power: if we want to say (u, v) and (v, w) are positive in this language,
`then we have to say (u, w) is positive too. So, we might not be able to represent f perfectly.
`This sort of problem—trying to find the (nearly) best representation of some arbitrary target
`f in a given limited hypothesis language—is sometimes called agnostic learning (Kearns,
`Schapire, & Sellie, 1994; Ben-David, Long, & Mansour, 2001). The observation that one
`can trivially agree with at least half the edge labels is equivalent to the standard machine
`learning fact that one can always achieve error at most 1/2 using either the all positive or
`all negative hypothesis.
`Our PTAS for approximating the number of agreements means that if the optimal clus-
`tering has error rate ν, then we can find one of error rate at most ν + ε. Our running time is
`exponential in 1/ε, but this means that we can achieve any constant error gap in polynomial
`time. What makes this interesting from the point of view of agnostic learning is that there
`are very few problems where agnostic learning can be done in polynomial time.2 Even for
`simple classes such as conjunctions and disjunctions, no polynomial-time algorithms are
`known that give even an error gap of 1/2 − ε.
`
`Organization of this paper. We begin by describing notation in Section 2. In Section 3 we
`prove that the clustering problem defined here is NP complete. Then we describe a constant
`factor approximation algorithm for minimizing disagreements in Section 4. In Section 5,
`we describe a PTAS for maximizing agreements. In Section 6, we present simple algorithms
`and motivation for the random noise model. Section 7 extends some of our results to the
`case of real-valued edge labels. Finally, subsequent work by others is briefly described in
`Section 8.
`
`2. Notation and definitions
`Let G = (V , E) be a complete graph on n vertices, and let e(u, v) denote the label (+ or −)
`(u) = {u} ∪ {v : e(u, v) = +} and N
`(u) = {v : e(u, v) = −}
`+
`−
`of the edge (u, v). Let N
`denote the positive and negative neighbors of u respectively.
`
`3
`
`

`

`92
`
`N. BANSAL, A. BLUM, AND S. CHAWLA
`
`We let OPT denote an optimal clustering on this graph. In general, for a clustering C, let
`C(v) be the set of vertices in the same cluster as v. We will use A to denote the clustering
`produced by our algorithms.
`In a clustering C, we call an edge (u, v) a mistake if either e(u, v) = + and yet u (cid:4)∈ C(v),
`or e(u, v) = − and u ∈ C(v). When e(u, v) = +, we call the mistake a positive mistake,
`otherwise it is called a negative mistake. We denote the total number of mistakes made by
`a clustering C by mC, and use mOPT to denote the number of mistakes made by OPT.
`For positive real numbers x, y and z, we use x ∈ y ± z to denote x ∈ [y − z, y + z].
`Finally, let ¯X for X ⊆ V denote the complement (V \ X).
`
`3. NP-completeness
`
`In this section, we will prove that the problem of minimizing disagreements, or equivalently,
`maximizing agreements, is NP-complete. It is easy to see that the decision version of this
`problem (viz. is there a clustering with at most z disagreements?) is in NP since we can easily
`check the number of disagreements given a clustering. Also, if we allow arbitrary weights
`on edges with the goal of minimizing weighted disagreements, then a simple reduction from
`the Multiway Cut problem proves NP-hardness—simply put a −∞-weight edge between
`every pair of terminals, then the value of the multiway cut is equal to the value of weighted
`disagreements. We use this reduction to give a hardness of approximation result for the
`weighted case in Section 7.
`We give a proof of NP hardness for the unweighted case by reducing the problem of
`Partition into Triangles GT11 in Garey and Johnson (2000) to the problem of minimizing
`disagreements. The reader who is not especially interested in NP-completeness proofs
`should feel free to skip this section.
`The Partition into Triangles problem is described as follows: Given a graph G with n = 3k
`vertices, does there exist a partition of the vertices into k sets V1, . . . , Vk, such that for all
`i, |Vi| =3 and the vertices in Vi form a triangle.
`Given a graph G = (V , E), we first transform it into a complete graph G
`(cid:8)
`on the same
`is weighted +1 if it is an edge in G and −1 otherwise.
`(cid:8)
`vertex set V . An edge in G
`Let A be an algorithm that given a graph outputs a clustering that minimizes the number of
`mistakes. First notice that if we impose the additional constraint that all clusters produced by
`(cid:8)
`, the algorithm will produce a partition
`A should be of size at most 3, then given the graph G
`into triangles if the graph admits one. This is because if the graph admits a partition into
`triangles, then the clustering corresponding to this triangulation has no negative mistakes,
`and any other clustering with clusters of size at most 3 has more positive mistakes than
`this clustering. Thus we could use such an algorithm to solve the Partition into Triangles
`problem.
`We will now design a gadget that forces the optimal clustering to contain at most 3
`(cid:8)
`vertices in each cluster. In particular, we will augment the graph G
`to a larger complete
`graph H , such that in the optimal clustering on H , each cluster contains at most 3 vertices
`(cid:8)
`.
`from G
`(cid:8)
`The construction of H is as follows: In addition to the vertices and edges of G
`, for every
`3-tuple {u, v, w} ⊂ G
`(cid:8)
`, H contains a clique Cu,v,w containing n6 vertices. All edges inside
`
`4
`
`

`

`CORRELATION CLUSTERING
`
`93
`
`these cliques have weight +1. Edges between vertices belonging to two different cliques
`have weight −1. Furthermore, for all u, v, w ∈ G
`(cid:8)
`each vertex in Cu,v,w has a positive edge
`(cid:8)
`.
`to u, v and w, and a negative edge to all other vertices in G
`Now assume that G admits a triangulation and let us examine the behavior of algorithm
`A on graph H . Let N = n6( n
`3 ).
`
`Lemma 1. Given H as input, in any clustering that A outputs, every cluster contains at
`(cid:8)
`most three vertices of G
`.
`Proof: First consider a clustering C of the following form:
`
`1. There are ( n
`3 ) clusters.
`2. Each cluster contains exactly one clique Cu,v,w and some vertices of G
`3. Every vertex u ∈ G
`(cid:8)
`is in the same cluster as Cu,v,w for some v and w.
`
`(cid:8)
`
`.
`
`In any such clustering, there are no mistakes among edges between cliques. The only
`(cid:8)
`(cid:8)
`mistakes are between vertices of G
`and the cliques, and those between the vertices of G
`.
`2 ) − 1) + ( n2 ) because each vertex
`The number of mistakes of this clustering is at most n7(( n
`
`(cid:8)
`has n6 positive edges to ( n
`2 ) cliques and is clustered with only one of them.
`in G
`(cid:8)
`Now consider a clustering in which some cluster has four vertices in G
`, say, u, v, w and
`2 ) − 1) + n6
`y. We show that this clustering has at least n7(( n
`2 mistakes. Call this clustering
`X. Firstly, without loss of generality we can assume that each cluster in X has size at most
`n6 + n4, otherwise there are at least (n10) negative mistakes within a cluster. This implies
`2 )n6 − (n6 + n4) positive mistakes. Hence the total
`(cid:8)
`makes at least ( n
`that each vertex in G
`2 ) − 1) − n5. Let Xu be the cluster containing
`number of positive mistakes is at least n7(( n
`. Since Xu has at most n6 + n4 vertices, at least one of u, v, w, y
`vertices u, v, w, y ∈ G
`(cid:8)
`will have at most n4 positive edges inside Xu and hence will contribute at least an additional
`n6 − n4 negative mistakes to the clustering. Thus the total number of mistakes is at least
`
`2 ) − 1)n7 − n5 + n6 − n4 ≥ n7(( n2 ) − 1) + n6/2. Thus the result follows.
`(( n
`
`The above lemma shows that the clustering produced by A will have at most 3 vertices
`of G in each cluster. Thus we can use the algorithm A to solve the Partition into Triangles
`problem and the reduction is complete.
`
`4. A constant factor approximation for minimizing disagreements
`
`As a warm-up to the general case, we begin by giving a very simple 3-approximation to the
`best clustering containing two clusters. That is, if the best two-cluster partition of the graph
`has x mistakes, then the following algorithm will produce one with at most 3x mistakes.
`Let OPT(2) be the best clustering containing two clusters, and let the corresponding clus-
`ters be C1 and C2. Our algorithm simply considers all clusters of the form {N
`(v)}
`+
`−
`(v), N
`for v ∈ V . Of these, it outputs the one that minimizes the number of mistakes.
`
`Theorem 2. The number of mistakes of the clustering output by the algorithm stated above
`is at most m A ≤ 3mOPT(2).
`
`5
`
`

`

`94
`
`N. BANSAL, A. BLUM, AND S. CHAWLA
`
`Proof: Let’s say an edge is “bad” if OPT(2) disagrees with it, and define the “bad degree”
`of a vertex to be the number of bad edges incident to it. Clearly, if there is a vertex that has
`no bad edges incident to it, the clustering produced by that vertex would be the same as
`{C1, C2}, and we are done with as many mistakes as mOPT(2).
`Otherwise, let v be a vertex with minimum bad degree d, and without loss of generality,
`let v ∈ C1. Consider the partition {N
`(v)}. Let X be the set of bad neighbors of
`+
`−
`(v), N
`v—the d vertices that are in the wrong set of the partition with respect to {C1, C2}. The total
`number of extra mistakes due to this set X (other than the mistakes already made by OPT) is
`at most dn. However, since all vertices have bad degree at least d, mOPT(2) ≥ nd/2. So, the
`(v)} is at most 2mOPT(2).
`number of extra mistakes made by taking the partition {N
`+
`−
`(v), N
`This proves the theorem.
`
`We now describe our main algorithm: a constant-factor approximation for minimizing
`the number of disagreements.
`The high-level idea of the algorithm is as follows. First, we show (Lemma 3 and 4) that
`if we can cluster a portion of the graph using clusters that each look sufficiently “clean”
`(Definition 1), then we can charge off the mistakes made within that portion to “erroneous
`triangles”: triangles with two + edges and one − edge. Furthermore, we can do this in
`such a way that the triangles we charge are nearly edge-disjoint, allowing us to bound the
`number of these mistakes by a constant factor of OPT. Second, we show (Lemma 6) that
`(cid:8)
`there must exist a nearly optimal clustering OPT
`in which all non-singleton clusters are
`“clean”. Finally, we show (Theorem 7 and Lemma 11) that we can algorithmically produce
`a clustering of the entire graph containing only clean clusters and singleton clusters, such
`(cid:8)
`that mistakes that have an endpoint in singleton clusters are bounded by OPT
`, and mistakes
`with both endpoints in clean clusters are bounded using Lemma 4.
`We begin by showing a lower bound for OPT. We call a triangle “erroneous” if it contains
`(cid:2)
`two positive edges and one negative edge. A fractional packing of erroneous triangles is a
`set of erroneous triangles {T1, . . . , Tm} and positive real numbers ri associated with each
`triangle Ti , such that for any edge e ∈ E,
`ri ≤ 1.
`e∈Ti
`(cid:2)
`Lemma 3. Given any fractional packing of erroneous triangles {r1, . . . , rm}, we have
`i ri ≤ OPT.
`
`Proof: Let M be the set of mistakes made by OPT. Then, mOPT = (cid:2)e∈M 1 ≥ (cid:2)
`ri , by the definition of a fractional packing. So we have mOPT ≥(cid:2)
`(cid:2)
`e∈M
`|M ∩ Ti|ri . Now,
`e∈Ti
`for each Ti , we must have |M ∩ Ti| ≥1, because OPT must make at least one mistake on
`each erroneous triangle. This gives us the result.
`
`i
`
`Next we give a definition of a “clean” cluster and a “good” vertex.
`Definition 1. A vertex v is called δ-good with respect to C, where C ⊆ V , if it satisfies
`the following:
`– |N
`(v) ∩ C| ≥(1 − δ)|C|
`+
`(v) ∩ (V \ C)| ≤δ |C|
`– |N
`+
`
`6
`
`

`

`CORRELATION CLUSTERING
`
`95
`
`If a vertex v is not δ-good with respect to (w.r.t.) C, then it is called δ-bad w.r.t. C. Finally,
`a set C is δ-clean if all v ∈ C are δ-good w.r.t. C.
`
`−C
`
`−C
`
`We now present two key lemmas.
`Lemma 4. Given a clustering of V in which all clusters are δ-clean for some δ ≤ 1/4,
`(cid:2)
`there exists a fractional packing {ri , Ti}m
`i=1 such that the number of mistakes made by this
`clustering is at most 4
`i ri .
`Proof: Let the clustering on V be (C1, . . . ,C k). First consider the case where the number
`(cid:2)
`) is at least half the total number of mistakes mC. We will construct
`of negative mistakes (m
`≥ 1
`i ri ≥ 1
`4 mC.
`a fractional packing of erroneous triangles with
`2 m
`Pick a negative edge (u, v) ∈ Ci × Ci that has not been considered so far. We will pick
`a vertex w ∈ Ci such that both (u, w) and (v, w) are positive, and associate (u, v) with the
`erroneous triangle (u, v, w) (see figure 1). We now show that for all (u, v), such a w can
`(cid:8), v) or (u, v(cid:8)
`) (i.e. the ones sharing u
`always be picked such that no other negative edges (u
`or v) also pick w.
`Since Ci is δ-clean, neither u nor v has more than δ|Ci| negative neighbors inside Ci .
`Thus (u, v) has at least (1− 2δ)|Ci| vertices w such that both (u, w) and (v, w) are positive.
`Moreover, at most 2δ|Ci| −2 of these could have already been chosen by other negative
`(cid:8), v). Thus (u, v) has at least (1 − 4δ)|Ci| +2 choices of w that satisfy the
`edges (u, v(cid:8)
`) or (u
`required condition. Since δ ≤ 1/4, (u, v) will always be able to pick such a w. Let Tuvw
`denote the erroneous triangle u, v, w.
`(cid:2)
`Note that any positive edge (v, w) can be chosen at most 2 times by the above scheme, once
`(cid:2)
`(cid:2)
`for negative mistakes on v and possibly again for negative mistakes on w. Thus we can give a
`value of ruvw = 1/2 to each erroneous triangle picked, ensuring that
`Ti contains (v,w) ri ≤ 1.
`1 ≥
`ri = 1
`Now, since we pick a triangle for each negative mistake, we get that
`.
`m
`
`Ti
`
`2
`
`Ti
`
`−C
`
`12
`
`Figure 1. Construction of a triangle packing for Lemma 4.
`
`7
`
`

`

`96
`
`N. BANSAL, A. BLUM, AND S. CHAWLA
`
`Next, consider the case when at least half the mistakes are positive mistakes. Just as
`above, we will associate mistakes with erroneous triangles. We will start afresh, without
`taking into account the labelings from the previous part.
`Consider a positive edge between u ∈ Ci and v ∈ C j . Let |Ci| ≥ |C j|. Pick a w ∈ Ci
`such that (u, w) is positive and (v, w) is negative (see figure 1). There will be at least
`|Ci|−δ (|Ci|+|C j|) such vertices as before and at most δ(|Ci|+|C j|) of them will be already
`taken. Thus, there are at least |Ci| −2δ (|Ci| + |C j|) ≥ |Ci|(1 − 4δ) > 0 choices for w.
`(cid:2)
`(cid:2)
`Moreover only the positive edge (u, w) can be chosen twice (once as (u, w) and once as
`(w, u)). Thus, as before, to obtain a packing, we can give a fractional value of ruvw = 1
`2 to
`1 ≥ 1
`ri = 1
`.
`the triangle Tuvw. We get that
`2 m
`Ti
`Ti
`2
`Now depending on whether there are more negative mistakes or more positive mistakes,
`we can choose the triangles appropriately, and hence account for at least a quarter of the
`total mistakes in the clustering.
`
`+C
`
`Lemma 4 along with Lemma 3 gives us the following corollary.
`Corollary 5. Any clustering in which all clusters are δ-clean for some δ ≤ 1
`4 has at most
`4mOPT mistakes.
`
`(cid:8)
`Lemma 6. There exists a clustering OPT
`
`(cid:8) ≤ ( 9δ2 + 1)mOPT.
`and mOPT
`
`in which each non-singleton cluster is δ-clean,
`
`, COPT
`
`2
`
`, . . . ,C OPT
`
`k
`
`1
`
`be the clusters in OPT.
`
`Proof: Consider the following procedure applied to the clustering of OPT and call the
`(cid:8)
`resulting clustering OPT
`.
`Procedure δ-Clean-Up. Let COPT
`1. Let S = ∅.
`2. For i = 1, . . . ,k do:
`|, then, S = S ∪ COPT
`|COPT
`3 -bad vertices in COPT
`(a) If the number of δ
`is more than δ
`,
`= ∅. We call this “dissolving” the cluster.
`C(cid:8)
`3
`= COPT
`. Then S = S ∪ Bi and C(cid:8)
`\ Bi .
`3 -bad vertices in COPT
`i
`(b) Else, let Bi denote the δ
`, {x}x∈S.
`, C(cid:8)
`, . . . ,C (cid:8)
`: C(cid:8)
`(cid:8)
`3. Output the clustering OPT
`
`i
`
`i
`
`i
`
`i
`
`i
`
`i
`
`1
`
`2
`
`k
`
`(cid:8) are closely related.
`We will prove that mOPT and mOPT
`= ∅. Now ifC (cid:8)
`i is δ clean. Clearly, this holds if C(cid:8)
`We first show that each C(cid:8)
`i is non-empty,
`| ≥ |C(cid:8)
`| ≥ |COPT
`|(1 − δ/3). For each point v ∈ C(cid:8)
`we know that |COPT
`(cid:4)(cid:5)(cid:5)COPT
`(cid:3)
`(cid:3)
`(cid:5)(cid:5) −
`(cid:5)(cid:5)
`i , we have:
`| ≥
`(v) ∩ C(cid:8)
`1 − δ
`|N
`(cid:4)(cid:5)(cid:5)COPT
`(cid:3)
`(cid:5)(cid:5)
`3
`1 − 2
`=
`> (1 − δ)|C(cid:8)
`
`i
`
`i
`
`i
`
`i
`
`|
`
`i
`
`δ 3
`
`+
`
`i
`
`i
`
`i
`
`(cid:4)(cid:5)(cid:5)COPT
`
`i
`
`δ 3
`
`8
`
`

`

`CORRELATION CLUSTERING
`
`97
`
`∩ C(cid:8)
`i and outside COPT
`
`i
`
`, we get,
`
`i
`
`Similarly, counting positive neighbors of v in COPT
`(cid:5)(cid:5)COPT
`(cid:5)(cid:5)COPT
`(cid:5)(cid:5) + δ
`(cid:5)(cid:5)
`| ≤ δ
`3
`3
`|
`|C(cid:8)
`≤ 2δ
`(1 − δ/3)
`3
`|
`< δ|C(cid:8)
`(as δ < 1)
`
`+
`
`|N
`
`(v) ∩ C(cid:8)
`
`i
`
`i
`
`i
`
`i
`
`i
`
`Thus each C(cid:8)
`i is δ-clean.
`We now account for the number of mistakes. If we dissolve some COPT
`, then clearly
`the number of mistakes associated with vertices in the original cluster COPT
`is at least
`|2/2. The mistakes added due to dissolving clusters is at most |COPT
`(δ/3)2|COPT
`|2/2.
`were at least δ/3|COPT
`|
`was not dissolved, then, the original mistakes in COPT
`If COPT
`|Bi|/2. The mistakes added by the procedure is at most |Bi||COPT
`|. Noting that 6/δ < 9/δ2,
`the lemma follows.
`
`i
`
`i
`
`i
`
`i
`
`i
`
`i
`
`i
`
`i
`
`(cid:8)
`
`given by the above lemma, we use C(cid:8)
`i to denote the non-singleton
`For the clustering OPT
`clusters and S to denote the set of singleton clusters. We will now describe Algorithm
`(cid:8)
`Cautious that tries to find clusters similar to OPT
`. Throughout the rest of this section, we
`assume that δ = 1
`44 .
`
`Algorithm Cautious
`
`1. Pick an arbitrary vertex v and do the following:
`(a) Let A(v) = N
`+
`(v).
`(b) (Vertex Removal Step): While ∃x ∈ A(v) such that x is 3δ-bad w.r.t. A(v), A(v) =
`A(v) \ {x}.
`(c) (Vertex Addition Step): Let Y = {y|y ∈ V , y is 7δ-good w.r.t. A(v)}. Let A(v) =
`A(v) ∪ Y .3
`2. Delete A(v) from the set of vertices and repeat until no vertices are left or until all
`the produced sets A(v) are empty. In the latter case, output the remaining vertices as
`singleton nodes.
`
`Call the clusters output by algorithm Cautious A1, A2, . . . . Let Z be the set of singleton
`vertices created in the final step. Our main goal will be to show that the clusters output by
`our algorithm satisfy the property stated below.
`Theorem 7. ∀ j, ∃i such that C(cid:8)
`⊆ Ai . Moreover, each Ai is 11δ-clean.
`
`j
`
`In order to prove this theorem, we need the following two lemmas.
`If v ∈ C(cid:8)
`(cid:8), then, any vertex w ∈ C(cid:8)
`, where C(cid:8)
`Lemma 8.
`i is a δ-clean cluster in OPT
`i is
`+
`(v).
`3δ-good w.r.t. N
`
`i
`
`9
`
`

`

`98
`
`N. BANSAL, A. BLUM, AND S. CHAWLA
`
`| ≤
`|. So, (1 − δ)|C(cid:8)
`| ≤δ |C(cid:8)
`(v) ∩ C(cid:8)
`| and |N
`| ≥(1 − δ)|C(cid:8)
`(v) ∩ C(cid:8)
`Proof: As v ∈ Ci , |N
`+
`+
`(v)| ≤(1 + δ)|C(cid:8)
`|. The same holds for w. Thus, we get the following two conditions.
`|N
`+
`|N
`(w) ∩ N
`(v)| ≥(1 − 2δ)|C(cid:8)
`| ≥(1 − 3δ)|N
`(v)|
`+
`+
`+
`|N
`(w) ∩ N+(v)| ≤ |N
`(w) ∩ N+(v) ∩ C(cid:8)
`| + |N
`(w) ∩ N+(v) ∩ C(cid:8)
`+
`+
`+
`| ≤ 2δ
`(v)| ≤3δ |N
`(v)|
`|N
`≤ 2δ|C(cid:8)
`+
`+
`1 − δ
`
`i
`
`i
`
`i
`
`|
`
`i
`
`i
`
`i
`
`i
`
`i
`
`i
`
`i
`
`+
`
`Lemma 9. Given an arbitrary set X, if v1 ∈ C(cid:8)i and v2 ∈ C(cid:8)
`
`j
`
`, i (cid:4)= j, then v1 and v2 cannot
`
`(v).
`Thus, w is 3δ-good w.r.t. N
`
`both be 3δ-good w.r.t. X.
`(v1)∩ X| ≥
`Proof: Suppose that v1 and v2 are both 3δ-good with respect to X. Then, |N
`+
`(1− 3δ)|X| and |N
`(v2)∩ X| ≥(1 − 3δ)|X|, hence |N
`(v1)∩ N
`(v2)∩ X| ≥(1 − 6δ)|X|,
`+
`+
`+
`which implies that
`(v2)| ≥(1 − 6δ)|X|
`(v1) ∩ N
`|N
`+
`Also, since v1 and v2 lie in δ-clean clusters C(cid:8)i and C(cid:8)j in OPT
`
`(cid:8)
`
`| ≤δ |C(cid:8)
`∩ C(cid:8)
`|, |N
`| and C(cid:8)
`= ∅. It follows that
`(v2) \ C(cid:8)
`δ|C(cid:8)
`+
`|N
`(v1) ∩ N
`(v2)| ≤δ (|C(cid:8)
`| + |C(cid:8)
`|)
`(2)
`Now notice that |C(cid:8)
`(v1) ∩ ¯X
`| + |N
`(v1) ∩ X ∩ C(cid:8)
`| ≤ |N
`| ≤ |N
`(v1) ∩ C(cid:8)
`| +δ |C(cid:8)
`+
`+
`+
`| ≤ 1+3δ
`|+δ|C(cid:8)
`| ≤ |N
`(v1)∩ X ∩C(cid:8)
`|+3δ|X|+δ|C(cid:8)
`| ≤(1 + 3δ)|X|+δ|C(cid:8)
`|. So, |C(cid:8)
`|X|.
`∩C(cid:8)
`+
`1−δ
`j . Using Eq. (2), |N
`(v1) ∩ N
`(v2)| ≤2δ 1+3δ
`|X|.
`The same holds for C(cid:8)
`+
`+
`1−δ
`However, since δ < 1/9, we have 2δ(1+ 3δ) < (1− 6δ)(1− δ). Thus the above equation
`along with Eq. (1) gives a contradiction and the result follows.
`
`+
`
`j
`
`+
`
`i
`
`j
`
`i
`
`i
`
`i
`
`j
`
`j
`
`i
`
`i
`
`i
`
`+
`
`i
`
`respectively, |N
`
`+
`
`(v1)\C(cid:8)
`
`i
`
`(1)
`| ≤
`
`i
`
`i
`
`i
`
`i
`
`i
`
`This gives us the following important corollary.
`
`Corollary 10. After every application of the removal step 1b of the algorithm, no two
`vertices from distinct C(cid:8)
`i and C(cid:8)j can be present in A(v).
`
`
`Now we go on to prove Theorem 7.
`
`Proof of Theorem 7: We will first show that each Ai is either a subset of S or contains
`exactly one of the clusters C(cid:8)
`j . The first part of the theorem will follow.
`We proceed by induction on i. Consider the inductive step. For a cluster Ai , let A
`be
`the set produced after the vertex removal phase such the cluster Ai is obtained by applying
`. We have two cases. First, we consider the case when
`the vertex addition phase to A
`⊆ S. Now during the vertex addition step, no vertex u ∈ C(cid:8)
`for any j.
`j can enter A
`A
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`10
`
`

`

`CORRELATION CLUSTERING
`
`99
`
`This follows because, since C(cid:8)
`, for u to enter we need that
`j is δ-clean and disjoint from A
`δ|C(cid:8)
`| ≥(1 − 7δ)|A
`| and (1− δ)|C(cid:8)
`| ≤7δ |A
`|, and these two conditions cannot be satisfied
`simultaneously. Thus Ai ⊆ S.
`j
`In the second case, some u ∈ C(cid:8)
`. However, in this case observe that from
`j is present in A
`for any k (cid:4)= j. Also, by the same
`Corollary 10, no vertices from C(cid:8)
`k can be present in A
`⊆ S, no vertex from C(cid:8)
`(cid:6)
`k will enter A
`in the vertex addition
`reasoning as for the case A
`⊆ Ai . Note that all vertices of C(cid:8)
`phase. Now it only remains to show that C(cid:8)
`j are still present
`in the remaining graph G \ (
`(cid:6)<i A(cid:6)).
`it follows that many vertices from C(cid:8)
`.
`j are present in A
`Since u was not removed from A
`(u) ∩ A
`| ≥(1 − 3δ)|A
`| and |N
`(u) ∩ A
`| ≤3δ |A
`|. Now (1 − δ)|C(cid:8)
`| ≤
`In particular, |N
`+
`+
`∩ C(cid:8)
`| ≥ |A
`∩ N
`| < 2|A
`|. Also, |A
`(u)| − |N
`(u) ∩
`(u)| implies that |C(cid:8)
`| ≤ 1+3δ
`|A
`|N
`j
`+
`+
`+
`1−δ
`| ≥ |A
`∩ N
`(u)| −δ |C(cid:8)
`|. So we have| A
`∩ C(cid:8)
`| ≥(1 − 5δ)|A
`|.
`C(cid:8)
`+
`We now show that all remaining vertices from C(cid:8)
`j will enter Ai during the vertex addition
`j such that w /∈ A
`∩ C(cid:8)
`, |A
`| ≤5δ |A
`| and|N+(w) ∩ C(cid:8)
`| ≤δ |C(cid:8)
`| together
`phase. For w ∈ C(cid:8)
`j
`j
`| +δ |C(cid:8)
`∩ N+(w)| ≤5δ |A
`| ≤7δ |A
`∩ N
`|. The same holds for |A
`(w)|. So
`imply that |A
`+
`j
`and will be added in the Vertex Addition step. Thus we have shown
`w is 7δ-good w.r.t. A
`that A(v) can contain C(cid:8)
`j for at most one j and in fact will contain this set entirely.
`⊆ Ai . Let v chosen in Step 1 of the
`Next, we will show that for every j, ∃i s.t. C(cid:8)
`algorithm be such that v ∈ C(cid:8)
`j
`j . We show that during the vertex removal step, no vertex from
`(v) ∩ C(cid:8)
`+
`j is removed. The proof follows by an easy induction on the number of vertices
`N
`removed so far (r) in the vertex removal step. The base case (r = 0) follows from Lemma 8
`since every vertex in C(cid:8)
`+
`(v). For the induction step observe
`j is 3δ-good with respect to N
`(v)∩C(cid:8)
`j is removed thus far, every vertex in C(cid:8)j is still 3δ-good
`+
`
`that since no vertex from N
`+
`(v) replaced
`w.r.t. to the intermediate A(v) (by mimicking the proof of Lemma 8 with N
`contains at least (1− δ)|C(cid:8)
`| vertices of C(cid:8)
`j at the end of the vertex removal
`by A(v)). Thus A
`⊆ Ai after the vertex addition phase.
`phase, and hence by the second case above, C(cid:8)
`j
`Finally we show that every non-singleton cluster Ai is 11δ-clean. We know that at the end
`. Thus, |N
`(x) ∩ A
`| ≤3δ |A
`|.
`of the vertex removal phase, ∀x ∈ A
`+
`, x is 3δ-good w.r.t. A
`is at most 3δ|A
`|2. Since, in the vertex
`So the total number of positive edges leaving A
`, the number of these vertices can
`addition step, we add vertices that are 7δ-good w.r.t. A
`|. Thus |Ai| < (1 + 4δ)|A
`|.
`|2/(1 − 7δ)|A
`| < 4δ|A
`be at most 3δ|A
`(v) ∩ Ai ≥ (1 − 7δ)|A
`| ≥
`+
`Since all vertices v in Ai are at least 7δ-good w.r.t. A
`, N
`|Ai| ≥(1 − 11δ)|Ai|. Similarly, N
`(v) ∩ Ai ≤ 7δ|A
`| ≤11δ |Ai|. This gives us the
`1−7δ
`+
`1+4δ
`result.
`
`(cid:8)i
`
`j
`
`j
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`j
`
`j
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`j
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`j
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`j
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`(cid:8)i
`
`j
`
`j
`
`. Call mistakes
`Now we are ready to bound the mistakes of A in terms of OPT and OPT
`that have both end points in some clusters Ai and A j as internal

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket