`Case 1:14-cv-02396—PGG-MHD Document 148-11 Filed 05/30/19 Page 1 of 9
`
`EXHIBIT 4
`
`EXHIBIT 4
`PART
`5 OF 8
`
`PART
`
`50F8
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 2 of 9
`
`HIERARCHICAL CLUSTERING
`
`233
`
`6,10.2.1 THE NEAREST-NEIGHBOR ALGORITHM
`Consider first the behavior when dmin is used.* Suppose that we think of the
`data points as being nodes of a graph, with edges forming a path between
`nodes in the same subset Xi.t When dmin is used to measure the distance
`between subsets, the nearest neighbors determine the nearest subsets. The
`merging of X i and X ; corresponds to adding an edge between the nearest
`pair of nodes in Xi and X;, Since edges linking clusters always go between
`distinct clusters, the resulting graph never has any closed loops or circuits;
`in the terminology of graph theory, this procedure generates a tree. If it is
`allowed to continue until all of the subsets are linked, the result is a spanning
`tree, a tree with a path from any node to any other node. Moreover, it can
`be shown that the sum of the edge lengths of the resulting tree will not exceed
`the sum of the edge lengths for any other spanning tree for that set of samples.
`Thus, with the use of dmiu as the distance measure, the agglomerative
`clustering procedure becomes an algorithm for generating a minimal spanning
`tree.
`Figure 6.17 shows the results of applying this procedure to the data of
`Figure 6.16. In all cases the procedure was stopped at c = 2 ; a minimal
`spanning tree can be obtained by adding the shortest possible edge between
`the two clusters. In the first case where the clusters are compact and well
`separated, the obvious clusters are found. In the second case, the presence
`of a few points located so as to produce a bridge between the clusters results
`in a rather unexpected grouping into one large, elongated cluster, and one
`small, compact cluster. This behavior is often called the "chaining effect,"
`and is sometimes considered to be a defect of this distance measure. To the
`extent that the results are very sensitive to noise or to slight changes in
`position of the data points, this is certainly a valid criticism. However, as
`the third case illustrates , this very tendency to form chains can be advan(cid:173)
`tageous if the clusters are elongated or possess elongated limbs.
`
`6.10.2.2 THE FURTHEST-NEIGHBOR ALGORITHM
`When dma.x is used to measure the distance between subsets, the growth of
`elongated clusters is discouraged .t Application of the procedure can be
`thought of as producing a graph in which edges connect all of the nodes in
`
`* In the literature, the resulting procedure is often called the nearest-neighbor or the
`minimum algorithm. If it is terminated when the distance between nearest clusters exceeds
`an arbitrary threshold, it is called the single-linkage algorithm.
`t Although we will not make deep use of graph theory, we assume that the reader has a
`general familiarity with the subject. A clear, rigorous treatment is given by 0. Ore, Theory
`of Graphs (American Math. Soc. Colloquium Publ., Vol. 38, 1962).
`t In the literature , the resulting procedure is often called the furthest neighbor or the maxi(cid:173)
`mum algorithm. If it is terminated when the distance between nearest clusters exceeds an
`arbitrary threshold, it is called the complete-linkage algorithm.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 3 of 9
`
`FIGURE6.1S.
`
`
`
`or algorithm. Results of the f urthest-neighb
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 4 of 9
`
`HIERARCHICAL CLUSTERING
`
`235
`
`a cluster. In the terminology of graph theory, every cluster constitutes a
`complete subgraph. The distance between two clusters is determined by the
`roost distant nodes in the two clusters. When the nearest clusters are merged,
`the graph is changed by adding edges between every pair of nodes in the two
`clusters. If we define the diameter of a cluster as the largest distance between
`points in the cluster , then the distance between two clusters is merely the
`diameter of their union. If we define the diameter of a partition as the
`largest diameter for clusters in the partition,
`then each iteration increases
`the diameter of the partition as little as possible. As Figure 6.18 illustrates,
`this is advantageous when the true clusters are compact and roughly equal
`in size. However, when this is not the case , as happens with the two elongated
`clusters, the resulting groupings can be meaningless. This is another example
`of imposing structure on data rather than finding structure in it.
`
`6.10.2.3 COMPROMISES
`The minimum and maximum measures represent two extremes in measuring
`the distance between clusters. Like all procedures that involve minima or
`maxima, they tend to be overly sensitive to "mavericks" or " sports" or
`"outliers" or "wildshots." The use of averaging is an obvious way to
`ameliorate these problems, and d e.vg and dmean are natural compromises
`between dmi n and dm e.x· Computationally, dm ea.n is the simplest of all of these
`measures, since the others require computing all nin; pairs of distances
`Ux - x' 11-However, a measure such as d 11,vg can be used when the distances
`!Ix - x'II are replaced by similarity measures, where the similarity between
`mean vectors may be difficult or impossible to define. We leave it to the
`reader to decide how the use of da vg or dmea.n might change the way that the
`points in Figure 6.16 are grouped.
`
`6.10.3 Stepwise-Optimal Hierarchical Clustering
`
`We observed earlier that if clusters are grown by merging the nearest pair of
`clusters, then the results have a minimum variance flavor. However , when
`the measure of distance between clusters is chosen arbitrarily , one can rarely
`assert that the resulting partition extremizes any particular criterion function.
`In effect, hierarchical clustering defines a cluster as whatever results from
`applying the clustering procedure However , with a simple modification it is
`possible to obtain a stepwise-optimal procedure for extremizing a criterion
`function. This is done merely by replacing Step 3 of the Basic Agglomerative
`Clustering Procedure (Section 6.10.2) by
`
`3'. Find the pair of distinct clusters !!,ti and !!£; whose merger would
`increase (or decrease) the criterion function as little as possible.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 5 of 9
`
`236
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`This assures us that at each iteration we have done the best possible thing,
`even if it clues not guarantee
`that the final partition
`is optimal.
`We saw earlier that the use of dma.x causes the smallest possible stepwise
`increase in the diameter of the partition. Another simple example is provided
`by the sum-of-squared-error criterion function J,. By an analysis very
`similar to that used in Section 6.9, we find that the pair of clusters whose
`merger increases J. as little as possible is the pair for which the "distance"
`
`d.(f'I i, f'I 1) = J nin;
`
`llm, - m,11
`
`ni + ni
`is minimum. Thus, in selecting clusters to be merged, this criterion takes
`into account the number of samples in each cluster as well as the distance
`between clusters. In general , the use of d, tends to favor growth by adding
`singletons or small clusters to large clusters over merging medium-sized
`clusters . While the final partition may not minimize J., it usually provid es
`a very good starting point for further iterative optimization.
`
`6.10.4 Hierarchical Clustering and Induced Metrics
`
`Suppose that we are unable to supply a metric for our data, but that we can
`measure a dissimilarity value cl(x, x') for every pair of samples, where
`cl(x, x') ~ 0, equality holding if and only if x = x' . Then agglomerati ve
`clustering can still be used , with the understanding that the nearest pair of
`clusters is the least dissimilar pair . Interestingly enough , if we define the
`dissimilarity between two clusters by
`c>m;n( f'I;, f'I ,) = min
`
`cl(x, x')
`
`XE~t, X#EXJ
`
`or
`
`then the hierarchical clustering procedure will induce a distance function
`for the given set of n samples. Furthermore , the ranking of the distan ces
`between samples will be invariant to any monotonic transformation of the
`dissimilarity values.
`To see how this comes about , we begin by defining the value vk for the
`clustering at level k. For level l, v1 = 0. For all higher levels, vk is the
`minimum dissimilarity between pairs of distinct clusters at level k -
`I.
`A moment's reflection will make it clear that with both Omin and Oma.x the
`value vk either stays the same or increases ask increases. Moreover, we shall
`assume that no two of the n samples are identical , so that v2 > 0. Thus ,
`0 = V1 < V2 ::;; V3 ::;; · · • ::;; Vn ·
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 6 of 9
`
`GRAPH THEORETIC METHODS
`
`237
`
`We can now define the distance d(x, x') between x and x' as the value of
`the lowest level clustering for which x and x' are in the same cluster. To
`show that this is a legitimate distance function, or metric, we need to show
`three things:
`(I) d(x, x') = 0-¢> x = x'
`(2) d(x, x') = d(x', x)
`(3) d(x, x") ::S: d(x, x') + d(x', x").
`It is easy to see that the first requirement is satisfied. The lowest level for
`which x and x are in the same cluster is level 1, so that d(x, x) = v1 = 0.
`Conversely, if d(x, x') = 0, the fact that v2 > 0 implies that x and x' must
`be in the same cluster at level 1, and hence that x = x'. The truth of the
`second requirement follows immediately from the definition of d(x, x'). This
`leaves the third requirement , the triangle inequality. Let d(x, x') = v, and
`d(x', x") = v;, so that x and x' are in the same cluster at level i and x' and
`x" are in the same cluster at level j. Because of the hierarch ical nesting of
`clusters, one of these clusters includes the other. If k = max(i,j),
`it is clear
`that at level k x, x', and x" are all in the same cluster , and hence that
`d(x, x") ::S: vk.
`But since the values vk are monotonically nondecreasing, it follows that
`vk = max(v;, v;) and hence that
`d(x, x") ::S: max(d(x, x'), d(x', x")).
`This is known as the ultrametric inequality. It is even stronger than the
`since max(d(x, x'), d(x', x")) ::s; d(x, x') + d(x', x").
`triangle
`inequality,
`Thus, all the conditions are satisfied, and we have created a bona fide metric
`for comparing the n samples.
`
`6.11 GRAPH THEORETIC METHODS
`
`In two or three instances we have used linear graphs to add insight into the
`nature of certain clustering procedures. Where the mathematics of normal
`mixtures and minimum-variance partitions seems to keep returning us to the
`picture of clusters as isolated clumps of points, the language and concepts of
`graph theory lead us to consider much more intricate structures. Unfortu(cid:173)
`nately, few of these possibilities have been systematically explored, and there
`is no uniform way of posing clustering problems as problems in graph theory.
`Thus, the effective use of these ideas is still largely an art, and the reader who
`wants to explore the possibilities should be prepared to be creative.
`We begin our brief look into graph-theoretic methods by reconsidering
`the simple procedure that produced the graphs shown in Figure 6.8. Here a
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 7 of 9
`
`238
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`0
`
`threshold distance d0 was selected, and two points were said to be in the same
`cluster if the distance between them was less than d0 • This procedure can
`easily be generalized to apply to arbitrary similarity measures. Suppose that
`we pick a threshold value s0 and say that x is similar to x' if s(x, x') >s
`•
`This defines an n-by -n similarity matrix S = [si;], where
`- {l if s(x,, X;) > s0
`i,j = 1, ...
`
`, n
`
`S .. -
`0 otherwise
`"
`This matrix defines a similarity graph in which nodes correspond to points
`and an edge joins node i and node j if and only if si; = I.
`The clusterings produced by the single-linkage algorithm and by a modified
`version of the complete-linkage algorithm are readily described in terms of
`this graph. With the single-linkage algorithm, two samples x and x' are in
`the same cluster if and only if there exists a chain x 1 , x2 , ...
`, xk such that x
`is similar to x 1 , x 1 is similar to x2 , and so on for the whole chain. Thus, this
`clustering corresponds to the connected components of the similarity graph.
`With the complete-linkage algorithm, all samples in a given cluster must be
`similar to one another, and no sample can be in more than one cluster. If
`we drop this second requirement,
`then this clustering corresponds
`to the
`maximal complete subgraphs of the similarity graph, the "largest" subgraphs
`with edges joining all pairs of nodes. (In general, the clusters of the complete(cid:173)
`linkage algorithm will be found among the maximal complete subgraphs,
`but they cannot be determined without knowing the unquantized similarity
`values.)
`Io the preceding section we noted that the nearest-neighbor algorithm
`could be viewed as an algorithm for finding a minimal spanning tree. Con(cid:173)
`versely, given a minimal spanning tree we can find the clusterings produced
`by the nearest-neighbor algorithm. Removal of the longest edge produces
`the two-cluster grouping, removal of the next longest edge produces
`the
`three-cluster grouping, and so on. This amounts to an inverted way of
`obtaining a divisive hieran;hical procedure,
`and sugge sls olher ways uf
`dividing the graph into subgraphs. For example, in selecting an edge to
`remove , we can compare its length to the lengths of other edges incident
`upon its nodes. Let us say that an edge is inconsistent if its length I is signifi(cid:173)
`cantly larger than l, the average length of all other edges incident on its
`nodes. Figure 6.19 shows a minimal spanning tree for a two-dimensional
`point set and the clusters obtained by systematically removing all edges for
`which I > 21. Note how the sensitivity of this criterion to local conditions
`gives results that are quite different from merely removing the two longest
`edges.
`When the data points are strung out into long chains, a minimal spanning
`tree forms a natural skeleton for the chain. If we define the diameter path as
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 8 of 9
`
`......
`
`-~·.\ .. 'i:·::.
`....... ·:-·.
`:······ . · ..
`. . .
`. .
`.
`· .. : . . . . .
`·.• ..... .
`... ..
`
`•-.. I•
`
`• • •
`
`THE PROBLEM OF VALIDITY
`
`239
`
`o DIAMETER OR
`NEAR DIAMETER PATH
`
`• OFF DIAMETER PATH
`
`C
`
`~
`
`(a) POINT SET
`
`(b) MINIMAL SPANNING TREE
`
`(c} CLUSTERS
`
`FIGURE 6.19. Clusters formed by removing inconsistent edges (From
`C. T. Zahn, 1971. Copyright 1971, Institute of Electrical and Electronics
`Engineers, reprinted by permission.)
`
`the longest path through the tree , then a chain will be characterized by the
`shallow depth of branching off the diameter path. In contrast , for a large,
`uniform cloud of data points , the tree will usually not have an obvious
`diameter path , but rather several distinct, near-diameter paiths. For any of
`these, an appreciable number of nodes will be off the path. While slight
`changes in the locations of the data points can cause major rerouting of a
`minimal spanning tree, they typically have little effect on such statistics.
`One of the useful statistics that can be obtained from a minimal spanning
`tree is the edge length distribution. Figure 6.20 shows a situation in which a
`dense cluster is embedded in a sparse one. The lengths of the edges of the
`minimal spanning tree exhibit two distinct clusters which would easily be
`detected by a minimum-variance procedure. By deleting all edges longer
`than some intermediate value, we can extract the dense cluster as the largest
`connected component of the remaining graph. While more complicated con(cid:173)
`figurations can not be disposed of this easily, the flexibility of the graph(cid:173)
`theoretic approach suggests that it is applicable to a wide variety of clustering
`problems.
`
`6.12 THE PROBLEM OF VALIDITY
`
`With almost all of the procedures we have considered thus far we have
`assumed that the number of clusters is known. That is a reasonable assump(cid:173)
`tion if we are upgrading a classifier that has been designed on a small sample
`set, or if we are tracking slowly time-varying patterns. However, it is a very
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 9 of 9
`
`"
`
`a
`
`a
`
`a
`
`a
`
`D
`
`D
`
`D
`
`"'
`
`,P4
`D
`•a
`a~
`a aa
`"a a
`
`a
`
`D
`
`a
`
`D
`
`a
`
`a
`
`a
`
`D
`
`D
`
`a
`
`D
`
`Ill
`
`FIGURE 6.20. A minimal spanning tree with a bimodal edge length
`distribution.
`
`