throbber
Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 1 of 9
`Case 1:14-cv-02396—PGG-MHD Document 148-11 Filed 05/30/19 Page 1 of 9
`
`EXHIBIT 4
`
`EXHIBIT 4
`PART
`5 OF 8
`
`PART
`
`50F8
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 2 of 9
`
`HIERARCHICAL CLUSTERING
`
`233
`
`6,10.2.1 THE NEAREST-NEIGHBOR ALGORITHM
`Consider first the behavior when dmin is used.* Suppose that we think of the
`data points as being nodes of a graph, with edges forming a path between
`nodes in the same subset Xi.t When dmin is used to measure the distance
`between subsets, the nearest neighbors determine the nearest subsets. The
`merging of X i and X ; corresponds to adding an edge between the nearest
`pair of nodes in Xi and X;, Since edges linking clusters always go between
`distinct clusters, the resulting graph never has any closed loops or circuits;
`in the terminology of graph theory, this procedure generates a tree. If it is
`allowed to continue until all of the subsets are linked, the result is a spanning
`tree, a tree with a path from any node to any other node. Moreover, it can
`be shown that the sum of the edge lengths of the resulting tree will not exceed
`the sum of the edge lengths for any other spanning tree for that set of samples.
`Thus, with the use of dmiu as the distance measure, the agglomerative
`clustering procedure becomes an algorithm for generating a minimal spanning
`tree.
`Figure 6.17 shows the results of applying this procedure to the data of
`Figure 6.16. In all cases the procedure was stopped at c = 2 ; a minimal
`spanning tree can be obtained by adding the shortest possible edge between
`the two clusters. In the first case where the clusters are compact and well
`separated, the obvious clusters are found. In the second case, the presence
`of a few points located so as to produce a bridge between the clusters results
`in a rather unexpected grouping into one large, elongated cluster, and one
`small, compact cluster. This behavior is often called the "chaining effect,"
`and is sometimes considered to be a defect of this distance measure. To the
`extent that the results are very sensitive to noise or to slight changes in
`position of the data points, this is certainly a valid criticism. However, as
`the third case illustrates , this very tendency to form chains can be advan(cid:173)
`tageous if the clusters are elongated or possess elongated limbs.
`
`6.10.2.2 THE FURTHEST-NEIGHBOR ALGORITHM
`When dma.x is used to measure the distance between subsets, the growth of
`elongated clusters is discouraged .t Application of the procedure can be
`thought of as producing a graph in which edges connect all of the nodes in
`
`* In the literature, the resulting procedure is often called the nearest-neighbor or the
`minimum algorithm. If it is terminated when the distance between nearest clusters exceeds
`an arbitrary threshold, it is called the single-linkage algorithm.
`t Although we will not make deep use of graph theory, we assume that the reader has a
`general familiarity with the subject. A clear, rigorous treatment is given by 0. Ore, Theory
`of Graphs (American Math. Soc. Colloquium Publ., Vol. 38, 1962).
`t In the literature , the resulting procedure is often called the furthest neighbor or the maxi(cid:173)
`mum algorithm. If it is terminated when the distance between nearest clusters exceeds an
`arbitrary threshold, it is called the complete-linkage algorithm.
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 3 of 9
`
`FIGURE6.1S.
`
`
`
`or algorithm. Results of the f urthest-neighb
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 4 of 9
`
`HIERARCHICAL CLUSTERING
`
`235
`
`a cluster. In the terminology of graph theory, every cluster constitutes a
`complete subgraph. The distance between two clusters is determined by the
`roost distant nodes in the two clusters. When the nearest clusters are merged,
`the graph is changed by adding edges between every pair of nodes in the two
`clusters. If we define the diameter of a cluster as the largest distance between
`points in the cluster , then the distance between two clusters is merely the
`diameter of their union. If we define the diameter of a partition as the
`largest diameter for clusters in the partition,
`then each iteration increases
`the diameter of the partition as little as possible. As Figure 6.18 illustrates,
`this is advantageous when the true clusters are compact and roughly equal
`in size. However, when this is not the case , as happens with the two elongated
`clusters, the resulting groupings can be meaningless. This is another example
`of imposing structure on data rather than finding structure in it.
`
`6.10.2.3 COMPROMISES
`The minimum and maximum measures represent two extremes in measuring
`the distance between clusters. Like all procedures that involve minima or
`maxima, they tend to be overly sensitive to "mavericks" or " sports" or
`"outliers" or "wildshots." The use of averaging is an obvious way to
`ameliorate these problems, and d e.vg and dmean are natural compromises
`between dmi n and dm e.x· Computationally, dm ea.n is the simplest of all of these
`measures, since the others require computing all nin; pairs of distances
`Ux - x' 11-However, a measure such as d 11,vg can be used when the distances
`!Ix - x'II are replaced by similarity measures, where the similarity between
`mean vectors may be difficult or impossible to define. We leave it to the
`reader to decide how the use of da vg or dmea.n might change the way that the
`points in Figure 6.16 are grouped.
`
`6.10.3 Stepwise-Optimal Hierarchical Clustering
`
`We observed earlier that if clusters are grown by merging the nearest pair of
`clusters, then the results have a minimum variance flavor. However , when
`the measure of distance between clusters is chosen arbitrarily , one can rarely
`assert that the resulting partition extremizes any particular criterion function.
`In effect, hierarchical clustering defines a cluster as whatever results from
`applying the clustering procedure However , with a simple modification it is
`possible to obtain a stepwise-optimal procedure for extremizing a criterion
`function. This is done merely by replacing Step 3 of the Basic Agglomerative
`Clustering Procedure (Section 6.10.2) by
`
`3'. Find the pair of distinct clusters !!,ti and !!£; whose merger would
`increase (or decrease) the criterion function as little as possible.
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 5 of 9
`
`236
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`This assures us that at each iteration we have done the best possible thing,
`even if it clues not guarantee
`that the final partition
`is optimal.
`We saw earlier that the use of dma.x causes the smallest possible stepwise
`increase in the diameter of the partition. Another simple example is provided
`by the sum-of-squared-error criterion function J,. By an analysis very
`similar to that used in Section 6.9, we find that the pair of clusters whose
`merger increases J. as little as possible is the pair for which the "distance"
`
`d.(f'I i, f'I 1) = J nin;
`
`llm, - m,11
`
`ni + ni
`is minimum. Thus, in selecting clusters to be merged, this criterion takes
`into account the number of samples in each cluster as well as the distance
`between clusters. In general , the use of d, tends to favor growth by adding
`singletons or small clusters to large clusters over merging medium-sized
`clusters . While the final partition may not minimize J., it usually provid es
`a very good starting point for further iterative optimization.
`
`6.10.4 Hierarchical Clustering and Induced Metrics
`
`Suppose that we are unable to supply a metric for our data, but that we can
`measure a dissimilarity value cl(x, x') for every pair of samples, where
`cl(x, x') ~ 0, equality holding if and only if x = x' . Then agglomerati ve
`clustering can still be used , with the understanding that the nearest pair of
`clusters is the least dissimilar pair . Interestingly enough , if we define the
`dissimilarity between two clusters by
`c>m;n( f'I;, f'I ,) = min
`
`cl(x, x')
`
`XE~t, X#EXJ
`
`or
`
`then the hierarchical clustering procedure will induce a distance function
`for the given set of n samples. Furthermore , the ranking of the distan ces
`between samples will be invariant to any monotonic transformation of the
`dissimilarity values.
`To see how this comes about , we begin by defining the value vk for the
`clustering at level k. For level l, v1 = 0. For all higher levels, vk is the
`minimum dissimilarity between pairs of distinct clusters at level k -
`I.
`A moment's reflection will make it clear that with both Omin and Oma.x the
`value vk either stays the same or increases ask increases. Moreover, we shall
`assume that no two of the n samples are identical , so that v2 > 0. Thus ,
`0 = V1 < V2 ::;; V3 ::;; · · • ::;; Vn ·
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 6 of 9
`
`GRAPH THEORETIC METHODS
`
`237
`
`We can now define the distance d(x, x') between x and x' as the value of
`the lowest level clustering for which x and x' are in the same cluster. To
`show that this is a legitimate distance function, or metric, we need to show
`three things:
`(I) d(x, x') = 0-¢> x = x'
`(2) d(x, x') = d(x', x)
`(3) d(x, x") ::S: d(x, x') + d(x', x").
`It is easy to see that the first requirement is satisfied. The lowest level for
`which x and x are in the same cluster is level 1, so that d(x, x) = v1 = 0.
`Conversely, if d(x, x') = 0, the fact that v2 > 0 implies that x and x' must
`be in the same cluster at level 1, and hence that x = x'. The truth of the
`second requirement follows immediately from the definition of d(x, x'). This
`leaves the third requirement , the triangle inequality. Let d(x, x') = v, and
`d(x', x") = v;, so that x and x' are in the same cluster at level i and x' and
`x" are in the same cluster at level j. Because of the hierarch ical nesting of
`clusters, one of these clusters includes the other. If k = max(i,j),
`it is clear
`that at level k x, x', and x" are all in the same cluster , and hence that
`d(x, x") ::S: vk.
`But since the values vk are monotonically nondecreasing, it follows that
`vk = max(v;, v;) and hence that
`d(x, x") ::S: max(d(x, x'), d(x', x")).
`This is known as the ultrametric inequality. It is even stronger than the
`since max(d(x, x'), d(x', x")) ::s; d(x, x') + d(x', x").
`triangle
`inequality,
`Thus, all the conditions are satisfied, and we have created a bona fide metric
`for comparing the n samples.
`
`6.11 GRAPH THEORETIC METHODS
`
`In two or three instances we have used linear graphs to add insight into the
`nature of certain clustering procedures. Where the mathematics of normal
`mixtures and minimum-variance partitions seems to keep returning us to the
`picture of clusters as isolated clumps of points, the language and concepts of
`graph theory lead us to consider much more intricate structures. Unfortu(cid:173)
`nately, few of these possibilities have been systematically explored, and there
`is no uniform way of posing clustering problems as problems in graph theory.
`Thus, the effective use of these ideas is still largely an art, and the reader who
`wants to explore the possibilities should be prepared to be creative.
`We begin our brief look into graph-theoretic methods by reconsidering
`the simple procedure that produced the graphs shown in Figure 6.8. Here a
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 7 of 9
`
`238
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`0
`
`threshold distance d0 was selected, and two points were said to be in the same
`cluster if the distance between them was less than d0 • This procedure can
`easily be generalized to apply to arbitrary similarity measures. Suppose that
`we pick a threshold value s0 and say that x is similar to x' if s(x, x') >s
`•
`This defines an n-by -n similarity matrix S = [si;], where
`- {l if s(x,, X;) > s0
`i,j = 1, ...
`
`, n
`
`S .. -
`0 otherwise
`"
`This matrix defines a similarity graph in which nodes correspond to points
`and an edge joins node i and node j if and only if si; = I.
`The clusterings produced by the single-linkage algorithm and by a modified
`version of the complete-linkage algorithm are readily described in terms of
`this graph. With the single-linkage algorithm, two samples x and x' are in
`the same cluster if and only if there exists a chain x 1 , x2 , ...
`, xk such that x
`is similar to x 1 , x 1 is similar to x2 , and so on for the whole chain. Thus, this
`clustering corresponds to the connected components of the similarity graph.
`With the complete-linkage algorithm, all samples in a given cluster must be
`similar to one another, and no sample can be in more than one cluster. If
`we drop this second requirement,
`then this clustering corresponds
`to the
`maximal complete subgraphs of the similarity graph, the "largest" subgraphs
`with edges joining all pairs of nodes. (In general, the clusters of the complete(cid:173)
`linkage algorithm will be found among the maximal complete subgraphs,
`but they cannot be determined without knowing the unquantized similarity
`values.)
`Io the preceding section we noted that the nearest-neighbor algorithm
`could be viewed as an algorithm for finding a minimal spanning tree. Con(cid:173)
`versely, given a minimal spanning tree we can find the clusterings produced
`by the nearest-neighbor algorithm. Removal of the longest edge produces
`the two-cluster grouping, removal of the next longest edge produces
`the
`three-cluster grouping, and so on. This amounts to an inverted way of
`obtaining a divisive hieran;hical procedure,
`and sugge sls olher ways uf
`dividing the graph into subgraphs. For example, in selecting an edge to
`remove , we can compare its length to the lengths of other edges incident
`upon its nodes. Let us say that an edge is inconsistent if its length I is signifi(cid:173)
`cantly larger than l, the average length of all other edges incident on its
`nodes. Figure 6.19 shows a minimal spanning tree for a two-dimensional
`point set and the clusters obtained by systematically removing all edges for
`which I > 21. Note how the sensitivity of this criterion to local conditions
`gives results that are quite different from merely removing the two longest
`edges.
`When the data points are strung out into long chains, a minimal spanning
`tree forms a natural skeleton for the chain. If we define the diameter path as
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 8 of 9
`
`......
`
`-~·.\ .. 'i:·::.
`....... ·:-·.
`:······ . · ..
`. . .
`. .
`.
`· .. : . . . . .
`·.• ..... .
`... ..
`
`•-.. I•
`
`• • •
`
`THE PROBLEM OF VALIDITY
`
`239
`
`o DIAMETER OR
`NEAR DIAMETER PATH
`
`• OFF DIAMETER PATH
`
`C
`
`~
`
`(a) POINT SET
`
`(b) MINIMAL SPANNING TREE
`
`(c} CLUSTERS
`
`FIGURE 6.19. Clusters formed by removing inconsistent edges (From
`C. T. Zahn, 1971. Copyright 1971, Institute of Electrical and Electronics
`Engineers, reprinted by permission.)
`
`the longest path through the tree , then a chain will be characterized by the
`shallow depth of branching off the diameter path. In contrast , for a large,
`uniform cloud of data points , the tree will usually not have an obvious
`diameter path , but rather several distinct, near-diameter paiths. For any of
`these, an appreciable number of nodes will be off the path. While slight
`changes in the locations of the data points can cause major rerouting of a
`minimal spanning tree, they typically have little effect on such statistics.
`One of the useful statistics that can be obtained from a minimal spanning
`tree is the edge length distribution. Figure 6.20 shows a situation in which a
`dense cluster is embedded in a sparse one. The lengths of the edges of the
`minimal spanning tree exhibit two distinct clusters which would easily be
`detected by a minimum-variance procedure. By deleting all edges longer
`than some intermediate value, we can extract the dense cluster as the largest
`connected component of the remaining graph. While more complicated con(cid:173)
`figurations can not be disposed of this easily, the flexibility of the graph(cid:173)
`theoretic approach suggests that it is applicable to a wide variety of clustering
`problems.
`
`6.12 THE PROBLEM OF VALIDITY
`
`With almost all of the procedures we have considered thus far we have
`assumed that the number of clusters is known. That is a reasonable assump(cid:173)
`tion if we are upgrading a classifier that has been designed on a small sample
`set, or if we are tracking slowly time-varying patterns. However, it is a very
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-11 Filed 05/30/19 Page 9 of 9
`
`"
`
`a
`
`a
`
`a
`
`a
`
`D
`
`D
`
`D
`
`"'
`
`,P4
`D
`•a
`a~
`a aa
`"a a
`
`a
`
`D
`
`a
`
`D
`
`a
`
`a
`
`a
`
`D
`
`D
`
`a
`
`D
`
`Ill
`
`FIGURE 6.20. A minimal spanning tree with a bimodal edge length
`distribution.
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket