throbber
Case 1:14-cv-02396-PGG-MHD Document 148-10 Filed 05/30/19 Page 1 of 12
`Case 1:14-cv-02396—PGG-MHD Document 148-10 Filed 05/30/19 Page 1 of 12
`
`EXHIBIT 4
`
`EXHIBIT 4
`PART
`4 OF 8
`
`PART
`
`4OF8
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-10 Filed 05/30/19 Page 2 of 12
`222
`UNSUPERVISED LEARNING AND CLUSTERING
`
`Note that the total scatter matrix does not depend on how the set of samples
`is partitioned into clusters. It depends only on the total set of samples. The
`within-cluster and between-cluster scatter matrices do depend on the par(cid:173)
`titioning, however. Roughly speaking, there is an exchange between these
`two matrices,
`the between-cluster scatter going up as the within-cluster
`scatter goes down. This is fortunate, since by trying to minimize the within(cid:173)
`cluster scatter we will also tend to maximize the between-cluster scatter.
`To be more precise in talking about the amount of within-cluster or
`between-cluster scatter, we need a scalar measure of the "size" of a scatter
`matrix. Th~ two measures that we shall consider are the trace and the
`determinant. In the univariate case, these two measures are equivalent, and
`we can define an optimal partition as one that minimizes Swor maximizes
`SB. In the multivariate case things are somewhat more complicated, and a
`number of related but distinct optimality criteria have been suggested.
`
`6.8.3.2 THE TRACE CRITERION
`Perhaps the simplest scalar measure of a scatter matrix is its trace, the sum
`of its diagonal elements. Roughly speaking , the trace measures the square
`of the scattering radius , since it is proportional
`to the sum of the variances
`in the coordinate directions. Thus, an obvious criterion function to minimize
`is the trace of Sw. In fact, this criterion is nothing more or less than the
`sum-of-squared-error criterion , since Eqs. (33) and (34) yield
`
`C
`
`C
`
`tr Sw = L tr Si= L I llx - m,112 = J , .
`
`i- 1
`
`i - 1 xe~.,
`Since tr ST = tr Sw + tr SB and tr ST is independent of how the samples
`are partitioned , we see that no new results are obtained by trying to maximize
`tr SB. However, it is comforting to know that in trying to minimize the
`within-cluster criterion J. = tr S w we are also maximizing the between(cid:173)
`cluster criterion
`
`(38)
`
`tr Sn =,In ; !Im; - mjl2
`
`C
`
`i • l
`
`•
`
`(39)
`
`6.8.3.3 THE DETERMINANT CRITERION
`In Section 4.11 we used the determinant of the scatter matrix to obtain a
`scalar measure of scatter. Roughly speaking, this measures the square of the
`scattering volume , since it is proportional
`to the product of the variances
`in the directions of the principal axes. Since SB will be singular if the number
`of clusters is less than or equal to the dimensionality, ISnl is ohviously a poor
`choice for a criterion function . Sw can also become singular, and will
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-10 Filed 05/30/19 Page 3 of 12
`
`223
`CRITERION FUNCTIONS FOR CLUSTERING
`certainly be so if n - c is less than the dimensionality d. * Howe ver, if we
`assume that Sw is nonsingular, we are Jed to consider the criterion function
`
`J,, = ISwl = I it Si I-
`
`(40)
`The partition that minimizes la. is often similar to the one that minimizes
`,,
`J but the two need not be the same. We observed before that the minimum-
`squared-error partition might change if the axes are scaled. This does not
`happen with la.. To see why, let T be a nonsingular matrix and consider the
`change of variables x' = Tx. Keeping the partitioning fixed, we obtain new
`mean vectors m~ = Tm; and new scatter matrices s; = TS;T 1
`• Thus. Jd.
`changes to
`
`1~ = IS~I = ITS wrti = ITl2 la.
`Since the scale factor ITl2 is the same for all partitions, it follows that la and
`J~ rank the partitions in the same way, and hence that the optimal clustering
`based on la is invariant to nonsingular linear transformations of the data.
`
`6.8.3.4 INVARIANT CRITERIA
`It is not hard to show that the eigenvalues ..t1 , ...
`, Ad of S~Sn are invariant
`under nonsingular linear transformations of the data. Indeed, these eigen(cid:173)
`values are the basic linear invariants of the scatter matrices. Their numerical
`values measure the ratio of between-cluster to within-cluster scatter in the
`direction of the eigenvectors, and partitions that yield large values are usually
`desirable. Of course, as we pointed out in Section 4.11, the fact that the
`rank of SB can not exceed c - 1 means that no more than c -
`l of these
`eigenvalues can be nonzero. Nevertheless, good partitions are ones for which
`the nonzero eigenvalues are large.
`One can invent a great variety of invariant clustering criteria by composing
`appropriate functions of these eigenvalues. Some of these follow naturally
`from standard matrix operations. For example, since the trace of a matrix
`is the sum of its eigenvalues, one might elect to maximize the criterion
`functiont
`
`tr S~SB = LA;.
`
`d
`
`(41)
`
`i=l
`* This follows from the fact that the rank of S; can not exceed 11; -
`1, and thus the rank
`of Sw can not exceed :E(n; - 1) = n - c. Of course, if the samples are confined to a
`lower dimensional subspace it is possible to have Sw be singular even though n - c ~ d.
`In such cases, some kind of dimensionality-reduction procedure must be used before the
`determinant criterion can be applied (see Section 6.14).
`t Another invariant criterion is
`
`ISwSBI = IT A;.
`i - 1
`However, since its value is usually zero it is not very useful.
`
`11
`
`-1
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-10 Filed 05/30/19 Page 4 of 12
`
`.,,,.
`
`/
`
`/
`
`/
`
`/
`
`I @ -
`,,. ---
`
`./
`
`(a) UNNORMALIZED
`
`- ......
`
`I
`I
`I
`
`/
`
`/
`
`/
`
`u,
`a,
`
`(b) NORMALIZED
`
`FIGURE 6.14. The effect of transforming to
`normalized principal components (Note:
`the
`partition that minimizes s;/Sw in (a) minimizes
`the sum of squared errors in (b).).
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-10 Filed 05/30/19 Page 5 of 12
`
`d
`
`(42)
`
`ITERATIVE OPTIMIZATION
`225
`BY using the relation Sx = Sw + SB, one can derive the following invariant
`relatives of tr Sw and ISwl:
`1
`tr S-i_.1Sw = I--
`i- 11 + A;
`ISwl __ IJ11 _l _.
`,-1 1 + A,
`ISTI
`Since all of these criterion functions are invariant to linear transformations,
`the same is true of the partitions that extremize them. In the special case of
`two clusters , only one eigenvalue is nonzero, and all of these criteria yield
`the same clustering. However, when the samples are partitioned
`into more
`than two clusters , the optimal partitions,
`though often similar, need not be
`the same.
`With regard to the criterion functions involving ST, note that ST does not
`depend on how the samples are partitioned into clusters. Thus, the clusterings
`that minimize ISwl/lSxl are exactly the same as the ones that minimize ISwl•
`If we rotate and scale the axes so that ST becomes the identity matrix , we
`see that minimizing tr Sz,1Sw is equivalent to minimizing the sum-of-squared(cid:173)
`tr Sw . after performing
`error criterion
`this normalization. Figure 6.14
`jllustrates the effects of this transformation graphically. Clearly, this criterion
`suffers from the very defects that we warned about in Section 6.7, and it is
`probably the least desirable of these criteria.
`One final warning about invariant criteria is in order. If different apparent
`groupings can be obtained by scaling the axes or by applying any other linear
`transformation,
`then all of these groupings will be exposed by invariant
`procedures. Thus, invariant criterion functions are more likely to possess
`multiple local extrema, and are correspondingly more difficult to extremize.
`The variety of the criterion functions we have discussed and the somewhat
`subtle differences between them should not be allowed to obscure their
`essential similarity. In every case the underlying model is that the samples
`form c fairly well separated clouds of points. The within-cluster scatter
`matrix Sw is used to measure the compactness of these clouds, and the basic
`goal is to find the most compact grouping. While this approach has proved
`useful for many problems , it is not universally applicable. For example, it
`will not extract a very dense cluster embedded in the center of a diffuse
`cluster, or separate intertwined line-like clusters. For such cases one must
`devise other criterion functions that are better matched to the structure
`present or being sought.
`
`(43)
`
`6.9 ITERATIVE OPTIMIZATION
`Once a criterion function has been selected, clustering becomes a well-defined
`problem in discrete optimization: find those partitions of the set of samples
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-10 Filed 05/30/19 Page 6 of 12
`
`226
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`that extremize the criterion function. Since the sample set is finite , there are
`only a finite number of possible partition s. Thus , in theory the clustering
`problem can always be solved by exhaustive enumeration. However , in
`practice such an approach is unthinkable for all but the simplest problems.
`There are approximately cn/c ! ways of partitioning a set of n elements into
`c subsets ,t and this exponential growth with n is overwhelming. For example,
`an exhaustive search for the best set of S clusters in 100 samples would
`require considering more than 1067 partitionings. Thus , in most applications
`an exhaustive search is completely infeasible.
`is
`The approach most frequently used in seeking optimal partitions
`iterati ve optimization. The basic idea is to find some reasonable initial
`partition and to "move" samples from one group to another if such a move
`will improve the value of the criterion function. Like hill-climbing procedures
`in general, these approaches guarantee local but not global optimization.
`Different starting points can lead to different solutions, and one never knows
`whether or not the best solution has been found. Despite these limitations,
`the fact that
`the computational
`requirements are bearable makes this
`approach significant.
`Let us consider the use of iterative improvement to minimize the sum-of(cid:173)
`squared-error criterion J., written as
`
`C
`
`J. = 1:1;,
`
`f= l
`
`where
`
`and
`
`1
`m; = -
`Ix.
`n; xe:£,
`Suppose that a sample x currently in cluster !!C; is tentatively moved to !!C;.
`Then m; changes to
`
`x - m 1
`*
`m1 = m 1 + ----'
`n; + 1
`
`t The reader who likes combinatorial problems will enjoy showing that there are exactly
`c (C)
`1 L
`. (-l)
`
`c- t;n
`
`}
`C. i =l
`
`I
`
`partitions of n items into c nonempty subsets. (see W. Feller, A11 I11trod11ctio11
`to Probability
`Theory and Its Applicatio11s, Vol. I, p. 58 (John Wiley, New York, Second Edition, 1959)).
`If n » c, the last term is the most significant.
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-10 Filed 05/30/19 Page 7 of 12
`
`ITERATIVE OPTIMIZATION
`
`227
`
`and l ; increases to
`J1 = I llx - m1112 + llx - m1112
`
`XEfil;
`
`=I;+
`
`_2!.L_ llx - m;ll2
`n1 + 1
`Vnder the assumption that n, "F l (singleton clusters should not be destroyed),
`a similar calculation shows that mi changes to
`
`•
`
`and Ji decreases to
`
`•
`
`Jt = J, - _____!!L_ Iii - m,112
`n, - 1
`These equations greatly simplify the computation of the change in the
`criterion function. The transfer of x from ft , to X-1 is advantageous
`if the
`decrease in J, is greater than the increase in J1• This is the case if
`l) llx - m;ll2 > n1/ (n; + I) llx - m;ll2 ,
`n,/(n i -
`which typically happens whenever x is closer to m; than m,. If reassignment
`is profitable, the greatest decrease in sum of squared error is obtained by
`selecting the cluster for which n1/ (n1 + I) llx - m,112 is minimum. This leads
`to the following clustering procedure:
`
`Procedure: Basic Minimum Squared Error
`I. Select an initial partition of the n samples into clusters and
`compute J, and the means m1 , ...
`, m0 •
`2. Select the next candidate sample x. Suppose that i is currently
`in f.t,.
`3. If n, = 1 go to Next; otherwise compute
`
`Loop:
`
`4. Transfer x to f.tk if Pk s P; for allj.
`5. Update J., m, , and mk.
`6. If J, has not changed in n attempts, stop; otherwise go to Loop.
`
`Next:
`
`j = i.
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-10 Filed 05/30/19 Page 8 of 12
`
`228
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`If this procedure is compared to the Basic Isodata procedure described in
`Section 6.4.4, it is clear that the former is essentially a sequential version of
`the latter. Where the Basic Isodata procedure waits until all n samples have
`been reclassified before updating,
`the Basic Minimum Squared Error pro(cid:173)
`cedure updates after each sample is reclassified. It has been experimentally
`observed that this procedure is more susceptible to being trapped at a local
`minimum, and it has the further disadvantage of making the results depend
`on the order in which the candidates are selected. However, it is at least a
`stepwise optimal procedure, and it can be easily modified to apply to problems
`in which samples are acquired sequentially and clustering must be done in
`real time.
`One question that plagues all hill-climbing procedures is the choice of the
`starting point. Unfortunately,
`there is no simple , universally good solution
`to this problem. One approach is to select c samples randomly for the initial
`cluster centers, using them to partition
`the data on a minimum -distance
`basis. Repetition with different random selections can give some indication
`of the sensitivity of the solution to the starting point. Another approach is
`to find the c-cluster starting point from the solution to the (c -
`!)-cluster
`problem. The solution for the one-cluster problem is the total sample mean;
`the starting point for the c-cluster problem can be the final means for the
`!)-cluster problem plus the sample that is furthest from the nearest
`(c -
`cluster center. This approach leads us directly to the so-called hierarchical
`clustering procedures, which are simple methods that can provide very good
`starting points for iterative optimization.
`
`6.10 HIERARCHICAL CLUSTERING
`
`6.10.1 Definitions
`Let us consider a sequence of partitions of the n samples into c clusters. The
`first of these is a partition into n clusters , each cluster containing exactly one
`sample. The next is a partition into n -
`l clusters, the next a partition into
`n - 2, and so on until the nth , in which all the samples form one cluster.
`We shall say that we are at level k in the sequence when c = n - k + l.
`Thus , level one corresponds to n clusters and level n to one. Given any two
`samples x and x' , at some level they will be grouped together in the same
`cluster. If the sequence has the property that whenever two samples are in
`the same cluster at level k they remain together at all higher levels, then the
`sequence is said to be a hierarchical clustering. The classical examples of
`hierarchical clustering appear in biological taxonomy, where individuals are
`grouped into species, species into genera, genera into families, and so on.
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-10 Filed 05/30/19 Page 9 of 12
`
`HIERARCHICAL CLUSTERING
`
`229
`
`this kind of clustering permeates classificatory activities in the
`
`Jn fact,
`sciences.
`tree, called a
`For every hierarchical clustering there is a corresponding
`dendrogram, that shows how the samples are grouped. Figure 6.15 shows a
`dendrogram for a hypothetical problem involving six samples. Level 1 shows
`the six samples as singleton clusters. At level 2, samples x 8 and x 5 have been
`grouped to form a cluster, and they stay together at all subsequent levels. If
`it is possible to measure the similarity between clusters, then the dendrogram
`•is usually drawn to scale to show the similarity between the clusters that are
`grouped. In Figure 6. 15, for example, the similarity between the two groups
`of samples that are merged at level 6 has a value of 30. The similarity values
`are often used to help determine whether the groupings are natural or forced.
`For our hypothetical example, one would be inclined to say that the groupings
`at levels 4 or 5 are natural, but that the large reduction in similarity needed
`to go to level 6 makes that grouping forced. We shall see shortly how such
`similarity values can be obtained.
`Because of their conceptual simplicity, hierarchical clustering procedures
`are among
`the best-known methods. The procedures
`themselves can be
`divided into two distinct classes, agglomerative and divisive. Agglomerative
`(bottom-up, clumping) procedures start with n singleton clusters and form
`
`x6
`
`•s
`
`•3
`
`LEVEL 1 -(cid:173)
`LEVEL 2 -
`LEVEL 3 -
`
`LEVEL 4 --
`
`LEVEL 5 -
`
`LEVEL 6 --
`
`100
`
`90
`
`so
`
`70
`
`w
`
`60
`
`50
`
`..J < u
`"'
`> t
`a: < =
`40 !!
`"'
`
`30
`
`20
`
`10
`
`FIGURE 6.15. A dendrogram for hierarchical clustering.
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-10 Filed 05/30/19 Page 10 of 12
`
`230
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`the sequence by successively merging clusters. Divisive (top -down, splitting)
`procedures start with all of the samples in one cluster and form the sequence
`by successively splitting clusters. The computation needed to go from one
`level to another is usually simpler for the agglomerative procedures. However,
`when there are many samples and one is interested in only a small number
`of clusters, this computation will have to be repeated many times. For
`simplicity, we shall limit our attention
`to the agglomerative procedures,
`referring the reader to the literature for divisive methods .
`
`6.10.2 Agglomerative Hierarchical Clustering
`
`Loop:
`
`, n.
`
`The major steps in agglomerative clustering are contained in the following
`procedure:
`Procedure: Basic Agglomerative Clustering
`1. Let e = n and fl"i = {x;}, i = l, ...
`2. If es c, stop.
`3. Find the nearest pair of distinct clusters, say fl"i and fl";.
`4. Merge fl"i and fl";, delete fl";, and decrement c by one.
`5. Go to Loop.
`terminates when the specified number of
`As described, this procedure
`clusters has been obtained. However, if we continue until c = l we can
`produce a dendrogram
`like that shown in Figure 6.15. At any level the
`distance between nearest clusters can provide the dissimilarity value for that
`level. The reader will note that we have not said how to measure the distance
`between two clusters. The considerations here are much like those involved
`in selecting a criterion function. For simplicity, we shall restrict our attention
`to the following distance measures, leaving extensions to other similarity
`measures to the reader's imagination:
`dm1nC8(i, 8(;) = min
`xe9',,x'e~;
`dmaxC 8(,, gr,) = max
`xe!E,,x'e:£ 1

`da.vif,(i, 8(;) = _l_ L L llx - x'II
`n,n; xe~, x'e~ ,
`dmeo,n(f,(i, 8(,) = llm, - m,11-
`All of these measures have a minimum-variance flavor, and they usually
`yield the same results if the clusters are compact and well separated. However,
`if the clusters are close to one another, or if their shapes are not basically
`hyperspherical , quite different results can be obtained. We shall use the
`two-dimensional point sets shown in Figure 6. I 6 to illustrate some of the
`differences.
`
`llx - x'II
`
`II X - x' II
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-10 Filed 05/30/19 Page 11 of 12
`
`a
`
`a
`
`a
`
`D
`
`a
`
`D
`
`D
`
`a
`
`D
`
`0
`
`0
`
`a
`
`0
`
`D
`
`II
`
`D
`
`a
`
`D
`
`a
`
`D
`
`a
`
`D
`
`a
`
`0
`
`a
`
`D
`
`0
`
`0
`
`0
`
`D
`
`a
`
`a
`
`a
`
`a
`
`D
`
`D
`
`0
`
`0
`
`0
`
`D
`
`0
`
`D
`
`D
`
`0
`
`D
`
`m
`
`D
`
`0
`
`D
`
`0
`
`a
`
`D
`
`D
`
`a
`
`0
`
`0
`
`0
`
`0
`
`a
`
`0
`
`a
`
`a
`
`a
`
`a
`
`a
`
`a
`
`a
`
`a
`
`D
`
`D
`
`a
`
`a
`
`a
`
`D
`
`a
`
`D
`
`D
`
`a
`
`a
`
`D
`
`0
`
`D
`
`FIGURE 6.16. Three illustrative examples.
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-10 Filed 05/30/19 Page 12 of 12
`
`FIGURE 6.17. Results of the nearest-neighbor algorithm.
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket