`Case 1:14-cv-02396—PGG-MHD Document 148-9 Filed 05/30/19 Page 1 of 12
`
`EXHIBIT 4
`
`EXHIBIT 4
`PART
`3 OF 8
`
`PART
`
`3OF8
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 2 of 12
`DATA DESCRIPTION AND CLUSTERING
`211
`
`overlap between the component densities , one can expect biased estimates
`and less than optimal results.
`Despite these drawbacks , the simplicity of decision-directed procedures
`makes the Bayesian approach computationally feasible , and a flawed solution
`is often better than none. If conditions are favorable, performance that is
`nearly optimal can be achieved at far less computational expense. The
`literature contains a few rather complicated analyses of particular decision(cid:173)
`directed procedures, and numerous reports of experimental results. The
`basic conclusions are that most of these procedures work well if the para(cid:173)
`metric assumptions are valid, if there is little overlap between the component
`densities , and if the initial classifier design is at least roughly correct.
`
`6.6 DATA DESCRIPTION AND CLUSTERING
`
`Let us reconsider our original problem of learning something of use from a
`set of unlabelled samples. Viewed geometrically, these samples form clouds
`of points in a d-dimensional space . Suppose that we knew that these points
`came from a single normal distribution. Then the most we could learn from
`the data would be contained
`in the sufficient statistics -
`the sample mean
`and the sample covariance matrix. In essence, these statistics constitute a
`compact description of the data. The sample mean locates the center of
`gravity of the cloud. It can be thought of as the single point x that best
`represents all of the data in the sense of minimizing the sum of squared
`distances from x to the samples. The sample covariance matrix tells us how
`well the sample mean describes the data in terms of the amount of scatter
`that exists in various directions. If the data points are actually normally
`distributed , then the cloud has a simple hyperellipsoidal shape , and the
`sample mean tends to fall in the region where the samples are most densely
`concentrated.
`Of course, if the samples are not normally distributed , these statistics can
`give a very misleading description of the data. Figure 6.7 shows four different
`data sets that all have the same mean and covariance matrix. Obviously,
`second-order statistics are incapable of revealing all of the structure in an
`arbitrary set of data.
`By assuming that the samples come from a mixture of c normal distribu(cid:173)
`tions , we can approximate a greater variety of situations. In essence, this
`corresponds to assuming that the samples fall in hyperellipsoidally-shaped
`clouds of various sizes and orientations. If the number of component densities
`is not limited, we can approximate virtually any density function in this way,
`and use the parameters of the mixture to describe the data. Unfortunately,
`we have seen that the problem of estimating the parameters of a mixture
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 3 of 12
`
`212
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`. ~ ... . . ..
`
`: !.···""l~.-;.: :;.· .:-•,
`. . . . · .. :~ .... •.-t-· ":- .
`.
`
`.. ~,::. ,.; :-".&.•~3~.:..:.;-. :
`1;;-.. ~~ • • •
`. ., •. ':<.•!l>t. ~•,;_ • .,,,... r.:, .. . . .
`• •
`I•"•
`. .. :i: .'lift--... -... ~~;,:::·.; I
`. -~ ~ ... :-· ..
`.. .. . ~~ ~~---•wt".
`• ·. •• ·. Ji·....
`
`: . . .. -1J;;..~.o.: •.. :·?"·: ·. :
`~
`.:··. \ .= :\:\:· .. f•~;-~. . ..
`· .. : . ·. ::,'; .. -.-_:.,,-•: •;.· .
`
`. '
`: ·:·
`• • • •i • '
`
`FIGURE 6.7. Data sets having identical second-order statistics.
`
`in situations where we have relatively
`density is not trivial. Furthermore,
`little a priori knowledge about the nature of the data , the assumption of
`particular parametric forms may lead to poor or meaningless results. Instead
`of finding structure in the data , we would be imposing structure on it.
`One alternative is to use one of the nonparametric methods described in
`Chapter 4 to estimate the unknown mixture density. If accurate, the resulting
`estimate is certainly a comple te description of what we can learn from the
`data. Regions of high local density, which might correspond to significant
`subclasses in the population, can be found from the peaks or modes of the
`estimated density .
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 4 of 12
`
`SIMILARITY MEASURES
`
`213
`
`If the goal is to find subclasses, a more direct alternative is to use a
`clustering procedure. Roughly speaking, clustering procedures yield a data
`description in terms of clusters or groups of data points that possess strong
`internal similarities. The more formal procedures use a criterion function,
`such as the sum of the squared distances from the cluster centers, and seek
`the grouping that extremizes the criterion function. Because even this can
`lead to unmanageable computational problems, other procedures have been
`proposed that are intuitively appealing but that lead to solutions having no
`established properties. Their use is usually justified on the ground that they
`are easy to apply and often yield interesting results that may guide the
`application of more rigorous procedures.
`
`6.7 SIMILARITY MEASURES
`
`Once we describe the clustering problem as one of finding natural groupings
`in a sel of data, we are obliged to define what we mean by a natural grouping.
`In what sense are we to say that the samples in one cluster are more like one
`another than like samples in other clusters? This question actually involves
`two separate issues-how should one measure the similarity between samples,
`and how should one evaluate a partitioning of a set of samples into clusters?
`In this section we address the first of these issues.
`The most obvious measure of the similarity (or dissimilarity) between two
`samples is the distance between them. One way to begin a clustering investi(cid:173)
`gation is to define a suitable distance function and compute the matrix of
`distances between all pairs of samples. If distance is a good measure of
`dissimilarity, then one would expect the distance between samples in the
`same cluster to be significantly less than the distance between samples in
`different clusters.
`Suppose for the moment that we say that two samples belong to the same
`cluster if the Euclidean distance between them is less than some threshold
`distance d0 • It is immediately obvious that the choice of d0 is very important.
`If d0 is very large, all of the samp les will be assigned to one cluster. If d0 is
`very small, each sample will form an isolated cluster. To obtain "natural"
`clusters, d0 will have to be greater than typical within-cluster distances and
`less than typical between-cluster distances (see Figure 6.8).
`Less obvious perhaps is the fact that the results of clustering depend on
`the choice of Euclidean distance as a measure of dissimilarity. This choice
`implies that the feature space is isotropic. Consequently, clusters defined by
`Euclidean distance will be invariant to translations or rotations-rigid-body
`motions of the data points. However, they will not be invariant to linear
`transformations
`in general, or to other transformations
`that distort the
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 5 of 12
`
`214
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`r
`
`7
`I
`I
`
`'
`
`-
`
`- - --·---,
`
`0
`
`0
`
`0
`
`0
`
`0
`
`0
`
`L ______
`
`__ _____
`
`L ____
`
`_ _________
`
`J __ _________
`
`0
`___ _ _ J
`
`{al LARGE d
`
`0
`
`(bl
`
`INTERMEDIATE d
`
`0
`
`(cl SMALL d
`
`0
`
`FIGURE 6.8. The effect of a distance threshold on clustering (Lines are drawn
`between points closer than a distance d0 apart).
`
`distance relationships. Thus , as Figure 6.9 illustrates, a simple scaling of the
`coordinate axes can result in a different grouping of the data into clusters.
`Of course , this is of no concern for problems in which arbitrary rescaling is
`an unnatural or meaningless transformation. However, if clusters are to
`mean anything, they should be invariant to transformations natural to the
`problem.
`One way to achieve invariance is to normalize the data prior to clustering.
`For example , to obtain invariance to displacement and scale changes , one
`might translate and scale the axes so that all of the features have zero mean
`and unit variance. To obtain invariance to rotation, one might rotate the
`axes so that they coincide with the eigenvectors of the sample covariance
`matrix. This transformation to principal components can be preceded and/or
`followed by normalization for scale.
`However , the reader should not conclude that this kind of normalization
`is necessarily desirable. Consider, for example, the matter of translating and
`scaling the axes so that each feature has zero mean and unit variance. The
`rationale usually given for this normalization
`is that it prevents certain
`features from dominating distance calculations merely because they have
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 6 of 12
`
`SIMILARITY MEASURES
`
`215
`
`large numerical values. Subtracting the mean and dividing by the standard
`deviation is an appropriate no~malization if this spread of values is due to
`normal random variation; however, it can be quite inappropriate
`if the
`spread is due to the presence of subclasses (see Figure 6.10). Thus, this
`routine normalization may be less than helpful in the cases of greatest
`interest. Section 6.8.3 describes some better ways to obtain invariance to
`scaling.
`An alternative to normalizing the data and using Euclidean distance is to
`use some kind of normalized distance, such as the Mahalanobis distance.
`More generally, one can abandon
`the use of distance altogether and introduce
`a nonmetric similarity function s(x, x') to compare two vectors x and x'.
`Conventionally, this is a symmetric function whose value is large when x
`and x' are similar. For example, when the angle between two vectors is a
`
`4
`
`5
`
`6
`
`3
`2
`x1' = o.2x,
`
`4
`
`5
`
`6
`
`2
`
`3
`
`.,
`
`0
`
`0
`0
`
`6
`
`5
`
`4
`
`N
`,':I
`0
`•
`3
`"'
`
`X
`
`2
`
`FIGURE 6.9. The effect of scaling on the apparent clustering.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 7 of 12
`
`.216 UNSUPERVISED LEARNING AND CLUSTERING
`
`0
`
`0 0 0
`
`0
`
`Oo
`
`o o
`0
`
`0 0
`0
`0 0
`
`0
`
`0
`
`0
`
`00
`0
`
`0
`
`0 0
`0
`o O 00
`
`[)
`
`{b) NORMALIZED
`{a) UNNORMALIZED
`FIGURE 6.10. Undesirable effects of normalization.
`
`meaningful measure of their similarity, then the normalized inner product
`xtx'
`s(x, x') = llxll llx'II
`may be an appropriate similarity function. This measure, whkh is the cosine
`of the angle between x and x', is invariant to rotation and dilation, though
`it is not invariant to translation and general linear transformations.
`When the features are binary valued (0 or 1), this similarity function has
`a simple nongeometrical interpretation in terms of measuring shared features
`or shared attributes. Let us say that a sample x possesses the ith attribute if
`x. = 1. Then xtx' is merely the number of attributes possessed by x and x',
`llxll llx'II = (x1xx' 1x') 112 is the geometric mean of the number of
`and
`attributes possessed by x and the number possessed by x'. Thus, s(x, x') is
`a measure of the relative possession of common attributes. Some simple
`variations are
`
`s(x x') =-
`'
`the fraction of attributes shared , and
`
`x1x'
`d
`
`'
`
`the ratio of the number of shared attributes to the number possessed by x
`or x'. This latter measure (sometimes known as the Tanimoto coefficient)
`is frequently encountered in the fields of information retrieval and biological
`taxonomy. Other measures of similarity arise in other applications, the
`variety of measures testifying to the diversity of problem domains.
`We feel obliged to mention that fundamental issues in measurement theory
`are involved in the use of any distance or similarity function. The calculation
`of the similarity between two vectors always involves combining the values
`of their components. Yet, in many pattern recognition applications the
`components of the feature vector measure seemingly noncomparable
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 8 of 12
`
`CRITERION FUNCTIONS FOR CLUSTERING
`
`217
`
`quantities. Using our early example of classifying lumber, how can one
`compare the brightness to the straightness-of-grain? Should the comparison
`depend on whether the brightness is measured in candles/m 2 or in foot(cid:173)
`lamberts? How does one treat vectors whose components have a mixture of
`nominal, ordinal,
`interval, and ratio scales?* Ultimately,
`there is no
`methodological answer to these questions. When a user selects a particular
`similarity function or normalizes his data in a particular way, he introduces
`information that gives the procedure meaning. We have given examples of
`some alternatives that have proved to be useful. Beyond that we can do little
`more than alert the unwary to these pitfalls of clustering.
`
`6.8 CRITERION FUNCTIONS FOR CLUSTERING
`
`Suppose that we have a set ff of n samples x 1 , ...
`, xn that we want to
`, grc· Each subset is to
`into exactly c disjoint subsets fl:1 , ...
`partition
`represent a cluster, with samples in the same cluster being somehow more
`similar than samples in different clusters. One way to make this into a well(cid:173)
`defined problem is to define a criterion function that measures the clustering
`quality of any partition of the data. Then the problem is one of finding the
`partition that extremizes the criterion function. In this section we examine
`the characteristics of several basically similar criterion functions, postponing
`until later the question of how to find an optimal partition.
`
`6.8.1 The Sum-of-Squared-Error Criterion
`
`The simplest and most widely used criterion function for clustering is the
`sum-of-squared-error criterion. Let n; be the number of samples in gr, and
`let m; be the mean of those samples,
`m; = .!. 2 x.
`ni XE~;
`Then the sum of squared errors is defined by
`J. = I I llx - m;ll2
`i=l xeEr,
`This criterion function has a simple interpretation. For a given cluster
`gr;, the mean vector m; is the best representative of the samples in gr; in the
`sense that it minimizes the sum of the squared lengths of the "error" vectors
`x - m;. Thus, J. measures the total squared error incurred in representing
`the n samples x 1,
`, Xn by the c cluster centers m1 , ...
`, m 0 . The value of
`.•.
`* These fundamental considerations are by no means unique to clustering. They appear,
`for example, whenever one chooses a parametric form for an unknown probability density
`function, a metric for nonparametric density estimation, or scale factors for linear
`discriminant functions. Clustering problems merely expose them more clearly.
`
`(25)
`
`(26)
`
`C
`
`•
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 9 of 12
`
`218
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`10 depends on how the samples are grouped into clusters , and an optimal
`partitioning is defined as one that minimizes 1 •. Clusterings of this type are
`often called minimum variance partitions.
`What kind of clustering problems are well suited to a sum-of-squared-error
`criterion? Basically, 1, is an appropriate criterion when the clusters form
`essentially compact clouds that are rather well separated from one another.
`It should work well for the two or three clusters in Figure 6.11, but one would
`not expect reasonable results for the data in Figure 6.12. * A less obvious
`problem arises when there are great differences in the number of samples in
`
`3.0 ,--
`
`---.------,----.------,---
`
`--~------,-------.
`
`2.5
`
`2.0
`
`E
`"
`... 1.5
`::c:
`0
`~
`..J
`~
`w
`a.
`
`1.0
`
`0.5
`
`0
`
`0
`
`t::. Iris virginica
`
`0 Iris versicolor
`
`0 Iris setosa
`
`t'-. M.
`~
`t:,.
`~ 6, t:,. t:,.
`
`t:,. c-..
`CIN:,6. {\
`
`M
`
`6
`
`t:,.
`
`N6.6
`~6.
`f!l,.t:,
`0
`6
`00
`0
`l;):o ~
`El cfu
`0
`0 111111
`0000
`0
`CD
`B El o OD
`
`0
`
`0
`
`r!(,.t:,.
`
`6
`
`t:,.
`
`0
`
`0
`0
`0 (}:x) 0
`
`00~0
`
`3
`cm
`PETAL LENGTH -
`FIGURE 6.11. A two-dimensional section of the Anderson iris data.
`
`4
`
`5
`
`6
`
`2
`
`t:,.
`
`6
`t:,.
`6
`
`7
`
`* These two data sets are well known for quite different reasons. Figure 6.11 shows two
`of four measurements made by E. Anderson on 150 samples of three species of iris. These
`data were listed and used by R. A. Fisher in his classic paper on discriminant ana lysis
`(Fisher 1936), and have since become a favorite example for illustrating clustering pro(cid:173)
`cedures. Figure 6.12 is well known in astronomy as the Hertzsprung and Russell (or
`spectrum-luminosity) diagram, which led to the subdivision of stars into such categornes
`as giants, supergiants, main sequence stars , and dwarfs. It was used by E. W. Forgey and
`again by D. Wishart (1969) to illustrate the limitations of simple clustering procedures.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 10 of 12
`
`.,
`·S ..
`
`·J
`
`-2
`
`- I
`
`,1
`
`+5
`
`+7
`
`+6
`
`+I
`
`+10 .,,
`,,J
`,1J .,,,
`
`CRITERION FUNCTIONS FOR CLUSTERING
`
`219
`
`-5
`
`-J
`
`-i! _,
`
`+1
`
`,2
`
`,J
`
`+6
`
`+7
`
`+6
`
`+I
`
`+10
`
`+If
`
`;fJ
`
`+13
`
`+15
`+15
`FIGURE 6.12. The Herzsprung-Russell Diagram (Courtesy Lunds Universitet
`Institutionen fiir Astronomi).
`
`different clusters. In that case it can happen that a partition that splits a
`large cluster is favored over one that maintains the integrity of the clusters
`merely because the slight reduction in squared error achieved is multiplied
`by many terms in the sum (see Figure 6.13). This situation frequently arises
`because of the presence of "outliers" or "wild shots," and brings up the
`problem of interpreting and evaluating the results of clustering. Since little
`can be said about that problem, we shall merely observe that if additional
`considerations render the results of minimizing 1. unsatisfactory, then these
`considerations should be used, if possible, in formulating a better criterion
`function.
`
`6.8.2 Related Minimum Variance Criteria
`
`By some simple algebraic manipulation we can eliminate the mean vectors
`from the expression for J. and obtain the equivalent expression
`J. = ½ I nisi,
`
`{27)
`
`C
`
`i=l
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 11 of 12
`UNSUPERVISED LEARNING AND CLUSTERING
`220
`.,,,..,..--
`\ ----
`'
`o
`o I "'
`lP
`o,, 0 0
`O O
`ooo
`o ~o
`0
`011
`,, o
`o O o 1I
`IIOoO
`00
`\\
`0
`oto
`II
`Oo
`o llo
`o
`o
`....... __ ___
`/1,
`.... __
`
`......
`
`',
`
`',
`
`',
`
`'\
`o o \
`o
`I
`J
`
`/
`
`/
`
`/
`_.,,,..,,,.
`
`0
`
`O
`
`--
`
`(al
`
`/
`
`0
`
`0
`
`0
`
`/
`/
`I
`/
`Io
`/
`I
`\
`\
`\ o
`\
`\.
`'
`
`..,,,.,----
`' '
`'
`'
`0 o \
`o
`
`0
`
`0
`0
`
`0
`
`0
`
`0
`
`0 0
`0
`
`0
`
`0 o
`
`/
`
`/
`
`/
`
`0
`
`0
`
`0
`
`/
`I
`I
`I
`I
`I
`I
`\
`\
`\ 0
`
`o
`
`0
`
`0
`
`0
`0
`
`0
`
`0
`
`/
`
`/
`
`0 /
`/
`
`/
`
`0
`
`00
`
`0
`
`0
`
`0 0
`0
`
`0
`
`0
`
`----
`
`0
`' 0
`0
`'
`.....
`'
`
`,--....
`I o
`\
`I
`o
`I
`J
`0
`\
`...__/
`
`\
`I
`I
`I
`I
`I
`I
`
`(b)
`
`FIGURE 6.13. The problem of splitting large clus(cid:173)
`ters: the sum of squared error is smaller for (a) than
`for (b).
`
`where
`
`(28)
`
`Eq. (28) leads us to interpret si as the average squared distance between
`points in the ith cluster, and emphasizes the fact that the sum-of-squared-error
`criterion uses EucUdean distance as the measure of similarity. It also suggests
`an obvious way of obtaining other criterion functions. For example, one can
`replace si by the average , the median, or perhaps the maximum distance
`between poi nts in a cluster. More generally , one can introduce an appropriate
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 12 of 12
`CRITERION FUNCTIONS FOR CLUSTERING
`221
`similarity function s(x, x') and replace si by functions such as
`
`or
`
`1 'C'
`-
`Si= 2 £.,
`n i X E'.'t'f
`
`I s(x, x')
`
`x:'e$1
`
`s; = min s(x, x').
`x,x'~i
`
`(29)
`
`(30)
`
`As before, we define an optimal partitioning as one that extremizes the
`criterion function. This creates a well-defined problem , and the hope is that
`its solution discloses the intrinsic structure of the data.
`
`6.8.3 Scattering Criteria
`
`6.8.3.1 THE SCATTER MATRICES
`
`Another interesting class of criterion functions can be derived from the scatter
`matrices used in multiple discriminant analysis. The following definitions
`directly parallel the definitions given in Section 4.11.
`
`Mean vector for ith cluster:
`
`Total mean vector:
`
`1
`m, = -
`Ix.
`n, x~,
`
`Scatter matrix for ith cluster:
`S ; = I (x - m;)(x - m;)1
`xe-~i
`
`•
`
`Within-cluster scatter matrix:
`
`Between-cluster scatter matrix:
`
`Total scatter matrix:
`
`C
`
`SJ:J = I n;(m; - m)(m; - m)'.
`
`i • l
`
`(31)
`
`(32)
`
`(33)
`
`(34)
`
`(35)
`
`ST=
`
`- m)(x - ml,
`
`I(x
`xe~
`As before, it follows from these definitions that the total scatter matrix
`is the sum of the within -cluster scatter matrix and the between-cluster scatter
`matrix:
`
`(36)
`
`(37)
`
`