throbber
Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 1 of 12
`Case 1:14-cv-02396—PGG-MHD Document 148-9 Filed 05/30/19 Page 1 of 12
`
`EXHIBIT 4
`
`EXHIBIT 4
`PART
`3 OF 8
`
`PART
`
`3OF8
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 2 of 12
`DATA DESCRIPTION AND CLUSTERING
`211
`
`overlap between the component densities , one can expect biased estimates
`and less than optimal results.
`Despite these drawbacks , the simplicity of decision-directed procedures
`makes the Bayesian approach computationally feasible , and a flawed solution
`is often better than none. If conditions are favorable, performance that is
`nearly optimal can be achieved at far less computational expense. The
`literature contains a few rather complicated analyses of particular decision(cid:173)
`directed procedures, and numerous reports of experimental results. The
`basic conclusions are that most of these procedures work well if the para(cid:173)
`metric assumptions are valid, if there is little overlap between the component
`densities , and if the initial classifier design is at least roughly correct.
`
`6.6 DATA DESCRIPTION AND CLUSTERING
`
`Let us reconsider our original problem of learning something of use from a
`set of unlabelled samples. Viewed geometrically, these samples form clouds
`of points in a d-dimensional space . Suppose that we knew that these points
`came from a single normal distribution. Then the most we could learn from
`the data would be contained
`in the sufficient statistics -
`the sample mean
`and the sample covariance matrix. In essence, these statistics constitute a
`compact description of the data. The sample mean locates the center of
`gravity of the cloud. It can be thought of as the single point x that best
`represents all of the data in the sense of minimizing the sum of squared
`distances from x to the samples. The sample covariance matrix tells us how
`well the sample mean describes the data in terms of the amount of scatter
`that exists in various directions. If the data points are actually normally
`distributed , then the cloud has a simple hyperellipsoidal shape , and the
`sample mean tends to fall in the region where the samples are most densely
`concentrated.
`Of course, if the samples are not normally distributed , these statistics can
`give a very misleading description of the data. Figure 6.7 shows four different
`data sets that all have the same mean and covariance matrix. Obviously,
`second-order statistics are incapable of revealing all of the structure in an
`arbitrary set of data.
`By assuming that the samples come from a mixture of c normal distribu(cid:173)
`tions , we can approximate a greater variety of situations. In essence, this
`corresponds to assuming that the samples fall in hyperellipsoidally-shaped
`clouds of various sizes and orientations. If the number of component densities
`is not limited, we can approximate virtually any density function in this way,
`and use the parameters of the mixture to describe the data. Unfortunately,
`we have seen that the problem of estimating the parameters of a mixture
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 3 of 12
`
`212
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`. ~ ... . . ..
`
`: !.···""l~.-;.: :;.· .:-•,
`. . . . · .. :~ .... •.-t-· ":- .
`.
`
`.. ~,::. ,.; :-".&.•~3~.:..:.;-. :
`1;;-.. ~~ • • •
`. ., •. ':<.•!l>t. ~•,;_ • .,,,... r.:, .. . . .
`• •
`I•"•
`. .. :i: .'lift--... -... ~~;,:::·.; I
`. -~ ~ ... :-· ..
`.. .. . ~~ ~~---•wt".
`• ·. •• ·. Ji·....
`
`: . . .. -1J;;..~.o.: •.. :·?"·: ·. :
`~
`.:··. \ .= :\:\:· .. f•~;-~. . ..
`· .. : . ·. ::,'; .. -.-_:.,,-•: •;.· .
`
`. '
`: ·:·
`• • • •i • '
`
`FIGURE 6.7. Data sets having identical second-order statistics.
`
`in situations where we have relatively
`density is not trivial. Furthermore,
`little a priori knowledge about the nature of the data , the assumption of
`particular parametric forms may lead to poor or meaningless results. Instead
`of finding structure in the data , we would be imposing structure on it.
`One alternative is to use one of the nonparametric methods described in
`Chapter 4 to estimate the unknown mixture density. If accurate, the resulting
`estimate is certainly a comple te description of what we can learn from the
`data. Regions of high local density, which might correspond to significant
`subclasses in the population, can be found from the peaks or modes of the
`estimated density .
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 4 of 12
`
`SIMILARITY MEASURES
`
`213
`
`If the goal is to find subclasses, a more direct alternative is to use a
`clustering procedure. Roughly speaking, clustering procedures yield a data
`description in terms of clusters or groups of data points that possess strong
`internal similarities. The more formal procedures use a criterion function,
`such as the sum of the squared distances from the cluster centers, and seek
`the grouping that extremizes the criterion function. Because even this can
`lead to unmanageable computational problems, other procedures have been
`proposed that are intuitively appealing but that lead to solutions having no
`established properties. Their use is usually justified on the ground that they
`are easy to apply and often yield interesting results that may guide the
`application of more rigorous procedures.
`
`6.7 SIMILARITY MEASURES
`
`Once we describe the clustering problem as one of finding natural groupings
`in a sel of data, we are obliged to define what we mean by a natural grouping.
`In what sense are we to say that the samples in one cluster are more like one
`another than like samples in other clusters? This question actually involves
`two separate issues-how should one measure the similarity between samples,
`and how should one evaluate a partitioning of a set of samples into clusters?
`In this section we address the first of these issues.
`The most obvious measure of the similarity (or dissimilarity) between two
`samples is the distance between them. One way to begin a clustering investi(cid:173)
`gation is to define a suitable distance function and compute the matrix of
`distances between all pairs of samples. If distance is a good measure of
`dissimilarity, then one would expect the distance between samples in the
`same cluster to be significantly less than the distance between samples in
`different clusters.
`Suppose for the moment that we say that two samples belong to the same
`cluster if the Euclidean distance between them is less than some threshold
`distance d0 • It is immediately obvious that the choice of d0 is very important.
`If d0 is very large, all of the samp les will be assigned to one cluster. If d0 is
`very small, each sample will form an isolated cluster. To obtain "natural"
`clusters, d0 will have to be greater than typical within-cluster distances and
`less than typical between-cluster distances (see Figure 6.8).
`Less obvious perhaps is the fact that the results of clustering depend on
`the choice of Euclidean distance as a measure of dissimilarity. This choice
`implies that the feature space is isotropic. Consequently, clusters defined by
`Euclidean distance will be invariant to translations or rotations-rigid-body
`motions of the data points. However, they will not be invariant to linear
`transformations
`in general, or to other transformations
`that distort the
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 5 of 12
`
`214
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`r
`
`7
`I
`I
`
`'
`
`-
`
`- - --·---,
`
`0
`
`0
`
`0
`
`0
`
`0
`
`0
`
`L ______
`
`__ _____
`
`L ____
`
`_ _________
`
`J __ _________
`
`0
`___ _ _ J
`
`{al LARGE d
`
`0
`
`(bl
`
`INTERMEDIATE d
`
`0
`
`(cl SMALL d
`
`0
`
`FIGURE 6.8. The effect of a distance threshold on clustering (Lines are drawn
`between points closer than a distance d0 apart).
`
`distance relationships. Thus , as Figure 6.9 illustrates, a simple scaling of the
`coordinate axes can result in a different grouping of the data into clusters.
`Of course , this is of no concern for problems in which arbitrary rescaling is
`an unnatural or meaningless transformation. However, if clusters are to
`mean anything, they should be invariant to transformations natural to the
`problem.
`One way to achieve invariance is to normalize the data prior to clustering.
`For example , to obtain invariance to displacement and scale changes , one
`might translate and scale the axes so that all of the features have zero mean
`and unit variance. To obtain invariance to rotation, one might rotate the
`axes so that they coincide with the eigenvectors of the sample covariance
`matrix. This transformation to principal components can be preceded and/or
`followed by normalization for scale.
`However , the reader should not conclude that this kind of normalization
`is necessarily desirable. Consider, for example, the matter of translating and
`scaling the axes so that each feature has zero mean and unit variance. The
`rationale usually given for this normalization
`is that it prevents certain
`features from dominating distance calculations merely because they have
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 6 of 12
`
`SIMILARITY MEASURES
`
`215
`
`large numerical values. Subtracting the mean and dividing by the standard
`deviation is an appropriate no~malization if this spread of values is due to
`normal random variation; however, it can be quite inappropriate
`if the
`spread is due to the presence of subclasses (see Figure 6.10). Thus, this
`routine normalization may be less than helpful in the cases of greatest
`interest. Section 6.8.3 describes some better ways to obtain invariance to
`scaling.
`An alternative to normalizing the data and using Euclidean distance is to
`use some kind of normalized distance, such as the Mahalanobis distance.
`More generally, one can abandon
`the use of distance altogether and introduce
`a nonmetric similarity function s(x, x') to compare two vectors x and x'.
`Conventionally, this is a symmetric function whose value is large when x
`and x' are similar. For example, when the angle between two vectors is a
`
`4
`
`5
`
`6
`
`3
`2
`x1' = o.2x,
`
`4
`
`5
`
`6
`
`2
`
`3
`
`.,
`
`0
`
`0
`0
`
`6
`
`5
`
`4
`
`N
`,':I
`0
`•
`3
`"'
`
`X
`
`2
`
`FIGURE 6.9. The effect of scaling on the apparent clustering.
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 7 of 12
`
`.216 UNSUPERVISED LEARNING AND CLUSTERING
`
`0
`
`0 0 0
`
`0
`
`Oo
`
`o o
`0
`
`0 0
`0
`0 0
`
`0
`
`0
`
`0
`
`00
`0
`
`0
`
`0 0
`0
`o O 00
`
`[)
`
`{b) NORMALIZED
`{a) UNNORMALIZED
`FIGURE 6.10. Undesirable effects of normalization.
`
`meaningful measure of their similarity, then the normalized inner product
`xtx'
`s(x, x') = llxll llx'II
`may be an appropriate similarity function. This measure, whkh is the cosine
`of the angle between x and x', is invariant to rotation and dilation, though
`it is not invariant to translation and general linear transformations.
`When the features are binary valued (0 or 1), this similarity function has
`a simple nongeometrical interpretation in terms of measuring shared features
`or shared attributes. Let us say that a sample x possesses the ith attribute if
`x. = 1. Then xtx' is merely the number of attributes possessed by x and x',
`llxll llx'II = (x1xx' 1x') 112 is the geometric mean of the number of
`and
`attributes possessed by x and the number possessed by x'. Thus, s(x, x') is
`a measure of the relative possession of common attributes. Some simple
`variations are
`
`s(x x') =-
`'
`the fraction of attributes shared , and
`
`x1x'
`d
`
`'
`
`the ratio of the number of shared attributes to the number possessed by x
`or x'. This latter measure (sometimes known as the Tanimoto coefficient)
`is frequently encountered in the fields of information retrieval and biological
`taxonomy. Other measures of similarity arise in other applications, the
`variety of measures testifying to the diversity of problem domains.
`We feel obliged to mention that fundamental issues in measurement theory
`are involved in the use of any distance or similarity function. The calculation
`of the similarity between two vectors always involves combining the values
`of their components. Yet, in many pattern recognition applications the
`components of the feature vector measure seemingly noncomparable
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 8 of 12
`
`CRITERION FUNCTIONS FOR CLUSTERING
`
`217
`
`quantities. Using our early example of classifying lumber, how can one
`compare the brightness to the straightness-of-grain? Should the comparison
`depend on whether the brightness is measured in candles/m 2 or in foot(cid:173)
`lamberts? How does one treat vectors whose components have a mixture of
`nominal, ordinal,
`interval, and ratio scales?* Ultimately,
`there is no
`methodological answer to these questions. When a user selects a particular
`similarity function or normalizes his data in a particular way, he introduces
`information that gives the procedure meaning. We have given examples of
`some alternatives that have proved to be useful. Beyond that we can do little
`more than alert the unwary to these pitfalls of clustering.
`
`6.8 CRITERION FUNCTIONS FOR CLUSTERING
`
`Suppose that we have a set ff of n samples x 1 , ...
`, xn that we want to
`, grc· Each subset is to
`into exactly c disjoint subsets fl:1 , ...
`partition
`represent a cluster, with samples in the same cluster being somehow more
`similar than samples in different clusters. One way to make this into a well(cid:173)
`defined problem is to define a criterion function that measures the clustering
`quality of any partition of the data. Then the problem is one of finding the
`partition that extremizes the criterion function. In this section we examine
`the characteristics of several basically similar criterion functions, postponing
`until later the question of how to find an optimal partition.
`
`6.8.1 The Sum-of-Squared-Error Criterion
`
`The simplest and most widely used criterion function for clustering is the
`sum-of-squared-error criterion. Let n; be the number of samples in gr, and
`let m; be the mean of those samples,
`m; = .!. 2 x.
`ni XE~;
`Then the sum of squared errors is defined by
`J. = I I llx - m;ll2
`i=l xeEr,
`This criterion function has a simple interpretation. For a given cluster
`gr;, the mean vector m; is the best representative of the samples in gr; in the
`sense that it minimizes the sum of the squared lengths of the "error" vectors
`x - m;. Thus, J. measures the total squared error incurred in representing
`the n samples x 1,
`, Xn by the c cluster centers m1 , ...
`, m 0 . The value of
`.•.
`* These fundamental considerations are by no means unique to clustering. They appear,
`for example, whenever one chooses a parametric form for an unknown probability density
`function, a metric for nonparametric density estimation, or scale factors for linear
`discriminant functions. Clustering problems merely expose them more clearly.
`
`(25)
`
`(26)
`
`C
`
`•
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 9 of 12
`
`218
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`10 depends on how the samples are grouped into clusters , and an optimal
`partitioning is defined as one that minimizes 1 •. Clusterings of this type are
`often called minimum variance partitions.
`What kind of clustering problems are well suited to a sum-of-squared-error
`criterion? Basically, 1, is an appropriate criterion when the clusters form
`essentially compact clouds that are rather well separated from one another.
`It should work well for the two or three clusters in Figure 6.11, but one would
`not expect reasonable results for the data in Figure 6.12. * A less obvious
`problem arises when there are great differences in the number of samples in
`
`3.0 ,--
`
`---.------,----.------,---
`
`--~------,-------.
`
`2.5
`
`2.0
`
`E
`"
`... 1.5
`::c:
`0
`~
`..J
`~
`w
`a.
`
`1.0
`
`0.5
`
`0
`
`0
`
`t::. Iris virginica
`
`0 Iris versicolor
`
`0 Iris setosa
`
`t'-. M.
`~
`t:,.
`~ 6, t:,. t:,.
`
`t:,. c-..
`CIN:,6. {\
`
`M
`
`6
`
`t:,.
`
`N6.6
`~6.
`f!l,.t:,
`0
`6
`00
`0
`l;):o ~
`El cfu
`0
`0 111111
`0000
`0
`CD
`B El o OD
`
`0
`
`0
`
`r!(,.t:,.
`
`6
`
`t:,.
`
`0
`
`0
`0
`0 (}:x) 0
`
`00~0
`
`3
`cm
`PETAL LENGTH -
`FIGURE 6.11. A two-dimensional section of the Anderson iris data.
`
`4
`
`5
`
`6
`
`2
`
`t:,.
`
`6
`t:,.
`6
`
`7
`
`* These two data sets are well known for quite different reasons. Figure 6.11 shows two
`of four measurements made by E. Anderson on 150 samples of three species of iris. These
`data were listed and used by R. A. Fisher in his classic paper on discriminant ana lysis
`(Fisher 1936), and have since become a favorite example for illustrating clustering pro(cid:173)
`cedures. Figure 6.12 is well known in astronomy as the Hertzsprung and Russell (or
`spectrum-luminosity) diagram, which led to the subdivision of stars into such categornes
`as giants, supergiants, main sequence stars , and dwarfs. It was used by E. W. Forgey and
`again by D. Wishart (1969) to illustrate the limitations of simple clustering procedures.
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 10 of 12
`
`.,
`·S ..
`
`·J
`
`-2
`
`- I
`
`,1
`
`+5
`
`+7
`
`+6
`
`+I
`
`+10 .,,
`,,J
`,1J .,,,
`
`CRITERION FUNCTIONS FOR CLUSTERING
`
`219
`
`-5
`
`-J
`
`-i! _,
`
`+1
`
`,2
`
`,J
`
`+6
`
`+7
`
`+6
`
`+I
`
`+10
`
`+If
`
`;fJ
`
`+13
`
`+15
`+15
`FIGURE 6.12. The Herzsprung-Russell Diagram (Courtesy Lunds Universitet
`Institutionen fiir Astronomi).
`
`different clusters. In that case it can happen that a partition that splits a
`large cluster is favored over one that maintains the integrity of the clusters
`merely because the slight reduction in squared error achieved is multiplied
`by many terms in the sum (see Figure 6.13). This situation frequently arises
`because of the presence of "outliers" or "wild shots," and brings up the
`problem of interpreting and evaluating the results of clustering. Since little
`can be said about that problem, we shall merely observe that if additional
`considerations render the results of minimizing 1. unsatisfactory, then these
`considerations should be used, if possible, in formulating a better criterion
`function.
`
`6.8.2 Related Minimum Variance Criteria
`
`By some simple algebraic manipulation we can eliminate the mean vectors
`from the expression for J. and obtain the equivalent expression
`J. = ½ I nisi,
`
`{27)
`
`C
`
`i=l
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 11 of 12
`UNSUPERVISED LEARNING AND CLUSTERING
`220
`.,,,..,..--
`\ ----
`'
`o
`o I "'
`lP
`o,, 0 0
`O O
`ooo
`o ~o
`0
`011
`,, o
`o O o 1I
`IIOoO
`00
`\\
`0
`oto
`II
`Oo
`o llo
`o
`o
`....... __ ___
`/1,
`.... __
`
`......
`
`',
`
`',
`
`',
`
`'\
`o o \
`o
`I
`J
`
`/
`
`/
`
`/
`_.,,,..,,,.
`
`0
`
`O
`
`--
`
`(al
`
`/
`
`0
`
`0
`
`0
`
`/
`/
`I
`/
`Io
`/
`I
`\
`\
`\ o
`\
`\.
`'
`
`..,,,.,----
`' '
`'
`'
`0 o \
`o
`
`0
`
`0
`0
`
`0
`
`0
`
`0
`
`0 0
`0
`
`0
`
`0 o
`
`/
`
`/
`
`/
`
`0
`
`0
`
`0
`
`/
`I
`I
`I
`I
`I
`I
`\
`\
`\ 0
`
`o
`
`0
`
`0
`
`0
`0
`
`0
`
`0
`
`/
`
`/
`
`0 /
`/
`
`/
`
`0
`
`00
`
`0
`
`0
`
`0 0
`0
`
`0
`
`0
`
`----
`
`0
`' 0
`0
`'
`.....
`'
`
`,--....
`I o
`\
`I
`o
`I
`J
`0
`\
`...__/
`
`\
`I
`I
`I
`I
`I
`I
`
`(b)
`
`FIGURE 6.13. The problem of splitting large clus(cid:173)
`ters: the sum of squared error is smaller for (a) than
`for (b).
`
`where
`
`(28)
`
`Eq. (28) leads us to interpret si as the average squared distance between
`points in the ith cluster, and emphasizes the fact that the sum-of-squared-error
`criterion uses EucUdean distance as the measure of similarity. It also suggests
`an obvious way of obtaining other criterion functions. For example, one can
`replace si by the average , the median, or perhaps the maximum distance
`between poi nts in a cluster. More generally , one can introduce an appropriate
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-9 Filed 05/30/19 Page 12 of 12
`CRITERION FUNCTIONS FOR CLUSTERING
`221
`similarity function s(x, x') and replace si by functions such as
`
`or
`
`1 'C'
`-
`Si= 2 £.,
`n i X E'.'t'f
`
`I s(x, x')
`
`x:'e$1
`
`s; = min s(x, x').
`x,x'~i
`
`(29)
`
`(30)
`
`As before, we define an optimal partitioning as one that extremizes the
`criterion function. This creates a well-defined problem , and the hope is that
`its solution discloses the intrinsic structure of the data.
`
`6.8.3 Scattering Criteria
`
`6.8.3.1 THE SCATTER MATRICES
`
`Another interesting class of criterion functions can be derived from the scatter
`matrices used in multiple discriminant analysis. The following definitions
`directly parallel the definitions given in Section 4.11.
`
`Mean vector for ith cluster:
`
`Total mean vector:
`
`1
`m, = -
`Ix.
`n, x~,
`
`Scatter matrix for ith cluster:
`S ; = I (x - m;)(x - m;)1
`xe-~i
`
`•
`
`Within-cluster scatter matrix:
`
`Between-cluster scatter matrix:
`
`Total scatter matrix:
`
`C
`
`SJ:J = I n;(m; - m)(m; - m)'.
`
`i • l
`
`(31)
`
`(32)
`
`(33)
`
`(34)
`
`(35)
`
`ST=
`
`- m)(x - ml,
`
`I(x
`xe~
`As before, it follows from these definitions that the total scatter matrix
`is the sum of the within -cluster scatter matrix and the between-cluster scatter
`matrix:
`
`(36)
`
`(37)
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket