throbber
Case 1:14-cv-02396-PGG-MHD Document 148-12 Filed 05/30/19 Page 1 of 10
`Case 1:14-cv-02396—PGG-MHD Document 148-12 Filed 05/30/19 Page 1 of 10
`
`EXHIBIT 4
`
`EXHIBIT 4
`PART
`6 OF 8
`
`PART
`
`6OF8
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-12 Filed 05/30/19 Page 2 of 10
`
`THE PROBLEM OF VALIDITY
`
`241
`
`unnatural assumption if we are exploring an essentially unknown set of data.
`Thus, a constantly recurring problem in cluster analysis is that of deciding
`just how many clusters are present.
`When clustering is done by extrernizing a criterion function, a common
`approach is to repeat the clustering procedure for c = I, c = 2, c = 3, etc.,
`and to see how the criterion function changes with c. For example, it is clear
`that the sum-of-squared-error criterion J. must decrease monotonically with
`c, since the squared error can be reduced each time c is increased merely by
`transferring a single sample to the new cluster. If the n samples are really
`grouped into c compact, well separated clusters, one would expect to see J.
`decrease rapidly until c = c, decreasing much more slowly thereafter until it
`reaches zero at c = n. Similar arguments have been advanced for hierarchical
`clustering procedures, the usual assumption being that large disparities in
`the levels at which clusters merge indicate the presence of natural groupings.
`A more formal approach to this problem is to devise some measure of
`goodness of fit that expresses how well a given c-cluster description matches
`the data. The chi-square and Kolmogorov-Smirnov statistics are the tradi (cid:173)
`tional measures of goodness of fit, but the curse of dimensionality usually
`demands the use of simpler measures, such as a criterion function J(c). Since
`we expect a description in terms of c + I clusters to give a better fit than a
`description in terms of c clusters, we would like to know what constitutes a
`statistically significant improvement in J(c).
`A formal way to proceed is to advance the null hypothesis that there are
`exactly c clusters present, and to compute the sampling distribution for
`J(c + I) under this hypothesis. This distribution tells us what kind of appar(cid:173)
`ent improvement to expect when a c-cluster description is actually correct.
`The decision procedure would be to accept the null hypothesis if the observed
`value of J(c + 1) falls within limits corresponding to an acceptable probability
`of false rejection.
`Unfortunately, it is usually very difficult to do anything more than crudely
`estimate the sampling distribution of J(c + 1). The resulting solutions are not
`above suspicion, and the statistical problem of testing cluster validity is still
`essentially unsolved. However, under the assumption that a suspicious test
`is better than none, we include the following approximate analysis for the
`simple sum-of-squared-error criterion.
`Suppose that we have a set fl" of n samples and we want to decide whether
`or not there is any justification for assuming that they form more than one
`cluster. Let us advance the null hypothesis that all n samples come from a
`normal population with mean µ. and covariance matrix a2/. If this hypothesis
`were true, any clusters found would have to have been formed by chance,
`and any observed decrease in the sum of squared erro r obtained by clustering
`would have no significance.
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-12 Filed 05/30/19 Page 3 of 10
`
`242
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`The sum of squared error J,(J) is a random variable, since it depends on
`the particu lar set of samples:
`J,(l) = I llx - mll2
`
`XE.!°
`where m is the mean of the n samples. Under the null hypothesis, the dis(cid:173)
`tribution for J.(I)
`is approximately normal with mean nda2 and variance
`2nda 4 •
`Suppose now that we partition the set of samples into two subsets !1£1 and
`ffi so as to minimize J,(2), where
`
`,
`
`mi being the mean of the samples in fl£;. Under the null hypothesis, this
`partitioning is spurious, but it nevertheless results in a value for J.(2) that is
`smaller than J.(I). If we knew the sampling distribution for J.(2), we could
`determine how small J,(2) would have to be before we were forced to abandon
`a one-cluster null hypothesis. Lacking an analytical solution for the optimal
`partitioning, we cannot derive an exact solution for the sampling distribution.
`However, we can obtain a rough estimate by considering the suboptimal
`partition provided by a hyperplane through the sample mean. For large n,
`it can be shown that the sum of squared error for this partition is approx(cid:173)
`imately normal with mean n(d - 2/7T)<J2 and variance 2n(d - 8/7T2)a 4 •
`This result agrees with our statement that J,(2) is smaller than Je(l), since
`the mean of J.(2) for the suboptimal partition - n(d - 2/7T)<J2-is
`Jess than
`the mean for J.(I) - nda2• To be considered significant, the reduction in the
`sum of squared error must certainly be greater than this. We can obtain an
`approximate critical value for u.(2) by assuming that the suboptimal partition
`is nearly optimal, by using the normal approximation
`for the sampling
`distribution, and by estimating a2 by
`
`llx - mll2 =1._J/l).
`62 =_!_ I
`nd xE.!°
`nd
`
`The final result can be stated as follows: Reject the null hypothesis at the
`p-percent significance level if
`J,(2) < 1 - 2 - IXJ2(1 - 8f,rr2d)'
`7Td
`J .(1)
`nd
`where IX is determined by
`
`(44)
`
`p = 100J
`
`00 1
`« J27T
`
`e- 112
`
`"
`
`1 du.
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-12 Filed 05/30/19 Page 4 of 10
`
`MULTIDIMENSIONAL SCALING
`
`243
`
`Thus , this provides us with a test for deciding whether or not the splitting of
`a cluster is justified. Clearly, the c-cluster problem can be trea ted by applying
`the same test to all clusters found.
`
`6.13 LOW-DIMENSIONAL REPRESENTATIONS
`AND MULTIDIMENSIONAL SCALING
`
`part of the problem of deciding whether or not a given clustering means
`anything stems from our inabili ty to visualize the structure of multidimen(cid:173)
`sional data . This problem is further aggravated when similarity or dissimi(cid:173)
`larity measures are used that lack the familiar properties of distance. One
`way to attack this problem is to try to represent the data points as points in
`some lower-dimensional space in such a way that the distances between poin ts
`in the lower-dimensional
`space correspond
`to the dissimilarities between
`points in the original space. If acceptably accurate representations
`can be
`found in two or perhaps three dimensions , this can be an extremely valuahle
`way to gain insight into the structure of the data. The general process of
`finding a configuration of points whose interpoint distances correspond
`to
`dissimilarities is often called multidimensional scaling.
`Let us begin with the simpler case where it is meaningful to talk about the
`distances between then samples x1, .
`, Xn. Let Yi be the lower-dimensional
`..
`image of x1, O;; be the distance between xi and xJ, and d,1 be the distance
`between Yi and Y;• Then we are looking for a configuration of image points
`y1, .
`. , y,. for which the n(n - 1)/2 distances d11 between image points are
`.
`as close as possible to the corresponding original distances tiil . Since it will
`for which d,1 = oi, for all i
`usually not be possible to find a configuration
`and j , we need some criterion for deciding whether or not one configuration
`is better than another. The following sum -of-squared-error
`functions are all
`reasonable candidates:
`
`(45)
`
`(46)
`
`(47)
`
`J •• = "'\2 .I ( d;; - Oi;)2
`
`,kU i;
`i <J
`
`i<1
`
`t<;
`in volve only the distances between points,
`Since these criterion functions
`they are invariant
`to rigid-body motions of the configurations. Moreover,
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-12 Filed 05/30/19 Page 5 of 10
`
`244
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`they have all been normalized so that their minimum values are invariant to
`dilations of the sample points. l, . emphasizes the largest errors, regardless
`whether the distances o,; are large or small. 111 emphasizes the largest frac(cid:173)
`tional errors, regardless whether the errors Id;; - O;;I are large or small. 1,1
`is a useful compromise, emphasizing the largest product of error and
`fractional error.
`Once a criterion function has been selected, an optimal configuration
`Yi, . .. , y,. is defined as one that minimizes that criterion function. An
`optimal configuration can be sought by a standard gradient-descent pro(cid:173)
`cedure , starting with some initial configuration and changing the y/s in the
`direction of greatest rate of decrease in the criterion function. Since
`
`di;= IIYi - Y,11,
`
`the gradient of d; ; with respect to Yi is merely a unit vector in the direction of
`Y; - y1• Thus , the gradients of the criterion functions are easy to compute:*
`
`The starting configuration can be chosen randomly, or in any convenient
`way that spreads the image points about. If the image points lie in a d(cid:173)
`dimensional space, then a simple and effective starting configuration can be
`found by selecting those d coordinates of the samples that have the largest
`variance.
`The following example illustrates the kind of results than can be obtained
`by these techniques.t The data consist of thirty points spaced at unit intervals
`along a three-dimensional helix:
`X1(k) = cos X3
`x 2(k) = sin x3
`x3(k) = k/✓2,
`k=O,l,
`...
`,29.
`* Second partial derivatives can also be computed easily, so that Newton's algorithm can
`be used. Note that ify , = y 1, the unit vector from y, toy; is undefined. Should that situa(cid:173)
`tion arise, (y, - Y;)/d 11 can be replaced by an arbitrary unit vector.
`t This example was taken from J. W. Sammon, Jr., "A nonlinear mapping for data structure
`analysis," IEEE Trans. Comp., C-18, 401-409 {May 1969).
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-12 Filed 05/30/19 Page 6 of 10
`
`MULTIDIMENSIONAL SCALING
`
`245
`
`figure 6.21 (a) shows a perspective representation of the three-dimensional
`data. When the 1.1 criterion was used, twenty iterations of a gradient descent
`procedure produced the two-dimensional configuration shown in Figure
`6.2l(b). Of course, translations, rotations, and reflections of this configura(cid:173)
`tion would be equally good solutions.
`In nonmetric multidimensional scaling problems, the quantities O;; are
`dissimilarities whose numerical values are not as important as their rank
`order. An ideal configuration would be one for which the rank order of the
`distances du is the same as the rank order of the dissimilarities '5ii. Let us
`order them = n(n - 1)/2 dissimilarities so that a,1
`• • • ~ a,,.;,., and let
`J;; be any m numbers satisfying the monotonicity constraint
`1(11 ~ Jiiii ~ • • • ~ Ji,.i,.•
`In general, the distances d,1 will not satisfy this constraint, and the numbers
`d;; will not be distances. However, the degree to which the dii satisfy this
`constraint is measured by
`Jmon = ~inl(du
`i<i
`where it is always to be understood that the d;; must satisfy the monotonicity
`constraint. Thus, lmon measures the degree to which the configuration of
`
`,
`
`1
`
`::;;
`
`- du)2,
`
`di;
`
`•• •
`•
`• • •
`•••
`•
`• •
`•
`•
`•
`•
`• •••
`• •••
`•
`••••
`
`(a) HELIX
`
`(bl IMAGE POINTS
`FIGURE 6.21. A two-dimensional representation of data points in
`three dimensions (Adapted from J. W. Sammon, 1969).
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-12 Filed 05/30/19 Page 7 of 10
`
`246
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`points y1 , .
`Jmon can not
`, Yn represents the original data. Unfortunately,
`..
`be used to define an optimal configuration because it can be made to vanish
`by collapsing
`the configuration
`to a single point. However,
`this defect is
`easily removed by a normalization such as the following:
`
`Jmon
`
`lrnon = I a:, .
`
`(48)
`
`i<J
`Thus, J mon is invariant
`rotations, and dilations of the
`to translations,
`configuration, and an optimal configuration can be defined as one that
`minimizes ·this criterion function. It has been observed experimentally
`that
`when the number of points is larger than dimensionality of the image space,
`the monotonicity constraint
`is actually quite confining. This might be ex(cid:173)
`pected from the fact that the numbe r of constraints grows as the square of
`the number of points , and it is the basis for the frequently encountered
`statement that this procedure allows the recovery of metric information from
`nonmetric data. The quality of the representation generally improves as the
`dimensionality of the image space is increased, and it may be necessary to go
`beyond three dimensions to obtain an acceptably small value of lrn on· How(cid:173)
`ever, this may be a small price to pay to allow the use of the many clustering
`procedures available for data points in metric spaces.
`
`6.14 CLUSTERING AND DIMENSIONALITY
`REDUCTION
`
`recognition
`Because the curse of dimensionality plagues so many pattern
`procedures, a variety of methods for dimensionality
`reduction have been
`proposed. Unlike the procedures
`that we have just examined, most of these
`methods provide a functional mapping, so that one can determine the image
`of an arbitrary
`feature vector. The classical procedures of statistics are
`principal components analysis and factor analysis, both of which
`reduce
`dimensionality by forming linear combinations of the features.* If we think
`of the problem as one of removing or combining
`(i.e., grouping) highly
`correlated features,
`then it becomes clear that the techniques of clustering
`
`* The object of principal components analysis (known in the communication theory
`literature as the Karhunen-Loeve expansion) is to find a lower-dimensional representation
`that accounts for the variance of the features. The object of factor analysis is to find a
`lower-dimensional representation that accounts for the correlat ions among the features.
`For more information, see M. G. Kendall and A. Stuart, The Advanced Theory of Statistics,
`Vol. 3, Chapter 43 (Hafner, New York, 1966) or H. H. Harman, Modern Factor Analysis
`(University of Chicago Press, Chicago and London, Second Edition, 1967).
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-12 Filed 05/30/19 Page 8 of 10
`
`DIMENSIONALITY REDUCTION
`
`247
`
`are applicable to this problem. In terms of the data matrix, whose n rows are
`the d-dimensional samples, ordinary clustering can be thought of as a group(cid:173)
`ing of the rows, with a smaller number of cluster centers being used to
`represent the data, whereas dimensionality reduction can be though of as a
`grouping the columns, with combined features being used to represent the
`data.
`Let us consider a simple modification of hierarchical clustering to reduce
`dimensionality. In place of an n-by-n matrix of distances between samples,
`we consider a d-by-d correlation matrix R = [P;;], where the correlation
`coefficient P;; is related to the covariances (or sample covariances) by
`
`_!!u._
`P·· =
`" ✓aua;; ·
`Since O ~ pl; ~ 1, with pi; = 0 for uncorrelated features and p;; = l for
`completely correlated features, pi; plays the role of a similarity function for
`features. Two features for which pi; is large are clearly good candidates to be
`merged into one feature,
`thereby reducing the dimensionality by one.
`Repetition of this process leads to the following hierarchical procedure:
`
`Loop:
`
`Procedure : Hierarchical Dimensionality Reduction
`I. Let J = d and :F; = {xi}, i = I, ...
`, d.
`2. If J = d', stop.
`3. Compute the correlation matrix and find the most correlated
`pair of distinct clusters of features , say :F; and §;.
`4. Merge ff; and :F1, delete ffe';, and decrement J by one.
`5. Go to Loop.
`
`Probably the simplest way to merge two groups of features is just to
`average them. (This tacitly assumes that the features have been scaled so that
`their numerical ranges are comparable.) With this definition of a new
`feature , there is no problem in defining the correlation matrix for groups of
`features. It is not hard to think of variations on this general theme, but we
`shall not pursue this topic further.
`For the purposes of pattern classification, the most serious criticism of all
`of the approaches to dimensionality reduction that we have mentioned is
`that they are overly concerned with faithful representation of the data.
`Greatest emphasis is usually placed on those features or groups of features
`that have the greatest variability. But for classification, we are interested in
`discrimination, not representation. Roughly speaking, the most interesting
`features are the ones for which the difference in the class means is large
`relative to the standard deviations, not the ones for whicih the standard
`deviations are large. In short, we are interested in something more like the
`method of multiple discriminant analysis described in Chapter 4.
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-12 Filed 05/30/19 Page 9 of 10
`
`248
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`There is a growing body of theory on methods of dimensionality reduction
`for pattern classification. Some of these methods seek to form new features
`out of linear combinations of old ones. Others seek merely a smaller subset
`of the original features. A major problem confronting this theory is that the
`division of pattern recognition into feature extraction followed by classifica(cid:173)
`tion is theoretically artificial. A completely optimal feature extractor can
`never be anything but an optimal classifier. It is only when constraints are
`placed on the classifier or limitations are placed on the size of the set of
`samples that one can formulate nontrivial (and very complicated) problems.
`that may be useful under the
`Various ways of circumventing
`this problem
`proper circumstances can be found in the literature, and we have included a
`few entry points to this literature. When it is possible to exploit knowledge
`of the problem domain to obtain more informative features, that is usually
`a more profitable course of action. In the second half of this book we shall
`devote ourselves to a systematic examination of ways of extracting features
`from visual data, and with the larger problem of visual scene analysis.
`
`6.15 BIBLIOGRAPHICAL AND HISTORICAL
`REMARKS
`
`The literature on unsupervised learning and clustering is so large and is
`scattered across so many disciplines that the following references must be
`viewed as little more than a selective random sampling. Fortunately, several
`of the references we cite contain extensive bibliographies, relieving us of
`many scholarly burdens. Historically, the literature dates back at least to
`1894 when Karl Pearson used sample moments to determine the parameters
`in a mixture of two univariate normal densities. Assuming exact knowledge
`of values of the mixture density, Doetsch (1936) used Fourier transforms to
`decompose univariate normal mixtures . Medgyessy (1961) extended this
`approach to other classes of mixtures, in the process exposing the problem
`of identifiability. Teicher (1961, 1963) and later Yakowitz and Spragins
`( 1968) demonstrated the identifiability of several families of mixtures, the
`latter authors showing the equivalence of identifiability and linear independ(cid:173)
`ence of the component densities.
`The phrases "unsupervised learning" or "learning without a teacher"
`usually refer to estimation of parameters of the component densities from
`samples drawn from the mixture density. Spragins (1966) and Cooper (1969)
`give valuable surveys of this work, and its relation to compound sequential
`Bayes learning is clarified by Cover (1969). Some of this work is quite general,
`being primarily concerned with theoretical possibilities. Thus, Stanat (1968)
`shows how Doetsch's method can be applied to learn multivariate normal
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-12 Filed 05/30/19 Page 10 of 10
`
`BIBLIOGRAPIIlCAL REMARKS
`
`249
`
`the
`
`and multivariate Bernoulli mixtures, and Yakowitz (1970) demonstrates
`possibility of learning virtually any identifiable mixture.
`Surprisingly few papers treat maximum-likelihood
`estimates. Hasselblad
`(1966) derived maximum-likelihood
`formulas for estimating
`the parameters
`of univariate normal mixtures. Day (1969) derived the formulas for the
`multivariate, equal covariance matrix case, and pointed out the existence of
`singular solutions with general normal mixtures. Our treatment of the multi(cid:173)
`variate case is based directly on the exceptionally clear paper by Wolfe
`(1970), who also derived formulas for multivariate Bernoulli mixtures. The
`formulation of the Bayesian approach
`to unsupervised
`learning is usually
`attributed
`to Daly (I 962); more general formulations have since been given
`by several authors (cf., Hilborn and Lainiotis 1968). Daly pointed out the
`exponential growth of the optimum system and the need for approximate
`solutions. Spragins' survey provides references to the literature on decision(cid:173)
`directed approximations prior to 1966, with subsequent work being refer(cid:173)
`enced by Patrick, Costello, and Monds (1970). Approximate solutions have
`also been obtained by the use of histograms (Patrick and Hancock 1966),
`quantized parameters
`(Fralick 1967), and randomized decisions (Agrawala
`1970).
`We have not mentioned all the ways that one might use to estimate
`unknown parameters. In particular, we have neglected the time-honored and
`robust method of sample moments, primarily because the situation becomes
`very complicated when there are more than two components
`in the mixture.
`However, some interesting solutions for special cases have been derived by
`David and Paul Cooper
`(1964) and elaborated
`further by Paul Cooper
`(1967). Because of its slow convergence, we have also omitted mention of the
`use of stochastic approximation;
`for the interested reader,
`the article by
`Young and Coraluppi ( 1970) can be recommended.
`Much of the early work in clustering was done in the biological sciences,
`where it appears in studies of numerical taxonomy. Here the major concern
`is with hierarchical clustering. The influential book by Sokal and Sneath
`(l 963) is an excellent source of references to this literature. Psychologists and
`sociologists have also contributed
`to cluste ring, although
`they are usually
`more concerned with clustering features than with clustering samples (Tryon
`1939; Tryon and Bailey 1970). The advent of the digital computer made
`cluster analysis practical, and caused the literature on clustering to spread
`over many disciplines. The well known survey by Ball (1965) gives a com(cid:173)
`prehensive overview of this work and is highly recommended; Ball's insights
`have had a major influence on our treatment of the subject. We have also
`benefited from the dissertation by Ling (1971), which includes a list of 140
`references. The surveys by Bolshev
`(1969) and Dorofeyuk
`(1971) give
`extensive references to the Russian literature on clustering.
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket