throbber
Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 1 of 11
`Case 1:14-cv-02396—PGG-MHD Document 148-13 Filed 05/30/19 Page 1 of 11
`
`EXHIBIT 4
`
`EXHIBIT 4
`PART
`7 OF 8
`
`PART
`
`7OF8
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 2 of 11
`
`250
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`Sokal and Sneath (1963) and Ball (1965) list many of the similarity
`measures and criterion functions that have seen use . The matters of measure(cid:173)
`ment scales, invariance criteria, and appropriate statistical operations are
`illuminated by Stevens (1968), and related fundamental phi losophical issues
`concerning clustering are treated by Watanabe (1969). The critique of cluster(cid:173)
`ing given by Fleiss and Zubin (1969) points out the unhappy consequences
`of being careless about such matters.
`Jones ( 1968) credits Thorndike ( 1953) with being the first to use the sum.
`of-squared-error criterion, which appears so frequently in the literature. The
`invariant criteria we presented were derived from Friedman and Rubin
`(1967), who pointed out that these criteria are related to Hotelling's Trace
`Criterion and the F-ratio of classical statistics. The observation that all these
`criteria give the same optimal partitions
`in the two-cluster case is due to
`Fukunaga and Koontz (1970). Of the various criteria we did not mention ,
`the "cohesion" criterion of Watanabe (1969, Chapter 8) is of particular
`interest since it involves more than pairwise similarity.
`In the text we outl ined the basic steps in a number of standard optimization
`and clustering programs. These descriptions were intentionally simplified,
`and even the more complete descriptions found in the literature do not always
`mention such matters as how ties are broken or how "wild shots" are
`rejected. The Isodata algorithm of Ball and Hall (1967) differs from our
`simplified description in several ways, most notably in the splitting of clusters
`that have too much within-cluster variability , and the merging of clusters
`that have too little between-cluster variability. Our description of the basic
`minimum-squared-error procedure is derived from an unpublished computer
`program developed by R. C. Singleton and W. H. Kautz at Stanford Research
`Institute
`in 1965. This procedure is also closely related to the adaptive
`sequential procedure of Sebestyen ( 1962), and to the so-called k-means
`procedure, whose convergence properties were studied by MacQueen (1967).
`Interesting applications of such procedures
`to character recognition are
`described by Andrews , Atrubin , and Hu (1968) and by Casey and Nagy
`(1968).
`Sokal and Sneath ( 1963) reference much of the early work on hierarchical
`clustering , and Wishart (1969) gives explicit references to the original sources
`for the single-linkage , nearest -neighbor, complete -linkage, furthest-neighbor,
`minimum -squared -error , and several other procedures. Lance and Williams
`(1967) show how most of these procedures can be obtained by specializing a
`general distance function in different ways; in addition , they reference the
`major papers on divisive hierarchical clustering. The relation between single(cid:173)
`linkage procedures and minimal spanning trees was shown by Gower and
`Ross (1969), who recommended a simple , efficient algorithm for finding
`minimal spanning trees given by Prim (1957). The equ ivalence between
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 3 of 11
`
`BIBLIOGRAPHICAL REMARKS
`
`251
`
`hierarchical clustering and a distance function satisfying the ultrametric
`inequality was shown by Johnson (1967).
`The great majority of papers on clustering have either explicitly or im(cid:173)
`plicitly accepted son:ie for~ ~f ':°ini~um-varia_nce ~riterion. Wishart (1969)
`pointed out the senous hm1tat1ons inherent m this approach, and as an
`alternative suggested a procedure resembling k,.-nearest-neighbor estimation
`of modes of the mixture density. Critiques of minimum-variance methods
`have also been given by Ling (1971) and Zahn (1971), both of whom favored
`graph-theoretic approaches to clustering. Zahn's work, though intended for
`data of any dimensionality , was motivated by a desire to find mathematical
`that group sets of points in two dimensions in a way that seems
`procedures
`visually natural. (Haralick and Kelly (1969) and Haralick and Dinstein (1971)
`also treat certain picture processing operations as clustering procedures , a
`viewpoint that applies to many of the procedures described in Part II of this
`book .)
`Most of the early work on graph-theoretic methods was done for informa(cid:173)
`tion retrieval purposes. Auguston and Minker (1970) credit Kuhns (1959)
`with the first application of graph theory to clustering. They give an experi(cid:173)
`mental comparison of several graph-theoretic
`techniques
`intended for
`information retrieval applications, and give many references to work in this
`domain. It is interesting that among papers with a graph-theoretic orientation
`we find three that are concerned with statistical tests for cluster validity , viz.,
`those by Bonner (1964), Hartigan (1967), and Ling (1971). Hall , Tepping,
`and Ball (1971) computed how the sum of squared error varies with the
`dimensionality of the data and the assumed number of clusters for both
`uniform and simplex data , and suggested these distributions as useful
`standards for comparison. Wolfe (1970) suggests a test for cluster validity
`based on an assumed chi-square distribution for the log-likelihood function.
`Green and Carmone (1970), whose valuable monograph on multidimen(cid:173)
`sional scaling contains an extensive bibliography , trace the origins of multi(cid:173)
`dimensional scaling to a paper by Richardson (1938). Recent interest in the
`topic was stimulated by two developments , nonmetric multidimensional
`scaling and computer graphics applications. The nonmetric approach
`originated by Shepard (1962) and extended by Kruskal (1964a) is well suited
`to many problems in psychology and sociology. The computational aspects
`of minimizing the criterion Jm on subject to a monotonicity constraint are
`described in detail by Kruskal (1964b). Calvert (1968) used a variation of
`Shepard's criterion to provide a two-dimensional computer display of multi(cid:173)
`variate data. The computationally simpler J,1 criterion was proposed and
`used by Sammon (1969) to display data for interactive analysis.
`The interest in man-machine systems stems partly from the difficulty of
`specifying criterion functions and clustering procedures that do what we
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 4 of 11
`
`252
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`(1965) were one of the firs.t
`really want them to do. Mattson and Dammann
`to suggest a man-machine
`solution to this problem. The great potential of
`interactive systems is well described by Ball and Hall (1970) in a paper on
`their PROMENADE
`system. Other well-known systems include BC TRY
`(Tryon and Bailey 1966; 1970), SARF (Stanley, Lendaris, and Nienow 1967),
`INTERSPACE
`(Patrick 1969), and OLPARS (Sammon 1970).
`Neither automatic nor man-machine
`systems for pattern recognition can
`escape the fundamental problems of high-dimensional data. Various pro (cid:173)
`cedures have been proposed
`for reducing
`the dimensionality , either by
`selecting
`the best subset of the available
`features or by combining
`the
`features, usually
`in a linear fashion. To avoid enormous computational
`problem s, most of these procedures use some criterion other than prob(cid:173)
`ability of error in making the selection. For example, Miller (1962) used a
`tr SrJSB criterion, Lewis (1962) used an entropy criterion , and Marill and
`Green (1963) used a divergence criterion. In some cases one can bound the
`probability of error by more easily computed criterion functions, but the
`final test is always one of actual performance. In the text we restricted our
`attention
`to a simple procedure due to King (1967), selecting it primarily
`because of its close relation to clustering. An excellent presentation of mathe(cid:173)
`matical methods for dimensionality reduction
`is given by Meisel (1972).
`
`REFERENCES
`
`1. Agrawala, A. K., "Learning with a probabilistic teacher," IEEE Trans. Info:(cid:173)
`Theory, IT-16, 373- 379 (July 1970).
`2. Andrews, D. R., A. J. Atrubin, and K. C. Hu, " The IBM 1975 Optical Page
`Reader. Part 3: Recognition logic development," IBM Journal, 12, 334- 371
`(September 1968).
`3. Augustson, J. G. and J. Minker, "An analysis of some graph theoretical
`cluster techniques," J. ACM, 17, 571-588 (October 1970).
`4. Ball, G. H., "Data analysis in the social sciences: what about the details-?'',
`Proc. FJCC, pp. 533- 560 (Spartan Books, Washington, D.C., 1965).
`5. Ball, G. H. and D. J. Hall, "A clustering technique for summarizing multi(cid:173)
`variate data," Behavioral Science, 12, 153-155 (March 1967).
`6. Ball, G. H. and D. J. Hall, "Some implications of interactive graphic computer
`systems for data analysis and statistics," Technometrics, 12, 17- 31 (February 1970).
`7. Bolshev, L. N. , "Cluster analysis," Bulletin, International Statistical Institute,
`43, 411- 425 (1969).
`8. Bonner, R. E. "On some clustering techniques," IBM Journal, 8, 22-32
`(January 1964).
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 5 of 11
`
`REFERENCES
`
`2S3
`
`9. Calvert, T. W., "Projections of multidimensional data for use in man com(cid:173)
`puter graphics," Proc. FJCC, pp. 227- 231 (Thompson Book Co., Washington,
`o.c., 1968).
`10. Casey, R. G. and G. Nagy, "An autonomous reading machine," IEEE Trans.
`Comp., C-17, 492- 503 (May 1968).
`11. Cooper, D. B. and P. W. Cooper, "Nonsupervised adaptive signal detection
`and pattern recognition," Information and Control, 1, 416-444 (September 1964).
`12. Cooper, P. W., "Some topics on nonsupervised adaptive detection for
`multivariate normal distributions," in Computer and Information Sciences-II, pp.
`123-146, J. T. Tou, ed. (Academic Press, New York, 1967).
`13. Cooper, P. W., "Nonsupervised learning in statistical pattern recognition,"
`in Methodologies of Pattern Recognition, pp. 97-109, S. Watanabe, ed. (Academic
`Press, New York, 1969).
`14. Cover, T. M., "Learning in pattern recognition," in Methodologies of Pattern
`Recognition, pp. 111-132, S. Watanabe, ed. (Academic Press, New York, 1969).
`15. Daly, R. F., "The adaptive binary-detection problem on the real line,"
`Technical Report 2003-3, Stanford University, Stanford, Calif. (February 1962).
`16. Day, N. E., "Estimating the components of a mixture of normal distribu(cid:173)
`tions," Biometrika, 56, 463-474 (December 1969).
`17. Doetsch, G., "Zerlegung einer Funktion
`in Gausche Fehlerkurven und
`zeitliche Zuruckverfolgung eines Temperaturzustandes," Mathematische Zeitschrift,
`41, 283-318 (1936).
`18. Dorofeyuk, A. A., "Automatic classification algorithms (review)," Auto(cid:173)
`mation and Remote Control, 32, 1928- 1958 (December 1971).
`19. Fleiss, J. L. and J. Zubin, "On the methods and theory of clustering,"
`Multivariate Behavioral Research, 4, 235-250 (April 1969).
`20. Fralick, S. C., "Learning to recognize patterns without a teacher," IEEE
`Trans. Info. Theory, IT-13, 57-64 (January 1967).
`21. Friedman, H. P. and J. Rubin, "On some invariant criteria for grouping
`data/' J. American Statistical Assn., 62, 1159-1178 (December 1967).
`22. Fukunaga, K. and W. L. G. Koontz, "A criterion and an algorithm for
`grouping data," IEEE Trans. Comp., C-19, 917- 923 (October 1970).
`23. Gower, J.C. and G. J. S. Ross, "Minimum spanning trees and single linkage
`cluster analysis," Appl. Statistics, 18, No. 1, 54-64 (1969).
`24. Green, P. E. and F. J. Carmone, Multidimensional Scaling and Related
`Techniques in Marketing Analysis (Allyn and Bacon, Boston, Mass., 1970).
`25. Hall, D. J., B. Tepping, and G. H. Ball, "Theoretical and experimental
`clustering characteristics for multivariate random and structured data," in "Applica(cid:173)
`tions of cluster analysis to Bureau of the Census data," Final Report, Contract
`Cco-9312, SRI Project 7600, Stanford Research Institute, Menlo Park, Calif.
`(1970).
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 6 of 11
`
`254
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`26. Haralick, R. M. and G. L. Kelly, "Pattern recognition with measurement
`space and spatial clustering for multiple images," Proc. IEEE, 51, 654-665 (April
`1969).
`27. Haralick, R. M. and I. Dinstein, "An iterative clustering procedure," IEEE
`Transactions on Sys., Man, and Cyb., SMC-1, 275-289 (July 1971).
`28. Hartigan, J. A., "Represe ntation of similarity matrices by trees," J. American
`Statistical Assn., 62, 114-0- 1158 (December 1967).
`29. Hasselblad, V., "Esti mation of parameters for a mixture of normal dis(cid:173)
`tributions," Technometrics, 8, 431-444 (August 1966).
`30. Hillborn, C. G., Jr. and D. G. Lainiotis, "Optimal unsupervised learning
`multicategory dependent hypotheses pattern recognition," IEEE Trans. Info. Theory,
`IT-14, 468-470 (May 1968).
`31. Johnson, S. C., "Hierarchical clustering schemes," Psychometrika, 32, 241-254
`(September 1967).
`32. Jones, K. L., "Problems of grouping individuals and the method of modality,"
`Behavioral Science, 13, 496-511 (November 1968).
`33. King, B. P., "Stepwise clustering procedures," J. American Statistical Assn.,
`62, 86-101 (March 1967).
`34. Kruskal, J. B., "Multidimensional scaling by optimizing goodness of fit to
`a nonmetric hypothesis," Psychometrika, 29, 1-27 (March 1964a).
`35. Kruskal, J.B., "Nonmetric multidimensional scaling: a numerical method,"
`Psychometrika, 29, 115-129 (June 1964b).
`36. Kuhns, J. L., "Mathematical analysis of correlation clusters," in "Word
`correlation and automatic indexing," Progress Report No . 2, C 82-OUI, Ramo(cid:173)
`Wooldridge Corporation, Canoga Park, Calif. (December 1959).
`37. Lance, G. N. and W. T. Williams, "A general theory of classificatory sorting
`strategies. 1. Hierarchical systems," Computer Journal, 9 , 373-380 (February 1967).
`38. Lewis, P. M., "The characte ristic selection problem in recognition systems,"
`IRE Trans. Info. Theory, IT-8, 171-178 (February 1962).
`39. Ling, R. F., "Cluster Analysis," Technical Report No. 18, Department of
`Statistics, Yale University, New Haven, Conn. (January 1971).
`40. MacQueen, J., "Some methods for classification and analysis of multivariate
`observations," in Proc. Fifth Berkeley Symposium on Math. Stat. and Prob., I,
`281-297, L. M. LeCam and J. Neyman, eds. (University of California Press,
`Berkeley and Los Angeles, Calif., 1967).
`41. Maril!, T. and D. M. Green, "On the effectiveness of receptors in recognition
`systems," IEEE Trans. Info. Theory, IT-9, 11-17 (January 1963).
`42. Mattson, R. L. and J. E. Dammann, "A technique for detecting and coding
`subclasses in pattern recognition problems," IBM Journal, 9, 294-302 (July 1965).
`43. Medgyessy, P., Decomposition of Superpositions of Distribution Functions
`(Plenum Press, New York, 1961).
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 7 of 11
`
`REFERENCES
`
`255
`
`44. Meisel, W. S., Computer-Oriented Approaches to Pattern Recognition
`(Academic Press, New York and London, 1972).
`45. Miller, R. G., "Statistical prediction by discriminant analysis," Meteoro(cid:173)
`logical Monographs, 4, 25 (October 1962).
`46. Patrick, E. A. and J. C. Hancock, " '.\lonsupervised sequential classification
`and recognition of patterns," IEEE Trans. Info. Theory, IT-12, 362-372 (July 1966).
`47. Patrick , E. A., "(Interspace) Interactive system for pattern analysis, classifica(cid:173)
`tion, and enhancement," paper presented at the Computers and Communications
`Conference, Rome, N.Y. (September 1969).
`48. Patrick, E. A., J.P. Costello, and F. C. Monds, "Decision directed estimation
`IEEE Trans. Comp., C-19, 197- 205 (March
`of a two class decision boundary,"
`1970).
`49. Pearson , K., "Contributions
`theory of evolution,"
`to the mathematical
`Philosophical Transactions of the Royal Society ~f London, 185, 71-110 (1894).
`50. Prim, R. C., "Shortest connection necworks and some generalizations," Bell
`System Technical Journal, 36, 1389-1401 (November 1957).
`51. Richardson, M. W., "Multidimensional
`psychophysics," Psychological
`Bulletin, 35, 659-660 (1938).
`52. Sammon, J. W., Jr., "A nonlinear mapping for data structure analysis,"
`IEEE Trans. Comp., C-18, 401-409 (May 1969).
`53. Sammon, J. W., Jr., "Interactive pattern analysis and classification," IEEE
`Trans. Comp., C-19, 594-616 (July 1970).
`54. Sebestyen, G. S., "Pattern recognition by an adaptive process of sample set
`construction," IRE Trans. Info. Theory, IT-8, S82- S91 (September 1962).
`55. Shepard, R. N., "The analysis of proximities: multidimensional scaling with
`an unknown distance function," Psychometrika, 27, 125- 139, 219-246 (1962).
`56. Sokal, R. R. and P. H. A. Sneath, Principles of Numerical Taxonomy (W. H.
`Freeman, San Francisco, Calif., 1963).
`57. Spragins , J., "Learning without a teacher," IEEE Trans. Info. Theory, IT-12,
`223-230 (April 1966).
`58. Stanat, D. F., " Unsupervised learning of mixtures of probability functions,"
`in Pattern Recognition, pp. 357-389, L. Kanai, ed. (Thompson Book Co., Washing(cid:173)
`ton, D.C. , 1968).
`59. Stanley, G. L., G. G. Lendaris, and W. C. Nienow, "Pattern Recognition
`Program," Technical Report 567-16, AC Electronics Defense Research Laboratories,
`Santa Barbara, Calif. (1967).
`60. Stevens, S.S., "Measurement, statistics, and the schemapiric view," Science,
`161, 849-856 (30 August 1968).
`61. Teicher, H., "Identifiability of mixtures," Ann. Math. Stat., 32, 244-248
`(March 1961).
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 8 of 11
`
`256
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`62. Teicher, H., "Identifiability of finite mixtures," Ann. Math. Stat., 34, 1265-
`1269 (December 1963).
`63. Thorndike, R. L., "Who belongs in the family?" Psychometrika, 18, 267-276
`(1953).
`64. Tryon, R. C., Cluster Analysis (Edwards Brothers, Ann Arbor, Mich., 1939).
`65. Tryon, R. C. and D. E. Bailey, " The BC TRY computer system of cluster
`and factor analysis," Multivariate Behavioral Research, 1, 95-111 (January 1966).
`66. Tryon, R. C. and D. E. Bailey, Cluster Analysis (McGraw-Hill, New York,
`1970).
`67. M. S. Watanabe, Knowing and Guessing (John Wiley, New York, 1969).
`68. Wishart, D., "Mode analysis: a generalization of nearest neighbor which
`reduces chaining effects," in Numerical Taxonomy, pp. 282-308, A. J. Cole, ed.
`(Academic Press, London and New York, 1969).
`69. Wolfe, J. H., "Pattern clustering by multivariate mixture analysis," Multi(cid:173)
`variate Behavioral Research, 5, 329- 350 (July 1970).
`70. Yakowitz, S. J. and J. D. Spragins, "On the identifiability of finite mixtures,"
`Ann. Math. Stat., 39, 209-214 (February 1968).
`71. Yakowitz, S. J., "Unsupervised
`learning and the identification of finite
`mixtures," IEEE Trans. Info. Theory, IT-16, 330-338 (May 1970).
`72. Young, T. Y. and G. Coraluppi, "Stochastic estimation of a mixture of
`normal density functions using an information criterion," IEEETra!ls. Irifo. Theory,
`IT-16, 258-263 (May 1970).
`73. Zahn, C. T., "Graph-theoretical methods for detecting and describing
`gestalt clusters," IEEE Tralls. Comp., C-20, 68-86 (January 1971).
`
`PROBLEMS
`
`1. Suppose that x can assume the values 0, I, ...
`of c binomial distributions
`
`P(x I 8) = tl (:) 0;"(1 - O;r - xP(w;).
`
`, m and that P(x I 8) is a mixture
`
`Assuming that the a priori probabilities are known, explain why this mixture is not
`identifiable if m < c. How does this answer change if the a priori probabilities are
`also unknown?
`2. Let x be a binary vector and P(x I 8) be a mixture of c multivariate Bernoulli
`distributions,
`
`where
`
`P(x I 8) = 2 P(x I wi, 8;)P( w;)
`
`C
`
`i=l
`
`P(x I wi, ei) = II e::o - 00) 1
`
`a
`
`1=1
`
`-
`
`,.
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 9 of 11
`
`PROBLEMS
`
`257
`
`(a) Show that
`
`i! log P(x I w;, 8;)
`a0.,
`
`x, - 8u
`= 011(1 - 011) •
`(b) Using the general equations for maximum likelihood estimates, show that the
`maximum likelihood estimate e, for 8; must satisfy
`
`3. Consider the univariate normal mixture
`
`Write a computer program that uses the general maximum likelihood equations of
`Section 6.4.3 iteratively to estimate the unknown means, variances, and a priori
`probabilities. Use this program to find maximum likelihood estimates of these
`parameters for the _data in Table 6_-1. (Answer: /l1 = -2.404, /l 2 = 1.491, u1 =
`0.577, 62 = 1.338, P(w 1) = 0.268, P(wJ = 0.732.)
`4. Let p(x I 8) be a c-component normal mixture with p(x I w;, 8;) ~ N(µ.;, a~I).
`Using the results of Section 6.3, show that the maximum likelihood estimate for
`a! must satisfy
`
`where P.; and P(w; I xk, 8;) are given by Eqs. (15) and (17), respectively.
`5. The derivation of the equations for maximum likelihood estimation of pa(cid:173)
`rameters of a mixture density was made under the assumption that the parameters
`in each component density are functionally independent. Suppose instead that
`p(x I ex) = lp(x I W;, a)P(w;),
`
`0
`
`i=l
`where a is a parameter that appears in a number of the component densities. Let
`/ be then-sample log-likelihood function, and show that
`I
`i! log p(xk I w1, 0()
`~ ~
`i!l
`- = k k P(w; xk, 0() ------~---'---
`i!tt
`k - 1 i=l
`i)C(
`
`where
`
`p(xk I w;, a)P(w;)
`( I )
`I
`P(w; xk, a) =
`p Xk
`
`<X
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 10 of 11
`
`,pq
`
`=
`
`.
`fl/1))].
`
`A
`
`.I:=
`
`C
`
`.
`(xp(k) - µ,,(,))( xq(k) -
`
`258
`UNSUPERVISED LEARNING AND CLUSTERING
`6. Let p(x I a>; , 8;) ~ N(µ ;, .E), where .Eis a common covariance matrix for the c
`component densities. Let a,,11 be the pqth element of .E, a"q be the pqth element of
`.E-1, x"(k) be the pth element of xk, and µ,,(i) be the pth element of µ i·
`(a) Show that
`o logp(xk I w;, 8i)
`<'Jpq)
`(
`l - 2
`[a,,0 -
`00
`(b) Use this result and the results of Problem 5 to show that the maximum likeli(cid:173)
`hood estimate for .E must satisfy
`1 1'
`- L xkxi - L P(w;)P.;P.!,
`n k - 1
`i - 1
`where P(w ,) and P.; are the maximum likelihood estimates given by Eqs. (14)
`and (15) in the text.
`7. Show that the maximum likelihood estimate of an a priori probability can be
`zero by considering the following special case. Letp(x I w 1) ~ N(O, I) andp( x I w2) ~
`N(O, (1/2)), so that P(w 1) is the only unknown parameter in the mixture
`- P(w1)
`(I - P(w1))
`-U /2):t2
`+
`( )
`p x -
`e
`. i -
`. i -
`v 21T
`'V1T
`Show that the maximum likelihood estimate P( w1) of P( w1) is zero if one sample x1
`is observed and if Xi < log 2. What is the value of P(w 1) if x~ > log 2?
`8. Consider the univariate normal mixture
`'µ c) = ± ~(w;) exp[ -
`_21 (x - µ;\2]
`p( x I /11, ...
`,=l V 21T (J
`in which all the components have the same , known variance, <J2• Suppose that the
`means are so far apart compared to <J that for any observed x all but one of the
`terms in this sum are negligible. Use a heuristic argument to show that the value of
`, X n I /11 , ...
`
`- :t•
`.
`e
`
`(J
`
`}
`
`, µJ}
`
`max {~ logp( x1 , ...
`µ1, .. ,.µ ,
`ought to be approximately
`C L P(w ;) log P(w;) - ½ log hae
`
`i=l
`when the number n of independently drawn samples is large. Compare this with the
`value shown on Figure 6.1.
`9. Let 01 and 02 be unknown parameters for the component densities p(x I w 1, 01)
`and p(x I w2, 02), respectively. Assume that 01 and 82 are initially statistically inde(cid:173)
`pendent, so that p(0 1 , 02) = p 1(01)p2(0z). Show that after one sample x1 from the
`mixture density is observed, p(0 1 , 02 I x1) can no longer be factored as
`p(01 I X1)p2(82 I X1)
`op(x I w;, 0;)
`00
`•
`
`if
`
`0
`;,,!,
`
`•
`
`i = I, 2.
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 11 of 11
`
`PROBLEMS
`
`259
`
`k- 1
`
`this similarity measure if the d
`
`What does thi s imply in general about the statistical dependence of parameters in
`unsupervised learning?
`, x.. be n d-dimensional samples and E be any nonsingular d-by-d
`10. Let xi> ...
`matrix. Show that the vector x that minimizes
`m 1 (xk - x)t 1:- 1(x k - x)
`1 ..
`is the sample mean, - I x k.
`n 1c-1
`11. Let s(x, x') = xtx' /( llx II · llx' II). Interpret
`features have binary values, where X; = 1 if x possesses the ith feature and x, = - 1
`if it does not. Show that for this case
`llx - x' 112 = 2d (I - s(x, x')).
`If a set of n samples q; is partitioned
`into c disjo int subsets f/:1 , ...
`, the
`12.
`, f/: 0
`sample mean mi for samples in fl:; is undefined if fl:; is empty. In such a case, the
`sum of squared errors invo lves only the nonempty subsets:
`!
`! llx - mi 112
`J, =
`
`non empty xe!,!, ~.
`
`•
`
`that
`
`Assuming that n ~ c, show that there are no empty subsets in a partition
`minimizes Je.
`13. Consider a set of n = 2k + 1 samples, k of which coincide at x = -2, k at
`x = 0, and one at x = a > 0. Show that the two-cluster partitioning that minimizes
`J, groups the k samples at x = 0 with the one at x = a if a 2 < 2(k + 1). What is the
`optimal grouping if a2 > 2(k + I)?
`14. Let x1 = (4 5)', x2 = (14)', x3 = (0 1)1 and x4 = (5 O)', and consider the
`following three partitions:
`1. f/:1 = {x1 , x2}, f/:2 = {x3, x,}
`2. f/:1 = {xi, x,}, f/:2 = {x2 , x3}
`3. f/:1 = {xi, x2 , x3}, f/:2 = {x4}.
`Show that by the sum-of-squared error (or tr Sw) criterion, the third partition
`is
`favored, whereas by the invariant JSwl criterion the first two partitions are favored.
`(1) and (2), tr Sw = 18, ISwl = 16;
`(Numerical answers for the three partitions:
`(3), tr Sw = 52/3, ISwl = 64/3.)
`, 2" of Sp;S B are invariant to nonsingular linear
`15. Show the eigenvalues Ai, ...
`, " " of Sx1Sw are
`transformations of the data. Show that the eigenvalues v1, •••
`by "; = 1 /(I + -l1). How does this show that J a =
`related
`to those of ~Sn
`ISwl/lSTI is invariant to nonsingular linear transformations of the data?
`16. One way to generalize the basic-minimum-squared-error
`procedure is to define
`the criterion function
`
`C
`
`JT = 2 2 (x - m;) 1S'.f1(x - m;) ,
`i= I xefl°,
`where m1 is the mean of then ; samples in fl:, and ST is the total scatter matrix.
`(a) Show that JT is invariant to nonsingular linear transformations of the data.
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket