`Case 1:14-cv-02396—PGG-MHD Document 148-13 Filed 05/30/19 Page 1 of 11
`
`EXHIBIT 4
`
`EXHIBIT 4
`PART
`7 OF 8
`
`PART
`
`7OF8
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 2 of 11
`
`250
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`Sokal and Sneath (1963) and Ball (1965) list many of the similarity
`measures and criterion functions that have seen use . The matters of measure(cid:173)
`ment scales, invariance criteria, and appropriate statistical operations are
`illuminated by Stevens (1968), and related fundamental phi losophical issues
`concerning clustering are treated by Watanabe (1969). The critique of cluster(cid:173)
`ing given by Fleiss and Zubin (1969) points out the unhappy consequences
`of being careless about such matters.
`Jones ( 1968) credits Thorndike ( 1953) with being the first to use the sum.
`of-squared-error criterion, which appears so frequently in the literature. The
`invariant criteria we presented were derived from Friedman and Rubin
`(1967), who pointed out that these criteria are related to Hotelling's Trace
`Criterion and the F-ratio of classical statistics. The observation that all these
`criteria give the same optimal partitions
`in the two-cluster case is due to
`Fukunaga and Koontz (1970). Of the various criteria we did not mention ,
`the "cohesion" criterion of Watanabe (1969, Chapter 8) is of particular
`interest since it involves more than pairwise similarity.
`In the text we outl ined the basic steps in a number of standard optimization
`and clustering programs. These descriptions were intentionally simplified,
`and even the more complete descriptions found in the literature do not always
`mention such matters as how ties are broken or how "wild shots" are
`rejected. The Isodata algorithm of Ball and Hall (1967) differs from our
`simplified description in several ways, most notably in the splitting of clusters
`that have too much within-cluster variability , and the merging of clusters
`that have too little between-cluster variability. Our description of the basic
`minimum-squared-error procedure is derived from an unpublished computer
`program developed by R. C. Singleton and W. H. Kautz at Stanford Research
`Institute
`in 1965. This procedure is also closely related to the adaptive
`sequential procedure of Sebestyen ( 1962), and to the so-called k-means
`procedure, whose convergence properties were studied by MacQueen (1967).
`Interesting applications of such procedures
`to character recognition are
`described by Andrews , Atrubin , and Hu (1968) and by Casey and Nagy
`(1968).
`Sokal and Sneath ( 1963) reference much of the early work on hierarchical
`clustering , and Wishart (1969) gives explicit references to the original sources
`for the single-linkage , nearest -neighbor, complete -linkage, furthest-neighbor,
`minimum -squared -error , and several other procedures. Lance and Williams
`(1967) show how most of these procedures can be obtained by specializing a
`general distance function in different ways; in addition , they reference the
`major papers on divisive hierarchical clustering. The relation between single(cid:173)
`linkage procedures and minimal spanning trees was shown by Gower and
`Ross (1969), who recommended a simple , efficient algorithm for finding
`minimal spanning trees given by Prim (1957). The equ ivalence between
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 3 of 11
`
`BIBLIOGRAPHICAL REMARKS
`
`251
`
`hierarchical clustering and a distance function satisfying the ultrametric
`inequality was shown by Johnson (1967).
`The great majority of papers on clustering have either explicitly or im(cid:173)
`plicitly accepted son:ie for~ ~f ':°ini~um-varia_nce ~riterion. Wishart (1969)
`pointed out the senous hm1tat1ons inherent m this approach, and as an
`alternative suggested a procedure resembling k,.-nearest-neighbor estimation
`of modes of the mixture density. Critiques of minimum-variance methods
`have also been given by Ling (1971) and Zahn (1971), both of whom favored
`graph-theoretic approaches to clustering. Zahn's work, though intended for
`data of any dimensionality , was motivated by a desire to find mathematical
`that group sets of points in two dimensions in a way that seems
`procedures
`visually natural. (Haralick and Kelly (1969) and Haralick and Dinstein (1971)
`also treat certain picture processing operations as clustering procedures , a
`viewpoint that applies to many of the procedures described in Part II of this
`book .)
`Most of the early work on graph-theoretic methods was done for informa(cid:173)
`tion retrieval purposes. Auguston and Minker (1970) credit Kuhns (1959)
`with the first application of graph theory to clustering. They give an experi(cid:173)
`mental comparison of several graph-theoretic
`techniques
`intended for
`information retrieval applications, and give many references to work in this
`domain. It is interesting that among papers with a graph-theoretic orientation
`we find three that are concerned with statistical tests for cluster validity , viz.,
`those by Bonner (1964), Hartigan (1967), and Ling (1971). Hall , Tepping,
`and Ball (1971) computed how the sum of squared error varies with the
`dimensionality of the data and the assumed number of clusters for both
`uniform and simplex data , and suggested these distributions as useful
`standards for comparison. Wolfe (1970) suggests a test for cluster validity
`based on an assumed chi-square distribution for the log-likelihood function.
`Green and Carmone (1970), whose valuable monograph on multidimen(cid:173)
`sional scaling contains an extensive bibliography , trace the origins of multi(cid:173)
`dimensional scaling to a paper by Richardson (1938). Recent interest in the
`topic was stimulated by two developments , nonmetric multidimensional
`scaling and computer graphics applications. The nonmetric approach
`originated by Shepard (1962) and extended by Kruskal (1964a) is well suited
`to many problems in psychology and sociology. The computational aspects
`of minimizing the criterion Jm on subject to a monotonicity constraint are
`described in detail by Kruskal (1964b). Calvert (1968) used a variation of
`Shepard's criterion to provide a two-dimensional computer display of multi(cid:173)
`variate data. The computationally simpler J,1 criterion was proposed and
`used by Sammon (1969) to display data for interactive analysis.
`The interest in man-machine systems stems partly from the difficulty of
`specifying criterion functions and clustering procedures that do what we
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 4 of 11
`
`252
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`(1965) were one of the firs.t
`really want them to do. Mattson and Dammann
`to suggest a man-machine
`solution to this problem. The great potential of
`interactive systems is well described by Ball and Hall (1970) in a paper on
`their PROMENADE
`system. Other well-known systems include BC TRY
`(Tryon and Bailey 1966; 1970), SARF (Stanley, Lendaris, and Nienow 1967),
`INTERSPACE
`(Patrick 1969), and OLPARS (Sammon 1970).
`Neither automatic nor man-machine
`systems for pattern recognition can
`escape the fundamental problems of high-dimensional data. Various pro (cid:173)
`cedures have been proposed
`for reducing
`the dimensionality , either by
`selecting
`the best subset of the available
`features or by combining
`the
`features, usually
`in a linear fashion. To avoid enormous computational
`problem s, most of these procedures use some criterion other than prob(cid:173)
`ability of error in making the selection. For example, Miller (1962) used a
`tr SrJSB criterion, Lewis (1962) used an entropy criterion , and Marill and
`Green (1963) used a divergence criterion. In some cases one can bound the
`probability of error by more easily computed criterion functions, but the
`final test is always one of actual performance. In the text we restricted our
`attention
`to a simple procedure due to King (1967), selecting it primarily
`because of its close relation to clustering. An excellent presentation of mathe(cid:173)
`matical methods for dimensionality reduction
`is given by Meisel (1972).
`
`REFERENCES
`
`1. Agrawala, A. K., "Learning with a probabilistic teacher," IEEE Trans. Info:(cid:173)
`Theory, IT-16, 373- 379 (July 1970).
`2. Andrews, D. R., A. J. Atrubin, and K. C. Hu, " The IBM 1975 Optical Page
`Reader. Part 3: Recognition logic development," IBM Journal, 12, 334- 371
`(September 1968).
`3. Augustson, J. G. and J. Minker, "An analysis of some graph theoretical
`cluster techniques," J. ACM, 17, 571-588 (October 1970).
`4. Ball, G. H., "Data analysis in the social sciences: what about the details-?'',
`Proc. FJCC, pp. 533- 560 (Spartan Books, Washington, D.C., 1965).
`5. Ball, G. H. and D. J. Hall, "A clustering technique for summarizing multi(cid:173)
`variate data," Behavioral Science, 12, 153-155 (March 1967).
`6. Ball, G. H. and D. J. Hall, "Some implications of interactive graphic computer
`systems for data analysis and statistics," Technometrics, 12, 17- 31 (February 1970).
`7. Bolshev, L. N. , "Cluster analysis," Bulletin, International Statistical Institute,
`43, 411- 425 (1969).
`8. Bonner, R. E. "On some clustering techniques," IBM Journal, 8, 22-32
`(January 1964).
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 5 of 11
`
`REFERENCES
`
`2S3
`
`9. Calvert, T. W., "Projections of multidimensional data for use in man com(cid:173)
`puter graphics," Proc. FJCC, pp. 227- 231 (Thompson Book Co., Washington,
`o.c., 1968).
`10. Casey, R. G. and G. Nagy, "An autonomous reading machine," IEEE Trans.
`Comp., C-17, 492- 503 (May 1968).
`11. Cooper, D. B. and P. W. Cooper, "Nonsupervised adaptive signal detection
`and pattern recognition," Information and Control, 1, 416-444 (September 1964).
`12. Cooper, P. W., "Some topics on nonsupervised adaptive detection for
`multivariate normal distributions," in Computer and Information Sciences-II, pp.
`123-146, J. T. Tou, ed. (Academic Press, New York, 1967).
`13. Cooper, P. W., "Nonsupervised learning in statistical pattern recognition,"
`in Methodologies of Pattern Recognition, pp. 97-109, S. Watanabe, ed. (Academic
`Press, New York, 1969).
`14. Cover, T. M., "Learning in pattern recognition," in Methodologies of Pattern
`Recognition, pp. 111-132, S. Watanabe, ed. (Academic Press, New York, 1969).
`15. Daly, R. F., "The adaptive binary-detection problem on the real line,"
`Technical Report 2003-3, Stanford University, Stanford, Calif. (February 1962).
`16. Day, N. E., "Estimating the components of a mixture of normal distribu(cid:173)
`tions," Biometrika, 56, 463-474 (December 1969).
`17. Doetsch, G., "Zerlegung einer Funktion
`in Gausche Fehlerkurven und
`zeitliche Zuruckverfolgung eines Temperaturzustandes," Mathematische Zeitschrift,
`41, 283-318 (1936).
`18. Dorofeyuk, A. A., "Automatic classification algorithms (review)," Auto(cid:173)
`mation and Remote Control, 32, 1928- 1958 (December 1971).
`19. Fleiss, J. L. and J. Zubin, "On the methods and theory of clustering,"
`Multivariate Behavioral Research, 4, 235-250 (April 1969).
`20. Fralick, S. C., "Learning to recognize patterns without a teacher," IEEE
`Trans. Info. Theory, IT-13, 57-64 (January 1967).
`21. Friedman, H. P. and J. Rubin, "On some invariant criteria for grouping
`data/' J. American Statistical Assn., 62, 1159-1178 (December 1967).
`22. Fukunaga, K. and W. L. G. Koontz, "A criterion and an algorithm for
`grouping data," IEEE Trans. Comp., C-19, 917- 923 (October 1970).
`23. Gower, J.C. and G. J. S. Ross, "Minimum spanning trees and single linkage
`cluster analysis," Appl. Statistics, 18, No. 1, 54-64 (1969).
`24. Green, P. E. and F. J. Carmone, Multidimensional Scaling and Related
`Techniques in Marketing Analysis (Allyn and Bacon, Boston, Mass., 1970).
`25. Hall, D. J., B. Tepping, and G. H. Ball, "Theoretical and experimental
`clustering characteristics for multivariate random and structured data," in "Applica(cid:173)
`tions of cluster analysis to Bureau of the Census data," Final Report, Contract
`Cco-9312, SRI Project 7600, Stanford Research Institute, Menlo Park, Calif.
`(1970).
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 6 of 11
`
`254
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`26. Haralick, R. M. and G. L. Kelly, "Pattern recognition with measurement
`space and spatial clustering for multiple images," Proc. IEEE, 51, 654-665 (April
`1969).
`27. Haralick, R. M. and I. Dinstein, "An iterative clustering procedure," IEEE
`Transactions on Sys., Man, and Cyb., SMC-1, 275-289 (July 1971).
`28. Hartigan, J. A., "Represe ntation of similarity matrices by trees," J. American
`Statistical Assn., 62, 114-0- 1158 (December 1967).
`29. Hasselblad, V., "Esti mation of parameters for a mixture of normal dis(cid:173)
`tributions," Technometrics, 8, 431-444 (August 1966).
`30. Hillborn, C. G., Jr. and D. G. Lainiotis, "Optimal unsupervised learning
`multicategory dependent hypotheses pattern recognition," IEEE Trans. Info. Theory,
`IT-14, 468-470 (May 1968).
`31. Johnson, S. C., "Hierarchical clustering schemes," Psychometrika, 32, 241-254
`(September 1967).
`32. Jones, K. L., "Problems of grouping individuals and the method of modality,"
`Behavioral Science, 13, 496-511 (November 1968).
`33. King, B. P., "Stepwise clustering procedures," J. American Statistical Assn.,
`62, 86-101 (March 1967).
`34. Kruskal, J. B., "Multidimensional scaling by optimizing goodness of fit to
`a nonmetric hypothesis," Psychometrika, 29, 1-27 (March 1964a).
`35. Kruskal, J.B., "Nonmetric multidimensional scaling: a numerical method,"
`Psychometrika, 29, 115-129 (June 1964b).
`36. Kuhns, J. L., "Mathematical analysis of correlation clusters," in "Word
`correlation and automatic indexing," Progress Report No . 2, C 82-OUI, Ramo(cid:173)
`Wooldridge Corporation, Canoga Park, Calif. (December 1959).
`37. Lance, G. N. and W. T. Williams, "A general theory of classificatory sorting
`strategies. 1. Hierarchical systems," Computer Journal, 9 , 373-380 (February 1967).
`38. Lewis, P. M., "The characte ristic selection problem in recognition systems,"
`IRE Trans. Info. Theory, IT-8, 171-178 (February 1962).
`39. Ling, R. F., "Cluster Analysis," Technical Report No. 18, Department of
`Statistics, Yale University, New Haven, Conn. (January 1971).
`40. MacQueen, J., "Some methods for classification and analysis of multivariate
`observations," in Proc. Fifth Berkeley Symposium on Math. Stat. and Prob., I,
`281-297, L. M. LeCam and J. Neyman, eds. (University of California Press,
`Berkeley and Los Angeles, Calif., 1967).
`41. Maril!, T. and D. M. Green, "On the effectiveness of receptors in recognition
`systems," IEEE Trans. Info. Theory, IT-9, 11-17 (January 1963).
`42. Mattson, R. L. and J. E. Dammann, "A technique for detecting and coding
`subclasses in pattern recognition problems," IBM Journal, 9, 294-302 (July 1965).
`43. Medgyessy, P., Decomposition of Superpositions of Distribution Functions
`(Plenum Press, New York, 1961).
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 7 of 11
`
`REFERENCES
`
`255
`
`44. Meisel, W. S., Computer-Oriented Approaches to Pattern Recognition
`(Academic Press, New York and London, 1972).
`45. Miller, R. G., "Statistical prediction by discriminant analysis," Meteoro(cid:173)
`logical Monographs, 4, 25 (October 1962).
`46. Patrick, E. A. and J. C. Hancock, " '.\lonsupervised sequential classification
`and recognition of patterns," IEEE Trans. Info. Theory, IT-12, 362-372 (July 1966).
`47. Patrick , E. A., "(Interspace) Interactive system for pattern analysis, classifica(cid:173)
`tion, and enhancement," paper presented at the Computers and Communications
`Conference, Rome, N.Y. (September 1969).
`48. Patrick, E. A., J.P. Costello, and F. C. Monds, "Decision directed estimation
`IEEE Trans. Comp., C-19, 197- 205 (March
`of a two class decision boundary,"
`1970).
`49. Pearson , K., "Contributions
`theory of evolution,"
`to the mathematical
`Philosophical Transactions of the Royal Society ~f London, 185, 71-110 (1894).
`50. Prim, R. C., "Shortest connection necworks and some generalizations," Bell
`System Technical Journal, 36, 1389-1401 (November 1957).
`51. Richardson, M. W., "Multidimensional
`psychophysics," Psychological
`Bulletin, 35, 659-660 (1938).
`52. Sammon, J. W., Jr., "A nonlinear mapping for data structure analysis,"
`IEEE Trans. Comp., C-18, 401-409 (May 1969).
`53. Sammon, J. W., Jr., "Interactive pattern analysis and classification," IEEE
`Trans. Comp., C-19, 594-616 (July 1970).
`54. Sebestyen, G. S., "Pattern recognition by an adaptive process of sample set
`construction," IRE Trans. Info. Theory, IT-8, S82- S91 (September 1962).
`55. Shepard, R. N., "The analysis of proximities: multidimensional scaling with
`an unknown distance function," Psychometrika, 27, 125- 139, 219-246 (1962).
`56. Sokal, R. R. and P. H. A. Sneath, Principles of Numerical Taxonomy (W. H.
`Freeman, San Francisco, Calif., 1963).
`57. Spragins , J., "Learning without a teacher," IEEE Trans. Info. Theory, IT-12,
`223-230 (April 1966).
`58. Stanat, D. F., " Unsupervised learning of mixtures of probability functions,"
`in Pattern Recognition, pp. 357-389, L. Kanai, ed. (Thompson Book Co., Washing(cid:173)
`ton, D.C. , 1968).
`59. Stanley, G. L., G. G. Lendaris, and W. C. Nienow, "Pattern Recognition
`Program," Technical Report 567-16, AC Electronics Defense Research Laboratories,
`Santa Barbara, Calif. (1967).
`60. Stevens, S.S., "Measurement, statistics, and the schemapiric view," Science,
`161, 849-856 (30 August 1968).
`61. Teicher, H., "Identifiability of mixtures," Ann. Math. Stat., 32, 244-248
`(March 1961).
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 8 of 11
`
`256
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`62. Teicher, H., "Identifiability of finite mixtures," Ann. Math. Stat., 34, 1265-
`1269 (December 1963).
`63. Thorndike, R. L., "Who belongs in the family?" Psychometrika, 18, 267-276
`(1953).
`64. Tryon, R. C., Cluster Analysis (Edwards Brothers, Ann Arbor, Mich., 1939).
`65. Tryon, R. C. and D. E. Bailey, " The BC TRY computer system of cluster
`and factor analysis," Multivariate Behavioral Research, 1, 95-111 (January 1966).
`66. Tryon, R. C. and D. E. Bailey, Cluster Analysis (McGraw-Hill, New York,
`1970).
`67. M. S. Watanabe, Knowing and Guessing (John Wiley, New York, 1969).
`68. Wishart, D., "Mode analysis: a generalization of nearest neighbor which
`reduces chaining effects," in Numerical Taxonomy, pp. 282-308, A. J. Cole, ed.
`(Academic Press, London and New York, 1969).
`69. Wolfe, J. H., "Pattern clustering by multivariate mixture analysis," Multi(cid:173)
`variate Behavioral Research, 5, 329- 350 (July 1970).
`70. Yakowitz, S. J. and J. D. Spragins, "On the identifiability of finite mixtures,"
`Ann. Math. Stat., 39, 209-214 (February 1968).
`71. Yakowitz, S. J., "Unsupervised
`learning and the identification of finite
`mixtures," IEEE Trans. Info. Theory, IT-16, 330-338 (May 1970).
`72. Young, T. Y. and G. Coraluppi, "Stochastic estimation of a mixture of
`normal density functions using an information criterion," IEEETra!ls. Irifo. Theory,
`IT-16, 258-263 (May 1970).
`73. Zahn, C. T., "Graph-theoretical methods for detecting and describing
`gestalt clusters," IEEE Tralls. Comp., C-20, 68-86 (January 1971).
`
`PROBLEMS
`
`1. Suppose that x can assume the values 0, I, ...
`of c binomial distributions
`
`P(x I 8) = tl (:) 0;"(1 - O;r - xP(w;).
`
`, m and that P(x I 8) is a mixture
`
`Assuming that the a priori probabilities are known, explain why this mixture is not
`identifiable if m < c. How does this answer change if the a priori probabilities are
`also unknown?
`2. Let x be a binary vector and P(x I 8) be a mixture of c multivariate Bernoulli
`distributions,
`
`where
`
`P(x I 8) = 2 P(x I wi, 8;)P( w;)
`
`C
`
`i=l
`
`P(x I wi, ei) = II e::o - 00) 1
`
`a
`
`1=1
`
`-
`
`,.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 9 of 11
`
`PROBLEMS
`
`257
`
`(a) Show that
`
`i! log P(x I w;, 8;)
`a0.,
`
`x, - 8u
`= 011(1 - 011) •
`(b) Using the general equations for maximum likelihood estimates, show that the
`maximum likelihood estimate e, for 8; must satisfy
`
`3. Consider the univariate normal mixture
`
`Write a computer program that uses the general maximum likelihood equations of
`Section 6.4.3 iteratively to estimate the unknown means, variances, and a priori
`probabilities. Use this program to find maximum likelihood estimates of these
`parameters for the _data in Table 6_-1. (Answer: /l1 = -2.404, /l 2 = 1.491, u1 =
`0.577, 62 = 1.338, P(w 1) = 0.268, P(wJ = 0.732.)
`4. Let p(x I 8) be a c-component normal mixture with p(x I w;, 8;) ~ N(µ.;, a~I).
`Using the results of Section 6.3, show that the maximum likelihood estimate for
`a! must satisfy
`
`where P.; and P(w; I xk, 8;) are given by Eqs. (15) and (17), respectively.
`5. The derivation of the equations for maximum likelihood estimation of pa(cid:173)
`rameters of a mixture density was made under the assumption that the parameters
`in each component density are functionally independent. Suppose instead that
`p(x I ex) = lp(x I W;, a)P(w;),
`
`0
`
`i=l
`where a is a parameter that appears in a number of the component densities. Let
`/ be then-sample log-likelihood function, and show that
`I
`i! log p(xk I w1, 0()
`~ ~
`i!l
`- = k k P(w; xk, 0() ------~---'---
`i!tt
`k - 1 i=l
`i)C(
`
`where
`
`p(xk I w;, a)P(w;)
`( I )
`I
`P(w; xk, a) =
`p Xk
`
`<X
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 10 of 11
`
`,pq
`
`=
`
`.
`fl/1))].
`
`A
`
`.I:=
`
`C
`
`.
`(xp(k) - µ,,(,))( xq(k) -
`
`258
`UNSUPERVISED LEARNING AND CLUSTERING
`6. Let p(x I a>; , 8;) ~ N(µ ;, .E), where .Eis a common covariance matrix for the c
`component densities. Let a,,11 be the pqth element of .E, a"q be the pqth element of
`.E-1, x"(k) be the pth element of xk, and µ,,(i) be the pth element of µ i·
`(a) Show that
`o logp(xk I w;, 8i)
`<'Jpq)
`(
`l - 2
`[a,,0 -
`00
`(b) Use this result and the results of Problem 5 to show that the maximum likeli(cid:173)
`hood estimate for .E must satisfy
`1 1'
`- L xkxi - L P(w;)P.;P.!,
`n k - 1
`i - 1
`where P(w ,) and P.; are the maximum likelihood estimates given by Eqs. (14)
`and (15) in the text.
`7. Show that the maximum likelihood estimate of an a priori probability can be
`zero by considering the following special case. Letp(x I w 1) ~ N(O, I) andp( x I w2) ~
`N(O, (1/2)), so that P(w 1) is the only unknown parameter in the mixture
`- P(w1)
`(I - P(w1))
`-U /2):t2
`+
`( )
`p x -
`e
`. i -
`. i -
`v 21T
`'V1T
`Show that the maximum likelihood estimate P( w1) of P( w1) is zero if one sample x1
`is observed and if Xi < log 2. What is the value of P(w 1) if x~ > log 2?
`8. Consider the univariate normal mixture
`'µ c) = ± ~(w;) exp[ -
`_21 (x - µ;\2]
`p( x I /11, ...
`,=l V 21T (J
`in which all the components have the same , known variance, <J2• Suppose that the
`means are so far apart compared to <J that for any observed x all but one of the
`terms in this sum are negligible. Use a heuristic argument to show that the value of
`, X n I /11 , ...
`
`- :t•
`.
`e
`
`(J
`
`}
`
`, µJ}
`
`max {~ logp( x1 , ...
`µ1, .. ,.µ ,
`ought to be approximately
`C L P(w ;) log P(w;) - ½ log hae
`
`i=l
`when the number n of independently drawn samples is large. Compare this with the
`value shown on Figure 6.1.
`9. Let 01 and 02 be unknown parameters for the component densities p(x I w 1, 01)
`and p(x I w2, 02), respectively. Assume that 01 and 82 are initially statistically inde(cid:173)
`pendent, so that p(0 1 , 02) = p 1(01)p2(0z). Show that after one sample x1 from the
`mixture density is observed, p(0 1 , 02 I x1) can no longer be factored as
`p(01 I X1)p2(82 I X1)
`op(x I w;, 0;)
`00
`•
`
`if
`
`0
`;,,!,
`
`•
`
`i = I, 2.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-13 Filed 05/30/19 Page 11 of 11
`
`PROBLEMS
`
`259
`
`k- 1
`
`this similarity measure if the d
`
`What does thi s imply in general about the statistical dependence of parameters in
`unsupervised learning?
`, x.. be n d-dimensional samples and E be any nonsingular d-by-d
`10. Let xi> ...
`matrix. Show that the vector x that minimizes
`m 1 (xk - x)t 1:- 1(x k - x)
`1 ..
`is the sample mean, - I x k.
`n 1c-1
`11. Let s(x, x') = xtx' /( llx II · llx' II). Interpret
`features have binary values, where X; = 1 if x possesses the ith feature and x, = - 1
`if it does not. Show that for this case
`llx - x' 112 = 2d (I - s(x, x')).
`If a set of n samples q; is partitioned
`into c disjo int subsets f/:1 , ...
`, the
`12.
`, f/: 0
`sample mean mi for samples in fl:; is undefined if fl:; is empty. In such a case, the
`sum of squared errors invo lves only the nonempty subsets:
`!
`! llx - mi 112
`J, =
`
`non empty xe!,!, ~.
`
`•
`
`that
`
`Assuming that n ~ c, show that there are no empty subsets in a partition
`minimizes Je.
`13. Consider a set of n = 2k + 1 samples, k of which coincide at x = -2, k at
`x = 0, and one at x = a > 0. Show that the two-cluster partitioning that minimizes
`J, groups the k samples at x = 0 with the one at x = a if a 2 < 2(k + 1). What is the
`optimal grouping if a2 > 2(k + I)?
`14. Let x1 = (4 5)', x2 = (14)', x3 = (0 1)1 and x4 = (5 O)', and consider the
`following three partitions:
`1. f/:1 = {x1 , x2}, f/:2 = {x3, x,}
`2. f/:1 = {xi, x,}, f/:2 = {x2 , x3}
`3. f/:1 = {xi, x2 , x3}, f/:2 = {x4}.
`Show that by the sum-of-squared error (or tr Sw) criterion, the third partition
`is
`favored, whereas by the invariant JSwl criterion the first two partitions are favored.
`(1) and (2), tr Sw = 18, ISwl = 16;
`(Numerical answers for the three partitions:
`(3), tr Sw = 52/3, ISwl = 64/3.)
`, 2" of Sp;S B are invariant to nonsingular linear
`15. Show the eigenvalues Ai, ...
`, " " of Sx1Sw are
`transformations of the data. Show that the eigenvalues v1, •••
`by "; = 1 /(I + -l1). How does this show that J a =
`related
`to those of ~Sn
`ISwl/lSTI is invariant to nonsingular linear transformations of the data?
`16. One way to generalize the basic-minimum-squared-error
`procedure is to define
`the criterion function
`
`C
`
`JT = 2 2 (x - m;) 1S'.f1(x - m;) ,
`i= I xefl°,
`where m1 is the mean of then ; samples in fl:, and ST is the total scatter matrix.
`(a) Show that JT is invariant to nonsingular linear transformations of the data.
`
`