`Case 1:14-cv-02396—PGG-MHD Document 148-8 Filed 05/30/19 Page 1 of 12
`
`EXHIBIT 4
`
`EXHIBIT 4
`PART
`2 OF 8
`
`PART
`
`20F8
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-8 Filed 05/30/19 Page 2 of 12
`200
`UNSUPERVISED LEARNING AND CLUSTERING
`
`estimates P.i, Ei, and P(wi):
`
`A
`
`equations for the local-maximum-likelihood
`1 n
`f>(wi) = - I,f>(w, I xk, 8)
`n 1,- 1
`n
`I,P(w, I xk, 8)xk
`!Li = -'---,.-----
`I,P(wi I xk, tJ)
`
`,..
`
`k-1
`
`k=l
`
`(14)
`
`(15)
`
`(16)
`
`(17)
`
`n
`
`I,P(w, I xk, 8)(xk -
`tl;)(x,. -
`E-= -k-_l __
`_ ________
`•
`n
`I Pc w, 1 xk> e,)
`
`k=l
`
`f,.;)'
`_
`
`where
`
`I -
`p(x,. j W;, 8i)P(w,)
`~
`P(w, xk, 8) = _c ________
`_
`°'I,p(xk I w1, 61).P(w1)
`I .Eil-112 exp[ - ½(xk -
`I, 1..r,i-112 exp[-½(x,. - ~,YE";1(xk - P.,)].P(w;)
`;- 1
`
`(i..;)'.E-;
`1(xk -
`
`fl,)].P(w,)
`
`A
`
`J - 1
`
`C
`
`,<
`
`A
`
`While the notation may make these equations appear to be rather formid(cid:173)
`is actually quite simple. In the extreme case where
`able, their interpretation
`F(w, Ix", 8) is one when x;. is from Class r11, and 7.ero otherwise, fi(m.) is the
`fraction of samples from w,, P,; is the mean of those samples, and E, is the
`sample covariance matrix. More generally, f>(w, I xk, 8) is
`corresponding
`between zero and one, and all of the samples play some role in the estimates.
`However,
`the estimates are basically still frequency ratios, sa,mple means,
`and sample covariance matrices.
`The problems involved in solving these implicit equations are similar to
`the problems discussed in Section 6.4.1, with the additional complication
`of having to avoid singular solutions. Of the various techniques that can be
`used to obtain a solution, the most obvious approach is to use initial estimates
`to evaluate Eq. (17) for P(w, Ix,., 8) and then to use Eqs. (14)-(16) to
`update these estimates. If the initial estimates are very good, having perhaps
`been obtained from a fairly large set of labelled samples, convergence can
`be quite rapid. However, the results do depend upon the starting point, and
`the problem of multiple solutions is always present. Furthermore,
`the repeated
`computation and inversion of the sample covariance matrices can be quite
`time consuming.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-8 Filed 05/30/19 Page 3 of 12
`
`APPLICATION TO NORMAL MIXTURES
`
`201
`
`Considerable simplification can be obtained it it is possible to assume that
`the covaria nce matrices are diagonal. This has the added virtue of reducing
`the number of unknown parameters, which is very important when the
`number of samples is not large. If this assumption is too strong, it still may
`be possible to obtain some simplification by assuming that the c covariance
`matrices are equal, which also eliminates the problem of singular solutions.
`The derivation of the appropriate maximum likelihood equations for this
`case is treated in Problems 5 and 6.
`
`6.4.4 A Simple Approximate Procedure
`
`Of the various techniques that can be used to simplify the computation and
`accelerate convergence, we shall briefly consider one elementary, approximate
`method. From Eq. (17), it is clear that the probability F(wi I xk, 8) is large
`(i.;)IE;-1(xk -
`when the squared Mahalanobis distance (xk -
`fli) is small.
`fJ-;112.
`Suppose that we merely compute the squared Euclidean distance llxk -
`find the mean flm nearest to xk, and approximate F(wi I xk, 8) as
`
`i=m
`
`otherwise.
`
`Then the iterative application of Eq. (15) leads to the following procedure*
`for finding P.1, ... , P.c:
`
`Procedure: Basic Isodata
`
`Loop:
`
`1. Choose some initial values for the means (i.1 , ...
`, (i.0 •
`2. Classify the n samples by assigning them to the class of the
`closest mean.
`3. Recompute the means as the average of the samples in their
`class.
`4. If any mean changed value, go to Loop; otherwise, stop.
`
`This is typical of a class of procedures that are known as clustering
`procedures. Later on we shall place it in the class of iterative optimization
`procedures, since the means tend to move so as to minimize a squared-error
`
`* Throughout this chapter we shall name and describe various iterative procedures as if
`they were computer programs. All of these procedures have in fact been programmed,
`often with much more elaborate provisions for doing such things as breaking ties, avoiding
`trap states, and allowing more sophisticated terminating conditions. Thus, we occasionally
`include the word "basic" in their names to emphasize the fact that our interest is limited to
`explaining essential concepts.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-8 Filed 05/30/19 Page 4 of 12
`202
`UNSUPERVISED LEARNING AND CLUSTERING
`
`criterion function. At the moment we view it merely as an approximate way
`to obta in maximum likelihood estimates for the means. The values obtained
`can be accepted as the answer, or can be used as starting points for the more
`exact computations.
`It is interesting to see how this procedure behaves on the example data in
`Table 6-1. Figure 6.4 shows the sequence of values for µ1 and µ2 obtained
`
`4
`
`-1
`
`-6
`
`-5
`
`-4
`
`- 2
`
`--3
`
`--4
`
`/\
`6 µ,
`
`-~
`
`FIGURE 6.4. Trajectories for the Basic Isodata Procedure.
`
`- 5
`
`for several different starting points. Since interchanging µ1 and µ2 merely
`interchanges the labels assigned to the data , the trajectories are symmetric
`about the line /11 = µ2 • The trajectories lead either to the point /11 = -2.176 ,
`µ2 = 1.684 or to its image. This is close to the solution found by the maximum
`likelihood method (viz., µ1 = - 2.13.0 and µ2 = 1.668), and the trajectories
`show a general resemblance to those shown in Figure 6.3 In general, when
`the overlap between the component densities is small the maximum likelihood
`approach and the lsodata procedure can be expected to give similar results.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-8 Filed 05/30/19 Page 5 of 12
`UNSUPERVISED BAYESIAN LEARNING
`203
`
`6.5 UNSUPERVISED BAYESIAN LEARNING
`
`6.5.1 The Bayes Classifier
`
`Maximum likelihood methods do not consider the parameter vector 8 to be
`is just unknown. Prior knowledge about likely values for 8 is
`random-it
`irrelevant, although in practice such knowledge may be used in choosing
`good starting points for hill-climbing procedures . In this section we shall
`take a Bayesian approach to unsupervised learning. We shall assume that 8
`is a random variable with a known a priori distribution p(6), and we shall
`use the samples to compute the a posteriori density p(6 I .%). Interestingly
`enough, the analysis will virtually parallel the analysis of supervised Bayesian
`learning, showing that the two problems are formally very similar .
`We begin with an explicit statement of our basic assumptions. We assume
`that:
`
`I. The number of classes is known.
`2. The a priori probabilities P( w;) for each class are known, j = I, ...
`, c.
`3. The forms for the class-conditional probability densities p(x I w1, 61) are
`known,j = 1, ...
`, c, but the parameter vector 8 = (8 1, •••
`, 8c) is not
`known.
`4. Part of our knowledge about 8 is contained in a known a priori density
`p(6).
`5. The rest of our knowledge about 8 is contained in a set ::£ of n samples
`x1 , •••
`, x,. drawn independently from the mixture density
`
`p(x I 8) = Ip(x I W;, 8;)P(w 1).
`
`C
`
`J=l
`
`(1)
`
`At this point we could go directly to the calculation of p(8 I fl"). However,
`let us first see how this density is used to determine the Bayes classifier.
`Suppose that a state of nature is selected with probability P(w;) and a feature
`vector x is selected according to the probability law p(x I w;, 8;), To derive
`the Bayes classifier we must use all of the information at our disposal to
`compute the a posteriori probabil ity P(w. Ix). We exhibit the role of the
`samples explicitly by writing this as P(w; Ix, 8t). By Bayes rule,
`p(x I W;, 8t)P(w; J ::£)
`P(w; x, :!£) = -~-
`--
`~-
`I
`C I p(x J w;, 8t)P(w; I 8t)
`
`.
`
`; - 1
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-8 Filed 05/30/19 Page 6 of 12
`
`204
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`.
`
`(18)
`
`Since the selection of the state of nature w, was done independently of the
`previously drawn samples, P(w, I :!l) = P(w,), and we obtain
`P(w, I x, fl') = /(x I w,, ~)P(w,)
`_°Lp(x I W;, :!l)P(w1)
`1=1
`We introduce the unknown parameter vector by writing
`p(x I w,, :!l) = f p(x, 8 I w,, ~) d6
`= f p(x I 8, w,, :!l)p(6 I w,, :!l) d6.
`Since the selection of x is independent of the samples, p(x I 8, wi, :!l) =
`p(x I w,, 8,). Similarly, since knowledge of the state of nature when x is
`selected tells us nothing about the distribution of 8, p(8 j w,, :!l) = p(8 I :!l).
`Thus we obtain
`
`(19)
`That is, our best estimate of p(x I wi) is obtained by averaging p(x I w,, 8i)
`over 8 •. Whether or not this is a good estimate depends on the nature of
`p(8 I :!l), and thus our attention turns at last to that density.
`
`6.5.2 Learning the Parameter Vector
`
`Using Bayes rule, we can write
`p(8 I :!l) = P< fl' I 8) p(8)
`f p(~ I 8)p(8) d6
`
`(20)
`
`n
`
`p(~
`
`where the independence of the samples yields
`j 8) = II p(xk I 8).
`k=l
`Alternatively, letting fl°" denote the set of n samples, we can write Eq. (20)
`in the recursive form
`p(8 I ~") = p(x,. I 8)p(8 I ~n-1)
`f p(x,. I 8)p(8 I fl'"- 1
`
`(21)
`
`(22)
`
`.
`
`) d6
`
`These are the basic equations for unsupervised Bayesian learning. Eq. (20)
`emphasizes the relation between the Bayesian and the maximum likelihood
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-8 Filed 05/30/19 Page 7 of 12
`
`UNSUPERVISED BAYESIAN LEARNING
`205
`solutions. If p(8) is essentially uniform over the region where p(ft I 8) peaks,
`then p(8 j f'£) peaks at the same place. If the only significant peak occurs at
`8 :::: 8 and if the peak is very sharp, then Eqs. (19) and (18) yield
`p(x I <»;, ft) l'::I p(x I <»;, 8;)
`
`and
`
`That is, these conditions justify the use of the maximum likelihood estimate
`as if it were the true value of 8 in designing the Bayes classifier.
`Of course, if p(8) has been obtained by supervised learning using a large
`set oflabelled samples, it will be far from uniform, and it will have a dominant
`influence on p(8 I fr n), when n is small. Eq. (22) shows how the observation
`of an additional unlabelled sample modifies our opinion about the true value
`of 8, and emphasizes the ideas of updating and learning. If the mixture
`density p(x I 8) is identifiable, then each additional sample tends to sharpen
`p(8 j ftn), and under fairly general conditions p(8 j frn) can be shown to
`converge (in probability) to a Dirac delta function centered at the true value
`of 8. Thus, even though we do not know the categories of the samples,
`identifiability assures us that we can learn the unknown parameter vector
`8, and thereby learn the component densities p(x I <»;, 8).
`This, then, is the formal Bayesian solution to the problem of unsupervised
`learning. In retrospect , the fact that unsupervised learning of the parameters
`of a mixture density is so similar to supervised learning of the parameters of
`a component density is not at all surprising. Indeed , if the component density
`is itself a mixture, there would appear to be no essential difference between
`the two problems.
`However, there are some significant differences between supervised and
`unsupervised learning. One of the major differences concerns the problem
`of identifiability. With supervised learning, lack of identifiability merely
`means that instead of obtaining a unique parameter vector we obtain an
`equivalence class of parameter vectors. However, since all of these yield the
`same component density, lack of identifiability presents no theoretical
`difficulty. With unsupervised learning , lack of identifiability is much more
`serious. When 8 can not be determined uniquely , the mixture can not be
`decomposed into its true components. Thus, while p(x I ~ n) may still
`converge to p(x) ,p(x I<»;, ft n) given by Eq. (19) will not in general converge
`to p(x I <»;), and a theoretical barrier to learning exists.
`Another serious problem for unsupervised learning is computational
`complexity. With supervised learning , the possibility of finding sufficient
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-8 Filed 05/30/19 Page 8 of 12
`
`206
`
`UNSUPERVISED LEARNING AND CLUSTERING
`
`statistics allows solutions that are analytically pleasing and computationally
`feasible. With unsupervised learning , there is no way to avoid the fact that
`the samples are obtained from a mixture density ,
`
`p(x j 8) = I,p(x I W J, 81)P(w;),
`
`C
`
`;- 1
`
`(I)
`
`and this gives us little hope of ever finding simple exact solutions for p(8 I tr).
`Such solutions are tied to the existence of a simple sufficient statistic, and the
`factorization theorem requires the ability to factor p(!!l' I 8) as
`p(!!l' I 8) = g(s, 8)h(!!l').
`
`But from Eqs. (21) and (1),
`
`p(!!l' j 8) = g [;ip(xk
`I w1, 81)P(w ;)J
`Thus, p(!!l' I 8) is the sum of c" products of component densities. Each term
`in this sum can be interpreted as the joint probability of obtaining the
`samples x1 , ...
`, X n bearing a particular labelling, with the sum extending
`over all of the ways that the samples could be labelled. Clearly , this results
`in a thorough mixture of 8 and the x's, and no simple factoring should be
`expected. An exception to this statement arises if the component densities
`do not overlap , so that as 8 varies only one term the mixture density is
`non-zero. In that case , p(!!l' I 8) is the product of the n nonzero terms , and
`may possess a simple sufficient statistic. However, since that case allows the
`class of any sample to be determined, it actually reduces the problem to one
`of supervised learning, and thus is not a significant exception.
`Another way to compare supervised and unsupervised learning
`substitute the mixture density for p(x ,. 18) in Eq. (22) and obtain
`
`is to
`
`C
`
`_I,p(x,. I w,, 81)P(w 1)
`p(8 I !!l'"- 1).
`p(8 I !!£") = C
`;- 1
`;~i J p(x,. I W;, 8;)P(w 1)p(8 I ,qrn-1) d8
`Jf we consider the special case where P(w 1) = 1 and all the other a priori
`probabilities are zero, corresponding
`to the supervised case in which all
`samples come from Class I, then Eq. (23) simplifies to
`p(8 I ,q[") =
`p(x ,. I Wi, 81)
`p(8 I ,qrn-1).
`J p(x,. I w1 , 8 1)p(8 I !!l'"-1
`) d8
`
`(23)
`
`(24)
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-8 Filed 05/30/19 Page 9 of 12
`UNSUPERVISED BAYESIAN LEARNING
`207
`
`Let us compare Eqs. (23) and (24) to see how observing an additional
`sample changes our estimate of 8. In each case we can ignore the denominator,
`which is independent of 8. Thus, the only significant difference is that in the
`supervised case we multiply the "a priori" density for 8 by the component
`density p(x ,. j w 1 , 81), while in the unsupervised case we multiply it by the
`mixture density .2~=iP(Xn I w;, 8;)P(w 1). Assuming that the sample really did
`come from Class 1, we see that the effect of not knowing
`this category
`membership in the unsupervised case is to diminish the influence of xn on
`changing 8. Since xn could have come from any of the c classes, we cannot
`the component(s) of 8 associated
`use it with full effectiveness in changing
`with any one category. Rather, we must distribute its effect over the various
`categories in accordance with the probability
`that it arose from each category.
`
`6.5.3 An Example
`
`two-component mixture with p(x I <.ui) ,__,
`Consider
`the one -dimensional,
`N(µ, I), p(x I w 2 , 0) ,_, N(0, I) , whereµ, P(w 1) and P(w 2) are known. Here
`
`P(w1)
`I
`p(x 0) = PC exp[-½(x
`y2~
`
`2
`
`P(w2)
`- µ)] + ✓- exp[ - ½(x - 0) ].
`2~
`
`2
`
`Viewed as a function of x, this mixture density is a superposition of two
`normal densities, one peaking at x = µ and the other peaking at x = 0.
`Viewed as a function of 0, p(x I 0) has a sing le peak at 0 = x. Suppose that
`the a priori density p(0) is uniform from a to b. Then after one observation
`
`otherwise,
`
`independent of 0. If the sample
`where rx and rx' are normalizing constants,
`Xi is in the range a ::;; x 1 ::;; b, then p(0 I x 1) peaks at 0 = x1 . Otherwise it
`peaks either at 0 = a if x 1 < a or at 0 = b if x1 > b. Note that the additive
`(l/2)(x 1 - µ) 2] is large if x 1 is near µ, and thus the peak of
`constant exp [-
`p(fJ I x1) is less pronounced
`if x 1 is near µ. This corresponds
`to the fact that
`it is more likely to have come from the p(x I w1) component,
`if Xi is nearµ,
`and hence its influence on our estimate for fJ is diminished.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-8 Filed 05/30/19 Page 10 of 12
`
`208
`UNSUPERVISED LEARNING AND CLUSTERING
`With the addition of a second sample x2 , p(0 I xi) changes to
`p(0 j Xi, x2) = {3p(x2 I 0)p(0 I x1)
`{-(x2 - µ) 2]
`{3'[P(w1)P(w 1) exp[-½(x 1 - µ) 2 -
`+ P(w 1)P(w 2) exp [-½(x 1 - µ) 2 - ½(x2 - 8)2]
`+ P(w 2)P(w 1) exp[-½(x 1 - 0)2 - ½(x2 - µ)2]
`+ P(w 2)P(w 2)exp[-½(x 1 - 0)2 - ½(x2 - 0)2]]
`
`0
`
`otherwise.
`
`Unfortunately,
`the primary
`thing we learn from this expression
`is that
`p(8 I qn) is already complicated when n = 2. The four terms in the sum
`correspond to the four ways in which the samples could have been drawn
`from the two component populations. With n samples there will be 2" terms,
`
`1.6
`
`1 .4
`
`1.2
`
`.~
`
`~ 0.8
`Q.
`
`0.6
`
`0.4
`
`0 .2
`
`0 L_ _
`-5
`
`_L _
`
`__J,_...,,,,et::::: :_---1.. _ __,:i:z.__,,_~ _ _J~~;J:::::==-;;;:a.-_..J
`-3
`3
`-2
`-1
`2
`0
`8
`FIGURE 6.5. Unsupervised Bayesian learning.
`
`4
`
`5
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-8 Filed 05/30/19 Page 11 of 12
`
`UNSUPERVISED BAYESIAN LEARNING
`
`209
`
`1.8 ,-----,---,,------,---,,------,---,,------,--
`
`-
`
`,------,----,
`
`1.6
`
`1.4
`
`1.2
`
`X
`;;;' 0.8
`ii
`
`0 .6
`
`0 .4
`
`0.2
`
`n = 25
`
`n = 16
`
`n s 0
`
`oL- --'---L---'---
`-5
`-4
`
`-3
`
`L- -
`-1
`
`-2
`
`-'- --
`0
`9
`FIGU~E 6.6. The effect of narrowing the a priori density.
`
`L- --L
`
`~..3',..__ _ _,_ _
`3
`4
`
`2
`
`___J
`
`5
`
`and no simple sufficient statistics can be found to facilitate understanding
`or to simplify computations.
`It is possible to use the relation
`P(x I 0)p(0 J [!l'n-1)
`p(0 I f!l'") = __ n....;___..;.
`J p(x 11 I 0)p(0 J f!l'n-l) d0
`
`__
`
`and numerical integration to obtain an approximate numerical solution for
`). This was done for the data in Table 6-1 using the values µ = 2,
`p(0 J f!l'11
`P(w 1) = 1/3, and P(w 2) = 2/3. An a priori density p(0) uniform from -4
`to 4 encompasses the data in that table. When this was used to start the
`recursive computation of p(0 J [!l'n), the results shown in Figure 6.5 were
`obtained. As n goes to infinity we can confident ly expect p(0 I f!l'") to approach
`an impulse centered at 0 = 2. This graph gives some idea of the rate of
`convergence.
`One of the main differences between the Bayesian and the maximum
`likelihood approaches to unsupervised learning appears in the presence of
`the a priori density p(0). Figure 6.6 shows how p(0 J f!l'n) changes when p(0)
`
`
`
`r
`
`Case 1:14-cv-02396-PGG-MHD Document 148-8 Filed 05/30/19 Page 12 of 12
`210
`UNSUPERVISED LEARNING AND CLUSTERING
`
`is assumed to be uniform from I to 3, corresponding to more certain initial
`knowledge about 0. The results of this change are most pronounced when n
`is small. It is here also that the differences between the Bayesian and the
`maximum likelihood solutions are most significant. As n increases, the im(cid:173)
`portance of prior knowledge diminishes, and in this particular case the curves
`for n = 25 are virtually identical. In general , one would expect the difference
`to be small when the number of unlabelled samples is several times the
`effective number of labelled samples used to determine p(O).
`
`6.5.4 Decision-Directed Approximations
`
`Although the problem of unsupervised learning can be stated as merely the
`problem of es,timating parameters of a mixture density, neither the maximum
`likelihood nor the Bayesian approach yields analytically simple results.
`Exact solutions for even the simplest nontrivial examples lead to computa(cid:173)
`tional reguirements that grow exponentially with the number of samples.
`The problem of unsupervised learning is too important
`to abandon just
`because exact solutions are hard to find, however , and numerous procedures
`for obtaining approximate solutions have been suggested.
`Since the basic difference between supervised and unsupervised learning
`is the presence or absence of labels for the samples, an obvious approach to
`unsupervised learning is to use the a priori information to design a classifier
`and to use the decisions of this classifier to label the samples. This is called
`the decision-directed approach to unsupervised learning, and it is subject to
`many variations. It can be applied sequentially by updating the classifier
`each time an unlabelled sample is classified. Alternatively , it can be applied
`in parallel by waiting until all n samples are classified before updating the
`classifier. If desired , this process can be repeated until no changes occur in
`the way the samples are labelled.* Various heuristics can be introduced to
`make the extent of any corrections depend upon
`the confidence of the
`classification decision.
`There are some obvious dangers associated with the decision-directed
`approach. If the initial classifier is not reasonably good , or if an unfortunate
`sequence of samples is encountered , the errors in classifying the unlabelled
`samples can drive the classifier the wrong way, resulting in a solution corre(cid:173)
`sponding roughly to one of the lesser peaks of the likelihood function. Even
`if the initial classifier is optimal, the resulting labelling will not in general
`be the same as the true class membership;
`the act of classification will
`exclude samples from the tails of the desired distribution , and will include
`samples from the tails of the other distributions. Thus, if there is significant
`
`* The Basic Isodata procedure described in Section 6.4.4 is essentially a decision-directed
`procedure of this type.
`
`