`Case 1:14-cv-02396—PGG-MHD Document 148-16 Filed 05/30/19 Page 1 of 23
`
`EXHIBIT 5
`
`EXHIBIT 5
`PART
`2 OF 2
`
`PART
`
`20F2
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 2 of 23
`
`11 Clustering
`
`545
`
`(1)2
`(1)3
`(1)4
`(1)5
`(1)6
`Fig. JJ-15 An example of the valley-seeking clustering.
`
`into w6 results in a shift of the boundary toward the w5 -side, as the arrow indi(cid:173)
`cates . This is equivalent
`to saying that the direction of the density gradient
`is
`estimated by using the numbers of samples on both sides of the boundary, and
`the boundary is moved toward the lower density side . Applying this process to
`all boundarie s repeatedly , the leftmost boundary of Fig. 11-15 moves out to
`00 • leaving the w1 -cluster empty,
`the second and third leftmost boundaries
`merge to one, making the Wrcluster empty, and so on. As a result, at the end
`of iteration, only w2• w4 , and w6 keep many samples and the others become
`empty. The procedure works
`in the same way even in a high-dimensional
`space .
`
`-
`
`A number of comments can be made about this iterative valley-seeking
`procedure. The procedure
`is nonparametric, and divides samples according
`to
`the valley of a density function . The density gradient
`is estimated, but in a
`crude way . The number of clusters must be preassigned, but we can always
`assign a larger number of clusters
`than we actually expect to have. Many of
`initial clusters could become empty , and only true clusters separated by the
`valleys will keep samples. As far as computation
`time is concerned,
`it takes a
`lot of computer time to find neighbors
`for each sample and fonn the table of
`Fig. 11-14. However . this operation
`is common for all nonparametric cluster(cid:173)
`ing procedures,
`including
`the graph
`theoretic approach . The iterative proce ss
`of this algorithm revises only the class assignment according to the majority of
`classes in f(X). This operation does not take much computation
`time.
`
`The volume of f(X) affects the perfonnance of this algorithm. just as it
`did in the graph theoretic approach. The optimum volume should be deter(cid:173)
`mined experimentally, as was done for the graph theoretic approach.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 3 of 23
`
`Introduction to Statistical Pattern Recognition
`
`Experiment 4: Seventy five samples per class were generated according
`
`546
`
`to
`
`( 11.82)
`
`where n 1 and n 2 are independent and normally distributed with zero mean and
`unit variance for both 001 and ffii , [m I m 2 ] is [O OJ for 001 and [0 -20) for 002,
`is normally distributed with EI 81001 I = 7t, EI 8 I roi I = 0, and
`and 8
`Var( 81 w1 I = Var( 81 ffii I = (7t/4)2 • After the data was whitened with respect
`to the mixture covariance matrix, both graph theoretic and iterative valley(cid:173)
`seeking algorithms were applied, resulting
`in the same clustering result, as
`shown in Fig. 11-16 [13-14].
`
`h,C C.LA~5
`) 1)
`• - - --
`I
`I
`I
`I
`I
`I
`
`- ---
`
`- --
`
`- - - ----
`
`---
`
`- - - ---
`
`---
`
`- ---
`
`- ---
`
`--
`
`- - --
`
`- - - - --
`
`- - - - - - --
`
`•
`I
`I
`I
`I
`I
`
`",\,\ ...
`A,\ .\ .. ,
`
`,\A
`,\,\
`
`l.
`
`,\.\
`
`,\
`
`A,\
`A\
`f,A,\
`AA
`
`'
`
`A ,\
`
`\
`,\Ah
`A
`A U
`
`h
`
`l\AA
`
`,\
`
`A
`
`r
`
`I~ H
`C\BlJ RBH
`
`ee ,,,
`
`OH
`RA
`
`"""
`"
`• OH
`
`t!Jt6H
`
`II
`BH
`P.e8
`PP8 i\
`Od
`OH u
`
`'
`
`H K
`
`- '>0
`
`--
`• --
`- V )
`
`- - - - - ---
`
`- --
`
`- - - ---
`
`---
`
`---
`
`- --
`
`----
`
`---
`
`- --
`
`- - --
`
`- - -- __ .., - - - --
`
`- - •
`
`SO
`
`Fig. 11-16 Classification of a two-class example.
`
`Table 11-3 shows the selected radius for f(X), and the number of iterations
`required to reach the final clusters
`in the iterative valley-seeking algorithm
`[14].
`
`Experiment 5: Fifty samples per class were generated
`three
`from
`classes. Two of
`them were
`the same ones as Experiment 4 except
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 4 of 23
`
`11 Clustering
`
`547
`
`TABLE 11-3
`
`PERFORMANCE OF THE VALLEY-SEEKING ALGORITHM
`
`Experiment
`
`Number
`of samples
`
`Number
`of clusters
`
`Radius
`of r(X)
`
`Number
`of iterations
`
`4
`5
`
`150
`150
`
`2
`3
`
`1.0
`0.75
`
`8
`10
`
`[m I m 2 J = [20 0] for ffi:2 instead of [0 -20]. The third distribution was normal
`with the mean and covariance matrix
`
`(11.83)
`
`Again, after the data was whitened with respect to the mixture covari(cid:173)
`ance matrix, both the graph theoretic and iterative valley-seeking algorithms
`produced the same clustering
`result, as shown
`in Fig . 11-17 [13-14]. The
`
`f .. Q.l~ r,l -\',' ~' . - --- - --------- -- -- - - --- ----
`
`- ------------- --- --------------.
`
`'·.
`
`.\\
`
`C C.
`C CC
`::cc.cc r.r.
`CL.
`L c-: cu:u.r.u:
`C C.
`L
`
`>i:. \
`
`II Pfl
`
`L' ~
`
`' " " ,.
`'
`
`. "
`
`"
`
`I
`•----------------------------------------------------------•
`
`- J l
`
`>O
`
`Fig. 11-17 Classification of a three-class example.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 5 of 23
`
`548
`
`Introduction to Statistical Pattern Recognition
`
`radius of f(X) and the number of iterations in the latter algorithm are listed in
`Table 11-3 [14].
`
`General comments: We have discussed both parametric and non(cid:173)
`parametric clustering
`techniques. The question here is which technique
`is
`better under what conditions. Generally speaking, nonparametric techniques do
`not require the knowledge of the number of clusters beforehand. This is cer(cid:173)
`tainly true for the graph theoretic approach. Even for the iterative valley(cid:173)
`seeking procedure, we can preassign a larger number of clusters than what is
`actually needed, and let the procedure select automatically
`the necessary
`number of clusters. Therefore, if we do not have any a priori knowledge about
`the data structure, it is most natural to adopt a nonparametric clustering tech(cid:173)
`nique and find out how the data is naturally divided into clusters.
`
`However, nonparametric procedures are in general very sensitive to the
`control parameters, especially
`the size of the local region. Therefore,
`it is
`necessary to run experiments for a wide range of sizes, and the results must be
`carefully examined. Also, nonparametric procedures are normally very compu(cid:173)
`tationally intensive.
`
`Furthermore, nonparametric clustering techniques have two fundamental
`flaws described as follows.
`
`(I) We cannot divide a distribution into a number of clusters unless the
`valley actually exists. For example, the 001 -distribution of Fig. I 1-9 could be
`obtained as the result of the valley-seeking procedure, but the distribution can(cid:173)
`not be divided into 2 or 3 clusters by any nonparametric method even if it is
`desirable to do so. When a distribution
`is wrapped as the 001 -distribution of
`Fig. 11-9, it is sometimes prefered to decompose
`the distribution
`into several
`normal distributions
`for further analysis of data structure or designing a
`classifier.
`
`On the other hand, the parametric procedures do not depend on the
`natural boundary of clusters, but depend only on the criterion. With a preas(cid:173)
`signed number of clusters, the procedures seek the boundary to optimize the
`criterion value. Therefore, we can divide the 001 -distribution of Fig. 11-9 into
`2, 3, or 4 clusters as we like. After examining the clustering results for various
`numbers of clusters, we may decide which number of clusters is most appropri(cid:173)
`ate for a particular application. Previously, we stated that it is a disadvantage
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 6 of 23
`
`I I Clustering
`
`549
`
`the number of clusters
`to know
`to have
`techniques
`of parametric clustering
`beforehand. But, that property may work sometimes as an advantage as dis(cid:173)
`cussed above.
`
`clus(cid:173)
`(2) As seen in Fig. I l-7(b) , the resulting clusters of nonparametric
`tering procedures contain
`the tails of other distributions
`and do not contain
`their own tails . Thus, the mean and covariance matrix of each cluster do not
`represent the true ones of the underlying distribution . In order to obtain proper
`parameters of the underlying distributions, we need
`to impose a parametric
`structure such as the summation of normal distribution s.
`
`11.3 Selection of Representatives
`
`to find clusters
`In the previous sections, we have discussed algorithms
`is to reduce
`the
`from a distribution . A similar but slightly different problem
`number of samples while maintaining
`the structure of the distribution . This
`may be viewed as selecting a small number of representatives
`from a distribu(cid:173)
`tion. Besides clustering,
`the reduction of sample size has a number of applica(cid:173)
`tions. A typical example
`is to divide a set of available samples
`into design and
`test sets . In this case , we must assure
`that design and test distrihutions
`are
`similar.
`
`Nonparametric Data Reduction
`
`structure when
`to assume a mathematical
`it is not appropriate
`Generally,
`the sample size is reduced . Therefore,
`a nonparametric
`procedure must be
`adopted to guide through the operation.
`
`Data reduction algorithm (16-17]: Let us study the Parzen approach
`to
`the problem of sample size reduction . Given N samples drawn from p (X), we
`wish to select Q sample s (Q < N) so that the Parzen density estimates for the N
`sample set and the Q sample set are close.
`
`Let p,v(X) be the density estimate based on N samples. Then
`I N
`p,v(X) = - L,K(X - X;),
`N 1=1
`
`A
`
`( 11.84)
`
`Similarly, when Q
`function.
`the kernel
`is
`where K(-)
`Y 1 , •••
`, Y 0 , are selected. the density function
`is estimated by
`
`representatives
`
`,
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 7 of 23
`
`550
`
`Introduction to Statistical Pattern Recognition
`
`A
`
`l Q
`PQ(X) = -r,K(X
`Q i=I
`
`- Y;).
`
`(11.85)
`
`A
`
`A
`
`In order to measure the similarity between PN(X) and PQ(X), the entropr cri-
`terion of Chapter 9 is used . Replacing p 1 (X) and p 2(X) of (9 .53) by PQ(X)
`and PN(X) respectively, we obtain the entropy inequality as
`Ir-lnpQ(X)]PN(X)dX ~s
`
`(11.86)
`
`[-lnpN(X)]pN(X)dX
`'
`where the equality holds only for PQ(X) = p,v(X). Thus, the best PQ(X) may be
`found by minimizing
`the lefthand side of ( 11.86). The criterion may be
`the expectation with respect to PN(X) in ( I 1.86)
`simplified by approximating
`by the sample mean over the existing samples X 1, •••
`,XN, as
`
`A
`
`A
`
`A
`
`A
`
`(11.87)
`
`In order to find the best Q representatives
`from the existing samples
`,XN, we would like to minimize J over all possible Q element subsets
`X 1, •••
`N
`of the original N element set. Unfortunately, an exhaustive search of all (Q)
`
`the
`for
`Instead, we will settle
`feasible.
`is not computationally
`subsets
`minimum J for subsets formed by replacing one element of the representative
`set by the best candidate not yet selected.
`
`An iterative procedure is as follows:
`
`(I) Select an initial assignment of Q sample s from the N sample data
`set. Call the Q sample set STORE and the remaining N -Q samples TEST.
`
`(2) For each element, Xi, in TEST, compute the change in J that results
`if the sample is transferred to STORE.
`
`Af 1(X 1) = N r_ In
`1=1
`
`I N t { I Q
`
`-Q ~K(X;-Y
`1-l
`
`}
`1)
`
`- In l-1 if K(X;-Y 1) + K(X;-X,)l]]
`
`Q+I U=I
`
`·
`
`(I 1.88)
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 8 of 23
`
`I I Clustering
`
`551
`
`(3) Pick the element, X" corresponding
`
`to the smallest /jJ 1 (and call it
`
`(4) For each element, X_" in STORE, compute
`results if the sample is transferred to TEST.
`
`the change
`
`in J that
`
`(11.89)
`
`(5) Find the element, X.,, corresponding
`
`to the smallest /jJ 2 (and call it
`
`J
`of
`is
`operations
`two
`these
`to
`due
`change
`The
`(6)
`/jJ =Li1 1(X;)+Li1 2(X;).
`to minimize 1, we would
`In order
`like to have
`Af < 0. If x; exists to satisfy Lil < 0, transfer x; to TEST, transfer x;· to
`STORE. and go to Step (2).
`
`(7) Otherwise, find the element, X,, corresponding
`Lil 1 (and call it x; ).
`(8) If x; exists, go to Step ( 4 ).
`(9) Otherwise, stop.
`
`to the next smallest
`
`Generally, this kind of iterative process produces a result which depends
`on the initial selection of Q representatives
`in STORE. However, Steps (7) and
`(8) allow us to search more possible combinations of X, and X, and thus insure
`that the final representative set is less dependent on the initial assignment.
`
`The kNN approach also can be applied
`approach . The kNN density estimate is
`
`in a similar way as the Parzen
`
`k-1
`~
`PN(X) = Ncd'/v(X)
`
`( 11. 90)
`
`where dN(X) is the distance from X to its kth NN among N samples. The same
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 9 of 23
`
`552
`
`Introduction to Statistical Pattern Recognition
`
`fonnula is used when Q samples are used to estimate the density function.
`Thus, the entropy criterion of ( 11.87) becomes
`
`IN
`J =-I,[-lnp
`N i=I
`
`IN
`~
`0 (X;)]= -I,[n
`N i=I
`
`cQ
`lnd 0 (X;)+ln -]
`k-l
`
`.
`
`(11.91)
`
`The same iterative procedure can be used to find the best Q samples by minim(cid:173)
`izing J of (11.91 ).
`
`As in most iterative techniques, a good initial guess usually is of consid(cid:173)
`erable importance. The following is a method to perfonn the initial assignment
`when Q = N/2.
`
`An initial assignment procedure: We discuss a procedure to generate
`an initial assignment for the case with Q = N 12. The basis of the procedure
`If a subset with size N 12, called STORE, is
`lies in an intuitive observation.
`optimally chosen, then we suspect that, on average, the NN of a sample, X, in
`STORE might be the second NN of X using all N samples.
`
`The procedure is best explained with the aid of Fig. I 1-18. In Fig. 11-
`l 8(a) the NN of X I is X 23 , the NN of X 23 is X 17, and X 17 and X 40 are mutual
`NN. Figures I l-18(b) and (c) represent two other possible NN progressions.
`
`x,
`
`X23
`
`x,1
`
`X40
`
`X11
`
`X33
`
`x,s
`
`~7
`
`~6
`
`X12
`
`X30
`
`Xis
`
`~
`(a)
`(b)
`(c)
`Fig. 11-18 Three examples of NN progressions.
`
`X72
`
`We may form the initial assignment by simply assigning every other sample in
`a NN progression to the initial assignment of STORE. Thus for Fig. I l- l 8(a)
`the initial assignment for STORE would consist of either X I and X 17, or X 23
`and X 40 .
`We now state the procedure.
`
`(I) Get the first sample, and follow the complete progression of its
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 10 of 23
`
`11 Clustering
`
`553
`
`NN's, assigning alternate samples
`gression.
`
`to STORE. Flag every sample m the pro(cid:173)
`
`(2) Find the next unflagged sample. Follow its NN progression, assign(cid:173)
`ing alternate samples to STORE. This is done until the progression ends or
`leads to a previously flagged sample. Flag every sample in the progression.
`
`(3) Repeat Step (2) until all samples are flagged.
`
`If sufficient storage is available to store a table of kNN's for each sample
`(k = 4 or 5), the entire process of initial assignment and criterion minimization
`can be done very efficiently.
`
`Reduced Parzen classifier [17]: In pattern recognition,
`the quadratic
`classifier is very popular. However, in practice with non-normal distributions,
`it is frequently observed that the error of a quadratic classifier
`is much larger
`than the Bayes error estimated by a nonparametric
`technique. On the other
`hand, nonparametric classifiers are too complex and time-consuming
`for on(cid:173)
`line operation. Thus, there is a need to fill the gap between these two kinds of
`classifiers.
`
`is to divide each class into several clusters, and
`One possible solution
`design a piecewise quadratic classifier as in (4.149). This approach contains
`both quadratic and nonparametric classitiers as special cases.
`If each class has
`only one cluster,
`it becomes a quadratic classifier, while,
`if each sample
`is
`viewed as a cluster, it becomes a Parzen classifier.
`
`to design a piecewise quadratic classifier
`The most elaborate procedure
`would be to decompose the distribution of each class into several normal distri(cid:173)
`butions as shown in ( 11 .45). Note that only this method gives reasonable esti(cid:173)
`mates of a priori probabilities, means, and covariance matrices. The other clus(cid:173)
`tering procedures fail to do so, because they divide samples as in Fig. I l-7(b)
`instead of Fig. l l-7(a). The entire clustering operation must be repeated by
`preassigning various numbers of clusters. The resulting classification error for
`each preassigned number of clusters is estimated and compared with the Bayes
`error estimated by a nonparametric
`technique. The final number of clusters
`must be as small as possible while maintaining an error close to the Bayes
`error.
`
`is to select Q representatives
`from each
`A somewhat simpler approach
`class as discussed in this section, and to set a kernel function at each represen(cid:173)
`tative to form the Parzen density estimate as in (11.85 ). The classification can
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 11 of 23
`
`554
`
`Introduction to Statistical Pattern Recognition
`
`be carried out by comparing the density estimates for w 1 and w2 as
`
`(11.92)
`
`This classifier is called the reduced Parzen classifier. It is generally too com(cid:173)
`plex to select a different kernel function for each representative. So, we may
`use a normal kernel function with the covariance matrix r 2l:.;, where 1:; is the
`global covariance matrix of W; and r is a parameter controlling the size of the
`kernel. This approach is simpler because a common kernel function is used
`within the same class and is prespecified, instead of being estimated and vary(cid:173)
`ing locally, as in the approach of ( 11 .45). As discussed in Chapter 7, the sensi(cid:173)
`tive parameters are the size of the kernel, r, the decision threshold, t of (11.92) ,
`and the estimate of l:.;.
`
`In order to verify the above argument,
`presented.
`
`two experimental
`
`results are
`
`Experiment 6: The reduced Parzen classifier for Data /-A was designed
`and tested as follows:
`
`(I) One hundred samples per class were generated from Data /-A , and
`Experiment 7-6 was repeated. From Fig. 7-9 , r = 2 was chosen as the kernel
`size.
`
`(2) The sample reduction algorithm in this section was applied to select
`Q representatives from I 00.
`(3) Using the Parzen density estimates of ( 11.85) for w1 and ffii with
`r = 2, the original 100 samples per class were classified as in (I 1.92). The
`threshold
`t was selected so as to minimize
`the classification error. The
`optimum t is different for each different value of Q.
`
`(4) Independently, 100 samples per class were generated and tested by
`the classifier designed above.
`Figure 11-19 shows the plot of the error vs. Q [ 17). The error curve is
`the average of IO trials, and the standard deviations are shown by vertical bars.
`Note that the above error estimation is based on the holdout method, in which
`design and test samples are independent so that the error becomes larger than
`the Bayes error ( 1.9% ). The error curve is almost flat up to one representative.
`For normal distribution s, selecting the expected vector as the one representative
`from each class and the covariance matrix as the kernel covariance, the reduced
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 12 of 23
`
`11 Clustering
`
`555
`
`Parzen classifier becomes quadratic, which is the Bayes classifier. So, the error
`curve of Fig. 11-19 should be flat down to Q = I. However. since we select
`representatives from the existing design samples, they may or may not be close
`to the expected vector. If not, we see that the error curve is flat up to Q = 2 or
`3 and starts to increase for smaller Q's.
`
`I! "' 10
`
`8
`
`6
`
`4
`
`2
`
`0 .__.._ _ _._ ___
`3
`
`.__ __
`6
`
`__.___.+-...L---.._--.._
`40
`9 10 20
`# of representatives
`Fig. 11-19 The error of the reduced Parzen classifier for a normal case.
`
`__
`
`60
`
`.___ __
`80
`
`.__ __
`100
`
`Q
`--1 (cid:141)
`
`Experiment 7: In order to test a non-normal case, the data of Experi(cid:173)
`ment 7-7 was studied. The data is 8-dimensional, and each class consists of
`two normal distributions. From Fig. 7-10, r = 1.5 was chosen. The procedure
`of Experiment 6 was repeated for this data, and the result is shown in Fig.
`11-20 [17). Figure 11-20 shows that the error curve is flat for Q ~6. We
`found that, when Q = 6, three representatives are selected from each cluster.
`
`These experiments suggest an interesting fact. It has been believed that a
`nonparametric procedure needs a large number of samples for high-dimensional
`data, in order to reliably estimate the Bayes error. Any nonparametric opera(cid:173)
`tion with a large number of samples requires a large amount of computer time.
`The results of the experiments in this section contradict these common beliefs,
`and suggest that we may need only a relatively small number of samples (or
`representatives) after all. However. as Experiment 7- IO and Table 7-3(b) sug(cid:173)
`gest, we may still need a large number of samples to estimate the covariance
`matrices. which are used to form the kernel functions.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 13 of 23
`
`556
`
`e
`%
`
`30
`
`25
`
`20
`
`15
`
`10
`
`5
`
`Introduction to Statistical Pattern Recognition
`
`a
`
`2
`
`3
`
`4
`
`5 6
`
`20 40 60 80 100 120 140
`9 10
`7 8
`# of representatives
`Fig. 11-20 The error of the reduced Parzen classifier for a non-normal case.
`
`Parametric Data Reduction
`
`In parametric approaches, our main concern is to maintain the expected
`vector and autocorrelation (or covariance) matrix while reducing the number of
`samples.
`
`Representation of a sample matrix: Let us express N samples in a
`matrix form as
`
`which is an nxN rectangular matrix, called the sample matrix. Then, the sam(cid:173)
`ple autocorrelation matrix can be expressed by
`
`(11.93)
`
`s = ..!.. r,xxT = ..!..uuT .
`
`Ni=I
`
`I
`
`I
`
`N
`
`(11.94)
`
`A
`
`A
`
`Since S is an n xn matrix; S has the nxn eigenvalue and eigenvector matrices, A
`and <I>, such that
`
`S<I> = <l>A .
`
`( 11.95)
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 14 of 23
`
`11 Clustering
`
`Or
`
`557
`
`(UUT)cp =$(NA)=
`
`cpµ,
`
`(11.96)
`
`is the eigenvalue matrix of uur andµ=
`whereµ
`the left side, we can convert (11.96) to
`
`NA. Multiplying ur from
`
`where ur U is an NxN matrix, ur cf) consists of n eigenvectors of uru, and the
`
`components of µ are the n eigenvalues of uru.
`
`Since (UT$f(UT
`
`cf))=
`
`cpT uur cf)= µ ~ /, we change the scales of the eigenvectors by multiplying
`-112
`
`µ
`
`from the right side so that ( 11.97) can be rewritten as
`
`(11.97)
`
`where
`
`(11.98)
`
`(11.99)
`
`That is, 4' consists of the n eigenvectors of ur U, and satisfies the orthonormal
`condition as
`
`4'T4' = µ
`
`-1/2
`
`cpTuuT cpµ
`
`-1/2
`
`= / .
`
`From ( 11.99), U can be expressed in terms of$, 4', and A as
`112
`4'T = -f;ycpAl!24'T.
`
`U = cpµ
`
`(11.100)
`
`(II. 101)
`
`[ 18 j.
`This expression of a sample matrix is called singular value decomposition
`Singular value decomposition
`is an effective technique to represent a rectangu(cid:173)
`lar (non-square) matrix by eigenvalues and eigenvectors.
`
`Data reduction: Now our problem
`is to reduce the number of samples
`while maintaining the expected vector and autocorrelation matrix. Let us intro(cid:173)
`duce another sample matrix \/ which consists of Q sample vectors.
`In order
`that both U and \I share the same autocorrelation matrix
`
`( 11.102)
`
`Using ( 11.10 I), both U and V can be expressed by
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 15 of 23
`
`558
`
`Introduction to Statistical Pattern Recognition
`
`(11.103)
`
`(I J.104)
`where '11 (Nxn) and 2 (Qxn) are the eigenvector matrices of uTu and vTv
`respectively, and <I> and A are the eigenvector and eigenvalue matrices of S.
`Note that, although (UUT)n ,11 and (WT),,xn share the same eigenvector matrix,
`'11 and =..
`<I>, (UTU)NxN and (VTV)Q ,Q have different eigenvector matrices,
`Furthermore, since ..J,:j'PT =A- 112¢,TU
`from (11.103), ..J,:j'PT is the sample
`matrix in the whitened space, where the sample autocorrelation matrix becomes
`/. Similarly, '1o::.T
`is the reduced sample matrix in that space. Then, (I 1.104)
`shows the transformation which converts '1Q 'E.T back to the original space.
`As in (11.100), then column vectors of=. are orthonormal
`to satisfy
`
`(11.105)
`
`is that U and V must have the same mean vector.
`The second condition
`Without losing generality, let us assume that the mean is zero. Then,
`
`_!_VI =_!_VI
`N Q
`N
`where I k is a vector with all k components being equal to I as It = [ I ...
`Or, substituting (11.103) and (11.104) into (11.106),
`
`(I I.I 06)
`
`I J7.
`
`=0
`
`Q
`
`,
`
`(I l. 107)
`
`When the sample matrix U is given, we can compute <I>, '11, and A.
`Therefore, our problem is to find ::: by solving ( 11.105) and ( 11.107). Then, V
`can be computed by (11.104). Now=. consists of nxQ elements. To determine
`from ( 11.105) and n equations
`those nxQ unknowns we have nxn equations
`in general when Q is smaller than 11 +I,
`from ( I I. I 07). Therefore,
`there are
`many possible solutions. But when Q = n +I, we have a unique solution as
`follows:
`
`=.;,·,(11+1> = [F,,,,,G,,, 1 l,
`
`F=l-al,,I;,,
`
`G =-0,,1,,,
`
`( I I.I 08)
`
`(I 1.109)
`
`(11.110)
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 16 of 23
`
`I I Clustering
`
`where
`
`559
`
`( I I. I 11)
`
`(11.112)
`
`a=
`
`0=~
`
`I
`
`-
`
`The derivation of (II. I 08) through ( 11.112) has been omitted for brevity, but
`the reader may easily verify that .:: of ( I I. I 08) satisfies both ( 11. 105) and
`( 11.107).
`Substituting ( 11.109) and ( 11.1 I 0) into ( I 1. 108),
`
`I -a
`-a
`
`-a
`I-a
`
`2.T =
`
`-0
`-0
`
`-a
`
`-a
`
`-a
`
`-a
`
`I -a
`
`-0
`
`(11.113)
`
`=.7 is the reduced
`Note that =.T of (11.113) is data independent, because~
`sample matrix of (n +I) samples for the mean O and the covariance matrix /.
`This sample matrix is transformed by (11.104) into the space where the origi(cid:173)
`nal data and its sample covariance matrix are given.
`
`Example 2: For n = 2, ~ 2.7 becomes
`+;-1
`[ 1.366
`3 =... = -0 .366
`
`-0.366
`1.366
`
`-Ii
`
`-I
`
`.
`
`(I I. 114)
`
`Computer Projects
`
`J.
`
`2.
`
`3.
`
`4.
`
`5.
`
`Repeat Experiment I.
`
`Repeat Experiments 2 and 3.
`
`Apply the graph theoretic valley-seeking
`
`technique to Data /-A.
`
`Apply the iterative valley-seeking
`
`technique to Data /-A.
`
`Repeat Experiments 6 and 7.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 17 of 23
`
`560
`
`Problems
`
`Introduction to Statistical Pattern Recognition
`
`1.
`
`2.
`
`3.
`
`A random variable x is uniformly distributed
`in [a, b). From N samples
`x1, . .• ,xN drawn from the distribution,
`find the maximum
`likelihood
`estimates of a and b.
`
`to L normal clusters and
`Assume that a distribution can be decomposed
`all clusters share the same covariance matrix. Find the recursive equa(cid:173)
`tions to compute
`the a priori probabilities,
`the mean vectors, and the
`common covariance matrix of L clusters by using the maximum
`likeli(cid:173)
`hood estimation
`technique. Point out the difference between
`this pro(cid:173)
`cedure and the nearest mean reclassification algorithm .
`
`of L normal
`consists
`that a distribution
`Assume
`P 1 = ... = PL· Find the recursive equations
`to compute
`covariance matrices of L clusters by using the maximum
`mation technique.
`
`and
`clusters
`the means and
`likelihood esti(cid:173)
`
`4.
`
`A kurtosis matrix is defined by
`
`K = El(X 7X)XX 7 } -
`
`(n+2)/
`
`.
`
`When a density function consists of L normal clusters with the mean
`vectors M; and the common covariance matrix I., K can be rewritten as
`
`L
`K = I,P;M;M;[tr(I.+M;M!) - (n+2)].
`i=I
`
`Since the rank of the above equation is (L-1 ). only (L-1) eigenvalues of
`K are non-zero. Using
`this property, we can estimate
`the number of
`modes when clusters are normal and share the same covariance matrix.
`Prove that the second equation is equal to the first one for I.= /.
`
`5.
`
`A density function of a random variable x can be estimated by the Par(cid:173)
`
`zen method with the kernel function l O
`
`for
`1C(y) = 3
`4 (t-y 2 ) for
`
`)'2 :2!1
`
`y2 < I
`
`The derivative of the Parzen density estimate, dp(x)ldx, can be used as
`an estimate of the density gradient. Confirm that this method of gradient
`estimation coincides with (11.64).
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 18 of 23
`
`11 Clustering
`
`561
`
`6.
`
`The product of two nonnal distributions, N(M 1,I 1) and N(M 2,I 2 ).
`gives another nonnal distribution , N (M 0 ,I 0 ), multiplied by a constant, c.
`(a) Derive
`the relations between
`the
`three means and covariance
`matrices.
`
`(b)
`
`Samples are drawn from N (M 1 ,I 1 ). and a nonnal window func(cid:173)
`tion N (X. I 2 ) is set at X. Using these samples and the given win(cid:173)
`dow
`function,
`estimate
`the mean and covariance matrix of
`N(Mo,Io),
`(c) Using M O and I.ii and the given window function, estimate M I and
`I 1• This procedure could be used to estimate a local covariance
`matrix even when the samples are drawn from a non-nonnal distri(cid:173)
`bution .
`
`7.
`
`Two density functions are estimated by the Parzen method. Find the
`expression for the class separability criterion fp 1 (X)p 2(X )dX.
`(a) Use a nonnal kernel function with kernel covariance r 2I; for W;.
`
`(b) Use a hyperspherical unifonn kernel with radius ,- for both w 1 and
`Wz.
`
`8.
`
`Let class separability be measured by
`
`I N
`J = N L, Lf (X; ,X)g (C;. l)
`
`i
`
`•
`
`i = l j= I
`
`where
`
`f I for IIX;-X) s;,.
`f (X;,X;) = l O for IIX;-X_;II > r.
`l l when X; and X_; belong to different classes,
`
`0 when X; and X; belong to the same <.:lass.
`
`~ (c c ) =
`'
`''
`.1
`
`That is. J counts the number of pairs whose members are close and yet
`come from different classes. Prove that the minimization of J by an
`iterative process
`leads
`to
`the
`iterative valley-seeking
`procedure of
`( I 1.8 I).
`
`9.
`
`Formulate
`cedure.
`
`the selection of representatives by a branch and bound pro(cid:173)
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 19 of 23
`
`562
`
`Introduction to Statistical Pattern Recognition
`
`10. By using the singular value decomposition
`technique, show how to
`image to m basis images,
`decompose an nxm (m<n)
`two-dimensional
`and how to evaluate the effectiveness of these basis images.
`
`References
`
`I.
`
`2.
`
`3.
`
`4.
`
`5.
`
`S. Watanabe, "Knowing and Guessing," Wiley, New York, 1969.
`
`D. J. Hall and G. B. Ball, ISODATA: A novel method of data analysis
`and pattern classification, Technical report, Stanford Research Institute,
`Menlo Park, CA, 1965.
`
`G. B. Ball, Data analysis in the social sciences : What about the details?,
`Proc. Fall Joint Computer Conference, pp. 533-559, Washington, DC,
`1965.
`
`J. MacQueen, Some methods for classification and analysis of multivari(cid:173)
`ate observations, Proc. Fifth Berkeley Symposium on Math . Statist . and
`Prob ., pp. 281-297, Berkeley, CA, 1967.
`
`K. Fukunaga and W. L. G. Koontz, A criterion and an algorithm for
`grouping data, Trans. IEEE Computers, C-19, pp. 917-923, 1970.
`
`6. W . L. G. Koontz, P. M. Narendra, and K. Fukunaga, A branch and
`bound clustering algorithm, Trans. IEEE Computers, C-24, pp. 908-915,
`1975.
`
`7.
`
`8.
`
`9.
`
`''Statistical
`D. M. Titterington, A. F. M. Smith, and U. E. Makov,
`Analysis of Finite Mixture Distributions," Wiley, New York, 1985.
`
`K. Fukunaga and T. E. Flick, Estimation of the parameters of a Gaussian
`mixture using the method of moments, Trans . IEEE Pattern Anal. and
`Machine Intel/ ., PAMI-5, pp. 410-416, 1983.
`
`N. E. Day, Estimating the components of a mixture of normal distribu(cid:173)
`tions, Biometrika. 56, pp. 463-474, 1969.
`
`10.
`
`J. H. Wolfe, Pattern clustering by multivariate mixture analysis, Mul(cid:173)
`tivar . Behav . Res. , 5, pp. 329-350, 1970.
`
`11. K. Fukunaga and L. D. Hostetler, The estimation of the gradient of a
`density function, with applications
`in pattern recognition, Trans. IEEE
`Inform. Theory, IT-21, pp. 32-40, 1975.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 20 of 23
`
`11 Clustering
`
`563
`
`12. K. Fukunaga and T. E. Flick, A test of Gaussian-ness of a data set using
`clustering, IEEE Trans. Pattern Anal. and Machine Intel/., PAMl-8, pp.
`240-24 7, 1986.
`
`13. W. L. G. Koontz, P. M. Narendra, and K. Fukunaga, A graph-theoretic
`approach to nonparametric cluster analysis, Trans. IEEE Computers, C-
`25, pp. 936-944, 1975.
`
`14. W. L. G. Koontz and K. Fukunaga, A nonparametric