`Case 1:14-cv-02396—PGG-MHD Document 148-15 Filed 05/30/19 Page 1 of 44
`
`EXHIBIT 5
`
`EXHIBIT 5
`PART
`1 OF 2
`
`PART
`
`10F2
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 2 of 44
`Keinosuke Fttkunaga
`
`h1tr'lJ(/uclilJ11 /(J Statistical
`Pattern
`Recognition
`
`Second &lition
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 3 of 44
`
`Keinosuke Fukunaga
`
`Introduction to
`Statistical
`Pattern
`Recognition
`
`Second Edition
`
`This comp letely revised second edition
`presents an introduction to sta tistical pat(cid:173)
`tern recognition. Pattern recognition in
`general ewe rs a wide range of problems: it
`is applied to enginee ring problems, such as
`character readers and wave form analysis,
`as well as to brain modeling in biology and
`psychology. Statistical decision and es tima (cid:173)
`tion, which are the ma in subjects of this
`book, are regarded as fundamental to the
`study of pattern recogni tion. This book
`is appropria te as a text for introductory
`courses in pattern recognition and as a ref(cid:173)
`erence book for people who work in the
`field. Each chapter also contains compu ter
`projects as well as exercises.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 4 of 44
`
`Introduction to Statistical
`Pattern Recognition
`
`Second Edition
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 5 of 44
`
`This is a volume in
`COMPUTER SCIENCE AND SCIENTIFIC COMPUTING
`
`Editor: WERNER RHEINBOLDT
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 6 of 44
`
`Introduction to Statistical
`Pattern Recognition
`Second Edition
`
`Keinosuke Fukunaga
`School of Electrical Engineering
`Purdue University
`West Lafayette , Indiana
`
`MC:(cid:141)
`
`Morgan ~ is an imprint of Academic Praa
`
`A Harcourt Sciena, and T«hnology Company
`
`San Diego San Francisco New York Boston
`London Sydney Tokyo
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 7 of 44
`
`This book is printed on acid-free paper. @>
`
`Copyright © 1990 by Academic Press
`All rights reserved .
`No part of this publication may be reproduced or
`transmitted
`in any form or by any means, electronic
`or mechanical, including photocopy, recording, or
`any information storage and retrieval system, without
`permission in writing from the publisher.
`
`ACADEMIC PRESS
`A Harcourt Science and Technology Company
`525 8 Street, Suite 1900, San Diego, CA 92101-4495 USA
`http://www.academicpress.com
`
`Academic Press
`24-28 Oval Road, London NWl 7DX United Kingdom
`http://www.h buk/ap/
`
`Morgan Kaufmann
`340 Pine Street, Sixth Floor, San Francisco, CA 94104-3205
`http://mkp.com
`
`Library of Congress Cataloging-in-Publication Data
`Fukunaga, Keinosuke .
`Introduction
`to statistical pattern recognition/ Keinosuke
`Fukunaga. - 2nd ed .
`p. cm.
`Includes bibliographical references .
`ISBN 0-12-269851-7
`2. Decision-making -
`1. Pattern perception - Statistical methods.
`- Mathematical models. 3. Mathematical statistics.
`I. Title.
`0327.F85 1990
`006 .4 - dc20
`
`89-18195
`CIP
`
`PRINTED IN 11lE UNITED STATES OF AMERICA
`OJ
`9
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 8 of 44
`
`Chapter 11
`
`CLUSTERING
`
`In the preceding chapters, we have presented a considerable body of
`design theory for pattern recognition. Procedures for classifier design, parame(cid:173)
`ter estimation, and density estimation have been discussed in detail. We have
`consistently assumed the existence of a training set of classified samples.
`In
`this chapter, we will focus our attention on the classification of samples
`without the aid of a training set. We will refer to this kind of classification as
`clustering or unsupervised classification.
`
`There are many instances where classification must and can be per(cid:173)
`fonned without a priori knowledge . Consider, for example, the biological tax(cid:173)
`onomy problem. Over the years, all known living things have been classified
`according to certain observable characteristics. Of course, plants and animals
`have never borne labels indicating their kingdoms, phylae, and so on. Rather,
`they have been categorized according to their observable characteristics without
`outside supervision.
`
`The clustering problem is not well defined unless the resulting classes of
`samples are required to exhibit certain properties. The choice of properties or,
`equivalently, the definition of a cluster, is the fundamental
`issue in the cluster(cid:173)
`ing problem. Given a suitable definition of a cluster, it is possible to distin(cid:173)
`guish between good and bad classifications of samples.
`
`508
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 9 of 44
`
`11 Clustering
`
`509
`
`to clustering will be addressed. One is
`In this chapter, two approaches
`called the parametric approach and the other is the nonparametric approach.
`
`In most parametric approaches , clusterini criteria are defined, and given
`samples are classified to a number of clusters so as to optimize
`the criteria.
`The most commonly used criteria are the class separability measures which
`were introduced in Chapter 10. That is, the class assignment which maximizes
`the class separability measure is considered to be the best clustering result.
`In
`this approach, the structure (parametric form) of the classification boundary is
`determined by the criterion. The clustering algorithm, which determines
`efficiently the best classification with respect to the criterion,
`is normally an
`iterative algorithm.
`In another parametric approach, a mathematical
`form is
`assumed to express the distribution of the given data. A typical example is the
`summation of normal distributions.
`In this case, the clustering problem con(cid:173)
`sists of finding the parameter values for this distribution which best fit the data.
`
`On the other hand, neither clustering criteria nor assumed mathematical
`forms for the distribution are used in the nonparametric approach.
`Instead,
`samples are separated according to the valley of the density function. The val(cid:173)
`ley may be considered as the natural boundary which separates the modes of
`the distributions. This boundary could be complex and not expressible by any
`parametric form.
`
`In addition, clustering may be viewed as the selection of representatives.
`In general, a density function may be approximated by the Parzen density esti(cid:173)
`mate around the representatives. Then, we may try to reduce the number of
`representatives while maintaining
`the degree of approximation. An iterative
`procedure to choose the representatives
`is discussed in this chapter.
`
`11.1 Parametric Clustering
`
`In this section, we will present, first, a general-purpose clustering algo(cid:173)
`rithm based on a generalized criterion. Then, the discussion for a specific cri(cid:173)
`terion follows.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 10 of 44
`
`510
`
`Introduction to Statistical Pattern Recognition
`
`General Clustering Algorithm
`
`to a wide
`in this section applies
`The clustering algorithm developed
`range of criteria. However, it is necessary to specify the fonn of the criterion
`as well as some other details of the clustering problem at the outset.
`
`Clustering criterion: Assume
`that we want to classify N samples,
`,XN. These vectors are not denoted as random vectors because, in the
`X 1, •••
`clustering problem, they are assumed to be fixed and known. Each sample is
`to be placed into one of L classes, ro1, ... ,ooL, where L is assumed to be given.
`The class to which the ith sample is assigned is denoted rok; (i = 1, ...
`,N).
`For convenience, let the value of k; be an integer between I and L. A classifi(cid:173)
`cation Q is a vector made up of the ook; 's, and a configuration x* is a vector
`made up of the X; 's, that is,
`
`and
`
`x* = [Xf ... xrf.
`The clustering criterion J is a function of n and X • and can be written
`J = J(ook,, ... ,rok.v; Xi, ... ,XN) = J(Q;X*).
`
`( 11.2)
`
`( 11.3)
`
`By definition, the best classification Q 0 satisfies either
`
`J(ilo;X*) = max or min J(Q;X·)
`n
`n
`
`(l l.4)
`
`depending on the criterion. For the remainder of this section, we will discuss
`only the minimization problem, since the maximization is similar.
`
`Example 1: Six vectors, X 1, ...
`,X 6 , of Fig. 11-1 are given. The
`problem is to find class assignments of these vectors to one of two classes so
`as to minimize a criterion. Let J = tr(S;,1 S_.) be the criterion where S" and Sm
`are the within-class and mixture scatter matrices defined in (10.1) and (10.4).
`{X 1,X 2,X 5 } E 00 1
`and
`example
`for
`classification,
`each
`For
`{X 3,X 4,X 61 E 002 as shown by dotted
`lines, or {X 1 ,X 2,X 3 I E 00 1 and
`{X 4 ,X 5 ,X 6 } E 002 as shown by solid lines, the mean vectors and covariance
`matrices for ro1 and roi are estimated, and S.,, Sm, and J can be computed.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 11 of 44
`
`I I Clustering
`
`511
`
`x,
`
`\
`\
`\
`\
`\
`\
`\
`\
`
`\ '
`
`Fig. 11-1 An example of clustering.
`
`Note that S~. varies depending on Lhe dass assignment but S,,, does not. The
`class assignment by the solid lines would give the minimum J (the smallest
`within-class scatter) among all possible combinations of class assignments.
`
`Clustering algorithm: For a given clustering problem, the configuration
`X • is fixed. The clustering algorithm varies only the classification n. Ordi(cid:173)
`nary steepest descent search
`techniques cannot be applied because of the
`discrete and unordered nature of n. Still, it is possible to define an iterative
`search algorithm based on variations in J with respect to variations in n.
`Suppose, at the eth iteration, the classification
`is Qe). where
`
`( I 1.5)
`
`If the ith sample is reclassified from its present class ki(e) to class j, the cluster(cid:173)
`ing criterion varies by an amount !V (i ,j, f), which is given by
`
`( 11.6)
`
`If !V (i,j ) ) is negative, reclassification of the ith sample to class j yields a
`classification that is improved in terms of J. This fact is the basis of the fol(cid:173)
`lowing algorithm:
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 12 of 44
`
`512
`
`Introduction to Statistical Pattern Recognition
`
`(I) Choose an initial classification ne0) .
`
`e2) Given
`j = 1,2, ..
`. ,Landi=
`
`tth classification
`the
`1,2, . . . ,N.
`
`ne o . calculate
`
`Ii/ (i ,j, t)
`
`for
`
`e3) For i = 1,2, ...
`
`,N, reclassify the ith sample to class t, where
`
`lif(i,t,n
`
`= min li/(i,j,e).
`j
`
`e11.7)
`
`Decide ties involving j = k;ee> in favor of the present classification. Decide
`other ties arbitrarily. This step forms nee + I).
`
`e4) If ne e + I)* n(l), return to Step e2). Otherwise
`complete.
`
`the algorithm
`
`is
`
`The algorithm is simply the iterative application of a classification rule
`based on the clustering criterion. Here, we adopted a procedure in which all of
`the samples are reclassified simultaneously at each iteration. An alternative
`approach is to reclassify each sample one at a time, resulting in similar but
`slightly different clusters . In these iterative procedures, there is no guarantee
`that the algorithm converges. Even if it does converge, we cannot be certain
`that the absolute minimum of J has been obtained. Therefore, we must depend
`on empirical evidence to justify the algorithm.
`
`In contrast to these potential weaknesses, the algorithm described above
`is very efficient. Like any good search algorithm, it surveys a small subset of
`classifications in a systematic and adaptive manner. It is easily programmable
`for any criterion of the form of e 11.3).
`
`Determining the number of clusters: So far, we have ignored the prob(cid:173)
`lem of determining the number of classes L, and have assumed that L is given.
`However, in practice, L is rarely known . We not only need to determine L but
`also the proper class assignment. For that purpose, we may run the clustering
`procedure for the various values of L, and find the best classification for each
`value of L. Let J~ be the optimal criterion value for a given L after the cluster(cid:173)
`ing procedure has converged. If J~ decreases as L increases, and either reaches
`the minimum point at L 0 or becomes flat after Lo, then we may use L 0 as the
`proper number of classes. Unfortunately, many of the popular criteria do not
`have this favorable property. For example , consider J = tres;, 1 Sw) of Example
`I. As L increases, samples are divided into smaller groups, and consequently
`the within-class scatter becomes smaller. This means that J might decrease
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 13 of 44
`
`11 Clustering
`
`513
`
`monotonically with L. Finally, when L becomes N, the total number of sam(cid:173)
`ples, each class consists of one sample only, and there is no within-class scatter
`(J = 0). Although L = N minimizes
`this criterion,
`this is obviously not the
`solution we want.
`
`It appears, therefore, that some external method of controlling L is neces(cid:173)
`for determining L has been fully
`sary. Unfortunately, no unified
`theory
`developed and accepted.
`
`Merging and splitting: After a number of classes is obtained, we may
`consider the merging of two classes
`into a single class or the splitting of a
`class into a number of classes.
`
`in two instances. The first is when two
`Basically, merging is desirable
`classes are very similar. The similarity may be measured in a number of ways.
`The Euclidean distance between two mean vectors is the simplest measure but
`is not an accurate one. The Bhattacharyya distance of (3.152), based on the
`normal assumption, could be a reasonable compromise between simplicity and
`accuracy. The second instance in which merging may be appropriate
`is when
`the population of a class is very small.
`In this case, the class may be merged
`with the most similar class, even when the similarity is not very close.
`
`Deciding whether or not a class is to he split is a more complex problem.
`Too large a population suggests that the class is a candidate for a split. Mul(cid:173)
`timodal and nonsymmetric distributions as well as distributions with large vari(cid:173)
`ances along one direction are also candidates for splitting.
`In order to identify
`these characteristics, various tests are necessary. Splitting a class may be car(cid:173)
`ried out by applying the clustering procedure to the samples in the class.
`
`It goes without saying that this kind of merging and splitting
`is very
`heuristic.
`Its merit lies in the fact that it is efficient and requires a minimum of
`human interaction.
`
`Multiple dichotomy:
`It
`to adopt an
`is somewhat more satisfying
`the clustering criterion J. One such
`approach which depends entirely on
`approach has been suggested [ 1] and is outlined as follows.
`Suppose that for L = 2 there are several distinct classifications which
`yield a nearly minimal value of J.
`If these classifications differ only in the
`classification of a few samples, there is no reason to suppose the existence or
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 14 of 44
`
`514
`
`Introduction to Statistical Pattern Recognition
`
`more than two classes. If the classifications are grossly different, however,
`then it is evident that several classes are actually present.
`
`2nd dichotomy
`
`/
`
`/
`
`/
`
`Q,/0
`////
`0 Isl dichotomy
`
`-------7----------------
`
`/
`
`/
`
`/
`
`/
`
`A,
`
`Fig. 11-2 Multiple dichotomy of three classes of samples .
`
`Figure 11-2 illustrates two possible dichotomies of a collection of sam(cid:173)
`ples apparently containing three classes A 1, A 2 , and A 3 • One dichotomy
`separates the samples into A I u A 2 and A 3, while the other results in the two
`classes A I and A 2 uA 3 • Thus , A 3 , A 1, and A I uA 3 = A 2 are seen to be dis(cid:173)
`tinct classes (A is the complement of the set A.)
`Now let us consider the more general case where there are k dichotomies
`of a collection of samples containing L classes. Each dichotomy separates
`these samples into two groups. Let SiJ be the set of all samples assigned to
`group j by the ith dichotomy for j = 1,2 and i = I, . .. ,k. Assume that the fol(cid:173)
`lowing two conditions hold.
`
`(a) A dichotomy places each class into only one group, that is, classes
`are not split by dichotomies.
`
`(b) For each pair of classes, there is at least one dichotomy which does
`not assign the two classes to the same group.
`
`Select one group from each of the k dichotomies and form a subset C as
`the intersection of these k groups. By condition (a), if C contains one sample,
`then it must contain all of the samples of that class. By condition (b), for any
`other class, there is at least one of the k selected groups to which that class
`does not belong. Therefore , if C is nonempty, then C contains one and only
`one class. Hence. in order to construct all the L classes, we consider the 2t
`subsets of the form.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 15 of 44
`
`I I Clustering
`
`c U 1, .
`
`..
`
`k
`
`,h -) = nsiJ
`
`i=I
`
`'
`
`515
`
`( 11.8)
`
`where each j equals
`have
`
`or 2. Each nonempty C is a class.
`
`In our example, we
`
`S11=A1UA2,
`
`S12=A3,
`
`S21=A1,
`
`S22=A2UA3,
`
`( 11. 9)
`
`so that
`
`C(I,
`
`I) =S 11 nS 21 =A 1 ,
`C(l,2)=S
`11 nS 22 =A2,
`l)=S 12nS21 =0,
`C(2, 2)=S12nS22
`=A3,
`
`C(2,
`
`(I I. 10)
`
`which is in agreement with our earlier argument.
`
`The multiple dichotomy approach has a stronger theoretical basis than
`the merging and splitting procedure. Further, it relies on no numerical criterion
`other than 1. However, implementation of the multiple dichotomy approach
`can be difficult, especially when the true number of classes is large. In addi(cid:173)
`tion, the conditions (a) and (b) mentioned above are rarely satisfied in practice.
`These difficulties may be overcome somewhat by imposing a hierarchical
`structure on the classes . The samples are divided
`into a small number of
`cla~ses, each class is divided further, and so on. Under this strategy, we need
`not find every possible dichotomy of the entire collection of samples.
`
`At this point, we depart from general discussion of the clustering algo(cid:173)
`rithm. Obviously, the discussion is incomplete. We have a basis, however, co
`develop and implement clustering procedures. Therefore, let us tum our atten(cid:173)
`tion to the detailed derivations of clustering procedures .
`
`Nearest Mean Reclassification Algorithm [2-5)
`
`In this section, we will discuss clustering procedures based on parame(cid:173)
`ters such as mean vectors and covariance matrices . We will show that the cri(cid:173)
`teria of class separability discussed in Chapter IO play an important role, and
`that the iterative algorithms of the previous section take on simple forms .
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 16 of 44
`
`516
`
`Introduction to Statistical Pattern Recognition
`
`Criteria: Clustering can be considered as a technique to group samples
`so as to maximize class separability. Then, all of the criteria which were dis(cid:173)
`cussed in Chapter IO may be used as clustering criteria.
`In this section only
`functions of scatter matrices are discussed due to the following reasons:
`
`In this
`is straightforward.
`to multiclass problems
`(I) The extension
`respect, the Bhattacharyya distance has a severe disadvantage, since it can be
`applied only to two-class problems.
`
`(2) Most clustering techniques are based on the scatter of mean vectors.
`Finding clusters based on covariance-differences
`is extremely difficult, unless
`some mathematical structure
`is imposed on the distribution. Therefore,
`the
`functions of scatter matrices fit well to clustering problems.
`
`(3) The simplicity of the criteria is a significant advantage, because in
`clustering we have the additional complexity of unknown class assignment.
`
`For feature extraction, we could choose any combination of Sh, S,.., and
`Sm as S I and S 2 to forrn a criterion J = tr(S21 S 1 ). However, for clustering it is
`It
`preferable to use Sm as S 2 , because Sm is independent of class assignment.
`would be too complicated if we had to recompute the inverse of S 2 each time
`the class assignment was altered in iteration. Therefore, our choice is limited
`to either tr(S;,1 Sb) or tr(S;,1 S,..). These
`two criteria are the same, because
`tr(S;,' Sh) = tr! s;,1 (Sm-Sw)) = n -
`tr(S;, 1 S"').
`In this chapter, we will use
`J = tr(S;,' S,..).
`is to
`in selecting criteria for clustering
`Another important consideration
`ensure that the clustering procedures give the same classification for a given set
`of samples regardless of the coordinate system of these samples. The chosen
`criterion, J = tr(S;,1 S".), satisfies this condition, since the criterion is invariant
`under any nonsingular linear transforrnation.
`
`Clustering algorithm: Let us assume that M O = 0 and Sm = I without
`If the given samples do not satisfy these conditions, we can
`losing generality.
`shift the coordinate origin and whiten the data with respect to S,,,. Then, using
`(IO.I) the criterion is rewritten as
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 17 of 44
`
`11 Clustering
`
`I ~ x(r)
`(' - ~ N,
`J -
`- tr.," . - kJ--kJ(
`J
`r=I N N,. j=I
`I L N,
`= - LL llxy) - M,.112
`N r=l)=I
`
`.
`
`T
`- M,.) (X 1
`
`(r)
`
`517
`
`(II.II)
`
`- M,.)
`
`Changing the class assignment of X; from the current class k; to class j at the
`~th iteration , we delete from ( 11.11) the term IIX; - Mk, mll2 and insert a new
`term IIX; - M/i.')l12
`• Thus,
`6..J(i,J.n = _!_1 llx; - M1(oll2
`N
`
`-
`
`llx; - Mde)l12 l .
`'
`
`(11.12)
`
`Since the second term of ( 11.12) is independent of j, the reclassification of X;
`at the 1th iteration can be carried out by
`IIX; - M,(011 = min IIX; - M/DII ~ XE w, .
`
`(11.13)
`
`1
`
`In words. the algorithm becomes:
`
`an
`
`initial
`
`classification,
`
`Q(0),
`
`Choose
`,ML (0).
`
`(I)
`M 1 (0), ...
`(2) Having calculated sample mean vectors M 1 (i'), ...
`iteration, reclassify each X; according to the nearest M,-(t).
`
`and
`
`calculate
`
`,Md t) at the tth
`
`(3) If the classification of any X; is changed, calculate
`the new sample
`,ML e + I) for
`mean vectors M I C+ I), ...
`the new class assignment,
`and
`repeat from Step (2). Otherwise, stop.
`
`This algorithm is called the nearest mean reclassification rule.
`
`Figure 11-3 shows how the iterative process works. At the !th step.
`samples are divided to three clusters, and their sample means, M;(O's, are com(cid:173)
`puted. All samples are now reclassified according
`to the nearest means . That
`is, the new boundary is piecewise linear, bisecting each pair of M;( C)'s.
`In Fig.
`11-3, there are three clearly separated clusters. We can see that the boundary
`is indeed improved by thi s operation.
`
`some properties
`the above discussion,
`From
`reclassification algorithm become evident. They are:
`
`of
`
`the nearest mean
`
`linear bisectors. Only the means
`(I) Clusters are divided by piecewise
`contribute to determine the boundary and covariance matrices do not affect the
`boundary.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 18 of 44
`
`518
`
`Introduction to Statistical Pattern Recognition
`
`J-th step
`
`X X
`X X X
`~,, -x-?:-M1(J)
`
`',
`
`X
`
`;
`>4 X
`X
`... :
`X
`,a_x X
`/x
`X
`M3(J)
`X X
`
`X
`
`X
`
`Fig. 11-3 An example of the nearest mean reclassification algorithm.
`
`(2) The number of clusters must be preassigned.
`
`(3) The initial classification. il(0), may be given randomly. No matter
`how Q(0) is given, the M;(0)'s are computed and the reclassification of sam(cid:173)
`ples according to the nearest M;(0)'s results in a piecewise linear boundary.
`This is equivalent to selecting the number of vectors, M;(0)'s, initially accord(cid:173)
`ing to the number of clusters. Random class assignment does not impose any
`extra instability on the algorithm.
`
`In order to verify the algorithm,
`ducted.
`
`the following experiment was con(cid:173)
`
`Experiment 1: One hundred samples were generated from each of Data
`/-A, and mixed together to form 200 samples. Then, these samples were
`classified to two clusters. Table 11-1 shows the confusion matrix of this
`experiment [5]. All I 00 samples from ro1 with 19 samples from CJ½ were
`assigned to one cluster, and 81 samples from mi are assigned to the other clus(cid:173)
`ter. The error rate is 19/200 = 0.095. Recall that we got 5% error for this data
`by designing the optimum linear classifier in Chapter 4. Considering the fact
`that any covariance information was not used in this clustering algorithm, the
`error rate of 9.5% is reasonable. Furthermore, since all error samples came
`from one class, we could improve the error rate simply by adjusting the deci(cid:173)
`sion threshold.
`
`Convergence [5]: The nearest mean reclassification algorithm
`is not
`guaranteed to converge . In this section, we will discuss the conditions under
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 19 of 44
`
`11 Clustering
`
`519
`
`TABLE 11-1
`
`CONFUSION MATRIX FOR THE NEAREST MEAN
`RECLASSIFICATION ALGORITHM
`
`Assigned class
`
`Actual
`class
`
`I
`2
`
`I 00
`19
`
`2
`
`0
`81
`
`which the separating hyperplane converges for two nonnal distributions with
`equal covariance matrices.
`
`are Nx(M 1,L 1) and
`two nonnal distributions
`that
`Let us assume
`that r, = r 2 = r. The nonnalization
`Nx(M 2,L 2) after nonnalization
`and
`makes S,,, = I and M O = 0. The Bayes classifier in this case becomes linear as
`(11.14)
`
`where c
`is a constant.
`M 0 =0=P 1M 1 +P 2M 2 ,
`
`Since Sm=l=L+P
`
`1M 1Mf +P 2M 2Mf
`
`and
`
`Using (2.160),
`
`P2
`-M?M?
`pl
`-
`r-' = I + -----(cid:173)
`P 2
`T
`I - -M?M?
`-
`P,
`
`T
`-
`
`-
`
`Substituting ( 11.16) and ( 11.17) into ( 11.14 ), the Bayes classifier is
`
`I
`-MIX
`----1-·
`P 1-P 2M2M2
`
`+ C = 0 .
`
`(11.15)
`
`(11.16)
`
`(11.17)
`
`( I 1.18)
`
`Thus, the optimum hyperplane for the equal covariance case is always in the
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 20 of 44
`
`520
`
`Introduction to Statistical Pattern Recognition
`
`direction of M 2 which is the same as the mean-difference vector M 2-M 1•
`This property, which the original coordinate system does not have,
`is a
`significant advantage of the normalized coordinate system.
`
`For the equal covariance case, we can show that the algorithm converges
`to M 2-M I from a wide range of initial classifications. After a given iteration,
`samples are separated by a hyperplane whose direction is, say V (llvll = I), as
`shown in Fig. 11-4. Also, the position of the hyperplane is specified by t 1 and
`
`V
`
`Fig. 11-4 Separation of two distributions.
`
`¼ which are the distances of M I and M 2 from the hyperplane. Let Dp and Dn
`be the centers of probability mass for the positive and negative sides of the
`hyperplane. Then, following the nearest mean reclassification rule, the direc(cid:173)
`tion of the succeeding hyperplane will be Dp-Dn. So, our convergence proof
`is to show that the angle between Dp-Dn and M 2-M 1 is smaller than the angle
`between V and M 2-M 1•
`Since the hyperplane separates each distribution into two parts, the posi(cid:173)
`tive and negative sides, we have four probability masses Rij (i = 1,2; j = p,n ),
`as shown in Fig. 11-4. Let Dij and% be the mean vectors and populations of
`these probability masses. Then,
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 21 of 44
`
`I I Clustering
`
`a;
`D;p = M; + --1:;V
`<J;h;
`
`,
`
`a
`a;(l-h
`
`I
`
`I:; V •
`
`; )
`
`D;/1 = M; -
`
`q,p = P;b;,
`
`q;,, = P;(l-h
`
`; ) ,
`
`521
`
`( 11.19)
`
`( I 1.20)
`
`( I 1.21)
`
`(11.22)
`
`( 11.23)
`
`(11.24)
`
`(I 1.25)
`
`[ ' l 2
`a, = ~c l;exp[--1;-Jdl; = _,---expl- 1 -
`[ C; I
`I f~
`h; = C
`1
`_ expl--1;
`2
`" 2n: ±!,la,
`
`?
`
`I
`2
`
`I
`'V2n:
`
`I,;
`
`I
`_ a,
`
`J.
`
`2 ]di;=
`
`I - cJ> ±-
`a,
`
`where
`
`I
`=
`2n: _ ,,a;
`
`2
`T
`a, = V I:;V.
`
`and cJ> is the normal error function. The sign + or -
`is selected, depending on
`whether M1 is located in R;11 or R;JJ. The details of the derivation of (11.19)
`through ( 11.25) are left to the reader. However , the following
`information
`( 11.25) . Since X is normal.
`could be helpful
`in deriving
`(11.19)
`through
`y = V1 X is also normal with the variance of ( I 1.25) . In they-space,
`the proba (cid:173)
`bility of mass for the positive side , h;, can be computed by ( 11.24 ). The vector
`D,j - M; has
`the direction of 1:;V, and
`of D,; - M; on
`the projections
`V (j = p.n)
`\11(D;JJ - M;) = a;<J;l h1
`and V 1(D;,, - M;) = -a;<J;l(l-h;).
`are
`the expected values of y for the positive and
`These are obtained by computing
`
`negative sides. From D;; and q;j, D1, and D11 are obt ained as
`
`q IJJD ,,, + q2pD2,,
`D JJ = ---''----'------'----''-
`q Ip + Q 2p
`q ,,,D ,,, + q 2,,D 211
`D,1 == -------
`q 111 + q 2 11
`
`(11.26)
`
`( 11.27)
`
`Substituting ( 11.19) through ( 11.22) into ( 11.26) and ( 11.27).
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 22 of 44
`
`522
`
`Introduction to Statistical Pattern Recognition
`
`For the equal covariance case,
`
`(11.28)
`
`(11.29)
`
`Furthermore, under the normalization of Sm = I and M O = 0, I: and M 2-M I are
`expressed as functions of M 2 as in (I I. 15) and (11.16). Therefore, (11.28)
`becomes
`
`where
`
`P2(h2-b 1 )-(1/cr)(P 2/P 1 )(P I a 1 +P2a2)MIV
`<P1b1+P2b2)1 l-(P1b1+P2b2)l
`
`( 1/a)(P I a 1 +P 2n 2)
`
`(11.30)
`
`(11.31)
`
`(11.32)
`
`The normal of the new hyperplane has a component in the direction of V
`and another in the direction of M 2 • If the coefficient of M 2 , c 1, has the same
`sign as MI V, the successive hyperplane becomes more nearly parallel to M 2 .
`Since c 2 and the denominator of c I are positive, we need to show that the
`numerator of c I and MIV have the same sign. We examine only the case
`where MI V > 0. The discussion
`for MIV < 0 is similar
`to the one for
`MIV > 0. For MIV > 0, we see from Fig. 11-4 that
`
`e1 + ½ = (M2-MJv
`
`= *Miv.
`
`(11.33)
`
`Using ( 11.33), the condition for convergence becomes
`
`(11.34)
`
`It is easily seen that the inequality of ( 11.34) is not satisfied for certain
`combinations of parameters. However, the region of parameters where ( 11.34)
`is satisfied can be calculated numerically . The result is shown in Fig. 11-5.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 23 of 44
`
`I I Clustering
`
`523
`
`Convergence
`region
`
`0.6
`
`0.4
`
`0.2
`
`P1
`Fig. 11-5 Region of convergence.
`
`Equations (11.23) and (11.24) with (11.34) show that we have three parameters
`in (l l.34U 1/cr, ~la, and P 1 (P 2 = I - P 1 ), or
`
`(11.35)
`
`regions of y and P 1 are plotted for various
`In Fig. 11-5, the convergence
`values of k [5]. The figure shows that convergence
`is quite likely in practice,
`except for either extreme P 1 's or y's.
`
`Branch and bound procedure [6]: The nearest mean reclassification
`algorithm works fine for many applications. However, there is no guarantee of
`the convergence of the iterative process. Also, the process might stop at a
`locally minimum point and fail to find the globally minimum point.
`
`Since assigning a class to each sample is a combinatorial problem, the
`branch and hound procedure discussed
`in Chapter 10 may be applied to find
`the optimum class assignment.
`
`Figllre 11-6 shows a solution tree for the clustering problem with four
`ln general, there are LN different Q's for classify-
`samples and three clusters.
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 24 of 44
`
`524
`
`Introduction to Statistical Pattern Recognition
`
`A
`Fig. 11-6 A solution tree for clustering.
`
`ing N sample vectors into L clusters. However, since the label of each cluster
`may be chosen arbitrarily, each classification
`could have several different
`expressions for n. For example, n = [ 1 1 2 2] and Q = [2 2 3 3] are the same
`classification, both indicating that X I and X 2 are in one cluster and X 3 and X 4
`are in one of the other clusters.
`In order to eliminate
`this duplication, we
`assign the first sample X I to cluster I, the second sample X 2 to either cluster I
`or 2, and so on, as shown in Fig. 11-6.
`
`to work effectively, we
`In order for the branch and bound procedure
`need to have a criterion which satisfies the monotonicity condition. Let lm(Qm)
`be the criterion to be minimized for n,,, = [wk, ... wk,.,], where the subscript m
`indicates the number of sample vectors involved. Then, the monotonicity con(cid:173)
`dition is stated as
`
`(11.36)
`
`That is, the J of a node is smaller than the J'