Network-1 Technologies, Inc. v. Google, L.L.C. et al, 1:14-cv-02396, No. 148-15 (S.D.N.Y. May. 30, 2019)

Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 1 of 44
`Case 1:14-cv-02396—PGG-MHD Document 148-15 Filed 05/30/19 Page 1 of 44
`
`EXHIBIT 5
`
`EXHIBIT 5
`PART
`1 OF 2
`
`PART
`
`10F2
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 2 of 44
`Keinosuke Fttkunaga
`
`h1tr'lJ(/uclilJ11 /(J Statistical
`Pattern
`Recognition
`
`Second &lition
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 3 of 44
`
`Keinosuke Fukunaga
`
`Introduction to
`Statistical
`Pattern
`Recognition
`
`Second Edition
`
`This comp letely revised second edition
`presents an introduction to sta tistical pat(cid:173)
`tern recognition. Pattern recognition in
`general ewe rs a wide range of problems: it
`is applied to enginee ring problems, such as
`character readers and wave form analysis,
`as well as to brain modeling in biology and
`psychology. Statistical decision and es tima (cid:173)
`tion, which are the ma in subjects of this
`book, are regarded as fundamental to the
`study of pattern recogni tion. This book
`is appropria te as a text for introductory
`courses in pattern recognition and as a ref(cid:173)
`erence book for people who work in the
`field. Each chapter also contains compu ter
`projects as well as exercises.
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 4 of 44
`
`Introduction to Statistical
`Pattern Recognition
`
`Second Edition
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 5 of 44
`
`This is a volume in
`COMPUTER SCIENCE AND SCIENTIFIC COMPUTING
`
`Editor: WERNER RHEINBOLDT
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 6 of 44
`
`Introduction to Statistical
`Pattern Recognition
`Second Edition
`
`Keinosuke Fukunaga
`School of Electrical Engineering
`Purdue University
`West Lafayette , Indiana
`
`MC:(cid:141)
`
`Morgan ~ is an imprint of Academic Praa
`
`A Harcourt Sciena, and T«hnology Company
`
`San Diego San Francisco New York Boston
`London Sydney Tokyo
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 7 of 44
`
`This book is printed on acid-free paper. @>
`
`Copyright © 1990 by Academic Press
`All rights reserved .
`No part of this publication may be reproduced or
`transmitted
`in any form or by any means, electronic
`or mechanical, including photocopy, recording, or
`any information storage and retrieval system, without
`permission in writing from the publisher.
`
`ACADEMIC PRESS
`A Harcourt Science and Technology Company
`525 8 Street, Suite 1900, San Diego, CA 92101-4495 USA
`http://www.academicpress.com
`
`Academic Press
`24-28 Oval Road, London NWl 7DX United Kingdom
`http://www.h buk/ap/
`
`Morgan Kaufmann
`340 Pine Street, Sixth Floor, San Francisco, CA 94104-3205
`http://mkp.com
`
`Library of Congress Cataloging-in-Publication Data
`Fukunaga, Keinosuke .
`Introduction
`to statistical pattern recognition/ Keinosuke
`Fukunaga. - 2nd ed .
`p. cm.
`Includes bibliographical references .
`ISBN 0-12-269851-7
`2. Decision-making -
`1. Pattern perception - Statistical methods.
`- Mathematical models. 3. Mathematical statistics.
`I. Title.
`0327.F85 1990
`006 .4 - dc20
`
`89-18195
`CIP
`
`PRINTED IN 11lE UNITED STATES OF AMERICA
`OJ
`9
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 8 of 44
`
`Chapter 11
`
`CLUSTERING
`
`In the preceding chapters, we have presented a considerable body of
`design theory for pattern recognition. Procedures for classifier design, parame(cid:173)
`ter estimation, and density estimation have been discussed in detail. We have
`consistently assumed the existence of a training set of classified samples.
`In
`this chapter, we will focus our attention on the classification of samples
`without the aid of a training set. We will refer to this kind of classification as
`clustering or unsupervised classification.
`
`There are many instances where classification must and can be per(cid:173)
`fonned without a priori knowledge . Consider, for example, the biological tax(cid:173)
`onomy problem. Over the years, all known living things have been classified
`according to certain observable characteristics. Of course, plants and animals
`have never borne labels indicating their kingdoms, phylae, and so on. Rather,
`they have been categorized according to their observable characteristics without
`outside supervision.
`
`The clustering problem is not well defined unless the resulting classes of
`samples are required to exhibit certain properties. The choice of properties or,
`equivalently, the definition of a cluster, is the fundamental
`issue in the cluster(cid:173)
`ing problem. Given a suitable definition of a cluster, it is possible to distin(cid:173)
`guish between good and bad classifications of samples.
`
`508
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 9 of 44
`
`11 Clustering
`
`509
`
`to clustering will be addressed. One is
`In this chapter, two approaches
`called the parametric approach and the other is the nonparametric approach.
`
`In most parametric approaches , clusterini criteria are defined, and given
`samples are classified to a number of clusters so as to optimize
`the criteria.
`The most commonly used criteria are the class separability measures which
`were introduced in Chapter 10. That is, the class assignment which maximizes
`the class separability measure is considered to be the best clustering result.
`In
`this approach, the structure (parametric form) of the classification boundary is
`determined by the criterion. The clustering algorithm, which determines
`efficiently the best classification with respect to the criterion,
`is normally an
`iterative algorithm.
`In another parametric approach, a mathematical
`form is
`assumed to express the distribution of the given data. A typical example is the
`summation of normal distributions.
`In this case, the clustering problem con(cid:173)
`sists of finding the parameter values for this distribution which best fit the data.
`
`On the other hand, neither clustering criteria nor assumed mathematical
`forms for the distribution are used in the nonparametric approach.
`Instead,
`samples are separated according to the valley of the density function. The val(cid:173)
`ley may be considered as the natural boundary which separates the modes of
`the distributions. This boundary could be complex and not expressible by any
`parametric form.
`
`In addition, clustering may be viewed as the selection of representatives.
`In general, a density function may be approximated by the Parzen density esti(cid:173)
`mate around the representatives. Then, we may try to reduce the number of
`representatives while maintaining
`the degree of approximation. An iterative
`procedure to choose the representatives
`is discussed in this chapter.
`
`11.1 Parametric Clustering
`
`In this section, we will present, first, a general-purpose clustering algo(cid:173)
`rithm based on a generalized criterion. Then, the discussion for a specific cri(cid:173)
`terion follows.
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 10 of 44
`
`510
`
`Introduction to Statistical Pattern Recognition
`
`General Clustering Algorithm
`
`to a wide
`in this section applies
`The clustering algorithm developed
`range of criteria. However, it is necessary to specify the fonn of the criterion
`as well as some other details of the clustering problem at the outset.
`
`Clustering criterion: Assume
`that we want to classify N samples,
`,XN. These vectors are not denoted as random vectors because, in the
`X 1, •••
`clustering problem, they are assumed to be fixed and known. Each sample is
`to be placed into one of L classes, ro1, ... ,ooL, where L is assumed to be given.
`The class to which the ith sample is assigned is denoted rok; (i = 1, ...
`,N).
`For convenience, let the value of k; be an integer between I and L. A classifi(cid:173)
`cation Q is a vector made up of the ook; 's, and a configuration x* is a vector
`made up of the X; 's, that is,
`
`and
`
`x* = [Xf ... xrf.
`The clustering criterion J is a function of n and X • and can be written
`J = J(ook,, ... ,rok.v; Xi, ... ,XN) = J(Q;X*).
`
`( 11.2)
`
`( 11.3)
`
`By definition, the best classification Q 0 satisfies either
`
`J(ilo;X*) = max or min J(Q;X·)
`n
`n
`
`(l l.4)
`
`depending on the criterion. For the remainder of this section, we will discuss
`only the minimization problem, since the maximization is similar.
`
`Example 1: Six vectors, X 1, ...
`,X 6 , of Fig. 11-1 are given. The
`problem is to find class assignments of these vectors to one of two classes so
`as to minimize a criterion. Let J = tr(S;,1 S_.) be the criterion where S" and Sm
`are the within-class and mixture scatter matrices defined in (10.1) and (10.4).
`{X 1,X 2,X 5 } E 00 1
`and
`example
`for
`classification,
`each
`For
`{X 3,X 4,X 61 E 002 as shown by dotted
`lines, or {X 1 ,X 2,X 3 I E 00 1 and
`{X 4 ,X 5 ,X 6 } E 002 as shown by solid lines, the mean vectors and covariance
`matrices for ro1 and roi are estimated, and S.,, Sm, and J can be computed.
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 11 of 44
`
`I I Clustering
`
`511
`
`x,
`
`\
`\
`\
`\
`\
`\
`\
`\
`
`\ '
`
`Fig. 11-1 An example of clustering.
`
`Note that S~. varies depending on Lhe dass assignment but S,,, does not. The
`class assignment by the solid lines would give the minimum J (the smallest
`within-class scatter) among all possible combinations of class assignments.
`
`Clustering algorithm: For a given clustering problem, the configuration
`X • is fixed. The clustering algorithm varies only the classification n. Ordi(cid:173)
`nary steepest descent search
`techniques cannot be applied because of the
`discrete and unordered nature of n. Still, it is possible to define an iterative
`search algorithm based on variations in J with respect to variations in n.
`Suppose, at the eth iteration, the classification
`is Qe). where
`
`( I 1.5)
`
`If the ith sample is reclassified from its present class ki(e) to class j, the cluster(cid:173)
`ing criterion varies by an amount !V (i ,j, f), which is given by
`
`( 11.6)
`
`If !V (i,j ) ) is negative, reclassification of the ith sample to class j yields a
`classification that is improved in terms of J. This fact is the basis of the fol(cid:173)
`lowing algorithm:
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 12 of 44
`
`512
`
`Introduction to Statistical Pattern Recognition
`
`(I) Choose an initial classification ne0) .
`
`e2) Given
`j = 1,2, ..
`. ,Landi=
`
`tth classification
`the
`1,2, . . . ,N.
`
`ne o . calculate
`
`Ii/ (i ,j, t)
`
`for
`
`e3) For i = 1,2, ...
`
`,N, reclassify the ith sample to class t, where
`
`lif(i,t,n
`
`= min li/(i,j,e).
`j
`
`e11.7)
`
`Decide ties involving j = k;ee> in favor of the present classification. Decide
`other ties arbitrarily. This step forms nee + I).
`
`e4) If ne e + I)* n(l), return to Step e2). Otherwise
`complete.
`
`the algorithm
`
`is
`
`The algorithm is simply the iterative application of a classification rule
`based on the clustering criterion. Here, we adopted a procedure in which all of
`the samples are reclassified simultaneously at each iteration. An alternative
`approach is to reclassify each sample one at a time, resulting in similar but
`slightly different clusters . In these iterative procedures, there is no guarantee
`that the algorithm converges. Even if it does converge, we cannot be certain
`that the absolute minimum of J has been obtained. Therefore, we must depend
`on empirical evidence to justify the algorithm.
`
`In contrast to these potential weaknesses, the algorithm described above
`is very efficient. Like any good search algorithm, it surveys a small subset of
`classifications in a systematic and adaptive manner. It is easily programmable
`for any criterion of the form of e 11.3).
`
`Determining the number of clusters: So far, we have ignored the prob(cid:173)
`lem of determining the number of classes L, and have assumed that L is given.
`However, in practice, L is rarely known . We not only need to determine L but
`also the proper class assignment. For that purpose, we may run the clustering
`procedure for the various values of L, and find the best classification for each
`value of L. Let J~ be the optimal criterion value for a given L after the cluster(cid:173)
`ing procedure has converged. If J~ decreases as L increases, and either reaches
`the minimum point at L 0 or becomes flat after Lo, then we may use L 0 as the
`proper number of classes. Unfortunately, many of the popular criteria do not
`have this favorable property. For example , consider J = tres;, 1 Sw) of Example
`I. As L increases, samples are divided into smaller groups, and consequently
`the within-class scatter becomes smaller. This means that J might decrease
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 13 of 44
`
`11 Clustering
`
`513
`
`monotonically with L. Finally, when L becomes N, the total number of sam(cid:173)
`ples, each class consists of one sample only, and there is no within-class scatter
`(J = 0). Although L = N minimizes
`this criterion,
`this is obviously not the
`solution we want.
`
`It appears, therefore, that some external method of controlling L is neces(cid:173)
`for determining L has been fully
`sary. Unfortunately, no unified
`theory
`developed and accepted.
`
`Merging and splitting: After a number of classes is obtained, we may
`consider the merging of two classes
`into a single class or the splitting of a
`class into a number of classes.
`
`in two instances. The first is when two
`Basically, merging is desirable
`classes are very similar. The similarity may be measured in a number of ways.
`The Euclidean distance between two mean vectors is the simplest measure but
`is not an accurate one. The Bhattacharyya distance of (3.152), based on the
`normal assumption, could be a reasonable compromise between simplicity and
`accuracy. The second instance in which merging may be appropriate
`is when
`the population of a class is very small.
`In this case, the class may be merged
`with the most similar class, even when the similarity is not very close.
`
`Deciding whether or not a class is to he split is a more complex problem.
`Too large a population suggests that the class is a candidate for a split. Mul(cid:173)
`timodal and nonsymmetric distributions as well as distributions with large vari(cid:173)
`ances along one direction are also candidates for splitting.
`In order to identify
`these characteristics, various tests are necessary. Splitting a class may be car(cid:173)
`ried out by applying the clustering procedure to the samples in the class.
`
`It goes without saying that this kind of merging and splitting
`is very
`heuristic.
`Its merit lies in the fact that it is efficient and requires a minimum of
`human interaction.
`
`Multiple dichotomy:
`It
`to adopt an
`is somewhat more satisfying
`the clustering criterion J. One such
`approach which depends entirely on
`approach has been suggested [ 1] and is outlined as follows.
`Suppose that for L = 2 there are several distinct classifications which
`yield a nearly minimal value of J.
`If these classifications differ only in the
`classification of a few samples, there is no reason to suppose the existence or
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 14 of 44
`
`514
`
`Introduction to Statistical Pattern Recognition
`
`more than two classes. If the classifications are grossly different, however,
`then it is evident that several classes are actually present.
`
`2nd dichotomy
`
`/
`
`/
`
`/
`
`Q,/0
`////
`0 Isl dichotomy
`
`-------7----------------
`
`/
`
`/
`
`/
`
`/
`
`A,
`
`Fig. 11-2 Multiple dichotomy of three classes of samples .
`
`Figure 11-2 illustrates two possible dichotomies of a collection of sam(cid:173)
`ples apparently containing three classes A 1, A 2 , and A 3 • One dichotomy
`separates the samples into A I u A 2 and A 3, while the other results in the two
`classes A I and A 2 uA 3 • Thus , A 3 , A 1, and A I uA 3 = A 2 are seen to be dis(cid:173)
`tinct classes (A is the complement of the set A.)
`Now let us consider the more general case where there are k dichotomies
`of a collection of samples containing L classes. Each dichotomy separates
`these samples into two groups. Let SiJ be the set of all samples assigned to
`group j by the ith dichotomy for j = 1,2 and i = I, . .. ,k. Assume that the fol(cid:173)
`lowing two conditions hold.
`
`(a) A dichotomy places each class into only one group, that is, classes
`are not split by dichotomies.
`
`(b) For each pair of classes, there is at least one dichotomy which does
`not assign the two classes to the same group.
`
`Select one group from each of the k dichotomies and form a subset C as
`the intersection of these k groups. By condition (a), if C contains one sample,
`then it must contain all of the samples of that class. By condition (b), for any
`other class, there is at least one of the k selected groups to which that class
`does not belong. Therefore , if C is nonempty, then C contains one and only
`one class. Hence. in order to construct all the L classes, we consider the 2t
`subsets of the form.
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 15 of 44
`
`I I Clustering
`
`c U 1, .
`
`..
`
`k
`
`,h -) = nsiJ
`
`i=I
`
`'
`
`515
`
`( 11.8)
`
`where each j equals
`have
`
`or 2. Each nonempty C is a class.
`
`In our example, we
`
`S11=A1UA2,
`
`S12=A3,
`
`S21=A1,
`
`S22=A2UA3,
`
`( 11. 9)
`
`so that
`
`C(I,
`
`I) =S 11 nS 21 =A 1 ,
`C(l,2)=S
`11 nS 22 =A2,
`l)=S 12nS21 =0,
`C(2, 2)=S12nS22
`=A3,
`
`C(2,
`
`(I I. 10)
`
`which is in agreement with our earlier argument.
`
`The multiple dichotomy approach has a stronger theoretical basis than
`the merging and splitting procedure. Further, it relies on no numerical criterion
`other than 1. However, implementation of the multiple dichotomy approach
`can be difficult, especially when the true number of classes is large. In addi(cid:173)
`tion, the conditions (a) and (b) mentioned above are rarely satisfied in practice.
`These difficulties may be overcome somewhat by imposing a hierarchical
`structure on the classes . The samples are divided
`into a small number of
`cla~ses, each class is divided further, and so on. Under this strategy, we need
`not find every possible dichotomy of the entire collection of samples.
`
`At this point, we depart from general discussion of the clustering algo(cid:173)
`rithm. Obviously, the discussion is incomplete. We have a basis, however, co
`develop and implement clustering procedures. Therefore, let us tum our atten(cid:173)
`tion to the detailed derivations of clustering procedures .
`
`Nearest Mean Reclassification Algorithm [2-5)
`
`In this section, we will discuss clustering procedures based on parame(cid:173)
`ters such as mean vectors and covariance matrices . We will show that the cri(cid:173)
`teria of class separability discussed in Chapter IO play an important role, and
`that the iterative algorithms of the previous section take on simple forms .
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 16 of 44
`
`516
`
`Introduction to Statistical Pattern Recognition
`
`Criteria: Clustering can be considered as a technique to group samples
`so as to maximize class separability. Then, all of the criteria which were dis(cid:173)
`cussed in Chapter IO may be used as clustering criteria.
`In this section only
`functions of scatter matrices are discussed due to the following reasons:
`
`In this
`is straightforward.
`to multiclass problems
`(I) The extension
`respect, the Bhattacharyya distance has a severe disadvantage, since it can be
`applied only to two-class problems.
`
`(2) Most clustering techniques are based on the scatter of mean vectors.
`Finding clusters based on covariance-differences
`is extremely difficult, unless
`some mathematical structure
`is imposed on the distribution. Therefore,
`the
`functions of scatter matrices fit well to clustering problems.
`
`(3) The simplicity of the criteria is a significant advantage, because in
`clustering we have the additional complexity of unknown class assignment.
`
`For feature extraction, we could choose any combination of Sh, S,.., and
`Sm as S I and S 2 to forrn a criterion J = tr(S21 S 1 ). However, for clustering it is
`It
`preferable to use Sm as S 2 , because Sm is independent of class assignment.
`would be too complicated if we had to recompute the inverse of S 2 each time
`the class assignment was altered in iteration. Therefore, our choice is limited
`to either tr(S;,1 Sb) or tr(S;,1 S,..). These
`two criteria are the same, because
`tr(S;,' Sh) = tr! s;,1 (Sm-Sw)) = n -
`tr(S;, 1 S"').
`In this chapter, we will use
`J = tr(S;,' S,..).
`is to
`in selecting criteria for clustering
`Another important consideration
`ensure that the clustering procedures give the same classification for a given set
`of samples regardless of the coordinate system of these samples. The chosen
`criterion, J = tr(S;,1 S".), satisfies this condition, since the criterion is invariant
`under any nonsingular linear transforrnation.
`
`Clustering algorithm: Let us assume that M O = 0 and Sm = I without
`If the given samples do not satisfy these conditions, we can
`losing generality.
`shift the coordinate origin and whiten the data with respect to S,,,. Then, using
`(IO.I) the criterion is rewritten as
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 17 of 44
`
`11 Clustering
`
`I ~ x(r)
`(' - ~ N,
`J -
`- tr.," . - kJ--kJ(
`J
`r=I N N,. j=I
`I L N,
`= - LL llxy) - M,.112
`N r=l)=I
`
`.
`
`T
`- M,.) (X 1
`
`(r)
`
`517
`
`(II.II)
`
`- M,.)
`
`Changing the class assignment of X; from the current class k; to class j at the
`~th iteration , we delete from ( 11.11) the term IIX; - Mk, mll2 and insert a new
`term IIX; - M/i.')l12
`• Thus,
`6..J(i,J.n = _!_1 llx; - M1(oll2
`N
`
`-
`
`llx; - Mde)l12 l .
`'
`
`(11.12)
`
`Since the second term of ( 11.12) is independent of j, the reclassification of X;
`at the 1th iteration can be carried out by
`IIX; - M,(011 = min IIX; - M/DII ~ XE w, .
`
`(11.13)
`
`1
`
`In words. the algorithm becomes:
`
`an
`
`initial
`
`classification,
`
`Q(0),
`
`Choose
`,ML (0).
`
`(I)
`M 1 (0), ...
`(2) Having calculated sample mean vectors M 1 (i'), ...
`iteration, reclassify each X; according to the nearest M,-(t).
`
`and
`
`calculate
`
`,Md t) at the tth
`
`(3) If the classification of any X; is changed, calculate
`the new sample
`,ML e + I) for
`mean vectors M I C+ I), ...
`the new class assignment,
`and
`repeat from Step (2). Otherwise, stop.
`
`This algorithm is called the nearest mean reclassification rule.
`
`Figure 11-3 shows how the iterative process works. At the !th step.
`samples are divided to three clusters, and their sample means, M;(O's, are com(cid:173)
`puted. All samples are now reclassified according
`to the nearest means . That
`is, the new boundary is piecewise linear, bisecting each pair of M;( C)'s.
`In Fig.
`11-3, there are three clearly separated clusters. We can see that the boundary
`is indeed improved by thi s operation.
`
`some properties
`the above discussion,
`From
`reclassification algorithm become evident. They are:
`
`of
`
`the nearest mean
`
`linear bisectors. Only the means
`(I) Clusters are divided by piecewise
`contribute to determine the boundary and covariance matrices do not affect the
`boundary.
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 18 of 44
`
`518
`
`Introduction to Statistical Pattern Recognition
`
`J-th step
`
`X X
`X X X
`~,, -x-?:-M1(J)
`
`',
`
`X
`
`;
`>4 X
`X
`... :
`X
`,a_x X
`/x
`X
`M3(J)
`X X
`
`X
`
`X
`
`Fig. 11-3 An example of the nearest mean reclassification algorithm.
`
`(2) The number of clusters must be preassigned.
`
`(3) The initial classification. il(0), may be given randomly. No matter
`how Q(0) is given, the M;(0)'s are computed and the reclassification of sam(cid:173)
`ples according to the nearest M;(0)'s results in a piecewise linear boundary.
`This is equivalent to selecting the number of vectors, M;(0)'s, initially accord(cid:173)
`ing to the number of clusters. Random class assignment does not impose any
`extra instability on the algorithm.
`
`In order to verify the algorithm,
`ducted.
`
`the following experiment was con(cid:173)
`
`Experiment 1: One hundred samples were generated from each of Data
`/-A, and mixed together to form 200 samples. Then, these samples were
`classified to two clusters. Table 11-1 shows the confusion matrix of this
`experiment [5]. All I 00 samples from ro1 with 19 samples from CJ½ were
`assigned to one cluster, and 81 samples from mi are assigned to the other clus(cid:173)
`ter. The error rate is 19/200 = 0.095. Recall that we got 5% error for this data
`by designing the optimum linear classifier in Chapter 4. Considering the fact
`that any covariance information was not used in this clustering algorithm, the
`error rate of 9.5% is reasonable. Furthermore, since all error samples came
`from one class, we could improve the error rate simply by adjusting the deci(cid:173)
`sion threshold.
`
`Convergence [5]: The nearest mean reclassification algorithm
`is not
`guaranteed to converge . In this section, we will discuss the conditions under
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 19 of 44
`
`11 Clustering
`
`519
`
`TABLE 11-1
`
`CONFUSION MATRIX FOR THE NEAREST MEAN
`RECLASSIFICATION ALGORITHM
`
`Assigned class
`
`Actual
`class
`
`I
`2
`
`I 00
`19
`
`2
`
`0
`81
`
`which the separating hyperplane converges for two nonnal distributions with
`equal covariance matrices.
`
`are Nx(M 1,L 1) and
`two nonnal distributions
`that
`Let us assume
`that r, = r 2 = r. The nonnalization
`Nx(M 2,L 2) after nonnalization
`and
`makes S,,, = I and M O = 0. The Bayes classifier in this case becomes linear as
`(11.14)
`
`where c
`is a constant.
`M 0 =0=P 1M 1 +P 2M 2 ,
`
`Since Sm=l=L+P
`
`1M 1Mf +P 2M 2Mf
`
`and
`
`Using (2.160),
`
`P2
`-M?M?
`pl
`-
`r-' = I + -----(cid:173)
`P 2
`T
`I - -M?M?
`-
`P,
`
`T
`-
`
`-
`
`Substituting ( 11.16) and ( 11.17) into ( 11.14 ), the Bayes classifier is
`
`I
`-MIX
`----1-·
`P 1-P 2M2M2
`
`+ C = 0 .
`
`(11.15)
`
`(11.16)
`
`(11.17)
`
`( I 1.18)
`
`Thus, the optimum hyperplane for the equal covariance case is always in the
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 20 of 44
`
`520
`
`Introduction to Statistical Pattern Recognition
`
`direction of M 2 which is the same as the mean-difference vector M 2-M 1•
`This property, which the original coordinate system does not have,
`is a
`significant advantage of the normalized coordinate system.
`
`For the equal covariance case, we can show that the algorithm converges
`to M 2-M I from a wide range of initial classifications. After a given iteration,
`samples are separated by a hyperplane whose direction is, say V (llvll = I), as
`shown in Fig. 11-4. Also, the position of the hyperplane is specified by t 1 and
`
`V
`
`Fig. 11-4 Separation of two distributions.
`
`¼ which are the distances of M I and M 2 from the hyperplane. Let Dp and Dn
`be the centers of probability mass for the positive and negative sides of the
`hyperplane. Then, following the nearest mean reclassification rule, the direc(cid:173)
`tion of the succeeding hyperplane will be Dp-Dn. So, our convergence proof
`is to show that the angle between Dp-Dn and M 2-M 1 is smaller than the angle
`between V and M 2-M 1•
`Since the hyperplane separates each distribution into two parts, the posi(cid:173)
`tive and negative sides, we have four probability masses Rij (i = 1,2; j = p,n ),
`as shown in Fig. 11-4. Let Dij and% be the mean vectors and populations of
`these probability masses. Then,
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 21 of 44
`
`I I Clustering
`
`a;
`D;p = M; + --1:;V
`<J;h;
`
`,
`
`a
`a;(l-h
`
`I
`
`I:; V •
`
`; )
`
`D;/1 = M; -
`
`q,p = P;b;,
`
`q;,, = P;(l-h
`
`; ) ,
`
`521
`
`( 11.19)
`
`( I 1.20)
`
`( I 1.21)
`
`(11.22)
`
`( 11.23)
`
`(11.24)
`
`(I 1.25)
`
`[ ' l 2
`a, = ~c l;exp[--1;-Jdl; = _,---expl- 1 -
`[ C; I
`I f~
`h; = C
`1
`_ expl--1;
`2
`" 2n: ±!,la,
`
`?
`
`I
`2
`
`I
`'V2n:
`
`I,;
`
`I
`_ a,
`
`J.
`
`2 ]di;=
`
`I - cJ> ±-
`a,
`
`where
`
`I
`=
`2n: _ ,,a;
`
`2
`T
`a, = V I:;V.
`
`and cJ> is the normal error function. The sign + or -
`is selected, depending on
`whether M1 is located in R;11 or R;JJ. The details of the derivation of (11.19)
`through ( 11.25) are left to the reader. However , the following
`information
`( 11.25) . Since X is normal.
`could be helpful
`in deriving
`(11.19)
`through
`y = V1 X is also normal with the variance of ( I 1.25) . In they-space,
`the proba (cid:173)
`bility of mass for the positive side , h;, can be computed by ( 11.24 ). The vector
`D,j - M; has
`the direction of 1:;V, and
`of D,; - M; on
`the projections
`V (j = p.n)
`\11(D;JJ - M;) = a;<J;l h1
`and V 1(D;,, - M;) = -a;<J;l(l-h;).
`are
`the expected values of y for the positive and
`These are obtained by computing
`
`negative sides. From D;; and q;j, D1, and D11 are obt ained as
`
`q IJJD ,,, + q2pD2,,
`D JJ = ---''----'------'----''-
`q Ip + Q 2p
`q ,,,D ,,, + q 2,,D 211
`D,1 == -------
`q 111 + q 2 11
`
`(11.26)
`
`( 11.27)
`
`Substituting ( 11.19) through ( 11.22) into ( 11.26) and ( 11.27).
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 22 of 44
`
`522
`
`Introduction to Statistical Pattern Recognition
`
`For the equal covariance case,
`
`(11.28)
`
`(11.29)
`
`Furthermore, under the normalization of Sm = I and M O = 0, I: and M 2-M I are
`expressed as functions of M 2 as in (I I. 15) and (11.16). Therefore, (11.28)
`becomes
`
`where
`
`P2(h2-b 1 )-(1/cr)(P 2/P 1 )(P I a 1 +P2a2)MIV
`<P1b1+P2b2)1 l-(P1b1+P2b2)l
`
`( 1/a)(P I a 1 +P 2n 2)
`
`(11.30)
`
`(11.31)
`
`(11.32)
`
`The normal of the new hyperplane has a component in the direction of V
`and another in the direction of M 2 • If the coefficient of M 2 , c 1, has the same
`sign as MI V, the successive hyperplane becomes more nearly parallel to M 2 .
`Since c 2 and the denominator of c I are positive, we need to show that the
`numerator of c I and MIV have the same sign. We examine only the case
`where MI V > 0. The discussion
`for MIV < 0 is similar
`to the one for
`MIV > 0. For MIV > 0, we see from Fig. 11-4 that
`
`e1 + ½ = (M2-MJv
`
`= *Miv.
`
`(11.33)
`
`Using ( 11.33), the condition for convergence becomes
`
`(11.34)
`
`It is easily seen that the inequality of ( 11.34) is not satisfied for certain
`combinations of parameters. However, the region of parameters where ( 11.34)
`is satisfied can be calculated numerically . The result is shown in Fig. 11-5.
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 23 of 44
`
`I I Clustering
`
`523
`
`Convergence
`region
`
`0.6
`
`0.4
`
`0.2
`
`P1
`Fig. 11-5 Region of convergence.
`
`Equations (11.23) and (11.24) with (11.34) show that we have three parameters
`in (l l.34U 1/cr, ~la, and P 1 (P 2 = I - P 1 ), or
`
`(11.35)
`
`regions of y and P 1 are plotted for various
`In Fig. 11-5, the convergence
`values of k [5]. The figure shows that convergence
`is quite likely in practice,
`except for either extreme P 1 's or y's.
`
`Branch and bound procedure [6]: The nearest mean reclassification
`algorithm works fine for many applications. However, there is no guarantee of
`the convergence of the iterative process. Also, the process might stop at a
`locally minimum point and fail to find the globally minimum point.
`
`Since assigning a class to each sample is a combinatorial problem, the
`branch and hound procedure discussed
`in Chapter 10 may be applied to find
`the optimum class assignment.
`
`Figllre 11-6 shows a solution tree for the clustering problem with four
`ln general, there are LN different Q's for classify-
`samples and three clusters.
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-15 Filed 05/30/19 Page 24 of 44
`
`524
`
`Introduction to Statistical Pattern Recognition
`
`A
`Fig. 11-6 A solution tree for clustering.
`
`ing N sample vectors into L clusters. However, since the label of each cluster
`may be chosen arbitrarily, each classification
`could have several different
`expressions for n. For example, n = [ 1 1 2 2] and Q = [2 2 3 3] are the same
`classification, both indicating that X I and X 2 are in one cluster and X 3 and X 4
`are in one of the other clusters.
`In order to eliminate
`this duplication, we
`assign the first sample X I to cluster I, the second sample X 2 to either cluster I
`or 2, and so on, as shown in Fig. 11-6.
`
`to work effectively, we
`In order for the branch and bound procedure
`need to have a criterion which satisfies the monotonicity condition. Let lm(Qm)
`be the criterion to be minimized for n,,, = [wk, ... wk,.,], where the subscript m
`indicates the number of sample vectors involved. Then, the monotonicity con(cid:173)
`dition is stated as
`
`(11.36)
`
`That is, the J of a node is smaller than the J'

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases