Network-1 Technologies, Inc. v. Google, L.L.C. et al, 1:14-cv-02396, No. 148-16 (S.D.N.Y. May. 30, 2019)

Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 1 of 23
`Case 1:14-cv-02396—PGG-MHD Document 148-16 Filed 05/30/19 Page 1 of 23
`
`EXHIBIT 5
`
`EXHIBIT 5
`PART
`2 OF 2
`
`PART
`
`20F2
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 2 of 23
`
`11 Clustering
`
`545
`
`(1)2
`(1)3
`(1)4
`(1)5
`(1)6
`Fig. JJ-15 An example of the valley-seeking clustering.
`
`into w6 results in a shift of the boundary toward the w5 -side, as the arrow indi(cid:173)
`cates . This is equivalent
`to saying that the direction of the density gradient
`is
`estimated by using the numbers of samples on both sides of the boundary, and
`the boundary is moved toward the lower density side . Applying this process to
`all boundarie s repeatedly , the leftmost boundary of Fig. 11-15 moves out to
`00 • leaving the w1 -cluster empty,
`the second and third leftmost boundaries
`merge to one, making the Wrcluster empty, and so on. As a result, at the end
`of iteration, only w2• w4 , and w6 keep many samples and the others become
`empty. The procedure works
`in the same way even in a high-dimensional
`space .
`
`-
`
`A number of comments can be made about this iterative valley-seeking
`procedure. The procedure
`is nonparametric, and divides samples according
`to
`the valley of a density function . The density gradient
`is estimated, but in a
`crude way . The number of clusters must be preassigned, but we can always
`assign a larger number of clusters
`than we actually expect to have. Many of
`initial clusters could become empty , and only true clusters separated by the
`valleys will keep samples. As far as computation
`time is concerned,
`it takes a
`lot of computer time to find neighbors
`for each sample and fonn the table of
`Fig. 11-14. However . this operation
`is common for all nonparametric cluster(cid:173)
`ing procedures,
`including
`the graph
`theoretic approach . The iterative proce ss
`of this algorithm revises only the class assignment according to the majority of
`classes in f(X). This operation does not take much computation
`time.
`
`The volume of f(X) affects the perfonnance of this algorithm. just as it
`did in the graph theoretic approach. The optimum volume should be deter(cid:173)
`mined experimentally, as was done for the graph theoretic approach.
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 3 of 23
`
`Introduction to Statistical Pattern Recognition
`
`Experiment 4: Seventy five samples per class were generated according
`
`546
`
`to
`
`( 11.82)
`
`where n 1 and n 2 are independent and normally distributed with zero mean and
`unit variance for both 001 and ffii , [m I m 2 ] is [O OJ for 001 and [0 -20) for 002,
`is normally distributed with EI 81001 I = 7t, EI 8 I roi I = 0, and
`and 8
`Var( 81 w1 I = Var( 81 ffii I = (7t/4)2 • After the data was whitened with respect
`to the mixture covariance matrix, both graph theoretic and iterative valley(cid:173)
`seeking algorithms were applied, resulting
`in the same clustering result, as
`shown in Fig. 11-16 [13-14].
`
`h,C C.LA~5
`) 1)
`• - - --
`I
`I
`I
`I
`I
`I
`
`- ---
`
`- --
`
`- - - ----
`
`---
`
`- - - ---
`
`---
`
`- ---
`
`- ---
`
`--
`
`- - --
`
`- - - - --
`
`- - - - - - --
`
`•
`I
`I
`I
`I
`I
`
`",\,\ ...
`A,\ .\ .. ,
`
`,\A
`,\,\
`
`l.
`
`,\.\
`
`,\
`
`A,\
`A\
`f,A,\
`AA
`
`'
`
`A ,\
`
`\
`,\Ah
`A
`A U
`
`h
`
`l\AA
`
`,\
`
`A
`
`r
`
`I~ H
`C\BlJ RBH
`
`ee ,,,
`
`OH
`RA
`
`"""
`"
`• OH
`
`t!Jt6H
`
`II
`BH
`P.e8
`PP8 i\
`Od
`OH u
`
`'
`
`H K
`
`- '>0
`
`--
`• --
`- V )
`
`- - - - - ---
`
`- --
`
`- - - ---
`
`---
`
`---
`
`- --
`
`----
`
`---
`
`- --
`
`- - --
`
`- - -- __ .., - - - --
`
`- - •
`
`SO
`
`Fig. 11-16 Classification of a two-class example.
`
`Table 11-3 shows the selected radius for f(X), and the number of iterations
`required to reach the final clusters
`in the iterative valley-seeking algorithm
`[14].
`
`Experiment 5: Fifty samples per class were generated
`three
`from
`classes. Two of
`them were
`the same ones as Experiment 4 except
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 4 of 23
`
`11 Clustering
`
`547
`
`TABLE 11-3
`
`PERFORMANCE OF THE VALLEY-SEEKING ALGORITHM
`
`Experiment
`
`Number
`of samples
`
`Number
`of clusters
`
`Radius
`of r(X)
`
`Number
`of iterations
`
`4
`5
`
`150
`150
`
`2
`3
`
`1.0
`0.75
`
`8
`10
`
`[m I m 2 J = [20 0] for ffi:2 instead of [0 -20]. The third distribution was normal
`with the mean and covariance matrix
`
`(11.83)
`
`Again, after the data was whitened with respect to the mixture covari(cid:173)
`ance matrix, both the graph theoretic and iterative valley-seeking algorithms
`produced the same clustering
`result, as shown
`in Fig . 11-17 [13-14]. The
`
`f .. Q.l~ r,l -\',' ~' . - --- - --------- -- -- - - --- ----
`
`- ------------- --- --------------.
`
`'·.
`
`.\\
`
`C C.
`C CC
`::cc.cc r.r.
`CL.
`L c-: cu:u.r.u:
`C C.
`L
`
`>i:. \
`
`II Pfl
`
`L' ~
`
`' " " ,.
`'
`
`. "
`
`"
`
`I
`•----------------------------------------------------------•
`
`- J l
`
`>O
`
`Fig. 11-17 Classification of a three-class example.
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 5 of 23
`
`548
`
`Introduction to Statistical Pattern Recognition
`
`radius of f(X) and the number of iterations in the latter algorithm are listed in
`Table 11-3 [14].
`
`General comments: We have discussed both parametric and non(cid:173)
`parametric clustering
`techniques. The question here is which technique
`is
`better under what conditions. Generally speaking, nonparametric techniques do
`not require the knowledge of the number of clusters beforehand. This is cer(cid:173)
`tainly true for the graph theoretic approach. Even for the iterative valley(cid:173)
`seeking procedure, we can preassign a larger number of clusters than what is
`actually needed, and let the procedure select automatically
`the necessary
`number of clusters. Therefore, if we do not have any a priori knowledge about
`the data structure, it is most natural to adopt a nonparametric clustering tech(cid:173)
`nique and find out how the data is naturally divided into clusters.
`
`However, nonparametric procedures are in general very sensitive to the
`control parameters, especially
`the size of the local region. Therefore,
`it is
`necessary to run experiments for a wide range of sizes, and the results must be
`carefully examined. Also, nonparametric procedures are normally very compu(cid:173)
`tationally intensive.
`
`Furthermore, nonparametric clustering techniques have two fundamental
`flaws described as follows.
`
`(I) We cannot divide a distribution into a number of clusters unless the
`valley actually exists. For example, the 001 -distribution of Fig. I 1-9 could be
`obtained as the result of the valley-seeking procedure, but the distribution can(cid:173)
`not be divided into 2 or 3 clusters by any nonparametric method even if it is
`desirable to do so. When a distribution
`is wrapped as the 001 -distribution of
`Fig. 11-9, it is sometimes prefered to decompose
`the distribution
`into several
`normal distributions
`for further analysis of data structure or designing a
`classifier.
`
`On the other hand, the parametric procedures do not depend on the
`natural boundary of clusters, but depend only on the criterion. With a preas(cid:173)
`signed number of clusters, the procedures seek the boundary to optimize the
`criterion value. Therefore, we can divide the 001 -distribution of Fig. 11-9 into
`2, 3, or 4 clusters as we like. After examining the clustering results for various
`numbers of clusters, we may decide which number of clusters is most appropri(cid:173)
`ate for a particular application. Previously, we stated that it is a disadvantage
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 6 of 23
`
`I I Clustering
`
`549
`
`the number of clusters
`to know
`to have
`techniques
`of parametric clustering
`beforehand. But, that property may work sometimes as an advantage as dis(cid:173)
`cussed above.
`
`clus(cid:173)
`(2) As seen in Fig. I l-7(b) , the resulting clusters of nonparametric
`tering procedures contain
`the tails of other distributions
`and do not contain
`their own tails . Thus, the mean and covariance matrix of each cluster do not
`represent the true ones of the underlying distribution . In order to obtain proper
`parameters of the underlying distributions, we need
`to impose a parametric
`structure such as the summation of normal distribution s.
`
`11.3 Selection of Representatives
`
`to find clusters
`In the previous sections, we have discussed algorithms
`is to reduce
`the
`from a distribution . A similar but slightly different problem
`number of samples while maintaining
`the structure of the distribution . This
`may be viewed as selecting a small number of representatives
`from a distribu(cid:173)
`tion. Besides clustering,
`the reduction of sample size has a number of applica(cid:173)
`tions. A typical example
`is to divide a set of available samples
`into design and
`test sets . In this case , we must assure
`that design and test distrihutions
`are
`similar.
`
`Nonparametric Data Reduction
`
`structure when
`to assume a mathematical
`it is not appropriate
`Generally,
`the sample size is reduced . Therefore,
`a nonparametric
`procedure must be
`adopted to guide through the operation.
`
`Data reduction algorithm (16-17]: Let us study the Parzen approach
`to
`the problem of sample size reduction . Given N samples drawn from p (X), we
`wish to select Q sample s (Q < N) so that the Parzen density estimates for the N
`sample set and the Q sample set are close.
`
`Let p,v(X) be the density estimate based on N samples. Then
`I N
`p,v(X) = - L,K(X - X;),
`N 1=1
`
`A
`
`( 11.84)
`
`Similarly, when Q
`function.
`the kernel
`is
`where K(-)
`Y 1 , •••
`, Y 0 , are selected. the density function
`is estimated by
`
`representatives
`
`,
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 7 of 23
`
`550
`
`Introduction to Statistical Pattern Recognition
`
`A
`
`l Q
`PQ(X) = -r,K(X
`Q i=I
`
`- Y;).
`
`(11.85)
`
`A
`
`A
`
`In order to measure the similarity between PN(X) and PQ(X), the entropr cri-
`terion of Chapter 9 is used . Replacing p 1 (X) and p 2(X) of (9 .53) by PQ(X)
`and PN(X) respectively, we obtain the entropy inequality as
`Ir-lnpQ(X)]PN(X)dX ~s
`
`(11.86)
`
`[-lnpN(X)]pN(X)dX
`'
`where the equality holds only for PQ(X) = p,v(X). Thus, the best PQ(X) may be
`found by minimizing
`the lefthand side of ( 11.86). The criterion may be
`the expectation with respect to PN(X) in ( I 1.86)
`simplified by approximating
`by the sample mean over the existing samples X 1, •••
`,XN, as
`
`A
`
`A
`
`A
`
`A
`
`(11.87)
`
`In order to find the best Q representatives
`from the existing samples
`,XN, we would like to minimize J over all possible Q element subsets
`X 1, •••
`N
`of the original N element set. Unfortunately, an exhaustive search of all (Q)
`
`the
`for
`Instead, we will settle
`feasible.
`is not computationally
`subsets
`minimum J for subsets formed by replacing one element of the representative
`set by the best candidate not yet selected.
`
`An iterative procedure is as follows:
`
`(I) Select an initial assignment of Q sample s from the N sample data
`set. Call the Q sample set STORE and the remaining N -Q samples TEST.
`
`(2) For each element, Xi, in TEST, compute the change in J that results
`if the sample is transferred to STORE.
`
`Af 1(X 1) = N r_ In
`1=1
`
`I N t { I Q
`
`-Q ~K(X;-Y
`1-l
`
`}
`1)
`
`- In l-1 if K(X;-Y 1) + K(X;-X,)l]]
`
`Q+I U=I
`
`·
`
`(I 1.88)
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 8 of 23
`
`I I Clustering
`
`551
`
`(3) Pick the element, X" corresponding
`
`to the smallest /jJ 1 (and call it
`
`(4) For each element, X_" in STORE, compute
`results if the sample is transferred to TEST.
`
`the change
`
`in J that
`
`(11.89)
`
`(5) Find the element, X.,, corresponding
`
`to the smallest /jJ 2 (and call it
`
`J
`of
`is
`operations
`two
`these
`to
`due
`change
`The
`(6)
`/jJ =Li1 1(X;)+Li1 2(X;).
`to minimize 1, we would
`In order
`like to have
`Af < 0. If x; exists to satisfy Lil < 0, transfer x; to TEST, transfer x;· to
`STORE. and go to Step (2).
`
`(7) Otherwise, find the element, X,, corresponding
`Lil 1 (and call it x; ).
`(8) If x; exists, go to Step ( 4 ).
`(9) Otherwise, stop.
`
`to the next smallest
`
`Generally, this kind of iterative process produces a result which depends
`on the initial selection of Q representatives
`in STORE. However, Steps (7) and
`(8) allow us to search more possible combinations of X, and X, and thus insure
`that the final representative set is less dependent on the initial assignment.
`
`The kNN approach also can be applied
`approach . The kNN density estimate is
`
`in a similar way as the Parzen
`
`k-1
`~
`PN(X) = Ncd'/v(X)
`
`( 11. 90)
`
`where dN(X) is the distance from X to its kth NN among N samples. The same
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 9 of 23
`
`552
`
`Introduction to Statistical Pattern Recognition
`
`fonnula is used when Q samples are used to estimate the density function.
`Thus, the entropy criterion of ( 11.87) becomes
`
`IN
`J =-I,[-lnp
`N i=I
`
`IN
`~
`0 (X;)]= -I,[n
`N i=I
`
`cQ
`lnd 0 (X;)+ln -]
`k-l
`
`.
`
`(11.91)
`
`The same iterative procedure can be used to find the best Q samples by minim(cid:173)
`izing J of (11.91 ).
`
`As in most iterative techniques, a good initial guess usually is of consid(cid:173)
`erable importance. The following is a method to perfonn the initial assignment
`when Q = N/2.
`
`An initial assignment procedure: We discuss a procedure to generate
`an initial assignment for the case with Q = N 12. The basis of the procedure
`If a subset with size N 12, called STORE, is
`lies in an intuitive observation.
`optimally chosen, then we suspect that, on average, the NN of a sample, X, in
`STORE might be the second NN of X using all N samples.
`
`The procedure is best explained with the aid of Fig. I 1-18. In Fig. 11-
`l 8(a) the NN of X I is X 23 , the NN of X 23 is X 17, and X 17 and X 40 are mutual
`NN. Figures I l-18(b) and (c) represent two other possible NN progressions.
`
`x,
`
`X23
`
`x,1
`
`X40
`
`X11
`
`X33
`
`x,s
`
`~7
`
`~6
`
`X12
`
`X30
`
`Xis
`
`~
`(a)
`(b)
`(c)
`Fig. 11-18 Three examples of NN progressions.
`
`X72
`
`We may form the initial assignment by simply assigning every other sample in
`a NN progression to the initial assignment of STORE. Thus for Fig. I l- l 8(a)
`the initial assignment for STORE would consist of either X I and X 17, or X 23
`and X 40 .
`We now state the procedure.
`
`(I) Get the first sample, and follow the complete progression of its
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 10 of 23
`
`11 Clustering
`
`553
`
`NN's, assigning alternate samples
`gression.
`
`to STORE. Flag every sample m the pro(cid:173)
`
`(2) Find the next unflagged sample. Follow its NN progression, assign(cid:173)
`ing alternate samples to STORE. This is done until the progression ends or
`leads to a previously flagged sample. Flag every sample in the progression.
`
`(3) Repeat Step (2) until all samples are flagged.
`
`If sufficient storage is available to store a table of kNN's for each sample
`(k = 4 or 5), the entire process of initial assignment and criterion minimization
`can be done very efficiently.
`
`Reduced Parzen classifier [17]: In pattern recognition,
`the quadratic
`classifier is very popular. However, in practice with non-normal distributions,
`it is frequently observed that the error of a quadratic classifier
`is much larger
`than the Bayes error estimated by a nonparametric
`technique. On the other
`hand, nonparametric classifiers are too complex and time-consuming
`for on(cid:173)
`line operation. Thus, there is a need to fill the gap between these two kinds of
`classifiers.
`
`is to divide each class into several clusters, and
`One possible solution
`design a piecewise quadratic classifier as in (4.149). This approach contains
`both quadratic and nonparametric classitiers as special cases.
`If each class has
`only one cluster,
`it becomes a quadratic classifier, while,
`if each sample
`is
`viewed as a cluster, it becomes a Parzen classifier.
`
`to design a piecewise quadratic classifier
`The most elaborate procedure
`would be to decompose the distribution of each class into several normal distri(cid:173)
`butions as shown in ( 11 .45). Note that only this method gives reasonable esti(cid:173)
`mates of a priori probabilities, means, and covariance matrices. The other clus(cid:173)
`tering procedures fail to do so, because they divide samples as in Fig. I l-7(b)
`instead of Fig. l l-7(a). The entire clustering operation must be repeated by
`preassigning various numbers of clusters. The resulting classification error for
`each preassigned number of clusters is estimated and compared with the Bayes
`error estimated by a nonparametric
`technique. The final number of clusters
`must be as small as possible while maintaining an error close to the Bayes
`error.
`
`is to select Q representatives
`from each
`A somewhat simpler approach
`class as discussed in this section, and to set a kernel function at each represen(cid:173)
`tative to form the Parzen density estimate as in (11.85 ). The classification can
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 11 of 23
`
`554
`
`Introduction to Statistical Pattern Recognition
`
`be carried out by comparing the density estimates for w 1 and w2 as
`
`(11.92)
`
`This classifier is called the reduced Parzen classifier. It is generally too com(cid:173)
`plex to select a different kernel function for each representative. So, we may
`use a normal kernel function with the covariance matrix r 2l:.;, where 1:; is the
`global covariance matrix of W; and r is a parameter controlling the size of the
`kernel. This approach is simpler because a common kernel function is used
`within the same class and is prespecified, instead of being estimated and vary(cid:173)
`ing locally, as in the approach of ( 11 .45). As discussed in Chapter 7, the sensi(cid:173)
`tive parameters are the size of the kernel, r, the decision threshold, t of (11.92) ,
`and the estimate of l:.;.
`
`In order to verify the above argument,
`presented.
`
`two experimental
`
`results are
`
`Experiment 6: The reduced Parzen classifier for Data /-A was designed
`and tested as follows:
`
`(I) One hundred samples per class were generated from Data /-A , and
`Experiment 7-6 was repeated. From Fig. 7-9 , r = 2 was chosen as the kernel
`size.
`
`(2) The sample reduction algorithm in this section was applied to select
`Q representatives from I 00.
`(3) Using the Parzen density estimates of ( 11.85) for w1 and ffii with
`r = 2, the original 100 samples per class were classified as in (I 1.92). The
`threshold
`t was selected so as to minimize
`the classification error. The
`optimum t is different for each different value of Q.
`
`(4) Independently, 100 samples per class were generated and tested by
`the classifier designed above.
`Figure 11-19 shows the plot of the error vs. Q [ 17). The error curve is
`the average of IO trials, and the standard deviations are shown by vertical bars.
`Note that the above error estimation is based on the holdout method, in which
`design and test samples are independent so that the error becomes larger than
`the Bayes error ( 1.9% ). The error curve is almost flat up to one representative.
`For normal distribution s, selecting the expected vector as the one representative
`from each class and the covariance matrix as the kernel covariance, the reduced
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 12 of 23
`
`11 Clustering
`
`555
`
`Parzen classifier becomes quadratic, which is the Bayes classifier. So, the error
`curve of Fig. 11-19 should be flat down to Q = I. However. since we select
`representatives from the existing design samples, they may or may not be close
`to the expected vector. If not, we see that the error curve is flat up to Q = 2 or
`3 and starts to increase for smaller Q's.
`
`I! "' 10
`
`8
`
`6
`
`4
`
`2
`
`0 .__.._ _ _._ ___
`3
`
`.__ __
`6
`
`__.___.+-...L---.._--.._
`40
`9 10 20
`# of representatives
`Fig. 11-19 The error of the reduced Parzen classifier for a normal case.
`
`__
`
`60
`
`.___ __
`80
`
`.__ __
`100
`
`Q
`--1 (cid:141)
`
`Experiment 7: In order to test a non-normal case, the data of Experi(cid:173)
`ment 7-7 was studied. The data is 8-dimensional, and each class consists of
`two normal distributions. From Fig. 7-10, r = 1.5 was chosen. The procedure
`of Experiment 6 was repeated for this data, and the result is shown in Fig.
`11-20 [17). Figure 11-20 shows that the error curve is flat for Q ~6. We
`found that, when Q = 6, three representatives are selected from each cluster.
`
`These experiments suggest an interesting fact. It has been believed that a
`nonparametric procedure needs a large number of samples for high-dimensional
`data, in order to reliably estimate the Bayes error. Any nonparametric opera(cid:173)
`tion with a large number of samples requires a large amount of computer time.
`The results of the experiments in this section contradict these common beliefs,
`and suggest that we may need only a relatively small number of samples (or
`representatives) after all. However. as Experiment 7- IO and Table 7-3(b) sug(cid:173)
`gest, we may still need a large number of samples to estimate the covariance
`matrices. which are used to form the kernel functions.
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 13 of 23
`
`556
`
`e
`%
`
`30
`
`25
`
`20
`
`15
`
`10
`
`5
`
`Introduction to Statistical Pattern Recognition
`
`a
`
`2
`
`3
`
`4
`
`5 6
`
`20 40 60 80 100 120 140
`9 10
`7 8
`# of representatives
`Fig. 11-20 The error of the reduced Parzen classifier for a non-normal case.
`
`Parametric Data Reduction
`
`In parametric approaches, our main concern is to maintain the expected
`vector and autocorrelation (or covariance) matrix while reducing the number of
`samples.
`
`Representation of a sample matrix: Let us express N samples in a
`matrix form as
`
`which is an nxN rectangular matrix, called the sample matrix. Then, the sam(cid:173)
`ple autocorrelation matrix can be expressed by
`
`(11.93)
`
`s = ..!.. r,xxT = ..!..uuT .
`
`Ni=I
`
`I
`
`I
`
`N
`
`(11.94)
`
`A
`
`A
`
`Since S is an n xn matrix; S has the nxn eigenvalue and eigenvector matrices, A
`and , such that
`
`S = <l>A .
`
`( 11.95)
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 14 of 23
`
`11 Clustering
`
`Or
`
`557
`
`(UUT)cp =$(NA)=
`
`cpµ,
`
`(11.96)
`
`is the eigenvalue matrix of uur andµ=
`whereµ
`the left side, we can convert (11.96) to
`
`NA. Multiplying ur from
`
`where ur U is an NxN matrix, ur cf) consists of n eigenvectors of uru, and the
`
`components of µ are the n eigenvalues of uru.
`
`Since (UT$f(UT
`
`cf))=
`
`cpT uur cf)= µ ~ /, we change the scales of the eigenvectors by multiplying
`-112
`
`µ
`
`from the right side so that ( 11.97) can be rewritten as
`
`(11.97)
`
`where
`
`(11.98)
`
`(11.99)
`
`That is, 4' consists of the n eigenvectors of ur U, and satisfies the orthonormal
`condition as
`
`4'T4' = µ
`
`-1/2
`
`cpTuuT cpµ
`
`-1/2
`
`= / .
`
`From ( 11.99), U can be expressed in terms of$, 4', and A as
`112
`4'T = -f;ycpAl!24'T.
`
`U = cpµ
`
`(11.100)
`
`(II. 101)
`
`[ 18 j.
`This expression of a sample matrix is called singular value decomposition
`Singular value decomposition
`is an effective technique to represent a rectangu(cid:173)
`lar (non-square) matrix by eigenvalues and eigenvectors.
`
`Data reduction: Now our problem
`is to reduce the number of samples
`while maintaining the expected vector and autocorrelation matrix. Let us intro(cid:173)
`duce another sample matrix \/ which consists of Q sample vectors.
`In order
`that both U and \I share the same autocorrelation matrix
`
`( 11.102)
`
`Using ( 11.10 I), both U and V can be expressed by
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 15 of 23
`
`558
`
`Introduction to Statistical Pattern Recognition
`
`(11.103)
`
`(I J.104)
`where '11 (Nxn) and 2 (Qxn) are the eigenvector matrices of uTu and vTv
`respectively, and and A are the eigenvector and eigenvalue matrices of S.
`Note that, although (UUT)n ,11 and (WT),,xn share the same eigenvector matrix,
`'11 and =..
`, (UTU)NxN and (VTV)Q ,Q have different eigenvector matrices,
`Furthermore, since ..J,:j'PT =A- 112¢,TU
`from (11.103), ..J,:j'PT is the sample
`matrix in the whitened space, where the sample autocorrelation matrix becomes
`/. Similarly, '1o::.T
`is the reduced sample matrix in that space. Then, (I 1.104)
`shows the transformation which converts '1Q 'E.T back to the original space.
`As in (11.100), then column vectors of=. are orthonormal
`to satisfy
`
`(11.105)
`
`is that U and V must have the same mean vector.
`The second condition
`Without losing generality, let us assume that the mean is zero. Then,
`
`_!_VI =_!_VI
`N Q
`N
`where I k is a vector with all k components being equal to I as It = [ I ...
`Or, substituting (11.103) and (11.104) into (11.106),
`
`(I I.I 06)
`
`I J7.
`
`=0
`
`Q
`
`,
`
`(I l. 107)
`
`When the sample matrix U is given, we can compute , '11, and A.
`Therefore, our problem is to find ::: by solving ( 11.105) and ( 11.107). Then, V
`can be computed by (11.104). Now=. consists of nxQ elements. To determine
`from ( 11.105) and n equations
`those nxQ unknowns we have nxn equations
`in general when Q is smaller than 11 +I,
`from ( I I. I 07). Therefore,
`there are
`many possible solutions. But when Q = n +I, we have a unique solution as
`follows:
`
`=.;,·,(11+1> = [F,,,,,G,,, 1 l,
`
`F=l-al,,I;,,
`
`G =-0,,1,,,
`
`( I I.I 08)
`
`(I 1.109)
`
`(11.110)
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 16 of 23
`
`I I Clustering
`
`where
`
`559
`
`( I I. I 11)
`
`(11.112)
`
`a=
`
`0=~
`
`I
`
`-
`
`The derivation of (II. I 08) through ( 11.112) has been omitted for brevity, but
`the reader may easily verify that .:: of ( I I. I 08) satisfies both ( 11. 105) and
`( 11.107).
`Substituting ( 11.109) and ( 11.1 I 0) into ( I 1. 108),
`
`I -a
`-a
`
`-a
`I-a
`
`2.T =
`
`-0
`-0
`
`-a
`
`-a
`
`-a
`
`-a
`
`I -a
`
`-0
`
`(11.113)
`
`=.7 is the reduced
`Note that =.T of (11.113) is data independent, because~
`sample matrix of (n +I) samples for the mean O and the covariance matrix /.
`This sample matrix is transformed by (11.104) into the space where the origi(cid:173)
`nal data and its sample covariance matrix are given.
`
`Example 2: For n = 2, ~ 2.7 becomes
`+;-1
`[ 1.366
`3 =... = -0 .366
`
`-0.366
`1.366
`
`-Ii
`
`-I
`
`.
`
`(I I. 114)
`
`Computer Projects
`
`J.
`
`2.
`
`3.
`
`4.
`
`5.
`
`Repeat Experiment I.
`
`Repeat Experiments 2 and 3.
`
`Apply the graph theoretic valley-seeking
`
`technique to Data /-A.
`
`Apply the iterative valley-seeking
`
`technique to Data /-A.
`
`Repeat Experiments 6 and 7.
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 17 of 23
`
`560
`
`Problems
`
`Introduction to Statistical Pattern Recognition
`
`1.
`
`2.
`
`3.
`
`A random variable x is uniformly distributed
`in [a, b). From N samples
`x1, . .• ,xN drawn from the distribution,
`find the maximum
`likelihood
`estimates of a and b.
`
`to L normal clusters and
`Assume that a distribution can be decomposed
`all clusters share the same covariance matrix. Find the recursive equa(cid:173)
`tions to compute
`the a priori probabilities,
`the mean vectors, and the
`common covariance matrix of L clusters by using the maximum
`likeli(cid:173)
`hood estimation
`technique. Point out the difference between
`this pro(cid:173)
`cedure and the nearest mean reclassification algorithm .
`
`of L normal
`consists
`that a distribution
`Assume
`P 1 = ... = PL· Find the recursive equations
`to compute
`covariance matrices of L clusters by using the maximum
`mation technique.
`
`and
`clusters
`the means and
`likelihood esti(cid:173)
`
`4.
`
`A kurtosis matrix is defined by
`
`K = El(X 7X)XX 7 } -
`
`(n+2)/
`
`.
`
`When a density function consists of L normal clusters with the mean
`vectors M; and the common covariance matrix I., K can be rewritten as
`
`L
`K = I,P;M;M;[tr(I.+M;M!) - (n+2)].
`i=I
`
`Since the rank of the above equation is (L-1 ). only (L-1) eigenvalues of
`K are non-zero. Using
`this property, we can estimate
`the number of
`modes when clusters are normal and share the same covariance matrix.
`Prove that the second equation is equal to the first one for I.= /.
`
`5.
`
`A density function of a random variable x can be estimated by the Par(cid:173)
`
`zen method with the kernel function l O
`
`for
`1C(y) = 3
`4 (t-y 2 ) for
`
`)'2 :2!1
`
`y2 < I
`
`The derivative of the Parzen density estimate, dp(x)ldx, can be used as
`an estimate of the density gradient. Confirm that this method of gradient
`estimation coincides with (11.64).
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 18 of 23
`
`11 Clustering
`
`561
`
`6.
`
`The product of two nonnal distributions, N(M 1,I 1) and N(M 2,I 2 ).
`gives another nonnal distribution , N (M 0 ,I 0 ), multiplied by a constant, c.
`(a) Derive
`the relations between
`the
`three means and covariance
`matrices.
`
`(b)
`
`Samples are drawn from N (M 1 ,I 1 ). and a nonnal window func(cid:173)
`tion N (X. I 2 ) is set at X. Using these samples and the given win(cid:173)
`dow
`function,
`estimate
`the mean and covariance matrix of
`N(Mo,Io),
`(c) Using M O and I.ii and the given window function, estimate M I and
`I 1• This procedure could be used to estimate a local covariance
`matrix even when the samples are drawn from a non-nonnal distri(cid:173)
`bution .
`
`7.
`
`Two density functions are estimated by the Parzen method. Find the
`expression for the class separability criterion fp 1 (X)p 2(X )dX.
`(a) Use a nonnal kernel function with kernel covariance r 2I; for W;.
`
`(b) Use a hyperspherical unifonn kernel with radius ,- for both w 1 and
`Wz.
`
`8.
`
`Let class separability be measured by
`
`I N
`J = N L, Lf (X; ,X)g (C;. l)
`
`i
`
`•
`
`i = l j= I
`
`where
`
`f I for IIX;-X) s;,.
`f (X;,X;) = l O for IIX;-X_;II > r.
`l l when X; and X_; belong to different classes,
`
`0 when X; and X; belong to the same <.:lass.
`
`~ (c c ) =
`'
`''
`.1
`
`That is. J counts the number of pairs whose members are close and yet
`come from different classes. Prove that the minimization of J by an
`iterative process
`leads
`to
`the
`iterative valley-seeking
`procedure of
`( I 1.8 I).
`
`9.
`
`Formulate
`cedure.
`
`the selection of representatives by a branch and bound pro(cid:173)
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 19 of 23
`
`562
`
`Introduction to Statistical Pattern Recognition
`
`10. By using the singular value decomposition
`technique, show how to
`image to m basis images,
`decompose an nxm (m<n)
`two-dimensional
`and how to evaluate the effectiveness of these basis images.
`
`References
`
`I.
`
`2.
`
`3.
`
`4.
`
`5.
`
`S. Watanabe, "Knowing and Guessing," Wiley, New York, 1969.
`
`D. J. Hall and G. B. Ball, ISODATA: A novel method of data analysis
`and pattern classification, Technical report, Stanford Research Institute,
`Menlo Park, CA, 1965.
`
`G. B. Ball, Data analysis in the social sciences : What about the details?,
`Proc. Fall Joint Computer Conference, pp. 533-559, Washington, DC,
`1965.
`
`J. MacQueen, Some methods for classification and analysis of multivari(cid:173)
`ate observations, Proc. Fifth Berkeley Symposium on Math . Statist . and
`Prob ., pp. 281-297, Berkeley, CA, 1967.
`
`K. Fukunaga and W. L. G. Koontz, A criterion and an algorithm for
`grouping data, Trans. IEEE Computers, C-19, pp. 917-923, 1970.
`
`6. W . L. G. Koontz, P. M. Narendra, and K. Fukunaga, A branch and
`bound clustering algorithm, Trans. IEEE Computers, C-24, pp. 908-915,
`1975.
`
`7.
`
`8.
`
`9.
`
`''Statistical
`D. M. Titterington, A. F. M. Smith, and U. E. Makov,
`Analysis of Finite Mixture Distributions," Wiley, New York, 1985.
`
`K. Fukunaga and T. E. Flick, Estimation of the parameters of a Gaussian
`mixture using the method of moments, Trans . IEEE Pattern Anal. and
`Machine Intel/ ., PAMI-5, pp. 410-416, 1983.
`
`N. E. Day, Estimating the components of a mixture of normal distribu(cid:173)
`tions, Biometrika. 56, pp. 463-474, 1969.
`
`10.
`
`J. H. Wolfe, Pattern clustering by multivariate mixture analysis, Mul(cid:173)
`tivar . Behav . Res. , 5, pp. 329-350, 1970.
`
`11. K. Fukunaga and L. D. Hostetler, The estimation of the gradient of a
`density function, with applications
`in pattern recognition, Trans. IEEE
`Inform. Theory, IT-21, pp. 32-40, 1975.
`
`

`Case 1:14-cv-02396-PGG-MHD Document 148-16 Filed 05/30/19 Page 20 of 23
`
`11 Clustering
`
`563
`
`12. K. Fukunaga and T. E. Flick, A test of Gaussian-ness of a data set using
`clustering, IEEE Trans. Pattern Anal. and Machine Intel/., PAMl-8, pp.
`240-24 7, 1986.
`
`13. W. L. G. Koontz, P. M. Narendra, and K. Fukunaga, A graph-theoretic
`approach to nonparametric cluster analysis, Trans. IEEE Computers, C-
`25, pp. 936-944, 1975.
`
`14. W. L. G. Koontz and K. Fukunaga, A nonparametric

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases