throbber
Estimating the Support of a High-Dimensional Distribution
`
`Bernhard Sch¨olkopf?,
`
`John C. Plattz,
`
`John Shawe-Taylory,
`
`Alex J. Smolax,
`
`Robert C. Williamsonx,
`
`? Microsoft Research Ltd, 1 Guildhall Street, Cambridge CB2 3NH, UK
`z Microsoft Research, 1 Microsoft Way, Redmond, WA, USA
`y Royal Holloway, University of London, Egham, UK
`x Department of Engineering, Australian National University,
`Canberra 0200, Australia
`
`bsc@scientist.com, jplatt@microsoft.com, john@dcs.rhbnc.ac.uk
`Alex.Smola@anu.edu.au, Bob.Williamson@anu.edu.au
`
`27 November 1999; revised 18 September 2000
`
`Technical Report
`MSR-TR-99-87
`
`Microsoft Research
`Microsoft Corporation
`One Microsoft Way
`Redmond, WA 98052
`
`An abridged version of this document will appear in Neural Computation.
`
`
`Exhibit Page 1
`
`Columbia Ex. 2006
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`Abstract
`
`Suppose you are given some dataset drawn from an underlying probability
`distribution P and you want to estimate a “simple” subset S of input space such
`that the probability that a test point drawn from P lies outside of S equals some
`a priori specified value between and .
`We propose a method to approach this problem by trying to estimate a func-
`tion f which is positive on S and negative on the complement. The functional
`form of f is given by a kernel expansion in terms of a potentially small subset of
`the training data; it is regularized by controlling the length of the weight vector
`in an associated feature space. The expansion coefficients are found by solving
`a quadratic programming problem, which we do by carrying out sequential op-
`timization over pairs of input patterns. We also provide a theoretical analysis of
`the statistical performance of our algorithm.
`The algorithm is a natural extension of the support vector algorithm to the
`case of unlabelled data.
`
`Keywords. Support Vector Machines, Kernel Methods, Density Estimation,
`Unsupervised Learning, Novelty Detection, Condition Monitoring, Outlier De-
`tection
`
`1 Introduction
`
`During recent years, a new set of kernel techniques for supervised learning has been
`developed (Vapnik, 1995; Sch¨olkopf et al., 1999a). Specifically, support vector (SV)
`algorithms for pattern recognition, regression estimation and solution of inverse prob-
`lems have received considerable attention.
`There have been a few attempts to transfer the idea of using kernels to compute
`inner products in feature spaces to the domain of unsupervised learning. The problems
`in that domain are, however, less precisely specified. Generally, they can be character-
`ized as estimating functions of the data which tell you something interesting about the
`underlying distributions. For instance, kernel PCA can be characterized as comput-
`ing functions which on the training data produce unit variance outputs while having
`minimum norm in feature space (Sch¨olkopf et al., 1999b). Another kernel-based un-
`supervised learning technique, regularized principal manifolds (Smola et al., 2000),
`computes functions which give a mapping onto a lower-dimensional manifold mini-
`mizing a regularized quantization error. Clustering algorithms are further examples of
`unsupervised learning techniques which can be kernelized (Sch¨olkopf et al., 1999b).
`An extreme point of view is that unsupervised learning is about estimating densi-
`ties. Clearly, knowledge of the density of P would then allow us to solve whatever
`problem can be solved on the basis of the data.
`it proposes an algorithm which
`The present work addresses an easier problem:
`computes a binary function which is supposed to capture regions in input space where
`the probability density lives (its support), i.e. a function such that most of the data will
`live in the region where the function is nonzero (Sch¨olkopf et al., 1999). In doing so,
`it is in line with Vapnik’s principle never to solve a problem which is more general
`than the one we actually need to solve. Moreover, it is applicable also in cases where
`the density of the data’s distribution is not even well-defined, e.g. if there are singular
`components.
`
`1
`
`
`Exhibit Page 2
`
`Columbia Ex. 2006
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`The article is organized as follows. After a review of some previous work in Sec. 2,
`we propose SV algorithms for the considered problem. Sec. 4 gives details on the im-
`plementation of the optimization procedure, followed by theoretical results character-
`izing the present approach. In Sec. 6, we apply the algorithm to artificial as well as
`real-world data. We conclude with a discussion.
`
`2 Previous Work
`
`In order to describe some previous work, it is convenient to introduce the following
`definition of a (multi-dimensional) quantile function, introduced by Einmal and Mason
`(1992). Let x ; : : : ;x ‘ be i.i.d. random variables in a set X with distribution P . Let C
`be a class of measurable subsets of X and let  be a real-valued function defined on C.
`The quantile function with respect to P; ; C is
`
`  :
`U  = inffC: P C  ; C  Cg
`‘ P‘
`In the special case where P is the empirical distribution (P‘C :=
`i= Cxi),
`we obtain the empirical quantile function. We denote by C and C‘ the (not
`necessarily unique) C  C that attains the infimum (when it is achievable). The most
`common choice of  is Lebesgue measure, in which case C is the minimum volume
`C  C that contains at least a fraction of the probability mass. Estimators of the form
`C‘ are called minimum volume estimators.
`Observe that for C being all Borel measurable sets, C  is the support of the
`density p corresponding to P , assuming it exists. (Note that C  is well defined even
`when p does not exist.) For smaller classes C, C  is the minimum volume C  C
`containing the support of p.
`Turning to the case where , it seems the first work was reported by Sager
`(1979) and then Hartigan (1987) who considered X = R with C being the class of
`closed convex sets in X. (They actually considered density contour clusters; cf. Ap-
`pendix A for a definition.) Nolan (1991) considered higher dimensions with C being
`the class of ellipsoids. Tsybakov (1997) has studied an estimator based on piecewise
`polynomial approximation of C and has shown it attains the asymptotically mini-
`max rate for certain classes of densities p. Polonik (1997) has studied the estimation
`of C by C‘. He derived asymptotic rates of convergence in terms of various
`measures of richness of C. He considered both VC classes and classes with a log -
`covering number with bracketing of order O(cid:0)r for r . More information on
`minimum volume estimators can be found in that work, and in Appendix A.
`A number of applications have been suggested for these techniques. They include
`problems in medical diagnosis (Tarassenko et al., 1995), marketing (Ben-David and
`Lindenbaum, 1997), condition monitoring of machines (Devroye and Wise, 1980), es-
`timating manufacturing yields (Stoneking, 1999), econometrics and generalized non-
`linear principal curves (Tsybakov, 1997; Korostelev and Tsybakov, 1993), regression
`and spectral analysis (Polonik, 1997), tests for multimodality and clustering (Polonik,
`1995b) and others (M¨uller, 1992).
`Most of these papers, in particular those with a theoretical slant, do not go all
`the way in devising practical algorithms that work on high-dimensional real-world-
`problems. A notable exception to this is the work of Tarassenko et al. (1995).
`
`2
`
`
`Exhibit Page 3
`
`Columbia Ex. 2006
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`Polonik (1995a) has shown how one can use estimators of C to construct den-
`sity estimators. The point of doing this is that it allows one to encode a range of prior
`assumptions about the true density p that would be impossible to do within the tradi-
`tional density estimation framework. He has shown asymptotic consistency and rates
`of convergence for densities belonging to VC-classes or with a known rate of growth
`of metric entropy with bracketing.
`Let us conclude this section with a short discussion of how the present work relates
`to the above. The present paper describes an algorithm which finds regions close to
`C. Our class C is defined implicitly via a kernel k as the set of half-spaces in a
`SV feature space. We do not try to minimize the volume of C in input space. Instead,
`we minimize a SV style regularizer which, using a kernel, controls the smoothness
`of the estimated function describing C. In terms of multi-dimensional quantiles, our
`approach can be thought of as employing Cw = kwk, where Cw = fx: fwx 
` g. Here, w;  are a weight vector and an offset parametrizing a hyperplane in the
`feature space associated with the kernel.
`The main contribution of the present work is that we propose an algorithm that
`has tractable computational complexity, even in high-dimensional cases. Our theory,
`which uses very similar tools to those used by Polonik, gives results that we expect
`will be of more use in a finite sample size setting.
`
`3 Algorithms
`
`We first introduce terminology and notation conventions. We consider training data
`
`x ; : : : ;x ‘  X;
`
`(1)
`
`where ‘  N is the number of observations, and X is some set. For simplicity, we think
`of it as a compact subset of RN . Let  be a feature map X ! F , i.e. a map into an
`inner product space F such that the inner product in the image of  can be computed
`by evaluating some simple kernel (Boser et al., 1992; Vapnik, 1995; Sch¨olkopf et al.,
`1999a)
`
`such as the Gaussian kernel
`
`kx; y = x y;
`
`kx; y = e(cid:0)kx(cid:0)yk=c:
`
`(2)
`
`(3)
`
`Indices i and j are understood to range over ; : : : ; ‘ (in compact notation: i; j  ‘).
`Bold face greek letters denote ‘-dimensional vectors whose components are labelled
`using normal face type.
`In the remainder of this section, we shall develop an algorithm which returns a
`function f that takes the value + in a “small” region capturing most of the data points,
`and (cid:0) elsewhere. Our strategy is to map the data into the feature space corresponding
`to the kernel, and to separate them from the origin with maximum margin. For a new
`point x, the value f x is determined by evaluating which side of the hyperplane it
`falls on, in feature space. Via the freedom to utilize different types of kernel functions,
`
`3
`
`
`Exhibit Page 4
`
`Columbia Ex. 2006
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`this simple geometric picture corresponds to a variety of nonlinear estimators in input
`space.
`To separate the data set from the origin, we solve the following quadratic program:
`
` 
`
`min
`wF;R‘; R
`
`kwk +
`‘Pi i (cid:0)
`subject to w xi  (cid:0) i; i  :
`Here,   ;  is a parameter whose meaning will become clear later.
`Since nonzero slack variables i are penalized in the objective function, we can
`expect that if w and solve this problem, then the decision function
`
`(4)
`
`(5)
`
`f x = sgnw x (cid:0) 
`will be positive for most examples xi contained in the training set,1 while the SV type
`regularization term kwk will still be small. The actual trade-off between these two
`goals is controlled by .
`Using multipliers i; i  , we introduce a Lagrangian
`‘Xi
`i (cid:0) (cid:0)Xi
`iw xi(cid:0) + i(cid:0)Xi
`kwk +
`Lw; ; ; ;  =
`
` 
`
` 
`
`(6)
`
`ii;
`
`(7)
`
`and set the derivatives with respect to the primal variables w; ; equal to zero, yield-
`ing
`
`w =Xi
`‘ (cid:0) i 
`
` 
`
`i =
`
`ixi;
`
`i = :
`
` 
`
`‘
`
`(8)
`
`(9)
`
`(10)
`
`; Xi
`In (8), all patterns fxi: i  ‘; i g are called Support Vectors. Together with (2),
`the SV expansion transforms the decision function (6) into a kernel expansion
`ikxi; x (cid:0) ! :
`
`f x = sgnXi
`
`Substituting (8) – (9) into L (7), and using (2), we obtain the dual problem:
`
`i = :
`
`(11)
`
`; Xi
`
` 
`
`‘
`
`ijkxi; xj subject to  i 
`
`Xij
`
` 
`
`min
`
`
`One can show that at the optimum, the two inequality constraints (5) become equalities
`if i and i are nonzero, i.e. if i =‘. Therefore, we can recover by
`exploiting that for any such i, the corresponding pattern xi satisfies
`
` = w xi =Xj
`
`jkxj; xi:
`
`(12)
`
`1We use the convention that sgnz equals for z  and (cid:0) otherwise.
`
`4
`
`
`Exhibit Page 5
`
`Columbia Ex. 2006
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`Note that if  approaches , the upper boundaries on the Lagrange multipliers tend
`to infinity, i.e. the second inequality constraint in (11) becomes void. The problem then
`resembles the corresponding hard margin algorithm, since the penalization of errors
`becomes infinite, as can be seen from the primal objective function (4). It is still a
`feasible problem, since we have placed no restriction on the offset , so it can become
`a large negative number in order to satisfy (5). If we had required  from the start,
`we would have ended up with the constraint Pi i  instead of the corresponding
`equality constraint in (11), and the multipliers i could have diverged.
`It is instructive to compare (11) to a Parzen windows estimator. To this end, sup-
`pose we use a kernel which can be normalized as a density in input space, such as
`the Gaussian (3). If we use  = , then the two constraints only allow the solution
` = : : : = ‘ = =‘. Thus the kernel expansion in (10) reduces to a Parzen win-
`dows estimate of the underlying density. For  , the equality constraint in (11) still
`ensures that the decision function is a thresholded density, however, in that case, the
`density will only be represented by a subset of training examples (the SVs) — those
`which are important for the decision (10) to be taken. Sec. 5 will explain the precise
`meaning of .
`To conclude this section, we note that one can also use balls to describe the data
`in feature space, close in spirit to the algorithms of Sch¨olkopf et al. (1995), with hard
`boundaries, and Tax and Duin (1999), with “soft margins.” Again, we try to put most
`of the data into a small ball by solving, for   ; ,
`R +
`‘Pi i
`min
`RR;R‘;cF
`kxi (cid:0) ck  R + i; i  for i  ‘:
`subject to
`
`(13)
`
`This leads to the dual
`
`subject to
`
`min
`
` Pij ijkxi; xj (cid:0)Pi ikxi; xi
`  i 
`‘ ; Pi i =
`
`and the solution
`
`corresponding to a decision function of the form
`
`ixi;
`
`c =Xi
`
`(14)
`
`(15)
`
`(16)
`
`(17)
`
`f x = sgn
`@R (cid:0)Xij
`
`ijkxi; xj + Xi
`
`ikxi; x (cid:0) kx; x
`A :
`
`Similar to the above, R is computed such that for any xi with i =‘ the
`argument of the sgn is zero.
`For kernels kx; y which only depend on x (cid:0) y, kx; x is constant. In this case,
`the equality constraint implies that the linear term in the dual target function is con-
`stant, and the problem (14–15) turns out to be equivalent to (11). It can be shown that
`the same holds true for the decision function, hence the two algorithms coincide in that
`
`5
`
`
`Exhibit Page 6
`
`Columbia Ex. 2006
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`case. This is geometrically plausible: for constant kx; x, all mapped patterns lie on a
`sphere in feature space. Therefore, finding the smallest sphere (containing the points)
`really amounts to finding the smallest segment of the sphere that the data live on. The
`segment, however, can be found in a straightforward way by simply intersecting the
`data sphere with a hyperplane — the hyperplane with maximum margin of separation
`to the origin will cut off the smallest segment.
`
`4 Optimization
`
`The last section has formulated quadratic programs (QPs) for computing regions that
`capture a certain fraction of the data. These constrained optimization problems can be
`solved via an off-the-shelf QP package to compute the solution. They do, however,
`possess features that set them apart from generic QPs, most notably the simplicity of
`the constraints. In the present section, we describe an algorithm which takes advan-
`tage of these features and empirically scales better to large data set sizes than a standard
`QP solver with time complexity of order O‘  (cf. Platt, 1999). The algorithm is a
`modified version of SMO (Sequential Minimal Optimization), an SV training algo-
`rithm originally proposed for classification (Platt, 1999), and subsequently adapted to
`regression estimation (Smola and Sch¨olkopf, 2000).
`The strategy of SMO is to break up the constrained minimization of (11) into the
`smallest optimization steps possible. Due to the constraint on the sum of the dual
`variables, it is impossible to modify individual variables separately without possibly
`violating the constraint. We therefore resort to optimizing over pairs of variables.
`
`Elementary optimization step. For instance, consider optimizing over and 
`with all other variables fixed. Using the shorthand Kij := kxi; xj, (11) then reduces
`to
`
`X
`
`ijKij +
`
`iCi + C;
`
`i;j=
`
` 
`
`min
` ;
`
`Xi
`
`=
`
`(18)
`
`(19)
`
`
`
`with Ci :=P‘j= jKij and C :=P‘
`  ;  
`
`i;j= ijKij, subject to
`
`i = ;
`
`Xi
`
`=
`
`;
`
` 
`
`‘
`
`where := (cid:0)P‘
` (cid:0) K +  (cid:0) K  +
`with the derivative
`
` 
`
`min
`
`
`i= i.
`We discard C, which is independent of and , and eliminate to obtain
`
`
`K +  (cid:0) C + C;
`
`(20)
`
` 
`
`(cid:0) (cid:0) K +  (cid:0) K  + K (cid:0) C + C:
`Setting this to zero and solving for , we get
` K (cid:0) K  +C (cid:0) C
`K + K (cid:0) K 
`6
`
` =
`
`:
`
`(21)
`
`(22)
`
`
`Exhibit Page 7
`
`Columbia Ex. 2006
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`Once  is found, can be recovered from = (cid:0) . If the new point  ; 
`is outside of ; =‘, the constrained optimum is found by projecting  from (22)
`into he region allowed by the constraints, and the re-computing .
`The offset is recomputed after every such step.
`Additional insight can be obtained by rewriting the last equation in terms of the
`outputs of the kernel expansion on the examples x and x before the optimization
`
`
`step. Let ;  denote the values of their Lagrange parameter before the step. Then
`the corresponding outputs (cf. (10)) read
`
`Oi := K i
` + Ki
` + Ci:
`
`(23)
`
`Using the latter to eliminate the Ci, we end up with an update equation for  which
`does not explicitly depend on
` ,
`
` =
` +
`
`O (cid:0) O
`K + K (cid:0) K 
`which shows that the update is essentially the fraction of first and second derivative of
`the objective function along the direction of -constraint satisfaction.
`Clearly, the same elementary optimization step can be applied to any pair of two
`variables, not just and . We next briefly describe how to do the overall optimiza-
`tion.
`
`;
`
`(24)
`
`Initialization of the algorithm. We start by setting a fraction  of all i, randomly
`chosen, to =‘. If ‘ is not an integer, then one of the examples is set to a value in
`
`; =‘ to ensure that Pi i = . Moreover, we set the initial to maxfOi: i ‘; i g.
`
`
`Optimization algorithm. We then select a first variable for the elementary optimiza-
`tion step in one of the two following ways. Here, we use the shorthand SVnb for the in-
`dices of variables which are not at bound, i.e. SVnb := fi: i  ‘; i =‘g.
`At the end, these correspond to points that will sit exactly on the hyperplane, and that
`will therefore have a strong influence on its precise position.
`
`(i) We scan over the entire data set2 until we find a variable violating a KKT
`condition (Bertsekas, 1995, e.g.), i.e. a point such that Oi (cid:0)  i or
` (cid:0) Oi  =‘ (cid:0) i . Once we have found one, say i, we pick j
`according to
`(25)
`j = arg max nSVnbjOi (cid:0) Onj:
`(ii) Same as (i), but the scan is only performed over SVnb.
`
`In practice, one scan of type (i) is followed by multiple scans of type (ii), until there
`are no KKT violators in SVnb, whereupon the optimization goes back to a single scan
`of type (i). If the type (i) scan finds no KKT violators, the optimization terminates.
`In unusual circumstances, the choice heuristic (25) cannot make positive progress.
`Therefore, a hierarchy of other choice heuristics is applied to ensure positive progress.
`
`2This scan can be accelerated by not checking patterns which are on the correct side of the hyperplane
`by a large margin, using the method of Joachims (1999).
`
`7
`
`
`Exhibit Page 8
`
`Columbia Ex. 2006
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`These other heuristics are the same as in the case of pattern recognition, cf. (Platt,
`1999), and have been found to work well in our experiments to be reported below.
`In our experiments with SMO applied to distribution support estimation, we have
`always found it to converge. However, to ensure convergence even in rare pathological
`conditions, the algorithm can be modified slightly, cf. (Keerthi et al., 1999).
`We end this section by stating a trick which is of importance in practical imple-
`mentations. In practice, one has to use a nonzero accuracy tolerance when checking
`whether two quantities are equal. In particular, comparisons of this type are used in de-
`termining whether a point lies on the margin. Since we want the final decision function
`to evaluate to for points which lie on the margin, we need to subtract this constant
`from the offset at the end.
`In the next section, it will be argued that subtracting something from is actually
`advisable also from a statistical point of view.
`
`5 Theory
`
`We now analyse the algorithm theoretically, starting with the uniqueness of the hyper-
`plane (Proposition 2). We then describe the connection to pattern recognition (Propo-
`sition 3), and show that the parameter  characterizes the fractions of SVs and outliers
`(Proposition 4). Following that, we give a robustness result for the soft margin (Propo-
`sition 5) and finally we present a theoretical result on the generalization error (Theorem
`7). The proofs are given in the Appendix.
`In this section, we will use italic letters to denote the feature space images of the
`corresponding patterns in input space, i.e.
`
`Definition 1 A data set
`
`xi := xi:
`
`x ; : : : ; x‘
`
`(26)
`
`(27)
`
`is called separable if there exists some w  F such that w xi for i  ‘.
`If we use a Gaussian kernel (3), then any data set x ; : : : ;x ‘ is separable after it is
`mapped into feature space: to see this, note that kxi; xj for all i; j, thus all inner
`products between mapped patterns are positive, implying that all patterns lie inside the
`same orthant. Moreover, since kxi; xi = for all i, they all have unit length. Hence
`they are separable from the origin.
`
`Proposition 2 (Supporting Hyperplane) If the data set (27) is separable, then there
`exists a unique supporting hyperplane with the properties that (1) it separates all data
`from the origin, and (2) its distance to the origin is maximal among all such hyper-
`planes. For any , it is given by
`
`(28)
`
`kwk subject to w xi  ; i  ‘:
`
` 
`
`min
`wF
`
`8
`
`
`Exhibit Page 9
`
`Columbia Ex. 2006
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`The following result elucidates the relationship between single-class classification
`and binary classification.
`
`Proposition 3 (Connection to Pattern Recognition) (i) Suppose w;  parametrizes
`the supporting hyperplane for the data (27). Then w;  parametrizes the optimal sep-
`arating hyperplane for the labelled data set
`
`fx ; ; : : : ;x ‘; ; (cid:0)x ;(cid:0) ; : : : ;(cid:0)x ‘;(cid:0) g:
`
`(29)
`
`(ii) Suppose w;  parametrizes the optimal separating hyperplane passing through
`the origin for a labelled data set
`
`fx ; y ; : : : ;x ‘; y‘g; yi  f g for i  ‘;
`
`(30)
`
`aligned such that w xi is positive for yi = . Suppose, moreover, that =kwk
`is the margin of the optimal hyperplane. Then w;  parametrizes the supporting
`hyperplane for the unlabelled data set
`
`fy x ; : : : ; y‘x‘g:
`
`(31)
`
`Note that the relationship is similar for nonseparable problems. In that case, margin
`errors in binary classification (i.e. points which are either on the wrong side of the
`separating hyperplane or which fall inside the margin) translate into outliers in single-
`class classification, i.e. into points which fall on the wrong side of the hyperplane.
`Proposition 3 then holds, cum grano salis, for the training sets with margin errors and
`outliers, respectively, removed.
`The utility of Proposition 3 lies in the fact that it allows us to recycle certain results
`proven for binary classification (Sch¨olkopf et al., 2000) for use in the single-class
`scenario. The following, explaining the significance of the parameter , is such a case.
`
`Proposition 4 (-Property) Assume the solution of (4),(5) satisfies = . The fol-
`lowing statements hold:
`(i)  is an upper bound on the fraction of outliers.
`(ii)  is a lower bound on the fraction of SVs.
`(iii) Suppose the data (27) were generated independently from a distribution P x
`which does not contain discrete components. Suppose, moreover, that the kernel is an-
`alytic and non-constant. With probability 1, asymptotically,  equals both the fraction
`of SVs and the fraction of outliers.
`
`Note that this result also applies to the soft margin ball algorithm of Tax and Duin
`(1999), provided that it is stated in the -parameterization given in Sec. 3.
`
`Proposition 5 (Resistance) Local movements of outliers parallel to w do not change
`the hyperplane.
`
`9
`
`
`Exhibit Page 10
`
`Columbia Ex. 2006
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`Note that although the hyperplane does not change, its parametrization in w and
`does.
`We now move on to the subject of generalization. Our goal is to bound the prob-
`ability that a novel point drawn from the same underlying distribution lies outside of
`the estimated region. We present a “marginalised” analysis which in fact provides a
`bound on the probability that a novel point lies outside the region slightly larger than
`the estimated one.
`
`Definition 6 Let f be a real-valued function on a space X. Fix
 R. For x  X let
`dx; f;
 = maxf;
(cid:0) f xg: Similarly for a training sequence X := x ; : : : ;x ‘,
`we define
`DX; f;
 = XxX
`dx; f;
:
`
`In the following, log denotes logarithms to base 2 and ln denotes natural logarithms.
`
`Theorem 7 (Generalization Error Bound) Suppose we are given a set of ‘ exam-
`ples X  X‘ generated i.i.d. from an unknown distribution P which does not con-
`tain discrete components. Suppose, moreover, that we solve the optimisation problem
`(4),(5) (or equivalently (11)) and obtain a solution fw given explicitly by (10). Let
`Rw; := fx: fwx  g denote the induced decision region. With probability (cid:0)
`over the draw of the random sample X  X‘, for any ,
`‘
`
k + log
` ;
`P x: x  Rw; (cid:0) 
`c logc^‘
`log
e
‘ (cid:0) ^
`+ + ;
`D
`^
`^
`D
`c = c, c = ln=c, c = , ^ = =kwk, D = DX; fw;;  =
`DX; fw; ; , and is given by (12).
`
` ‘
`
`where
`
`k =
`
`+
`
`(32)
`
`(33)
`
`The training sample X defines (via the algorithm) the decision region Rw; . We
`expect that new points generated according to P will lie in Rw; . The theorem gives a
`probabilistic guarantee that new points lie in the larger region Rw; (cid:0).
`The parameter  can be adjusted when running the algorithm to trade off incor-
`porating outliers versus minimizing the “size” of Rw; . Adjusting  will change the
`value of D. Note that since D is measured with respect to while the bound applies
`to (cid:0) , any point which is outside of the region that the bound applies to will make a
`contribution to D which is bounded away from . Therefore, (32) does not imply that
`asymptotically, we will always estimate the complete support.
`The parameter allows one to trade off the confidence with which one wishes
`the assertion of the theorem to hold against the size of the predictive region Rw; (cid:0):
`one can see from (33) that k and hence the RHS of (32) scales inversely with . In
`fact, it scales inversely with ^, i.e. it increases with w. This justifies measuring the
`complexity of the estimated region by the size of w, and minimizing kwk in order to
`find a region that will generalize well. In addition, the theorem suggests not to use the
`offset returned by the algorithm, which would correspond to = , but a smaller
`value (cid:0) (with ).
`
`10
`
`
`Exhibit Page 11
`
`Columbia Ex. 2006
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`0.5, 0.5
`0.59, 0.47
`0.70
`
`0.1, 0.5
`0.24, 0.03
`0.62
`
`0.5, 0.1
`0.65, 0.38
`0.48
`
`0.5, 0.5
`0.54, 0.43
`0.84
`
`, width c
`frac. SVs/OLs
`margin =kwk
`Figure 1: First two pictures: A single-class SVM applied to two toy problems;
` = c = :, domain: (cid:0) ; . Note how in both cases, at least a fraction of  of
`all examples is in the estimated region (cf. table). The large value of  causes the
`additional data points in the upper left corner to have almost no influence on the de-
`cision function. For smaller values of , such as : (third picture), the points cannot
`be ignored anymore. Alternatively, one can force the algorithm to take these ‘out-
`liers’ (OLs) into account by changing the kernel width (3): in the fourth picture, using
`c = : ;  = :, the data is effectively analyzed on a different length scale which
`leads the algorithm to consider the outliers as meaningful points.
`
`We do not claim that using Theorem 7 directly is a practical means to determine
`the parameters  and explicitly. It is loose in several ways. We suspect c is too large
`by a factor of more than 50. Furthermore, no account is taken of the smoothness of the
`kernel used. If that were done (by using refined bounds on the covering numbers of the
`induced class of functions as in Williamson et al. (1998)), then the first term in (33)
`would increase much more slowly when decreasing . The fact that the second term
`would not change indicates a different tradeoff point. Nevertheless, the theorem gives
`one some confidence that  and are suitable parameters to adjust.
`
`6 Experiments
`
`We apply the method to artificial and real-world data. Figure 1 displays 2-D toy ex-
`amples, and shows how the parameter settings influence the solution. Figure 2 shows
`a comparison to a Parzen windows estimator on a 2-D problem, along with a family of
`estimators which lie “in between” the present one and the Parzen one.
`Figure 3 shows a plot of the outputs w x on training and test sets of the US
`postal service database of handwritten digits. The database contains   digit images
`of size    = ; the last  constitute the test set. We used a Gaussian kernel
`(3), which has the advantage that the data are always separable from the origin in
`feature space (cf. the comment following Definition 1). For the kernel parameter c, we
`used : . This value was chosen a priori, it is a common value for SVM classifiers
`on that data set, cf. Sch¨olkopf et al. (1995)).3 We fed our algorithm with the training
`
`3In (Hayton et al., 2001), the following procedure is used to determine a value of c. For small c, all
`training points will become SVs — the algorithm just memorizes the data, and will not generalize well.
`As c increases, the number of SVs drops. As a simple heuristic, one can thus start with a small value of
`c and increase it until the number of SVs does not decrease any further.
`
`11
`
`
`Exhibit Page 12
`
`Columbia Ex. 2006
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`Figure 2: A single-class SVM applied to a toy problem; c = :, domain: (cid:0) ; , for
`various settings of the offset . As discussed in Sec. 3,  = yields a Parzen windows
`expansion. However, to get a Parzen windows estimator of the distribution’s support,
`we must in that case not use the offset returned by the algorithm (which would allow
`all points to lie outside of the estimated region). Therefore, in this experiment, we
`adjusted the offset such that a fraction  = : of patterns would lie outside. From
`left to right, we show the results for   f: ; :; :; g. The rightmost picture
`corresponds to the Parzen estimator which utilizes all kernels; the other estimators use
`roughly a fraction of  kernels. Note that as a result of the averaging over all kernels,
`the Parzen windows estimate does not model the shape of the distribution very well
`for the chosen parameters.
`
`instances of digit only. Testing was done on both digit and on all other digits. We
`present results for two values of , one large, one small; for values in between, the
`results are qualitatively similar. In the first experiment, we used  = , thus aiming
`for a description of “-ness” which only captures half of all zeros in the training set.
`As shown in figure 3, this leads to zero false positives (i.e. even though the learning
`machine has not seen any non--s during training, it correctly identifies all non--s as
`such), while still recognizing  of the digits in the test set. Higher recognition
`rates can be achieved using smaller values of :

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket