throbber
Df:VIACS Series in Discrete )..1athematics
`and Theoretical Comput;er Science
`
`Locally Lifting the Curse of Dimensionality for
`Nearest Neighbor Search
`
`Peter N. Yianilos
`
`ABSTRACT. Our work gives a positive result for nearest neighbor search in
`high dimensions. It establishes that radius-limited search is, under particular
`circumstances, free of the curse of dimensionality. It further illuminates the
`nature of the curse, and may therefore someday contribute to improved general
`purpose algorithms for high dimensions and for general metric spaces.
`We consider the problem of nearest neighbor search in the Euclidean
`hypercube [-l,+ l]d with uniform distributions, and the additional natural
`assumption that the nearest neighbor is located within a constant fraction R
`of the maximum interpoint distance in th is space, i.e. within distance 2RVd
`of the query.
`We introduce the idea of aggressive pruning and give a family of practical
`algorithms, an idealized analysis, and describe experiments. Our main result
`is that search complexity measured in terms of d-dimensional inner product
`operations, is i) strongly sublinear with respect to the data set size n for
`moderate R, ii) asymptotically, and as a practical matter, independent of
`dimension.
`Given a random data set, a random query within distance 2RVd of some
`database element, and a randomly constructed data structure, the search suc(cid:173)
`ceeds witb a specified probability, which is a parameter of the search algorithm.
`On average a search performs R:$ n"f distance computations where n is the num(cid:173)
`ber of point.s in the database, and "f < 1 is calculated in our analysis. Linear
`and near-linear space structures are described, and our algorithms and analy(cid:173)
`sis are fr·ee of large hidden constants, i.e. the algorithms perform far less work
`than exhaustive search - both in theory and practice.
`
`1. Introduction
`
`Finding nearest neighbors in Euclidean spaces of very low dimension is theoret(cid:173)
`ically fast, and practical, using the notion of a Voronoi diagram [Aur91]. In mod(cid:173)
`erate dimension, or in general metric spaces with intrinsically moderate dimension,
`recursive projection-decomposition techniques such as kd-trees (FBS75, FBF 77,
`BF79, Ben80] and vantage-point techniques [Uhl91b, Uhl91a, RP92, Yia93]
`for general metric spaces of intrinsically low dimension, are effective.
`
`1991 Mathematics Subject Classification. 68WOl, 68W05, 68W40, 68Pl0.
`Key words and phrases. :-.learest neighbor search, kd-tree, curse of dimensionality.
`Completed during 1999 while visiting the Princeton University computer science department.
`A prelim inary version appeared as [YiaOO].
`
`@2002 Americao :-.!atbematical Society
`
` NETWORK-1 EXHIBIT A2010
` Google Inc. v. Network-1 Technologies, Inc.
` IPR2015-00345
`Page 1 of 14
`
`

`
`2
`
`PETER :'<. YJA:'<ILOS
`
`As dimension d grows large, these tree techniques perform significantly better
`than brute-force search only if the number of dataset points n grows exponentially
`in d. Or, in the case of Voronoi diagrams, if space increases exponentially.
`The motivation for this work is the observation that, in practice, one is usually
`interested in nearest neighbors only if they are somewhat close to the query. The
`main contribution of this paper is the algorithmic idea of aggressiue pruning and
`its analysis for the uniformly distributed hypercube. In this setting, with a natural
`definition of somewhat close, we show that the expected time complexity of finding
`nearest neighbors is invariant with respect to dimension1 -
`and the space cose is
`independent of d, and linear in n .
`In a Euclidean hypercube the maximum distance between two points grows
`with Jd. By somewhat close we mean within a neighborhood whose radius is a
`constant fraction R of this distance. A parameter 0 < p < 1 controls the probability
`that a search will locate the nearest neighbor within this search domain. For each
`choice of R and p we calculate an exponent 'Y < 1 such that the search will perform
`on average~ n..., distance computations, and this dominates the work performed.
`Notice that 'Y is independent of d.
`The practical significance of our work is that search time is strongly sublinear
`given moderately large values for R and acceptable success probabilities. For ex(cid:173)
`ample, searching 1, 000,000 points uniformly distributed in [-1, +1]1°00 with our
`experimental software, given R = 0.1, requires on average~ 30,000 distance com(cid:173)
`putations, and succeeds with probability 0.9988. In this example 'Y ~ 0.78 from
`our analysis. Arbitrarily high success probabilities can be obtained at the expense
`of distance computations.
`We remark that this work was motivated by the author's recent work of [Yia99],
`where the objective is to build a data structure that provides worst-case sublinear(cid:173)
`tirne radius-limited nearest neighbor search (independent of query) for a given
`dataset. With uniform distributions in Euclidean space, the resulting structures
`support search neighborhood of only 0(1) size, in contrast with those of this paper
`that scale linearly with the maximum interpoint distance.
`Early work in nearest neighbor search is surveyed in [D as91]. There is a large
`literature on the search problem, much of it elaborating on a single fact: that certain
`projections from a metric space to lR have the property that projected distances
`are dominated by those in the original space.
`The two most important such projections are i) inner product with a unit vector
`in Euclidean space, and il) distance from a chosen vantage point. 3 These ideas were
`recognized early on in work including [BK73, Fuk75, Fuk90, F S82 , Sha77].
`Taking the inner product with a canonical basis element in Euclidean space
`leads to the well-known kd-tree of Friedman and Bentley [FBS75, FBF77, BF 79,
`B en80J. They recursively divide a pointset in JRd by projecting each element onto a
`distinguished coordinate. Improvements, distribution adaptation, and incremental
`searches, are described in [EW82], [KP86], and [Bro90] respectively.
`
`1:Vfeasured as d-dimensional inoer product opera~ions
`2In addition to the space required to store the dataset itself
`3The first of these may be viewed as the second in the limit as a vantage point moves toward
`infinity along t he direction of the unit vector.
`
`Page 2 of 14
`
`

`
`LOC ALLY LIFTI:'o:G THE: Ct:RSE: OF DntE:'o:SIO:'o: • .\LlTY FOR :-.;:-.; SEARCH
`
`3
`
`The search tree we build has essentially the same structure as a kd-tree built
`given randomly pretransformed data~. and constrained to use the same cut coordi(cid:173)
`nate for all nodes at a gi\·en tree leveL Our criterion for pruning branches during
`search, and its analysis, are the primary contributions of this paper and distinguish
`our methods from kd-tree search.
`;\fore recently, the field of computational geometry has yielded many interesting
`results such as those of [Vai89, Cla88, Cla87, FMT92J and earlier [DL76J.
`Very recently a number of papers have appeared striving for more efficient algo(cid:173)
`rithms with worst-case time bounds for the approximate nearest neighbor problem
`[AMN t-94, Kle97, KOR98, IM98J. These exploit properties of random projec(cid:173)
`tions beyond the simple projection distance dominance fact mentioned above, and
`additional ideas to establish worst case bounds. Our work may be viewed as ex(cid:173)
`ploiting the fact that random projections of uniformly random data in a hypercube,
`and of neighborhood balls of radius proportional to Jd. both have constant vari(cid:173)
`ance with respect to d. See also [Cla97J for very reC'cnt work on search in general
`metric spaces.
`Several of the papers mentioned above include interesting theoretical construc(cid:173)
`tions that trade polynomial space, and in some cases expensive preprocessing, for
`fast performance in the ,...-orst case. We remark that to be useful in practice, a
`nearest neighbor algorithm must require very nearly linear space -
`a stringent
`requirement. As datasets grow. even low-degree polynomial space becomes rapidly
`unacceptable.
`For completeness. early work dealing with two special cases should be men(cid:173)
`tioned. Retrieval of similar binary keys is considered by Rivest in (Riv74J and the
`Loc setting is the focus of [Yun76J. Also see [BM80J for worst case data struc(cid:173)
`tures for the range search problem. Finally, recent works [BOR99J and (CCGL99J
`establish nontrivial lower bounds for nearest neighbor search.
`[n the following section we give construction and search algorithms specialized
`for our uniform setting. Section 3 gives a concrete analysis of these algorithms that
`include calculations of the applicable search time complexity exponent, and failure
`probability. E>..-periments are presented in section 4, which confirm in practice the
`dimensional invariance established by analysis. Finally, some directions for further
`work are mentioned in section 5.
`
`2. A lgorithms
`
`2.1. Construction: A search tree is built with the set of data points as its
`leaves. It has essentially the same structure as a kd-tree built on data that has first
`been transformed to a random coordinate system. Construction time is easily seen
`to be O(n logn) and space is linear.
`Construction proceeds recursively. Each interior node has as input a list of
`points. The root's list is the entire set. Associated with each interior node is a
`randomly selected unit vector Ui, where i denotes level (distance from the root).
`The number of such vectors is then equal to the depth of tree minus one (because
`there is no vector associated with a leaf). This set is constructed so as to be
`
`4Th at is, where the dataset is first transformed tO the coordinate system induced by a random
`basis. When t he kd-tree cuts space using a single coordinate in this transformed setting, is then
`the same as cutting based the inner product of a data element with a particular basis vector.
`
`Page 3 of 14
`
`

`
`PETER :->. Yl.\ '\I LOS
`
`orthonormal. First. random vectors are drawn. then they are orthonormalized in
`the usual way (Gram-Schmidt).
`Construction of a node consists of reading its input list, computing the inner
`product of each element with the node's associated u;, and adding the element
`to a left or right output list depending on its sign. ln general the dividing point
`is chosen more intelligently, e.g. by computing the median of the inner product
`values. But in our uniform [-l,+l]d setting we just use zero for simplicity. Left
`and right children are then generated recursively. If a node's input list consists of
`a single element, then that node becomes a leaf and stores the clement within it.
`
`2.2. Search: The search is parameterized by a query q, a value R E (0, 1)
`giving the proportional size of the search domain, and a probability 0 < p < 1 that
`is related to the success rate we expect.
`The inner product of q and each tt; is first computed. l\'cxt the positive thresh(cid:173)
`old distance e = ~;;-~2(p) is computed, where~ denotes the cumulative distdbution
`function for a normal density with the variance indicated by subscdpt.
`Search proceeds recursively from the root. For a node at level i the value
`< q, u; > is examined. If it is less than e then the left child is recursively explored. If
`it exceeds -f then the right child is explored. Xoticc that when < q. ui >E (-e. +l)
`both children are explored. \Yhen a leaf is encountered, the distance is computed
`between the element it contains and q.
`This decision rule is centered at zero because of our particular uniform [-1. +1Jd
`setting, but is easily translated to an arbitrary cut point, e.g. the median of the
`projected values.
`After each distance computation d(q, x) is performed the proportion d(q, x)/2~
`is computed. If smaller than R, then R is reduced and e recomputed.
`This concludes the description of our search algorithm and we now briefly
`discuss related issues and ex-tensions.
`An important idea in kd-tree search involves the computation of the minimum
`distance from the query to the subspace corresponding to a node to be explored.
`II this distance grows beyond the radius of interest, the node is pruned. We do
`not, however, include it in our analysis or experimental implementation because in
`our high dimensional setting, in the absence of exponentially many data elements,
`this idea has vanishingly little effect. Intuitively, this is because the search tree is
`not nearly deep enough for the minimum distance to grow larger than the search
`radius.
`The analogue of tbis kd-tree idea in our setting is an alternative version of our
`algorithm abOYe that Slightly redUCes f wbile descending through interiOr nodeS tO
`reflect the fact that the distribution of data elements within a ball about the query
`is no longer uniform. But, again, tbis is a second order effect.
`Finally we remark that the f-cutoff approach taken above might be replaced
`with an entirely probabilistic pruning scheme that passes probabilities from our
`analysis down to each child during search. The probabilities upward to the root
`are then multiplied and search continues downward until the result falls below a
`specified threshold.
`
`We assume both data points and queries are uniformly distributed within the
`hypercube [-1, + l]d. The Euclidean distance metric applies and the ma.x:imum
`
`3. A nalysis
`
`Page 4 of 14
`
`

`
`LOCALLY LIFTI~G THE CURSE OF DC\IIE~SfO~ALTTY POR :-<~ SEARCH
`
`5
`
`distance between two points in this space is 2Vd. We consider the problem of
`finding the nearest neighbor of a query q within some distance 8 that is a constant
`proportion of the maximum interpoint distance, i.e. 8 = 2RVd with 0 < R < 1.
`Any unit vector ·u gives rise to a projection 1T' u mapping x E JR.d into JR. defined
`by < x, u >. It is immediate that distances in the range of this projection are
`dominated by distances in the domain. It is this fact that kd-trees exploit to prune
`branches during branch-and-bound search.
`If r represents the distance to the nearest neighbor encountered so far during
`a search, then this fact implies that every member of the ball of radius r centered
`at the query q maps under any 'iT'u into the interval [7ru(q)- T, 1T'u(q) + r]. So kd(cid:173)
`tree search may confidently disregard any data points that project outside of this
`interval. Unlike the kd-tree, which finds nearest neighbors with certainty, we will
`consider randomized constructions that succeed with some specified probability.
`Since 8 scales up linearly with Jd, the interval grows ['iT' .,(q)- 8, '~~u(q) + 8] too,
`and soon the kd-tree can confidently prune almost nothing, and performs a nearly
`exhaustive search.
`But in our uniformly random setting we will show that the 8 ball about q
`projects to a distribution about 11(q) having constant variance. This is why it is
`possible to continue to probabilistically prune just as effectively even as d -+ oo.
`Since dimension does not affect our ability to prune, we have lifted the curse of
`dimensionality in our setting.
`We now proceed with the description and idealized analysis of our algorithm
`with the following proposition, which establishes two elementary but key dimen(cid:173)
`sional invariants
`PROPOS!T!O:" 3.1. i) Let u be a random unit vector, and let X denote a random
`dataset in our setting, then the one dimensional set of values 1iu (X) has mean zero
`and variance 1/3 - independent of dimension - where both u and X are random
`variables. ii) Consider any q E JR.d and let r denote a random vector located on
`the surface of a ball centered at q of radius 2RVd. Then jo1· any unit vector u,
`the distribution of values 1T' ... (r) has mean 1T'11 (q) and variance 4R2 - independent of
`dimension.
`PROOF . The variance of each component of u is 1/d since < u,u >= 1 by
`definition and the components are i.i.d. Consider a random element x E [-1, +1Jrl.
`Here a simple integration establishes that the variance of each component is 1/3. So
`the variance of each term of< u, x > is 1/3d, and that of the entire inner product
`is then 1/3 as required. Now each component of u and x has mean zero so that
`< u, x > has mean zero -
`and part i) is established.
`'iT'" (r) is centered at 1iu(q) by linearity of inner product. So we may assume
`without loss of generality that q = 0. Now the inner product of two random unit
`vectors is easily seen to have variance 1/ d. Scaling one of them by a factor of 2R Jd
`increases the variance by a factor of 4R2d, so that it becomes 4R2 as required by
`0
`part ii).
`
`Now tix some unit vector u and consider the query's projection 'iT'" ( q). Then by
`Proposition 3.1, and ignoring hypercube corner effects, the projection of the data
`points within the domain of search are distributed about 1T'.u(q) with variance no
`greater than 4R2
`•
`Because distances between projected points are dominated by their original
`distance it is clear that I 'iT' u ( x) - "" ( q) I < 2R Jd for any x in the search domain.
`
`Page 5 of 14
`
`

`
`6
`
`PETER~. YIA~ILOS
`
`So any data points with projections farther than 2RVd from 1r u.(q) can be
`confidently ruled out during search. It is this hard pruning that kd-trees (and
`many related structures) exploit to avoid considering every point during nearest
`neighbor search.
`But notice that this pruning cutoff distance 2RVd -+ oo with d while the
`projection variance remains constant. As a result hard pruning is asymptotically
`ineffective, and this gives rise to the curse of dimensionality for nearest neighbor
`search using kd-trees in our setting.
`?\ext observe that as a consequence of the central limit theorem the distribu(cid:173)
`and we remark that as a
`tions of Proposition 3.1 are asymptotically normal -
`practical matter this is a good approximation; even for moderate dimension. So
`from this point on in our analysis we will simply assume normality, i.e. that d is
`sufficiently large so that its deviation from normality is negligible.
`We then arrive at one of the main observations of this paper: that the nearest
`neighbor's projection is within constant distance of ?Tu.(q) with arbitrarily high ar•d
`easily calculated probability p depending only on this constant, and not on d.
`Recall that the tree we build recursively bisects the set of data points using a
`randomly selected u by separating its projection into left and right branches at a
`cut point c near the median. Given a query q we prune one of these branches if
`?Tu(q) is at least some cutoff distance e from c. We refer to this idea as aggressive
`pr·uning.
`PROPOSITl0:-.1 3.2. Using cutoff distance e = 1?;;-A2(p) to prune branches will
`exclude the nearest neighbor (ass·uming one exists within distance 2RVd of the
`query} with probability 1- p at each node; wher·e 1?u•(x) denotes the c-nmulative
`distribution function for the zero mean normal density with variance a 2 •
`
`PROOF. Equating the target failure rate 1-p with the mass of the distribution's
`positive tail gives 1-1?4R2(e) = 1- p, so e = 1?;;-A2(p).
`o
`Again, note that this cutoff distance e is constant despite the fact that the
`absolute size of our search domain is expanding with .Jd as d -+ oo. \'Ve can now
`establish a single tree's expected search complexity and success probability.
`PROPOSITIO~ 3.3. Given n data points, R,p, and e from proposition 9.2, and
`assuming d = fl(logn), then i) our search will visit~ n'Y leaf nodes where "' =
`log2 2<I> 113(e) , and compute ad-dimensional Euclidean distance at at each; and ii)
`the search will .succeed with probability no smaller than ~ p10g2,..
`
`PROOF. Recall that we e>..-pect that at each node of the tree the projection of
`the remaining database elements will have mean/median close to zero and variance
`of approximately 1/3. We fix: our attention on one such node with corresponding
`unit vector u and assume the cut point is exactly zero and variance is exactly
`1/3. Since we have assumed that the queries are also uniformly distributed, their
`projection is identical. So we can easily calculate the probability b that a query q
`lies within distance e of the zero. In this event both left and right branches must
`be explored. Otherwise, with probability 1 - b we can prune one so that only one
`is explored.
`The mass of one tail (beyond e) is 1-1?1; 3 (e) from which it follows that 1 - b
`is twice this value. So b = 21? 1; 3(e)- 1. Then the expected number F of branches
`visited is 2 · b + 1 · (1- b) = 21?1; 3(£)
`
`Page 6 of 14
`
`

`
`LOC ALLY t l f'Tl:>IG THE CURSE 01" DLVJE:>ISIO~ALITY !"OR :>!:-< SEARCH
`
`7
`
`Since d = n(log n) it is possible to choose unit vectors from root to leaf that
`are independent (e.g. the canonical basis), random vectors will be nearly so, and
`our random orthonormalization process will certainly succeed. So except for corner
`effects, which we disregard, it is reasonable to assume that query projections and
`failure probabilities are independent along each root-leaf path.
`Since the tree's depth is log2 n (we'll assume perfect balance) the search will
`visit plog, rt nodes. That is 21082 F Jog2" or n 10S2 F , establishing i).
`:\ow the nearest neighbor is located somewhere as a leaf of the tree. To succeed,
`the search must not prune it at any of the log2 n nodes along a path up to the root.
`This happens with probability p10S'" establishing ii) - where independence follows
`from the orthogonality of the set of projectors and the assumption of a uniform
`distribution. 5
`0
`
`Notice that the exponent of n in proposition 3.3 is within (0, 1). Also, the
`requirement that d = n(logn) is not very restrictive since otherwise the number
`of data points is exponential in d and earlier techniques can be used to efficiently
`locate nearest neighbors.
`The probability estimate of Proposition 3.3 is extremely conservative. The
`actual failure probabilities are much smaller because: i) our analysis has implicitly
`assumed that the query always projects to a point exactly e from the cut point.
`This is the worst case. If the distance is smaller, then nothing is pruned and no
`error can be made; and if the distance is larger, the effective cutoff distance is
`effectively increased; and ii) we have assumed that the nearest neighbor is always
`just within the domain of search. In reality queries will be distributed with this
`domain.
`Even without considering these factors ouT calculations suggest that efficient
`search is possible for moderately large values of R.
`Table 3 gives exponent values for selected values of R and p. For example,
`if R = 0.05 and p = 1 - 10- 4 = 0.9999 then the exponent is ~ 0.566. Given
`n = 10, 000, 000 data points our results state that no more than n°·566 ~ 9, 162
`inner product operations will be performed, and that the nearest neighbor will be
`located with probability 0.999910g2 n ~ 0.9977. Choosing p = 1 - 10- 6 corresponds
`to 48, 194 inner products with the probability of success increasing to 0.999977. All
`of these calculations are asymptotically independent of dimension.
`We continue this example to illustrate the use of forests to improve performance
`in some settings. Later we will give a theoretical consequence of this idea. Consider
`two independent trees with p = 1-I0-4
`. That is, the choice of random unit vectors
`is independent. Both trees are searched for each query. We then expect a failure
`rate of ~ (1 - 0.9977)2 = .00000529, and 2 · 9, 162 = 18,324 inner products are
`performed. 6 Now this failure rate is considerably better than the 1 - 0.999977 =
`0.000023 resulting from a single tree using p = 1-10-6 , and far fewer inner products
`are computed, but space consumption is doubled. This example illustrates the
`space-time design tradeoff that arises when applying our ideas in practice.
`The natural generalization of this idea leads to a forest of independent trees.
`Recall that a failure rate of a single tree grows, albeit slowly, with n . By using
`
`5 ::-legligible dependence resu lts from the hypercube's corners, but for simplicity we ignore this
`effect .
`6\.Ye must have d large enough to allow independent unit vectors to be drawn, and will say
`more about this later.
`
`Page 7 of 14
`
`

`
`8
`
`PETER:->. YIA:-<JLOS
`
`R
`
`1 - p
`10- <1 w-a w-s 10- 10 10- 12
`w-2
`0.01 0.090 0.141 0.177 0.207 0.232 0.254
`0.02 0.174 0.267 0.331 0.381 0.423 0.458
`0.03 0.252 0.379 0.463 0.526 0.577 0.618
`0.04 0.325 0.479 0.575 0.645 0.698 0.740
`0.05 0.393 0.566 0.669 0.739 0.790 0.829
`0.06 0.456 0.642 0.746 0.813 0.859 0.892
`0.07 0.513 0.707 0.808 0.869 0.908 0.935
`0.08 0.566 0.763 0.858 0.911 0.943 0.963
`0.09 0.615 0.810 0.897 0.941 0.965 0.979
`0.10 0.660 0.850 0.926 0.962 0.980 0.989
`0.11 0.700 0.882 0.949 0.976 0.989 0.995
`0.12 0.737 0.909 0.965 0.986 0.994 0.998
`0.13 0.770 0.931 0.977 0.992 0.997 0.999
`0.14 0.800 0.948 0.985 0.995 0.999 1.000
`0.15 0.826 0.961 0.990 0.997 0.999 1.000
`0.16 0.850 0.971 0.994 0.999 1.000 1.000
`0.17 0.871 0.979 0.996 0.999 1.000 1.000
`0.18 0.890 0.985 0.998 1.000 1.000 1.000
`0.19 0.906 0.990 0.999 1.000 1.000 1.000
`0.20 0.921 0.993 0.999 1.000 1.000 1.000
`
`TABLE 1. Searches in our setting with a single tree require ::::J n 7
`inner product operations and linear space. This table gives 'Y values
`for various choices of R and p from our analysis.
`
`enough trees one can force the forest's failure rate to zero as n increases while
`preserving sublinear search times; at the expense of polynomia.l space to pay for
`the additiona.l trees. To analyze this let x denote t he tree's depth and observe that
`(1 - p"')Y gives t he probability of failure for a forest of y distinct random trees. It
`can be shown that if y = p - t l+•)x for any c: > 0, then this probability tends to
`zero as x (and t herefore n) increases 7• Since x = log2 n t he number of trees y =
`n -(1+• ) 10&2 P in terms of n. Search time (the number of inner product operations)
`also increases by a factor of y. Space increases from linear to O(yn). Fina.lly, to
`maintain independence we must increase the lower bound on d to O(y logn).
`It is apparent from table 3 that this idea works for many values of Rand p. For
`example, if R = 0.05 and p = 1 - 10- 6 then the original e>..-ponent 0.566 increases
`by only ~ 1.44 X w-6 - far less than the table's precision. But it does not work
`for arbitrary Rand p. For large va.lues of R the exponent is so close to 1 that the
`added complexity makes the total superlinear.
`Note that O(nd) space is required to store the data elements t hemselves, but
`only O(n) space is needed for the interior tree nodes, which point to data elements.
`So, in practice, we can afford O(d) trees while remaining linear space. In practice
`this means that in high dimension one can easily afford many trees without mate(cid:173)
`rially affecting space consumption. In theory it means that if d is assumed to grow
`
`;However it may increase somewhat before eventually decreasing.
`
`Page 8 of 14
`
`

`
`LOCALLY LIFTI:"G THE CuRSE Of OIYI E:-.iSIO~ALITY FOR :-.i:'< SEARCH
`
`9
`
`as an appropriate small power of n as n -4 oo, then the entire forest occupies linear
`space.
`Returning to our choice of the [- 1, + 1]d hypercube, we observe that i) All of
`our arguments apply immediately to [0, 1Jd after compensating for the shifted mean,
`and ii) that our arguments apply as well to the d-dimensional hypersphere. To see
`this note that the dot product of a random unit vector and a random database
`element has variance 1/d, and the maximum interpoint distance is constant, so the
`projection of a proportional neighborhood balls also falls with 1/d.
`Finally we remark that as an alternative to forests , one can build a single tree
`in which each level is overlapped, i.e. items near the center are stored in both the
`left and right branches. This effectively reduces e and with it search time - at the
`expense of polynomial space (given a constant fraction of overlap at each level).
`
`4. E xp eriments
`
`To confirm our analysis we wrote an ANSI-C program to build and search trees.
`It accepts d, n, R, pas arguments, in addition to the number of random queries to
`test, and a random seed.
`The data points are distributed uniformly in [-1, +l]d. Queries are generated
`by choosing a data point x at random, then a random unit vector u, and forming
`the sum x + (1 - e)2Rv05u - so that the query lies just inside of the search domain.
`We use e = .0001. This ensures that a nearest neighbor exists within the domain.
`An indexed set of projection vectors is generated by starting with random
`vectors and orthonormalizing them in the standard way. Each level in the tree uses
`the corresponding vector in this set as its projector. So the size of the set is O(logn),
`and in practice is ~ 2 log2 n because the resulting trees are not perfectly balanced.
`For simplicity zero is always used as the cut point during tree construction.
`In order to allow us to simultaneously explore large n and d without exceeding
`available RAM, we represent the data points to be searched using a pseudorandom
`generator seed that suffices to construct them. As a result we can easily study
`n = 1, 000,000 and higher in almost arbitrarily high dimension.
`Figure 1 illustrates that search complexity measured in the number of inner
`product operations performed is very nearly dimension-invariant -
`confirming our
`analysis. So total work scales up only linearly with dimension with the complexity
`of an inner product operation.
`Note that for simplicity we regard an inner product operation and a Euclidean
`distance computation as having identical complexity. Also, the table's results ex(cid:173)
`clude the roughly 2 log2 n inner products required to compute the query's projection
`at each level of the tree.
`Recall that our analysis of failure rate was quite conservative and experiments
`confirm this. Actual failure rates for every case in Figure 1 were somewhat lower
`than those from analysis. For example, in the R = 0.20 case, analysis predicts that
`the search will succeed with probability ~ 0.85, and the actual value is ~ 0.97.
`Note that one reason for using the moderate value p = 0.99 in our study is that we
`have confirmed that values far closer to zero result in no observable e..'<perimental
`error given the limited number of queries we have time to perform.
`vVe compare performance in our setting with kd-tree search [FBS75, FBF77,
`BF79, Ben80]. We are interested in neighbors within a specified distance of the
`query. To reproduce this setting, kd-tree search is initialized as though an element
`at the specified distance has already been encountered.
`
`Page 9 of 14
`
`

`
`10
`
`PETER:\. YL\:\ILOS
`
`1~ ,---~--~----~--~--~----~--~--~----r---~
`"'R..O..Ot" (cid:173)
`,...05"
`,.
`"'R-.o tO"
`•
`"Rll0..15"
`D
`. . . . . . . . . . . . . . . . _____ . . . . . . . . . . . . . . . . . __ . . .. _..~.:.:.....
`
`----~·IS I a
`
`I
`
`10000
`
`"0
`~ 1000
`·u;
`·:;;
`en
`Q)
`>
`113 100
`~
`
`10
`
`....... ...... ~ . ....... JI!~ . . .. -'(Jit • ••• ,. ,. ........ .,... • • • --.Jit• • • , ...... ,. .... . .... ,-.. .•• -. ... .. , •• ~.., ~
`,., . ....... ~~ ..... -
`
`~J"<x"x~.:~'\?'~hw.)l)eMI(JoUI~XNK•.w"'x~~·~~ryox,.')I~~,.._I(MW"x,.-to~)Pt~W'f"wll
`
`1 0~--~ ... ~~~~--~*~--~~~--~~--~~~--~*~--~~~~.~--~~~
`dime nsion
`
`FtGL:RE 1. Proportional radius search complexity is constant with
`respect to dimension -
`confirming our analysis. We measure the
`number of leaves visited. Each visit corresponds to a Euclidean
`distance computation. In this experiment the number of uniformly
`distributed data points n = 100,000. and the p from our analysis
`is 0.99. Each query is generated so as to be slightly less than
`distance 2R.Jd from some database element. so that our radius(cid:173)
`limited search always succeeds. The results of 11 000 such random
`queries are averaged to produce each data point. Our analysis
`predicts constant values of approximately 3, 92, 1987, 13,553, and
`40, 114 corresponding toR values of 0.01, 0.05, 0.10, 0.15, and 0.20
`respectively. T he e-xperimental values compare well with these
`estimates, and in general are dominated by them by a factor of
`approximately two.
`
`A recent kd-tree implementation [MA99] is modified for our experiments. Its
`core routines are a ltered to support radius-limited search as described above, and
`the distribution's sample program is adapted to report the average number of dis(cid:173)
`tance computations performed in response to each query. As in our earlier exper(cid:173)
`iments, queries are generated by starting at a database clement, and moving the
`specifi

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket