`and Theoretical Comput;er Science
`
`Locally Lifting the Curse of Dimensionality for
`Nearest Neighbor Search
`
`Peter N. Yianilos
`
`ABSTRACT. Our work gives a positive result for nearest neighbor search in
`high dimensions. It establishes that radius-limited search is, under particular
`circumstances, free of the curse of dimensionality. It further illuminates the
`nature of the curse, and may therefore someday contribute to improved general
`purpose algorithms for high dimensions and for general metric spaces.
`We consider the problem of nearest neighbor search in the Euclidean
`hypercube [-l,+ l]d with uniform distributions, and the additional natural
`assumption that the nearest neighbor is located within a constant fraction R
`of the maximum interpoint distance in th is space, i.e. within distance 2RVd
`of the query.
`We introduce the idea of aggressive pruning and give a family of practical
`algorithms, an idealized analysis, and describe experiments. Our main result
`is that search complexity measured in terms of d-dimensional inner product
`operations, is i) strongly sublinear with respect to the data set size n for
`moderate R, ii) asymptotically, and as a practical matter, independent of
`dimension.
`Given a random data set, a random query within distance 2RVd of some
`database element, and a randomly constructed data structure, the search suc(cid:173)
`ceeds witb a specified probability, which is a parameter of the search algorithm.
`On average a search performs R:$ n"f distance computations where n is the num(cid:173)
`ber of point.s in the database, and "f < 1 is calculated in our analysis. Linear
`and near-linear space structures are described, and our algorithms and analy(cid:173)
`sis are fr·ee of large hidden constants, i.e. the algorithms perform far less work
`than exhaustive search - both in theory and practice.
`
`1. Introduction
`
`Finding nearest neighbors in Euclidean spaces of very low dimension is theoret(cid:173)
`ically fast, and practical, using the notion of a Voronoi diagram [Aur91]. In mod(cid:173)
`erate dimension, or in general metric spaces with intrinsically moderate dimension,
`recursive projection-decomposition techniques such as kd-trees (FBS75, FBF 77,
`BF79, Ben80] and vantage-point techniques [Uhl91b, Uhl91a, RP92, Yia93]
`for general metric spaces of intrinsically low dimension, are effective.
`
`1991 Mathematics Subject Classification. 68WOl, 68W05, 68W40, 68Pl0.
`Key words and phrases. :-.learest neighbor search, kd-tree, curse of dimensionality.
`Completed during 1999 while visiting the Princeton University computer science department.
`A prelim inary version appeared as [YiaOO].
`
`@2002 Americao :-.!atbematical Society
`
` NETWORK-1 EXHIBIT A2010
` Google Inc. v. Network-1 Technologies, Inc.
` IPR2015-00345
`Page 1 of 14
`
`
`
`2
`
`PETER :'<. YJA:'<ILOS
`
`As dimension d grows large, these tree techniques perform significantly better
`than brute-force search only if the number of dataset points n grows exponentially
`in d. Or, in the case of Voronoi diagrams, if space increases exponentially.
`The motivation for this work is the observation that, in practice, one is usually
`interested in nearest neighbors only if they are somewhat close to the query. The
`main contribution of this paper is the algorithmic idea of aggressiue pruning and
`its analysis for the uniformly distributed hypercube. In this setting, with a natural
`definition of somewhat close, we show that the expected time complexity of finding
`nearest neighbors is invariant with respect to dimension1 -
`and the space cose is
`independent of d, and linear in n .
`In a Euclidean hypercube the maximum distance between two points grows
`with Jd. By somewhat close we mean within a neighborhood whose radius is a
`constant fraction R of this distance. A parameter 0 < p < 1 controls the probability
`that a search will locate the nearest neighbor within this search domain. For each
`choice of R and p we calculate an exponent 'Y < 1 such that the search will perform
`on average~ n..., distance computations, and this dominates the work performed.
`Notice that 'Y is independent of d.
`The practical significance of our work is that search time is strongly sublinear
`given moderately large values for R and acceptable success probabilities. For ex(cid:173)
`ample, searching 1, 000,000 points uniformly distributed in [-1, +1]1°00 with our
`experimental software, given R = 0.1, requires on average~ 30,000 distance com(cid:173)
`putations, and succeeds with probability 0.9988. In this example 'Y ~ 0.78 from
`our analysis. Arbitrarily high success probabilities can be obtained at the expense
`of distance computations.
`We remark that this work was motivated by the author's recent work of [Yia99],
`where the objective is to build a data structure that provides worst-case sublinear(cid:173)
`tirne radius-limited nearest neighbor search (independent of query) for a given
`dataset. With uniform distributions in Euclidean space, the resulting structures
`support search neighborhood of only 0(1) size, in contrast with those of this paper
`that scale linearly with the maximum interpoint distance.
`Early work in nearest neighbor search is surveyed in [D as91]. There is a large
`literature on the search problem, much of it elaborating on a single fact: that certain
`projections from a metric space to lR have the property that projected distances
`are dominated by those in the original space.
`The two most important such projections are i) inner product with a unit vector
`in Euclidean space, and il) distance from a chosen vantage point. 3 These ideas were
`recognized early on in work including [BK73, Fuk75, Fuk90, F S82 , Sha77].
`Taking the inner product with a canonical basis element in Euclidean space
`leads to the well-known kd-tree of Friedman and Bentley [FBS75, FBF77, BF 79,
`B en80J. They recursively divide a pointset in JRd by projecting each element onto a
`distinguished coordinate. Improvements, distribution adaptation, and incremental
`searches, are described in [EW82], [KP86], and [Bro90] respectively.
`
`1:Vfeasured as d-dimensional inoer product opera~ions
`2In addition to the space required to store the dataset itself
`3The first of these may be viewed as the second in the limit as a vantage point moves toward
`infinity along t he direction of the unit vector.
`
`Page 2 of 14
`
`
`
`LOC ALLY LIFTI:'o:G THE: Ct:RSE: OF DntE:'o:SIO:'o: • .\LlTY FOR :-.;:-.; SEARCH
`
`3
`
`The search tree we build has essentially the same structure as a kd-tree built
`given randomly pretransformed data~. and constrained to use the same cut coordi(cid:173)
`nate for all nodes at a gi\·en tree leveL Our criterion for pruning branches during
`search, and its analysis, are the primary contributions of this paper and distinguish
`our methods from kd-tree search.
`;\fore recently, the field of computational geometry has yielded many interesting
`results such as those of [Vai89, Cla88, Cla87, FMT92J and earlier [DL76J.
`Very recently a number of papers have appeared striving for more efficient algo(cid:173)
`rithms with worst-case time bounds for the approximate nearest neighbor problem
`[AMN t-94, Kle97, KOR98, IM98J. These exploit properties of random projec(cid:173)
`tions beyond the simple projection distance dominance fact mentioned above, and
`additional ideas to establish worst case bounds. Our work may be viewed as ex(cid:173)
`ploiting the fact that random projections of uniformly random data in a hypercube,
`and of neighborhood balls of radius proportional to Jd. both have constant vari(cid:173)
`ance with respect to d. See also [Cla97J for very reC'cnt work on search in general
`metric spaces.
`Several of the papers mentioned above include interesting theoretical construc(cid:173)
`tions that trade polynomial space, and in some cases expensive preprocessing, for
`fast performance in the ,...-orst case. We remark that to be useful in practice, a
`nearest neighbor algorithm must require very nearly linear space -
`a stringent
`requirement. As datasets grow. even low-degree polynomial space becomes rapidly
`unacceptable.
`For completeness. early work dealing with two special cases should be men(cid:173)
`tioned. Retrieval of similar binary keys is considered by Rivest in (Riv74J and the
`Loc setting is the focus of [Yun76J. Also see [BM80J for worst case data struc(cid:173)
`tures for the range search problem. Finally, recent works [BOR99J and (CCGL99J
`establish nontrivial lower bounds for nearest neighbor search.
`[n the following section we give construction and search algorithms specialized
`for our uniform setting. Section 3 gives a concrete analysis of these algorithms that
`include calculations of the applicable search time complexity exponent, and failure
`probability. E>..-periments are presented in section 4, which confirm in practice the
`dimensional invariance established by analysis. Finally, some directions for further
`work are mentioned in section 5.
`
`2. A lgorithms
`
`2.1. Construction: A search tree is built with the set of data points as its
`leaves. It has essentially the same structure as a kd-tree built on data that has first
`been transformed to a random coordinate system. Construction time is easily seen
`to be O(n logn) and space is linear.
`Construction proceeds recursively. Each interior node has as input a list of
`points. The root's list is the entire set. Associated with each interior node is a
`randomly selected unit vector Ui, where i denotes level (distance from the root).
`The number of such vectors is then equal to the depth of tree minus one (because
`there is no vector associated with a leaf). This set is constructed so as to be
`
`4Th at is, where the dataset is first transformed tO the coordinate system induced by a random
`basis. When t he kd-tree cuts space using a single coordinate in this transformed setting, is then
`the same as cutting based the inner product of a data element with a particular basis vector.
`
`Page 3 of 14
`
`
`
`PETER :->. Yl.\ '\I LOS
`
`orthonormal. First. random vectors are drawn. then they are orthonormalized in
`the usual way (Gram-Schmidt).
`Construction of a node consists of reading its input list, computing the inner
`product of each element with the node's associated u;, and adding the element
`to a left or right output list depending on its sign. ln general the dividing point
`is chosen more intelligently, e.g. by computing the median of the inner product
`values. But in our uniform [-l,+l]d setting we just use zero for simplicity. Left
`and right children are then generated recursively. If a node's input list consists of
`a single element, then that node becomes a leaf and stores the clement within it.
`
`2.2. Search: The search is parameterized by a query q, a value R E (0, 1)
`giving the proportional size of the search domain, and a probability 0 < p < 1 that
`is related to the success rate we expect.
`The inner product of q and each tt; is first computed. l\'cxt the positive thresh(cid:173)
`old distance e = ~;;-~2(p) is computed, where~ denotes the cumulative distdbution
`function for a normal density with the variance indicated by subscdpt.
`Search proceeds recursively from the root. For a node at level i the value
`< q, u; > is examined. If it is less than e then the left child is recursively explored. If
`it exceeds -f then the right child is explored. Xoticc that when < q. ui >E (-e. +l)
`both children are explored. \Yhen a leaf is encountered, the distance is computed
`between the element it contains and q.
`This decision rule is centered at zero because of our particular uniform [-1. +1Jd
`setting, but is easily translated to an arbitrary cut point, e.g. the median of the
`projected values.
`After each distance computation d(q, x) is performed the proportion d(q, x)/2~
`is computed. If smaller than R, then R is reduced and e recomputed.
`This concludes the description of our search algorithm and we now briefly
`discuss related issues and ex-tensions.
`An important idea in kd-tree search involves the computation of the minimum
`distance from the query to the subspace corresponding to a node to be explored.
`II this distance grows beyond the radius of interest, the node is pruned. We do
`not, however, include it in our analysis or experimental implementation because in
`our high dimensional setting, in the absence of exponentially many data elements,
`this idea has vanishingly little effect. Intuitively, this is because the search tree is
`not nearly deep enough for the minimum distance to grow larger than the search
`radius.
`The analogue of tbis kd-tree idea in our setting is an alternative version of our
`algorithm abOYe that Slightly redUCes f wbile descending through interiOr nodeS tO
`reflect the fact that the distribution of data elements within a ball about the query
`is no longer uniform. But, again, tbis is a second order effect.
`Finally we remark that the f-cutoff approach taken above might be replaced
`with an entirely probabilistic pruning scheme that passes probabilities from our
`analysis down to each child during search. The probabilities upward to the root
`are then multiplied and search continues downward until the result falls below a
`specified threshold.
`
`We assume both data points and queries are uniformly distributed within the
`hypercube [-1, + l]d. The Euclidean distance metric applies and the ma.x:imum
`
`3. A nalysis
`
`Page 4 of 14
`
`
`
`LOCALLY LIFTI~G THE CURSE OF DC\IIE~SfO~ALTTY POR :-<~ SEARCH
`
`5
`
`distance between two points in this space is 2Vd. We consider the problem of
`finding the nearest neighbor of a query q within some distance 8 that is a constant
`proportion of the maximum interpoint distance, i.e. 8 = 2RVd with 0 < R < 1.
`Any unit vector ·u gives rise to a projection 1T' u mapping x E JR.d into JR. defined
`by < x, u >. It is immediate that distances in the range of this projection are
`dominated by distances in the domain. It is this fact that kd-trees exploit to prune
`branches during branch-and-bound search.
`If r represents the distance to the nearest neighbor encountered so far during
`a search, then this fact implies that every member of the ball of radius r centered
`at the query q maps under any 'iT'u into the interval [7ru(q)- T, 1T'u(q) + r]. So kd(cid:173)
`tree search may confidently disregard any data points that project outside of this
`interval. Unlike the kd-tree, which finds nearest neighbors with certainty, we will
`consider randomized constructions that succeed with some specified probability.
`Since 8 scales up linearly with Jd, the interval grows ['iT' .,(q)- 8, '~~u(q) + 8] too,
`and soon the kd-tree can confidently prune almost nothing, and performs a nearly
`exhaustive search.
`But in our uniformly random setting we will show that the 8 ball about q
`projects to a distribution about 11(q) having constant variance. This is why it is
`possible to continue to probabilistically prune just as effectively even as d -+ oo.
`Since dimension does not affect our ability to prune, we have lifted the curse of
`dimensionality in our setting.
`We now proceed with the description and idealized analysis of our algorithm
`with the following proposition, which establishes two elementary but key dimen(cid:173)
`sional invariants
`PROPOS!T!O:" 3.1. i) Let u be a random unit vector, and let X denote a random
`dataset in our setting, then the one dimensional set of values 1iu (X) has mean zero
`and variance 1/3 - independent of dimension - where both u and X are random
`variables. ii) Consider any q E JR.d and let r denote a random vector located on
`the surface of a ball centered at q of radius 2RVd. Then jo1· any unit vector u,
`the distribution of values 1T' ... (r) has mean 1T'11 (q) and variance 4R2 - independent of
`dimension.
`PROOF . The variance of each component of u is 1/d since < u,u >= 1 by
`definition and the components are i.i.d. Consider a random element x E [-1, +1Jrl.
`Here a simple integration establishes that the variance of each component is 1/3. So
`the variance of each term of< u, x > is 1/3d, and that of the entire inner product
`is then 1/3 as required. Now each component of u and x has mean zero so that
`< u, x > has mean zero -
`and part i) is established.
`'iT'" (r) is centered at 1iu(q) by linearity of inner product. So we may assume
`without loss of generality that q = 0. Now the inner product of two random unit
`vectors is easily seen to have variance 1/ d. Scaling one of them by a factor of 2R Jd
`increases the variance by a factor of 4R2d, so that it becomes 4R2 as required by
`0
`part ii).
`
`Now tix some unit vector u and consider the query's projection 'iT'" ( q). Then by
`Proposition 3.1, and ignoring hypercube corner effects, the projection of the data
`points within the domain of search are distributed about 1T'.u(q) with variance no
`greater than 4R2
`•
`Because distances between projected points are dominated by their original
`distance it is clear that I 'iT' u ( x) - "" ( q) I < 2R Jd for any x in the search domain.
`
`Page 5 of 14
`
`
`
`6
`
`PETER~. YIA~ILOS
`
`So any data points with projections farther than 2RVd from 1r u.(q) can be
`confidently ruled out during search. It is this hard pruning that kd-trees (and
`many related structures) exploit to avoid considering every point during nearest
`neighbor search.
`But notice that this pruning cutoff distance 2RVd -+ oo with d while the
`projection variance remains constant. As a result hard pruning is asymptotically
`ineffective, and this gives rise to the curse of dimensionality for nearest neighbor
`search using kd-trees in our setting.
`?\ext observe that as a consequence of the central limit theorem the distribu(cid:173)
`and we remark that as a
`tions of Proposition 3.1 are asymptotically normal -
`practical matter this is a good approximation; even for moderate dimension. So
`from this point on in our analysis we will simply assume normality, i.e. that d is
`sufficiently large so that its deviation from normality is negligible.
`We then arrive at one of the main observations of this paper: that the nearest
`neighbor's projection is within constant distance of ?Tu.(q) with arbitrarily high ar•d
`easily calculated probability p depending only on this constant, and not on d.
`Recall that the tree we build recursively bisects the set of data points using a
`randomly selected u by separating its projection into left and right branches at a
`cut point c near the median. Given a query q we prune one of these branches if
`?Tu(q) is at least some cutoff distance e from c. We refer to this idea as aggressive
`pr·uning.
`PROPOSITl0:-.1 3.2. Using cutoff distance e = 1?;;-A2(p) to prune branches will
`exclude the nearest neighbor (ass·uming one exists within distance 2RVd of the
`query} with probability 1- p at each node; wher·e 1?u•(x) denotes the c-nmulative
`distribution function for the zero mean normal density with variance a 2 •
`
`PROOF. Equating the target failure rate 1-p with the mass of the distribution's
`positive tail gives 1-1?4R2(e) = 1- p, so e = 1?;;-A2(p).
`o
`Again, note that this cutoff distance e is constant despite the fact that the
`absolute size of our search domain is expanding with .Jd as d -+ oo. \'Ve can now
`establish a single tree's expected search complexity and success probability.
`PROPOSITIO~ 3.3. Given n data points, R,p, and e from proposition 9.2, and
`assuming d = fl(logn), then i) our search will visit~ n'Y leaf nodes where "' =
`log2 2<I> 113(e) , and compute ad-dimensional Euclidean distance at at each; and ii)
`the search will .succeed with probability no smaller than ~ p10g2,..
`
`PROOF. Recall that we e>..-pect that at each node of the tree the projection of
`the remaining database elements will have mean/median close to zero and variance
`of approximately 1/3. We fix: our attention on one such node with corresponding
`unit vector u and assume the cut point is exactly zero and variance is exactly
`1/3. Since we have assumed that the queries are also uniformly distributed, their
`projection is identical. So we can easily calculate the probability b that a query q
`lies within distance e of the zero. In this event both left and right branches must
`be explored. Otherwise, with probability 1 - b we can prune one so that only one
`is explored.
`The mass of one tail (beyond e) is 1-1?1; 3 (e) from which it follows that 1 - b
`is twice this value. So b = 21? 1; 3(e)- 1. Then the expected number F of branches
`visited is 2 · b + 1 · (1- b) = 21?1; 3(£)
`
`Page 6 of 14
`
`
`
`LOC ALLY t l f'Tl:>IG THE CURSE 01" DLVJE:>ISIO~ALITY !"OR :>!:-< SEARCH
`
`7
`
`Since d = n(log n) it is possible to choose unit vectors from root to leaf that
`are independent (e.g. the canonical basis), random vectors will be nearly so, and
`our random orthonormalization process will certainly succeed. So except for corner
`effects, which we disregard, it is reasonable to assume that query projections and
`failure probabilities are independent along each root-leaf path.
`Since the tree's depth is log2 n (we'll assume perfect balance) the search will
`visit plog, rt nodes. That is 21082 F Jog2" or n 10S2 F , establishing i).
`:\ow the nearest neighbor is located somewhere as a leaf of the tree. To succeed,
`the search must not prune it at any of the log2 n nodes along a path up to the root.
`This happens with probability p10S'" establishing ii) - where independence follows
`from the orthogonality of the set of projectors and the assumption of a uniform
`distribution. 5
`0
`
`Notice that the exponent of n in proposition 3.3 is within (0, 1). Also, the
`requirement that d = n(logn) is not very restrictive since otherwise the number
`of data points is exponential in d and earlier techniques can be used to efficiently
`locate nearest neighbors.
`The probability estimate of Proposition 3.3 is extremely conservative. The
`actual failure probabilities are much smaller because: i) our analysis has implicitly
`assumed that the query always projects to a point exactly e from the cut point.
`This is the worst case. If the distance is smaller, then nothing is pruned and no
`error can be made; and if the distance is larger, the effective cutoff distance is
`effectively increased; and ii) we have assumed that the nearest neighbor is always
`just within the domain of search. In reality queries will be distributed with this
`domain.
`Even without considering these factors ouT calculations suggest that efficient
`search is possible for moderately large values of R.
`Table 3 gives exponent values for selected values of R and p. For example,
`if R = 0.05 and p = 1 - 10- 4 = 0.9999 then the exponent is ~ 0.566. Given
`n = 10, 000, 000 data points our results state that no more than n°·566 ~ 9, 162
`inner product operations will be performed, and that the nearest neighbor will be
`located with probability 0.999910g2 n ~ 0.9977. Choosing p = 1 - 10- 6 corresponds
`to 48, 194 inner products with the probability of success increasing to 0.999977. All
`of these calculations are asymptotically independent of dimension.
`We continue this example to illustrate the use of forests to improve performance
`in some settings. Later we will give a theoretical consequence of this idea. Consider
`two independent trees with p = 1-I0-4
`. That is, the choice of random unit vectors
`is independent. Both trees are searched for each query. We then expect a failure
`rate of ~ (1 - 0.9977)2 = .00000529, and 2 · 9, 162 = 18,324 inner products are
`performed. 6 Now this failure rate is considerably better than the 1 - 0.999977 =
`0.000023 resulting from a single tree using p = 1-10-6 , and far fewer inner products
`are computed, but space consumption is doubled. This example illustrates the
`space-time design tradeoff that arises when applying our ideas in practice.
`The natural generalization of this idea leads to a forest of independent trees.
`Recall that a failure rate of a single tree grows, albeit slowly, with n . By using
`
`5 ::-legligible dependence resu lts from the hypercube's corners, but for simplicity we ignore this
`effect .
`6\.Ye must have d large enough to allow independent unit vectors to be drawn, and will say
`more about this later.
`
`Page 7 of 14
`
`
`
`8
`
`PETER:->. YIA:-<JLOS
`
`R
`
`1 - p
`10- <1 w-a w-s 10- 10 10- 12
`w-2
`0.01 0.090 0.141 0.177 0.207 0.232 0.254
`0.02 0.174 0.267 0.331 0.381 0.423 0.458
`0.03 0.252 0.379 0.463 0.526 0.577 0.618
`0.04 0.325 0.479 0.575 0.645 0.698 0.740
`0.05 0.393 0.566 0.669 0.739 0.790 0.829
`0.06 0.456 0.642 0.746 0.813 0.859 0.892
`0.07 0.513 0.707 0.808 0.869 0.908 0.935
`0.08 0.566 0.763 0.858 0.911 0.943 0.963
`0.09 0.615 0.810 0.897 0.941 0.965 0.979
`0.10 0.660 0.850 0.926 0.962 0.980 0.989
`0.11 0.700 0.882 0.949 0.976 0.989 0.995
`0.12 0.737 0.909 0.965 0.986 0.994 0.998
`0.13 0.770 0.931 0.977 0.992 0.997 0.999
`0.14 0.800 0.948 0.985 0.995 0.999 1.000
`0.15 0.826 0.961 0.990 0.997 0.999 1.000
`0.16 0.850 0.971 0.994 0.999 1.000 1.000
`0.17 0.871 0.979 0.996 0.999 1.000 1.000
`0.18 0.890 0.985 0.998 1.000 1.000 1.000
`0.19 0.906 0.990 0.999 1.000 1.000 1.000
`0.20 0.921 0.993 0.999 1.000 1.000 1.000
`
`TABLE 1. Searches in our setting with a single tree require ::::J n 7
`inner product operations and linear space. This table gives 'Y values
`for various choices of R and p from our analysis.
`
`enough trees one can force the forest's failure rate to zero as n increases while
`preserving sublinear search times; at the expense of polynomia.l space to pay for
`the additiona.l trees. To analyze this let x denote t he tree's depth and observe that
`(1 - p"')Y gives t he probability of failure for a forest of y distinct random trees. It
`can be shown that if y = p - t l+•)x for any c: > 0, then this probability tends to
`zero as x (and t herefore n) increases 7• Since x = log2 n t he number of trees y =
`n -(1+• ) 10&2 P in terms of n. Search time (the number of inner product operations)
`also increases by a factor of y. Space increases from linear to O(yn). Fina.lly, to
`maintain independence we must increase the lower bound on d to O(y logn).
`It is apparent from table 3 that this idea works for many values of Rand p. For
`example, if R = 0.05 and p = 1 - 10- 6 then the original e>..-ponent 0.566 increases
`by only ~ 1.44 X w-6 - far less than the table's precision. But it does not work
`for arbitrary Rand p. For large va.lues of R the exponent is so close to 1 that the
`added complexity makes the total superlinear.
`Note that O(nd) space is required to store the data elements t hemselves, but
`only O(n) space is needed for the interior tree nodes, which point to data elements.
`So, in practice, we can afford O(d) trees while remaining linear space. In practice
`this means that in high dimension one can easily afford many trees without mate(cid:173)
`rially affecting space consumption. In theory it means that if d is assumed to grow
`
`;However it may increase somewhat before eventually decreasing.
`
`Page 8 of 14
`
`
`
`LOCALLY LIFTI:"G THE CuRSE Of OIYI E:-.iSIO~ALITY FOR :-.i:'< SEARCH
`
`9
`
`as an appropriate small power of n as n -4 oo, then the entire forest occupies linear
`space.
`Returning to our choice of the [- 1, + 1]d hypercube, we observe that i) All of
`our arguments apply immediately to [0, 1Jd after compensating for the shifted mean,
`and ii) that our arguments apply as well to the d-dimensional hypersphere. To see
`this note that the dot product of a random unit vector and a random database
`element has variance 1/d, and the maximum interpoint distance is constant, so the
`projection of a proportional neighborhood balls also falls with 1/d.
`Finally we remark that as an alternative to forests , one can build a single tree
`in which each level is overlapped, i.e. items near the center are stored in both the
`left and right branches. This effectively reduces e and with it search time - at the
`expense of polynomial space (given a constant fraction of overlap at each level).
`
`4. E xp eriments
`
`To confirm our analysis we wrote an ANSI-C program to build and search trees.
`It accepts d, n, R, pas arguments, in addition to the number of random queries to
`test, and a random seed.
`The data points are distributed uniformly in [-1, +l]d. Queries are generated
`by choosing a data point x at random, then a random unit vector u, and forming
`the sum x + (1 - e)2Rv05u - so that the query lies just inside of the search domain.
`We use e = .0001. This ensures that a nearest neighbor exists within the domain.
`An indexed set of projection vectors is generated by starting with random
`vectors and orthonormalizing them in the standard way. Each level in the tree uses
`the corresponding vector in this set as its projector. So the size of the set is O(logn),
`and in practice is ~ 2 log2 n because the resulting trees are not perfectly balanced.
`For simplicity zero is always used as the cut point during tree construction.
`In order to allow us to simultaneously explore large n and d without exceeding
`available RAM, we represent the data points to be searched using a pseudorandom
`generator seed that suffices to construct them. As a result we can easily study
`n = 1, 000,000 and higher in almost arbitrarily high dimension.
`Figure 1 illustrates that search complexity measured in the number of inner
`product operations performed is very nearly dimension-invariant -
`confirming our
`analysis. So total work scales up only linearly with dimension with the complexity
`of an inner product operation.
`Note that for simplicity we regard an inner product operation and a Euclidean
`distance computation as having identical complexity. Also, the table's results ex(cid:173)
`clude the roughly 2 log2 n inner products required to compute the query's projection
`at each level of the tree.
`Recall that our analysis of failure rate was quite conservative and experiments
`confirm this. Actual failure rates for every case in Figure 1 were somewhat lower
`than those from analysis. For example, in the R = 0.20 case, analysis predicts that
`the search will succeed with probability ~ 0.85, and the actual value is ~ 0.97.
`Note that one reason for using the moderate value p = 0.99 in our study is that we
`have confirmed that values far closer to zero result in no observable e..'<perimental
`error given the limited number of queries we have time to perform.
`vVe compare performance in our setting with kd-tree search [FBS75, FBF77,
`BF79, Ben80]. We are interested in neighbors within a specified distance of the
`query. To reproduce this setting, kd-tree search is initialized as though an element
`at the specified distance has already been encountered.
`
`Page 9 of 14
`
`
`
`10
`
`PETER:\. YL\:\ILOS
`
`1~ ,---~--~----~--~--~----~--~--~----r---~
`"'R..O..Ot" (cid:173)
`,...05"
`,.
`"'R-.o tO"
`•
`"Rll0..15"
`D
`. . . . . . . . . . . . . . . . _____ . . . . . . . . . . . . . . . . . __ . . .. _..~.:.:.....
`
`----~·IS I a
`
`I
`
`10000
`
`"0
`~ 1000
`·u;
`·:;;
`en
`Q)
`>
`113 100
`~
`
`10
`
`....... ...... ~ . ....... JI!~ . . .. -'(Jit • ••• ,. ,. ........ .,... • • • --.Jit• • • , ...... ,. .... . .... ,-.. .•• -. ... .. , •• ~.., ~
`,., . ....... ~~ ..... -
`
`~J"<x"x~.:~'\?'~hw.)l)eMI(JoUI~XNK•.w"'x~~·~~ryox,.')I~~,.._I(MW"x,.-to~)Pt~W'f"wll
`
`1 0~--~ ... ~~~~--~*~--~~~--~~--~~~--~*~--~~~~.~--~~~
`dime nsion
`
`FtGL:RE 1. Proportional radius search complexity is constant with
`respect to dimension -
`confirming our analysis. We measure the
`number of leaves visited. Each visit corresponds to a Euclidean
`distance computation. In this experiment the number of uniformly
`distributed data points n = 100,000. and the p from our analysis
`is 0.99. Each query is generated so as to be slightly less than
`distance 2R.Jd from some database element. so that our radius(cid:173)
`limited search always succeeds. The results of 11 000 such random
`queries are averaged to produce each data point. Our analysis
`predicts constant values of approximately 3, 92, 1987, 13,553, and
`40, 114 corresponding toR values of 0.01, 0.05, 0.10, 0.15, and 0.20
`respectively. T he e-xperimental values compare well with these
`estimates, and in general are dominated by them by a factor of
`approximately two.
`
`A recent kd-tree implementation [MA99] is modified for our experiments. Its
`core routines are a ltered to support radius-limited search as described above, and
`the distribution's sample program is adapted to report the average number of dis(cid:173)
`tance computations performed in response to each query. As in our earlier exper(cid:173)
`iments, queries are generated by starting at a database clement, and moving the
`specifi