`
`Rajeev Motwaniz
`Piotr Indyky
`Aristides Gionis (cid:0)
`Department of Computer Science
`Stanford University
`Stanford(cid:2) CA
`fgionis(cid:2)indyk(cid:2)rajeevg(cid:3)cs(cid:4)stanford(cid:4)edu
`
`Abstract
`
`The nearest(cid:2) or near(cid:2)neighbor query problems
`arise in a large variety of database applications(cid:3)
`usually in the context of similarity searching(cid:4) Of
`late(cid:3) there has been increasing interest in build(cid:2)
`ing search(cid:5)index structures for performing simi(cid:2)
`larity search over high(cid:2)dimensional data(cid:3) e(cid:4)g(cid:4)(cid:3) im(cid:2)
`age databases(cid:3) document collections(cid:3) time(cid:2)series
`databases(cid:3) and genome databases(cid:4) Unfortunately(cid:3)
`all known techniques for solving this problem fall
`prey to the (cid:6)curse of dimensionality(cid:4)(cid:7) That is(cid:3)
`the data structures scale poorly with data dimen(cid:2)
`sionality(cid:8)
`in fact(cid:3)
`if the number of dimensions
`exceeds to (cid:3) searching in k(cid:2)d trees and re(cid:2)
`lated structures involves the inspection of a large
`fraction of the database(cid:3) thereby doing no better
`than brute(cid:2)force linear search(cid:4) It has been sug(cid:2)
`gested that since the selection of features and the
`choice of a distance metric in typical applications
`is rather heuristic(cid:3) determining an approximate
`nearest neighbor should su(cid:12)ce for most practi(cid:2)
`cal purposes(cid:4) In this paper(cid:3) we examine a novel
`scheme for approximate similarity search based
`on hashing(cid:4) The basic idea is to hash the points
`
`(cid:0)Supported by NAVY N (cid:5) (cid:5) (cid:5) grant and NSF
`Grant IIS(cid:5) (cid:10)
`ySupported by Stanford Graduate Fellowship and NSF NYI
`Award CCR(cid:5) (cid:10)
`zSupported by ARO MURI Grant DAAH (cid:5) (cid:5) (cid:5) (cid:14) NSF
`Grant IIS(cid:5) (cid:14) and NSF Young Investigator Award CCR(cid:5)
` (cid:14) with matching funds from IBM(cid:14) Mitsubishi(cid:14) Schlum(cid:5)
`berger Foundation(cid:14) Shell Foundation(cid:14) and Xerox Corporation(cid:10)
`
`Permission to copy without fee all or part of this material is
`granted provided that the copies are not made or distributed for
`direct commercial advantage(cid:2) the VLDB copyright notice and
`the title of the publication and its date appear(cid:2) and notice is
`given that copying is by permission of the Very Large Data Base
`Endowment(cid:3) To copy otherwise(cid:2) or to republish(cid:2) requires a fee
`and(cid:4)or special permission from the Endowment(cid:3)
`Proceedings of the th VLDB Conference(cid:4)
`Edinburgh(cid:4) Scotland(cid:4) (cid:7)
`
`from the database so as to ensure that the prob(cid:2)
`ability of collision is much higher for objects that
`are close to each other than for those that are far
`apart(cid:4) We provide experimental evidence that our
`method gives signi(cid:13)cant improvement in running
`time over other methods for searching in high(cid:2)
`dimensional spaces based on hierarchical tree de(cid:2)
`composition(cid:4) Experimental results also indicate
`that our scheme scales well even for a relatively
`large number of dimensions (cid:14)more than (cid:16)(cid:4)
`
` Introduction
`
`A similarity search problem involves a collection of ob(cid:2)
`jects (cid:3)e(cid:4)g(cid:4)(cid:5) documents(cid:5) images(cid:6) that are characterized
`by a collection of relevant features and represented
`as points in a high(cid:2)dimensional attribute space(cid:7) given
`queries in the form of points in this space(cid:5) we are re(cid:2)
`quired to (cid:8)nd the nearest (cid:3)most similar(cid:6) object to the
`query(cid:4) The particularly interesting and well(cid:2)studied
`case is the d(cid:2)dimensional Euclidean space(cid:4) The prob(cid:2)
`lem is of major importance to a variety of applications(cid:7)
`some examples are(cid:9) data compression (cid:10) (cid:13)(cid:7) databases
`and data mining (cid:10) (cid:13)(cid:7) information retrieval (cid:10) (cid:5) (cid:5) (cid:13)(cid:7)
`image and video databases (cid:10) (cid:5) (cid:5) (cid:5) (cid:13)(cid:7) machine
`learning (cid:10)(cid:13)(cid:7) pattern recognition (cid:10) (cid:5) (cid:13)(cid:7) and(cid:5) statistics
`and data analysis (cid:10) (cid:5) (cid:13)(cid:4) Typically(cid:5) the features of
`the objects of interest are represented as points in (cid:0)d
`and a distance metric is used to measure similarity of
`objects(cid:4) The basic problem then is to perform indexing
`or similarity searching for query objects(cid:4) The number
`of features (cid:3)i(cid:4)e(cid:4)(cid:5) the dimensionality(cid:6) ranges anywhere
`from tens to thousands(cid:4) For example(cid:5) in multimedia
`applications such as IBM(cid:22)s QBIC (cid:3)Query by Image
`Content(cid:6)(cid:5) the number of features could be several hun(cid:2)
`dreds (cid:10) (cid:5) (cid:13)(cid:4) In information retrieval for text doc(cid:2)
`uments(cid:5) vector(cid:2)space representations involve several
`thousands of dimensions(cid:5) and it is considered to be a
`dramatic improvement that dimension(cid:2)reduction tech(cid:2)
`niques(cid:5) such as the Karhunen(cid:2)Lo(cid:23)eve transform (cid:10)(cid:5) (cid:13)
`(cid:3)also known as principal components analysis (cid:10)(cid:13) or
`latent semantic indexing (cid:10) (cid:13)(cid:6)(cid:5) can reduce the dimen(cid:2)
`sionality to a mere few hundreds(cid:24)
`
`Google Ex. 1018
`
`
`
`The low(cid:2)dimensional case (cid:3)say(cid:5) for d equal to or
` (cid:6) is well(cid:2)solved (cid:10) (cid:13)(cid:5) so the main issue is that of deal(cid:2)
`ing with a large number of dimensions(cid:5) the so(cid:2)called
`(cid:25)curse of dimensionality(cid:4)(cid:26) Despite decades of inten(cid:2)
`sive e(cid:27)ort(cid:5) the current solutions are not entirely sat(cid:2)
`isfactory(cid:7) in fact(cid:5) for large enough d(cid:5) in theory or in
`practice(cid:5) they provide little improvement over a linear
`algorithm which compares a query to each point from
`the database(cid:4) In particular(cid:5) it was shown in (cid:10)(cid:13) that(cid:5)
`both empirically and theoretically(cid:5) all current index(cid:2)
`ing techniques (cid:3)based on space partitioning(cid:6) degrade
`to linear search for su(cid:28)ciently high dimensions(cid:4) This
`situation poses a serious obstacle to the future develop(cid:2)
`ment of large scale similarity search systems(cid:4) Imagine
`for example a search engine which enables content(cid:2)
`based image retrieval on the World(cid:2)Wide Web(cid:4) If the
`system was to index a signi(cid:8)cant fraction of the web(cid:5)
`the number of images to index would be at least of
`the order tens (cid:3)if not hundreds(cid:6) of million(cid:4) Clearly(cid:5) no
`indexing method exhibiting linear (cid:3)or close to linear(cid:6)
`dependence on the data size could manage such a huge
`data set(cid:4)
`
`The premise of this paper is that in many cases
`it is not necessary to insist on the exact answer(cid:7) in(cid:2)
`stead(cid:5) determining an approximate answer should suf(cid:2)
`(cid:8)ce (cid:3)refer to Section for a formal de(cid:8)nition(cid:6)(cid:4) This
`observation underlies a large body of recent research
`in databases(cid:5) including using random sampling for his(cid:2)
`togram estimation (cid:10)(cid:13) and median approximation (cid:10) (cid:13)(cid:5)
`using wavelets for selectivity estimation (cid:10) (cid:13) and ap(cid:2)
`proximate SVD (cid:10)(cid:13)(cid:4) We observe that there are many
`applications of nearest neighbor search where an ap(cid:2)
`proximate answer is good enough(cid:4) For example(cid:5) it
`often happens (cid:3)e(cid:4)g(cid:4)(cid:5) see (cid:10) (cid:13)(cid:6) that the relevant answers
`are much closer to the query point than the irrele(cid:2)
`vant ones(cid:7) in fact(cid:5) this is a desirable property of a
`good similarity measure(cid:4) In such cases(cid:5) the approxi(cid:2)
`mate algorithm (cid:3)with a suitable approximation factor(cid:6)
`will return the same result as an exact algorithm(cid:4) In
`other situations(cid:5) an approximate algorithm provides
`the user with a time(cid:2)quality tradeo(cid:27) (cid:29) the user can
`decide whether to spend more time waiting for the
`exact answer(cid:5) or to be satis(cid:8)ed with a much quicker
`approximation (cid:3)e(cid:4)g(cid:4)(cid:5) see (cid:10)(cid:13)(cid:6)(cid:4)
`
`The above arguments rely on the assumption that
`approximate similarity search can be performed much
`faster than the exact one(cid:4) In this paper we show that
`this is indeed the case(cid:4) Speci(cid:8)cally(cid:5) we introduce a
`new indexing method for approximate nearest neigh(cid:2)
`bor with a truly sublinear dependence on the data size
`even for high(cid:2)dimensional data(cid:4) Instead of using space
`partitioning(cid:5) it relies on a new method called locality(cid:2)
`sensitive hashing (cid:3)LSH(cid:4)(cid:4) The key idea is to hash the
`points using several hash functions so as to ensure that(cid:5)
`for each function(cid:5) the probability of collision is much
`higher for objects which are close to each other than
`for those which are far apart(cid:4) Then(cid:5) one can deter(cid:2)
`
`mine near neighbors by hashing the query point and
`retrieving elements stored in buckets containing that
`point(cid:4) We provide such locality(cid:2)sensitive hash func(cid:2)
`tions that are simple and easy to implement(cid:7) they can
`also be naturally extended to the dynamic setting(cid:5) i(cid:4)e(cid:4)(cid:5)
`when insertion and deletion operations also need to be
`supported(cid:4) Although in this paper we are focused on
`Euclidean spaces(cid:5) di(cid:27)erent LSH functions can be also
`used for other similarity measures(cid:5) such as dot prod(cid:2)
`uct (cid:10)(cid:13)(cid:4)
`Locality(cid:2)Sensitive Hashing was introduced by Indyk
`and Motwani (cid:10)(cid:13) for the purposes of devising main
`memory algorithms for nearest neighbor search(cid:7) in par(cid:2)
`ticular(cid:5) it enabled us to achieve worst(cid:2)case O(cid:3)dn (cid:2)(cid:3)(cid:6)(cid:2)
`time for approximate nearest neighbor query over an
`n(cid:2)point database(cid:4) In this paper we improve that tech(cid:2)
`nique and achieve a signi(cid:8)cantly improved query time
`of O(cid:3)dn (cid:2)(cid:2) (cid:3)(cid:3)(cid:4)(cid:6)(cid:4) This yields an approximate nearest
`neighbor algorithm running in sublinear time for any
`(cid:2) (cid:3) (cid:4) Furthermore(cid:5) we generalize the algorithm and
`its analysis to the case of external memory(cid:4)
`We support our theoretical arguments by empiri(cid:2)
`cal evidence(cid:4) We performed experiments on two data
`sets(cid:4) The (cid:8)rst contains (cid:5) histograms of color
`images(cid:5) where each histogram was represented as a
`point ind(cid:2)dimensional space(cid:5) for d up to (cid:4) The sec(cid:2)
`ond contains around (cid:5) points representing tex(cid:2)
`ture information of blocks of large aerial photographs(cid:4)
`All our tables were stored on disk(cid:4) We compared
`the performance of our algorithm with the perfor(cid:2)
`mance of the Sphere(cid:30)Rectangle(cid:2)tree (cid:3)SR(cid:2)tree(cid:6) (cid:10)(cid:13)(cid:5) a
`recent data structure which was shown to be com(cid:2)
`parable to or signi(cid:8)cantly more e(cid:28)cient than other
`tree(cid:2)decomposition(cid:2)based indexing methods for spa(cid:2)
`tial data(cid:4) The experiments show that our algorithm is
`signi(cid:8)cantly faster than the earlier methods(cid:5) in some
`cases even by several orders of magnitude(cid:4)
`It also
`scales well as the data size and dimensionality increase(cid:4)
`Thus(cid:5) it enables a new approach to high(cid:2)performance
`similarity search (cid:29) fast retrieval of approximate an(cid:2)
`swer(cid:5) possibly followed by a slower but more accurate
`computation in the few cases where the user is not
`satis(cid:8)ed with the approximate answer(cid:4)
`The rest of this paper is organized as follows(cid:4) In
`Section we introduce the notation and give formal
`de(cid:8)nitions of the similarity search problems(cid:4) Then in
`Section we describe locality(cid:2)sensitive hashing and
`show how to apply it to nearest neighbor search(cid:4) In
`Section we report the results of experiments with
`LSH(cid:4) The related work is described in Section (cid:4) Fi(cid:2)
`nally(cid:5) in Section we present conclusions and ideas for
`future research(cid:4)
`
` Preliminaries
`p to denote the Euclidean space (cid:0)d under theWe use ld
`
`lp norm(cid:5) i(cid:4)e(cid:4)(cid:5) when the length of a vector (cid:3)x (cid:4) (cid:5) (cid:5) (cid:5) xd(cid:6) is
`de(cid:8)ned as (cid:3)jx jp (cid:31) (cid:5) (cid:5) (cid:5) (cid:31) jxdjp(cid:6) (cid:2)p(cid:4) Further(cid:5) dp(cid:3)p(cid:4) q(cid:6)
`
`Google Ex. 1018
`
`
`
`jjp(cid:2)qjjp denotes the distance between the points p and
`
`q in ldp(cid:4) We use H d to denote the Hamming metric
`space of dimension d(cid:5) i(cid:4)e(cid:4)(cid:5) the space of binary vectors
`of length d under the standard Hamming metric(cid:4) We
`use dH (cid:3)p(cid:4) q(cid:6) denote the Hamming distance(cid:5) i(cid:4)e(cid:4)(cid:5) the
`number of bits on which p and q di(cid:27)er(cid:4)
`The nearest neighbor search problem is de(cid:8)ned as
`follows(cid:9)
`
`De(cid:2)nition (cid:4)Nearest Neighbor Search (cid:4)NNS(cid:5)(cid:5)
`Given a set P of n objects represented as points in a
`normed space ld
`p(cid:5) preprocess P so as to e(cid:6)ciently an(cid:2)
`swer queries by (cid:7)nding the point in P closest to a query
`point q(cid:8)
`
`The de(cid:8)nition generalizes naturally to the case
`where we want to return K (cid:3) points(cid:4) Speci(cid:8)cally(cid:5) in
`the K(cid:2)Nearest Neighbors Search (cid:3)K(cid:2)NNS(cid:4)(cid:5) we wish to
`return the K points in the database that are closest to
`the query point(cid:4) The approximate version of the NNS
`problem is de(cid:8)ned as follows(cid:9)
`
`De(cid:2)nition (cid:4)(cid:2)(cid:7)Nearest Neighbor Search (cid:4)(cid:2)(cid:7)NNS(cid:5)(cid:5)
`Given a set P of points in a normed space ld
`p(cid:5) prepro(cid:2)
`cess P so as to e(cid:6)ciently return a point p P for any
`given query point q(cid:5) such that d(cid:3)q(cid:4) p(cid:6) (cid:4) (cid:3) (cid:31) (cid:2)(cid:6)d(cid:3)q(cid:4) P (cid:6)(cid:5)
`where d(cid:3)q(cid:4) P (cid:6) is the distance of q to the its closest point
`in P (cid:8)
`
`Again(cid:5) this de(cid:8)nition generalizes naturally to (cid:8)nd(cid:2)
`ing K (cid:3) approximate nearest neighbors(cid:4) In the Ap(cid:2)
`proximate K(cid:2)NNS problem(cid:5) we wish to (cid:8)nd K points
`p (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) pK such that the distance of pi to the query q is
`at most (cid:3) (cid:31) (cid:2)(cid:6) times the distance from the ith nearest
`point to q(cid:4)
`
` The Algorithm
`
`In this section we present e(cid:28)cient solutions to the ap(cid:2)
`proximate versions of the NNS problem(cid:4) Without sig(cid:2)
`ni(cid:8)cant loss of generality(cid:5) we will make the following
`two assumptions about the data(cid:9)
`
` (cid:4) the distance is de(cid:8)ned by the l norm (cid:3)see com(cid:2)
`ments below(cid:6)(cid:5)
`
`(cid:4) all coordinates of points in P are positive integers(cid:4)
`
`The (cid:8)rst assumption is not very restrictive(cid:5) as usu(cid:2)
`ally there is no clear advantage in(cid:5) or even di(cid:27)erence
`between(cid:5) using l or l norm for similarity search(cid:4) For
`example(cid:5) the experiments done for the Webseek (cid:10) (cid:13)
`project (cid:3)see (cid:10) (cid:13)(cid:5) chapter (cid:6) show that comparing color
`histograms using l and l norms yields very similar
`results (cid:3)l is marginally better(cid:6)(cid:4) Both our data sets
`(cid:3)see Section (cid:6) have a similar property(cid:4) Speci(cid:8)cally(cid:5)
`we observed that a nearest neighbor of an average
`query point computed under the l norm was also an
`(cid:2)(cid:2)approximate neighbor under the l norm with an av(cid:2)
`erage value of (cid:2) less than ! (cid:3)this observation holds
`
`for both data sets(cid:6)(cid:4) Moreover(cid:5) in most cases (cid:3)i(cid:4)e(cid:4)(cid:5) for
`! of the queries in the (cid:8)rst set and ! in the sec(cid:2)
`ond set(cid:6) the nearest neighbors under l and l norms
`were exactly the same(cid:4) This observation is interest(cid:2)
`ing in its own right(cid:5) and can be partially explained
`via the theorem by Figiel et al (cid:3)see (cid:10) (cid:13) and references
`therein(cid:6)(cid:4) They showed analytically that by simply ap(cid:2)
`plying scaling and random rotation to the space l(cid:5)
`we can make the distances induced by the l and l
`norms almost equal up to an arbitrarily small factor(cid:4)
`It seems plausible that real data is already randomly
`rotated(cid:5) thus the di(cid:27)erence between l and l norm
`is very small(cid:4) Moreover(cid:5) for the data sets for which
`this property does not hold(cid:5) we are guaranteed that
`after performing scaling and random rotation our al(cid:2)
`gorithms can be used for the l norm with arbitrarily
`small loss of precision(cid:4)
`As far as the second assumption is concerned(cid:5)
`clearly all coordinates can be made positive by prop(cid:2)
`erly translating the origin of (cid:0)d(cid:4) We can then con(cid:2)
`vert all coordinates to integers by multiplying them
`by a suitably large number and rounding to the near(cid:2)
`est integer(cid:4) It can be easily veri(cid:8)ed that by choosing
`proper parameters(cid:5) the error induced by rounding can
`be made arbitrarily small(cid:4) Notice that after this oper(cid:2)
`ation the minimum interpoint distance is (cid:4)
`
` (cid:9) Locality(cid:7)Sensitive Hashing
`
`In this section we present locality(cid:2)sensitive hashing
`(cid:3)LSH(cid:6)(cid:4) This technique was originally introduced by
`Indyk and Motwani (cid:10)(cid:13) for the purposes of devising
`main memory algorithms for the (cid:2)(cid:2)NNS problem(cid:4) Here
`we give an improved version of their algorithm(cid:4) The
`new algorithm is in many respects more natural than
`the earlier one(cid:9) it does not require the hash buckets to
`store only one point(cid:7) it has better running time guar(cid:2)
`antees(cid:7) and(cid:5) the analysis is generalized to the case of
`secondary memory(cid:4)
`Let C be the largest coordinate in all points in P (cid:4)
`Then(cid:5) as per (cid:10) (cid:13)(cid:5) we can embed P into the Hamming
`cube H d
`with d Cd(cid:5) by transforming each point
`p (cid:3)x (cid:4) (cid:5) (cid:5) (cid:5) xd(cid:6) into a binary vector
`
`v(cid:3)p(cid:6) UnaryC(cid:3)x (cid:6) (cid:5) (cid:5) (cid:5) UnaryC(cid:3)xd(cid:6)(cid:4)
`
`where UnaryC(cid:3)x(cid:6) denotes the unary representation of
`x(cid:5) i(cid:4)e(cid:4)(cid:5) is a sequence of x ones followed by C(cid:2)x zeroes(cid:4)
`Fact For any pair of points p(cid:4) q with coordinates in
`the set f (cid:5) (cid:5) (cid:5)Cg(cid:5)
`d (cid:3)p(cid:4) q(cid:6) dH(cid:3)v(cid:3)p(cid:6)(cid:4) v(cid:3)q(cid:6)(cid:6)(cid:5)
`
`That is(cid:5) the embedding preserves the distances be(cid:2)
`tween the points(cid:4) Therefore(cid:5)
`in the sequel we can
`concentrate on solving (cid:2)(cid:2)NNS in the Hamming space
`H d
`(cid:4) However(cid:5) we emphasize that we do not need to
`actually convert the data to the unary representation(cid:5)
`
`Google Ex. 1018
`
`
`
`Algorithm Preprocessing
`Input A set of points P (cid:5)
`l (cid:3)number of hash tables(cid:6)(cid:5)
`Output Hash tables Ti(cid:5) i (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) l
`Foreach i (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) l
`Initialize hash table Ti by generating
`a random hash function gi(cid:3)(cid:5)(cid:6)
`Foreach i (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) l
`Foreach j (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) n
`Store point pj on bucket gi(cid:3)pj(cid:6) of hash table Ti
`
`Figure (cid:9) Preprocessing algorithm for points already
`embedded in the Hamming cube(cid:4)
`
`Algorithm Approximate Nearest Neighbor Query
`Input A query point q(cid:5)
`K (cid:3)number of appr(cid:4) nearest neighbors(cid:6)
`Access To hash tables Ti(cid:5) i (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) l
`generated by the preprocessing algorithm
`Output K (cid:3)or less(cid:6) appr(cid:4) nearest neighbors
`S (cid:6) "
`Foreach i (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) l
`S (cid:6) S (cid:7) fpoints found in gi(cid:3)q(cid:6) bucket of table Tig
`Return the K nearest neighbors of q found in set S
`(cid:30)# Can be found by main memory linear search #(cid:30)
`
`Figure (cid:9) Approximate Nearest Neighbor query an(cid:2)
`swering algorithm(cid:4)
`
`Although we are mainly interested in the I(cid:30)O com(cid:2)
`plexity of our scheme(cid:5) it is worth pointing out that
`the hash functions can be e(cid:28)ciently computed if the
`
`data set is obtained by mapping ld into d (cid:2)dimensional
`Hamming space(cid:4) Let p be any point from the data set
`and let p denote its image after the mapping(cid:4) Let I
`be the set of coordinates and recall that we need to
`compute p
`jI(cid:4) For i (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) d(cid:5) let Iji denote(cid:5) in sorted
`order(cid:5) the coordinates in I which correspond to the
`ith coordinate of p(cid:4) Observe(cid:5) that projecting p on Iji
`results in a sequence of bits which is monotone(cid:5) i(cid:4)e(cid:4)(cid:5)
`consists of a number(cid:5) say oi(cid:5) of ones followed by ze(cid:2)
`ros(cid:4) Therefore(cid:5) in order to represent p
`I it is su(cid:28)cient
`to compute oi for i (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) d(cid:4) However(cid:5) the latter
`task is equivalent to (cid:8)nding the number of elements
`in the sorted array Iji which are smaller than a given
`value(cid:5) i(cid:4)e(cid:4)(cid:5) the ith coordinate of p(cid:4) This can be done
`via binary search in log C time(cid:5) or even in constant
`time using a precomputed array of C bits(cid:4) Thus(cid:5) the
`total time needed to compute the function is either
`O(cid:3)d log C(cid:6) or O(cid:3)d(cid:6)(cid:5) depending on resources used(cid:4) In
`our experimental section(cid:5) the value of C can be made
`very small(cid:5) and therefore we will resort to the second
`method(cid:4)
`For quick reference we summarize the preprocessing
`
`which could be expensive whenC is large(cid:7) in fact(cid:5) all
`our algorithms can be made to run in time indepen(cid:2)
`dent on C(cid:4) Rather(cid:5) the unary representation provides
`us with a convenient framework for description of the
`algorithms which would be more complicated other(cid:2)
`wise(cid:4)
`We de(cid:8)ne the hash functions as follows(cid:4) For an inte(cid:2)
`ger l to be speci(cid:8)ed later(cid:5) choose l subsets I (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) Il of
`f (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) d g(cid:4) Let pjI denote the projection of vector p on
`the coordinate set I(cid:5) i(cid:4)e(cid:4)(cid:5) we compute pjI by selecting
`the coordinate positions as per I and concatenating
`the bits in those positions(cid:4) Denote gj(cid:3)p(cid:6) pjIj (cid:4) For
`the preprocessing(cid:5) we store each p P in the bucket
`gj(cid:3)p(cid:6)(cid:5) for j (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) l(cid:4) As the total number of buckets
`may be large(cid:5) we compress the buckets by resorting
`to standard hashing(cid:4) Thus(cid:5) we use two levels of hash(cid:2)
`ing(cid:9) the LSH function maps a point p to bucket gj(cid:3)p(cid:6)(cid:5)
`and a standard hash function maps the contents of
`these buckets into a hash table of size M (cid:4) The maxi(cid:2)
`mal bucket size of the latter hash table is denoted by
`B(cid:4) For the algorithm(cid:22)s analysis(cid:5) we will assume hash(cid:2)
`ing with chaining(cid:5) i(cid:4)e(cid:4)(cid:5) when the number of points in
`a bucket exceeds B(cid:5) a new bucket (cid:3)also of size B(cid:6) is
`allocated and linked to and from the old bucket(cid:4) How(cid:2)
`ever(cid:5) our implementation does not employ chaining(cid:5)
`but relies on a simpler approach(cid:9) if a bucket in a given
`index is full(cid:5) a new point cannot be added to it(cid:5) since
`it will be added to some other index with high prob(cid:2)
`ability(cid:4) This saves us the overhead of maintaining the
`link structure(cid:4)
`The number n of points(cid:5) the size M of the hash
`table(cid:5) and the maximum bucket size B are related by
`the following equation(cid:9)
`
`(cid:4)
`
`n B
`
`M (cid:6)
`
`where (cid:6) is the memory utilization parameter(cid:5) i(cid:4)e(cid:4)(cid:5) the
`ratio of the memory allocated for the index to the size
`of the data set(cid:4)
`To process a query q(cid:5) we search all
`indices
`g (cid:3)q(cid:6)(cid:4) (cid:5) (cid:5) (cid:5) (cid:4) gl(cid:3)q(cid:6) until we either encounter at least c (cid:5) l
`points (cid:3)for c speci(cid:8)ed later(cid:6) or use all l indices(cid:4) Clearly(cid:5)
`the number of disk accesses is always upper bounded
`by the number of indices(cid:5) which is equal to l(cid:4) Let
`p (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) pt be the points encountered in the process(cid:4)
`For Approximate K(cid:2)NNS(cid:5) we output the K points pi
`closest to q(cid:7) in general(cid:5) we may return fewer points if
`the number of points encountered is less than K(cid:4)
`It remains to specify the choice of the subsets Ij(cid:4)
`For each j f (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) lg(cid:5) the set Ij consists of k ele(cid:2)
`ments from f (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) d g sampled uniformly at random
`with replacement(cid:4) The optimal value of k is chosen to
`maximize the probability that a point p (cid:25)close(cid:26) to q
`will fall into the same bucket as q(cid:5) and also to mini(cid:2)
`mize the probability that a point p (cid:25)far away(cid:26) from q
`will fall into the same bucket(cid:4) The choice of the values
`of l and k is deferred to the next section(cid:4)
`
`Google Ex. 1018
`
`
`
`and query answering algorithms in Figures and (cid:4)
`
` (cid:9) Analysis of Locality(cid:7)Sensitive Hashing
`
`The principle behind our method is that the probabil(cid:2)
`ity of collision of two points p and q is closely related
`to the distance between them(cid:4) Speci(cid:8)cally(cid:5) the larger
`the distance(cid:5) the smaller the collision probability(cid:4) This
`intuition is formalized as follows (cid:10)(cid:13)(cid:4) Let D(cid:3)(cid:5)(cid:4)(cid:5)(cid:6) be a
`distance function of elements from a set S(cid:5) and for
`any p S let B(cid:3)p(cid:4) r(cid:6) denote the set of elements from
`S within the distance r from p(cid:4)
`
`De(cid:2)nition A family H of functions from S to U
`is called (cid:3)r (cid:4) r(cid:4) p (cid:4) p(cid:6)(cid:2)sensitive for D(cid:3)(cid:5)(cid:4)(cid:5)(cid:6) if for any
`q(cid:4) p S
`(cid:8) if p B(cid:3)q(cid:4) r (cid:6) then PrH(cid:10)h(cid:3)q(cid:6) h(cid:3)p(cid:6)(cid:13) (cid:9) p (cid:5)
`(cid:8) if p (cid:7) B(cid:3)q(cid:4) r(cid:6) then PrH(cid:10)h(cid:3)q(cid:6) h(cid:3)p(cid:6)(cid:13) (cid:4) p(cid:8)
`In the above de(cid:8)nition(cid:5) probabilities are considered
`with respect to the random choice of a function h from
`the family H(cid:4) In order for a locality(cid:2)sensitive family
`to be useful(cid:5) it has to satisfy the inequalities p (cid:3) p
`and r (cid:8) r(cid:4)
`Observe that if D(cid:3)(cid:5)(cid:4)(cid:5)(cid:6) is the Hamming distance
`dH(cid:3)(cid:5)(cid:4)(cid:5)(cid:6)(cid:5) then the family of projections on one coor(cid:2)
`dinate is locality(cid:2)sensitive(cid:4) More speci(cid:8)cally(cid:9)
`
`Fact Let S be H d
`(cid:3)the d (cid:2)dimensional Ham(cid:2)
`ming cube(cid:4) and D(cid:3)p(cid:4) q(cid:6) dH(cid:3)p(cid:4) q(cid:6) for p(cid:4) q
`H d
`(cid:8) Then for any r(cid:5) (cid:2) (cid:3) (cid:5) the family Hd
`for i (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) d g is
`(cid:9) hi(cid:3)(cid:3)b (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) bd (cid:6)(cid:6) bi(cid:4)
`fhi
`(cid:0)r(cid:4) r(cid:3) (cid:31) (cid:2)(cid:6)(cid:4) (cid:2) r
`d (cid:2)(cid:2)sensitive(cid:8)
`d (cid:4) (cid:2) r(cid:2) (cid:3)(cid:3)(cid:4)
`We now generalize the algorithm from the previ(cid:2)
`ous section to an arbitrary locality(cid:2)sensitive family
`H(cid:4) Thus(cid:5) the algorithm is equally applicable to other
`locality(cid:2)sensitive hash functions (cid:3)e(cid:4)g(cid:4)(cid:5) see (cid:10)(cid:13)(cid:6)(cid:4) The
`generalization is simple(cid:9) the functions g are now de(cid:2)
`(cid:8)ned to be of the form
`
`r and (cid:2)(cid:4) Then we show how to apply the solution to
`this problem to solve (cid:2)(cid:2)NNS(cid:4)
`Denote by P the set of all points p P such that
`d(cid:3)q(cid:4) p (cid:6) (cid:3) r(cid:4) We observe that the algorithm correctly
`solves the (cid:3)r(cid:4) (cid:2)(cid:6)(cid:2)Neighbor problem if the following two
`properties hold(cid:9)
`P If there exists p(cid:2) such that p(cid:2) B(cid:3)q(cid:4) r (cid:6)(cid:5) then
`gj(cid:3)p(cid:2)(cid:6) g j(cid:3)q(cid:6) for some j (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) l(cid:4)
`
`P The total number of blocks pointed to by q and
`containing only points from P is less than cl(cid:4)
`
`Assume that H is a (cid:3)r (cid:4) r(cid:4) p (cid:4) p(cid:6)(cid:2)sensitive family(cid:7)
`de(cid:8)ne (cid:9) ln (cid:2)p
`(cid:4) The correctness of the LSH algo(cid:2)
`ln (cid:2)p
`rithm follows from the following theorem(cid:4)
`
`B(cid:4)(cid:4)
`Theorem Setting k log (cid:2)p(cid:3)n(cid:7)B(cid:6) and l (cid:3) n
`guarantees that properties P and P hold with prob(cid:2)
`
`
` (cid:2) ability at least e (cid:9) (cid:5) (cid:8)
`Remark Note that by repeating the LSH algorithm
`O(cid:3) (cid:7)(cid:10)(cid:6) times(cid:5) we can amplify the probability of success
`in at least one trial to (cid:2) (cid:10)(cid:5) for any (cid:10) (cid:3) (cid:8)
`Proof(cid:10) Let property P hold with probability P (cid:5)
`and property P hold with probability P(cid:4) We will
`show that both P and P are large(cid:4) Assume that there
`exists a point p(cid:2) within distance r of q(cid:7) the proof is
`quite similar otherwise(cid:4) Set k log (cid:2)p(cid:3)n(cid:7)B(cid:6)(cid:4) The
`probability that g(cid:3)p (cid:6) g (cid:3)q(cid:6) for p P (cid:2) B(cid:3)q(cid:4) r(cid:6) is at
` B
`most pk
`n (cid:4) Denote the set of all points p (cid:7) B(cid:3)q(cid:4) r(cid:6)
`by P (cid:4) The expected number of blocks allocated for gj
`which contain exclusively points from P does not ex(cid:2)
`ceed (cid:4) The expected number of such blocks allocated
`for all gj is at most l(cid:4) Thus(cid:5) by the Markov inequal(cid:2)
`ity (cid:10) (cid:13)(cid:5) the probability that this number exceeds l is
`less than (cid:7)(cid:4) If we choose c (cid:5) the probability that
`the property P holds is P (cid:3) (cid:7)(cid:4)
`Consider now the probability of gj(cid:3)p(cid:2)(cid:6) gj(cid:3)q(cid:6)(cid:4)
`Clearly(cid:5) it is bounded from below by
`
`gi(cid:3)p(cid:6) (cid:3)hi (cid:3)p(cid:6)(cid:4) hi(cid:3)p(cid:6)(cid:4) (cid:5) (cid:5) (cid:5) (cid:4) hik(cid:3)p(cid:6)(cid:6)(cid:4)
`
`pk
` p
`
`log (cid:2)p
`
`
`n(cid:2)B
`
` (cid:3)n(cid:7)B(cid:6)(cid:3) log (cid:2)p
`log (cid:2)p (cid:3)n(cid:7)B(cid:6)(cid:3)(cid:4)(cid:5)
`
`where the functions hi (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) hik are randomly chosen
`from H with replacement(cid:4) As before(cid:5) we choose l such
`functions g (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) gl(cid:4) In the case when the family Hd is
`used(cid:5) i(cid:4)e(cid:4)(cid:5) each function selects one bit of an argument(cid:5)
`the resulting values of gj(cid:3)p(cid:6) are essentially equivalent
`to pjIj (cid:4)
`We now show that the LSH algorithm can be used to
`solve what we call the (cid:3)r(cid:4) (cid:2)(cid:6)(cid:2)Neighbor problem(cid:9) deter(cid:2)
`mine whether there exists a point p within a (cid:8)xed dis(cid:2)
`tance r r of q(cid:5) or whether all points in the database
`are at least a distance r r(cid:3) (cid:31)(cid:2)(cid:6) away from q(cid:7) in the
`(cid:8)rst case(cid:5) the algorithm is required to return a point
`p within distance at most (cid:3) (cid:31) (cid:2)(cid:6)r from q(cid:4)
`In par(cid:2)
`ticular(cid:5) we argue that the LSH algorithm solves this
`problem for a proper choice of k and l(cid:5) depending on
`
`B(cid:4)(cid:4)
`By setting l (cid:3) n
`(cid:5) we bound from above the prob(cid:2)
`ability that gj(cid:3)p(cid:2)(cid:6) gj(cid:3)q(cid:6) for all j (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) l by (cid:7)e(cid:4)
`Thus the probability that one such gj exists is at least
`P (cid:9) (cid:2) (cid:7)e(cid:4)
`Therefore(cid:5) the probability that both properties P
`and P hold is at least (cid:2) (cid:10)(cid:3) (cid:2) P (cid:6) (cid:31) (cid:3) (cid:2) P(cid:6)(cid:13)
`
`P (cid:31) P (cid:2) (cid:9) (cid:2)
`e (cid:4) The theorem follows(cid:4)
`In the following we consider the LSH family for the
`Hamming