throbber
Similarity Search in High Dimensions via Hashing
`
`Rajeev Motwaniz
`Piotr Indyky
`Aristides Gionis (cid:0)
`Department of Computer Science
`Stanford University
`Stanford(cid:2) CA  
`fgionis(cid:2)indyk(cid:2)rajeevg(cid:3)cs(cid:4)stanford(cid:4)edu
`
`Abstract
`
`The nearest(cid:2) or near(cid:2)neighbor query problems
`arise in a large variety of database applications(cid:3)
`usually in the context of similarity searching(cid:4) Of
`late(cid:3) there has been increasing interest in build(cid:2)
`ing search(cid:5)index structures for performing simi(cid:2)
`larity search over high(cid:2)dimensional data(cid:3) e(cid:4)g(cid:4)(cid:3) im(cid:2)
`age databases(cid:3) document collections(cid:3) time(cid:2)series
`databases(cid:3) and genome databases(cid:4) Unfortunately(cid:3)
`all known techniques for solving this problem fall
`prey to the (cid:6)curse of dimensionality(cid:4)(cid:7) That is(cid:3)
`the data structures scale poorly with data dimen(cid:2)
`sionality(cid:8)
`in fact(cid:3)
`if the number of dimensions
`exceeds to (cid:3) searching in k(cid:2)d trees and re(cid:2)
`lated structures involves the inspection of a large
`fraction of the database(cid:3) thereby doing no better
`than brute(cid:2)force linear search(cid:4) It has been sug(cid:2)
`gested that since the selection of features and the
`choice of a distance metric in typical applications
`is rather heuristic(cid:3) determining an approximate
`nearest neighbor should su(cid:12)ce for most practi(cid:2)
`cal purposes(cid:4) In this paper(cid:3) we examine a novel
`scheme for approximate similarity search based
`on hashing(cid:4) The basic idea is to hash the points
`
`(cid:0)Supported by NAVY N (cid:5) (cid:5) (cid:5)  grant and NSF
`Grant IIS(cid:5)  (cid:10)
`ySupported by Stanford Graduate Fellowship and NSF NYI
`Award CCR(cid:5)  (cid:10)
`zSupported by ARO MURI Grant DAAH(cid:5) (cid:5) (cid:5)(cid:14) NSF
`Grant IIS(cid:5)  (cid:14) and NSF Young Investigator Award CCR(cid:5)
`  (cid:14) with matching funds from IBM(cid:14) Mitsubishi(cid:14) Schlum(cid:5)
`berger Foundation(cid:14) Shell Foundation(cid:14) and Xerox Corporation(cid:10)
`
`Permission to copy without fee all or part of this material is
`granted provided that the copies are not made or distributed for
`direct commercial advantage(cid:2) the VLDB copyright notice and
`the title of the publication and its date appear(cid:2) and notice is
`given that copying is by permission of the Very Large Data Base
`Endowment(cid:3) To copy otherwise(cid:2) or to republish(cid:2) requires a fee
`and(cid:4)or special permission from the Endowment(cid:3)
`Proceedings of the th VLDB Conference(cid:4)
`Edinburgh(cid:4) Scotland(cid:4) (cid:7)
`
`from the database so as to ensure that the prob(cid:2)
`ability of collision is much higher for objects that
`are close to each other than for those that are far
`apart(cid:4) We provide experimental evidence that our
`method gives signi(cid:13)cant improvement in running
`time over other methods for searching in high(cid:2)
`dimensional spaces based on hierarchical tree de(cid:2)
`composition(cid:4) Experimental results also indicate
`that our scheme scales well even for a relatively
`large number of dimensions (cid:14)more than (cid:16)(cid:4)
`
` Introduction
`
`A similarity search problem involves a collection of ob(cid:2)
`jects (cid:3)e(cid:4)g(cid:4)(cid:5) documents(cid:5) images(cid:6) that are characterized
`by a collection of relevant features and represented
`as points in a high(cid:2)dimensional attribute space(cid:7) given
`queries in the form of points in this space(cid:5) we are re(cid:2)
`quired to (cid:8)nd the nearest (cid:3)most similar(cid:6) object to the
`query(cid:4) The particularly interesting and well(cid:2)studied
`case is the d(cid:2)dimensional Euclidean space(cid:4) The prob(cid:2)
`lem is of major importance to a variety of applications(cid:7)
`some examples are(cid:9) data compression (cid:10)(cid:13)(cid:7) databases
`and data mining (cid:10) (cid:13)(cid:7) information retrieval (cid:10) (cid:5) (cid:5) (cid:13)(cid:7)
`image and video databases (cid:10) (cid:5) (cid:5) (cid:5) (cid:13)(cid:7) machine
`learning (cid:10)(cid:13)(cid:7) pattern recognition (cid:10) (cid:5) (cid:13)(cid:7) and(cid:5) statistics
`and data analysis (cid:10) (cid:5) (cid:13)(cid:4) Typically(cid:5) the features of
`the objects of interest are represented as points in (cid:0)d
`and a distance metric is used to measure similarity of
`objects(cid:4) The basic problem then is to perform indexing
`or similarity searching for query objects(cid:4) The number
`of features (cid:3)i(cid:4)e(cid:4)(cid:5) the dimensionality(cid:6) ranges anywhere
`from tens to thousands(cid:4) For example(cid:5) in multimedia
`applications such as IBM(cid:22)s QBIC (cid:3)Query by Image
`Content(cid:6)(cid:5) the number of features could be several hun(cid:2)
`dreds (cid:10) (cid:5) (cid:13)(cid:4) In information retrieval for text doc(cid:2)
`uments(cid:5) vector(cid:2)space representations involve several
`thousands of dimensions(cid:5) and it is considered to be a
`dramatic improvement that dimension(cid:2)reduction tech(cid:2)
`niques(cid:5) such as the Karhunen(cid:2)Lo(cid:23)eve transform (cid:10)(cid:5) (cid:13)
`(cid:3)also known as principal components analysis (cid:10)(cid:13) or
`latent semantic indexing (cid:10) (cid:13)(cid:6)(cid:5) can reduce the dimen(cid:2)
`sionality to a mere few hundreds(cid:24)
`
`Google Ex. 1018
`
`

`

`The low(cid:2)dimensional case (cid:3)say(cid:5) for d equal to  or
` (cid:6) is well(cid:2)solved (cid:10) (cid:13)(cid:5) so the main issue is that of deal(cid:2)
`ing with a large number of dimensions(cid:5) the so(cid:2)called
`(cid:25)curse of dimensionality(cid:4)(cid:26) Despite decades of inten(cid:2)
`sive e(cid:27)ort(cid:5) the current solutions are not entirely sat(cid:2)
`isfactory(cid:7) in fact(cid:5) for large enough d(cid:5) in theory or in
`practice(cid:5) they provide little improvement over a linear
`algorithm which compares a query to each point from
`the database(cid:4) In particular(cid:5) it was shown in (cid:10)(cid:13) that(cid:5)
`both empirically and theoretically(cid:5) all current index(cid:2)
`ing techniques (cid:3)based on space partitioning(cid:6) degrade
`to linear search for su(cid:28)ciently high dimensions(cid:4) This
`situation poses a serious obstacle to the future develop(cid:2)
`ment of large scale similarity search systems(cid:4) Imagine
`for example a search engine which enables content(cid:2)
`based image retrieval on the World(cid:2)Wide Web(cid:4) If the
`system was to index a signi(cid:8)cant fraction of the web(cid:5)
`the number of images to index would be at least of
`the order tens (cid:3)if not hundreds(cid:6) of million(cid:4) Clearly(cid:5) no
`indexing method exhibiting linear (cid:3)or close to linear(cid:6)
`dependence on the data size could manage such a huge
`data set(cid:4)
`
`The premise of this paper is that in many cases
`it is not necessary to insist on the exact answer(cid:7) in(cid:2)
`stead(cid:5) determining an approximate answer should suf(cid:2)
`(cid:8)ce (cid:3)refer to Section  for a formal de(cid:8)nition(cid:6)(cid:4) This
`observation underlies a large body of recent research
`in databases(cid:5) including using random sampling for his(cid:2)
`togram estimation (cid:10)(cid:13) and median approximation (cid:10) (cid:13)(cid:5)
`using wavelets for selectivity estimation (cid:10) (cid:13) and ap(cid:2)
`proximate SVD (cid:10)(cid:13)(cid:4) We observe that there are many
`applications of nearest neighbor search where an ap(cid:2)
`proximate answer is good enough(cid:4) For example(cid:5) it
`often happens (cid:3)e(cid:4)g(cid:4)(cid:5) see (cid:10) (cid:13)(cid:6) that the relevant answers
`are much closer to the query point than the irrele(cid:2)
`vant ones(cid:7) in fact(cid:5) this is a desirable property of a
`good similarity measure(cid:4) In such cases(cid:5) the approxi(cid:2)
`mate algorithm (cid:3)with a suitable approximation factor(cid:6)
`will return the same result as an exact algorithm(cid:4) In
`other situations(cid:5) an approximate algorithm provides
`the user with a time(cid:2)quality tradeo(cid:27) (cid:29) the user can
`decide whether to spend more time waiting for the
`exact answer(cid:5) or to be satis(cid:8)ed with a much quicker
`approximation (cid:3)e(cid:4)g(cid:4)(cid:5) see (cid:10)(cid:13)(cid:6)(cid:4)
`
`The above arguments rely on the assumption that
`approximate similarity search can be performed much
`faster than the exact one(cid:4) In this paper we show that
`this is indeed the case(cid:4) Speci(cid:8)cally(cid:5) we introduce a
`new indexing method for approximate nearest neigh(cid:2)
`bor with a truly sublinear dependence on the data size
`even for high(cid:2)dimensional data(cid:4) Instead of using space
`partitioning(cid:5) it relies on a new method called locality(cid:2)
`sensitive hashing (cid:3)LSH(cid:4)(cid:4) The key idea is to hash the
`points using several hash functions so as to ensure that(cid:5)
`for each function(cid:5) the probability of collision is much
`higher for objects which are close to each other than
`for those which are far apart(cid:4) Then(cid:5) one can deter(cid:2)
`
`mine near neighbors by hashing the query point and
`retrieving elements stored in buckets containing that
`point(cid:4) We provide such locality(cid:2)sensitive hash func(cid:2)
`tions that are simple and easy to implement(cid:7) they can
`also be naturally extended to the dynamic setting(cid:5) i(cid:4)e(cid:4)(cid:5)
`when insertion and deletion operations also need to be
`supported(cid:4) Although in this paper we are focused on
`Euclidean spaces(cid:5) di(cid:27)erent LSH functions can be also
`used for other similarity measures(cid:5) such as dot prod(cid:2)
`uct (cid:10)(cid:13)(cid:4)
`Locality(cid:2)Sensitive Hashing was introduced by Indyk
`and Motwani (cid:10)(cid:13) for the purposes of devising main
`memory algorithms for nearest neighbor search(cid:7) in par(cid:2)
`ticular(cid:5) it enabled us to achieve worst(cid:2)case O(cid:3)dn (cid:2)(cid:3)(cid:6)(cid:2)
`time for approximate nearest neighbor query over an
`n(cid:2)point database(cid:4) In this paper we improve that tech(cid:2)
`nique and achieve a signi(cid:8)cantly improved query time
`of O(cid:3)dn (cid:2)(cid:2) (cid:3)(cid:3)(cid:4)(cid:6)(cid:4) This yields an approximate nearest
`neighbor algorithm running in sublinear time for any
`(cid:2) (cid:3) (cid:4) Furthermore(cid:5) we generalize the algorithm and
`its analysis to the case of external memory(cid:4)
`We support our theoretical arguments by empiri(cid:2)
`cal evidence(cid:4) We performed experiments on two data
`sets(cid:4) The (cid:8)rst contains (cid:5) histograms of color
`images(cid:5) where each histogram was represented as a
`point ind(cid:2)dimensional space(cid:5) for d up to (cid:4) The sec(cid:2)
`ond contains around (cid:5) points representing tex(cid:2)
`ture information of blocks of large aerial photographs(cid:4)
`All our tables were stored on disk(cid:4) We compared
`the performance of our algorithm with the perfor(cid:2)
`mance of the Sphere(cid:30)Rectangle(cid:2)tree (cid:3)SR(cid:2)tree(cid:6) (cid:10)(cid:13)(cid:5) a
`recent data structure which was shown to be com(cid:2)
`parable to or signi(cid:8)cantly more e(cid:28)cient than other
`tree(cid:2)decomposition(cid:2)based indexing methods for spa(cid:2)
`tial data(cid:4) The experiments show that our algorithm is
`signi(cid:8)cantly faster than the earlier methods(cid:5) in some
`cases even by several orders of magnitude(cid:4)
`It also
`scales well as the data size and dimensionality increase(cid:4)
`Thus(cid:5) it enables a new approach to high(cid:2)performance
`similarity search (cid:29) fast retrieval of approximate an(cid:2)
`swer(cid:5) possibly followed by a slower but more accurate
`computation in the few cases where the user is not
`satis(cid:8)ed with the approximate answer(cid:4)
`The rest of this paper is organized as follows(cid:4) In
`Section  we introduce the notation and give formal
`de(cid:8)nitions of the similarity search problems(cid:4) Then in
`Section we describe locality(cid:2)sensitive hashing and
`show how to apply it to nearest neighbor search(cid:4) In
`Section  we report the results of experiments with
`LSH(cid:4) The related work is described in Section (cid:4) Fi(cid:2)
`nally(cid:5) in Section  we present conclusions and ideas for
`future research(cid:4)
`
` Preliminaries
`p to denote the Euclidean space (cid:0)d under theWe use ld
`
`lp norm(cid:5) i(cid:4)e(cid:4)(cid:5) when the length of a vector (cid:3)x (cid:4) (cid:5) (cid:5) (cid:5) xd(cid:6) is
`de(cid:8)ned as (cid:3)jx jp (cid:31) (cid:5) (cid:5) (cid:5) (cid:31) jxdjp(cid:6) (cid:2)p(cid:4) Further(cid:5) dp(cid:3)p(cid:4) q(cid:6)
`
`Google Ex. 1018
`
`

`

`jjp(cid:2)qjjp denotes the distance between the points p and
`
`q in ldp(cid:4) We use H d to denote the Hamming metric
`space of dimension d(cid:5) i(cid:4)e(cid:4)(cid:5) the space of binary vectors
`of length d under the standard Hamming metric(cid:4) We
`use dH (cid:3)p(cid:4) q(cid:6) denote the Hamming distance(cid:5) i(cid:4)e(cid:4)(cid:5) the
`number of bits on which p and q di(cid:27)er(cid:4)
`The nearest neighbor search problem is de(cid:8)ned as
`follows(cid:9)
`
`De(cid:2)nition (cid:4)Nearest Neighbor Search (cid:4)NNS(cid:5)(cid:5)
`Given a set P of n objects represented as points in a
`normed space ld
`p(cid:5) preprocess P so as to e(cid:6)ciently an(cid:2)
`swer queries by (cid:7)nding the point in P closest to a query
`point q(cid:8)
`
`The de(cid:8)nition generalizes naturally to the case
`where we want to return K (cid:3) points(cid:4) Speci(cid:8)cally(cid:5) in
`the K(cid:2)Nearest Neighbors Search (cid:3)K(cid:2)NNS(cid:4)(cid:5) we wish to
`return the K points in the database that are closest to
`the query point(cid:4) The approximate version of the NNS
`problem is de(cid:8)ned as follows(cid:9)
`
`De(cid:2)nition  (cid:4)(cid:2)(cid:7)Nearest Neighbor Search (cid:4)(cid:2)(cid:7)NNS(cid:5)(cid:5)
`Given a set P of points in a normed space ld
`p(cid:5) prepro(cid:2)
`cess P so as to e(cid:6)ciently return a point p  P for any
`given query point q(cid:5) such that d(cid:3)q(cid:4) p(cid:6) (cid:4) (cid:3) (cid:31) (cid:2)(cid:6)d(cid:3)q(cid:4) P (cid:6)(cid:5)
`where d(cid:3)q(cid:4) P (cid:6) is the distance of q to the its closest point
`in P (cid:8)
`
`Again(cid:5) this de(cid:8)nition generalizes naturally to (cid:8)nd(cid:2)
`ing K (cid:3) approximate nearest neighbors(cid:4) In the Ap(cid:2)
`proximate K(cid:2)NNS problem(cid:5) we wish to (cid:8)nd K points
`p (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) pK such that the distance of pi to the query q is
`at most (cid:3) (cid:31) (cid:2)(cid:6) times the distance from the ith nearest
`point to q(cid:4)
`
` The Algorithm
`
`In this section we present e(cid:28)cient solutions to the ap(cid:2)
`proximate versions of the NNS problem(cid:4) Without sig(cid:2)
`ni(cid:8)cant loss of generality(cid:5) we will make the following
`two assumptions about the data(cid:9)
`
` (cid:4) the distance is de(cid:8)ned by the l norm (cid:3)see com(cid:2)
`ments below(cid:6)(cid:5)
`
`(cid:4) all coordinates of points in P are positive integers(cid:4)
`
`The (cid:8)rst assumption is not very restrictive(cid:5) as usu(cid:2)
`ally there is no clear advantage in(cid:5) or even di(cid:27)erence
`between(cid:5) using l or l norm for similarity search(cid:4) For
`example(cid:5) the experiments done for the Webseek (cid:10) (cid:13)
`project (cid:3)see (cid:10)(cid:13)(cid:5) chapter (cid:6) show that comparing color
`histograms using l and l norms yields very similar
`results (cid:3)l is marginally better(cid:6)(cid:4) Both our data sets
`(cid:3)see Section (cid:6) have a similar property(cid:4) Speci(cid:8)cally(cid:5)
`we observed that a nearest neighbor of an average
`query point computed under the l norm was also an
`(cid:2)(cid:2)approximate neighbor under the l norm with an av(cid:2)
`erage value of (cid:2) less than ! (cid:3)this observation holds
`
`for both data sets(cid:6)(cid:4) Moreover(cid:5) in most cases (cid:3)i(cid:4)e(cid:4)(cid:5) for
`! of the queries in the (cid:8)rst set and  ! in the sec(cid:2)
`ond set(cid:6) the nearest neighbors under l and l norms
`were exactly the same(cid:4) This observation is interest(cid:2)
`ing in its own right(cid:5) and can be partially explained
`via the theorem by Figiel et al (cid:3)see (cid:10) (cid:13) and references
`therein(cid:6)(cid:4) They showed analytically that by simply ap(cid:2)
`plying scaling and random rotation to the space l(cid:5)
`we can make the distances induced by the l and l
`norms almost equal up to an arbitrarily small factor(cid:4)
`It seems plausible that real data is already randomly
`rotated(cid:5) thus the di(cid:27)erence between l and l norm
`is very small(cid:4) Moreover(cid:5) for the data sets for which
`this property does not hold(cid:5) we are guaranteed that
`after performing scaling and random rotation our al(cid:2)
`gorithms can be used for the l norm with arbitrarily
`small loss of precision(cid:4)
`As far as the second assumption is concerned(cid:5)
`clearly all coordinates can be made positive by prop(cid:2)
`erly translating the origin of (cid:0)d(cid:4) We can then con(cid:2)
`vert all coordinates to integers by multiplying them
`by a suitably large number and rounding to the near(cid:2)
`est integer(cid:4) It can be easily veri(cid:8)ed that by choosing
`proper parameters(cid:5) the error induced by rounding can
`be made arbitrarily small(cid:4) Notice that after this oper(cid:2)
`ation the minimum interpoint distance is (cid:4)
`
` (cid:9) Locality(cid:7)Sensitive Hashing
`
`In this section we present locality(cid:2)sensitive hashing
`(cid:3)LSH(cid:6)(cid:4) This technique was originally introduced by
`Indyk and Motwani (cid:10)(cid:13) for the purposes of devising
`main memory algorithms for the (cid:2)(cid:2)NNS problem(cid:4) Here
`we give an improved version of their algorithm(cid:4) The
`new algorithm is in many respects more natural than
`the earlier one(cid:9) it does not require the hash buckets to
`store only one point(cid:7) it has better running time guar(cid:2)
`antees(cid:7) and(cid:5) the analysis is generalized to the case of
`secondary memory(cid:4)
`Let C be the largest coordinate in all points in P (cid:4)
`Then(cid:5) as per (cid:10) (cid:13)(cid:5) we can embed P into the Hamming
`cube H d
`with d Cd(cid:5) by transforming each point
`p (cid:3)x (cid:4) (cid:5) (cid:5) (cid:5) xd(cid:6) into a binary vector
`
`v(cid:3)p(cid:6) UnaryC(cid:3)x (cid:6) (cid:5) (cid:5) (cid:5) UnaryC(cid:3)xd(cid:6)(cid:4)
`
`where UnaryC(cid:3)x(cid:6) denotes the unary representation of
`x(cid:5) i(cid:4)e(cid:4)(cid:5) is a sequence of x ones followed by C(cid:2)x zeroes(cid:4)
`Fact For any pair of points p(cid:4) q with coordinates in
`the set f (cid:5) (cid:5) (cid:5)Cg(cid:5)
`d (cid:3)p(cid:4) q(cid:6) dH(cid:3)v(cid:3)p(cid:6)(cid:4) v(cid:3)q(cid:6)(cid:6)(cid:5)
`
`That is(cid:5) the embedding preserves the distances be(cid:2)
`tween the points(cid:4) Therefore(cid:5)
`in the sequel we can
`concentrate on solving (cid:2)(cid:2)NNS in the Hamming space
`H d
`(cid:4) However(cid:5) we emphasize that we do not need to
`actually convert the data to the unary representation(cid:5)
`
`Google Ex. 1018
`
`

`

`Algorithm Preprocessing
`Input A set of points P (cid:5)
`l (cid:3)number of hash tables(cid:6)(cid:5)
`Output Hash tables Ti(cid:5) i (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) l
`Foreach i (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) l
`Initialize hash table Ti by generating
`a random hash function gi(cid:3)(cid:5)(cid:6)
`Foreach i (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) l
`Foreach j (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) n
`Store point pj on bucket gi(cid:3)pj(cid:6) of hash table Ti
`
`Figure (cid:9) Preprocessing algorithm for points already
`embedded in the Hamming cube(cid:4)
`
`Algorithm Approximate Nearest Neighbor Query
`Input A query point q(cid:5)
`K (cid:3)number of appr(cid:4) nearest neighbors(cid:6)
`Access To hash tables Ti(cid:5) i (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) l
`generated by the preprocessing algorithm
`Output K (cid:3)or less(cid:6) appr(cid:4) nearest neighbors
`S (cid:6) "
`Foreach i (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) l
`S (cid:6) S (cid:7) fpoints found in gi(cid:3)q(cid:6) bucket of table Tig
`Return the K nearest neighbors of q found in set S
`(cid:30)# Can be found by main memory linear search #(cid:30)
`
`Figure (cid:9) Approximate Nearest Neighbor query an(cid:2)
`swering algorithm(cid:4)
`
`Although we are mainly interested in the I(cid:30)O com(cid:2)
`plexity of our scheme(cid:5) it is worth pointing out that
`the hash functions can be e(cid:28)ciently computed if the
`
`data set is obtained by mapping ld into d(cid:2)dimensional
`Hamming space(cid:4) Let p be any point from the data set
`and let p denote its image after the mapping(cid:4) Let I
`be the set of coordinates and recall that we need to
`compute p
`jI(cid:4) For i (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) d(cid:5) let Iji denote(cid:5) in sorted
`order(cid:5) the coordinates in I which correspond to the
`ith coordinate of p(cid:4) Observe(cid:5) that projecting p on Iji
`results in a sequence of bits which is monotone(cid:5) i(cid:4)e(cid:4)(cid:5)
`consists of a number(cid:5) say oi(cid:5) of ones followed by ze(cid:2)
`ros(cid:4) Therefore(cid:5) in order to represent p
`I it is su(cid:28)cient
`to compute oi for i (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) d(cid:4) However(cid:5) the latter
`task is equivalent to (cid:8)nding the number of elements
`in the sorted array Iji which are smaller than a given
`value(cid:5) i(cid:4)e(cid:4)(cid:5) the ith coordinate of p(cid:4) This can be done
`via binary search in log C time(cid:5) or even in constant
`time using a precomputed array of C bits(cid:4) Thus(cid:5) the
`total time needed to compute the function is either
`O(cid:3)d log C(cid:6) or O(cid:3)d(cid:6)(cid:5) depending on resources used(cid:4) In
`our experimental section(cid:5) the value of C can be made
`very small(cid:5) and therefore we will resort to the second
`method(cid:4)
`For quick reference we summarize the preprocessing
`
`which could be expensive whenC is large(cid:7) in fact(cid:5) all
`our algorithms can be made to run in time indepen(cid:2)
`dent on C(cid:4) Rather(cid:5) the unary representation provides
`us with a convenient framework for description of the
`algorithms which would be more complicated other(cid:2)
`wise(cid:4)
`We de(cid:8)ne the hash functions as follows(cid:4) For an inte(cid:2)
`ger l to be speci(cid:8)ed later(cid:5) choose l subsets I (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) Il of
`f (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) dg(cid:4) Let pjI denote the projection of vector p on
`the coordinate set I(cid:5) i(cid:4)e(cid:4)(cid:5) we compute pjI by selecting
`the coordinate positions as per I and concatenating
`the bits in those positions(cid:4) Denote gj(cid:3)p(cid:6) pjIj (cid:4) For
`the preprocessing(cid:5) we store each p  P in the bucket
`gj(cid:3)p(cid:6)(cid:5) for j (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) l(cid:4) As the total number of buckets
`may be large(cid:5) we compress the buckets by resorting
`to standard hashing(cid:4) Thus(cid:5) we use two levels of hash(cid:2)
`ing(cid:9) the LSH function maps a point p to bucket gj(cid:3)p(cid:6)(cid:5)
`and a standard hash function maps the contents of
`these buckets into a hash table of size M (cid:4) The maxi(cid:2)
`mal bucket size of the latter hash table is denoted by
`B(cid:4) For the algorithm(cid:22)s analysis(cid:5) we will assume hash(cid:2)
`ing with chaining(cid:5) i(cid:4)e(cid:4)(cid:5) when the number of points in
`a bucket exceeds B(cid:5) a new bucket (cid:3)also of size B(cid:6) is
`allocated and linked to and from the old bucket(cid:4) How(cid:2)
`ever(cid:5) our implementation does not employ chaining(cid:5)
`but relies on a simpler approach(cid:9) if a bucket in a given
`index is full(cid:5) a new point cannot be added to it(cid:5) since
`it will be added to some other index with high prob(cid:2)
`ability(cid:4) This saves us the overhead of maintaining the
`link structure(cid:4)
`The number n of points(cid:5) the size M of the hash
`table(cid:5) and the maximum bucket size B are related by
`the following equation(cid:9)
`
`(cid:4)
`
`n B
`
`M (cid:6)
`
`where (cid:6) is the memory utilization parameter(cid:5) i(cid:4)e(cid:4)(cid:5) the
`ratio of the memory allocated for the index to the size
`of the data set(cid:4)
`To process a query q(cid:5) we search all
`indices
`g (cid:3)q(cid:6)(cid:4) (cid:5) (cid:5) (cid:5) (cid:4) gl(cid:3)q(cid:6) until we either encounter at least c (cid:5) l
`points (cid:3)for c speci(cid:8)ed later(cid:6) or use all l indices(cid:4) Clearly(cid:5)
`the number of disk accesses is always upper bounded
`by the number of indices(cid:5) which is equal to l(cid:4) Let
`p (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) pt be the points encountered in the process(cid:4)
`For Approximate K(cid:2)NNS(cid:5) we output the K points pi
`closest to q(cid:7) in general(cid:5) we may return fewer points if
`the number of points encountered is less than K(cid:4)
`It remains to specify the choice of the subsets Ij(cid:4)
`For each j  f (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) lg(cid:5) the set Ij consists of k ele(cid:2)
`ments from f (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) dg sampled uniformly at random
`with replacement(cid:4) The optimal value of k is chosen to
`maximize the probability that a point p (cid:25)close(cid:26) to q
`will fall into the same bucket as q(cid:5) and also to mini(cid:2)
`mize the probability that a point p (cid:25)far away(cid:26) from q
`will fall into the same bucket(cid:4) The choice of the values
`of l and k is deferred to the next section(cid:4)
`
`Google Ex. 1018
`
`

`

`and query answering algorithms in Figures and (cid:4)
`
` (cid:9) Analysis of Locality(cid:7)Sensitive Hashing
`
`The principle behind our method is that the probabil(cid:2)
`ity of collision of two points p and q is closely related
`to the distance between them(cid:4) Speci(cid:8)cally(cid:5) the larger
`the distance(cid:5) the smaller the collision probability(cid:4) This
`intuition is formalized as follows (cid:10)(cid:13)(cid:4) Let D(cid:3)(cid:5)(cid:4)(cid:5)(cid:6) be a
`distance function of elements from a set S(cid:5) and for
`any p  S let B(cid:3)p(cid:4) r(cid:6) denote the set of elements from
`S within the distance r from p(cid:4)
`
`De(cid:2)nition A family H of functions from S to U
`is called (cid:3)r (cid:4) r(cid:4) p (cid:4) p(cid:6)(cid:2)sensitive for D(cid:3)(cid:5)(cid:4)(cid:5)(cid:6) if for any
`q(cid:4) p  S
`(cid:8) if p  B(cid:3)q(cid:4) r (cid:6) then PrH(cid:10)h(cid:3)q(cid:6) h(cid:3)p(cid:6)(cid:13) (cid:9) p (cid:5)
`(cid:8) if p (cid:7) B(cid:3)q(cid:4) r(cid:6) then PrH(cid:10)h(cid:3)q(cid:6) h(cid:3)p(cid:6)(cid:13) (cid:4) p(cid:8)
`In the above de(cid:8)nition(cid:5) probabilities are considered
`with respect to the random choice of a function h from
`the family H(cid:4) In order for a locality(cid:2)sensitive family
`to be useful(cid:5) it has to satisfy the inequalities p (cid:3) p
`and r (cid:8) r(cid:4)
`Observe that if D(cid:3)(cid:5)(cid:4)(cid:5)(cid:6) is the Hamming distance
`dH(cid:3)(cid:5)(cid:4)(cid:5)(cid:6)(cid:5) then the family of projections on one coor(cid:2)
`dinate is locality(cid:2)sensitive(cid:4) More speci(cid:8)cally(cid:9)
`
`Fact  Let S be H d
`(cid:3)the d(cid:2)dimensional Ham(cid:2)
`ming cube(cid:4) and D(cid:3)p(cid:4) q(cid:6) dH(cid:3)p(cid:4) q(cid:6) for p(cid:4) q 
`H d
`(cid:8) Then for any r(cid:5) (cid:2) (cid:3) (cid:5) the family Hd
`for i (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) dg is
`(cid:9) hi(cid:3)(cid:3)b (cid:4) (cid:5) (cid:5) (cid:5) (cid:4) bd(cid:6)(cid:6) bi(cid:4)
`fhi
`(cid:0)r(cid:4) r(cid:3) (cid:31) (cid:2)(cid:6)(cid:4) (cid:2) r
`d (cid:2)(cid:2)sensitive(cid:8)
`d (cid:4) (cid:2) r(cid:2) (cid:3)(cid:3)(cid:4)
`We now generalize the algorithm from the previ(cid:2)
`ous section to an arbitrary locality(cid:2)sensitive family
`H(cid:4) Thus(cid:5) the algorithm is equally applicable to other
`locality(cid:2)sensitive hash functions (cid:3)e(cid:4)g(cid:4)(cid:5) see (cid:10)(cid:13)(cid:6)(cid:4) The
`generalization is simple(cid:9) the functions g are now de(cid:2)
`(cid:8)ned to be of the form
`
`r and (cid:2)(cid:4) Then we show how to apply the solution to
`this problem to solve (cid:2)(cid:2)NNS(cid:4)
`Denote by P the set of all points p  P such that
`d(cid:3)q(cid:4) p(cid:6) (cid:3) r(cid:4) We observe that the algorithm correctly
`solves the (cid:3)r(cid:4) (cid:2)(cid:6)(cid:2)Neighbor problem if the following two
`properties hold(cid:9)
`P If there exists p(cid:2) such that p(cid:2)  B(cid:3)q(cid:4) r (cid:6)(cid:5) then
`gj(cid:3)p(cid:2)(cid:6) g j(cid:3)q(cid:6) for some j (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) l(cid:4)
`
`P The total number of blocks pointed to by q and
`containing only points from P is less than cl(cid:4)
`
`Assume that H is a (cid:3)r (cid:4) r(cid:4) p (cid:4) p(cid:6)(cid:2)sensitive family(cid:7)
`de(cid:8)ne (cid:9) ln (cid:2)p
`(cid:4) The correctness of the LSH algo(cid:2)
`ln (cid:2)p
`rithm follows from the following theorem(cid:4)
`
`B(cid:4)(cid:4)
`Theorem Setting k log (cid:2)p(cid:3)n(cid:7)B(cid:6) and l (cid:3) n
`guarantees that properties P and P hold with prob(cid:2)
`
`
` (cid:2) ability at least e (cid:9) (cid:5) (cid:8)
`Remark Note that by repeating the LSH algorithm
`O(cid:3) (cid:7)(cid:10)(cid:6) times(cid:5) we can amplify the probability of success
`in at least one trial to (cid:2) (cid:10)(cid:5) for any (cid:10) (cid:3) (cid:8)
`Proof(cid:10) Let property P hold with probability P (cid:5)
`and property P hold with probability P(cid:4) We will
`show that both P and P are large(cid:4) Assume that there
`exists a point p(cid:2) within distance r of q(cid:7) the proof is
`quite similar otherwise(cid:4) Set k log (cid:2)p(cid:3)n(cid:7)B(cid:6)(cid:4) The
`probability that g(cid:3)p(cid:6) g (cid:3)q(cid:6) for p  P (cid:2) B(cid:3)q(cid:4) r(cid:6) is at
` B
`most pk
`n (cid:4) Denote the set of all points p (cid:7) B(cid:3)q(cid:4) r(cid:6)
`by P (cid:4) The expected number of blocks allocated for gj
`which contain exclusively points from P does not ex(cid:2)
`ceed (cid:4) The expected number of such blocks allocated
`for all gj is at most l(cid:4) Thus(cid:5) by the Markov inequal(cid:2)
`ity (cid:10) (cid:13)(cid:5) the probability that this number exceeds l is
`less than (cid:7)(cid:4) If we choose c (cid:5) the probability that
`the property P holds is P (cid:3) (cid:7)(cid:4)
`Consider now the probability of gj(cid:3)p(cid:2)(cid:6) gj(cid:3)q(cid:6)(cid:4)
`Clearly(cid:5) it is bounded from below by
`
`gi(cid:3)p(cid:6) (cid:3)hi (cid:3)p(cid:6)(cid:4) hi(cid:3)p(cid:6)(cid:4) (cid:5) (cid:5) (cid:5) (cid:4) hik(cid:3)p(cid:6)(cid:6)(cid:4)
`
`pk
` p
`
`log (cid:2)p
`
`
`n(cid:2)B
`
` (cid:3)n(cid:7)B(cid:6)(cid:3) log (cid:2)p
`log (cid:2)p (cid:3)n(cid:7)B(cid:6)(cid:3)(cid:4)(cid:5)
`
`where the functions hi (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) hik are randomly chosen
`from H with replacement(cid:4) As before(cid:5) we choose l such
`functions g (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) gl(cid:4) In the case when the family Hd is
`used(cid:5) i(cid:4)e(cid:4)(cid:5) each function selects one bit of an argument(cid:5)
`the resulting values of gj(cid:3)p(cid:6) are essentially equivalent
`to pjIj (cid:4)
`We now show that the LSH algorithm can be used to
`solve what we call the (cid:3)r(cid:4) (cid:2)(cid:6)(cid:2)Neighbor problem(cid:9) deter(cid:2)
`mine whether there exists a point p within a (cid:8)xed dis(cid:2)
`tance r r of q(cid:5) or whether all points in the database
`are at least a distance r r(cid:3) (cid:31)(cid:2)(cid:6) away from q(cid:7) in the
`(cid:8)rst case(cid:5) the algorithm is required to return a point
`p within distance at most (cid:3) (cid:31) (cid:2)(cid:6)r from q(cid:4)
`In par(cid:2)
`ticular(cid:5) we argue that the LSH algorithm solves this
`problem for a proper choice of k and l(cid:5) depending on
`
`B(cid:4)(cid:4)
`By setting l (cid:3) n
`(cid:5) we bound from above the prob(cid:2)
`ability that gj(cid:3)p(cid:2)(cid:6)  gj(cid:3)q(cid:6) for all j (cid:4) (cid:5) (cid:5) (cid:5)(cid:4) l by (cid:7)e(cid:4)
`Thus the probability that one such gj exists is at least
`P (cid:9) (cid:2) (cid:7)e(cid:4)
`Therefore(cid:5) the probability that both properties P
`and P hold is at least (cid:2) (cid:10)(cid:3) (cid:2) P (cid:6) (cid:31) (cid:3) (cid:2) P(cid:6)(cid:13)
`
`P (cid:31) P (cid:2) (cid:9)  (cid:2)
`e (cid:4) The theorem follows(cid:4)
`In the following we consider the LSH family for the
`Hamming

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket