`Case 1:14-cv-02396—PGG-MHD Document 153-16 Filed 06/28/19 Page 1 of 8
`
`EXHIBIT O
`
`EXHIBIT 0
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 2 of 8
`
`Efficient Similarity Search in Digital Libraries
`
`Christian Böhm Bernhard Braunmüller Hans-Peter Kriegel Matthias Schubert
`
`Computer Science Institute, University of Munich
`E-mail: {boehm, braunmue, kriegel, schubera}@dbs.informatik.uni-muenchen.de
`
`Abstract
`Digital libraries are a core information technology.
`When the stored data is complex, e.g. high-resolution
`images or molecular protein structures, simple query types
`like the exact match query are hardly applicable. In such
`environments similarity queries, particularly range queries
`and k-nearest neighbor queries, turn out to be important
`query types. Numerous approaches have been proposed for
`the processing of similarity queries which mainly concen-
`trate on highly dynamic data sets where insertion, update,
`and deletion operations permanently occur. However, only
`little effort has been devoted to the case of rather static data
`sets - a case that frequently occurs in digital libraries. In this
`paper, we introduce a novel technique for efficient similarity
`search on top of static or rarely changing data sets. In par-
`ticular, we propose a special sorting order on the data
`objects which can be effectively exploited to substantially
`reduce the total query time of similarity queries. An exten-
`sive experimental evaluation with real-world data sets
`emphasizes the practical impact of our technique.
`
`1. Introduction
`In recent years, digital libraries have become a core
`information technology [10, 17]. Among the various impor-
`tant aspects of digital libraries the search for similar objects
`in the huge amount of digitized data has become an essential
`task. The QBIC system, for instance, contains a large image
`library which can be effectively searched for similar images
`by using similarity queries [22, 9]. The Brookhaven Protein
`Data Bank currently provides the atomic coordinates of sev-
`eral thousand proteins [2]. Here, the solution of the impor-
`tant molecular docking problem is supported by applying
`similarity queries on docking segments [19]. Another exam-
`ple are industrial CAD repositories which can effectively be
`used to reduce the cost of developing and producing new
`parts by maximizing the reuse of existing parts [21, 4].
`Again, similarity queries are the most important query type
`in this process. Several other examples exist, e.g. geograph-
`ical repositories, image collections and medical libraries.
`The most common approach for efficient similarity
`search is to map data objects into some high-dimensional
`vector space (the so-called feature space). Similarity
`
`between two data objects is assumed to correspond to the
`distance of their feature vectors. Thus, searching for similar
`objects to a given query object is transformed into the prob-
`lem of finding feature vectors which are close to the query
`feature vector. Popular examples of feature vectors are color
`histograms [8, 26], shape descriptors [15, 13], Fourier vec-
`tors [11, 1] and multi-parametric surface functions [18].
`
`Typically, range queries and k-nearest neighbor (k-nn)
`queries are applied for the process of finding feature vectors
`which are close to a given query feature vector. Range que-
`ries retrieve all feature objects within a given radius ε from
`the query feature vector. For k-nn queries, the user provides
`a number k and receives the k feature vectors which are clos-
`est to the query vector. In general, a similarity query is a
`CPU and I/O intensive task and the conventional approach
`to address this problem is to use some multidimensional
`index structure [25, 12]. Unfortunately, even specialized
`index structures like the TV-tree [20] or the X-tree [6] often
`fail to process similarity queries efficiently when the dimen-
`sionality d of the feature space is too high (d > 10). How-
`ever, this is a frequently encountered situation since the
`dimensionality of the feature space directly corresponds to
`the accuracy of the similarity search. In such environments,
`scan-based techniques like the VA-file [27] provide the most
`efficient query processing. These approaches mainly con-
`sider the case of highly dynamic environments with frequent
`insertion, update, and deletion operations. Only little effort
`has been devoted to the important case of static data sets.
`Obviously, static data sets provide more optimization poten-
`tial than highly dynamic data sets. However, this optimiza-
`tion potential is not exploited by techniques which were
`developed for dynamic environments.
`
`In this paper, we concentrate on the situation of static or
`rarely changing data sets (where the data set is known in
`advance) as they frequently occur in digital libraries. In par-
`ticular, we introduce a novel technique (the landmark file) to
`significantly improve the query performance of scan
`approaches when storing static data sets. The main idea of
`the landmark file is to introduce a special order on the fea-
`ture objects which can be exploited to substantially reduce
`the total amount of data which has to be scanned during the
`similarity search process.
`
`0-7695-0659-3/00 $10.00 2000 IEEE
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 3 of 8
`
`The rest of our paper is organized as follows. In section 2,
`we review the problem of similarity search in the context of
`index-based and scan-based access methods. We present our
`approach in section 3 and section 4. We performed an exten-
`sive experimental evaluation in order to show the efficiency
`of our approach. The results are presented in section 5 and
`section 6 concludes the paper.
`
`2. Similarity search in feature spaces
`Range queries and k-nn queries are the most important
`similarity queries [7]. In the following, we formally define
`both query types.
`
`Definition 1: Range Query
`ℜd
`Let DB denote a set of feature vectors v ∈
` and let
`ℜ+
`ℜd ℜd×
`→
`dist:
` denote a distance function that mea-
`0
`sures
`the (dis-) similarity of
`two feature vectors
`ℜd
`. Then, for a query feature vector q ∈
`ℜd
`v1, v2 ∈
` and a
`ℜ+
`query range ε ∈
`, the range query returns the set
`0
`RQ(q, ε) = {v ∈ DB | dist(q, v) ≤ ε}
`
`Definition 2: k-Nearest Neighbor Query
`For a query feature vector q ∈
`ℜd
` and a query parameter
`k ≥ 1,
`the k-nearest neighbor query returns
`the set
`NNq(k) ⊆ DB that contains k feature vectors from DB, and
`for which the following condition holds:
`: dist(q, v) ≤ dist(q, v’)
`∈
`∈
`,
`v∀
`v'∀
`NNq k( )
`DB NNq k( )
`–
`
`Note that possibly several feature vectors exist which
`have the same distance to the query vector as the k-th feature
`vector in the answer set. In this case, the k-th feature vector
`in NNq(k) is a non-deterministic selection of one of those
`equally distanced feature vectors. Several distance functions
`for measuring the (dis-) similarity have been discussed. The
`Euclidean distance metric L2, for instance, is one of the most
`frequently used similarity distance function, e.g. in the area
`of images, protein structures, CAD objects, or stock data.
`Many sophisticated index structures have been proposed
`for efficiently processing similarity queries, but, only few
`techniques consider the case of static feature sets. In [24] a
`compaction technique for “packing” and reducing dead-
`space on R-trees [14] has been proposed for static pictorial
`feature sets. Other approaches follow the concept of bulk-
`loading when the feature set is completely known in
`advance. The Hilbert R-tree [16], for instance, uses the Hil-
`bert space-filling curve to decompose the feature set into
`contiguous sequences which are stored in the data pages. In
`[5] a variant of the well-known Quicksort algorithm is used
`for a generic bulk loading method. Since index structures
`like the R-tree have been developed for low-dimensional
`feature spaces (d < 5) they offer only poor query perfor-
`mance for high-dimensional feature sets. In recent years,
`
`this problem has been addressed by developing high-dimen-
`sional index structures. The TV-tree [20], for example, uses
`so-called Telescope Vectors, i.e. feature vectors which may
`be dynamically shortened. The underlying assumption is
`that only dimensions with high variance are important for
`the query processing and therefore feature values of dimen-
`sions with low variance can be neglected. Another example
`is the X-tree [6] which uses the concept of directory supern-
`odes. Whenever the split of a directory node would lead to a
`high overlap of the resulting nodes or to overlap minimal but
`extremely unbalanced nodes, the overflowing node is trans-
`formed into a supernode, i.e. a node with a larger than usual
`block size.
`The main advantage of index-based access methods is the
`index selectivity during query processing, i.e. only a small
`fraction of the feature vectors has to be considered. This,
`however, induces costly random seek operations since the
`accessed index pages are generally not stored in contiguous
`disk blocks.
`While these high-dimensional indexing techniques per-
`form well for dimensions up to 10, even their performance
`often degenerates for higher dimensions. The reason for this
`effect is the so-called curse of dimensionality: Most of the
`measures one could define in a d-dimensional vector space,
`such as volume, area, or perimeter are exponentially depen-
`dent on the dimensionality of the space. Due to this effect,
`scan-based techniques, particularly the VA-file [27], turn
`out to provide better query performance when accessing
`high-dimensional feature spaces. The VA-file follows the
`idea of vector quantization. The feature space is divided into
`grid cells using α-quantiles in each dimension. Each feature
`vector is then represented by the address of the grid cell in
`which the feature vector lies. The main advantage of scan-
`based techniques is the sequential nature of their I/O opera-
`tions which results in the absence of expensive random
`seeks. However, there is no selectivity in the query process
`involved and therefore all feature vectors have to be consid-
`ered. Additionally, to the best of our knowledge, no optimi-
`zation techniques have been proposed for scan-based
`techniques when the stored feature set is static.
`
`3. The landmark file
`As we have discussed in section 2, both query processing
`paradigms, indexing and sequentially scanning, have advan-
`tages and disadvantages. Our solution combines the advan-
`tages of both approaches and avoids their disadvantages. It
`adopts the idea of the scan, to read only from a single, con-
`tiguous interval of the file and avoids uncontrolled random
`seek operations. But our solution does not read and process
`the complete feature set. Thus, our solution inherits the
`selectivity from the indexing approach.
`For static data sets, we can achieve this goal by keeping
`the feature set in a sort order according to an appropriate
`
`0-7695-0659-3/00 $10.00 2000 IEEE
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 4 of 8
`
`3 2 1 0
`
`distance file (i = 2)
`
`compressed representation
`
`exact representation
`
`Figure 2. The landmark file
`The third representation of the feature vector file is the
`distance file. This file does not contain information about
`each feature vector of the data set, but only for one vector
`out of i where i is a user-provided parameter, the chunk
`parameter. The landmark distance of each i-th vector of the
`ordered feature set is stored. This distance file partitions the
`feature space into spherical shells which contain a constant
`number i of feature vectors. The landmark shells for i = 2
`are depicted in figure 1 left. There are no explicit pointers or
`links between the different representations of the files. As
`the sort order of the feature vectors is identical in all three
`representations, and since the length of each feature vector
`representation is constant, this reference is implicitly given
`by the position of the feature vector in the file (cf. figure 2).
`Every query processing algorithm first considers the dis-
`tance file and determines the landmark shells which poten-
`tially include query results. The corresponding part of the
`file with the compressed representation is scanned and can-
`didates are collected. For every candidate which cannot be
`definitely classified as a query result by the compressed rep-
`resentation, a lookup to the exact representation is per-
`formed.
`To process rare insertions, deletions, and updates we pro-
`pose to store new or updated points in a separate file which
`is scanned sequentially without any optimizations for query
`processing. Whenever this overflow area becomes too large,
`the complete landmark file is reconstructed from scratch.
`
`3.2 Query processing using landmarks
`When the region of the query is exactly known in
`advance (such as in range queries), then query processing is
`straightforward:
`(cid:127) The distance file is loaded.
`(cid:127) Using the distance file, those shells are determined
`which are intersected by the query (cf. figure 3).
`For the intersected shells, the compressed representa-
`tion is scanned.
`If necessary, the exact representation is accessed for
`
`query
`
`intersected
`shells
`
`Figure 3. Range queries on the landmark file
`
`landmark distance
`landmark point
`Figure 1. Landmark approx. (l.) and quantization (r.)
`
`0
`
`1 2
`
`3
`
`sorting key. For each query, the query processing algorithm
`determines the lower and upper bound of the sorting key and
`performs a sequential scan of the corresponding interval of
`the feature file. To choose the sorting key of the feature vec-
`tors, there are several possibilities, such as the projection to
`a single dimension. To assess the quality of a sorting key,
`recall the general objective: a substantial reduction of the
`number of scanned feature vectors. The interval into which
`the query is translated has to contain few feature vectors. In
`other words, the sorting key must distinguish as much as
`possible between different feature vectors. Obviously, the
`projection to a single dimension could fail to reach such a
`high distinguishing power. Therefore, our sorting key con-
`siders all relevant dimensions. A general approach is to use
`the Euclidean distance between a selected reference point
`and the feature vectors. If, for instance, the origin is selected
`as reference point, the sorting key corresponds to the sum of
`the squared dimension values. We will show in section 4,
`how an optimal reference point can be chosen.
`
`3.1 Structure of the landmark file
`
`In the following we show how the idea of sorting the fea-
`ture vectors can be applied to establish selectivity in scan-
`based query processing. A point of the feature space (not
`necessarily contained in the data set) is chosen as reference
`point. This point is called the landmark point (cf. figure 1
`left). The feature vectors are sorted in ascending distance to
`the landmark point (the sorting key is called the landmark
`distance).
`We keep three representations of the feature vector file.
`In the exact representation, we store the full geometry of all
`feature vectors, i.e. the floating point values of the dimen-
`sions. In the compressed representation, we adopt the idea
`of the VA-file and store a quantized version of all feature
`vectors, i.e. the address (number) of the grid cell in which
`the corresponding feature vector lies (typically a few bits per
`dimension, cf. figure 1 right). The compressed representa-
`tion requires substantially less space on secondary storage,
`and, therefore, the I/O cost compared to the exact represen-
`tation is approximately reduced by the compression factor.
`On the other hand, the position of a feature vector is not
`exactly known such that a single access to the exact repre-
`sentation may be necessary.
`
`0-7695-0659-3/00 $10.00 2000 IEEE
`
`(cid:127)
`(cid:127)
`
`
`Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 5 of 8
`
`,
` to
`a landmark distance in the range from
`L q( )
`L q( )
`r–
`r+
`where L(q) is the landmark distance of the query vector and
`r is the distance of the nearest neighbor of q. The number a
`corresponds to the number of vectors (= the number of
`shells) which must be processed by a query when
`.
`1=
`i
`For a known parameter a, we obtain the following cost
`function:
`
`(1)
`
`⋅
`
`(
`
`tpos
`
`+
`
`tlat
`
`)
`
`
`
`--
`
`1+
`
`a i-
`
`ttr
`
`+
`
`
`
`t i a,( )
`
`
`
`a i+(
`
`)
`
`⋅
`
`=
`
`where ttr is the transfer time per compressed feature vec-
`tor, tpos is the positioning time and tlat is the latency time
`(rotational delay) of the disk drive. The first term of t(i,a)
`indicates, that a compressed feature vectors are scanned
`during query processing, with an additional overhead of one
`block (i compressed feature vectors), because, on the aver-
`age, half a block too much is scanned at the lower and the
`upper end of the scanned interval, respectively. In the sec-
`a i⁄
`ond term of t(i,a), the fraction
` represents the num-
`1+
`ber of separate I/O requests during query processing, for
`each of which a seek operation with arm positioning and
`rotational delay is required. This cost t(i,a) can be mini-
`mized by setting the derivative to 0:
`
`each feature vector for which a decision cannot be
`made using the compressed representation only.
`For (1-) nearest neighbor queries, the query region is not
`known in advance. Such queries are usually evaluated by
`increasing the radius of the query sphere until the nearest
`neighbor is covered by the query sphere. We adapt this algo-
`rithm for the landmark file (cf. figure 4):
`(cid:127) The distance file is loaded.
`(cid:127) Using the distance file, the shell containing the query
`vector is determined. This is the start shell.
`(cid:127) Beginning with the start shell, the unprocessed shell
`which is closest to the query vector is scanned succes-
`sively. The scanned part of the compressed representa-
`tion is extended, alternating at its upper and lower end.
`Whenever a feature vector is found which is closer
`than the current candidate, it is stored in a variable.
`(cid:127) This step is repeated until the distance between the
`next shell and the query vector is larger than the dis-
`tance to the current candidate.
`It is straightforward to extend this algorithm for k-nn
`search: Instead of a single candidate, the algorithm has to
`maintain a list with k candidates, and the distance to the last
`candidate is used as the stopping condition. For both query
`types, the algorithm must perform lookups to the exact rep-
`resentation in tie situations.
`
`3.3 Parameter optimization
`The chunk parameter i, which determines the size of the
`blocks for scanning in the nearest neighbor algorithm, is
`obviously a critical parameter. If it is chosen too small, the
`compressed representation is scanned in too small portions,
`which induces non-negligible I/O overhead. In contrast, if it
`is chosen too large, many compressed feature vectors may
`be scanned unnecessarily. To minimize the query processing
`time, the parameter i must be optimized.
`Optimization problems like this are often solved by cost
`models such as [3]. In this special case, however, the estima-
`tion of the cost may be difficult because the intersection vol-
`umes between sphere shells and the query sphere must be
`computed. Therefore, we base our optimization on statisti-
`cal information from previous runs of the algorithm or from
`tentative runs on a sample.
`The information we are gathering is the means µ and the
`standard deviation σ of the number a of feature vectors with
`
`query
`
`3
`1
`2
`
`accessed
`shells
`
`Figure 4. Nearest neighbor queries
`
`0-7695-0659-3/00 $10.00 2000 IEEE
`
`)
`
`0=
`
`⇒
`
`iopt
`
`±=
`
`⋅
`
`a
`
`∂∂
`
`----t i a,(
`i
`
`+
`tlat
`tpos
`---------------------
`ttr
`Only the positive solution is valid, and it corresponds to a
`minimum, because the second derivative is (constantly)
`positive.
`This solution optimizes the chunk parameter i for a
`known parameter a. However, this parameter is not constant
`for all queries. Therefore, we have to optimize i for a variety
`of different parameters a occurring in different queries. It
`turned out in our experiments that a is normally distributed.
`Under this assumption we can extend our model to optimize
`i for an a which is normally distributed with means µ and
`standard deviation σ:
`
`(2)
`
`)2
`a µ–(
`–
`----------------------
`2σ2
`
`⋅
`
`
`t i a,( )
`
`
`
`
`da
`
`(3)
`
`1
`--------------e
`σ 2π
`
`
`
`
`
`∞∫
`
`t˜ i( )
`
`=
`
`0
`This formula assigns to each cost t(i,a) the probability of
`the corresponding a. The average over all possible a can be
`determined by integration with a ranging from 0 to infinity.
`In eq. (3), i can again be optimized by setting the derivative
`∂t˜ i( ) ∂i⁄
`
` to zero:
`
`=
`
`+
`tlat
`µ tpos
`⋅
`---------------------
`iopt
`ttr
`The result is independent of the variance σ2 and equal to
`the result of eq. (2) for a known a, where a is replaced by the
`means µ.
`
`(4)
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 6 of 8
`
`(cid:127) CAD repository: 16-dimensional Fourier points corre-
`sponding to contours of 1,200,000 industrial parts used
`in the S3-System [4].
`Image Collection: 64-dimensional color histograms
`corresponding to 112,000 images from TV-snapshots.
`The CAD repository we use is a collection of contours of
`industrial parts (e.g. clips holding cables) from a supplier for
`automobile companies. The repository is used to reduce the
`cost of developing and producing new parts by maximizing
`the reuse of existing parts. This is achieved by applying sim-
`ilarity queries, in particular k-nn queries. The repository can
`be seen as static since there are no update or deletion opera-
`tions and very few insertion operations (less than 5 per
`month). The image collection is comparable to other image
`collections, e.g. the Corel image collection. Each image was
`reduced to 64 colors and the numbers of pixels per color in
`an image build the 64-dimensional feature vector. Color his-
`tograms are successfully used for similarity search in image
`libraries, e.g. [22].
`In the experimental evaluation, we compare our tech-
`nique to the VA-file which has been shown to offer excellent
`performance for high-dimensional feature sets (cf. section
`2). We focus on k-nn queries since they have generally more
`practical impact. The reason is that from a user’s point of
`view range queries are often difficult to apply when a rea-
`sonable value for ε is hard to determine. Additionally, the
`total query time of a k-nn query is an upper bound for a range
`query with an ε value that is equal to the distance of the k-th
`nearest neighbor. Recall, the landmark file has no additional
`seek cost for range queries since it scans the intersected
`shells in one pass and does not alternate the scanning direc-
`tion as in the case of k-nn queries. The overall overhead for
`performing k-nn queries compared to range queries is there-
`fore generally higher for the landmark file than for the VA-
`file. Thus, we can safely assume that the performance gain
`of the landmark file compared to the VA-file is even higher
`
`10-nn query
`1-nn query
`
`Chunk parameter i calculated
`by formula
`
`1.4
`
`1.2
`
`1.0
`
`0.8
`
`0.6
`
`0.4
`
`0.2
`
`0.0
`
`Total query time (sec)
`
`0
`
`4000
`
`12000 16000 20000 24000
`8000
`Chunk size
`Figure 6. Optimal chunk parameter i, image collection
`
`4. Determining the landmark point
`In this section, we show how a suitable landmark point
`can be chosen in order to further optimize query processing
`using the landmark file. Figure 5 shows a feature set which
`is skewed from northwest to southeast. On the left side, we
`choose the landmark point L1 which is close to the origin.
`Considering the landmark distances of the query point, we
`recognize a very low variance. Even for a minimal chunk
`parameter i = 1, the landmark shells which are intersected
`by the depicted query, cover almost the complete feature set
`(up to 2 points) On the right side of figure 5, in contrast, the
`landmark point L2 is chosen such that it is somewhere on the
`axis of the major skew direction of the feature set in the
`southeastern corner of the feature space. We recognize a
`high variance of the landmark distances and the landmark
`shells intersected by the query region cover only few points
`for reasonable values of i (for the minimum chunk size i=1
`the query region would only cover one vector, the actual
`nearest neighbor). Obviously, the variance of the landmark
`distances is inherited from the variance of the feature set:
`Whenever the landmark has a position on an axis of high
`variance of the feature set, the variance of the landmark dis-
`tances is high.
`
`L2
`L1
`Figure 5. A good (L2) and a bad (L1) landmark point
`This is captured mathematically by the concept of the
`principal component analysis (PCA) which finds the direc-
`tion, where a feature set exhibits the highest variance [23].
`For PCA, the means vector = 1/N Σ0≤i≤N xi and the cova-
`x
`riance matrix C = 1/N Σ0≤i≤N
`(
`) xi(
`)T
`
` are deter-
`x–
`x–
`xi
`mined, and C is diagonalized by some orthonormal matrix
`Φ, i.e.
`. The matrix Φ can be
`1 ....,λ,
`ΦTCΦ
`diag λ
`
`(
`)
`=
`d
`determined by singular value decomposition [23], the λi are
`Φ
`ϕ
`[
`1 ...,ϕ,
`]
`
`the eigenvalues and the column vectors of
`=
`d
`are the eigenvectors of C. The eigenvector ϕi corresponding
`to the largest eigenvalue λi indicates the direction of the
`highest variance of the feature set. Therefore, the landmark
`point
`should be chosen
`from
`the
`straight
`line
`
` ν ϕ⋅+
`. To maximize the variance of the landmark
`=
`g
`x
`i
`distances, the landmark point should not be chosen inside
`the feature space.
`
`5. Experimental evaluation
`In order to demonstrate the practical relevance of our
`technique we performed an extensive experimental evalua-
`tion using the following real-world feature sets:
`
`0-7695-0659-3/00 $10.00 2000 IEEE
`
`(cid:127)
`
`
`Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 7 of 8
`
`VA-file
`landmark file
`
`01234567
`
`Total query time (sec)
`
`solution for this problem, we additionally performed k-nn
`queries on the image collection for randomly selected land-
`mark points. In particular, we measured the total query times
`of the landmark file using 10 random landmark points and
`by calculating the means we determined the average total
`query time as depicted in figure 8 (denoted as “lm file (ran-
`dom lm. pt.)”). We observe that the query performance of
`the landmark file substantially increases when using the for-
`mally derived landmark point compared to using a random
`landmark point. For k = 10, for instance, the total query time
`of the landmark file using a random landmark point is 2.22
`times the total query time using our formally derived land-
`mark point. Another important observation is the fact that
`even with a random landmark point the landmark file con-
`sistently outperforms the VA-file.
`
`Finally, we investigate the performance of our approach
`with respect to the size N of the feature set. We used the
`CAD repository and increased the number N of feature vec-
`tors from 100,000 to 1,200,000. We measured the total
`query time of 1-nn, 10-nn, and 50-nn queries for both access
`methods. The results are presented in figure 9. For small val-
`ues of N, the landmark file and the VA-file perform almost
`equally. When the size of the feature set increases, however,
`this situation changes drastically. For N = 400,000 and
`k = 10, for instance, the landmark file already saves 33% of
`the total query time compared to the VA-file and this ratio
`increases up to 46% for N = 1,200,000. For k = 1 and the
`complete CAD repository, the landmark file even saves 78%
`of the total query time of the VA-file, which corresponds to
`a speed-up factor of 4.53.
`
`2.8
`
`2.4
`
`2.0
`
`1.6
`
`1.2
`
`0.8
`
`0.4
`
`Total query time (sec)
`
`0
`
`10
`
`20
`
`30
`
`40
`
`50
`
`Query size k
`
`VA-file
`lm file (random lm. pt.)
`landmark file
`
`Figure 8. Query size on the image collection
`
`0
`
`10
`
`30
`20
`Query size k
`Figure 7. Query size on the CAD repository
`
`40
`
`50
`
`for range queries. We performed all experiments on a
`HP C160 under HP-UX 10.20. The implementation lan-
`guage for our technique and the VA-file was C++.
`In our first experiment, we evaluate our solution for the
`important problem of optimizing the chunk parameter i (cf.
`section 3.3). We performed 1-nn queries and 10-nn queries
`and varied the chunk size from 500 to 24,000 points using
`the image collection. The total query times with respect to
`the chunk size are depicted in figure 6. Obviously, there
`exists an optimum for the chunk size for both query param-
`eters which lies at approximately iopt = 4350. Applying our
`formula derived in section 3.3, we find our chunk parameter
`i = 4380 almost perfectly matching. We observed the same
`effect when using the CAD repository (not depicted). There-
`fore, we applied our formula for the choice of the chunk
`parameter i in all following experiments.
`Next, we investigate the performance of the landmark file
`and the VA-file with respect to the query parameter k. We
`increased k from 1 to 50 and measured the total query times.
`Figure 7 depicts the results for the CAD repository. We
`observe that the landmark file clearly outperforms the
`VA-file for all values of k. For small parameters, e.g. for
`k = 1, the landmark file exhibits only 22% of the query time
`of the VA-file. When k increases, the performance gain of
`the landmark file decreases. The reason is that with increas-
`ing query size, the distance to the k-th nearest neighbor
`increases and thus more and more landmark shells are inter-
`sected. However, even for high values, e.g. for k = 50, our
`approach saves 31% of the query time compared to the VA-
`file.
`We performed the same experiment on the image collec-
`tion (cf. figure 8). Again, our approach substantially outper-
`forms the VA-file for all values of k and saves between 45%
`and 73% of the total query time of the VA-file.
`As we have discussed in section 4, the selection of the
`landmark point is a critical task. In order to evaluate our
`
`0-7695-0659-3/00 $10.00 2000 IEEE
`
`
`
`Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 8 of 8
`
`[5] Böhm C., Kriegel H.-P.: ’Efficient Bulk Loading of Large High-Di-
`mensional Indexes’, Proc. Int. Conf. on Data Warehousing and Knowl-
`edge Discovery, 1999, pp. 251-260.
`[6] Berchtold S., Keim D., Kriegel H.-P.: ‘The X-tree: An Index Struc-
`ture for High-Dimensional Data’, Proc. Int. Conf. on Very Large Data
`Bases, 1996, pp. 28-39.
`[7] Böhm C.: ‘Efficient Indexing High-Dimensional Data Spaces’,
`Ph.D. thesis, University of Munich, 1998.
`[8] Faloutsos C., Barber R., Flickner M., Hafner J., et al.: ‘Efficient and
`Effective Querying by Image Content’, Journal of Intelligent Informa-
`tion Systems, 1994, Vol. 3, pp. 231-262.
`[9] Flickner M., Sawhney H., Niblack W., Ashley J., Huang Q., Dom
`B., Gorkani M., Hafner J., Lee D., Petkovic D., Steele D., Yanker P.:
`‘Query by image and video content: The QBIC system‘, Computer,
`Vol. 28, September 1995, pp. 23-32.
`[10] Fox E. A., Akscyn R. M., Furuta R. K., Leggett J. J.: ’Digital Li-
`braries-Introduction to the Special Section’, Communications of the
`ACM, 38(4), 1995, pp. 22-28.
`[11] Faloutsos C., Ranganathan M., Manolopoulos Y.: ‘Fast Subse-
`quence Matching in Time-Series Databases’, Proc. ACM SIGMOD
`Int. Conf. on Management of Data, 1994, pp. 419-429.
`[12] Gaede V., Günther O.:‘Multidimensional Access Methods’, ACM
`Computing Surveys, Vol. 30, No. 2, 1998, pp.170-231.
`[13] Gary J. E., Mehrotra R.: ‘Similar Shape Retrieval using a Struc-
`tural Feature Index’, Information Systems, Vol. 18, No. 7, 1993.
`[14] Guttman A.: ‘R-trees: A Dynamic Index Structure for Spatial
`Searching’, Proc. ACM SIGMOD Int. Conf. on Management of Data,
`1984, 47-57.
`[15] Jagadish H. V.: ‘A Retrieval Technique for Similar Shapes’, Proc.
`ACM SIGMOD Int. Conf. on Managem. of Data, 1991, 208-217.
`[16] Kamel, Faloutsos: ‘Hilbert R-tree: An Improved R-tree using
`Fractals’. Proc. Int. Conf. on Very Large Data Bases, 1994, 500-509.
`[17] Klavans J.: ’Data Bases in Digital Libraries: Where Computer
`Science and Information Management Meet’, Proc. ACM SIGACT-
`SIGMOD-SIGART Symposium on Principles of Database Systems,
`1998, 224-226.
`[18] Kriegel H.-P., Seidl T. : ’Approximation-Based Similarity Search
`for 3-D Surface Segments’, GeoInformatica Int. Journal on Advances
`of Computer Science for Geographic Information Systems, 1998,
`Vol.2, No.2, pp.113-147.
`[19] Kriegel H.-P., Schmidt T., Seidl T.: ’3D Similarity Search by
`Shape Approximation’, Proc. 5th Int. Symposium on Large Spatial Da-
`tabases, 1997, pp. 11-28.
`[20] Lin K., Jagadish H. V., Faloutsos C.: ‘The TV-tree: An Index
`Structure for High-Dimensional Data’, Journal of Very Large Data
`Bases, Vol. 3, pp. 517-542, 1994.
`[21] Mehrotra R., Gary J. E.: ‘Feature-Based Retrieval of Similar
`Shapes’, Proc. Int. Conf