throbber
Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 1 of 8
`Case 1:14-cv-02396—PGG-MHD Document 153-16 Filed 06/28/19 Page 1 of 8
`
`EXHIBIT O
`
`EXHIBIT 0
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 2 of 8
`
`Efficient Similarity Search in Digital Libraries
`
`Christian Böhm Bernhard Braunmüller Hans-Peter Kriegel Matthias Schubert
`
`Computer Science Institute, University of Munich
`E-mail: {boehm, braunmue, kriegel, schubera}@dbs.informatik.uni-muenchen.de
`
`Abstract
`Digital libraries are a core information technology.
`When the stored data is complex, e.g. high-resolution
`images or molecular protein structures, simple query types
`like the exact match query are hardly applicable. In such
`environments similarity queries, particularly range queries
`and k-nearest neighbor queries, turn out to be important
`query types. Numerous approaches have been proposed for
`the processing of similarity queries which mainly concen-
`trate on highly dynamic data sets where insertion, update,
`and deletion operations permanently occur. However, only
`little effort has been devoted to the case of rather static data
`sets - a case that frequently occurs in digital libraries. In this
`paper, we introduce a novel technique for efficient similarity
`search on top of static or rarely changing data sets. In par-
`ticular, we propose a special sorting order on the data
`objects which can be effectively exploited to substantially
`reduce the total query time of similarity queries. An exten-
`sive experimental evaluation with real-world data sets
`emphasizes the practical impact of our technique.
`
`1. Introduction
`In recent years, digital libraries have become a core
`information technology [10, 17]. Among the various impor-
`tant aspects of digital libraries the search for similar objects
`in the huge amount of digitized data has become an essential
`task. The QBIC system, for instance, contains a large image
`library which can be effectively searched for similar images
`by using similarity queries [22, 9]. The Brookhaven Protein
`Data Bank currently provides the atomic coordinates of sev-
`eral thousand proteins [2]. Here, the solution of the impor-
`tant molecular docking problem is supported by applying
`similarity queries on docking segments [19]. Another exam-
`ple are industrial CAD repositories which can effectively be
`used to reduce the cost of developing and producing new
`parts by maximizing the reuse of existing parts [21, 4].
`Again, similarity queries are the most important query type
`in this process. Several other examples exist, e.g. geograph-
`ical repositories, image collections and medical libraries.
`The most common approach for efficient similarity
`search is to map data objects into some high-dimensional
`vector space (the so-called feature space). Similarity
`
`between two data objects is assumed to correspond to the
`distance of their feature vectors. Thus, searching for similar
`objects to a given query object is transformed into the prob-
`lem of finding feature vectors which are close to the query
`feature vector. Popular examples of feature vectors are color
`histograms [8, 26], shape descriptors [15, 13], Fourier vec-
`tors [11, 1] and multi-parametric surface functions [18].
`
`Typically, range queries and k-nearest neighbor (k-nn)
`queries are applied for the process of finding feature vectors
`which are close to a given query feature vector. Range que-
`ries retrieve all feature objects within a given radius ε from
`the query feature vector. For k-nn queries, the user provides
`a number k and receives the k feature vectors which are clos-
`est to the query vector. In general, a similarity query is a
`CPU and I/O intensive task and the conventional approach
`to address this problem is to use some multidimensional
`index structure [25, 12]. Unfortunately, even specialized
`index structures like the TV-tree [20] or the X-tree [6] often
`fail to process similarity queries efficiently when the dimen-
`sionality d of the feature space is too high (d > 10). How-
`ever, this is a frequently encountered situation since the
`dimensionality of the feature space directly corresponds to
`the accuracy of the similarity search. In such environments,
`scan-based techniques like the VA-file [27] provide the most
`efficient query processing. These approaches mainly con-
`sider the case of highly dynamic environments with frequent
`insertion, update, and deletion operations. Only little effort
`has been devoted to the important case of static data sets.
`Obviously, static data sets provide more optimization poten-
`tial than highly dynamic data sets. However, this optimiza-
`tion potential is not exploited by techniques which were
`developed for dynamic environments.
`
`In this paper, we concentrate on the situation of static or
`rarely changing data sets (where the data set is known in
`advance) as they frequently occur in digital libraries. In par-
`ticular, we introduce a novel technique (the landmark file) to
`significantly improve the query performance of scan
`approaches when storing static data sets. The main idea of
`the landmark file is to introduce a special order on the fea-
`ture objects which can be exploited to substantially reduce
`the total amount of data which has to be scanned during the
`similarity search process.
`
`0-7695-0659-3/00 $10.00 2000 IEEE
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 3 of 8
`
`The rest of our paper is organized as follows. In section 2,
`we review the problem of similarity search in the context of
`index-based and scan-based access methods. We present our
`approach in section 3 and section 4. We performed an exten-
`sive experimental evaluation in order to show the efficiency
`of our approach. The results are presented in section 5 and
`section 6 concludes the paper.
`
`2. Similarity search in feature spaces
`Range queries and k-nn queries are the most important
`similarity queries [7]. In the following, we formally define
`both query types.
`
`Definition 1: Range Query
`ℜd
`Let DB denote a set of feature vectors v ∈
` and let
`ℜ+
`ℜd ℜd×
`→
`dist:
` denote a distance function that mea-
`0
`sures
`the (dis-) similarity of
`two feature vectors
`ℜd
`. Then, for a query feature vector q ∈
`ℜd
`v1, v2 ∈
` and a
`ℜ+
`query range ε ∈
`, the range query returns the set
`0
`RQ(q, ε) = {v ∈ DB | dist(q, v) ≤ ε}
`
`Definition 2: k-Nearest Neighbor Query
`For a query feature vector q ∈
`ℜd
` and a query parameter
`k ≥ 1,
`the k-nearest neighbor query returns
`the set
`NNq(k) ⊆ DB that contains k feature vectors from DB, and
`for which the following condition holds:
`: dist(q, v) ≤ dist(q, v’)
`∈
`∈
`,
`v∀
`v'∀
`NNq k( )
`DB NNq k( )
`–
`
`Note that possibly several feature vectors exist which
`have the same distance to the query vector as the k-th feature
`vector in the answer set. In this case, the k-th feature vector
`in NNq(k) is a non-deterministic selection of one of those
`equally distanced feature vectors. Several distance functions
`for measuring the (dis-) similarity have been discussed. The
`Euclidean distance metric L2, for instance, is one of the most
`frequently used similarity distance function, e.g. in the area
`of images, protein structures, CAD objects, or stock data.
`Many sophisticated index structures have been proposed
`for efficiently processing similarity queries, but, only few
`techniques consider the case of static feature sets. In [24] a
`compaction technique for “packing” and reducing dead-
`space on R-trees [14] has been proposed for static pictorial
`feature sets. Other approaches follow the concept of bulk-
`loading when the feature set is completely known in
`advance. The Hilbert R-tree [16], for instance, uses the Hil-
`bert space-filling curve to decompose the feature set into
`contiguous sequences which are stored in the data pages. In
`[5] a variant of the well-known Quicksort algorithm is used
`for a generic bulk loading method. Since index structures
`like the R-tree have been developed for low-dimensional
`feature spaces (d < 5) they offer only poor query perfor-
`mance for high-dimensional feature sets. In recent years,
`
`this problem has been addressed by developing high-dimen-
`sional index structures. The TV-tree [20], for example, uses
`so-called Telescope Vectors, i.e. feature vectors which may
`be dynamically shortened. The underlying assumption is
`that only dimensions with high variance are important for
`the query processing and therefore feature values of dimen-
`sions with low variance can be neglected. Another example
`is the X-tree [6] which uses the concept of directory supern-
`odes. Whenever the split of a directory node would lead to a
`high overlap of the resulting nodes or to overlap minimal but
`extremely unbalanced nodes, the overflowing node is trans-
`formed into a supernode, i.e. a node with a larger than usual
`block size.
`The main advantage of index-based access methods is the
`index selectivity during query processing, i.e. only a small
`fraction of the feature vectors has to be considered. This,
`however, induces costly random seek operations since the
`accessed index pages are generally not stored in contiguous
`disk blocks.
`While these high-dimensional indexing techniques per-
`form well for dimensions up to 10, even their performance
`often degenerates for higher dimensions. The reason for this
`effect is the so-called curse of dimensionality: Most of the
`measures one could define in a d-dimensional vector space,
`such as volume, area, or perimeter are exponentially depen-
`dent on the dimensionality of the space. Due to this effect,
`scan-based techniques, particularly the VA-file [27], turn
`out to provide better query performance when accessing
`high-dimensional feature spaces. The VA-file follows the
`idea of vector quantization. The feature space is divided into
`grid cells using α-quantiles in each dimension. Each feature
`vector is then represented by the address of the grid cell in
`which the feature vector lies. The main advantage of scan-
`based techniques is the sequential nature of their I/O opera-
`tions which results in the absence of expensive random
`seeks. However, there is no selectivity in the query process
`involved and therefore all feature vectors have to be consid-
`ered. Additionally, to the best of our knowledge, no optimi-
`zation techniques have been proposed for scan-based
`techniques when the stored feature set is static.
`
`3. The landmark file
`As we have discussed in section 2, both query processing
`paradigms, indexing and sequentially scanning, have advan-
`tages and disadvantages. Our solution combines the advan-
`tages of both approaches and avoids their disadvantages. It
`adopts the idea of the scan, to read only from a single, con-
`tiguous interval of the file and avoids uncontrolled random
`seek operations. But our solution does not read and process
`the complete feature set. Thus, our solution inherits the
`selectivity from the indexing approach.
`For static data sets, we can achieve this goal by keeping
`the feature set in a sort order according to an appropriate
`
`0-7695-0659-3/00 $10.00 2000 IEEE
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 4 of 8
`
`3 2 1 0
`
`distance file (i = 2)
`
`compressed representation
`
`exact representation
`
`Figure 2. The landmark file
`The third representation of the feature vector file is the
`distance file. This file does not contain information about
`each feature vector of the data set, but only for one vector
`out of i where i is a user-provided parameter, the chunk
`parameter. The landmark distance of each i-th vector of the
`ordered feature set is stored. This distance file partitions the
`feature space into spherical shells which contain a constant
`number i of feature vectors. The landmark shells for i = 2
`are depicted in figure 1 left. There are no explicit pointers or
`links between the different representations of the files. As
`the sort order of the feature vectors is identical in all three
`representations, and since the length of each feature vector
`representation is constant, this reference is implicitly given
`by the position of the feature vector in the file (cf. figure 2).
`Every query processing algorithm first considers the dis-
`tance file and determines the landmark shells which poten-
`tially include query results. The corresponding part of the
`file with the compressed representation is scanned and can-
`didates are collected. For every candidate which cannot be
`definitely classified as a query result by the compressed rep-
`resentation, a lookup to the exact representation is per-
`formed.
`To process rare insertions, deletions, and updates we pro-
`pose to store new or updated points in a separate file which
`is scanned sequentially without any optimizations for query
`processing. Whenever this overflow area becomes too large,
`the complete landmark file is reconstructed from scratch.
`
`3.2 Query processing using landmarks
`When the region of the query is exactly known in
`advance (such as in range queries), then query processing is
`straightforward:
`(cid:127) The distance file is loaded.
`(cid:127) Using the distance file, those shells are determined
`which are intersected by the query (cf. figure 3).
`For the intersected shells, the compressed representa-
`tion is scanned.
`If necessary, the exact representation is accessed for
`
`query
`
`intersected
`shells
`
`Figure 3. Range queries on the landmark file
`
`landmark distance
`landmark point
`Figure 1. Landmark approx. (l.) and quantization (r.)
`
`0
`
`1 2
`
`3
`
`sorting key. For each query, the query processing algorithm
`determines the lower and upper bound of the sorting key and
`performs a sequential scan of the corresponding interval of
`the feature file. To choose the sorting key of the feature vec-
`tors, there are several possibilities, such as the projection to
`a single dimension. To assess the quality of a sorting key,
`recall the general objective: a substantial reduction of the
`number of scanned feature vectors. The interval into which
`the query is translated has to contain few feature vectors. In
`other words, the sorting key must distinguish as much as
`possible between different feature vectors. Obviously, the
`projection to a single dimension could fail to reach such a
`high distinguishing power. Therefore, our sorting key con-
`siders all relevant dimensions. A general approach is to use
`the Euclidean distance between a selected reference point
`and the feature vectors. If, for instance, the origin is selected
`as reference point, the sorting key corresponds to the sum of
`the squared dimension values. We will show in section 4,
`how an optimal reference point can be chosen.
`
`3.1 Structure of the landmark file
`
`In the following we show how the idea of sorting the fea-
`ture vectors can be applied to establish selectivity in scan-
`based query processing. A point of the feature space (not
`necessarily contained in the data set) is chosen as reference
`point. This point is called the landmark point (cf. figure 1
`left). The feature vectors are sorted in ascending distance to
`the landmark point (the sorting key is called the landmark
`distance).
`We keep three representations of the feature vector file.
`In the exact representation, we store the full geometry of all
`feature vectors, i.e. the floating point values of the dimen-
`sions. In the compressed representation, we adopt the idea
`of the VA-file and store a quantized version of all feature
`vectors, i.e. the address (number) of the grid cell in which
`the corresponding feature vector lies (typically a few bits per
`dimension, cf. figure 1 right). The compressed representa-
`tion requires substantially less space on secondary storage,
`and, therefore, the I/O cost compared to the exact represen-
`tation is approximately reduced by the compression factor.
`On the other hand, the position of a feature vector is not
`exactly known such that a single access to the exact repre-
`sentation may be necessary.
`
`0-7695-0659-3/00 $10.00 2000 IEEE
`
`(cid:127)
`(cid:127)
`

`

`Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 5 of 8
`
`,
` to
`a landmark distance in the range from
`L q( )
`L q( )
`r–
`r+
`where L(q) is the landmark distance of the query vector and
`r is the distance of the nearest neighbor of q. The number a
`corresponds to the number of vectors (= the number of
`shells) which must be processed by a query when
`.
`1=
`i
`For a known parameter a, we obtain the following cost
`function:
`
`(1)
`
`⋅
`
`(
`
`tpos
`
`+
`
`tlat
`
`)
`
`
`
`--
`
`1+ 
`
`a i-
`
`ttr
`
`+
`
`
`
`t i a,( )
`
`
`
`a i+(
`
`)
`
`⋅
`
`=
`
`where ttr is the transfer time per compressed feature vec-
`tor, tpos is the positioning time and tlat is the latency time
`(rotational delay) of the disk drive. The first term of t(i,a)
`indicates, that a compressed feature vectors are scanned
`during query processing, with an additional overhead of one
`block (i compressed feature vectors), because, on the aver-
`age, half a block too much is scanned at the lower and the
`upper end of the scanned interval, respectively. In the sec-
`a i⁄
`ond term of t(i,a), the fraction
` represents the num-
`1+
`ber of separate I/O requests during query processing, for
`each of which a seek operation with arm positioning and
`rotational delay is required. This cost t(i,a) can be mini-
`mized by setting the derivative to 0:
`
`each feature vector for which a decision cannot be
`made using the compressed representation only.
`For (1-) nearest neighbor queries, the query region is not
`known in advance. Such queries are usually evaluated by
`increasing the radius of the query sphere until the nearest
`neighbor is covered by the query sphere. We adapt this algo-
`rithm for the landmark file (cf. figure 4):
`(cid:127) The distance file is loaded.
`(cid:127) Using the distance file, the shell containing the query
`vector is determined. This is the start shell.
`(cid:127) Beginning with the start shell, the unprocessed shell
`which is closest to the query vector is scanned succes-
`sively. The scanned part of the compressed representa-
`tion is extended, alternating at its upper and lower end.
`Whenever a feature vector is found which is closer
`than the current candidate, it is stored in a variable.
`(cid:127) This step is repeated until the distance between the
`next shell and the query vector is larger than the dis-
`tance to the current candidate.
`It is straightforward to extend this algorithm for k-nn
`search: Instead of a single candidate, the algorithm has to
`maintain a list with k candidates, and the distance to the last
`candidate is used as the stopping condition. For both query
`types, the algorithm must perform lookups to the exact rep-
`resentation in tie situations.
`
`3.3 Parameter optimization
`The chunk parameter i, which determines the size of the
`blocks for scanning in the nearest neighbor algorithm, is
`obviously a critical parameter. If it is chosen too small, the
`compressed representation is scanned in too small portions,
`which induces non-negligible I/O overhead. In contrast, if it
`is chosen too large, many compressed feature vectors may
`be scanned unnecessarily. To minimize the query processing
`time, the parameter i must be optimized.
`Optimization problems like this are often solved by cost
`models such as [3]. In this special case, however, the estima-
`tion of the cost may be difficult because the intersection vol-
`umes between sphere shells and the query sphere must be
`computed. Therefore, we base our optimization on statisti-
`cal information from previous runs of the algorithm or from
`tentative runs on a sample.
`The information we are gathering is the means µ and the
`standard deviation σ of the number a of feature vectors with
`
`query
`
`3
`1
`2
`
`accessed
`shells
`
`Figure 4. Nearest neighbor queries
`
`0-7695-0659-3/00 $10.00 2000 IEEE
`
`)
`
`0=
`
`⇒
`
`iopt
`
`±=
`
`⋅
`
`a
`
`∂∂
`
`----t i a,(
`i
`
`+
`tlat
`tpos
`---------------------
`ttr
`Only the positive solution is valid, and it corresponds to a
`minimum, because the second derivative is (constantly)
`positive.
`This solution optimizes the chunk parameter i for a
`known parameter a. However, this parameter is not constant
`for all queries. Therefore, we have to optimize i for a variety
`of different parameters a occurring in different queries. It
`turned out in our experiments that a is normally distributed.
`Under this assumption we can extend our model to optimize
`i for an a which is normally distributed with means µ and
`standard deviation σ:
`
`(2)
`
`)2
`a µ–(
`–
`----------------------
`2σ2
`
`⋅
`
`
`t i a,( )
`
`
`
`
`da
`
`(3)
`
`1
`--------------e
`σ 2π
`
`
`
`
`
`∞∫
`
`t˜ i( )
`
`=
`
`0
`This formula assigns to each cost t(i,a) the probability of
`the corresponding a. The average over all possible a can be
`determined by integration with a ranging from 0 to infinity.
`In eq. (3), i can again be optimized by setting the derivative
`∂t˜ i( ) ∂i⁄
`
` to zero:
`
`=
`
`+
`tlat
`µ tpos
`⋅
`---------------------
`iopt
`ttr
`The result is independent of the variance σ2 and equal to
`the result of eq. (2) for a known a, where a is replaced by the
`means µ.
`
`(4)
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 6 of 8
`
`(cid:127) CAD repository: 16-dimensional Fourier points corre-
`sponding to contours of 1,200,000 industrial parts used
`in the S3-System [4].
`Image Collection: 64-dimensional color histograms
`corresponding to 112,000 images from TV-snapshots.
`The CAD repository we use is a collection of contours of
`industrial parts (e.g. clips holding cables) from a supplier for
`automobile companies. The repository is used to reduce the
`cost of developing and producing new parts by maximizing
`the reuse of existing parts. This is achieved by applying sim-
`ilarity queries, in particular k-nn queries. The repository can
`be seen as static since there are no update or deletion opera-
`tions and very few insertion operations (less than 5 per
`month). The image collection is comparable to other image
`collections, e.g. the Corel image collection. Each image was
`reduced to 64 colors and the numbers of pixels per color in
`an image build the 64-dimensional feature vector. Color his-
`tograms are successfully used for similarity search in image
`libraries, e.g. [22].
`In the experimental evaluation, we compare our tech-
`nique to the VA-file which has been shown to offer excellent
`performance for high-dimensional feature sets (cf. section
`2). We focus on k-nn queries since they have generally more
`practical impact. The reason is that from a user’s point of
`view range queries are often difficult to apply when a rea-
`sonable value for ε is hard to determine. Additionally, the
`total query time of a k-nn query is an upper bound for a range
`query with an ε value that is equal to the distance of the k-th
`nearest neighbor. Recall, the landmark file has no additional
`seek cost for range queries since it scans the intersected
`shells in one pass and does not alternate the scanning direc-
`tion as in the case of k-nn queries. The overall overhead for
`performing k-nn queries compared to range queries is there-
`fore generally higher for the landmark file than for the VA-
`file. Thus, we can safely assume that the performance gain
`of the landmark file compared to the VA-file is even higher
`
`10-nn query
`1-nn query
`
`Chunk parameter i calculated
`by formula
`
`1.4
`
`1.2
`
`1.0
`
`0.8
`
`0.6
`
`0.4
`
`0.2
`
`0.0
`
`Total query time (sec)
`
`0
`
`4000
`
`12000 16000 20000 24000
`8000
`Chunk size
`Figure 6. Optimal chunk parameter i, image collection
`
`4. Determining the landmark point
`In this section, we show how a suitable landmark point
`can be chosen in order to further optimize query processing
`using the landmark file. Figure 5 shows a feature set which
`is skewed from northwest to southeast. On the left side, we
`choose the landmark point L1 which is close to the origin.
`Considering the landmark distances of the query point, we
`recognize a very low variance. Even for a minimal chunk
`parameter i = 1, the landmark shells which are intersected
`by the depicted query, cover almost the complete feature set
`(up to 2 points) On the right side of figure 5, in contrast, the
`landmark point L2 is chosen such that it is somewhere on the
`axis of the major skew direction of the feature set in the
`southeastern corner of the feature space. We recognize a
`high variance of the landmark distances and the landmark
`shells intersected by the query region cover only few points
`for reasonable values of i (for the minimum chunk size i=1
`the query region would only cover one vector, the actual
`nearest neighbor). Obviously, the variance of the landmark
`distances is inherited from the variance of the feature set:
`Whenever the landmark has a position on an axis of high
`variance of the feature set, the variance of the landmark dis-
`tances is high.
`
`L2
`L1
`Figure 5. A good (L2) and a bad (L1) landmark point
`This is captured mathematically by the concept of the
`principal component analysis (PCA) which finds the direc-
`tion, where a feature set exhibits the highest variance [23].
`For PCA, the means vector = 1/N Σ0≤i≤N xi and the cova-
`x
`riance matrix C = 1/N Σ0≤i≤N
`(
`) xi(
`)T
`
` are deter-
`x–
`x–
`xi
`mined, and C is diagonalized by some orthonormal matrix
`Φ, i.e.
`. The matrix Φ can be
`1 ....,λ,
`ΦTCΦ
`diag λ
`
`(
`)
`=
`d
`determined by singular value decomposition [23], the λi are


`[
`1 ...,ϕ,
`]
`
`the eigenvalues and the column vectors of
`=
`d
`are the eigenvectors of C. The eigenvector ϕi corresponding
`to the largest eigenvalue λi indicates the direction of the
`highest variance of the feature set. Therefore, the landmark
`point
`should be chosen
`from
`the
`straight
`line
`
` ν ϕ⋅+
`. To maximize the variance of the landmark
`=
`g
`x
`i
`distances, the landmark point should not be chosen inside
`the feature space.
`
`5. Experimental evaluation
`In order to demonstrate the practical relevance of our
`technique we performed an extensive experimental evalua-
`tion using the following real-world feature sets:
`
`0-7695-0659-3/00 $10.00 2000 IEEE
`
`(cid:127)
`

`

`Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 7 of 8
`
`VA-file
`landmark file
`
`01234567
`
`Total query time (sec)
`
`solution for this problem, we additionally performed k-nn
`queries on the image collection for randomly selected land-
`mark points. In particular, we measured the total query times
`of the landmark file using 10 random landmark points and
`by calculating the means we determined the average total
`query time as depicted in figure 8 (denoted as “lm file (ran-
`dom lm. pt.)”). We observe that the query performance of
`the landmark file substantially increases when using the for-
`mally derived landmark point compared to using a random
`landmark point. For k = 10, for instance, the total query time
`of the landmark file using a random landmark point is 2.22
`times the total query time using our formally derived land-
`mark point. Another important observation is the fact that
`even with a random landmark point the landmark file con-
`sistently outperforms the VA-file.
`
`Finally, we investigate the performance of our approach
`with respect to the size N of the feature set. We used the
`CAD repository and increased the number N of feature vec-
`tors from 100,000 to 1,200,000. We measured the total
`query time of 1-nn, 10-nn, and 50-nn queries for both access
`methods. The results are presented in figure 9. For small val-
`ues of N, the landmark file and the VA-file perform almost
`equally. When the size of the feature set increases, however,
`this situation changes drastically. For N = 400,000 and
`k = 10, for instance, the landmark file already saves 33% of
`the total query time compared to the VA-file and this ratio
`increases up to 46% for N = 1,200,000. For k = 1 and the
`complete CAD repository, the landmark file even saves 78%
`of the total query time of the VA-file, which corresponds to
`a speed-up factor of 4.53.
`
`2.8
`
`2.4
`
`2.0
`
`1.6
`
`1.2
`
`0.8
`
`0.4
`
`Total query time (sec)
`
`0
`
`10
`
`20
`
`30
`
`40
`
`50
`
`Query size k
`
`VA-file
`lm file (random lm. pt.)
`landmark file
`
`Figure 8. Query size on the image collection
`
`0
`
`10
`
`30
`20
`Query size k
`Figure 7. Query size on the CAD repository
`
`40
`
`50
`
`for range queries. We performed all experiments on a
`HP C160 under HP-UX 10.20. The implementation lan-
`guage for our technique and the VA-file was C++.
`In our first experiment, we evaluate our solution for the
`important problem of optimizing the chunk parameter i (cf.
`section 3.3). We performed 1-nn queries and 10-nn queries
`and varied the chunk size from 500 to 24,000 points using
`the image collection. The total query times with respect to
`the chunk size are depicted in figure 6. Obviously, there
`exists an optimum for the chunk size for both query param-
`eters which lies at approximately iopt = 4350. Applying our
`formula derived in section 3.3, we find our chunk parameter
`i = 4380 almost perfectly matching. We observed the same
`effect when using the CAD repository (not depicted). There-
`fore, we applied our formula for the choice of the chunk
`parameter i in all following experiments.
`Next, we investigate the performance of the landmark file
`and the VA-file with respect to the query parameter k. We
`increased k from 1 to 50 and measured the total query times.
`Figure 7 depicts the results for the CAD repository. We
`observe that the landmark file clearly outperforms the
`VA-file for all values of k. For small parameters, e.g. for
`k = 1, the landmark file exhibits only 22% of the query time
`of the VA-file. When k increases, the performance gain of
`the landmark file decreases. The reason is that with increas-
`ing query size, the distance to the k-th nearest neighbor
`increases and thus more and more landmark shells are inter-
`sected. However, even for high values, e.g. for k = 50, our
`approach saves 31% of the query time compared to the VA-
`file.
`We performed the same experiment on the image collec-
`tion (cf. figure 8). Again, our approach substantially outper-
`forms the VA-file for all values of k and saves between 45%
`and 73% of the total query time of the VA-file.
`As we have discussed in section 4, the selection of the
`landmark point is a critical task. In order to evaluate our
`
`0-7695-0659-3/00 $10.00 2000 IEEE
`
`

`

`Case 1:14-cv-02396-PGG-MHD Document 153-16 Filed 06/28/19 Page 8 of 8
`
`[5] Böhm C., Kriegel H.-P.: ’Efficient Bulk Loading of Large High-Di-
`mensional Indexes’, Proc. Int. Conf. on Data Warehousing and Knowl-
`edge Discovery, 1999, pp. 251-260.
`[6] Berchtold S., Keim D., Kriegel H.-P.: ‘The X-tree: An Index Struc-
`ture for High-Dimensional Data’, Proc. Int. Conf. on Very Large Data
`Bases, 1996, pp. 28-39.
`[7] Böhm C.: ‘Efficient Indexing High-Dimensional Data Spaces’,
`Ph.D. thesis, University of Munich, 1998.
`[8] Faloutsos C., Barber R., Flickner M., Hafner J., et al.: ‘Efficient and
`Effective Querying by Image Content’, Journal of Intelligent Informa-
`tion Systems, 1994, Vol. 3, pp. 231-262.
`[9] Flickner M., Sawhney H., Niblack W., Ashley J., Huang Q., Dom
`B., Gorkani M., Hafner J., Lee D., Petkovic D., Steele D., Yanker P.:
`‘Query by image and video content: The QBIC system‘, Computer,
`Vol. 28, September 1995, pp. 23-32.
`[10] Fox E. A., Akscyn R. M., Furuta R. K., Leggett J. J.: ’Digital Li-
`braries-Introduction to the Special Section’, Communications of the
`ACM, 38(4), 1995, pp. 22-28.
`[11] Faloutsos C., Ranganathan M., Manolopoulos Y.: ‘Fast Subse-
`quence Matching in Time-Series Databases’, Proc. ACM SIGMOD
`Int. Conf. on Management of Data, 1994, pp. 419-429.
`[12] Gaede V., Günther O.:‘Multidimensional Access Methods’, ACM
`Computing Surveys, Vol. 30, No. 2, 1998, pp.170-231.
`[13] Gary J. E., Mehrotra R.: ‘Similar Shape Retrieval using a Struc-
`tural Feature Index’, Information Systems, Vol. 18, No. 7, 1993.
`[14] Guttman A.: ‘R-trees: A Dynamic Index Structure for Spatial
`Searching’, Proc. ACM SIGMOD Int. Conf. on Management of Data,
`1984, 47-57.
`[15] Jagadish H. V.: ‘A Retrieval Technique for Similar Shapes’, Proc.
`ACM SIGMOD Int. Conf. on Managem. of Data, 1991, 208-217.
`[16] Kamel, Faloutsos: ‘Hilbert R-tree: An Improved R-tree using
`Fractals’. Proc. Int. Conf. on Very Large Data Bases, 1994, 500-509.
`[17] Klavans J.: ’Data Bases in Digital Libraries: Where Computer
`Science and Information Management Meet’, Proc. ACM SIGACT-
`SIGMOD-SIGART Symposium on Principles of Database Systems,
`1998, 224-226.
`[18] Kriegel H.-P., Seidl T. : ’Approximation-Based Similarity Search
`for 3-D Surface Segments’, GeoInformatica Int. Journal on Advances
`of Computer Science for Geographic Information Systems, 1998,
`Vol.2, No.2, pp.113-147.
`[19] Kriegel H.-P., Schmidt T., Seidl T.: ’3D Similarity Search by
`Shape Approximation’, Proc. 5th Int. Symposium on Large Spatial Da-
`tabases, 1997, pp. 11-28.
`[20] Lin K., Jagadish H. V., Faloutsos C.: ‘The TV-tree: An Index
`Structure for High-Dimensional Data’, Journal of Very Large Data
`Bases, Vol. 3, pp. 517-542, 1994.
`[21] Mehrotra R., Gary J. E.: ‘Feature-Based Retrieval of Similar
`Shapes’, Proc. Int. Conf

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket