`
`Computer Vision and Image Understanding 110 (2008) 346–359
`
`www.elsevier.com/locate/cviu
`
`Speeded-Up Robust Features (SURF)
`
`Herbert Bay a, Andreas Ess a,*, Tinne Tuytelaars b, Luc Van Gool a,b
`
`a ETH Zurich, BIWI, Sternwartstrasse 7, CH-8092 Zurich, Switzerland
`b K.U. Leuven, ESAT-PSI, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium
`
`Received 31 October 2006; accepted 5 September 2007
`Available online 15 December 2007
`
`Abstract
`
`This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features).
`SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness,
`yet can be computed and compared much faster.
`This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors
`and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by sim-
`plifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps.
`The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important param-
`eters. We conclude the article with SURF’s application to two challenging, yet converse goals: camera calibration as a special case of image
`registration, and object recognition. Our experiments underline SURF’s usefulness in a broad range of topics in computer vision.
`Ó 2007 Elsevier Inc. All rights reserved.
`
`Keywords: Interest points; Local features; Feature description; Camera calibration; Object recognition
`
`1. Introduction
`
`The task of finding point correspondences between two
`images of the same scene or object is part of many com-
`puter vision applications. Image registration, camera cali-
`bration, object recognition, and image retrieval are just a
`few.
`The search for discrete image point correspondences can
`be divided into three main steps. First, ‘interest points’ are
`selected at distinctive locations in the image, such as cor-
`ners, blobs, and T-junctions. The most valuable property
`of an interest point detector is its repeatability. The repeat-
`ability expresses the reliability of a detector for finding the
`same physical interest points under different viewing condi-
`tions. Next, the neighbourhood of every interest point is
`represented by a feature vector. This descriptor has to be
`distinctive and at the same time robust to noise, detection
`
`* Corresponding author.
`E-mail address: aess@vision.ee.ethz.ch (A. Ess).
`
`1077-3142/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved.
`doi:10.1016/j.cviu.2007.09.014
`
`displacements and geometric and photometric deforma-
`tions. Finally, the descriptor vectors are matched between
`different images. The matching is based on a distance
`between the vectors, e.g. the Mahalanobis or Euclidean dis-
`tance. The dimension of the descriptor has a direct impact
`on the time this takes, and less dimensions are desirable for
`fast interest point matching. However, lower dimensional
`feature vectors are in general less distinctive than their
`high-dimensional counterparts.
`It has been our goal to develop both a detector and
`descriptor that, in comparison to the state-of-the-art, are
`fast to compute while not sacrificing performance. In order
`to succeed, one has to strike a balance between the above
`requirements like simplifying the detection scheme while
`keeping it accurate, and reducing the descriptor’s size while
`keeping it sufficiently distinctive.
`A wide variety of detectors and descriptors have already
`been proposed in the literature (e.g. [21,24,27,39,25]). Also,
`detailed comparisons and evaluations on benchmarking
`datasets have been performed [28,30,31]. Our fast detector
`and descriptor,
`called SURF (Speeded-Up Robust
`
`Yuneec Exhibit 1024 Page 1
`
`
`
`H. Bay et al. / Computer Vision and Image Understanding 110 (2008) 346–359
`
`347
`
`Features), was introduced in [4]. It is built on the insights
`gained from this previous work. In our experiments on
`these benchmarking datasets, SURF’s detector and
`descriptor are not only faster, but the former is also more
`repeatable and the latter more distinctive.
`We focus on scale and in-plane rotation-invariant detec-
`tors and descriptors. These seem to offer a good compromise
`between feature complexity and robustness to commonly
`occurring photometric deformations. Skew, anisotropic
`scaling, and perspective effects are assumed to be second
`order effects, that are covered to some degree by the overall
`robustness of the descriptor. Note that the descriptor can
`be extended towards affine-invariant regions using affine
`normalisation of the ellipse (cf. [31]), although this will have
`an impact on the computation time. Extending the detector,
`on the other hand, is less straightforward. Concerning the
`photometric deformations, we assume a simple linear model
`with a bias (offset) and contrast change (scale factor). Nei-
`ther detector nor descriptor use colour information.
`The article is structured as follows. In Section 2, we give
`a review over previous work in interest point detection and
`description. In Section 3, we describe the strategy applied
`for fast and robust interest point detection. The input
`image is analysed at different scales in order to guarantee
`invariance to scale changes. The detected interest points
`are provided with a rotation and scale-invariant descriptor
`in Section 4. Furthermore, a simple and efficient first-line
`indexing technique, based on the contrast of the interest
`point with its surrounding, is proposed.
`In Section 5, some of the available parameters and their
`effects are discussed, including the benefits of an upright
`version (not invariant to image rotation). We also investi-
`gate SURF’s performance in two important application
`scenarios. First, we consider a special case of image regis-
`tration, namely the problem of camera calibration for 3D
`reconstruction. Second, we will explore SURF’s applica-
`tion to an object recognition experiment. Both applications
`highlight SURF’s benefits in terms of speed and robustness
`as opposed to other strategies. The article is concluded in
`Section 6.
`
`2. Related work
`
`2.1. Interest point detection
`
`The most widely used detector is probably the Harris
`corner detector [15], proposed back in 1988. It is based
`on the eigenvalues of the second moment matrix. However,
`Harris corners are not scale invariant. Lindeberg [21] intro-
`duced the concept of automatic scale selection. This allows
`to detect interest points in an image, each with their own
`characteristic scale. He experimented with both the deter-
`minant of the Hessian matrix as well as the Laplacian
`(which corresponds to the trace of the Hessian matrix) to
`detect blob-like structures. Mikolajczyk and Schmid [26]
`refined this method, creating robust and scale-invariant
`feature detectors with high repeatability, which they coined
`
`Harris-Laplace and Hessian-Laplace. They used a (scale-
`adapted) Harris measure or the determinant of the Hessian
`matrix to select the location, and the Laplacian to select the
`scale. Focusing on speed, Lowe [23] proposed to approxi-
`mate the Laplacian of Gaussians (LoG) by a Difference
`of Gaussians (DoG) filter.
`Several other scale-invariant interest point detectors
`have been proposed. Examples are the salient region detec-
`tor, proposed by Kadir and Brady [17], which maximises
`the entropy within the region, and the edge-based region
`detector proposed by Jurie and Schmid [16]. They seem less
`amenable to acceleration though. Also several affine-invari-
`ant feature detectors have been proposed that can cope
`with wider viewpoint changes. However, these fall outside
`the scope of this article.
`From studying the existing detectors and from published
`comparisons [29,30], we can conclude that Hessian-based
`detectors are more stable and repeatable than their Harris-
`based counterparts. Moreover, using the determinant of
`the Hessian matrix rather than its trace (the Laplacian)
`seems advantageous, as it fires less on elongated, ill-localised
`structures. We also observed that approximations like the
`DoG can bring speed at a low cost in terms of lost accuracy.
`
`2.2. Interest point description
`
`An even larger variety of feature descriptors has been
`proposed, like Gaussian derivatives [11], moment invari-
`ants
`[32], complex features
`[1],
`steerable filters
`[12],
`phase-based local features [6], and descriptors representing
`the distribution of smaller-scale features within the interest
`point neighbourhood. The latter, introduced by Lowe [24],
`have been shown to outperform the others [28]. This can be
`explained by the fact that they capture a substantial
`amount of information about the spatial intensity patterns,
`while at the same time being robust to small deformations
`or localisation errors. The descriptor in [24], called SIFT
`for short, computes a histogram of local oriented gradients
`around the interest point and stores the bins in a 128D vec-
`tor (8 orientation bins for each of 4 4 location bins).
`Various refinements on this basic scheme have been pro-
`posed. Ke and Sukthankar [18] applied PCA on the gradi-
`ent image around the detected interest point. This PCA-
`SIFT yields a 36D descriptor which is fast for matching,
`but proved to be less distinctive than SIFT in a second
`comparative study by Mikolajczyk and Schmid [30]; and
`applying PCA slows down feature computation. In the
`same paper [30], the authors proposed a variant of SIFT,
`called GLOH, which proved to be even more distinctive
`with the same number of dimensions. However, GLOH is
`computationally more expensive as it uses again PCA for
`data compression.
`The SIFT descriptor still seems the most appealing
`descriptor for practical uses, and hence also the most
`widely used nowadays. It is distinctive and relatively fast,
`which is crucial for on-line applications. Recently, Se
`et al. [37] implemented SIFT on a Field Programmable
`
`Yuneec Exhibit 1024 Page 2
`
`
`
`348
`
`H. Bay et al. / Computer Vision and Image Understanding 110 (2008) 346–359
`
`Gate Array (FPGA) and improved its speed by an order of
`magnitude. Meanwhile, Grabner et al. [14] also used inte-
`gral images to approximate SIFT. Their detection step is
`based on difference-of-mean (without interpolation), their
`description step on integral histograms. They achieve
`about the same speed as we do (though the description step
`is constant in speed), but at the cost of reduced quality
`compared to SIFT. Generally, the high dimensionality of
`the descriptor is a drawback of SIFT at the matching step.
`For on-line applications relying only on a regular PC, each
`one of the three steps (detection, description, matching) has
`to be fast.
`An entire body of work is available on speeding up the
`matching step. All of them come at the expense of getting
`an approximative matching. Methods include the best-
`bin-first proposed by Lowe [24], balltrees [35], vocabulary
`trees [34], locality sensitive hashing [9], or redundant bit
`vectors [13]. Complementary to this, we suggest the use
`of the Hessian matrix’s trace to significantly increase the
`matching speed. Together with the descriptor’s low dimen-
`sionality, any matching algorithm is bound to perform
`faster.
`
`3. Interest point detection
`
`Our approach for interest point detection uses a very
`basic Hessian matrix approximation. This lends itself to
`the use of integral images as made popular by Viola and
`Jones [41], which reduces the computation time drastically.
`Integral images fit in the more general framework of box-
`lets, as proposed by Simard et al. [38].
`
`3.1. Integral images
`
`Fig. 1. Using integral images, it takes only three additions and four
`memory accesses to calculate the sum of intensities inside a rectangular
`region of any size.
`
`o2
`
`;
`
`ð2Þ
`
`determinant of the Hessian also for the scale selection, as
`done by Lindeberg [21].
`Given a point x ¼ ðx; yÞ in an image I, the Hessian
`
`
`matrix Hðx; rÞ in x at scale r is defined as follows
`Hðx; rÞ ¼ Lxxðx; rÞ Lxyðx; rÞ
`Lxyðx; rÞ Lyyðx; rÞ
`where Lxxðx; rÞ is the convolution of the Gaussian second
`ox2 gðrÞ with the image I in point x, and
`order derivative
`similarly for Lxyðx; rÞandLyyðx; rÞ.
`Gaussians are optimal for scale-space analysis [19,20],
`but in practice they have to be discretised and cropped
`(Fig. 2,
`left half). This leads to a loss in repeatability
`under image rotations around odd multiples of p
`4. This
`weakness holds for Hessian-based detectors in general.
`Fig. 3 shows the repeatability rate of
`two detectors
`based on the Hessian matrix for pure image rotation.
`The repeatability attains a maximum around multiples
`of p
`2. This is due to the square shape of the filter. Nev-
`ertheless, the detectors still perform well, and the slight
`decrease in performance does not outweigh the advan-
`tage of fast convolutions brought by the discretisation
`and cropping. As real filters are non-ideal in any case,
`and given Lowe’s success with his LoG approximations,
`we push the approximation for the Hessian matrix even
`further with box filters (in the right half of Fig. 2).
`These approximate second order Gaussian derivatives
`and can be evaluated at a very low computational cost
`
`Fig. 2. Left to right: The (discretised and cropped) Gaussian second order
`partial derivative in y- (Lyy) and xy-direction (Lxy), respectively; our
`approximation for the second order Gaussian partial derivative in y- (Dyy)
`and xy-direction (Dxy). The grey regions are equal to zero.
`
`ð1Þ
`
`In order to make the article more self-contained, we
`briefly discuss the concept of integral images. They allow
`for fast computation of box type convolution filters. The
`entry of an integral image I RðxÞ at a location x ¼ ðx; yÞT
`represents the sum of all pixels in the input image I within
`a rectangular region formed by the origin and x.
`6y
`6x
`
`Iði; jÞ
`
`Xj
`Xi
`
`i¼0
`
`j¼0
`
`I RðxÞ ¼
`
`Once the integral image has been computed, it takes
`three additions to calculate the sum of the intensities over
`any upright, rectangular area (see Fig. 1). Hence, the calcu-
`lation time is independent of its size. This is important in
`our approach, as we use big filter sizes.
`
`3.2. Hessian matrix-based interest points
`
`We base our detector on the Hessian matrix because of
`its good performance in accuracy. More precisely, we
`detect blob-like structures at locations where the determi-
`nant is maximum. In contrast to the Hessian-Laplace
`detector by Mikolajczyk and Schmid [26], we rely on the
`
`Yuneec Exhibit 1024 Page 3
`
`
`
`H. Bay et al. / Computer Vision and Image Understanding 110 (2008) 346–359
`
`349
`
`3.3. Scale space representation
`
`Interest points need to be found at different scales, not
`least because the search of correspondences often requires
`their comparison in images where they are seen at different
`scales. Scale spaces are usually implemented as an image
`pyramid. The images are repeatedly smoothed with a
`Gaussian and then sub-sampled in order to achieve a
`higher level of the pyramid. Lowe [24] subtracts these pyr-
`amid layers in order to get the DoG (Difference of Gaussi-
`ans) images where edges and blobs can be found.
`Due to the use of box filters and integral images, we do
`not have to iteratively apply the same filter to the output of
`a previously filtered layer, but instead can apply box filters
`of any size at exactly the same speed directly on the original
`image and even in parallel (although the latter is not
`exploited here). Therefore, the scale space is analysed by
`up-scaling the filter size rather than iteratively reducing
`the image size, Fig. 4. The output of the 9 9 filter, intro-
`duced in previous section, is considered as the initial scale
`layer, to which we will refer as scale s ¼ 1:2 (approximating
`Gaussian derivatives with r ¼ 1:2). The following layers
`are obtained by filtering the image with gradually bigger
`masks, taking into account the discrete nature of integral
`images and the specific structure of our filters.
`Note that our main motivation for this type of sampling
`is its computational efficiency. Furthermore, as we do not
`have to downsample the image, there is no aliasing. On
`the downside, box filters preserve high-frequency compo-
`nents that can get lost in zoomed-out variants of the same
`scene, which can limit scale-invariance. This was however
`not noticeable in our experiments.
`The scale space is divided into octaves. An octave repre-
`sents a series of filter response maps obtained by convolv-
`ing the same input image with a filter of increasing size. In
`total, an octave encompasses a scaling factor of 2 (which
`implies that one needs to more than double the filter size,
`see below). Each octave is subdivided into a constant num-
`ber of scale levels. Due to the discrete nature of integral
`images, the minimum scale difference between two subse-
`quent scales depends on the length l0 of the positive or neg-
`ative lobes of the partial second order derivative in the
`direction of derivation (x or y), which is set to a third of
`the filter size length. For the 9 9 filter, this length l0 is
`3. For two successive levels, we must increase this size by
`
`Fig. 4. Instead of iteratively reducing the image size (left), the use of
`integral images allows the up-scaling of the filter at constant cost (right).
`
`Fig. 3. Top: Repeatability score for image rotation of up to 180°. Hessian-
`based detectors have in general a lower repeatability score for angles
`around odd multiples of p
`4. Bottom: Sample images from the sequence that
`was used. Fast-Hessian is the more accurate version of our detector (FH-
`15), as explained in Section 3.3.
`
`images. The calculation time therefore is
`using integral
`independent of the filter size. As shown in Section 5
`and Fig. 3, the performance is comparable or better
`than with the discretised and cropped Gaussians.
`The 9 9 box filters in Fig. 2 are approximations of a
`Gaussian with r ¼ 1:2 and represent the lowest scale (i.e.
`highest spatial resolution) for computing the blob response
`maps. We will denote them by Dxx, Dyy, and Dxy. The
`weights applied to the rectangular regions are kept simple
`for computational efficiency. This yields
`ð3Þ
`detðHapproxÞ ¼ DxxDyy ðwDxyÞ2:
`The relative weight w of the filter responses is used to bal-
`ance the expression for the Hessian’s determinant. This is
`needed for the energy conservation between the Gaussian
`kernels and the approximated Gaussian kernels,
`w ¼ j Lxyð1:2ÞjF j Dyyð9ÞjF
`ð4Þ
`¼ 0:912::: ’ 0:9;
`j Lyyð1:2ÞjF j Dxyð9ÞjF
`where j xjF is the Frobenius norm. Notice that for theoret-
`ical correctness, the weighting changes depending on the
`scale. In practice, we keep this factor constant, as this did
`not have a significant
`impact on the results in our
`experiments.
`Furthermore, the filter responses are normalised with
`respect to their size. This guarantees a constant Frobenius
`norm for any filter size, an important aspect for the scale
`space analysis as discussed in the next section.
`The approximated determinant of the Hessian repre-
`sents the blob response in the image at location x. These
`responses are stored in a blob response map over different
`scales, and local maxima are detected as explained in Sec-
`tion 3.4.
`
`Yuneec Exhibit 1024 Page 4
`
`
`
`350
`
`H. Bay et al. / Computer Vision and Image Understanding 110 (2008) 346–359
`
`a minimum of 2 pixels (1 pixel on every side) in order to
`keep the size uneven and thus ensure the presence of the
`central pixel. This results in a total increase of the mask size
`by 6 pixels (see Fig. 5). Note that for dimensions different
`from l0 (e.g. the width of the central band for the vertical
`filter in Fig. 5), rescaling the mask introduces rounding-
`off errors. However, since these errors are typically much
`smaller than l0, this is an acceptable approximation.
`The construction of the scale space starts with the 9 9
`filter, which calculates the blob response of the image for
`scale. Then, filters with sizes 15 15,
`the smallest
`21 21, and 27 27 are applied, by which even more than
`a scale change of two has been achieved. But this is needed,
`as a 3D non-maximum suppression is applied both spa-
`tially and over the neighbouring scales. Hence, the first
`and last Hessian response maps in the stack cannot contain
`such maxima themselves, as they are used for reasons of
`comparison only. Therefore, after interpolation, see Sec-
`tion 3.4, the smallest possible scale is r ¼ 1:6 ¼ 1:212
`9 corre-
`sponding to a filter size of 12 12, and the highest to
`r ¼ 3:2 ¼ 1:224
`9 . For more details, we refer to [2].
`Similar considerations hold for the other octaves. For
`each new octave, the filter size increase is doubled (going
`from 6–12 to 24–48). At the same time, the sampling inter-
`vals for the extraction of the interest points can be doubled
`as well for every new octave. This reduces the computation
`time and the loss in accuracy is comparable to the image
`sub-sampling of the traditional approaches. The filter sizes
`for the second octave are 15, 27, 39, 51. A third octave is com-
`puted with the filter sizes 27, 51, 75, 99 and, if the original
`image size is still larger than the corresponding filter sizes,
`the scale space analysis is performed for a fourth octave,
`
`using the filter sizes 51, 99, 147, and 195. Fig. 6 gives an over-
`view of the filter sizes for the first three octaves. Further
`octaves can be computed in a similar way. In typical scale-
`space analysis however, the number of detected interest
`points per octave decays very quickly, cf. Fig. 7.
`The large scale changes, especially between the first fil-
`ters within these octaves (from 9 to 15 is a change of
`1.7), renders the sampling of scales quite crude. Therefore,
`we have also implemented a scale space with a finer sam-
`pling of the scales. This computes the integral image on
`the image up-scaled by a factor of 2, and then starts the
`first octave by filtering with a filter of size 15. Additional
`filter sizes are 21, 27, 33, and 39. Then a second octave
`starts, again using filters which now increase their sizes
`by 12 pixels, after which a third and fourth octave follow.
`Now the scale change between the first two filters is only
`1.4 (21/15). The lowest scale for the accurate version that
`can be detected through quadratic
`interpolation is
`9Þ=2 ¼ 1:2.
`s ¼ ð1:218
`As the Frobenius norm remains constant for our filters
`at any size, they are already scale normalised, and no fur-
`ther weighting of the filter response is required, for more
`information on that topic, see [22].
`
`3.4. Interest point localisation
`
`In order to localise interest points in the image and over
`scales, a non-maximum suppression in a 3 3 3 neigh-
`bourhood is applied. Specifically, we use a fast variant
`introduced by Neubeck and Van Gool [33]. The maxima
`of the determinant of the Hessian matrix are then interpo-
`lated in scale and image space with the method proposed
`by Brown and Lowe [5].
`Scale space interpolation is especially important in our
`case, as the difference in scale between the first layers of every
`octave is relatively large. Fig. 8 shows an example of the
`detected interest points using our ‘Fast-Hessian’ detector.
`
`4. Interest point description and matching
`
`Our descriptor describes the distribution of the intensity
`content within the interest point neighbourhood, similar to
`
`Fig. 5. Filters Dyy (top) and Dxy (bottom) for two successive scale levels
`(9 9 and 15 15). The length of the dark lobe can only be increased by
`an even number of pixels in order to guarantee the presence of a central
`pixel (top).
`
`Fig. 6. Graphical representation of the filter side lengths for three different
`octaves. The logarithmic horizontal axis represents the scales. Note that
`the octaves are overlapping in order to cover all possible scales seamlessly.
`
`Yuneec Exhibit 1024 Page 5
`
`
`
`H. Bay et al. / Computer Vision and Image Understanding 110 (2008) 346–359
`
`351
`
`4.1. Orientation assignment
`
`In order to be invariant to image rotation, we identify a
`reproducible orientation for the interest points. For that
`purpose, we first calculate the Haar wavelet responses in
`x and y direction within a circular neighbourhood of radius
`6s around the interest point, with s the scale at which the
`interest point was detected. The sampling step is scale
`dependent and chosen to be s. In keeping with the rest, also
`the size of the wavelets are scale dependent and set to a side
`length of 4s. Therefore, we can again use integral images
`for fast filtering. The used filters are shown in Fig. 9. Only
`six operations are needed to compute the response in x or y
`direction at any scale.
`Once the wavelet responses are calculated and weighted
`with a Gaussian (r ¼ 2s) centred at the interest point, the
`responses are represented as points in a space with the hor-
`izontal response strength along the abscissa and the vertical
`response strength along the ordinate. The dominant orien-
`tation is estimated by calculating the sum of all responses
`within a sliding orientation window of size p
`3, see Fig. 10.
`The horizontal and vertical responses within the window
`are summed. The two summed responses then yield a local
`orientation vector. The longest such vector over all win-
`dows defines the orientation of the interest point. The size
`of the sliding window is a parameter which had to be cho-
`sen carefully. Small sizes fire on single dominating gradi-
`ents,
`large sizes tend to yield maxima in vector length
`that are not outspoken. Both result in a misorientation of
`the interest point.
`Note that for many applications, rotation invariance is
`not necessary. Experiments of using the upright version
`of SURF (U-SURF, for short) for object detection can
`be found in [3,4]. U-SURF is faster to compute and can
`increase distinctivity, while maintaining a robustness to
`rotation of about ±15°.
`
`4.2. Descriptor based on sum of Haar wavelet responses
`
`For the extraction of the descriptor, the first step con-
`sists of constructing a square region centred around the
`interest point and oriented along the orientation selected
`in previous section. The size of this window is 20s. Exam-
`ples of such square regions are illustrated in Fig. 11.
`The region is split up regularly into smaller 4 4 square
`sub-regions. This preserves important spatial information.
`For each sub-region, we compute Haar wavelet responses
`
`Fig. 9. Haar wavelet filters to compute the responses in x (left) and y
`direction (right). The dark parts have the weight 1 and the light parts þ1.
`
`Fig. 7. Histogram of the detected scales. The number of detected interest
`points per octave decays quickly.
`
`Fig. 8. Detected interest points for a Sunflower field. This kind of scenes
`shows the nature of the features obtained using Hessian-based detectors.
`
`the gradient information extracted by SIFT [24] and its
`variants. We build on the distribution of first order Haar
`wavelet responses in x and y direction rather than the gra-
`dient, exploit integral images for speed, and use only 64D.
`This reduces the time for feature computation and match-
`ing, and has proven to simultaneously increase the robust-
`ness. Furthermore, we present a new indexing step based
`on the sign of the Laplacian, which increases not only the
`robustness of the descriptor, but also the matching speed
`(by a factor of 2 in the best case). We refer to our detec-
`tor-descriptor
`scheme as SURF—Speeded-Up Robust
`Features.
`The first step consists of fixing a reproducible orienta-
`tion based on information from a circular region around
`the interest point. Then, we construct a square region
`aligned to the selected orientation and extract the SURF
`descriptor from it. Finally, features are matched between
`two images. These three steps are explained in the
`following.
`
`Yuneec Exhibit 1024 Page 6
`
`
`
`352
`
`H. Bay et al. / Computer Vision and Image Understanding 110 (2008) 346–359
`
`Fig. 12. To build the descriptor, an oriented quadratic grid with 4 4
`square sub-regions is laid over the interest point (left). For each square,
`the wavelet responses are computed from 5 5 samples (for illustrative
`purposes, we show only 2 2 sub-divisions here). For each field, we
`j d xj; dy, and j d y j, computed relatively to the
`collect the sums dx,
`orientation of the grid (right).
`
`feature vector. In order to bring in information about the
`polarity of the intensity changes, we also extract the sum
`of the absolute values of the responses, j dx j and j d y j.
`P
`P
`P
`P
`Hence, each sub-region has a 4D descriptor vector v for
`v ¼ ð
`dy;
`dx;
`its
`underlying
`intensity
`structure
`j dy jÞ. Concatenating this for all 4 4 sub-
`j dx j;
`regions, this results in a descriptor vector of length 64.
`The wavelet responses are invariant to a bias in illumina-
`tion (offset). Invariance to contrast (a scale factor) is
`achieved by turning the descriptor into a unit vector.
`Fig. 13 shows the properties of the descriptor for three
`distinctively different
`image-intensity patterns within a
`sub-region. One can imagine combinations of such local
`intensity patterns, resulting in a distinctive descriptor.
`SURF is, up to some point, similar in concept as SIFT,
`in that they both focus on the spatial distribution of gradi-
`ent information. Nevertheless, SURF outperforms SIFT in
`practically all cases, as shown in Section 5. We believe this
`is due to the fact that SURF integrates the gradient infor-
`mation within a subpatch, whereas SIFT depends on the
`orientations of the individual gradients. This makes SURF
`
`Fig. 10. Orientation assignment: a sliding orientation window of size p
`3
`detects the dominant orientation of the Gaussian weighted Haar wavelet
`responses at every sample point within a circular neighbourhood around
`the interest point.
`
`Fig. 11. Detail of the Graffiti scene showing the size of the oriented
`descriptor window at different scales.
`
`at 5 5 regularly spaced sample points. For reasons of
`simplicity, we call dx the Haar wavelet response in horizon-
`tal direction and dy the Haar wavelet response in vertical
`direction (filter size 2s), see Fig. 9 again. ‘‘Horizontal’’
`and ‘‘vertical’’ here is defined in relation to the selected
`interest point orientation (see Fig. 12).1 To increase the
`robustness towards geometric deformations and localisa-
`tion errors, the responses dx and dy are first weighted with
`a Gaussian (r ¼ 3:3s) centred at the interest point.
`Then, the wavelet responses dx and dy are summed up
`over each sub-region and form a first set of entries in the
`
`1 For efficiency reasons,
`the Haar wavelets are calculated in the
`unrotated image and the responses are then interpolated,
`instead of
`actually rotating the image.
`
`P
`
`Fig. 13. The descriptor entries of a sub-region represent the nature of the
`underlying intensity pattern. Left: In case of a homogeneous region, all
`P
`P
`values are relatively low. Middle: In presence of frequencies in x direction,
`j dx j is high, but all others remain low. If the intensity is
`the value of
`j dx j are high.
`dx and
`gradually increasing in x direction, both values
`
`Yuneec Exhibit 1024 Page 7
`
`
`
`H. Bay et al. / Computer Vision and Image Understanding 110 (2008) 346–359
`
`353
`
`Fig. 15. If the contrast between two interest points is different (dark on
`light background vs. light on dark background), the candidate is not
`considered a valuable match.
`
`Fig. 14. Due to the global integration of SURF’s descriptor, it stays more
`robust to various image perturbations than the more locally operating
`SIFT descriptor.
`
`extra information defines a meaningful hyperplane for
`splitting the data, as opposed to randomly choosing an ele-
`ment or using feature statistics.
`
`less sensitive to noise, as illustrated in the example of
`Fig. 14.
`In order to arrive at these SURF descriptors, we exper-
`imented with fewer and more wavelet features, second
`order derivatives, higher-order wavelets, PCA, median val-
`ues, average values, etc. From a thorough evaluation, the
`proposed sets turned out to perform best. We then varied
`the number of sample points and sub-regions. The 4 4
`sub-region division solution provided the best results, see
`also Section 5. Considering finer subdivisions appeared to
`be less robust and would increase matching times too
`much. On the other hand, the short descriptor with 3 3
`sub-regions (SURF-36) performs slightly worse, but allows
`for very fast matching and is still acceptable in comparison
`to other descriptors in the literature.
`We also tested an alternative version of the SURF
`descriptor that adds a couple of similar features (SURF-
`128). It again uses the same sums as before, but now splits
`these values up further. The sums of dx and j dx j are com-
`puted separately for dy < 0 and dy P 0. Similarly, the sums
`of d y and j dy j are split up according to the sign of dx,
`thereby doubling the number of features. The descriptor
`is more distinctive