`c(cid:1) 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.
`
`Distinctive Image Features from Scale-Invariant Keypoints
`
`DAVID G. LOWE
`Computer Science Department, University of British Columbia, Vancouver, B.C., Canada
`Lowe@cs.ubc.ca
`
`Received January 10, 2003; Revised January 7, 2004; Accepted January 22, 2004
`
`Abstract. This paper presents a method for extracting distinctive invariant features from images that can be used
`to perform reliable matching between different views of an object or scene. The features are invariant to image scale
`and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in
`3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a
`single feature can be correctly matched with high probability against a large database of features from many images.
`This paper also describes an approach to using these features for object recognition. The recognition proceeds by
`matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm,
`followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification
`through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify
`objects among clutter and occlusion while achieving near real-time performance.
`
`Keywords:
`
`invariant features, object recognition, scale invariance, image matching
`
`1.
`
`Introduction
`
`Image matching is a fundamental aspect of many prob-
`lems in computer vision, including object or scene
`recognition, solving for 3D structure from multiple im-
`ages, stereo correspondence, and motion tracking. This
`paper describes image features that have many prop-
`erties that make them suitable for matching differing
`images of an object or scene. The features are invariant
`to image scaling and rotation, and partially invariant to
`change in illumination and 3D camera viewpoint. They
`are well localized in both the spatial and frequency do-
`mains, reducing the probability of disruption by occlu-
`sion, clutter, or noise. Large numbers of features can be
`extracted from typical images with efficient algorithms.
`In addition, the features are highly distinctive, which
`allows a single feature to be correctly matched with
`high probability against a large database of features,
`providing a basis for object and scene recognition.
`The cost of extracting these features is minimized by
`taking a cascade filtering approach, in which the more
`
`expensive operations are applied only at locations that
`pass an initial test. Following are the major stages of
`computation used to generate the set of image features:
`
`1. Scale-space extrema detection: The first stage of
`computation searches over all scales and image lo-
`cations. It is implemented efficiently by using a
`difference-of-Gaussian function to identify poten-
`tial interest points that are invariant to scale and
`orientation.
`2. Keypoint localization: Ateach candidate location, a
`detailed model is fit to determine location and scale.
`Keypoints are selected based on measures of their
`stability.
`3. Orientation assignment: One or more orientations
`are assigned to each keypoint location based on local
`image gradient directions. All future operations are
`performed on image data that has been transformed
`relative to the assigned orientation, scale, and loca-
`tion for each feature, thereby providing invariance
`to these transformations.
`
`AVIGILON EX. 2021
`IPR2019-00311
`Page 1 of 20
`
`
`
`92
`
`Lowe
`
`4. Keypoint descriptor: The local image gradients are
`measured at the selected scale in the region around
`each keypoint. These are transformed into a repre-
`sentation that allows for significant levels of local
`shape distortion and change in illumination.
`
`This approach has been named the Scale Invariant Fea-
`ture Transform (SIFT), as it transforms image data into
`scale-invariant coordinates relative to local features.
`An important aspect of this approach is that it gen-
`erates large numbers of features that densely cover the
`image over the full range of scales and locations. A typ-
`ical image of size 500×500 pixels will give rise to about
`2000 stable features (although this number depends on
`both image content and choices for various parame-
`ters). The quantity of features is particularly important
`for object recognition, where the ability to detect small
`objects in cluttered backgrounds requires that at least
`3 features be correctly matched from each object for
`reliable identification.
`For image matching and recognition, SIFT features
`are first extracted from a set of reference images and
`stored in a database. A new image is matched by indi-
`vidually comparing each feature from the new image to
`this previous database and finding candidate matching
`features based on Euclidean distance of their feature
`vectors. This paper will discuss fast nearest-neighbor
`algorithms that can perform this computation rapidly
`against large databases.
`The keypoint descriptors are highly distinctive,
`which allows a single feature to find its correct match
`with good probability in a large database of features.
`However, in a cluttered image, many features from
`the background will not have any correct match in
`the database, giving rise to many false matches in ad-
`dition to the correct ones. The correct matches can
`be filtered from the full set of matches by identify-
`ing subsets of keypoints that agree on the object and
`its location, scale, and orientation in the new image.
`The probability that several features will agree on
`these parameters by chance is much lower than the
`probability that any individual feature match will be
`in error. The determination of these consistent clus-
`ters can be performed rapidly by using an efficient
`hash table implementation of the generalized Hough
`transform.
`Each cluster of 3 or more features that agree on an
`object and its pose is then subject to further detailed
`verification. First, a least-squared estimate is made for
`an affine approximation to the object pose. Any other
`
`image features consistent with this pose are identified,
`and outliers are discarded. Finally, a detailed compu-
`tation is made of the probability that a particular set of
`features indicates the presence of an object, given the
`accuracy of fit and number of probable false matches.
`Object matches that pass all these tests can be identified
`as correct with high confidence.
`
`2. Related Research
`
`The development of image matching by using a set of
`local interest points can be traced back to the work of
`Moravec (1981) on stereo matching using a corner de-
`tector. The Moravec detector was improved by Harris
`and Stephens (1988) to make it more repeatable un-
`der small image variations and near edges. Harris also
`showed its value for efficient motion tracking and 3D
`structure from motion recovery (Harris, 1992), and the
`Harris corner detector has since been widely used for
`many other image matching tasks. While these feature
`detectors are usually called corner detectors, they are
`not selecting just corners, but rather any image location
`that has large gradients in all directions at a predeter-
`mined scale.
`The initial applications were to stereo and short-
`range motion tracking, but the approach was later ex-
`tended to more difficult problems. Zhang et al. (1995)
`showed that it was possible to match Harris corners
`over alarge image range by using a correlation window
`around each corner to select likely matches. Outliers
`were then removed by solving for a fundamental ma-
`trix describing the geometric constraints between the
`two views of rigid scene and removing matches that did
`not agree with the majority solution. At the same time,
`a similar approach was developed by Torr (1995) for
`long-range motion matching, in which geometric con-
`straints were used to remove outliers for rigid objects
`moving within an image.
`The ground-breaking work of Schmid and Mohr
`(1997) showed that invariant local feature matching
`could be extended to general image recognition prob-
`lems in which a feature was matched against a large
`database of images. They also used Harris corners to
`select interest points, but rather than matching with
`a correlation window, they used a rotationally in-
`variant descriptor of the local image region. This al-
`lowed features to be matched under arbitrary orien-
`tation change between the two images. Furthermore,
`they demonstrated that multiple feature matches could
`accomplish general recognition under occlusion and
`
`AVIGILON EX. 2021
`IPR2019-00311
`Page 2 of 20
`
`
`
`Distinctive Image Features from Scale-Invariant Keypoints
`
`93
`
`clutter by identifying consistent clusters of matched
`features.
`The Harris corner detector is very sensitive to
`changes in image scale, so it does not provide a good ba-
`sis for matching images of different sizes. Earlier work
`by the author (Lowe, 1999) extended the local feature
`approach to achieve scale invariance. This work also
`described a new local descriptor that provided more
`distinctive features while being less sensitive to local
`image distortions such as 3D viewpoint change. This
`current paper provides a more in-depth development
`and analysis of this earlier work, while also present-
`ing a number of improvements in stability and feature
`invariance.
`There is a considerable body of previous research on
`identifying representations that are stable under scale
`change. Some of the first work in this area was by
`Crowley and Parker (1984), who developed a repre-
`sentation that identified peaks and ridges in scale space
`and linked these into a tree structure. The tree structure
`could then be matched between images with arbitrary
`scale change. More recent work on graph-based match-
`ing by Shokoufandeh et al. (1999) provides more dis-
`tinctive feature descriptors using wavelet coefficients.
`The problem of identifying an appropriate and con-
`sistent scale for feature detection has been studied in
`depth by Lindeberg (1993, 1994). He describes this as
`a problem of scale selection, and we make use of his
`results below.
`Recently, there has been an impressive body of
`work on extending local features to be invariant to
`full affine transformations (Baumberg, 2000; Tuyte-
`laars and Van Gool, 2000; Mikolajczyk and Schmid,
`2002; Schaffalitzky and Zisserman, 2002; Brown and
`Lowe, 2002). This allows for invariant matching to fea-
`tures on a planar surface under changes in orthographic
`3D projection, in most cases by resampling the image in
`a local affine frame. However, none of these approaches
`are yet fully affine invariant, as they start with initial
`feature scales and locations selected in a non-affine-
`invariant manner due to the prohibitive cost of explor-
`ing the full affine space. The affine frames are are also
`more sensitive to noise than those of the scale-invariant
`features, so in practice the affine features have lower
`repeatability than the scale-invariant features unless the
`affine distortion is greater than about a 40 degree tilt of
`a planar surface (Mikolajczyk, 2002). Wider affine in-
`variance may not be important for many applications,
`as training views are best taken at least every 30 de-
`grees rotation in viewpoint (meaning that recognition
`
`is within 15 degrees of the closest training view) in or-
`der to capture non-planar changes and occlusion effects
`for 3D objects.
`While the method to be presented in this paper is not
`fully affine invariant, a different approach is used in
`which the local descriptor allows relative feature posi-
`tions to shift significantly with only small changes in
`the descriptor. This approach not only allows the de-
`scriptors to be reliably matched across a considerable
`range of affine distortion, but it also makes the features
`more robust against changes in 3D viewpoint for non-
`planar surfaces. Other advantages include much more
`efficient feature extraction and the ability to identify
`larger numbers of features. On the other hand, affine
`invariance is a valuable property for matching planar
`surfaces under very large view changes, and further
`research should be performed on the best ways to com-
`bine this with non-planar 3D viewpoint invariance in
`an efficient and stable manner.
`Many other feature types have been proposed for use
`in recognition, some of which could be used in addition
`to the features described in this paper to provide fur-
`ther matches under differing circumstances. One class
`of features are those that make use of image contours
`or region boundaries, which should make them less
`likely to be disrupted by cluttered backgrounds near
`object boundaries. Matas et al. (2002) have shown that
`their maximally-stable extremal regions can produce
`large numbers of matching features with good stabil-
`ity. Mikolajczyk et al. (2003) have developed a new
`descriptor that uses local edges while ignoring unre-
`lated nearby edges, providing the ability to find stable
`features even near the boundaries of narrow shapes su-
`perimposed on background clutter. Nelson and Selinger
`(1998) have shown good results with local features
`based on groupings of image contours. Similarly, Pope
`and Lowe (2000) used features based on the hierarchi-
`cal grouping of image contours, which are particularly
`useful for objects lacking detailed texture.
`The history of research on visual recognition con-
`tains work on a diverse set of other image properties
`that can be used as feature measurements. Carneiro and
`Jepson (2002) describe phase-based local features that
`represent the phase rather than the magnitude of local
`spatial frequencies, which is likely to provide improved
`invariance to illumination. Schiele and Crowley (2000)
`have proposed the use of multidimensional histograms
`summarizing the distribution of measurements within
`image regions. This type of feature may be particu-
`larly useful for recognition of textured objects with
`
`AVIGILON EX. 2021
`IPR2019-00311
`Page 3 of 20
`
`
`
`94
`
`Lowe
`
`deformable shapes. Basri and Jacobs (1997) have
`demonstrated the value of extracting local region
`boundaries for recognition. Other useful properties to
`incorporate include color, motion, figure-ground dis-
`crimination, region shape descriptors, and stereo depth
`cues. The local feature approach can easily incorporate
`novel feature types because extra features contribute
`to robustness when they provide correct matches, but
`otherwise do little harm other than their cost of compu-
`tation. Therefore, future systems are likely to combine
`many feature types.
`
`3. Detection of Scale-Space Extrema
`
`As described in the introduction, we will detect key-
`points using a cascade filtering approach that uses effi-
`cient algorithms to identify candidate locations that are
`then examined in further detail. The first stage of key-
`point detection is to identify locations and scales that
`can be repeatably assigned under differing views of
`the same object. Detecting locations that are invariant
`to scale change of the image can be accomplished by
`searching for stable features across all possible scales,
`using a continuous function of scale known as scale
`space (Witkin, 1983).
`It has been shown by Koenderink (1984) and
`Lindeberg (1994) that under a variety of reasonable
`assumptions the only possible scale-space kernel is
`the Gaussian function. Therefore, the scale space of
`an image is defined as a function, L(x, y, σ ), that
`is produced from the convolution of a variable-scale
`Gaussian, G(x, y, σ ), with an input image, I (x, y):
`L(x, y, σ ) = G(x, y, σ ) ∗ I (x, y),
`where ∗ is the convolution operation in x and y,
`and
`
`−(x 2+y2)/2σ 2 .
`
`G(x, y, σ ) = 1
`2π σ 2 e
`To efficiently detect stable keypoint locations in scale
`space, we have proposed (Lowe, 1999) using scale-
`space extrema in the difference-of-Gaussian function
`convolved with the image, D(x, y, σ ), which can be
`computed from the difference of two nearby scales sep-
`arated by a constant multiplicative factor k:
`D(x, y, σ ) = (G(x, y, kσ ) − G(x, y, σ )) ∗ I (x, y)
`= L(x, y, kσ ) − L(x, y, σ ).
`(1)
`
`There are a number of reasons for choosing this
`function. First, it is a particularly efficient function to
`compute, as the smoothed images, L , need to be com-
`puted in any case for scale space feature description,
`and D can therefore be computed by simple image
`subtraction.
`In addition, the difference-of-Gaussian function pro-
`vides a close approximation to the scale-normalized
`Laplacian of Gaussian, σ 2∇2G, asstudied by
`Lindeberg (1994). Lindeberg showed that the normal-
`ization of the Laplacian with the factor σ 2 is required
`for true scale invariance. In detailed experimental com-
`parisons, Mikolajczyk (2002) found that the maxima
`and minima of σ 2∇2G produce the most stable image
`features compared to a range of other possible image
`functions, such as the gradient, Hessian, or Harris cor-
`ner function.
`The relationship between D and σ 2∇2G can be un-
`derstood from the heat diffusion equation (parameter-
`ized in terms of σ rather than the more usual t = σ 2):
`= σ∇2G.
`∂G
`∂σ
`From this, we see that ∇2G can be computed from the
`finite difference approximation to ∂G/∂σ , using the
`difference of nearby scales at kσ and σ :
`σ∇2G = ∂G
`
`≈ G(x, y, kσ ) − G(x, y, σ )
`kσ − σ
`
`∂σ
`
`and therefore,
`G(x, y, kσ ) − G(x, y, σ ) ≈ (k − 1)σ 2∇2G.
`
`This shows that when the difference-of-Gaussian func-
`tion has scales differing by a constant factor it already
`incorporates the σ 2 scale normalization required for
`the scale-invariant Laplacian. The factor (k − 1) in the
`equation is a constant over all scales and therefore does
`not influence extrema location. The approximation er-
`ror will go to zero as k goes to 1, but in practice we have
`found that the approximation has almost no impact on
`the stability of extrema detection or localization for
`2.
`An efficient approach to construction of D(x, y, σ )
`is shown in Fig. 1. The initial image is incrementally
`convolved with Gaussians to produce images separated
`by a constant factor k in scale space, shown stacked in
`the left column. We choose to divide each octave of
`scale space (i.e., doubling of σ ) into an integer num-
`ber, s, ofintervals, so k = 21/s . We must produce s + 3
`
`even significant differences in scale, such as k = √
`
`AVIGILON EX. 2021
`IPR2019-00311
`Page 4 of 20
`
`
`
`Distinctive Image Features from Scale-Invariant Keypoints
`
`95
`
`Figure 1. For each octave of scale space, the initial image is repeatedly convolved with Gaussians to produce the set of scale space images
`shown on the left. Adjacent Gaussian images are subtracted to produce the difference-of-Gaussian images on the right. After each octave, the
`Gaussian image is down-sampled by a factor of 2, and the process repeated.
`
`images in the stack of blurred images for each octave,
`so that final extrema detection covers a complete oc-
`tave. Adjacent image scales are subtracted to produce
`the difference-of-Gaussian images shown on the right.
`Once a complete octave has been processed, we resam-
`ple the Gaussian image that has twice the initial value
`of σ (it will be 2 images from the top of the stack) by
`taking every second pixel in each row and column. The
`accuracy of sampling relative to σ is no different than
`for the start of the previous octave, while computation
`is greatly reduced.
`
`3.1. Local Extrema Detection
`
`In order to detect the local maxima and minima of
`D(x, y, σ ), each sample point is compared to its eight
`neighbors in the current image and nine neighbors in
`the scale above and below (see Fig. 2). It is selected
`only if it is larger than all of these neighbors or smaller
`than all of them. The cost of this check is reasonably
`low due to the fact that most sample points will be
`eliminated following the first few checks.
`An important issue is to determine the frequency
`of sampling in the image and scale domains that is
`needed to reliably detect the extrema. Unfortunately,
`it turns out that there is no minimum spacing of sam-
`
`Figure 2. Maxima and minima of the difference-of-Gaussian im-
`ages are detected by comparing a pixel (marked with X) to its 26
`neighbors in 3 × 3 regions at the current and adjacent scales (marked
`with circles).
`
`ples that will detect all extrema, as the extrema can be
`arbitrarily close together. This can be seen by consid-
`ering a white circle on a black background, which will
`have a single scale space maximum where the circular
`positive central region of the difference-of-Gaussian
`function matches the size and location of the circle.
`For a very elongated ellipse, there will be two max-
`ima near each end of the ellipse. As the locations of
`maxima are a continuous function of the image, for
`some ellipse with intermediate elongation there will
`be a transition from a single maximum to two, with
`the maxima arbitrarily close to each other near the
`transition.
`
`AVIGILON EX. 2021
`IPR2019-00311
`Page 5 of 20
`
`
`
`96
`
`Lowe
`
`Figure 3. The top line of the first graph shows the percent of keypoints that are repeatably detected at the same location and scale in a
`transformed image as a function of the number of scales sampled per octave. The lower line shows the percent of keypoints that have their
`descriptors correctly matched to a large database. The second graph shows the total number of keypoints detected in a typical image as a function
`of the number of scale samples.
`
`Therefore, we must settle for a solution that trades
`off efficiency with completeness. In fact, as might be
`expected and is confirmed by our experiments, extrema
`that are close together are quite unstable to small per-
`turbations of the image. We can determine the best
`choices experimentally by studying a range of sampling
`frequencies and using those that provide the most reli-
`able results under a realistic simulation of the matching
`task.
`
`3.2. Frequency of Sampling in Scale
`
`The experimental determination of sampling frequency
`that maximizes extrema stability is shown in Figs. 3 and
`4. These figures (and most other simulations in this pa-
`per) are based on a matching task using a collection of
`32 real images drawn from a diverse range, including
`outdoor scenes, human faces, aerial photographs, and
`industrial images (the image domain was found to have
`almost no influence on any of the results). Each image
`was then subject to a range of transformations, includ-
`ing rotation, scaling, affine stretch, change in brightness
`and contrast, and addition of image noise. Because the
`changes were synthetic, it was possible to precisely
`predict where each feature in an original image should
`appear in the transformed image, allowing for measure-
`ment of correct repeatability and positional accuracy
`for each feature.
`Figure 3 shows these simulation results used to ex-
`amine the effect of varying the number of scales per
`octave at which the image function is sampled prior to
`
`Figure 4. The top line in the graph shows the percent of keypoint
`locations that are repeatably detected in a transformed image as a
`function of the prior image smoothing for the first level of each
`octave. The lower line shows the percent of descriptors correctly
`matched against a large database.
`
`extrema detection. In this case, each image was resam-
`pled following rotation by a random angle and scaling
`by a random amount between 0.2 of 0.9 times the orig-
`inal size. Keypoints from the reduced resolution image
`were matched against those from the original image so
`that the scales for all keypoints would be be present in
`the matched image. In addition, 1% image noise was
`added, meaning that each pixel had a random number
`added from the uniform interval [−0.01, 0.01] where
`pixel values are in the range [0, 1] (equivalent to pro-
`viding slightly less than 6 bits of accuracy for image
`pixels).
`
`AVIGILON EX. 2021
`IPR2019-00311
`Page 6 of 20
`
`
`
`Distinctive Image Features from Scale-Invariant Keypoints
`
`97
`
`The top line in the first graph of Fig. 3 shows the
`percent of keypoints that are detected at a matching
`location and scale in the transformed image. For all
`√
`examples in this paper, we define a matching scale as
`being within a factor of
`2 ofthe correct scale, and
`a matching location as being within σ pixels, where
`σ is the scale of the keypoint (defined from Eq. (1) as
`the standard deviation of the smallest Gaussian used
`in the difference-of-Gaussian function). The lower line
`on this graph shows the number of keypoints that are
`correctly matched to a database of 40,000 keypoints
`using the nearest-neighbor matching procedure to be
`described in Section 6 (this shows that once the key-
`point is repeatably located, it is likely to be useful for
`recognition and matching tasks). As this graph shows,
`the highest repeatability is obtained when sampling
`3 scales per octave, and this is the number of scale
`samples used for all other experiments throughout this
`paper.
`It might seem surprising that the repeatability does
`not continue to improve as more scales are sampled.
`The reason is that this results in many more local ex-
`trema being detected, but these extrema are on average
`less stable and therefore are less likely to be detected
`in the transformed image. This is shown by the second
`graph in Fig. 3, which shows the average number of
`keypoints detected and correctly matched in each im-
`age. The number of keypoints rises with increased sam-
`pling of scales and the total number of correct matches
`also rises. Since the success of object recognition often
`depends more on the quantity of correctly matched key-
`points, as opposed to their percentage correct match-
`ing, for many applications it will be optimal to use a
`larger number of scale samples. However, the cost of
`computation also rises with this number, so for the ex-
`periments in this paper we have chosen to use just 3
`scale samples per octave.
`To summarize, these experiments show that the
`scale-space difference-of-Gaussian function has a large
`number of extrema and that it would be very expensive
`to detect them all. Fortunately, we can detect the most
`stable and useful subset even with a coarse sampling
`of scales.
`
`3.3. Frequency of Sampling in the Spatial Domain
`
`Just as we determined the frequency of sampling per oc-
`tave of scale space, so we must determine the frequency
`of sampling in the image domain relative to the scale of
`smoothing. Given that extrema can be arbitrarily close
`
`together, there will be a similar trade-off between sam-
`pling frequency and rate of detection. Figure 4 shows
`an experimental determination of the amount of prior
`smoothing, σ , that is applied to each image level before
`building the scale space representation for an octave.
`Again, the top line is the repeatability of keypoint de-
`tection, and the results show that the repeatability con-
`tinues to increase with σ . However, there is a cost to
`using a large σ in terms of efficiency, so we have cho-
`sen to use σ = 1.6, which provides close to optimal
`repeatability. This value is used throughout this paper
`and was used for the results in Fig. 3.
`Of course, if we pre-smooth the image before ex-
`trema detection, we are effectively discarding the high-
`est spatial frequencies. Therefore, to make full use of
`the input, the image can be expanded to create more
`sample points than were present in the original. We
`double the size of the input image using linear inter-
`polation prior to building the first level of the pyra-
`mid. While the equivalent operation could effectively
`have been performed by using sets of subpixel-offset
`filters on the original image, the image doubling leads
`to a more efficient implementation. We assume that the
`original image has a blur of at least σ = 0.5 (the min-
`imum needed to prevent significant aliasing), and that
`therefore the doubled image has σ = 1.0 relative to
`its new pixel spacing. This means that little additional
`smoothing is needed prior to creation of the first oc-
`tave of scale space. The image doubling increases the
`number of stable keypoints by almost a factor of 4, but
`no significant further improvements were found with a
`larger expansion factor.
`
`4. Accurate Keypoint Localization
`
`Once a keypoint candidate has been found by com-
`paring a pixel to its neighbors, the next step is to per-
`form a detailed fit to the nearby data for location, scale,
`and ratio of principal curvatures. This information al-
`lows points to be rejected that have low contrast (and
`are therefore sensitive to noise) or are poorly localized
`along an edge.
`The initial implementation of this approach (Lowe,
`1999) simply located keypoints at the location and scale
`of the central sample point. However, recently Brown
`has developed a method (Brown and Lowe, 2002) for
`fitting a 3D quadratic function to the local sample points
`to determine the interpolated location of the maximum,
`and his experiments showed that this provides a sub-
`stantial improvement to matching and stability. His
`
`AVIGILON EX. 2021
`IPR2019-00311
`Page 7 of 20
`
`
`
`98
`
`Lowe
`
`approach uses the Taylor expansion (up to the quadratic
`terms) of the scale-space function, D(x, y, σ ), shifted
`so that the origin is at the sample point:
`
`D(x) = D + ∂ D
`∂x
`
`T
`
`x + 1
`2
`
`xT
`
`∂ 2 D
`∂x2 x
`
`(2)
`
`where D and its derivatives are evaluated at the sample
`point and x = (x, y, σ )T is the offset from this point.
`The location of the extremum, ˆx, isdetermined by tak-
`ing the derivative of this function with respect to x and
`setting it to zero, giving
`
`ˆx = − ∂ 2 D
`∂x2
`
`−1 ∂ D
`∂x
`
`.
`
`(3)
`
`As suggested by Brown, the Hessian and derivative
`of D are approximated by using differences of neigh-
`boring sample points. The resulting 3×3 linear system
`can be solved with minimal cost. If the offset ˆx is larger
`than 0.5 in any dimension, then it means that the ex-
`tremum lies closer to a different sample point. In this
`
`case, the sample point is changed and the interpolation
`performed instead about that point. The final offset ˆx
`is added to the location of its sample point to get the
`interpolated estimate for the location of the extremum.
`The function value at the extremum, D(ˆx), is useful
`for rejecting unstable extrema with low contrast. This
`can be obtained by substituting Eqs (3) into (2), giving
`
`D(ˆx) = D + 1
`2
`
`∂ D
`∂x
`
`T
`
`ˆx.
`
`For the experiments in this paper, all extrema with
`a value of D(ˆx) less than 0.03 were discarded (as
`before, we assume image pixel values in the range
`[0, 1]).
`Figure 5 shows the effects of keypoint selection on
`a natural image. In order to avoid too much clutter, a
`low-resolution 233 by 189 pixel image is used and key-
`points are shown as vectors giving the location, scale,
`and orientation of each keypoint (orientation assign-
`ment is described below). Figure 5(a) shows the origi-
`nal image, which is shown at reduced contrast behind
`
`Figure 5. This figure shows the stages of keypoint selection. (a) The 233 × 189 pixel original image. (b) The initial 832 keypoints locations
`at maxima and minima of the difference-of-Gaussian function. Keypoints are displayed as vectors indicating scale, orientation, and location.
`(c) After applying a threshold on minimum contrast, 729 keypoints remain. (d) The final 536 keypoints that remain following an additional
`threshold on ratio of principal curvatures.
`
`AVIGILON EX. 2021
`IPR2019-00311
`Page 8 of 20
`
`
`
`Distinctive Image Features from Scale-Invariant Keypoints
`
`99
`
`the subsequent figures. Figure 5(b) shows the 832
`keypoints at all detected maxima and minima of the
`difference-of-Gaussian function, while (c) shows the
`729 keypoints that remain following removal of those
`with a value of D(ˆx) less than 0.03. Part (d) will be
`explained in the following section.
`
`4.1. Eliminating Edge Responses
`
`For stability, it is not sufficient to reject keypoints with
`low contrast. The difference-of-Gaussian function will
`have a strong response along edges, even if the loca-
`tion along the edge is poorly determined and therefore
`unstable to small amounts of noise.
`A poorly defined peak in the difference-of-Gaussian
`function will have a large principal curvature across
`the edge but a small one in the perpendicular direction.
`The principal curvatures can be computed from a 2× 2
`Hessian matrix, H, computed at the location and scale
`of the keypoint:
`
`one, so that α = rβ. Then,
`= (α + β)2
`= (rβ + β)2
`Tr(H)2
`Det(H)
`
`rβ2
`
`αβ
`
`= (r + 1)2
`
`r
`
`,
`
`which depends only on the ratio of the eigenval-
`ues rather than their individual values. The quantity
`(r + 1)2/r is at a minimum when the two eigenvalues
`are equal and it increases with r. Therefore, to check
`that the ratio of principal curvatures is below some
`threshold, r, weonly need to check
`(r + 1)2
`r
`
`Tr(H)2
`Det(H)
`
`<
`
`.
`
`This is very efficient to compute, with less than 20
`floating point operations required to test each keypoint.
`The experiments in this paper use a value of r = 10,
`