`Feature Descriptors
`
`Duy-Nguyen Ta
`Georgia Institute of Technology
`duynguyen@gatech.edu
`
`Wei-Chao Chen Natasha Gelfand Kari Pulli
`Nokia Research Center, Palo Alto
`{wei-chao.chen,natasha.gelfand,kari.pulli}@nokia.com
`
`Abstract
`
`We present an efficient algorithm for continuous image
`recognition and feature descriptor tracking in video which
`operates by reducing the search space of possible interest
`points inside of the scale space image pyramid. Instead of
`performing tracking in 2D images, we search and match
`candidate features in local neighborhoods inside the 3D im-
`age pyramid without computing their feature descriptors.
`The candidates are further validated by fitting to a motion
`model. The resulting tracked interest points are more re-
`peatable and resilient to noise, and descriptor computa-
`tion becomes much more efficient because only those areas
`of the image pyramid that contain features are searched.
`We demonstrate our method on real-time object recognition
`and label augmentation running on a mobile device.
`
`1. Introduction
`Robust feature descriptors such as SIFT [12], SURF [2],
`and GLOH [16] have become a core component in applica-
`tions such as image recognition [8], multi-view stereo [25],
`and image registration [4, 26]. These descriptors are sta-
`ble under viewpoint and lighting changes, so they are able
`to cope with significant amounts of image variability. At
`the same time, discriminative power is achieved by rep-
`resenting feature points as high-dimensional vectors. The
`combination of robustness and discriminative power makes
`these methods ideally suited for searching large heteroge-
`neous image databases.
`We are interested in using robust feature descriptors
`in real-time object recognition and tracking applications.
`Given a database of images of labeled objects, such as build-
`ings in an outdoor environment, and an input video stream,
`we would like to recognize the objects present in the video,
`augment the video stream with labels indicating the ob-
`jects, and maintain and update those labels as the video pans
`across the scene. Figure 1 shows an example of such an ap-
`plication. This task can be thought of as having two com-
`
`ponents. Image matching against a database is performed
`to identify new objects that appear in the video and object
`tracking is performed to update the positions of the labels
`of recognized objects in the consecutive video frames. Ro-
`bust feature points with high dimensional descriptors per-
`form best for image recognition, therefore we would like to
`compute and track them at interactive frame rates.
`To track robust features across video frames, we can per-
`form pairwise image matching given any two consecutive
`video frames, as done by Battiato et al. [1] and Skrypnyk
`and Lowe [24]. Figure 2 describes the algorithm flow of this
`approach. The main drawback of frame-to-frame matching
`is the wasted computation as this approach does not exploit
`coherence in the video. Most of the time the features are
`used for tracking purposes only, as there is no need to per-
`form an image recognition step, unless a significant change
`is detected in the current frame. The expensive robustness
`properties of the descriptors are not needed for frame-to-
`frame matching, since consecutive video frames are likely
`to be very similar. Furthermore, image noise means that
`many of the descriptors are transient and will be thrown
`away when consecutive frames are matched. Therefore,
`performing detection and computation of robust descriptors
`for each frame of the video is unnecessary and can also be
`difficult to perform at interactive frame rates.
`Alternatively, one can imagine a hybrid algorithm where
`motion estimation is done through lightweight features such
`as FAST corners [19, 20] or Harris corners [9] and object
`recognition is done through robust features. When enough
`motion is accumulated to warrant a recognition step, one
`can run a standard image matching algorithm against the
`image database to detect what is present in the scene. How-
`ever, locations of the robust features such as SIFT and
`SURF are unlikely to coincide with the features used for
`tracking. This is a problem since the position of any aug-
`mentation that results from the matching step, such as build-
`ing labels, can not be transferred to successive frames with-
`out knowing the scene structure. A solution to this is to
`compute the robust descriptors at corner locations that are
`used for tracking [5, 29]. These approaches, even though
`
`12
`
`937
`
`978-1-4244-3991-1/09/$25.00 ©2009 IEEE
`
`Niantic's Exhibit No. 1027
`Page 001
`
`
`
`(b)
`(a)
`Figure 1. Our algorithm running in an outdoor environment. (a) Detected interest points and traced trajectories from a video sequence
`(Section 3). (b) Above: Video frames overlaid with their object labels zoomed in. Below: Matching images and their partial subgraph from
`the database. The solid and dashed edges indicate geometric and object ID relationships, respectively (Section 4)
`
`fairly efficient in terms of tracking, require extracting multi-
`ple descriptors per corner, because the corners are not scale-
`invariant. The SIFT Flow approach by Liu et al. [11] pro-
`duces a flow field of SIFT descriptors with a fairly high
`density, but this requires computing feature descriptors at
`a much higher density than produced by the difference-
`of-Gaussians algorithm of standard SIFT. Furthermore, the
`performance of SIFT and SURF is related to the distinctive-
`ness of the computed interest points, so using descriptors
`computed at alternative locations can negatively impact the
`recognition performance of the algorithm.
`
`Contributions. To enable tracking of robust descriptors
`at interactive frame rates, we propose to exploit coherency
`in video frames to detect interest points at locations in each
`frame where they are most likely to appear. Instead of track-
`ing features in 2D images, we instead perform our track-
`ing in the scale-space image pyramid, and we achieve the
`robustness of the direct frame-to-frame matching method
`while reducing computation significantly. The detected in-
`terest points are scale-invariant, and are inherently matched
`and tracked across video frames. We decouple the descrip-
`tor computation from the interest point detection so that fea-
`ture descriptors are computed only for the purpose of object
`recognition. We demonstrate that our algorithm runs in real-
`time on mobile phones, and we discuss applications such
`as real-time augmentation and object recognition. Figure 3
`shows a functional overview of our algorithm.
`
`2. Related Work
`Robust Feature Detectors and Descriptors. There are
`several approaches to scale-invariant
`interest point de-
`tection,
`including Difference-of-Gaussians in SIFT by
`Lowe [12], maximally stable extended regions (MSER) by
`Matas et al. [14], Harris-Affine and Hessian-Affine corners
`by Mikolajczyk and Schmid [15], and Hessians approxi-
`
`mated using Haar-basis by Bay et al. [2]. Mikolajczyk et
`al. [17] performed a comprehensive comparison study on
`the detectors.
`Robust descriptor algorithms take the output of region
`detectors, construct a canonical frame, and then extract a
`high-dimensional feature vector for each region. The de-
`scriptors are designed to be invariant to lighting, scale,
`and rotational changes. Examples include the very popular
`SIFT [12] and SURF [2]. Refer to Mikolajczyk et al. [16]
`for a comparative study on the descriptors.
`There are also many successful attempts to speed up the
`descriptor computation algorithm. Sinha et al. [22] describe
`an efficient SIFT implementation on graphics processing
`unit (GPU). Grabner et al. [7] propose to speed up SIFT
`computation using integral images. Our SURFTrac algo-
`rithm improves on baseline SURF algorithm which is our
`primary benchmark; additional speedups similar to these
`approaches can be applied to gain even more efficiency.
`
`Applications of Feature Descriptors. Many researchers
`have applied feature descriptors in object recognition and
`image retrieval. For example, Sivic and Zisserman [23] ap-
`plied SIFT for efficient keyframe retrieval in video. Grau-
`man and Darrell [8] proposed a fast kernel-based object
`identification method using SIFT features. Nist´er and
`Stew´enius [18] used hierarchical k-means to construct a
`search tree of local features for fast and efficient image re-
`trieval.
`
`Motion Estimation and Tracking. There is an abundant
`body of prior art on video tracking and motion estimation.
`Refer to Torr and Zisserman [28] for a summary of feature-
`based motion estimation methods. Optical flow algorithms
`such as the classic examples by Lucas and Kanade [13] and
`Berthold et al. [10] can also be used as input for a motion
`estimation algorithm. Zhang et al. [30] recover epipolar ge-
`
`2938
`
`Niantic's Exhibit No. 1027
`Page 002
`
`
`
`Figure 3. The SURFTrac algorithm overview. For the first image,
`the algorithm initializes a list of interest points by performing a
`full detection. The interest points are then updated and tracked
`upon receiving new images. The descriptors are for recognition
`purposes and the algorithm computes them as needed.
`
`(1)
`
`(2)
`
`To achieve scale-invariance, many interest point detec-
`tion algorithms use image pyramids during the detection
`phase. These algorithms include Hessian-Affine, Harris-
`Affine, and approximate Hessian. In this process, an image
`pyramid is formed by downsampling the input image to a
`progressively lower resolution. For the purpose of our dis-
`cussion, we will treat the image pyramid as a stack of same-
`sized images Sk(·), each filtered from the original image Ik
`with a different scale of zero-mean Gaussian as follows
`Sk(x, y, σ) = Ik(x, y) ∗ G(0, σ2).
`The interest point response R has the same image stack data
`layout, and is computed by applying the response computa-
`tion function f(·) over the stack of images S,
`Rk(x, y, σ) = f · Sk(x, y, σ).
`Local maxima in the function R represent relatively sta-
`ble regions and are extracted as interest points. Because the
`bandwidth of function R is lower at higher values of σ, the
`sampling rate for maxima computation is naturally reduced
`at higher σ to increase efficiency. The extracted interest
`points are then refined with smooth local interpolation [3].
`As described before, we plan to extend interest point de-
`tection algorithms such that the interest points are tracked
`across video frames efficiently. Although our proposal is
`adaptable to many algorithms using image pyramids, we
`will focus the remainder of our discussion on the approxi-
`mate Hessian detector in SURF [2] because of its efficiency
`and good interest point repeatability. More importantly, by
`using SURF we can compute part of Rk(·) directly without
`having to produce the intermediate Gaussian stack Sk(·).
`The algorithm computes the scale-normalized Hessian ma-
`trix
`
`(cid:35)
`
`,
`
`(cid:34) ∂2
`
`1 σ
`
`2
`
`Hk(x, y, σ) =
`
`2939
`
`∂2
`∂x∂y Sk(x, y, σ)
`∂x2 Sk(x, y, σ)
`∂2
`∂2
`∂y2 Sk(x, y, σ)
`∂x∂y Sk(x, y, σ)
`(3)
`and the response function is the determinant of H(·),
`Rk(x, y, σ) = det(Hk(x, y, σ)). In the following sections,
`the use of the Haar-wavelet approximation in SURF is im-
`plied when we refer to the Hessian matrix. In the remainder
`of the section we describe the algorithm shown in Figure 3.
`
`Figure 2. An algorithm using feature descriptor algorithms as-is
`for feature tracking.
`
`ometry between two images through correlation, relaxation,
`and robust model fitting using Harris corners. The feature-
`tracking aspect of our algorithm can be used as input to
`motion estimation algorithms, and therefore complements
`instead of competing with motion estimation research.
`There have been many successful attempts to track ro-
`bust descriptor features. Skrypnyk and Lowe [24] and Bat-
`tiato et al. [1] propose to track SIFT features similar to
`Figure 2. Other algorithms change the location of descrip-
`tor computation by using different interest point detectors.
`For example, Liu et al. [11] compute dense SIFT corre-
`spondences by detecting and matching the descriptors on
`a dense grid. Chehlov et al. [5] describe a Kalman filter
`based SLAM algorithm that uses a corner detector followed
`by SIFT computation on these corners. Similarly, Wagner
`et al. [29] compute FAST corners on the object database
`image and extract descriptors at different scale levels, and
`they also demonstrate an application that performs tracking
`and object recognition in real time.
`Our algorithm is related to these studies, but our ap-
`proach is quite different because we wish to retain all of the
`proven great attributes of robust feature descriptors. We aim
`to perform image recognition similar to Takacs et al. [27]
`where a large image and object database need to be matched
`against the input image. This means that we do not wish to
`replace scale-invariant interest points with generic corner
`detectors when the resulting descriptors would reduce the
`effectiveness of object recognition. From this perspective,
`the quality of object recognition is equally important to the
`real-time tracking requirement, and the tracking algorithm
`should not interfere with the recognition performance.
`
`3. The SURFTrac Algorithm
`Many feature descriptor algorithms consist of two con-
`secutive steps, namely interest point detection followed by
`descriptor computation. An interest point detection algo-
`rithm extracts regions of interest that tend to be repeatable
`and invariant under transformations such as brightness or
`perspective changes.
`In the descriptor computation step,
`each extracted interest point defines a circular or affine re-
`gion from which one descriptor is computed. It is often pos-
`sible to mix and match these two steps of the algorithm, for
`example, one can compute a SIFT descriptor using Hessian-
`Affine interest points.
`
`Image k-1
`
`Interest Point
`Detection
`
`Descriptor
`Computation
`
`Feature
`Descriptors
`
`Image k
`
`Interest Point
`Detection
`
`Descriptor
`Computation
`
`Feature
`Descriptors
`
`Feature
`Matching
` &
`Motion
`Estimation
`
`Estimated
`Motion
`
`Image 0
`
`Interest Point
`Detection
`
`Tracked
`Interest Points
`
`Descriptor
`Computation
`
`Tracked
`Feature
`Descriptors
`
`To Object
`Recognition
`
`Image k>0
`
`Incremental
`Interest Point
`Detection
`
`Tracking
`Interest
`Points
`
`Motion
`Estimation
`
`Estimated
`Motion
`
`Niantic's Exhibit No. 1027
`Page 003
`
`
`
`3.1. Incremental Interest Point Detection
`In order to compute interest points in each video frame
`incrementally, we can predict regions in the Gaussian im-
`age stack where useful features are most likely to appear,
`and compute the response R only in these regions. Let us
`denote the input video sequence as I = {I0, I1,··· IN−1}.
`Given image Ik−1 and one of its interest points pk−1 =
`(xk−1, yk−1, σk−1), assuming we know the relative motion
`M k−1
`(.) between Ik−1 and Ik as a homography, we can
`k
`simply transform pk−1 to its location in frame Ik with
`pk = (xk, yk, σk) = M k−1
`
`(pk−1).
`
`k
`
`(4)
`
`Figure 4. Incremental detection of an interest point. The predicted
`motion M k−1
`is used to transform any interest point pk−1 from
`k
`image Ik−1 to image Ik. This predicted interest point pk forms
`a volume neighborhood Pk in which the new interest point is ex-
`tracted. The image corners are similarly transformed to the new
`image (blue dashed lines).
`
`methods for computing lightweight signatures from the in-
`formation that is already present in the scale space.
`
`Local Curvature Method.
`In SURF, because the re-
`sponse function is an approximation to the determinant of
`the Hessian matrix, we can use the relationship between
`the two principal curvatures of each interest point as a sig-
`nature of the interest point. Given the scale-normalized
`Hessian matrix in Equation 3, we compute its eigenvec-
`tors λ1 and λ2, λ1 > λ2, and measure the curvature ratio
`r1 = λ1/λ2. This is directly related to the edge response
`detection method used in SIFT [12],
`r2 = trace(H)2
`det(H)
`
`.
`
`(6)
`
`(r1 + 1)2
`r1
`Because the components of H are already calculated, com-
`puting ratio r2 is more efficient than r1. We treat the fea-
`ture with the smallest difference in r2 as the matching inter-
`est point, if this difference does not exceed a user-defined
`threshold ∆r2.
`
`=
`
`Obviously, homography is insufficient to model the mo-
`tion of all tracked features. However, if the relative motion
`between Ik−1 and Ik is small, we can still use the homog-
`raphy and expand the point pk into a 3D volume search
`neighborhood Pk
`Pk = {(x(cid:48)
`
`
`
`
`k, y(cid:48)k, σ(cid:48)k) :
`
`(5)
`
`|σ(cid:48)k − σk| ≤ ∆σ,
`
`
`|x(cid:48)k − xk| ≤ γσ(cid:48)k,
`
`
`
`|y(cid:48)k − yk| ≤ γσ(cid:48)k},
`where ∆σ is the search range in the scale space, and γ is
`related to the motion prediction error, as well as disparity of
`the tracked point with respect to the primary planar structure
`of the scene. The search neighborhood Pk corresponds to a
`pyramidal frustum because the interest point sampling rate
`is reduced at higher scale levels. In practice, because of the
`high correlation between images, using a fixed-size search
`neighborhood works fairly well regardless of the actual mo-
`tion. Using Equation 4 and 5, the collection of tracked inter-
`est points {p0
`
`
`k−1, p1k−1,··· pm−1k−1 } from image Ik−1 forms
`
`
`} where
`a joint neighborhood Pk = {P0k ∪ P1k ∪···∪ Pm−1
`k
`useful interest points are most likely to appear. Figure 4
`illustrates the process.
`In addition to Pk, we also need to take into considera-
`tion parts of the image Ik that have not been seen before.
`To this end, we maintain a keyframe ID j, accumulate the
`motion between image Ij and Ik and transform the four cor-
`ners of the image Ij to Ik. When the overlap between the
`keyframe and the current image drops to a certain percent-
`age, we extract interest points from the part of image Ik that
`lies outside of the shrunken quadrilateral. The keyframe ID
`is then updated to the current frame k.
`
`3.2. Tracking Interest Points
`Next we need to match interest points between images
`Ik−1 and Ik. We first assume that if a feature pj
`k−1 in Ik−1
`is still present in Ik, it can only be in region Pj
`k. When
`more than one interest point are detected in this region, we
`need to choose the one that best matches pj
`k−1 without com-
`puting their SURF descriptors. Instead, we investigate two
`
`Normalized-Cross-Correlation (NCC). NCC is a clas-
`sic technique for matching image regions and normally op-
`erates in the pixel intensity domain. With blob-type features
`like the ones used by SURF, this technique is not accurate
`because the image intensities surrounding the neighborhood
`of an interest point may not vary much. In addition, because
`the features are detected in the scale level, the neighbor-
`hood for NCC needs to be adjusted according to the scale
`of the interest points. Therefore NCC in the pixel intensity
`domain is not a suitable algorithm in terms of both perfor-
`mance and match quality. On the other hand, the values of
`the Hessian determinant around each interest point can be
`used as a more stable signature, since the values in the Hes-
`sian domain are independent of scale and relative bright-
`ness changes. This is best illustrated in Figure 5, where
`
`2940
`
`Image Ik-1
`
`Image Ik
`
`pk
`
`Pk
`
`Mk-1
`k
`
`pk-1
`
` scale
`
`Larger
`
`Niantic's Exhibit No. 1027
`Page 004
`
`
`
`Figure 5. Approximate Hessian determinants. Each row represents the Hessian determinant values surrounding an interest point across five
`consecutive video frames. The patterns remain consistent and distinct, allowing easy and reliable matching.
`
`we show values of Hessian determinants over consecutive
`video frames. The consistency and distinctiveness of the
`two patterns are evident.
`
`To perform the NCC in the Hessian domain, for each
`possible pair of interest points, we construct a frustum
`around each interest point in domain R(·) corresponding to
`5 × 5 × 3 grid values, and compute the L2-norm between
`the two grids. This is much more efficient than a regular
`NCC because we only need to compute the dot-products at
`detected interest point locations. Similar to the local curva-
`ture method, we take the best matching interest point that
`passes a NCC threshold ∆N CC as the matching interest
`point. Comparisons between these two tracking measures
`can be found in Section 5.
`
`3.3. Motion Estimation
`
`The interest point detection algorithm described in Equa-
`tion 4 requires the estimation of relative motion. The esti-
`mation of motion M k−1
`consists of the prediction and cor-
`k
`rection steps. The correction step is fairly straightforward—
`given the matching pairs, we use RANSAC to fit a funda-
`mental matrix model and reject incorrect matches accord-
`ingly. The tracking process produces fairly accurate results
`and only a small number of iterations suffices to reject most
`of the false tracking results.
`To compute the joint neighborhood Pk we need to pre-
`dict the homography M k−1
`. Note that the model computed
`k
`in the correction step does not necessarily have to be the ho-
`mography M k−1
`; a more general model used in the correc-
`k
`tion stage allows more valid matches to go into the tracked
`interest point pool. In the simplest form, one can assume
`constant velocity motion and reuse the corrected motion, or
`assume no motion at all. If a valid model can not be found
`in the correction step, we do not have sufficient matches be-
`tween images and in this case Pk falls back to the entire
`scale space.
`
`3.4. Descriptor Computation
`Each tracked feature descriptor is computed from the
`current list of tracked interest points like in the SURF al-
`gorithm. Because of the decoupling, we can choose not to
`compute any descriptors at all and use the SURFTrac al-
`gorithm only as a tracker. When descriptors are required,
`we ensure a smooth frame rate by putting the new inter-
`est points in a priority queue and computing their descrip-
`tors when the time budget allows. In addition, because the
`tracked points may out-live the robustness of the descrip-
`tors, especially because the interest points are not affine-
`invariant, we invalidate old descriptors and place them in
`the priority queue to be refreshed.
`
`4. Real-Time Object Recognition and Tracking
`In order to recognize and label objects in real-time, we
`need to compute and maintain the descriptors from the set
`of tracked features and query the image database. Querying
`the whole database can be slow especially as the size of the
`database grows. In order to speed up the process, we pro-
`pose to organize the images in the database based on their
`spatial relationships, and query only subsets of the database
`that are more likely to contain matching objects. Figure 6
`shows the overview of this process, and Figure 1 shows an
`example output from our system.
`
`4.1. Image Graph Organization
`We propose to organize the database images as follows.
`Given a collection of database images V, we create an undi-
`rected graph G = (V, E) where images form the nodes in
`the graph, and the edges E = {EG ∪ EID} describe the re-
`lationships between the images. An edge eg ∈ EG between
`two images indicates a geometric relationship when these
`two images can be related through standard pairwise im-
`age matching. In this case, a geometric relationship is sim-
`ply the homography transform between these two images.
`Each image is also further identified with one or more ob-
`ject IDs, and two images sharing the same ID are also con-
`
`2941
`
`Niantic's Exhibit No. 1027
`Page 005
`
`
`
`if features corresponding to new object IDs fall onto the
`video frame, we add these new features to the list of tracked
`interest points, and run SURFTrac algorithm again before
`querying into Gk.
`
`5. Results and Discussions
`We now move on to discuss the performance and at-
`tributes of our implementation of SURFTrac algorithm. In
`the following experiments, unless otherwise noted, we pre-
`dicted the motion M k−1
`between consecutive frames using
`k
`constant velocity assumption, and rely on the SURFTrac lo-
`cal neighborhood to locate the updated interest points. We
`also chose to set ∆N CC and ∆r2 to infinity, which means
`we always select one best matching feature for each local
`neighborhood. We implement our methods on top of the
`SURF algorithm in [6] to obtain fair performance compar-
`isons between these two algorithms.
`
`5.1. Interest Point Tracking
`To evaluate the incremental interest point detection al-
`gorithm, we first produce a synthetic image sequence by
`rendering a rectangular texture in 3D and varying the vir-
`tual camera parameters to simulate the pure translation and
`rotation of the real camera. Using a synthetic sequence al-
`lows us to measure the impact of each individual extrinsic
`parameter independently. We use either the NCC or the Lo-
`cal Curvature method to track the interest points. Then, we
`apply homography RANSAC to filter out the false-positive
`matches. We measure the accuracy of the tracking method
`as a percentage of the matches that passed RANSAC. The
`accuracy under different camera movements are shown in
`Figure 7 where the camera moves in x direction with a con-
`stant velocity. For the local SURFTrac neighborhood, we
`always use ∆σ = 1 and γ = 6, which for the base level
`corresponds to 16× 16 pixels. With an accuracy up to 90%,
`the NCC method proved to be much better than the Local
`Curvature method.
`Figure 8 shows how the accuracy changes according
`to the amount of camera movement around different axes.
`SURFTrac appears to cope with most of the camera move-
`ments quite well within reasonable range of motion. How-
`ever, we can see that the amount of change in roll severely
`affects the SURFTrac accuracy in both NCC and Local Cur-
`vature methods.
`
`5.2. Mobile Phone Performance
`We implemented SURFTrac on a Nokia N95 mobile
`phone to analyze and compare the speed of SURFTrac
`against regular SURF in both interest point detection and
`tracking using a frame-by-frame matching method similar
`to Skrypnyk and Lowe [24] and Battiato et al. [1], as de-
`scribed in Figure 2. The N95 contains an ARM-11 based
`
`2942
`
`Figure 6. Real-time tracking and recognition pipeline.
`
`nected by an additional edge eid ∈ EID. This organization
`is similar to Simon et al. [21] where a graph of images is
`constructed for hierarchical browsing purposes. It is partic-
`ularly suitable for our purposes where the images are taken
`from real-world environments such as streets and buildings.
`An example of the image graph is shown in Figure 1(b).
`
`4.2. Initialization
`During initialization, we compute the full SURF fea-
`ture descriptors from the first video image and match them
`against images in G using a method similar to Takacs et
`al. [27]. This method constructs an approximate nearest
`neighbor tree for all the image features in the database
`followed by geometric verification (RANSAC). Upon suc-
`cessfully identifying the matching images, the best image
`vk ∈ V is marked as the current keynode, and the set of
`images in-play is reduced to only those images that are con-
`nected to vk by a path in G, as shown in Figure 1(b). Once
`a keynode image and its object ID are identified, we can
`continuously match and update the keynode at a relatively
`low cost since we are fairly confident all potentially relevant
`objects are included in the current database subgraph.
`The next task involves computing the placement of the
`labels. In this step, we group all the matching features ac-
`cording to their respective object IDs, and render one object
`ID label at the geometric centroid of each feature group.
`The locations of the labels remain static with respect to the
`feature group until a new keynode image is chosen.
`
`4.3. Object Tracking
`At every new video frame, we run SURFTrac to update
`the interest points, compute the homography against the
`previous video frame, and update the label location accord-
`ingly. Because SURFTrac continues to run in every input
`video frame, we only need to run the object recognition al-
`gorithm occasionally for newly revealed objects. In order
`to avoid running the matching algorithm frequently, we re-
`project all features in Vk to the current video frame, and
`
`Capture
`Image
`
`Detect Features
`& Query Image
`
`
`Initialization
`
`No
`
`Yes
`
`Found
`Matches?
`
`Determine
`Keynode
`
`
`
`Render
`Labels
`
`Object ID
`Database
`
`
`Image
`
`Database
`
`Object
`Tracking
`
`Project Subgraph
`
`Features to
`Current Image
`
`New Object
`
`ID in View?
`
`
`Yes
`
`No
`
`
`Update Tracked
`Interest Points
`
`
`
`SURFTrac and
`Query Image
`
`
`
`Switch to New
`Keynode
`
`
`
`Update Label
`Locations & Render
`
`
` Update Feature
`Locations
`
`
`Track Known Features
`with SURFTrac
`
`Capture
`Image
`
`Niantic's Exhibit No. 1027
`Page 006
`
`
`
`Figure 7. Accuracy comparison between NCC and Local Curva-
`ture (r2) matching methods.
`
`Texas Instrument OMAP2 processor running at 330MHz.
`Our test results use 256× 192 live video from the N95 cam-
`era. The local neighborhood Pk is chosen experimentally
`to balance the tradeoff between the computational cost and
`the tracking consistency, and varied from 12× 12× 3 in the
`lowest scale to 32 × 32 × 3 in the highest scale.
`Table 1 summarizes the average processing time per
`frame in different methods. Tracking features with the
`SURF algorithm takes 0.678 seconds even with fairly op-
`timized code. Detecting interest points alone with SURF
`takes 0.357 seconds, which is over 3× slower than SURF-
`Trac even though the latter provides additional tracking re-
`sults. Overall for tracking purposes, we achieve over 5×
`speedup compared to SURF.
`the Local Curvature
`Without geometric verification,
`method is slightly faster than the NCC method. However,
`when geometric verification is enabled, both methods are
`equally efficient because in the case of Local Curvature
`method, RANSAC requires more iterations to converge due
`to its much larger false-positive ratio.
`
`5.3. Outdoor tracking
`
`Figure 1 shows the outdoor tracking and label augmen-
`tation results on the N95 phone. We use a small set of
`keynodes in the database with about three thousand fea-
`tures matched across all keynodes. We use NCC for this
`purpose because it generally outperforms the Local Curva-
`ture method. We observe that, in this application context,
`the user tends to hold and move the mobile device paral-
`lel to the ground plane, and therefore the amount of camera
`roll is fairly insignificant. Our implementation reached 1.4
`FPS during initialization or recovery when the full-frame
`feature detection and querying with database is needed. Af-
`ter the first keynode is matched, the tracking and subgraph
`matching runs at about 6-7 FPS, including capturing and
`rendering of the image and label overlay.
`
`Figure 8. Impact of change in each camera parameter on neighbor-
`hood matching results of NCC and LC (r2).
`
`5.4. Limitations
`
`Our tracking method inherits similar limitations as the
`traditional active search strategy which depends very much
`on the search region’s size, the camera motion and the distri-
`bution of features in the scene. In the outdoor environment,
`the tracker fails when, for example, the camera pans to an
`empty space between two buildings. In addition, because
`the interest points are localized and interpolated quadrati-
`cally in the scale space, their positional accuracies do not
`rival traditional corner detectors where sub-pixel accuracies
`are commonly achieved.
`In terms of label placement, because we have multiple
`keynodes per building, a label may jitter when a new keyn-
`ode is selected. A label can also drift after a while, but its
`
`2943
`
`NCC
`
`r2
`
`1
`
`11
`
`21
`
`31
`
`41
`
`51
`
`61
`
`71
`
`81
`
`frame
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`accuracy (%)
`
`Impact of z-translation
`
`NCC
`
`r2
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9
`
`10 11 12
`
`∆distance
`(mm)
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`accuracy (%)
`
`Impact of panning
`
`NCC
`
`r2
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9
`
`10 11 12
`
`∆angle
`(degree)
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`accuracy (%)
`
`Impact of tilting
`
`NCC
`
`r2
`
`∆angle
`(degree)
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9
`
`10 11 12
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`accuracy (%)
`
`Impact of rolling
`
`NCC
`
`r2
`
`∆angle
`(degree)
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9 10 11 12 13
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`accuracy (%)
`
`Niantic's Exhibit No. 1027
`Page 007
`
`
`
`SURF
`
`SURFTrac
`
`Time (sec)
`Methods
`0.678
`Matching and Tracking
`0.357
`Interest Point only
`0.115
`NCC only
`0.133
`NCC + RANSAC
`0.111
`Local Curvature only
`Local Curvature + RANSAC 0.134
`
`Table 1. Speed comparison of SURFTrac algorithm versus SURF
`algorithm. For tracking purposes we achieve over 5× speedup
`compared to SURF.
`
`position will be recovered appropriately when the tracker
`switches to a new keynode from the database.
`
`6. Conclusion
`We presented the SURFTrac algorithm, an efficient
`method for tracking scale-invariant interest points without
`computing their descriptors. We also demonstrate a frame-
`work for outdoor tracking using SURFTrac