throbber
SURFTrac: Efficient Tracking and Continuous Object Recognition using Local
`Feature Descriptors
`
`Duy-Nguyen Ta
`Georgia Institute of Technology
`duynguyen@gatech.edu
`
`Wei-Chao Chen Natasha Gelfand Kari Pulli
`Nokia Research Center, Palo Alto
`{wei-chao.chen,natasha.gelfand,kari.pulli}@nokia.com
`
`Abstract
`
`We present an efficient algorithm for continuous image
`recognition and feature descriptor tracking in video which
`operates by reducing the search space of possible interest
`points inside of the scale space image pyramid. Instead of
`performing tracking in 2D images, we search and match
`candidate features in local neighborhoods inside the 3D im-
`age pyramid without computing their feature descriptors.
`The candidates are further validated by fitting to a motion
`model. The resulting tracked interest points are more re-
`peatable and resilient to noise, and descriptor computa-
`tion becomes much more efficient because only those areas
`of the image pyramid that contain features are searched.
`We demonstrate our method on real-time object recognition
`and label augmentation running on a mobile device.
`
`1. Introduction
`Robust feature descriptors such as SIFT [12], SURF [2],
`and GLOH [16] have become a core component in applica-
`tions such as image recognition [8], multi-view stereo [25],
`and image registration [4, 26]. These descriptors are sta-
`ble under viewpoint and lighting changes, so they are able
`to cope with significant amounts of image variability. At
`the same time, discriminative power is achieved by rep-
`resenting feature points as high-dimensional vectors. The
`combination of robustness and discriminative power makes
`these methods ideally suited for searching large heteroge-
`neous image databases.
`We are interested in using robust feature descriptors
`in real-time object recognition and tracking applications.
`Given a database of images of labeled objects, such as build-
`ings in an outdoor environment, and an input video stream,
`we would like to recognize the objects present in the video,
`augment the video stream with labels indicating the ob-
`jects, and maintain and update those labels as the video pans
`across the scene. Figure 1 shows an example of such an ap-
`plication. This task can be thought of as having two com-
`
`ponents. Image matching against a database is performed
`to identify new objects that appear in the video and object
`tracking is performed to update the positions of the labels
`of recognized objects in the consecutive video frames. Ro-
`bust feature points with high dimensional descriptors per-
`form best for image recognition, therefore we would like to
`compute and track them at interactive frame rates.
`To track robust features across video frames, we can per-
`form pairwise image matching given any two consecutive
`video frames, as done by Battiato et al. [1] and Skrypnyk
`and Lowe [24]. Figure 2 describes the algorithm flow of this
`approach. The main drawback of frame-to-frame matching
`is the wasted computation as this approach does not exploit
`coherence in the video. Most of the time the features are
`used for tracking purposes only, as there is no need to per-
`form an image recognition step, unless a significant change
`is detected in the current frame. The expensive robustness
`properties of the descriptors are not needed for frame-to-
`frame matching, since consecutive video frames are likely
`to be very similar. Furthermore, image noise means that
`many of the descriptors are transient and will be thrown
`away when consecutive frames are matched. Therefore,
`performing detection and computation of robust descriptors
`for each frame of the video is unnecessary and can also be
`difficult to perform at interactive frame rates.
`Alternatively, one can imagine a hybrid algorithm where
`motion estimation is done through lightweight features such
`as FAST corners [19, 20] or Harris corners [9] and object
`recognition is done through robust features. When enough
`motion is accumulated to warrant a recognition step, one
`can run a standard image matching algorithm against the
`image database to detect what is present in the scene. How-
`ever, locations of the robust features such as SIFT and
`SURF are unlikely to coincide with the features used for
`tracking. This is a problem since the position of any aug-
`mentation that results from the matching step, such as build-
`ing labels, can not be transferred to successive frames with-
`out knowing the scene structure. A solution to this is to
`compute the robust descriptors at corner locations that are
`used for tracking [5, 29]. These approaches, even though
`
`12
`
`937
`
`978-1-4244-3991-1/09/$25.00 ©2009 IEEE
`
`Niantic's Exhibit No. 1027
`Page 001
`
`

`

`(b)
`(a)
`Figure 1. Our algorithm running in an outdoor environment. (a) Detected interest points and traced trajectories from a video sequence
`(Section 3). (b) Above: Video frames overlaid with their object labels zoomed in. Below: Matching images and their partial subgraph from
`the database. The solid and dashed edges indicate geometric and object ID relationships, respectively (Section 4)
`
`fairly efficient in terms of tracking, require extracting multi-
`ple descriptors per corner, because the corners are not scale-
`invariant. The SIFT Flow approach by Liu et al. [11] pro-
`duces a flow field of SIFT descriptors with a fairly high
`density, but this requires computing feature descriptors at
`a much higher density than produced by the difference-
`of-Gaussians algorithm of standard SIFT. Furthermore, the
`performance of SIFT and SURF is related to the distinctive-
`ness of the computed interest points, so using descriptors
`computed at alternative locations can negatively impact the
`recognition performance of the algorithm.
`
`Contributions. To enable tracking of robust descriptors
`at interactive frame rates, we propose to exploit coherency
`in video frames to detect interest points at locations in each
`frame where they are most likely to appear. Instead of track-
`ing features in 2D images, we instead perform our track-
`ing in the scale-space image pyramid, and we achieve the
`robustness of the direct frame-to-frame matching method
`while reducing computation significantly. The detected in-
`terest points are scale-invariant, and are inherently matched
`and tracked across video frames. We decouple the descrip-
`tor computation from the interest point detection so that fea-
`ture descriptors are computed only for the purpose of object
`recognition. We demonstrate that our algorithm runs in real-
`time on mobile phones, and we discuss applications such
`as real-time augmentation and object recognition. Figure 3
`shows a functional overview of our algorithm.
`
`2. Related Work
`Robust Feature Detectors and Descriptors. There are
`several approaches to scale-invariant
`interest point de-
`tection,
`including Difference-of-Gaussians in SIFT by
`Lowe [12], maximally stable extended regions (MSER) by
`Matas et al. [14], Harris-Affine and Hessian-Affine corners
`by Mikolajczyk and Schmid [15], and Hessians approxi-
`
`mated using Haar-basis by Bay et al. [2]. Mikolajczyk et
`al. [17] performed a comprehensive comparison study on
`the detectors.
`Robust descriptor algorithms take the output of region
`detectors, construct a canonical frame, and then extract a
`high-dimensional feature vector for each region. The de-
`scriptors are designed to be invariant to lighting, scale,
`and rotational changes. Examples include the very popular
`SIFT [12] and SURF [2]. Refer to Mikolajczyk et al. [16]
`for a comparative study on the descriptors.
`There are also many successful attempts to speed up the
`descriptor computation algorithm. Sinha et al. [22] describe
`an efficient SIFT implementation on graphics processing
`unit (GPU). Grabner et al. [7] propose to speed up SIFT
`computation using integral images. Our SURFTrac algo-
`rithm improves on baseline SURF algorithm which is our
`primary benchmark; additional speedups similar to these
`approaches can be applied to gain even more efficiency.
`
`Applications of Feature Descriptors. Many researchers
`have applied feature descriptors in object recognition and
`image retrieval. For example, Sivic and Zisserman [23] ap-
`plied SIFT for efficient keyframe retrieval in video. Grau-
`man and Darrell [8] proposed a fast kernel-based object
`identification method using SIFT features. Nist´er and
`Stew´enius [18] used hierarchical k-means to construct a
`search tree of local features for fast and efficient image re-
`trieval.
`
`Motion Estimation and Tracking. There is an abundant
`body of prior art on video tracking and motion estimation.
`Refer to Torr and Zisserman [28] for a summary of feature-
`based motion estimation methods. Optical flow algorithms
`such as the classic examples by Lucas and Kanade [13] and
`Berthold et al. [10] can also be used as input for a motion
`estimation algorithm. Zhang et al. [30] recover epipolar ge-
`
`2938
`
`Niantic's Exhibit No. 1027
`Page 002
`
`

`

`Figure 3. The SURFTrac algorithm overview. For the first image,
`the algorithm initializes a list of interest points by performing a
`full detection. The interest points are then updated and tracked
`upon receiving new images. The descriptors are for recognition
`purposes and the algorithm computes them as needed.
`
`(1)
`
`(2)
`
`To achieve scale-invariance, many interest point detec-
`tion algorithms use image pyramids during the detection
`phase. These algorithms include Hessian-Affine, Harris-
`Affine, and approximate Hessian. In this process, an image
`pyramid is formed by downsampling the input image to a
`progressively lower resolution. For the purpose of our dis-
`cussion, we will treat the image pyramid as a stack of same-
`sized images Sk(·), each filtered from the original image Ik
`with a different scale of zero-mean Gaussian as follows
`Sk(x, y, σ) = Ik(x, y) ∗ G(0, σ2).
`The interest point response R has the same image stack data
`layout, and is computed by applying the response computa-
`tion function f(·) over the stack of images S,
`Rk(x, y, σ) = f · Sk(x, y, σ).
`Local maxima in the function R represent relatively sta-
`ble regions and are extracted as interest points. Because the
`bandwidth of function R is lower at higher values of σ, the
`sampling rate for maxima computation is naturally reduced
`at higher σ to increase efficiency. The extracted interest
`points are then refined with smooth local interpolation [3].
`As described before, we plan to extend interest point de-
`tection algorithms such that the interest points are tracked
`across video frames efficiently. Although our proposal is
`adaptable to many algorithms using image pyramids, we
`will focus the remainder of our discussion on the approxi-
`mate Hessian detector in SURF [2] because of its efficiency
`and good interest point repeatability. More importantly, by
`using SURF we can compute part of Rk(·) directly without
`having to produce the intermediate Gaussian stack Sk(·).
`The algorithm computes the scale-normalized Hessian ma-
`trix
`
`(cid:35)
`
`,
`
`(cid:34) ∂2
`
`1 σ
`
`2
`
`Hk(x, y, σ) =
`
`2939
`
`∂2
`∂x∂y Sk(x, y, σ)
`∂x2 Sk(x, y, σ)
`∂2
`∂2
`∂y2 Sk(x, y, σ)
`∂x∂y Sk(x, y, σ)
`(3)
`and the response function is the determinant of H(·),
`Rk(x, y, σ) = det(Hk(x, y, σ)). In the following sections,
`the use of the Haar-wavelet approximation in SURF is im-
`plied when we refer to the Hessian matrix. In the remainder
`of the section we describe the algorithm shown in Figure 3.
`
`Figure 2. An algorithm using feature descriptor algorithms as-is
`for feature tracking.
`
`ometry between two images through correlation, relaxation,
`and robust model fitting using Harris corners. The feature-
`tracking aspect of our algorithm can be used as input to
`motion estimation algorithms, and therefore complements
`instead of competing with motion estimation research.
`There have been many successful attempts to track ro-
`bust descriptor features. Skrypnyk and Lowe [24] and Bat-
`tiato et al. [1] propose to track SIFT features similar to
`Figure 2. Other algorithms change the location of descrip-
`tor computation by using different interest point detectors.
`For example, Liu et al. [11] compute dense SIFT corre-
`spondences by detecting and matching the descriptors on
`a dense grid. Chehlov et al. [5] describe a Kalman filter
`based SLAM algorithm that uses a corner detector followed
`by SIFT computation on these corners. Similarly, Wagner
`et al. [29] compute FAST corners on the object database
`image and extract descriptors at different scale levels, and
`they also demonstrate an application that performs tracking
`and object recognition in real time.
`Our algorithm is related to these studies, but our ap-
`proach is quite different because we wish to retain all of the
`proven great attributes of robust feature descriptors. We aim
`to perform image recognition similar to Takacs et al. [27]
`where a large image and object database need to be matched
`against the input image. This means that we do not wish to
`replace scale-invariant interest points with generic corner
`detectors when the resulting descriptors would reduce the
`effectiveness of object recognition. From this perspective,
`the quality of object recognition is equally important to the
`real-time tracking requirement, and the tracking algorithm
`should not interfere with the recognition performance.
`
`3. The SURFTrac Algorithm
`Many feature descriptor algorithms consist of two con-
`secutive steps, namely interest point detection followed by
`descriptor computation. An interest point detection algo-
`rithm extracts regions of interest that tend to be repeatable
`and invariant under transformations such as brightness or
`perspective changes.
`In the descriptor computation step,
`each extracted interest point defines a circular or affine re-
`gion from which one descriptor is computed. It is often pos-
`sible to mix and match these two steps of the algorithm, for
`example, one can compute a SIFT descriptor using Hessian-
`Affine interest points.
`
`Image k-1
`
`Interest Point
`Detection
`
`Descriptor
`Computation
`
`Feature
`Descriptors
`
`Image k
`
`Interest Point
`Detection
`
`Descriptor
`Computation
`
`Feature
`Descriptors
`
`Feature
`Matching
` &
`Motion
`Estimation
`
`Estimated
`Motion
`
`Image 0
`
`Interest Point
`Detection
`
`Tracked
`Interest Points
`
`Descriptor
`Computation
`
`Tracked
`Feature
`Descriptors
`
`To Object
`Recognition
`
`Image k>0
`
`Incremental
`Interest Point
`Detection
`
`Tracking
`Interest
`Points
`
`Motion
`Estimation
`
`Estimated
`Motion
`
`Niantic's Exhibit No. 1027
`Page 003
`
`

`

`3.1. Incremental Interest Point Detection
`In order to compute interest points in each video frame
`incrementally, we can predict regions in the Gaussian im-
`age stack where useful features are most likely to appear,
`and compute the response R only in these regions. Let us
`denote the input video sequence as I = {I0, I1, · · · IN−1}.
`Given image Ik−1 and one of its interest points pk−1 =
`(xk−1, yk−1, σk−1), assuming we know the relative motion
`M k−1
`(.) between Ik−1 and Ik as a homography, we can
`k
`simply transform pk−1 to its location in frame Ik with
`pk = (xk, yk, σk) = M k−1
`
`(pk−1).
`
`k
`
`(4)
`
`Figure 4. Incremental detection of an interest point. The predicted
`motion M k−1
`is used to transform any interest point pk−1 from
`k
`image Ik−1 to image Ik. This predicted interest point pk forms
`a volume neighborhood Pk in which the new interest point is ex-
`tracted. The image corners are similarly transformed to the new
`image (blue dashed lines).
`
`methods for computing lightweight signatures from the in-
`formation that is already present in the scale space.
`
`Local Curvature Method.
`In SURF, because the re-
`sponse function is an approximation to the determinant of
`the Hessian matrix, we can use the relationship between
`the two principal curvatures of each interest point as a sig-
`nature of the interest point. Given the scale-normalized
`Hessian matrix in Equation 3, we compute its eigenvec-
`tors λ1 and λ2, λ1 > λ2, and measure the curvature ratio
`r1 = λ1/λ2. This is directly related to the edge response
`detection method used in SIFT [12],
`r2 = trace(H)2
`det(H)
`
`.
`
`(6)
`
`(r1 + 1)2
`r1
`Because the components of H are already calculated, com-
`puting ratio r2 is more efficient than r1. We treat the fea-
`ture with the smallest difference in r2 as the matching inter-
`est point, if this difference does not exceed a user-defined
`threshold ∆r2.
`
`=
`
`Obviously, homography is insufficient to model the mo-
`tion of all tracked features. However, if the relative motion
`between Ik−1 and Ik is small, we can still use the homog-
`raphy and expand the point pk into a 3D volume search
`neighborhood Pk
`Pk = {(x(cid:48)
`
`
`
`
`k, y(cid:48)k, σ(cid:48)k) :
`
`(5)
`
`|σ(cid:48)k − σk| ≤ ∆σ,
`
`
`|x(cid:48)k − xk| ≤ γσ(cid:48)k,
`
`
`
`k},|y(cid:48)k − yk| ≤ γσ(cid:48)
`where ∆σ is the search range in the scale space, and γ is
`related to the motion prediction error, as well as disparity of
`the tracked point with respect to the primary planar structure
`of the scene. The search neighborhood Pk corresponds to a
`pyramidal frustum because the interest point sampling rate
`is reduced at higher scale levels. In practice, because of the
`high correlation between images, using a fixed-size search
`neighborhood works fairly well regardless of the actual mo-
`tion. Using Equation 4 and 5, the collection of tracked inter-
`est points {p0
`
`
`k−1, p1k−1, · · · pm−1k−1 } from image Ik−1 forms
`
`
`} where
`k ∪· · ·∪ Pm−1a joint neighborhood Pk = {P0k ∪ P1
`k
`useful interest points are most likely to appear. Figure 4
`illustrates the process.
`In addition to Pk, we also need to take into considera-
`tion parts of the image Ik that have not been seen before.
`To this end, we maintain a keyframe ID j, accumulate the
`motion between image Ij and Ik and transform the four cor-
`ners of the image Ij to Ik. When the overlap between the
`keyframe and the current image drops to a certain percent-
`age, we extract interest points from the part of image Ik that
`lies outside of the shrunken quadrilateral. The keyframe ID
`is then updated to the current frame k.
`
`3.2. Tracking Interest Points
`Next we need to match interest points between images
`Ik−1 and Ik. We first assume that if a feature pj
`k−1 in Ik−1
`is still present in Ik, it can only be in region Pj
`k. When
`more than one interest point are detected in this region, we
`need to choose the one that best matches pj
`k−1 without com-
`puting their SURF descriptors. Instead, we investigate two
`
`Normalized-Cross-Correlation (NCC). NCC is a clas-
`sic technique for matching image regions and normally op-
`erates in the pixel intensity domain. With blob-type features
`like the ones used by SURF, this technique is not accurate
`because the image intensities surrounding the neighborhood
`of an interest point may not vary much. In addition, because
`the features are detected in the scale level, the neighbor-
`hood for NCC needs to be adjusted according to the scale
`of the interest points. Therefore NCC in the pixel intensity
`domain is not a suitable algorithm in terms of both perfor-
`mance and match quality. On the other hand, the values of
`the Hessian determinant around each interest point can be
`used as a more stable signature, since the values in the Hes-
`sian domain are independent of scale and relative bright-
`ness changes. This is best illustrated in Figure 5, where
`
`2940
`
`Image Ik-1
`
`Image Ik
`
`pk
`
`Pk
`
`Mk-1
`k
`
`pk-1
`
` scale
`
`Larger
`
`Niantic's Exhibit No. 1027
`Page 004
`
`

`

`Figure 5. Approximate Hessian determinants. Each row represents the Hessian determinant values surrounding an interest point across five
`consecutive video frames. The patterns remain consistent and distinct, allowing easy and reliable matching.
`
`we show values of Hessian determinants over consecutive
`video frames. The consistency and distinctiveness of the
`two patterns are evident.
`
`To perform the NCC in the Hessian domain, for each
`possible pair of interest points, we construct a frustum
`around each interest point in domain R(·) corresponding to
`5 × 5 × 3 grid values, and compute the L2-norm between
`the two grids. This is much more efficient than a regular
`NCC because we only need to compute the dot-products at
`detected interest point locations. Similar to the local curva-
`ture method, we take the best matching interest point that
`passes a NCC threshold ∆N CC as the matching interest
`point. Comparisons between these two tracking measures
`can be found in Section 5.
`
`3.3. Motion Estimation
`
`The interest point detection algorithm described in Equa-
`tion 4 requires the estimation of relative motion. The esti-
`mation of motion M k−1
`consists of the prediction and cor-
`k
`rection steps. The correction step is fairly straightforward—
`given the matching pairs, we use RANSAC to fit a funda-
`mental matrix model and reject incorrect matches accord-
`ingly. The tracking process produces fairly accurate results
`and only a small number of iterations suffices to reject most
`of the false tracking results.
`To compute the joint neighborhood Pk we need to pre-
`dict the homography M k−1
`. Note that the model computed
`k
`in the correction step does not necessarily have to be the ho-
`mography M k−1
`; a more general model used in the correc-
`k
`tion stage allows more valid matches to go into the tracked
`interest point pool. In the simplest form, one can assume
`constant velocity motion and reuse the corrected motion, or
`assume no motion at all. If a valid model can not be found
`in the correction step, we do not have sufficient matches be-
`tween images and in this case Pk falls back to the entire
`scale space.
`
`3.4. Descriptor Computation
`Each tracked feature descriptor is computed from the
`current list of tracked interest points like in the SURF al-
`gorithm. Because of the decoupling, we can choose not to
`compute any descriptors at all and use the SURFTrac al-
`gorithm only as a tracker. When descriptors are required,
`we ensure a smooth frame rate by putting the new inter-
`est points in a priority queue and computing their descrip-
`tors when the time budget allows. In addition, because the
`tracked points may out-live the robustness of the descrip-
`tors, especially because the interest points are not affine-
`invariant, we invalidate old descriptors and place them in
`the priority queue to be refreshed.
`
`4. Real-Time Object Recognition and Tracking
`In order to recognize and label objects in real-time, we
`need to compute and maintain the descriptors from the set
`of tracked features and query the image database. Querying
`the whole database can be slow especially as the size of the
`database grows. In order to speed up the process, we pro-
`pose to organize the images in the database based on their
`spatial relationships, and query only subsets of the database
`that are more likely to contain matching objects. Figure 6
`shows the overview of this process, and Figure 1 shows an
`example output from our system.
`
`4.1. Image Graph Organization
`We propose to organize the database images as follows.
`Given a collection of database images V, we create an undi-
`rected graph G = (V, E) where images form the nodes in
`the graph, and the edges E = {EG ∪ EID} describe the re-
`lationships between the images. An edge eg ∈ EG between
`two images indicates a geometric relationship when these
`two images can be related through standard pairwise im-
`age matching. In this case, a geometric relationship is sim-
`ply the homography transform between these two images.
`Each image is also further identified with one or more ob-
`ject IDs, and two images sharing the same ID are also con-
`
`2941
`
`Niantic's Exhibit No. 1027
`Page 005
`
`

`

`if features corresponding to new object IDs fall onto the
`video frame, we add these new features to the list of tracked
`interest points, and run SURFTrac algorithm again before
`querying into Gk.
`
`5. Results and Discussions
`We now move on to discuss the performance and at-
`tributes of our implementation of SURFTrac algorithm. In
`the following experiments, unless otherwise noted, we pre-
`dicted the motion M k−1
`between consecutive frames using
`k
`constant velocity assumption, and rely on the SURFTrac lo-
`cal neighborhood to locate the updated interest points. We
`also chose to set ∆N CC and ∆r2 to infinity, which means
`we always select one best matching feature for each local
`neighborhood. We implement our methods on top of the
`SURF algorithm in [6] to obtain fair performance compar-
`isons between these two algorithms.
`
`5.1. Interest Point Tracking
`To evaluate the incremental interest point detection al-
`gorithm, we first produce a synthetic image sequence by
`rendering a rectangular texture in 3D and varying the vir-
`tual camera parameters to simulate the pure translation and
`rotation of the real camera. Using a synthetic sequence al-
`lows us to measure the impact of each individual extrinsic
`parameter independently. We use either the NCC or the Lo-
`cal Curvature method to track the interest points. Then, we
`apply homography RANSAC to filter out the false-positive
`matches. We measure the accuracy of the tracking method
`as a percentage of the matches that passed RANSAC. The
`accuracy under different camera movements are shown in
`Figure 7 where the camera moves in x direction with a con-
`stant velocity. For the local SURFTrac neighborhood, we
`always use ∆σ = 1 and γ = 6, which for the base level
`corresponds to 16× 16 pixels. With an accuracy up to 90%,
`the NCC method proved to be much better than the Local
`Curvature method.
`Figure 8 shows how the accuracy changes according
`to the amount of camera movement around different axes.
`SURFTrac appears to cope with most of the camera move-
`ments quite well within reasonable range of motion. How-
`ever, we can see that the amount of change in roll severely
`affects the SURFTrac accuracy in both NCC and Local Cur-
`vature methods.
`
`5.2. Mobile Phone Performance
`We implemented SURFTrac on a Nokia N95 mobile
`phone to analyze and compare the speed of SURFTrac
`against regular SURF in both interest point detection and
`tracking using a frame-by-frame matching method similar
`to Skrypnyk and Lowe [24] and Battiato et al. [1], as de-
`scribed in Figure 2. The N95 contains an ARM-11 based
`
`2942
`
`Figure 6. Real-time tracking and recognition pipeline.
`
`nected by an additional edge eid ∈ EID. This organization
`is similar to Simon et al. [21] where a graph of images is
`constructed for hierarchical browsing purposes. It is partic-
`ularly suitable for our purposes where the images are taken
`from real-world environments such as streets and buildings.
`An example of the image graph is shown in Figure 1(b).
`
`4.2. Initialization
`During initialization, we compute the full SURF fea-
`ture descriptors from the first video image and match them
`against images in G using a method similar to Takacs et
`al. [27]. This method constructs an approximate nearest
`neighbor tree for all the image features in the database
`followed by geometric verification (RANSAC). Upon suc-
`cessfully identifying the matching images, the best image
`vk ∈ V is marked as the current keynode, and the set of
`images in-play is reduced to only those images that are con-
`nected to vk by a path in G, as shown in Figure 1(b). Once
`a keynode image and its object ID are identified, we can
`continuously match and update the keynode at a relatively
`low cost since we are fairly confident all potentially relevant
`objects are included in the current database subgraph.
`The next task involves computing the placement of the
`labels. In this step, we group all the matching features ac-
`cording to their respective object IDs, and render one object
`ID label at the geometric centroid of each feature group.
`The locations of the labels remain static with respect to the
`feature group until a new keynode image is chosen.
`
`4.3. Object Tracking
`At every new video frame, we run SURFTrac to update
`the interest points, compute the homography against the
`previous video frame, and update the label location accord-
`ingly. Because SURFTrac continues to run in every input
`video frame, we only need to run the object recognition al-
`gorithm occasionally for newly revealed objects. In order
`to avoid running the matching algorithm frequently, we re-
`project all features in Vk to the current video frame, and
`
`Capture
`Image
`
`Detect Features
`& Query Image
`
`
`Initialization
`
`No
`
`Yes
`
`Found
`Matches?
`
`Determine
`Keynode
`
`
`
`Render
`Labels
`
`Object ID
`Database
`
`
`Image
`
`Database
`
`Object
`Tracking
`
`Project Subgraph
`
`Features to
`Current Image
`
`New Object
`
`ID in View?
`
`
`Yes
`
`No
`
`
`Update Tracked
`Interest Points
`
`
`
`SURFTrac and
`Query Image
`
`
`
`Switch to New
`Keynode
`
`
`
`Update Label
`Locations & Render
`
`
` Update Feature
`Locations
`
`
`Track Known Features
`with SURFTrac
`
`Capture
`Image
`
`Niantic's Exhibit No. 1027
`Page 006
`
`

`

`Figure 7. Accuracy comparison between NCC and Local Curva-
`ture (r2) matching methods.
`
`Texas Instrument OMAP2 processor running at 330MHz.
`Our test results use 256× 192 live video from the N95 cam-
`era. The local neighborhood Pk is chosen experimentally
`to balance the tradeoff between the computational cost and
`the tracking consistency, and varied from 12× 12× 3 in the
`lowest scale to 32 × 32 × 3 in the highest scale.
`Table 1 summarizes the average processing time per
`frame in different methods. Tracking features with the
`SURF algorithm takes 0.678 seconds even with fairly op-
`timized code. Detecting interest points alone with SURF
`takes 0.357 seconds, which is over 3× slower than SURF-
`Trac even though the latter provides additional tracking re-
`sults. Overall for tracking purposes, we achieve over 5×
`speedup compared to SURF.
`the Local Curvature
`Without geometric verification,
`method is slightly faster than the NCC method. However,
`when geometric verification is enabled, both methods are
`equally efficient because in the case of Local Curvature
`method, RANSAC requires more iterations to converge due
`to its much larger false-positive ratio.
`
`5.3. Outdoor tracking
`
`Figure 1 shows the outdoor tracking and label augmen-
`tation results on the N95 phone. We use a small set of
`keynodes in the database with about three thousand fea-
`tures matched across all keynodes. We use NCC for this
`purpose because it generally outperforms the Local Curva-
`ture method. We observe that, in this application context,
`the user tends to hold and move the mobile device paral-
`lel to the ground plane, and therefore the amount of camera
`roll is fairly insignificant. Our implementation reached 1.4
`FPS during initialization or recovery when the full-frame
`feature detection and querying with database is needed. Af-
`ter the first keynode is matched, the tracking and subgraph
`matching runs at about 6-7 FPS, including capturing and
`rendering of the image and label overlay.
`
`Figure 8. Impact of change in each camera parameter on neighbor-
`hood matching results of NCC and LC (r2).
`
`5.4. Limitations
`
`Our tracking method inherits similar limitations as the
`traditional active search strategy which depends very much
`on the search region’s size, the camera motion and the distri-
`bution of features in the scene. In the outdoor environment,
`the tracker fails when, for example, the camera pans to an
`empty space between two buildings. In addition, because
`the interest points are localized and interpolated quadrati-
`cally in the scale space, their positional accuracies do not
`rival traditional corner detectors where sub-pixel accuracies
`are commonly achieved.
`In terms of label placement, because we have multiple
`keynodes per building, a label may jitter when a new keyn-
`ode is selected. A label can also drift after a while, but its
`
`2943
`
`NCC
`
`r2
`
`1
`
`11
`
`21
`
`31
`
`41
`
`51
`
`61
`
`71
`
`81
`
`frame
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`accuracy (%)
`
`Impact of z-translation
`
`NCC
`
`r2
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9
`
`10 11 12
`
`∆distance
`(mm)
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`accuracy (%)
`
`Impact of panning
`
`NCC
`
`r2
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9
`
`10 11 12
`
`∆angle
`(degree)
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`accuracy (%)
`
`Impact of tilting
`
`NCC
`
`r2
`
`∆angle
`(degree)
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9
`
`10 11 12
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`accuracy (%)
`
`Impact of rolling
`
`NCC
`
`r2
`
`∆angle
`(degree)
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9 10 11 12 13
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`accuracy (%)
`
`Niantic's Exhibit No. 1027
`Page 007
`
`

`

`SURF
`
`SURFTrac
`
`Time (sec)
`Methods
`0.678
`Matching and Tracking
`0.357
`Interest Point only
`0.115
`NCC only
`0.133
`NCC + RANSAC
`0.111
`Local Curvature only
`Local Curvature + RANSAC 0.134
`
`Table 1. Speed comparison of SURFTrac algorithm versus SURF
`algorithm. For tracking purposes we achieve over 5× speedup
`compared to SURF.
`
`position will be recovered appropriately when the tracker
`switches to a new keynode from the database.
`
`6. Conclusion
`We presented the SURFTrac algorithm, an efficient
`method for tracking scale-invariant interest points without
`computing their descriptors. We also demonstrate a frame-
`work for outdoor tr

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket