`Volume | ID. Number 3, June 2008
`
`ISSN l077-3l42
`
`Computer Vision
`and Image
`Understanding
`
`www.50iencedirect.com
`
`Chief Editor
`Avmash C. Kak
`
`Area Editors
`Narendra Ahuja
`Yiannis Aloimonos
`Robert Bergevin
`Ruud M. Bolle
`Kevin W. Bowyer
`Kim L. Boyer
`Larry S. Davis
`Jan-Olof Eklundh
`Adrian Hilton
`Radu Horaud
`Jonathan J. Hull
`Katsushi lkeuchi
`Martin D. Levine
`Chung-Sheng Li
`Nikos Paragios
`
`Thur Pece
`.
`nand Rangaraian
`john K.Tsotsos
`Jay ram K. Udupa
`3.2mm...
`Harry Wechsler
`Daphna Weinshall
`
`Available online at
`.-;;’
`'
`'
`'. SCIenceDirect
`.
`.
`
`AVIGILON EX. 2022
`IPR2019-00314
`Page 1 of 18
`
`
`
`1.1er
`H.
`r ~ "~\
`Computer Vision and Image Understanding
`\
`Volume 110, Number 3, June 2008
`5
`
`© 2008 Elsevier Inc. All rights reserved.
`
`5
`
`.
`
`\J
`
`f; 4
`
`This journal and the individual contributions contained in it are protected under copyright by Elsevier Inc. and ihe following terms and conditions apply to their
`use:
`"
`.J
`
`Single photocopies of single articles may be made for personal use as allowed by national copyright laws Permission of the Publisher and
`Photocopying
`payment of a feeis required for all other photocopying, including multiple or systematic copying copying for advertising orprombtion'al-p’mposes resale and all
`forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for nonprofit educational classroom use
`Permissions may be sought directly from Elsevier’s Rights Department in Oxford. UK; phone: (+44) 1865 843830; fax: (+44) 1865 853333: email: permissions®
`elsevier.com. Requests may also be completed online via the Elsevier home page (http://www.e1sevier.com/loeate/permissions).
`In the USA. users may clear permissions and make payments through the Copyright Clearance Center. Inc.. 222 Rosewood Drive. Danvers, MA 01923. USA;
`phone: (978) 75078400; fax: (978) 750—4744; and in the UK through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottcnham Court
`Road. London W] P (JLP. L'K; phone: +44 (0)20 7631 5555: fax: +44 (0)20 7631 55001 Other countries may have a local reprographic rights agency for payments.
`Derivative Works Subscribers may reproduce tables of contents or prepare lists of articles including abstracts for internal circulation within their institutions.
`Permission of the Publisher is required for resale or distribution outside the institution.
`Permission of the Publisher is required for all other derivative works, including compilations and translations.
`
`Permission of the Publisher is required to store or use electronically any material contained in thisjournal, including any article or
`
`Electronic Storage or Usage
`part of an article.
`Except as outlined above, no part of this publication may be reproduced, stored in a retrieval system. or transmitted in any form or by any means, electronic.
`mechanical. photocopying, recording. or otherwise. without prior written permission of the Publisher.
`Address permission requests to the Elsevier Rights Department at the fax and c-mail addresses noted above.
`Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of product liability. negligence. or
`otherwise, or from any use or operation of any methods. products. instructions. or ideas contained in the material herein. Because of rapid advances in the medical
`sciences. in particular. independent verification of diagnoses and drug dosages should be made.
`Although all advertising material is expected to conform to ethical (medical) standards, inclusion in this publication does not constitute a guarantee or endorsement
`of the quality or value of such product or of the claims made of it by its manufacturer.
`
`Publication information: Computer Vision and Image Understanding (ISSN 10773142). For 2008. Volumes 109—1 12 (12 issues) are scheduled for publication,
`Subscription prices are available upon request from the Publisher. from the Regional Sales Office nearest you. or from this joumal's website (http://www.
`e1sevier.com/locate/cviu). Further infonnation is available on this journal and other Elsevicr products through Elsevier's website (http://wwwelsevier.com).
`Subscriptions are accepted on a prepaid basis only and are entered on a calendar year basis. Issues are sent by standard mail (surface within Europe. air delivery
`outside Europe). Priority rates are available upon request. Claims for missing issues should be made Within six months of the date of dispatch.
`
`USA mailing notice: Computer Vision and Image Understanding (ISSN 1077-3142) is published monthly by Elsevier Inc. (525 B Street. Suite 1900. San
`Diego. CA 92101-4495. USA). Annual subscription price in the USA $1343.00 (valid in North. Central. and South America). including air speed delivery,
`Periodicals postage paid at San Diego. CA 92199-9602. USA. and at additional mailing offices. USA POSTMASTER: Send change of address to Computer
`Vision and Image Understanding. Elscvier, Customer Service Department. 6277 Sea Harbor Drive. Orlando. FL 32887-4800, USA.
`
`Orders, claims, and journal inquiries: Please contact the Customer Service Department at the Regional Sales Oflice nearest you Orlando: Elsevier.
`Customer Service Dcpartment. 6277 Sea Harbor Drive. Orlando. FL 32887-4800. USA; phone: (+ l) (877) 839 7126 or (+ l) (800) 654 2452 [toll-free
`numbers for customers inside USA] or ( + 1) (407) 345 4020 or (+ 1) (407) 345 4000 [customers outside USA]; fax: (+ 1) (407) 363 1354 or ( - 1) (407) 363
`9661; c-mail: usjcs(it elsevier.com or elspcstri elsevier.com. Amsterdam: Elscvier. Customer Service Department. PO Box 211. 1000 AE Amsterdam. The
`Netherlands: phone: (+ 31) (20) 4853757; fax: (+ 31) (20) 4853432: e—mail: nlinfovfjgi‘elseviercom. Tokyo: Elsevier, Customer Service Department. 4F
`HigashivAzabu. l-Chome Bldg. 1-9-15 Higashi-Azabu. Minato—ku. Tokyo 106-0044. Japan: phone: ( + 81) (3) 5561 5037; fax: ( t 81) (3) 556] 5047;c-mai1:
`jp.info(gi‘elsevier.com. Singapore: Elsevier. Customer Service Department. 3 Killiney Road. #08-01 Winsland House 1. Singapore 239519; phone: (+65)
`63490222; fax: (+65) 67331510; e-mail: asiainfottt,elsevier.com.
`
`Author inquiries: For inquiries relating to the submission of articles (including electronic submission where available). please visit this journal‘s homepage
`at http://www.elsevier.com/locate/cviu. You can track accepted articles at http://www.elsevier.com/trackartic1e and set up e-mail alerts to inform you of
`when an article‘s status has changed. Also accessible from here 1s information on copyright. frequently asked questions. and more. For detailed
`instructions on the preparation of electronic artwork. please visit http://www.elsevier.com/artworkinstructions. Contact details for questions arising after
`acceptance of an article. especially those relating to proofs. will be provided by the publisher.
`
`Advertising information: Advertising orders and inquiries should be sent to James Kcnney. Advertising/Commercial Sales Department. Elsevier Ltd..
`84 Theobald‘s Road. London WClX 8RR1 United Kingdom; phone: +44 (0) 20 7611 4494; fax: +44 (0) 20 7611 4463; e-mail: ykenneyol elsevier.com.
`
`Printed by The Sheridan Press. Hanover. Pennsylvania. United States of America
`so The paper used in this publication meets the requirements of ANSI/NISO Z39.48—1992 (Permanence of Pa er)P
`
`AVIGILON EX. 2022
`
`IPR2019-00314
`
`Page 2 of 18
`
`AVIGILON EX. 2022
`IPR2019-00314
`Page 2 of 18
`
`
`
`© 2008 Elsevier Inc. All rights reserved.
`
`This journal and the individual contributions contained in it are protected under copyright by Elsevier Inc.. and
`the following terms and conditions apply to their use:
`
`Photocopying Single photocopies of single articles may be made for personal use as allowed by national
`copyright
`laws. Permission of the Publisher and payment of a fee is required for all other photocopying
`including multiple or systematic copying. copying for advertising or promotional purposes. resale. and all forms
`of document delivery. Special rates are available for educational institutions that wish to make photocopies for
`nonprofit educational classroom use.
`Permissions may be sought directly from Elsevicr‘s Rights Department in Oxford, UK: phone: (+44) 1865
`843830: fax: (+44) 1865 853333: email: pennissions©elsevier.com. Requests may also be completed online
`via the Elsevier home page (http://www.elseviercom/locate/perinissions).
`1n the USA. users may clear permissions and make payments through the Copyright Clearance Center. Inc., 222
`Rosewood Drive, Danvers, MA 01923. USA; phone: (978) 750-8400; fax: (978) 7504744; and in the UK
`through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road.
`London WlP OLP. UK: phone: +44 (0)20 7631 5555: fax: +44 (0)20 7631 5500. Other countries may have a
`local reprographic rights agency for payments
`Derivative Works Subscribers may reproduce tables of contents or prepare lists of articles including abstracts
`for internal circulation within their institutions. Permission of the Publisher is required for resale or distribution
`outside the institution.
`
`Permission of the Publisher is required for all other derivative works. including compilations and translations.
`Electronic Storage or Usage
`Permission of the Publisher is required to store or use electronically any
`material contained in this journal. including any article or part of an article.
`Except as outlined above. no part of this publication may be reproduced. stored in a retrieval system, or
`transmitted in any form or by any means. electronic. mechanical, photocopying. recording. or otherwise. without
`prior written permission of the Publisher.
`Address permission requests to the Elsevier Rights Department at the fax and e-mail addresses noted above.
`Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a
`matter of product liability, negligence. or otherwise. or from any use or operation of any methods. products.
`instructions. or ideas contained in the material herein. Because of rapid advances in the medical sciences. in
`particular. independent verification of diagnoses and drug dosages should be made.
`inclusion in this
`Although all advertising material
`is expected to conform to ethical (medical) standards,
`publication does not constitute a guarantee or endorsement of the quality or value of such product or of the
`claims made of it by its manufacturer.
`
`AVIGILON EX. 2022
`
`IPR2019-00314
`
`Page 3 of 18
`
`AVIGILON EX. 2022
`IPR2019-00314
`Page 3 of 18
`
`
`
`
`
`ELSEVIER
`
`Guest Editorial
`
`Computer Vision and Image Understanding
`
`Volume 110, Number 3, June 2008
`
`CONTENTS
`
`Com puter Vision
`and Image _
`Understanding
`
`Special issue on Similarity Matching in Computer Vision and Multimedia
`_ Guest Editors: Nicu Sebe, Qi Tian, Michael S. Lew, Thomas S. Huang
`
`Similarity Matching in Computer Vision and Multimedia
`Nicu Sebe, Qi Tian, Michael S. Lew, Thomas S. Huang .......................................
`
`309
`
`Special Issue Articles
`
`Indexing through laplacian spectra
`M. Fatih Demirci, Reinier H. van Leuken, Remco C. Veltkamp .................................
`
`Strategies for shape matching using skeletons
`Wooi-Boon Goh ....................................................................
`
`SpeededrUp Robust Features (SURF)
`Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van G001 ....................................
`
`Content based video matching using spatiotemporal volumes
`Arslan Basharat, Yun Zhai, Mubarak Shah ................................................
`
`Anytime similarity measures for faster alignment
`Rupert Brooks, Tal Arbel, Doina Precup ..................................................
`
`Locally adaptive subspace and similarity metric learning for visual data clustering and retrieval
`Yun Fu, Zhu Li, Thomas S. Huang, Aggelos K. Katsaggelos ....................................
`
`Combining visual dictionary. kernel—based similarity and learning strategy for image category retrieval
`Philippe Henri Gosselin, Matthieu Cord, Sylvie Philipp-Foliguet .................................
`
`Measuring novelty and redundancy with multiple modalities in cross~lingual broadcast news
`Xiao Wu, Alexander G. Hauptmann, Chong-Wah Ngo ........................................
`
`312
`
`326
`
`346
`
`360
`
`378
`
`390
`
`403
`
`418
`
`
`
`Abstracted/lndexed in abstract and citation database SCOPUS®, Full text available on ScienceDirect‘g.
`
`J
`
`Ill
`
`
`
`Illlllllllllllll Ill
`
`
`
`
`
`
`1077-3142(200806H10:3;1-F
`
`
`
`
`
`|
`
`
`
`llllll
`
`
`
`AVIGILON EX. 2022
`
`IPR2019-00314
`
`Page 4 of 18
`
`AVIGILON EX. 2022
`IPR2019-00314
`Page 4 of 18
`
`
`
`Available online at www.sciencedirect.com
`
`Computer Vision and Image Understanding 110 (2008) 346–359
`
`www.elsevier.com/locate/cviu
`
`Speeded-Up Robust Features (SURF)
`
`Herbert Bay a, Andreas Ess a,*, Tinne Tuytelaars b, Luc Van Gool a,b
`
`a ETH Zurich, BIWI, Sternwartstrasse 7, CH-8092 Zurich, Switzerland
`b K.U. Leuven, ESAT-PSI, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium
`
`Received 31 October 2006; accepted 5 September 2007
`Available online 15 December 2007
`
`Abstract
`
`This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features).
`SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness,
`yet can be computed and compared much faster.
`This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors
`and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by sim-
`plifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps.
`The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important param-
`eters. We conclude the article with SURF’s application to two challenging, yet converse goals: camera calibration as a special case of image
`registration, and object recognition. Our experiments underline SURF’s usefulness in a broad range of topics in computer vision.
`Ó 2007 Elsevier Inc. All rights reserved.
`
`Keywords: Interest points; Local features; Feature description; Camera calibration; Object recognition
`
`1. Introduction
`
`The task of finding point correspondences between two
`images of the same scene or object is part of many com-
`puter vision applications. Image registration, camera cali-
`bration, object recognition, and image retrieval are just a
`few.
`The search for discrete image point correspondences can
`be divided into three main steps. First, ‘interest points’ are
`selected at distinctive locations in the image, such as cor-
`ners, blobs, and T-junctions. The most valuable property
`of an interest point detector is its repeatability. The repeat-
`ability expresses the reliability of a detector for finding the
`same physical interest points under different viewing condi-
`tions. Next, the neighbourhood of every interest point is
`represented by a feature vector. This descriptor has to be
`distinctive and at the same time robust to noise, detection
`
`* Corresponding author.
`E-mail address: aess@vision.ee.ethz.ch (A. Ess).
`
`1077-3142/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved.
`doi:10.1016/j.cviu.2007.09.014
`
`displacements and geometric and photometric deforma-
`tions. Finally, the descriptor vectors are matched between
`different images. The matching is based on a distance
`between the vectors, e.g. the Mahalanobis or Euclidean dis-
`tance. The dimension of the descriptor has a direct impact
`on the time this takes, and less dimensions are desirable for
`fast interest point matching. However, lower dimensional
`feature vectors are in general less distinctive than their
`high-dimensional counterparts.
`It has been our goal to develop both a detector and
`descriptor that, in comparison to the state-of-the-art, are
`fast to compute while not sacrificing performance. In order
`to succeed, one has to strike a balance between the above
`requirements like simplifying the detection scheme while
`keeping it accurate, and reducing the descriptor’s size while
`keeping it sufficiently distinctive.
`A wide variety of detectors and descriptors have already
`been proposed in the literature (e.g. [21,24,27,39,25]). Also,
`detailed comparisons and evaluations on benchmarking
`datasets have been performed [28,30,31]. Our fast detector
`and descriptor,
`called SURF (Speeded-Up Robust
`
`AVIGILON EX. 2022
`IPR2019-00314
`Page 5 of 18
`
`
`
`H. Bay et al. / Computer Vision and Image Understanding 110 (2008) 346–359
`
`347
`
`Features), was introduced in [4]. It is built on the insights
`gained from this previous work. In our experiments on
`these benchmarking datasets, SURF’s detector and
`descriptor are not only faster, but the former is also more
`repeatable and the latter more distinctive.
`We focus on scale and in-plane rotation-invariant detec-
`tors and descriptors. These seem to offer a good compromise
`between feature complexity and robustness to commonly
`occurring photometric deformations. Skew, anisotropic
`scaling, and perspective effects are assumed to be second
`order effects, that are covered to some degree by the overall
`robustness of the descriptor. Note that the descriptor can
`be extended towards affine-invariant regions using affine
`normalisation of the ellipse (cf. [31]), although this will have
`an impact on the computation time. Extending the detector,
`on the other hand, is less straightforward. Concerning the
`photometric deformations, we assume a simple linear model
`with a bias (offset) and contrast change (scale factor). Nei-
`ther detector nor descriptor use colour information.
`The article is structured as follows. In Section 2, we give
`a review over previous work in interest point detection and
`description. In Section 3, we describe the strategy applied
`for fast and robust interest point detection. The input
`image is analysed at different scales in order to guarantee
`invariance to scale changes. The detected interest points
`are provided with a rotation and scale-invariant descriptor
`in Section 4. Furthermore, a simple and efficient first-line
`indexing technique, based on the contrast of the interest
`point with its surrounding, is proposed.
`In Section 5, some of the available parameters and their
`effects are discussed, including the benefits of an upright
`version (not invariant to image rotation). We also investi-
`gate SURF’s performance in two important application
`scenarios. First, we consider a special case of image regis-
`tration, namely the problem of camera calibration for 3D
`reconstruction. Second, we will explore SURF’s applica-
`tion to an object recognition experiment. Both applications
`highlight SURF’s benefits in terms of speed and robustness
`as opposed to other strategies. The article is concluded in
`Section 6.
`
`2. Related work
`
`2.1. Interest point detection
`
`The most widely used detector is probably the Harris
`corner detector [15], proposed back in 1988. It is based
`on the eigenvalues of the second moment matrix. However,
`Harris corners are not scale invariant. Lindeberg [21] intro-
`duced the concept of automatic scale selection. This allows
`to detect interest points in an image, each with their own
`characteristic scale. He experimented with both the deter-
`minant of the Hessian matrix as well as the Laplacian
`(which corresponds to the trace of the Hessian matrix) to
`detect blob-like structures. Mikolajczyk and Schmid [26]
`refined this method, creating robust and scale-invariant
`feature detectors with high repeatability, which they coined
`
`Harris-Laplace and Hessian-Laplace. They used a (scale-
`adapted) Harris measure or the determinant of the Hessian
`matrix to select the location, and the Laplacian to select the
`scale. Focusing on speed, Lowe [23] proposed to approxi-
`mate the Laplacian of Gaussians (LoG) by a Difference
`of Gaussians (DoG) filter.
`Several other scale-invariant interest point detectors
`have been proposed. Examples are the salient region detec-
`tor, proposed by Kadir and Brady [17], which maximises
`the entropy within the region, and the edge-based region
`detector proposed by Jurie and Schmid [16]. They seem less
`amenable to acceleration though. Also several affine-invari-
`ant feature detectors have been proposed that can cope
`with wider viewpoint changes. However, these fall outside
`the scope of this article.
`From studying the existing detectors and from published
`comparisons [29,30], we can conclude that Hessian-based
`detectors are more stable and repeatable than their Harris-
`based counterparts. Moreover, using the determinant of
`the Hessian matrix rather than its trace (the Laplacian)
`seems advantageous, as it fires less on elongated, ill-localised
`structures. We also observed that approximations like the
`DoG can bring speed at a low cost in terms of lost accuracy.
`
`2.2. Interest point description
`
`An even larger variety of feature descriptors has been
`proposed, like Gaussian derivatives [11], moment invari-
`ants
`[32], complex features
`[1],
`steerable filters
`[12],
`phase-based local features [6], and descriptors representing
`the distribution of smaller-scale features within the interest
`point neighbourhood. The latter, introduced by Lowe [24],
`have been shown to outperform the others [28]. This can be
`explained by the fact that they capture a substantial
`amount of information about the spatial intensity patterns,
`while at the same time being robust to small deformations
`or localisation errors. The descriptor in [24], called SIFT
`for short, computes a histogram of local oriented gradients
`around the interest point and stores the bins in a 128D vec-
`tor (8 orientation bins for each of 4 4 location bins).
`Various refinements on this basic scheme have been pro-
`posed. Ke and Sukthankar [18] applied PCA on the gradi-
`ent image around the detected interest point. This PCA-
`SIFT yields a 36D descriptor which is fast for matching,
`but proved to be less distinctive than SIFT in a second
`comparative study by Mikolajczyk and Schmid [30]; and
`applying PCA slows down feature computation. In the
`same paper [30], the authors proposed a variant of SIFT,
`called GLOH, which proved to be even more distinctive
`with the same number of dimensions. However, GLOH is
`computationally more expensive as it uses again PCA for
`data compression.
`The SIFT descriptor still seems the most appealing
`descriptor for practical uses, and hence also the most
`widely used nowadays. It is distinctive and relatively fast,
`which is crucial for on-line applications. Recently, Se
`et al. [37] implemented SIFT on a Field Programmable
`
`AVIGILON EX. 2022
`IPR2019-00314
`Page 6 of 18
`
`
`
`348
`
`H. Bay et al. / Computer Vision and Image Understanding 110 (2008) 346–359
`
`Gate Array (FPGA) and improved its speed by an order of
`magnitude. Meanwhile, Grabner et al. [14] also used inte-
`gral images to approximate SIFT. Their detection step is
`based on difference-of-mean (without interpolation), their
`description step on integral histograms. They achieve
`about the same speed as we do (though the description step
`is constant in speed), but at the cost of reduced quality
`compared to SIFT. Generally, the high dimensionality of
`the descriptor is a drawback of SIFT at the matching step.
`For on-line applications relying only on a regular PC, each
`one of the three steps (detection, description, matching) has
`to be fast.
`An entire body of work is available on speeding up the
`matching step. All of them come at the expense of getting
`an approximative matching. Methods include the best-
`bin-first proposed by Lowe [24], balltrees [35], vocabulary
`trees [34], locality sensitive hashing [9], or redundant bit
`vectors [13]. Complementary to this, we suggest the use
`of the Hessian matrix’s trace to significantly increase the
`matching speed. Together with the descriptor’s low dimen-
`sionality, any matching algorithm is bound to perform
`faster.
`
`3. Interest point detection
`
`Our approach for interest point detection uses a very
`basic Hessian matrix approximation. This lends itself to
`the use of integral images as made popular by Viola and
`Jones [41], which reduces the computation time drastically.
`Integral images fit in the more general framework of box-
`lets, as proposed by Simard et al. [38].
`
`3.1. Integral images
`
`Fig. 1. Using integral images, it takes only three additions and four
`memory accesses to calculate the sum of intensities inside a rectangular
`region of any size.
`
`o2
`
`;
`
`ð2Þ
`
`determinant of the Hessian also for the scale selection, as
`done by Lindeberg [21].
`Given a point x ¼ ðx; yÞ in an image I, the Hessian
`
`
`matrix Hðx; rÞ in x at scale r is defined as follows
`Hðx; rÞ ¼ Lxxðx; rÞ Lxyðx; rÞ
`Lxyðx; rÞ Lyyðx; rÞ
`where Lxxðx; rÞ is the convolution of the Gaussian second
`ox2 gðrÞ with the image I in point x, and
`order derivative
`similarly for Lxyðx; rÞandLyyðx; rÞ.
`Gaussians are optimal for scale-space analysis [19,20],
`but in practice they have to be discretised and cropped
`(Fig. 2,
`left half). This leads to a loss in repeatability
`under image rotations around odd multiples of p
`4. This
`weakness holds for Hessian-based detectors in general.
`Fig. 3 shows the repeatability rate of
`two detectors
`based on the Hessian matrix for pure image rotation.
`The repeatability attains a maximum around multiples
`of p
`2. This is due to the square shape of the filter. Nev-
`ertheless, the detectors still perform well, and the slight
`decrease in performance does not outweigh the advan-
`tage of fast convolutions brought by the discretisation
`and cropping. As real filters are non-ideal in any case,
`and given Lowe’s success with his LoG approximations,
`we push the approximation for the Hessian matrix even
`further with box filters (in the right half of Fig. 2).
`These approximate second order Gaussian derivatives
`and can be evaluated at a very low computational cost
`
`Fig. 2. Left to right: The (discretised and cropped) Gaussian second order
`partial derivative in y- (Lyy) and xy-direction (Lxy), respectively; our
`approximation for the second order Gaussian partial derivative in y- (Dyy)
`and xy-direction (Dxy). The grey regions are equal to zero.
`
`ð1Þ
`
`In order to make the article more self-contained, we
`briefly discuss the concept of integral images. They allow
`for fast computation of box type convolution filters. The
`entry of an integral image I RðxÞ at a location x ¼ ðx; yÞT
`represents the sum of all pixels in the input image I within
`a rectangular region formed by the origin and x.
`6y
`6x
`
`Iði; jÞ
`
`Xj
`Xi
`
`i¼0
`
`j¼0
`
`I RðxÞ ¼
`
`Once the integral image has been computed, it takes
`three additions to calculate the sum of the intensities over
`any upright, rectangular area (see Fig. 1). Hence, the calcu-
`lation time is independent of its size. This is important in
`our approach, as we use big filter sizes.
`
`3.2. Hessian matrix-based interest points
`
`We base our detector on the Hessian matrix because of
`its good performance in accuracy. More precisely, we
`detect blob-like structures at locations where the determi-
`nant is maximum. In contrast to the Hessian-Laplace
`detector by Mikolajczyk and Schmid [26], we rely on the
`
`AVIGILON EX. 2022
`IPR2019-00314
`Page 7 of 18
`
`
`
`H. Bay et al. / Computer Vision and Image Understanding 110 (2008) 346–359
`
`349
`
`3.3. Scale space representation
`
`Interest points need to be found at different scales, not
`least because the search of correspondences often requires
`their comparison in images where they are seen at different
`scales. Scale spaces are usually implemented as an image
`pyramid. The images are repeatedly smoothed with a
`Gaussian and then sub-sampled in order to achieve a
`higher level of the pyramid. Lowe [24] subtracts these pyr-
`amid layers in order to get the DoG (Difference of Gaussi-
`ans) images where edges and blobs can be found.
`Due to the use of box filters and integral images, we do
`not have to iteratively apply the same filter to the output of
`a previously filtered layer, but instead can apply box filters
`of any size at exactly the same speed directly on the original
`image and even in parallel (although the latter is not
`exploited here). Therefore, the scale space is analysed by
`up-scaling the filter size rather than iteratively reducing
`the image size, Fig. 4. The output of the 9 9 filter, intro-
`duced in previous section, is considered as the initial scale
`layer, to which we will refer as scale s ¼ 1:2 (approximating
`Gaussian derivatives with r ¼ 1:2). The following layers
`are obtained by filtering the image with gradually bigger
`masks, taking into account the discrete nature of integral
`images and the specific structure of our filters.
`Note that our main motivation for this type of sampling
`is its computational efficiency. Furthermore, as we do not
`have to downsample the image, there is no aliasing. On
`the downside, box filters preserve high-frequency compo-
`nents that can get lost in zoomed-out variants of the same
`scene, which can limit scale-invariance. This was however
`not noticeable in our experiments.
`The scale space is divided into octaves. An octave repre-
`sents a series of filter response maps obtained by convolv-
`ing the same input image with a filter of increasing size. In
`total, an octave encompasses a scaling factor of 2 (which
`implies that one needs to more than double the filter size,
`see below). Each octave is subdivided into a constant num-
`ber of scale levels. Due to the discrete nature of integral
`images, the minimum scale difference between two subse-
`quent scales depends on the length l0 of the positive or neg-
`ative lobes of the partial second order derivative in the
`direction of derivation (x or y), which is set to a third of
`the filter size length. For the 9 9 filter, this length l0 is
`3. For two successive levels, we must increase this size by
`
`Fig. 4. Instead of iteratively reducing the image size (left), the use of
`integral images allows the up-scaling of the filter at constant cost (right).
`
`Fig. 3. Top: Repeatability score for image rotation of up to 180°. Hessian-
`based detectors have in general a lower repeatability score for angles
`around odd multiples of p
`4. Bottom: Sample images from the sequence that
`was used. Fast-Hessian is the more accurate version of our detector (FH-
`15), as explained in Section 3.3.
`
`images. The calculation time therefore is
`using integral
`independent of the filter size. As shown in Section 5
`and Fig. 3, the performance is comparable or better
`than with the discretised and cropped Gaussians.
`The 9 9 box filters in Fig. 2 are approximations of a
`Gaussian with r ¼ 1:2 and represent the lowest scale (i.e.
`highest spatial resolution) for computing the blob response
`maps. We will denote them by Dxx, Dyy, and Dxy. The
`weights applied to the rectangular regions are kept simple
`for computational efficiency. This yields
`ð3Þ
`detðHapproxÞ ¼ DxxDyy ðwDxyÞ2:
`The relative weight w of the filter responses is used to bal-
`ance the expression for the Hessian’s determinant. This is
`needed for the energy conservation between the Gaussian
`kernels and the approximated Gaussian kernels,
`w ¼ j Lxyð1:2ÞjF j Dyyð9ÞjF
`ð4Þ
`¼ 0:912::: ’ 0:9;
`j Lyyð1:2ÞjF j Dxyð9ÞjF
`where j xjF is the Frobenius norm. Notice that for theoret-
`ical correctness, the weighting changes depending on the
`scale. In practice, we keep this factor constant, as this did
`not have a significant
`impact on the results in our
`experiments.
`Furthermore, the filter responses are normalised with
`respect to their size. This guarantees a constant Frobenius
`norm for any filter size, an important aspect for the scale
`space analysis as discussed in the next section.
`The approximated determinant of the Hessian repre-
`sents the blob response in the image at location x. These
`responses are stored in a blob response map over different
`scales, and local maxima are detected as explained in Sec-
`tion 3.4.
`
`AVIGILON EX. 2022
`IPR2019-00314
`Page 8 of 18
`
`
`
`350
`
`H. Bay et al. / Computer Vision and Image Understanding 110 (2008) 346–359
`
`a minimum of 2 pixels (1 pixel on every side) in order to
`keep the size uneven and thus ensure the presence of the
`central pixel. This results in a total increase of the mask size
`by 6 pixels (see Fig. 5). Note that for dimensions different
`from l0 (e.g. the width of the central band for the vertical
`filter in Fig. 5), rescaling the mask introduces rounding-
`off errors. However, since these errors are typically much
`smaller than l0, this is an acceptable approximation.
`The construction of the scale space starts with the 9 9
`filter, which calculates the blob response of the image for
`scale. Then, filters with sizes 15 15,
`the smallest
`21 21, and 27 27 are applied, by which even more than
`a scale change of two has been achi