`
`Navneet Dalal and Bill Triggs
`INRIA Rhˆone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
`{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
`
`Abstract
`We study the question of feature sets for robust visual ob-
`ject recognition, adopting linear SVM based human detec-
`tion as a test case. After reviewing existing edge and gra-
`dient based descriptors, we show experimentally that grids
`of Histograms of Oriented Gradient (HOG) descriptors sig-
`nificantly outperform existing feature sets for human detec-
`tion. We study the influence of each stage of the computation
`on performance, concluding that fine-scale gradients, fine
`orientation binning, relatively coarse spatial binning, and
`high-quality local contrast normalization in overlapping de-
`scriptor blocks are all important for good results. The new
`approach gives near-perfect separation on the original MIT
`pedestrian database, so we introduce a more challenging
`dataset containing over 1800 annotated human images with
`a large range of pose variations and backgrounds.
`1 Introduction
`Detecting humans in images is a challenging task owing
`to their variable appearance and the wide range of poses that
`they can adopt. The first need is a robust feature set that
`allows the human form to be discriminated cleanly, even in
`cluttered backgrounds under difficult illumination. We study
`the issue of feature sets for human detection, showing that lo-
`cally normalized Histogram of Oriented Gradient (HOG) de-
`scriptors provide excellent performance relative to other ex-
`isting feature sets including wavelets [17,22]. The proposed
`descriptors are reminiscent of edge orientation histograms
`[4,5], SIFT descriptors [12] and shape contexts [1], but they
`are computed on a dense grid of uniformly spaced cells and
`they use overlapping local contrast normalizations for im-
`proved performance. We make a detailed study of the effects
`of various implementation choices on detector performance,
`taking “pedestrian detection” (the detection of mostly visible
`people in more or less upright poses) as a test case. For sim-
`plicity and speed, we use linear SVM as a baseline classifier
`throughout the study. The new detectors give essentially per-
`fect results on the MIT pedestrian test set [18,17], so we have
`created a more challenging set containing over 1800 pedes-
`trian images with a large range of poses and backgrounds.
`Ongoing work suggests that our feature set performs equally
`well for other shape-based object classes.
`
`We briefly discuss previous work on human detection in
`§2, give an overview of our method §3, describe our data
`sets in §4 and give a detailed description and experimental
`evaluation of each stage of the process in §5–6. The main
`conclusions are summarized in §7.
`2 Previous Work
`There is an extensive literature on object detection, but
`here we mention just a few relevant papers on human detec-
`tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et
`al [18] describe a pedestrian detector based on a polynomial
`SVM using rectified Haar wavelets as input descriptors, with
`a parts (subwindow) based variant in [17]. Depoortere et al
`give an optimized version of this [2]. Gavrila & Philomen
`[8] take a more direct approach, extracting edge images and
`matching them to a set of learned exemplars using chamfer
`distance. This has been used in a practical real-time pedes-
`trian detection system [7]. Viola et al [22] build an efficient
`moving person detector, using AdaBoost to train a chain of
`progressively more complex region rejection rules based on
`Haar-like wavelets and space-time differences. Ronfard et
`al [19] build an articulated body detector by incorporating
`SVM based limb classifiers over 1st and 2nd order Gaussian
`filters in a dynamic programming framework similar to those
`of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
`[9]. Mikolajczyk et al [16] use combinations of orientation-
`position histograms with binary-thresholded gradient magni-
`tudes to build a parts based method containing detectors for
`faces, heads, and front and side profiles of upper and lower
`body parts. In contrast, our detector uses a simpler archi-
`tecture with a single detection window, but appears to give
`significantly higher performance on pedestrian images.
`3 Overview of the Method
`This section gives an overview of our feature extraction
`chain, which is summarized in fig. 1. Implementation details
`are postponed until §6. The method is based on evaluating
`well-normalized local histograms of image gradient orienta-
`tions in a dense grid. Similar features have seen increasing
`use over the past decade [4,5,12,15]. The basic idea is that
`local object appearance and shape can often be characterized
`rather well by the distribution of local intensity gradients or
`
`Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
`1063-6919/05 $20.00 © 2005 IEEE
`
`Yuneec Exhibit 1025 Page 1
`
`
`
`Input
`image
`
`Normalize
`gamma &
`colour
`
`Compute
`gradients
`
`Weighted vote
`into spatial &
`orientation cells
`
`Contrast normalize
`over overlapping
`spatial blocks
`
`Collect HOG’s
`over detection
`window
`
`Linear
`SVM
`
`Person /
`non−person
`classification
`
`Figure 1. An overview of our feature extraction and object detection chain. The detector window is tiled with a grid of overlapping blocks
`in which Histogram of Oriented Gradient feature vectors are extracted. The combined vectors are fed to a linear SVM for object/non-object
`classification. The detection window is scanned across the image at all positions and scales, and conventional non-maximum suppression
`is run on the output pyramid to detect object instances, but this paper concentrates on the feature extraction process.
`
`edge directions, even without precise knowledge of the cor-
`responding gradient or edge positions. In practice this is im-
`plemented by dividing the image window into small spatial
`regions (“cells”), for each cell accumulating a local 1-D his-
`togram of gradient directions or edge orientations over the
`pixels of the cell. The combined histogram entries form the
`representation. For better invariance to illumination, shad-
`owing, etc., it is also useful to contrast-normalize the local
`responses before using them. This can be done by accumu-
`lating a measure of local histogram “energy” over somewhat
`larger spatial regions (“blocks”) and using the results to nor-
`malize all of the cells in the block. We will refer to the nor-
`malized descriptor blocks as Histogram of Oriented Gradi-
`ent (HOG) descriptors. Tiling the detection window with
`a dense (in fact, overlapping) grid of HOG descriptors and
`using the combined feature vector in a conventional SVM
`based window classifier gives our human detection chain
`(see fig. 1).
`The use of orientation histograms has many precursors
`[13,4,5], but it only reached maturity when combined with
`local spatial histogramming and normalization in Lowe’s
`Scale Invariant Feature Transformation (SIFT) approach to
`wide baseline image matching [12], in which it provides
`the underlying image patch descriptor for matching scale-
`invariant keypoints. SIFT-style approaches perform remark-
`ably well in this application [12,14]. The Shape Context
`work [1] studied alternative cell and block shapes, albeit ini-
`tially using only edge pixel counts without the orientation
`histogramming that makes the representation so effective.
`The success of these sparse feature based representations has
`somewhat overshadowed the power and simplicity of HOG’s
`as dense image descriptors. We hope that our study will help
`to rectify this. In particular, our informal experiments sug-
`gest that even the best current keypoint based approaches are
`likely to have false positive rates at least 1–2 orders of mag-
`nitude higher than our dense grid approach for human detec-
`tion, mainly because none of the keypoint detectors that we
`are aware of detect human body structures reliably.
`The HOG/SIFT representation has several advantages. It
`captures edge or gradient structure that is very characteristic
`of local shape, and it does so in a local representation with
`an easily controllable degree of invariance to local geometric
`and photometric transformations:
`translations or rotations
`make little difference if they are much smaller that the local
`spatial or orientation bin size. For human detection, rather
`
`coarse spatial sampling, fine orientation sampling and strong
`local photometric normalization turns out to be the best strat-
`egy, presumably because it permits limbs and body segments
`to change appearance and move from side to side quite a lot
`provided that they maintain a roughly upright orientation.
`
`4 Data Sets and Methodology
`
`Datasets. We tested our detector on two different data sets.
`The first is the well-established MIT pedestrian database
`[18], containing 509 training and 200 test images of pedestri-
`ans in city scenes (plus left-right reflections of these). It con-
`tains only front or back views with a relatively limited range
`of poses. Our best detectors give essentially perfect results
`on this data set, so we produced a new and significantly more
`challenging data set, ‘INRIA’, containing 1805 64×128 im-
`ages of humans cropped from a varied set of personal pho-
`tos. Fig. 2 shows some samples. The people are usually
`standing, but appear in any orientation and against a wide
`variety of background image including crowds. Many are
`bystanders taken from the image backgrounds, so there is no
`particular bias on their pose. The database is available from
`http://lear.inrialpes.fr/data for research purposes.
`Methodology. We selected 1239 of the images as positive
`training examples, together with their left-right reflections
`(2478 images in all). A fixed set of 12180 patches sampled
`randomly from 1218 person-free training photos provided
`the initial negative set. For each detector and parameter com-
`bination a preliminary detector is trained and the 1218 nega-
`tive training photos are searched exhaustively for false posi-
`tives (‘hard examples’). The method is then re-trained using
`this augmented set (initial 12180 + hard examples) to pro-
`duce the final detector. The set of hard examples is subsam-
`pled if necessary, so that the descriptors of the final training
`set fit into 1.7 Gb of RAM for SVM training. This retrain-
`ing process significantly improves the performance of each
`detector (by 5% at 10−4 False Positives Per Window tested
`(FPPW) for our default detector), but additional rounds of
`retraining make little difference so we do not use them.
`To quantify detector performance we plot Detection Er-
`ror Tradeoff (DET) curves on a log-log scale, i.e. miss rate
`( 1−Recall or
`FalseNeg
`TruePos+FalseNeg ) versus FPPW. Lower val-
`ues are better. DET plots are used extensively in speech and
`in NIST evaluations. They present the same information as
`Receiver Operating Characteristics (ROC’s) but allow small
`
`Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
`1063-6919/05 $20.00 © 2005 IEEE
`
`Yuneec Exhibit 1025 Page 2
`
`
`
`Figure 2. Some sample images from our new human detection database. The subjects are always upright, but with some partial occlusions
`and a wide range of variations in pose, appearance, clothing, illumination and background.
`
`probabilities to be distinguished more easily. We will often
`use miss rate at 10−4FPPW as a reference point for results.
`This is arbitrary but no more so than, e.g. Area Under ROC.
`In a multiscale detector it corresponds to a raw error rate of
`about 0.8 false positives per 640×480 image tested. (The full
`detector has an even lower false positive rate owing to non-
`maximum suppression). Our DET curves are usually quite
`shallow so even very small improvements in miss rate are
`equivalent to large gains in FPPW at constant miss rate. For
`example, for our default detector at 1e-4 FPPW, every 1%
`absolute (9% relative) reduction in miss rate is equivalent to
`reducing the FPPW at constant miss rate by a factor of 1.57.
`
`5 Overview of Results
`
`Before presenting our detailed implementation and per-
`formance analysis, we compare the overall performance of
`our final HOG detectors with that of some other existing
`methods. Detectors based on rectangular (R-HOG) or cir-
`cular log-polar (C-HOG) blocks and linear or kernel SVM
`are compared with our implementations of the Haar wavelet,
`PCA-SIFT, and shape context approaches. Briefly, these ap-
`proaches are as follows:
`Generalized Haar Wavelets. This is an extended set of ori-
`ented Haar-like wavelets similar to (but better than) that used
`in [17]. The features are rectified responses from 9×9 and
`12×12 oriented 1st and 2nd derivative box filters at 45◦
`inter-
`vals and the corresponding 2nd derivative xy filter.
`PCA-SIFT. These descriptors are based on projecting gradi-
`ent images onto a basis learned from training images using
`PCA [11]. Ke & Sukthankar found that they outperformed
`SIFT for key point based matching, but this is controversial
`[14]. Our implementation uses 16×16 blocks with the same
`derivative scale, overlap, etc., as our HOG descriptors. The
`PCA basis is calculated using the positive training images.
`Shape Contexts. The original Shape Contexts [1] used bi-
`nary edge-presence voting into log-polar spaced bins, irre-
`spective of edge orientation. We simulate this using our C-
`HOG descriptor (see below) with just 1 orientation bin. 16
`angular and 3 radial intervals with inner radius 2 pixels and
`outer radius 8 pixels gave the best results. Both gradient-
`strength and edge-presence based voting were tested, with
`
`the edge threshold chosen automatically to maximize detec-
`tion performance (the values selected were somewhat vari-
`able, in the region of 20–50 graylevels).
`Results. Fig. 3 shows the performance of the various detec-
`tors on the MIT and INRIA data sets. The HOG-based de-
`tectors greatly outperform the wavelet, PCA-SIFT and Shape
`Context ones, giving near-perfect separation on the MIT test
`set and at least an order of magnitude reduction in FPPW
`on the INRIA one. Our Haar-like wavelets outperform MIT
`wavelets because we also use 2nd order derivatives and con-
`trast normalize the output vector. Fig. 3(a) also shows MIT’s
`best parts based and monolithic detectors (the points are in-
`terpolated from [17]), however beware that an exact compar-
`ison is not possible as we do not know how the database in
`[17] was divided into training and test parts and the nega-
`tive images used are not available. The performances of the
`final rectangular (R-HOG) and circular (C-HOG) detectors
`are very similar, with C-HOG having the slight edge. Aug-
`menting R-HOG with primitive bar detectors (oriented 2nd
`derivatives – ‘R2-HOG’) doubles the feature dimension but
`further improves the performance (by 2% at 10−4 FPPW).
`Replacing the linear SVM with a Gaussian kernel one im-
`proves performance by about 3% at 10−4 FPPW, at the cost
`of much higher run times1. Using binary edge voting (EC-
`HOG) instead of gradient magnitude weighted voting (C-
`HOG) decreases performance by 5% at 10−4 FPPW, while
`omitting orientation information decreases it by much more,
`even if additional spatial or radial bins are added (by 33% at
`10−4 FPPW, for both edges (E-ShapeC) and gradients (G-
`ShapeC)). PCA-SIFT also performs poorly. One reason is
`that, in comparison to [11], many more (80 of 512) principal
`vectors have to be retained to capture the same proportion of
`the variance. This may be because the spatial registration is
`weaker when there is no keypoint detector.
`6
`Implementation and Performance Study
`We now give details of our HOG implementations and
`systematically study the effects of the various choices on de-
`tector performance. Throughout this section we refer results
`
`1We use the hard examples generated by linear R-HOG to train the ker-
`nel R-HOG detector, as kernel R-HOG generates so few false positives that
`its hard example set is too sparse to improve the generalization significantly.
`
`Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
`1063-6919/05 $20.00 © 2005 IEEE
`
`Yuneec Exhibit 1025 Page 3
`
`
`
`DET − different descriptors on MIT database
`
`DET − different descriptors on INRIA database
`
`Ker. R−HOG
`Lin. R2−HOG
`Lin. R−HOG
`Lin. C−HOG
`Lin. EC−HOG
`Wavelet
`PCA−SIFT
`Lin. G−ShapeC
`Lin. E−ShapeC
`10−2
`10−3
`10−4
`10−5
`false positives per window (FPPW)
`
`10−1
`
`0.5
`
`0.2
`
`0.1
`
`0.05
`
`0.02
`
`miss rate
`
`Lin. R−HOG
`Lin. C−HOG
`Lin. EC−HOG
`Wavelet
`PCA−SIFT
`Lin. G−ShaceC
`Lin. E−ShaceC
`MIT best (part)
`MIT baseline
`
`10−2
`10−3
`10−4
`10−5
`false positives per window (FPPW)
`
`10−1
`
`0.01
`10−6
`
`0.2
`
`0.1
`
`miss rate
`
`0.05
`
`0.02
`0.01
`10−6
`
`Figure 3. The performance of selected detectors on (left) MIT and (right) INRIA data sets. See the text for details.
`
`to our default detector which has the following properties,
`described below: RGB colour space with no gamma cor-
`rection; [−1, 0, 1] gradient filter with no smoothing; linear
`; 16×16
`gradient voting into 9 orientation bins in 0◦
`–180◦
`pixel blocks of four 8×8 pixel cells; Gaussian spatial win-
`dow with σ = 8 pixel; L2-Hys (Lowe-style clipped L2 norm)
`block normalization; block spacing stride of 8 pixels (hence
`4-fold coverage of each cell); 64×128 detection window;
`linear SVM classifier.
`Fig. 4 summarizes the effects of the various HOG param-
`eters on overall detection performance. These will be exam-
`ined in detail below. The main conclusions are that for good
`performance, one should use fine scale derivatives (essen-
`tially no smoothing), many orientation bins, and moderately
`sized, strongly normalized, overlapping descriptor blocks.
`6.1 Gamma/Colour Normalization
`We evaluated several input pixel representations includ-
`ing grayscale, RGB and LAB colour spaces optionally with
`power law (gamma) equalization. These normalizations have
`only a modest effect on performance, perhaps because the
`subsequent descriptor normalization achieves similar results.
`We do use colour information when available. RGB and
`LAB colour spaces give comparable results, but restricting
`to grayscale reduces performance by 1.5% at 10−4 FPPW.
`Square root gamma compression of each colour channel im-
`proves performance at low FPPW (by 1% at 10−4 FPPW)
`but log compression is too strong and worsens it by 2% at
`10−4 FPPW.
`6.2 Gradient Computation
`Detector performance is sensitive to the way in which
`gradients are computed, but the simplest scheme turns out
`to be the best. We tested gradients computed using Gaus-
`sian smoothing followed by one of several discrete deriva-
`tive masks. Several smoothing scales were tested includ-
`
`0 1
`
`ing σ=0 (none). Masks tested included various 1-D point
`derivatives (uncentred [−1, 1], centred [−1, 0, 1] and cubic-
`corrected [1,−8, 0, 8,−1]) as well as 3×3 Sobel masks and
`(cid:1)
`(cid:2)
`(cid:2)
`(cid:1) −1 0
`2×2 diagonal ones
`,
`0 1−1 0
`(the most compact cen-
`tred 2-D derivative masks). Simple 1-D [−1, 0, 1] masks at
`σ=0 work best. Using larger masks always seems to de-
`crease performance, and smoothing damages it significantly:
`for Gaussian derivatives, moving from σ=0 to σ=2 reduces
`the recall rate from 89% to 80% at 10−4 FPPW. At σ=0,
`cubic corrected 1-D width 5 filters are about 1% worse than
`[−1, 0, 1] at 10−4 FPPW, while the 2×2 diagonal masks are
`1.5% worse. Using uncentred [−1, 1] derivative masks also
`decreases performance (by 1.5% at 10−4 FPPW), presum-
`ably because orientation estimation suffers as a result of the
`x and y filters being based at different centres.
`For colour images, we calculate separate gradients for
`each colour channel, and take the one with the largest norm
`as the pixel’s gradient vector.
`6.3 Spatial / Orientation Binning
`The next step is the fundamental nonlinearity of the de-
`scriptor. Each pixel calculates a weighted vote for an edge
`orientation histogram channel based on the orientation of the
`gradient element centred on it, and the votes are accumu-
`lated into orientation bins over local spatial regions that we
`call cells. Cells can be either rectangular or radial (log-polar
`sectors). The orientation bins are evenly spaced over 0◦
`–
`180◦
`(“unsigned” gradient) or 0◦
`–360◦
`(“signed” gradient).
`To reduce aliasing, votes are interpolated bilinearly between
`the neighbouring bin centres in both orientation and posi-
`tion. The vote is a function of the gradient magnitude at the
`pixel, either the magnitude itself, its square, its square root,
`or a clipped form of the magnitude representing soft pres-
`ence/absence of an edge at the pixel. In practice, using the
`magnitude itself gives the best results. Taking the square root
`
`Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
`1063-6919/05 $20.00 © 2005 IEEE
`
`Yuneec Exhibit 1025 Page 4
`
`
`
`DET − effect of gradient scale σ
`
`DET − effect of number of orientation bins β
`
`DET − effect of normalization methods
`
`L2−Hys
`L2−norm
`L1−Sqrt
`L1−norm
`No norm
`Window norm
`10−3
`10−4
`false positives per window (FPPW)
`(c)
`DET − effect of kernel width,γ, on kernel SVM
`
`10−2
`
`Linear
`γ=8e−3
`γ=3e−2
`γ=7e−2
`10−2
`10−3
`10−4
`10−5
`false positives per window (FPPW)
`(f)
`
`10−1
`
`0.2
`
`0.1
`
`0.05
`
`miss rate
`
`10−1
`
`0.02
`10−5
`
`0.5
`
`0.2
`
`0.1
`
`0.05
`
`0.02
`
`miss rate
`
`10−1
`
`0.01
`10−6
`
`bin= 9 (0−180)
`bin= 6 (0−180)
`bin= 4 (0−180)
`bin= 3 (0−180)
`bin=18 (0−360)
`bin=12 (0−360)
`bin= 8 (0−360)
`bin= 6 (0−360)
`10−2
`10−3
`10−5
`10−4
`false positives per window (FPPW)
`(b)
`DET − effect of window size
`
`64x128
`56x120
`48x112
`10−2
`10−3
`10−4
`10−5
`false positives per window (FPPW)
`(e)
`
`0.5
`
`0.2
`
`0.1
`
`0.05
`
`0.02
`
`miss rate
`
`10−1
`
`0.01
`10−6
`
`σ=0
`σ=0.5
`σ=1
`σ=2
`σ=3
`σ=0, c−cor
`10−2
`10−3
`10−4
`10−5
`false positives per window (FPPW)
`(a)
`
`0.5
`
`0.2
`
`0.1
`
`0.05
`
`0.02
`
`miss rate
`
`0.01
`10−6
`
`DET − effect of overlap (cell size=8, num cell = 2x2, wt=0)
`0.5
`
`0.5
`
`0.2
`
`0.1
`
`0.05
`
`0.02
`
`miss rate
`
`10−1
`
`0.01
`10−6
`
`overlap = 3/4, stride = 4
`overlap = 1/2, stride = 8
`overlap = 0, stride =16
`10−2
`10−5
`10−4
`10−3
`false positives per window (FPPW)
`(d)
`
`0.2
`
`0.1
`
`0.05
`
`0.02
`
`miss rate
`
`0.01
`10−6
`
`Figure 4. For details see the text. (a) Using fine derivative scale significantly increases the performance. (‘c-cor’ is the 1D cubic-corrected
`point derivative). (b) Increasing the number of orientation bins increases performance significantly up to about 9 bins spaced over 0◦
`–
`. (c) The effect of different block normalization schemes (see §6.4). (d) Using overlapping descriptor blocks decreases the miss rate
`180◦
`by around 5%. (e) Reducing the 16 pixel margin around the 64×128 detection window decreases the performance by about 4%. (f) Using
`a Gaussian kernel SVM, exp(−γ(cid:1)x1 − x2(cid:1)2), improves the performance by about 3%.
`
`20
`
`15
`
`10
`
`05
`
`Miss Rate (%)
`
`12x12
`
`10x10
`
`4x4
`
`2x2
`8x8
`3x3
`6x6
`4x4 Block size (Cells)
`Cell size (pixels)
`Figure 5. The miss rate at 10−4 FPPW as the cell and block sizes
`change. The stride (block overlap) is fixed at half of the block size.
`3×3 blocks of 6×6 pixel cells perform best, with 10.4% miss rate.
`
`1x1
`
`based on grouping cells into larger spatial blocks and con-
`trast normalizing each block separately. The final descriptor
`is then the vector of all components of the normalized cell
`responses from all of the blocks in the detection window.
`In fact, we typically overlap the blocks so that each scalar
`
`reduces performance slightly, while using binary edge pres-
`ence voting decreases it significantly (by 5% at 10−4 FPPW).
`Fine orientation coding turns out to be essential for good
`performance, whereas (see below) spatial binning can be
`rather coarse. As fig. 4(b) shows, increasing the number
`of orientation bins improves performance significantly up to
`about 9 bins, but makes little difference beyond this. This
`is for bins spaced over 0◦
`–180◦
`, i.e. the ‘sign’ of the gradi-
`ent is ignored. Including signed gradients (orientation range
`0◦
`–360◦
`, as in the original SIFT descriptor) decreases the
`performance, even when the number of bins is also doubled
`to preserve the original orientation resolution. For humans,
`the wide range of clothing and background colours presum-
`ably makes the signs of contrasts uninformative. However
`note that including sign information does help substantially
`in some other object recognition tasks, e.g. cars, motorbikes.
`6.4 Normalization and Descriptor Blocks
`Gradient strengths vary over a wide range owing to local
`variations in illumination and foreground-background con-
`trast, so effective local contrast normalization turns out to
`be essential for good performance. We evaluated a num-
`ber of different normalization schemes. Most of them are
`
`Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
`1063-6919/05 $20.00 © 2005 IEEE
`
`Yuneec Exhibit 1025 Page 5
`
`
`
`cell response contributes several components to the final de-
`scriptor vector, each normalized with respect to a different
`block. This may seem redundant but good normalization is
`critical and including overlap significantly improves the per-
`formance. Fig. 4(d) shows that performance increases by 4%
`at 10−4 FPPW as we increase the overlap from none (stride
`16) to 16-fold area / 4-fold linear coverage (stride 4).
`We evaluated two classes of block geometries, square or
`rectangular ones partitioned into grids of square or rectangu-
`lar spatial cells, and circular blocks partitioned into cells in
`log-polar fashion. We will refer to these two arrangements
`as R-HOG and C-HOG (for rectangular and circular HOG).
`
`R-HOG. R-HOG blocks have many similarities to SIFT de-
`scriptors [12] but they are used quite differently. They are
`computed in dense grids at a single scale without dominant
`orientation alignment and used as part of a larger code vector
`that implicitly encodes spatial position relative to the detec-
`tion window, whereas SIFT’s are computed at a sparse set
`of scale-invariant key points, rotated to align their dominant
`orientations, and used individually. SIFT’s are optimized for
`sparse wide baseline matching, R-HOG’s for dense robust
`coding of spatial form. Other precursors include the edge
`orientation histograms of Freeman & Roth [4]. We usually
`use square R-HOG’s, i.e. ς×ς grids of η×η pixel cells each
`containing β orientation bins, where ς, η, β are parameters.
`Fig. 5 plots the miss rate at 10−4 FPPW w.r.t. cell size and
`block size in cells. For human detection, 3×3 cell blocks of
`6×6 pixel cells perform best, with 10.4% miss-rate at 10−4
`FPPW. Our standard 2×2 cell blocks of 8×8 cells are a close
`second. In fact, 6–8 pixel wide cells do best irrespective of
`the block size – an interesting coincidence as human limbs
`are about 6–8 pixels across in our images. 2×2 and 3×3 cell
`blocks work best. Adaptivity to local imaging conditions is
`weakened when the block becomes too big, and when it is
`too small (1×1 cell block, i.e. normalization over orientation
`alone) valuable spatial information is suppressed.
`As in [12], it is useful to downweight pixels near the edges
`of the block by applying a Gaussian spatial window to each
`pixel before accumulating orientation votes into cells. This
`improves performance by 1% at 10−4 FPPW for a Gaussian
`with σ = 0.5 ∗ block width.
`We also tried including multiple block types with differ-
`ent cell and block sizes in the overall descriptor. This slightly
`improves performance (by around 3% at 10−4 FPPW), at the
`cost of greatly increased descriptor size.
`Besides square R-HOG blocks, we also tested vertical
`(2×1 cell) and horizontal (1×2 cell) blocks and a combined
`descriptor including both vertical and horizontal pairs. Verti-
`cal and vertical+horizontal pairs are significantly better than
`horizontal pairs alone, but not as good as 2×2 or 3×3 cell
`blocks (1% worse at 10−4 FPPW).
`C-HOG. Our circular block (C-HOG) descriptors are rem-
`
`iniscent of Shape Contexts [1] except that, crucially, each
`spatial cell contains a stack of gradient-weighted orienta-
`tion cells instead of a single orientation-independent edge-
`presence count. The log-polar grid was originally suggested
`by the idea that it would allow fine coding of nearby struc-
`ture to be combined with coarser coding of wider context,
`and the fact that the transformation from the visual field to
`the V1 cortex in primates is logarithmic [21]. However small
`descriptors with very few radial bins turn out to give the best
`performance, so in practice there is little inhomogeneity or
`context. It is probably better to think of C-HOG’s simply as
`an advanced form of centre-surround coding.
`We evaluated two variants of the C-HOG geometry,
`ones with a single circular central cell (similar to
`the GLOH feature of [14]), and ones whose cen-
`tral cell is divided into angular sectors as in shape
`contexts. We present results only for the circular-
`centre variants, as these have fewer spatial cells
`than the divided centre ones and give the same per-
`formance in practice. A technical report will provide fur-
`ther details. The C-HOG layout has four parameters: the
`numbers of angular and radial bins; the radius of the central
`bin in pixels; and the expansion factor for subsequent radii.
`At least two radial bins (a centre and a surround) and four
`angular bins (quartering) are needed for good performance.
`Including additional radial bins does not change the perfor-
`mance much, while increasing the number of angular bins
`decreases performance (by 1.3% at 10−4 FPPW when go-
`ing from 4 to 12 angular bins). 4 pixels is the best radius
`for the central bin, but 3 and 5 give similar results. Increas-
`ing the expansion factor from 2 to 3 leaves the performance
`essentially unchanged. With these parameters, neither Gaus-
`sian spatial weighting nor inverse weighting of cell votes by
`cell area changes the performance, but combining these two
`reduces slightly. These values assume fine orientation sam-
`pling. Shape contexts (1 orientation bin) require much finer
`spatial subdivision to work well.
`
`Block Normalization schemes. We evaluated four differ-
`ent block normalization schemes for each of the above HOG
`geometries. Let v be the unnormalized descriptor vector,
`(cid:2)v(cid:2)k be its k-norm for k=1, 2, and be a small constant.
`(cid:3)(cid:2)v(cid:2)2
`The schemes are: (a) L2-norm, v → v/
`+ 2; (b)
`2
`L2-Hys, L2-norm followed by clipping (limiting the maxi-
`mum values of v to 0.2) and renormalizing, as in [12]; (c)
`L1-norm, v → v/((cid:2)v(cid:2)1 + ); and (d) L1-sqrt, L1-norm fol-
`lowed by square root v → (cid:3)
`v/((cid:2)v(cid:2)1 + ), which amounts
`to treating the descriptor vectors as probability distributions
`and using the Bhattacharya distance between them. Fig. 4(c)
`shows that L2-Hys, L2-norm and L1-sqrt all perform equally
`well, while simple L1-norm reduces performance by 5%,
`and omitting normalization entirely reduces it by 27%, at
`10−4 FPPW. Some regularization is needed as we evalu-
`ate descriptors densely, including on empty patches, but the
`
`Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
`1063-6919/05 $20.00 © 2005 IEEE
`
`Yuneec Exhibit 1025 Page 6
`
`
`
`(a)
`
`(b)
`
`(c)
`
`(d)
`
`(e)
`
`(f)
`
`(g)
`
`Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks are
`centred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each “pixel”
`shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.
`(e) It’s computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.
`
`results are insensitive to ’s value over a large range.
`
`Centre-surround normalization. We also investigated an
`alternative centre-surround style cell normalization scheme,
`in which the image is tiled with a grid of cells and for
`each cell the total energy in the cell and its surrounding re-
`gion (summed over orientations and pooled using Gaussian
`weighting) is used to normalize the cell. However as fig. 4(c)
`(“window norm”) shows, this decreases performance relative
`to the corresponding block based scheme (by 2% at 10−4
`FPPW, for pooling with σ=1 cell widths). One reason is
`that there are no longer any overlapping blocks so each cell
`is coded only once in the final descriptor. Including several
`normalizations for each cell based on different pooling scales
`σ provides no perceptible change in performance, so it seems
`that it is the existence of several pooling regions with differ-
`ent spatial offsets relative to the cell that is important here,
`not the pooling scale.
`
`To clarify this point, consider the R-HOG detector with
`overlapping block