throbber
Histograms of Oriented Gradients for Human Detection
`
`Navneet Dalal and Bill Triggs
`INRIA Rhˆone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
`{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
`
`Abstract
`We study the question of feature sets for robust visual ob-
`ject recognition, adopting linear SVM based human detec-
`tion as a test case. After reviewing existing edge and gra-
`dient based descriptors, we show experimentally that grids
`of Histograms of Oriented Gradient (HOG) descriptors sig-
`nificantly outperform existing feature sets for human detec-
`tion. We study the influence of each stage of the computation
`on performance, concluding that fine-scale gradients, fine
`orientation binning, relatively coarse spatial binning, and
`high-quality local contrast normalization in overlapping de-
`scriptor blocks are all important for good results. The new
`approach gives near-perfect separation on the original MIT
`pedestrian database, so we introduce a more challenging
`dataset containing over 1800 annotated human images with
`a large range of pose variations and backgrounds.
`1 Introduction
`Detecting humans in images is a challenging task owing
`to their variable appearance and the wide range of poses that
`they can adopt. The first need is a robust feature set that
`allows the human form to be discriminated cleanly, even in
`cluttered backgrounds under difficult illumination. We study
`the issue of feature sets for human detection, showing that lo-
`cally normalized Histogram of Oriented Gradient (HOG) de-
`scriptors provide excellent performance relative to other ex-
`isting feature sets including wavelets [17,22]. The proposed
`descriptors are reminiscent of edge orientation histograms
`[4,5], SIFT descriptors [12] and shape contexts [1], but they
`are computed on a dense grid of uniformly spaced cells and
`they use overlapping local contrast normalizations for im-
`proved performance. We make a detailed study of the effects
`of various implementation choices on detector performance,
`taking “pedestrian detection” (the detection of mostly visible
`people in more or less upright poses) as a test case. For sim-
`plicity and speed, we use linear SVM as a baseline classifier
`throughout the study. The new detectors give essentially per-
`fect results on the MIT pedestrian test set [18,17], so we have
`created a more challenging set containing over 1800 pedes-
`trian images with a large range of poses and backgrounds.
`Ongoing work suggests that our feature set performs equally
`well for other shape-based object classes.
`
`We briefly discuss previous work on human detection in
`§2, give an overview of our method §3, describe our data
`sets in §4 and give a detailed description and experimental
`evaluation of each stage of the process in §5–6. The main
`conclusions are summarized in §7.
`2 Previous Work
`There is an extensive literature on object detection, but
`here we mention just a few relevant papers on human detec-
`tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et
`al [18] describe a pedestrian detector based on a polynomial
`SVM using rectified Haar wavelets as input descriptors, with
`a parts (subwindow) based variant in [17]. Depoortere et al
`give an optimized version of this [2]. Gavrila & Philomen
`[8] take a more direct approach, extracting edge images and
`matching them to a set of learned exemplars using chamfer
`distance. This has been used in a practical real-time pedes-
`trian detection system [7]. Viola et al [22] build an efficient
`moving person detector, using AdaBoost to train a chain of
`progressively more complex region rejection rules based on
`Haar-like wavelets and space-time differences. Ronfard et
`al [19] build an articulated body detector by incorporating
`SVM based limb classifiers over 1st and 2nd order Gaussian
`filters in a dynamic programming framework similar to those
`of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
`[9]. Mikolajczyk et al [16] use combinations of orientation-
`position histograms with binary-thresholded gradient magni-
`tudes to build a parts based method containing detectors for
`faces, heads, and front and side profiles of upper and lower
`body parts. In contrast, our detector uses a simpler archi-
`tecture with a single detection window, but appears to give
`significantly higher performance on pedestrian images.
`3 Overview of the Method
`This section gives an overview of our feature extraction
`chain, which is summarized in fig. 1. Implementation details
`are postponed until §6. The method is based on evaluating
`well-normalized local histograms of image gradient orienta-
`tions in a dense grid. Similar features have seen increasing
`use over the past decade [4,5,12,15]. The basic idea is that
`local object appearance and shape can often be characterized
`rather well by the distribution of local intensity gradients or
`
`Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
`1063-6919/05 $20.00 © 2005 IEEE
`
`Yuneec Exhibit 1025 Page 1
`
`

`
`Input
`image
`
`Normalize
`gamma &
`colour
`
`Compute
`gradients
`
`Weighted vote
`into spatial &
`orientation cells
`
`Contrast normalize
`over overlapping
`spatial blocks
`
`Collect HOG’s
`over detection
`window
`
`Linear
`SVM
`
`Person /
`non−person
`classification
`
`Figure 1. An overview of our feature extraction and object detection chain. The detector window is tiled with a grid of overlapping blocks
`in which Histogram of Oriented Gradient feature vectors are extracted. The combined vectors are fed to a linear SVM for object/non-object
`classification. The detection window is scanned across the image at all positions and scales, and conventional non-maximum suppression
`is run on the output pyramid to detect object instances, but this paper concentrates on the feature extraction process.
`
`edge directions, even without precise knowledge of the cor-
`responding gradient or edge positions. In practice this is im-
`plemented by dividing the image window into small spatial
`regions (“cells”), for each cell accumulating a local 1-D his-
`togram of gradient directions or edge orientations over the
`pixels of the cell. The combined histogram entries form the
`representation. For better invariance to illumination, shad-
`owing, etc., it is also useful to contrast-normalize the local
`responses before using them. This can be done by accumu-
`lating a measure of local histogram “energy” over somewhat
`larger spatial regions (“blocks”) and using the results to nor-
`malize all of the cells in the block. We will refer to the nor-
`malized descriptor blocks as Histogram of Oriented Gradi-
`ent (HOG) descriptors. Tiling the detection window with
`a dense (in fact, overlapping) grid of HOG descriptors and
`using the combined feature vector in a conventional SVM
`based window classifier gives our human detection chain
`(see fig. 1).
`The use of orientation histograms has many precursors
`[13,4,5], but it only reached maturity when combined with
`local spatial histogramming and normalization in Lowe’s
`Scale Invariant Feature Transformation (SIFT) approach to
`wide baseline image matching [12], in which it provides
`the underlying image patch descriptor for matching scale-
`invariant keypoints. SIFT-style approaches perform remark-
`ably well in this application [12,14]. The Shape Context
`work [1] studied alternative cell and block shapes, albeit ini-
`tially using only edge pixel counts without the orientation
`histogramming that makes the representation so effective.
`The success of these sparse feature based representations has
`somewhat overshadowed the power and simplicity of HOG’s
`as dense image descriptors. We hope that our study will help
`to rectify this. In particular, our informal experiments sug-
`gest that even the best current keypoint based approaches are
`likely to have false positive rates at least 1–2 orders of mag-
`nitude higher than our dense grid approach for human detec-
`tion, mainly because none of the keypoint detectors that we
`are aware of detect human body structures reliably.
`The HOG/SIFT representation has several advantages. It
`captures edge or gradient structure that is very characteristic
`of local shape, and it does so in a local representation with
`an easily controllable degree of invariance to local geometric
`and photometric transformations:
`translations or rotations
`make little difference if they are much smaller that the local
`spatial or orientation bin size. For human detection, rather
`
`coarse spatial sampling, fine orientation sampling and strong
`local photometric normalization turns out to be the best strat-
`egy, presumably because it permits limbs and body segments
`to change appearance and move from side to side quite a lot
`provided that they maintain a roughly upright orientation.
`
`4 Data Sets and Methodology
`
`Datasets. We tested our detector on two different data sets.
`The first is the well-established MIT pedestrian database
`[18], containing 509 training and 200 test images of pedestri-
`ans in city scenes (plus left-right reflections of these). It con-
`tains only front or back views with a relatively limited range
`of poses. Our best detectors give essentially perfect results
`on this data set, so we produced a new and significantly more
`challenging data set, ‘INRIA’, containing 1805 64×128 im-
`ages of humans cropped from a varied set of personal pho-
`tos. Fig. 2 shows some samples. The people are usually
`standing, but appear in any orientation and against a wide
`variety of background image including crowds. Many are
`bystanders taken from the image backgrounds, so there is no
`particular bias on their pose. The database is available from
`http://lear.inrialpes.fr/data for research purposes.
`Methodology. We selected 1239 of the images as positive
`training examples, together with their left-right reflections
`(2478 images in all). A fixed set of 12180 patches sampled
`randomly from 1218 person-free training photos provided
`the initial negative set. For each detector and parameter com-
`bination a preliminary detector is trained and the 1218 nega-
`tive training photos are searched exhaustively for false posi-
`tives (‘hard examples’). The method is then re-trained using
`this augmented set (initial 12180 + hard examples) to pro-
`duce the final detector. The set of hard examples is subsam-
`pled if necessary, so that the descriptors of the final training
`set fit into 1.7 Gb of RAM for SVM training. This retrain-
`ing process significantly improves the performance of each
`detector (by 5% at 10−4 False Positives Per Window tested
`(FPPW) for our default detector), but additional rounds of
`retraining make little difference so we do not use them.
`To quantify detector performance we plot Detection Er-
`ror Tradeoff (DET) curves on a log-log scale, i.e. miss rate
`( 1−Recall or
`FalseNeg
`TruePos+FalseNeg ) versus FPPW. Lower val-
`ues are better. DET plots are used extensively in speech and
`in NIST evaluations. They present the same information as
`Receiver Operating Characteristics (ROC’s) but allow small
`
`Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
`1063-6919/05 $20.00 © 2005 IEEE
`
`Yuneec Exhibit 1025 Page 2
`
`

`
`Figure 2. Some sample images from our new human detection database. The subjects are always upright, but with some partial occlusions
`and a wide range of variations in pose, appearance, clothing, illumination and background.
`
`probabilities to be distinguished more easily. We will often
`use miss rate at 10−4FPPW as a reference point for results.
`This is arbitrary but no more so than, e.g. Area Under ROC.
`In a multiscale detector it corresponds to a raw error rate of
`about 0.8 false positives per 640×480 image tested. (The full
`detector has an even lower false positive rate owing to non-
`maximum suppression). Our DET curves are usually quite
`shallow so even very small improvements in miss rate are
`equivalent to large gains in FPPW at constant miss rate. For
`example, for our default detector at 1e-4 FPPW, every 1%
`absolute (9% relative) reduction in miss rate is equivalent to
`reducing the FPPW at constant miss rate by a factor of 1.57.
`
`5 Overview of Results
`
`Before presenting our detailed implementation and per-
`formance analysis, we compare the overall performance of
`our final HOG detectors with that of some other existing
`methods. Detectors based on rectangular (R-HOG) or cir-
`cular log-polar (C-HOG) blocks and linear or kernel SVM
`are compared with our implementations of the Haar wavelet,
`PCA-SIFT, and shape context approaches. Briefly, these ap-
`proaches are as follows:
`Generalized Haar Wavelets. This is an extended set of ori-
`ented Haar-like wavelets similar to (but better than) that used
`in [17]. The features are rectified responses from 9×9 and
`12×12 oriented 1st and 2nd derivative box filters at 45◦
`inter-
`vals and the corresponding 2nd derivative xy filter.
`PCA-SIFT. These descriptors are based on projecting gradi-
`ent images onto a basis learned from training images using
`PCA [11]. Ke & Sukthankar found that they outperformed
`SIFT for key point based matching, but this is controversial
`[14]. Our implementation uses 16×16 blocks with the same
`derivative scale, overlap, etc., as our HOG descriptors. The
`PCA basis is calculated using the positive training images.
`Shape Contexts. The original Shape Contexts [1] used bi-
`nary edge-presence voting into log-polar spaced bins, irre-
`spective of edge orientation. We simulate this using our C-
`HOG descriptor (see below) with just 1 orientation bin. 16
`angular and 3 radial intervals with inner radius 2 pixels and
`outer radius 8 pixels gave the best results. Both gradient-
`strength and edge-presence based voting were tested, with
`
`the edge threshold chosen automatically to maximize detec-
`tion performance (the values selected were somewhat vari-
`able, in the region of 20–50 graylevels).
`Results. Fig. 3 shows the performance of the various detec-
`tors on the MIT and INRIA data sets. The HOG-based de-
`tectors greatly outperform the wavelet, PCA-SIFT and Shape
`Context ones, giving near-perfect separation on the MIT test
`set and at least an order of magnitude reduction in FPPW
`on the INRIA one. Our Haar-like wavelets outperform MIT
`wavelets because we also use 2nd order derivatives and con-
`trast normalize the output vector. Fig. 3(a) also shows MIT’s
`best parts based and monolithic detectors (the points are in-
`terpolated from [17]), however beware that an exact compar-
`ison is not possible as we do not know how the database in
`[17] was divided into training and test parts and the nega-
`tive images used are not available. The performances of the
`final rectangular (R-HOG) and circular (C-HOG) detectors
`are very similar, with C-HOG having the slight edge. Aug-
`menting R-HOG with primitive bar detectors (oriented 2nd
`derivatives – ‘R2-HOG’) doubles the feature dimension but
`further improves the performance (by 2% at 10−4 FPPW).
`Replacing the linear SVM with a Gaussian kernel one im-
`proves performance by about 3% at 10−4 FPPW, at the cost
`of much higher run times1. Using binary edge voting (EC-
`HOG) instead of gradient magnitude weighted voting (C-
`HOG) decreases performance by 5% at 10−4 FPPW, while
`omitting orientation information decreases it by much more,
`even if additional spatial or radial bins are added (by 33% at
`10−4 FPPW, for both edges (E-ShapeC) and gradients (G-
`ShapeC)). PCA-SIFT also performs poorly. One reason is
`that, in comparison to [11], many more (80 of 512) principal
`vectors have to be retained to capture the same proportion of
`the variance. This may be because the spatial registration is
`weaker when there is no keypoint detector.
`6
`Implementation and Performance Study
`We now give details of our HOG implementations and
`systematically study the effects of the various choices on de-
`tector performance. Throughout this section we refer results
`
`1We use the hard examples generated by linear R-HOG to train the ker-
`nel R-HOG detector, as kernel R-HOG generates so few false positives that
`its hard example set is too sparse to improve the generalization significantly.
`
`Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
`1063-6919/05 $20.00 © 2005 IEEE
`
`Yuneec Exhibit 1025 Page 3
`
`

`
`DET − different descriptors on MIT database
`
`DET − different descriptors on INRIA database
`
`Ker. R−HOG
`Lin. R2−HOG
`Lin. R−HOG
`Lin. C−HOG
`Lin. EC−HOG
`Wavelet
`PCA−SIFT
`Lin. G−ShapeC
`Lin. E−ShapeC
`10−2
`10−3
`10−4
`10−5
`false positives per window (FPPW)
`
`10−1
`
`0.5
`
`0.2
`
`0.1
`
`0.05
`
`0.02
`
`miss rate
`
`Lin. R−HOG
`Lin. C−HOG
`Lin. EC−HOG
`Wavelet
`PCA−SIFT
`Lin. G−ShaceC
`Lin. E−ShaceC
`MIT best (part)
`MIT baseline
`
`10−2
`10−3
`10−4
`10−5
`false positives per window (FPPW)
`
`10−1
`
`0.01
`10−6
`
`0.2
`
`0.1
`
`miss rate
`
`0.05
`
`0.02
`0.01
`10−6
`
`Figure 3. The performance of selected detectors on (left) MIT and (right) INRIA data sets. See the text for details.
`
`to our default detector which has the following properties,
`described below: RGB colour space with no gamma cor-
`rection; [−1, 0, 1] gradient filter with no smoothing; linear
`; 16×16
`gradient voting into 9 orientation bins in 0◦
`–180◦
`pixel blocks of four 8×8 pixel cells; Gaussian spatial win-
`dow with σ = 8 pixel; L2-Hys (Lowe-style clipped L2 norm)
`block normalization; block spacing stride of 8 pixels (hence
`4-fold coverage of each cell); 64×128 detection window;
`linear SVM classifier.
`Fig. 4 summarizes the effects of the various HOG param-
`eters on overall detection performance. These will be exam-
`ined in detail below. The main conclusions are that for good
`performance, one should use fine scale derivatives (essen-
`tially no smoothing), many orientation bins, and moderately
`sized, strongly normalized, overlapping descriptor blocks.
`6.1 Gamma/Colour Normalization
`We evaluated several input pixel representations includ-
`ing grayscale, RGB and LAB colour spaces optionally with
`power law (gamma) equalization. These normalizations have
`only a modest effect on performance, perhaps because the
`subsequent descriptor normalization achieves similar results.
`We do use colour information when available. RGB and
`LAB colour spaces give comparable results, but restricting
`to grayscale reduces performance by 1.5% at 10−4 FPPW.
`Square root gamma compression of each colour channel im-
`proves performance at low FPPW (by 1% at 10−4 FPPW)
`but log compression is too strong and worsens it by 2% at
`10−4 FPPW.
`6.2 Gradient Computation
`Detector performance is sensitive to the way in which
`gradients are computed, but the simplest scheme turns out
`to be the best. We tested gradients computed using Gaus-
`sian smoothing followed by one of several discrete deriva-
`tive masks. Several smoothing scales were tested includ-
`
`0 1
`
`ing σ=0 (none). Masks tested included various 1-D point
`derivatives (uncentred [−1, 1], centred [−1, 0, 1] and cubic-
`corrected [1,−8, 0, 8,−1]) as well as 3×3 Sobel masks and
`(cid:1)
`(cid:2)
`(cid:2)
`(cid:1) −1 0
`2×2 diagonal ones
`,
`0 1−1 0
`(the most compact cen-
`tred 2-D derivative masks). Simple 1-D [−1, 0, 1] masks at
`σ=0 work best. Using larger masks always seems to de-
`crease performance, and smoothing damages it significantly:
`for Gaussian derivatives, moving from σ=0 to σ=2 reduces
`the recall rate from 89% to 80% at 10−4 FPPW. At σ=0,
`cubic corrected 1-D width 5 filters are about 1% worse than
`[−1, 0, 1] at 10−4 FPPW, while the 2×2 diagonal masks are
`1.5% worse. Using uncentred [−1, 1] derivative masks also
`decreases performance (by 1.5% at 10−4 FPPW), presum-
`ably because orientation estimation suffers as a result of the
`x and y filters being based at different centres.
`For colour images, we calculate separate gradients for
`each colour channel, and take the one with the largest norm
`as the pixel’s gradient vector.
`6.3 Spatial / Orientation Binning
`The next step is the fundamental nonlinearity of the de-
`scriptor. Each pixel calculates a weighted vote for an edge
`orientation histogram channel based on the orientation of the
`gradient element centred on it, and the votes are accumu-
`lated into orientation bins over local spatial regions that we
`call cells. Cells can be either rectangular or radial (log-polar
`sectors). The orientation bins are evenly spaced over 0◦
`–
`180◦
`(“unsigned” gradient) or 0◦
`–360◦
`(“signed” gradient).
`To reduce aliasing, votes are interpolated bilinearly between
`the neighbouring bin centres in both orientation and posi-
`tion. The vote is a function of the gradient magnitude at the
`pixel, either the magnitude itself, its square, its square root,
`or a clipped form of the magnitude representing soft pres-
`ence/absence of an edge at the pixel. In practice, using the
`magnitude itself gives the best results. Taking the square root
`
`Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
`1063-6919/05 $20.00 © 2005 IEEE
`
`Yuneec Exhibit 1025 Page 4
`
`

`
`DET − effect of gradient scale σ
`
`DET − effect of number of orientation bins β
`
`DET − effect of normalization methods
`
`L2−Hys
`L2−norm
`L1−Sqrt
`L1−norm
`No norm
`Window norm
`10−3
`10−4
`false positives per window (FPPW)
`(c)
`DET − effect of kernel width,γ, on kernel SVM
`
`10−2
`
`Linear
`γ=8e−3
`γ=3e−2
`γ=7e−2
`10−2
`10−3
`10−4
`10−5
`false positives per window (FPPW)
`(f)
`
`10−1
`
`0.2
`
`0.1
`
`0.05
`
`miss rate
`
`10−1
`
`0.02
`10−5
`
`0.5
`
`0.2
`
`0.1
`
`0.05
`
`0.02
`
`miss rate
`
`10−1
`
`0.01
`10−6
`
`bin= 9 (0−180)
`bin= 6 (0−180)
`bin= 4 (0−180)
`bin= 3 (0−180)
`bin=18 (0−360)
`bin=12 (0−360)
`bin= 8 (0−360)
`bin= 6 (0−360)
`10−2
`10−3
`10−5
`10−4
`false positives per window (FPPW)
`(b)
`DET − effect of window size
`
`64x128
`56x120
`48x112
`10−2
`10−3
`10−4
`10−5
`false positives per window (FPPW)
`(e)
`
`0.5
`
`0.2
`
`0.1
`
`0.05
`
`0.02
`
`miss rate
`
`10−1
`
`0.01
`10−6
`
`σ=0
`σ=0.5
`σ=1
`σ=2
`σ=3
`σ=0, c−cor
`10−2
`10−3
`10−4
`10−5
`false positives per window (FPPW)
`(a)
`
`0.5
`
`0.2
`
`0.1
`
`0.05
`
`0.02
`
`miss rate
`
`0.01
`10−6
`
`DET − effect of overlap (cell size=8, num cell = 2x2, wt=0)
`0.5
`
`0.5
`
`0.2
`
`0.1
`
`0.05
`
`0.02
`
`miss rate
`
`10−1
`
`0.01
`10−6
`
`overlap = 3/4, stride = 4
`overlap = 1/2, stride = 8
`overlap = 0, stride =16
`10−2
`10−5
`10−4
`10−3
`false positives per window (FPPW)
`(d)
`
`0.2
`
`0.1
`
`0.05
`
`0.02
`
`miss rate
`
`0.01
`10−6
`
`Figure 4. For details see the text. (a) Using fine derivative scale significantly increases the performance. (‘c-cor’ is the 1D cubic-corrected
`point derivative). (b) Increasing the number of orientation bins increases performance significantly up to about 9 bins spaced over 0◦
`–
`. (c) The effect of different block normalization schemes (see §6.4). (d) Using overlapping descriptor blocks decreases the miss rate
`180◦
`by around 5%. (e) Reducing the 16 pixel margin around the 64×128 detection window decreases the performance by about 4%. (f) Using
`a Gaussian kernel SVM, exp(−γ(cid:1)x1 − x2(cid:1)2), improves the performance by about 3%.
`
`20
`
`15
`
`10
`
`05
`
`Miss Rate (%)
`
`12x12
`
`10x10
`
`4x4
`
`2x2
`8x8
`3x3
`6x6
`4x4 Block size (Cells)
`Cell size (pixels)
`Figure 5. The miss rate at 10−4 FPPW as the cell and block sizes
`change. The stride (block overlap) is fixed at half of the block size.
`3×3 blocks of 6×6 pixel cells perform best, with 10.4% miss rate.
`
`1x1
`
`based on grouping cells into larger spatial blocks and con-
`trast normalizing each block separately. The final descriptor
`is then the vector of all components of the normalized cell
`responses from all of the blocks in the detection window.
`In fact, we typically overlap the blocks so that each scalar
`
`reduces performance slightly, while using binary edge pres-
`ence voting decreases it significantly (by 5% at 10−4 FPPW).
`Fine orientation coding turns out to be essential for good
`performance, whereas (see below) spatial binning can be
`rather coarse. As fig. 4(b) shows, increasing the number
`of orientation bins improves performance significantly up to
`about 9 bins, but makes little difference beyond this. This
`is for bins spaced over 0◦
`–180◦
`, i.e. the ‘sign’ of the gradi-
`ent is ignored. Including signed gradients (orientation range
`0◦
`–360◦
`, as in the original SIFT descriptor) decreases the
`performance, even when the number of bins is also doubled
`to preserve the original orientation resolution. For humans,
`the wide range of clothing and background colours presum-
`ably makes the signs of contrasts uninformative. However
`note that including sign information does help substantially
`in some other object recognition tasks, e.g. cars, motorbikes.
`6.4 Normalization and Descriptor Blocks
`Gradient strengths vary over a wide range owing to local
`variations in illumination and foreground-background con-
`trast, so effective local contrast normalization turns out to
`be essential for good performance. We evaluated a num-
`ber of different normalization schemes. Most of them are
`
`Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
`1063-6919/05 $20.00 © 2005 IEEE
`
`Yuneec Exhibit 1025 Page 5
`
`

`
`cell response contributes several components to the final de-
`scriptor vector, each normalized with respect to a different
`block. This may seem redundant but good normalization is
`critical and including overlap significantly improves the per-
`formance. Fig. 4(d) shows that performance increases by 4%
`at 10−4 FPPW as we increase the overlap from none (stride
`16) to 16-fold area / 4-fold linear coverage (stride 4).
`We evaluated two classes of block geometries, square or
`rectangular ones partitioned into grids of square or rectangu-
`lar spatial cells, and circular blocks partitioned into cells in
`log-polar fashion. We will refer to these two arrangements
`as R-HOG and C-HOG (for rectangular and circular HOG).
`
`R-HOG. R-HOG blocks have many similarities to SIFT de-
`scriptors [12] but they are used quite differently. They are
`computed in dense grids at a single scale without dominant
`orientation alignment and used as part of a larger code vector
`that implicitly encodes spatial position relative to the detec-
`tion window, whereas SIFT’s are computed at a sparse set
`of scale-invariant key points, rotated to align their dominant
`orientations, and used individually. SIFT’s are optimized for
`sparse wide baseline matching, R-HOG’s for dense robust
`coding of spatial form. Other precursors include the edge
`orientation histograms of Freeman & Roth [4]. We usually
`use square R-HOG’s, i.e. ς×ς grids of η×η pixel cells each
`containing β orientation bins, where ς, η, β are parameters.
`Fig. 5 plots the miss rate at 10−4 FPPW w.r.t. cell size and
`block size in cells. For human detection, 3×3 cell blocks of
`6×6 pixel cells perform best, with 10.4% miss-rate at 10−4
`FPPW. Our standard 2×2 cell blocks of 8×8 cells are a close
`second. In fact, 6–8 pixel wide cells do best irrespective of
`the block size – an interesting coincidence as human limbs
`are about 6–8 pixels across in our images. 2×2 and 3×3 cell
`blocks work best. Adaptivity to local imaging conditions is
`weakened when the block becomes too big, and when it is
`too small (1×1 cell block, i.e. normalization over orientation
`alone) valuable spatial information is suppressed.
`As in [12], it is useful to downweight pixels near the edges
`of the block by applying a Gaussian spatial window to each
`pixel before accumulating orientation votes into cells. This
`improves performance by 1% at 10−4 FPPW for a Gaussian
`with σ = 0.5 ∗ block width.
`We also tried including multiple block types with differ-
`ent cell and block sizes in the overall descriptor. This slightly
`improves performance (by around 3% at 10−4 FPPW), at the
`cost of greatly increased descriptor size.
`Besides square R-HOG blocks, we also tested vertical
`(2×1 cell) and horizontal (1×2 cell) blocks and a combined
`descriptor including both vertical and horizontal pairs. Verti-
`cal and vertical+horizontal pairs are significantly better than
`horizontal pairs alone, but not as good as 2×2 or 3×3 cell
`blocks (1% worse at 10−4 FPPW).
`C-HOG. Our circular block (C-HOG) descriptors are rem-
`
`iniscent of Shape Contexts [1] except that, crucially, each
`spatial cell contains a stack of gradient-weighted orienta-
`tion cells instead of a single orientation-independent edge-
`presence count. The log-polar grid was originally suggested
`by the idea that it would allow fine coding of nearby struc-
`ture to be combined with coarser coding of wider context,
`and the fact that the transformation from the visual field to
`the V1 cortex in primates is logarithmic [21]. However small
`descriptors with very few radial bins turn out to give the best
`performance, so in practice there is little inhomogeneity or
`context. It is probably better to think of C-HOG’s simply as
`an advanced form of centre-surround coding.
`We evaluated two variants of the C-HOG geometry,
`ones with a single circular central cell (similar to
`the GLOH feature of [14]), and ones whose cen-
`tral cell is divided into angular sectors as in shape
`contexts. We present results only for the circular-
`centre variants, as these have fewer spatial cells
`than the divided centre ones and give the same per-
`formance in practice. A technical report will provide fur-
`ther details. The C-HOG layout has four parameters: the
`numbers of angular and radial bins; the radius of the central
`bin in pixels; and the expansion factor for subsequent radii.
`At least two radial bins (a centre and a surround) and four
`angular bins (quartering) are needed for good performance.
`Including additional radial bins does not change the perfor-
`mance much, while increasing the number of angular bins
`decreases performance (by 1.3% at 10−4 FPPW when go-
`ing from 4 to 12 angular bins). 4 pixels is the best radius
`for the central bin, but 3 and 5 give similar results. Increas-
`ing the expansion factor from 2 to 3 leaves the performance
`essentially unchanged. With these parameters, neither Gaus-
`sian spatial weighting nor inverse weighting of cell votes by
`cell area changes the performance, but combining these two
`reduces slightly. These values assume fine orientation sam-
`pling. Shape contexts (1 orientation bin) require much finer
`spatial subdivision to work well.
`
`Block Normalization schemes. We evaluated four differ-
`ent block normalization schemes for each of the above HOG
`geometries. Let v be the unnormalized descriptor vector,
`(cid:2)v(cid:2)k be its k-norm for k=1, 2, and be a small constant.
`(cid:3)(cid:2)v(cid:2)2
`The schemes are: (a) L2-norm, v → v/
`+ 2; (b)
`2
`L2-Hys, L2-norm followed by clipping (limiting the maxi-
`mum values of v to 0.2) and renormalizing, as in [12]; (c)
`L1-norm, v → v/((cid:2)v(cid:2)1 + ); and (d) L1-sqrt, L1-norm fol-
`lowed by square root v → (cid:3)
`v/((cid:2)v(cid:2)1 + ), which amounts
`to treating the descriptor vectors as probability distributions
`and using the Bhattacharya distance between them. Fig. 4(c)
`shows that L2-Hys, L2-norm and L1-sqrt all perform equally
`well, while simple L1-norm reduces performance by 5%,
`and omitting normalization entirely reduces it by 27%, at
`10−4 FPPW. Some regularization is needed as we evalu-
`ate descriptors densely, including on empty patches, but the
`
`Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
`1063-6919/05 $20.00 © 2005 IEEE
`
`Yuneec Exhibit 1025 Page 6
`
`

`
`(a)
`
`(b)
`
`(c)
`
`(d)
`
`(e)
`
`(f)
`
`(g)
`
`Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks are
`centred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each “pixel”
`shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.
`(e) It’s computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.
`
`results are insensitive to ’s value over a large range.
`
`Centre-surround normalization. We also investigated an
`alternative centre-surround style cell normalization scheme,
`in which the image is tiled with a grid of cells and for
`each cell the total energy in the cell and its surrounding re-
`gion (summed over orientations and pooled using Gaussian
`weighting) is used to normalize the cell. However as fig. 4(c)
`(“window norm”) shows, this decreases performance relative
`to the corresponding block based scheme (by 2% at 10−4
`FPPW, for pooling with σ=1 cell widths). One reason is
`that there are no longer any overlapping blocks so each cell
`is coded only once in the final descriptor. Including several
`normalizations for each cell based on different pooling scales
`σ provides no perceptible change in performance, so it seems
`that it is the existence of several pooling regions with differ-
`ent spatial offsets relative to the cell that is important here,
`not the pooling scale.
`
`To clarify this point, consider the R-HOG detector with
`overlapping block

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket