`IPR of U.S. Patent 9,007,420
`
`
`
`the detector by focusing attention on promising regions of
`the image. The notion behind focus of attention approaches
`is that it is often possible to rapidly determine where in an
`image an object might occur [16, 7. 1]. More complex pro-
`cessing is reserved only for these promising regions. The
`key measure of such an approach is the “false negative” rate
`of the attentional process. It must be the case that all, or
`almost all. object instances are selected by the attentional
`filter.
`
`We will describe a process for training an extremely sim-
`ple and efficient classifier which can be used as a “super-
`vised" focus of attention operator. The term supervised
`refers to the fact that the attentional operator is trained to
`detect examples of a particular class. In the domain of face
`detection it is possible to achieve fewer than l% false neg-
`atives and 40% false positives using a classifier constructed
`from two Harr-like features. The effect of this filter is to
`
`reduce by over one half the -number of locations where the
`final detector must be evaluated.
`
`Those sub-windows which are not rejected by the initial
`classifier are processed by a sequence of classifiers, each
`slightly more complex than the last. If any classifier rejects
`the sub-window. no -_further processing is performed. The
`structure of the cascaded detection process is essentially
`that of a degenerate decision tree, and as such is related to
`the work of Geman and colleagues [1, 3].
`An extremely fast face detector will have broad prac-
`tical applications. These include user interfaces.
`image
`databases. and teleconferencing.
`In applications where
`rapid frame-rates are not necessary. our system will allow
`for significant additional post-processing and analysis.
`In
`addition our system can be implemented on a wide range of
`small low power devices. including hand-helds and embed-
`ded processors.
`In our lab we have implemented this face
`detector on the Compaq iPaq handheld and have achieved
`detection at two frames per second (this device has a low
`power 200 MIPS Strong Arm processor which lacks float~
`ing point hardware).
`The remainder of the paper describes our contributions
`and a number of experimental results. including a detailed
`description of our experimental methodology. Discussion
`of closely related work takes place at the end of each sec-
`tion.
`
`2. Features
`
`Our object detection procedure classifies images based on
`the value of simple features. There are many motivations
`for using features rather than the pixels directly. The most
`common reason is that features can act to encode ad-hoc
`
`domain knowledge that is difficult to learn using a finite
`quantity of training data. For this system there is also a
`second critical motivation for features:
`the feature based
`
`system operates much faster than a pixel-based system.
`The simple features used are reminiscent of Bear basis
`functions which have been used by Papageorgiou et al. [9].
`
`
`
`Figure 1: Example rectangle features shown relative to the
`enclosing detection window. The sum of the pixels which
`lie within the white rectangles are subtracted from the sum
`of pixels in the grey rectangles. 'I\vo-rectangle features are
`shown in (A) and (B). Figure (C) shows a three-rectangle
`feature, and (D) a four-rectangle feature.
`
`More specifically, we use three kinds of features. The value
`of 3 two-rectangle feature is the difference between the sum
`of the pixels within two rectangular regions. The regions
`have the same size and shape and are horizontally or ver-
`tically adjacent (see Figure 1}. A rhree~t'ectartgIe feature
`computes the sum within two outside rectangles subtracted
`from the sum in a center rectangle. Finally a four-rectangle
`feature computes the difference between diagonal pairs of
`rectangles.
`Given that the base resolution of t.he detector is 24x24,
`
`the exhaustive set of rectangle features is quite large, over
`180.000 . Note that unlike the Haar basis. the set of rectan-
`gle features is overcomplete’.
`
`2.1. Integral Image
`Rectangle features can be computed very rapidly using an
`intermediate representation for the image which we call the
`integral image.” The integral image at location I, y contains
`the sum of the pixels above and to the left of .15, y, inclusive:
`
`ii(==.y} =
`
`2 it-*0’. ti’).
`='sa=.tr’ S1:
`
`where ii'(m, y) is the integral image and £(z, y) is the origi-
`nal image. Using the following pair of recurrences:
`
`stony)
`
`s(z.y - 1) +%‘(r.y)
`
`time) = ='=7(=¢= - 1. S!) + stay)
`
`(1)
`
`(2)
`
`(where s(:c, y) is the cumulative row sum. s(:c, -1) = 0,
`and t'i(-1, y) = 0) the integral image can be computed in
`one pass over the original image.
`
`’A complete basis has no linear dependence between basis elements
`and has the same number of elements as the image space. in this case 576.
`The full set of 180.000 thousand features is many times over-complete.
`‘There is a close relation to "summed area tables" as used in graphics
`[2}. We choose a difierent nitrite here in order to emphasize its use for the
`analysis of images. rather than for texture mapping.
`
`1-512
`
`
`
`
`
`Figure 2: The sum of the pixels within rectangle D can be
`computed with four array references. The value of the inte-
`gral image at location 1 is the sum of the pixels in rectangle
`A. The value at location 2 is A + B . at location 3 is A + C.
`and at location 4 is A + B + C + D. The sum within D can
`be computed as 4 +1 —- (2 + 3).
`
`image any rectangular sum can be
`Using the integral
`computed in four array references (see Figure 2). Clearly
`the difference between two rectangular sums can be com-
`puted in eight references. Since the two-rectangle features
`defined above involve adjacent rectangular sums they can
`be computed in six array references. eight in the case of
`the three-rectangle features. and nine for four-rectangle fea-
`lures.
`
`2.2. Feature Discussion
`
`Rectangle features are somewhat primitive when compared
`with alternatives such as steerable filters [4, 6]. Steerable fil-
`
`ters, and their relatives, are excellent for the detailed analy-
`sis of boundaries. image compression, and texture analysis.
`In contrast rectangle features. while sensitive to the pres-
`ence of edges, bars. and other simple image structure, are
`quite coarse. Unlike steerable filters the only orientations
`available are vertical. horizontal, and diagonal. The set of
`rectangle features do however provide a rich image repre-
`sentation which supports effective learning. In conjunction
`with the integral image . the efficiency of the rectangle fea-
`ture set provides ample compensation for their limited flex-
`ibility.
`
`3. Learning Classification Functions
`
`Given a feature set and a training set of positive and neg-
`ative images. any number of machine learning approaches
`could he used to learn a classification function. In our sys-
`tem 21 variant of AdaBoost is used both to select a small set
`
`of features and train the classifier {S}. In its original form.
`the AdaBoost learning algorithm is used to boost the clas-
`sification performance of a simple (sometimes called weak)
`learning algorithm. There are a number of fonrial guaran-
`tees provided by the Adal?-oost learning procedure. Freund
`and Schapire proved that the training error of the strong
`
`classifier approaches zero exponentially in the numberof
`rounds. More importantly a number of results were later
`proved about generalization performance [12]. The key
`insight is that generalization performance is related to the
`margin of the examples. and that AdaBoost achieves large
`margins rapidly. '
`Recall that there are over 180.000 rectangle features as-
`sociated with each image sub-window. a number far larger
`than the number of pixels. Even though each feature can
`be computed very efficiently. computing the complete set is
`prohibitively expensive. Our hypothesis. which is borne out
`by experiment, is that a very small number of these features
`can be combined to form an effective classifier. The main
`challenge is to find these features.
`In support of this goal. the weak learning algorithm is
`designed to select the single rectangle feature which best
`separates the positive and negative examples (this is similar
`to the approach of [l5] in the domain of image database
`retrieval). For -each feature, the weak learner determines
`the optimal threshold classification function. such that the
`minimum number of examples are rnisclassified. A weak
`classifier by (3) thus consists of afearure f,-, a threshold 3,-
`and a polarity P3‘ indicating the direction of the inequality
`sign:
`
`an -{a ::.r;a:2
`
`Here 3 is a 24x24 pixel sub-window of an image. See Fig-
`ure 3 for a summary of the boosting process.
`In practice no single feature can perform the classifica-
`tion task with low error. Features which are selected in early
`rounds of the boosting process had error rates between 0.1
`and 0.3. Features selected in later rounds, as the task be-
`comes more difficult, yield error rates between 0.4 and 0.5.
`
`3.1. Learning Discussion
`
`Many general feature selection procedures have been pro-
`posed (see chapter 8 of [17] for a review). Our final appli-
`cation demanded a very aggressive approach which would
`discard the vast majority of features. For a similar recogni-
`tion problem Papageorgiou et al. proposed a scheme for fea-
`ture selection based on feature variance [9]. They demon»
`strated good results selecting 37 features out of a total 1734
`features.
`
`Roth et al. propose a feature selection process based
`on the Winnow exponential perceptron learning rule [10].
`The Winnow learning process converges to a solution where
`many of these weights are zero. Nevertheless a very large
`number of features are retained (perhaps a few hundred or
`thousand).
`
`3.2. Learning Results
`
`While details on the training and performance of the final
`system are presented in Section 5, several simple results
`merit discussion.
`Initial experiments demonstrated that a
`
`I-S13
`
`
`
`
`
`Figure 4: The first and second features selected by Ad-
`aBoost. The two features are shown in the top row and then
`overlayed on a typical training face in the bottom row. The
`first feature measures the difference in intensity between the
`region of the eyes and a region across the upper cheeks. The
`feature capitalizes on the observation that the eye region is
`often darker than the checks. The second feature compares
`the intensities in the eye regions to the intensity across the
`bridge of the nose.
`
`4. The Attentional Cascade
`
`This section describes an algorithm for constructing a cas-
`cade of classifiers which achieves increased detection per-
`formance while radically reducing computation time. The
`key insight is that smaller. and therefore more efficient,
`boosted classifiers can be constructed which reject many of
`the negative sub-windows while detecting almost all posi-
`tive instances (i .e. the threshold of a boosted classifier can
`be adjusted so that the false negative rate is close to zero}.
`Simpler classifiers are used to reject the majority of sub-
`windows before more complex classifiers are called upon
`to achieve low false positive rates.
`The overall form of the detection process is that of a de-
`generate dccision tree, what we call a “cascade” (see Fig-
`ure 5). A positive result from the first classifier triggers the
`evaluation of a second classifier which has also been ad-
`
`justed to achieve very high detection rates. A positive result
`from the second classifier triggers a third classifier. and so
`on. A negative outcome at any point leads to the immediate
`rejection of the sub-window.
`Stages in the cascade are constructed by training clas-
`sifiers using AdaBoost and then adjusting the threshold to
`minimize false negatives. Note that the default AdaBoost
`threshold is designed to yield a low error rate on the train-
`ing data. In general a lower threshold yields higher detec-
`tion rates and higher false positive rates.
`For example an excellent first stage classifier can be con-
`sttucted from a two-feature strong classifier by reducing the
`threshold to minimize false negatives. Measured against a
`validation training set. the threshold can be adjusted to de-
`tect 100% of the faces with a false positive rate of 40%. See
`
`I-514
`
`
`
`
`
`0 Given example images (:n,yt).. .. ,(:c..,y..) where
`y.- = 0, 1 for negative and positive examples respec-
`tively.
`
`o Initialize weights w1,.- = g, § for 1;: = 0,1 respec-
`tively, where m and I are the number of negatives and
`positives respectively.
`'
`a Fort=1,...,'1":
`
`l. Normalize the weights.
`
`1.L1f_.1' {'-
`
`so that w; is a probability distribution.
`
`2. For each feature, 3'. train a classifier hj which
`is restricted to using a single feature.
`The
`error is evaluated with respect
`to wt. e,- =
`2,. w.- Ihns.-J - ya.
`3. Choose the classifier, hr, with the lowest error 6;.
`
`4. Update the weights:
`
`'HJ't+1,n‘
`
`]_ .
`‘—‘ we.o9t
`S‘
`
`where at = 0 if example mg is classified cor-
`rectly, ti.‘ = 1 otherwise, and ,3: = -3-l—lz '
`
`o The final strong classifier is:
`
`= { 1
`
`0
`
`otherwise
`
`215:} aihi ix) 2 é‘E'ih=1 Q‘
`
`
`
`
`
`where (2; = log 31;
`
`
`Figure 3: The AdaBoost algorithm for classifier learn-
`ing. Each round of boosting selects one feature from the
`180,000 potential features.
`
`frontal face classifier constructed from 200 features yields
`a detection rate of 95% with a false positive rate of 1 in
`14084. These results are compelling, but not sufficient for
`many real-world tasks. In terms of computation, this clas-
`sifier is probably faster than any other published system,
`requiring 0.7 seconds to scan an 384 by 288 pixel image.
`Unfortunately. the most straightforward technique for im-
`proving detection performance, adding features to the clas-
`sifier. directly increases computation time.
`
`For the task of face detection, the initial rectangle fea-
`tures selected by AdaBoost are meaningful and easily inter-
`preted. The first feature selected seems to focus on the prop-
`erty that the region of the eyes is often darker than the region
`of the nose and checks (see Figure 4). This feature is rel-
`atively large in comparison with the detection sub-window.
`and should be somewhat insensitive to size and location of
`
`the face. The second feature selected relies on the property
`that the eyes are darker than the bridge of the nose.
`
`
`
`ing this optimum is a tremendously difficult problem.
`In practice a very simple framework is used to produce
`an effective classifier which is highly efficient. Each stage
`in the cascade reduces the false positive rate and decreases
`the detection rate. A target is selected for the minimum
`reduction in false positives and the maximum decrease in
`detection. Each stage is trained by adding features until the
`target detection and false positives rates are met ( these rates
`are determined by testing the detector on a validation set}.
`Stages are added until the overall target for false positive
`and detection rate is met.
`
`4.2. Detector Cascade Discussion
`
`The complete face detection cascade has 38 stages with over
`6000 features. Nevertheless t.he cascade structure results in
`
`fast average detection times. On a difficult dataset. con-
`taining 507 faces and 75 million sub-windows. faces are
`detected using an average of 10 feature evaluations per sub-
`window. In comparison. this system is about 15 times faster
`than an implementation of the detection system constructed
`by Rowley et al.3 [I 1]
`A notion similar to the cascade appears in the face de-
`tection system described by Rowley et al. in which two de-
`tection networks are used [I 1]. Rowley et al. used a faster
`yet less accurate network to prescreen the image in order to
`find candidate regions for a slower more accurate network.
`Though it is difficult to determine exactly, it appears that
`Rowley et al.’s two network face system is the fastest exist-
`ing face detector.‘
`The structure of the cascaded detection process is es-
`sentially that of a degenerate decision tree. and as such is
`related to the work of Amit and Geman [1]. Unlike tech-
`niques which use a fixed detector, Amit and Geman propose
`an alternative point of view where unusual co-occurrences
`of simple image features are used to trigger the evaluation
`of a more complex detection process.
`In this way the full
`detection process need not be evaluated at many of the po-
`tential image locations and scales. While this basic insight
`is very valuable, in their implementation it is necessary to
`first evaluate some feature detector at every location. These
`features are then grouped to find unusual co-occurrences. In
`practice. since the form of our detector and the features that
`it uses are extremely efficient, the amortized cost of evalu-
`ating our detector at every scale and location is much faster
`than finding and grouping edges throughout the image.
`In recent work Fleuret and Geman have presented a face
`detection technique which relies on a “chain“ of tests in or-
`der to signify the presence of a face at a particular scale and
`
`3Hen.ry Rowley very graciously supplied us with irnplementutions of
`his detection system for direct comparison. Reported results are against
`his fastest system. It is difficult to determine from the published literature,
`but the Rowley-Ba.|uja—Kanade detector is widely considered the fastest
`detection system and has been heavily tested on real-world problems.
`‘Other published detectors have either neglected to discuss perfor-
`mance in detail. or have never published detection and false positive rates
`on a large and ditlicult training set.
`
`Figure 5: Schematic depiction of a the detection cascade.
`A series of classifiers are applied to every sub-window. The
`initial classifier eliminates a large number of negative exam-
`ples with very little processing. Subsequent layers eliminate
`additional negatives but require additional computation. Af-
`ter several stages of processing the number of sub-windows
`have been reduced radically. Further processing can take
`any form such as additional stages of the cascade (as in our
`detection system) or an alternative detection system.
`
`Figure 4 for a description of the two features used in this
`classifier.
`
`Computation of the two feature classifier amounts to
`about 60 microprocessor instructions.
`It seems hard to
`imagine that any simpler filter could achieve higher rejec-
`tion rates. By comparison, scanning a simple image tem-
`plate. or a single layer perceptron, would require at least 20
`times as many operations per sub-window.
`The structure of the cascade reflects the fact
`
`that
`
`within any single image an overwhelming majority of sub-
`windows are negative. As such. the cascade attempts to re-
`ject as many negatives as possible at the earliest stage pos-
`sible. While a positive instance will trigger the evaluation
`of every classifier in the cascade, this is an exceedingly rare
`B\"Bl'll.
`
`Much like a decision tree, subsequent classifiers are
`trained using those examples which pass through all the
`previous stages. A5 a result. the second classifier faces a
`more difficult task than the first. The examples which make
`it through the first stage are “harder” than typical exam-
`ples. The more difficult examples faced by deeper classi-
`fiers push the entire receiver operating characteristic (ROC)
`curve downward. At a given detection rate. deeper classi-
`fiers have correspondingly higher false positive rates.
`
`4.1. Training a Cascade of Classifiers
`
`The cascade training process involves two types of trade-
`offs.
`In most cases classifiers with more features will
`
`achieve higher detection rates and lower false positive rates.
`At the same time classifiers with more features require more
`time to compute. In principle one could define an optimiza-
`tion framework in which: i) the number of classifier stages.
`ii) the number of features in each stage. and iii) the thresh-
`old of each stage. are traded off in order to minimize the
`expected number of evaluated features. Unfortunately find-
`
`I-515
`
`
`
`
`
`Figure 6: Example of frontal upright face images used for
`training.
`
`location [3]. The image properties measured by Fleuret and
`Geman, disjnnctions of fine scale edges, are quite different
`than rectangle features which are simple. exist at all scales.
`and are somewhat interpretable. The two approaches also
`differ radically in their learning philosophy. The motivation
`for Fleuret and Gen1an‘s learning process is density estima-
`tion and density discrimination, while our detector is purely
`discriminative. Finally the false positive rate of Fleuret and
`Ciernan’s approach appears to be higher than that of previ-
`ous approaches like Rowley et al. and this approach. Un-
`fortunately the paper does not report quantitative results of
`this kind. The included example images each have between
`2 and 10 false positives.
`
`5 Results
`
`A 38 layer cascaded classifier was trained to detect frontal
`upright faces. To train the detector. a set of face and non-
`face training images were used. The face training set con-
`sisted of 4916 hand labeled faces scaled and aligned to a
`base resolution of 24 by 24 pixels. The faces were ex-
`tracted from images downloaded during a random crawl of
`the world wide web. Some typical face examples are shown
`in Figure 6. The non-face subwindows used to train the
`detector come from 9544 images which were manually in-
`spected and found to not contain any faces. There are about
`350 million subwindows within these non-face images.
`The number of features in the first five layers of the de-
`tector is l, 10, 25, 25 and 50 features respectively. The
`remaining layers have increasingly more features. The total
`number of features in all layers is 6061.
`Each classifier in the cascade was trained with the 4916
`
`training faces (plus their vertical mirror images for a total
`
`of 9832 training faces) and 10,000 non-face sub-windows
`(also of size 24 by 24 pixels) using the Adaboost training
`procedure. “or the initial one feature classifier, the non-
`face training examples were collected by selecting random
`sub-windows from a set of 9544 images which did not con-
`tain faces. The non-face examples used to train subsequent
`layers were obtained by scanning the partial cascade across
`the non—face images and collecting false positives. A maxi-
`mum of 10.000 such non-face sub-windows were collected
`for each layer.
`
`Speed of the Final Detector
`The speed of the cascaded detector is directly related to
`the number of features evaluated per scanned sub-window.
`Evaluated on the MlT+CMU test set [11]. an average of 10
`features out of a total of 6061 are evaluated per sub-window.
`This is possible because a large majority of sub—windows
`are rejected by the first or second layer in the cascade. On
`a T00 Mhz Pentium III processor. the face detector can pro-
`cess a 384 by 288 pixel image in about .067 seconds (us-
`ing a starting scale of 1.25 and a step size of 1.5 described
`below). This is roughly 15 times faster than the Rowley-
`Baluja-Kanade detector [ll] and about 600 times faster than
`the Schneiderrnan- Kanade detector [13].
`
`Image Processing
`All example sub-windows used for training were vari-
`ance normalized to minimize the effect of different light-
`ing conditions. Normalization is therefore necessary during
`detection as well. The variance of an image sub-window
`can be computed quickly using a pair of integral images.
`Recall that 0'2 = mg -— % 2 :::2, where 0 is the standard
`deviation. in is the mean. and as is the pixel value within
`the sub-window. The mean of a sub-window can be com-
`
`puted using the integral image. The sum of squared pixels
`is computed using an integral image of the image squared
`(i.e. two integral images are used in the scanning process).
`During scanning the effect of image normalization can be
`achieved by post-multiplying the feature values rather than
`pre-multiplying the pixels.
`
`Scanning the Detector
`The final detector is scanned across the image at multi-
`ple scales and locations. Scaling is achieved by scaling the
`detector itself, rather than scaling the image. This process
`makes sense because the features can be evaluated at any
`scale with the same cost. Good results were obtained using
`a set of scales a factor of 1.25 apart.
`The detector is also scanned across location. Subsequent
`locations are obtained by shifting the window some number
`of pixels A. This shifting process is affected by the scale of
`the detector: if the current scale is s the window is shifted
`
`by [all], where D is the rounding operation.
`The choice of A affects both the speed of the detector as
`well as accuracy. The results we present are for A = 1.0.
`We can achieve a significant speedup by setting A = 1.5
`with only a slight decrease in accuracy.
`
`I-516
`
`
`
`Table l: Detection rates for various numbers of false positives on the MIT-I-CMU test set‘ containing 130 images and 507
`faces.
`
`
`
`
`
`
`
`Viola-Jones (voting)
`Row|ey~Balujn~Kt1nnde
`5chneidenm.n—l(unade
`
`Ro1h—Ya_rIg—Ahuju
`
`
`
`
`
`
`
`
`
`“-
`=— 94-4% —_2
`in --
`
`81.1%
`83.2%
`
`Integration of Multiple Detections
`Since the final detector is insensitive to small changes in
`translation and scale. multiple detections will usually occur
`around each face in a scanned image. The same is often true
`of some types of false positives. In practice it often makes
`sense to return one final detection per face. Toward this end
`it is useful to postprocess the detected sub-windows in order
`to combine overlapping detections into a single detection.
`In these experiments detections are combined in a very
`simple fashion. The set of detections are first partitioned
`into disjoint subsets. 'I\vo detections are in the same subset
`if their bounding regions overlap. Each partition yields a
`single final detection. The corners of the final bounding
`region are the average of the corners of all detections in the
`set.
`
`Experiments on a Real-World Test Set
`We tested our system on the MIT+CMLl frontal face test
`set [1 1]. This set consists of 130 images with 50'? labeled
`frontal faces. A ROC curve showing the performance of our
`detector on this test set is shown in Figure 7. To create the
`ROC curve the threshold of the final layer classifier is ad-
`justed from —oo to +oo. Adjusting the threshold to +oo
`will yield a detection rate of 0.0 and a false positive rate
`of 0.0. Adjusting the threshold to —-oo. however. increases
`both the detection rate and false positive rate. but only to a
`certain point. Neither rate can be higher than the rate of the
`detection cascade minus the final layer. In effect. a thresh-
`old of -—-co is equivalent to removing that layer. Furtlter
`increasing the detection and false positive rates requires de-
`creasing the threshold of the next classifier in the cascade.
`Thus. in order to construct a complete ROC curve, classifier
`layers are removed. We use the number of false positives as
`opposed to the rare of false positives for the x-axis of the
`ROC curve to facilitate comparison with other systems. To
`compute the false positive rate. simply divide by the total
`number of sub-windows scanned.
`In our experiments, the
`number of sub—windows scanned is 75,081,800.
`
`Unfortunately, most previous published results on face
`detection have only included a single operating regime (Le.
`single point on the ROC curve). To make comparison with
`our detector easier we have listed our detection rate for the
`
`false positive rates reported by the other systems. Table I
`
`POC mm brbeedeleclcrwnllal-nun=1.0
`
`
`
`Figure 7: ROC curve for our face detector on the
`MIT+CMU test set. The detector was run using a step size
`of 1.0 and starting scale of 1.0 05,081,800 sub—windows
`scanned).
`'
`-
`
`lists the detection rate for various numbers of false detec-
`
`tions for our system as well as other published systems. For
`the Rowley-Baluja-Kanade results [1 1]. a number of differ-
`ent versions of their detector were tested yielding a number
`of different results they are all listed in under the same head-
`ing. For the Roth-Yang-Ahuja detector [10]. they reported
`their result on the MIT+CMU test set minus 5 images con-
`taining line drawn faces removed.
`Figure 8 shows the output of our face detector on some
`test images from the MIT+CMU test set.
`
`A simple voting scheme to further improve results
`In table 1 we also show results from running three de-
`tectors (the 38 layer one described above plus two similarly
`trained detectors} and outputting the majority vote of the
`three detectors. This improves the detection rate as well as
`eliminating more false positives. The improvement would
`be greater if the detectors were more independent. The cor-
`relation of their errors results in a modest improvement over
`the best single detector.
`
`I-517
`
`
`
`
`
`Figure 8: Output of our face detector on a number of test images from the MIT-t-CMU test set.
`
`6 Conclusions
`
`We have presented an approach for object detection which
`minimizes computation time while achieving high detection
`accuracy. The approach was used to construct a face detec-
`tion system which is approximately 15 times faster than any
`previous approach.
`This paper brings together new algorithms. representa-
`tions, and insights which are quite generic and may well
`have broader application in computer vision and image pro-
`ceasing.
`Finally this paper presents a set of detailed experiments
`on a difficult face detection dataset which has been widely
`studied. This dataset includes faces under a very wide range
`of conditions including: illumination, scale. pose, and earn-
`era variation. Experiments on such a large and complex
`dataset are difficult and time consuming. Nevertheless sys-
`tems which work under these conditions are unlikely to be
`brittle or limited to a single set of conditions. More impor-
`tantly conclusions drawn from this dataset are unlikely to
`be experimental artifacts.
`
`References
`
`[11 Y. Amit. D. Geman. and K. Wilder. Joint induction of shape
`features and tree classifiers. I997.
`
`In
`Summed-area tables for texture mapping.
`[2] F. Crow.
`Proceedings of SIGGRAPH, volume 18(3), pages 207-212,
`1984.
`
`[3] F. Fleuret and D. Gernan. Coarsedo-fine face detection. int.
`J‘. Computer Vision, 200] .
`
`{4} William T. Freeman and Edward H. Adelson. The design
`and use of steerable filters.
`IEEE Transactions on Pattern
`Analysis and Machine intelligence. l3(9):89i—9{]6, 1991.
`
`{5} Yoav Freund and Robert E. Schapire. A decision-theoretic
`generalization of on~line leaming and an application to
`boosting. In Computational Leaming Theory: Enrocolt '95.
`pages 23-37. Springer—Verlag, 1995.
`
`[6] H. Greenspan. S. Belongie. R. Gooodman, P. Perona, S. Rak-
`shit. and C. Anderson. Overcomplete steerable pyramid fil-
`
`ters and rotation invariance. In Proceedings ofrhe IEEE Con-
`ference on Computer Vision and Pattern Recognition, 1994.
`
`[7] L. Itti, C. Koch, and E. Niebur. A model of saliency-based
`visual attention for rapid scene analysis.
`IEEE Parr. Anal.
`Mach. i'nIe.l.l., 2l}(l 1): 12544259, November 1993.
`
`{8} Edgar Osuna. Robert Freund, and Federico Girosi. Training
`support vector machines: an application to face detection.
`In Proceedings ofrhe IEEE Conference on Computer lirion
`and Pattern Recognition. i997.
`
`{9} C. Papageorgiou. M. Oren. and T. Poggio. A general frame-
`work for object detection.
`In Inremarional Conference on
`Computer Piston, 1998.
`
`[10] D. Roth, M. Yang. and N. Ahuja. A snowbased face detector.
`In Neural Infomiorion Processing I2. 2000.
`
`[l 1] H. Rowley, S. Baluja. and T. Kanade. Neural network-based
`face detection. In IEEE Port. Anal. Mach. l’nIell., volume 20,
`pages 22-38. i993.
`
`[12] Robert E. Schapire. Yoav Freund, Peter Bartlett, and
`Wet: Sun Lee. Boosting the margin: A new explanation for
`the effectiveness of voting methods.
`In Proceedings of the
`Fourteenth fntentational Conference on Machine Leamtng.
`1997.
`
`[[3] H. Schneiderrnan and T. Kanade. A statistical method for 3D
`object detection applied to faces and cars.
`In Inrernortonal
`Conference on Computer Vision, 2000.
`
`[14] K. Sung and T. Poggio. Example-based learning for view-
`based face detection. In IEEE Pan. Anal. Mach. Ina-elt, vol-
`ume 20. pages 39-51. 1998.
`
`[15] K. Tteu and P. Viola. Boosting image retrieval. In Proceed-
`ings ofthe IEEE Conference on Computer Vision and Pattern
`Recognition. 2000.
`
`[I6} J.K."l"sotsos. SM. Culliane. W.Y.K. Wai. Y.H. Lai. N. Davis,
`and F. Nuflo. Modeling visual-attention via selective tun-
`ing. Artificial Inrelligence Journal. '78(1-2}‘.507—545. Octo-
`ber I995.
`
`[17] Andrew Webb. Statistical Pattern Recognition. Oxford Uni-
`versity Press. New York, 1999.
`
`I-5l8