`
`Cha Zhang and Zhengyou Zhang
`
`June 2010
`
`Technical Report
`MSR-TR-2010-66
`
`Microsoft Research
`Microsoft Corporation
`One Microsoft Way
`Redmond, WA 98052
`http://www.research.microsoft.com
`
`GTL 1009
`IPR of U.S. Patent 9,007,420
`
`0001
`
`
`
`Abstract
`
`Face detection has been one of the most studied topics
`in the computer vision literature. In this technical report,
`we survey the recent advances in face detection for the past
`decade. The seminal Viola-Jones face detector is first re-
`viewed. We then survey the various techniques according to
`how they extract features and what learning algorithms are
`adopted. It is our hope that by reviewing the many existing
`algorithms, we will see even better algorithms developed to
`solve this fundamental computer vision problem. 1
`
`1. Introduction
`With the rapid increase of computational powers and
`availability of modern sensing, analysis and rendering
`equipment and technologies, computers are becoming more
`and more intelligent. Many research projects and commer-
`cial products have demonstrated the capability for a com-
`puter to interact with human in a natural way by looking at
`people through cameras, listening to people through micro-
`phones, understanding these inputs, and reacting to people
`in a friendly manner.
`One of the fundamental techniques that enables such nat-
`ural human-computer interaction (HCI) is face detection.
`Face detection is the step stone to all facial analysis algo-
`rithms, including face alignment, face modeling, face re-
`lighting, face recognition, face verification/authentication,
`head pose tracking, facial expression tracking/recognition,
`gender/age recognition, and many many more. Only when
`computers can understand face well will they begin to truly
`understand people’s thoughts and intentions.
`Given an arbitrary image, the goal of face detection is to
`determine whether or not there are any faces in the image
`and, if present, return the image location and extent of each
`face [112]. While this appears as a trivial task for human
`beings, it is a very challenging task for computers, and has
`been one of the top studied research topics in the past few
`decades. The difficulty associated with face detection can
`be attributed to many variations in scale, location, orienta-
`tion (in-plane rotation), pose (out-of-plane rotation), facial
`expression, lighting conditions, occlusions, etc, as seen in
`Fig. 1.
`There have been hundreds of reported approaches to
`face detection. Early Works (before year 2000) had been
`nicely surveyed in [112] and [30]. For instance, Yang et
`al. [112] grouped the various methods into four categories:
`knowledge-based methods, feature invariant approaches,
`template matching methods, and appearance-based meth-
`
`1This technical report is extracted from an early draft of the book
`“Boosting-Based Face Detection and Adaptation” by Cha Zhang and
`Zhengyou Zhang, Morgan & Claypool Publishers, 2010.
`
`Figure 1. Examples of face images. Note the huge variations in
`pose, facial expression, lighting conditions, etc.
`
`ods. Knowledge-based methods use pre-defined rules to de-
`termine a face based on human knowledge; feature invariant
`approaches aim to find face structure features that are robust
`to pose and lighting variations; template matching methods
`use pre-stored face templates to judge if an image is a face;
`appearance-based methods learn face models from a set of
`representative training face images to perform detection. In
`general, appearance-based methods had been showing supe-
`rior performance to the others, thanks to the rapid growing
`computation power and data storage.
`The field of face detection has made significant progress
`in the past decade. In particular, the seminal work by Viola
`and Jones [92] has made face detection practically feasible
`in real world applications such as digital cameras and photo
`organization software.
`In this report, we present a brief
`survey on the latest development in face detection tech-
`niques since the publication of [112]. More attention will
`be given to boosting-based face detection schemes, which
`have evolved as the de-facto standard of face detection in
`real-world applications since [92].
`The rest of the paper is organized as follows. Section 2
`gives an overview of the Viola-Jones face detector, which
`also motivates many of the recent advances in face detec-
`tion. Solutions to two key issues for face detection: what
`features to extract, and which learning algorithm to apply,
`will be surveyed in Section 3 (feature extraction), Section 4
`(boosting learning algorithms) and Section 5 (other learn-
`ing algorithms). Conclusions and future work are given in
`Section 6.
`
`2. The Viola-Jones Face Detector
`
`If one were asked to name a single face detection algo-
`rithm that has the most impact in the 2000’s, it will most
`likely be the seminal work by Viola and Jones [92]. The
`Viola-Jones face detector contains three main ideas that
`make it possible to build a successful face detector that can
`run in real time: the integral image, classifier learning with
`AdaBoost, and the attentional cascade structure.
`
`0002
`
`
`
`and [19], we briefly present a generalized version of Ad-
`aBoost algorithm, usually referred as RealBoost. It has been
`advocated in various works [46, 6, 101, 62] that RealBoost
`yields better performance than the original AdaBoost algo-
`rithm.
`Consider a set of training examples as S = {(xi, zi), i =
`1,··· , N}, where xi belongs to a domain or instance space
`X , and zi belongs to a finite label space Z. In binary classi-
`fication problems, Z = {1,−1}, where zi = 1 for positive
`examples and zi = −1 for negative examples. AdaBoost
`t=1 ft(x) to pre-
`produces an additive model F T (x) =
`dict the label of an input example x, where F T (x) is a real
`valued function in the form F T : X → R. The predicted
`label is ˆzi = sign(F T (xi)), where sign(·) is the sign func-
`tion. From the statistical view of boosting [19], AdaBoost
`algorithm fits an additive logistic regression model by us-
`ing adaptive Newton updates for minimizing the expected
`exponential criterion:
`
`(cid:80)T
`
`LT =
`
`exp{−ziF T (xi)}.
`
`(3)
`
`N(cid:88)
`
`i=1
`The AdaBoost learning algorithm can be considered as
`to find the best additive base function ft+1(x) once F t(x)
`is given. For this purpose, we assume the base function pool
`{f(x)} is in the form of confidence rated decision stumps.
`That is, a certain form of real feature value h(x) is first ex-
`tracted from x, h : X → R. For instance, in the Viola-Jones
`face detector, h(x) is the Haar-like features computed with
`integral image, as was shown in Fig. 2 (a-f). A decision
`threshold H divide the output of h(x) into two subregions,
`u1 and u2, u1 ∪ u2 = R. The base function f(x) is thus:
`f(x) = cj, if h(x) ∈ uj, j = 1, 2,
`(4)
`which is often referred as the stump classifier. cj is called
`the confidence. The optimal values of the confidence values
`can be derived as follows. For j = 1, 2 and k = 1,−1, let
`exp{−kF t(xi)}.
`
`Wkj =
`
`2(cid:88)
`
`j=1
`
`Lt+1 =
`
`i:zi=k,f (xi)∈uj
`The target criterion can thus be written as:
`
`(cid:88)
`
`(cid:163)
`
`W+1je−cj + W−1jecj
`
`(5)
`
`(cid:164)
`
`.
`
`(6)
`
`(cid:181)
`
`(cid:182)
`
`.
`
`W+1j
`W−1j
`
`(cid:112)
`
`W+1jW−1j.
`
`(7)
`
`(8)
`
`ln
`
`1 2
`
`cj =
`
`Plugging into (6), we have:
`
`2(cid:88)
`
`j=1
`
`Lt+1 = 2
`
`Using standard calculus, we see Lt+1 is minimized when
`
`Figure 2. Illustration of the integral image and Haar-like rectangle
`features (a-f).
`
`2.1. The Integral Image
`Integral image, also known as a summed area table, is
`an algorithm for quickly and efficiently computing the sum
`of values in a rectangle subset of a grid. It was first intro-
`duced to the computer graphics field by Crow [12] for use
`in mipmaps. Viola and Jones applied the integral image for
`rapid computation of Haar-like features, as detailed below.
`The integral image is constructed as follows:
`i(x(cid:48), y(cid:48)),
`
`(cid:88)
`
`x(cid:48)≤x,y(cid:48)≤y
`
`ii(x, y) =
`
`(1)
`
`where ii(x, y) is the integral image at pixel location (x, y)
`and i(x(cid:48), y(cid:48)) is the original image. Using the integral image
`to compute the sum of any rectangular area is extremely
`efficient, as shown in Fig. 2. The sum of pixels in rectangle
`region ABCD can be calculated as:
`i(x, y) = ii(D)+ ii(A)− ii(B)− ii(C), (2)
`
`(cid:88)
`
`(x,y)∈ABCD
`
`which only requires four array references.
`The integral image can be used to compute simple Haar-
`like rectangular features, as shown in Fig. 2 (a-f). The fea-
`tures are defined as the (weighted) intensity difference be-
`tween two to four rectangles. For instance, in feature (a),
`the feature value is the difference in average pixel value in
`the gray and white rectangles. Since the rectangles share
`corners, the computation of two rectangle features (a and b)
`requires six array references, the three rectangle features (c
`and d) requires eight array references, and the four rectangle
`features (e and f) requires nine array references.
`
`2.2. AdaBoost Learning
`Boosting is a method of finding a highly accurate hy-
`pothesis by combining many “weak” hypotheses, each with
`moderate accuracy. For an introduction on boosting, we re-
`fer the readers to [59] and [19].
`The AdaBoost (Adaptive Boosting) algorithm is gen-
`erally considered as the first step towards more practical
`boosting algorithms [17, 18]. In this section, following [80]
`
`y
`
`x
`
`A
`
`
`
`CC
`
`B
`
`D
`
`a
`
`b
`
`c
`
`d
`
`e
`
`f
`
`0003
`
`
`
`Input
`• Training examples S = {(xi, zi), i = 1,··· , N}.
`• T is the total number of weak classifiers to be trained.
`Initialize
`• Initialize example score F 0(xi) = 1
`N+
`2 ln
`,
`N−
`where N+ and N− are the number of positive and
`negative examples in the training data set.
`
`(cid:179)
`
`(cid:180)
`
`Adaboost Learning
`For t = 1,··· , T :
`1. For each Haar-like feature h(x) in the pool, find the
`optimal threshold H and confidence score c1 and c2
`to minimize the Z score Lt (8).
`2. Select the best feature with the minimum Lt.
`3. Update F t(xi) = F t−1(xi) + ft(xi), i = 1,··· , N,
`4. Update W+1j, W−1j, j = 1, 2.
`Output Final classifier F T (x).
`Figure 3. Adaboost learning pseudo code.
`
`Figure 4. The attentional cascade.
`
`In practice, at
`Eq. (8) is referred as the Z score in [80].
`iteration t + 1, for every Haar-like feature h(x), we find the
`optimal threshold H and confidence score c1 and c2 in order
`to minimize the Z score Lt+1. A simple pseudo code of the
`AdaBoost algorithm is shown in Fig. 3.
`
`2.3. The Attentional Cascade Structure
`Attentional cascade is a critical component in the Viola-
`Jones detector. The key insight is that smaller, and thus
`more efficient, boosted classifiers can be built which reject
`most of the negative sub-windows while keeping almost all
`the positive examples. Consequently, majority of the sub-
`windows will be rejected in early stages of the detector,
`making the detection process extremely efficient.
`The overall process of classifying a sub-window thus
`forms a degenerate decision tree, which was called a “cas-
`cade” in [92]. As shown in Fig. 4, the input sub-windows
`pass a series of nodes during detection. Each node will
`make a binary decision whether the window will be kept
`for the next round or rejected immediately. The number of
`weak classifiers in the nodes usually increases as the num-
`ber of nodes a sub-window passes. For instance, in [92], the
`first five nodes contain 1, 10, 25, 25, 50 weak classifiers, re-
`
`spectively. This is intuitive, since each node is trying to
`reject a certain amount of negative windows while keeping
`all the positive examples, and the task becomes harder at
`late stages. Having fewer weak classifiers at early stages
`also improves the speed of the detector.
`The cascade structure also has an impact on the training
`process. Face detection is a rare event detection task. Con-
`sequently, there are usually billions of negative examples
`needed in order to train a high performance face detector.
`To handle the huge amount of negative training examples,
`Viola and Jones [92] used a bootstrap process. That is, at
`each node, a threshold was manually chosen, and the par-
`tial classifier was used to scan the negative example set to
`find more unrejected negative examples for the training of
`the next node. Furthermore, each node is trained indepen-
`dently, as if the previous nodes does not exist. One argu-
`ment behind such a process is to force the addition of some
`nonlinearity in the training process, which could improve
`the overall performance. However, recent works showed
`that it is actually beneficial not to completely separate the
`training process of different nodes, as will be discussed in
`Section 4.
`In [92], the attentional cascade is constructed manually.
`That is, the number of weak classifiers and the decision
`threshold for early rejection at each node are both specified
`manually. This is a non-trivial task. If the decision thresh-
`olds were set too aggressively, the final detector will be
`very fast, but the overall detection rate may be hurt. On the
`other hand, if the decision thresholds were set very conser-
`vatively, most sub-windows will need to pass through many
`nodes, making the detector very slow. Combined with the
`limited computational resources available in early 2000’s,
`it is no wonder that training a good face detector can take
`months of fine-tuning.
`
`3. Feature Extraction
`As mentioned earlier, thanks to the rapid expansion in
`storage and computation resources, appearance based meth-
`ods have dominated the recent advances in face detection.
`The general practice is to collect a large set of face and non-
`face examples, and adopt certain machine learning algo-
`rithms to learn a face model to perform classification. There
`are two key issues in this process: what features to extract,
`and which learning algorithm to apply. In this section, we
`first review the recent advances in feature extraction.
`The Haar-like rectangular features as in Fig. 2 (a-f) are
`very efficient to compute due to the integral image tech-
`nique, and provide good performance for building frontal
`face detectors. In a number of follow-up works, researchers
`extended the straightforward features with more variations
`in the ways rectangle features are combined.
`For instance, as shown in Fig. 5, Lienhart and Maydt[49]
`generalized the feature set of [92] by introducing 45 degree
`
`Input
`sub-windows
`
`T
`
`1
`
`F
`
`T
`
`2
`
`F
`
`3
`
`F
`
`T
`
`Further
`processing
`
`Rejected sub-windows
`
`0004
`
`
`
`Figure 5. The rotated integral image/summed area table.
`
`Figure 6. (a) Rectangular features with flexible sizes and distances
`introduced in [46]. (b) Diagonal filters in [38].
`
`rotated rectangular features (a-d), and center-surround fea-
`tures (e-f). In order to compute the 45 degree rotated rect-
`angular features, a new rotated summed area table was in-
`troduced as:
`
`(cid:88)
`
`rii(x, y) =
`
`x(cid:48)≤x,|y−y(cid:48)|≤x−x(cid:48)
`
`i(x(cid:48), y(cid:48)).
`
`(9)
`
`As seen in Fig. 5, rii(A) is essentially the sum of pixel in-
`tensities in the shaded area. The rotated summed area table
`can be calculated with two passes over all pixels.
`A number of researchers noted the limitation of the orig-
`inal Haar-like feature set in [92] for multi-view face detec-
`tion, and proposed to extend the feature set by allowing
`more flexible combination of rectangular regions. For in-
`stance, in [46], three types of features were defined in the
`detection sub-window, as shown in Fig. 6 (a). The rectan-
`gles are of flexible sizes x × y and they are at certain dis-
`tances of (dx, dy) apart. The authors argued that these fea-
`tures can be non-symmetrical to cater to non-symmetrical
`characteristics of non-frontal faces. Jones and Viola [38]
`also proposed a similar feature called diagonal filters, as
`shown in Fig. 6 (b). These diagonal filters can be computed
`with 16 array references to the integral image.
`Jones et al.
`[39] further extended the Haar-like fea-
`ture set to work on motion filtered images for video-based
`
`Figure 7. The joint Haar-like feature introduced in [62].
`
`pedestrian detection. Let the previous and current video
`frames be it−1 and it. Five motion filters are defined as:
`∆ = |it − it−1|
`U = |it − it−1 ↑ |
`L = |it − it−1 ← |
`R = |it − it−1 → |
`D = |it − it−1 ↓ |
`it ↑ is it
`where {↑,←,→,↓} are image shift operators.
`shifted up by one pixel. In addition to the regular rectan-
`gular features (Fig. 2) on these additional motion filtered
`images, Jones et al. added single box rectangular sum fea-
`tures, and new features across two images. For instance:
`fi = ri(∆) − ri(S),
`(10)
`where S ∈ {U, L, R, D} and ri(·) is a single box rectangu-
`lar sum within the detection window.
`One must be careful that the construction of the motion
`filtered images {U, L, R, D} is not scale invariant. That is,
`when detecting pedestrians at different scales, these filtered
`images need to be recomputed. This can be done by first
`constructing a pyramid of images for it at different scales
`and computing the filtered images at each level of the pyra-
`mid, as was done in [39].
`Mita et al. [62] proposed joint Haar-like features, which
`is based on co-occurrence of multiple Haar-like features.
`The authors claimed that feature co-occurrence can better
`capture the characteristics of human faces, making it pos-
`sible to construct a more powerful classifier. As shown
`in Fig. 7, the joint Haar-like feature uses a similar feature
`computation and thresholding scheme, however, only the
`binary outputs of the Haar-like features are concatenated
`into an index for 2F possible combinations, where F is the
`number of combined features. To find distinctive feature
`co-occurrences with limited computational complexity, the
`suboptimal sequential forward selection scheme was used in
`[62]. The number F was also heuristically limited to avoid
`statistical unreliability.
`To some degree, the above joint Haar-like features re-
`semble a CART tree, which was explored in [8].
`It was
`shown that CART tree based weak classifiers improved re-
`sults across various boosting algorithms with a small loss
`
`x
`
`y
`
`A
`
`
`
`aa
`
`
`
`bb
`
`
`
`cc
`
`
`
`dd
`
`
`
`ee
`
`
`
`ff
`
`dx
`
`+2
`
`dy
`
`x
`-2
`
`y
`
`+1
`
`dx
`
`+1
`
`+1
`
`dx
`
`-1
`
`'dx
`
`dy
`
`-2
`x
`
`(a)(a)
`
`y
`
`'dx
`
`y
`
`-1
`x
`
`dy
`
`+1
`
`(b)
`
`j = (011)2 = 3
`
`0005
`
`
`
`in speed. In another variation for improving the weak clas-
`sifier, [101] proposed to use a single Haar-like feature, and
`equally bin the feature values into a histogram to be used
`in a RealBoost learning algorithm. Similar to the number
`F in the joint Haar-like features, the number of bins for the
`histogram is vital to the performance of the final detector.
`[101] proposed to use 64 bins. And in their later work [32],
`they specifically pointed out that too fine granularity of the
`histogram may cause overfitting, and suggested to use fine
`granularity in the first few layers of the cascade, and coarse
`granularity in latter layers. Another interesting recent work
`is [107], where the authors proposed a new weak classifier
`called Bayesian stump. Bayesian stump is also a histogram
`based weak classifier, however, the split thresholds of the
`Bayesian stump are derived from iterative split and merge
`operations instead of being at equal distances and fixed. Ex-
`perimental results showed that such a flexible multi-split
`thresholding scheme is effective in improving the detector’s
`performance.
`Another limitation of the original Haar-like feature set is
`its lack of robustness in handling faces under extreme light-
`ing conditions, despite that the Haar features are usually
`normalized by the test windows’ intensity covariance [92].
`In [21] a modified census transform was adopted to gener-
`ate illumination-insensitive features for face detection. On
`each pixel’s 3×3 neighborhood, the authors applied a mod-
`ified census transform that compares the neighborhood pix-
`els with their intensity mean. The results are concatenated
`into an index number representing the pixel’s local struc-
`ture. During boosting, the weak classifiers are constructed
`by examining the distributions of the index numbers for the
`pixels. Another well-known feature set robust to illumina-
`tion variations is the local binary patterns (LBP) [65], which
`have been very effective for face recognition tasks [2, 117].
`In [37, 119], LBP was applied for face detection tasks under
`a Bayesian and a boosting framework, respectively. More
`recently, inspired by LBP, Yan et al. [110] proposed locally
`assembled binary feature, which showed great performance
`on standard face detection data sets.
`To explore possibilities to further improve performance,
`more and more complex features were proposed in the lit-
`erature. For instance, Liu and Shum [52] studied generic
`linear features, which is defined by a mapping function
`φ() : Rd → R1, where d is the size of the test patch. For
`linear features, φ(x) = φT x, φ ∈ Rd. The classification
`function is in the following form:
`
`T(cid:88)
`
`tures. Another example is the anisotropic Gaussian filters
`in [60]. In [10], the linear features were constructed by pre-
`learning them using local non-negative matrix factorization
`(LNMF), which is still sub-optimal. Instead, Liu and Shum
`[52] proposed to search for the linear features by examining
`the Kullback-Leibler (KL) divergence of the positive and
`negative histograms projected on the feature during boost-
`ing (hence the name Kullback-Leibler boosting). In [97],
`the authors proposed to apply Fisher discriminant analysis
`and more generally recursive nonparametric discriminant
`analysis (RNDA) to find the linear projections φt. Linear
`projection features are very powerful features. The selected
`features shown in [52] and [97] were like face templates.
`They may significantly improve the convergence speed of
`the boosting classifier at early stages. However, caution
`must be taken to avoid overfitting if these features are to
`be used at the later stages of learning. In addition, the com-
`putational load of linear features are generally much higher
`than the traditional Haar-like features. Oppositely, Baluja
`et al. [4] proposed to use simple pixel pairs as features, and
`Abramson and Steux [1] proposed to use the relative values
`of a set of control points as features. Such pixel-based fea-
`ture can be computed even faster than the Haar-like features,
`however, their discrimination power is generally insufficient
`to build high performance detectors.
`
`Another popular complex feature for face/object detec-
`tion is based on regional statistics such as histograms. Levi
`and Weiss [45] proposed local edge orientation histograms,
`which computes the histogram of edges orientations in sub-
`regions of the test windows. These features are then se-
`lected by an AdaBoost algorithm to build the detector. The
`orientation histogram is largely invariant to global illumina-
`tion changes, and it is capable of capturing geometric prop-
`erties of faces that are difficult to capture with linear edge
`filters such as Haar-like features. However, similar to mo-
`tion filters, edge based histogram features are not scale in-
`variant, hence one must first scale the test images to form a
`pyramid to make the local edge orientation histograms fea-
`tures reliable. Later, Dalal and Triggs [13] proposed a sim-
`ilar scheme called histogram of oriented gradients (HoG),
`which became a very popular feature for human/pedestrian
`detection [120, 25, 88, 43, 15].
`In [99], the authors pro-
`posed spectral histogram features, which adopts a broader
`set of filters before collecting the histogram features, in-
`cluding gradient filters, Laplacian of Gaussian filters and
`Gabor filters. Compared with [45], the histogram features
`in [99] were based on the whole testing window rather than
`local regions, and support vector machines (SVMs) were
`used for classification. Zhang et al. [118] proposed another
`histogram-based feature called spatial histograms, which is
`based on local statistics of LBP. HoG and LBP were also
`combined in [98], which achieved excellent performance on
`human detection with partial occlusion handling. Region
`
`(11)
`
`F T (x) = sign[
`
`t
`
`
`λt(φTt x)],
`where λt() are R → R discriminating functions, such as
`the conventional stump classifiers in AdaBoost. F T (x)
`shall be 1 for positive examples and −1 for negative exam-
`ples. Note the Haar-like feature set is a subset of linear fea-
`
`0006
`
`
`
`feature values, they assume a collection of induced binary
`features (e.g., decision stumps with known thresholds) are
`already available. By partitioning the feature space into sub-
`regions through these binary features, the training examples
`can be indexed by the sub-regions they are located. The
`algorithm then searches for a small subset of compositional
`features that are both frequent to have statistical significance
`and accurate to be useful for label prediction. The final clas-
`sifier is then learned based on the selected subset of compo-
`sitional features through AdaBoost. In [26], the authors first
`established an analogue between compositional feature se-
`lection and generative image segmentation, and applied the
`Swendsen-Wang Cut algorithm to generate n-partitions for
`the individual feature set, where each subset of the parti-
`tion corresponds to a compositional feature. This algorithm
`re-runs for every weak classifier selected by the AdaBoost
`learning framework. On a person detection task tested, the
`composite features showed significant improvement, espe-
`cially when the individual features were very weak (e.g.,
`Haar-like features).
`In some applications such as object tracking, even if the
`number of possible features is not extensive, an exhaus-
`tive feature selection is still impractical due to computa-
`tional constraints. In [53], the authors proposed a gradient
`based feature selection scheme for online boosting with pri-
`mary applications in person detection and tracking. Their
`work iteratively updates each feature using a gradient de-
`scent algorithm, by minimizing the weighted least square
`error between the estimated feature response and the true
`label. This is particularly attractive for tracking and up-
`dating schemes such as [25], where at any time instance,
`the object’s appearance is already represented by a boosted
`classifier learned from previous frames. Assuming there is
`no dramatic change in the appearance, the gradient descent
`based algorithm can refine the features in a very efficient
`manner.
`There have also been many features that attempted to
`model the shape of the objects. For instance, Opelt et al.
`[66] composed multiple boundary fragments to weak clas-
`sifiers and formed a strong “boundary-fragment-model” de-
`tector using boosting. They ensure the feasibility of the fea-
`ture selection process by limiting the number of boundary
`fragments to 2-3 for each weak classifier. Shotton et al. [86]
`learned their object detectors with a boosting algorithm and
`their feature set consisted of a randomly chosen dictionary
`of contour fragments. A very similar edgelet feature was
`proposed in [102], and was used to learn human body part
`detectors in order to handle multiple, partially occluded hu-
`mans. In [79], shapelet features focusing on local regions
`of the image were built from low-level gradient informa-
`tion using AdaBoost for pedestrian detection. An interest-
`ing side benefit of having contour/edgelet features is that
`object detection and object segmentation can be performed
`
`Figure 8. The sparse feature set in granular space introduced in
`[33].
`
`covariance was another statistics based feature, proposed
`by Tuzel et al. [91] for generic object detection and tex-
`ture classification tasks. Instead of using histograms, they
`compute the covariance matrices among the color channels
`and gradient images. Regional covariance features can also
`be efficiently computed using integral images.
`Huang et al.
`[33] proposed a sparse feature set in or-
`der to strengthen the features’ discrimination power with-
`out incurring too much additional computational cost. Each
`sparse feature can be represented as:
`αipi(x; u, v, s), αi ∈ {−1, +1}
`
`f(x) =
`
`(12)
`
`(cid:88) i
`
`where x is an image patch, and pi is a granule of the sparse
`feature. A granule is specified by 3 parameters: horizon-
`tal offset u, vertical offset v and scale s. For instance, as
`shown in Fig. 8, pi(x; 5, 3, 2) is a granule with top-left cor-
`ner (5,3), and scale 22 = 4, and pi(x; 9, 13, 3) is a granule
`with top-left corner (9,13), and scale 23 = 8. Granules can
`be computed efficiently using pre-constructed image pyra-
`mids, or through the integer image. In [33], the maximum
`number of granules in a single sparse feature is 8. Since the
`total number of granules is large, the search space is very
`large and exhaustive search is infeasible. The authors pro-
`posed a heuristic search scheme, where granules are added
`to a sparse feature one-by-one, with an expansion opera-
`tor that removes, refines and adds granules to a partially
`selected sparse feature. To reduce the computation, the au-
`thors further conducted multi-scaled search, which uses a
`small set of training examples to evaluate all features first
`and rejects those that are unlikely to be good. The perfor-
`mance of the multi-view face detector trained in [33] using
`sparse features was very good.
`As new features are composed in seeking the best dis-
`crimination power, the feature pool becomes larger and
`larger, which creates new challenges in the feature selec-
`tion process. A number of recent works have attempted to
`address this issue. For instance, [113] proposed to discover
`compositional features using the classic frequent item-set
`mining scheme in data mining.
`Instead of using the raw
`
`(14,3,1)
`
`(5,3,2)
`
`(9,13,3)
`
`0007
`
`
`
`Table 1. Features for face/object detection.
`
`Feature Type
`Haar-like
`features and
`its variations
`
`Pixel-based
`features
`Binarized
`features
`
`Generic linear
`features
`
`Statistics-based
`features
`
`Composite
`features
`Shape features
`
`Representative Works
`Haar-like features [92]
`Rotated Haar-like features [49]
`Rectangular features with structure [46,
`38]
`Haar-like features on motion filtered
`image [39]
`Pixel pairs [4]
`Control point set [1]
`Modified census transform [21]
`LBP features [37, 119]
`Locally assembled binary feature [110]
`Anisotropic Gaussian filters [60]
`LNMF [10]
`Generic linear features with KL boost-
`ing [52]
`RNDA [97]
`Edge orientation histograms [45, 13]
`etc.
`Spectral histogram [99]
`Spatial histogram (LBP-based) [118]
`HoG and LBP [98]
`Region covariance [91]
`Joint Haar-like features [62]
`Sparse feature set [33]
`Boundary/contour fragments [66, 86]
`Edgelet [102]
`Shapelet [79]
`
`jointly, such as the work in [104] and [23].
`We summarize the features presented in this Section in
`Table 1.
`
`4. Variations of the Boosting Learning Algo-
`rithm
`In addition to exploring better features, another venue
`to improve the detector’s performance is through improv-
`ing the boosting learning algorithm, particularly under the
`cascade decision structure.
`In the original face detection
`paper by Viola and Jones [92], the standard AdaBoost algo-
`rithm [17] was adopted. In a number of follow-up works
`[46, 6, 101, 62], researchers advocated the use of Real-
`Boost, which was explained in detail in Section 2.2. Both
`Lienhart et al.
`[48] and Brubaker et al.
`[8] compared
`three boosting algorithms: AdaBoost, RealBoost and Gen-
`tleBoost, though they reach different conclusions as the for-
`mer recommended GentleBoost while the latter showed Re-
`alBoost works slightly better when combined with CART-
`based weak classifiers. In the following, we describe a num-
`
`ber of recent works on boosting learning for face/object de-
`tection, with emphasis on adapting to the cascade structure,
`the training speed, multi-view face detection, etc.
`In [46],
`the authors proposed FloatBoost, which at-
`tempted to overcome the monotonicity problem of the se-
`quential AdaBoost Learning. Specifically, AdaBoost is a se-
`quential forward search procedure using a greedy selection
`strategy, which may be suboptimal. FloatBoost incorpo-
`rates the idea of floating search [73] into AdaBoost, which
`not only add features during training, but also backtrack
`and examine the already selected features to remove those
`that are least significant. The authors claimed that Float-
`Boost usually needs fewer weak classifiers than AdaBoost
`to achieve a given objective. Jang and Kim [36] proposed
`to used evolutionary algorithms to minimize the number of
`classifiers without degrading the detection accuracy. They
`showed that such an algorithm can reduce the total number
`of weak classifiers by over 40%. Note in practice only the
`first few nodes are critical to the detection speed, since most
`testing windows are rejected by the first few weak classifiers
`in a cascade architecture.
`As mentioned in Section 2.3, Viola and Jones [92]
`trained each node independently. A number of follow-up
`works showed that there is indeed information in the results
`from the previous nodes, and it is best to reuse them instead
`of starting from scratch at each new node. For instance, in
`[108], the authors proposed to use a “cha