throbber
A Survey of Recent Advances in Face Detection
`
`Cha Zhang and Zhengyou Zhang
`
`June 2010
`
`Technical Report
`MSR-TR-2010-66
`
`Microsoft Research
`Microsoft Corporation
`One Microsoft Way
`Redmond, WA 98052
`http://www.research.microsoft.com
`
`GTL 1009
`IPR of U.S. Patent 9,007,420
`
`0001
`
`

`
`Abstract
`
`Face detection has been one of the most studied topics
`in the computer vision literature. In this technical report,
`we survey the recent advances in face detection for the past
`decade. The seminal Viola-Jones face detector is first re-
`viewed. We then survey the various techniques according to
`how they extract features and what learning algorithms are
`adopted. It is our hope that by reviewing the many existing
`algorithms, we will see even better algorithms developed to
`solve this fundamental computer vision problem. 1
`
`1. Introduction
`With the rapid increase of computational powers and
`availability of modern sensing, analysis and rendering
`equipment and technologies, computers are becoming more
`and more intelligent. Many research projects and commer-
`cial products have demonstrated the capability for a com-
`puter to interact with human in a natural way by looking at
`people through cameras, listening to people through micro-
`phones, understanding these inputs, and reacting to people
`in a friendly manner.
`One of the fundamental techniques that enables such nat-
`ural human-computer interaction (HCI) is face detection.
`Face detection is the step stone to all facial analysis algo-
`rithms, including face alignment, face modeling, face re-
`lighting, face recognition, face verification/authentication,
`head pose tracking, facial expression tracking/recognition,
`gender/age recognition, and many many more. Only when
`computers can understand face well will they begin to truly
`understand people’s thoughts and intentions.
`Given an arbitrary image, the goal of face detection is to
`determine whether or not there are any faces in the image
`and, if present, return the image location and extent of each
`face [112]. While this appears as a trivial task for human
`beings, it is a very challenging task for computers, and has
`been one of the top studied research topics in the past few
`decades. The difficulty associated with face detection can
`be attributed to many variations in scale, location, orienta-
`tion (in-plane rotation), pose (out-of-plane rotation), facial
`expression, lighting conditions, occlusions, etc, as seen in
`Fig. 1.
`There have been hundreds of reported approaches to
`face detection. Early Works (before year 2000) had been
`nicely surveyed in [112] and [30]. For instance, Yang et
`al. [112] grouped the various methods into four categories:
`knowledge-based methods, feature invariant approaches,
`template matching methods, and appearance-based meth-
`
`1This technical report is extracted from an early draft of the book
`“Boosting-Based Face Detection and Adaptation” by Cha Zhang and
`Zhengyou Zhang, Morgan & Claypool Publishers, 2010.
`
`Figure 1. Examples of face images. Note the huge variations in
`pose, facial expression, lighting conditions, etc.
`
`ods. Knowledge-based methods use pre-defined rules to de-
`termine a face based on human knowledge; feature invariant
`approaches aim to find face structure features that are robust
`to pose and lighting variations; template matching methods
`use pre-stored face templates to judge if an image is a face;
`appearance-based methods learn face models from a set of
`representative training face images to perform detection. In
`general, appearance-based methods had been showing supe-
`rior performance to the others, thanks to the rapid growing
`computation power and data storage.
`The field of face detection has made significant progress
`in the past decade. In particular, the seminal work by Viola
`and Jones [92] has made face detection practically feasible
`in real world applications such as digital cameras and photo
`organization software.
`In this report, we present a brief
`survey on the latest development in face detection tech-
`niques since the publication of [112]. More attention will
`be given to boosting-based face detection schemes, which
`have evolved as the de-facto standard of face detection in
`real-world applications since [92].
`The rest of the paper is organized as follows. Section 2
`gives an overview of the Viola-Jones face detector, which
`also motivates many of the recent advances in face detec-
`tion. Solutions to two key issues for face detection: what
`features to extract, and which learning algorithm to apply,
`will be surveyed in Section 3 (feature extraction), Section 4
`(boosting learning algorithms) and Section 5 (other learn-
`ing algorithms). Conclusions and future work are given in
`Section 6.
`
`2. The Viola-Jones Face Detector
`
`If one were asked to name a single face detection algo-
`rithm that has the most impact in the 2000’s, it will most
`likely be the seminal work by Viola and Jones [92]. The
`Viola-Jones face detector contains three main ideas that
`make it possible to build a successful face detector that can
`run in real time: the integral image, classifier learning with
`AdaBoost, and the attentional cascade structure.
`
`0002
`
`

`
`and [19], we briefly present a generalized version of Ad-
`aBoost algorithm, usually referred as RealBoost. It has been
`advocated in various works [46, 6, 101, 62] that RealBoost
`yields better performance than the original AdaBoost algo-
`rithm.
`Consider a set of training examples as S = {(xi, zi), i =
`1,··· , N}, where xi belongs to a domain or instance space
`X , and zi belongs to a finite label space Z. In binary classi-
`fication problems, Z = {1,−1}, where zi = 1 for positive
`examples and zi = −1 for negative examples. AdaBoost
`t=1 ft(x) to pre-
`produces an additive model F T (x) =
`dict the label of an input example x, where F T (x) is a real
`valued function in the form F T : X → R. The predicted
`label is ˆzi = sign(F T (xi)), where sign(·) is the sign func-
`tion. From the statistical view of boosting [19], AdaBoost
`algorithm fits an additive logistic regression model by us-
`ing adaptive Newton updates for minimizing the expected
`exponential criterion:
`
`(cid:80)T
`
`LT =
`
`exp{−ziF T (xi)}.
`
`(3)
`
`N(cid:88)
`
`i=1
`The AdaBoost learning algorithm can be considered as
`to find the best additive base function ft+1(x) once F t(x)
`is given. For this purpose, we assume the base function pool
`{f(x)} is in the form of confidence rated decision stumps.
`That is, a certain form of real feature value h(x) is first ex-
`tracted from x, h : X → R. For instance, in the Viola-Jones
`face detector, h(x) is the Haar-like features computed with
`integral image, as was shown in Fig. 2 (a-f). A decision
`threshold H divide the output of h(x) into two subregions,
`u1 and u2, u1 ∪ u2 = R. The base function f(x) is thus:
`f(x) = cj, if h(x) ∈ uj, j = 1, 2,
`(4)
`which is often referred as the stump classifier. cj is called
`the confidence. The optimal values of the confidence values
`can be derived as follows. For j = 1, 2 and k = 1,−1, let
`exp{−kF t(xi)}.
`
`Wkj =
`
`2(cid:88)
`
`j=1
`
`Lt+1 =
`
`i:zi=k,f (xi)∈uj
`The target criterion can thus be written as:
`
`(cid:88)
`
`(cid:163)
`
`W+1je−cj + W−1jecj
`
`(5)
`
`(cid:164)
`
`.
`
`(6)
`
`(cid:181)
`
`(cid:182)
`
`.
`
`W+1j
`W−1j
`
`(cid:112)
`
`W+1jW−1j.
`
`(7)
`
`(8)
`
`ln
`
`1 2
`
`cj =
`
`Plugging into (6), we have:
`
`2(cid:88)
`
`j=1
`
`Lt+1 = 2
`
`Using standard calculus, we see Lt+1 is minimized when
`
`Figure 2. Illustration of the integral image and Haar-like rectangle
`features (a-f).
`
`2.1. The Integral Image
`Integral image, also known as a summed area table, is
`an algorithm for quickly and efficiently computing the sum
`of values in a rectangle subset of a grid. It was first intro-
`duced to the computer graphics field by Crow [12] for use
`in mipmaps. Viola and Jones applied the integral image for
`rapid computation of Haar-like features, as detailed below.
`The integral image is constructed as follows:
`i(x(cid:48), y(cid:48)),
`
`(cid:88)
`
`x(cid:48)≤x,y(cid:48)≤y
`
`ii(x, y) =
`
`(1)
`
`where ii(x, y) is the integral image at pixel location (x, y)
`and i(x(cid:48), y(cid:48)) is the original image. Using the integral image
`to compute the sum of any rectangular area is extremely
`efficient, as shown in Fig. 2. The sum of pixels in rectangle
`region ABCD can be calculated as:
`i(x, y) = ii(D)+ ii(A)− ii(B)− ii(C), (2)
`
`(cid:88)
`
`(x,y)∈ABCD
`
`which only requires four array references.
`The integral image can be used to compute simple Haar-
`like rectangular features, as shown in Fig. 2 (a-f). The fea-
`tures are defined as the (weighted) intensity difference be-
`tween two to four rectangles. For instance, in feature (a),
`the feature value is the difference in average pixel value in
`the gray and white rectangles. Since the rectangles share
`corners, the computation of two rectangle features (a and b)
`requires six array references, the three rectangle features (c
`and d) requires eight array references, and the four rectangle
`features (e and f) requires nine array references.
`
`2.2. AdaBoost Learning
`Boosting is a method of finding a highly accurate hy-
`pothesis by combining many “weak” hypotheses, each with
`moderate accuracy. For an introduction on boosting, we re-
`fer the readers to [59] and [19].
`The AdaBoost (Adaptive Boosting) algorithm is gen-
`erally considered as the first step towards more practical
`boosting algorithms [17, 18]. In this section, following [80]
`
`y
`
`x
`
`A
`
`
`
`CC
`
`B
`
`D
`
`a
`
`b
`
`c
`
`d
`
`e
`
`f
`
`0003
`
`

`
`Input
`• Training examples S = {(xi, zi), i = 1,··· , N}.
`• T is the total number of weak classifiers to be trained.
`Initialize
`• Initialize example score F 0(xi) = 1
`N+
`2 ln
`,
`N−
`where N+ and N− are the number of positive and
`negative examples in the training data set.
`
`(cid:179)
`
`(cid:180)
`
`Adaboost Learning
`For t = 1,··· , T :
`1. For each Haar-like feature h(x) in the pool, find the
`optimal threshold H and confidence score c1 and c2
`to minimize the Z score Lt (8).
`2. Select the best feature with the minimum Lt.
`3. Update F t(xi) = F t−1(xi) + ft(xi), i = 1,··· , N,
`4. Update W+1j, W−1j, j = 1, 2.
`Output Final classifier F T (x).
`Figure 3. Adaboost learning pseudo code.
`
`Figure 4. The attentional cascade.
`
`In practice, at
`Eq. (8) is referred as the Z score in [80].
`iteration t + 1, for every Haar-like feature h(x), we find the
`optimal threshold H and confidence score c1 and c2 in order
`to minimize the Z score Lt+1. A simple pseudo code of the
`AdaBoost algorithm is shown in Fig. 3.
`
`2.3. The Attentional Cascade Structure
`Attentional cascade is a critical component in the Viola-
`Jones detector. The key insight is that smaller, and thus
`more efficient, boosted classifiers can be built which reject
`most of the negative sub-windows while keeping almost all
`the positive examples. Consequently, majority of the sub-
`windows will be rejected in early stages of the detector,
`making the detection process extremely efficient.
`The overall process of classifying a sub-window thus
`forms a degenerate decision tree, which was called a “cas-
`cade” in [92]. As shown in Fig. 4, the input sub-windows
`pass a series of nodes during detection. Each node will
`make a binary decision whether the window will be kept
`for the next round or rejected immediately. The number of
`weak classifiers in the nodes usually increases as the num-
`ber of nodes a sub-window passes. For instance, in [92], the
`first five nodes contain 1, 10, 25, 25, 50 weak classifiers, re-
`
`spectively. This is intuitive, since each node is trying to
`reject a certain amount of negative windows while keeping
`all the positive examples, and the task becomes harder at
`late stages. Having fewer weak classifiers at early stages
`also improves the speed of the detector.
`The cascade structure also has an impact on the training
`process. Face detection is a rare event detection task. Con-
`sequently, there are usually billions of negative examples
`needed in order to train a high performance face detector.
`To handle the huge amount of negative training examples,
`Viola and Jones [92] used a bootstrap process. That is, at
`each node, a threshold was manually chosen, and the par-
`tial classifier was used to scan the negative example set to
`find more unrejected negative examples for the training of
`the next node. Furthermore, each node is trained indepen-
`dently, as if the previous nodes does not exist. One argu-
`ment behind such a process is to force the addition of some
`nonlinearity in the training process, which could improve
`the overall performance. However, recent works showed
`that it is actually beneficial not to completely separate the
`training process of different nodes, as will be discussed in
`Section 4.
`In [92], the attentional cascade is constructed manually.
`That is, the number of weak classifiers and the decision
`threshold for early rejection at each node are both specified
`manually. This is a non-trivial task. If the decision thresh-
`olds were set too aggressively, the final detector will be
`very fast, but the overall detection rate may be hurt. On the
`other hand, if the decision thresholds were set very conser-
`vatively, most sub-windows will need to pass through many
`nodes, making the detector very slow. Combined with the
`limited computational resources available in early 2000’s,
`it is no wonder that training a good face detector can take
`months of fine-tuning.
`
`3. Feature Extraction
`As mentioned earlier, thanks to the rapid expansion in
`storage and computation resources, appearance based meth-
`ods have dominated the recent advances in face detection.
`The general practice is to collect a large set of face and non-
`face examples, and adopt certain machine learning algo-
`rithms to learn a face model to perform classification. There
`are two key issues in this process: what features to extract,
`and which learning algorithm to apply. In this section, we
`first review the recent advances in feature extraction.
`The Haar-like rectangular features as in Fig. 2 (a-f) are
`very efficient to compute due to the integral image tech-
`nique, and provide good performance for building frontal
`face detectors. In a number of follow-up works, researchers
`extended the straightforward features with more variations
`in the ways rectangle features are combined.
`For instance, as shown in Fig. 5, Lienhart and Maydt[49]
`generalized the feature set of [92] by introducing 45 degree
`
`Input
`sub-windows
`
`T
`
`1
`
`F
`
`T
`
`2
`
`F
`
`3
`
`F
`
`T
`
`Further
`processing
`
`Rejected sub-windows
`
`0004
`
`

`
`Figure 5. The rotated integral image/summed area table.
`
`Figure 6. (a) Rectangular features with flexible sizes and distances
`introduced in [46]. (b) Diagonal filters in [38].
`
`rotated rectangular features (a-d), and center-surround fea-
`tures (e-f). In order to compute the 45 degree rotated rect-
`angular features, a new rotated summed area table was in-
`troduced as:
`
`(cid:88)
`
`rii(x, y) =
`
`x(cid:48)≤x,|y−y(cid:48)|≤x−x(cid:48)
`
`i(x(cid:48), y(cid:48)).
`
`(9)
`
`As seen in Fig. 5, rii(A) is essentially the sum of pixel in-
`tensities in the shaded area. The rotated summed area table
`can be calculated with two passes over all pixels.
`A number of researchers noted the limitation of the orig-
`inal Haar-like feature set in [92] for multi-view face detec-
`tion, and proposed to extend the feature set by allowing
`more flexible combination of rectangular regions. For in-
`stance, in [46], three types of features were defined in the
`detection sub-window, as shown in Fig. 6 (a). The rectan-
`gles are of flexible sizes x × y and they are at certain dis-
`tances of (dx, dy) apart. The authors argued that these fea-
`tures can be non-symmetrical to cater to non-symmetrical
`characteristics of non-frontal faces. Jones and Viola [38]
`also proposed a similar feature called diagonal filters, as
`shown in Fig. 6 (b). These diagonal filters can be computed
`with 16 array references to the integral image.
`Jones et al.
`[39] further extended the Haar-like fea-
`ture set to work on motion filtered images for video-based
`
`Figure 7. The joint Haar-like feature introduced in [62].
`
`pedestrian detection. Let the previous and current video
`frames be it−1 and it. Five motion filters are defined as:
`∆ = |it − it−1|
`U = |it − it−1 ↑ |
`L = |it − it−1 ← |
`R = |it − it−1 → |
`D = |it − it−1 ↓ |
`it ↑ is it
`where {↑,←,→,↓} are image shift operators.
`shifted up by one pixel. In addition to the regular rectan-
`gular features (Fig. 2) on these additional motion filtered
`images, Jones et al. added single box rectangular sum fea-
`tures, and new features across two images. For instance:
`fi = ri(∆) − ri(S),
`(10)
`where S ∈ {U, L, R, D} and ri(·) is a single box rectangu-
`lar sum within the detection window.
`One must be careful that the construction of the motion
`filtered images {U, L, R, D} is not scale invariant. That is,
`when detecting pedestrians at different scales, these filtered
`images need to be recomputed. This can be done by first
`constructing a pyramid of images for it at different scales
`and computing the filtered images at each level of the pyra-
`mid, as was done in [39].
`Mita et al. [62] proposed joint Haar-like features, which
`is based on co-occurrence of multiple Haar-like features.
`The authors claimed that feature co-occurrence can better
`capture the characteristics of human faces, making it pos-
`sible to construct a more powerful classifier. As shown
`in Fig. 7, the joint Haar-like feature uses a similar feature
`computation and thresholding scheme, however, only the
`binary outputs of the Haar-like features are concatenated
`into an index for 2F possible combinations, where F is the
`number of combined features. To find distinctive feature
`co-occurrences with limited computational complexity, the
`suboptimal sequential forward selection scheme was used in
`[62]. The number F was also heuristically limited to avoid
`statistical unreliability.
`To some degree, the above joint Haar-like features re-
`semble a CART tree, which was explored in [8].
`It was
`shown that CART tree based weak classifiers improved re-
`sults across various boosting algorithms with a small loss
`
`x
`
`y
`
`A
`
`
`
`aa
`
`
`
`bb
`
`
`
`cc
`
`
`
`dd
`
`
`
`ee
`
`
`
`ff
`
`dx
`
`+2
`
`dy
`
`x
`-2
`
`y
`
`+1
`
`dx
`
`+1
`
`+1
`
`dx
`
`-1
`
`'dx
`
`dy
`
`-2
`x
`
`(a)(a)
`
`y
`
`'dx
`
`y
`
`-1
`x
`
`dy
`
`+1
`
`(b)
`
`j = (011)2 = 3
`
`0005
`
`

`
`in speed. In another variation for improving the weak clas-
`sifier, [101] proposed to use a single Haar-like feature, and
`equally bin the feature values into a histogram to be used
`in a RealBoost learning algorithm. Similar to the number
`F in the joint Haar-like features, the number of bins for the
`histogram is vital to the performance of the final detector.
`[101] proposed to use 64 bins. And in their later work [32],
`they specifically pointed out that too fine granularity of the
`histogram may cause overfitting, and suggested to use fine
`granularity in the first few layers of the cascade, and coarse
`granularity in latter layers. Another interesting recent work
`is [107], where the authors proposed a new weak classifier
`called Bayesian stump. Bayesian stump is also a histogram
`based weak classifier, however, the split thresholds of the
`Bayesian stump are derived from iterative split and merge
`operations instead of being at equal distances and fixed. Ex-
`perimental results showed that such a flexible multi-split
`thresholding scheme is effective in improving the detector’s
`performance.
`Another limitation of the original Haar-like feature set is
`its lack of robustness in handling faces under extreme light-
`ing conditions, despite that the Haar features are usually
`normalized by the test windows’ intensity covariance [92].
`In [21] a modified census transform was adopted to gener-
`ate illumination-insensitive features for face detection. On
`each pixel’s 3×3 neighborhood, the authors applied a mod-
`ified census transform that compares the neighborhood pix-
`els with their intensity mean. The results are concatenated
`into an index number representing the pixel’s local struc-
`ture. During boosting, the weak classifiers are constructed
`by examining the distributions of the index numbers for the
`pixels. Another well-known feature set robust to illumina-
`tion variations is the local binary patterns (LBP) [65], which
`have been very effective for face recognition tasks [2, 117].
`In [37, 119], LBP was applied for face detection tasks under
`a Bayesian and a boosting framework, respectively. More
`recently, inspired by LBP, Yan et al. [110] proposed locally
`assembled binary feature, which showed great performance
`on standard face detection data sets.
`To explore possibilities to further improve performance,
`more and more complex features were proposed in the lit-
`erature. For instance, Liu and Shum [52] studied generic
`linear features, which is defined by a mapping function
`φ() : Rd → R1, where d is the size of the test patch. For
`linear features, φ(x) = φT x, φ ∈ Rd. The classification
`function is in the following form:
`
`T(cid:88)
`
`tures. Another example is the anisotropic Gaussian filters
`in [60]. In [10], the linear features were constructed by pre-
`learning them using local non-negative matrix factorization
`(LNMF), which is still sub-optimal. Instead, Liu and Shum
`[52] proposed to search for the linear features by examining
`the Kullback-Leibler (KL) divergence of the positive and
`negative histograms projected on the feature during boost-
`ing (hence the name Kullback-Leibler boosting). In [97],
`the authors proposed to apply Fisher discriminant analysis
`and more generally recursive nonparametric discriminant
`analysis (RNDA) to find the linear projections φt. Linear
`projection features are very powerful features. The selected
`features shown in [52] and [97] were like face templates.
`They may significantly improve the convergence speed of
`the boosting classifier at early stages. However, caution
`must be taken to avoid overfitting if these features are to
`be used at the later stages of learning. In addition, the com-
`putational load of linear features are generally much higher
`than the traditional Haar-like features. Oppositely, Baluja
`et al. [4] proposed to use simple pixel pairs as features, and
`Abramson and Steux [1] proposed to use the relative values
`of a set of control points as features. Such pixel-based fea-
`ture can be computed even faster than the Haar-like features,
`however, their discrimination power is generally insufficient
`to build high performance detectors.
`
`Another popular complex feature for face/object detec-
`tion is based on regional statistics such as histograms. Levi
`and Weiss [45] proposed local edge orientation histograms,
`which computes the histogram of edges orientations in sub-
`regions of the test windows. These features are then se-
`lected by an AdaBoost algorithm to build the detector. The
`orientation histogram is largely invariant to global illumina-
`tion changes, and it is capable of capturing geometric prop-
`erties of faces that are difficult to capture with linear edge
`filters such as Haar-like features. However, similar to mo-
`tion filters, edge based histogram features are not scale in-
`variant, hence one must first scale the test images to form a
`pyramid to make the local edge orientation histograms fea-
`tures reliable. Later, Dalal and Triggs [13] proposed a sim-
`ilar scheme called histogram of oriented gradients (HoG),
`which became a very popular feature for human/pedestrian
`detection [120, 25, 88, 43, 15].
`In [99], the authors pro-
`posed spectral histogram features, which adopts a broader
`set of filters before collecting the histogram features, in-
`cluding gradient filters, Laplacian of Gaussian filters and
`Gabor filters. Compared with [45], the histogram features
`in [99] were based on the whole testing window rather than
`local regions, and support vector machines (SVMs) were
`used for classification. Zhang et al. [118] proposed another
`histogram-based feature called spatial histograms, which is
`based on local statistics of LBP. HoG and LBP were also
`combined in [98], which achieved excellent performance on
`human detection with partial occlusion handling. Region
`
`(11)
`
`F T (x) = sign[
`
`t
`
`
`λt(φTt x)],
`where λt() are R → R discriminating functions, such as
`the conventional stump classifiers in AdaBoost. F T (x)
`shall be 1 for positive examples and −1 for negative exam-
`ples. Note the Haar-like feature set is a subset of linear fea-
`
`0006
`
`

`
`feature values, they assume a collection of induced binary
`features (e.g., decision stumps with known thresholds) are
`already available. By partitioning the feature space into sub-
`regions through these binary features, the training examples
`can be indexed by the sub-regions they are located. The
`algorithm then searches for a small subset of compositional
`features that are both frequent to have statistical significance
`and accurate to be useful for label prediction. The final clas-
`sifier is then learned based on the selected subset of compo-
`sitional features through AdaBoost. In [26], the authors first
`established an analogue between compositional feature se-
`lection and generative image segmentation, and applied the
`Swendsen-Wang Cut algorithm to generate n-partitions for
`the individual feature set, where each subset of the parti-
`tion corresponds to a compositional feature. This algorithm
`re-runs for every weak classifier selected by the AdaBoost
`learning framework. On a person detection task tested, the
`composite features showed significant improvement, espe-
`cially when the individual features were very weak (e.g.,
`Haar-like features).
`In some applications such as object tracking, even if the
`number of possible features is not extensive, an exhaus-
`tive feature selection is still impractical due to computa-
`tional constraints. In [53], the authors proposed a gradient
`based feature selection scheme for online boosting with pri-
`mary applications in person detection and tracking. Their
`work iteratively updates each feature using a gradient de-
`scent algorithm, by minimizing the weighted least square
`error between the estimated feature response and the true
`label. This is particularly attractive for tracking and up-
`dating schemes such as [25], where at any time instance,
`the object’s appearance is already represented by a boosted
`classifier learned from previous frames. Assuming there is
`no dramatic change in the appearance, the gradient descent
`based algorithm can refine the features in a very efficient
`manner.
`There have also been many features that attempted to
`model the shape of the objects. For instance, Opelt et al.
`[66] composed multiple boundary fragments to weak clas-
`sifiers and formed a strong “boundary-fragment-model” de-
`tector using boosting. They ensure the feasibility of the fea-
`ture selection process by limiting the number of boundary
`fragments to 2-3 for each weak classifier. Shotton et al. [86]
`learned their object detectors with a boosting algorithm and
`their feature set consisted of a randomly chosen dictionary
`of contour fragments. A very similar edgelet feature was
`proposed in [102], and was used to learn human body part
`detectors in order to handle multiple, partially occluded hu-
`mans. In [79], shapelet features focusing on local regions
`of the image were built from low-level gradient informa-
`tion using AdaBoost for pedestrian detection. An interest-
`ing side benefit of having contour/edgelet features is that
`object detection and object segmentation can be performed
`
`Figure 8. The sparse feature set in granular space introduced in
`[33].
`
`covariance was another statistics based feature, proposed
`by Tuzel et al. [91] for generic object detection and tex-
`ture classification tasks. Instead of using histograms, they
`compute the covariance matrices among the color channels
`and gradient images. Regional covariance features can also
`be efficiently computed using integral images.
`Huang et al.
`[33] proposed a sparse feature set in or-
`der to strengthen the features’ discrimination power with-
`out incurring too much additional computational cost. Each
`sparse feature can be represented as:
`αipi(x; u, v, s), αi ∈ {−1, +1}
`
`f(x) =
`
`(12)
`
`(cid:88) i
`
`where x is an image patch, and pi is a granule of the sparse
`feature. A granule is specified by 3 parameters: horizon-
`tal offset u, vertical offset v and scale s. For instance, as
`shown in Fig. 8, pi(x; 5, 3, 2) is a granule with top-left cor-
`ner (5,3), and scale 22 = 4, and pi(x; 9, 13, 3) is a granule
`with top-left corner (9,13), and scale 23 = 8. Granules can
`be computed efficiently using pre-constructed image pyra-
`mids, or through the integer image. In [33], the maximum
`number of granules in a single sparse feature is 8. Since the
`total number of granules is large, the search space is very
`large and exhaustive search is infeasible. The authors pro-
`posed a heuristic search scheme, where granules are added
`to a sparse feature one-by-one, with an expansion opera-
`tor that removes, refines and adds granules to a partially
`selected sparse feature. To reduce the computation, the au-
`thors further conducted multi-scaled search, which uses a
`small set of training examples to evaluate all features first
`and rejects those that are unlikely to be good. The perfor-
`mance of the multi-view face detector trained in [33] using
`sparse features was very good.
`As new features are composed in seeking the best dis-
`crimination power, the feature pool becomes larger and
`larger, which creates new challenges in the feature selec-
`tion process. A number of recent works have attempted to
`address this issue. For instance, [113] proposed to discover
`compositional features using the classic frequent item-set
`mining scheme in data mining.
`Instead of using the raw
`
`(14,3,1)
`
`(5,3,2)
`
`(9,13,3)
`
`0007
`
`

`
`Table 1. Features for face/object detection.
`
`Feature Type
`Haar-like
`features and
`its variations
`
`Pixel-based
`features
`Binarized
`features
`
`Generic linear
`features
`
`Statistics-based
`features
`
`Composite
`features
`Shape features
`
`Representative Works
`Haar-like features [92]
`Rotated Haar-like features [49]
`Rectangular features with structure [46,
`38]
`Haar-like features on motion filtered
`image [39]
`Pixel pairs [4]
`Control point set [1]
`Modified census transform [21]
`LBP features [37, 119]
`Locally assembled binary feature [110]
`Anisotropic Gaussian filters [60]
`LNMF [10]
`Generic linear features with KL boost-
`ing [52]
`RNDA [97]
`Edge orientation histograms [45, 13]
`etc.
`Spectral histogram [99]
`Spatial histogram (LBP-based) [118]
`HoG and LBP [98]
`Region covariance [91]
`Joint Haar-like features [62]
`Sparse feature set [33]
`Boundary/contour fragments [66, 86]
`Edgelet [102]
`Shapelet [79]
`
`jointly, such as the work in [104] and [23].
`We summarize the features presented in this Section in
`Table 1.
`
`4. Variations of the Boosting Learning Algo-
`rithm
`In addition to exploring better features, another venue
`to improve the detector’s performance is through improv-
`ing the boosting learning algorithm, particularly under the
`cascade decision structure.
`In the original face detection
`paper by Viola and Jones [92], the standard AdaBoost algo-
`rithm [17] was adopted. In a number of follow-up works
`[46, 6, 101, 62], researchers advocated the use of Real-
`Boost, which was explained in detail in Section 2.2. Both
`Lienhart et al.
`[48] and Brubaker et al.
`[8] compared
`three boosting algorithms: AdaBoost, RealBoost and Gen-
`tleBoost, though they reach different conclusions as the for-
`mer recommended GentleBoost while the latter showed Re-
`alBoost works slightly better when combined with CART-
`based weak classifiers. In the following, we describe a num-
`
`ber of recent works on boosting learning for face/object de-
`tection, with emphasis on adapting to the cascade structure,
`the training speed, multi-view face detection, etc.
`In [46],
`the authors proposed FloatBoost, which at-
`tempted to overcome the monotonicity problem of the se-
`quential AdaBoost Learning. Specifically, AdaBoost is a se-
`quential forward search procedure using a greedy selection
`strategy, which may be suboptimal. FloatBoost incorpo-
`rates the idea of floating search [73] into AdaBoost, which
`not only add features during training, but also backtrack
`and examine the already selected features to remove those
`that are least significant. The authors claimed that Float-
`Boost usually needs fewer weak classifiers than AdaBoost
`to achieve a given objective. Jang and Kim [36] proposed
`to used evolutionary algorithms to minimize the number of
`classifiers without degrading the detection accuracy. They
`showed that such an algorithm can reduce the total number
`of weak classifiers by over 40%. Note in practice only the
`first few nodes are critical to the detection speed, since most
`testing windows are rejected by the first few weak classifiers
`in a cascade architecture.
`As mentioned in Section 2.3, Viola and Jones [92]
`trained each node independently. A number of follow-up
`works showed that there is indeed information in the results
`from the previous nodes, and it is best to reuse them instead
`of starting from scratch at each new node. For instance, in
`[108], the authors proposed to use a “cha

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket