(12) United States Patent
`US 6,282,317 B1
`(10) Patent N0.:
Luo et al.
(45) Date of Patent: Aug. 28, 2001
Inventors: Jiebo Luo, Pittsford; Stephen Etz,
Fairport; Amit Singhal, Rochester, all
of NY (US)
(73) Assignee: Eastman Kodak Company, Rochester,
NY (US)
`( * ) Notice:
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`(21) Appl. No.: 09/223,860
Dec. 31, 1998
`Int. Cl.7 ....................................................... G06K 9/46
`(52) US. Cl.
`.............................................................. 382/203
`(58) Field of Search ..................................... 382/203, 204,
`382/205, 206, 207, 168, 169, 173, 218,
`195, 155, 156, 159, 160, 190, 202, 227
Primary Examiner—Andrew W. Johns
Assistant Examiner—Seyed Azarian
(74) Attorney, Agent, or Firm—David M. Woods
A method for detecting a main subject in an image, the
`method comprises:
`receiving a digital
`image; extracting
`regions of arbitrary shape and size defined by actual objects
`from the digital image; grouping the regions into larger
`segments corresponding to physically coherent objects;
`extracting for each of the regions at least one structural
`saliency feature and at least one semantic saliency feature;
`and integrating saliency features using a probabilistic rea-
`soning engine into an estimate of a belief that each region is
`the main subject.
37 Claims, 9 Drawing Sheets
`US. Patent
`Aug. 28, 2001
`Sheet 1 0f 9
`US 6,282,317 B1
`US. Patent
`Aug. 28, 2001
`Sheet 2 0f 9
`US 6,282,317 B1
` 52
` 512
`FIG. 2
`US. Patent
`Aug. 28, 2001
`Sheet 3 0f 9
`US 6,282,317 B1
`FIG. 3
`FIG. 7
`US. Patent
`Aug. 28, 2001
`Sheet 4 0f 9
`US 6,282,317 B1
`a. c.
`'1'"... "p.u
`US. Patent
`Aug. 28, 2001
`Sheet 5 0f 9
`US 6,282,317 B1
`0. 8
`0. 6
`0. 4
`FIG. 4C
`US. Patent
`Aug. 28, 2001
`Sheet 6 0f 9
`US 6,282,317 B1
`N '3'
`Ԥ 3) ,1" .1:
`um. .
`“a a» '
`US. Patent
`Aug. 28, 2001
`Sheet 7 0f 9
`US 6,282,317 B1
`MEAN 0' 6
`0. 4
`FIG. 5b
`FIG. 56
`. 75
`US. Patent
`Aug. 28, 2001
`Sheet 8 0f 9
`US 6,282,317 B1
`US 6,282,317 B1
`The invention relates generally to the field of digital
`image processing and, more particularly, to locating main
`subjects, or equivalently, regions of photographic interest in
`a digital image.
`In photographic pictures, a main subject is defined as what
`the photographer tries to capture in the scene. The first-party
`truth is defined as the opinion of the photographer and the
`third-party truth is defined as the opinion from an observer
`other than the photographer and the subject (if applicable).
`In general, the first-party truth typically is not available due
`to the lack of specific knowledge that the photographer may
`have about the people, setting, event, and the like. On the
`other hand,
`there is,
`in general, good agreement among
`third-party observers if the photographer has successfully
`used the picture to communicate his or her interest in the
`main subject to the viewers. Therefore, it is possible to
`design a method to automatically perform the task of detect-
`ing main subjects in images.
`Main subject detection provides a measure of saliency or
`relative importance for different regions that are associated
`with different subjects in an image. It enables a discrimina-
`tive treatment of the scene contents for a number of appli-
`cations. The output of the overall system can be modified
`versions of the image, semantic information, and action.
`The methods disclosed by the prior art can be put in two
`major categories. The first category is considered “pixel-
`based” because such methods were designed to locate inter-
`esting pixels or “spots” or “blocks”, which usually do not
`correspond to entities of objects or subjects in an image. The
`second category is considered “region-based” because such
`methods were designed to locate interesting regions, which
`correspond to entities of objects or subjects in an image.
`Most pixel-based approaches to region-of—interest detec-
`tion are essentially edge detectors. V. D. Gesu, et al., “Local
`operators to detect regions of interest,” Pattern Recognition
`Letters, vol. 18, pp. 1077—1081, 1997, used two local
`operators based on the computation of local moments and
`symmetries to derive the selection. Arguing that the perfor-
`mance of a visual system is strongly influenced by infor-
`mation processing done at early vision stage, two transforms
`named the discrete moment transform (DMT) and discrete
`symmetry transform (DST) are computed to measure local
`central moments about each pixel and local radial symmetry.
`In order to exclude trivial symmetry cases, nonuniform
`region selection is needed. The specific DMT operator acts
`like a detector of prominent edges (occlusion boundaries)
`and the DST operator acts like a detector of symmetric
`blobs. The results from the two operators are combined via
`logic “AND” operation. Some morphological operations are
`needed to dilate the edge-like raw output map generated by
`the DMT operator.
`R. Milanese, Detecting salient regions in an image: From
`biology to implementation, PhD thesis, University of
`Geneva, Switzerland, 1993, developed a computational
`model of visual attention, which combines knowledge about
`the human visual system with computer vision techniques.
`The model
`is structured into three major stages. First,
`multiple feature maps are extracted from the input image
`(for examples, orientation, curvature, color contrast and the
`like). Second, a corresponding number of “conspicuity”
`maps are computed using a derivative of Gaussian model,
`which enhance regions of interest
`in each feature map.
`Finally, a nonlinear relaxation process is used to integrate
`the conspicuity maps into a single representation by finding
`a compromise among inter-map and intra-map inconsisten-
`cies. The effectiveness of the approach was demonstrated
`using a few relatively simple images with remarkable
`regions of interest.
`To determine an optimal tonal reproduction, J. R. Boyack,
`et al., US. Pat. No. 5,724,456, developed a system that
`partitions the image into blocks, combines certain blocks
`into sectors, and then determines a difference between the
`maximum and minimum average block values for each
`sector. A sector is labeled an active sector if the difference
`exceeds a pre-determined threshold value. All weighted
`counts of active sectors are plotted versus the average
`luminance sector values in a histogram, which is then shifted
`via some predetermined criterion so that the average lumi-
`nance sector value of interest will fall within a destination
`window corresponding to the tonal reproduction capability
`of a destination application.
`In summary, this type of pixel-based approach does not
`explicitly detect region of interest corresponding to seman-
`tically meaningful subjects in the scene. Rather, these meth-
`ods attempt to detect regions where certain changes occur in
`order to direct attention or gather statistics about the scene.
`X. Marichal, et al., “Automatic detection of interest areas
`of an image or of a sequence of images,” in Proc. IEEE Int.
`Conf. Image Process., 1996, developed a fuzzy logic-based
`system to detect interesting areas in a video sequence. A
`number of subjective knowledge-based interest criteria were
`evaluated for segmented regions in an image. These criteria
`include: (1) an interaction criterion (a window predefined by
`a human operator);
`(2) a border criterion (rejecting of
`regions having large number of pixels along the picture
`(3) a face texture criterion (de-emphasizing
`regions whose texture does not correspond to skin samples);
`(4) a motion criterion (rejecting regions with no motion and
`low gradient or regions with very large motion and high
`gradient); and (5) a continuity criterion (temporal stability in
`motion). The main application of this method is for directing
`the resources in video coding, in particular for videophone
`or videoconference.
`is clear that motion is the most
`effective criterion for this technique targeted at video instead
`of still images. Moreover, the fuzzy logic functions were
`designed in an ad hoc fashion. Lastly, this method requires
`a window predefined by a human operator, and therefore is
`not fully automatic.
`W. Osberger, et al., “Automatic identification of percep-
`tually important regions in an image,” in Proc. IEEE Int.
`Conf. Pattern Recognition, 1998, evaluated several features
`known to influence human visual attention for each region of
`a segmented image to produce an importance value for each
`feature in each region. The features mentioned include
`low-level factors (contrast, size, shape, color, motion) and
`higher level factors (location,
`people, context), but only contrast, size, shape, location and
`foreground/background (determining background by deter-
`mining the proportion of total image border that is contained
`in each region) were implemented. Moreover, this method
`chose to treat each factor as being of equal importance by
`arguing that (1) there is little quantitative data which indi-
`cates the relative importance of these different factors and
`(2) the relative importance is likely to change from one
`image to another. Note that segmentation was obtained using
`the split-and-merge method based on 8x8 image blocks and
`US 6,282,317 B1
`this segmentation method often results in over-segmentation
`and blotchiness around actual objects.
`Q. Huang, et al., “Foreground/background segmentation
`of color images by integration of multiple cues,” in Proc.
`IEEE Int. Conf. Image Process, 1995, addressed automatic
`segmentation of color images into foreground and back-
`ground with the assumption that background regions are
`relatively smooth but may have gradually varying colors or
`be lightly textured. A multi-level segmentation scheme was
`devised that included color clustering, unsupervised seg-
`mentation based on MDL (Minimum Description Length)
`principle, edge-based foreground/background separation,
`and integration of both region and edge-based segmentation.
`In particular, the MDL-based segmentation algorithm was
`used to further group the regions from the initial color
`clustering, and the four corners of the image were used to
`adaptively determine an estimate of the background gradient
`magnitude. The method was tested on around 100 well-
`composed images with prominent main subject centered in
`the image against large area of the assumed type of unclut-
`tered background.
`T. F. Syeda-Mahmood, “Data and model-driven selection
`using color regions,” Int. J. Comput. Vision, vol. 21, no. 1,
`pp. 9—36, 1997, proposed a data-driven region selection
`method using color region segmentation and region-based
`saliency measurement. A collection of 220 primary color
`categories was pre-defined in the form of a color LUT
`(look-up-table). Pixels are mapped to one of the color
`categories, grouped together through connected component
`analysis, and further merged according to compatible color
`categories. Two types of saliency measures, namely self-
`saliency and relative saliency, are linearly combined using
`heuristic weighting factors to determine the overall saliency.
`In particular, self-saliency included color saturation, bright-
`ness and size while relative saliency included color contrast
`(defined by CIE distance) and size contrast between the
`concerned region and the surrounding region that is ranked
`highest among neighbors by size, extent and contrast in
`successive order.
`In summary, almost all of these reported methods have
`been developed for
`targeted types of images: video-
`conferencing or TV news broadcasting images, where the
`main subject is a talking person against a relatively simple
`static background (Osberg, Marichal); museum images,
`where there is a prominent main subject centered in the
`image against large area of relatively clean background
`(Huang); and toy-world images, where the main subject are
`a few distinctively colored and shaped objects (Milanese,
`Syeda). These methods were either not designed for uncon-
`strained photographic images, or even if designed with
`generic principles were only demonstrated for their effec-
`tiveness on rather simple images. The criteria and reasoning
`processes used were somewhat
`inadequate for less con-
`strained images, such as photographic images.
`It is an object of this invention to provide a method for
`detecting the location of main subjects within a digitally
`captured image and thereby overcoming one or more prob-
`lems set forth above.
`It is also an object of this invention to provide a measure
`of belief for the location of main subjects within a digitally
`captured image and thereby capturing the intrinsic degree of
`uncertainty in determining the relative importance of differ-
`ent subjects in an image. The output of the algorithm is in the
`form of a list of segmented regions ranked in a descending
`order of their likelihood as potential main subjects for a
`generic or specific application. Furthermore, this list can be
`converted into a map in which the brightness of a region is
`proportional to the main subject belief of the region.
`It is also an object of this invention to use ground truth
`data. Ground truth, defined as human outlined main subjects,
`is used to feature selection and training the reasoning engine.
`It is also an object of this invention to provide a method
`of finding main subjects in an image in an automatic manner.
`It is also an object of this invention to provide a method
`of finding main subjects in an image with no constraints or
`assumptions on scene contents.
`It is further an object of the invention to use the main
`subject location and main subject belief to obtain estimates
`of the scene characteristics.
`The present invention comprises the steps of:
`a) receiving a digital image;
`b) extracting regions of arbitrary shape and size defined
`by actual objects from the digital image;
`c) grouping the regions into larger segments correspond-
`ing to physically coherent objects;
`d) extracting for each of the regions at least one structural
`saliency feature and at
`least one semantic saliency
`feature; and,
`e) integrating saliency features using a probabilistic rea-
`soning engine into an estimate of a belief that each
`region is the main subject.
`The above and other objects of the present invention will
`become more apparent when taken in conjunction with the
`following description and drawings wherein identical refer-
`ence numerals have been used, where possible, to designate
`identical elements that are common to the figures.
`The present invention has the following advantages of:
`a robust image segmentation method capable of identify-
`ing object regions of arbitrary shapes and sizes, based
`on physics-motivated adaptive Bayesian clustering and
`non-purposive grouping;
`emphasis on perceptual grouping capable of organizing
`regions corresponding to different parts of physically
`coherent subjects;
`utilization of a non-binary representation of the ground
`truth, which capture the inherent uncertainty in deter-
`mining the belief of main subject, to guide the design
`of the system;
`a rigorous, systematic statistical training mechanism to
`determine the relative importance of different features
`through ground truth collection and contingency table
`extensive, robust feature extraction and evidence collec-
`combination of structural saliency and semantic saliency,
`the latter facilitated by explicit identification of key
`foreground- and background- subject matters;
`combination of self and relative saliency measures for
`structural saliency features; and,
`a robust Bayes net-based probabilistic inference engine
`suitable for integrating incomplete information.
`FIG. 1 is a perspective view of a computer system for
`implementing the present invention;
`US 6,282,317 B1
`FIG. 2 is a block diagram illustrating a software program
`of the present invention;
`FIG. 3 is an illustration of the sensitivity characteristic of
`a belief sensor with sigmoidal shape used in the present
`FIG. 4 is an illustration of the location PDF with
`unknown-orientation, FIG. 4(a) is an illustration of the PDF
`in the form of a 2D function, FIG. 4(b) is an illustration of
`the PDF in the form of its projection along the width
`direction, and FIG. 4(c) is an illustration of the PDF in the
`form of its projection along the height direction;
`FIG. 5 is an illustration of the location PDF with known-
`orientation, FIG. 5(a) is an illustration of the PDF in the
`form of a 2D function, FIG. 5(b) is an illustration of the PDF
`in the form of its projection along the width direction, and
`FIG. 5(c) is an illustration of the PDF in the form of its
`projection along the height direction;
`FIG. 6 is an illustration of the computation of relative
`saliency for the central circular region using an extended
`neighborhood as marked by the box of dotted line;
`FIG. 7 is an illustration of a two level Bayes net used in
`the present invention; and,
`FIG. 8 is block diagram of a preferred segmentation
`In the following description, the present invention will be
`described in the preferred embodiment as a software pro-
`gram. Those skilled in the art will readily recognize that the
`equivalent of such software may also be constructed in
`Still further, as used herein, computer readable storage
`medium may comprise,
`for example; magnetic storage
`media such as a magnetic disk (such as a floppy disk) or
`magnetic tape; optical storage media such as an optical disc,
`tape, or machine readable bar code; solid state
`electronic storage devices such as random access memory
`(RAM), or read only memory (ROM); or any other physical
`device or medium employed to store a computer program.
`Referring to FIG. 1, there is illustrated a computer system
`10 for implementing the present invention. Although the
`computer system 10 is shown for the purpose of illustrating
`a preferred embodiment, the present invention is not limited
`to the computer system 10 shown, but may be used on any
`electronic processing system. The computer system 10
`includes a microprocessor based unit 20 for receiving and
`processing software programs and for performing other
`processing functions. A touch screen display 30 is electri-
`cally connected to the microprocessor based unit 20 for
`displaying user related information associated with the
`software, and for receiving user input via touching the
`screen. A keyboard 40 is also connected to the micropro-
`cessor based unit 20 for permitting a user to input informa-
`tion to the software. As an alternative to using the keyboard
`40 for input, a mouse 50 may be used for moving a selector
`52 on the display 30 and for selecting an item on which the
`selector 52 overlays, as is well known in the art.
`A compact disk-read only memory (CD-ROM) 55 is
`connected to the microprocessor based unit 20 for receiving
`software programs and for providing a means of inputting
`the software programs and other information to the micro-
`processor based unit 20 via a compact disk 57, which
`typically includes a software program. In addition, a floppy
`disk 61 may also include a software program, and is inserted
`into the microprocessor based unit 20 for inputting the
`software program. Still further, the microprocessor based
`unit 20 may be programmed, as is well know in the art, for
`storing the software program internally. A printer 56 is
`connected to the microprocessor based unit 20 for printing
`a hardcopy of the output of the computer system 10.
`Images may also be displayed on the display 30 via a
`personal computer card (PC card) 62 or, as it was formerly
`known, a personal computer memory card international
`association card (PCMCIA card) which contains digitized
`images electronically embodied the card 62. The PC card 62
`is ultimately inserted into the microprocessor based unit 20
`for permitting visual display of the image on the display 30.
`Referring to FIG. 2, there is shown a block diagram of an
`overview of the present invention. First, an input image of
`a natural scene is acquired and stored SO in a digital form.
`the image is segmented SZ into a few regions of
`homogeneous properties. Next,
`the region segments are
`grouped into larger regions based on similarity measures S4
`through non-purposive perceptual grouping, and further
`grouped into larger regions corresponding to perceptually
`coherent objects S6 through purposive grouping (purposive
`grouping concerns specific objects). The regions are evalu-
`ated for their saliency SS using two independent yet comple-
`mentary types of saliency features—structural saliency fea-
`tures and semantic saliency features. The structural saliency
`features, including a set of low-level early vision features
`and a set of geometric features, are extracted 88a, which are
`further processed to generate a set of self-saliency features
`and a set of relative saliency features. Semantic saliency
`features in the forms of key subject matters, which are likely
`to be part of either foreground (for example, people) or
`background (for example, sky, grass), are detected 88b to
`provide semantic cues as well as scene context cues. The
`evidences of both types are integrated 810 using a reasoning
`engine based on a Bayes net to yield the final belief map of
`the main subject $12.
`To the end of semantic interpretation of images, a single
`criterion is clearly insufficient. The human brain, furnished
`with its a priori knowledge and enormous memory of real
`world subjects and scenarios, combines different subjective
`criteria in order to give an assessment of the interesting or
`primary subject(s) in a scene. The following extensive list of
`features are believed to have influences on the human brain
`in performing such a somewhat intangible task as main
`subject detection:
`location, size, brightness, colorfulness,
`texturefulness, key subject matter, shape, symmetry, spatial
`relationship (surroundedness/occlusion), bordemess, indoor/
`outdoor, orientation, depth (when applicable), and motion
`(when applicable for video sequence).
`In the present invention, the low-level early vision fea-
`tures include color, brightness, and texture. The geometric
`features include location (centrality), spatial relationship
`(borderness, adjacency, surroundedness, and occlusion),
`size, shape, and symmetry. The semantic features include
`flesh, face, sky, grass, and other green vegetation. Those
`skilled in the art can define more features without departing
`from the scope of the present invention.
`82: Region Segmentation
`The adaptive Bayesian color segmentation algorithm
`(Luo et al., “Towards physics-based segmentation of pho-
`tographic color images,” Proceedings of the IEEE Interna-
`tional Conference on Image Processing, 1997) is used to
`generate a tractable number of physically coherent regions
`of arbitrary shape. Although this segmentation method is
`preferred, it will be appreciated that a person of ordinary
`skill in the art can use a different segmentation method to
`US 6,282,317 B1
`obtain object regions of arbitrary shape without departing
`from the scope of the present invention. Segmentation of
`arbitrarily shaped regions provides the advantages of: (1)
`accurate measure of the size, shape, location of and spatial
`relationship among objects; (2) accurate measure of the
`color and texture of objects; and (3) accurate classification
`of key subject matters.
`Referring to FIG. 8, there is shown a block diagram of the
`preferred segmentation algorithm. First, an initial segmen-
`tation of the image into regions is obtained SSO. A color
`histogram of the image is computed and then partitioned into
`a plurality of clusters that correspond to distinctive, promi-
`nent colors in the image. Each pixel of the image is classified
`to the closest cluster in the color space according to a
`preferred physics-based color distance metric with respect to
`the mean values of the color clusters (Luo et al., “Towards
`physics-based segmentation of photographic color images,”
`Proceedings of the IEEE International Conference on Image
`Processing, 1997). This classification process results in an
`initial segmentation of the image. A neighborhood window
`is placed at each pixel in order to determine what neighbor-
`hood pixels are used to compute the local color histogram
`for this pixel. The window size is initially set at the size of
`the entire image 852, so that the local color histogram is the
`same as the one for the entire image and does not need to be
`recomputed. Next, an iterative procedure is performed
`between two alternating processes: re-computing SS4 the
`local mean values of each color class based on the current
`segmentation, and re-classifying the pixels according to the
`updated local mean values of color classes SS6. This itera-
`tive procedure is performed until a convergence is reached
`S60. During this iterative procedure, the strength of the
`spatial constraints can be adjusted in a gradual manner SSS
`(for example, the value of [3, which indicates the strength of
`the spatial constraints,
`is increased linearly with each
`iteration). After the convergence is reached for a particular
`window size, the window used to estimate the local mean
`values for color classes is reduced by half in size S62. The
`iterative procedure is repeated for the reduced window size
`to allow more accurate estimation of the local mean values
`for color classes. This mechanism introduces spatial adap-
`tivity into the segmentation process. Finally, segmentation
`of the image is obtained when the iterative procedure
`reaches convergence for the minimum window size S64.
`S4 & S6: Perceptual Grouping
`The segmented regions may be grouped into larger seg-
`ments that consist of regions that belong to the same object.
`Perceptual grouping can be non-purposive and purposive.
`Referring to FIG. 2, non-purposive perceptual grouping S4
`can eliminate over-segmentation due to large illumination
`differences, for example, a table or wall with remarkable
`illumination falloff over a distance. Purposive perceptual
`grouping S6 is generally based on smooth, noncoincidental
`connection of joints between parts of the same object, and in
`certain cases models of typical objects (for example, a
`person has head, torso and limbs).
`Perceptual grouping facilitates the recognition of high-
`level vision features. Without proper perceptual grouping, it
`is difficult to perform object recognition and proper assess-
`ment of such properties as size and shape. Perceptual
`grouping includes: merging small regions into large regions
`based on similarity in properties and compactness of the
`would-be merged region (non-purposive grouping); and
`grouping parts that belong to the same object based on
`commonly shared background, compactness of the would-be
`merged region, smoothness in contour connection between
`regions, and model of specific object (purposive grouping).
`SS: Feature Extraction
`For each region, an extensive set of features, which are
`shown to contribute to visual attention, are extracted and
`associated evidences are then comput

