`
`
`
`Research Assignment
`
`Converting 2D to 3D: A Survey
`
`
`
`
`
`Supervisors: Assoc. Prof. Dr. Ir. E. A. Hendriks
`
`
`Dr. Ir. P. A. Redert
`
`Information and Communication Theory Group (ICT)
`Faculty of Electrical Engineering, Mathematics and Computer Science
`Delft University of Technology, the Netherlands
`
`
`Qingqing Wei
`Student Nr: 9936241
`Email: weiqingqing@yahoo.com
`
`
`December 2005
`
`
`
`Information and Communication Theory Group
`
`
`________________________________________________________________________
`
`
`Title:
`
`Converting 2D to 3D: A Survey
`
`
`
`
`Q. Wei
`Author:
`
`E. A. Hendriks (TU Delft)
`Reviewers:
`
`P. A. Redert
`
`
`
`________________________________________________________________________
`
`
`
`
`The Digital Signal Processing Group of Philips Research
`Customer:
`_____________________________________________________________________
`
`Keywords:
`
`2D to 3D conversion, depth cue, depth map, survey, comparison,
`3D TV
`
`
`Abstract:
`
`
`
`The survey investigates the existing 2D to 3D conversion
`algorithms developed in the past 30 years by various computer
`vision research communities across the world. According to the
`depth cues on which the algorithms reply, the algorithms are
`classified into the following 12 categories: binocular disparity,
`motion, defocus, focus, silhouette, atmosphere scattering, shading,
`linear perspective, patterned texture, symmetric patterns, occlusion
`(curvature, simple transform) and statistical patterns. The survey
`describes and analyzes algorithms that use a single depth cue and
`several promising approaches using multiple cues, establishing an
`overview and evaluating its relative position in the field of
`conversion algorithms.
`________________________________________________________________________
`.
`Conclusion: The results of some 2D to 3D conversion algorithms are 3D
`coordinates of a small set of points in the images. This group of
`algorithms is less suitable for the 3D television application.
`
`The depth cues based on multiple images yield in general more
`accurate results, while the depth cues based on single still image
`are more versatile.
`
` A
`
` single solution to convert the entire class of 2D images to 3D
`models does not exist. Combing depth cues enhances the accuracy
`of the results. It has been observed that machine learning is a new
`and promising research direction in 2D to 3D conversion. And it is
`also helpful to explore the alternatives than to confine ourselves
`only in the conventional methods based on depth maps.
`
`Faculty of Electrical Engineering, Mathematics and Computer Science
`
`
`i
`
`Project:
`
`Research Assignment for Master Program Media and Knowledge
`Engineering of Delft University of Technology
`
`
`
`Information and Communication Theory Group
`
`Content
`
` 1
`
`
`
`Introduction............................................................................................................... 1
`
`2
`
`2D to 3D Conversion Algorithms............................................................................. 3
`2.1
`Binocular disparity ............................................................................................. 3
`2.2 Motion................................................................................................................. 5
`2.3
`Defocus using more than two images ................................................................. 7
`2.4
`Focus................................................................................................................... 9
`2.5
`Silhouette............................................................................................................. 9
`2.6
`Defocus using a single image ........................................................................... 11
`2.7
`Linear perspective............................................................................................. 11
`2.8
`Atmosphere scattering ...................................................................................... 12
`2.9
`Shading ............................................................................................................. 13
`2.10 Patterned texture............................................................................................... 14
`2.11 Bilateral symmetric pattern .............................................................................. 16
`2.12 Occlusions......................................................................................................... 18
`2.13 Statistical patterns ............................................................................................ 20
`2.14 Other depth cues ............................................................................................... 21
`
`3 Comparison ............................................................................................................. 22
`
`4 A new Trend: Pattern Recognition in Depth Estimation.................................... 28
`
`5 Discussion and Conclusion..................................................................................... 32
`
`6 Bibliography ............................................................................................................ 34
`
`
`
`
`
`
`
`
`
`
`• The picture on the cover is taken from http://www.ddd.com.
`
`
`
`
`
`Faculty of Electrical Engineering, Mathematics and Computer Science
`
`
`ii
`
`
`
`Information and Communication Theory Group
`
`
`1 Introduction
`
`Three-dimensional television (3D-TV) is nowadays often seen as the next major
`milestone in the ultimate visual experience of media. Although the concept of
`stereoscopy has existed for a long time, the breakthrough from conventional 2D
`broadcasting to real-time 3D broadcasting is still pending. However, in recent years, there
`has been rapid progress in the fields image capture, coding and display [1], which brings
`the realm of 3D closer to reality than ever before.
`
`The world of 3D incorporates the third dimension of depth, which can be perceived by
`the human vision in the form of binocular disparity. Human eyes are located at slightly
`different positions, and these perceive different views of the real world. The brain is then
`able to reconstruct the depth information from these different views. A 3D display takes
`advantage of this phenomenon, creating two slightly different images of every scene and
`then presenting them to the individual eyes. With an appropriate disparity and calibration
`of parameters, a correct 3D perception can be realized.
`
`An important step in any 3D system is the 3D content generation. Several special
`cameras have been designed to generate 3D model directly. For example, a stereoscopic
`dual-camera makes use of a co-planar configuration of two separate, monoscopic
`cameras, each capturing one eye’s view, and depth information is computed using
`binocular disparity. A depth-range camera is another example. It is a conventional video
`camera enhanced with an add-on laser element, which captures a normal two-dimensional
`RGB image and a corresponding depth map. A depth map is a 2D function that gives the
`depth (with respect to the viewpoint) of an object point as a function of the image
`coordinates. Usually, it is represented as a gray level image with the intensity of each
`pixel registering its depth. The laser element emits a light wall towards the real world
`scene, which hits the objects in the scene and reflected back. This is subsequently
`registered and used for the construction of a depth map.
`
`
`
`
`
`
`
`Figure 1: A 2D image and its depth map1
`All the techniques described above are used to directly generate 3D content, which
`certainly contribute to the prevalence of 3D-TV. However, the tremendous amount of
`
`1 Figure source: http://www.extra.research.philips.com/euprojects/attest/
`
`Faculty of Electrical Engineering, Mathematics and Computer Science
`
`
`1
`
`
`
`
`
`Information and Communication Theory Group
`
`
`current and past media data is in 2D format and should be possible to be viewed with a
`stereoscopic effect. This is where the 2D to 3D conversion method comes to rescue. This
`method recovers the depth information by analyzing and processing the 2D image
`structures. Figure 1 shows the typical product of 2D to 3D conversion algorithm – the
`corresponding depth map of a conventional 2D image. A diversity of 2D to 3D
`conversion algorithms has been developed by the computer vision community. Each
`algorithm has its own strengths and weaknesses. Most conversion algorithms make use of
`certain depth cues to generate depth maps. An example of depth cues is the defocus or the
`motion that could be present in the images.
`
`This survey describes and analyzes algorithms that use a single depth cue and several
`promising approaches using multiple cues, establishing an overview and evaluating its
`relative position in the field of conversion algorithms. This may therefore contribute to
`the development of novel depth cues and help to build better algorithms using combined
`depth cues.
`
`The structure of the survey is as follows. In Chapter 2, one or multiple representative
`algorithms for every individual depth cue are selected and their working principles are
`briefly reviewed. Chapter 3 gives a comparison of these algorithms in several aspects.
`Taking this evaluation into consideration, one relatively promising algorithm using
`certain investigated depth cues is chosen and described in more detail in Chapter 4. At the
`end, Chapter 5 presents the conclusion of the survey.
`
`
`
`
`Faculty of Electrical Engineering, Mathematics and Computer Science
`
`
`2
`
`
`
`Information and Communication Theory Group
`
`
`2 2D to 3D Conversion Algorithms
`
`Depending on the number of input images, we can categorize the existing conversion
`algorithms into two groups: algorithms based on two or more images and algorithms
`based on a single still image. In the first case, the two or more input images could be
`taken either by multiple fixed cameras located at different viewing angles or by a single
`camera with moving objects in the scenes. We call the depth cues used by the first group
`the multi-ocular depth cues. The second group of depth cues operates on a single still
`image, and they are referred to as the monocular depth cues. The Table 1 summarizes the
`depth cues used in 2D to 3D conversion algorithms and their representative works. A
`review of algorithms using specific depth cue is given below.
`Table 1: Depth Cues and Their Representative Algorithms
`Depth Cues
`Representative Works
`
`The Number of
`Input Images
`
`Binocular disparity
`
`Two or More
`Images
`(binocular or
`multi-ocular)
`
`Motion
`Defocus
`
`Focus
`
`One single
`image
`(monocular)
`
`Silhouette
`Defocus
`Linear perspective
`Atmosphere Scattering
`Shading
`Patterned texture
`(Incorporates relative size)
`Symmetric patterns
`Occlusion
` - Curvature
` - Single Transform
`Statistical patterns
`
`Correlation-based, feature-based correspondence; triangulation
`[2][3]
`Optical flow [2]; Factorization [10]; Kalman filter [11]
`Local image decomposition using the Hermite polynomial basis
`[4]; Inverse filtering [12]; S-Transform [13]
`A set of images of different focus level and sharpness estimation
`[5]
`Voxel-based and deformable mesh model [6]
`Second Gaussian derivative [7]
`Vanishing line detection and gradient plane assignment [8]
`Light scattering model [15]
`Energy minimization [17]
`Frontal texel [19]
`
`Combination of photometric and geometric constraints [21]
`
`Smoothing curvature and isophote [22]
`Shortest path [23]
`Color-based heuristics [8], Statistical estimators [25]
`
`2.1 Binocular disparity
`
`With two images of the same scene captured from slightly different view points, the
`binocular disparity can be utilized to recover the depth of an object. This is the main
`mechanism for depth perception. First, a set of corresponding points in the image pair are
`found. Then, by means of the triangulation method, the depth information can be
`retrieved with a high degree of accuracy (see Figure 2) when all the parameters of the
`stereo system are known. When only intrinsic camera parameters are available, the depth
`can be recovered correctly up to a scale factor. In the case when no camera parameters
`are known, the resulting depth is correct up to a projective transformation [2].
`
`Faculty of Electrical Engineering, Mathematics and Computer Science
`
`
`3
`
`
`
`Information and Communication Theory Group
`
`
`
`
`T
`d
`
`
`
`(2.1)
`
`Z
`
`=
`
`f
`
`Figure 2: Disparity
`Assume lp and rp are the projections of the 3D point P on the left image and right image;
`lO and rO are the origin of camera coordinate systems of the left and right cameras. Based
`l OOP r
`on the relationship between similar triangles (
`P p p ) and (
`,
`,
`) shown in Figure 2,
`,
`,l
`r
`
`the depth value Z of the point P can be obtained:
`
`
`
`d
`x
`x
`, which measures the difference in retinal position between
`where
`=
`−
`l
`r
`corresponding image points. The disparity value of a point is often interpreted as the
`inversed distances to the observed objects. Therefore, finding the disparity map is
`essential for the construction of the depth map.
`
`The most time-consuming aspect of depth estimation algorithms based on binocular
`disparity is the stereo correspondence problem. Stereo correspondence, also known as
`stereo matching, is one of the most active research areas in computer vision. Given an
`image point on the left image, how can one find the matching image point in the right
`image? Due to the inherent ambiguities of the image pairs such as occlusion, general
`stereo matching problem is hard to solve. Several constraints have been introduced to
`make the problem solvable. Epipolar geometry and camera calibration are the two most
`frequently used constraints. With these two constraints, image pairs can be rectified.
`Another widely accepted assumption is the photometric constraint, which states that the
`intensities of the corresponding pixels are similar to each other. The ordering constraint
`states that the order of points in the image pair is usually the same. The uniqueness
`constraint claims that each feature can have one match at most, and the smoothness
`constraint (also known as the continuity constraint) says that disparity changes smoothly
`almost everywhere. Some of these constraints are hard, like for example, the epipolar
`geometry, while others such as the smoothness constraints are soft. The taxonomy [3] of
`Scharstein and Szeliski together with their website “Middlebury stereo vision page’ [9]
`have investigated the performance of approximately 40 stereo correspondence algorithms
`running on a pair of rectified images. Different algorithms impose various sets of
`constraints.
`
`Faculty of Electrical Engineering, Mathematics and Computer Science
`
`
`4
`
`
`
`Information and Communication Theory Group
`
`
`
`
`The current stereo correspondence algorithms are based on the correlation of local
`windows, on the matching of a sparse set of image features, or on global optimization.
`When comparing the correlation between windows in the two images, the corresponding
`element is given by the window where the correlation is maximized. A traditional
`similarity measure is the sum-of squared-differences (SSD). The local algorithms
`generate a dense disparity map. Feature-based methods are conceptually very similar to
`correlation-based methods, but they only search for correspondences of a sparse set of
`image features. The similarity measure must be adapted to the type of feature used.
`Nowadays global optimization methods are becoming popular because of their good
`performance. They make explicit use of the smoothness constraints and try to find a
`disparity assignment that minimizes a global energy function. The global energy is
`typically a combination of the matching cost and the smoothness term, where the latter
`usually measures the differences between the disparities of neighboring pixels. It is the
`different minimization step used in these algorithms which differentiates them from each
`other, e.g. dynamic programming or graph cuts.
`
`2.2 Motion
`
`The relative motion between the viewing camera and the observed scene provides an
`important cue to depth perception: near objects move faster across the retina than far
`objects do. The extraction of 3D structures and the camera motion from image sequences
`is termed as structure from motion. The motion may be seen as a form of “disparity over
`time”, represented by the concept of motion field. The motion field is the 2D velocity
`vectors of the image points, induced by the relative motion between the viewing camera
`and the observed scene. The basic assumptions for structure-from-motion are that the
`objects do not deform and their movements are linear. Suppose that there is only one
`P
`X Y Z
`rigid relative motion, denoted byV , between the camera and scenes. Let
`]T
`[
`,
`,
`=
`a 3D point in the conventional camera reference frame. The relative
`motionV between P and the camera can be described as [2]:
`
`
`
`where T andω are the translational velocity vector and the angular velocity of the camera
`respectively. The connection between the depth of 3D points and its 2D motion field is
`incorporated in the basic equations of the motion field, which combines equation (2.2)
`and the knowledge of perspective projection:
`
`
`V
`
`
`
`T Pω= − − ×
`
`
`
`
`
`
`
`v
`
`x
`
`=
`
`v
`
`y
`
`=
`
`T x T f
`−
`z
`x
`Z
`T x T f
`−
`z
`y
`Z
`
`−
`
`ω ω+
`
`f
`
`y
`z
`
`y
`
`+
`
`+
`
`f
`−
`ω ω
`x
`z
`
`y
`
`−
`
`−
`
` be
`
`(2.2)
`
`(2.3)
`
`(2.4)
`
`5
`
`2
`
`2
`
`
`
`
`
`x
`
`fω
`
`y
`
`xy
`ω
`x
`f
`xy
`ω ω
`y
`y
`x
`+
`f
`f
`
`Faculty of Electrical Engineering, Mathematics and Computer Science
`
`
`
`
`Information and Communication Theory Group
`
`
`(
`∇
`
`
`)E v E
`T
`+
`t
`
`Where xv and yv are the components of motion field in x and y direction respectively; Z is
`the depth of the corresponding 3D point; and the subscripts x , y and z indicate the
`component of the x-axis, y-axis and z-axis directions. In order to solve this basic equation
`for depth values, various constraints and simplifications have been developed to lower
`the degree of freedom of the equation, which leads to the different algorithms for depth
`estimation, each suitable for solving problem in a specific domain. Some of them
`compute the motion field explicitly before recovering the depth information; others
`estimate the 3D structure directly with motion field integrated in the estimation process.
`An example of the latter is the factorization algorithm [10], where the registered
`measurement matrix, containing entries of the normalized image point coordinates over
`several video frames, is converted into a product of a shape matrix and motion matrix.
`The shape matrix registers the coordinates of the 3D object, and the motion matrix
`describes the rotation of a set of 3D points with respect to the camera. An introduction to
`explicit motion estimation methods is given below.
`
`Dominant algorithms of motion field estimation are either optical flow based or feature
`based. Optical flow, also known as apparent motion of the image brightness pattern, is
`considered to be an approximation of the motion field. Optical flow subjects to the
`constraint that apparent brightness of moving objects remains constant, described by the
`image brightness constancy equation:
`
`
`
`where it is assumed that the image brightness is a function of image coordinates and the
`time. E∇ is the spatial gradients and tE denotes the partial differentiation with respect to
`time. After computing the spatial and temporal derivatives of image brightness for a
`small N N×
`patch, we can solve (2.5) to obtain the motion field for that patch. This
`method is notorious for its noise sensitivity, which requires extra treatments such as
`tracking the motion across a long image sequence or imposing more constraints. In
`general, current optical flow methods yield dense but less accurate depth maps.
`
`Another group of motion estimation algorithms is based on tracking separate features in
`the image sequence, generating sparse depth maps. Kalman filter [11] is for example a
`frequently used technique. It is a recursive algorithm that estimates the position and
`uncertainty of moving feature points in the subsequent frame.
`
`It is worth to note that the sufficiently small average spatial disparity of corresponding
`points in consecutive frames is beneficial to the stability and robustness for the 3D
`reconstruction from the time integration of long sequences of frames. On the other hand,
`when the average disparity between frames is large, the depth reconstruction can be done
`in a way as that of binocular disparity (stereo). The motion field becomes equal to the
`stereo disparity map only if the spatial and temporal variances between frames are
`sufficiently small.
`
`
`=
`0
`
`(2.5)
`
`Faculty of Electrical Engineering, Mathematics and Computer Science
`
`
`6
`
`
`
`Information and Communication Theory Group
`
`2.3 Defocus using more than two images
`
`2
`
`1
`u
`
`1
`+ =
`v
`
`1
`f
`
`
`
`(2.6)
`
`Depth-from-defocus methods generate a depth map from the degree of blurring present in
`the images. In a thin lens system, objects that are in-focus are clearly pictured whilst
`objects at other distances are defocused, i.e. blurred. Figure 3 shows a thin lens model of
`an out-of-focus real world point P projected onto the image plane. Its corresponding
`''P with a blur
`projection is a circular blur patch with constant brightness, centered at
`radius ofσ. The blur is caused by the convolution of the ideal projected image and the
`
` ,( yxg
`
`,( yx
`x y are the coordinates of the
`camera point spread function (PSF)
`)
` where ( ,
`,
`))
`σ
`''P . It is usually assumed that
`
`,( yx
`image point
`=)
`, whereσis a constant for a given
`σ
`σ
`window, to simplify the system and Gaussian function is used to simulate the PSF:
`1
`2
`x
`y
`+−
`g x y
`e σ
`. In order to estimate the depthu , we need the following two
`( ,
`)
`=
`2
`σ
`2
`πσ
`equations. The fundamental equation of thin lenses describes the relation
`betweenu , v and f as:
`
`
`
`
`Pentland [12] has derived a relationship between the distance u (Figure 3) and the
`blurσin equation (2.7):
`
`
`fs
`⎧
`⎪ − −
`f
`s
`⎪= ⎨
`fs
`⎪
`f
`s
`⎪ − +
`⎩
`
`(2.7)
`
`if u
`
`>
`
`v
`
`if u
`
`<
`
`v
`
`
`
`σ σ
`
`kf
`
`kf
`
`
`
`u
`
`
`where u is the depth,v is the distance between the lens and the position of the perfect
`focus, s is the distance between the lens and the image plane, f is the focal length of the
`lens, and k is a constant determined by the lens system. Of these, s , f and k are camera
`parameters, which can be determined by camera calibration. Please note that the second
`f
`u
`f
`2
`case u
`v< is possible to happen, for example, when
`, based on the
`< <
`f>
`v
`2
`, which yields thus u
`fundamental equation of thin lenses (2.6), we can obtain
`With equation (2.7), the problem of computing depth u is converted into a task of
`estimating camera parameters and the blur parameterσ. When camera parameters can be
`obtained from camera calibration, the depthu can be computed from equation (2.7) once
`the blur parameterσ is known. The depth-from-focus algorithms focus thus on the blur
`radius estimation techniques.
`
`v< .
`
`Faculty of Electrical Engineering, Mathematics and Computer Science
`
`
`7
`
`
`
`Information and Communication Theory Group
`
`
`
`
`Figure 3: Thin lens model2
`Equation (2.7) indicates also when blur radiusσand other camera parameters except the
`focal length f are known, the depth u cannot be exactly determined. With 2 unknowns –
`u and f , equation (2.7) is under-constrained. In this case, the output signal can be a
`projection of an out-of-focus step edge, an in-focus smooth transition (e.g. a smooth
`texture) or infinite situations in between these two extremes [7]. This causes ambiguity
`when estimating the blur parameter. To tackle the problem, most of the depth- from-
`defocus algorithms reply on two or more images of the same scene taken from the same
`position with different camera focal settings to determine the blue radius. Once the blur
`radius is estimated and camera parameters are obtained from camera calibration, the
`depth can be computed by Equation (2.7).
`
`The blur radius estimation techniques are based on, for example, inverse filtering [12],
`where the blur is estimated by solving a linear regression problem, or on S-Transform
`[13], which involves spatial domain convolution/de-convolution transform. Another
`example is the approach proposed by Ziou, Wang and Vaillancourt. It relies on a local
`image decomposition technique using the Hermite polynomial basis [4]. It is based on the
`fact that the depth can be computed once the camera parameters are available and the blur
`difference between two images, taken with different focal lengths, is known. The blur
`difference is retrieved by solving a set of equations, derived from the observation that the
`coefficients of the Hermite polynomial estimated from the more blurred image is a
`function of the partial derivation of the less blurred image and the blur difference.
`
`
`2 Figure source: reference [7]
`Faculty of Electrical Engineering, Mathematics and Computer Science
`
`
`8
`
`
`
`Information and Communication Theory Group
`
`2.4 Focus
`
`The depth-from-focus approach is closely related to the family of algorithms using depth
`from defocus. The main difference is that the depth-from-focus requires a series of
`images of the scene with different focus levels by varying and registering the distance
`between the camera and the scene, while depth-from-defocus only needs 2 or more
`images with fixed object and camera positions and use different camera focal settings.
`Figure 4 illustrates the principle of the depth-from-focus approach [5]. An object with an
`arbitrary surface is placed at the translational stage, which moves towards the camera
`(optics) starting from the reference plane. The focused plane is defined by the optics. It is
`located at the position where all points on it are focused on the camera sensor plane. Let
`‘s’ be a surface point on the object. When moving the stage towards the focused plane,
`the images of ‘s’ become more and more focused and will obtain its maximum sharpness
`when ‘s’ reaches the focused plane. After this, moving ‘s’ furthermore makes its image
`defocused again. During this process, the displacements of the translational stage are
`d
`when ‘s’ is maximally focused and
`registered. If we assume that the displacement is
`foused
`fd , then the depth
`the distance between the ‘focused plane’ and the reference plane is
`d
`d
`d
`value of ‘s’ relative to the stage will be determined as s
`. Applying this
`=
`−
`f
`focused
`same procedure for all surface elements and interpolating the focus measures, a dense
`depth map can be constructed.
`
`Figure 4: Depth from focus3
`
`
`
`2.5 Silhouette
`
`A silhouette of an object in an image refers to the contour separating the object from the
`background. Shape-from-silhouette methods require multiple views of the scene taken by
`cameras from different viewpoints. Such a process together with correct texturing
`
`
`3 Figure source: reference [5]
`Faculty of Electrical Engineering, Mathematics and Computer Science
`
`
`9
`
`
`
`Information and Communication Theory Group
`
`
`generates a full 3D model of the objects in the scene, allowing viewers to observe a live
`scene from an arbitrary viewpoint.
`
`Shape-from-silhouette requires accurate camera calibration. For each image, the
`silhouette of the target objects is segmented using background subtraction. The retrieved
`silhouettes are back projected to a common 3D space (see Figure 5) with projection
`centers equal to the camera locations. Back-projecting a silhouette produces a cone-like
`volume. The intersection of all the cones forms the visual hull of the target 3D object,
`which is often processed in the voxel representation. This 3D reconstruction procedure is
`referred to as shape-from-silhouette.
`
`
`
`
`Figure 5: Silhouette volume intersection4
`Matsuyama [6] proposed an approach using parallel computing via a PC cluster system.
`Instead of computing the intersection of 3D cones directly, the 3D voxel space is
`partitioned into a group of parallel planes. Each PC is assigned a task to compute the
`cross section of the 3D object volume on one specific plane. By stacking up such cross
`sections, the voxel representation of the 3D object shape is reconstructed. In this way, the
`3D volume intersection problem is decomposed into 2D intersection computation sub-
`problems which are concurrently carried out by all PCs. This leads to a promising speed
`gain. Furthermore, in order to capture the 3D object accurately, Matsuyama introduced a
`deformable mesh model, converting the 3D voxel volume into a surface mesh composed
`of triangular patches. According to a set of constraints, the surface mesh is deformed to
`fit the object surface. An example of the constraints is the 3D motion flow constraint,
`which requests that the mesh be adapted dynamically in conformity with object actions.
`
` shape-from-silhouette algorithm is often followed by a texturing algorithm. The visual
`hull is a geometry that encloses the captured object, but it does not capture the concave
`portion of the object that is not visible on the silhouette. Moreover, the number of views
`is often limited to make the processing time reasonable. This leads to a coarse geometry
`of the visual hull. Texturing assigns colors to the voxels on the surface of the visual hull
`and is therefore an indispensable step in creating realistic renderings.
`
`
` A
`
`
`4 Figure source: reference [6]
`Faculty of Electrical Engineering, Mathematics and Computer Science
`
`
`10
`
`
`
`Information and Communication Theory Group
`
`2.6 Defocus using a single image
`
`In section 2.3, depth-from-defocus algorithms based on two or more images are
`introduced. The reason for using more images is to eliminate the ambiguity in blur radius
`estimation when the focal setting of the camera is unknown. The images, with which this
`group of algorithms works, are required to be taken from a fixed camera position and
`object position but using different focal settings. However, only a small number of 2D
`video materials satisfy this condition. For example, the focus settings are changed when it
`is necessary to redirect the audience’s attention from foreground to background or vice
`verse. To make defocus as a depth cue suitable for conventional video contents, where we
`do not have control of the focal settings of the camera, Wong and Ernst [7] have
`proposed a blur estimation technique using a single image based on the second derivative
`of a Gaussian filter [14]. When filtering an edge of blur radiusσwith a second derivative
`of a Gaussian filter of certain variance s , the response has a positive and a negative peak.
`Denote the distance between the peaks as d , which can be measured directly from the
`s
`filtered image. The blur radius is computed according to the formula
`(see
`2( )d
`2
`2
`2
`σ =
`−
`Figure 6). With the estimated blur radius and the camera parameters obtained from
`camera calibration, a depth map can be generated that is based on equation (2.7). When
`the camera parameters are unknown, we can still estimate the relative depth level of each
`pixel based on its estimated blur radius by mapping a large blur value into a higher depth
`level and a smaller blur value to a lower depth level.
`
`
`
`Figure 6: Blur radius estimation: (a) Blurred edge; (b) Second derivative of Gaussian filter with
`standard derivation s; (c) Filter response, the distance d between peaks can be measured from the
`filtered image5
`
`2.7 Linear perspective
`
`Linear perspective refers to the fact that parallel lines, such as railroad tracks, appear to
`converge with distance, eventually reaching a vanishing point at the horizon. The more
`the lines converge, the farther away they appear to be. A recent representative work is the
`gradient plane assignment approach proposed by Battiato, Curti et al.. [8]. Their method
`performs well for single images containing sufficient objects of a rigid and geometric
`appearance. First, edge detection is employed to locate the predominant lines in the
`image. Then, the intersection points of these lines are determined. The intersection with
`
`
`5 Figure source: reference [7]
`Faculty of El