`Apple Inc. v. Corephotonics
`
`
`
`Texts in Computer Science
`
`Editors
`David Gries
`Fred B. Schneider
`
`Forfurther volumes:
`www.springer.com/series/3191
`
`APPL-1010 / Page 2 of 16
`
`APPL-1010 / Page 2 of 16
`
`
`
`
`
`Richard Szeliski
`
`Computer Vision
`
`Algorithms and Applications
`
`Z) Springer
`
`APPL-1010 / Page 3 of 16
`
`APPL-1010 / Page 3 of 16
`
`
`
`
`
`Dr. Richard Szeliski
`Microsoft Research
`
`One Microsoft Way
`98052-6399 Redmond
`Washington
`USA
`szeliski@ microsoft.com
`
`Series Editors
`David Gries
`Department of Computer Science
`Upson Hall
`Cornell University
`Ithaca, NY 14853-7501, USA
`
`Fred B. Schneider
`Department of Computer Science
`UpsonHall
`Cornell University
`Ithaca, NY 14853-7501, USA
`
`ISSN 1868-0941
`ISBN 978-1-84882-934-3
`DOI 10.1007/978-1-84882-935-0
`Springer London Dordrecht Heidelberg New York
`
`e-ISSN 1868-095X
`e-ISBN 978-1-84882-935-0
`
`British Library Cataloguing in Publication Data
`A catalogue record for this book is available from the British Library
`
`Library of Congress Control Number: 2010936817
`
`© Springer-Verlag London Limited 2011
`Apart from any fair dealing for the purposes of research or private study, or criticism or review, as
`permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced,
`stored or transmitted,
`in any form or by any means, with the prior permission in writing of the
`publishers, or in the case of reprographic reproduction in accordance with the terms oflicenses issued by
`the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent
`to the publishers.
`The use of registered names, trademarks,etc., in this publication does not imply, even in the absence of a
`specific statement, that such names are exempt from the relevant laws and regulations and therefore free
`for general use.
`The publisher makes norepresentation, express or implied, with regard to the accuracy of the information
`contained in this book and cannot accept any legal responsibility or liability for any errors or omissions
`that may be made.
`
`Printed on acid-free paper
`
`Springer is part of Springer Science+Business Media (www.springer.com)
`
`APPL-1010 / Page 4 of 16
`
`APPL-1010 / Page 4 of 16
`
`
`
`
`
`This modelis equivalentto first projecting the world points onto a local fronto-parallel image
`plane and then scaling this image using regular perspective projection. The scaling can be the
`samefor all parts of the scene (Figure 2.7b) or it can be different for objects that are being
`modeled independently (Figure 2.7c). More importantly, the scaling can vary from frame to
`frame whenestimating structure from motion, which can better model the scale changethat
`occurs as an object approaches the camera.
`Scaled orthography is a popular model for reconstructing the 3D shape of objects far away
`from the camera, since it greatly simplifies certain computations. For example, pose (camera
`
`
`
`APPL-1010 / Page 5 of 16
`
`easier to express exact rotations. When the angle is in radians, the derivatives of R with
`respect to w can easily be computed (2.36).
`Quaternions, on the other hand,are better if you want to keep track of a smoothly moving
`camera, since there are no discontinuities in the representation.It is also easier to interpolate
`between rotations and to chain rigid transformations (Murray, Li, and Sastry 1994; Bregler
`and Malik 1998).
`Myusual preference is to use quaternions, but to update their estimates using an incre-
`mental rotation, as described in Section 6.2.2.
`
`2.1.5 3D to 2D projections
`
`Nowthat we know howto represent 2D and 3D geometric primitives and how to transform
`them spatially, we need to specify how 3D primitives are projected onto the image plane. We
`can do this using a linear 3D to 2D projection matrix. The simplest model is orthography,
`which requires nodivision to get the final (inhomogeneous)result. The more commonly used
`modelis perspective, since this more accurately models the behavior of real cameras.
`
`Orthography and para-perspective
`
`An orthographic projection simply drops the z componentof the three-dimensional coordi-
`nate p to obtain the 2D point x. (In this section, we use p to denote 3D points and x to denote
`2D points.) This can be written as
`
`If we are using homogeneous(projective) coordinates, we can write
`
`1
`
`0
`
`0
`
`0
`
`01]
`0
`z=/|/0 1
`000 1
`
`p,
`
`i.e., we drop the z component but keep the w component. Orthography is an approximate
`model for long focal length (telephoto) lenses and objects whose depth is shallow relative
`to their distance to the camera (Sawhney and Hanson 1991). It is exact only for telecentric
`lenses (Baker and Nayar 1999, 2001).
`In practice, world coordinates (which may measure dimensions in meters) need to be
`scaled to fit onto an image sensor (physically measured in millimeters, but ultimately mea-
`sured in pixels). For this reason, scaled orthographyis actually more commonly used,
`
`x = [sIgx2|0] p.
`
`APPL-1010 / Page 5 of 16
`
`
`
`(a) 3D view
`
`
`(b) orthography
` (c) scaled orthography
`
`(e) perspective
`
`(f) object-centered
`
`Figure 2.7 Commonly usedprojection models: (a) 3D view of world, (b) orthography, (c) scaled orthography,
`(d) para-perspective, (e) perspective, (f) object-centered. Each diagram showsa top-down viewofthe projection.
`Note how parallel lines on the ground plane and boxsides remain parallel in the non-perspective projections.
`
`APPL-1010 / Page 6 of 16
`
`APPL-1010 / Page 6 of 16
`
`
`
`
`
`orientation) can be estimated using simple least squares (Section 6.2.1). Under orthography,
`structure and motion can simultaneously be estimated using factorization (singular value de-
`composition), as discussed in Section 7.3 (Tomasi and Kanade 1992).
`A closely related projection model is para-perspective (Aloimonos 1990; Poelman and
`Kanade 1997).
`In this model, object points are again first projected onto a local reference
`parallel to the image plane. However, rather than being projected orthogonally to this plane,
`they are projected parallel to the line of sight to the object center (Figure 2.7d). This is
`followed by the usual projection onto the final image plane, which again amountsto a scaling.
`The combination of these two projectionsis therefore affine and can be written as
`
`403
`G02
`401
`G00
`
`
`
`=| 413|P.aio G11 G12
`0
`0
`0
`1
`
`
`
`Note howparallel lines in 3D remain parallel after projection in Figure 2.7b—d. Para-perspective
`provides a more accurate projection model than scaled orthography, without incurring the
`added complexity of per-pixel perspective division, which invalidatestraditional factoriza-
`tion methods (Poelman and Kanade 1997).
`
`Perspective
`
`The most commonly used projection in computer graphics and computer vision is true 3D
`perspective (Figure 2.7e). Here, points are projected onto the image plane by dividing them
`by their z component. Using inhomogeneouscoordinates, this can be written as
`
`z£=P.(p)=|
`
`y/z
`1
`
`|.
`
`In homogeneouscoordinates, the projection has a simple linear form,
`
`1
`
`0
`
`0
`
`0
`
`0 0/9,
`z=|/0 1
`00 1
`0
`
`i.e., we drop the w component of p. Thus, after projection, it is not possible to recover the
`distance of the 3D point from the image, which makessense for a 2D imaging sensor.
`A form often seen in computer graphics systems is a two-step projection thatfirst projects
`3D coordinates into normalized device coordinates in the range (x,y,z)
`[-1,—1] x
`[—1, 1] x [0, 1], and then rescales these coordinates to integer pixel coordinates using a view-
`port transformation (Watt 1995; OpenGL-ARB 1997).
`The(initial) perspective projection
`is then represented using a 4 x 4 matrix
`
`~
`c=
`
`1
`
`0
`
`0
`
`0
`
`0
`
`0
`
`0
`
`1
`
`0
`
`0 Par]Zrange
`
`0
`
`PnceBian}Somme
`
`=
`
`p
`
`;
`
`0
`
`1
`
`0
`
`where Znear and Zfa, are the near and far z clipping planes and Zrange = Zfar — Znear- Note
`that the first two rows are actually scaled by the focal length and the aspect ratio so that
`
`
`
`APPL-1010 / Page 7 of 16
`
`
`
`APPL-1010 / Page 7 of 16
`
`€
`
`
`
`
`Figure 2.8 Projection of a 3D camera-centered point p, onto the sensor planesat location p. O, is the camera
`center (nodalpoint), c, is the 3D origin ofthe sensor plane coordinate system, and s, and s,, are the pixel spacings.
`
`[—1, —1]?. The reasonfor keepingthe third row, rather
`visible rays are mapped to (x, y, z) €
`than droppingit, is that visibility operations, such as z-buffering, require a depth for every
`graphical element that is being rendered.
`
`If we set Zncar = 1, Ztar —> ©, and switch the sign of the third row, the third element
`of the normalized screen vector becomesthe inverse depth,i.e., the disparity (Okutomi and
`Kanade 1993). This can be quite convenient in manycases since, for cameras moving around
`outdoors, the inverse depth to the camera is often a more well-conditioned parameterization
`than direct 3D distance.
`
`While a regular 2D image sensor has no way of measuring distance to a surface point,
`range sensors (Section 12.2) and stereo matching algorithms (Chapter 11) can compute such
`values. It is then convenient to be able to map from a sensor-based depth or disparity value d
`directly back to a 3D location using the inverse of a 4 x 4 matrix (Section 2.1.5). We can do
`this if we represent perspective projection using a full-rank 4 x 4 matrix, as in (2.64).
`
`Cameraintrinsics
`
`Once wehave projected a 3D point through an ideal pinhole using a projection matrix, we
`muststill transform the resulting coordinates according to the pixel sensor spacing and the
`relative position of the sensor plane to the origin. Figure 2.8 showsan illustration of the
`geometry involved. In this section, we first present a mapping from 2D pixel coordinates to
`3D rays using a sensor homography M,, since this is easier to explain in termsof physically
`measurable quantities. We then relate these quantities to the more commonly used camera in-
`trinsic matrix K,, which is used to map 3D camera-centered points p, to 2D pixel coordinates
`Zs.
`
`Image sensors return pixel values indexed by integer pixel coordinates (xs, Ys), often
`with the coordinates starting at the upper-left corner of the image and moving down and to
`the right.
`(This convention is not obeyed by all imaging libraries, but the adjustment for
`other coordinate systems is straightforward.) To map pixel centers to 3D coordinates, wefirst
`scale the (x;, ys) values by the pixel spacings (sz, 8) (sometimes expressed in microns for
`solid-state sensors) and then describe the orientation of the sensorarray relative to the camera
`projection center O, with an origin c, and a 3D rotation R, (Figure 2.8).
`
`APPL-1010 / Page 8 of 16
`
`APPL-1010 / Page 8 of 16
`
`
`
`The combined 2D to 3D projection can then be written as
`
`x
`0
`O
`Ss,
`O
`s,
`0
`°
`—
`
`p=[Rsles}] 4 9||us|= Mets.9
`
`
`
`1
`
`0
`
`O
`
`1
`
`Thefirst two columnsof the 3 x 3 matrix IM, are the 3D vectors correspondingto unit steps
`in the image pixel array along the x, and y, directions, while the third column is the 3D
`image array origin C,.
`the three parameters describing
`The matrix M,, is parameterized by eight unknowns:
`the rotation R,, the three parameters describing the translation c,, and the two scale factors
`(Sz, Sy). Note that we ignore here the possibility of skew between the two axes onthe image
`plane, since solid-state manufacturing techniques render this negligible. In practice, unless
`we have accurate external knowledge of the sensor spacing or sensor orientation, there are
`only seven degrees of freedom, since the distance of the sensor from the origin cannot be
`teased apart from the sensor spacing, based on external image measurementalone.
`However, estimating a camera model M, with the required seven degrees of freedom
`(i.e., where the first two columnsare orthogonal after an appropriate re-scaling) is impractical,
`so mostpractitioners assume a general 3 x 3 homogeneous matrix form.
`The relationship between the 3D pixel center p and the 3D camera-centered point p,is
`given by an unknownscaling s, p = sp,. We can therefore write the complete projection
`between p, and a homogeneousversion ofthe pixel address %, as
`
`&,=aM,'p, = Kp,.
`
`The 3 x 3 matrix K is called the calibration matrix and describes the camera intrinsics (as
`opposed to the camera’s orientation in space, whicharecalled the extrinsics).
`From the above discussion, we see that K has seven degrees of freedom in theory and
`eight degrees of freedom (the full dimensionality of a 3 x 3 homogeneous matrix) in practice.
`Why, then, do most textbooks on 3D computer vision and multi-view geometry (Faugeras
`1993; Hartley and Zisserman 2004; Faugeras and Luong 2001) treat K as an upper-triangular
`matrix with five degrees of freedom?
`While this is usually not made explicit in these books, it is because we cannot recover
`the full K matrix based on external measurement alone. Whencalibrating a camera (Chap-
`ter 6) based on external 3D points or other measurements (Tsai 1987), we end up estimating
`the intrinsic (AC) and extrinsic (A, t) camera parameters simultaneously usingaseries of
`measurements,
`
`@,—K| B|t | p,= Pr,
`wherep,,, are known 3D world coordinates and
`
`P=K[R{t]
`
`is known as the camera matrix. Inspecting this equation, we see that we can post-multiply
`K by R,and pre-multiply [R|¢] by RY, and still end up with a valid calibration. Thus,it
`is impossible based on image measurements alone to know thetrue orientation of the sensor
`and the true camera intrinsics.
`
`
`
`APPL-1010 / Page 9 of 16
`
`APPL-1010 / Page 9 of 16
`
`
`
`
`
`Figure 2.9 Simplified cameraintrinsics showingthe focal length f and the optical center (c,, cy). The image
`width and height are W and H.
`
`The choice of an upper-triangular form for K seems to be conventional. Given a full
`3 x 4camera matrix P = K[R|t], we can compute an upper-triangular K matrix using QR
`factorization (Golub and Van Loan 1996). (Note the unfortunate clash of terminologies: In
`matrix algebra textbooks, R represents an upper-triangular (right of the diagonal) matrix; in
`computer vision, R is an orthogonalrotation.)
`There are several ways to write the upper-triangular form of K. One possibility is
`
`Cy
`S
`tee
`K=|0 fy q |,
`0
`Oo
`1
`
`(2.57)
`
`which uses independentfocal lengths f, and fy for the sensor « and y dimensions. The entry
`s encodes any possible skew between the sensor axes due to the sensor not being mounted
`perpendicular to the optical axis and (c,,c,) denotes the optical center expressed in pixel
`coordinates. Another possibility is
`
`Kk=]/
`
`8s
`f
`0 af
`0
`oO
`
`Cz
`vc
`41
`
`|,
`
`(2.58)
`
`where the aspect ratio a has been made explicit and a commonfocallength f is used.
`In practice, for many applications an even simpler form can be obtained bysetting a = 1
`and s = 0,
`
`0 &
`f
`K=|0 f
`vx
`00 1
`
`|.
`
`(2.59)
`
`Often, setting the origin at roughly the center of the image,e.g., (Cz,cy) = (W/2, H/2),
`where W and H are the image height and width, can result in a perfectly usable camera
`model with a single unknown,i.e., the focal length f.
`Figure 2.9 shows how these quantities can be visualized as part of a simplified imaging
`model. Note that now we have placed the image plane in front of the nodal point (projection
`center of the lens). The sense of the y axis has also been flipped to get a coordinate system
`compatible with the way that most imaginglibraries treat the vertical (row) coordinate. Cer-
`tain graphics libraries, such as Direct3D,use a left-handed coordinate system, which can lead
`to some confusion.
`
`APPL-1010 / Page 10 of 16
`
`
`
`APPL-1010 / Page 10 of 16
`
`
`
`
`
`Figure 2.10 Central projection, showing the relationship between the 3D and 2D coordinates, p and a, as well
`as the relationship between the focal length f, image width W,andthe field of view 6.
`
`A note on focal lengths
`
`The issue of how to express focal lengths is one that often causes confusion in implementing
`computer vision algorithms and discussing their results. This is because the focal length
`depends on the units used to measure pixels.
`If we numberpixel coordinatesusing integer values, say [0, W) x [0, H), the focal length
`f and cameracenter (cz, Cy) in (2.59) can be expressed as pixel values. How do these quan-
`tities relate to the more familiar focal lengths used by photographers?
`Figure 2.10 illustrates the relationship between the focal length f, the sensor width W,
`and the field of view 0, which obey the formula
`
`tan = 7 or f= “ an 4 5
`
`-1
`
`For conventional film cameras, W = 35mm,and hence f is also expressed in millimeters.
`Since we work with digital images, it is more convenient to express W in pixels so that the
`focal length f can be useddirectly in the calibration matrix K asin (2.59).
`Anotherpossibility is to scale the pixel coordinates so that they go from [—1, 1) along
`the longer image dimension and [—a~',a~') along the shorter axis, where a > 1 is the
`image aspectratio (as opposedto the sensor cell aspect ratio introducedearlier). This can be
`accomplished using modified normalized device coordinates,
`
`xl, = (2a, -—W)/S and y, = (2y, —H)/S, where S = max(W, #).
`
`This has the advantage that the focal length f and optical center (cz, cy) become independent
`of the image resolution, which can be useful when using multi-resolution, image-processing
`algorithms, such as image pyramids(Section 3.5).” The use of S' instead of W also makesthe
`focal length the same for landscape (horizontal) and portrait (vertical) pictures, as is the case
`in 35mm photography. (In some computer graphics textbooks and systems, normalized device
`coordinates go from [—1, 1] x [—1, 1], which requires the use of two different focal lengths
`to describe the camera intrinsics (Watt 1995; OpenGL-ARB 1997).) Setting S = W = 2in
`(2.60), we obtain the simpler (unitless) relationship
`
`
`
` f-+ =tan (2.62)
`
`2 To make the conversiontruly accurate after a downsamplingstep in a pyramid,floating point values of W and
`H would have to be maintained since they can become non-integral if they are ever odd at a larger resolution in the
`\
`pyramid.
`
`APPL-1010/ Page 11 of 16
`
`APPL-1010 / Page 11 of 16
`
`
`
`The conversion between the various focal length representations is straightforward,e.g.,
`to go from a unitless f to one expressedin pixels, multiply by W/2, while to convert from an
`f expressed in pixels to the equivalent 35mm focal length, multiply by 35/W.
`
`Camera matrix
`
`Now that we have shown how to parameterize the calibration matrix K, we can put the
`camera intrinsics and extrinsics together to obtain a single 3 x 4 camera matrix
`
`P=K[R|t].
`
`(2.63)
`
`It is sometimes preferable to use an invertible 4 x 4 matrix, which can be obtained by not
`dropping the last row in the P matrix,
`
`[K 0|[Rt
`
`~
`
`=| or tf ge
`
`-
`
`| | =e.
`
`(2.64)
`
`where E is a 3D rigid-body (Euclidean) transformation and K is the full-rank calibration
`matrix. The 4 x 4 camera matrix P can be used to map directly from 3D world coordinates
`Dy = (Zw; Yw; Zw; 1) to screen coordinates(plus disparity), x; = (Xs, ys, 1, d),
`
`xz, ~ Pp,,,
`
`(2.65)
`
`where ~ indicates equality up to scale. Note that after multiplication by P, the vector is
`divided by the third element ofthe vector to obtain the normalized form 2, = (25, Ys, 1, d).
`
`Plane plus parallax (projective depth)
`
`In general, when using the 4 x 4 matrix P, we havethe freedom to remap the last row to
`whateversuits our purpose(rather than just being the “standard” interpretation of disparity as
`inverse depth). Let us re-write the last row of P as pz = s3[fio|co|, where||fo|| = 1. We
`then have the equation
`
`(2.66)
`d= = (tg ‘Du +),
`where z = p.- Pp, = 12+ (DP, — C) is the distance of p,, from the camera center C’' (2.25)
`along the optical axis Z (Figure 2.11). Thus, we can interpret d as the projective disparity
`or projective depth of a 3D scene point p,, from the reference plane fio - p,, + co = 0
`(Szeliski and Coughlan 1997; Szeliski and Golland 1999; Shade, Gortler, He et al. 1998;
`Baker, Szeliski, and Anandan 1998). (The projective depth is also sometimescalled parallax
`in reconstruction algorithms that use the term plane plus parallax (Kumar, Anandan, and
`Hanna 1994; Sawhney 1994).) Setting Ao = O and co = 1, 1.e., putting the reference plane
`at infinity, results in the more standard d = 1/z version of disparity (Okutomi and Kanade
`1993).
`Another wayto see this is to invert the P matrix so that we can mappixels plus disparity
`directly back to 3D points,
`
`(2.67)
`py =Pag.
`In general, we can choose P to have whatever form is convenient, i.e., to sample space us-
`ing an arbitrary projection. This can comein particularly handy whensetting up multi-view
`
`APPL-1010 / Page 12 of 16
`
`APPL-1010 / Page 12 of 16
`
`
`
`d=0.5
`
`d=-0.25 image plane plane
`
`d=0
`
`d= inverse depth
`
`d= projective depth
`
`Figure 2.11 Regular disparity (inverse depth) and projective depth (parallax from a reference plane).
`
`stereo reconstruction algorithms, since it allows us to sweepaseries of planes (Section 11.1.2)
`through space with a variable (projective) sampling that best matches the sensed image mo-
`tions (Collins 1996; Szeliski and Golland 1999; Saito and Kanade 1999),
`
`Mapping from one camera to another
`
`What happens when wetake two images of a 3D scene from different camera positions or
`orientations (Figure 2.12a)? Using the full rank 4 x 4 camera matrix P = KE from (2.64),
`we can write the projection from world to screen coordinates as
`
`Lo ow KoEop = Pop.
`
`Assuming that we know the z-buffer or disparity value do for a pixel in one image, we can
`compute the 3D point location p using
`
`~—1
`p~E,'Ko £0
`
`and then project it into another image yielding
`~
`~ «1.
`43
`=
`=
`-1p-1-
`@~ Ki E\p=K,E\E)'K, to=PiP, %o = Mioixo.
`
`Unfortunately, we do not usually have access to the depth coordinates of pixels in a regular
`photographic image. However, for a planar scene, as discussed above in (2.66), we can
`replace the last row of Po in (2.64) with a general plane equation, fio - p + co that maps
`points on the plane to dp = 0 values (Figure 2.12b). Thus, if we set do = 0, we can ignore
`the last column of M19 in (2.70) andalso its last row, since we do not care aboutthe final
`z-buffer depth. The mapping equation (2.70) thus reduces to
`
`#1 ~ Hioxo,
`
`where H19 is a general 3 x 3 homography matrix and x, and @ are now 2D homogeneous
`coordinates(i.e., 3-vectors) (Szeliski 1996).Thisjustifies the use of the 8-parameter homog-
`raphy as a general alignment model for mosaics of planar scenes (Mann and Picard 1994;
`¥
`Szeliski 1996).
`
`APPL-1010 / Page 13 of 16
`
`APPL-1010 / Page 13 of 16
`
`
`
`p=(%Y,Z1)
`
`
`
`(a)
`
`(b)
`
`Figure 2.12 A point is projected into two images: (a) relationship between the 3D point coordinate (X,Y, Z, 1)
`and the 2D projected point (x, y, 1, d); (b) planar homography induced by pointsall lying on a commonplane
`Ao: pt+co =O.
`
`The other special case where we do not need to know depth to perform inter-camera
`mapping is when the camera is undergoing pure rotation (Section 9.1.3), ie., when to = ¢4.
`In this case, we can write
`
`&@, ~ K,R,R)' Koto = Ki RiK> xo,
`
`(2.72)
`
`which again can be represented with a 3 x 3 homography. If we assumethat the calibration
`matrices have knownaspectratios and centers of projection (2.59), this homography can be
`parameterized by the rotation amount and the two unknownfocal lengths. This particular
`formulation is commonly used in image-stitching applications (Section 9.1.3).
`
`Object-centered projection
`
`When working with long focal length lenses, it often becomesdifficult to reliably estimate
`the focal length from image measurements alone. This is because the focal length and the
`distance to the object are highly correlated and it becomesdifficult to tease these two effects
`apart. For example, the change in scale of an object viewed through a zoom telephoto lens
`can either be due to a zoom change or a motion towards the user.
`(This effect was put to
`dramatic use in some of Alfred Hitchcock’s film Vertigo, where the simultaneous change of
`zoom and camera motion producesa disquieting effect.)
`This ambiguity becomesclearer if we write out the projection equation corresponding to
`the simple calibration matrix Kk (2.59),
`
`Te Pr ty
`=OT2,13
`Ts i ptt, 1?
`:
`b
`Ys = pryPty + Cy,
`Tz° pr tz
`
`(2.73)
`(2.74)
`
`
`
`where rz, ry, and r, are the three rows of R. If the distance to the object center t, > ||p||
`(the size of the object), the denominator is approximately t, and the overall scale of the
`projected object dependsontheratio of f to t,. It therefore becomesdifficult to disentangle
`these two quantities.
`
`APPL-1010 / Page 14 of 16
`
`APPL-1010 / Page 14 of 16
`
`
`
`z
`To see this more clearly, let 7, = t;! and s = n,f. We can then re-write the above
`equations as
`
`
`
`Te Dt+tz
`Xs
`*Tinw,-p
`—goPe oo,
`ry Ptty
`u
`“Tine,
`(Szeliski and Kang 1994; Pighin, Hecker, Lischinski et al. 1998). The scale of the projection
`s can be reliably estimated if we are looking at a known object(i.e., the 3D coordinates p
`are known). The inverse distance 7, is now mostly decoupled from the estimates of s and
`can be estimated from the amount of foreshortening as the object rotates. Furthermore, as
`the lens becomeslonger,1.e., the projection model becomesorthographic, there is no need to
`replace a perspective imaging model with an orthographic one, since the same equation can
`be used, with 7, — 0 (as opposed to f and t, both going to infinity). This allows us to form
`a natural link between orthographic reconstruction techniques such as factorization and their
`projective/perspective counterparts (Section 7.3).
`
`7
`
`2.1.6 Lens distortions
`
`The above imaging models all assume that cameras obey a Jinear projection model where
`straight lines in the world result in straight lines in the image.
`(This follows as a natural
`consequenceoflinear matrix operations being applied to homogeneous coordinates.) Unfor-
`tunately, many wide-angle lenses have noticeable radial distortion, which manifests itself as
`a visible curvature in the projection of straight lines. (See Section 2.2.3 for a more detailed
`discussion of lens optics, including chromatic aberration.) Unless this distortion is taken into
`account, it becomes impossible to create highly accurate photorealistic reconstructions. For
`example, image mosaics constructed without taking radial distortion into account will often
`exhibit blurring due to the mis-registration of corresponding features before pixel blending
`(Chapter 9).
`Fortunately, compensating for radial distortion is not that difficult in practice. For most
`lenses, a simple quartic model of distortion can produce goodresults. Let (2,,y-) be the
`pixel coordinates obtained after perspective division but before scaling by focal length f and
`shifting by the optical center (cz, cy), ie.,
`
`To Potts
`Lo. = ——
`Tz: ptt,
`Ty:
`ptt
`Tz: ptt,
`Yo = Pty
`
`The radial distortion model says that coordinates in the observed images are displaced away
`(barrel distortion) or towards (pincushion distortion) the image center by an amount propor-
`tional to their radial distance (Figure 2.13a—b).> The simplest radial distortion models use
`low-order polynomials, e.g.,
`
`Zo = «(1+ Kir? + Kors)
`Ge = yc(l+Kir2 + kore),
`
`3 Anamorphic lenses, which are widely usedin feature film production,do not follow this radial distortion model.
`Instead, they can be thoughtof, to a first approximation, as inducing different vertical and horizontal scalings,i.e.,
`non-square pixels.
`
`
`
`
`
`APPL-1010 / Page 15 of 16
`
`APPL-1010 / Page 15 of 16
`
`
`
`Humansperceive the three-dimensional structure of the world with apparentease. However, despite all of
`the recent advances in computervision research, the dream of having a computerinterpret an image at the
`samelevel as a two-year old remains elusive.Why is computer vision such a challenging problem and whatis
`the current state of the art?
`
`ComputerVision: Algorithms and Applications explores the variety of techniques commonly used to
`PYREANEAOTC MITcigORT LCemLeUOMM(oeotalMAT(CLaTUN)ARLeononicerLog wherevision is being suc-
`cessfully used, both for specialized applications such as medical imaging, and for fun, consumer-level tasks such
`as image editing and stitching, which students can apply to their own personal photos and videos.
`Morethan just a source of“recipes,” this exceptionally authoritative and comprehensive textbook/reference
`also takes a scientific approachto basic vision problems, formulating physical models of the imaging process
`before inverting them to produce descriptions of a scene. These problems are also analyzed using statistical
`models and solved using rigorous engineering techniques.
`
`Topics and Features:
`* Structured to supportactive curricula and project-oriented courses, with tips in the Introduction for using
`the book in a variety of customized courses
`
`* Presents exercises at the end of each chapter with a heavy emphasis ontesting algorithms and containing
`numeroussuggestions for small mid-term projects
`
`topics in the Appendices, which cover
`* Provides additional material and more detailed mathematical
`linear algebra, numerical techniques, and Bayesian estimation theory
`
`* Suggests additional reading at the end of each chapter, including the latest research in each sub-field, in
`addition to a full Bibliography at the end of the book
`
`SMMAUoNUceTmaUSMUaCRome(Le LAL Noete! website,http://szeliski.org/Book/
`
`Suitable for an upper-level undergraduate or graduate-level course in computer science or CleteaXaara
`textbook focuses on basic techniques that work under real-world conditions and encourages students to
`PUA Metetmega)CoM ColtLateTalo eReocttae doodLoo suitable as a unique reference
`to the fundamental techniques and current researchliterature in computer vision.
`Dr. Richard Szeliski has more than 25 years’ experience in computer vision research, most notably at
`Digital Equipment Corporation and Microsoft Research. This text draws on TaayTem >>401-1g(ao ol SOL
`ovenemo MeelESMMCR eTnlae TeenMOLILeisMol ACLSol UL! Stanford.
`
`9°781848°829343 springer.com
`
`ISBN 978-1-84882-934-3
`
`APPL-1010 / Page 16 of 16
`
`