`
`Applications
`
`Richard Szeliski
`
`Digital Equipment Corporation
`
`Cambridge Research Lab
`
`CRL 94/2
`
`May, 1994
`
`VALEO EXHIBIT 1029
`Valeo v. Magna
`IPR2015-____
`(cid:57)(cid:36)(cid:47)(cid:40)(cid:50)(cid:3)(cid:40)(cid:59)(cid:17)(cid:3)(cid:20)(cid:19)(cid:21)(cid:28)(cid:66)(cid:19)(cid:19)(cid:20)
`
`
`
`Image Mosaicing for Tele-Reality
`
`Applications
`
`Richard Szeliski
`
`Digital Equipment Corporation
`
`Cambridge Research Lab
`
`CRL 94/2
`
`May, 1994
`
`(cid:57)(cid:36)(cid:47)(cid:40)(cid:50)(cid:3)(cid:40)(cid:59)(cid:17)(cid:3)(cid:20)(cid:19)(cid:21)(cid:28)(cid:66)(cid:19)(cid:19)(cid:21)
`
`
`
`Digital Equipment Corporation has four research facilities: the Systems Research Center and the
`Western Research Laboratory, both in Palo Alto, California; the Paris Research Laboratory, in
`Paris; and the Cambridge Research Laboratory, in Cambridge, Massachusetts.
`
`The Cambridge laboratory became operational in 1988 and is located at One Kendall Square,
`near MIT. CRL engages in computing research to extend the state of the computing art in areas
`likely to be important to Digital and its customers in future years. CRL’s main focus is
`applications technology; that is, the creation of knowledge and tools useful for the preparation of
`important classes of applications.
`
`CRL Technical Reports can be ordered by electronic mail. To receive instructions, send a mes-
`sage to one of the following addresses, with the word help in the Subject line:
`
`On Digital’s EASYnet:
`On the Internet:
`
`CRL::TECHREPORTS
`techreports@crl.dec.com
`
`This work may not be copied or reproduced for any commercial purpose. Permission to copy without payment is
`granted for non-profit educational and research purposes provided all such copies include a notice that such copy-
`ing is by permission of the Cambridge Research Lab of Digital Equipment Corporation, an acknowledgment of the
`authors to the work, and all applicable portions of the copyright notice.
`
`The Digital logo is a trademark of Digital Equipment Corporation.
`
`TM
`
`Cambridge Research Laboratory
`One Kendall Square
`Cambridge, Massachusetts 02139
`
`(cid:57)(cid:36)(cid:47)(cid:40)(cid:50)(cid:3)(cid:40)(cid:59)(cid:17)(cid:3)(cid:20)(cid:19)(cid:21)(cid:28)(cid:66)(cid:19)(cid:19)(cid:22)
`
`
`
`Image Mosaicing for Tele-Reality
`
`Applications
`
`Richard Szeliski
`
`Digital Equipment Corporation
`
`Cambridge Research Lab
`
`CRL 94/2
`
`May, 1994
`
`Abstract
`
`While a large number of virtual reality applications, such as fluid flow analysis and molecular
`
`modeling, deal with simulated data, many newer applications attempt to recreate true reality as
`
`convincingly as possible. Building detailed models for such applications, which we call tele-reality,
`
`is a major bottleneck holding back their deployment.
`
`In this paper, we present techniques for
`
`automatically deriving realistic 2-D scenes and 3-D texture-mapped models from video sequences,
`
`which can help overcome this bottleneck. The fundamental technique we use is image mosaicing,
`
`i.e., the automatic alignment of multiple images into larger aggregates which are then used to
`
`represent portions of a 3-D scene. We begin with the easiest problems, those of flat scene and
`
`panoramic scene mosaicing, and progress to more complicated scenes, culminating in full 3-D
`
`models. We also present a number of novel applications based on tele-reality technology.
`
`Keywords:
`
`image mosaics, image registration, image compositing, planar scenes, panoramic
`
`scenes, 3-D scene recovery, motion estimation, structure from motion, virtual reality, tele-reality.
`
`c Digital Equipment Corporation 1994. All rights reserved.
`
`(cid:57)(cid:36)(cid:47)(cid:40)(cid:50)(cid:3)(cid:40)(cid:59)(cid:17)(cid:3)(cid:20)(cid:19)(cid:21)(cid:28)(cid:66)(cid:19)(cid:19)(cid:23)
`
`
`
`Contents
`
`Contents
`
`1 Introduction ✁ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`2 Related work ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ✁ ✄ ✂ ☎
`
`3 Basic imaging equations
`
` ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ✁ ✄ ✂ ☎
`
`4 Planar image mosaicing ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`4.1 Local image registration ✁ ✄ ✂ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ✄
`
`4.2 Global image registration ✄ ✂ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ✄
`
`4.3 Results ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✂ ✄ ✁ ✄
`
`5 Panoramic image mosaicing ✂ ✄ ✁ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`5.1 Results ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✂ ✄ ✁ ✄
`
`6 Scenes with arbitrary depth ✂ ✄ ✁ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`6.1 Formulation ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`6.2 Results ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✂ ✄ ✁ ✄
`
`7 Full 3-D model recovery ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`8 Applications ✁ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`8.1 Planar mosaicing applications ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`8.2
`
`3-D model building applications ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`8.3 End-user applications
`
` ☎ ✂ ☎ ✂ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`9 Discussion ☎ ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✂ ✄ ✁ ✄
`
`10 Conclusions ✁ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`A 2-D projective transformations ☎ ✂ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`i
`
`1
`
`2
`
`3
`
`5
`
`6
`
`8
`
`8
`
`9
`
`11
`
`11
`
`13
`
`14
`
`14
`
`18
`
`18
`
`19
`
`19
`
`21
`
`22
`
`28
`
`(cid:57)(cid:36)(cid:47)(cid:40)(cid:50)(cid:3)(cid:40)(cid:59)(cid:17)(cid:3)(cid:20)(cid:19)(cid:21)(cid:28)(cid:66)(cid:19)(cid:19)(cid:24)
`
`
`
`ii
`
`List of Figures
`
`LISTOF FIGURES
`
`1
`
`Rigid, affine, and projective transformations ✄ ✁ ✄ ✂ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✂ ✄ ✁ ✄
`
`2 Whiteboard image mosaic example
`
` ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✂ ✄ ✁ ✄
`
`3
`
`4
`
`5
`
`6
`
`Panoramic image mosaic example (bookshelf and cluttered desk) ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`Depth recovery example: table with stacks of papers
`
` ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`Depth recovery example: outdoor scene with trees
`
` ✂ ☎ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ✁ ✄ ✂ ☎
`
`3-D model recovery example
`
` ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ☎ ✂ ✂ ☎ ✂ ☎ ✂ ✄ ✁ ✄ ✂ ☎ ✂ ✂ ☎ ✂ ☎
`
`4
`
`10
`
`12
`
`15
`
`16
`
`17
`
`(cid:57)(cid:36)(cid:47)(cid:40)(cid:50)(cid:3)(cid:40)(cid:59)(cid:17)(cid:3)(cid:20)(cid:19)(cid:21)(cid:28)(cid:66)(cid:19)(cid:19)(cid:25)
`
`
`
`1 Introduction
`
`1 Introduction
`
`1
`
`Virtual reality is currently creating a lot of excitement and interest in the computer graphics
`
`community. Typical virtual reality systems use immersive technologies such as head-mounted or
`
`stereo displays and data gloves. The intent is to convince users that they are interacting with an
`
`alternate physical world, and also often to give insight into computer simulations, e.g., for fluid
`
`flow analysis or molecular modeling [Earnshaw et al., 1993]. Realism and speed of rendering are
`
`therefore important issues.
`
`Another class of virtual reality applications attempts to recreate true reality as convincingly
`
`as possible. Examples of such applications include flight simulators (which were among the
`
`earliest uses of virtual reality), various interactive multi-player games, and medical simulators,
`
`and visualization tools. Since the reality being recreated is usually distant, we use the term tele-
`
`reality in this paper to refer to such virtual reality scenarios based on real imagery.1 Tele-reality
`
`applications which could be built using the techniques described in this paper include scanning
`
`a whiteboard in your office to obtain a high-resolution image, walkthroughs of existing buildings
`
`for re-modeling or selling, participating in a virtual classroom, and browsing through your local
`
`supermarket aisles from home (see Section 8.)
`
`While most focus in virtual reality today is on input/output devices and the quality and speed of
`
`rendering, perhaps the biggest bottleneck standing in the way of widespread tele-reality applications
`
`is the slow and tedious model-building process [Adam, 1993]. This also affects many other areas
`
`of computer graphics, such as computer animation, special effects, and CAD. For small objects,
`
`say 0.2–2m, laser-based scanners are a good solution. These scanners provide registered depth and
`
`colored texture maps, usually in either raster or cylindrical coordinates [Cyberware Laboratory Inc,
`
`1990; Rioux and Bird, 1993]. They have been used extensively in the entertainment industry for
`
`special effects and computer animation. Unfortunately, laser-based scanners are fairly expensive,
`
`limited in resolution (typically 512 256 pixels), and most importantly, limited in range (e.g., they
`
`cannot be used to scan in a building.)
`
`The image-based ranging techniques which we develop in this paper have the potential to
`
`overcome these limitations. Imagine walking through an environment such as a building interior
`
`and filming a video sequence of what you see. By registering and compositing the images in the
`
`1The term telepresence is also often used, especially for applications such as tele-medicine where two-way inter-
`
`activity (remote operation) is important [Earnshaw et al., 1993].
`
`(cid:57)(cid:36)(cid:47)(cid:40)(cid:50)(cid:3)(cid:40)(cid:59)(cid:17)(cid:3)(cid:20)(cid:19)(cid:21)(cid:28)(cid:66)(cid:19)(cid:19)(cid:26)
`
`
`
`2
`
`2 Related work
`
`video together into large mosaics of the scene, image-based ranging can achieve an essentially
`
`unlimited resolution.2 Since the images can be acquired using any optical technology (from
`
`microscopy, through hand-held videocams, to satellite photography), the range or scale of the
`
`scene being reconstructed is not an issue. Finally, as desktop video becomes ubiquitous in our
`
`computing environments—initially for videoconferencing, and later as an advanced user interface
`
`tool [Gold, 1993; Rehg and Kanade, 1994; Blake and Isard, 1994]—image-based scene and model
`
`acquisition will become accessible to all users.
`
`In this paper, we develop novel techniques for extracting large 2-D textures and 3-D models
`
`from image sequences based on image registration and compositing techniques and also present
`
`some potential applications. After a review of related work (Section 2) and of the basic image
`
`formation equations (Section 3), we present our technique for registering pieces of a flat (planar)
`
`scene, which is the simplest interesting image mosaicing problem (Section 4). We then show how
`
`the same technique can be used to mosaic panoramic scenes obtained by rotating the camera around
`
`its center of projection (Section 5). Section 6 discusses how to recover depth in scenes. Section
`
`7 discusses the most general and difficult problem, that of building full 3-D models from multiple
`
`images. Section 8 presents some novel applications of the tele-reality technology developed in this
`
`paper. Finally, we close with a comparison of our work with previous approaches, and a discussion
`
`of the significance of our results.
`
`2 Related work
`
`While 3-D models acquired with laser-based scanners are now commonly used in computer ani-
`
`mations and for special effects, 3-D models derived directly from still or video images are still a
`
`rarity. One example of physically-based models derived from images are the dancing vegetables in
`
`the “Cooking with Kurt” video [Terzopoulos and Witkin, 1988]. Another, more recent example, is
`
`the 3-D model of a building extracted from a video sequences in [Azarbayejani et al., 1993], which
`
`was then composited with a 3-D teapot. Somewhat related are graphics techniques which extract
`
`camera position information from images, such as [Gleicher and Witkin, 1992].
`
`2The traditional use of the term image compositing [Porter and Duff, 1984; Foley et al., 1990] is for blending
`
`images which are already registered. We use the term image mosaicing in this paper for our work to avoid confusion
`
`with this previous work, although compositing (as in composite sketches) also seems appropriate.
`
`(cid:57)(cid:36)(cid:47)(cid:40)(cid:50)(cid:3)(cid:40)(cid:59)(cid:17)(cid:3)(cid:20)(cid:19)(cid:21)(cid:28)(cid:66)(cid:19)(cid:19)(cid:27)
`
`
`
`3 Basic imagingequations
`
`3
`
`The extraction of geometric information from multiple images has long been one of the central
`
`problems in computer vision [Ballard and Brown, 1982; Horn, 1986] and photogrammetry [Moffitt
`
`and Mikhail, 1980]. However, many of the techniques used are based on feature extraction,
`
`produce sparse descriptions of shape, and often require manual assistance. Certain techniques,
`
`such as stereo, produce depth or elevation maps which are inadequate to model true 3-D objects.
`
`In computer vision, the recovered geometry is normally used either for object recognition or for
`
`grasping and navigation (robotics). Relatively little attention has been paid to building 3-D models
`
`registered with intensity (texture maps), which are necessary for computer graphics and virtual
`
`reality applications (but see [Mann, 1993] for recent work which addresses some of the same
`
`problems as this paper.)
`
`In computer graphics, compositing multiple image streams together to create larger format
`
`(Omnimax) images is discussed in [Greene and Heckbert, 1986]. However, in this application,
`
`the relative position of the cameras was known in advance. The registration techniques developed
`
`in this paper are related to image warping [Wolberg, 1990] since once the images are registered,
`
`they can be warped into a common reference frame before being composited.3 While most current
`
`techniques require the manual specification of feature correspondences, some recent techniques
`
`[Beymer et al., 1993] as well as the techniques developed in this paper can be used to automate
`
`this process. Combinations of local image warping and compositing are now commonly used for
`
`special effects under the general rubric of morphing [Beier and Neely, 1992]. Recently, morphing
`
`based on z-buffer depths and camera motion has been applied to view interpolation as a quick
`
`alternative to full 3-D rendering [Chen and Williams, 1993]. The techniques we develop in Section
`
`6 can be used to compute the depth maps necessary for this approach directly from real-world
`
`image sequences.
`
`3 Basic imaging equations
`
`Many of the ideas in this paper have their simplest expression using projective geometry [Semple
`
`and Kneebone, 1952]. However, rather than relying on these results, we will use more standard
`
`methods and notations from computer graphics [Foley et al., 1990], and prove whatever simple
`
`3One can view the image registration task as an inverse warping problem, since we are given two images and asked
`
`to recover the unknown warping function.
`
`(cid:57)(cid:36)(cid:47)(cid:40)(cid:50)(cid:3)(cid:40)(cid:59)(cid:17)(cid:3)(cid:20)(cid:19)(cid:21)(cid:28)(cid:66)(cid:19)(cid:19)(cid:28)
`
`
`
`4
`
`3 Basic imagingequations
`
`Figure 1: Rigid, affine, and projective transformations
`
`results we require in Appendix A. Throughout, we will use homogeneous coordinates to represent
`
`points, i.e., we denote 2-D points in the image plane as ✂✁☎✄✝✆✞✄✠✟☛✡ , with ☞✁✍✌✎✟✏✄✠✆✑✌✎✟☛✡ being the
`coordinates ✂✁☎✄✝✆✞✄✓✒✔✄✝✟✕✡ have Cartesian coordinates ✂✁✍✌✎✟✏✄✠✆✍✌✖✟✗✄✘✒✙✌✎✟☛✡ .
`
`corresponding Cartesian coordinates [Foley et al., 1990]. Similarly, 3-D points with homogeneous
`
`Using homogeneous coordinates, we can describe the class of 2-D planar transformations using
`
`matrix multiplication
`
`✚✛✛✛✜ ✁✔✢✆✔✢✟✣✢ ✤✥✥✥✦★✧ ✚✛✛✛✜ ✩
`
`00
`
`10
`
`20
`
`01
`
`11
`
`21
`
`✤✥✥✥✦ ✚✛✛✛✜ ✁✆✟ ✤✥✥✥✦
`
`02
`
`12
`
`22
`
`or
`
`✪ ✢ ✧✬✫
`
`2D✪
`
`(1)
`
`The simplest transformations in this general class are pure translations, followed by translations and
`
`rotations (rigid transforms), plus scaling (similarity transforms), affine transforms, and full projec-
`
`tive transforms. Figure 1 shows a square and possible rigid, affine, and projective transformations.
`
`0
`
`0
`
`1
`
`matrix with 8 degrees of freedom.4
`
`2D matrix,
`
`00
`
`10
`
`0
`
`01
`
`11
`
`02
`
`12
`
`0
`
`1
`
`2D
`
`Rigid and affine transformations have the following forms for the✫
`rigid-2D ✧ ✚✛✛✛✜
`affine-2D ✧ ✚✛✛✛✜ ✩
`✤✥✥✥✦ ✄
`✤✥✥✥✦ ✄ ✫
`cos✭ ✮
`sin✭ ✯✝✰
`sin✭
`cos✭ ✯✲✱
`with 3 and 6 degrees of freedom, respectively, while projective transformations have a general✫
`similarity (7 dof), affine (12 dof), and full projective (15 dof) transforms. The✫
`4Two✫
`22✴
`
`In 3-D, we have the same hierarchy of transformations, with rigid (6 degrees of freedom),
`
`3D matrices in this
`
`2D matrices are equivalent if they are scalar multiples of each other. We remove this redundancy by setting
`
`1.
`
`(cid:57)(cid:36)(cid:47)(cid:40)(cid:50)(cid:3)(cid:40)(cid:59)(cid:17)(cid:3)(cid:20)(cid:19)(cid:21)(cid:28)(cid:66)(cid:19)(cid:20)(cid:19)
`
`✩
`✩
`✩
`✩
`✩
`✩
`✩
`✩
`✫
`✩
`✩
`✩
`✩
`✩
`✳
`
`
` ✧ ✚✜ ✁ ✂✄✆☎
`✤✥✥✥✦ ✄
`ˆ✝ ✄✠✟ ✧ ✚✛✛✛✜
`✝ ✧ ✞
`0 0 1✌☛✡
`which projects 3-D points through the origin onto a 2-D projection plane a distance ✡ along the✒
`axis [Foley et al., 1990]. The 3 3 ˆ✝ matrix can be a general matrix in the case where the internal
`camera parameters are unknown. The last column of ✝
`☞ ✧ ☞✁✞✄✠✆ ✄✓✒✔✄✠✟☛✡ onto a 2-D screen
`
`1 0
`
`0 1
`
`0
`
`0
`
`0
`
`0
`
`0
`
`is always zero for central projection.
`
`The combined equations projecting a 3-D world coordinate
`
`location✪ ✧ ☞✁ ✢✄✠✆ ✢✄✠✟ ✢✡ can thus be written as
`✪ ✧ ✝✌ ☞ ✧ ✫
`where ✫
`
`(2)
`
`(3)
`
`(4)
`
`4 Planarimage mosaicing
`
`5
`
`case are 4 4. Of particular interest are the rigid (Euclidean) transformation,
`
`✤✦ ✄
`
`1
`
`where ✁
`
`is a 3 3 orthonormal rotation matrix and ✂
`
`viewing matrix
`
`is a 3-D translation vector, and the 3 4
`
`cam☞ ✄
`
`cam is a 3 4 camera matrix.
`
`This equation is valid even if the camera calibration
`
`parameters and/or the camera orientation are unknown.
`
`4 Planar image mosaicing
`
`The simplest possible set of images to mosaic are pieces of a planar scene such as a document,
`
`whiteboard, or flat desktop. Imagine that we have a camera fixed directly over our desk. As we
`
`slide a document under the camera, different portions of the document are visible. Any two such
`
`pieces are related to each other by a translation and a rotation (2-D rigid transformation).
`
`Now imagine that we are scanning a whiteboard with a hand-held video camera. The class of
`
`transformations relating two pieces of the board is more general in this case. It is easy to see that
`
`this class is the family of 2-D projective transformations (just imagine how a square or grid in one
`
`image can appear in another.) These transformations can be computed without any knowledge of
`
`the internal camera calibration parameters (such as focal length or optical center) or of the relative
`
`camera motion between frames. The fact that 2-D projective transformations capture all such
`
`possible mappings is a basic result of projective geometry (see Appendix A.)
`
`(cid:57)(cid:36)(cid:47)(cid:40)(cid:50)(cid:3)(cid:40)(cid:59)(cid:17)(cid:3)(cid:20)(cid:19)(cid:21)(cid:28)(cid:66)(cid:19)(cid:20)(cid:20)
`
`
`
`6
`
`4 Planar imagemosaicing
`
`Given this knowledge, how do we go about computing the transformations relating the various
`
`scene pieces so that we can paste them together? A variety of techniques are possible, some more
`
`automated than others. For example, we could manually identify four or more corresponding points
`
`between the two views, which is enough information to solve for the eight unknowns in the 2-D
`
`projective transformation. Or we could iteratively adjust the relative positions of input images
`
`using either a blink comparator or transparency [Carlbom et al., 1991]. These kinds of manual
`
`approaches are too tedious to be useful for large tele-reality applications.
`
`4.1 Local image registration
`
`The approach we use in this paper is to directly minimize the discrepancy in intensities between
`
`pairs of images after applying the transformation we are recovering. This has the advantage of not
`
`requiring any easily identifiable feature points, and of being statistically optimal once we are in the
`
`vicinity of the true solution [Szeliski and Coughlan, 1994]. More concretely, suppose we describe
`
`our 2-D transformations as
`
`Our technique minimizes the sum of the squared intensity errors
`
`2
`
`5
`1
`
`✁ ✢ ✧ ✩
`
`(5)
`
`(6)
`
`4✆ ✄✁ ✩
`3✁ ✄✁ ✩
`1✆ ✄✁ ✩
`0✁ ✂✁ ✩
`1 ✄☎✆ ✢ ✧ ✩
`7✆ ✄✁
`6✁ ✄✁ ✩
`7✆ ✂✁
`6✁ ✄✁ ✩
`☎ ✧ ✆ ✞✝✠✟ ✢ ☞✁ ✢ ✄✠✆ ✢ ✡ ✮ ✟ ☞✁ ✄✠✆ ✡☛✡ 2 ✧ ✆ ✌☞
`2
`✟ ✢ ☞✁✔✢✄✠✆ ✢✡ (pixels
`✟ ☞✁✞✄✠✆✑✡ and
`over all corresponding pairs of pixels✍ which are inside both images
`✟ ✢ into the reference frame of
`using ✫✏✎
`transformation✫
`☞ with respect to the unknown motion parameters✑ ✩
`7✒ . These are straightforward to
`7 ✧ ✮ ✆ ✔ ✕ ✁ ✢ ✓ ✟ ✢✓ ✁ ✢ ✁ ✆ ✢ ✓ ✟ ✢✓ ✆ ✢✗✖ ✄
`0 ✧ ✁ ✔ ✓ ✟ ✢✓ ✁ ✢ ✄ ✄ ✓ ☞ ✓ ✩
`✓ ☞ ✓ ✩
`
`which are mapped outside image boundaries do not contribute.) Once we have found the best
`
`2D, we can warp image
`
`1
`2D [Wolberg,
`
`1990] and then blend the two images together. To reduce visible artifacts, we weight images being
`
`blended together more heavily towards the center, using a bilinear weighting function.
`
`To perform the minimization, we use the Levenberg-Marquardt iterative non-linear minimiza-
`
`tion algorithm [Press et al., 1992]. This algorithm requires the computation of the partial derivatives
`
`of
`
`compute, i.e.,
`
`0 ✩
`
`(7)
`
`(cid:57)(cid:36)(cid:47)(cid:40)(cid:50)(cid:3)(cid:40)(cid:59)(cid:17)(cid:3)(cid:20)(cid:19)(cid:21)(cid:28)(cid:66)(cid:19)(cid:20)(cid:21)
`
`✩
`✩
`
`✟
`
`
`4.1 Local imageregistration
`
`7
`
`is the image intensity gradient of
`
`Hessian matrix
`
`and the weighted gradient vector
`
`with components
`
`(8)
`
`and then updates the motion parameter estimate
`
`1
`
`, where ☛
`
`is a time-varying stabilization parameter [Press et al., 1992]. The advantage of using Levenberg-
`
`Marquardt over straightforward gradient descent is that it converges in fewer iterations.
`
`The
`
`complete registration algorithm thus consists of the following steps:
`
`(b) compute the error in intensity between the corresponding pixels
`
`(c) compute the partial derivative of
`
`as in (7);
`(d) add the pixel’s contribution to
`
`and
`
`as in (8).
`
`.
`
`and update the motion estimate
`
`1
`
`✡✒✑✔✓✖✕
`
`is the denominator in (5), and ✓ ✟ ✢✌✓ ✁✔✢✄✓ ✟ ✢✌✓ ✆✔✢✡
`✟ ✢
`where✔
`at ✂✁ ✢ ✄✠✆✔✢ ✡ . From these partials, the Levenberg-Marquardt algorithm computes an approximate
`✂☎✄✝✆ ✧ ✆ ✓ ☞ ✓ ✩✞✄ ✓ ☞ ✓ ✩✟✆ ✄ ✠ ✄ ✧ ✮ 2✆ ☞ ✓ ☞ ✓ ✩✟✄ ✄
`by an amount ∆✡ ✧ ✁☞☛✍✌ ✡✎
`1. For each pixel✍ at location ☞✁ ✄✝✆ ✡ ,
`(a) compute its corresponding position in the other image ☞✁✍✢ ✄✠✆ ✢ ✡ using (5);
`☞ ✧ ✟ ✢ ☞✁ ✢ ✄✠✆ ✢ ✡ ✮ ✟ ☞✁ ✄✠✆ ✡
`and the intensity gradient ✓ ✟ ✢✌✓ ✁✔✢✄✓ ✟ ✢✌✓ ✆ ✢✡ ;
`☞ w.r.t. the✩✞✄ using
`✓ ☞ ✓ ✩✟✄ ✧ ✓ ✟ ✢✓ ✁ ✢ ✓ ✁✔✢✓ ✩✞✄ ✁ ✓ ✟ ✢✓ ✆ ✢ ✓ ✆✔✢✓ ✩✟✄
`2. Solve the system of equations ✁✎☛✏✌ ✡ ∆✡ ✧ ✁
`✗ ✧✡✘✑✙✓ ✗ ✁ ∆✡
`
`3. Check that the error in (6) has decreased; if not, increment ☛
`1992]) and compute a new ∆✡
`
`,
`
`(as described in [Press et al.,
`
`4. Continue iterating until the error is below a threshold or a fixed number of steps has been
`
`completed.
`
`The steps in this algorithm are similar to the operations performed when warping images [Wolberg,
`
`1990; Beier and Neely, 1992], with additional operations for correcting the current warping pa-
`
`rameters based on the local intensity error and its gradients (similar to the process in [Gleicher and
`
`Witkin, 1992]). For more details on the exact implementation, see [Szeliski and Coughlan, 1994].
`
`(cid:57)(cid:36)(cid:47)(cid:40)(cid:50)(cid:3)(cid:40)(cid:59)(cid:17)(cid:3)(cid:20)(cid:19)(cid:21)(cid:28)(cid:66)(cid:19)(cid:20)(cid:22)
`
`✁
`✡
`✁
`✁
`
`
`8
`
`4 Planar imagemosaicing
`
`4.2 Global image registration
`
`Unfortunately, both gradient descent and Levenberg-Marquardt only find locally optimal solutions.
`
`If the motion between successive frames is large, we must use a different strategy to find the best
`
`registration. We have implemented two different techniques for dealing with this problem. The
`
`first technique, which is commonly used in computer vision, is hierarchical matching, which first
`
`registers smaller, subsampled versions of the images where the apparent motion is smaller [Quam,
`
`1984; Witkin et al., 1987; Bergen et al., 1992]. Motion estimates from these smaller coarser levels
`
`are then used to initialize motion estimates at finer levels, thereby avoiding the local minimum
`
`problem (see [Szeliski and Coughlan, 1994] for details.) While this technique is not guaranteed
`
`to find the correct registration, we have observed empirically that it works well when the initial
`
`misregistration is only a few pixels (the exact domain of convergence depends on the intensity
`
`pattern in the image.)
`
`For larger displacements, we use phase correlation [Kuglin and Hines, 1975; Brown, 1992].
`
`This technique estimates the 2-D translation between a pair of images by taking 2-D Fourier
`
`Transforms of each image, computing the phase difference at each frequency, performing an
`
`inverse Fourier Transform, and searching for a peak in the magnitude image. In our experiments,
`
`this technique has proven to work remarkably well, providing good initial guesses for image pairs
`
`which overlap by as little as 50%. The technique will not work, however, if the inter-frame motion
`
`is not mostly translational (large rotations or zooms), but this does not occur in practice with our
`
`current image sequences.
`
`4.3 Results
`
`To demonstrate the performance of our algorithm, we digitized an image sequence with a camera
`
`panning over a whiteboard. Figure 2a shows the final mosaic of the whiteboard, with the locations
`
`of the constituent images in the mosaic shown as black outlines. Figure 2b is a portion of the final
`
`mosaic without these outlines. This mosaic is 1300 2046 pixels, based on compositing 39 NTSC
`
`(640 480) resolution images.
`
`To compute this mosaic, we developed an interactive image mosaicing tool which lets the user
`
`coarsely position successive frames relative to each other. The tool also has an automatic mosaicing
`
`option which computes the initial rough placement of each image with respect to the previous one
`
`(cid:57)(cid:36)(cid:47)(cid:40)(cid:50)(cid:3)(cid:40)(cid:59)(cid:17)(cid:3)(cid:20)(cid:19)(cid:21)(cid:28)(cid:66)(cid:19)(cid:20)(cid:23)
`
`
`
`5 Panoramicimage mosaicing
`
`9
`
`using phase correlation. Our algorithm then refines the location of each image by minimizing (6)
`
`using the current mosaic thus far as
`
`✟ ✢ ☞✁✔✢✄✠✆ ✢✡ . The
`
`images in Figure 2 were automatically composited without user intervention using the middle frame
`
`(center of the image) as the anchor image (no deformations.) As we can see, our technique works
`
`well on this example. Section 9 discusses some imaging artifacts which can cause difficulties.
`
`✟ ☞✁✞✄✠✆✑✡ and the input frame being adjusted as
`
`5 Panoramic image mosaicing
`
`Another image mosaicing problem which turns out to be equally simple to solve is panoramic
`
`image mosaicing. In this scenario, we rotate a camera around its optical center in order to create
`
`a full “viewing sphere” of images (in computer graphics, this representation is sometimes called
`
`an environment map [Greene, 1986].) This is similar to the action of panoramic still photographic
`
`cameras where the rotation of a camera on top of a tripod is mechanically coordinated with the
`
`film transport, the anamorphic lens used in [Lippman, 1980], and to “cinema in the round” movie
`
`theaters [Greene and Heckbert, 1986]. In our case, however, we can mosaic multiple 2-D images of
`
`arbitrary detail and resolution, and we need not know the camera motion. Examples of applications
`
`include constructing true scenic panoramas (say of the view at the rim of the Grand Canyon), or
`
`limited virtual environments (recreating a meeting room or office as seen from one location.)
`
`It turns out that images taken from the same viewpoint (stationary optical center) are related
`
`by 2-D projective transformations, just as in the planar scene case (see Appendix A for a quick
`
`proof.) Intuitively, we cannot tell the relative depth of points in the scene as we rotate (there is no
`
`motion parallax), so they may as well be located on any plane (in projective geometry, we could
`
`say they lie on the plane at infinity,
`
`
`
`ERROR: undefinedfilename
`OFFENDING COMMAND: findfont
`
`STACK:
`
`/
`/ISOUYZ+