throbber
Interactive 3-D Video Representation and Coding
`Technologies
`
`ALJOSCHA SMOLIC´ AND PETER KAUFF
`
`Invited Paper
`
`Interactivity in the sense of being able to explore and navigate
`audio–visual scenes by freely choosing viewpoint and viewing
`direction,
`is an important key feature of new and emerging
`audio–visual media. This paper gives an overview of suitable
`technology for such applications, with a focus on international
`standards, which are beneficial for consumers, service providers,
`and manufacturers. We first give a general classification and
`overview of interactive scene representation formats as commonly
`used in computer graphics literature. Then, we describe popular
`standard formats for interactive three–dimensional (3-D) scene
`representation and creation of virtual environments, the virtual
`reality modeling language (VRML), and the MPEG-4 BInary
`Format for Scenes (BIFS) with some examples. Recent extensions
`to MPEG-4 BIFS, the Animation Framework eXtension (AFX),
`providing advanced computer graphics tools, are explained and
`illustrated. New technologies mainly targeted at reconstruction,
`modeling, and representation of dynamic real world scenes are
`further studied. The user shall be able to navigate photorealistic
`scenes within certain restrictions, which can be roughly defined
`as 3-D video. Omnidirectional video is an extension of the planar
`two–dimensional (2-D) image plane to a spherical or cylindrical
`image plane. Any 2-D view in any direction can be rendered from
`this overall recording to give the user the impression of looking
`around. In interactive stereo two views, one for each eye, are
`synthesized to provide the user with an adequate depth cue of the
`observed scene. Head motion parallax viewing can be supported
`in a certain operating range if sufficient depth or disparity data
`are delivered with the video data. In free viewpoint video, a dy-
`namic scene is captured by a number of cameras. The input data
`are transformed into a special data representation that enables
`interactive navigation through the dynamic scene environment.
`Keywords—Interactive media, MPEG standards, three–dimen-
`sional (3-D) video, video coding.
`
`I. INTRODUCTION
`
`Interactivity is an important key feature of new and
`emerging audio–visual media where the user has the op-
`
`Manuscript received December 19, 2003; revised June 9, 2004.
`The authors are with the Fraunhofer Institute for Telecommunications,
`Image Processing Department, Heinrich-Hertz-Institute (HHI) 10587
`Berlin, Germany (e-mail: smolic@hhi.de; kauff@hhi.de).
`Digital Object Identifier 10.1109/JPROC.2004.839608
`
`portunity to be active in some way instead of just being a
`passive consumer. One kind of interactivity is the ability
`to freely choose viewpoint and viewing direction within
`an audio–visual scene. The first media that provided such
`functionality were based on textured three–dimensional
`(3-D) mesh models, as they are well known from computer
`graphics. For representation and exchange of such data, the
`ISO/IEC has standardized a special language called virtual
`reality modeling language (VRML) that is widely used in
`the web. However, although VRML still holds merit, it is
`going to get dated due to the rapid progress in multimedia, in
`general, and virtual reality, in particular. VRML is invariably
`graphics-oriented and scene realism, therefore, is limited.
`Most of the scenes are either purely computer generated or
`contain static two–dimensional (2-D) views of real word
`objects represented by still pictures or moving textures.
`Hence, as a complement to VRML, new standardization
`efforts have been launched to advance realism and func-
`tionality of 3-D representations in interactive audio–visual
`media. Most of them, such as the Web3D consortium, have
`collaborated closely with the moving picture experts group
`(MPEG) group of ISO/IEC. The reason for this liaison is
`that the most recent standard MPEG-4 is not only focused
`on audio–visual coding; indeed, it is much more multi-
`media-oriented than MPEG-1/2 and offers plenty of new
`functionalities like interactivity and 3-D scene representa-
`tion.
`Even in its basic version, MPEG-4 already supports
`interactivity with hybrid scenes containing both computer
`graphics and natural video objects. Furthermore, to be com-
`patible with VRML, it includes all conventional tools and
`scene description formats for interactive 3-D graphics as a
`subset. In addition, it allows an easy integration of multiple
`audio and video streams compliantly coded with existing
`ISO or ITU standards. And the MPEG-4 scene description
`language called BInary Format for Scenes (BIFS) specifies
`a compressed binary format, which is suited for online
`transmission and streaming of 3-D scenes. The more recent
`
`0018-9219/$20.00 © 2005 IEEE
`
`98
`
`PROCEEDINGS OF THE IEEE, VOL. 93, NO. 1, JANUARY 2005
`
`Authorized licensed use limited to: Emily Klump. Downloaded on February 22,2021 at 14:46:27 UTC from IEEE Xplore. Restrictions apply.
`
`Interdigital Exhibit 2003
`Unified Patents v. Interdigital
`IPR2021-00102
`Page 98
`
`

`

`Fig. 1. Categorization of scene representations [27].
`
`versions of the MPEG-4 standard support further extensions
`toward advanced 3-D representation. One example is the
`Animation Framework eXtension (AFX) that provides new
`3-D formats for natural-looking scene objects. In this con-
`text, AFX also offers new tools such as surface light fields
`and depth image-based rendering for efficient 3-D modeling
`of scene objects captured by real imagery.
`Further approaches for modeling and rendering real world
`scenes are investigated in still ongoing 3-D integration ac-
`tivities of a new MPEG group, 3-D audio–visual (3DAV).
`This includes video-based rendering as well as advanced 3-D
`reconstruction methods. Application scenarios investigated
`in this context are omnidirectional video, interactive stereo
`video, and free viewpoint video.
`Omnidirectional video is an extension of the planar 2-D
`image plane to a spherical or cylindrical image plane. Other
`kinds of planes (e.g., hyperbolic) are also possible. Video is
`captured at a certain viewpoint (which may move over time)
`in multiple directions. Any 2-D view in any direction can
`be rendered from this overall recording. Such an omnidi-
`rectional video can be displayed with a suitable player. Key
`functions of interaction are zoom and rotation of the viewing
`direction to give the user the impression of looking around.
`In interactive stereo two views, one for each eye, are syn-
`thesized to provide the user with an adequate depth cue of the
`observed scene. Head motion parallax viewing can be sup-
`ported if sufficient depth or disparity data are delivered with
`the video data. In that case, it is possible to generate stereo-
`scopic virtual views that correspond to various head positions
`around a given zero position. This is an important feature of
`envisaged immersive media, such as 3-D-TV, that seem to be
`applicable in the near future.
`The most general case is free viewpoint video. A dynamic
`scene is captured by a number of cameras. In addition to the
`video signals, other information such as camera calibration
`and derived scene geometry is also acquired or estimated.
`These input data are transformed into a special data repre-
`sentation that enables interactive navigation through the dy-
`namic scene environment. It is also possible to generate 3-D
`video objects that are composed from multiple view video in-
`formation. Free viewpoint video representation formats can
`either be purely image-based or rely on a certain kind of 3-D
`reconstruction.
`
`This paper gives an overview of available and emerging
`technology for modeling, coding, and rendering of dynamic
`real world scenes for interactive applications, at the con-
`vergence point of computer graphics, computer vision, and
`classical media. The main focus is on such formats that
`are or will probably be available in open standards such
`as ISO/IEC MPEG. A general classification of interactive
`3-D scene representation approaches is given in the next
`section. Section III describes the scene description language
`MPEG-4 BIFS in more detail. A few examples are given and
`the link to the common computer graphics format VRML
`is explained. Following these fundamentals, the recently
`established AFX extension of MPEG-4 is presented in
`Section IV. Then, Section V gives an overview of the new
`technology that is under investigation in the working group
`3DAV of MPEG. Finally, Section VI summarizes the paper
`and gives an outlook to future developments in these fields.
`
`II. CLASSIFICATION OF INTERACTIVE 3-D SCENE
`REPRESENTATION FORMATS
`
`In computer graphics literature, methods for scene repre-
`sentation are often classified as a continuum in between two
`extremes as illustrated in Fig. 1 [27]. The one extreme is rep-
`resented by classical 3-D computer graphics. This approach
`can also be called geometry-based modeling. In most cases,
`scene geometry is described on the basis of 3-D meshes. Real
`world objects are reproduced using geometric 3-D surfaces
`with an associated texture mapped onto them. More sophisti-
`cated attributes can be assigned as well. For instance, appear-
`ance properties (opacity, reflectance, specular lights, etc.) can
`enhance the realism of the models significantly.
`Everyone is familiar with this type of computer graphics
`from games, Internet, TV, movies, etc. The achievable per-
`formance might be extremely good if the scenes are purely
`computer generated. The available technology for both pro-
`duction and rendering has been highly optimized over the last
`few years, especially in the case of common 3-D mesh rep-
`resentations. In addition, state-of-the-art PC graphics cards
`are able to render highly complex scenes with an impressive
`quality in terms of refresh rate, levels of detail, spatial reso-
`lution, reproduction of motion, and accuracy of textures.
`A drawback to this approach is the high cost for content
`creation. Aiming at photorealism, 3-D scene and object mod-
`
`SMOLIC´ AND KAUFF: INTERACTIVE 3-D VIDEO REPRESENTATION AND CODING TECHNOLOGIES
`
`99
`
`Authorized licensed use limited to: Emily Klump. Downloaded on February 22,2021 at 14:46:27 UTC from IEEE Xplore. Restrictions apply.
`
`Interdigital Exhibit 2003, Page 99
`
`

`

`eling is complex and time consuming, and it becomes even
`more complex if a dynamically changing environment sim-
`ulating real life is being created. Furthermore, an automatic
`3-D object and scene reconstruction implies an estimation of
`camera geometry, depth structures, and 3-D shapes. Inher-
`ently, all these processes tend to produce occasional errors.
`Therefore high-quality production, e.g., for movies, has to be
`done user assisted, supervised by a skilled operator.
`The other extreme is given by scene representations that
`do not use any 3-D geometry at all. It is usually called
`image-based modeling. In this case, virtual intermediate
`views are generated from available real views by interpola-
`tion. The main advantages are a high quality of virtual view
`synthesis and an avoidance of 3-D scene reconstruction.
`However, these benefits have to be paid by dense sampling
`of the real world with plenty of original view images. In
`general, the synthesis quality increases with the number of
`available views. Hence, a large amount of cameras has to be
`set up to achieve high-performance rendering, and plenty of
`image data needs to be processed therefore. To the contrary,
`if the number of used cameras is too low, interpolation and
`occlusion artifacts will appear in the synthesized images,
`possibly affecting the quality.
`The image-based methods can be derived from the theory
`of the plenoptic function. The expression descends from the
`Latin root plenus, meaning complete or full and optic, per-
`taining to vision [37]. This 7-dimensional function has ini-
`tially been postulated by Adelson and Bergen [1]:
`
`(1)
`
`It describes the intensity of every light ray at every posi-
`tion in space (
`, 3-D), in every direction (
`, 2-D),
`for every wavelength ( , 1-D), at any time ( , 1-D). It repre-
`sents everything that can be seen from all positions in space,
`into any direction, anytime. As such, it might be called the
`universal formula of vision, but it has only theoretical rel-
`evance. In practice, it is simplified by omitting dimensions,
`for example the wavelength, time, or some spatial dimension.
`Moreover, it is not possible to apply the plenoptic function
`in its continuous form, i.e., it has to be sampled. Views are
`taken at a number of discrete positions, probably into some
`discrete directions. Against this background it is possible to
`formulate a complete plenoptic sampling theory in analogy
`to the common sampling theorem of signal theory [4].
`Examples of image-based representations are ray-space
`[12]–[16] or light-field rendering [32] and panoramic con-
`figurations including concentric and cylindrical mosaics [5],
`[37], [51], [55]. All these methods do not make any use of ge-
`ometry, but they either have to cope with an enormous com-
`plexity in terms of data acquisition or they execute simplifi-
`cations restricting the level of interactivity.
`In between the two extremes there exists a continuum of
`methods that make more or less use of both approaches and
`combine the advantages in a particular manner. For instance,
`a Lumigraph [3], [18] uses a similar representation as a light
`field but adds a rough 3-D model. This provides information
`
`on the depth structure of the scene and therefore allows for
`reducing the number of views.
`Other representations do not use explicit 3-D models but
`depth or disparity maps. Such maps assign a depth value to
`each pixel of an image. Together with the original 2-D image
`the depth map builds a 3-D-like representation, often called
`2.5-D. This can be extended to layered depth images [50]
`where multiple color and depth values are stored in consec-
`utively ordered depth layers (see Section IV).
`Closer to the geometry-based end of the spectrum, we can
`find methods that use view-dependent geometry and/or view
`dependent texture [7], [43]. Surface light fields combine the
`idea of light fields with an explicit 3-D model [6], [57]. Fur-
`thermore, volumetric representations such as voxels (from
`volume elements) can be used instead of a complete 3-D
`mesh model to describe 3-D geometry [8], [29], [41], [42],
`[49].
`such systems
`complete processing chain of
`The
`can be divided into the parts of acquisition/capturing,
`processing,
`scene
`representation,
`coding,
`transmis-
`sion/streaming/storage,
`interactive rendering, and 3-D
`displays. The design has to take into account all parts,
`since there are strong interrelations between all of them.
`For instance, an interactive display that requires random
`access to 3-D data will affect the performance of a coding
`scheme that is based on data prediction. A complete system
`for efficient representation and interactive streaming of
`high-resolution panoramic views has been presented in [20].
`Other coding and transmission aspects of such data have
`also been studied, for example in [16], [21], [28], [31], [33],
`[39], [40], [46], [52], [60], and [61]. The European IST
`project ATTEST has studied a complete processing chain
`for interactive 3-D broadcast including 3-D-TV acquisition,
`data representation, joint coding of video and depth maps,
`auto-stereoscopic 3-D displays, and parallax viewing based
`on head tracking [10].
`
`III. MPEG-4 BINARY FORMAT FOR SCENES (BIFS)
`
`Classical 3-D computer graphics representations using
`textured 3-D mesh models are widely spread over various
`applications. A lot of software tools and specialized hard-
`ware are available and standard APIs such as OpenGL,
`DirectX, or Java3D provide developers with easy access
`to state-of-the-art functionalities of graphics cards. Due to
`historical implementation reasons, many applications (e.g.,
`games) use proprietary formats for data representation.
`However, for exchange of 3-D graphics data between dif-
`ferent systems it is necessary to define standardized formats
`ensuring interoperability. For that purpose, ISO/IEC has
`specified VRML. It was mainly developed for transmission
`of 3-D graphics over the Internet but can as well be used
`for other applications. VRML data can be visualized with
`an appropriate player and the user can navigate through the
`virtual environments. Apart from the formats and attributes
`for object description (geometric primitives, 3-D meshes,
`
`100
`
`PROCEEDINGS OF THE IEEE, VOL. 93, NO. 1, JANUARY 2005
`
`Authorized licensed use limited to: Emily Klump. Downloaded on February 22,2021 at 14:46:27 UTC from IEEE Xplore. Restrictions apply.
`
`Interdigital Exhibit 2003, Page 100
`
`

`

`Fig. 2. Example of MPEG-4 BIFS scene.
`
`textures, appearance, etc.), VRML also contains other ele-
`ments necessary to define interactive 3-D worlds (e.g., light
`sources, collision, sensors, interpolators, viewpoint).
`Later on, ISO/IEC issued a standard for multimedia data
`representation known as MPEG-4. It builds on BIFS, which
`is an extension of VRML. In addition to the functionality
`of VRML, BIFS provides, for instance, a better integration
`of natural audio and video, advanced 3-D audio features, a
`timing model, an update mechanism to modify the scene in
`time, a script to animate the scene temporally, new graphics
`elements (e.g., face and body animation), and an efficient bi-
`nary encoding for the scene description. The last point is par-
`ticularly important for streaming and broadcast applications,
`since scene description files in text format might be insuffi-
`ciently large and not well suited for real-time streaming.
`As such, MPEG-4 is much more than just another video
`and audio codec, although advanced audio–visual coding is
`again an important part of MPEG-4. In fact, it is a real multi-
`media standard, combining all types of media within a stan-
`dardized format.
`An example image of a rendered MPEG-4 BIFS scene is
`shown in Fig. 2. The background is a still image coded in
`JPEG format. The foreground scene contains 3-D graphics
`elements (box, pawn in the game) that the user can interact
`with (move, rotate, change of object properties like shape,
`size, texture, opacity, etc.). In addition, the interaction ele-
`ments at the bottom allow a change of the image background
`and a browsing from scene to scene. Such scene changes can
`be downloaded on demand and the scene graph is updated
`online while viewing the scene. Furthermore, a live video
`stream showing a person is decoded and mapped onto the
`surface of the 3-D box. This simple example illustrates some
`basic features of MPEG-4 BIFS and its potential for content
`creators. The possibility of online scene updates and the live
`streaming of audio–visual data and their seamless integration
`into virtual worlds represent a clear progress over VRML.
`A further example, which efficiently takes advantage of
`BIFS, is interactive streaming and rendering of high-reso-
`lution panoramic views [20]. Panoramic views are widely
`used on the Internet to provide complete views of real en-
`vironments. Navigation is restricted to rotation and zoom.
`
`Fig. 3. BIFS representation for high-resolution panoramic views.
`
`A panorama is a purely image-based representation as ex-
`plained in Section II.
`Basically, only a small portion of the panorama is dis-
`played at a certain time with an interactive player (if the video
`is not projected as a whole as done, e.g., in a dome applica-
`tion). This can, for instance, be a 60 field of view. A rough
`estimate for a minimum resolution to get a good quality for
`600 pixels. This means that a
`each rendered view is 600
`full spherical panorama would need a total resolution of at
`least 3600
`1800 pixels. Transmitting such a large image
`(or even higher possible resolutions as reported in [20]) over
`the Internet prior to display causes unacceptable long down-
`load times. Furthermore, the usage of one large picture for the
`whole panorama is a heavy burden for the player as it has to
`keep the whole image in the memory for rendering purposes.
`In this context, BIFS can be used for an efficient repre-
`sentation of interactive panoramas. Since the user only sees
`a portion of the panorama at a time, streaming and rendering
`can be limited to this particular image area. For this purpose,
`the panorama is divided into patches that can be decoded
`separately as illustrated in Fig. 3. In practice, these patches
`are JPEG coded images arranged around a 3-D cylinder
`using the BIFS syntax. Each of the patches is assigned to a
`visibility sensor that is slightly bigger than the patch. While
`the user navigates over the panorama, the actual visible
`patches are streamed, loaded, and rendered. This process is
`started as soon as the corresponding visibility sensor gets
`into the active window shown at the screen. A look ahead
`and prefetching strategy guarantees that all image data are
`available on time. This is simply achieved by oversizing the
`visibility sensor. Vice versa, a patch is unloaded if the corre-
`sponding sensor gets out of the actual view. In a streaming
`application, copies of the transmitted patches can be stored
`locally to avoid renewed transmission when revisiting same
`areas of the panorama. This procedure allows smooth ren-
`dering of even very high-resolution panoramas.
`
`SMOLIC´ AND KAUFF: INTERACTIVE 3-D VIDEO REPRESENTATION AND CODING TECHNOLOGIES
`
`101
`
`Authorized licensed use limited to: Emily Klump. Downloaded on February 22,2021 at 14:46:27 UTC from IEEE Xplore. Restrictions apply.
`
`Interdigital Exhibit 2003, Page 101
`
`

`

`In a complete interactive multimedia application, the inter-
`active video must be accompanied by corresponding interac-
`tive audio. If the user rotates the viewpoint, the associated
`sound should follow the interaction, by changing its direc-
`tion of origin. If the user approaches a sound source, the as-
`sociated sound should become louder. These functionalities
`are provided by MPEG-4 AudioBIFS [48], which provides
`the means for setting up 3-D audio scenes. Sound sources
`with various attributes and properties can be placed anywhere
`in a virtual 3-D space. With a suitable 3-D audio player the
`user can navigate arbitrarily within such a scene and corre-
`sponding audio is rendered for every position and orientation.
`
`IV. MPEG-4 AFX
`
`Computer graphics research has of course continued suc-
`cessfully since the initial version of MPEG-4 BIFS was fi-
`nalized. Some of these developments had been integrated
`into an extension of MPEG-4 called AFX [25]. Two of the
`new tools are of specific interest in the scope of this paper:
`light-field mapping (LFM) and depth image-based represen-
`tation (DIBR). The first one addresses the concept of surface
`light fields and the second one the concept of layered depth
`images.
`As mentioned before, surface light fields combine a clas-
`sical 3-D mesh representation with the concept of a light-field
`rendering. A light-field representation is used as texture that
`is mapped onto the 3-D mesh model. Such a 3-D model is
`typically built out of thousands of triangles that approximate
`the 3-D surface of an object.
`Texture mapping means the assignment of a colored pixel
`map onto the 3-D surface. The simplest way is to assign a
`single still image as texture. However, this leads to poor ren-
`dering results. In reality, natural materials look different from
`changing view angles depending on reflectance properties,
`micro-structures, and lighting conditions. It is not possible
`to reproduce these properties by a single texture that looks
`the same from any direction. Therefore, conventional com-
`puter graphics employ sophisticated tools to model these ma-
`terial properties as best as is possible. The results might look
`fine for purely computer-generated objects. However, it is ex-
`tremely difficult to set these parameters such that they pre-
`cisely mirror the material properties of real world objects.
`The promising solution to this problem is to incorporate
`ideas from image-based rendering. As explained in Sec-
`tion II, the idea of this method is to describe the real world
`by multiple view images instead of graphical 3-D models.
`As a consequence, view-dependent texture mapping assigns
`more than only one single texture to a triangle [7], [43].
`Depending on the actual view direction, a realistic texture
`reproducing natural material appearance is calculated from
`the available original views. The same concept is exploited
`for a surface light field such as the LFM tool in AFX [6],
`[57]. The result is an extremely realistic rendering of static
`objects. However, data acquisition requires special equip-
`ment and a quite complex manual procedure of content
`creation. A reliable automatic generation of view-dependent
`textures for moving objects does not seem applicable for
`
`the near future. Therefore, applications to dynamic video
`objects are not realistic so far.
`The AFX tool DIBR implements the concept of layered
`depth images [50]. In this case, a 3-D object or scene is rep-
`resented by a number of views with associated depth maps
`as shown in Fig. 4 [2]. The depth maps define a depth value
`for every single pixel of the 2-D images. Together with ap-
`propriate scaling and information from camera calibration,
`it is possible to render virtual intermediate views as shown
`in the middle image in Fig. 4. The quality of the rendered
`views and the possible range of navigation depend on the
`number of original views and the setting of the cameras. A
`special case of this method is stereo-vision, where two views
`are generated accordingly to the geometry of the human eyes
`basis. In this case, the depth is often calculated using dis-
`parity estimation. Supposing that the capturing cameras are
`fully calibrated and their 3-D geometry is known, therefore,
`corresponding depth values can be recalculated one-to-one
`from the estimated disparity results. In the case of simple
`camera configurations (such as a conventional stereo rig or
`a multi baseline video system) this disparity estimation can
`even be used for fully automatic real-time depth reconstruc-
`tion in 3-D video or 3-D-TV applications as explained in Sec-
`tion V-C.
`
`V. MPEG EXPLORATION ON 3DAV
`
`To this end, the considerations have mainly been concen-
`trated on computer graphics with integrated 2-D video. Some
`of the concepts like panoramic views or DIBR can easily
`be extended toward 3-D video. In this context, the term 3-D
`video shall refer to interactive and navigable representations
`of real world dynamic scenes as captured by real imagery. A
`lot of research has been done in this field during the last few
`years resulting in different types of formats and technology
`for different types of application scenarios. To investigate the
`needs for standardization in this area, MPEG has established
`a working group called 3DAV [53].
`Three main application scenarios have been extracted: om-
`nidirectional video, interactive stereo video, and free-view-
`point video and related requirements for standardization have
`been derived [23]. Suitable technology for realization has
`been reviewed and evaluated experimentally [24]. The fol-
`lowing section gives an overview of the results.
`
`A. Common Issues
`It has been identified that the definition of suitable quality
`measures is a common problem of all technology investi-
`gated in the context of 3DAV. For example, there are no
`original data to assess algorithms that compute intermediate
`views. Therefore, the definition of suitable quality measures
`is a critical task for all 3DAV technology. In many cases, only
`subjective criteria can be used.
`Another common issue of technology that makes use of
`interpolated views, such as interactive stereo and free view-
`point video, is the need for accurate 3-D camera calibration
`information. A suitable data format, which is based on Tsai’s
`fundamental work [56], has already been proposed in 3DAV.
`
`102
`
`PROCEEDINGS OF THE IEEE, VOL. 93, NO. 1, JANUARY 2005
`
`Authorized licensed use limited to: Emily Klump. Downloaded on February 22,2021 at 14:46:27 UTC from IEEE Xplore. Restrictions apply.
`
`Interdigital Exhibit 2003, Page 102
`
`

`

`Fig. 4. Example of AFX DIBR [2].
`
`This description includes a 3-D world coordinate system as
`well as extrinsic, intrinsic, and sensor parameters of cap-
`turing devices.
`One problem of interactive applications is that the user
`may choose a virtual viewpoint and direction for which the
`corresponding view or parts of it cannot be rendered from the
`available data. This happens, for instance, due to revealing of
`areas that are occluded in all available views, or if the user
`navigates out of the scope covered by the original views. The
`user navigation can be restricted to a meaningful range; how-
`ever, this does not always solve the problem of disclosures.
`In such cases, the missing information has to be extrapolated
`from the original views.
`
`B. Omnidirectional Video
`Fig. 5 shows a camera suitable for capturing omnidi-
`rectional video (based on the Telemmersion1 system [22]),
`which has the shape of a dodecahedron and includes 11 sep-
`arate cameras to capture a nearly complete spherical field of
`view in high resolution. Other available systems use mirrors,
`either plain or hyperbolic, to project an omnidirectional view
`of a smaller portion of a sphere onto one or more camera
`sensors (see, e.g., [46]). Such an omnidirectional video
`can be displayed in a suitable player, with the key types
`of interaction being zoom and rotation to give the effect
`
`of looking around. However, in contrast to free viewpoint
`video, the user is not able to change the position of the
`viewpoint interactively. Nevertheless, the viewpoint might
`change, but that supposes that the camera has been moved
`during the period of capturing. Projected onto a dome or
`with a head-mounted display the user can get the impression
`of being part of the scene.
`Fig. 5 also shows an example image generated with the
`Dodeca 10002 camera system and postprocessed with corre-
`sponding Immersive Media technology [38] to get a single
`overall view of the motion picture panorama. MPEG-4 BIFS
`already provides representation and codes such omnidirec-
`tional video in a standardized way such that it can be un-
`derstood and displayed interactively by any MPEG-4 3-D
`player. The simplest solution is to define a 3-D sphere and
`to map the omnidirectional video onto it, in analogy to the
`box with video texture example in Fig. 2. The user’s view-
`point would be placed at the center of the sphere, similar to
`the example in Fig. 3.
`More sophisticated solutions use knowledge from cartog-
`raphy for a a more efficient representation of the spherical
`video texture [53]. Obviously, the content in Fig. 5 is heavily
`distorted in some areas because of the mapping of a sphere to
`a plane. The distortion increases with proximity to the poles,
`which leads to an inefficient distribution of spatial resolu-
`tion. It is well known that no rectangular coordinate map can
`
`1Registered trademark.
`
`2Registered trademark.
`
`SMOLIC´ AND KAUFF: INTERACTIVE 3-D VIDEO REPRESENTATION AND CODING TECHNOLOGIES
`
`103
`
`Authorized licensed use limited to: Emily Klump. Downloaded on February 22,2021 at 14:46:27 UTC from IEEE Xplore. Restrictions apply.
`
`Interdigital Exhibit 2003, Page 103
`
`

`

`Fig. 5.
`(a) Dodeca 1000 camera for capturing spherical video. (b) Image from Telemmersion
`video showing a spherical view.
`
`accurately depict a sphere. Therefore, different types of pro-
`jections, which either preserve distances, shape, or area while
`distorting the others, have been proposed. The projection in
`Fig. 5 keeps distances, that which is most commonly used for
`cartographic maps. This has the effect that, e.g., Greenland
`and Africa have the same area on common world maps, al-
`though the real area of Africa is more than ten times bigger.
`From the point of view of coding efficiency a projection
`preserving area (i.e., areas on the sphere and on the map have
`the same size) is more desirable because it keeps the original
`video resolution. Fig. 6 shows an example of such an equal-
`area projection. The video texture is mapped onto a 3-D mesh
`object approximating a sphere as illustrated in Fig. 6. In this
`case, the resolution of the mapped video texture fits well to
`the resolution of the original camera signals.
`The flexible syntax of MPEG-4 BIFS also enables a va-
`riety of other solutions to represent and encode omnidirec-
`
`tional video. Among those, the group of polygonal mappings
`(proposed by Yamazawa et al.) has received particular at-
`tention in 3DAV. Here, the 3-D geometry is represented by
`a regular polyhedron such as a hexahedron, octahedron, or
`icosahedron. The associated 2-D video texture has the shape
`of the unwrapped surface and is therefore not rectangular. It
`is therefore less intuitively humanly readable when viewed at
`a glance than the common cartographical mappings as shown
`in Figs. 5 and 6

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket