`(12) Patent Application Publication (10) Pub. No.: US 2006/0146143 A1
`Xin et al.
`(43) Pub. Date:
`Jul. 6, 2006
`
`US 2006O1461.43A1
`
`(54) METHOD AND SYSTEM FOR MANAGING
`REFERENCE PICTURES IN MULTIVIEW
`VIDEOS
`
`(76) Inventors: Jun Xin, Quincy, MA (US); Emin
`Martinian, Waltham, MA (US);
`Alexander Behrens, Bueckeburg (DE);
`Anthony Vetro, Arlington, MA (US);
`Huifang Sun, Billerica, MA (US)
`
`Correspondence Address:
`Patent Department
`Mitsubishi Electric Research Laboratories, Inc.
`201 Broadway
`Cambridge, MA 02139 (US)
`
`(21) Appl. No.:
`(22) Filed:
`
`11/292,393
`Nov. 30, 2005
`
`Related U.S. Application Data
`
`(63) Continuation-in-part of application No. 11/015.390.
`filed on Dec. 17, 2004.
`Publication Classification
`
`(51) Int. Cl.
`(2006.01)
`H04N 5/225
`(52) U.S. Cl. .......................................................... 348/218.1
`
`(57)
`
`ABSTRACT
`
`A system and method manages multiview videos. A refer
`ence picture list is maintained for each current frame of
`multiple multiview videos. The reference picture list indexes
`temporal reference pictures, spatial reference pictures and
`synthesized reference pictures of the multiview videos.
`Then, each current frame of the multiview videos is pre
`dicted according to reference pictures indexed by the asso
`ciated reference picture list during encoding and decoding.
`
`Prediction Modes
`Spatial
`temporal
`view synthesis
`intra
`
`
`
`
`
`
`
`Input Video 1
`? 401
`Input Video 2
`402
`
`Input Video 3
`403
`
`Input Video 4
`
`
`
`MCTF/DCVF
`Decomposition
`
`
`
`410
`
`Low band
`frames
`
`411
`
`High band
`1gh ban
`frames
`
`412
`
`Side
`Information r 413
`
`
`
`
`
`
`
`
`
`
`
`
`
`Interdigital Exhibit 2005
`Unified Patents v. Interdigital
`IPR2021-00102
`Page 1 of 30
`
`
`
`Patent Application Publication
`
`Jul. 6, 2006 Sheet 1 of 19
`
`US 2006/0146143 A1
`
`IZI
`
`ZZ I
`
`Z J9p00UIGH
`
`ZO !
`
`£0 I
`
`i
`
`I '9IAI
`
`
`
`JLRIV RHOIRICI
`
`Interdigital Exhibit 2005, Page 2 of 30
`
`
`
`Patent Application Publication Jul. 6, 2006 Sheet 2 of 19
`
`US 2006/0146143 A1
`
`
`
`s
`
`
`
`
`
`s
`
`w:
`
`
`
`S
`
`S
`
`Interdigital Exhibit 2005, Page 3 of 30
`
`
`
`Patent Application Publication
`
`Jul. 6, 2006 Sheet 3 of 19
`
`US 2006/0146143 A1
`
`
`
`pueq AAOTI
`
`se?dues
`
`
`
`JLRIV RIOTRICH £ (9 IAI
`
`
`
`se?duues
`
`pueq qãIH
`
`Z09)
`
`qnduI
`
`sòIdues
`
`[09]
`
`Interdigital Exhibit 2005, Page 4 of 30
`
`
`
`Patent Application Publication
`
`Jul. 6, 2006 Sheet 4 of 19
`
`US 2006/0146143 A1
`
`I I #7
`
`
`
`pueq AAOT
`
`SºueJ?
`
`
`
`
`
`S0pOIA UIO!!0! p3.IA
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Interdigital Exhibit 2005, Page 5 of 30
`
`
`
`Patent Application Publication
`
`Jul. 6, 2006 Sheet 5 of 19
`
`US 2006/0146143 A1
`
`9 (OICH
`
`
`
`33eds
`
`Interdigital Exhibit 2005, Page 6 of 30
`
`
`
`SmL
`
`o
`
`8a
`
`Interdigital Exhibit 2005, Page 7 of 30
`
`Interdigital Exhibit 2005, Page 7 of 30
`
`
`
`
`Patent Application Publication
`
`Jul. 6, 2006 Sheet 7 of 19
`
`US 2006/0146143 A1
`
`00/
`
`
`
`
`
`9A?depV
`
`
`
`-X{00Iq0.10 BIAN
`
`„HAOQI/HILOW
`
`
`
`u0??Sodu 003CI
`
`Z '91. H.
`
`Interdigital Exhibit 2005, Page 8 of 30
`
`
`
`Patent Application Publication Jul. 6, 2006 Sheet 8 of 19
`
`US 2006/0146143 A1
`
`
`
`s
`
`- - - - a
`
`a
`
`-
`
`8. s
`
`
`
`i
`
`s
`
`Interdigital Exhibit 2005, Page 9 of 30
`
`
`
`Patent Application Publication
`
`US 2006/0146143 A1
`
`
`
`JLRIV RHOIRICH 6 (5) IAI
`
`
`
`
`
`J06eueIN Td8 Ma?A-016u?S
`
`-096
`
`
`
`
`
`
`
`
`
`Interdigital Exhibit 2005, Page 10 of 30
`
`
`
`Patent Application Publication
`
`Jul. 6, 2006 Sheet 10 of 19
`
`US 2006/0146143 A1
`
`() I "OIH
`
`0 [0]
`
`
`
`
`
`
`
`
`
`
`
`Interdigital Exhibit 2005, Page 11 of 30
`
`
`
`Patent Application Publication
`
`Jul. 6, 2006 Sheet 11 of 19
`
`US 2006/0146143 A1
`
`
`
`II '9IAI
`
`
`
`seun?old ºpu?l?jax! :) | Ieuoduæ L ?
`
`Interdigital Exhibit 2005, Page 12 of 30
`
`
`
`Patent Application Publication Jul. 6, 2006 Sheet 12 of 19
`
`US 2006/0146143 A1
`
`
`
`S 3
`
`s
`
`s s
`
`S.
`
`S.
`
`s
`
`S.
`
`g
`
`Interdigital Exhibit 2005, Page 13 of 30
`
`
`
`Patent Application Publication
`
`Jul. 6, 2006 Sheet 13 of 19
`
`US 2006/0146143 A1
`
`
`
`
`
`?pOW WAÐ?A
`
`0 [0]
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Interdigital Exhibit 2005, Page 14 of 30
`
`
`
`Patent Application Publication
`
`Jul. 6, 2006 Sheet 14 of 19
`
`US 2006/0146143 A1
`
`
`
`f' I 'OIDH
`
`
`
`JLRIW RHOIRICH
`
`Interdigital Exhibit 2005, Page 15 of 30
`
`
`
`Patent Application Publication
`
`Jul. 6, 2006 Sheet 15 of 19
`
`US 2006/0146143 A1
`
`
`
`Interdigital Exhibit 2005, Page 16 of 30
`
`
`
`Patent Application Publication
`
`Jul. 6, 2006 Sheet 16 of 19
`
`US 2006/0146143 A1
`
`[06 I
`
`9I 9 IAI
`
`3pIS
`
`
`
`
`
`Interdigital Exhibit 2005, Page 17 of 30
`
`
`
`Patent Application Publication
`
`Jul. 6, 2006 Sheet 17 of 19
`
`US 2006/0146143 A1
`
`0907
`
`0 | 070Z0Z
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Interdigital Exhibit 2005, Page 18 of 30
`
`
`
`Patent Application Publication
`
`Jul. 6, 2006 Sheet 18 of 19
`
`US 2006/0146143 A1
`
`
`
`SISQU?UKS AAQJA
`
`8I (OICH
`
`Interdigital Exhibit 2005, Page 19 of 30
`
`
`
`Patent Application Publication Jul. 6, 2006 Sheet 19 of 19
`
`US 2006/0146143 A1
`
`
`
`s
`
`Interdigital Exhibit 2005, Page 20 of 30
`
`
`
`US 2006/0146143 A1
`
`Jul. 6, 2006
`
`METHOD AND SYSTEM FOR MANAGING
`REFERENCE PICTURES IN MULTIVIEW VIDEOS
`
`RELATED APPLICATIONS
`0001. This application is a continuation-in-part of U.S.
`patent application Ser. No. 11/015,390 entitled “Multiview
`Video Decomposition and Encoding and filed by Xin et al.
`on Dec. 17, 2004. This application is related to U.S. patent
`application Ser. No.
`entitled “Method and System
`for Synthesizing Multiview Videos’ and U.S. patent appli
`cation Ser. No.
`entitled “Method for Randomly
`Accessing Multiview Videos, both of which were co-filed
`with this application by Xin et al. on Nov. 30, 2005.
`
`FIELD OF THE INVENTION
`0002 This invention relates generally to encoding and
`decoding multiview videos, and more particularly to man
`aging reference pictures while encoding and decoding mul
`tiview videos.
`
`BACKGROUND OF THE INVENTION
`0003 Multiview video encoding and decoding is essen
`tial for applications such as three dimensional television
`(3DTV), free viewpoint television (FTV), and multi-camera
`Surveillance. Multiview video encoding and decoding is also
`known as dynamic light field compression.
`0004 FIG. 1 shows a prior art 'simulcast system 100 for
`multiview video encoding. Cameras 1-4 acquire sequences
`of frames or videos 101-104 of a scene 5. Each camera has
`a different view of the scene. Each video is encoded 111-114
`independently to corresponding encoded videos 121-124.
`That system uses conventional 2D video encoding tech
`niques. Therefore, that system does not correlate between
`the different videos acquired by the cameras from the
`different viewpoints while predicting frames of the encoded
`Video. Independent encoding decreases compression effi
`ciency, and thus network bandwidth and storage are
`increased.
`0005 FIG. 2 shows a prior art disparity compensated
`prediction system 200 that does use inter-view correlations.
`Videos 201-204 are encoded 211-214 to encoded videos
`231-234. The videos 201 and 204 are encoded independently
`using a standard video encoder such as MPEG-2 or H.264,
`also known as MPEG-4 Part 10. These independently
`encoded videos are reference' videos. The remaining videos
`202 and 203 are encoded using temporal prediction and
`inter-view predictions based on reconstructed reference vid
`eos 251 and 252 obtained from decoders 221 and 222.
`Typically, the prediction is determined adaptively on a per
`block basis, S. C. Chan et al., “The data compression of
`simplified dynamic light fields.” Proc. IEEE Int. Acoustics,
`Speech, and Signal Processing Conf., April, 2003.
`0006 FIG. 3 shows prior art lifting-based wavelet
`decomposition, see W. Sweldens, “The data compression of
`simplified dynamic light fields,” J. Appl. Comp. Harm.
`Anal., vol. 3, no. 2, pp. 186-200, 1996. Wavelet decompo
`sition is an effective technique for static light field compres
`sion. Input samples 301 are split 310 into odd samples 302
`and even samples 303. The odd samples are predicted 320
`from the even samples. A prediction error forms high band
`samples 304. The high band samples are used to update 330
`
`the even samples and to form low band samples 305. That
`decomposition is invertible so that linear or non-linear
`operations can be incorporated into the prediction and
`update steps.
`0007. The lifting scheme enables a motion-compensated
`temporal transform, i.e., motion compensated temporal fil
`tering (MCTF) which, for videos, essentially filters along a
`temporal motion trajectory. A review of MCTF for video
`coding is described by Ohm et al., “Interframe wavelet
`coding-motion picture representation for universal scalabil
`ity.” Signal Processing: Image Communication, Vol. 19, no.
`9, pp. 877-908, October 2004. The lifting scheme can be
`based on any wavelet kernel such as Harr or 5/3 Daubechies,
`and any motion model Such as block-based translation or
`afline global motion, without affecting the reconstruction.
`0008 For encoding, the MCTF decomposes the video
`into high band frames and low band frames. Then, the
`frames are subjected to spatial transforms to reduce any
`remaining spatial correlations. The transformed low and
`high band frames, along with associated motion information,
`are entropy encoded to form an encoded bitstream. MCTF
`can be implemented using the lifting scheme shown in FIG.
`3 with the temporally adjacent videos as input. In addition,
`MCTF can be applied recursively to the output low band
`frames.
`0009 MCTF-based videos have a compression efficiency
`comparable to that of video compression standards such as
`H.264/AVC. In addition, the videos have inherent temporal
`scalability. However, that method cannot be used for directly
`encoding multiview videos in which there is a correlation
`between videos acquired from multiple views because there
`is no efficient method for predicting views that accounts for
`correlation in time.
`0010. The lifting scheme has also been used to encode
`static light fields, i.e., single multiview images. Rather than
`performing a motion-compensated temporal filtering, the
`encoder performs a disparity compensated inter-view filter
`ing (DCVF) across the static views in the spatial domain, see
`Chang et al., “Inter-view wavelet compression of light fields
`with disparity compensated lifting.” SPIE Confon Visual
`Communications and Image Processing, 2003. For encod
`ing, DCVF decomposes the static light field into high and
`low band images, which are then subject to spatial trans
`forms to reduce any remaining spatial correlations. The
`transformed images, along with the associated disparity
`information, are entropy encoded to form the encoded
`bitstream. DCVF is typically implemented using the lifting
`based wavelet transform scheme as shown in FIG. 3 with
`the images acquired from spatially adjacent camera views as
`input. In addition, DCVF can be applied recursively to the
`output low band images. DCVF-based static light field
`compression provides a better compression efficiency than
`independently coding the multiple frames. However, that
`method also cannot encode multiview videos in which both
`temporal correlation and spatial correlation between views
`are used because there is no efficient method for predicting
`views that account for correlation in time.
`
`SUMMARY OF THE INVENTION
`0011. A method and system to decompose multiview
`Videos acquired of a scene by multiple cameras is presented.
`
`Interdigital Exhibit 2005, Page 21 of 30
`
`
`
`US 2006/0146143 A1
`
`Jul. 6, 2006
`
`0012 Each multiview video includes a sequence of
`frames, and each camera provides a different view of the
`SCCC.
`0013 A prediction mode is selected from a temporal,
`spatial, view synthesis, and intra-prediction mode.
`0014. The multiview videos are then decomposed into
`low band frames, high band frames, and side information
`according to the selected prediction mode.
`0015. A novel video reflecting a synthetic view of the
`scene can also be generated from one or more of the
`multiview videos.
`0016. More particularly, one embodiment of the inven
`tion provides a system and method for managing multiview
`Videos. A reference picture list is maintained for each current
`frame of multiple multiview videos. The reference picture
`list indexes temporal reference pictures, spatial reference
`pictures and synthesized reference pictures of the multiview
`videos. Then, each current frame of the multiview videos is
`predicted according to reference pictures indexed by the
`associated reference picture list during encoding and decod
`1ng.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`0017 FIG. 1 is a block diagram of a prior art system for
`encoding multiview videos:
`0018 FIG. 2 is a block diagram of a prior art disparity
`compensated prediction system for encoding multiview vid
`eos,
`0019 FIG. 3 is a flow diagram of a prior art wavelet
`decomposition process;
`0020 FIG. 4 is a block diagram of a MCTF/DCVF
`decomposition according to an embodiment of the inven
`tion;
`FIG. 5 is a block diagram of low-band frames and
`0021
`high band frames as a function of time and space after the
`MCTF/DCVF decomposition according to an embodiment
`of the invention;
`0022 FIG. 6 is a block diagram of prediction of high
`band frame from adjacent low-band frames according to an
`embodiment of the invention;
`0023 FIG. 7 is a block diagram of a multiview coding
`system using macroblock-adaptive MCTF/DCVF decompo
`sition according to an embodiment of the invention;
`0024 FIG. 8 is a schematic of video synthesis according
`to an embodiment of the invention;
`0.025
`FIG. 9 is a block diagram of a prior art reference
`picture management;
`0026 FIG. 10 is a block diagram of multiview reference
`picture management according to an embodiment of the
`invention;
`0027 FIG. 11 is a block diagram of multiview reference
`pictures in a decoded picture buffer according to an embodi
`ment of the invention;
`0028 FIG. 12 is a graph comparing coding efficiencies of
`different multiview reference picture orderings;
`
`0029 FIG. 13 is a block diagram of dependencies of
`view mode on the multiview reference picture list manager
`according to an embodiment of the invention;
`0030 FIG. 14 is diagram of a prior art reference picture
`management for single view coding systems that employ
`prediction from temporal reference pictures;
`0031
`FIG. 15 is a diagram of a reference picture man
`agement for multiview coding and decoding systems that
`employ prediction from multiview reference pictures
`according to an embodiment of the invention;
`0032 FIG. 16 is a block diagram of view synthesis in a
`decoder using depth information encoded and received as
`side information according to an embodiment of the inven
`tion;
`0033 FIG. 17 is a block diagram of cost calculations for
`selecting a prediction mode according to an embodiment of
`the invention;
`0034 FIG. 18 is a block diagram of view synthesis in a
`decoder using depth information estimated by a decoder
`according to an embodiment of the invention; and
`0035 FIG. 19 is a block diagram of multiview videos
`using V-frames to achieve spatial random access in the
`decoder according to an embodiment of the invention.
`
`DETAILED DESCRIPTION OF THE
`EMBODIMENTS OF THE INVENTION
`0036) One embodiment of our invention provides a joint
`temporal/inter-view processing method for encoding and
`decoding frames of multiview videos. Multiview videos are
`Videos that are acquired of a scene by multiple cameras
`having different poses. We define a pose camera as both its
`3D (x, y, z) position, and its 3D (0, p. (p) orientation. Each
`pose corresponds to a view of the scene.
`0037. The method uses temporal correlation between
`frames within each video acquired for a particular camera
`pose, as well as spatial correlation between synchronized
`frames in videos acquired from multiple camera views. In
`addition, synthetic’ frames can be correlated, as described
`below.
`0038. In one embodiment, the temporal correlation uses
`motion compensated temporal filtering (MCTF), while the
`spatial correlation uses disparity compensated inter-view
`filtering (DCVF).
`0039. In another embodiment of the invention, spatial
`correlation uses prediction of one view from synthesized
`frames that are generated from neighboring frames. Neigh
`boring frames are temporally or spatially adjacent frames,
`for example, frames before or after a current frame in the
`temporal domain, or one or more frames acquired at the
`same instant in time but from cameras having different poses
`or views of the scene.
`0040. Each frame of each video includes macroblocks of
`pixels. Therefore, the method of multiview video encoding
`and decoding according to one embodiment of the invention
`is macroblock adaptive. The encoding and decoding of a
`current macroblock in a current frame is performed using
`several possible prediction modes, including various forms
`of temporal, spatial, view synthesis, and intra prediction. To
`determine the best prediction mode on a macroblock basis,
`
`Interdigital Exhibit 2005, Page 22 of 30
`
`
`
`US 2006/0146143 A1
`
`Jul. 6, 2006
`
`one embodiment of the invention provides a method for
`selecting a prediction mode. The method can be used for any
`number of camera arrangements.
`0041. In order to maintain compatibility with existing
`single-view encoding and decoding systems, a method for
`managing a reference picture list is described. Specifically,
`we describe a method of inserting and removing reference
`pictures from a picture buffer according to the reference
`picture list. The reference pictures include temporal refer
`ence pictures, spatial reference pictures and synthesized
`reference pictures.
`0042. As used herein, a reference picture is defined as any
`frame that is used during the encoding and decoding to
`predict a current frame. Typically, reference pictures are
`spatially or temporally adjacent or neighboring to the
`current frame.
`0043. It is important to note that the same operations are
`applied in both the encoder and decoder because the same
`set of reference pictures are used at any give time instant to
`encode and decode the current frame.
`0044 One embodiment of the invention enables random
`access to the frames of the multiview videos during encod
`ing and decoding. This improves coding efficiency.
`0045 MCTF/DCVF Decomposition
`0046 FIG. 4 show a MCTF/DCVF decomposition 400
`according to one embodiment of the invention. Frames of
`input videos 401–404 are acquired of a scene 5 by cameras
`1-4 having different posses. Note, as shown in FIG. 8, some
`of the cameras 1a and 1b can be at the same locations but
`with different orientations. It is assumed that there is some
`amount of view overlap between any pair of cameras. The
`poses of the cameras can change while acquiring the mul
`tiview videos. Typically, the cameras are synchronized with
`each other. Each input video provides a different view of
`the scene. The input frames 401–404 are sent to a MCTF/
`DCVF decomposition 400. The decomposition produces
`encoded low-band frames 411, encoded high band frames
`412, and associated side information 413. The high band
`frames encode prediction errors using the low band frames
`as reference pictures. The decomposition is according to
`selected prediction modes 410. The prediction modes
`include spatial, temporal, view synthesis, and intra predic
`tion modes. The prediction modes can be selected adaptively
`on a per macroblock basis for each current frame. With intra
`prediction, the current macroblock is predicted from other
`macroblocks in the same frame.
`0047 FIG. 5 shows a preferred alternating checkerboard
`pattern of the low band frames (L) 411 and the high band
`frames (H) 412 for a neighborhood of frames 510. The
`frames have a spatial (view) dimension 501 and a temporal
`dimension 502. Essentially, the pattern alternates low band
`frames and high band frames in the spatial dimension for a
`single instant in time, and additionally alternates temporally
`the low band frames and the high band frames for a single
`video.
`0.048. There are several advantages of this checkerboard
`pattern. The pattern distributes low band frames evenly in
`both the space and time dimensions, which achieves Scal
`ability in space and time when a decoder only reconstructs
`the low band frames. In addition, the pattern aligns the high
`
`band frames with adjacent low band frames in both the space
`and time dimensions. This maximizes the correlation
`between reference pictures from which the predictions of the
`errors in the current frame are made, as shown in FIG. 6.
`0049 According to a lifting-based wavelet transform, the
`high band frames 412 are generated by predicting one set of
`samples from the other set of samples. The prediction can be
`achieved using a number of modes including various forms
`of temporal prediction, various forms of spatial prediction,
`and a view synthesis prediction according to the embodi
`ments of invention described below.
`0050. The means by which the high band frames 412 are
`predicted and the necessary information required to make
`the prediction are referred to as the side information 413. If
`a temporal prediction is performed, then the temporal mode
`is signaled as part of the side information along with
`corresponding motion information. If a spatial prediction is
`performed, then the spatial mode is signaled as part of the
`side information along with corresponding disparity infor
`mation. If view synthesis prediction is performed, then the
`view synthesis mode is signaled as part of the side infor
`mation along with corresponding disparity, motion and
`depth information.
`0051. As shown in FIGS. 6, the prediction of each
`current frame 600 uses neighboring frames 510 in both the
`space and time dimensions. The frames that are used for
`predicting the current frame are called reference pictures.
`The reference pictures are maintained in the reference list,
`which is part of the encoded bitstream. The reference
`pictures are stored in the decoded picture buffer.
`0.052. In one embodiment of the invention, the MCTF
`and DCVF are applied adaptively to each current macrob
`lock for each frame of the input videos to yield decomposed
`low band frames, as well as the high band frames and the
`associated side information. In this way, each macroblock is
`processed adaptively according to a best prediction mode.
`An optimal method for selecting the prediction mode is
`described below.
`0053. In one embodiment of the invention, the MCTF is
`first applied to the frames of each video independently. The
`resulting frames are then further decomposed with the
`DCVF. In addition to the final decomposed frames, the
`corresponding side information is also generated. If per
`formed on a macroblock-basis, then the prediction mode
`Selections for the MCTF and the DCVF are considered
`separately. As an advantage, this prediction mode selection
`inherently supports temporal scalability. In this way, lower
`temporal rates of the videos are easily accessed in the
`compressed bitstream.
`0054) In another embodiment, the DCVF is first applied
`to the frames of the input videos. The resulting frames are
`then temporally decomposed with the MCTF. In addition to
`the final decomposed frames, the side information is also
`generated. If performed on a macroblock-basis, then the
`prediction mode selections for the MCTF and DCVF are
`considered separately. As an advantage, this selection inher
`ently supports spatial scalability. In this way, a reduced
`number of the views are easily accessed in the compressed
`bitstream.
`0055. The decomposition described above can be applied
`recursively on the resulting set of low band frames from a
`
`Interdigital Exhibit 2005, Page 23 of 30
`
`
`
`US 2006/0146143 A1
`
`Jul. 6, 2006
`
`previous decomposition stage. As an advantage, our MCTF/
`DCVF decomposition 400 effectively removes both tempo
`ral and spatial (inter-view) correlations, and can achieve a
`very high compression efficiency. The compression effi
`ciency of our multiview video encoder outperforms conven
`tional simulcast encoding, which encodes each video for
`each view independently.
`0056 Coding of MCTF/DCVF Decomposition
`0057. As shown in FIG. 7, the outputs 411 and 412 of
`decomposition 400 are fed to a signal encoder 710, and the
`output 413 is fed to a side information encoder 720. The
`signal encoder 710 performs a transform, quantization and
`entropy coding to remove any remaining correlations in the
`decomposed low band and high band frames 411-412. Such
`operations are well known in the art, Netravali and Haskell,
`Digital Pictures: Representation, Compression and Stan
`dards, Second Edition, Plenum Press, 1995.
`0058. The side information encoder 720 encodes the side
`information 413 generated by the decomposition 400. In
`addition to the prediction mode and the reference picture list,
`the side information 413 includes motion information cor
`responding to the temporal predictions, disparity informa
`tion corresponding to the spatial predictions and view syn
`thesis and depth information corresponding to the view
`synthesis predictions.
`0059 Encoding the side information can be achieved by
`known and established techniques, such as the techniques
`used in the MPEG-4 Visual standard, ISO/IEC 14496-2,
`“Information technology—Coding of audio-visual objects
`Part 2: Visual,” 2" Edition, 2001, or the more recent
`H.264/AVC standard, and ITU-T Recommendation H.264,
`“Advanced video coding for generic audiovisual services.”
`2004.
`0060 For instance, motion vectors of the macroblocks
`are typically encoded using predictive methods that deter
`mine a prediction vector from vectors in macroblocks in
`reference pictures. The difference between the prediction
`vector and the current vector is then subject to an entropy
`coding process, which typically uses the statistics of the
`prediction error. A similar procedure can be used to encode
`disparity vectors.
`0061
`Furthermore, depth information for each macrob
`lock can be encoded using predictive coding methods in
`which a prediction from macroblocks in reference pictures is
`obtained, or by simply using a fixed length code to express
`the depth value directly. If pixel level accuracy for the depth
`is extracted and compressed, then texture coding techniques
`that apply transform, quantization and entropy coding tech
`niques can be applied.
`0062) The encoded signals 711–713 from the signal
`encoder 710 and side information encoder 720 can be
`multiplexed 730 to produce an encoded output bitstream
`T31.
`0063 Decoding of MCTF/DCVF Decomposition
`0064. The bitstream 731 can be decoded 740 to produce
`output multiview videos 741 corresponding to the input
`multiview videos 401–404. Optionally, synthetic video can
`also be generated. Generally, the decoder performs the
`inverse operations of the encoder to reconstruct the multi
`view videos. If all low band and high band frames are
`
`decoded, then the full set of frames in both the space (view)
`dimension and time dimension at the encoded quality are
`reconstructed and available.
`0065 Depending on the number of recursive levels of
`decomposition that were applied in the encoder and which
`type of decompositions were applied, a reduced number of
`Videos and/or a reduced temporal rate can be decoded as
`shown in FIG. 7.
`0.066 View Synthesis
`0067. As shown in FIG. 8, view synthesis is a process by
`which frames 801 of a synthesized video are generated from
`frames 803 of one or more actual multiview videos. In other
`words, view synthesis provides a means to synthesize the
`frames 801 corresponding to a selected novel view 802 of
`the scene 5. This novel view 802 may correspond to a
`virtual camera 800 not present at the time the input
`multiview videos 401–404 were acquired or the view can
`correspond to a camera view that is acquired, whereby the
`synthesized view will be used for prediction and encoding/
`decoding of this view as described below.
`0068 If one video is used, then the synthesis is based on
`extrapolation or warping, and if multiple videos are used,
`then the synthesis is based on interpolation.
`0069 Given the pixel values of frames 803 of one or
`more multiview videos and the depth values of points in the
`scene, the pixels in the frames 801 for the synthetic view 802
`can be synthesized from the corresponding pixel values in
`the frames 803.
`0070 View synthesis is commonly used in computer
`graphics for rendering still images for multiple views, see
`Buehler et al., “Unstructured Lumigraph Rendering.” Proc.
`ACM SIGGRAPH, 2001. That method requires extrinsic
`and intrinsic parameters for the cameras.
`0071 View synthesis for compressing multiview videos
`is novel. In one embodiment of our invention, we generate
`synthesized frames to be used for predicting the current
`frame. In one embodiment of the invention, synthesized
`frames are generated for designated high band frames. In
`another embodiment of the invention, synthesized frames
`are generated for specific views. The synthesized frames
`serve as reference pictures from which a current synthesized
`frame can be predicted.
`0072 One difficulty with this approach is that the depth
`values of the scene 5 are unknown. Therefore, we estimate
`the depth values using known techniques, e.g., based on
`correspondences of features in the multiview videos.
`0073 Alternatively, for each synthesized video, we gen
`erate multiple synthesized frames, each corresponding to a
`candidate depth value. For each macroblock in the current
`frame, the best matching macroblock in the set of synthe
`sized frames is determined. The synthesized frame from
`which this best match is found indicates the depth value of
`the macroblock in the current frame. This process is repeated
`for each macroblock in the current frame.
`0074. A difference between the current macroblock and
`the synthesized block is encoded and compressed by the
`signal encoder 710. The side information for this multiview
`mode is encoded by the side information encoder 720. The
`side information includes a signal indicating the view syn
`
`Interdigital Exhibit 2005, Page 24 of 30
`
`
`
`US 2006/0146143 A1
`
`Jul. 6, 2006
`
`thesis prediction mode, the depth value of the macroblock,
`and an optional displacement vector that compensates for
`any misalignments between the macroblock in the current
`frame and the best matching macroblock in the synthesized
`frame to be compensated.
`0075 Prediction Mode Selection
`0076). In the macroblock-adaptive MCTF/DCVF decom
`position, the prediction modem for each macroblock can be
`selected by minimizing a cost function adaptively on a per
`macroblock basis:
`
`n = argmin (n),
`
`where J(m)=D(m)+ R(m), and D is distortion, w is a weight
`ing parameter, R is rate, m indicates the set of candidate
`prediction modes, and m * indicates the optimal prediction
`mode that has been selected based on a minimum cost
`criteria.
`0077. The candidate modes m include various modes of
`temporal, spatial, view synthesis, and intra prediction. The
`cost function J(m) depends on the rate and distortion result
`ing from encoding the macroblock using a specific predic
`tion mode m.
`0078. The distortion D measures a difference between a
`reconstructed macroblock and a source macroblock. The
`reconstructed macroblock is obtained by encoding and
`decoding the macroblock using the given prediction mode
`m. A common distortion measure is a sum of squared
`difference. The rate R corresponds to the number of bits
`needed to encode the macroblock, including the prediction
`error and the side information. The weighting parameter w
`controls the rate-distortion tradeoff of the macroblock cod
`ing, and can be derived from a size of a quantization step.
`0079 Detailed aspects of the encoding and decoding
`processes are described in further detail below. In particular,
`the various data structures that are used by the encoding and
`decoding processes are described. It should be understood
`that the data structures, as described herein, that are used in
`the encoder are identical to corresponding data structures
`used in the decoder. It should also be understood that the
`processing steps of the decoder essentially follow the same
`processing steps as the encoder, but in an inverse order.
`0080 Reference Picture Management
`0081
`FIG. 9 shows a reference picture management for
`prior art single-view encoding and decoding systems. Tem
`poral reference pictures 901 are managed by a single-view
`reference picture list (RPL) manager 910, whi