throbber

`
`US 20060146138Al
`
`(19) United States
`
`(12; Patent Application Publication (10) Pub. No.: US 2006/0146138 A1
`
`Xin et al. Jul. 6, 2006 (43} Pub. Date:
`
`
`(54) METHOD AND SYSTEM FOR
`SYNTHESIZING MULTIVIEW VIDEOS
`
`Publication Classification
`
`(76)
`
`Inventors:
`
`.Iun Xin, Quincy, MA (US); I‘Imin
`Martinian. Waltlunn. MA (US):
`Alexander Behrcns‘ Bueckeburg (DE):
`Anthony Vctro. Arlington, MA (US);
`Iluifang Sun. Billerica. MA (US)
`
`Correspondence Address:
`Patent Department
`Mitsubishi Electric Research Laboratories, Inc.
`20] Broadway
`Cambridge, MA 02139 (US)
`
`(21) App]. No.2
`
`l'UZ92,lfi7
`
`(22)
`
`1.11“}:
`
`No“ 3|}, 2005
`
`Related U,S_ Application Data
`
`(63)
`
`(.‘onlintuttion-in-part of application No. [”015390.
`filed on Dec. 17, 2004.
`
`(5] )
`
`Int. (7].
`(2006.01)
`HIMN 5/225
`(52) U.S. (Tl.
`........................................................ 348207.99
`
`(57)
`
`ABSTRACT
`
`A system and method syntheslzes multwiew VldeOS. Multi—
`\r'iL‘W videos are acquired of a scene with corresponding
`cameras arranged at a poses such that there is View overlap
`between any pair ofcameras. A synthesized multiyiew yideo
`ls generated from the acquired mu ltmew Videos for a virtual
`camera. A reference picture list
`is maintained for each
`current
`frame of each of the nlulliview videos and the
`synthesiaed video. The reference picture list indexes teln-
`poral reference pictures and spatial reference pictures of the
`acquired multiview videos and the synthesized reference
`pictures of the synlhesized multiview video.
`'l‘hen. each
`current frame of the inultiview videos is predicled according
`to reference pictures indexed by the associaled reference
`picture list during encoding and decoding.
`
`Prediction Modes
`
`.
`
`spatial
`temporal
`View synthesis
`intra
`
`4‘0
`
`402
`
`
`Input Video 3
`
`403
`
`
`MCTF/DCVF
`
`412
`
`
`Input Video 1
`r 40]
`411
`
`
`Input Video 2
` High band
`frames
`Decomposition
`
`
`
`
`Input Video 4
`
`404
`
`
`Side
`Information ( 413
`
`UNIFIED 1003
`
`UNIFIED 1003
`
`

`

`1
`
`“
`
`‘1.
`
`":
`‘ i-
`
`scenc
`
`g
`in
`1
`
`,5
`
`3
`5'
`i I.
`
`5) 4
`‘~
`
`101
`
`102
`
`103
`
`'
`
`04
`
`FIG. I
`
`PRIOR ART
`
`61J0Imalls
`
`Encoded
`video 2
`
`122
`
`Encoded
`video 3
`
`123
`
`Encoded
`video 4
`
`124
`
`.—.
`
`100
`
`Encoded
`
`Encoder 1
`
`video 1
`
`121
`
`F‘
`
`—»
`
`
`
`uoueauqnduoulznnddvmama
`
`900E‘9‘[“I'
`
`IV8£l9|7l019003SI]
`
`

`

`211
`
`200
`
`22]
`
`231
`
`-'
`
`
`
`201
`
`202
`
`203
`
`
`
`204
`
`214
`
`252
`
`234
`
`
`
`
`
`FIG. 2
`
`PRIOR ART
`
`
`
`
`
`uogwaglqnduoulznnddvwaned
`
`900K‘9’[nl'
`61.10ZIaaus
`
`IV8£l9|7l0l9003SI]
`
`

`

`302
`
`304
`
`High band
`samples
`
`
`
`
`
`
`6110£Iaaus900z‘9'lnr"OUBNJQHd"oumnddvmawd
`
`
`
`
`
`FIG. 3
`
`PRIOR ART
`
`Low band
`
`samples
`
`IV8€l9|7l0l9003SI]
`
`

`

`
`
`Prediction Modes
`
`
`
`’
`
` 411
`
`412
`
`uoynouqnduogeauddvwaned
`
`
`
`6|10vwallssum‘9'[fll‘
`IV8€l917l0f9002Sf]
`
`spatial
`temporal
`
`Input Video 1
`F‘
`
`
`
`410
`view synthesis
`
`
`intra
`
`
` 401
`Low band
`frames
`
`MCTF/DCVF
`Highband
`
`Decomposition
`
`Side
`Input Video 4
`Information ( 413
`404
`
`
`
`1”)
`
`Input Video 2
`
`402
`
`Input Video 3
`
`403
`
`400
`
`FIG. 4
`
`

`

`Patent Application Publication
`
`Jul. 6, 2006 Sheet 5 0f 19
`
`US 2006f0l46138 Al
`
`time
`
`502
`
`44/ 4,:
`
`’
`
`Ijfii:
`
` 4"
` space
`
`

`

`Patent Application Publication
`
`Jul. 6, 2006
`
`Sheet 6 of 19
`
`US 200610146138 Al
`
`
`
`

`

`700
`
` 711-712
`730
`
`
`Macroblock-
`
`Adaptive ”I Encoder
`
`
`Output
`MCTF/DCVF -
`'
`'
`
`
`ecomposmon
`731
`
`
`
`
`Bitstream
`
`
`
`
`
`740
`
`Decoder
`
`D
`
`Side
`
`413
`
`Information
`
`En coder
`
`720
`
`FIG. 7
`
`741 \_/‘~'555
`
`
`
`1V8€l917l039002Sf]
`
`
`
`6110Llaaus900:“9'Inrnounanqnd"oueauddVelueIed
`
`
`
`
`
`
`
`

`

`Patent Application Publication
`
`Jul. 6, 2006 Sheet 8 of 19
`
`US 2006f0146138 A1
`
`
`
`ill-III"...
`
`III
`
`-u...
`
`
`
`

`

`Temporai Reference Pictures
`
`910
`
`Single-view RPL Manager
`
`FIG. 9
`
`PRIOR ART
`
`930
`
`
`
`
`Remove
`Insert
`Pictures
`Pictures
`
`
`
`
`901 [[01:13:31]anuoyeauddvwaned
` 6|106walls900:‘9“WT
`
`Ilafiflfiflflflal
`
` 960
`
`
`
`Temporal
`Prediction
`
`1V8€l917l039002Sf]
`
`

`

`
`1003
`J.
`Temcral Reference Pictures
`-
`
` l 00 1
`
` Multi-view
`
` RPL Manager
`
`Spatial Reference Pictures
`
`Synthesized Reference Pictures
`
`
`
`
`Insert
`
`Pictures
`
`Remove
`
`Pictures
`
`1020
`
`1030
`
`1040
`
`31050
`
`
`
`
`
`uoueauqnduoulzenddvmarred
`
`FIG. 10
`
`900E‘9‘[“I'
`6|.1001Balls
`IV8£l9|7l0l9003SI]
`
`1050\
`
`Multiview
`
`Prediction
`
`

`

`Patent Application Publication
`
`Jul. 6, 2006 Sheet 11 0f19
`
`US 2006f0146138 A1
`
`in?
`
`(0: -.—I
`yam;
`0-1.“):-
`E1383
`:m-me
`H" LI
`':§'£L
`
`-
`
`_ [iii-m]
`_ _-j-j (DH
`
`_- rug;
`
`1103
`
`1102
`
`1101
`
`FIG.11
`
`

`

`Patent Application Publication
`
`Jul. 6, 2006 Sheet 12 0f 19
`
`US 2006f0146138 Al
`
`:38IT
`
`com
`
`033:9.225
`
`NM65%
`
`

`

`Temporal Reference Pictures
`
`1003
`
`
`S . atial Reference Pictures
` Multi-view RPL
`
`View Mode
`100]
`Manager
`
`Synthesized Reference Pictures
`
`
`
`
`
`FIG. 13
`
`
`Multiview
`Prediction
`
`
`
`
`Reference Pictures
`
`with Higher Cor'relatio
`
`Reference Pictures
`Prediction Efficiency
`with Lower Correlation
`
`
`
`
`
`uoueauqnduouennddvJuamd
`
`900E‘9‘[“I'
`6|.10E1laallS
`IV8£l9|7l0l9003Si]
`
`

`

`._WB3MBatsu.u_
`.n1.u.nont4mnPnn23umzmnB..m.‘_.BmnB1t2wAI.nun..Bn._1t1umonmPUmlmmL0tom__Hu_
`
`__~.unuuuuen2nn1_Iw0u.H.mm0wn.mM_zn._1l__.nP.4u..._I......C_nuu..4"/\..unBm/\nBtsm_w
`
`v
`
`.
`
`
`
`
` .0011__7IIPPyn—v___ -_..wv..—a_.w_____nw0_nw_2.1z.4.n1nPn.w.n
`
`
`
`
`
`1-.--.....-.--.-.-.-..--.:11---:-----.::111:::--:.-:.
`
`tiJt6__5
`
`FIG. 14
`
`PRIOR ART
`
`
`
`
`
`3:2:>uumm32o=2232.5:._=_.93%mix:3on.o:m33.5333.5
`
`
`
`
`
`

`

`
`
`
`
`
`
`
`
` 6110§l133qS900:‘9‘Inrnonwlqndnoumuddwuaw
`
`
`
`[V85197[01'900ZSf]
`
`

`

`1905
`
`1930
`
`(//
`
`1903
`
`Reconstructed
`
`macroblock
`
`1920
`
`
`
`1904
`
`Synthesized
`macroblock
`
`1902
`
`View synthesis
`
`
`
`
`Decoded
`
`residual
`
`macroblock
`
`Spatial
`reference
`pictures
`
`
`
`4 1 3
`
` Side
`
`information
`
`decoder
`
`Depth
`
`1901
`
`FIG. 16
`
`uoynouqnduogeauddvwaned
`
`
`
`611091wallssum‘9'[fll‘
`IV8€l917l0f9002Sf]
`
`

`

`2020
`
`2010
`
`202 1
`
`2030
`
`Tem ral
`p0
`reference
`PICMBS
`
`
`2031
`.
`.
`
`
`Deterrnlne codlng
`.
`Motion
`cost usin tem oral
`
`
`
`estimation
`Motion
`prediction}: m 1
`
`vectors
`
`
`
`
`macroblock
`2042
`
`
`
`Determine coding
`cost using spatial
`
`
`pictures J
`Disparity
`
`
`prediction m
`vectors
`
`
`
`2070 Minimum
`C
`
`
`Determine coding
`
`
`Depth estimation
`
`cost using view-
`
`and View synthesis
`
`
`
`
`synthesis
`m3
`
`Synthesized
`
`View
`
`_____
`
`.
`‘
`I Spatia
`reference
`
`
`
`Determine coding
`_
`.
`cost using Intra
`redlction m4
`
`Neighboring
`pixels
`
`FIG. I 7
`
`209.0
`
`209 1
`
`Best
`
`mode
`
`
`
`uognailqnduogeanddvmated
`
`900z‘9'lM‘
`6l1“LIHalls
`IV8€l9l7l0l9003Sf]
`
`

`

`Decoded
`
`residual
`
`macroblock
`
`Spatial
`reference
`
`pictures
`
`
`
`
`. fl
`Synthesazed
`2121
`macroblock
`2120
`
`
`
`
`
`
`
`
`View synthesis
`
`2101
`
`Depth
`estimation
`
`Depth
`
`FIG. 18
`
`Reconstructed
`
`macroblock
`
`
`
`61108[was900:‘9“m“0933!!!!“(1uoueauddvmawcl
`
`[VsemwmoozSf]
`
`

`

`Patent Application Publication
`
`Jul. 6, 2006 Sheet 19 0f 19
`
`US 200610146138 A1
`
`
`
`
`
`
`
`
`
`.265%
`
`
`
`

`

`US 2006110146138 Al
`
`Jul. 6, 2006
`
`METHOD AND SYSTEM FOR SYN'lil-IESIZING
`MULTIVIEW VIDEOS
`
`REIAII‘IEI) APPLICATION
`
`[0001] This application is a continuation—in—part of US.
`patent application Ser. No. 111015390 entitled “Multiview
`Video Decomposition and Encoding” and filed by Xin et al.
`on Dec. 17, 2004. This application is related to US patent
`application Ser. No. XXIXXXJXX entitled “Method and
`System for Managing Reference Pictures in Multiview Vid—
`eos“ and U.S. patent application Ser. No. XXIXXX,XXX
`entitled “Method for Randomly Accessing Mulliview Vid-
`eos“, both of which were co-filed with this application by
`Xin et al. on Nov. 30, 2005.
`
`I:ll'.l.l.,l) ()l" Tllli INV’ltNTlON
`
`[0002] This invention relates generally to encoding and
`decoding multiview videos, and more particularly to syn—
`thesizing tnultiview videos.
`
`BACKGROUND OF THE INVENTION
`
`[0003] Multivimv video encoding and decoding is essen-
`tial for applications such as three dimensional television
`(3DTV), free viewpoint television (FTV), and multicamera
`surveillance. Multiview video encoding and decoding is also
`known as dynamic light field compression.
`
`[0004] FIG. 1 shows a prior art ‘simulcast‘ system 100 for
`multiview video encoding. Cameras 1-4 acquire sequences
`of frames or videos 101-104 of a scene 5. Each camera has
`a different view ofthe scene. Each video is encoded 111—114
`independently to corresponding encoded videos 121—124.
`That system uses conventional 2D video encoding tech—
`niques. Therefore, that system does not correlate between
`the different videos acquired by the cameras from the
`difl'erent viewpoints while predicting frames of the encoded
`video. Independent encoding decreases compression effi—
`ciency, and thus network bandwidth and storage are
`increased.
`
`[0005] FIG. 2 shows a prior art disparity compensated
`prediction system 200 that does use inter-view correlations.
`Videos 201—204 are encoded 211—214 to encoded videos
`
`231—234. The videos 201 and 204 are encoded independently
`using a standard video encoder such as MPEG—2 or H.264,
`also known as MPEG-4 Part 10. These independently
`encoded videos are ‘reference’ videos. The remaining videos
`202 and 203 are encoded using temporal prediction and
`inter—view predictions based on reconstructed reference vid—
`eos 251 and 252 obtained from decoders 221 and 222.
`
`Typically, the prediction is determined adaptively on a per
`block basis. S. C. Chan et al., “The data compression of
`simplified dynamic light fields.“ Proc. IEEE Int. Acoustics,
`Speech, and Signal Processing Conf.. April, 2003.
`
`‘lifting—based’ wavelet
`[0006] FIG. 3 shows prior art
`decomposition. see W. Sweldens, “The data compression of
`simplified dynamic light fields,” .1. App]. Comp.
`llarm.
`Anal, vol. 3, no. 2. pp. 186-200, 1996. Wavelet decompo-
`sition is an effective technique for static light field compres—
`sion. Input samples 301 are split 310 into odd samples 302
`and even samples 303. The odd samples are predicted 320
`from the even samples. A prediction error fomts high band
`samples 304. The high band samples are used to update 330
`
`the even samples and to form low band samples 305. That
`decomposition is invertible so that
`linear or non—linear
`operations can be incorporated into the prediction and
`update steps.
`
`[0007] The lifting scheme enables a motion—compensated
`temporal transfonn, i.e., motion compensated temporal fil—
`tering (MCTF) which, for videos. essentially filters along a
`temporal motion trajectory. A review of MCTF for video
`coding is described by Ohm et al., “Interframe wavelet
`coding—motion picture representation for universal scal—
`ability,“ Signal Processing: Image Communication, vol. 19.
`no. 9, pp. 877-908. October 2004. The lifting scheme can be
`based on any wavelet kernel such as Man or 523 Daubechies,
`and any motion model such as block—based translation or
`affine global motion, without affecting the reconstruction.
`
`[0008] For encoding, the MCTF decomposes the video
`into high band frames and low band frames. Then,
`the
`frames are subjected to spatial transforms to reduce any
`remaining spatial correlations. The transformed low and
`high band frames, along with associated motion information.
`are entropy encoded to form an encoded bitstream. MCTF
`can be implemented using the lifting scheme shown in FIG.
`3 with the temporally adjacent videos as input. In addition,
`MCTF can be applied recursively to the output low band
`frames.
`
`[0009] MC'I‘li-based videos have a compression efficiency
`comparable to that of video compression standards such as
`H.2641’AVC. In addition, the videos have inherent temporal
`scalability. However, that method cannot be used for directly
`encoding multiview videos in which there is a correlation
`between videos acquired from multiple views because there
`is no efficient method for predicting views that accounts for
`correlation in time.
`
`[0010] The lifting scheme has also been used to encode
`static light fields, i.e., single multiview images. Rather than
`performing a motion—compensated temporal filtering, the
`encoder performs a disparity compensated inter—view filter—
`ing (DEW/'1’) across the static views in the spatial domain. see
`Chang et al.. “Inter-view wavelet compression of light fields
`with disparity compensated lifting,” SPIE Conf on Visual
`Communications and Image Processing, 2003. For encod—
`ing, DCVF decomposes the static light field into high and
`low band images, which are then subject to spatial trans-
`forms to reduce any remaining spatial correlations. The
`transformed images, along with the associated disparity
`information, are entropy encoded to form the encoded
`bitstream. [JCVI’ is typically implemented using the lifting-
`bascd wavelet transform scheme as shown in FIG. 3 with
`the images acquired from spatially adjacent camera views as
`input. In addition, DCVF can be applied recursively to the
`output
`low band images. DCVF—based static light
`field
`compression provides a better compression efficiency than
`independently coding the multiple frames. However, that
`method also cannot encode multiview videos in which both
`
`temporal correlation and spatial correlation between views
`are used because there is no efficient method for predicting
`views that account for correlation in time.
`
`SUMMARY OF THE INVENTION
`
`[0011] A method and system to decompose multiview
`videos acquired ofa Scene by multiple cameras is presented.
`
`

`

`US 2006f0146l38 A1
`
`Jul. 6, 2006
`
`[0012] Each multiview video inclttdes a sequence of
`frames, and each camera provides a different view of the
`scene.
`
`[0027] FIG. 1] is a block diagram of multiview reference
`pictures in a decoded picture bufl'er according to an embodi—
`ment of the invention;
`
`[0013] A prediction mode is selected from a temporal,
`spatial, view synthesis. and intra—prediction mode.
`
`[0028] FIG. 12 is a graph comparing coding efliciencies of
`different multiview reference picture orderings;
`
`[0014] The multiview videos are then decomposed into
`low band frames. high band frames, and side information
`according to the selected prediction mode.
`
`[0015] A novel video reflecting a synthetic view of the
`scene can also be generated from one or more of the
`multiview videos.
`
`[0016] More particularly. one embodiment of the invert—
`tion provides a system and method for synthesizing multi—
`view videos. Multiview videos are acquired of a scene with
`corresponding cameras arranged at a poses such that there is
`view overlap between any pair of cameras. A synthesized
`multiview video is generated from the acquired multiview
`videos for a virtual camera. A reference picture list
`is
`maintained for each current frame of each of the multiview
`videos and the synthesized video. The reference picture list
`indexes temporal reference pictures and spatial reference
`pictures of the acquired multiview videos and the synthe—
`sized reference pictures of the synthesized multiview video.
`Then, each current frame of the multiview videos is pre-
`dicted according to reference pictures indexed by the asso—
`ciated reference picture list during encoding and decoding.
`
`BRIEF DESCRIPTION 01" THE DRAWINGS
`
`[0017] FIG. 1 is a block diagram of a prior art system for
`encoding multiview videos;
`
`[0018] FIG. 2 is a block diagram ofa prior art disparity
`compensated prediction system for encoding multiview vid—
`eos;
`
`[0019] FIG. 3 is a flow diagram of a prior art wavelet
`decomposition process;
`
`[0020] FIG. 4 is a block diagram of a MCTFJDCVF
`decomposition according to an embodiment of the invert—
`tion;
`
`[0021] FIG. 5 is a block diagram of low—band frames and
`high band frames as a function of time and space after the
`MC'I‘WIXIVF decomposition according to an embodiment
`of the invention:
`
`[0022] FIG. 6 is a block diagram of prediction of high
`band frame from adjacent low-band frames according to an
`embodiment of the invention:
`
`[0023] FIG. 7 is a block diagram of a multiview coding
`system using macroblock-adaptive MC’I‘FI’I.)(TVF decompo-
`sition according to an embodiment of the invention;
`
`[0024] FIG. 8 is a schematic of video synthesis according
`to an embodiment of the invention;
`
`[0025] FIG. 9 is a block diagram ofa prior an reference
`picture management;
`
`[0026] FIG. 10 is a block diagram of multiview reference
`picture management according to an embodiment of the
`invention:
`
`[0029] FIG. 13 is a block diagram of dependencies of
`view mode on the multiview reference picture list manager
`according to an embodiment of the invention;
`
`[0030] FIG. 14 is diagram ofa prior art reference picture
`management for single view coding systems that employ
`prediction from temporal reference pictures;
`
`[0031] FIG. 15 is a diagram of a reference picture man-
`agement for multiview coding and decoding systems that
`employ prediction from multiview reference pictures
`according to an embodiment of the invention;
`
`[0032] FIG. 16 is a block diagram of view synthesis in a
`decoder using depth information encoded and received as
`side information according to an embodiment of the inven-
`11(3t11
`
`[0033] FIG. 17 is a block diagram of cost calculations for
`selecting a prediction mode according to an embodiment of
`the invention;
`
`[0034] FIG. 18 is a block diagram of view synthesis in a
`decoder using depth imomtation estimated by a decoder
`according to an embodiment of the invention; and
`
`[0035] FIG. 19 is a block diagram of multiview videos
`using V—frames to achieve spatial random access in the
`decoder according to an embodiment of the invention.
`
`[)I:.".'IL’\II..EI) DESCRIPTION 01" 'l‘IIIi
`EMBODIMENTS OF THE INVENTION
`
`[0036] One embodiment of our invention provides a joint
`temporalfinter—view processing method for encoding and
`decoding frames of multiview videos. Multiview videos are
`videos that are acquired of a scene by multiple cameras
`having different poses. We define a pose camera as both its
`3]) (x, y. 7.) position. and its 3D (0. p, 1p) orientation. Each
`pose corresponds to a ‘view’ of the scene.
`
`[0037] The method uses temporal correlation between
`frames within each video acquired for a particular camera
`pose, as well as spatial correlation between synchronized
`frames in videos acquired from multiple camera vieWs. In
`addition, ‘synthetic’ frames can be correlated, as deseribed
`below.
`
`In one embodiment, the temporal correlation uses
`[0038]
`motion compensated temporal filtering (MCTF), while the
`spatial correlation uses disparity compensated inter—view
`filtering [DCVI’).
`
`In another embodiment of the invention, spatial
`[0039]
`correlation uses prediction of one view from synthesized
`frames that are generated from ‘neighboring’ frames. Neigh-
`boring frames are temporally or spatially adjacent frames,
`for example, frames before or after a current frame in the
`temporal domain, or one or more frames acquired at the
`same instant in time but from cameras having different poses
`or vievvs of the scene.
`
`liach frame of each video includes macrobhicks of
`[0040]
`pixels. Therefore. the method of multiview video encoding
`
`

`

`US 2006r’0146l38 Al
`
`Jul. 6, 2006
`
`and decoding according to one embodiment ofthe invention
`is macroblock adaptive. The encoding and decoding of a
`current macroblock in a current frame is performed using
`several possible prediction modes, including various forms
`of temporal, spatial. view synthesis. and intra prediction. To
`determine the best prediction mode on a macroblock basis,
`one embodiment of the invention provides a method for
`selecting a prediction mode. The method can be used for any
`number of camera arrangements.
`
`In order to maintain compatibility with existing
`[0041]
`single-view encoding and decoding systems, a method for
`managing a reference picture list is described. Specifically,
`we describe a method of inserting and removing reference
`pictures from a picture bufl'er according to the reference
`picture list. 'lhe reference pictures include temporal refer-
`ence pictures, spatial reference pictures and synthesized
`reference pictures.
`
`[0042] As used herein, a reference picture is defined as any
`frame that
`is used during the encoding and decoding to
`‘predict‘ a current frame. Typically. reference pictures are
`spatially or temporally adjacent or ‘neighboring‘ to the
`current frame.
`
`It is important to note that the same operations are
`[0043]
`applied in both the encoder and decoder because the same
`set of reference pictures are used at tmy give time instant to
`encode and decode the current frame.
`
`[0044] One embodiment of the invention enables random
`access to the frames of the multiview videos during encod—
`ing and decoding. This improves coding efliciency.
`
`[0045] MCTFIDCVF Decomposition
`
`[0046] FIG. 4 show a MCI‘FI‘IITVF decomposition 400
`according to one embodiment of the invention. Frames of
`input videos 401-404 are acquired of a scene 5 by cameras
`1-4 having different posses. Note. as shown in FIG. 8, some
`of the cameras la and lb can be at the same locations but
`with different orientations. It is assumed that there is some
`amount of view overlap between any pair of cameras. The
`poses of the cameras can change while acquiring the mul—
`tiview videos. Typically, the cameras are synchronived with
`each other. l'iach input video provides a different ‘view’ of
`the scene. The input frames 401—404 are sent to a MC‘TFtr
`DCVF decomposition 400. The decomposition produces
`encoded low—band frames 411, encoded high band frames
`4] 2, and associated side infonnation 413. The high band
`frames encode prediction errors using the low band frames
`as reference pictures. The decomposition is according to
`selected prediction modes 410. The prediction modes
`include spatial. temporal. view synthesis. and intra predic-
`tion modes. The prediction modes can be selected adaptively
`on a per macroblock basis for each current frame. With intra
`prediction, the current macroblock is predicted from other
`macroblocks in the same frame.
`
`[0047] FIG. 5 shows a preferred alternating ‘checkerboard
`pattern’ of the low band frames (L) 411 and the high band
`frames (II) 412 for a neighborhood of frames 510. The
`frames have a spatial (view) dimension 501 and a temporal
`dimension 502. Essentially, the pattern alternates low band
`frames and high band frames in the spatial dimension for a
`single instant in time, and additionally alternates temporally
`the low band frames and the high band frames for a single
`video.
`
`[0048] There are several advantages of this checkerboard
`pattern. The pattern distributes low band frames evenly in
`both the space and time dimensions, which achieves scal—
`ability in space and time when a decoder only reconstructs
`the low baud frames. In addition, the pattern aligns the high
`band frames with adjacent low band frames in both the space
`and time dimensions. This maximizes the correlation
`
`between reference pictures from which the predictions ofthe
`errors in the current frame are made, as shown in FIG. 6.
`
`[0049] According to a lifting—based wavelet transfoml. the
`high band frames 412 are generated by predicting one set of
`samples from the other set of samples. The prediction can be
`achieved using a number of modes including various forms
`of temporal prediction. various forms of spatial prediction,
`and a view synthesis prediction according to the embodi—
`ments of invention described below.
`
`[0050] The means by which the high band frames 412 are
`predicted and the necessary information required to make
`the prediction are referred to as the side infonnation 413. If
`a temporal prediction is performed, then the temporal mode
`is signaled as part of the side information along with
`corresponding motion information. Ifa spatial prediction is
`performed, then the spatial mode is signaled as part of the
`side information along with corresponding disparity infor-
`mation. If view synthesis prediction is performed, then the
`view synthesis mode is signaled as part of the side infor—
`lnation along with corresponding diSparity. motion and
`depth in formation.
`
`the prediction of each
`[0051] As shown in FIGS. 6,
`current frame 600 uses neighboring frames 510 in both the
`space and time dimensions. The frames that are used for
`predicting the current frame are called reference pictures.
`The reference pictures are maintained in the reference list,
`which is part of the encoded bitstream. The reference
`pictures are stored in the decoded picture buffer.
`
`In one embodiment of the invention, the MC’I‘I"
`[0052]
`aud DCVF are applied adaptively to each current macrob-
`lock for each frame of the input videos to yield decomposed
`low band frames, as well as the high band frames and the
`associated side infonnation. In this way, each macroblock is
`processed adaptively according to a ‘best‘ prediction mode.
`An optimal method for selecting the prediction mode is
`described below.
`
`In one embodiment of the invention, the MCTI" is
`[0053]
`first applied to the frames of each video independently. The
`resulting frames are then further decomposed with the
`DCVF. In addition to the final decomposed frames,
`the
`corresponding side information is also generated. lf per-
`formed on a macroblock-basis, then the prediction mode
`selections for the MCTF and the DCVF are considered
`
`separately. As an advantage, this prediction mode selection
`inherently supports temporal scalability. In this way, lower
`temporal rates of the videos are easily accessed in the
`compressed bitstream.
`
`In another embodiment, the [XIV]? is first applied
`[0054]
`to the frames of the input videos. The resulting frames are
`then temporally decomposed with the MCTF. In addition to
`the final decomposed frames, the side information is also
`generated. If performed on a macroblock—basis, then the
`prediction mode selections for the MC’I‘F and DCVIJ are
`considered separately. As an advantage, this selection inher-
`
`

`

`US 200670146138 Al
`
`Jul. 6, 2006
`
`ently supports spatial scalability. In this way. a reduced
`number of the views are easily accessed in the compressed
`bitstream.
`
`[0055] The decomposition described above can be applied
`recursively on the resulting set of low band frames from a
`previous decomposition stage. As an advantage, our MC‘TFtr
`DCVF decomposition 400 effectively removes both tempo—
`ral and spatial (inter-view] correlations, and can achieve a
`very high compression efficiency. The compression elli-
`cieney of our multiview video encoder outperforms conven-
`tional simulcast encoding. which encodes each video for
`each view independently.
`
`[0056] Coding of MC'l‘Ft'DCVI: Decomposition
`
`[0057] As shown in FIG. 7. the outputs 411 and 412 of
`decomposition 400 are fed to a signal encoder 710, and the
`output 413 is fed to a side information encoder 720. The
`signal encoder 710 performs a transfonn, quantization and
`entropy coding to remove any remaining correlations in the
`decomposed low band and high band frames 411—412. Such
`operations are well known in the art. Netravali and Ilaskell,
`Digital Pictures: Representation. Compression and Stan-
`dards, Second lfldition, Plenum Press. 1995.
`
`[0058] The side information encoder 720 encodes the side
`information 413 generated by the decomposition 400. In
`addition to the prediction mode and the reference picture list,
`the side information 413 includes motion information cor—
`
`responding to the temporal predictions, disparity informa—
`tion corresponding to the spatial predictions and view syn-
`thesis and depth information corresponding to the view
`synthesis predictions.
`
`tincoding the side information can be achieved by
`[0059]
`known and established techniques, such as the techniques
`used in the MPEG—4 Visual standard, ISOIIEC 14496—2,
`“Information
`technology—Coding
`of
`audio—visual
`objects
`Part 2: Visual,” 2"" Edition. 2001, or the more
`recent [1.264I'AVC standard, and [TU-T Recommendation
`H.264, “Advanced video coding for generic audiovisual
`services,” 2004.
`
`[0060] For instance. motion vectors of the macroblocks
`are typically encoded using predictive methods that deter—
`mine a prediction vector from vectors in macroblocks in
`reference pictures. The difference between the prediction
`vector and the current vector is then subject to an entropy
`coding process, which typically uses the statistics of the
`prediction error. A similar procedure can be used to encode
`disparity vectors.
`
`[0061] Furthermore, depth information for each macrob—
`1ock can be encoded using predictive coding methods in
`which a prediction from macroblocks in reference pictures is
`obtained, or by simply using a fixed length code to express
`the depth value directly. If pixel level accuracy for the depth
`is extracted and compressed, then texture coding techniques
`that apply transform, quantization and entropy coding tech
`niques can be applied.
`
`[0062] The encoded signals 711—713 from the signal
`encoder 710 and side information encoder 720 can be
`multiplexed 730 to produce an encoded output bitstream
`731.
`
`[0063]
`
`Decoding of M(f'l‘lifl.)(.‘VF I.)ecomposition
`
`[0064] The bitstream 731 can be decoded 740 to produce
`output multiview videos 741 corresponding to the input
`multiview videos 401—404. Optionally, synthetic video can
`also be generated. Generally,
`the decoder performs the
`inverse operations of the encoder to reconstruct the multi-
`view videos. If all
`low band and high band frames are
`decoded, then the full set of frames in both the space (view)
`dimension and time dimension at the encoded quality are
`reconstructed and available.
`
`[0065] Depending on the number of recursive levels of
`decomposition that were applied in the encoder and which
`type of decompositions were applied. a reduced number of
`videos andfor a reduced temporal rate can be decoded as
`shown in FIG. 7.
`
`[0066] View Synthesis
`
`[0067] As shown in FIG. 8, view synthesis is a process by
`which frames 801 ofa synthesized video are generated from
`frames 803 of one or more actual mu ltiview videos. In other
`
`words. view synthesis provides a means to synthesize the
`frames 80] corresponding to a selected novel view 802 of
`the scene 5. This novel view 802 may correspond to a
`‘virtual’ camera 800 not present at
`the time the input
`multiview videos 401—404 were acquired or the view can
`correspond to a camera view that is acquired. whereby the
`synthesized View will he used for prediction and encoding.If
`decoding of this view as described below.
`
`Ifone video is used. then the synthesis is based on
`[0068]
`extrapolation or warping, and if multiple videos are used.
`then the synthesis is based on interpolation.
`
`[0069] Given the pixel values of frames 803 of one or
`more multiview videos and the depth values of points in the
`scene, the pixels in the frames 80] for the synthetic view 802
`can be synthesized from the corresponding pixel values in
`the frames 803.
`
`[0070] View synthesis is commonly used in computer
`graphics for rendering still images for multiple views. see
`Buehler et al.. “Unstructured Lumigraph Rendering,“ Proc.
`ACM SIGGRAPH, 2001. That method requires extrinsic
`and intrinsic parameters for the cameras.
`
`[0071] View synthesis for compressing multiview videos
`is novel. In one embodiment of our invention. we generate
`synthesized frames to be used for predicting the current
`frame. In one embodiment of the invention, synthesized
`frames are generated for designated high band frames. In
`another embodiment of the invention, synthesized frames
`are generated for specific views. The synthesized frames
`serve as reference pictures from which a current synthesized
`frame can be predicted.
`
`[0072] One difiziculty with this approach is that the depth
`values of the scene 5 are unknown. Therefore, we estimate
`the depth values using known techniques, e.g., based on
`correspondences of features in the multiview videos.
`
`[0073] Alternatively, for each synthesized video, we gen-
`erate multiple synthesized frames, each corresponding to a
`candidate depth value. For each macroblock in the current
`frame, the best matching macroblock in the set of synthe—
`sized frames is determined. The synthesized frame from
`which this best match is found indicates the depth value of
`
`

`

`US 2006f0146l38 Al
`
`Jul. 6, 2006
`
`the macroblock in the current frame. This process is repeated
`for each macroblock in the current frame.
`
`[0074] A difference between the current macroblock and
`the synthesized block is encoded and compressed by the
`signal encoder 710. The side information for this multiview
`mode is encoded by the side information encoder 720. The
`side infonnation includes a signal indicating the view syn-
`thesis prediction mode, the depth value of the macroblock,
`and an optional displacement vector that compensates for
`any misaligmnents between the macroblock in the current
`frame and the best matching macmblock in the synthesized
`frame to be compensated.
`
`[0075] Prediction Mode Selection
`
`In the macroblock—adaptive MCTFfDCVF decom—
`[0076]
`position, the predict'ion mode in for each macroblock can be
`selected by minimizing a cost function adaptively on a per
`macroblock basis:
`
`m*=mg min J(m},
`
`where J(m)=l)(m)+?&R(m). and I) is distortion, A is a weight-
`ing parameter, R is rate, m indicates the set of candidate
`prediction modes, and m‘“ indicates the optimal prediction
`mode that has been selected based on a minimum cost
`criteria.
`
`[0077] The candidate modes In inclu

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket