`Sawhney et al.
`
`(10) Patent N0.:
`
`(45) Date of Patent:
`
`US 6,907,073 B2
`Jun. 14, 2005
`
`US006907073B2
`
`(54)
`
`(75)
`
`TWEENING-BASED CODEC FOR
`SCALEABLE ENCODERS AND DECODERS
`WITH VARYING MOTION COMPUTATION
`CAPABILITY
`
`Inventors: Harpreet Singh Sawhney, West
`Windsor, NJ (US); Rakesh Kumar,
`Monmouth Junction, NJ (US); Keith
`Hanna, Princeton, NJ (US); Peter
`Burt, Princeton, NJ (US); Norman
`Winarsky, Princeton, NJ (US)
`
`(73)
`
`Assignee: Sarnofi' Corporation, Princeton, NJ
`(US)
`
`Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 596 days.
`
`(21)
`
`(22)
`
`(65)
`
`(60)
`
`(51)
`(52)
`(58)
`
`(56)
`
`Appl. No.: 09/731,194
`
`Filed:
`
`Dec. 6, 2000
`Prior Publication Data
`
`US 2001/0031003 A1 Oct. 18, 2001
`
`Related U.S. Application Data
`Provisional application No. 60/172,841, filed on Dec. 20,
`1999.
`
`Int. Cl.7 ................................................ .. H04B 1/66
`U.S. Cl.
`................................................ .. 375/240.14
`Field of Search ..................... .. 375/240.01, 240.03,
`375/240.15, 240.16, 240.12, 240.11, 240.14,
`240.09, 240.22, 240.23, 240.25, 240.13,
`382/234, 236, 238, 250, H04B 1/66
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,121,202 A *
`5,677,735 A
`
`.................. .. 375/240.16
`6/1992 Tanoi
`10/1997 Ueno et al.
`
`(Continued)
`FOREIGN PATENT DOCUMENTS
`
`EP
`
`0 305 127 A2
`
`3/1989
`
`OTHER PUBLICATIONS
`
`“The Motion Transform: A New Motion Compensation
`Technique”, by Armitano, R. M. et al; IEEE International
`Conference on Acoustics, Speech, and Signal Processing
`Proceedings, vol. CONF. 21, May 7, 1996, pp. 2295-2298.
`Patent Abstracts of Japan, vol. No. 14, Dec. 31, 1998 and JP
`10 257502 A, Matsushita Electric Ind Co Ltd, Sep. 25, 1998
`abstract.
`
`Primary Examiner—Tung V0
`(74) Attorney, Agent, or Firm—William J. Burke
`
`(57)
`
`ABSTRACT
`
`A scaleable video encoder has one or more encoding modes
`in which at least some, and possibly all, of the motion
`information used during motion-based predictive encoding
`of a video stream is excluded from the resulting encoded
`video bitstream, where a corresponding video decoder is
`capable of performing its own motion computation to gen-
`erate its own version of the motion information used to
`perform motion-based predictive decoding in order to
`decode the bitstream to generate a decoded video stream. All
`motion computation, whether at the encoder or the decoder,
`is preferably performed on decoded data. For example,
`frames may be encoded as either H, L, or B frames, where
`H frames are intra-coded at full resolution and L frames are
`intra-coded at low resolution. The motion information is
`generated by applying motion computation to decoded L and
`H frames and used to generate synthesized L frames.
`L-frame residual errors are generated by performing inter-
`frame differencing between the synthesized and original L
`frames and are encoded into the bitstream. In addition,
`synthesized B frames are generated by tweening between the
`decoded H and L frames and B-frame residual errors are
`generated by performing inter-frame differencing between
`the synthesized B frames and, depending on the
`implementation, either the original B frames or sub-sampled
`B frames. These B-frame residual errors are also encoded
`into the bitstream. The ability of the decoder to perform
`motion computation enables motion-based predictive encod-
`ing to be used to generate an encoded bitstream without
`having to expend bits for explicitly encoding any motion
`information.
`
`(Continued)
`
`35 Claims, 5 Drawing Sheets
`
` 11\6
`
`L FRAME
`* SYNTHESIS
`
`100
`
`To
`120
`"E5'°U"'-
`H ERROR
`
`ENCODING
`
`119 1
`|N'|'ER-FRAME
`DIFFERENCING
`
`
`
`RESIDUAL ‘
`eanon
`ENOOD|NG ,
`
`
`
`Google Inc.
`GOOG 1041
`
`IPR20l6-00212
`
`,
`
`
`
`
`
`IM=UT
`STREAM
`
`.
`‘-
`FRAMEI
`nseaou H
`' we 5
`semcnon
`
`BITSTREAM
`(ownom)
`
`0001
`
`INTEFLFRAME
`
`.
`' DIFFEHENCING
`
`
`
`Google Inc.
`GOOG 1041
`IPR2016-00212
`
`0001
`
`
`
`US 6,907,073 B2
`Page 2
`
`US. PATENT DOCUMENTS
`
`6,563,549 B1 *
`
`5/2003 Sethuraman .............. .. 348/700
`
`........ .. 375/240.16
`5,686,962 A * 11/1997 Chung et al.
`5,703,649 A * 12/1997 Kondo ................ .. 375/240.18
`5,764,805 A *
`6/1998 Martucci et al.
`..... .. 382/238
`
`......... .. 375/240.23
`5,852,469 A * 12/1998 Nagai et al.
`6,097,842 A *
`8/2000 Suzuki et al.
`............. .. 382/232
`6,427,027 B1 *
`7/2002 Suzuki et al.
`..
`382/236
`
`6,490,705 B1 * 12/2002 Boyce ...................... .. 714/776
`6,535,558 B1 *
`3/2003 Suzuki et al.
`........ .. 375/240.12
`
`FOREIGN PATENT DOCUMENTS
`
`EP
`EP
`W0
`W0
`
`0 753 970 A2
`0 920 214 A2
`WO 93/02526
`WO 99/57906
`
`1/1997
`6/1999
`2/1993
`11/1999
`
`* cited by examiner
`
`0002
`
`0002
`
`
`
`U.S. Patent
`
`Jun. 14, 2005
`
`Sheet 1 0f5
`
`US 6,907,073 B2
`
`
`ozaouzmozaoozm
`m_Omm_wOZ_OZmImu__u__DAw.mw_.._._.Z>w..OZ_QOOmOA-<mHZ_
`
`
`
`
`ms_<E.E:,__mmm.>>o...
`
`mi
`
`s,mm.5=m2
`
`
`
`e:2:o._.
`
`s_<mm.._.w._._m
`
`5%.
`
`
`
`_2<mm._.m._._mmz<mu_
`
`:<zoEo.oz_%_oE
`
`._s,_oEo
`
`zo_._.om._mw
`
`
`
`
`
`w_m.w..:.z>w.2zo_wmms_<mE.m
`
`0003
`
`ozaoozmmn_>._.
`
`
`
`2<m_%m:mdoflmmmwfi..oz_ozm_$&_o.A.A._<Dn=wmmm2<mu_.,_._<:.<..._w
`
`
`
`
`
`_oz_._.;_<.a,
`
`
`
`K.@E
`
`0003
`
`
`
`
`
`
`U.S. Patent
`
`Jun. 14, 2005
`
`Sheet 2 0f5
`
`US 6,907,073 B2
`
`H BLBLBLBLBHBLBLBLBLBH
`
`304
`
`306
`
`
`
`
`INCLUDE INTRA-FRAME
`
`CODED H-FRAME DATA
`
`
`
`GENERATE DECODED FULL-RES H
`
`FRAME FOR USE AS REFERENCE
`
`INTO BITSTREAM
`
`DATA FOR L AND B FRAMES
`
`
`FIG. 3
`
`502
`
`504
`
`506
`
`508
`
`SUB-SAMPLE FULL-RES B FFIAME
`TO GENERATE LOW-RES B FRAME
`
`
`
`
`
`
`USING MOTION INFO
`
`FROM H/L FRAME ENCODING,
`
`GENERATE SYNTHESIZED
`
`LOW-RES B FRAME
`
`
`
`
`
`GENERATE B-FRAME RESIDUALS
`
`BTW LOW-RES B FRAME &
`
`SYNTHESIZED LOW-RES B FRAME
`
`
`
`
`
`
`ENCODE B-FRAME RESIDUALS
`
`INTO BITSTREAM
`
`I L
`
`0004
`
`0004
`
`
`
`U.S. Patent
`
`Jun. 14, 2005
`
`Sheet 3 0f5
`
`US 6,907,073 B2
`
`SUB-SAMPLE FULL-RES L FRAME
`
`T0 GENERATE Low-RES LFRAME J
`
`‘R
`
`1
`
`INTRA-ENOODE Low-RES L FRAME
`~————~—-A
`
`1
`GENERATE DECODED
`Low-RES I. FRAME
`
`FOR DECODED Low-RES L FRAME
`
`PERFORM MTION COMP
`
`
`
`TO GENERATE MOTION INFO
`I
`
`E
`
`
`
`406
`
`1
`
`
`
`INCLUDE INTRA-
`. CODED L FRAME
`INTO BITSTREAM
`"
`
`
`
`412
`
`
`
`OPTIONALLY
`_ ENCODE SOM EIALL
`MOTION INFO
`
`
`—
`
`INTO BITSTREAM
`-——~
`
`
`
`
`
`
`GENERATE RESID ERRORS BETWEEN
`SYNTHESIZED FULL-RES L FRAME
`AND ORIGINAL FULL-RES L FRAM
`
`‘
`
`THRESHOLD RESIDUAL ERRORS TO
`GENERATE BINARY MASK
`
`4
`
`
`
`
`
`
`BASED ON BINARY MASK,
`AUGMENT BITSTREAM WITH
`
`ENCODED RESIDUAL ERRORS.
`
`
`-
`
`0005
`
`v
`
`USE MOTION INFO TO SYNTHESIZE
`FULL-RES L FRAME
`
`I
`FROM DECODED H FRAME(s)
`
`
`
`
`
`
`
`
`402
`
`404
`.
`
`408
`
`410
`
`414
`
`416
`
`418
`
`420
`
`0005
`
`
`
`U.S. Patent
`
`Jun. 14, 2005
`
`Sheet 4 0f5
`
`US 6,907,073 B2
`
`E
`
`.0m_mmE_,_>m.zozoz
`
`m_m-;o._
`
`1f
`
`GdroN_.wcom
`
`Booomo..-ws_<E._mum
`_2..mEmzoE8<£<8_mm§
`
`
`
`
`
`O...m_2<m_n_mm._.z_3<_5_mmmms_<E._.>>O._wooomo
`
`omoooma..<Ez_Qwooozm
`
`
`_2<mEmGzaoowomn_>.—s_<mm_._.m.:m
`
`
`
`zo_nm_wm
`
`0006
`
`m_2<Emm_Eso..
`
`m_m_Ez>m
`
`ws_<¢.._._m:z_
`
`zoE8<
`
`
`
`.25..maoomo
`
`m_2<Emme
`
`3.8.3;
`
`®.@E
`
`0006
`
`
`
`
`
`
`
`U.S. Patent
`
`Jun. 14, 2005
`
`Sheet 5 of 5
`
`US 6,907,073 B2
`
`ADD L-FRAMERESIDUALSTO
`SYNTHESIZED FULL-RESL FRAME
`TO GENERATE DECODED
`
`
`
`DECODE BITSTREAM TO GENERATE
`DECODED B-FRAME RESIDUALS
`
`SYNTHESIZELOW-RES B FRAME
`
`ADD B-FRAME RESIDUALS TO
`SYNTHESIZED LOW.RESBFRAME
`TO GENERATE DECODED
`LOW-RES B FRAME
`
`V
`GENERATE DECODED FULL-RES
`B FRAME FROM DECODED
`LOW-RES B FRAME
`Fil@?, {?
`
`1002
`
`Fl??, ?
`
`?
`
`TiME
`
`WEWS
`FROZEN
`INTIME
`
`VIEWPOINT
`
`A WiFW
`OFA
`
`0007
`
`
`
`1
`TWEENING-BASED CODEC FOR
`SCALEABLE ENCODERS AND DECODERS
`WITH VARYING MOTION COMPUTATION
`CAPABILITY
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`This application claims the benefit of the filing date of
`U.S. provisional application No. 60/172,841, filed on Dec.
`20, 1999.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`invention relates to video compression/
`The present
`decompression (codec) processing.
`2. Description of the Related Art
`Traditional video compression/decompression processing
`relies on asymmetric computation between the encoder and
`decoder. The encoder is used to do all the analysis of the
`video stream in terms of inter- and intra-frame components,
`including block-based motion computation, and also in
`terms of object-based components. The analysis is used to
`compress static and dynamic information in the video
`stream. The decoder simply decodes the encoded video
`bitstream by decompressing the intra- and block-based inter-
`frame information. No significant analysis is performed at
`the decoder end. Examples of such codecs include MPEG1,
`MPEG2, MPEG4, H.263, and related standards. The quality
`of “codeced” video using the traditional asymmetric
`approach is reasonably good for data rates above about 1.2
`megabits/second (Mbps). However, the typical quality of
`output video is significantly degraded at modem speeds of
`56 kilobits/second (Kbps) and even at speeds as high as a
`few 100 Kbps.
`SUMMARY OF THE INVENTION
`
`The present invention is related to video compression/
`decompression processing that
`involves analysis of the
`video stream (e.g., motion computation) at both the encoder
`end and the decoder end. With the rapid increase in pro-
`cessing power of commonly available platforms, and with
`the potential for dedicated video processing sub-systems
`becoming viable, the techniques of the present invention
`may significantly influence video delivery on the Internet
`and other media at low and medium bit-rate channels.
`
`In traditional video compression, any and all motion
`computation is performed by the encoder, and none by the
`decoder. For example, in a conventional MPEG-type video
`compression algorithm, for predictive frames, the encoder
`performs block-based motion estimation to identify motion
`vectors that relate blocks of data in a current frame to closely
`matching blocks of reference data for use in generating
`motion-compensated inter-frame differences. These inter-
`frame differences (also referred to as residual errors) along
`with the motion vectors themselves are explicitly encoded
`into the resulting encoded video bitstream. Under this codec
`paradigm, without having to perform any motion computa-
`tion itself, a decoder recovers the motion vectors and inter-
`frame differences from the bitstream and uses them to
`
`generate the corresponding frames of a decoded video
`stream. As used in this specification,
`the term “motion
`computation” refers to motion estimation and other types of
`analysis in which motion information for video streams is
`generated, as opposed to motion compensation, where
`already existing motion information is merely applied to
`video data.
`
`US 6,907,073 B2
`
`2
`According to certain embodiments of the present
`invention, a video decoder is capable of performing at least
`some motion computation. As such, the video encoder can
`omit some or all of the motion information (e.g., motion
`vectors) from the encoded video bitstream, relying on the
`decoder to perform its own motion computation analysis to
`generate the equivalent motion information required to
`generate the decoded video stream. In this way, more of the
`available transmission and/or storage capacity (i.e., bit rate)
`can be allocated for encoding the residual errors (e.g.,
`inter-frame differences) rather than having to expend bits to
`encode motion information.
`
`According to one embodiment, the present invention is a
`method for encoding a video stream to generate an encoded
`video bitstream, comprising the steps of (a) encoding, into
`the encoded video bitstream, a first original frame/region in
`the video stream using intra-frame coding to generate an
`encoded first frame/region; and (b) encoding,
`into the
`encoded video bitstream, a second original frame/region in
`the video stream using motion-based predictive coding,
`wherein at least some motion information used during the
`motion-based predictive coding is excluded from the
`encoded video bitstream.
`
`10
`
`15
`
`20
`
`According to another embodiment, the present invention
`is a video encoder for encoding a video stream to generate
`an encoded video bitstream, comprising (a) a frame/region
`type selector configured for selecting different processing
`paths for encoding different frames/regions into the encoded
`video bitstream; (b) a first processing path configured for
`encoding, into the encoded video bitstream, a first original
`frame/region in the video stream using intra-frame coding to
`generate an encoded first frame/region; and (c) a second
`processing path configured for encoding, into the encoded
`video bitstream, a second original frame/region in the video
`stream using motion-based predictive coding, wherein the
`video encoder has an encoding mode in which at least some
`motion information used during the motion-based predictive
`coding is excluded from the encoded video bitstream.
`According to yet another embodiment, the present inven-
`tion is a method for decoding an encoded video bitstream to
`generate a decoded video stream, comprising the steps of (a)
`decoding, from the encoded video bitstream, an encoded
`first frame/region using intra-frame decoding to generate a
`decoded first frame/region; and (b) decoding, from the
`encoded video bitstream, an encoded second frame/region
`using motion-based predictive decoding, wherein at least
`some motion information used during the motion-based
`predictive decoding is generated by performing motion
`computation as part of the decoding method.
`According to yet another embodiment, the present inven-
`tion is a video decoder for decoding an encoded video
`bitstream to generate a decoded video stream, comprising (a)
`a frame/region type selector configured for selecting differ-
`ent processing paths for decoding different encoded frames/
`regions from the encoded video bitstream; (b) a first pro-
`cessing path configured for decoding, from the encoded
`video bitstream, an encoded first frame/region in the video
`stream using intra-frame decoding to generate a decoded
`first frame/region; and (c) a second processing path config-
`ured for decoding, from the encoded video bitstream, an
`encoded second frame/region in the video stream using
`motion-based predictive decoding, wherein the video
`decoder has a decoding mode in which at least some motion
`information used during the motion-based predictive decod-
`ing is generated by the video decoder performing motion
`computation.
`According to yet another embodiment, the present inven-
`tion is a method for decoding an encoded video bitstream to
`0008
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`0008
`
`
`
`US 6,907,073 B2
`
`3
`
`generate a decoded video stream, comprising the steps of (a)
`decoding, from the encoded video bitstream, a plurality of
`encoded frames/regions to generate a plurality of decoded
`frames/regions using motion information; and (b) perform-
`ing tweening based on the motion information to insert one
`or more additional frames/regions into the decoded video
`stream.
`
`According to yet another embodiment, the present inven-
`tion is a decoder for decoding an encoded video bitstream to
`generate a decoded video stream, comprising (a) one or
`more processing paths configured for decoding, from the
`encoded video bitstream, a plurality of encoded frames/
`regions to generate a plurality of decoded frames/regions
`using motion information; and (b) an additional processing
`path configured for performing tweening based on the
`motion information to insert one or more additional frames/
`regions into the decoded video stream.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`Other aspects, features, and advantages of the present
`invention will become more fully apparent from the follow-
`ing detailed description,
`the appended claims, and the
`accompanying drawings in which:
`FIG. 1 shows a block diagram of a scaleable video
`encoder, according to one embodiment of the present inven-
`tion;
`FIG. 2 shows a representation of the encoding of an input
`video stream by the video encoder of FIG. 1;
`FIG. 3 shows a flow diagram of the processing of each H
`frame by the video encoder of FIG. 1;
`FIG. 4 shows a flow diagram of the processing of each L
`frame by the video encoder of FIG. 1;
`FIG. 5 shows a flow diagram of the processing of each B
`frame by the video encoder of FIG. 1;
`FIG. 6 shows a block diagram of a video decoder,
`according to one embodiment of the present invention;
`FIG. 7 shows a flow diagram of the processing of each L
`frame by the video decoder of FIG. 6;
`FIG. 8 shows a flow diagram of the processing of each B
`frame by the video decoder of FIG. 6;
`FIG. 9 represents a basketball event being covered by a
`ring of cameras; and
`FIG. 10 represents a space-time continuum of views along
`the ring of cameras of FIG. 9.
`DETAILED DESCRIPTION
`
`In current state-of-the-art motion video encoding
`algorithms, like those of the MPEGx family, a large part of
`the bit budget and hence the bandwidth is consumed by the
`encoding of motion vectors and error images for the non-
`intra-coded frames.
`In a typical MPEG2 coded stream,
`approximately 5% of the bit budget is used for overhead,
`10-15% is for intra-coded frames (i.e., frames that are coded
`as stills), 20-30% is for motion vectors, and 50-65% of the
`budget is for error encoding. The relatively large budget for
`error encoding can be attributed to two main reasons. First,
`motion vectors are computed only as a translation vector for
`(8><8) blocks or (16><16) macroblocks, and, second,
`the
`resulting errors tend to be highly uncorrelated and non-
`smooth.
`
`According to certain embodiments of the present
`invention, motion computation is performed at both the
`encoder end and the decoder end. As such, motion informa-
`tion (e.g., motion vectors) need not be transmitted. Since
`
`4
`motion computation is performed at the decoder end, instead
`of limiting the representation of motion to block-based
`translations, motion fields can be computed with greater
`accuracy using a combination of parametric and non-
`parametric representations.
`Embodiments of the present invention enable the video
`stream to be sub-sampled both temporally and spatially at
`the encoder. The video stream can be sub-sampled in time so
`that not all of the frames are transmitted. In addition, some
`of the frames that are transmitted may be coded at a lower
`spatial resolution. Using dense and accurate motion com-
`putation at the decoder end, the decoded full-resolution and
`low-resolution frames are used to recreate a full-resolution
`
`decoded video stream with missing frames filled in using
`motion-compensated spatio-temporal
`interpolation (also
`referred to as “tweening”). This could result in large savings
`in compression while maintaining quality of service for a
`range of different bandwidth pipes.
`In one embodiment of the present invention, a scaleable
`encoder is capable of encoding input video streams at a
`number of different encoding modes corresponding to dif-
`ferent types of decoders, e.g., having different levels of
`processing capacity.
`At one extreme class of encoding modes, the encoder
`generates an encoded video bitstream for a decoder that is
`capable of performing all of the motion computation per-
`formed by the encoder. In that case, the encoder encodes the
`video stream using an encoding mode in which motion-
`based predictive encoding is used to encode at least some of
`the frames in the video stream, but none of the motion
`information used during the video compression processing is
`explicitly included in the resulting encoded video bitstream.
`The corresponding decoder performs its own motion com-
`putation during video decompression processing to generate
`its own version of the motion information for use in gener-
`ating a decoded video stream from the encoded video
`bitstream, without having to rely on the bitstream explicitly
`carrying any motion information.
`At the other extreme class of encoding modes, the encoder
`encodes the video stream for a decoder that is incapable of
`performing any motion computation (as in conventional
`video codecs). In that case, if the encoder uses any motion
`information during encoding (e.g., for motion-compensated
`inter-frame differencing), then all of that motion information
`is explicitly encoded into the encoded video bitstream. The
`corresponding decoder recovers the encoded motion infor-
`mation from the encoded video bitstream to generate a
`decoded video stream without having to perform any motion
`computation on its own.
`In between these two extremes are a number of different
`
`encoding modes that are geared towards decoders that
`perform some, but not all of the motion computation per-
`formed by the encoder. In these situations,
`the encoder
`explicitly encodes some, but not all of the motion informa-
`tion used during its motion-based predictive encoding, into
`the resulting video bitstream. The corresponding decoder
`recovers the encoded motion information from the bitstream
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`and performs its own version of motion computation to
`generate the rest of the motion information used to generate
`a decoded video stream.
`
`65
`
`Independent of how much motion information is to be
`encoded into the bitstream, a scaleable encoder of the
`present invention is also capable of skipping frames with the
`expectation that the decoder will be able to insert frames into
`the decoded video stream during playback. Depending on
`the implementation, frame skipping may involve providing
`0009
`
`0009
`
`
`
`US 6,907,073 B2
`
`5
`at least some header information for skipped frames in the
`encoded video bitstream or even no explicit information at
`all.
`
`Encoding
`FIG. 1 shows a block diagram of a scaleable video
`encoder 100, according to one embodiment of the present
`invention. Scaleable video encoder 100 will
`first be
`
`described in the context of an extreme encoding mode in
`which none of the motion information used during motion-
`based predictive encoding is explicitly encoded into the
`resulting encoded video bitstream. Other encoding modes
`will then be described.
`
`According to this extreme encoding mode, each frame in
`an input video stream is encoded as either an H frame, an L
`frame, or a B frame. Each H frame is intra-encoded as a high
`spatial resolution (e.g., full-resolution) key frame, each L
`frame is intra-encoded as a low spatial resolution (e.g., %><%
`resolution) key frame augmented by residual error encoding,
`and each B frame is inter-encoded as a low spatial resolution
`frame based on motion estimates between sets of H and/or
`L frames. Video encoder 100 encodes an input video stream
`as a sequence of H, L, and B frames to form a corresponding
`output encoded video bitstream.
`FIG. 2 shows a representation of a particular example of
`the encoding of an input video stream by video encoder 100
`of FIG. 1. In the example of FIG. 2, the input video stream
`is encoded using a repeating 10-frame sequence of
`(HBLBLBLBLB). In general, however, other fixed or even
`adaptive frame sequences are possible. For example, in one
`preferred fixed frame sequence, a 30 frame/second (fps)
`video stream is encoded using the fixed 30-frame sequence
`of:
`
`(HBBBBBLBBBBBLBBBBBLBBBBBLBBBBB).
`
`The generation of the frame-type sequence may also be
`performed adaptively, e.g., based on the amount of motion
`present across frame, with fewer B frames between con-
`secutive H/L key frames and/or fewer L frames between
`consecutive H frames when motion is greater and/or less
`uniform, and vice versa.
`Referring again to FIG. 1, type selection 102 is applied to
`the input video stream to determine which frames in the
`video stream are to be encoded as H, L, and B frames.
`(Although this type selection 102 will be described in the
`context of entire frames, this process may also be imple-
`mented based on regions within a frame, such as square
`blocks,
`rectangular regions, or even arbitrary shaped
`regions, with the corresponding estimation and encoding
`applied to each.) As mentioned above, depending on the
`particular implementation, the frame-type selection may be
`based on a fixed frame sequence or an appropriate adaptive
`selection algorithm, e.g., based on motion magnitude, spe-
`cial effects, scene cuts, and the like. Each of the different
`types of frames is then processed along a corresponding
`processing path represented in FIG. 1. As shown in FIG. 1,
`an option exists to drop one or more frames from the input
`video stream. This optional frame dropping may be incor-
`porated into a fixed frame sequence or adaptively selected,
`e.g., based on the amount of motion present or bit-rate
`considerations.
`
`FIG. 3 shows a flow diagram of the processing of each H
`frame by video encoder 100 of FIG. 1. Referring to the
`blocks in FIG. 1 and the steps in FIG. 3, the current H frame
`is intra-encoded at full resolution, e.g., using wavelet encod-
`ing (block 104 of FIG. 1 and step 302 of FIG. 3). As is
`known in the art, wavelet encoding typically involves the
`application of wavelet transforms to different sets of pixel
`
`6
`data corresponding to regions within a current frame, fol-
`lowed by quantization, run-length encoding, and variable-
`length (Huffman-type) encoding to generate the current
`frame’s contribution to the encoded video bitstream.
`
`Typically, the sizes of the regions of pixel data (and therefore
`the sizes of the wavelet transforms) vary according to the
`pixel data itself. In general, the more uniform the pixel data,
`the larger the size of a region that is encoded with a single
`wavelet transform. Note that even though this encoding is
`referred to as “full resolution,” it may still involve sub-
`sampling of the color components (e.g., 421:1 YUV sub-
`sampling).
`The resulting intra-encoded full-resolution H-frame data
`is incorporated into the encoded video bitstream (step 304).
`The same intra-encoded H-frame data is also decoded (block
`106 and step 306), e.g., using wavelet decoding, to generate
`a full-resolution decoded H frame for use as reference data
`
`for encoding L and B frames.
`FIG. 4 shows a flow diagram of the processing of each L
`frame by video encoder 100 of FIG. 1. Referring to the
`blocks in FIG. 1 and the steps in FIG. 4,
`the current
`full-resolution L frame is spatially sub-sampled (e.g., by a
`factor of 4 in each direction) to generate a corresponding
`low-resolution L frame (block 108 and step 402). Depending
`on the particular implementation, this spatial sub-sampling
`may be based on any suitable technique, such as simple
`decimation or more complicated averaging.
`The low-resolution L frame is then intra-encoded (block
`110 and step 404), e.g., using wavelet encoding, and the
`resulting intra-encoded low-resolution L-frame data is incor-
`porated into the encoded video bitstream (step 406). The
`same intra-encoded L-frame data is also decoded to generate
`a decoded low-resolution L frame (block 112 and step 408).
`Motion computation analysis is then performed compar-
`ing the decoded low-resolution L-frame data to one or more
`other sets of decoded data (e.g., decoded full-resolution data
`corresponding to the previous and/or subsequent H frames
`and/or decoded low-resolution data corresponding to the
`previous and/or subsequent L frames) to generate motion
`information for the current L frame (block 114 and step 410).
`In this particular “extreme” encoding mode, none of this
`L-frame motion information is explicitly encoded into the
`encoded video bitstream.
`In other encoding modes
`(including the opposite “extreme” encoding mode), some or
`all of the motion information is encoded into the bitstream
`
`(step 412).
`The exact type of motion computation analysis performed
`depends on the particular implementation of video encoder
`100. For example, motion may be computed for each L
`frame based on either the previous H frame, the closest H
`frame, or the previous key (H or L) frame. Moreover, this
`motion computation may range from conventional MPEG-
`like block-based or macroblock-based algorithms to any of
`a combination of optical flow, layered motion, and/or multi-
`frame parametric/non-parametric algorithms.
`For example, in one implementation, video encoder 100
`may perform conventional forward, backward, and/or
`bi-directional block-based motion estimation in which a
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`motion vector is generated for each (8><8) block or (16><16)
`macroblock of pixels in the current frame. In alternative
`embodiments, other types of motion computation analysis
`may be performed, including optical flow analysis in which
`a different motion vector is generated for each pixel in the
`current frame. (For those encoding modes in which some or
`all of the motion information is encoded into the bitstream,
`the optical flow can be compactly represented using either
`wavelet encoding or region-based parametric plus residual
`0010
`
`65
`
`0010
`
`
`
`US 6,907,073 B2
`
`8
`to
`errors are then encoded, e.g., using wavelet encoding,
`generate encoded B-frame residual data for inclusion in the
`encoded video bitstream (block 128 and step 508). Depend-
`ing on the particular implementation,
`the residual error
`encoding of block 128 may rely on a thresholding of
`B-frame inter-frame differences to determine which residu-
`als to encode, similar to that described previously with
`regard to block 120 for the L-frame residual errors. Note
`that, since B frames are never used to generate reference
`data for encoding other frames, video encoder 100 does not
`have to decode the encoded B-frame residual data.
`
`In an alternative implementation of video encoder 100,
`instead of synthesizing low-resolution B frames,
`full-
`resolution B frames can be synthesized by tweening between
`pairs of decoded full-resolution H frames generated by block
`106 and synthesized full-resolution L frames generated by
`block 116. Inter-frame differencing can then be applied
`between the original full-resolution B frames and the syn-
`thesized full-resolution B frames to generate residual errors
`that can be encoded, e.g., using wavelet encoding, into the
`encoded video bitstream.
`In that case,
`the spatial sub-
`sampling of block 122 can be omitted.
`As mentioned earlier, the processing in FIGS. 3-5 corre-
`spond to the extreme encoding mode in which video encoder
`100 performs motion-based predictive encoding, but none of
`the corresponding motion information is explicitly encoded
`into the resulting encoded video bitstream, where the
`decoder performs its own motion computation to generate its
`own version of the motion information for use in generating
`the corresponding decoded video stream. As mentioned
`earlier, video encoder 100 is preferably a scaleable video
`encoder that can encode video streams at a variety of
`different encoding modes. Some of the encoding options
`available in video encoder 100 include:
`
`10
`
`15
`
`20
`
`25
`
`30
`
`7
`
`flow encoding.) Still other implementations may rely on
`hierarchical or layered motion analysis in which a number of
`different motion vectors are generated at different
`resolutions, where finer motion information (e.g., corre-
`sponding to smaller sets of pixels) provide corrections to
`coarser motion information (e.g., corresponding to larger
`sets of pixels). In any case, the resulting motion information
`characterizes the motion between the current L frame and
`
`corresponding H/L frames.
`No matter what type of analysis is performed, the motion
`information generated during the motion computation is
`then used to synthesize a full-resolution L frame (block 116
`and step 414). In particular, the motion information is used
`to warp (i.e., motion compensate)
`the corresponding
`decoded full-resolution H frame to generate a synthesized
`full-resolution frame corresponding to the current L frame.
`Note that the synthesized full-resolution L frame may be
`generated using forward, backward, or even bi-directional
`warping based on more than one decoded full-resolution H
`frame. This would require computation of motion informa-
`tion relative to two different decoded full-resolution H
`
`frames, but will typically reduce even further the corre-
`sponding residuals that need to be compressed.
`In general, the synthesized full-resolution L frame may
`have artifacts due to various errors in motion computation
`due to occlusions, mismatches, and the like. As such, a
`quality of alignment metric (e.g., based on pixel-to-pixel
`absolute differences) is generated between the synthesized
`full-resolution L frame and the original full-resolution L
`frame (block 118 and step 416). The quality of alignment
`metrics form an image of residual errors that represent the
`quality of alignment at each pixel.
`The residual errors are then encoded for inclusion into the
`
`In one
`encoded video bitstream (block 120).
`implementation, the image of residual errors is thresholded
`at an appropriate level to form a binary mask (step 418) that
`identifies those regions of pixels for whom the residual error
`should be encoded, e.g., using a wavelet
`transform, for
`inclusion into the encoded video bitstream (step 420). For
`typical video processing, the residual errors for only about
`10% of the pixels will be encoded