`
`1103
`
`Overview of the Scalable Video Coding
`Extension of the H.264/AVC Standard
`
`Heiko Schwarz, Detlev Marpe, Member, IEEE, and Thomas Wiegand, Member, IEEE
`
`(Invited Paper)
`
`Abstract—With the introduction of the H.264/AVC video
`coding standard, significant improvements have recently been
`demonstrated in video compression capability. The Joint Video
`Team of the ITU-T VCEG and the ISO/IEC MPEG has now also
`standardized a Scalable Video Coding (SVC) extension of the
`H.264/AVC standard. SVC enables the transmission and decoding
`of partial bit streams to provide video services with lower tem-
`poral or spatial resolutions or reduced fidelity while retaining a
`reconstruction quality that is high relative to the rate of the partial
`bit streams. Hence, SVC provides functionalities such as graceful
`degradation in lossy transmission environments as well as bit
`rate, format, and power adaptation. These functionalities provide
`enhancements to transmission and storage applications. SVC has
`achieved significant improvements in coding efficiency with an
`increased degree of supported scalability relative to the scalable
`profiles of prior video coding standards. This paper provides an
`overview of the basic concepts for extending H.264/AVC towards
`SVC. Moreover, the basic tools for providing temporal, spatial,
`and quality scalability are described in detail and experimentally
`analyzed regarding their efficiency and complexity.
`
`Index Terms—H.264/AVC, MPEG-4, Scalable Video Coding
`(SVC), standards, video.
`
`I.
`
`INTRODUCTION
`
`DVANCES in video coding technology and standard-
`
`Aieaion ({1]-[6] along with the rapid developments and
`
`improvements of network infrastructures, storage capacity, and
`computing power are enabling an increasing number of video
`applications. Application areas today range from multimedia
`messaging, video telephony, and video conferencing over mo-
`bile TV, wireless and wired Internet video streaming, standard-
`and high-definition TV broadcasting to DVD, Blu-ray Disc,
`and HD DVD optical storage media. For these applications,
`a variety of video transmission and storage systems may be
`employed.
`Traditional digital video transmission and storage systems
`are based on H.222.0| MPEG-2 systems [7] for broadcasting
`services over satellite, cable, and terrestrial transmission chan-
`nels, and for DVD storage, or on H.320 [8] for conversational
`video conferencing services. These channels are typically char-
`
`Manuscript received October 6, 2006; revised July 15, 2007. This paper was
`recommended by Guest Editor T. Wiegand.
`The authors are with the Fraunhofer Institute for Telecommunications, Hein-
`tich Hertz Institute, 10587 Berlin, Germany (e-mail: hschwarz@hhi.hg.de;
`marpe@hhi.fhg.de; wiegand@bhi.fhg.de).
`Color versions of one or more ofthe figures in this paper are available online
`at http://ieeexplore.iecee.org.
`Digital Object Identifier 10.1 109/TCSVT.2007.905532
`
`acterized by a fixed spatio-temporal format of the video signal
`(SDTV or HDTV or CIF for H.320 video telephone). Their ap-
`plication behavior in such systems typicallyfalls into one of the
`two categories: it works or it does not work.
`Modern video transmission and storage systems using the In-
`ternet and mobile networks are typically based on RTP/IP [9] for
`real-time services (conversational and streaming) and on com-
`puter file formats like mp4 or 3gp. Most RTP/IP access networks
`are typically characterized by a wide range of connection quali-
`ties and receiving devices. The varying connection quality is re-
`sulting from adaptive resource sharing mechanisms ofthese net-
`works addressing the time varying data throughput requirements
`of a varying numberof users. The variety of devices with dif-
`ferent capabilities ranging from cell phones with small screens
`and restricted processing powerto high-end PCs with high-def-
`inition displays results from the continuous evolution of these
`endpoints.
`Scalable Video Coding (SVC) is a highly attractive solution
`to the problems posed by the characteristics of modern video
`transmission systems. The term “scalability” in this paperrefers
`to the removal ofparts of the video bit stream in order to adapt
`it to the various needs or preferences of end users as wellas to
`varying terminal capabilities or network conditions. The term
`SVCis used interchangeably in this paper for both the concept
`of SVC in general and for the particular new design that has
`been standardized as an extension of the H.264/AVC standard.
`
`The objective of the SVC standardization has been to enable the
`encoding of a high-quality video bit stream that contains one or
`more subset bit streams that can themselves be decoded with a
`
`complexity and reconstruction quality similar to that achieved
`using the existing H.264/AVC design with the same quantity of
`data as in the subset bit stream.
`SVC has been an active research and standardization area for
`
`at least 20 years. Theprior international video coding standards
`H.262 | MPEG-2 Video[3], H.263 [4], and MPEG-4 Visual [5]
`already include several tools by which the most important scala-
`bility modes can be supported. However, the scalable profiles of
`those standards have rarely been used. Reasonsfor that include
`the characteristics of traditional video transmission systems as
`well as the fact that the spatial and quality scalability features
`came along with a significant loss in coding efficiency as well
`as a large increase in decoder complexity as compared to the
`corresponding nonscalable profiles. It should be noted that two
`or more single-layer streams, i.e., nonscalable streams, can al-
`ways be transmitted by the method of simulcast, which in prin-
`ciple provides similar functionalities as a scalable bit stream,
`
`1051-8215/$25.00 © 2007 IEEE
`
`Authorized licensed use limited to: New York University. Downloaded on April 13,2010 at 13:04:43 UTC from IEEE Xplore. Restrictions apply.
`
`SAMSUNG-1010
`
`1
`
`SAMSUNG-1010
`
`
`
`1104
`
`IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,VOL. 17, NO. 9, SEPTEMBER 2007
`
`althoughtypically at the costof a significant increasein bit rate.
`Moreover, the adaptation of a single stream can be achieved
`through transcoding, which is currently used in multipoint con-
`trol units in video conferencing systems or for streaming ser-
`vices in 3G systems. Hence, a scalable video codec has to com-
`pete against these alternatives.
`This paper describes the SVC extension of H.264/AVC andis
`organized as follows. Section II explains the fundamental scal-
`ability types and discusses some representative applications of
`SVC as well as their implications in terms of essential require-
`ments. Section III gives the history of SVC. Section IV briefly
`reviews basic design concepts of H.264/AVC. In Section V,
`the concepts for extending H.264/AVC toward na SVCstan-
`dard are described in detail and analyzed regarding effective-
`ness and complexity. The SVC high-level design is summarized
`in Section VI. For more detailed information about SVC, the
`reader is referred to the draft standard [10].
`
`Il. TYPES OF SCALABILITY, APPLICATIONS,
`AND REQUIREMENTS
`
`In general, a video bit stream is called scalable when parts of
`the stream can be removed in a waythat the resulting substream
`forms another valid bit stream for some target decoder, and the
`substream represents the source content with a reconstruction
`quality that is less than that of the complete original bit stream
`but is high when considering the lower quantity of remaining
`data. Bit streams that do not provide this property are referred
`to as single-layer bit streams. The usual modesofscalability are
`temporal, spatial, and quality scalability. Spatial scalability and
`temporal scalability describe cases in which subsets of the bit
`stream represent the source content with a reduced picture size
`(spatial resolution) or frame rate (temporal resolution), respec-
`tively. With quality scalability, the substream provides the same
`spatio-temporal resolution as the complete bit stream, but with
`a lowerfidelity—where fidelity is often informally referred to
`as signal-to-noise ratio (SNR). Quality scalability is also com-
`monly referred to as fidelity or SNR scalability. More rarely
`required scalability modes are region-of-interest (ROI) and ob-
`ject-based scalability, in which the substreams typically repre-
`sent spatially contiguous regions of the original picture area.
`The different types of scalability can also be combined, so that a
`multitude of representations with different spatio-temporal res-
`olutions and bit rates can be supported within a single scalable
`bit stream.
`
`Efficient SVC provides a numberof benefits in terms of ap-
`plications [11]-[13]—a few of which will be briefly discussed
`in the following. Consider, for instance, the scenario of a video
`transmission service with heterogeneousclients, where multiple
`bit streams of the same source contentdiffering in coded picture
`size, frame rate, and bit rate should be provided simultaneously.
`With the application of a properly configured SVC scheme, the
`source content has to be encoded only once—for the highest re-
`quired resolution andbit rate, resulting in a scalable bit stream
`from which representations with lower resolution and/or quality
`can be obtained by discarding selected data. For instance, a
`client with restricted resources (display resolution, processing
`
`power, or battery power) needs to decode only a part of the de-
`livered bit stream. Similarly, in a multicast scenario, terminals
`with different capabilities can be served by a single scalable bit
`stream.In an alternative scenario, an existing video format(like
`QVGA) can be extended in a backward compatible way by an
`enhancement video format (like VGA).
`Another benefit of SVC is that a scalable bit stream usually
`contains parts with different importance in terms of decoded
`video quality. This property in conjunction with unequal error
`protection is especially useful in any transmission scenario
`with unpredictable throughput variations and/or relatively high
`packet loss rates. By using a stronger protection of the more
`important information, error resilience with graceful degra-
`dation can be achieved up to a certain degree of transmission
`errors. Media-Aware Network Elements (MANEs), which re-
`ceive feedback messages aboutthe terminal capabilities and/or
`channel conditions, can remove the nonrequired parts from
`a scalable bit stream, before forwarding it. Thus, the loss of
`important transmission units due to congestion can be avoided
`and the overall error robustness of the video transmission
`
`service can be substantially improved.
`SVCis also highly desirable for surveillance applications, in
`which video sources not only need to be viewed on multiple
`devices ranging from high-definition monitors to videophones
`or PDAs, but also need to be stored and archived. With SVC,
`for instance, high-resolution/high-quality parts of a bit stream
`can ordinarily be deleted after some expiration time, so that only
`low-quality copies of the video are kept for long-term archival.
`The latter approach may also become an interesting feature in
`personal video recorders and home networking.
`Even though SVC schemes offer such a variety of valuable
`functionalities, the scalable profiles of existing standards have
`rarely been used in the past, mainly because spatial and quality
`scalability have historically come at the price of increased de-
`coder complexity and significantly decreased codingefficiency.
`In contrast to that, temporal scalability is often supported, e.g.,
`in H.264/AVC-based applications, but mainly because it comes
`along with a substantial coding efficiency improvement (cf.
`Section V-A.2).
`H.264/AVC is the most recent international video coding
`standard. It provides significantly improved coding efficiency
`in comparison to all prior standards [14]. H.264/AVC has
`attracted a lot of attention from industry and has been adopted
`by various application standards and is increasingly used in a
`broad variety of applications.It is expected that in the near-term
`future H.264/AVC will be commonly used in most video appli-
`cations. Given this high degree of adoption and deploymentof
`the new standard and taking into accountthe large investments
`that have already been taken place for preparing and developing
`H.264/AVC-based products, it is quite natural to now build a
`SVC scheme as an extension of H.264/AVC and to reuse its
`
`key features.
`Considering the needs of today’s and future video applica-
`tions as well as the experiences with scalable profiles in the past,
`the success of any future SVC standard critically depends on the
`following essential requirements.
`* Similar coding efficiency compared to single-layer
`coding—for each subsetofthe scalable bit stream.
`
`Authorized licensed use limited to: New York University. Downloaded on April 13,2010 at 13:04:43 UTC from IEEE Xplore. Restrictions apply.
`
`2
`
`2
`
`
`
`SCHWARZ et al.: OVERVIEW OF THE SCALABLE VIDEO CODING EXTENSION
`
`1105
`
`* Little increase in decoding complexity compared to single-
`layer decoding that scales with the decoded spatio-tem-
`poral resolution and bitrate.
`* Support of temporal, spatial, and quality scalability.
`¢ Support of a backward compatible base layer (H.264/AVC
`in this case).
`* Support of simple bit stream adaptations after encoding.
`In any case, the coding efficiency of scalable coding should
`be clearly superior to that of “simulcasting” the supported
`spatio-temporal resolutions andbit rates in separate bit streams.
`In comparison to single-layer coding,bit rate increases of 10%
`to 50% for the same fidelity might be tolerable depending on
`the specific needs of an application and the supported degree
`of scalability.
`This paper provides an overview how these requirements
`have been addressed in the design of the SVC extension of
`H.264/AVC,
`
`Ill. History oF SVC
`
`Hybrid video coding, as found in H.264/AVC [6] and all past
`video coding designs that are in widespread application use,
`is based on motion-compensated temporal differential pulse
`code modulation (DPCM) together with spatial decorrelating
`transformations [15]. DPCM is characterized by the use of
`synchronous prediction loops at the encoder and decoder.
`Differences between these prediction loops lead to a “drift”
`that can accumulate over time and produce annoyingartifacts.
`However, the scalability bit stream adaptation operation, i.e.,
`the removal of parts of the video bit stream can produce such
`differences.
`
`Subband or transform coding does not have the drift prop-
`erty of DPCM.Therefore, video coding techniques based on
`motion-compensated 3-D wavelet transforms have been studied
`extensively for use in SVC [16]-[19]. The progress in wavelet-
`based video coding caused MPEG to start an activity on ex-
`ploring this technology. As a result, MPEG issued a call for
`proposals forefficient SVC technology in October 2003 with
`the intention to develop a new SVC standard. Twelve of the
`14 submitted proposals in response to this call [20] represented
`scalable video codecs based on 3-D wavelet transforms, while
`the remaining two proposals were extensions of H.264/AVC
`[6]. After a six-month evaluation phase, in which several sub-
`jective tests for a variety of conditions were carried out and
`the proposals were carefully analyzed regarding their poten-
`tial for a successful future standard, the scalable extension of
`H.264/AVC as proposed in [21] was chosenas the starting point
`[22] of MPEG’s SVC project in October 2004.In January 2005,
`MPEG and VCEG agreed to jointly finalize the SVC project as
`an Amendment of H.264/AVCwithin the Joint Video Team.
`
`Although the initial design [21] included a wavelet-like
`decomposition structure in temporal direction,
`it was later
`removed from the SVC specification [10]. Reasons for that
`removal
`included drastically reduced encoder and decoder
`complexity and improvements in coding efficiency. It was
`shown that an adjustment of the DPCM prediction structure
`can lead to a significantly improved drift control as will be
`shown in the paper. Despite this change, most components of
`the proposal in [21] remained unchanged from the first model
`
`[22] to the latest draft [10] being augmented by methods for
`nondyadic scalability and interlaced processing which were not
`included in the initial design.
`
`TV. H.264/AVC Basics
`
`SVC was standardized as an extension of H.264/AVC. In
`
`order to keep the paper self-contained, the following brief
`description of H.264/AVC is limited to those key features
`that are relevant for understanding the concepts of extending
`H.264/AVC towards SVC. For more detailed information
`about H.264/AVC, the readeris referred to the standard [6] or
`corresponding overview papers [23]-[26].
`Conceptually,
`the design of H.264/AVC covers a Video
`Coding Layer (VCL) and a Network Abstraction Layer (NAL).
`While the VCL creates a coded representation of the source
`content, the NAL formats these data and provides header infor-
`mation in a waythat enables simple and effective customization
`of the use of VCL data for a broad variety of systems.
`
`A. Network Abstraction Layer (NAL)
`
`The coded video data are organized into NAL units, which
`are packets that each contains an integer number of bytes. A
`NAL unit starts with a one-byte header, which signals the type
`of the contained data. The remaining bytes represent payload
`data. NAL units are classified into VCL NAL units, which con-
`tain coded slices or coded slice data partitions, and non-VCL
`NAL units, which contain associated additional information.
`The most important non-VCL NAL units are parameter sets and
`Supplemental Enhancement Information (SEI). The sequence
`and picture parameter sets contain infrequently changing infor-
`mation for a video sequence. SEI messages are not required for
`decoding the samples of a video sequence. They provide addi-
`tional information which can assist the decoding process or re-
`lated processeslike bit stream manipulation or display. A set of
`consecutive NAL units with specific properties is referred to as
`an access unit. The decoding of an access unit results in exactly
`one decodedpicture. A set of consecutive access units with cer-
`tain propertiesis referred to as a coded video sequence. A coded
`video sequence represents an independently decodable part of a
`NAL unitbit stream. It always starts with an instantaneous de-
`coding refresh (IDR) access unit, which signals that the IDR ac-
`cess unit and all following access units can be decoded without
`decoding any previous pictures of the bit stream.
`
`B. Video Coding Layer (VCL)
`The VCL of H.264/AVC follows the so-called block-based
`hybrid video coding approach. Althoughits basic design is very
`similar to that of prior video coding standards such as H.261,
`MPEG-1 Video, H.262|MPEG-2 Video, H.263, or MPEG-4
`Visual, H.264/AVC includes new features that enable it to
`achieve a significant improvement in compression efficiency
`relative to any prior video coding standard [14]. The main dif-
`ference to previous standards is the largely increased flexibility
`and adaptability of H.264/AVC.
`The way pictures are partitioned into smaller coding units in
`H.264/AVC, however, follows the rather traditional concept of
`subdivision into macroblocks and slices. Each picture is par-
`titioned into macroblocks that each covers a rectangular pic-
`
`Authorized licensed use limited to: New York University. Downloaded on April 13,2010 at 13:04:43 UTC from IEEE Xplore. Restrictions apply.
`
`3
`
`3
`
`
`
`1106
`
`IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,VOL. 17, NO. 9, SEPTEMBER 2007
`
`ture area of 16 x 16 luma samples and, in the case of video in
`4:2:0 chroma sampling format, 8 x 8 samples of each of the two
`chroma components. The samples of a macroblock are either
`spatially or temporally predicted, and the resulting prediction
`residual signal is represented using transform coding. The mac-
`roblocks of a picture are organizedin slices, each of which can
`be parsed independently ofother slices in a picture. Depending
`on the degree of freedom for generating the prediction signal,
`H.264/AVC supports three basic slice coding types.
`1) I-slice: intra-picture predictive coding using spatial predic-
`tion from neighboring regions,
`2) P-slice: intra-picture predictive coding and inter-picture
`predictive coding with one prediction signal for each pre-
`dicted region,
`3) B-slice: intra-picture predictive coding, inter-picture pre-
`dictive coding, and inter-picture bipredictive coding with
`two prediction signals that are combined with a weighted
`average to form the region prediction.
`For I-slices, H.264/AVC provides several directional spatial
`intra-prediction modes, in which the prediction signal is gener-
`ated by using neighboring samples of blocks that precede the
`block to be predicted in coding order. For the luma component,
`the intra-predictionis either applied to 4 x 4, 8 x 8, or 16 x 16
`blocks, whereas for the chroma components,it is always applied
`on a macroblockbasis.!
`
`For P- and B-slices, H.264/AVC additionally permits variable
`block size motion-compensated prediction with multiple refer-
`ence pictures [27]. The macroblock type signals the partitioning
`of a macroblockinto blocks of 16 x 16, 16 x 8,8 x 16, or8 x 8
`luma samples. When a macroblock type specifies partitioning
`into four 8 x 8 blocks, each of these so-called submacroblocks
`can be further split into 8 x 4,4 x 8, or4 x 4blocks, whichis in-
`dicated through the submacroblock type. For P-slices, one mo-
`tion vector is transmitted for each block. In addition, the used
`reference picture can be independently chosen for each 16 x 16,
`16 x 8, or 8 x 16 macroblockpartition or 8 x 8 submacroblock.
`It is signaled via a reference index parameter, which is an index
`into a list of reference pictures that is replicated at the decoder.
`In B-slices, two distinct reference picture lists are utilized,
`and for each 16 x 16, 16x 8, or 8 x 16 macroblock partition
`or 8 x 8 submacroblock, the prediction method can be selected
`between Jist 0, list 1, or biprediction. Whilelist 0 andlist 1 pre-
`diction refer to unidirectional prediction using a reference pic-
`ture of reference picture list 0 or 1, respectively, in the bipredic-
`tive mode,the prediction signal is formed by a weighted sum of
`a list 0 andlist 1 prediction signal. In addition, special modes
`as so-called direct modes in B-slices and skip modes in P- and
`B-slices are provided, in which such data as motion vectors
`and reference indexes are derived from previously transmitted
`information.
`
`For transform coding, H.264/AVC specifies a set of integer
`transforms ofdifferent block sizes. While for intra-macroblocks
`the transform size is directly coupled to the intra-prediction
`blocksize, the luma signal of motion-compensated macroblocks
`that do not contain blocks smaller than 8 x 8 can be coded by
`
`1Some details of the profiles of H.264/AVC that were designed primarily to
`serve the needs of professional application environments are neglected in this
`description, particularly in relation to chromaprocessing and range ofstep sizes.
`
`using either a 4 x 4 or 8 X 8 transform. For the chroma com-
`ponents a two-stage transform, consisting of 4 x 4 transforms
`and a Hadamard transform of the resulting DC coefficients is
`employed.! A similar hierarchical transform is also used for the
`luma componentof macroblocks codedin intra 16 x 16 mode.
`All inverse transforms are specified by exact integer operations,
`so that inverse-transform mismatches are avoided. H.264/AVC
`
`uses uniform reconstruction quantizers. One of 52 quantization
`step sizes! can be selected for each macroblock by the quantiza-
`tion parameter QP. The scaling operations for the quantization
`step sizes are arranged with logarithmic step size increments,
`such that an increment of the QP by 6 corresponds to a dou-
`bling of quantization step size.
`For reducing blockingartifacts, which are typically the most
`disturbing artifacts in block-based coding, H.264/AVCspecifies
`an adaptive deblockingfilter, which operates within the motion-
`compensated prediction loop.
`H.264/AVC supports two methods of entropy coding, which
`both use context-based adaptivity to improve performance rel-
`ative to prior standards. While context-based adaptive variable-
`length coding (CAVLC)uses variable-length codesandits adap-
`tivity is restricted to the coding of transform coefficient levels,
`context-based adaptive binary arithmetic coding (CABAC) uti-
`lizes arithmetic coding and a more sophisticated mechanism for
`employing statistical dependencies, which leads to typical bit
`rate savings of 10%—15% relative to CAVLC.
`In addition to the increased flexibility on the macroblock
`level, H.264/AVC also allows much more flexibility on a picture
`and sequence level compared to prior video coding standards.
`Here we mainly refer to reference picture memory control.
`In H.264/AVC, the coding and display order of pictures is
`completely decoupled. Furthermore, any picture can be marked
`as reference picture for use in motion-compensated prediction
`of following pictures, independent of the slice coding types.
`The behavior of the decoded picture buffer (DPB), which can
`hold up to 16 frames (depending on the used conformance
`point and picture size), can be adaptively controlled by memory
`management control operation (MMCO) commands, and the
`reference picture lists that are used for coding of P- or B-slices
`can be arbitrarily constructed from the pictures available in the
`DPBvia reference picture list reordering (RPLR) commands.
`In order to enable a flexible partitioning of a picture
`into slices,
`the concept of slice groups was introduced in
`H.264/AVC. The macroblocks of a picture can be arbitrarily
`partitioned into slice groups via a slice group map. The slice
`group map, which is specified by the content of the picture
`parameter set and some slice header information, assigns a
`unique slice group identifier to each macroblock ofa picture.
`And each slice is obtained by scanning the macroblocks of
`a picture that have the same slice group identifier as thefirst
`macroblock of the slice in raster-scan order. Similar to prior
`video coding standards, a picture comprises the set of slices
`representing a complete frame or one field of a frame (such
`that, e.g., an interlaced-scan picture can be either coded as a
`single frame picture or two separate field pictures). Addition-
`ally, H.264/AVC supports a macroblock-adaptive switching
`between frame and field coding. For that, a pair of vertically
`adjacent macroblocks is considered as a single coding unit,
`
`Authorized licensed use limited to: New York University. Downloaded on April 13,2010 at 13:04:43 UTC from IEEE Xplore. Restrictions apply.
`
`4
`
`4
`
`
`
`SCHWARZ et al.: OVERVIEW OF THE SCALABLE VIDEO CODING EXTENSION
`
`1107
`
`
`
`which can be either transmitted as two spatially neighboring
`frame macroblocks, or as interleaved top and a bottom field
`macroblocks.
`
`V. BASIC CONCEPTS FOR EXTENDING H.264/AVC
`TOWARDS AN SVC STANDARD
`
`Apart from the required support of all commontypesofscal-
`ability, the most important design criteria for a successful SVC
`standard are coding efficiency and complexity, as was noted
`in Section II. Since SVC was developed as an extension of
`H.264/AVC with all ofits well-designed core coding tools being
`inherited, oneofthe design principles of SVC was that new tools
`should only be added if necessary for efficiently supporting the
`required types of scalability.
`
`A. Temporal Scalability
`
`A bit stream provides temporal scalability when the set of
`corresponding access units can be partitioned into a temporal
`base layer and one or more temporal enhancementlayers with
`the following property. Let the temporal layers be identified by a
`temporal layer identifier J’, whichstarts from 0 for the base layer
`and is increased by 1 from one temporal layer to the next. Then
`for each natural number é, the bit stream that is obtained by
`removingall access units of all temporal layers with a temporal
`layer identifier T greater than k forms another valid bit stream
`for the given decoder.
`For hybrid video codecs, temporal scalability can generally
`be enabled byrestricting motion-compensated prediction to
`reference pictures with a temporal layer identifier that is less
`than or equal to the temporal layer identifier of the picture to
`be predicted. The prior video coding standards MPEG-1 [2],
`H.262 | MPEG-2 Video [3], H.263 [4], and MPEG-4 Visual [5]
`all support temporal scalability to some degree. H.264/AVC
`[6] provides a significantly increased flexibility for temporal
`scalability because of its reference picture memory control. It
`allows the coding of picture sequences with arbitrary temporal
`dependencies, whichare only restricted by the maximum usable
`DPBsize. Hence, for supporting temporal scalability with a
`reasonable numberof temporal layers, no changesto the design
`of H.264/AVC were required. The only related change in SVC
`refers to the signaling of temporal layers, which is described in
`Section VI.
`
`I) Hierarchical Prediction Structures: Temporal scalability
`with dyadic temporal enhancement layers can be very efficiently
`provided with the concept of hierarchical B-pictures [28], [29]
`as illustrated in Fig. 1(a).2 The enhancementlayer pictures are
`typically coded as B-pictures, where the reference picture lists 0
`and 1 are restricted to the temporally preceding and succeeding
`picture, respectively, with a temporal layer identifier less than
`the temporal layer identifier of the predicted picture. Each set
`of temporal layers {To,..., T,,} can be decoded independently
`of all layers with a temporal layer identifier T > :. In the fol-
`lowing,the set of pictures between two successive pictures of
`
`2As described above, neitherP- or B-slices are directly coupled with the man-
`agement of reference pictures in H.264/AVC. Hence, backward prediction is not
`necessarily coupled with the use of B-slices and the temporal coding structure
`of Fig. 1(a)
`also be realized using P-slices resulting in a structure that is
`often called hierarchical P-pictures.
`
`
`
`l1°12 139
`To
`Tz
`Tz
`(b)
`
`o
`Ty
`
`3
`Tz
`
`4
`Tz
`
`2
`Ts
`
`6
`Tz
`
`fF
`12
`
`5
`T,
`
`8
`Tz
`
`@
`Tp
`
`11
`Ty
`
`15
`Tz
`
`16
`Ta
`
`14
`Ty
`
`17
`Tr
`
`18
`Te
`
`10
`To
`
`15
`12 13 14
`11
`10
`9
`Ts Tz Ts Ti
`Ts Tz Ts
`
`16
`To
`
`8T
`
`o
`
`7 T
`
`s
`
`5
`Ts
`
`6
`Tz
`
`4 T
`
`h
`
`3T
`
`s
`
`2 T
`
`z
`
`14
`Ts
`
`o
`To
`
`(c)
`
`Fig. 1. Hierarchical prediction structures for enabling temporal scalability.
`(a) Coding with hierarchical B-pictures. (b) Nondyadic hierarchical prediction
`structure. (c) Hierarchical prediction structure with a structural encoding/de-
`coding delay of zero. The numbers directly below the pictures specify the
`coding order, the symbols T,, specify the temporal layers with k representing
`the corresponding temporal layer identifier.
`
`the temporal base layer together with the succeeding base layer
`picture is referred to as a group ofpictures (GOP).
`Although the described prediction structure with hierarchical
`B-pictures provides temporal scalability and also shows excel-
`lent coding efficiency as will be demonstrated later, it repre-
`sents a special case. In general, hierarchical prediction struc-
`tures for enabling temporal scalability can always be combined
`with the multiple reference picture concept of H.264/AVC. This
`means that the reference picture lists can be constructed by using
`more than one reference picture, and they can also include pic-
`tures with the same temporal level as the picture to be pre-
`dicted. Furthermore, hierarchical prediction structures are not
`restricted to the dyadic case. As an example, Fig. 1(b) illustrates
`a nondyadic hierarchical prediction structure, which provides 2
`independently decodable subsequences with 1/9th and 1/3rd of
`the full frame rate. It should further be notedthat it is possible to
`arbitrarily modify the prediction structure of the temporal base
`layer, e.g., in order to increase the coding efficiency. The chosen
`temporal prediction structure does not need to be constant over
`time.
`
`Note thatit is possible to arbitrarily adjust the structural delay
`between encoding and decoding a picture by restricting mo-
`tion-compensated prediction from pictures that follow the pic-
`ture to be predicted in display order. As an example, Fig. 1(c)
`shows a hierarchical prediction structure, which does not em-
`ploy motion-compensated prediction from pictures in the future.
`Although this structure provides the same degree of temporal
`scalability as the prediction structure of Fig. 1(a), its structural
`delay is equal to zero compared to 7 pictures for the prediction
`structure in Fig. 1(a). However, such low-delay structures typi-
`cally decrease coding efficiency.
`
`Authorized licensed use limited to: New York Univer