`
`1121
`
`Spatial Scalability Within the H.264/AVC Scalable
`Video Coding Extension
`
`C. Andrew Segall, Member, IEEE, and Gary J. Sullivan, Fellow, IEEE
`
`(Invited Paper)
`
`Abstract—A scalable extension to the H.264/AVC video coding
`standard has been developed within the Joint Video Team (JVT), a
`joint organization of the ITU-T Video Coding Group (VCEG)and
`the ISO/IEC Moving Picture Experts Group (MPEG). The exten-
`sion allows multiple resolutions of an image sequence to be con-
`tained in a single bit stream. In this paper, we introduce the spa-
`tially scalable extension within the resulting Scalable Video Coding
`standard. The high-level design is described and individual coding
`tools are explained. Additionally, encoder issues are identified. Fi-
`nally, the performance of the design is reported.
`
`Index Terms—H.264/AVC, Scalable Video Coding (SVC), spatial
`scalability.
`
`I.
`
`INTRODUCTION
`
`ITH the expectation that future applications will sup-
`
`Wao: a diverse range of display resolutions and trans-
`
`mission channel capacities, the Joint Video Team (JVT) has
`developed a scalable extension [1], [2] to the state-of-the-art
`H.264/AVC video coding standard [3]-[6]. This extension is
`commonly known as Scalable Video Coding (SVC)andit pro-
`vides support for multiple display resolutions within a single
`compressed bit stream (or in hierarchically related bit streams),
`whichis referred to here as spatial scalability. Additionally, the
`SVC extensions support combinations of temporal scalability
`(frame rate enhancement) and quality scalability (fidelity en-
`hancementfor pictures of the same resolution) with the spatial
`scalability feature [2]. This is achieved while balancing both de-
`coder complexity and coding efficiency.
`Theresolution diversity of current display devices motivates
`the need for spatial scalability. Specifically, larger format, high
`definition displays are becoming common in consumerapplica-
`tions, with displays containing over two million pixels readily
`available. By contrast, lower resolution displays with between
`ten thousand and one hundred thousand pixels are also popular
`in applications constrained by size, power and weight. Unfortu-
`nately, transmitting a single representation of a video sequence
`to the range of display resolutions available in the market is im-
`practical. For example,it is rarely justifiable to design a device
`
`Manuscript received January 7, 2007; revised July 24, 2007. This paper was
`recommended by Guest Editor T. Wiegand.
`C. A. Segall is with Sharp Laboratories of America, Camas, WA 98607 USA
`(e-mail: asegall @sharplabs.com).
`G. J. Sullivan is with Microsoft Corporation, Redmond, WA 98052 USA
`(e-mail: garysull@microsoft.com).
`Color versions of one or more ofthe figures in this paper are available online
`at http://ieeexplore.iecee.org.
`Digital Object Identifier 10.1109/TCSVT.2007.906824
`
`with low display resolution with the capacity for decoding and
`down-sampling high-resolution video material. Such a require-
`ment could increase the cost and power ofthe device to the point
`of exceeding the very constraints that determined its display res-
`olution. In addition, sending the high-resolution details that are
`ultimately not shown on the display for such a device is a waste
`ofits receiving channelbitrate.
`Diverse, limited, and time-varying channel capacity provides
`a second motivation for spatial scalability. Here, the concern is
`that channel capacity may preclude the reliable transmission of
`high-resolution video to specific devices or at specific time in-
`stances. Spatial scalability allows for the rapid bit rate adapta-
`tion that can be a necessity in such scenarios. Thisbit rate adap-
`tation is achieved without transcoding operations or feedback to
`a complex real-time encoding process, both of which can intro-
`duce unacceptable complexity and delay.
`The purposeofthis paperis to discuss key concepts of spa-
`tial scalability within the SVC extension. This project is the
`fourth in a historical series of efforts to standardize spatially
`SVC schemes (after prior efforts in MPEG-2 [7], [8], H.263
`Annex O [9], and MPEG-4 part 2 [10]), although the prior de-
`signs were basically not successful in terms of industry adop-
`tion. This paper points out several ways in which the new design
`addresses the problems of those prior approaches.
`Therestof this paper is organized as follows. Section II pro-
`vides an overview of H.264/AVCspatially scalable coding and
`comparesit to alternative scalable approaches. In Section III,
`the specific coding tools within the spatial SVC design are de-
`scribed. In Section IV, encoder issues related to spatial SVC
`are considered. In Section V, the performance of the spatial
`SVC extension is presented. Finally, conclusions are provided
`in Section VI.
`
`II. OVERVIEW
`
`The SVC extension of H.264/AVC provides a mechanism for
`reusing an encoded lower resolution version of an image se-
`quencefor the coding of a corresponding higher resolution se-
`quence. This is shown in Fig. 1, where a diagram of a hypo-
`thetical SVC encoder is provided. Subsequent sections discuss
`the specific tools introduced in the SVC extension. However,
`to better aid in the understanding of the SVC design, this sec-
`tion focuses on higher level concepts. We begin by identifying
`basic concepts and definitions necessary for discussion of the
`SVC design. Then, we consider the high level spatial relation-
`ship between resolutions in a bit stream. Finally, we summarize
`
`Authorized licensed use limited to: Fish & Richardson PC.
`
`1051-8215/$25.00 © 2007 IEEE
`Downloaded on December 23,2023 at 02:17:28 UTC from IEEE Xplore. Restrictions apply.
`
`SAMSUNG-1006
`
`1
`
`SAMSUNG-1006
`
`
`
`1122
`
`IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,VOL. 17, NO. 9, SEPTEMBER 2007
`
`High Resolution
`Sequence
`
`Separate
`Frame into
`Macroblocks
`
`Control
`
`Deblocking
`Operation
`
`Transform and
`Quantize
`
`Enhancement :
`Layer
`y
`
`Scaling and
`Inverse
`
`Transform
`
`Deblocking
`Operation
`
`
`
`Inter-Layer Prediction ‘Sucnsemmnsenn
`
`
`
`
` Encoder
`
`ieee
`
`
`
` Macroblocks PUCCCMROEECOeee
`
`Base LayerSERRACCERETC
`
`Deblocking
`Operation
`
`a Motion
`Compensation
`
`
`
`Intra-
`Prediction
`
`Scaling and
`Inverse
`Transform
`
`
`
`Control
`
`Encoder
`
`Low Resolution
`Sequence
`
`Separate
`Frame into
`
`Fig. 1. High-level diagram ofspatial scalability in the SVC design. The “base layer” encoder takes a lower resolution video sequence as input and encodesit with
`the H.264/AVC video coding standard while conformingto a legacy profile. The enhancementlayer encoder takes a higher resolution sequence as input. The higher
`resolution sequence can be encoded with ordinary H.264/AVCtechnologies. Moreover, inter-layer prediction can be used to provide additional coding choices. For
`the case of intra-picture coded blocks in the base layer, reconstructed intensities provide a prediction for the enhancementlayer. For the case of inter-picture coded
`blocks in the base layer, enhancement layer motion vectors and residual difference information can be predicted from the base layer. Further resolution layers can
`be added in an analogous fashion and can utilize either the base layer or previously transmitted enhancementlayers for inter-layer prediction. Moreover, other
`forms of SVC (temporal or quality) enhancement may also be present.
`
`A. Basic Concepts
`
`two key design concepts in the SVC extension—image pyra-
`mids and single-loop decoding.
`
`bit streams and transmitting them using the sum of the two bit
`rates. The challenge here is considerable—amongthe three typ-
`ical basic forms ofbit stream scalability, i-e., spatial, temporal,
`and quality, the spatial form seems to be the most difficult in
`which to achieve significant superiority to a simulcast solution.
`The basic mission of a scalable design is two-fold: 1) to min-
`One dominant reason for this is the focus of the JVT on sup-
`imize the codingefficiency loss relative to single-layer coding;
`porting lower resolution versions of image sequences with high
`and 2) to minimize the complexity increase (especially for de-
`visual quality, as opposed to lower resolution representations
`coders) relative to single-layer coding. By single-layer coding,
`that provide high codingefficiency.
`we refer to the coding of a video sequence without providing
`The lowest resolution video data in a spatially scalable system
`the scalability functionality. Unless a result with coding effi-
`is sometimes referred to as the base layer (especially when it
`ciency significantly superior to a simulcast solution can be ob-
`is decodable by an ordinary nonscalable single-layer decoder),
`tained, a scalable solution with any complexity penalty is use-
`and the higher resolution video datais often referred to as the en-
`less. By simulcast, we refer to the coding of both source video
`hancementlayer. Processes that determine or predict the value
`sequencesofa scalable scenarioas entirely separate single-layer
`Authorized licensed use limited to: Fish & Richardson PC. Downloaded on December 23,2023 at 02:17:28 UTC from IEEE Xplore. Restrictions apply.
`
`2
`
`2
`
`
`
`SEGALL AND SULLIVAN: SPATIAL SCALABILITY WITHIN THE H.264/AVC SCALABLE VIDEO CODING EXTENSION
`
`1123
`
`B. Inter-Layer Spatial Relationships and Profile Constraints
`
`of enhancement layer data from previously reconstructed data
`of a lower resolution layer at the same time instance are re-
`An important feature of the SVC design, from a high-level
`ferred to as inter-layer prediction processes, and the source for
`functionality perspective, is the ability for the lower resolution
`the prediction is referred to as the reference layer. Other forms
`and higher resolution pictures in a spatially scalable bit stream
`of prediction include inter-picture prediction, involving predic-
`to represent different regions of a video scene. For example, a
`tion operating temporally between different pictures of the same
`system may transmit standard definition television content, with
`resolution layer, and intra-picture prediction, involving predic-
`a picture aspect ratio of 4:3, as a base layer and high definition
`tion operating spatially within the samepicture of one particular
`television content, with a picture aspect ratio of 16:9, in a higher
`resolution layer.
`resolution enhancementlayer. Such a use case requires crop-
`From a video coding specification perspective, the set of
`ping and offsetting the origins of the picture regions in addition
`data comprising a SVC representation is treated as a single
`to scaling, as the lower resolution layer signal may not repre-
`bit stream. However, from a systems multiplex orfile storage
`sent the entire extent of the higher resolution sequence (and vice
`perspective, the data might often be handled differently—as
`versa). The common “pan and scan” technique used on stan-
`distinct hierarchically related streams of content that are coor-
`dard-definition DVDs for converting wide screen data for dis-
`dinated using decoding timestamps or other such mechanisms.
`play on a 4:3 display is an example of a more limited form of
`In this fashion, a system can ease the handling of the data, such
`such display adaptation. The SVC extension supports such ca-
`as enabling channel bit rate adaptation or ensuring that legacy
`pability in a flexible but straightforward manner. Relative posi-
`decoders that do not support scalability are presented with only
`tioning and windowing parameters are providedin picture-level
`the base layer for decoding.
`syntax structures, so that flexible cropping, scaling, and align-
`Some degree of familiarity with the concepts of the orig-
`ment relationships can not only be supported but may be varied
`inal H.264/AVC standard is assumed in the presentation pro-
`on a picture-by-picture basis.
`vided herein, such as the concepts of macroblocks, motion par-
`However, such flexibility can be constrained to simplify the
`titions, biprediction, inter-picture prediction using multiple ref-
`use cases for particular applications. In particular,
`the SVC
`erence pictures, and reference picture lists. Readers unfamiliar
`extension includes the definition of three profiles of the design.
`with this background information may benefit from referring to
`These are the “Scalable Baseline”profile, the “Scalable High”
`[3]-[6]. Moreover, one topic that is somewhat neglected in this
`profile, and the “Scalable High Intra” profile. While the latter
`presentationis that of interlaced-scan video content. Herein the
`two profiles support full spatial SVC flexibility, the Scalable
`principles of the spatial SVC design are explained under the as-
`Baseline profile imposes the following constraints to enable
`sumption of frame-structured progressive-scan pictures, so that
`simplified application scenarios.
`the concepts can be described without the need to consider the
`* The width and height of the scaled regions of lower res-
`details of the handling of interlaced fields and frames. The ap-
`olution and higher resolution pictures must have the same
`plication of these SVC concepts to interlaced videois straight-
`scaling ratio, andthis ratio can only have the value 1.5 or 2.
`forward for those familiar with interlaced video coding using
`* The spatial offsets specifying the relative location of the
`H.264/AVC.For further information aboutinterlaced video sup-
`upper left corner of the lower and higher resolution pic-
`port in the SVC context, the reader is referred to [11]. The
`ture regions must be multiples of 16 both horizontally and
`overview of SVC in general thatis found in [2] will also be of
`vertically (i-e., they must be in units of macroblocks).
`interest to many readers.
`The case using a scaling ratio of 2 with spatial offset con-
`An additional simplification used in much ofthe discussion
`straints as noted above is often referred to as dyadic spatial
`for this overview paper is to primarily considerabit stream con-
`scalability, whereas the more general case is knownas extended
`taining only two layers—a lower resolution base layer and a
`spatial scalability [12].
`higher resolution spatial scalability enhancementlayer. In fact,
`the SVC design fully supports multilayer scenarios including
`multiple spatial scalability layers and the mixing ofspatial scal-
`ability layers with other layers that provide temporal or quality
`scalability. Considerable flexibility is also provided in regard
`to the selection of the reference layer for each enhancement
`layer, such that a bit stream can contain branching dependency
`structures.
`
`C. Image Pyramids and Related Coarse-to-Fine Hierarchies
`
`Image pyramids describe a relationship between lowerres-
`olution and higher resolution versions of an image.! This re-
`lationship is found in a variety of image and video processing
`scenarios, and image pyramids have been incorporated into a
`variety of applications, e.g., [13]-[19], as well as previous scal-
`able efforts in the video coding standards [7]-{10]. In the SVC
`extension, a coarse-to-fine hierarchy of imagesis also used for
`spatial scalability. The original high-resolution image sequence
`is converted to lower resolutions by filtering and decimating.
`Then, the sequence ofpictures at the lowest of these resolutions
`is coded in a mannersuchthat it can be decoded independently.
`Eachhigher resolution video sequence is coded relative to a de-
`coded lower resolution sequence.
`
`As with prior international standards for video coding (scal-
`able and nonscalable), the scope of the standard is limited to
`specifying the decoding process and the format of the syntax.
`Encoder designers are free to use any encoding algorithmsthey
`wish, so long as the bit stream they produce conforms to the
`format specification. Any kind of preprocessingis also allowed
`prior to encoding, and decoding devices are allowed to contain
`any sort of post-processing, error and loss concealment tech-
`1The terms picture and image are used interchangeably herein.
`niques, and display-related customization.
`Authorized licensed use limited to: Fish & Richardson PC. Downloaded on December 23,2023 at 02:17:28 UTC from IEEE Xplore. Restrictions apply.
`
`3
`
`3
`
`
`
`1124
`
`IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,VOL. 17, NO. 9, SEPTEMBER 2007
`
`The use of an image pyramid for video coding does not come
`without penalties. Specifically, an image pyramid is an over-
`complete decomposition. In other words, the numberof image
`samples in the entire pyramidstructure is larger than the number
`of samples in an original high-resolution image. Thisis in con-
`trast to embedded representations that usecritically sampled de-
`compositions. For example, wavelet decompositions are well
`knownto provide inherent scalability and viable image coding
`designs [20]-[24]. In the development of the SVC extension,
`such critically sampled decompositions were also considered
`[25]-[28]. However, the aliasing introduced by these decom-
`positions, while suitable for still image coding, were deemed
`problematic for video. Specifically, the aliasing can make effec-
`tive motion compensated inter-picture prediction more difficult,
`as well as lead to objectionable temporal artifacts. Additionally,
`the wavelet design may be likely to require more computational
`resources than the traditional block-based coding approach.
`The decision to use an image pyramid in the SVC project
`provides flexibility for encoder and application designers. The
`down-sampling operation is not defined in the standard, so
`that encoder designers are free to employ the down-sampler
`that they consider most suitable. For example, applications
`that are sensitive to encoder hardware costs would select a
`
`down-sampler with minimum complexity for the specific im-
`plementation architecture. Alternatively, in other applications,
`additional computational complexity may be acceptable in
`order to achieve a higher quality result. These applications
`would choose a more sophisticated, and likely more complex,
`down-sampling method.
`
`D. Single-Loop Decoding Concept
`
`The concept of image pyramids describes the relationship be-
`tween images ofdifferent resolution. However, image pyramids
`do not capture the evolution of that relationship between com-
`pressed images through time in sequences of such images. To
`understandthis relationship, we need to consider the concepts of
`multiloop and single-loop decoding. Single-loop decoding,also
`called constrainedinter-layer prediction [29|-{32],is a funda-
`mental property of the new SVC design, andit is described in
`the remainderofthe section.
`
`Then, the coarse-to-fine relationship of the image pyramid is
`used to predict the lower frequency components of a higherres-
`olution enhancement-layer picture using up-samplingofthe de-
`coded lower resolution picture. Additionally, motion compen-
`sated inter-picture prediction is performed again at the enhance-
`ment layer. This predicts the high frequency components of the
`enhancement layer.
`Using a decoder with multiple motion compensation loops
`does improve the coding efficiency of a scalable video codec,
`but the benefit in coding efficiency turns out to be minimal
`whenall available coded data is used effectively in other ways
`(29]-[32]. Moreover, the multiloop decoding schemeincreases
`decoding complexity. Motion compensation is performed at
`each resolution and the reconstructed pictures of all levels of
`the pyramid are stored for each time instant. This becomes
`problematic in practice, as motion compensation requires
`high memory bandwidths for many processing architectures
`[33], and the extra decoding processes involved in multiloop
`decoding add undesirable sequential dependencies to the de-
`coding process as well as require extra encoder and decoder
`implementation and debuggingefforts.
`In the SVC design, a lower complexity approach is adopted.
`Motion compensation is performed only at the target decoded
`resolution (e.g., the displayed resolution). Thus, the decoding
`structure of the SVC design is referred to as a single-loop design,
`which simply meansthat only the operation of a single motion
`compensation loop is necessary to reconstruct the image se-
`quencefor any resolution layer. This provides an important fea-
`ture, as it reduces the complexity of motion compensation to that
`of a single-layer decoder—eliminating the major source of com-
`plexity penalty in prior SVC designs. As will be seen in the next
`section, good codingefficiency can still be achieved without re-
`quiring multiloop decoding, by effectively propagating the in-
`formation found in the coded motion vectors, mode information
`and residual difference data from each lower resolution layer to
`each next higher resolution layer. This propagation employs the
`previously described image pyramid concept.
`To further ease implementation, the syntax of the SVC ex-
`tension has been designed in a way that it allows the separate
`parsing of each layer of the syntax (without parsing other layers
`and without operating the decoding processes of lower layers)
`[34], [35]. Completing the full decoding process, of course, re-
`quires further processing of someparsed data of each layer up to
`the target decoded layer (but not full multilayer decoding, due
`to the single-loop nature of the design).
`
`In the family of ITU-T and ISO/IEC video coding standards,
`which includes H.264/AVC, block-wise motion compensated
`inter-picture prediction playsa critical role in improving coding
`efficiency. This is accomplished by transmitting (or having the
`decoder infer) one or more motion vectors to predict a block in
`the current picture from the content of previously decoded ref-
`erence pictures. Then, additional information aboutthe residual
`SVCintroduces several design features to enable spatial scal-
`difference between the prediction and the actual image data
`ability. These tools include the calculation of corresponding po-
`may be sent. For natural image sequences, which often contain
`sitionsin different resolution layers, methodsfor inter-layer pre-
`slowly evolving features, the motion compensation process ex-
`diction ofvarious data such as macroblock prediction modes and
`ploits the inherent characteristics of the image sequence.
`motion vectors, an “I_BL” macroblock type that uses inter-layer
`In designing the SVC extension of the H.264/AVC standard, a
`up-sampled image prediction, and a residual difference signal
`fundamental question was how to use the motion compensation
`prediction technique that uses inter-layer up-sampled residual
`process within the context of spatial scalability. One potential
`difference prediction. These tools are provided in addition to the
`approach (used in all previous standardized designs) would be
`original single-layer coding tools, such as (spatial) intra-picture
`to perform multiloop decoding. In this scenario, each low-res-
`olution picture is completely decoded, including low-resolu-
`and (temporal) inter-picture coding techniques, and an encoder
`must determine when eachtool is most appropriate. Describing
`tion motion-compensation prediction operations in particular.
`Authorized licensed use limited to: Fish & Richardson PC. Downloaded on December 23,2023 at 02:17:28 UTC from IEEE Xplore. Restrictions apply.
`
`Ill. CopING TooLs
`
`4
`
`4
`
`
`
`SEGALL AND SULLIVAN: SPATIAL SCALABILITY WITHIN THE H.264/AVC SCALABLE VIDEO CODING EXTENSION
`
`1125
`
`the new tools and how they are combined effectively with the
`original H.264/AVCsingle-layer design features is the focus of
`this section.
`
`A. Calculation of Corresponding Spatial Positions
`
`Thefirst design feature that we will discuss in detail is the
`calculation of corresponding positions in adjacent levels of the
`pyramid hierarchy. This concept is used in several ways in the
`spatial SVC extension.
`Identifying sample locations in a lower resolution layer that
`correspond to sample locations in the enhancementlayer is per-
`formed at fractional-sample accuracy. Specifically, sample po-
`sitions are calculated to 1/16th sample position increments and
`derived using fixed-point operations as
`
`BE,
`
`*D,+R
`
`B, = Round (a)
`
`B,, = Round (Aa)
`
`()
`
`where B, and B, are, respectively, the horizontal and vertical
`sample coordinates in the lowerresolution (e.g., base layer) pic-
`ture array, E, and E, are horizontal and vertical sample coordi-
`nates in the high-resolution (enhancement-layer) picture array,
`and R, and R, are higherprecision (1/2° sampleposition) ref-
`erence offset locations for grid reference position alignment,
`and D, and D, are scaled inverses of the horizontal and ver-
`tical resampling ratios. .D, and D,, are specified as
`
`25 + BaseWidth
`D, = Round (=~
`
`oun (an)
`Noun ——)
`
`D, = Round
`
`=
`
`25 » BaseHeight
`(=———""5—
`
`2
`
`@)
`
`+ The D, and D, computations only need to be performed
`once, with the results reused repeatedly for computations
`of B, and B, for the entire image (or sequence of video
`images).
`When movingfrom positionto position from rightto left or
`top to bottom in computing B,. and B, for a series of values
`of F, and F.,, a multiplication operation can be converted
`to an addition so that computation of each B, and B, can
`be performed incrementally, requiring only one addition
`and one right shift operation to obtain the result of each
`formula.
`
`This design supports essentially arbitrary resizing ratios (ex-
`cept in constrained applications using the Scalable Baseline pro-
`file), and the position calculation equations have low complexity
`regardless of the ratio, in contrast to some prior standardized de-
`signs in which only relatively simple rational ratios were prac-
`tical due to the way the position calculations were specified.
`
`B. Coarse-to-Fine Projection ofMacroblock Modes, Motion
`Partitioning, Reference Picture Indices, and Motion Vectors
`
`In the enhancement layer syntax for areas of the enhance-
`ment layer that correspond to areas within the lower resolution
`picture, a flag, called the base mode flag, can be sent for each
`nonskipped macroblock? to determine whether the macroblock
`mode, motion segmentation, reference picture indices, and mo-
`tion vectors are to be inferred from the data at corresponding
`positions in the lower resolution layer. The basic concepts of
`this inference process were proposed for use with dyadic spatial
`scalability in [37] and were extended to arbitrary spatial scala-
`bility relationships in [38]-[40]. In some sense the projection
`consists offirst projecting the sample grid ofthe finer level to
`the coarser level of the pyramid and then using this projection
`to propagate data from the coarser level to the finer level.
`When the base modeflag is equal to 0, the macroblock pre-
`diction modeis sent within the enhancement layer macroblock-
`level syntax. Then, within each motion partition, a flag can be
`sent for each reference picture list, called the motion predic-
`tion flag, to determine whether reference picture indexes will
`be sent in the enhancement layer or not and whether the mo-
`tion vectors are to be predicted within the enhancementlayer or
`using inter-layer prediction from the lower resolution layer mo-
`tion data.
`
`where BaseWidth and BaseHeight denote the width and height
`of the rectangular region of the lower resolution picture array to
`be up-sampled, respectively, and ScaledBaseWidth and Scaled-
`BaseHeight denote the width and height of the corresponding
`region of the up-sampled lowerresolution picture array, respec-
`tively. The precision control parameter S has been chosen to
`trade off between precision and ease of computation; S is spec-
`ified to be 16 for most uses to enable the use of 16-bit word-
`
`length arithmetic, and to be a somewhat larger number opti-
`mized for 32-bit arithmetic for enhanced-capability decoders
`that support very large picture sizes. The basic design of these
`formulas was proposed in [36], and some later refinements were
`subsequently applied. The formulas are designed for computa-
`tional simplicity as follows.
`* The above formulas are specified for implementation using
`two’s complement integer operations, most of which re-
`quire at most 16 bits of dynamic range (for example, noting
`that BaseWidth is always less than ScaledBaseWidth, D,
`requires no more than S' bits).
`2To save the need to repeatedly send the base mode flag in cases when an
`Multiplication and division scale factors that are powers
`encoder will not vary its value in applicable macroblocks, a default value for
`of two are specified to be performed using left and right
`the flag can alternatively be sent at the slice header level.
`binary arithmetic shifts.
`3To save the need to repeatedly send the motion prediction flag in cases when
`Roundingofa ratio is accomplished by adding half of the
`an encoder will not vary its value in applicable macroblocks, a default value for
`the flag can alternatively be sent at the slice header level.
`value of the denominator prior to right shifting.
`Authorized licensed use limited to: Fish & Richardson PC. Downloaded on December 23,2023 at 02:17:28 UTC from IEEE Xplore. Restrictions apply.
`
`When the base modeflag is equal to 1, since the finest gran-
`ularity of H.264/AVC coding decisions is at the 4 x 4 level,
`the inference process is performed based on 4 x 4 luma block
`structures. For each 4 x 4 lumablock, the process begins by
`identifying a corresponding block in the lowerresolution layer.
`Numbering the samples of the luma block from 0 to 3 both
`horizontally and vertically, the luma sample at position (1,1) is
`used to determine the block’s associated data. A corresponding
`sample in the lower resolution layer for this sample is iden-
`tified in a similar manner as described in Section III-A, but
`
`5
`
`5
`
`
`
`1126
`
`IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,VOL. 17, NO. 9, SEPTEMBER 2007
`
`with nearest sample precision instead of 1/16th sample preci-
`sion. The prediction type (intra-picture, inter-picture predictive,
`or inter-picture bipredictive), reference picture indices, and mo-
`tion vectors associated with the prediction block containing the
`corresponding lowerresolution layer position are then assigned
`to the 4 x 4 enhancementlayer block. Motion vectors are scaled
`by the resampling ratio and offset by any relative picture grid
`spatial offset so that they become relevant to the enhancement
`layer picture coordinates. Then a merging process takes place
`to determine the final mode and motion segmentation in the en-
`hancement layer macroblock.
`Tf all 4 x 4 luma blocksof the enhancement macroblock cor-
`
`respond to intra-picture coded lowerresolution layer blocks, the
`inferred macroblock type is considered to be “I-BL,” a mac-
`roblock type that is described in the following section; other-
`wise, motion segmentation, reference picture indices, and mo-
`tion vectors then need to be inferred. (It should be noted that
`because the prediction mode is determined from only one po-
`sition in each 4 x 4 block, it is possible that a few samples in
`enhancement layer LBL macroblock may have corresponding
`locations in the lower resolution layer picture that lie in inter-
`picture predicted regions of the lower resolution layer.)
`In H.264/AVC,reference picture indexes have an 8 x 8 luma
`granularity. To achieve this granularity, for each 8 x 8 luma
`region of the enhancement layer, the reference picture index is
`set to the minimum of the reference picture indexes inferred
`from the corresponding constituent 4 4 blocks when per-
`forming inter-layer motion prediction [38]-[40]. When some
`lower resolution layer blocks are in a B-slice, the minimum is
`computed separately for each of the two reference picture lists
`and biprediction is inferred if both lists were used in the set
`of 4 x 4 blocks. For 4 x 4 regions that did not use a selected
`reference picture index (or indices, in the case of biprediction),
`the motion vectoris set to that of a neighboring block that did
`(so that some motion vector value is assigned that is relevant to
`the selected reference picture index).
`Then the values of motion vectors are inspected to determine
`the final motion partitioning of the enhancement layer mac-
`roblock (4 x 4, 4 x 8, 8 x 8, 8 x 16, 16x 18, or 16 x 16). Par-
`titions with identical reference picture indexes and similar or
`identical motion vectors are merged to make thefinal predicted
`motion more coherent and reduce the complexity of the associ-
`ated inter-picture prediction processing [41].
`The result is predicted mode and motion data that fits with
`the same basic structure of ordinary single-layer H.264/AVC
`prediction.
`
`C. LBL Macroblock Type and Inter-Layer Texture Prediction
`
`the prediction. Finally, a deblocking filter is applied to the re-
`sulting picture.
`It is important to understand thatLBL macroblocks cannot
`occur at arbitrary locations in the enhancementlayer. Instead,
`the ILBL macroblocktype is only available whe