`
`1649
`
`Overview of the High Efficiency Video Coding
`(HEVC) Standard
`
`Gary J. Sullivan, Fellow, IEEE, Jens-Rainer Ohm, Member, IEEE, Woo-Jin Han, Member, IEEE, and
`Thomas Wiegand, Fellow, IEEE
`
`Abstract—High Efficiency Video Coding (HEVC) is currently
`being prepared as the newest video coding standard of the
`ITU-T Video Coding Experts Group and the ISO/IEC Moving
`Picture Experts Group. The main goal of the HEVC standard-
`ization effort is to enable significantly improved compression
`performance relative to existing standards—in the range of 50%
`bit-rate reduction for equal perceptual video quality. This paper
`provides an overview of the technical features and characteristics
`of the HEVC standard.
`
`Index Terms—Advanced video coding (AVC), H.264, High
`Efficiency Video Coding (HEVC), Joint Collaborative Team
`on Video Coding (JCT-VC), Moving Picture Experts Group
`(MPEG), MPEG-4, standards, Video Coding Experts Group
`(VCEG), video compression.
`
`I. Introduction
`
`Video coding standards have evolved primarily through the
`development of the well-known ITU-T and ISO/IEC standards.
`The ITU-T produced H.261 [2] and H.263 [3], ISO/IEC
`produced MPEG-1 [4] and MPEG-4 Visual [5], and the two
`organizations jointly produced the H.262/MPEG-2 Video [6]
`and H.264/MPEG-4 Advanced Video Coding (AVC) [7] stan-
`dards. The two standards that were jointly produced have had a
`particularly strong impact and have found their way into a wide
`variety of products that are increasingly prevalent in our daily
`lives. Throughout this evolution, continued efforts have been
`made to maximize compression capability and improve other
`characteristics such as data loss robustness, while considering
`the computational resources that were practical for use in prod-
`ucts at the time of anticipated deployment of each standard.
`The major video coding standard directly preceding the
`HEVC project was H.264/MPEG-4 AVC, which was initially
`developed in the period between 1999 and 2003, and then
`was extended in several important ways from 2003–2009.
`H.264/MPEG-4 AVC has been an enabling technology for dig-
`ital video in almost every area that was not previously covered
`by H.262/MPEG-2 Video and has substantially displaced the
`older standard within its existing application domains. It is
`widely used for many applications, including broadcast of high
`definition (HD) TV signals over satellite, cable, and terrestrial
`transmission systems, video content acquisition and editing
`systems, camcorders, security applications, Internet and mo-
`bile network video, Blu-ray Discs, and real-time conversa-
`tional applications such as video chat, video conferencing, and
`telepresence systems.
`the grow-
`However, an increasing diversity of services,
`ing popularity of HD video, and the emergence of beyond-
`HD formats (e.g., 4k×2k or 8k×4k resolution) are creating
`even stronger needs for coding efficiency superior to H.264/
`MPEG-4 AVC’s capabilities. The need is even stronger when
`higher resolution is accompanied by stereo or multiview
`capture and display. Moreover, the traffic caused by video
`applications targeting mobile devices and tablet PCs, as well
`as the transmission needs for video-on-demand services, are
`imposing severe challenges on today’s networks. An increased
`desire for higher quality and resolutions is also arising in
`mobile applications.
`HEVC has been designed to address essentially all existing
`applications of H.264/MPEG-4 AVC and to particularly focus
`on two key issues: increased video resolution and increased
`use of parallel processing architectures. The syntax of HEVC
`1051-8215/$31.00 c(cid:2) 2012 IEEE
`
`T HE High Efficiency Video Coding (HEVC) standard is
`
`the most recent joint video project of the ITU-T Video
`Coding Experts Group (VCEG) and the ISO/IEC Moving
`Picture Experts Group (MPEG) standardization organizations,
`working together in a partnership known as the Joint Col-
`laborative Team on Video Coding (JCT-VC) [1]. The first
`edition of the HEVC standard is expected to be finalized in
`January 2013, resulting in an aligned text that will be published
`by both ITU-T and ISO/IEC. Additional work is planned to
`extend the standard to support several additional application
`scenarios, including extended-range uses with enhanced pre-
`cision and color format support, scalable video coding, and
`3-D/stereo/multiview video coding. In ISO/IEC, the HEVC
`standard will become MPEG-H Part 2 (ISO/IEC 23008-2)
`and in ITU-T it is likely to become ITU-T Recommendation
`H.265.
`
`Manuscript received May 25, 2012; revised August 22, 2012; accepted
`August 24, 2012. Date of publication October 2, 2012; date of current
`version January 8, 2013. This paper was recommended by Associate Editor
`H. Gharavi. (Corresponding author: W.-J. Han.)
`G. J. Sullivan is with Microsoft Corporation, Redmond, WA 98052 USA
`(e-mail: garysull@microsoft.com).
`J.-R. Ohm is with the
`Institute of Communication Engineering,
`RWTH Aachen University, Aachen
`52056, Germany
`(e-mail:
`ohm@ient.rwth-aachen.de).
`W.-J. Han is with the Department of Software Design and Management,
`Gachon University, Seongnam 461-701, Korea (e-mail: hurumi@gmail.com).
`T. Wiegand is with the Fraunhofer Institute for Telecommunications, Hein-
`rich Hertz Institute, Berlin 10587, Germany, and also with the Berlin Institute
`of Technology, Berlin 10587, Germany (e-mail: twiegand@ieee.org).
`Color versions of one or more of the figures in this paper are available
`online at http://ieeexplore.ieee.org.
`Digital Object Identifier 10.1109/TCSVT.2012.2221191
`
`Virginia Innovation Sciences, Ex. 2004.
`
`
`
`1650
`
`IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012
`
`is generic and should also be generally suited for other
`applications that are not specifically mentioned above.
`As has been the case for all past ITU-T and ISO/IEC video
`coding standards, in HEVC only the bitstream structure and
`syntax is standardized, as well as constraints on the bitstream
`and its mapping for the generation of decoded pictures. The
`mapping is given by defining the semantic meaning of syntax
`elements and a decoding process such that every decoder
`conforming to the standard will produce the same output
`when given a bitstream that conforms to the constraints of the
`standard. This limitation of the scope of the standard permits
`maximal freedom to optimize implementations in a manner
`appropriate to specific applications (balancing compression
`quality, implementation cost, time to market, and other con-
`siderations). However, it provides no guarantees of end-to-
`end reproduction quality, as it allows even crude encoding
`techniques to be considered conforming.
`To assist the industry community in learning how to use the
`standard, the standardization effort not only includes the de-
`velopment of a text specification document, but also reference
`software source code as an example of how HEVC video can
`be encoded and decoded. The draft reference software has been
`used as a research tool for the internal work of the committee
`during the design of the standard, and can also be used as a
`general research tool and as the basis of products. A standard
`test data suite is also being developed for testing conformance
`to the standard.
`This paper is organized as follows. Section II highlights
`some key features of the HEVC coding design. Section III
`explains the high-level syntax and the overall structure of
`HEVC coded data. The HEVC coding technology is then
`described in greater detail in Section IV. Section V explains
`the profile, tier, and level design of HEVC. Since writing an
`overview of a technology as substantial as HEVC involves a
`significant amount of summarization, the reader is referred
`to [1] for any omitted details. The history of the HEVC
`standardization effort is discussed in Section VI.
`
`II. HEVC Coding Design and Feature Highlights
`
`The HEVC standard is designed to achieve multiple goals,
`including coding efficiency, ease of transport system integra-
`tion and data loss resilience, as well as implementability using
`parallel processing architectures. The following subsections
`briefly describe the key elements of the design by which
`these goals are achieved, and the typical encoder operation
`that would generate a valid bitstream. More details about the
`associated syntax and the decoding process of the different
`elements are provided in Sections III and IV.
`
`A. Video Coding Layer
`The video coding layer of HEVC employs the same hy-
`brid approach (inter-/intrapicture prediction and 2-D transform
`coding) used in all video compression standards since H.261.
`Fig. 1 depicts the block diagram of a hybrid video encoder,
`which could create a bitstream conforming to the HEVC
`standard.
`
`An encoding algorithm producing an HEVC compliant
`bitstream would typically proceed as follows. Each picture
`is split into block-shaped regions, with the exact block par-
`titioning being conveyed to the decoder. The first picture
`of a video sequence (and the first picture at each clean
`random access point into a video sequence) is coded using
`only intrapicture prediction (that uses some prediction of data
`spatially from region-to-region within the same picture, but has
`no dependence on other pictures). For all remaining pictures
`of a sequence or between random access points, interpicture
`temporally predictive coding modes are typically used for
`most blocks. The encoding process for interpicture prediction
`consists of choosing motion data comprising the selected
`reference picture and motion vector (MV) to be applied for
`predicting the samples of each block. The encoder and decoder
`generate identical interpicture prediction signals by applying
`motion compensation (MC) using the MV and mode decision
`data, which are transmitted as side information.
`The residual signal of the intra- or interpicture prediction,
`which is the difference between the original block and its pre-
`diction, is transformed by a linear spatial transform. The trans-
`form coefficients are then scaled, quantized, entropy coded,
`and transmitted together with the prediction information.
`The encoder duplicates the decoder processing loop (see
`gray-shaded boxes in Fig. 1) such that both will generate
`identical predictions for subsequent data. Therefore, the quan-
`tized transform coefficients are constructed by inverse scaling
`and are then inverse transformed to duplicate the decoded
`approximation of the residual signal. The residual is then
`added to the prediction, and the result of that addition may
`then be fed into one or two loop filters to smooth out artifacts
`induced by block-wise processing and quantization. The final
`picture representation (that is a duplicate of the output of the
`decoder) is stored in a decoded picture buffer to be used for
`the prediction of subsequent pictures. In general, the order of
`encoding or decoding processing of pictures often differs from
`the order in which they arrive from the source; necessitating a
`distinction between the decoding order (i.e., bitstream order)
`and the output order (i.e., display order) for a decoder.
`Video material to be encoded by HEVC is generally ex-
`pected to be input as progressive scan imagery (either due to
`the source video originating in that format or resulting from
`deinterlacing prior to encoding). No explicit coding features
`are present in the HEVC design to support the use of interlaced
`scanning, as interlaced scanning is no longer used for displays
`and is becoming substantially less common for distribution.
`However, a metadata syntax has been provided in HEVC to
`allow an encoder to indicate that interlace-scanned video has
`been sent by coding each field (i.e., the even or odd numbered
`lines of each video frame) of interlaced video as a separate
`picture or that it has been sent by coding each interlaced frame
`as an HEVC coded picture. This provides an efficient method
`of coding interlaced video without burdening decoders with a
`need to support a special decoding process for it.
`In the following, the various features involved in hybrid
`video coding using HEVC are highlighted as follows.
`1) Coding tree units and coding tree block (CTB) structure:
`The core of the coding layer in previous standards was
`
`Virginia Innovation Sciences, Ex. 2004.
`
`
`
`SULLIVAN et al.: OVERVIEW OF THE HEVC STANDARD
`
`1651
`
`Fig. 1. Typical HEVC video encoder (with decoder modeling elements shaded in light gray).
`
`the macroblock, containing a 16×16 block of luma sam-
`ples and, in the usual case of 4:2:0 color sampling, two
`corresponding 8×8 blocks of chroma samples; whereas
`the analogous structure in HEVC is the coding tree unit
`(CTU), which has a size selected by the encoder and
`can be larger than a traditional macroblock. The CTU
`consists of a luma CTB and the corresponding chroma
`CTBs and syntax elements. The size L×L of a luma
`CTB can be chosen as L = 16, 32, or 64 samples, with
`the larger sizes typically enabling better compression.
`HEVC then supports a partitioning of the CTBs into
`smaller blocks using a tree structure and quadtree-like
`signaling [8].
`2) Coding units (CUs) and coding blocks (CBs): The
`quadtree syntax of the CTU specifies the size and
`positions of its luma and chroma CBs. The root of the
`quadtree is associated with the CTU. Hence, the size of
`the luma CTB is the largest supported size for a luma
`CB. The splitting of a CTU into luma and chroma CBs
`is signaled jointly. One luma CB and ordinarily two
`chroma CBs, together with associated syntax, form a
`coding unit (CU). A CTB may contain only one CU or
`may be split to form multiple CUs, and each CU has an
`associated partitioning into prediction units (PUs) and a
`tree of transform units (TUs).
`3) Prediction units and prediction blocks (PBs): The de-
`cision whether to code a picture area using interpicture
`or intrapicture prediction is made at the CU level. A
`PU partitioning structure has its root at the CU level.
`
`Depending on the basic prediction-type decision, the
`luma and chroma CBs can then be further split in size
`and predicted from luma and chroma prediction blocks
`(PBs). HEVC supports variable PB sizes from 64×64
`down to 4×4 samples.
`4) TUs and transform blocks: The prediction residual is
`coded using block transforms. A TU tree structure has
`its root at the CU level. The luma CB residual may be
`identical to the luma transform block (TB) or may be
`further split into smaller luma TBs. The same applies to
`the chroma TBs. Integer basis functions similar to those
`of a discrete cosine transform (DCT) are defined for the
`square TB sizes 4×4, 8×8, 16×16, and 32×32. For the
`4×4 transform of luma intrapicture prediction residuals,
`an integer transform derived from a form of discrete sine
`transform (DST) is alternatively specified.
`5) Motion vector signaling: Advanced motion vector pre-
`diction (AMVP) is used, including derivation of several
`most probable candidates based on data from adjacent
`PBs and the reference picture. A merge mode for MV
`coding can also be used, allowing the inheritance of
`MVs from temporally or spatially neighboring PBs.
`Moreover, compared to H.264/MPEG-4 AVC, improved
`skipped and direct motion inference are also specified.
`6) Motion compensation: Quarter-sample precision is used
`for the MVs, and 7-tap or 8-tap filters are used for
`interpolation of fractional-sample positions (compared
`to six-tap filtering of half-sample positions followed
`by linear interpolation for quarter-sample positions in
`
`Virginia Innovation Sciences, Ex. 2004.
`
`
`
`1652
`
`IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012
`
`H.264/MPEG-4 AVC). Similar to H.264/MPEG-4 AVC,
`multiple reference pictures are used. For each PB, either
`one or two motion vectors can be transmitted, resulting
`either in unipredictive or bipredictive coding, respec-
`tively. As in H.264/MPEG-4 AVC, a scaling and offset
`operation may be applied to the prediction signal(s) in
`a manner known as weighted prediction.
`7) Intrapicture prediction: The decoded boundary samples
`of adjacent blocks are used as reference data for spa-
`tial prediction in regions where interpicture prediction
`is not performed. Intrapicture prediction supports 33
`directional modes (compared to eight such modes in
`H.264/MPEG-4 AVC), plus planar (surface fitting) and
`DC (flat) prediction modes. The selected intrapicture
`prediction modes are encoded by deriving most probable
`modes (e.g., prediction directions) based on those of
`previously decoded neighboring PBs.
`8) Quantization control: As in H.264/MPEG-4 AVC, uni-
`form reconstruction quantization (URQ)
`is used in
`HEVC, with quantization scaling matrices supported for
`the various transform block sizes.
`9) Entropy coding: Context adaptive binary arithmetic cod-
`ing (CABAC) is used for entropy coding. This is sim-
`ilar to the CABAC scheme in H.264/MPEG-4 AVC,
`but has undergone several
`improvements to improve
`its throughput speed (especially for parallel-processing
`architectures) and its compression performance, and to
`reduce its context memory requirements.
`10) In-loop deblocking filtering: A deblocking filter similar
`to the one used in H.264/MPEG-4 AVC is operated
`within the interpicture prediction loop. However, the
`design is simplified in regard to its decision-making and
`filtering processes, and is made more friendly to parallel
`processing.
`11) Sample adaptive offset (SAO): A nonlinear amplitude
`mapping is introduced within the interpicture prediction
`loop after the deblocking filter. Its goal is to better
`reconstruct the original signal amplitudes by using a
`look-up table that
`is described by a few additional
`parameters that can be determined by histogram analysis
`at the encoder side.
`
`B. High-Level Syntax Architecture
`A number of design aspects new to the HEVC standard
`improve flexibility for operation over a variety of applications
`and network environments and improve robustness to data
`losses. However, the high-level syntax architecture used in
`the H.264/MPEG-4 AVC standard has generally been retained,
`including the following features.
`1) Parameter set structure: Parameter sets contain informa-
`tion that can be shared for the decoding of several re-
`gions of the decoded video. The parameter set structure
`provides a robust mechanism for conveying data that are
`essential to the decoding process. The concepts of se-
`quence and picture parameter sets from H.264/MPEG-4
`AVC are augmented by a new video parameter set (VPS)
`structure.
`
`2) NAL unit syntax structure: Each syntax structure is
`placed into a logical data packet called a network
`abstraction layer (NAL) unit. Using the content of a two-
`byte NAL unit header, it is possible to readily identify
`the purpose of the associated payload data.
`3) Slices: A slice is a data structure that can be decoded
`independently from other slices of the same picture, in
`terms of entropy coding, signal prediction, and residual
`signal reconstruction. A slice can either be an entire
`picture or a region of a picture. One of the main
`purposes of slices is resynchronization in the event of
`data losses. In the case of packetized transmission, the
`maximum number of payload bits within a slice is
`typically restricted, and the number of CTUs in the slice
`is often varied to minimize the packetization overhead
`while keeping the size of each packet within this bound.
`4) Supplemental enhancement information (SEI) and video
`usability information (VUI) metadata: The syntax in-
`cludes support for various types of metadata known as
`SEI and VUI. Such data provide information about the
`timing of the video pictures, the proper interpretation of
`the color space used in the video signal, 3-D stereoscopic
`frame packing information, other display hint informa-
`tion, and so on.
`
`C. Parallel Decoding Syntax and Modified Slice Structuring
`Finally, four new features are introduced in the HEVC stan-
`dard to enhance the parallel processing capability or modify
`the structuring of slice data for packetization purposes. Each
`of them may have benefits in particular application contexts,
`and it is generally up to the implementer of an encoder or
`decoder to determine whether and how to take advantage of
`these features.
`1) Tiles: The option to partition a picture into rectangular
`regions called tiles has been specified. The main pur-
`pose of tiles is to increase the capability for parallel
`processing rather than provide error resilience. Tiles are
`independently decodable regions of a picture that are
`encoded with some shared header information. Tiles can
`additionally be used for the purpose of spatial random
`access to local regions of video pictures. A typical
`tile configuration of a picture consists of segmenting
`the picture into rectangular regions with approximately
`equal numbers of CTUs in each tile. Tiles provide
`parallelism at a more coarse level of granularity (pic-
`ture/subpicture), and no sophisticated synchronization of
`threads is necessary for their use.
`2) Wavefront parallel processing: When wavefront parallel
`processing (WPP) is enabled, a slice is divided into
`rows of CTUs. The first row is processed in an ordinary
`way, the second row can begin to be processed after
`only two CTUs have been processed in the first row,
`the third row can begin to be processed after only
`two CTUs have been processed in the second row,
`and so on. The context models of the entropy coder
`in each row are inferred from those in the preceding
`row with a two-CTU processing lag. WPP provides a
`form of processing parallelism at a rather fine level of
`
`Virginia Innovation Sciences, Ex. 2004.
`
`
`
`SULLIVAN et al.: OVERVIEW OF THE HEVC STANDARD
`
`1653
`
`granularity, i.e., within a slice. WPP may often provide
`better compression performance than tiles (and avoid
`some visual artifacts that may be induced by using tiles).
`3) Dependent slice segments: A structure called a de-
`pendent slice segment allows data associated with a
`particular wavefront entry point or tile to be carried in
`a separate NAL unit, and thus potentially makes that
`data available to a system for fragmented packetization
`with lower latency than if it were all coded together in
`one slice. A dependent slice segment for a wavefront
`entry point can only be decoded after at least part of
`the decoding process of another slice segment has been
`performed. Dependent slice segments are mainly useful
`in low-delay encoding, where other parallel tools might
`penalize compression performance.
`In the following two sections, a more detailed description
`of the key features is given.
`
`III. High-Level Syntax
`
`The high-level syntax of HEVC contains numerous elements
`that have been inherited from the NAL of H.264/MPEG-4
`AVC. The NAL provides the ability to map the video coding
`layer (VCL) data that represent the content of the pictures
`onto various transport layers, including RTP/IP, ISO MP4,
`and H.222.0/MPEG-2 Systems, and provides a framework
`for packet loss resilience. For general concepts of the NAL
`design such as NAL units, parameter sets, access units, the
`byte stream format, and packetized formatting, please refer
`to [9]–[11].
`NAL units are classified into VCL and non-VCL NAL
`units according to whether they contain coded pictures or
`other associated data, respectively. In the HEVC standard,
`several VCL NAL unit types identifying categories of pictures
`for decoder initialization and random-access purposes are
`included. Table I lists the NAL unit types and their associated
`meanings and type classes in the HEVC standard.
`The following subsections present a description of the new
`capabilities supported by the high-level syntax.
`
`A. Random Access and Bitstream Splicing Features
`The new design supports special features to enable random
`access and bitstream splicing. In H.264/MPEG-4 AVC, a
`bitstream must always start with an IDR access unit. An
`IDR access unit contains an independently coded picture—
`i.e., a coded picture that can be decoded without decoding
`any previous pictures in the NAL unit stream. The presence
`of an IDR access unit indicates that no subsequent picture
`in the bitstream will require reference to pictures prior to the
`picture that it contains in order to be decoded. The IDR picture
`is used within a coding structure known as a closed GOP (in
`which GOP stands for group of pictures).
`The new clean random access (CRA) picture syntax speci-
`fies the use of an independently coded picture at the location
`of a random access point (RAP), i.e., a location in a bitstream
`at which a decoder can begin successfully decoding pictures
`without needing to decode any pictures that appeared earlier
`in the bitstream, which supports an efficient temporal coding
`
`TABLE I
`NAL Unit Types, Meanings, and Type Classes
`
`Type
`0, 1
`2, 3
`4, 5
`6, 7
`8, 9
`10–15
`16–18
`19, 20
`21
`22–31
`32
`33
`34
`35
`36
`37
`38
`39, 40
`41–47
`48–63
`
`Meaning
`Slice segment of ordinary trailing picture
`Slice segment of TSA picture
`Slice segment of STSA picture
`Slice segment of RADL picture
`Slice segment of RASL picture
`Reserved for future use
`Slice segment of BLA picture
`Slice segment of IDR picture
`Slice segment of CRA picture
`Reserved for future use
`Video parameter set (VPS)
`Sequence parameter set (SPS)
`Picture parameter set (PPS)
`Access unit delimiter
`End of sequence
`End of bitstream
`Filler data
`SEI messages
`Reserved for future use
`Unspecified (available for system use)
`
`Class
`VCL
`VCL
`VCL
`VCL
`VCL
`VCL
`VCL
`VCL
`VCL
`VCL
`non-VCL
`non-VCL
`non-VCL
`non-VCL
`non-VCL
`non-VCL
`non-VCL
`non-VCL
`non-VCL
`non-VCL
`
`order known as open GOP operation. Good support of random
`access is critical for enabling channel switching, seek opera-
`tions, and dynamic streaming services. Some pictures that fol-
`low a CRA picture in decoding order and precede it in display
`order may contain interpicture prediction references to pictures
`that are not available at the decoder. These nondecodable
`pictures must therefore be discarded by a decoder that starts
`its decoding process at a CRA point. For this purpose, such
`nondecodable pictures are identified as random access skipped
`leading (RASL) pictures. The location of splice points from
`different original coded bitstreams can be indicated by broken
`link access (BLA) pictures. A bitstream splicing operation
`can be performed by simply changing the NAL unit type of
`a CRA picture in one bitstream to the value that indicates
`a BLA picture and concatenating the new bitstream at the
`position of a RAP picture in the other bitstream. A RAP
`picture may be an IDR, CRA, or BLA picture, and both
`CRA and BLA pictures may be followed by RASL pictures
`in the bitstream (depending on the particular value of the
`NAL unit type used for a BLA picture). Any RASL pictures
`associated with a BLA picture must always be discarded by
`the decoder, as they may contain references to pictures that
`are not actually present in the bitstream due to a splicing
`operation. The other type of picture that can follow a RAP
`picture in decoding order and precede it in output order is
`the random access decodable leading (RADL) picture, which
`cannot contain references to any pictures that precede the
`RAP picture in decoding order. RASL and RADL pictures
`are collectively referred to as leading pictures (LPs). Pictures
`that follow a RAP picture in both decoding order and output
`order, which are known as trailing pictures, cannot contain
`references to LPs for interpicture prediction.
`
`B. Temporal Sublayering Support
`Similar to the temporal scalability feature in the H.264/
`MPEG-4 AVC scalable video coding (SVC) extension [12],
`
`Virginia Innovation Sciences, Ex. 2004.
`
`
`
`1654
`
`IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012
`
`reference picture list 0 and list 1. An index called a reference
`picture index is used to identify a particular picture in one
`of these lists. For uniprediction, a picture can be selected
`from either of these lists. For biprediction, two pictures are
`selected—one from each list. When a list contains only one
`picture, the reference picture index implicitly has the value 0
`and does not need to be transmitted in the bitstream.
`The high-level syntax for identifying the RPS and estab-
`lishing the reference picture lists for interpicture prediction is
`more robust to data losses than in the prior H.264/MPEG-4
`AVC design, and is more amenable to such operations as
`random access and trick mode operation (e.g., fast-forward,
`smooth rewind, seeking, and adaptive bitstream switching).
`A key aspect of this improvement
`is that
`the syntax is
`more explicit, rather than depending on inferences from the
`stored internal state of the decoding process as it decodes the
`bitstream picture by picture. Moreover, the associated syntax
`for these aspects of the design is actually simpler than it had
`been for H.264/MPEG-4 AVC.
`
`IV. HEVC Video Coding Techniques
`
`As in all prior ITU-T and ISO/IEC JTC 1 video coding
`standards since H.261 [2],
`the HEVC design follows the
`classic block-based hybrid video coding approach (as depicted
`in Fig. 1). The basic source-coding algorithm is a hybrid
`of interpicture prediction to exploit temporal statistical de-
`pendences, intrapicture prediction to exploit spatial statistical
`dependences, and transform coding of the prediction residual
`signals to further exploit spatial statistical dependences. There
`is no single coding element in the HEVC design that provides
`the majority of its significant improvement in compression
`efficiency in relation to prior video coding standards. It is,
`rather, a plurality of smaller improvements that add up to the
`significant gain.
`
`A. Sampled Representation of Pictures
`For representing color video signals, HEVC typically uses a
`tristimulus YCbCr color space with 4:2:0 sampling (although
`extension to other sampling formats is straightforward, and is
`planned to be defined in a subsequent version). This separates
`a color representation into three components called Y, Cb,
`and Cr. The Y component is also called luma, and represents
`brightness. The two chroma components Cb and Cr represent
`the extent to which the color deviates from gray toward blue
`and red, respectively. Because the human visual system is more
`sensitive to luma than chroma, the 4:2:0 sampling structure
`is typically used, in which each chroma component has one
`fourth of the number of samples of the luma component (half
`the number of samples in both the horizontal and vertical
`dimensions). Each sample for each component is typically
`represented with 8 or 10 b of precision, and the 8-b case is the
`more typical one. In the remainder of this paper, we focus our
`attention on the typical use: YCbCr components with 4:2:0
`sampling and 8 b per sample for the representation of the
`encoded input and decoded output video signal.
`The video pictures are typically progressively sampled with
`rectangular picture sizes W×H, where W is the width and
`
`Fig. 2. Example of a temporal prediction structure and the POC values,
`decoding order, and RPS content for each picture.
`
`HEVC specifies a temporal identifier in the NAL unit header,
`which indicates a level in a hierarchical temporal prediction
`structure. This was introduced to achieve temporal scalability
`without the need to parse parts of the bitstream other than the
`NAL unit header.
`Under certain circumstances, the number of decoded tem-
`poral sublayers can be adjusted during the decoding process
`of one coded video sequence. The location of a point in the
`bitstream at which sublayer switching is possible to begin
`decoding some higher temporal layers can be indicated by the
`presence of temporal sublayer access (TSA) pictures and step-
`wise TSA (STSA) pictures. At the location of a TSA picture, it
`is possible to switch from decoding a lower temporal sublayer
`to decoding any higher temporal sublayer, and at the location
`of an STSA picture, it is possible to switch from decoding a
`lower temporal sublayer to decoding only one particular higher
`temporal sublayer (but not the further layers above that, unless
`they also contain STSA or TSA pictures).
`
`C. Additional Parameter Sets
`The VPS has been added as metadata to describe the
`overall characteristics of coded video sequences, including
`the dependences between temporal sublayers. The primary
`purpose of this is to enable the compatible extensibility of
`the standard in terms of signaling at the systems layer, e.g.,
`when the base layer of a future extended scalable or multiview
`bitstream would need to be decodable by a legacy decoder, but
`for which additional information about the bitstream structure
`that
`is only relevant for the advanced decoder would be
`ignored.
`
`D. Reference Picture Sets and Reference Picture Lists
`For multiple-reference picture management, a particular set
`of previously decoded pictures needs to be present in the de-
`coded picture buffer (DPB) for the decoding of the remainder
`of