`
`849
`
`H.263+: Video Coding at Low Bit Rates
`
`Guy Cˆot´e, Student Member, IEEE, Berna Erol, Michael Gallant, Student Member, IEEE,
`and Faouzi Kossentini, Member, IEEE
`
`Abstract—In this tutorial paper, we discuss the ITU-T H.263+
`(or H.263 Version 2) low-bit-rate video coding standard. We
`first describe, briefly, the H.263 standard including its optional
`modes. We then address the 12 new negotiable modes of
`H.263+. Next, we present experimental results for these modes,
`based on our public-domain implementation (see our Web
`site at http://spmg.ece.ubc.ca). Tradeoffs among compression
`performance, complexity, and memory requirements for the
`H.263+ optional modes are discussed. Finally, results for mode
`combinations are presented.
`
`Index Terms— H.263, H.263+, video compression standards,
`video compression and coding, video conferencing, video tele-
`phony.
`
`I. INTRODUCTION
`
`IN the past few years, there has been significant interest
`
`in digital video applications. Consequently, academia and
`industry have worked toward developing video compression
`techniques [1]–[5], and several successful standards have
`emerged, e.g., ITU-T H.261, H.263, ISO/IEC MPEG-1, and
`MPEG-2. These standards address a wide range of applications
`having different requirements in terms of bit rate, picture
`quality, complexity, error resilience, and delay.
`While the demand for digital video communication ap-
`plications such as videoconferencing, video e-mailing, and
`video telephony has increased considerably, transmission rates
`over public switched telephone networks (PSTN) and wireless
`networks are still very limited. This requires compression
`performance and channel error robustness levels that cannot be
`achieved by previous block-based video coding standards such
`as H.261. Version 1 of the international standard ITU-T H.263,
`entitled “Video Coding for Low Bit Rate Communications”
`[6], addresses the above requirements and, as a result, becomes
`the new low-bit-rate video coding standard.
`Although its coding structure is based on that of H.261,
`H.263 provides better picture quality at low bit rates with little
`additional complexity. It also includes four optional modes
`aimed at improving compression performance. H.263 has been
`adopted in several videophone terminal standards, notably
`ITU-T H.324 (PSTN), H.320 (ISDN), and H.310 (B-ISDN).
`
`Manuscript received October 26, 1997; revised April 24, 1998. This work
`was supported by the Natural Sciences and Engineering Research Council
`of Canada and by AVT Audio Visual Telecommunications Corporation. This
`paper was recommended by Associate Editor M.-T. Sun.
`The authors are with the Department of Electrical and Computer Engineer-
`ing, University of British Columbia, Vancouver, B.C., V6T 1Z4, Canada.
`Publisher Item Identifier S 1051-8215(98)06325-3.
`
`H.263 Version 2, also known as H.263+ in the standards
`community, was officially approved as a standard in January
`1998 [7]. H.263+ is an extension of H.263, providing 12
`new negotiable modes and additional features. These modes
`and features improve compression performance, allow the use
`of scalable bit streams, enhance performance over packet-
`switched networks, support custom picture size and clock
`frequency, and provide supplemental display and external
`usage capabilities.
`
`II. THE ITU-T H.263 STANDARD
`The H.263 video standard is based on techniques common
`to many current video coding standards. In this section, we
`describe the source coding framework of H.263.
`
`A. Baseline H.263 Video Coding
`Fig. 1 shows a block diagram of an H.263 baseline encoder.
`Motion-compensated prediction first reduces temporal redun-
`dancies. Discrete cosine transform (DCT)-based algorithms are
`then used for encoding the motion-compensated prediction
`difference frames. The quantized DCT coefficients, motion
`vectors, and side information are entropy coded using variable-
`length codes (VLC’s).
`1) Video Frame Structure: H.263 supports five standard-
`ized picture formats: sub-QCIF, QCIF, CIF, 4CIF, and 16CIF.
`The luminance component of the picture is sampled at these
`resolutions, while the chrominance components,
`and
`,
`are downsampled by two in both the horizontal and vertical
`directions. The picture structure is shown in Fig. 2 for the
`QCIF resolution. Each picture in the input video sequence is
`divided into macroblocks, consisting of four luminance blocks
`of 8 pixels
`8 lines followed by one
`block and one
`block, each consisting of 8 pixels
`8 lines. A group of blocks
`(GOB) is defined as an integer number of macroblock rows, a
`number that is dependent on picture resolution. For example, a
`GOB consists of a single macroblock row at QCIF resolution.
`2) Video Coding Tools: H.263 supports interpicture predic-
`tion that is based on motion estimation and compensation. The
`coding mode where temporal prediction is used is called an
`inter mode. In this mode, only the prediction error frames—the
`difference between original frames and motion-compensated
`predicted frames—need be encoded. If temporal prediction is
`not employed, the corresponding coding mode is called an
`intra mode.
`a) Motion estimation and compensation: Motion-com-
`pensated prediction assumes
`that
`the pixels within the
`
`1051–8215/98$10.00 ª
`
`1998 IEEE
`
`Realtime Adaptive Streaming LLC
`Exhibit 2011
`IPR2019-01035
`Page 1
`
`
`
`850
`
`IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 8, NO. 7, NOVEMBER 1998
`
`Fig. 1. H.263 video encoder block diagram.
`
`Fig. 3. H.263 source coding algorithm: motion compensation.
`
`motion. This motion information is represented by two-
`dimensional displacement vectors or motion vectors. Due
`to the block-based picture representation, many motion
`estimation algorithms employ block-matching techniques,
`where the motion vector
`is obtained by minimizing a
`cost function measuring the mismatch between a candidate
`macroblock and the current macroblock. Although several cost
`measures have been introduced, the most widely used one is
`the sum-of-absolute-differences (SAD) defined by
`
`Fig. 2. H.263 picture structure at QCIF resolution.
`
`current picture can be modeled as a translation of those
`within a previous picture, as shown in Fig. 3. In baseline
`H.263, each macroblock is predicted from the previous
`frame. This implies an assumption that each pixel within
`the macroblock undergoes the same amount of translational
`
`16
`th pixel of a 16
`represents the
`where
`macroblock from the current picture at the spatial location
`, and
`represents the
`th pixel of a
`candidate macroblock from a reference picture at the spatial
`displaced by the vector
`. To find the
`location
`macroblock producing the minimum mismatch error, we need
`to calculate the SAD at several locations within a search
`window. The simplest, but the most compute-intensive search
`
`Realtime Adaptive Streaming LLC
`Exhibit 2011
`IPR2019-01035
`Page 2
`
`
`
`C ˆOT ´E et al.: H.263+
`
`851
`
`method, known as the full search or exhaustive search method,
`evaluates the SAD at every possible pixel location in the
`search area. To lower the computational complexity, several
`algorithms that restrict the search to a few points have been
`proposed [8]. In baseline H.263, one motion vector per mac-
`roblock is allowed for motion compensation. Both horizontal
`and vertical components of the motion vectors may be of half
`pixel accuracy, but their values may lie only in the [ 16, 15.5]
`range, limiting the search window used in motion estimation.
`A positive value of the horizontal or vertical component of the
`motion vector represents a macroblock spatially to the right or
`below the macroblock being predicted, respectively.
`8 DCT specified
`b) Transform: The purpose of the 8
`8 blocks of original pixels
`by H.263 is to decorrelate the 8
`or motion-compensated difference pixels, and to compact
`their energy into as few coefficients as possible. Besides its
`relatively high decorrelation and energy compaction capa-
`8 DCT is simple, efficient, and amenable
`bilities, the 8
`to software and hardware implementations [9]. The most
`8 DCT is that
`common algorithm for implementing the 8
`which consists of eight-point DCT transformation of the rows
`8 DCT is defined by
`and the columns, respectively. The 8
`
`where
`
`and
`for
`
`8 original
`th pixel of the 8
`denotes the
`Here,
`8 DCT
`denotes the coefficients of the 8
`block, and
`transformed block. The original 8
`8 block of pixels can be
`8 inverse DCT (IDCT) given by
`recovered using an 8
`
`Although exact reconstruction can be theoretically achieved,
`it is often not possible using finite-precision arithmetic. While
`forward DCT errors can be tolerated, inverse DCT errors must
`meet the H.263 standard if compliance is to be achieved.
`c) Quantization: The human viewer is more sensitive to
`reconstruction errors related to low spatial frequencies than
`those related to high frequencies [10]. Slow linear changes in
`intensity or color (low-frequency information) are important
`to the eye. Quick, high-frequency changes can often not be
`seen, and may be discarded. For every element position in
`the DCT output matrix, a corresponding quantization value is
`computed using the equation
`
`where
`
`is the
`
`th DCT coefficient and
`
`is the
`
`Fig. 4. Zigzag scan pattern to reorder DCT coefficients from low to high
`frequencies.
`
`th quantization value. The resulting real numbers are
`then rounded to their nearest integer values. The net effect
`is usually a reduced variance between quantized coefficients
`as compared to the variance between the original DCT co-
`efficients, as well as a reduction of the number of nonzero
`coefficients.
`In H.263, quantization is performed using the same step
`size within a macroblock (i.e., using a uniform quantization
`matrix). Even quantization levels in the range from 2 to 62 are
`allowed, except for the first coefficient (DC coefficient) of an
`intra block, which is uniformly quantized using a step size of
`eight. The quantizers consist of equally spaced reconstruction
`levels with a dead zone centered at zero. After the quantization
`process, the reconstructed picture is stored so that it can be
`later used for prediction of the future picture.
`d) Entropy coding: Entropy coding is performed by
`means of variable-length codes (VLC’s). Motion vectors
`are first predicted by setting their component’s values to
`median values of those of neighboring motion vectors already
`transmitted: the motion vectors of the macroblocks to the
`left, above, and above right of the current macroblock. The
`difference motion vectors are then VLC coded.
`Prior to entropy coding, the quantized DCT coefficients are
`arranged into a one-dimensional array by scanning them in
`zigzag order. This rearrangement places the DC coefficient
`first
`in the array, and the remaining AC coefficients are
`ordered from low to high frequency. This scan pattern is
`illustrated in Fig. 4. The rearranged array is coded using a
`three-dimensional run-length VLC table, representing the triple
`(LAST, RUN, LEVEL). The symbol RUN is defined as the
`distance between two nonzero coefficients in the array. The
`symbol LEVEL is the nonzero value immediately following a
`sequence of zeros. The symbol LAST replaces the H.261 end-
`of block flag, where “LAST
`1” means that the current code
`corresponds to the last coefficient in the coded block. This
`coding method produces a compact representation of the 8
`8 DCT coefficients, as a large number of the coefficients are
`normally quantized to zero and the reordering results (ideally)
`in the grouping of long runs of consecutive zero values. Other
`information such as prediction types and quantizer indication
`is also entropy coded by means of VLC’s.
`
`Realtime Adaptive Streaming LLC
`Exhibit 2011
`IPR2019-01035
`Page 3
`
`
`
`852
`
`IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 8, NO. 7, NOVEMBER 1998
`
`(a)
`
`(c)
`
`(b)
`
`(d)
`
`Fig. 5.
`
`Improved PB frames. (a) Structure. (b) Forward prediction. (c) Backward prediction. (d) Bidirectional prediction.
`
`3) Coding Control: The two switches in Fig. 1 represent
`the intra/inter mode selection, which is not specified in the
`standard. Such a selection is made at the macroblock level.
`The performance of the motion estimation process, usually
`measured in terms of the associated SAD values, can be used
`to select the coding mode (intra or inter). If a macroblock
`does not change significantly with respect to the reference
`picture, an encoder can also choose not to encode it, and
`the decoder will simply repeat the macroblock located at the
`subject macroblock’s spatial location in the reference picture.
`
`B. Optional Modes
`In addition to the core encoding and decoding algorithms
`described above, H.263 includes four negotiable advanced
`coding modes: unrestricted motion vectors, advanced predic-
`tion, PB frames, and syntax-based arithmetic coding. The first
`two modes are used to improve inter picture prediction. The
`PB-frames mode improves temporal resolution with little bit
`rate increase. When the syntax-based arithmetic coding mode
`is enabled, arithmetic coding replaces the default VLC coding.
`These optional modes allow developers to trade off between
`compression performance and complexity. We next provide
`a brief description of each of these modes. A more detailed
`description of such modes can be found in [11] and [12].
`1) Unrestricted Motion Vector Mode (Annex D): In base-
`line H.263, motion vectors can only reference pixels that are
`within the picture area. Because of this, macroblocks at the
`border of a picture may not be well predicted. When the
`unrestricted motion vector mode is used, motion vectors can
`take on values in the range [ 31.5, 31.5] instead of [ 16,
`15.5], and are allowed to point outside the picture boundaries.
`
`The longer motion vectors improve coding efficiency for larger
`picture formats, i.e., 4CIF or 16CIF. Moreover, by allowing
`motion vectors to point outside the picture, a significant
`gain is achieved if there is movement along picture edges.
`This is especially useful in the case of camera movement or
`background movement.
`2) Syntax-Based Arithmetic Coding Mode (Annex E):
`Baseline H.263 employs variable-length coding as a means of
`entropy coding. In this mode, syntax-based arithmetic coding
`is used. Since VLC and arithmetic coding are both lossless
`coding schemes, the resulting picture quality is not affected,
`yet the bit rate can be reduced by approximately 5% due to
`the more efficient arithmetic codes. It is worth noting that use
`of this annex is not widespread.
`3) Advanced Prediction Mode (Annex F): This mode al-
`lows for the use of four motion vectors per macroblock, one
`8 luminance blocks. Furthermore,
`for each of the four 8
`overlapped block motion compensation is used for
`the
`luminance macroblocks, and motion vectors are allowed to
`point outside the picture as in the unrestricted motion vector
`mode. Use of this mode improves inter picture prediction, and
`yields a significant improvement in subjective picture quality
`for the same bit rate by reducing blocking artifacts.
`4) PB-Frames Mode (Annex G): In this mode, the frame
`structure consists of a P picture and a B picture, as illustrated
`in Fig. 5(a). The quantized DCT coefficients of the B and P
`pictures are interleaved at the macroblock layer such that a
`P-picture macroblock is immediately followed by a B-picture
`macroblock. Therefore, the maximum number of blocks trans-
`mitted at the macroblock layer is 12 rather than 6. The P
`picture is forward predicted from the previously decoded P
`
`Realtime Adaptive Streaming LLC
`Exhibit 2011
`IPR2019-01035
`Page 4
`
`
`
`C ˆOT ´E et al.: H.263+
`
`853
`
`picture. The B picture is bidirectionally predicted from the
`previously decoded P picture and the P picture currently being
`decoded. The forward and backward motion vectors for a B
`macroblock are calculated by scaling the motion vector from
`the current P-picture macroblock using the temporal resolution
`of the P and B pictures with respect to the previous P picture.
`If this motion vector does not yield a good prediction, it can
`be enhanced by a delta vector. The delta vector is obtained by
`performing motion estimation, within a small search window,
`around the calculated motion vectors.
`When decoding a PB-frame macroblock, the P macroblock
`is reconstructed first, followed by the B macroblock since
`the information from the P macroblock is needed for B-
`macroblock prediction. When using the PB-frames mode, the
`picture rate can be doubled without a significant increase in
`bit rate.
`
`III. THE ITU-T H.263+ STANDARD
`The objective of H.263+ is to broaden the range of ap-
`plications and to improve compression efficiency. H.263+, or
`H.263 version 2, is backward compatible with H.263. Not only
`is this critical due to the large number of video applications
`currently using the H.263 standard, but it is also required by
`ITU-T rules.
`H.263+ offers many improvements over H.263. It allows the
`use of a wide range of custom source formats, as opposed to
`H.263, wherein only five video source formats defining picture
`size, picture shape, and clock frequency can be used. This
`added flexibility opens H.263+ to a broader range of video
`scenes and applications, such as wide format pictures, resize-
`able computer windows, and higher refresh rates. Moreover,
`picture size, aspect ratio, and clock frequency can be specified
`as part of the H.263+ bit stream. Another major improvement
`of H.263+ over H.263 is scalability, which can improve the
`delivery of video information in error-prone, packet-lossy,
`or heterogeneous environments by allowing multiple display
`rates, bit rates, and resolutions to be available at the decoder.
`Furthermore, picture segment1 dependencies may be limited,
`likely reducing error propagation.
`
`A. H.263+ Optional Modes
`Next, we describe each of the 12 new optional coding
`modes of the H.263+ video coding standard, including the
`modification of H.263’s unrestricted motion vector mode when
`used within an H.263+ framework.
`1) Unrestricted Motion Vector Mode (Annex D): The defi-
`nition of the unrestricted motion vector mode in H.263+ is
`different from that of H.263. When this mode is employed
`within an H.263+ framework, new reversible VLC’s (RVLC’s)
`are used for encoding the difference motion vectors. These
`codes are single valued, as opposed to the earlier H.263 VLC’s
`which were double valued. The double-valued codes were not
`popular due to limitations in their extendibility, and also to
`their high implementation cost. Reversible VLC’s are easy to
`
`1 A picture segment is defined as a slice or any number of GOB’s preceded
`by a GOB header.
`
`Fig. 6. Neighboring blocks used for intra prediction in the advanced intra
`coding mode.
`
`implement as a simple state machine can be used to generate
`and decode them.
`More importantly, reversible VLC’s can be used to increase
`resilience to channel errors. The idea behind RVLC’s is that
`decoding can be performed by processing the received motion
`vector part of the bit stream in the forward and reverse
`directions. If an error is detected while decoding in the forward
`direction, motion vector data are not completely lost as the
`decoder can proceed in the reverse direction; this improves
`error resilience of the bit stream [13].2 Furthermore, the motion
`vector range is extended to up to
`256,
`255.5] depending
`on the picture size, as depicted in Table I. This is very useful
`given the wide range of new picture formats available in
`H.263+.
`2) Advanced Intra Coding Mode (Annex I): This mode im-
`proves compression performance when coding intra mac-
`roblocks. In this mode, inter block prediction from neighboring
`intra coded blocks, a modified inverse quantization of intra
`DCT coefficients, and a separate VLC table for intra coded
`coefficients are employed. Block prediction is performed using
`data from the same luminance or chrominance components
`(
`or
`). As illustrated in Fig. 6, one of three different
`prediction options can be signaled: DC only, vertical DC and
`AC, or horizontal DC and AC. In the DC only option, only the
`DC coefficient is predicted, usually from both the block above
`and the block to the left, unless one of these blocks is not in the
`same picture segment or is not an intra block. In the vertical
`DC and AC option, the DC and first row of AC coefficients
`are vertically predicted from those of the block above. Finally,
`in the horizontal DC and AC option, the DC and first column
`of AC coefficients are horizontally predicted from those of the
`
`2 To exploit the full error resilience potential of RVLC’s, the motion vector
`bits should be blocked into one stream for each video frame, concatenating a
`large number of RLVC’s. This can be performed by data partitioning, which
`is currently being proposed in H.263++.
`
`Realtime Adaptive Streaming LLC
`Exhibit 2011
`IPR2019-01035
`Page 5
`
`
`
`854
`
`IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 8, NO. 7, NOVEMBER 1998
`
`TABLE I
`MOTION VECTOR RANGE IN H.263+’S
`UNRESTRICTED MOTION VECTOR RANGE MODE
`
`block to the left. The option that yields the best prediction is
`applied to all blocks of the subject intra macroblock.
`The difference coefficients, obtained by subtracting the
`predicted DCT coefficients from the original ones, are then
`quantized and scanned differently, depending on the selected
`prediction option. Three scanning patterns are used: the basic
`zigzag scan for DC only prediction,
`the alternate-vertical
`scan (as in MPEG-2) for horizontally predicted blocks, or
`the alternate-horizontal scan for vertically predicted blocks.
`The main part of the standard employs the same VLC table
`for coding all quantized coefficients. However, this table is
`designed for inter macroblocks and is not very effective
`for coding intra macroblocks. In intra macroblocks, larger
`coefficients with smaller runs of zeros are more common.
`Thus, the advanced intra coding mode employs a new VLC
`table for encoding the quantized coefficients, a table that is
`optimized to global statistics of intra macroblocks.
`intro-
`3) Deblocking Filter Mode (Annex J): This mode
`duces a deblocking filter inside the coding loop. Unlike in
`postfiltering, predicted pictures are computed based on filtered
`versions of the previous ones. A filter is applied to the edge
`boundaries of the four luminance and two chrominance 8
`8
`blocks. The filter is applied to a window of four edge pixels
`in the horizontal direction, and it is then similarly applied in
`the vertical direction. The weight of the filter’s coefficients
`depend on the quantizer step size for a given macroblock,
`where stronger coefficients are used for a coarser quantizer.
`This mode also allows the use of four motion vectors per
`macroblock, as specified in the advanced prediction mode of
`H.263, and also allows motion vectors to point outside picture
`boundaries, as in the unrestricted motion vector mode. The
`above techniques, as well as filtering, result in better prediction
`and a reduction in blocking artifacts. The computationally
`expensive overlapping motion compensation operation of the
`advanced prediction mode is not used here in order to keep
`the additional complexity of this mode minimal.
`4) Slice Structured Mode (Annex K): A slice structure, in-
`stead of a GOB structure, is employed in this mode. This
`allows the subdivision of the picture into segments containing
`variable numbers of macroblocks. The slice structure consists
`of a slice header followed by consecutive complete mac-
`roblocks. Two additional submodes can be signaled to reflect
`the order of transmission, sequential or arbitrary, and the
`shape of the slices, rectangular or not. These add flexibility
`to the slice structure so that it can be designed for different
`environments and applications. For example, rectangular slices
`can be used to subdivide a picture into rectangular regions of
`interest for region-based coding. The slice header locations
`
`within the bit stream act as resynchronization points, which
`help the decoder recover from bit errors and packet losses.
`They also allow slices to be decoded in an arbitrary order.
`5) Supplemental Enhancement Information Mode (Annex
`L): In this mode, supplemental information is included in
`the bit stream in order to offer display capabilities within
`the coding framework. This supplemental information includes
`support for picture freeze, picture snapshot, video segmenta-
`tion, progressive refinement, and chroma keying. These added
`functionalities are externally negotiated at the system layer
`(using H.245 for example) to ensure picture synchronization.
`The picture freeze option allows the encoder to signal a
`complete or partial freeze of a picture. Rectangular areas of
`a picture can be frozen while the rest of the picture is still
`being updated. A picture freeze release code is explicitly sent
`to the decoder. The picture snapshot option allows part of
`or the full picture to be used as a still image snapshot by
`an external application. When video subsequences can be
`used by an external application, such can be signaled by
`the video segmentation option of this mode. The progressive
`refinement option signals to the decoder that the following
`pictures represent a refinement in quality of the subject picture,
`as opposed to pictures at different times. The chroma keying
`option indicates that transparent or semitransparent pixels can
`be employed during the video decoding process. When set
`on, transparent pixels are not displayed. Instead, a background
`picture that is externally controlled is displayed.
`All of the above options are aimed at providing decoder
`supporting features and functionalities within the video bit
`stream. For example, such options will facilitate interoper-
`ability between different applications within the context of
`windows-based environments.
`6) Improved PB-Frames Mode (Annex M): This mode is
`an enhanced version of the H.263 PB-frames mode. The
`main difference is that the H.263 PB-frames mode allows
`only bidirectional prediction to predict B pictures in a
`PB frame, whereas the improved PB-frames mode permits
`forward, backward, and bidirectional prediction as illustrated
`in Fig. 5. Bidirectional prediction methods, as illustrated in
`Fig. 5(d), are the same in both modes, except that, in the
`improved PB-frames mode, no delta vector is transmitted. In
`forward prediction, as shown in Fig. 5(b), the B macroblock
`is predicted from the previous P macroblock, and a separate
`motion vector is then transmitted. In backward prediction,
`as illustrated in Fig. 5(c), the predicted macroblock is equal
`to the future P macroblock, and therefore no motion vector
`is transmitted. Use of the additional forward and backward
`predictors makes the improved PB frames less susceptible to
`significant changes that may occur between pictures.
`7) Reference Picture Selection Mode (Annex N): In H.263,
`a picture is predicted from the previous picture. If a part of the
`subject picture is lost due to channel errors or packet loss, the
`quality of future pictures can be severely degraded. Using this
`mode, it is possible to select the reference picture for prediction
`in order to suppress temporal error propagation due to inter
`coding. Multiple pictures must be stored at the decoder, and
`the encoder should signal the necessary amount of additional
`picture memory by external means. The information which
`
`Realtime Adaptive Streaming LLC
`Exhibit 2011
`IPR2019-01035
`Page 6
`
`
`
`C ˆOT ´E et al.: H.263+
`
`855
`
`specifies the selected picture for prediction is included in the
`encoded bit stream.
`If a back-channel is employed, two back-channel mode
`switches define four messaging methods (NEITHER, ACK,
`NACK, and ACK+NACK) that
`the encoder and decoder
`employ to determine which picture segment will be used for
`prediction. For example, a NACK sent to the encoder from
`the decoder signals that a given picture has been degraded by
`errors. Thus, the encoder may choose not to use this picture for
`future prediction, and instead employ a different, unaffected,
`reference picture. This mode reduces error propagation, thus
`maintaining good picture reproduction quality in error-prone
`environments.
`8) Temporal, SNR, and Spatial Scalability Mode (Annex
`O): This mode specifies syntax to support temporal, SNR,
`and spatial scalability capabilities. Scalability is a desirable
`property for error-prone and heterogeneous environments. It
`implies that the encoder’s output bit stream can be manipulated
`any time after it has been generated. This property is desirable
`in order to counter limitations such as constraints on bit
`rate, display resolution, network throughput, and decoder
`complexity. In multipoint and broadcast video applications,
`such constraints cannot be foreseen at the time of encoding.
`Temporal scalability provides a mechanism for enhancing
`perceptual quality by increasing the picture display rate. This
`is achieved via bidirectionally predicted pictures,
`inserted
`between anchor picture pairs and predicted from either one or
`both of these anchor pictures, as illustrated in Fig. 7(a). Thus,
`for the same quantization level, B pictures yield increased
`compression as compared to forward predicted P pictures. B
`pictures are not used as anchor pictures, i.e., other pictures are
`never predicted from them. Therefore, they can be discarded
`without impacting picture quality of future pictures; hence,
`the name temporal scalability. Note that, while B pictures
`improve compression performance as compared to P pictures,
`they increase encoder complexity and memory requirements
`and introduce additional delays.
`Spatial scalability and SNR scalability are closely related,
`the only difference being the increased spatial resolution
`provided by spatial scalability. An example of SNR scalable
`pictures is shown in Fig. 7(b). SNR scalability implies the
`creation of multirate bit streams. It allows for the recovery of
`coding error, or the difference, between an original picture and
`its reconstruction. This is achieved by using a finer quantizer
`to encode the difference picture in an enhancement layer.
`This additional information increases the SNR of the overall
`reproduced picture; hence, the name SNR scalability.
`Spatial scalability allows for the creation of multiresolution
`bit streams to meet varying display requirements/constraints
`for a wide range of clients. A spatial scalable structure
`is illustrated in Fig. 7(c). It
`is essentially the same as in
`SNR scalability, except that a spatial enhancement layer here
`attempts to recover the coding loss between an upsampled ver-
`sion of the reconstructed reference layer picture and a higher
`resolution version of the original picture. For example, if the
`reference layer has a QCIF resolution, and the enhancement
`layer has a CIF resolution, the reference layer picture must
`be scaled accordingly such that the enhancement layer picture
`
`(a)
`
`(b)
`
`(c)
`
`Fig. 7.
`
`Illustration of scalability features. (a) Temporal. (b) SNR. (c) Spatial.
`
`can be appropriately predicted from it. The standard allows the
`resolution to be increased by a factor of 2 in the vertical only,
`horizontal only, or both the vertical and horizontal directions
`for a single enhancement layer. There can be multiple enhance-
`ment layers, each increasing picture resolution over that of the
`previous layer. The interpolation filters used to upsample the
`reference layer picture are explicitly defined in the standard.
`Aside from the upsampling process from the reference to the
`enhancement layer, the processing and syntax of a spatially
`scaled picture are identical to those of an SNR scaled picture.
`In either SNR or spatial scalability, the enhancement layer
`pictures are referred to as EI or EP pictures. If the enhancement
`layer picture is upward predicted, from a picture in the
`reference layer, then the enhancement layer picture is referred
`to as an enhancement-I (EI) picture. In some cases, when
`reference layer pictures are coarsely represented, over coding
`of static parts of the picture can occur in the enhancement
`
`Realtime Adaptive Streaming LLC
`Exhibit 2011
`IPR2019-01035
`Page 7
`
`
`
`856
`
`IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 8, NO. 7, NOVEMBER 1998
`
`layer, requiring an unnecessarily excessive bit rate. To avoid
`this problem, forward prediction is permitted in the enhance-
`ment layer. A picture that can be forward predicted from a
`previous enhancement layer picture or upward predicted from
`the reference layer picture is referred to as an enhancement-P
`(EP) picture. Note that computing the average of the upward
`and forward predicted pictures can provide a bidirectional
`prediction option for EP pictures. For both EI and EP pictures,
`upward prediction from the reference layer picture implies
`that no motion vectors are required. In the case of forward
`prediction for EP pictures, motion vectors are required.
`9) Reference Picture Resampling Mode (Annex P): Th