`
`MPEG-4 natural video coding } An overview
`
`Touradj Ebrahimi!,*, Caspar Horne"
`
`!Signal Processing Laboratory, Swiss Federal Institute of Technology } EPFL, 1015 Lausanne, Switzerland
`"Mediamatics Inc., 48430 Lakeview Blvd., Fremont, CA, 94538, USA
`
`Abstract
`
`This paper describes the MPEG-4 standard, as de"ned in ISO/IEC 14496-2. The MPEG-4 visual standard is
`developed to provide users a new level of interaction with visual contents. It provides technologies to view, access and
`manipulate objects rather than pixels, with great error robustness at a large range of bit-rates. Application areas range
`from digital television, streaming video, to mobile multimedia and games. The MPEG-4 natural video standard consists
`of a collection of tools that support these application areas. The standard provides tools for shape coding, motion
`estimation and compensation, texture coding, error resilience, sprite coding and scalability. Conformance points in the
`form of object types, pro"les and levels, provide the basis for interoperability. Shape coding can be performed in binary
`mode, where the shape of each object is described by a binary mask, or in gray scale mode, where the shape is described in
`a form similar to an alpha channel, allowing transparency, and reducing aliasing. Motion compensation is block based,
`with appropriate modi"cations for object boundaries. The block size can be 16]16, or 8]8, with half pixel resolution.
`MPEG-4 also provides a mode for overlapped motion compensation. Texture coding is based in 8]8 DCT, with
`appropriate modi"cations for object boundary blocks. Coe$cient prediction is possible to improve coding e$ciency.
`Static textures can be encoded using a wavelet transform. Error resilience is provided by resynchronization markers, data
`partitioning, header extension codes, and reversible variable length codes. Scalability is provided for both spatial and
`temporal resolution enhancement. MPEG-4 provides scalability on an object basis, with the restriction that the object
`shape has to be rectangular. MPEG-4 conformance points are de"ned at the Simple Pro"le, the Core Pro"le, and the
`Main Pro"le. Simple Pro"le and Core Pro"les address typical scene sizes of QCIF and CIF size, with bit-rates of 64, 128,
`384 and 2 Mbit/s. Main Pro"le addresses a typical scene sizes of CIF, ITU-R 601 and HD, with bit-rates at 2, 15 and
`38.4 Mbit/s. ( 2000 Elsevier Science B.V. All rights reserved.
`
`Keywords: MPEG-4 visual; Video coding; Object based coding; Streaming; Interactivity
`
`1. Introduction
`
`Multimedia commands the growing attention of
`the telecommunications, consumer electronics, and
`
`* Corresponding author.
`E-mail address: touradj.ebrahimi@ep#.ch (T. Ebrahimi)
`
`computer industry. In a broad sense, multimedia is
`assumed to be a general framework for interaction
`with information available from di!erent sources,
`including video.
`A multimedia standard is expected to provide
`support for a large number of applications. These
`applications translate into speci"c sets of require-
`ments which may be very di!erent from each other.
`One theme common to most applications is the
`need for supporting interactivity with di!erent
`
`0923-5965/00/$ - see front matter ( 2000 Elsevier Science B.V. All rights reserved.
`PII: S 0 9 2 3 - 5 9 6 5 ( 9 9 ) 0 0 0 5 4 - 5
`
`VIMEO/IAC EXHIBIT 1026
`VIMEO ET AL., v. BT, 2019-00833
`
`
`
`366
`
`T. Ebrahimi, C. Horne / Signal Processing: Image Communication 15 (2000) 365}385
`
`kinds of data. Applications related to visual in-
`formation can be grouped together on the basis of
`several features:
`f type of data (still images, stereo images, video,
`etc.),
`f type of source (natural images, computer gener-
`ated images, text/graphics, medical images, etc.),
`f type of communication (ranging from point-to-
`point to multipoint-to-multipoint),
`f type of desired functionalities (object manipula-
`tion, on-line editing, progressive transmission,
`error resilience, etc.).
`
`Video compression standards, MPEG-1 [5] and
`MPEG-2 [6], although perfectly well suited in en-
`vironments for which they were designed, are not
`necessarily #exible enough to e$ciently address the
`requirements of multimedia applications. Hence,
`MPEG (Moving Picture Experts Group) commit-
`ted itself to the development of the MPEG-4
`standard, providing a common platform for a wide
`range of multimedia applications [3]. MPEG has
`been working on the development of the MPEG-4
`standard since 1993, and "nally after about 6 years
`of e!orts, an International Standard covering the
`"rst version of MPEG-4 has been adopted recently
`[7].
`This paper provides an overview of the major
`natural video coding tools and algorithms as
`de"ned in the International Standard ISO/IEC
`14496-2 (MPEG-4). Additional features and im-
`provements that will be de"ned in an amendment,
`due in 2000, are not described here.
`
`2. MPEG-4 visual overview
`
`2.1. Motivation
`
`Digital video is replacing analog video in many
`existing applications. A prime example is the intro-
`duction of digital television that is starting to see
`wide deployment. Another example is the progress-
`ive replacement of analog video cassettes by DVD
`as the preferred medium to watch movies. MPEG-2
`has been one of the key technologies that enabled
`the acceptance of these new media. In these existing
`applications, digital video will initially provide sim-
`
`ilar functionalities as analog video, i.e. the content
`is represented in digital form instead of analog, with
`obvious direct bene"ts such as improved quality
`and reliability, but the content remains the same to
`the user. However, once the content is in the digital
`domain, new functionalities can easily be added,
`that will allow the user to view, access, and mani-
`pulate the content in completely new ways. The
`MPEG-4 standard provides key technologies that
`will enable such functionalities.
`
`2.2. Application areas
`
`2.2.1. Digital TV
`With the phenomenal growth of the Internet and
`the World Wide Web, the interest in advanced
`interactivity with content provided by digital televi-
`sion is increasing. Increased text, picture, audio or
`graphics that can be controlled by the user can add
`to the entertainment value of certain programs, or
`provide valuable information unrelated to the
`current program, but of interest to the viewer. TV
`station logos, customized advertising, multi-win-
`dow screen formats allowing display of sports stat-
`istics or stock quotes using data-casting are some
`examples of increased functionalities. Providing the
`capability to link and to synchronize certain events
`with video would even improve the experience.
`Coding and representation of not only frames of
`video, but also individual objects in the scene (video
`objects), can open the door for completely new
`ways of television programming.
`
`2.2.2. Mobile multimedia
`The enormous popularity of cellular phones and
`palm computers indicates the interest in mobile
`communications and computing. Using multi-
`media in these areas would enhance the user’s ex-
`perience and improve the usability of these devices.
`Narrow bandwidth, limited computational capa-
`city and reliability of the transmission media are
`limitations that currently hamper widespread use
`of multimedia here. Providing improved error
`e$ciency
`resilience,
`improved
`coding
`and
`#exibility in assigning computational resources
`would bring mobile multimedia applications closer
`to reality.
`
`
`
`T. Ebrahimi, C. Horne / Signal Processing: Image Communication 15 (2000) 365}385
`
`367
`
`2.2.3. TV production
`Content creation is increasingly turning into
`virtual production techniques as extensions to the
`well-known chroma keying. The scene and the ac-
`tors are recorded separately, and can be mixed
`with additional computer generated special e!ects.
`By coding video objects instead of rectangular lin-
`ear video frames, and allowing access to the video
`objects, the scene can be rendered with higher qual-
`ity, and with more #exibility. Television programs
`consisting of composited video objects, and addi-
`tional graphics and audio, can then be transmitted
`directly to the viewer, with the additional advant-
`age of allowing the user to control the program-
`ming in a more sophisticated way. In addition,
`depending on the targeted viewers, local TV sta-
`tions could inject regional advertisement video
`objects, better suited when international programs
`are broadcast.
`
`2.2.4. Games
`The popularity of games on stand-alone game
`machines, and on PCs clearly indicate the interest
`in user interaction. Most games are currently using
`three-dimensional graphics, both for the environ-
`ment, and for the objects that are controlled by the
`players. The addition of video objects into these
`games would make the games even more realistic,
`and using overlay techniques, the objects could be
`made more life-like. Essential is the access to indi-
`vidual video objects, and using standards based
`technology would make it possible to personalize
`games by using personal video data bases linked in
`real-time into the games.
`
`2.2.5. Streaming video
`Streaming video over the Internet is becoming
`very popular, using viewing tools as software plug-
`ins for a Web browser. News updates and live
`music shows are just examples of many possible
`video streaming applications. Here, bandwidth is
`limited due to the use of modems, and transmission
`reliability is an issue, as packet loss may occur.
`Increased error resilience and improved coding e$-
`ciency will improve the experience of streaming
`video. In addition, scalability of the bitstream, in
`terms of temporal and spatial resolution, but also
`in terms of video objects, under the control of the
`
`viewer, will further enhance the experience, and
`also the use of streaming video.
`
`2.3. Features and functionalities
`
`The MPEG-4 visual standard consists of a set of
`tools that enable applications by supporting several
`classes of
`functionalities. The most
`important
`features covered by MPEG-4 standard can be clus-
`tered in three categories (see Fig. 1) and sum-
`marized as follows:
`
`1. Compression ezciency: Compression e$ciency
`has been the leading principle for MPEG-1 and
`MPEG-2, and in itself has enabled applications
`such as Digital TV and DVD. Improved coding
`e$ciency and coding of multiple concurrent
`data streams will increase acceptance of applica-
`tions based on the MPEG-4 standard.
`2. Content-based interactivity: Coding and repres-
`enting video objects rather than video frames
`enables content-based applications. It is one of
`important novelties o!ered by
`the most
`MPEG-4. Based on e$cient representation of
`objects, object manipulation, bitstream editing,
`and object-based scalability allow new levels of
`content interactivity.
`3. Universal access: Robustness in error-prone en-
`vironments allows MPEG-4 encoded content to
`be accessible over a wide range of media, such as
`mobile networks as well as wired connections. In
`addition, object-based temporal and spatial
`scalability allow the user to decide where to use
`sparse resources, which can be the available
`bandwidth, but also the computing capacity or
`power consumption.
`
`Fig. 1. Functionalities o!ered by the MPEG-4 visual standard.
`
`
`
`368
`
`T. Ebrahimi, C. Horne / Signal Processing: Image Communication 15 (2000) 365}385
`
`functionalities,
`these
`some of
`To support
`MPEG-4 should provide the capability to repres-
`ent arbitrarily shaped video objects. Each object
`can be encoded with di!erent parameters, and at
`di!erent qualities. The shape of a video object can
`be represented in MPEG-4 by a binary or a gray-
`level (alpha) plane. The texture is coded separately
`from its shape. For low-bit-rate applications, frame
`based coding of texture can be used, similar to
`MPEG-1 and MPEG-2. To increase robustness to
`errors, special provisions are taken into account at
`the bitstream level to allow fast resynchronization,
`and e$cient error recovery.
`The MPEG-4 visual standard has been explicitly
`optimized for three bit-rate ranges:
`(1) Below 64 kbit/s,
`(2) 64}384 kbit/s,
`(3) 384}4 Mbit/s.
`For high-quality applications, higher bit-rates
`are also supported while using the same set of tools
`and the same bitstream syntax for those available
`in the lower bit-rates.
`MPEG-4 provides support for both interlaced
`and progressive material. The chrominance format
`that is supported is 4:2:0. In this format the number
`of Cb and Cr samples are half the number of samples
`of the luminance samples in both horizontal and
`vertical directions. Each component can be repre-
`sented by a number of bits ranging from 4 to 12 bits.
`
`2.4. Structure and syntax
`
`The central concept de"ned by the MPEG-4
`standard is the audio-visual object, which forms the
`foundation of
`the object-based representation.
`Such a representation is well suited for interactive
`applications and gives direct access to the scene
`contents. Here we will limit ourselves to mainly
`natural video objects. However, the discussion re-
`mains quite valid for other types of audio-visual
`objects. A video object may consist of one or more
`layers to support scalable coding. The scalable syn-
`tax allows the reconstruction of video in a layered
`fashion starting from a stand-alone base layer, and
`adding a number of enhancement layers. This
`allows applications to generate a single MPEG-4
`video bitstream for a variety of bandwidth and/or
`computational complexity requirements. A special
`
`case where a high degree of scalability is needed, is
`when static image data is mapped onto two or three
`dimensional objects. To address this functionality,
`MPEG-4 provides a special mode for encoding
`static textures using a wavelet transform.
`An MPEG-4 visual scene may consist of one or
`more video objects. Each video object is character-
`ized by temporal and spatial information in the
`form of shape, motion and texture. For certain
`applications video objects may not be desirable,
`because of either the associated overhead or the
`di$culty of generating video objects. For those
`applications, MPEG-4 video allows coding of rec-
`tangular frames which represent a degenerate case
`of an arbitrarily shaped object.
`An MPEG-4 visual bitstream provides a hier-
`archical description of a visual scene as shown
`in Fig. 2. Each level of the hierarchy can be accessed
`in the bitstream by special code values called start
`codes. The hierarchical
`levels that describe the
`scene most directly are:
`f Visual Object Sequence (VS): The complete
`MPEG-4 scene which may contain any 2-D
`or 3-D natural or synthetic objects and their
`enhancement layers.
`f Video Object (VO): A video object corresponds to
`a particular (2-D) object in the scene. In the most
`simple case this can be a rectangular frame, or it
`can be an arbitrarily shaped object correspond-
`ing to an object or background of the scene.
`f Video Object Layer (VOL): Each video object can
`be encoded in scalable (multi-layer) or non-scala-
`ble form (single layer), depending on the applica-
`tion, represented by the video object
`layer
`(VOL). The VOL provides support for scalable
`coding. A video object can be encoded using
`spatial or temporal scalability, going from coarse
`to "ne resolution. Depending on parameters
`such as available bandwidth, computational
`power, and user preferences, the desired resolu-
`tion can be made available to the decoder.
`
`There are two types of video object layers, the
`video object layer that provides full MPEG-4 func-
`tionality, and a reduced functionality video object
`layer, the video object layer with short headers. The
`latter provides bitstream compatibility with base-
`line H.263 [4].
`
`
`
`T. Ebrahimi, C. Horne / Signal Processing: Image Communication 15 (2000) 365}385
`
`369
`
`Fig. 2. Example of an MPEG-4 video bitstream logical structure.
`
`Each video object is sampled in time, each time
`sample of a video object is a video object plane.
`Video object planes can be grouped together to
`form a group of video object planes:
`f Group of Video Object Planes (GOV): The GOV
`groups together video object planes. GOVs can
`provide points in the bitstream where video ob-
`ject planes are encoded independently from each
`other, and can thus provide random access
`points into the bitstream. GOVs are optional.
`f Video Object Plane (VOP): A VOP is a time
`sample of a video object. VOPs can be encoded
`independently of each other, or dependent on
`each other by using motion compensation.
`A conventional video frame can be represented
`by a VOP with rectangular shape.
`
`A video object plane can be used in several di!erent
`ways. In the most common way the VOP contains
`
`the encoded video data of a time sample of a video
`object. In that case it contains motion parameters,
`shape information and texture data. These are en-
`coded using macroblocks. It can also be used to
`code a sprite. A sprite is a video object that is
`usually larger than the displayed video, and is per-
`sistent over time. There are ways to slightly modify
`a sprite, by changing its brightness, or by warping it
`to take into account spatial deformation. It is used
`to represent large, more or less static areas, such as
`backgrounds. Sprites are encoded using macro-
`blocks.
`the
`A macroblock contains a section of
`luminance component and the spatially subsam-
`pled chrominance components. In the MPEG-4
`visual standard there is support for only one
`chrominance format for a macroblock, the 4:2:0
`format. In this format, each macroblock contains
`4 luminance blocks, and 2 chrominance blocks.
`
`
`
`370
`
`T. Ebrahimi, C. Horne / Signal Processing: Image Communication 15 (2000) 365}385
`
`Fig. 3. General block diagram of MPEG-4 video.
`
`Fig. 4. Example of VOP-based decoding in MPEG-4.
`
`Each block contains 8]8 pixels, and is encoded
`using the DCT transform. A macroblock carries the
`shape information, motion information, and tex-
`ture information.
`Fig. 3 illustrates the general block diagram of
`MPEG-4 encoding and decoding based on the
`notion of video objects. Each video object is coded
`separately. For reasons of e$ciency and backward
`compatibility, video objects are coded via their
`corresponding video object planes in a hybrid cod-
`ing scheme somewhat similar to previous MPEG
`standards. Fig. 4 shows an example of decoding of
`a VOP.
`
`3. Shape coding tools
`
`In this section, we discuss the tools o!ered by the
`MPEG-4 standard for explicit coding of shape in-
`formation for arbitrarily shaped VOs. Besides the
`shape information available for the VOP in ques-
`tion, the shape coding scheme also relies on motion
`estimation to compress the shape information even
`further. A general description of shape coding tech-
`niques would be out of the scope of this paper.
`Therefore, we will only describe the scheme ad-
`opted by MPEG-4 natural video standard for
`shape coding. Interested readers are referred to [2]
`
`
`
`T. Ebrahimi, C. Horne / Signal Processing: Image Communication 15 (2000) 365}385
`
`371
`
`for information on other shape coding techniques.
`In MPEG-4 visual standard, two kinds of shape
`information are considered as inherent character-
`istics of a video object. These are referred to as
`binary and gray scale shape information. By binary
`shape information, one means a label information
`that de"nes which portions (pixels) of the support
`of the object belong to the video object at a given
`time. The binary shape information is most com-
`monly represented as a matrix with the same size as
`that of the bounding box of a VOP. Every element
`of the matrix can take one of the two possible
`values depending on whether the pixel is inside or
`outside the video object. Gray scale shape is a gen-
`eralization of the concept of binary shape providing
`a possibility to represent transparent objects, and
`reduce aliasing e!ects. Here the shape information
`is represented by 8 bits, instead of a binary value.
`
`3.1. Binary shape coding
`
`In the past, the problem of shape representation
`and coding has been thoroughly investigated in the
`"elds of computer vision, image understanding, im-
`age compression and computer graphics. However,
`this is the "rst time that a video standardization
`e!ort has adopted a shape representation and cod-
`ing technique within its scope. In its canonical
`form, a binary shape is represented as a matrix of
`binary values called a bitmap. However, for the
`purpose of compression, manipulation, or a more
`semantic description, one may choose to represent
`the shape in other forms such as using geometric
`representations or by means of its contour. Since its
`beginning, MPEG adopted a bitmap based com-
`pression technique for the shape information. This
`is mainly due to the relative simplicity and higher
`maturity of such techniques. Experiments have
`shown that bitmap-based techniques o!er good
`compression e$ciency with relatively low com-
`putational complexity. This section describes the
`coding methods for binary shape information. Bi-
`nary shape information is encoded by a motion
`compensated block based technique allowing both
`lossless and lossy coding of such data. In MPEG-4
`video compression algorithm, the shape of every
`VOP is coded along with its other properties (tex-
`ture and motion). To this end, the shape of a VOP
`
`is bounded by a rectangular window with a size of
`multiples of 16 pixels in horizontal and vertical
`directions. The position of the bounding rectangle
`could be chosen such that it contains the minimum
`number of blocks of size 16]16 with nontranspar-
`ent pixels. The samples in the bounding box and
`outside of the VOP are set to 0 (transparent). The
`rectangular bounding box is then partitioned into
`blocks of 16]16 samples and the encoding/decod-
`ing process is performed block by block.
`The binary matrix representing the shape of
`a VOP is referred to as binary mask. In this mask
`every pixel belonging to the VOP is set to 255, and
`all other pixels are set to 0. It is then partitioned
`into binary alpha blocks (BAB) of size 16]16. Each
`BAB is encoded separately. Starting from rectangu-
`lar frames, it is common to have BABs which have
`all pixels of the same value, either 0 (in which case
`the BAB is called a transparent block) or 255 (in
`which case the block is said to be an opaque block).
`The shape compression algorithm provides several
`modes for coding a BAB. The basic tools for
`encoding BABs are the Context based Arithmetic
`Encoding (CAE) algorithm [1] and motion
`compensation. InterCAE and IntraCAE are the
`variants of the CAE algorithm used with and with-
`out motion compensation, respectively. Each shape
`coding mode supported by the standard is a combi-
`nation of these basic tools. Motion vectors can be
`computed by searching for a best match position
`(given by the minimum sum of absolute di!erence).
`The motion vectors themselves are di!erentially
`coded. Every BAB can be coded in one of the
`following modes:
`
`1. The block is #agged transparent. In this case no
`coding is necessary. Texture information is not
`coded for such blocks either.
`2. The block is #agged opaque. Again, shape cod-
`ing is not necessary for such blocks, but texture
`information needs to be coded (since they belong
`to the VOP).
`3. The block is coded using IntraCAE without use
`of past information.
`4. Motion vector di!erence (MVD) is zero but the
`block is not updated.
`5. MVD is zero and the block is updated. InerCAE
`is used for coding the block update.
`
`
`
`372
`
`T. Ebrahimi, C. Horne / Signal Processing: Image Communication 15 (2000) 365}385
`
`Fig. 5. Context number selected for InterCAE (a) and IntraCAE
`(b) shape coding. In each case, the pixel to be encoded is marked
`by a circle, and the context pixels are marked with crosses. In the
`InterCAE, part of the context pixels are taken from the co-
`located block in the previous frame.
`
`6. MVD is non-zero, but the block is not coded.
`7. MVD is non-zero, and the block is coded.
`
`The CAE algorithm is used to code pixels in
`BABs. The arithmetic encoder is initialized at the
`beginning of the process. Each pixel is encoded as
`follows:
`
`1. Compute a context number according to the
`de"nition of Fig. 5.
`2. Index a probability table using this context
`number.
`3. Use the retrieved probability to drive the arith-
`metic encoder for codeword assignment.
`
`3.2. Gray scale shape coding
`
`The gray scale shape information has a structure
`similar to that of binary shape with the di!erence
`that every pixel (element of the matrix) can take on
`a range of values (usually 0 to 255) representing the
`degree of the transparency of that pixel. The gray
`scale shape corresponds to the notion of alpha
`plane used in computer graphics, in which 0 corres-
`ponds to a completely transparent pixel and 255 to
`a completely opaque pixel. Intermediate values of
`the pixel correspond to intermediate degrees of
`transparencies of that pixel. By convention, a bi-
`nary shape information corresponds to a gray scale
`shape information with values of 0 and 255. Gray
`scale shape information is encoded using a block
`based motion compensated DCT similar to that of
`texture coding, allowing lossy coding only. The
`gray scale shape coding also makes use of binary
`shape coding for coding of its support.
`
`4. Motion estimation and compensation tools
`
`Motion estimation and compensation are com-
`monly used to compress video sequences by ex-
`ploiting temporal redundancies between frames.
`The approaches for motion compensation in the
`MPEG-4 standard are similar to those used in
`other video coding standards. The main di!erence
`is that the block-based techniques used in the other
`standards have been adapted to the VOP structure
`used in MPEG-4. MPEG-4 provides three modes
`for encoding an input VOP, as shown in Fig. 6,
`namely:
`
`1. A VOP may be encoded independently of any
`other VOP. In this case the encoded VOP is
`called an Intra VOP (I-VOP).
`2. A VOP may be predicted (using motion com-
`pensation) based on another previously decoded
`VOP. Such VOPs are called Predicted VOPs
`(P-VOP).
`3. A VOP may be predicted based on past as well
`as future VOPs. Such VOPs are called Bidirec-
`tional
`Interpolated VOPs (B-VOP). B-VOPs
`may only be interpolated based on I-VOPs or
`P-VOPs.
`
`Obviously, motion estimation is necessary only
`for coding P-VOPs and B-VOPs. Motion estima-
`
`Fig. 6. The three modes of VOP coding. I-VOPs are coded
`without any information from other VOPs. P- and B-VOPs are
`predicted based on I- or other P-VOPs.
`
`
`
`T. Ebrahimi, C. Horne / Signal Processing: Image Communication 15 (2000) 365}385
`
`373
`
`tion (ME) is performed only for macroblocks in the
`bounding box of the VOP in question. If a macro-
`block lies entirely within a VOP, motion estimation
`is performed in the usual way, based on block
`matching of 16]16 macroblocks as well as 8]8
`blocks (in advanced prediction mode). This results
`in one motion vector for the entire macroblock, and
`one for each of its blocks. Motion vectors are com-
`puted to half-sample precision.
`For macroblocks that only partially belong to
`the VOP, motion vectors are estimated using the
`modixed block (polygon) matching technique. Here,
`the discrepancy of matching is given by the sum of
`absolute di!erence (SAD) computed for only those
`pixels in the macroblock that belong to the VOP.
`In case the reference block lies on the VOP bound-
`ary, a repetitive padding technique assigns values
`to pixels outside the VOP. The SAD is then com-
`puted using these padded pixels as well. This
`improves e$ciency, by allowing more possibilities
`when searching for candidate pixels for prediction
`at the boundary of the reference VOP.
`For P- and B-VOPs, motion vectors are encoded
`as follows. The motion vectors are "rst di!eren-
`tially coded, based on up to three vectors of pre-
`viously transmitted blocks. The exact number
`depends on the allowed range of the vectors. The
`maximum range is selected by the encoder and
`transmitted to the decoder, in a fashion similar to
`MPEG-2. Variable length coding is then used to
`encode the motion vectors.
`MPEG-4 also supports overlapped motion com-
`pensation, similar to the one used in H.263 [4].
`This usually results in better prediction quality at
`lower bit-rates.
`Here,
`for each block of the macroblock, the
`motion vectors of neighboring blocks are con-
`
`sidered. This includes the motion vector of the
`current block, and its four neighbors. Each vector
`provides an estimate of the pixel value. The actual
`predicted value is then a weighted average of all
`these estimates.
`
`5. Texture coding tools
`
`The texture information of a video object plane is
`present in the luminance, Y, and two chrominance
`components, Cb, Cr, of the video signal. In the case
`of an I-VOP, the texture information resides dir-
`ectly in the luminance and chrominance compo-
`nents. In the case of motion compensated VOPs the
`texture information represents the residual error
`remaining after motion compensation. For encod-
`ing the texture information, the standard 8]8
`block-based DCT is used. To encode an arbitrarily
`shaped VOP, an 8]8 grid is super-imposed on the
`VOP. Using this grid, 8]8 blocks that are internal
`to VOP are encoded without modi"cations. Blocks
`that straddle the VOP are called boundary blocks,
`and are treated di!erently from internal blocks. The
`transformed blocks are quantized, and individual
`coe$cient prediction can be used from neighboring
`blocks to further reduce the entropy value of the
`coe$cients. This is followed by a scanning of the
`coe$cients, to reduce to average run length be-
`tween to coded coe$cients. Then, the coe$cients
`are encoded by variable length encoding. This pro-
`cess is illustrated in the block diagram of Fig. 7.
`
`5.1. Boundary macroblocks
`
`Macroblocks that straddle VOP boundaries,
`boundary macroblocks, contain arbitrarily shaped
`
`Fig. 7. VOP texture coding process.
`
`
`
`374
`
`T. Ebrahimi, C. Horne / Signal Processing: Image Communication 15 (2000) 365}385
`
`texture data. A padding process is used to extend
`these shapes into rectangular macroblocks. The
`luminance component is padded on 16]16 basis,
`while the chrominance blocks are padded on 8]8
`basis. Repetitive padding consists in assigning
`a value to the pixels of the macroblock that lie
`outside of the VOP. When the texture data is the
`residual error after motion compensation,
`the
`blocks are padded with zero-values. Padded macro-
`blocks are then coded using the technique de-
`scribed above.
`For intra coded blocks, the padding is performed
`in a two-step procedure called Low Pass Extrapola-
`tion (LPE). This procedure is as follows:
`1. Compute the mean of the pixels in the blocks
`that belong to the VOP. Use this mean value as the
`padding value, that is,
`
`fr,c
`
`D
`(r,c)bVOP
`
`" 1
`N
`
`+
`(x,y)|VOP
`
`fx,y ,
`
`(1)
`
`match errors. The DCT transform is followed by
`a quantization process.
`
`5.3. Quantization
`
`The DCT coe$cients are quantized as a lossy
`compression step. There are two types of quantiz-
`ations available. Both are essentially a division of
`the coe$cient by a quantization step size. The "rst
`method uses one of two available quantization ma-
`trices to modify the quantization step size depend-
`ing on the spatial frequency of the coe$cient. The
`second method uses the same quantization step size
`for all coe$cients. MPEG-4 also allows for a non-
`linear quantization of DC values.
`
`5.4. Coezcient prediction
`
`The average energy of the quantized coe$cients
`can be further reduced by using prediction from
`neighboring blocks. The prediction can be per-
`formed from either the block above, the block
`to the left, or the block above left, as illustrated
`in Fig. 8. The direction of
`the prediction is
`adaptive and is selected based on comparison of
`horizontal and vertical DC gradients (increase or
`reduction in its value) of surrounding blocks
`A, B and C.
`
`where N is the number of pixels of the macroblock
`in the VOP. This is also known as mean-repetition
`DCT.
`2. Use the average operation given in Eq. (2) for
`each pixel fr,c, where r and c are the row and
`column position of each pixel in the macroblock
`outside the VOP boundary. Start from the top left
`corner f0,0 of the macroblock and proceed row by
`row to the bottom right pixel.
`#fr~1,c
`#fr,c‘1
`4
`
`#fr‘1,c
`
`.
`
`(2)
`
`fr,c
`
`D
`(r,c)bVOP
`
`"fr,c~1
`
`The pixels considered in the right-hand side of
`Eq. (2) should lie within the VOP, otherwise they
`are not considered and the denominator is adjusted
`accordingly.
`Once the block has been padded, it is coded in
`a similar fashion to an internal block.
`
`5.2. DCT
`
`Internal video texture blocks and padded bound-
`ary blocks are encoded using a 2-D 8]8 block-
`based DCT. The accuracy of the implementation of
`8]8 inverse transform is speci"ed by the IEEE
`1180 standard to minimize accumulation of mis-
`
`Fig. 8. Candidate blocks for coe$cient prediction.
`
`
`
`T. Ebrahimi, C. Horne / Signal Processing: Image Communication 15 (2000) 365}385
`
`375
`
`There are two types of prediction possible, DC
`prediction and AC prediction:
`f DC prediction: The prediction is performed for
`the DC coe$cient only, and is