`Robert J. Gove
`
`
`
`Texas Instuments, Inc.
`
`
`
`Dallas, Texas 75265
`
`
`
`ABSTRACT
`
`
`We introduce a new highly-integrated processing chip for performing a variety of
`
`
`
`
`
`
`
`
`
`
`
`
`functions, however this chip is particularly well suited for video compression algorithms.
`
`
`
`
`
`
`
`
`
`
`
`
`Applications include multimedia PCs, virtual reality 3D graphics, full-duplex
`
`
`
`
`
`
`
`
`
`videoconferencing, HDTV, and color hardcopy. We have architected the Multimedia Video
`
`
`
`
`
`
`
`
`
`
`
`Processor, or MVP, to provide a yet unattainable level ofperformance from a single chip,
`
`
`
`
`
`
`
`
`
`
`
`
`
`although with the programmability typically found in today's general-purpose computers.
`
`
`
`
`
`
`
`
`
`
`While advanced semiconductor design and process techniques have been used for its
`
`
`
`
`
`
`
`
`
`
`
`
`design, the key to the advantage of this component lies in optimization of the architecture
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`for real-time video and graphics processing. This paper will analyze video compression
`
`
`
`
`
`
`
`
`
`
`
`
`application requirements, describe the MVP architecture, and pose its potential as a very
`
`
`
`
`
`
`
`
`
`
`
`
`
`capable solution for a wide range of markets.
`
`
`
`
`
`
`
`
`INTRODUCTION
`
`
`The computer and consumer video industries are pursuing varied paths to offer cost-
`
`
`
`
`
`
`
`
`
`
`
`
`effective computing products which provide new forms of information and entertainment.
`
`
`
`
`
`
`
`
`
`
`
`Products are emerging from cable TV delivery of interactive digital movies to digital mobile
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`offices. Digital compression and video processing at a reasonable cost are spurring this
`
`
`
`
`
`
`
`
`
`
`
`
`
`revolution. While algorithm developments have been important, most of the enabling
`
`
`
`
`
`
`
`
`
`
`
`advances lie in the availability of high—density memory and high-performance processing
`
`
`
`
`
`
`
`
`
`
`
`ICs. With the pending general availability of the Multimedia Video Processor, or MVP, in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`1994, a yet unattained level of digital signal processing performance will be available and
`
`
`
`
`
`
`
`
`
`
`
`
`
`with all the flexibility of present day programmable computers. Standard-based video-
`
`
`
`
`
`
`
`
`
`
`conferencing and playback of compressed digital video and audio (using PX64. IPEG or
`
`
`
`
`
`
`
`
`
`
`
`
`
`MPEG "multi—standard" codecs systems) with a single MVP processor will be possible, as
`
`
`
`
`
`
`
`
`
`
`
`
`
`well as codecs with yet-to-be-defined algorithms like model-based compression.
`
`
`
`
`
`
`
`
`
`However, not only will the MVP support compression, it will also handle processing of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`high-resolution video, full-motion video processing from sources like camcorders, digital
`
`
`
`
`
`
`
`
`
`
`audio processing, hardcopy raster image processing, and 3D graphics, and all under
`
`
`
`
`
`
`
`
`
`
`
`
`software control and generation. From this wide range of functions, we calculated that
`
`
`
`
`
`
`
`
`
`
`
`
`
`several billion operations per second are required to provide video-based applications on
`
`
`
`
`
`
`
`
`
`
`
`
`the desktop. Current and soon to appear desktop host processors like X86, Pentium,
`
`
`
`
`
`
`
`
`
`
`
`
`
`Alpha, and MIPS do not have the computational power to meet these demands.
`
`
`
`
`
`
`
`
`
`
`
`
`
`KEYS TO THE MVP ARCHITECTURE
`
`
`
`
`
`
`
`The MVP's unique architecture and computational power enables users to integrate these
`
`
`
`
`
`
`
`
`
`
`
`
`varied functions on a single processing component. The keys to obtaining both exceptional
`
`
`
`
`
`
`
`
`
`
`
`
`processing speeds and fully-programmable features with the MVP include the use of:
`
`
`
`
`
`
`
`
`
`
`
`
`
`(1) an fiicient parallel processing architecture,
`
`
`
`
`
`
`(2) fast pixel processing tuned to image, video, and graphics processing,
`('0 imollinont finnffnl nr.'mm~m Amen fl...” rL.ymmLm f Hm nwahimnmva
`
`
`
`
`
`
`
`
`
`
`W, .........6..... cm... u. u, .mu5e tum.._,.uw uu uwguu . ma. I44 u....,.,.... e,
`
`
`
`
`
`
`
`
`
`
`(4) single-chip integration without slower chip-to-chip communications.
`
`
`
`
`
`
`
`
`
`
`
`1068-0314/94 $3.00 © 1994 IEEE
`
`
`
`
`
`
`
`215
`
`
`
`Page 1 01°10
`
`Samsung Exhibit 1006
`
`
`
`PRIOR-ART_0010815
`
`Page 1 of 10
`
`Samsung Exhibit 1006
`
`
`
`
`
`
`
`218
`
`
`multiply and accumulate for various types of filtering operations. To obtain good image
`
`
`
`
`
`
`
`
`
`
`
`
`quality at any channel rate and 30 frames per second, the image processor must compute
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`over 1.2 billion operations per second (BOPS).
`
`
`
`
`
`
`
`
`
`
`
`The addition of audio compression (which requires highcr precision integer and possibly
`
`
`
`
`
`
`
`
`
`
`
`
`floating point algorithms) and network communication, necessary for video conferencing
`
`
`
`
`
`
`
`
`
`
`(G728 or G.71l, H.242, H.230, H.221), further increases the scope of computational
`
`
`
`
`
`
`
`
`
`
`
`
`requirements. Reducing the system cost, we propose to include support in the architecture
`
`
`
`
`
`
`
`
`
`
`
`
`
`for the required non—standard functions like color space conversion (YCrCb to RGB),
`
`
`
`
`
`
`
`
`
`
`
`
`decimation of the source image to CIF resolution and variable scaling of the decompressed
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`sequence. Complete implementation of compression applications such as video-
`
`
`
`
`
`
`
`
`conferencing requires over 2 BOPS of the programmable image processor.
`
`
`
`
`
`
`
`
`
`
`ARCHITECTURE CHOICES
`
`
`
`
`We considered several candidate parallel architectures for implementation of this single—chip
`
`
`
`
`
`
`
`
`
`
`
`video processor [Gove-92, Guttag-92]. An architecture with a mix of dedicated and
`
`
`
`
`
`
`
`
`
`
`
`
`programmable processors was initially evaluated, then subsequently discounted when no
`
`
`
`
`
`
`
`
`
`
`single dominant function was found that was necessary almost all of the time. Besides, We
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`predicted that by the time the chip was completed, that a new important algorithm would
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`emerge. From the standpoint of loss of silicon efficiency by dedicated resources to any one
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`function (like a DCT), we felt compelled to seek a general-purpose well—balanced system
`
`
`
`
`
`
`
`
`
`
`
`
`solution. Several other candidates existed, however the mix of algorithms and practical
`
`
`
`
`
`
`
`
`
`
`
`
`implementation limitations focused us on SIMD and MIMD architectures. These differ by
`
`
`
`
`
`
`
`
`
`
`
`
`the autonomy of the processors functions with MIMD -- a desirable feature for any data
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`dependent algorithm operating in parallel.
`
`
`
`
`
`
`With MIMD desirable, the choice of a processor and memory interconnection architecture
`
`
`
`
`
`
`
`
`
`
`
`remained. Pipelined, shared bus memory, communication port (mesh/array/hypercube),
`
`
`
`
`
`
`
`
`and crossbar fully-shared memory were considered. Pipeline memory and processors
`
`
`
`
`
`
`
`
`
`
`(systolic arrays) are typically used for video, however they're too restrictive in the sense
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`that one must a priori know the size of the memory and dynamics of the algorithm to
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`prevent data contention and processor stalls. With our varied needs, this would lead to
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`inefficiencies. A shared—bus memory structure would also have bottleneck problems with
`
`
`
`
`
`
`
`
`
`
`
`highly variable instruction and data streams and moving of results from one processor to
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the other. The n-way connected communication port requires a very ordered flow of data,
`
`
`
`
`
`
`
`
`
`
`
`
`
`like a systolic or wavefront flow of data, or the application of a pixel per processor (not
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`practical in a single chip). This approach works for large arrays of simple processors
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`which can operate uniformly on images, however we wanted more complex processors
`
`
`
`
`
`
`
`
`
`
`
`
`which could adapt
`to varyin g types of data, from bit graphics to floating-point
`
`
`
`
`
`
`
`
`
`
`
`
`
`representations. The crossbar fully~shared memory is ideally suited to these needs,
`
`
`
`
`
`
`
`
`
`
`
`minimizing contention, data movement and providing flexibility for many types of
`
`
`
`
`
`
`
`
`
`
`
`In fact, since the crossbar operations at the processor instruction rates, this
`algorithms.
`
`
`
`
`
`
`
`
`
`
`
`
`
`architecture can functionally emulate the other approaches (pipeline, shared bus...).
`
`
`
`
`
`
`
`
`
`
`
`We not only wanted to provide this order of magnitude performance increase, but the goal
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`was to apply a traditional computer model of programmable processing and a large memory
`
`
`
`
`
`
`
`
`
`
`
`
`to applications with integrated image, graphics, video and audio processing, or image
`
`
`
`
`
`
`
`
`
`
`
`
`computing. As shown in Figure #2 titled "MVP System Architecture", replacing the
`
`
`
`
`
`
`
`
`
`
`
`
`processing and memory pipeline of conventional video systems with the single video
`
`
`
`
`
`
`
`
`
`
`
`
`processor and large memory system model yields tremendous application flexibility,
`In
`
`
`
`
`
`
`
`
`
`
`
`effect the system can re-configure itself with software from video conferencing to playing
`
`
`
`
`
`
`
`
`
`
`
`
`
`CD movies, just as a PC would re-configure from a spreadsheet to a video game.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 4 Of 10
`
`
`
`PRIOR-ART_0010818
`
`
`
`Page 4 of 10
`
`
`
`..........._
`
`........_
`
`
`
`
`220
`
`
`dependent algorithms like object feature matching or Huffman coding and also supporting
`
`
`
`
`
`
`
`
`
`
`
`traditional data independent SIMD operations like convolution.
`
`
`
`
`
`
`
`
`
`
`To prevent contention for memory or register access, a very wide instruction set in the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`l'\QD:- and vs 11:1-no n.._n1.:n .-....\.ml..;.......I .....=......... J.‘ ...-~,.A :. okra K/1\ID This 41.-.svil-silitu nnrrnirc
`gun a nun (1 Annex. uu null; pnusauaucu uruurury 13 uauu in run. in v n . aura JAUAJLIAJALI yunltuua
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the programmer to produce highly-parallel optimized code. A performance penalty may
`
`
`
`
`
`
`
`
`
`
`
`result if only one highly-serial task is performed continuously, however, the very nature of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`image, video, graphics, and audio processing, with varied concurrent and complex
`
`
`
`
`
`
`
`
`
`
`
`processing, prevents this from occurring. The MVP integrates more functions than ever
`
`
`
`
`
`
`
`
`
`
`
`
`before into one chip, while avoiding the compromises of other architectures.
`
`
`
`
`
`
`
`
`
`
`
`
`Detailed Architecture Description:
`
`
`
`Figure #1, titled "MVP Block Diagram", shows the MVP chip architecture. The Master
`
`
`
`
`
`
`
`
`
`
`
`
`
`Processor (MP) provides a RISC processor for simple user interface, sequential
`
`
`
`
`
`
`
`
`
`
`
`processing, and orchestration of multiple concurrent tasks operating on the entire MVP.
`
`
`
`
`
`
`
`
`
`
`
`
`The DSP Parallel Processors (PP), of which 4 will be designed in the first version of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MVP, provide highly—optimized image/video/graphics/audio processing capabilities. The
`
`
`
`
`
`
`
`Transfer Controller (TC) intelligently moves data and instructions on and off the MVP. All
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`of these processors are locally interconnected with a crossbar to 25 on-chip 2Kbyte SRAM
`
`
`
`
`
`
`
`
`
`
`
`
`
`modules. Other features include dual video frame timing generators (VC) and JTAG test
`
`
`
`
`
`
`
`
`
`
`
`
`
`and emulation circuits.
`
`
`
`
`With five 32+bit programmable processors operating at one targeted state rate of SOMI-Iz
`
`
`
`
`
`
`
`
`
`
`
`
`and numerous parallel operations performed in each processor, over 2 billion operations
`
`
`
`
`
`
`
`
`
`
`per second result.
`In addition, 100 MFLOPS (fully IEEE-754) can occur. The peak data
`
`
`
`
`
`
`
`
`
`
`
`
`
`transfer rate is then 400 MBytes/second, adequate for many video applications. The
`
`
`
`
`
`
`
`
`
`
`
`internal bandwidth over the crossbar between on-chip memory and processors is 2.4
`
`
`
`
`
`
`
`
`
`
`
`GBytes/second.
`
`
`
`
`
`
`
`
`DSP PARALLEL PROCESSORS (PP)
`
`
`
`
`
`
`The PP has many powerful features beyond those found in conventional DSPs. Practically
`
`
`
`
`
`
`
`
`
`
`
`
`
`all video algorithms benefit from these features. Most of the features were added to permit
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`scalability within the PP to support many simple functions (like bit ops) in one cycle or
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`fewer operations with the same hardware at higher precision (like 32-bits). The following
`
`
`
`
`
`
`
`
`
`
`
`
`
`describes the feature and advantage:
`
`
`
`
`
`- 44 user registers:
`
`
`
`
`- ease of programming/compiling and fast parallel functions.
`
`
`
`
`
`
`
`- Single-cycle access into crossbar memory expands effective registers to 34K:
`
`
`
`
`
`
`
`
`
`
`- flexibility.
`
`0 Three-leveL no overhead instruction looping:
`
`
`
`
`
`
`- programming flexibility and faster tight loops (usually 2030%)
`
`
`
`
`
`
`
`- Double parallel transfer from memory with address update:
`
`
`
`
`
`
`
`
`
`- most algorithms need two pixels loaded per cycle.
`
`
`
`
`
`
`
`
`- Three-operand ALU arithmetic and logical operations:
`
`
`
`
`
`
`
`- double speed correlation and windows support.
`
`
`
`
`
`- Splitable mulfiply (8x8=l6 or 16x16=32):
`
`
`
`
`
`
`— double speed pixel operations.
`
`
`
`
`- Word/I-Ialfword/Byte multiple arithmetic:
`
`
`
`
`- 4x on algorithms like motion estimation and 2x on fast DCTS.
`
`
`
`
`
`
`
`
`
`
`- Flexible data path:
`
`
`
`
`- masking, merging, rotating... for bit stream coding (like Huffman).
`
`
`
`
`
`
`
`
`
`- General-purpose use of address adders:
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 5 Of 10
`
`
`
`
`PRIOR-ART_0010820
`
`Page 6 of 10
`
`
`
`221
`
`
`
`_
`_
`- up to 6x number of adds in one cycle.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`- Conditional operations prevent need for branching (and possible pipeline stalls):
`- adaptive algorithms will operate faster (like adaptive thresholdmg).
`
`
`
`
`
`
`
`
`
`
`
`As a result, as many as 15 RISC operations will be performed in one PP cycle. When
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`multiplied by the number of FPS and added to the MP and FPU operations, a formidable
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`number results.
`In addition, since the C—compiler also influenced the architecture of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`PP, many of these features will automatically compile into fast cod -- many users of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MVP will not need to understand the PP architecture to take advantage of its performance.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MASTER PROCESSOR (MP)
`
`
`
`
`
`The MP is a general-purpose RISC processor with an integral IEEE-compatible floating-
`
`
`
`
`
`
`
`
`
`
`point unit. A 32-bit instruction is accessed from a 4KByte instruction cache. Data loads
`
`
`
`
`
`
`
`
`
`
`
`
`can be 8, 16, 32, or 64 bits from a 4KByte data cache or from any data module via the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`crossbar. The MP has thirty-one 32-bit usable registers. Uncommon features include:
`
`
`
`
`
`
`
`
`
`
`
`- Register files common to floating—point & integer operations.
`
`
`
`
`
`
`
`
`
`° Scoreboard keeps track of result of loads and FPU, preventing use until updated.
`
`
`
`
`
`
`
`
`
`
`
`
`
`- Addressing modes support optional updating of base-address register with
`
`
`
`
`
`
`
`
`
`
`results of the address computation.
`
`
`
`
`
`- Special FPU instruction permits new multiply. add/subt, & increment each cycle.
`
`
`
`
`
`
`
`
`
`
`
`- Left-most and Right-most one logic.
`
`
`
`
`
`
`- Both endians supported.
`
`
`
`
`
`
`
`
`
`
`
`
`Since the MP was designed to efficiently execute C programs and has added hardware for
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`bitsueam processing, it performs exceptionally well as the controller and data interpretation
`
`
`
`
`
`
`
`
`
`
`
`
`processor. The floating point capability accelerates and simplifies programming of high
`
`
`
`
`
`
`
`
`
`
`
`precision applications like medical imaging and 3D graphics.
`
`
`
`
`
`
`
`
`SHARED MEMORY & TRANSFER CONTROLLER (TC)
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Much of the advantage of the MVP architecture lies in the memory and data I/O
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`architecture. Each processor and memory is fully interconnected through the crossbar and
`
`
`
`
`
`
`
`
`
`
`
`switchable at instruction rates. With greater than 500 signal lines switching at nanosecond
`
`
`
`
`
`
`
`
`
`
`
`
`the crossbarred memory architecture is only possible with single-chip
`speeds,
`
`
`
`
`
`
`
`
`
`implementation. With adequate on-chip memory and the ability to reconnect the next
`
`
`
`
`
`
`
`
`
`
`
`processor to the data memory, rather than moving the data to another memory, the data on-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`chip is not required to move as often. In effect, the original requirement of billions of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`bytes/second data transfer is reduced to only 100's of Mbytes/second. This model works
`
`
`
`
`
`
`
`
`
`
`
`
`
`well as long as the algorithm uses localized regions of data (patches, blocks,
`
`
`
`
`
`
`
`
`
`
`
`
`
`neighborhoods, rows...), each of which "fit" into the on-chip memory, and are accessed in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`repeated or predictable patterns. While this usually occurs with image processing, an
`
`
`
`
`
`
`
`
`
`
`
`
`extremely intelligent transfer controller was architected to aid in insuring the validity of this
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`assumption. The TC has numerous modes of transferring data on- or off—chip, each
`
`
`
`
`
`
`
`
`
`
`
`
`optimized for a particular type of dataflow (block, patch, fat line, indexed or guided
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`patches...). Most importantly, the on-chip SRAM memory was architected with sufficient
`
`
`
`
`
`
`
`
`
`
`
`size and modularity to permit double-buffering of data 1/0 on and off the chip, while the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`on-chip processors access the other on-chip memory modules.
`In effect, practically no
`
`
`
`
`
`
`
`
`
`
`
`
`overhead is required for video I/O. Many convenient methods were designed into the TC
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`to prioritize these accesses. In addition, we included support for most commodity memory
`
`
`
`
`
`
`
`
`
`
`
`
`
`components (VRAM, SRAM, DRAM). Finally, we devised several methods to mitigate
`
`
`
`
`
`
`
`
`
`
`
`any contention between the processors for a particular memory module. Both round-robin
`
`
`
`
`
`
`
`
`
`
`
`and fixed-robin priority schemes are available to permit developers flexibility in structuring
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 7 Of 10
`
`
`
`
`PRIOR-ART_0010821
`
`Page 7 of 10
`
`
`
`222
`
`
`their algorithms to reduce contention. For the many image and video algorithms currently
`
`
`
`
`
`
`
`
`
`
`
`
`developed for the MVP thus far, contention has not been a problem
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Another advantage of the crossbar architecture is expandability. We can design many
`
`
`
`
`
`
`
`
`
`
`
`
`different MVP chips, as a function of the number of PPs. We simply slice the architecture,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`cutting or adding PPs and memory modules. Conceptually. the advantage of this approach
`
`
`
`
`
`
`
`
`
`
`
`
`is that, with the same package and pin—out, several different performance and price points
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`can be used. A range of applications may require a range of different MVP chips.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Applications which require CCIR 601 studio quality video and/or multifunction processing
`
`
`
`
`
`
`
`
`
`
`
`
`
`(graphics / audio) video) would most-likely require an MVP with 4 processors. On the
`
`
`
`
`
`
`
`
`
`
`
`other hand, a more dedicated or single-function application like graphics may require
`
`
`
`
`
`
`
`
`
`
`
`
`fewer PPs. In addition, if only limited resolution video (QCIF) processing is necessary,
`
`
`
`
`
`
`
`
`
`
`
`
`
`again a small number of PPs could suffice. We anticipate various versions of the MVP in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the future.
`
`
`
`VIDEO CONTROLLER (VC)
`
`
`
`
`
`In addition, the MVP has two programmable timing controllers for generation of video and
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`other tinting signals. As an example, video frame grabbing and display requires may pixel,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`horizontal, and vertical signals for synchronization of the external logic in the system. The
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MVP has internal logic to generate those signals under program control, relieving the
`
`
`
`
`
`
`
`
`
`
`
`
`
`system designer from design of external logic to perform those functions.
`
`
`
`
`
`
`
`
`
`
`
`NEW DCT ALGORITHMS FOR COMPRESSION
`
`
`
`
`
`SPEED-UP WITH USE OF A PROGRAMMABLE ARCHITECTURE
`
`
`
`
`
`
`
`
`
`One advantage of the programmable compression chip is the optimization possible by
`
`
`
`
`
`
`
`
`
`
`
`
`selecting the least computationally demanding DCT algorithm that will meet the accuracy
`
`
`
`
`
`
`
`
`
`
`
`
`required of the application. For example, fast DCI’ algorithms like those of Lee and Chen
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`[Lee-84, Chen-77] have considerable advantage with respect to traditional matrix multiply
`
`
`
`
`
`
`
`
`
`
`
`approaches (with a factor of 5 or more speedup). Seperability of the 2D DCT is generally
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`used for decomposition of DCTs (with successive processing of the individual rows &
`
`
`
`
`
`
`
`
`
`
`
`
`
`columns of an image) . The size of the DCT directly influences the benefit of seperability,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`however with the 8x8 DCTs of most standards, a definite speedup results. The Lee
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`algorithm tends to be easy to implement and achieve faster computation, although has
`
`
`
`
`
`
`
`
`
`
`
`
`
`accuracy issues The Chen algorithm is harder to implement and is computational slower,
`
`
`
`
`
`
`
`
`
`
`
`
`
`but with good accuracy. Depending on the available processing bandwidth, the encoder
`
`
`
`
`
`
`
`
`
`
`
`
`can select an appropriate DCT or IDCT algorithm to perform the task.
`If errors result,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`different coding decisions result and either lower SNRs or compression ratios occur.
`
`
`
`
`
`
`
`
`
`
`
`
`
`In addition, we devised a "Truncated-IDCT" algorithm to utilize the advantages of a
`
`
`
`
`
`
`
`
`
`
`
`
`programmable architecture. Since the DCT, quantizer and threholding operations seek to
`
`
`
`
`
`
`
`
`
`
`
`minimize the population of selective frequency coefficients (for high compression),
`
`
`
`
`
`
`
`
`
`
`statistically, most of the high frequency coefficients are zero valued. Therefore, the
`
`
`
`
`
`
`
`
`
`
`
`
`conventional IDCI‘ will act on 8x8 matrixes with a high percentage of zero valued inputs.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`We can then significantly reduce the amount of IDCT operations performed by not
`
`
`
`
`
`
`
`
`
`
`
`
`
`executing the zero valued multiplies and adds (similar work has been reported[McMillan-
`
`
`
`
`
`
`
`
`
`
`
`92]). This is only possible with software—based IDCTs.
`In implementation, the program
`
`
`
`
`
`
`
`
`
`
`
`adaptively truncates the Sxl IDCT summation in the vertical direction based on the run-
`
`
`
`
`
`
`
`
`
`
`
`
`
`length encoded input values. Further reductions by selecting 4x1 summation when
`
`
`
`
`
`
`
`
`
`
`
`appropriate also shortens the process, although not as frequently. With this approach a
`
`
`
`
`
`
`
`
`
`
`
`
`
`factor of 3 or more speed-up on 1DCIs will usually occur.
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 8 of 10
`
`
`
`PRIOR-ART_0010822
`
`
`
`Page 8 of 10
`
`
`
`TOOLS
`
`
`
`223
`
`
`
`
`
`
`
`
`Advances in video compression have been limited by the availability of tools to develop
`
`
`
`
`
`
`
`
`
`
`
`
`
`software and hardware. With the MVP, TI offers a range of software tools and direct on-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`chip support for in—circuit debug. A real—time executive, C++ compilers, algebraic
`
`
`
`
`
`
`
`
`
`
`assembler, windowed high-level language debugger (with JTAG emulation hardware on
`
`
`
`
`
`
`
`
`
`chip) and library of primitives/applications, all the tools familiar to computer application
`
`
`
`
`
`
`
`
`
`
`
`developers, will now be available for development of video applications.
`
`
`
`
`
`
`
`
`
`
`The software model for the MVP is based on two levels. The primary level includes the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Master Processor acting as a director and scheduler of the MVP's parallelism. The
`
`
`
`
`
`
`
`
`
`
`
`
`
`Executive operates on the MP, performing those supervisory tasks. The Executive can
`
`
`
`
`
`
`
`
`
`
`
`
`dispatch tasks for operation in pipeline, parallel or any other arrangement on any processor
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`within the MVP. Under that, a level which actually performs the tasks on each processor is
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`accessed by either: (1) a library of primitives, (2) application tools for programming in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`assembly or (3) a high-level language compiler. Each of these methods have advantages,
`
`
`
`
`
`
`
`
`
`
`
`
`with varying performance and skill level required to code the chip, as a function of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`particular application. Although nothing restricts the use of any processor as the master or
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`slave processor, only software convention.
`
`
`
`
`
`COMPETING VIDEO COMPRESSION CHIP ARCHITECTURES
`
`
`
`
`
`
`
`Several semiconductor companies have reported activity in video compression chip or chip
`
`
`
`
`
`
`
`
`
`
`
`
`set solutions ([Bolton-93][Konstantinides-92]). Most chip manufacturers are proposing
`
`
`
`
`
`
`
`
`hardwired or paramaterized architectures, without C—1evel programmability. Our MVP is
`
`
`
`
`
`
`
`
`
`
`an exception. In addition, most of the other "programmable" approaches are based on an
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`architecture which integrates dedicated logic modules, like DCTs and Motion Estimators,
`
`
`
`
`
`
`
`
`
`
`
`with only the controller programmable. This lirriits their efficiency since the silicon devoted
`
`
`
`
`
`
`
`
`
`
`
`
`
`to those functions must always keep busy with those functions to justify their cost. On the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`contrary the MVP architecture has no dedicated logic, permitting user balancing of silicon
`
`
`
`
`
`
`
`
`
`
`
`
`
`based on the varied and dynamic computational demands of compression. Other
`
`
`
`
`
`
`
`
`
`
`
`researchers have recently described simulations which support our position that universally
`
`
`
`
`
`
`
`
`
`
`
`programmable architectures are competitive solutions when compared with dedicated or
`
`
`
`
`
`
`
`
`
`
`hybrid architectures for video decompression (Mayer—93).
`
`
`
`
`
`
`
`Many different architectures are proposed, in development, or currently available, however
`
`
`
`
`
`
`
`
`
`
`
`none except the MVP has the flexibility nor computational performance to meet the
`
`
`
`
`
`
`
`
`
`
`
`
`
`complete demands of truly integrated digital video on the desktop. including the complete
`
`
`
`
`
`
`
`
`
`
`
`
`
`concert of real-time video & audio compression, with image & 3D graphics processing.
`
`
`
`
`
`
`
`
`
`
`
`
`
`Not only compression and decompression, but system—level bit—stream control, video
`
`
`
`
`
`
`
`
`
`
`scaling, error correction and even audio echo cancellation. Architecture limitations and
`
`
`
`
`
`
`
`
`
`
`
`transistor counts limit other chips to subsets of these functions.
`
`
`
`
`
`
`
`
`
`
`
`CONCLUSION
`
`
`
`The MVP is a monolithic single—chip parallel processor that performs compression
`
`
`
`
`
`
`
`
`
`
`
`processing, audio & video processing, 3D graphics and others, and even at the same time.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Over 2 billion operations are performed per second. This marriatic performance boost will
`
`
`
`
`
`
`
`
`
`
`
`
`
`enable a wide range of new applications, including desktop interactive digital video.
`
`
`
`
`
`
`
`
`
`
`
`
`Integrating fu1ly—programmable parallel DSP processors with a RISC processor on one
`
`
`
`
`
`
`
`
`
`
`
`chip provides software flexibility and system adaptability. A new parallel architecture,
`
`
`
`
`
`
`
`
`
`
`
`using a crossbar network to couple the processors and large on-chip SRAMs, and with
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MIM.D (Multiple Instruction Multiple Data) operations, yields extremely high efficiency for
`
`
`
`
`
`
`
`
`
`
`
`
`Page 9 of 10
`
`
`
`PRIOR-ART_0010823
`
`
`
`Page 9 of 10
`
`
`
`224
`
`
`most image, graphics, and video algorithms. Software tools like real-time executives,
`
`
`
`
`
`
`
`
`
`
`
`assemblers and compilers all help bring a familiar computer programming model to
`
`
`
`
`
`
`
`
`
`
`
`
`multidimensional signal processing. This new technology frees developers of compression
`
`
`
`
`
`
`
`
`
`
`algorithms to optimize implementations of standard video & audio compression algorithms,
`
`
`
`
`
`
`
`
`
`
`
`without the restrictions found in today's compression chips (those which are limited to
`
`
`
`
`
`
`
`
`
`
`
`
`
`current interpretations or versions of the standards). In addition, algorithm developers can
`
`
`
`
`
`
`
`
`
`
`
`
`implement future compression algorithms, without the difficulties of developing new chips
`
`
`
`
`
`
`
`
`
`
`
`or adapting existing chips.
`
`
`
`
`
`The MVP supports a wide range of open standards for video compression and image
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`computing The variation within each standard to promote creative and distinguishing
`
`
`
`
`
`
`
`
`
`
`
`advantages in the market place and the constant urge to optimize the standard to a particular
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`range of markets, each work to prevent fixed hardware solutions. This programmable,
`
`
`
`
`
`
`
`
`
`
`
`
`integrated solution gives flexibility to system designers to develop competitive algorithms
`
`
`
`
`
`
`
`
`
`
`
`as well as adapt to emerging standards.
`
`
`
`
`
`
`
`ACKNOWLEDGMENTS
`
`
`The author wishes to thank Jeremiah Golston, Dr. Chris Read, and Dr. V. Venkateswar for
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`compression algorithm work relating to the MVP. In addition, thanks to the MVP Program
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Manager, Walt Bonneau, for developing and motivating a "world-class" team. A special
`
`
`
`
`
`
`
`
`
`
`
`thanks to my original co-MVP-architects, Keith Balmer, Karl Guttag, and Nick Ing-
`
`
`
`
`
`
`
`
`
`
`
`Simmons. Finally, thanks to the entire "Team—MVP"!
`
`
`
`
`
`
`
`REFERENCES
`
`
`
`[Bolton-93] Bolton, M. "A Family of MPEG Video Encoder and Decoder Chips", IEEE
`
`
`
`
`
`
`
`
`
`
`
`
`Proceeding of Conference on Hot Chips, 1993.
`
`
`
`
`
`
`
`
`
`
`[Chen-77] Chen, W.H., C.H. Smith, and S.C. Fralick, "A Fast Computational Algorithm
`
`
`
`
`
`
`
`
`
`
`
`
`for the Discrete Cosine Transform", IEEE Transactions of Communication, Vol. 25, pp.
`
`
`
`
`
`
`
`
`
`
`
`
`1004-1009. Sept. 1977.
`
`
`
`
`[Gove-92] Gove, R..l., "Architectures for Single-Chip Image Computing", SPIE
`
`
`
`
`
`
`
`
`Proceedings of Conf. on Image Processing and Interchange, San Jose, Ca., Feb 1992.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`[Guttag-92] Guttag, K.M., R.J. Gove, & .l.R. Van Aken, "A Single-Chip Multiprocessor
`
`
`
`
`
`
`
`
`
`
`For Multimedia: The MVP", IEEE Computer Graphics & Applications, pp.53-64, 11/92.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`[Konstantinides-92] K. Konstantinides & V. Bhaskaran, ”M0n0lithic Architectures for
`
`
`
`
`
`
`
`
`
`Image Processing & Compression”, IEEE Computer Graphics & Applications, pp 75-86,
`
`
`
`
`
`
`
`
`
`
`
`Nov. 1992.
`
`
`
`[Lee-84] B.G. Lee, "A New Algorithm for the Discrete Cosine Transform", IEEE Trans.
`
`
`
`
`
`
`
`
`
`
`
`
`on Acoustics, Speech, and Signal Processing, Vol. 32, pp. 1243-1245, 1984.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`[Mayer—93] A. C. Mayer, "The Architecture of a Processor Array for Video
`
`
`
`
`
`
`
`
`
`
`
`
`Decompression", IEEE Trans. on Consumer Elect..Vol39, No.3, pp 565-569, Aug. 1993.
`
`
`
`
`
`
`
`
`
`
`
`
`[McMillan-92] McMillan, L.& L. Westover, "A Forward»Mapping Realization of the
`
`
`
`
`
`
`
`
`
`
`Inverse Discrete Cosine Transfonn", IEEE Proc. of the Data Compression Conference, pp
`
`
`
`
`
`
`
`
`
`
`
`
`219-228, 1992.
`
`
`
`Page 10 Of 10
`
`
`
`
`PRIOR-ART_0010824
`
`Page 10 of 10