throbber
The MVP: A Highly-Integrated Video Compression Chip
`Robert J. Gove
`
`
`
`Texas Instuments, Inc.
`
`
`
`Dallas, Texas 75265
`
`
`
`ABSTRACT
`
`
`We introduce a new highly-integrated processing chip for performing a variety of
`
`
`
`
`
`
`
`
`
`
`
`
`functions, however this chip is particularly well suited for video compression algorithms.
`
`
`
`
`
`
`
`
`
`
`
`
`Applications include multimedia PCs, virtual reality 3D graphics, full-duplex
`
`
`
`
`
`
`
`
`
`videoconferencing, HDTV, and color hardcopy. We have architected the Multimedia Video
`
`
`
`
`
`
`
`
`
`
`
`Processor, or MVP, to provide a yet unattainable level ofperformance from a single chip,
`
`
`
`
`
`
`
`
`
`
`
`
`
`although with the programmability typically found in today's general-purpose computers.
`
`
`
`
`
`
`
`
`
`
`While advanced semiconductor design and process techniques have been used for its
`
`
`
`
`
`
`
`
`
`
`
`
`design, the key to the advantage of this component lies in optimization of the architecture
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`for real-time video and graphics processing. This paper will analyze video compression
`
`
`
`
`
`
`
`
`
`
`
`
`application requirements, describe the MVP architecture, and pose its potential as a very
`
`
`
`
`
`
`
`
`
`
`
`
`
`capable solution for a wide range of markets.
`
`
`
`
`
`
`
`
`INTRODUCTION
`
`
`The computer and consumer video industries are pursuing varied paths to offer cost-
`
`
`
`
`
`
`
`
`
`
`
`
`effective computing products which provide new forms of information and entertainment.
`
`
`
`
`
`
`
`
`
`
`
`Products are emerging from cable TV delivery of interactive digital movies to digital mobile
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`offices. Digital compression and video processing at a reasonable cost are spurring this
`
`
`
`
`
`
`
`
`
`
`
`
`
`revolution. While algorithm developments have been important, most of the enabling
`
`
`
`
`
`
`
`
`
`
`
`advances lie in the availability of high—density memory and high-performance processing
`
`
`
`
`
`
`
`
`
`
`
`ICs. With the pending general availability of the Multimedia Video Processor, or MVP, in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`1994, a yet unattained level of digital signal processing performance will be available and
`
`
`
`
`
`
`
`
`
`
`
`
`
`with all the flexibility of present day programmable computers. Standard-based video-
`
`
`
`
`
`
`
`
`
`
`conferencing and playback of compressed digital video and audio (using PX64. IPEG or
`
`
`
`
`
`
`
`
`
`
`
`
`
`MPEG "multi—standard" codecs systems) with a single MVP processor will be possible, as
`
`
`
`
`
`
`
`
`
`
`
`
`
`well as codecs with yet-to-be-defined algorithms like model-based compression.
`
`
`
`
`
`
`
`
`
`However, not only will the MVP support compression, it will also handle processing of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`high-resolution video, full-motion video processing from sources like camcorders, digital
`
`
`
`
`
`
`
`
`
`
`audio processing, hardcopy raster image processing, and 3D graphics, and all under
`
`
`
`
`
`
`
`
`
`
`
`
`software control and generation. From this wide range of functions, we calculated that
`
`
`
`
`
`
`
`
`
`
`
`
`
`several billion operations per second are required to provide video-based applications on
`
`
`
`
`
`
`
`
`
`
`
`
`the desktop. Current and soon to appear desktop host processors like X86, Pentium,
`
`
`
`
`
`
`
`
`
`
`
`
`
`Alpha, and MIPS do not have the computational power to meet these demands.
`
`
`
`
`
`
`
`
`
`
`
`
`
`KEYS TO THE MVP ARCHITECTURE
`
`
`
`
`
`
`
`The MVP's unique architecture and computational power enables users to integrate these
`
`
`
`
`
`
`
`
`
`
`
`
`varied functions on a single processing component. The keys to obtaining both exceptional
`
`
`
`
`
`
`
`
`
`
`
`
`processing speeds and fully-programmable features with the MVP include the use of:
`
`
`
`
`
`
`
`
`
`
`
`
`
`(1) an fiicient parallel processing architecture,
`
`
`
`
`
`
`(2) fast pixel processing tuned to image, video, and graphics processing,
`('0 imollinont finnffnl nr.'mm~m Amen fl...” rL.ymmLm f Hm nwahimnmva
`
`
`
`
`
`
`
`
`
`
`W, .........6..... cm... u. u, .mu5e tum.._,.uw uu uwguu . ma. I44 u....,.,.... e,
`
`
`
`
`
`
`
`
`
`
`(4) single-chip integration without slower chip-to-chip communications.
`
`
`
`
`
`
`
`
`
`
`
`1068-0314/94 $3.00 © 1994 IEEE
`
`
`
`
`
`
`
`215
`
`
`
`Page 1 01°10
`
`Samsung Exhibit 1006
`
`
`
`PRIOR-ART_0010815
`
`Page 1 of 10
`
`Samsung Exhibit 1006
`
`

`
`

`
`

`
`218
`
`
`multiply and accumulate for various types of filtering operations. To obtain good image
`
`
`
`
`
`
`
`
`
`
`
`
`quality at any channel rate and 30 frames per second, the image processor must compute
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`over 1.2 billion operations per second (BOPS).
`
`
`
`
`
`
`
`
`
`
`
`The addition of audio compression (which requires highcr precision integer and possibly
`
`
`
`
`
`
`
`
`
`
`
`
`floating point algorithms) and network communication, necessary for video conferencing
`
`
`
`
`
`
`
`
`
`
`(G728 or G.71l, H.242, H.230, H.221), further increases the scope of computational
`
`
`
`
`
`
`
`
`
`
`
`
`requirements. Reducing the system cost, we propose to include support in the architecture
`
`
`
`
`
`
`
`
`
`
`
`
`
`for the required non—standard functions like color space conversion (YCrCb to RGB),
`
`
`
`
`
`
`
`
`
`
`
`
`decimation of the source image to CIF resolution and variable scaling of the decompressed
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`sequence. Complete implementation of compression applications such as video-
`
`
`
`
`
`
`
`
`conferencing requires over 2 BOPS of the programmable image processor.
`
`
`
`
`
`
`
`
`
`
`ARCHITECTURE CHOICES
`
`
`
`
`We considered several candidate parallel architectures for implementation of this single—chip
`
`
`
`
`
`
`
`
`
`
`
`video processor [Gove-92, Guttag-92]. An architecture with a mix of dedicated and
`
`
`
`
`
`
`
`
`
`
`
`
`programmable processors was initially evaluated, then subsequently discounted when no
`
`
`
`
`
`
`
`
`
`
`single dominant function was found that was necessary almost all of the time. Besides, We
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`predicted that by the time the chip was completed, that a new important algorithm would
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`emerge. From the standpoint of loss of silicon efficiency by dedicated resources to any one
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`function (like a DCT), we felt compelled to seek a general-purpose well—balanced system
`
`
`
`
`
`
`
`
`
`
`
`
`solution. Several other candidates existed, however the mix of algorithms and practical
`
`
`
`
`
`
`
`
`
`
`
`
`implementation limitations focused us on SIMD and MIMD architectures. These differ by
`
`
`
`
`
`
`
`
`
`
`
`
`the autonomy of the processors functions with MIMD -- a desirable feature for any data
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`dependent algorithm operating in parallel.
`
`
`
`
`
`
`With MIMD desirable, the choice of a processor and memory interconnection architecture
`
`
`
`
`
`
`
`
`
`
`
`remained. Pipelined, shared bus memory, communication port (mesh/array/hypercube),
`
`
`
`
`
`
`
`
`and crossbar fully-shared memory were considered. Pipeline memory and processors
`
`
`
`
`
`
`
`
`
`
`(systolic arrays) are typically used for video, however they're too restrictive in the sense
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`that one must a priori know the size of the memory and dynamics of the algorithm to
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`prevent data contention and processor stalls. With our varied needs, this would lead to
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`inefficiencies. A shared—bus memory structure would also have bottleneck problems with
`
`
`
`
`
`
`
`
`
`
`
`highly variable instruction and data streams and moving of results from one processor to
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the other. The n-way connected communication port requires a very ordered flow of data,
`
`
`
`
`
`
`
`
`
`
`
`
`
`like a systolic or wavefront flow of data, or the application of a pixel per processor (not
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`practical in a single chip). This approach works for large arrays of simple processors
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`which can operate uniformly on images, however we wanted more complex processors
`
`
`
`
`
`
`
`
`
`
`
`
`which could adapt
`to varyin g types of data, from bit graphics to floating-point
`
`
`
`
`
`
`
`
`
`
`
`
`
`representations. The crossbar fully~shared memory is ideally suited to these needs,
`
`
`
`
`
`
`
`
`
`
`
`minimizing contention, data movement and providing flexibility for many types of
`
`
`
`
`
`
`
`
`
`
`
`In fact, since the crossbar operations at the processor instruction rates, this
`algorithms.
`
`
`
`
`
`
`
`
`
`
`
`
`
`architecture can functionally emulate the other approaches (pipeline, shared bus...).
`
`
`
`
`
`
`
`
`
`
`
`We not only wanted to provide this order of magnitude performance increase, but the goal
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`was to apply a traditional computer model of programmable processing and a large memory
`
`
`
`
`
`
`
`
`
`
`
`
`to applications with integrated image, graphics, video and audio processing, or image
`
`
`
`
`
`
`
`
`
`
`
`
`computing. As shown in Figure #2 titled "MVP System Architecture", replacing the
`
`
`
`
`
`
`
`
`
`
`
`
`processing and memory pipeline of conventional video systems with the single video
`
`
`
`
`
`
`
`
`
`
`
`
`processor and large memory system model yields tremendous application flexibility,
`In
`
`
`
`
`
`
`
`
`
`
`
`effect the system can re-configure itself with software from video conferencing to playing
`
`
`
`
`
`
`
`
`
`
`
`
`
`CD movies, just as a PC would re-configure from a spreadsheet to a video game.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 4 Of 10
`
`
`
`PRIOR-ART_0010818
`
`
`
`Page 4 of 10
`
`

`
`..........._
`
`........_
`
`
`

`
`220
`
`
`dependent algorithms like object feature matching or Huffman coding and also supporting
`
`
`
`
`
`
`
`
`
`
`
`traditional data independent SIMD operations like convolution.
`
`
`
`
`
`
`
`
`
`
`To prevent contention for memory or register access, a very wide instruction set in the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`l'\QD:- and vs 11:1-no n.._n1.:n .-....\.ml..;.......I .....=......... J.‘ ...-~,.A :. okra K/1\ID This 41.-.svil-silitu nnrrnirc
`gun a nun (1 Annex. uu null; pnusauaucu uruurury 13 uauu in run. in v n . aura JAUAJLIAJALI yunltuua
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the programmer to produce highly-parallel optimized code. A performance penalty may
`
`
`
`
`
`
`
`
`
`
`
`result if only one highly-serial task is performed continuously, however, the very nature of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`image, video, graphics, and audio processing, with varied concurrent and complex
`
`
`
`
`
`
`
`
`
`
`
`processing, prevents this from occurring. The MVP integrates more functions than ever
`
`
`
`
`
`
`
`
`
`
`
`
`before into one chip, while avoiding the compromises of other architectures.
`
`
`
`
`
`
`
`
`
`
`
`
`Detailed Architecture Description:
`
`
`
`Figure #1, titled "MVP Block Diagram", shows the MVP chip architecture. The Master
`
`
`
`
`
`
`
`
`
`
`
`
`
`Processor (MP) provides a RISC processor for simple user interface, sequential
`
`
`
`
`
`
`
`
`
`
`
`processing, and orchestration of multiple concurrent tasks operating on the entire MVP.
`
`
`
`
`
`
`
`
`
`
`
`
`The DSP Parallel Processors (PP), of which 4 will be designed in the first version of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MVP, provide highly—optimized image/video/graphics/audio processing capabilities. The
`
`
`
`
`
`
`
`Transfer Controller (TC) intelligently moves data and instructions on and off the MVP. All
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`of these processors are locally interconnected with a crossbar to 25 on-chip 2Kbyte SRAM
`
`
`
`
`
`
`
`
`
`
`
`
`
`modules. Other features include dual video frame timing generators (VC) and JTAG test
`
`
`
`
`
`
`
`
`
`
`
`
`
`and emulation circuits.
`
`
`
`
`With five 32+bit programmable processors operating at one targeted state rate of SOMI-Iz
`
`
`
`
`
`
`
`
`
`
`
`
`and numerous parallel operations performed in each processor, over 2 billion operations
`
`
`
`
`
`
`
`
`
`
`per second result.
`In addition, 100 MFLOPS (fully IEEE-754) can occur. The peak data
`
`
`
`
`
`
`
`
`
`
`
`
`
`transfer rate is then 400 MBytes/second, adequate for many video applications. The
`
`
`
`
`
`
`
`
`
`
`
`internal bandwidth over the crossbar between on-chip memory and processors is 2.4
`
`
`
`
`
`
`
`
`
`
`
`GBytes/second.
`
`
`
`
`
`
`
`
`DSP PARALLEL PROCESSORS (PP)
`
`
`
`
`
`
`The PP has many powerful features beyond those found in conventional DSPs. Practically
`
`
`
`
`
`
`
`
`
`
`
`
`
`all video algorithms benefit from these features. Most of the features were added to permit
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`scalability within the PP to support many simple functions (like bit ops) in one cycle or
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`fewer operations with the same hardware at higher precision (like 32-bits). The following
`
`
`
`
`
`
`
`
`
`
`
`
`
`describes the feature and advantage:
`
`
`
`
`
`- 44 user registers:
`
`
`
`
`- ease of programming/compiling and fast parallel functions.
`
`
`
`
`
`
`
`- Single-cycle access into crossbar memory expands effective registers to 34K:
`
`
`
`
`
`
`
`
`
`
`- flexibility.
`
`0 Three-leveL no overhead instruction looping:
`
`
`
`
`
`
`- programming flexibility and faster tight loops (usually 2030%)
`
`
`
`
`
`
`
`- Double parallel transfer from memory with address update:
`
`
`
`
`
`
`
`
`
`- most algorithms need two pixels loaded per cycle.
`
`
`
`
`
`
`
`
`- Three-operand ALU arithmetic and logical operations:
`
`
`
`
`
`
`
`- double speed correlation and windows support.
`
`
`
`
`
`- Splitable mulfiply (8x8=l6 or 16x16=32):
`
`
`
`
`
`
`— double speed pixel operations.
`
`
`
`
`- Word/I-Ialfword/Byte multiple arithmetic:
`
`
`
`
`- 4x on algorithms like motion estimation and 2x on fast DCTS.
`
`
`
`
`
`
`
`
`
`
`- Flexible data path:
`
`
`
`
`- masking, merging, rotating... for bit stream coding (like Huffman).
`
`
`
`
`
`
`
`
`
`- General-purpose use of address adders:
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 5 Of 10
`
`
`
`
`PRIOR-ART_0010820
`
`Page 6 of 10
`
`

`
`221
`
`
`
`_
`_
`- up to 6x number of adds in one cycle.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`- Conditional operations prevent need for branching (and possible pipeline stalls):
`- adaptive algorithms will operate faster (like adaptive thresholdmg).
`
`
`
`
`
`
`
`
`
`
`
`As a result, as many as 15 RISC operations will be performed in one PP cycle. When
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`multiplied by the number of FPS and added to the MP and FPU operations, a formidable
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`number results.
`In addition, since the C—compiler also influenced the architecture of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`PP, many of these features will automatically compile into fast cod -- many users of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MVP will not need to understand the PP architecture to take advantage of its performance.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MASTER PROCESSOR (MP)
`
`
`
`
`
`The MP is a general-purpose RISC processor with an integral IEEE-compatible floating-
`
`
`
`
`
`
`
`
`
`
`point unit. A 32-bit instruction is accessed from a 4KByte instruction cache. Data loads
`
`
`
`
`
`
`
`
`
`
`
`
`can be 8, 16, 32, or 64 bits from a 4KByte data cache or from any data module via the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`crossbar. The MP has thirty-one 32-bit usable registers. Uncommon features include:
`
`
`
`
`
`
`
`
`
`
`
`- Register files common to floating—point & integer operations.
`
`
`
`
`
`
`
`
`
`° Scoreboard keeps track of result of loads and FPU, preventing use until updated.
`
`
`
`
`
`
`
`
`
`
`
`
`
`- Addressing modes support optional updating of base-address register with
`
`
`
`
`
`
`
`
`
`
`results of the address computation.
`
`
`
`
`
`- Special FPU instruction permits new multiply. add/subt, & increment each cycle.
`
`
`
`
`
`
`
`
`
`
`
`- Left-most and Right-most one logic.
`
`
`
`
`
`
`- Both endians supported.
`
`
`
`
`
`
`
`
`
`
`
`
`Since the MP was designed to efficiently execute C programs and has added hardware for
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`bitsueam processing, it performs exceptionally well as the controller and data interpretation
`
`
`
`
`
`
`
`
`
`
`
`
`processor. The floating point capability accelerates and simplifies programming of high
`
`
`
`
`
`
`
`
`
`
`
`precision applications like medical imaging and 3D graphics.
`
`
`
`
`
`
`
`
`SHARED MEMORY & TRANSFER CONTROLLER (TC)
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Much of the advantage of the MVP architecture lies in the memory and data I/O
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`architecture. Each processor and memory is fully interconnected through the crossbar and
`
`
`
`
`
`
`
`
`
`
`
`switchable at instruction rates. With greater than 500 signal lines switching at nanosecond
`
`
`
`
`
`
`
`
`
`
`
`
`the crossbarred memory architecture is only possible with single-chip
`speeds,
`
`
`
`
`
`
`
`
`
`implementation. With adequate on-chip memory and the ability to reconnect the next
`
`
`
`
`
`
`
`
`
`
`
`processor to the data memory, rather than moving the data to another memory, the data on-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`chip is not required to move as often. In effect, the original requirement of billions of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`bytes/second data transfer is reduced to only 100's of Mbytes/second. This model works
`
`
`
`
`
`
`
`
`
`
`
`
`
`well as long as the algorithm uses localized regions of data (patches, blocks,
`
`
`
`
`
`
`
`
`
`
`
`
`
`neighborhoods, rows...), each of which "fit" into the on-chip memory, and are accessed in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`repeated or predictable patterns. While this usually occurs with image processing, an
`
`
`
`
`
`
`
`
`
`
`
`
`extremely intelligent transfer controller was architected to aid in insuring the validity of this
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`assumption. The TC has numerous modes of transferring data on- or off—chip, each
`
`
`
`
`
`
`
`
`
`
`
`
`optimized for a particular type of dataflow (block, patch, fat line, indexed or guided
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`patches...). Most importantly, the on-chip SRAM memory was architected with sufficient
`
`
`
`
`
`
`
`
`
`
`
`size and modularity to permit double-buffering of data 1/0 on and off the chip, while the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`on-chip processors access the other on-chip memory modules.
`In effect, practically no
`
`
`
`
`
`
`
`
`
`
`
`
`overhead is required for video I/O. Many convenient methods were designed into the TC
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`to prioritize these accesses. In addition, we included support for most commodity memory
`
`
`
`
`
`
`
`
`
`
`
`
`
`components (VRAM, SRAM, DRAM). Finally, we devised several methods to mitigate
`
`
`
`
`
`
`
`
`
`
`
`any contention between the processors for a particular memory module. Both round-robin
`
`
`
`
`
`
`
`
`
`
`
`and fixed-robin priority schemes are available to permit developers flexibility in structuring
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 7 Of 10
`
`
`
`
`PRIOR-ART_0010821
`
`Page 7 of 10
`
`

`
`222
`
`
`their algorithms to reduce contention. For the many image and video algorithms currently
`
`
`
`
`
`
`
`
`
`
`
`
`developed for the MVP thus far, contention has not been a problem
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Another advantage of the crossbar architecture is expandability. We can design many
`
`
`
`
`
`
`
`
`
`
`
`
`different MVP chips, as a function of the number of PPs. We simply slice the architecture,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`cutting or adding PPs and memory modules. Conceptually. the advantage of this approach
`
`
`
`
`
`
`
`
`
`
`
`
`is that, with the same package and pin—out, several different performance and price points
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`can be used. A range of applications may require a range of different MVP chips.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Applications which require CCIR 601 studio quality video and/or multifunction processing
`
`
`
`
`
`
`
`
`
`
`
`
`
`(graphics / audio) video) would most-likely require an MVP with 4 processors. On the
`
`
`
`
`
`
`
`
`
`
`
`other hand, a more dedicated or single-function application like graphics may require
`
`
`
`
`
`
`
`
`
`
`
`
`fewer PPs. In addition, if only limited resolution video (QCIF) processing is necessary,
`
`
`
`
`
`
`
`
`
`
`
`
`
`again a small number of PPs could suffice. We anticipate various versions of the MVP in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the future.
`
`
`
`VIDEO CONTROLLER (VC)
`
`
`
`
`
`In addition, the MVP has two programmable timing controllers for generation of video and
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`other tinting signals. As an example, video frame grabbing and display requires may pixel,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`horizontal, and vertical signals for synchronization of the external logic in the system. The
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MVP has internal logic to generate those signals under program control, relieving the
`
`
`
`
`
`
`
`
`
`
`
`
`
`system designer from design of external logic to perform those functions.
`
`
`
`
`
`
`
`
`
`
`
`NEW DCT ALGORITHMS FOR COMPRESSION
`
`
`
`
`
`SPEED-UP WITH USE OF A PROGRAMMABLE ARCHITECTURE
`
`
`
`
`
`
`
`
`
`One advantage of the programmable compression chip is the optimization possible by
`
`
`
`
`
`
`
`
`
`
`
`
`selecting the least computationally demanding DCT algorithm that will meet the accuracy
`
`
`
`
`
`
`
`
`
`
`
`
`required of the application. For example, fast DCI’ algorithms like those of Lee and Chen
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`[Lee-84, Chen-77] have considerable advantage with respect to traditional matrix multiply
`
`
`
`
`
`
`
`
`
`
`
`approaches (with a factor of 5 or more speedup). Seperability of the 2D DCT is generally
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`used for decomposition of DCTs (with successive processing of the individual rows &
`
`
`
`
`
`
`
`
`
`
`
`
`
`columns of an image) . The size of the DCT directly influences the benefit of seperability,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`however with the 8x8 DCTs of most standards, a definite speedup results. The Lee
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`algorithm tends to be easy to implement and achieve faster computation, although has
`
`
`
`
`
`
`
`
`
`
`
`
`
`accuracy issues The Chen algorithm is harder to implement and is computational slower,
`
`
`
`
`
`
`
`
`
`
`
`
`
`but with good accuracy. Depending on the available processing bandwidth, the encoder
`
`
`
`
`
`
`
`
`
`
`
`
`can select an appropriate DCT or IDCT algorithm to perform the task.
`If errors result,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`different coding decisions result and either lower SNRs or compression ratios occur.
`
`
`
`
`
`
`
`
`
`
`
`
`
`In addition, we devised a "Truncated-IDCT" algorithm to utilize the advantages of a
`
`
`
`
`
`
`
`
`
`
`
`
`programmable architecture. Since the DCT, quantizer and threholding operations seek to
`
`
`
`
`
`
`
`
`
`
`
`minimize the population of selective frequency coefficients (for high compression),
`
`
`
`
`
`
`
`
`
`
`statistically, most of the high frequency coefficients are zero valued. Therefore, the
`
`
`
`
`
`
`
`
`
`
`
`
`conventional IDCI‘ will act on 8x8 matrixes with a high percentage of zero valued inputs.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`We can then significantly reduce the amount of IDCT operations performed by not
`
`
`
`
`
`
`
`
`
`
`
`
`
`executing the zero valued multiplies and adds (similar work has been reported[McMillan-
`
`
`
`
`
`
`
`
`
`
`
`92]). This is only possible with software—based IDCTs.
`In implementation, the program
`
`
`
`
`
`
`
`
`
`
`
`adaptively truncates the Sxl IDCT summation in the vertical direction based on the run-
`
`
`
`
`
`
`
`
`
`
`
`
`
`length encoded input values. Further reductions by selecting 4x1 summation when
`
`
`
`
`
`
`
`
`
`
`
`appropriate also shortens the process, although not as frequently. With this approach a
`
`
`
`
`
`
`
`
`
`
`
`
`
`factor of 3 or more speed-up on 1DCIs will usually occur.
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 8 of 10
`
`
`
`PRIOR-ART_0010822
`
`
`
`Page 8 of 10
`
`

`
`TOOLS
`
`
`
`223
`
`
`
`
`
`
`
`
`Advances in video compression have been limited by the availability of tools to develop
`
`
`
`
`
`
`
`
`
`
`
`
`
`software and hardware. With the MVP, TI offers a range of software tools and direct on-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`chip support for in—circuit debug. A real—time executive, C++ compilers, algebraic
`
`
`
`
`
`
`
`
`
`
`assembler, windowed high-level language debugger (with JTAG emulation hardware on
`
`
`
`
`
`
`
`
`
`chip) and library of primitives/applications, all the tools familiar to computer application
`
`
`
`
`
`
`
`
`
`
`
`developers, will now be available for development of video applications.
`
`
`
`
`
`
`
`
`
`
`The software model for the MVP is based on two levels. The primary level includes the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Master Processor acting as a director and scheduler of the MVP's parallelism. The
`
`
`
`
`
`
`
`
`
`
`
`
`
`Executive operates on the MP, performing those supervisory tasks. The Executive can
`
`
`
`
`
`
`
`
`
`
`
`
`dispatch tasks for operation in pipeline, parallel or any other arrangement on any processor
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`within the MVP. Under that, a level which actually performs the tasks on each processor is
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`accessed by either: (1) a library of primitives, (2) application tools for programming in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`assembly or (3) a high-level language compiler. Each of these methods have advantages,
`
`
`
`
`
`
`
`
`
`
`
`
`with varying performance and skill level required to code the chip, as a function of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`particular application. Although nothing restricts the use of any processor as the master or
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`slave processor, only software convention.
`
`
`
`
`
`COMPETING VIDEO COMPRESSION CHIP ARCHITECTURES
`
`
`
`
`
`
`
`Several semiconductor companies have reported activity in video compression chip or chip
`
`
`
`
`
`
`
`
`
`
`
`
`set solutions ([Bolton-93][Konstantinides-92]). Most chip manufacturers are proposing
`
`
`
`
`
`
`
`
`hardwired or paramaterized architectures, without C—1evel programmability. Our MVP is
`
`
`
`
`
`
`
`
`
`
`an exception. In addition, most of the other "programmable" approaches are based on an
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`architecture which integrates dedicated logic modules, like DCTs and Motion Estimators,
`
`
`
`
`
`
`
`
`
`
`
`with only the controller programmable. This lirriits their efficiency since the silicon devoted
`
`
`
`
`
`
`
`
`
`
`
`
`
`to those functions must always keep busy with those functions to justify their cost. On the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`contrary the MVP architecture has no dedicated logic, permitting user balancing of silicon
`
`
`
`
`
`
`
`
`
`
`
`
`
`based on the varied and dynamic computational demands of compression. Other
`
`
`
`
`
`
`
`
`
`
`
`researchers have recently described simulations which support our position that universally
`
`
`
`
`
`
`
`
`
`
`
`programmable architectures are competitive solutions when compared with dedicated or
`
`
`
`
`
`
`
`
`
`
`hybrid architectures for video decompression (Mayer—93).
`
`
`
`
`
`
`
`Many different architectures are proposed, in development, or currently available, however
`
`
`
`
`
`
`
`
`
`
`
`none except the MVP has the flexibility nor computational performance to meet the
`
`
`
`
`
`
`
`
`
`
`
`
`
`complete demands of truly integrated digital video on the desktop. including the complete
`
`
`
`
`
`
`
`
`
`
`
`
`
`concert of real-time video & audio compression, with image & 3D graphics processing.
`
`
`
`
`
`
`
`
`
`
`
`
`
`Not only compression and decompression, but system—level bit—stream control, video
`
`
`
`
`
`
`
`
`
`
`scaling, error correction and even audio echo cancellation. Architecture limitations and
`
`
`
`
`
`
`
`
`
`
`
`transistor counts limit other chips to subsets of these functions.
`
`
`
`
`
`
`
`
`
`
`
`CONCLUSION
`
`
`
`The MVP is a monolithic single—chip parallel processor that performs compression
`
`
`
`
`
`
`
`
`
`
`
`processing, audio & video processing, 3D graphics and others, and even at the same time.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Over 2 billion operations are performed per second. This marriatic performance boost will
`
`
`
`
`
`
`
`
`
`
`
`
`
`enable a wide range of new applications, including desktop interactive digital video.
`
`
`
`
`
`
`
`
`
`
`
`
`Integrating fu1ly—programmable parallel DSP processors with a RISC processor on one
`
`
`
`
`
`
`
`
`
`
`
`chip provides software flexibility and system adaptability. A new parallel architecture,
`
`
`
`
`
`
`
`
`
`
`
`using a crossbar network to couple the processors and large on-chip SRAMs, and with
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MIM.D (Multiple Instruction Multiple Data) operations, yields extremely high efficiency for
`
`
`
`
`
`
`
`
`
`
`
`
`Page 9 of 10
`
`
`
`PRIOR-ART_0010823
`
`
`
`Page 9 of 10
`
`

`
`224
`
`
`most image, graphics, and video algorithms. Software tools like real-time executives,
`
`
`
`
`
`
`
`
`
`
`
`assemblers and compilers all help bring a familiar computer programming model to
`
`
`
`
`
`
`
`
`
`
`
`
`multidimensional signal processing. This new technology frees developers of compression
`
`
`
`
`
`
`
`
`
`
`algorithms to optimize implementations of standard video & audio compression algorithms,
`
`
`
`
`
`
`
`
`
`
`
`without the restrictions found in today's compression chips (those which are limited to
`
`
`
`
`
`
`
`
`
`
`
`
`
`current interpretations or versions of the standards). In addition, algorithm developers can
`
`
`
`
`
`
`
`
`
`
`
`
`implement future compression algorithms, without the difficulties of developing new chips
`
`
`
`
`
`
`
`
`
`
`
`or adapting existing chips.
`
`
`
`
`
`The MVP supports a wide range of open standards for video compression and image
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`computing The variation within each standard to promote creative and distinguishing
`
`
`
`
`
`
`
`
`
`
`
`advantages in the market place and the constant urge to optimize the standard to a particular
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`range of markets, each work to prevent fixed hardware solutions. This programmable,
`
`
`
`
`
`
`
`
`
`
`
`
`integrated solution gives flexibility to system designers to develop competitive algorithms
`
`
`
`
`
`
`
`
`
`
`
`as well as adapt to emerging standards.
`
`
`
`
`
`
`
`ACKNOWLEDGMENTS
`
`
`The author wishes to thank Jeremiah Golston, Dr. Chris Read, and Dr. V. Venkateswar for
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`compression algorithm work relating to the MVP. In addition, thanks to the MVP Program
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Manager, Walt Bonneau, for developing and motivating a "world-class" team. A special
`
`
`
`
`
`
`
`
`
`
`
`thanks to my original co-MVP-architects, Keith Balmer, Karl Guttag, and Nick Ing-
`
`
`
`
`
`
`
`
`
`
`
`Simmons. Finally, thanks to the entire "Team—MVP"!
`
`
`
`
`
`
`
`REFERENCES
`
`
`
`[Bolton-93] Bolton, M. "A Family of MPEG Video Encoder and Decoder Chips", IEEE
`
`
`
`
`
`
`
`
`
`
`
`
`Proceeding of Conference on Hot Chips, 1993.
`
`
`
`
`
`
`
`
`
`
`[Chen-77] Chen, W.H., C.H. Smith, and S.C. Fralick, "A Fast Computational Algorithm
`
`
`
`
`
`
`
`
`
`
`
`
`for the Discrete Cosine Transform", IEEE Transactions of Communication, Vol. 25, pp.
`
`
`
`
`
`
`
`
`
`
`
`
`1004-1009. Sept. 1977.
`
`
`
`
`[Gove-92] Gove, R..l., "Architectures for Single-Chip Image Computing", SPIE
`
`
`
`
`
`
`
`
`Proceedings of Conf. on Image Processing and Interchange, San Jose, Ca., Feb 1992.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`[Guttag-92] Guttag, K.M., R.J. Gove, & .l.R. Van Aken, "A Single-Chip Multiprocessor
`
`
`
`
`
`
`
`
`
`
`For Multimedia: The MVP", IEEE Computer Graphics & Applications, pp.53-64, 11/92.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`[Konstantinides-92] K. Konstantinides & V. Bhaskaran, ”M0n0lithic Architectures for
`
`
`
`
`
`
`
`
`
`Image Processing & Compression”, IEEE Computer Graphics & Applications, pp 75-86,
`
`
`
`
`
`
`
`
`
`
`
`Nov. 1992.
`
`
`
`[Lee-84] B.G. Lee, "A New Algorithm for the Discrete Cosine Transform", IEEE Trans.
`
`
`
`
`
`
`
`
`
`
`
`
`on Acoustics, Speech, and Signal Processing, Vol. 32, pp. 1243-1245, 1984.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`[Mayer—93] A. C. Mayer, "The Architecture of a Processor Array for Video
`
`
`
`
`
`
`
`
`
`
`
`
`Decompression", IEEE Trans. on Consumer Elect..Vol39, No.3, pp 565-569, Aug. 1993.
`
`
`
`
`
`
`
`
`
`
`
`
`[McMillan-92] McMillan, L.& L. Westover, "A Forward»Mapping Realization of the
`
`
`
`
`
`
`
`
`
`
`Inverse Discrete Cosine Transfonn", IEEE Proc. of the Data Compression Conference, pp
`
`
`
`
`
`
`
`
`
`
`
`
`219-228, 1992.
`
`
`
`Page 10 Of 10
`
`
`
`
`PRIOR-ART_0010824
`
`Page 10 of 10

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket