`
`
`
`
`
`
`
`The MVP: A Highly-Integrated Video Compression Chip
`Robert I. Gove
`
`
`
`Texas Instuments, Inc.
`
`
`
`Dallas, Texas 7 5265
`
`
`
`ABSTRACT
`
`
`We introduce a new highly-integrated processing chip for performing a variety of
`
`
`
`
`
`
`
`
`
`
`
`
`functions, however this chip is particularly well suited for video compression algorithms.
`
`
`
`
`
`
`
`
`
`
`
`
`Applications include multimedia PCs, virtual reality 3D graphics, full-duplex
`
`
`
`
`
`
`
`
`
`videoconferencing, HDTV, and color hardcopy. We have architected the Multimedia Video
`
`
`
`
`
`
`
`
`
`
`
`Processor, or MVP, to provide a yet unattainable level ofperformance from a single chip,
`
`
`
`
`
`
`
`
`
`
`
`
`
`although with the programmability typically found in today's general-purpose computers.
`
`
`
`
`
`
`
`
`
`
`While advanced semiconductor design and process techniques have been used for its
`
`
`
`
`
`
`
`
`
`
`
`
`design, the key to the advantage of this component lies in optimization of the architecture
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`for real-time video and graphics processing. This paper will analyze video compression
`
`
`
`
`
`
`
`
`
`
`
`
`application requirements, describe the MVP architecture, and pose its potential as a very
`
`
`
`
`
`
`
`
`
`
`
`
`
`capable solution for a wide range of markets.
`
`
`
`
`
`
`
`
`INTRODUCTION
`
`
`The computer and consumer video industries are pursuing varied paths to offer cost-
`
`
`
`
`
`
`
`
`
`
`
`
`effective computing products which provide new forms of information and entertainment.
`
`
`
`
`
`
`
`
`
`
`
`Products are emerging from cable TV delivery of interactive digital movies to digital mobile
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`offices. Digital compression and video processing at a reasonable cost are spurring this
`
`
`
`
`
`
`
`
`
`
`
`
`
`revolution. While algorithm developments have been important, most of the enabling
`
`
`
`
`
`
`
`
`
`
`
`advances lie in the availability of high—density memory and high—performance processing
`
`
`
`
`
`
`
`
`
`
`
`ICs, With the pending general availability of the Multimedia Video Processor, or MVP, in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`1994, a yet unattained level of digital signal processing performance will be available and
`
`
`
`
`
`
`
`
`
`
`
`
`
`with all the flexibility of present day programmable computers. Standard-based video-
`
`
`
`
`
`
`
`
`
`
`conferencing and playback of compressed digital video and audio (using Px64. IPEG or
`
`
`
`
`
`
`
`
`
`
`
`
`
`MPEG "multivstandar " codecs systems) with a single MVP processor will be possible, as
`
`
`
`
`
`
`
`
`
`
`
`
`
`well as codecs with yet-to-be-defined algorithms like model-based compression.
`
`
`
`
`
`
`
`
`
`However, not only will the MVP support compression, it will also handle processing of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`high-resolution video, full-motion video processing from sources like camcorders, digital
`
`
`
`
`
`
`
`
`
`
`audio processing, hardcopy raster image processing, and 3D graphics, and all under
`
`
`
`
`
`
`
`
`
`
`
`
`software control and generation. From this wide range of functions, we calculated that
`
`
`
`
`
`
`
`
`
`
`
`
`
`several billion operations per second are required to provide video-based applications on
`
`
`
`
`
`
`
`
`
`
`
`
`the desktop. Current and soon to appear desktop host processors like X86, Pentium,
`
`
`
`
`
`
`
`
`
`
`
`
`
`Alpha, and MIPS do not have the computational power to meet these demands.
`
`
`
`
`
`
`
`
`
`
`
`
`
`KEYS TO THE MVP ARCHITECTURE
`
`
`
`
`
`
`
`The MVP's unique architecture and computational power enables users to integrate these
`
`
`
`
`
`
`
`
`
`
`
`
`varied functions on a single processing component. The keys to obtaining both exceptional
`
`
`
`
`
`
`
`
`
`
`
`
`processing speeds and fully-programmable features with the MVP include the use of:
`
`
`
`
`
`
`
`
`
`
`
`
`
`(1) an efi‘icient paraIIel processing architecture,
`
`
`
`
`
`
`(2) fast pixel processing tuned to image, video, and graphics processing,
`(‘1\ inmllinom I‘Anfrnl Arrmnna ann Hm” tinynun'annf aha nwaLiranmya
`
`
`
`
`
`
`
`
`
`
`\e, .........6w-. cm... W u, muse umuyiuw uu uwslwuo H» w cmwuw v,
`
`
`
`
`
`
`
`
`
`
`(4) single-chip integration without slower chip-to-chip communications.
`
`
`
`
`
`
`
`
`
`
`
`1068-0314/94 $3.00 © 1994 IEEE
`
`
`
`
`
`
`
`215
`
`
`
`Page 1 0f 10
`
`HTC-LG-SAMSUNG EXHIBIT 1006
`
`
`
`PRIOR-ART_0010815
`
`Page 1 of 10
`
`HTC-LG-SAMSUNG EXHIBIT 1006
`
`
`
`2l6
`
`DSP Parallel Processors (PPn) :
`Advanced DSP Cores
`
`Master Processor (MP) :
`Advanced RISC
`
`
`
`.Fireman-tRAH
`
`DataRAH!
`
`DitaRAE
`
`.nD8C3RAMI
`
`.tDariaHAND
`
`DataFIN}
`
`
`
`DataRANO
`
`
`
`InfltrcaChO
`
`Figure l:
`
`MVP Block Diagram:
`
`(A Single-Chip Parallel Processor)
`
`Page 2 of 10
`
`PRIOR-ART 0010816
`
`
`
`217
`
`ALGORITHM-DIRECTED ARCHITECTURE DEFINITION
`
`Processing Requirements
`Today's proposed international video compression standards use common frequency
`domain. quantization, and entropy coding techniques to (de)compress small portions (8x8)
`of each image. While these functions demand a great deal from the encoder/decoder. many
`other varied functions remain. each with dynamic requirements which vary based on'the
`type of image compressed as well as the channel rate required to maintain real-time
`operation. For optimal efficiency a processor must adapt to these dynamic needs. A
`typical average of the processing demands of the Px64 video-conferencing standard
`appears in the following table.
`
`RISC vs. MVP-PP Processing Requirements for Px64
`
`WIEX.
`m(H.201)
`an:
`W
`
`Full-OF
`
`.
`
`Spudm Speed avenge Wu.
`mac Estonian
`MVP Errata
`Spud-up clam
`fiofllm ‘
`as of the)
`more
`
`029
`msmm-thmm “—1-7
`M Decisions- (”harm-roam
`ems
`m (ammonium t3) lntra
`
`SPEED-UP >g
`
`it
`
`1.00:
`155 PPHPS "‘
`
`N
`
`Lin
`
`* Multiply counted at one instruction even though most RISCs require many cycles.
`“ If the "Truncated-IDCI‘“ algorithm was used. IDCTs speed-up again (see later).
`"" The total is equivalent to 3 MVP-PP processors (see below PP section).
`““ Audio standards concurrently execute on the MVP-MP (see below MP section).
`
`As we studied the computational requirements for motion estimation (51%) and DCI‘s
`(22%) it became quite apparent that a programmable image processor must excel at these
`functions. It is important to recognize that what's done poorly in a processor can dominate
`its performance. Since most archiwctutal improvements would not uniformly accelerate all
`functions uniformly, we looked for special architectural features for these critical functions.
`while maintaining enough flexibility to benefit a larger class of algorithms.
`In final
`anhglnysis. a much more uniform distribution of computational loading resulted after the
`c
`ges.
`
`As seen in the table. the programmable image processor must perform many other
`functions well. including: bit manipulation and table look ups for entropy encoding. and
`
`Page 3 of 10
`
`PRIOR-ART_0010817
`
`
`
`218
`
`
`multiply and accumulate for various types of filtering operations. To obtain good image
`
`
`
`
`
`
`
`
`
`
`
`
`quality at any channel rate and 30 frames per second, the image processor must compute
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`over 1.2 billion operations per second (BOPS).
`
`
`
`
`
`
`
`
`
`
`
`The addition of audio compression (which requires highcr precision integer and possibly
`
`
`
`
`
`
`
`
`
`
`
`
`floating point algorithms) and network communication, necessary for video conferencing
`
`
`
`
`
`
`
`
`
`
`(6.728 or 6.711, H.242, H.230, H.221), further increases the scope of computational
`
`
`
`
`
`
`
`
`
`
`
`
`requirements. Reducing the system cost, we propose to include support in the architecture
`
`
`
`
`
`
`
`
`
`
`
`
`
`for the required non-standard functions like color space conversion (YCer to RGB),
`
`
`
`
`
`
`
`
`
`
`
`
`decimation of the source image to GP resolution and variable scaling of the decompressed
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`sequence. Complete implementation of compression applications such as video-
`
`
`
`
`
`
`
`
`conferencing requires over 2 BOPS of the programmable image processor.
`
`
`
`
`
`
`
`
`
`
`ARCHITECTURE CHOICES
`
`
`
`
`We considered several candidate parallel architectures for implementation of this single—chip
`
`
`
`
`
`
`
`
`
`
`
`video processor [Gove-92, Guttag-92]. An architecture with a mix of dedicated and
`
`
`
`
`
`
`
`
`
`
`
`
`programmable processors was initially evaluated, then subsequently discounted when no
`
`
`
`
`
`
`
`
`
`
`single dominant function was found that was necessary almost all of the time. Besides, we
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`predicted that by the time the chip was completed, that a new important algorithm would
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`emerge. From the standpoint of loss of silicon efficiency by dedicated resources to any one
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`function (like a DCT), we felt compelled to seek a general-purpose well—balanced system
`
`
`
`
`
`
`
`
`
`
`
`
`solution. Several other candidates existed, however the mix of algorithms and practical
`
`
`
`
`
`
`
`
`
`
`
`
`implementation limitations focused us on SIMD and MIMD architectures. These differ by
`
`
`
`
`
`
`
`
`
`
`
`
`the autonomy of the processors functions with MIMD -- a desirable feature for any data
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`dependent algorithm operating in parallel.
`
`
`
`
`
`
`With MIMD desirable, the choice of a processor and memory interconnection architecture
`
`
`
`
`
`
`
`
`
`
`
`remained. Pipelined, shared bus memory, communication port (mesh/array/hypercube),
`
`
`
`
`
`
`
`
`and crossbar fully-shared memory were considered. Pipeline memory and processors
`
`
`
`
`
`
`
`
`
`
`(systolic arrays) are typically used for video, however they‘re too restrictive in the sense
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`that one must a prion’ know the size of the memory and dynamics of the algorithm to
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`prevent data contention and processor stalls. With our varied needs, this would lead to
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`inefficiencies. A shared—bus memory Structure would also have bottleneck problems with
`
`
`
`
`
`
`
`
`
`
`
`highly variable instruction and data streams and moving of results from one processor to
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the other. The n-way connected communication port requires a very ordered flow of data,
`
`
`
`
`
`
`
`
`
`
`
`
`
`like a systolic or wavefront flow of data, or the application of a pixel per processor (not
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`practical in a single chip). This approach works for large arrays of simple processors
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`which can operate uniformly on images, however we wanted more complex processors
`
`
`
`
`
`
`
`
`
`
`
`
`which could adapt
`to varying types of data, from bit graphics to floating-point
`
`
`
`
`
`
`
`
`
`
`
`
`
`representations. The crossbar fullyshared memory is ideally suited to these needs,
`
`
`
`
`
`
`
`
`
`
`
`minimizing contention, data movement and providing flexibility for many types of
`
`
`
`
`
`
`
`
`
`
`
`In fact, since the crossbar operations at the processor instruction rates, this
`algorithms.
`
`
`
`
`
`
`
`
`
`
`
`
`
`architecture can functionally emulate the other approaches (pipeline, shared bus...).
`
`
`
`
`
`
`
`
`
`
`
`We not only wanted to provide this order of magnitude performance increase, but the goal
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`was to apply a traditional computer model of programmable processing and a large memory
`
`
`
`
`
`
`
`
`
`
`
`
`to applications with integrated image, graphics, video and audio processing, or image
`
`
`
`
`
`
`
`
`
`
`
`
`computing. As shown in Figure #2 titled "MVP System Architecture”, replacing the
`
`
`
`
`
`
`
`
`
`
`
`
`processing and memory pipeline of conventional video systems with the single video
`
`
`
`
`
`
`
`
`
`
`
`
`processor and large memory system model yields tremendous application flexibility,
`In
`
`
`
`
`
`
`
`
`
`
`
`effect the system can re-configure itself with software from video conferencing to playing
`
`
`
`
`
`
`
`
`
`
`
`
`
`CD movies, just as a PC would re-configure from a spreadsheet to a video game.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 4 of 10
`
`
`
`PRIOR-ART_0010818
`
`
`
`Page 4 of 10
`
`
`
`2l9
`
`Figure 2: The MVP "System" Architecture.
`.._
`- imme. mic «whom computer mommy (rick. phm 00...).
`- date Item "Ml (phone or local draw).
`~ lame/mica away or mum manor.
`
`- warm memory
`- instruction
`- mutt'ploimagcs.
`audio...
`
`- deploy on TV "mints.
`
`“NT’OUTPUT
`INTERFACE
`
`tram tor:
`
`-
`
`live video a nudlo (maul/cm}
`
`THE MVP ARCHITECTURE
`
`The Multimedia Video Processor, or MVP, represents the next-generation of digital signal
`processors. The MVP can be technically described as a single-chip crossbar shared
`memory heterogeneous MIMD multiprocessor.
`It combines RISC and advanced DSP
`processing in one parallel architecture with unique features for each. Current RISC
`processors typically use instruction pipelining. numerous registers and a detached fleeting
`point processor. 0n the other hand. current DSPs are optimized for one dimensronal
`multiply—accumulate functions. Newer DSPs have floating-point capabilities. yet most
`imaging and video only needs integer operations. DSPs usually have fewer registers than
`RISC and have direct memory accesses (DMA) with limited capabilities.
`
`The MVP combines the best features of RISC and DSP in parallel and adds other features
`to offer unprecedented Power and Flexibility. The hean of an image or video chip )5 HS
`capability to process ZD signals. The MVP has features for ZD DSP-like processing.
`including multiply-accumulate operations. The on-chip memory and register characteristics
`of the MVP were optimizer! for image computing algorithms, preventing time consuming
`cache rm'sses or swapping of register contents. Multidimensional external memory access
`and double buffering minimizes the typical memory bottleneck of current DSP solutions.
`An internal memory crossbar provides extremely efficient synchronization and
`communication of multiple processors. A very high-performance RISC processor is
`integrated on the chip, providing intelligent control of the DSP-like processors. Also
`integrated into the chip. is new floating-point architecture can act as a co-prooessor to any of
`the DSP-likeoproccssors or the RISC processor. By analysis of the algorithms, the
`required mix
`integer ops to floating-point ops was somewhere between 8:1 and 4:1 -- a
`balance which the MVP supports. The entire collection of processors and memory is
`configured as a MIMD architecrure for ease of programming and high performance for all
`image and video computing applications. This MIMD data and control supports both data
`
`Page 5 of 10
`
`PRIOR-ART_0010819
`
`
`
`220
`
`
`dependent algorithms like object feature matching or Huffman coding and also supporting
`
`
`
`
`
`
`
`
`
`
`
`traditional data independent SIMD operations like convolution.
`
`
`
`
`
`
`
`
`
`
`“CD:- onA n lurnn nn akin monknma MAMA“: :n “and 3'. el‘a KA‘VD Thie- flnviki‘ihr mire
`To prevent contention for memory or register access, a very wide instruction set in the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`gun a mlu u Amen urn-mu}: Lnusauauw urburury 13 new III IIIL m v n . aura nuanuuu, walrus;
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the programmer to produce highly—parallel optimized code. A performance penalty may
`
`
`
`
`
`
`
`
`
`
`
`result if only one highly—serial task is performed continuously, however, the very nature of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`image, video, graphics, and audio processing, with varied concurrent and complex
`
`
`
`
`
`
`
`
`
`
`
`processing, prevents this from occurring. The MVP integrates more functions than ever
`
`
`
`
`
`
`
`
`
`
`
`
`before into one chip, while avoiding the compromises of other architectures.
`
`
`
`
`
`
`
`
`
`
`
`
`Detailed Architecture Description:
`
`
`
`Figure #1, titled "MVP Block Diagram", shows the MVP chip architecture. The Master
`
`
`
`
`
`
`
`
`
`
`
`
`
`Processor (MP) provides a RISC processor for simple user interface, sequential
`
`
`
`
`
`
`
`
`
`
`
`processing, and orchestration of multiple concurrent tasks operating on the entire MVP.
`
`
`
`
`
`
`
`
`
`
`
`
`The DSP Parallel Processors (PP), of which 4 will be designed in the first version of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MVP, provide highly—optimized image/video/graphies/audio processing capabilities. The
`
`
`
`
`
`
`
`Transfer Controller (TC) intelligently moves data and instructions on and off the MVP. All
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`of these processors are locally interconnected with a crossbar to 25 on-chip 2Kbyte SRAM
`
`
`
`
`
`
`
`
`
`
`
`
`
`modules. Other features include dual video frame timing generators (VC) and 11‘AG test
`
`
`
`
`
`
`
`
`
`
`
`
`
`and emulation circuits.
`
`
`
`
`With five 32+bit programmable processors operating atone targeted state rate of SOMHz
`
`
`
`
`
`
`
`
`
`
`
`
`and numerous parallel operations performed in each processor, over 2 billion operations
`
`
`
`
`
`
`
`
`
`
`per second result.
`In addition, 100 MFLOPS (fully [BEE-754) can occur. The peak data
`
`
`
`
`
`
`
`
`
`
`
`
`
`transfer rate is then 400 MBytes/second, adequate for many video applications. The
`
`
`
`
`
`
`
`
`
`
`
`internal bandwidth over the crossbar between on-chip memory and processors is 2.4
`
`
`
`
`
`
`
`
`
`
`
`GBytes/seeond.
`
`
`
`
`
`
`
`
`DSP PARALLEL PROCESSORS (PP)
`
`
`
`
`
`
`The PP has many powerful features beyond those found in conventional DSPs. Practically
`
`
`
`
`
`
`
`
`
`
`
`
`
`all video algorithms benefit from these features. Most of the features were added to permit
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`scalability within the PP to support many simple functions (like bit ops) in one cycle or
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`fewer operations with the same hardware at higher precision (like 32-bits). The following
`
`
`
`
`
`
`
`
`
`
`
`
`
`describes the feature and advantage:
`
`
`
`
`
`- 44 user registers:
`
`
`
`
`— ease ofprogramming/compiling and fast parallel functions.
`
`
`
`
`
`
`
`- Single-cycle access into crossbar memory expands effective registers to 34K:
`
`
`
`
`
`
`
`
`
`
`- flexibility.
`
`- Three-level, no overhead instruction looping:
`
`
`
`
`
`
`- programming flexibility and faster tight loops (usually 2030%)
`
`
`
`
`
`
`
`° Double parallel transfer from memory with address update:
`
`
`
`
`
`
`
`
`
`- most algorithms need two pixels loaded per cycle.
`
`
`
`
`
`
`
`
`- Three-operand ALU arithmetic and logical operations:
`
`
`
`
`
`
`
`- double speed correlation and windows support.
`
`
`
`
`
`- Splitable multiply (8x8=16 or 16x16=32):
`
`
`
`
`
`
`- double speed pixel operations
`
`
`
`
`- Word/Halfword/Byte multiple arithmetic:
`
`
`
`
`- 4x on algorithms like motion estimation and 2x on fast DCTS.
`
`
`
`
`
`
`
`
`
`
`. Flexible data path:
`
`
`
`
`- masking, merging, rotating... for bit stream coding (like Huffman).
`
`
`
`
`
`
`
`
`
`- General-purpose use of address adders:
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 6 of 10
`
`
`
`
`PRIOR-ART_0010820
`
`Page 6 of 10
`
`
`
`221
`
`
`
`_
`_
`- up to 6x number of adds in one cycle.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`- Conditional operations prevent need for branching (and possible pipeline stalls):
`- adaptive algorithms will operate faster (like adaptive thresholding).
`
`
`
`
`
`
`
`
`
`
`
`As a result, as many as 15 RISC operations will be performed in one PP cycle. When
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`multipiied by the number of FPS and added to the MP and FPU operations, a formidable
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`number results.
`In addition, since the Cicompiler also influenced the architecture of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`PP, many of these features will automatically compile into fast cod -- many users of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MVP will not nwd to understand the PP architecture to take advantage of its performance
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MASTER PROCESSOR (MP)
`
`
`
`
`
`The MP is a general-purpose RISC processor with an integral IEEE—compatible floating-
`
`
`
`
`
`
`
`
`
`
`point unit. A 32-bit instruction is accessed from a 4KByte instruction cache. Data loads
`
`
`
`
`
`
`
`
`
`
`
`
`can be 8, 16, 32, or 64 bits from a 4KByte data cache or from any data module via the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`crossbar. The MP has thirty-one 32-bit usable registers. Uncommon features include:
`
`
`
`
`
`
`
`
`
`
`
`' Register files common to floating—point & integer operations.
`
`
`
`
`
`
`
`
`
`' Scoreboard keeps track of result of loads and FPU, preventing use until updated.
`
`
`
`
`
`
`
`
`
`
`
`
`
`. Addressing modes support optional updating of base-address register with
`
`
`
`
`
`
`
`
`
`
`results of the address computation.
`
`
`
`
`
`- Special FPU instruction permits new multiply. add/subt, & increment each cycle.
`
`
`
`
`
`
`
`
`
`
`
`° Left-most and Rightemost one logic.
`
`
`
`
`
`
`- Both endians supported.
`
`
`
`
`
`
`
`
`
`
`
`
`Since the MP was designed to efficiently execute C programs and has added hardware for
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`bitsuwm processing, it performs exceptionally well as the controller and data interpretation
`
`
`
`
`
`
`
`
`
`
`
`
`processor. The floating point capability accelerates and simplifies programming of high
`
`
`
`
`
`
`
`
`
`
`
`precision applications like medical imaging and 3D graphics
`
`
`
`
`
`
`
`
`SHARED MEMORY & TRANSFER CONTROLLER (TC)
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Much of the advantage of the MVP architecture lies in the memory and data I/O
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`architecture. Each processor and memory is fully interconnected through the crossbar and
`
`
`
`
`
`
`
`
`
`
`
`switchable at instruction rates. With greater than 500 signal lines switching at nanosecond
`
`
`
`
`
`
`
`
`
`
`
`
`the crossbarred memory architecture is only possible with single-chip
`speeds,
`
`
`
`
`
`
`
`
`
`implementation. With adequate on-chip memory and the ability to reconnect the next
`
`
`
`
`
`
`
`
`
`
`
`processor to the data memory, rather than moving the data to another memory, the data on-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`chip is not required to move as often. In effect, the original requirement of billions of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`bytes/second data transfer is reduced to only 100's of Mbytes/sec0nd. This model works
`
`
`
`
`
`
`
`
`
`
`
`
`
`well as long as the algorithm uses localized regions of data (patches, blocks,
`
`
`
`
`
`
`
`
`
`
`
`
`
`neighborhoods, rows...), each of which ”fit" into the on-chip memory, and are accessed in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`repeated or predictable patterns. While this usually occurs with image processing, an
`
`
`
`
`
`
`
`
`
`
`
`
`extremely intelligent transfer controller was architected to aid in insuring the validity of this
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`assumption. The TC has numerous modes of transferring data on- or off—chip, each
`
`
`
`
`
`
`
`
`
`
`
`
`optimized for a particular type of dataflow (block, patch, fat line, indexed or guided
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`patches...). Most importantly, the on—chip SRAM memory was architected with sufficient
`
`
`
`
`
`
`
`
`
`
`
`size and modularity to permit double-buffering of data 1/0 on and off the chip, while the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`on—chip processors access the other on-chip memory modules.
`In effect, practically no
`
`
`
`
`
`
`
`
`
`
`
`
`overhead is required for video I/O. Many convenient methods were designed into the TC
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`to prioritize these accesses. In addition, we included support for most commodity memory
`
`
`
`
`
`
`
`
`
`
`
`
`
`components (VRAM, SRAM, DRAM). Finally, we devised several methods to mitigate
`
`
`
`
`
`
`
`
`
`
`
`any contention between the processors for a particular memory module. Both round-robin
`
`
`
`
`
`
`
`
`
`
`
`and fixed-robin priority schemes are available to permit developers flexibility in structuring
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 7 of 10
`
`
`
`
`PRIOR-ART_0010821
`
`Page 7 of 10
`
`
`
`222
`
`
`their algorithms to reduce contention. For the many image and video algorithms currently
`
`
`
`
`
`
`
`
`
`
`
`
`developed for the MVP thus far, contention has not been a problem
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Another advantage of the crossbar architecture is expandability. We can design many
`
`
`
`
`
`
`
`
`
`
`
`
`different MVP chips, as a function of the number of PPs. We simply slice the architecture.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`cutting or adding PPs and memory modules. Conceptually. the advantage of thisapproach
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`is that, with the same package and pin—out, several different performance and price points
`can be used. A range of applications may require a range of different MVP chips.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Applications which require CCIR 601 studio quality video and/or multifunction processing
`
`
`
`
`
`
`
`
`
`
`
`
`
`(graphics / audio) video) would most-likely require an MVP with 4 processors. 0n the
`
`
`
`
`
`
`
`
`
`
`
`other hand, a more dedicated or single-function application like graphics may require
`
`
`
`
`
`
`
`
`
`
`
`
`fewer PPS. In addition, if only limited resolution video (QCIF) processing is necessary,
`
`
`
`
`
`
`
`
`
`
`
`
`
`again a small number of PPs could suffice. We anticipate various versions of the MVP in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the future.
`
`
`
`VIDEO CONTROLLER (VC)
`
`
`
`
`
`In addition, the MVP has two programmable timing controllers for generation of video and
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`other timing signals. As an example, video frame grabbing and display requires may pixel,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`horizontal, and vertical signals for synchronization of the external logic in the system. The
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MVP has internal logic to generate those signals under program control, relieving the
`
`
`
`
`
`
`
`
`
`
`
`
`
`system designer from design of external logic to perform those functions.
`
`
`
`
`
`
`
`
`
`
`
`NEW DCT ALGORITHMS FOR COMPRESSION
`
`
`
`
`
`SPEED-UP WITH USE OF A PROGRAMMABLE ARCHITECTURE
`
`
`
`
`
`
`
`
`
`One advantage of the programmable compression chip is the optimization possible by
`
`
`
`
`
`
`
`
`
`
`
`
`selecting the least computationally demanding DCT algorithm that will meet the accuracy
`
`
`
`
`
`
`
`
`
`
`
`
`required of the application. For example, fast DCI‘ algorithms like those of Lee and Chen
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`[Lee-84, Chen-77] have considerable advantage with respect to traditional matrix multiply
`
`
`
`
`
`
`
`
`
`
`
`approaches (with a factor of 5 or more speedup). Seperability of the 2D DCT is generally
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`used for decomposition of DCTs (with successive processing of the individual rows &
`
`
`
`
`
`
`
`
`
`
`
`
`
`columns of an image) . The size of the DCT directly influences the benefit of seperability,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`however with the 8x8 DCTs of most standards, a definite speedup results. The Lee
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`algorithm tends to be easy to implement and achieve faster computation, although has
`
`
`
`
`
`
`
`
`
`
`
`
`
`accuracy issues The Chen algorithm is harder to implement and is computational slower,
`
`
`
`
`
`
`
`
`
`
`
`
`
`but with good accuracy. Depending on the available processing bandwidth, the encoder
`
`
`
`
`
`
`
`
`
`
`
`
`can select an appropriate DCT or IDCT algorithm to perform the task.
`If errors result,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`different coding decisions result and either lower SNRs or compression ratios occur.
`
`
`
`
`
`
`
`
`
`
`
`
`
`In addition, we devised a "Truncated-IDCT" algorithm to utilize the advantages of a
`
`
`
`
`
`
`
`
`
`
`
`
`programmable architecture. Since the DCT, quantizer and threholding operations seek to
`
`
`
`
`
`
`
`
`
`
`
`minimize the population of selective frequency coefficients (for high compression),
`
`
`
`
`
`
`
`
`
`
`statistically, most of the high frequency coefficients are zero valued. Therefore, the
`
`
`
`
`
`
`
`
`
`
`
`
`conventional IDCI‘ will act on 8x8 matrixes with a high percentage of zero valued inputs.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`We can then significantly reduce the amount of IDCT operations performed by not
`
`
`
`
`
`
`
`
`
`
`
`
`
`executing the zero valued multiplies and adds (similar work has been reported[McMillan-
`
`
`
`
`
`
`
`
`
`
`
`92]). This is only possible with software—based IDCTs.
`In implementation, the program
`
`
`
`
`
`
`
`
`
`
`
`adaptively truncates the 8x1 IDCT summation in the vertical direction based on the run-
`
`
`
`
`
`
`
`
`
`
`
`
`
`length encoded input values. Further reductions by selecting 4x1 summation when
`
`
`
`
`
`
`
`
`
`
`
`appropriate also shortens the process, although not as frequently. With this approach a
`
`
`
`
`
`
`
`
`
`
`
`
`
`factor of 3 or more speed-up on lDCIs will usually occur.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 8 of 10
`
`
`
`PRIOR-ART_0010822
`
`
`
`Page 8 of 10
`
`
`
`
`
`TOOLS
`
`
`
`223
`
`
`
`
`
`
`
`
`Advances in video compression have been limited by the availability of tools to develop
`
`
`
`
`
`
`
`
`
`
`
`
`
`software and hardware. With the MVP, TI offers a range of software tools and direct on-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`chip support for in—circuit debug. A real—time executive, C++ compilers, algebraic
`
`
`
`
`
`
`
`
`
`
`assembler, windowed high-level language debugger (with JTAG emulation hardware on
`
`
`
`
`
`
`
`
`
`chip) and library of primitives/applications, all the tools familiar to computer application
`
`
`
`
`
`
`
`
`
`
`
`developers, will now be available for development of video applications.
`
`
`
`
`
`
`
`
`
`
`The software model for the MVP is based on two levels. The primary level includes the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Master Processor acting as a director and scheduler of the MVP's parallelism. The
`
`
`
`
`
`
`
`
`
`
`
`
`
`Executive operates on the MP, performing those supervisory tasks. The Executive can
`
`
`
`
`
`
`
`
`
`
`
`
`dispatch tasks for operation in pipeline, parallel or any other arrangement on any processor
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`within the MVP. Under that, a level which actually performs the tasks on each processor is
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`accessed by either: (1) a library of primitives, (2) application tools for programming in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`assembly or (3) a high-level language compiler. Each of these methods have advantages,
`
`
`
`
`
`
`
`
`
`
`
`
`with varying performance and skill level required to code the chip, as a function of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`particular application. Although nothing restricts the use of any processor as the master or
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`slave processor, only software convention.
`
`
`
`
`
`COMPETING VIDEO COMPRESSION CHIP ARCHITECTURES
`
`
`
`
`
`
`
`Several semiconductor companies have reported activity in video compression chip or chip
`
`
`
`
`
`
`
`
`
`
`
`
`set solutions ([Bolton-93][Konstantinides-92]). Most chip manufacturers are proposing
`
`
`
`
`
`
`
`
`hardwired or paramaterized architectures, without Calevel programmability. Our MVP is
`
`
`
`
`
`
`
`
`
`
`an exception. In addition, most of the other "programmable" approaches are based on an
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`architecture which integrates dedicated logic modules, like DCTs and Motion Estimators,
`
`
`
`
`
`
`
`
`
`
`
`with only the controller programmable. This limits their efficiency since the silicon devoted
`
`
`
`
`
`
`
`
`
`
`
`
`
`to those functions must always keep busy with those functions to justify their cost. On the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`contrary the MVP architecture has no dedicated logic, permitting user balancing of silicon
`
`
`
`
`
`
`
`
`
`
`
`
`
`based on the varied and dynamic computational demands of compression. Other
`
`
`
`
`
`
`
`
`
`
`
`researchers have recently described simulations which support our position that universally
`
`
`
`
`
`
`
`
`
`
`
`programmable architectures are competitive solutions when compared with dedicated or
`
`
`
`
`
`
`
`
`
`
`hybrid architectures for video decompression (Mayer—93).
`
`
`
`
`
`
`
`Many different architectures are proposed, in development, or currently available, however
`
`
`
`
`
`
`
`
`
`
`
`none except the MVP has the flexibility nor computational performance to meet the
`
`
`
`
`
`
`
`
`
`
`
`
`
`complete demands of truly integrated digital video on the desktop. including the complete
`
`
`
`
`
`
`
`
`
`
`
`
`
`concert of real-time video & audio compression, with image & 3D graphics processing.
`
`
`
`
`
`
`
`
`
`
`
`
`
`Not only compression and decompression, but system—level bit-stream control, video
`
`
`
`
`
`
`
`
`
`
`scaling, error correction and even audio echo cancellation. Architecture limitations and
`
`
`
`
`
`
`
`
`
`
`
`transistor counts limit other chips to subsets of these functions.
`
`
`
`
`
`
`
`
`
`
`
`CONCLUSION
`
`
`
`The MVP is a monolithic single-chip parallel processor that performs compression
`
`
`
`
`
`
`
`
`
`
`
`processing, audio & video processing, 3D graphics and others, and even at the same time.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Over 2 billion operations are performed per second. This dramatic performance boost will