`
`Robert J. Gave
`
`Texas Instuments, Inc.
`Dallas, Texas 75265
`
`ABSTRACT
`
`We introduce a new highly-integrated processing chip for performing a variety of
`functions, however this chip is particularly well suited for video compression algorithms.
`Applications include multimedia PCs, virtual reality 3D graphics, full-duplex
`videoconferencing, HDTV, and color hardcopy. We have architected the Multimedia Video
`Processor, or MVP, to provide a yet unattainable level of performance from a single chip,
`although with the programmability typically found in today's general-purpose computers.
`While advanced semiconductor design and process techniques have been used for its
`design, the key to the advantage of this component lies in optimization of the architecture
`for real-time video and graphics processing. This paper will analyze video compression
`application requirements, describe the MVP architecture, and pose its potential as a very
`capable solution for a wide range of markets.
`
`INTRODUCTION
`
`The computer and consumer video industries are pursuing varied paths to offer cost(cid:173)
`effective computing products which provide new forms of information and entertainment.
`Products are emerging from cable TV delivery of interactive digital movies to digital mobile
`offices. Digital compression and video processing at a reasonable cost are spurring this
`revolution. While algorithm developments have been important, most of the enabling
`advances lie in the availability of high-density memory and high-performance processing
`ICs. With the pending general availability of the Multimedia Video Processor, or MVP, in
`1994, a yet unattained level of digital signal processing performance will be available and
`with all the flexibility of present day programmable computers. Standard-based video(cid:173)
`conferencing and playback of compressed digital video and audio (using Px64, JPEG or
`MPEG "multi-standard" codecs systems) with a single MVP processor will be possible, as
`well as codecs with yet-to-be-defined algorithms like model-based compression.
`However, not only will the MVP support compression, it will also handle processing of
`high-resolution video, full-motion video processing from sources like camcorders, digital
`audio processing, hardcopy raster image processing, and 3D graphics, and all under
`software control and generation. From this wide range of functions, we calculated that
`several billion operations per second are required to provide video-based applications on
`the desktop. Current and soon to appear desktop host processors like X86, Pentium,
`Alpha, and MIPS do not have the computational power to meet these demands.
`
`KEYS TO THE MVP ARCHITECTURE
`
`The MVP's unique architecture and computational power enables users to integrate these
`varied functions on a single processing component The keys to obtaining both exceptional
`processing speeds and fully-programmable features with the MVP include the use of:
`
`(1) an efficient parallel processing architecture,
`(2) fast pixel processing tuned to image, video, and graphics processing,
`(3) intelligent control of i:r,.age data flo,-.,· througr.<Jut the architecture,
`(4) single-chip integration without slower chip-to-chip communications.
`
`1068-0314194 $3.00 © 1994 IEEE
`
`215
`
`PRIOR-ART _001 0815
`
`Page 1 of 10
`
` ZTE EXHIBIT 1006
`
`
`
`216
`
`DSP Parallel Processors (PPn) :
`Advanced DSP Cores
`
`Master Processor (MP):
`Advanced RISC
`
`-.DSP
`.Pn,.-
`- --
`1 ~~ l ~ ;~ 1: $4
`LG
`
`L O
`
`. I
`
`I
`
`1, QS,P: -
`PPl
`
`_;
`
`DSP
`PPO .
`~
`L
`
`LG
`.,
`
`RISC
`MP
`
`Cl>
`
`..
`
`I
`
`S!
`
`~~
`
`vc
`
`~~ JTAG.,
`. ll
`'TC
`
`. :
`
`~
`
`64
`
`I·•'•·
`I 1r::.
`. ·.;_
`I'
`
`'
`.;
`' f~
`t;
`
`.....
`
`p
`a
`I
`r D D 0 n
`II\ t. t
`t
`t
`. . . . . 1'
`
`i R ~ A
`c A A A c
`dU~
`A
`h
`H
`o
`
`~
`I
`c 0 D D n
`a a a a..,
`t
`1\ t
`t
`t
`• • a A r
`~ It It !\
`r A A A e
`dU~
`A
`h
`K
`e
`
`p
`A
`J
`r D 0 0 n
`._ a a a .-
`1\ t. t
`t
`t
`e • a a r
`; R R R
`r A A A C
`R ~ ~ ~ ~
`A
`h
`M
`0
`
`~
`I
`I
`• 0 o n n
`! t 0
`I
`•
`t
`t
`t
`r r
`t c
`e
`~ ~ h e e
`
`: . •
`~ ~ a e e ...
`...
`
`0 h h
`
`H
`
`Figure 1: MVP Block Diagram:
`
`(A Single-Chip Parallel Processor )
`
`Page 2 of 10
`
`PRIOR-ART _001 0816
`
`
`
`217
`
`ALGORITHM-DIRECTED ARCHITECTURE DEFINITION
`
`Processing Requirements
`Today's proposed international video compression standards use common frequency
`domain, quantization, and emropy coding techniques 10 (de)compress small portions (8x8)
`of each image. While these funcaons demand a great deal from the encoder/decoder, many
`other varied functions remain, each with dynamic requirements which vary based on the
`rype of image compressed as well as the channel rate required to maintain real-time
`operation. For optimal efficiency a processor must adapt to these dynamic needs. A
`typical average of the processing demands o f the Px64 video-conferencing standard
`appears in the following table.
`
`RISC vs. MVP-PP Processing Re quirements for Px64
`
`IIVP &-*on
`Speod(-
`
`ap.c~..,pol
`11\!PoPP va. AI!IC
`
`'II. of lime) -
`
`--
`
`...... (H.»>)
`fUU.-ouPl£X, FULL-oF,
`
`Malian EsOinalion · !!loci< Mat:hing (encode)
`
`Encoding Deaoionc - (1) lnler wltnotion
`· (2) Wlor'OIICCdod cill.,(3) lnlra
`-
`
`Loop Rloring (boll)
`
`~image (cum>nt- preciicled)
`
`Fast OCT (en::odo)
`
`11Tos~-Zag~
`
`~Encxxlo
`
`IDCT(botl>)
`Roconc- (bof'l)
`(111'8cbed • dift. imaQe)
`...._, Dococle & OequanWii::Jn (deo>de)
`
`TOTAL CYCLES
`
`(MIPS)
`
`lltSC & -n
`Speod<---e-
`'II. of II-) "
`
`0.51
`
`0.004
`
`0092
`
`0.10
`
`0 .062
`
`0.042
`
`0.014
`
`0.161-
`
`0062
`
`0.018
`
`029
`
`0.009
`
`0. 116
`
`0.013
`
`0.077
`
`0.071
`
`0.045
`
`0.226
`
`0.077
`
`0.045
`
`1.00~
`1,tts...S
`
`1.00=
`155PP·PS-
`
`14
`
`7'
`
`6
`
`9
`
`6
`
`5
`
`3
`
`6
`
`5
`
`3
`AVERAGE
`SPEEo.ur> :7.7
`
`• Multiply counted a one inStruction even though most RISCs require many cycles.
`** If the "Truncatcd·IDCT" algorithm was used, IDCfs speed-up again (see later).
`••• Tbe tOtal is equivalent to 3 MVP-PP processors (see below PP section).
`****Audio standards concurrently execute on the MVP-MP (see below MP section).
`
`As we studied the computational requirements for motion estimation (51%) and DCTs
`(22%) it became quite apparent that a programmable image processor must excel at these
`functions. It is imponant to recognize that what's done poorly in a processor can dominate
`its perfonnance. Since most architcctural improvements would not unifonnly accelerate all
`functions unifonnly, we looked for special architectural featore.s [()(' these critical functions,
`while maintaining enough flexibility to benefit a larger class of algorithms. In final
`analysis, a much more uniform distribution of computational loading resulted after the
`changes.
`
`As seen in the table, the programmable image processor must perform many other
`functions well, including: bit manipulation and table look ups for entropy encoding, and
`
`Page 3 of 10
`
`PRIOR-ART_ 0010817
`
`
`
`218
`
`multiply and accumulate for various types of filtering operations. To obtain good image
`quality at any channel rate and 30 frames per second, the image processor must compute
`over 1.2 billion operations per second (BOPS).
`
`The addition of audio compression (which requires higher precision integer and possibly
`floating point algorithms) and network communication, necessary for video conferencing
`(G.728 or G.711, H.242, H.230, H.221), further increases the scope of computational
`requirements. Reducing the system cost, we propose to include support in the architecture
`for the required non-standard functions like color space conversion (YCrCb to RGB),
`decimation of the source image to CIF resolution and variable scaling of the decompressed
`sequence. Complete implementation of compression applications such as video(cid:173)
`conferencing requires over 2 BOPS of the programmable image processor.
`
`ARCHITECTURE CHOICES
`
`We considered several candidate parallel architectures for implementation of this single-chip
`video processor [Gove-92, Guttag-92]. An architecture with a mix of dedicated and
`programmable processors was initially evaluated, then subsequently discounted when no
`single dominant function was found that was necessary almost all of the time. Besides, we
`predicted that by the time the chip was completed, that a new important algorithm would
`emerge. From the standpoint of loss of silicon efficiency by dedicated resources to any one
`function (like a DCI"), we felt compelled to seek a general-purpose well-balanced system
`solution. Several other candidates existed, however the mix of algorithms and practical
`implementation limitations focused us on SIMD and MIMD architectures. These differ by
`the autonomy of the processors functions with MIMD -- a desirable feature for any data
`dependent algorithm operating in parallel.
`
`With MIMD desirable, the choice of a processor and memory interconnection architecture
`remained. Pipelined, shared bus memory, communication port (mesh/array/hypercube),
`and crossbar fully-shared memory were considered. Pipeline memory and processors
`(systolic arrays) are typically used for video, however they're too restrictive in the sense
`that one must a priori know the size of the memory and dynamics of the algorithm to
`prevent data contention and processor stalls. With our varied needs, this would lead to
`inefficiencies. A shared-bus memory structure would also have bottleneck problems with
`highly variable instruction and data streams and moving of results from one processor to
`the other. The n-way connected communication port requires a very ordered flow of data,
`like a systolic or wavefront flow of data, or the application of a pixel per processor (not
`practical in a single chip). This approach works for large arrays of simple processors
`which can operate uniformly on images, however we wanted more complex processors
`which could adapt to varying types of data, from bit graphics to floating-point
`representations. The crossbar fully-shared memory is ideally suited to these needs,
`minimizing contention, data movement and providing flexibility for many types of
`algorithms. In fact, since the crossbar operations at the processor instruction rates, this
`architecture can functionally emulate the other approaches (pipeline, shared bus ... ).
`
`We not only wanted to provide this order of magnitude performance increase, but the goal
`was to apply a traditional computer model of programmable processing and a large memory
`to applications with integrate.d image, graphics, video and audio processingj or image
`computing. As shown in Figure #2 titled "MVP System Architecture", replacing the
`processing and memory pipeline of conventional video systems with the single video
`processor and large memory system model yields tremendous application flexibility. In
`effect the system can re-configure itself with software from video conferencing to playing
`CD movies, just as a PC would re-configure from a spreadsheet to a video game.
`
`PRIOR-ART _001 0818
`
`Page 4 of 10
`
`
`
`Figure 2: The MVP "System" Architecture.
`k\tcliaoe b r: • lmQQe, ave~ dattl from eomp...tef momory (chk. photo·CO ... ).
`• dati from netwotka (phone« local dig !tal).
`• imac;,eilideO chc;iay on wcrts:tatiM !T'Ciriw.
`
`219
`
`t m Obyt.s
`t
`
`• live lli(ieo & audiO (c:tWMt'a.s,VCFb)
`• dsplay on TV monitors.
`
`THE MVP ARCHITECTURE
`
`The Multimedia Video Processor, or MVP, represents the next-generation of digital signal
`processors. The MVP can be technically described as a single-chip crossbar shared
`memory heterogeneous MfMD multiprocessor. It combines RISC and advanced DSP
`processing in one parallel architecture with unique features for each. Current RISC
`processors typically use instruction pipelining, numerous registers and a detached floating
`point processor. On the other hand, current DSPs are optimized for one dimensional
`m ultiply-accumulate functions. Newer DSPs have floating-point capabilities, yet most
`imaging and video only needs integer operations. DSPs usually have fewer registers than
`RISC and have direct memory accesses (DMA) with limited capabilities.
`
`The MVP combines the best features of RISC and DSP in parallel and adds other features
`to offer unprecedented Power and Flexibility. The heart of an image or video chip is its
`capability to process 2D signals. The MVP has features for 20 DSP-Iike processing.
`including multiply-accumulate operations. The on-chip memory and register characteristics
`of the MVP were optimized for image computing algorithms, preventing time consuming
`cache misses or swapping o f register contents. Multidimensional external memory access
`and double buffering minimizes the typical memory bonleneck of current DSP solutions.
`An internal memory crossbar provides extremdy efficient synchroni:tation and
`communication of multiple processors. A very high-performance RISC processor is
`integrated on the chip, providing intelligent control of the DSP-like processors. Also
`integrated into the chip, a new floating-point architecture can act as a co-processor to any of
`the DSP-like processors or the RISC processor. By analysis of the algorith ms, the
`required mix of integer ops to floating-point ops was somewhere between 8:1 and 4:1 -- a
`balance which the MVP supports. The entire collection of processors and memory is
`configured as a MIMD architecture for ease of programming and high performance for all
`image and video computing applications. This MlMD data and control supports both data
`
`·-----
`
`-
`
`-
`
`- - - -
`
`Page 5 of 10
`
`PRIOR-ART _001 0819
`
`
`
`220
`
`dependent algorithms like object feature matching or Huffman coding and also supporting
`traditional data independent SIMD operations like convolution.
`
`To prevent contention for memory or register access, a very wide instruction set in the
`DSPs a..~d a large on-chip crossba..L'""ct.i memory is used in the ~1VP. This flexibility penPJts
`the programmer to produce highly-parallel optimized code. A performance penalty may
`result if only one highly-serial task is performed continuously, however, the very nature of
`image, video, graphics, and audio processing, with varied concurrent and complex
`processing, prevents this from occurring. The MVP integrates more functions than ever
`before into one chip, while avoiding the compromises of other architeCtures.
`
`Detailed Architecture Description:
`Figure #1, titled "MVP Block Diagram", shows the MVP chip architecture. The Master
`Processor (MP) provides a RISC processor for simple user interface, sequential
`processing, and orchestration of multiple concurrent tasks operating on the entire MVP.
`The DSP Parallel Processors (PP), of which 4 will be designed in the first version of the
`MVP, provide highly-optimized image/video/graphics/audio processing capabilities. The
`Transfer Controller (TC) intelligently moves data and instructions on and off the MVP. All
`of these processors are locally interconnected with a crossbar to 25 on-chip 2Kbyte SRAM
`modules. Other features include dual video frame timing generators (VC) and JTAG test
`and emulation circuits.
`
`With five 32+bit programmable processors operating at one targeted state rate of 50MHz
`and numerous parallel operations performed in each processor, over 2 billion operations
`per second result. In addition, 100 MFWPS (fully IEEE-754) can occur. The peak data
`transfer rate is then 400 MBytes/second, adequate for many video applications. The
`internal bandwidth over the crossbar between on-chip memory and processors is 2.4
`GBytes/second.
`
`DSP PARALLEL PROCESSORS (PP)
`
`The PP has many powerful features beyond those found in conventional DSPs. Practically
`all video algorithms benefit from these features. Most of the features were added to permit
`scalability within the PP to support many simple functions (like bit ops) in one cycle or
`fewer operations with the same hardware at higher precision (like 32-bits). The following
`describes the feature and advantage:
`• 44 user registers:
`- ease of programming/compiling and fast parallel functions.
`• Single-cycle access into crossbar memory expands effective registers to 34K:
`- flexibility.
`• Three-level, no overhead instruction looping:
`- programming flexibility and faster tight loops (usually 20-30%)
`• Double parallel transfer from memory with address update:
`- most algorithms need two pixels loaded per cycle.
`• Three-operand ALU arithmetic and logical operations:
`- double speed correlation and windows support.
`• Splitable multiply (8x8=16 or l6xl6=32):
`- double speed pixel operations.
`• Word!Halfword/Byte multiple arithmetic:
`- 4x on algorithms like motion estimation and 2x on fast DCfs.
`• Flexible data path:
`- masking, merging, rotating ... for bit stream coding (like Huffman).
`• General-purpose use of address adders:
`
`PRIOR-ART _001 0820
`
`Page 6 of 10
`
`
`
`221
`
`- up to 6x number of adds in one cycle.
`• Conditional operations prevent need for branching (and possible pipeline stalls):
`- adaptive algorithms will operate faster (like adaptive thresholding).
`
`As a result, as many as 15 RISC operations will be performed in one PP cycle. When
`muitipiied by the number of PPs and added to the MP and FPU operations, a formidabie
`number results. In addition, since the C-compiler also influenced the architecture of the
`PP, many of these features will automatically compile into fast code-- many users of the
`MVP will not need to understand the PP architecture to take advantage of its performance.
`
`MASTER PROCESSOR (MP)
`
`The MP is a general-purpose RISC processor with an integral IEEE-compatible floating(cid:173)
`point unit. A 32-bit instruction is accessed from a 4KByte instruction cache. Data loads
`can be 8, 16, 32, or 64 bits from a 4KByte data cache or from any data module via the
`crossbar. The MP has thirty-one 32-bit usable registers. Uncommon features include:
`• Register files common to floating-point & integer operations.
`• Scoreboard keeps track of result of loads and FPU, preventing use until updated.
`• Addressing modes support optional updating of base-address register with
`results of the address computation.
`• Special FPU instruction permits new multiply, add/subt, & increment each cycle.
`• Left-most and Right-most one logic.
`• Both endians supported.
`
`Since the MP was designed to efficiently execute C programs and has added hardware for
`bitstream processing, it performs exceptionally well as the controller and data interpretation
`processor. The floating point capability accelerates and simplifies programming of high
`precision applications like medical imaging and 3D graphics.
`
`SHARED MEMORY & TRANSFER CONTROLLER (TC)
`
`Much of the advantage of the MVP architecture lies in the memory and data I/0
`architecture. Each processor and memory is fully interconnected through the crossbar and
`switchable at instruction rates. With greater than 500 signal lines switching at nanosecond
`speeds, the crossbarred memory architecture is only possible with single-chip
`implementation. With adequate on-chip memory and the ability to reconnect the next
`processor to the data memory, rather than moving the data to another memory, the data on(cid:173)
`chip is not required to move as often. In effect, the original requirement of billions of
`byteS/second data transfer is reduced to only lOO's of Mbytes/second. This model works
`well as long as the algorithm uses localized regions of data (patches, blocks,
`neighborhoods, rows ... ), each of which "fit" into the on-chip memory, and are accessed in
`repeated or predictable patterns. While this usually occurs with image processing, an
`extremely intelligent transfer controller was architected to aid in insuring the validity of this
`assumption. The TC has numerous modes of transferring data on- or off-chip, each
`optimized for a particular type of dataflow (block, patch, fat line, indexed or guided
`patches ... ). Most importantly, the on-chip SRAM memory was architected with sufficient
`size and modularity to permit double-buffering of data 1/0 on and off the chip, while the
`on-chip processors access the other on-chip memory modules. In effect, practically no
`overhead is required for video 1/0. Many convenient methods were designed into the TC
`to prioritize these accesses. In addition, we included support for most commodity memory
`components (VRAM, SRAM, DRAM). Finally, we devised several methods to mitigate
`any contention between the processors for a particular memory module. Both round-robin
`and fixed-robin priority schemes are available to permit developers flexibility in structuring
`
`PRIOR-ART _001 0821
`
`Page 7 of 10
`
`
`
`222
`
`their algorithms to reduce contention. For the many image and video algorithms currently
`developed for the MVP thus far, contention has not been a problem.
`
`Another advantage of the crossbar architecture is expandability. We can design many
`different MVP chips, as a function of the number of PPs. We simply slice the architecture,
`cutting or adding PPs and memory modules. Conceptually, the advantage ofthis approach
`is that, with the same package and pin-out, several different performance and price points
`can be used. A range of applications may require a range of different MVP chips.
`Applications which require CCIR 601 studio quality video and/or multifunction processing
`(graphics I audio I video) would most-likely require an MVP with 4 processors. On the
`other hand, a more dedicated or single-function application like graphics may require
`fewer PPs. In addition, if only limited resolution video (QCIF) processing is necessary,
`again a small number of PPs could suffice. We anticipate various versions of the MVP in
`the future.
`
`VIDEO CONTROLLER (VC)
`
`In addition, the MVP has two programmable timing controllers for generation of video and
`other timing signals. As an example, video frame grabbing and display requires may pixel,
`horizontal, and vertical signals for synchronization of the external logic in the system. The
`MVP has internal logic to generate those signals under program control, relieving the
`system designer from design of external logic to perform those functions.
`
`NEW DCT ALGORITHMS FOR COMPRESSION
`SPEED-UP WITH USE OF A PROGRAMMABLE ARCHITECTURE
`
`One advantage of the programmable compression chip is the optimization possible by
`selcx:ting the least computationally demanding DCT algorithm that will meet the accuracy
`required of the application. For example, fast DCT algorithms like those of Lee and Chen
`[Lee-84, Chen-77] have considerable advantage with respect to traditional matrix multiply
`approaches (with a factor of S or more speedup). Seperability of the 20 OCT is generally
`used for decomposition of DCTs (with successive processing of the individual rows &
`columns of an image) . The size of the DCT directly influences the benefit of seperability,
`however with the 8x8 OCTs of most standards, a definite speedup results. The Lee
`algorithm tends to be easy to implement and achieve faster computation, although has
`accuracy issues. The Chen algorithm is harder to implement and is computational slower,
`but with good accuracy. Depending on the available processing bandwidth, the encoder
`c~ select an appropriate DCT or IDCT algorithm to perform the task. If errors result,
`different coding decisions result and either lower SNRs or compression ratios occur.
`
`In addition, we devised a "Truncated-IDCT" algorithm to utilize the advantages of a
`programmable architecture. Since the OCT, quantizer and threholding operations seek to
`minimize the population of selective frequency coefficients (for high compression),
`statistically, most of the high frequency coefficients are zero valued. Therefore, the
`conventional IDCT will act on 8x8 matrixes with a high percentage of zero valued inputs.
`We can then significantly reduce the amount of IOCT operations performed by not
`executing the zero valued multiplies and adds (similar work has been reported[McMillan-
`92]). This is only possible with software-based IDCTs. In implementation, the program
`adaptiveiy truncates the 8xl IDCT summation in the vertical direction based on the run(cid:173)
`length encoded input values. Further reductions by selecting 4xl summation when
`appropriate also shortens the process, although not as frequently. With this approach a
`factor of 3 or more speed-up on IDCTs will usually occur.
`
`PRIOR-ART _001 0822
`
`Page 8 of 10
`
`
`
`TOOLS
`
`223
`
`Advances in video compression have been limited by the availability of tools to develop
`software and hardware. With the MVP, TI offers a range of software tools and direct on(cid:173)
`chip support for in-circuit debul!:. A real-time executive. C++ compilers, algebraic
`assembler, windowed high-level language debugger (with JTAG emulaiion hardware on
`chip) and library of primitives/applications, all the tools familiar to computer application
`developers, will now be available for development of video applications.
`
`The software model for the MVP is based on two levels. The primary level includes the
`Master Processor acting as a director and scheduler of the MVP's parallelism. The
`Executive operates on the MP, performing those supervisory tasks. The Executive can
`dispatch tasks for operation in pipeline, parallel or any other arrangement on any processor
`within the MVP. Under that, a level which actually performs the tasks on each processor is
`accessed by either: (1) a library of primitives, (2) application tools for programming in
`assembly or (3) a high-levellanguage compiler. Each of these methods have advantages,
`with varying performance and skill level required to code the chip, as a function of the
`particular application. Although nothing restricts the use of any processor as the master or
`slave processor, only software convention.
`
`COMPETING VIDEO COMPRESSION CHIP ARCHITECTURES
`
`Several semiconductor companies have reported activity in video compression chip or chip
`set solutions ([Bolton-93][Konstantinides-92]). Most chip manufacturers are proposing
`hardwired or paramaterized architectures, without C-leve! programmability. Our MVP is
`an exception. In addition, most of the other "programmable" approaches are based on an
`architecture which integrates dedicated logic modules, like DCTs and Motion Estimators,
`with only the controller programmable. This limits their efficiency since the silicon devoted
`to those functions must always keep busy with those functions to justify their cost. On the
`contrary the MVP architecture has no dedicated logic, permitting user balancing of silicon
`based on the varied and dynamic computational demands of compression. Other
`researchers have recently described simulations which support our position that universally
`programmable architectures are competitive solutions when compared with dedicated or
`hybrid architectures for video decompression (Mayer-93).
`
`Many different architectures are proposed, in development, or currently available, however
`none except the MVP has the flexibility nor computational performance to meet the
`complete demands of truly integrated digital video on the desktop, including the complete
`concert of real-time video & audio compression, with image & 3D graphics processing.
`Not only compression and decompression, but system-level bit-stream control, video
`scaling, error correction and even audio echo cancellation. Architecture limitations and
`transistor counts limit other chips to subsets of these functions.
`
`CONCLUSION
`
`The MVP is a monolithic single-chip parallel processor that performs compression
`processing, audio & video processing, 3D graphics and others, and even at the same time.
`Over 2 billion operations are performed per second. This dramatic performance boost will
`enable a wide range of new applications, including desktop interactive digital video.
`Integrating fully-programmable parallel DSP processors with a RISC processor on one
`chip provides software flexibility and system adaptability. A new parallel architecture,
`using a crossbar network to couple the processors and large on-chip SRAMs, and with
`MIMD (Multiple Instruction Multiple Data) operations, yields extremely high efficiency for
`
`PRIOR-ART _001 0823
`
`Page 9 of 10
`
`
`
`224
`
`most image, graphics, and video algorithms. Software tools like real-time executives,
`assemblers and compilers all help bring a familiar computer programming model to
`multidimensional signal processing. This new technology frees developers of compression
`algorithms to optimize implementations of standard video & audio compression algorithms,
`without the restrictions found in today's compression chips (those which are limited to
`current interpretations or versions of the standards). In addition, algorithm developers can
`implement future compression algorithms, without the difficulties of developing new chips
`or adapting existing chips.
`
`The MVP supports a wide range of open standards for video compression and image
`computing. The variation within each standard to promote creative and distinguishing
`advantages in the market place and the constant urge to optimize the standard to a particular
`range of markets, each work to prevent fixed hardware solutions. This programmable,
`integrated solution gives flexibility to system designers to develop competitive algorithms
`as well as adapt to emerging standards.
`
`ACKNOWLEDGMENTS
`
`The author wishes to thank Jeremiah Golston, Dr. Chris Read, and Dr. V. Venkateswar for
`compression algorithm work relating to the MVP. In addition, thanks to the MVP Program
`Manager, Walt Bonneau, for developing and motivating a "world-class" team. A special
`thanks to my original co-MVP-architects, Keith Balmer, Karl Guttag, and Nick, Ing(cid:173)
`Simmons. Finally, thanks to the entire "Team-MVP"!
`
`REFERENCES
`
`[Bolton-93] Bolton, M. "A Family of MPEG Video Encoder and Decoder Chips", IEEE
`Proceeding of Conference on Hot Chips, 1993.
`
`[Chen-77] Chen, W.H., C.H. Smith, and S.C. Fralick, "A Fast Computational Algorithm
`for the Discrete Cosine Transform", IEEE Transactions of Communication, Vol. 25, pp.
`1004-1009, Sept. 1977.
`
`[Gove-92] Gove, R.J., "Architectures for Single-Chip Image Computing", SPIE
`Proceedings of Conf. on Image Processing and Interchange, San Jose, Ca., Feb 1992.
`
`[Guttag-92] Guttag, K.M., R.J. Gove, & J.R. VanAken, "A Single-Chip Multiprocessor
`For Multimedia: The MVP", IEEE Computer Graphics & Applications, pp.53-64, 11/92.
`
`[Konstantinides-92] K. Konstantinides & V. Bhaskaran, "Monolithic Architectures for
`Image Processing & Compression", IEEE Computer Graphics & Applications, pp 75-86,
`Nov. 1992.
`
`[Lee-84] B.G. Lee, "A New Algorithm for the Discrete Cosine Transform", IEEE Trans.
`on Acoustics, Speech, and Signal Processing, Vol. 32, pp. 1243-1245, 1984.
`
`[Mayer-93] A. C. Mayer, "The Architecture of a Processor Array for Video
`Decompression", IEEE Trans. on Consumer Elect.,Vol39, No.3, pp 565-569, Aug. 1993.
`
`[McMillan-92] McMillan, L.& L. Westover, "A Forward-Mapping Realization of the
`Inverse Discrete Cosine Transform", IEEE Proc. of the Data Compression Conference, pp
`219-228, 1992.
`
`PRIOR-ART _001 0824
`
`Page 10 of 10