throbber

`
`
`
`
`
`
`
`The MVP: A Highly-Integrated Video Compression Chip
`Robert I. Gove
`
`
`
`Texas Instuments, Inc.
`
`
`
`Dallas, Texas 7 5265
`
`
`
`ABSTRACT
`
`
`We introduce a new highly-integrated processing chip for performing a variety of
`
`
`
`
`
`
`
`
`
`
`
`
`functions, however this chip is particularly well suited for video compression algorithms.
`
`
`
`
`
`
`
`
`
`
`
`
`Applications include multimedia PCs, virtual reality 3D graphics, full-duplex
`
`
`
`
`
`
`
`
`
`videoconferencing, HDTV, and color hardcopy. We have architected the Multimedia Video
`
`
`
`
`
`
`
`
`
`
`
`Processor, or MVP, to provide a yet unattainable level ofperformance from a single chip,
`
`
`
`
`
`
`
`
`
`
`
`
`
`although with the programmability typically found in today's general-purpose computers.
`
`
`
`
`
`
`
`
`
`
`While advanced semiconductor design and process techniques have been used for its
`
`
`
`
`
`
`
`
`
`
`
`
`design, the key to the advantage of this component lies in optimization of the architecture
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`for real-time video and graphics processing. This paper will analyze video compression
`
`
`
`
`
`
`
`
`
`
`
`
`application requirements, describe the MVP architecture, and pose its potential as a very
`
`
`
`
`
`
`
`
`
`
`
`
`
`capable solution for a wide range of markets.
`
`
`
`
`
`
`
`
`INTRODUCTION
`
`
`The computer and consumer video industries are pursuing varied paths to offer cost-
`
`
`
`
`
`
`
`
`
`
`
`
`effective computing products which provide new forms of information and entertainment.
`
`
`
`
`
`
`
`
`
`
`
`Products are emerging from cable TV delivery of interactive digital movies to digital mobile
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`offices. Digital compression and video processing at a reasonable cost are spurring this
`
`
`
`
`
`
`
`
`
`
`
`
`
`revolution. While algorithm developments have been important, most of the enabling
`
`
`
`
`
`
`
`
`
`
`
`advances lie in the availability of high—density memory and high—performance processing
`
`
`
`
`
`
`
`
`
`
`
`ICs, With the pending general availability of the Multimedia Video Processor, or MVP, in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`1994, a yet unattained level of digital signal processing performance will be available and
`
`
`
`
`
`
`
`
`
`
`
`
`
`with all the flexibility of present day programmable computers. Standard-based video-
`
`
`
`
`
`
`
`
`
`
`conferencing and playback of compressed digital video and audio (using Px64. IPEG or
`
`
`
`
`
`
`
`
`
`
`
`
`
`MPEG "multivstandar " codecs systems) with a single MVP processor will be possible, as
`
`
`
`
`
`
`
`
`
`
`
`
`
`well as codecs with yet-to-be-defined algorithms like model-based compression.
`
`
`
`
`
`
`
`
`
`However, not only will the MVP support compression, it will also handle processing of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`high-resolution video, full-motion video processing from sources like camcorders, digital
`
`
`
`
`
`
`
`
`
`
`audio processing, hardcopy raster image processing, and 3D graphics, and all under
`
`
`
`
`
`
`
`
`
`
`
`
`software control and generation. From this wide range of functions, we calculated that
`
`
`
`
`
`
`
`
`
`
`
`
`
`several billion operations per second are required to provide video-based applications on
`
`
`
`
`
`
`
`
`
`
`
`
`the desktop. Current and soon to appear desktop host processors like X86, Pentium,
`
`
`
`
`
`
`
`
`
`
`
`
`
`Alpha, and MIPS do not have the computational power to meet these demands.
`
`
`
`
`
`
`
`
`
`
`
`
`
`KEYS TO THE MVP ARCHITECTURE
`
`
`
`
`
`
`
`The MVP's unique architecture and computational power enables users to integrate these
`
`
`
`
`
`
`
`
`
`
`
`
`varied functions on a single processing component. The keys to obtaining both exceptional
`
`
`
`
`
`
`
`
`
`
`
`
`processing speeds and fully-programmable features with the MVP include the use of:
`
`
`
`
`
`
`
`
`
`
`
`
`
`(1) an efi‘icient paraIIel processing architecture,
`
`
`
`
`
`
`(2) fast pixel processing tuned to image, video, and graphics processing,
`(‘1\ inmllinom I‘Anfrnl Arrmnna ann Hm” tinynun'annf aha nwaLiranmya
`
`
`
`
`
`
`
`
`
`
`\e, .........6w-. cm... W u, muse umuyiuw uu uwslwuo H» w cmwuw v,
`
`
`
`
`
`
`
`
`
`
`(4) single-chip integration without slower chip-to-chip communications.
`
`
`
`
`
`
`
`
`
`
`
`1068-0314/94 $3.00 © 1994 IEEE
`
`
`
`
`
`
`
`215
`
`
`
`Page 1 0f 10
`
`HTC-LG-SAMSUNG EXHIBIT 1006
`
`
`
`PRIOR-ART_0010815
`
`Page 1 of 10
`
`HTC-LG-SAMSUNG EXHIBIT 1006
`
`

`

`2l6
`
`DSP Parallel Processors (PPn) :
`Advanced DSP Cores
`
`Master Processor (MP) :
`Advanced RISC
`
`
`
`.Fireman-tRAH
`
`DataRAH!
`
`DitaRAE
`
`.nD8C3RAMI
`
`.tDariaHAND
`
`DataFIN}
`
`
`
`DataRANO
`
`
`
`InfltrcaChO
`
`Figure l:
`
`MVP Block Diagram:
`
`(A Single-Chip Parallel Processor)
`
`Page 2 of 10
`
`PRIOR-ART 0010816
`
`

`

`217
`
`ALGORITHM-DIRECTED ARCHITECTURE DEFINITION
`
`Processing Requirements
`Today's proposed international video compression standards use common frequency
`domain. quantization, and entropy coding techniques to (de)compress small portions (8x8)
`of each image. While these functions demand a great deal from the encoder/decoder. many
`other varied functions remain. each with dynamic requirements which vary based on'the
`type of image compressed as well as the channel rate required to maintain real-time
`operation. For optimal efficiency a processor must adapt to these dynamic needs. A
`typical average of the processing demands of the Px64 video-conferencing standard
`appears in the following table.
`
`RISC vs. MVP-PP Processing Requirements for Px64
`
`WIEX.
`m(H.201)
`an:
`W
`
`Full-OF
`
`.
`
`Spudm Speed avenge Wu.
`mac Estonian
`MVP Errata
`Spud-up clam
`fiofllm ‘
`as of the)
`more
`
`029
`msmm-thmm “—1-7
`M Decisions- (”harm-roam
`ems
`m (ammonium t3) lntra
`
`SPEED-UP >g
`
`it
`
`1.00:
`155 PPHPS "‘
`
`N
`
`Lin
`
`* Multiply counted at one instruction even though most RISCs require many cycles.
`“ If the "Truncated-IDCI‘“ algorithm was used. IDCTs speed-up again (see later).
`"" The total is equivalent to 3 MVP-PP processors (see below PP section).
`““ Audio standards concurrently execute on the MVP-MP (see below MP section).
`
`As we studied the computational requirements for motion estimation (51%) and DCI‘s
`(22%) it became quite apparent that a programmable image processor must excel at these
`functions. It is important to recognize that what's done poorly in a processor can dominate
`its performance. Since most archiwctutal improvements would not uniformly accelerate all
`functions uniformly, we looked for special architectural features for these critical functions.
`while maintaining enough flexibility to benefit a larger class of algorithms.
`In final
`anhglnysis. a much more uniform distribution of computational loading resulted after the
`c
`ges.
`
`As seen in the table. the programmable image processor must perform many other
`functions well. including: bit manipulation and table look ups for entropy encoding. and
`
`Page 3 of 10
`
`PRIOR-ART_0010817
`
`

`

`218
`
`
`multiply and accumulate for various types of filtering operations. To obtain good image
`
`
`
`
`
`
`
`
`
`
`
`
`quality at any channel rate and 30 frames per second, the image processor must compute
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`over 1.2 billion operations per second (BOPS).
`
`
`
`
`
`
`
`
`
`
`
`The addition of audio compression (which requires highcr precision integer and possibly
`
`
`
`
`
`
`
`
`
`
`
`
`floating point algorithms) and network communication, necessary for video conferencing
`
`
`
`
`
`
`
`
`
`
`(6.728 or 6.711, H.242, H.230, H.221), further increases the scope of computational
`
`
`
`
`
`
`
`
`
`
`
`
`requirements. Reducing the system cost, we propose to include support in the architecture
`
`
`
`
`
`
`
`
`
`
`
`
`
`for the required non-standard functions like color space conversion (YCer to RGB),
`
`
`
`
`
`
`
`
`
`
`
`
`decimation of the source image to GP resolution and variable scaling of the decompressed
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`sequence. Complete implementation of compression applications such as video-
`
`
`
`
`
`
`
`
`conferencing requires over 2 BOPS of the programmable image processor.
`
`
`
`
`
`
`
`
`
`
`ARCHITECTURE CHOICES
`
`
`
`
`We considered several candidate parallel architectures for implementation of this single—chip
`
`
`
`
`
`
`
`
`
`
`
`video processor [Gove-92, Guttag-92]. An architecture with a mix of dedicated and
`
`
`
`
`
`
`
`
`
`
`
`
`programmable processors was initially evaluated, then subsequently discounted when no
`
`
`
`
`
`
`
`
`
`
`single dominant function was found that was necessary almost all of the time. Besides, we
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`predicted that by the time the chip was completed, that a new important algorithm would
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`emerge. From the standpoint of loss of silicon efficiency by dedicated resources to any one
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`function (like a DCT), we felt compelled to seek a general-purpose well—balanced system
`
`
`
`
`
`
`
`
`
`
`
`
`solution. Several other candidates existed, however the mix of algorithms and practical
`
`
`
`
`
`
`
`
`
`
`
`
`implementation limitations focused us on SIMD and MIMD architectures. These differ by
`
`
`
`
`
`
`
`
`
`
`
`
`the autonomy of the processors functions with MIMD -- a desirable feature for any data
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`dependent algorithm operating in parallel.
`
`
`
`
`
`
`With MIMD desirable, the choice of a processor and memory interconnection architecture
`
`
`
`
`
`
`
`
`
`
`
`remained. Pipelined, shared bus memory, communication port (mesh/array/hypercube),
`
`
`
`
`
`
`
`
`and crossbar fully-shared memory were considered. Pipeline memory and processors
`
`
`
`
`
`
`
`
`
`
`(systolic arrays) are typically used for video, however they‘re too restrictive in the sense
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`that one must a prion’ know the size of the memory and dynamics of the algorithm to
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`prevent data contention and processor stalls. With our varied needs, this would lead to
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`inefficiencies. A shared—bus memory Structure would also have bottleneck problems with
`
`
`
`
`
`
`
`
`
`
`
`highly variable instruction and data streams and moving of results from one processor to
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the other. The n-way connected communication port requires a very ordered flow of data,
`
`
`
`
`
`
`
`
`
`
`
`
`
`like a systolic or wavefront flow of data, or the application of a pixel per processor (not
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`practical in a single chip). This approach works for large arrays of simple processors
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`which can operate uniformly on images, however we wanted more complex processors
`
`
`
`
`
`
`
`
`
`
`
`
`which could adapt
`to varying types of data, from bit graphics to floating-point
`
`
`
`
`
`
`
`
`
`
`
`
`
`representations. The crossbar fullyshared memory is ideally suited to these needs,
`
`
`
`
`
`
`
`
`
`
`
`minimizing contention, data movement and providing flexibility for many types of
`
`
`
`
`
`
`
`
`
`
`
`In fact, since the crossbar operations at the processor instruction rates, this
`algorithms.
`
`
`
`
`
`
`
`
`
`
`
`
`
`architecture can functionally emulate the other approaches (pipeline, shared bus...).
`
`
`
`
`
`
`
`
`
`
`
`We not only wanted to provide this order of magnitude performance increase, but the goal
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`was to apply a traditional computer model of programmable processing and a large memory
`
`
`
`
`
`
`
`
`
`
`
`
`to applications with integrated image, graphics, video and audio processing, or image
`
`
`
`
`
`
`
`
`
`
`
`
`computing. As shown in Figure #2 titled "MVP System Architecture”, replacing the
`
`
`
`
`
`
`
`
`
`
`
`
`processing and memory pipeline of conventional video systems with the single video
`
`
`
`
`
`
`
`
`
`
`
`
`processor and large memory system model yields tremendous application flexibility,
`In
`
`
`
`
`
`
`
`
`
`
`
`effect the system can re-configure itself with software from video conferencing to playing
`
`
`
`
`
`
`
`
`
`
`
`
`
`CD movies, just as a PC would re-configure from a spreadsheet to a video game.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 4 of 10
`
`
`
`PRIOR-ART_0010818
`
`
`
`Page 4 of 10
`
`

`

`2l9
`
`Figure 2: The MVP "System" Architecture.
`.._
`- imme. mic «whom computer mommy (rick. phm 00...).
`- date Item "Ml (phone or local draw).
`~ lame/mica away or mum manor.
`
`- warm memory
`- instruction
`- mutt'ploimagcs.
`audio...
`
`- deploy on TV "mints.
`
`“NT’OUTPUT
`INTERFACE
`
`tram tor:
`
`-
`
`live video a nudlo (maul/cm}
`
`THE MVP ARCHITECTURE
`
`The Multimedia Video Processor, or MVP, represents the next-generation of digital signal
`processors. The MVP can be technically described as a single-chip crossbar shared
`memory heterogeneous MIMD multiprocessor.
`It combines RISC and advanced DSP
`processing in one parallel architecture with unique features for each. Current RISC
`processors typically use instruction pipelining. numerous registers and a detached fleeting
`point processor. 0n the other hand. current DSPs are optimized for one dimensronal
`multiply—accumulate functions. Newer DSPs have floating-point capabilities. yet most
`imaging and video only needs integer operations. DSPs usually have fewer registers than
`RISC and have direct memory accesses (DMA) with limited capabilities.
`
`The MVP combines the best features of RISC and DSP in parallel and adds other features
`to offer unprecedented Power and Flexibility. The hean of an image or video chip )5 HS
`capability to process ZD signals. The MVP has features for ZD DSP-like processing.
`including multiply-accumulate operations. The on-chip memory and register characteristics
`of the MVP were optimizer! for image computing algorithms, preventing time consuming
`cache rm'sses or swapping of register contents. Multidimensional external memory access
`and double buffering minimizes the typical memory bottleneck of current DSP solutions.
`An internal memory crossbar provides extremely efficient synchronization and
`communication of multiple processors. A very high-performance RISC processor is
`integrated on the chip, providing intelligent control of the DSP-like processors. Also
`integrated into the chip. is new floating-point architecture can act as a co-prooessor to any of
`the DSP-likeoproccssors or the RISC processor. By analysis of the algorithms, the
`required mix
`integer ops to floating-point ops was somewhere between 8:1 and 4:1 -- a
`balance which the MVP supports. The entire collection of processors and memory is
`configured as a MIMD architecrure for ease of programming and high performance for all
`image and video computing applications. This MIMD data and control supports both data
`
`Page 5 of 10
`
`PRIOR-ART_0010819
`
`

`

`220
`
`
`dependent algorithms like object feature matching or Huffman coding and also supporting
`
`
`
`
`
`
`
`
`
`
`
`traditional data independent SIMD operations like convolution.
`
`
`
`
`
`
`
`
`
`
`“CD:- onA n lurnn nn akin monknma MAMA“: :n “and 3'. el‘a KA‘VD Thie- flnviki‘ihr mire
`To prevent contention for memory or register access, a very wide instruction set in the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`gun a mlu u Amen urn-mu}: Lnusauauw urburury 13 new III IIIL m v n . aura nuanuuu, walrus;
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the programmer to produce highly—parallel optimized code. A performance penalty may
`
`
`
`
`
`
`
`
`
`
`
`result if only one highly—serial task is performed continuously, however, the very nature of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`image, video, graphics, and audio processing, with varied concurrent and complex
`
`
`
`
`
`
`
`
`
`
`
`processing, prevents this from occurring. The MVP integrates more functions than ever
`
`
`
`
`
`
`
`
`
`
`
`
`before into one chip, while avoiding the compromises of other architectures.
`
`
`
`
`
`
`
`
`
`
`
`
`Detailed Architecture Description:
`
`
`
`Figure #1, titled "MVP Block Diagram", shows the MVP chip architecture. The Master
`
`
`
`
`
`
`
`
`
`
`
`
`
`Processor (MP) provides a RISC processor for simple user interface, sequential
`
`
`
`
`
`
`
`
`
`
`
`processing, and orchestration of multiple concurrent tasks operating on the entire MVP.
`
`
`
`
`
`
`
`
`
`
`
`
`The DSP Parallel Processors (PP), of which 4 will be designed in the first version of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MVP, provide highly—optimized image/video/graphies/audio processing capabilities. The
`
`
`
`
`
`
`
`Transfer Controller (TC) intelligently moves data and instructions on and off the MVP. All
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`of these processors are locally interconnected with a crossbar to 25 on-chip 2Kbyte SRAM
`
`
`
`
`
`
`
`
`
`
`
`
`
`modules. Other features include dual video frame timing generators (VC) and 11‘AG test
`
`
`
`
`
`
`
`
`
`
`
`
`
`and emulation circuits.
`
`
`
`
`With five 32+bit programmable processors operating atone targeted state rate of SOMHz
`
`
`
`
`
`
`
`
`
`
`
`
`and numerous parallel operations performed in each processor, over 2 billion operations
`
`
`
`
`
`
`
`
`
`
`per second result.
`In addition, 100 MFLOPS (fully [BEE-754) can occur. The peak data
`
`
`
`
`
`
`
`
`
`
`
`
`
`transfer rate is then 400 MBytes/second, adequate for many video applications. The
`
`
`
`
`
`
`
`
`
`
`
`internal bandwidth over the crossbar between on-chip memory and processors is 2.4
`
`
`
`
`
`
`
`
`
`
`
`GBytes/seeond.
`
`
`
`
`
`
`
`
`DSP PARALLEL PROCESSORS (PP)
`
`
`
`
`
`
`The PP has many powerful features beyond those found in conventional DSPs. Practically
`
`
`
`
`
`
`
`
`
`
`
`
`
`all video algorithms benefit from these features. Most of the features were added to permit
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`scalability within the PP to support many simple functions (like bit ops) in one cycle or
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`fewer operations with the same hardware at higher precision (like 32-bits). The following
`
`
`
`
`
`
`
`
`
`
`
`
`
`describes the feature and advantage:
`
`
`
`
`
`- 44 user registers:
`
`
`
`
`— ease ofprogramming/compiling and fast parallel functions.
`
`
`
`
`
`
`
`- Single-cycle access into crossbar memory expands effective registers to 34K:
`
`
`
`
`
`
`
`
`
`
`- flexibility.
`
`- Three-level, no overhead instruction looping:
`
`
`
`
`
`
`- programming flexibility and faster tight loops (usually 2030%)
`
`
`
`
`
`
`
`° Double parallel transfer from memory with address update:
`
`
`
`
`
`
`
`
`
`- most algorithms need two pixels loaded per cycle.
`
`
`
`
`
`
`
`
`- Three-operand ALU arithmetic and logical operations:
`
`
`
`
`
`
`
`- double speed correlation and windows support.
`
`
`
`
`
`- Splitable multiply (8x8=16 or 16x16=32):
`
`
`
`
`
`
`- double speed pixel operations
`
`
`
`
`- Word/Halfword/Byte multiple arithmetic:
`
`
`
`
`- 4x on algorithms like motion estimation and 2x on fast DCTS.
`
`
`
`
`
`
`
`
`
`
`. Flexible data path:
`
`
`
`
`- masking, merging, rotating... for bit stream coding (like Huffman).
`
`
`
`
`
`
`
`
`
`- General-purpose use of address adders:
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 6 of 10
`
`
`
`
`PRIOR-ART_0010820
`
`Page 6 of 10
`
`

`

`221
`
`
`
`_
`_
`- up to 6x number of adds in one cycle.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`- Conditional operations prevent need for branching (and possible pipeline stalls):
`- adaptive algorithms will operate faster (like adaptive thresholding).
`
`
`
`
`
`
`
`
`
`
`
`As a result, as many as 15 RISC operations will be performed in one PP cycle. When
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`multipiied by the number of FPS and added to the MP and FPU operations, a formidable
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`number results.
`In addition, since the Cicompiler also influenced the architecture of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`PP, many of these features will automatically compile into fast cod -- many users of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MVP will not nwd to understand the PP architecture to take advantage of its performance
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MASTER PROCESSOR (MP)
`
`
`
`
`
`The MP is a general-purpose RISC processor with an integral IEEE—compatible floating-
`
`
`
`
`
`
`
`
`
`
`point unit. A 32-bit instruction is accessed from a 4KByte instruction cache. Data loads
`
`
`
`
`
`
`
`
`
`
`
`
`can be 8, 16, 32, or 64 bits from a 4KByte data cache or from any data module via the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`crossbar. The MP has thirty-one 32-bit usable registers. Uncommon features include:
`
`
`
`
`
`
`
`
`
`
`
`' Register files common to floating—point & integer operations.
`
`
`
`
`
`
`
`
`
`' Scoreboard keeps track of result of loads and FPU, preventing use until updated.
`
`
`
`
`
`
`
`
`
`
`
`
`
`. Addressing modes support optional updating of base-address register with
`
`
`
`
`
`
`
`
`
`
`results of the address computation.
`
`
`
`
`
`- Special FPU instruction permits new multiply. add/subt, & increment each cycle.
`
`
`
`
`
`
`
`
`
`
`
`° Left-most and Rightemost one logic.
`
`
`
`
`
`
`- Both endians supported.
`
`
`
`
`
`
`
`
`
`
`
`
`Since the MP was designed to efficiently execute C programs and has added hardware for
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`bitsuwm processing, it performs exceptionally well as the controller and data interpretation
`
`
`
`
`
`
`
`
`
`
`
`
`processor. The floating point capability accelerates and simplifies programming of high
`
`
`
`
`
`
`
`
`
`
`
`precision applications like medical imaging and 3D graphics
`
`
`
`
`
`
`
`
`SHARED MEMORY & TRANSFER CONTROLLER (TC)
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Much of the advantage of the MVP architecture lies in the memory and data I/O
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`architecture. Each processor and memory is fully interconnected through the crossbar and
`
`
`
`
`
`
`
`
`
`
`
`switchable at instruction rates. With greater than 500 signal lines switching at nanosecond
`
`
`
`
`
`
`
`
`
`
`
`
`the crossbarred memory architecture is only possible with single-chip
`speeds,
`
`
`
`
`
`
`
`
`
`implementation. With adequate on-chip memory and the ability to reconnect the next
`
`
`
`
`
`
`
`
`
`
`
`processor to the data memory, rather than moving the data to another memory, the data on-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`chip is not required to move as often. In effect, the original requirement of billions of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`bytes/second data transfer is reduced to only 100's of Mbytes/sec0nd. This model works
`
`
`
`
`
`
`
`
`
`
`
`
`
`well as long as the algorithm uses localized regions of data (patches, blocks,
`
`
`
`
`
`
`
`
`
`
`
`
`
`neighborhoods, rows...), each of which ”fit" into the on-chip memory, and are accessed in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`repeated or predictable patterns. While this usually occurs with image processing, an
`
`
`
`
`
`
`
`
`
`
`
`
`extremely intelligent transfer controller was architected to aid in insuring the validity of this
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`assumption. The TC has numerous modes of transferring data on- or off—chip, each
`
`
`
`
`
`
`
`
`
`
`
`
`optimized for a particular type of dataflow (block, patch, fat line, indexed or guided
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`patches...). Most importantly, the on—chip SRAM memory was architected with sufficient
`
`
`
`
`
`
`
`
`
`
`
`size and modularity to permit double-buffering of data 1/0 on and off the chip, while the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`on—chip processors access the other on-chip memory modules.
`In effect, practically no
`
`
`
`
`
`
`
`
`
`
`
`
`overhead is required for video I/O. Many convenient methods were designed into the TC
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`to prioritize these accesses. In addition, we included support for most commodity memory
`
`
`
`
`
`
`
`
`
`
`
`
`
`components (VRAM, SRAM, DRAM). Finally, we devised several methods to mitigate
`
`
`
`
`
`
`
`
`
`
`
`any contention between the processors for a particular memory module. Both round-robin
`
`
`
`
`
`
`
`
`
`
`
`and fixed-robin priority schemes are available to permit developers flexibility in structuring
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 7 of 10
`
`
`
`
`PRIOR-ART_0010821
`
`Page 7 of 10
`
`

`

`222
`
`
`their algorithms to reduce contention. For the many image and video algorithms currently
`
`
`
`
`
`
`
`
`
`
`
`
`developed for the MVP thus far, contention has not been a problem
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Another advantage of the crossbar architecture is expandability. We can design many
`
`
`
`
`
`
`
`
`
`
`
`
`different MVP chips, as a function of the number of PPs. We simply slice the architecture.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`cutting or adding PPs and memory modules. Conceptually. the advantage of thisapproach
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`is that, with the same package and pin—out, several different performance and price points
`can be used. A range of applications may require a range of different MVP chips.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Applications which require CCIR 601 studio quality video and/or multifunction processing
`
`
`
`
`
`
`
`
`
`
`
`
`
`(graphics / audio) video) would most-likely require an MVP with 4 processors. 0n the
`
`
`
`
`
`
`
`
`
`
`
`other hand, a more dedicated or single-function application like graphics may require
`
`
`
`
`
`
`
`
`
`
`
`
`fewer PPS. In addition, if only limited resolution video (QCIF) processing is necessary,
`
`
`
`
`
`
`
`
`
`
`
`
`
`again a small number of PPs could suffice. We anticipate various versions of the MVP in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the future.
`
`
`
`VIDEO CONTROLLER (VC)
`
`
`
`
`
`In addition, the MVP has two programmable timing controllers for generation of video and
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`other timing signals. As an example, video frame grabbing and display requires may pixel,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`horizontal, and vertical signals for synchronization of the external logic in the system. The
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`MVP has internal logic to generate those signals under program control, relieving the
`
`
`
`
`
`
`
`
`
`
`
`
`
`system designer from design of external logic to perform those functions.
`
`
`
`
`
`
`
`
`
`
`
`NEW DCT ALGORITHMS FOR COMPRESSION
`
`
`
`
`
`SPEED-UP WITH USE OF A PROGRAMMABLE ARCHITECTURE
`
`
`
`
`
`
`
`
`
`One advantage of the programmable compression chip is the optimization possible by
`
`
`
`
`
`
`
`
`
`
`
`
`selecting the least computationally demanding DCT algorithm that will meet the accuracy
`
`
`
`
`
`
`
`
`
`
`
`
`required of the application. For example, fast DCI‘ algorithms like those of Lee and Chen
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`[Lee-84, Chen-77] have considerable advantage with respect to traditional matrix multiply
`
`
`
`
`
`
`
`
`
`
`
`approaches (with a factor of 5 or more speedup). Seperability of the 2D DCT is generally
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`used for decomposition of DCTs (with successive processing of the individual rows &
`
`
`
`
`
`
`
`
`
`
`
`
`
`columns of an image) . The size of the DCT directly influences the benefit of seperability,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`however with the 8x8 DCTs of most standards, a definite speedup results. The Lee
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`algorithm tends to be easy to implement and achieve faster computation, although has
`
`
`
`
`
`
`
`
`
`
`
`
`
`accuracy issues The Chen algorithm is harder to implement and is computational slower,
`
`
`
`
`
`
`
`
`
`
`
`
`
`but with good accuracy. Depending on the available processing bandwidth, the encoder
`
`
`
`
`
`
`
`
`
`
`
`
`can select an appropriate DCT or IDCT algorithm to perform the task.
`If errors result,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`different coding decisions result and either lower SNRs or compression ratios occur.
`
`
`
`
`
`
`
`
`
`
`
`
`
`In addition, we devised a "Truncated-IDCT" algorithm to utilize the advantages of a
`
`
`
`
`
`
`
`
`
`
`
`
`programmable architecture. Since the DCT, quantizer and threholding operations seek to
`
`
`
`
`
`
`
`
`
`
`
`minimize the population of selective frequency coefficients (for high compression),
`
`
`
`
`
`
`
`
`
`
`statistically, most of the high frequency coefficients are zero valued. Therefore, the
`
`
`
`
`
`
`
`
`
`
`
`
`conventional IDCI‘ will act on 8x8 matrixes with a high percentage of zero valued inputs.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`We can then significantly reduce the amount of IDCT operations performed by not
`
`
`
`
`
`
`
`
`
`
`
`
`
`executing the zero valued multiplies and adds (similar work has been reported[McMillan-
`
`
`
`
`
`
`
`
`
`
`
`92]). This is only possible with software—based IDCTs.
`In implementation, the program
`
`
`
`
`
`
`
`
`
`
`
`adaptively truncates the 8x1 IDCT summation in the vertical direction based on the run-
`
`
`
`
`
`
`
`
`
`
`
`
`
`length encoded input values. Further reductions by selecting 4x1 summation when
`
`
`
`
`
`
`
`
`
`
`
`appropriate also shortens the process, although not as frequently. With this approach a
`
`
`
`
`
`
`
`
`
`
`
`
`
`factor of 3 or more speed-up on lDCIs will usually occur.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 8 of 10
`
`
`
`PRIOR-ART_0010822
`
`
`
`Page 8 of 10
`
`

`

`
`
`TOOLS
`
`
`
`223
`
`
`
`
`
`
`
`
`Advances in video compression have been limited by the availability of tools to develop
`
`
`
`
`
`
`
`
`
`
`
`
`
`software and hardware. With the MVP, TI offers a range of software tools and direct on-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`chip support for in—circuit debug. A real—time executive, C++ compilers, algebraic
`
`
`
`
`
`
`
`
`
`
`assembler, windowed high-level language debugger (with JTAG emulation hardware on
`
`
`
`
`
`
`
`
`
`chip) and library of primitives/applications, all the tools familiar to computer application
`
`
`
`
`
`
`
`
`
`
`
`developers, will now be available for development of video applications.
`
`
`
`
`
`
`
`
`
`
`The software model for the MVP is based on two levels. The primary level includes the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Master Processor acting as a director and scheduler of the MVP's parallelism. The
`
`
`
`
`
`
`
`
`
`
`
`
`
`Executive operates on the MP, performing those supervisory tasks. The Executive can
`
`
`
`
`
`
`
`
`
`
`
`
`dispatch tasks for operation in pipeline, parallel or any other arrangement on any processor
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`within the MVP. Under that, a level which actually performs the tasks on each processor is
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`accessed by either: (1) a library of primitives, (2) application tools for programming in
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`assembly or (3) a high-level language compiler. Each of these methods have advantages,
`
`
`
`
`
`
`
`
`
`
`
`
`with varying performance and skill level required to code the chip, as a function of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`particular application. Although nothing restricts the use of any processor as the master or
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`slave processor, only software convention.
`
`
`
`
`
`COMPETING VIDEO COMPRESSION CHIP ARCHITECTURES
`
`
`
`
`
`
`
`Several semiconductor companies have reported activity in video compression chip or chip
`
`
`
`
`
`
`
`
`
`
`
`
`set solutions ([Bolton-93][Konstantinides-92]). Most chip manufacturers are proposing
`
`
`
`
`
`
`
`
`hardwired or paramaterized architectures, without Calevel programmability. Our MVP is
`
`
`
`
`
`
`
`
`
`
`an exception. In addition, most of the other "programmable" approaches are based on an
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`architecture which integrates dedicated logic modules, like DCTs and Motion Estimators,
`
`
`
`
`
`
`
`
`
`
`
`with only the controller programmable. This limits their efficiency since the silicon devoted
`
`
`
`
`
`
`
`
`
`
`
`
`
`to those functions must always keep busy with those functions to justify their cost. On the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`contrary the MVP architecture has no dedicated logic, permitting user balancing of silicon
`
`
`
`
`
`
`
`
`
`
`
`
`
`based on the varied and dynamic computational demands of compression. Other
`
`
`
`
`
`
`
`
`
`
`
`researchers have recently described simulations which support our position that universally
`
`
`
`
`
`
`
`
`
`
`
`programmable architectures are competitive solutions when compared with dedicated or
`
`
`
`
`
`
`
`
`
`
`hybrid architectures for video decompression (Mayer—93).
`
`
`
`
`
`
`
`Many different architectures are proposed, in development, or currently available, however
`
`
`
`
`
`
`
`
`
`
`
`none except the MVP has the flexibility nor computational performance to meet the
`
`
`
`
`
`
`
`
`
`
`
`
`
`complete demands of truly integrated digital video on the desktop. including the complete
`
`
`
`
`
`
`
`
`
`
`
`
`
`concert of real-time video & audio compression, with image & 3D graphics processing.
`
`
`
`
`
`
`
`
`
`
`
`
`
`Not only compression and decompression, but system—level bit-stream control, video
`
`
`
`
`
`
`
`
`
`
`scaling, error correction and even audio echo cancellation. Architecture limitations and
`
`
`
`
`
`
`
`
`
`
`
`transistor counts limit other chips to subsets of these functions.
`
`
`
`
`
`
`
`
`
`
`
`CONCLUSION
`
`
`
`The MVP is a monolithic single-chip parallel processor that performs compression
`
`
`
`
`
`
`
`
`
`
`
`processing, audio & video processing, 3D graphics and others, and even at the same time.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Over 2 billion operations are performed per second. This dramatic performance boost will

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket