`ELECTRONICS ENGINEERS, INC.
`
`am”;
`
`Mmfl
`
`Apple Exhibit 1009
`Page 1 of 14
`
`W%,«L’W
`
`3>§\\JW§
`
`Apple Exhibit 1009
`Page 1 of 14
`
`
`
`Proceedings
`
`DCC ’94
`
`DATA COMPRESSION CONFERENCE
`
` Apple Exhibit 1009
`
`Page 2 of 14
`
`Apple Exhibit 1009
`Page 2 of 14
`
`
`
`
`
`Droceedings
`
`DCC ’94
`
`DATA COMPRESSION CONFERENCE
`
`March 29-31, 1994
`
`Snowbird, Utah
`
`Edited by
`James A. Storer
`Martin Cohn
`
`Sponsored by
`IEEE Computer Society Technical Committee
`on Computer Communications
`
`In cooperation with
`NASA/CESDIS
`
`@
`
`IEEE Computer Society Press
`Los Alamitos, California
`
`Washington 0 Brussels 0 Tokyo
`
`Page 3 of 14
`
`Apple Exhibit 1009
`
`Apple Exhibit 1009
`Page 3 of 14
`
`
`
`
`prise the proceedings of the meeting mentioned on
`The papers in this book com
`the authors’ opinions and, in the interests
`
`the cover and title page. They reflect
`
`d as presented and without change. Their
`of timely dissemination, are publishe
`ly constitute endorsement by the
`
`inclusion in this publication does not necessari
`of Electrical and
`editors,
`the IEEE Computer Society Press, or the Institute
`Electronics Engineers, Inc.
`
`
`
`Published by the
`IEEE Computer Society Press
`10662 Los Vaqueros Circle
`P.O. Box 3014
`Los Alamitos, CA 90720-1264
`
`.
`
`
`ii
`
`2
`
`© 1994 by the Institute of Electrical and Electronics Engineers, Inc. All rights reserved.
`
`
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source.
`Libraries are permitted to photocopy beyond the limits of US copyright law, for private
`use of patrons, those articles in this volume that carry a code at the bottom of the first
`page, provided that
`the per-copy fee indicated in the code is paid through the
`Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For other
`copying, reprint, or republication permission, write to IEEE Copyrights Manager, IEEE
`Service Center, 445 Hoes Lane, P.O. Box 1331, Piscataway, NJ 08855-1331.
`
`IEEE Computer Society Press Order Number 5637-02
`IEEE Catalog Number 93THO626-2
`ISBN 0-8186-5636-0 (microfiche)
`ISBN 0-8186-5637-9 (case)
`ISSN 1068-0314
`
`Additional copies can be ordered from
`
`IEEE Service Center
`445 Hoes Lane
`P.O. Box 1331
`Piscataway, NJ 08855-1331
`Tel: (908) 981-1393
`Fax: (908) 981-9667
`
`IEEE Computer Society
`13, avenue de l'Aquilon
`B-1200 Brussels
`BELGIUM
`Tel: +32-2-770-2198
`Fax: +32-2-770-8505
`
`IEEE Computer Society
`Ooshima Building
`2-19-1 Minami-Aoyama
`Minato-ku, Tokyo 107
`JAPAN
`Fax: +81 -3-3408-3553
`Tel: +81-3-3408-3118
`
`IEEE Computer Society Press
`Customer Service Center
`10662 Los Vaqueros Circle
`P.O. Box 3014
`Los Alamitos, CA 90720-1264
`Tel:
`(714) 821 -8380
`Fax: (714) 821-4641
`Email: cs.books @ computer.org
`
`Production Editor: Lisa O’Conner
`Printed in the United States of America by Bookcrafters
`
`Q The Institute of Electrical and Electronics Engineers, Inc.
`
` Apple Exhibit 1009
`Page 4 of 14
`
`Apple Exhibit 1009
`Page 4 of 14
`
`
`
`The MVP: A Highly-Integrated Video Compression Chip
`Robert J. Gove
`
`Texas Instuments, Inc.
`Dallas, Texas 7 5265
`
`ABSTRACT
`
`We introduce a new highly-integrated processing chip for performing a variety of
`functions, however this chip is particularly well suited for video compression algorithms.
`Applications include multimedia PCs, virtual reality 3D graphics,
`full-duplex
`videoconferencing, HDTV, and color hardcopy. We have architected the Multimedia Video
`Processor, or MVP, to provide a yet unattainable level of performance from a single chip.
`although with the programmability typically found in today's general-purpose computers.
`While advanced semiconductor design and process techniques have been used for its
`design, the key to the advantage of this component lies in optimization of the architecture
`for real-time video and graphics processing. This paper will analyze video compression
`application requirements, describe the MVP architecture, and pose its potential as a very
`capable solution for a wide range of markets.
`
`INTRODUCTION
`
`The computer and consumer video industries are pursuing varied paths to offer cost-
`effective computing products which provide new forms of information and entertainment.
`Products are emerging from cable TV delivery of interactive digital movies to digital mobile
`offices. Digital compression and video processing at a reasonable cost are spurring this
`revolution. While algorithm developments have been important, most of the enabling
`advances lie in the availability of high-density memory and high-performance processing
`ICs. With the pending general availability of the Multimedia Video Processor, or MVP, in
`1994, a yet unattained level of digital signal processing performance will be available and
`with all the flexibility of present day programmable computers. Standard-based video-
`conferencing and playback of compressed digital video and audio (using Px64, JPEG or
`MPEG "multi-standard" codecs systems) with a single MVP processor will be possible, as
`well as codecs with yet-to-be-defined algorithms like model—based compression.
`However, not only will the MVP support compression, it will also handle processing of
`high-resolution video, full-motion video processing from sources like camcorders, digital
`audio processing, hardcopy raster image processing, and 3D graphics, and all under
`software control and generation. From this wide range of functions, we calculated that
`several billion operations per second are required to provide video-based applications on
`the desktop. Current and soon to appear desktop host processors like X86, Pentium,
`Alpha, and MIPS do not have the computational power to meet these demands.
`
`KEYS TO THE MVP ARCHITECTURE
`
`The MVP's unique architecture and computational power enables users to integrate these
`varied functions on a single processing component. The keys to obtaining both exceptional
`processing speeds and fully-programmable features with the MVP include the use of:
`
`(1) an efficientparallel processing architecture,
`(2) fast pixel processing tuned to image, video, and graphics processing,
`(3) intelligent control of image dataflow throughout the architecture,
`(4) single-chip integration without slower chip-to-chip communications.
`
`1068-0314/94 $3.00 © 1994 IEEE
`
`215
`
`Apple Exhibit 1009
`Page 5 of 14
`
`Apple Exhibit 1009
`Page 5 of 14
`
`
`
`216
`
`DSP Parallel Processors (PPn)
`’_“.iV."“".°°d, DSP C."'°.S
`
`Master Processor (MP)
`Advanced RISC
`
`cgEE{E
`
`.
`
`.
`
`Ecg
`
`3%EE
`
`bogEo
`
`Figure 1
`
`: MVP Block Diagram
`
`(A Single-Chip Parallel Processor)
`
`Apple Exhibit 1009
`Page6of14
`
`Apple Exhibit 1009
`Page 6 of 14
`
`
`
`217
`
`ALGORITHM-DIRECTED ARCHITECTURE DEFINITION
`
`Processing Requirements
`Today's proposed international video compression standards use common frequency
`domain, quantization, and entropy coding techniques to (de)compress small portions (8x8)
`of each image. While these functions demand a great deal from the encoder/decoder, many
`other varied functions remain, each with dynamic requirements which vary based on_the
`type of image compressed as well as the channel rate required to maintain real-time
`operation. For optimal efficiency a processor must adapt to these dynamic needs. A
`typical average of the processing demands of the Px64 video-conferencing standard
`appears in the following table.
`
`RISC vs. MVP-PP Processing Requirements for PX64
`
`Pfl (H.261)
`RISC Exoeuion
`speedup of
`FULL-DUPLEX, FULL-CIF,
`Speed (average
`MVP-PP vs. RISC
`30H: Functions
`95 of tlme) "
`PI”°°°$W|'I
`
`
`
`
`
`Mofion Estimation — Block Mabhing (encode)
`Enco&tg Decisions - (1) lnterw/motion
`veuars, (2) Inter w/coded difl.,(3) lntra
`Loop Filtering (both)
`
`Diflerence image (current - predicted)
`Fast DCT (encode)
`Threshokilfiuawtization/Ztg—Zag Run-length
`
`IDCT (both)
`meonstruetion (both)
`(predated + (in. image)
`Bittream Decode & Dequantizmion (decode)
`TOTAL CYCLES
`(MIPS)
`
`
`
`01a
`
`0.161“
`
`0.01 8
`1.00 =
`1,193 MIPS
`
`
`
`O 077
`0.071
`
`0 Q6
`0077
`'
`
`1.00 =
`155 PP MIPS "“
`
`
`
`
`
`AVERAGE
`SPEED-UP = 7.7
`
`
`
`* Multiply counted a one instruction even though most RISCS require many cycles.
`** If the "'l‘runcated—IDC’I“‘ algorithm was used, lDC’I‘s speed—up again (see later).
`"‘** The total is equivalent to 3 MVP-PP processors (see below PP section).
`**** Audio standards concurrently execute on the MVP-MP (see below MP section).
`
`As we studied the computational requirements for motion estimation (51%) and DCTs
`(22%) it became quite apparent that a programmable image processor must excel at these
`functions. It is important to recognize that what‘s done poorly in a processor can dominate
`its performance. Since most architectural improvements would not uniformly accelerate all
`functions uniformly, we looked for special architectural features for these critical functions.
`while maintaining enough flexibility to benefit a larger class of algorithms.
`In final
`analysis, a much more uniform distribution of computational loading resulted after the
`changes.
`
`As seen in the table, the programmable image processor must perform many other
`functions well, including: bit manipulation and table look ups for entropy encoding, and
`
`Apple Exhibit 1009
`Page 7 of 14
`
`Apple Exhibit 1009
`Page 7 of 14
`
`
`
`218
`
`multiply and accumulate for various types of filtering operations. To obtain good image
`quality at any channel rate and 30 frames per second, the image processor must compute
`over 1.2 billion operations per second (BOPS).
`
`The addition of audio compression (which requires higher precision integer and possibly
`floating point algorithms) and network communication, necessary for video conferencing
`(G.728 or G.7l l, H.242, H.230, H.221), further increases the scope of computational
`requirements. Reducing the system cost, we propose to include support in the architecture
`for the required non-standard functions like color space conversion (YCrCb to RGB),
`decimation of the source image to CIF resolution and variable scaling of the decompressed
`sequence. Complete implementation of compression applications such as video-
`conferencing requires over 2 BOPS of the programmable image processor.
`
`ARCHITECTURE CHOICES
`
`We considered several candidate parallel architectures for implementation of this single-chip
`video processor [Gove-92, Guttag-92]. An architecture with a mix of dedicated and
`programmable processors was initially evaluated, then subsequently discounted when no
`single dominant function was found that was necessary almost all of the time. Besides, we
`predicted that by the time the chip was completed, that a new important algorithm would
`emerge. From the standpoint of loss of silicon efficiency by dedicated resources to any one
`function (like a DCT), we felt compelled to seek a general-purpose well—balanced system
`solution. Several other candidates existed, however the mix of algorithms and practical
`implementation limitations focused us on SIMD and MIMD architectures. These differ by
`the autonomy of the processors functions with MIMD -- a desirable feature for any data
`dependent algorithm operating in parallel.
`
`With MIMD desirable, the choice of a processor and memory interconnection architecture
`remained. Pipelined, shared bus memory, communication port (mesh/array/hypercube),
`and crossbar fully-shared memory were considered. Pipeline memory and processors
`(systolic arrays) are typically used for video, however they're too restrictive in the sense
`that one must a priori know the size of the memory and dynamics of the algorithm to
`prevent data contention and processor stalls. With our varied needs, this would lead to
`inefficiencies. A shared—bus memory structure would also have bottleneck problems with
`highly variable instruction and data streams and moving of results from one processor to
`the other. The n-way connected communication port requires a very ordered flow of data,
`like a systolic or wavefront flow of data, or the application of a pixel per processor (not
`practical in a single chip). This approach works for large arrays of simple processors
`which can operate uniformly on images, however we wanted more complex processors
`which could adapt to varyin g types of data, from bit graphics to floating-point
`representations. The crossbar fully»shared memory is ideally suited to these needs,
`minimizing contention, data movement and providing flexibility for many types of
`algorithms.
`In fact, since the crossbar operations at the processor instruction rates, this
`architecture can functionally emulate the other approaches (pipeline, shared bus...).
`
`We not only wanted to provide this order of magnitude performance increase, but the goal
`was to apply a traditional computer model of programmable processing and a large memory
`to applications with integrated image, graphics, video and audio processing, or image
`computing. As shown in Figure #2 titled "MVP System Architecture", replacing the
`processing and memory pipeline of conventional video systems with the single video
`processor and large memory system model yields tremendous application flexibility.
`In
`effect the system can re—configure itself with software from video conferencing to playing
`CD movies, just as a PC would re—configure from a spreadsheet to a video game.
`
`Apple Exhibit 1009
`Page 8 of 14
`
`Apple Exhibit 1009
`Page 8 of 14
`
`
`
`219
`
`Figure 2: The MVP "System" Architecture.
`Interiace fur:
`- Image. audio data from computer memory (disk. DHOID-CD...).
`- data from networks (phone or local digital).
`- imagelvideo display on workstation monitor.
`HOST COMPUTER
`INTERFACE
`
`MEMORY
`- Application memory
`instruction
`-
`— multiple images,
`udio...
`
`
`
`mp”-“OUTPUT
`INTERFACE
`
`Interface for:
`
`- live video 5 audio (camera-is,VCRs)
`- display on TV rnonitors.
`
`THE MVP ARCHITECTURE
`
`The Multimedia Video Processor, or MVP, represents the next-generation of digital signal
`processors. The MVP can be technically described as a single-chip crossbar shared
`memory heterogeneous MIMD multiprocessor.
`It combines RISC and advanced DSP
`processing in one parallel architecture with unique features for each. Current RISC
`processors typically use instruction pipelining, numerous registers and a detached floating
`point processor. On the other hand, current DSPs are optimized for one dimensional
`multiply-accumulate functions. Newer DSPs have floating-point capabilities, yet most
`imaging and video only needs integer operations. DSPs usually have fewer registers than
`RISC and have direct memory accesses (DMA) with limited capabilities.
`
`The MVP combines the best features of RISC and DSP in parallel and adds other features
`to offer unprecedented Power and Flexibility. The heart of an image or video chip is its
`capability to process 2D signals. The MVP has features for 2D DSP-like processing,
`including multiply-accumulate operations. The on-chip memory and register characteristics
`of the MVP were optimized for image computing algorithms, preventing time consuming
`cache misses or swapping of register contents. Multidimensional external memory access
`and double buffering minimizes the typical memory bottleneck of current DSP solutions.
`An internal memory crossbar provides extremely efficient synchronization and
`communication of multiple processors. A very high-performance RISC processor is
`integrated on the chip, providing intelligent control of the DSP-like processors. Also
`integrated into the chip, a new floating-point architecture can act as a co-processor to any of
`the DSP-like processors or the RISC processor. By analysis of the algorithms, the
`required mix of integer ops to floating—poim ops was somewhere between 8:1 and 4:1 -- a
`balance which the MVP supports. The entire collection of processors and memory is
`configured as a MIMD architecture for ease of programming and high performance for all
`image and video computing applications. This MIMD data and control supports both data
`
`Apple Exhibit 1009
`Page 9 of 14
`
`Apple Exhibit 1009
`Page 9 of 14
`
`
`
`220
`
`dependent algorithms like object feature matching or Huffman coding and also supporting
`traditional data independent SIMD operations like convolution.
`
`To prevent contention for memory or register access, a very wide instruction set in the
`DSPs and a large on-chip crossbarred memory is used in the MVP. This flexibility permits
`the programmer to produce highly-parallel optimized code. A performance penalty may
`result if only one highly-serial task is performed continuously, however, the very nature of
`image, video, graphics, and audio processing, with varied concurrent and complex
`processing, prevents this from occurring. The MVP integrates more functions than ever
`before into one chip, while avoiding the comprorni ses of other architectures.
`
`Detailed Architecture Description:
`Figure #1, titled "MVP Block Diagram", shows the MVP chip architecture. The Master
`Processor (MP) provides a RISC processor for simple user interface, sequential
`processing, and orchestration of multiple concurrent tasks operating on the entire MVP.
`The DSP Parallel Processors (PP), of which 4 will be designed in the first version of the
`MVP, provide highly-optimized image/video/graphics/audio processing capabilities. The
`Transfer Controller (TC) intelligently moves data and instructions on and off the MVP. All
`of these processors are locally interconnected with a crossbar to 25 on-chip 2Kbyte SRAM
`modules. Other features include dual video frame timing generators (VC) and JTAG test
`and emulation circuits.
`
`With five 32+bit programmable processors operating at one targeted state rate of 50MHz
`and numerous parallel operations performed in each processor, over 2 billion operations
`per second result. In addition, 100 MFLOPS (fully IEEE-754) can occur. The peak data
`transfer rate is then 400 MBytes/second, adequate for many video applications. The
`internal bandwidth over the crossbar between on-chip memory and processors is 2.4
`GBytes/second.
`
`DSP PARALLEL PROCESSORS (PP)
`
`The PP has many powerful features beyond those found in conventional DSPs. Practically
`all video algorithms benefit from these features. Most of the features were added to permit
`scalability within the PP to support many simple functions (like bit ops) in one cycle or
`fewer operations with the same hardware at higher precision (like 32-bits). The following
`describes the feature and advantage:
`- 44 user registers:
`- ease of programming/compiling and fast parallel functions.
`0 Single-cycle access into crossbar memory expands effective registers to 34K:
`- flexibility.
`0 Three-level, no overhead instruction looping:
`- programming flexibility and faster tight loops (usually 20-30%)
`- Double parallel transfer from memory with address update:
`- most algorithms need two pixels loaded per cycle.
`- Three-operand ALU arithmetic and logical operations:
`- double speed correlation and windows support.
`- Splitable multiply (8x8=16 or l6x16=32):
`- double speed pixel operations.
`- Word/Halfword/Byte multiple arithmetic:
`- 4x on algorithms like motion estimation and 2x on fast DC’l‘s.
`- Flexible data path:
`- masking, merging, rotating... for bit stream coding (like Huffman).
`0 General-purpose use of address adders:
`
`Apple Exhibit 1009
`Page 10 of 14
`
`Apple Exhibit 1009
`Page 10 of 14
`
`
`
`221
`
`.
`_
`- up to 6x number of adds in one cycle.
`' Conditional operations prevent need for branching (and possible pipeline stalls):
`- adaptive algorithms will operate faster (like adaptive thresholding).
`
`As a result, as many as 15 RISC operations will be performed in one PP cycle. When
`multiplied by the number of PPS and added to the MP and FPU operations, a formidable
`number results.
`In addition, since the C-compiler also influenced the architecture of the
`PP, many of these features will automatically compile into fast code -- many users of the
`MVP will not need to understand the PP architecture to take advantage of its performance.
`
`MASTER PROCESSOR (MP)
`
`The MP is a general-purpose RISC processor with an integral IEEE-compatible floating-
`point unit. A 32-bit instruction is accessed from a 4KByte instruction cache. Data loads
`can be 8, 16, 32, or 64 bits from a 4KByte data cache or from any data module via the
`crossbar. The MP has thirty-one 32-bit usable registers. Uncommon features include:
`- Register files common to floating-point & integer operations.
`0 Scoreboard keeps track of result of loads and FPU, preventing use until updated.
`- Addressing modes support optional updating of base-address register with
`results of the address computation.
`- Special FPU instruction permits new multiply, add/subt, & increment each cycle.
`- Left-most and Right-most one logic.
`- Both endians supported.
`
`Since the MP was designed to efficiently execute C programs and has added hardware for
`bitstream processing, it performs exceptionally well as the controller and data interpretation
`processor. The floating point capability accelerates and simplifies programming of high
`precision applications like medical imaging and 3D graphics.
`
`SHARED MEMORY & TRANSFER CONTROLLER (TC)
`
`Much of the advantage of the MVP architecture lies in the memory and data I/0
`architecture. Each processor and memory is fully interconnected through the crossbar and
`switchable at instruction rates. With greater than 500 signal lines switching at nanosecond
`speeds,
`the crossbarred memory architecture is only possible with single-chip
`implementation. With adequate on-chip memory and the ability to reconnect the next
`processor to the data memory, rather than moving the data to another memory, the data on-
`chip is not required to move as often. In effect, the original requirement of billions of
`bytes/second data transfer is reduced to only 100's of Mbytes/second. This model works
`well as long as the algorithm uses localized regions of data (patches, blocks,
`neighborhoods, rows...), each of which "fit" into the on-chip memory, and are accessed in
`repeated or predictable patterns. While this usually occurs with image processing, an
`extremely intelligent transfer controller was architected to aid in insuring the validity of this
`assumption. The TC has numerous modes of transferring data on- or off-chip, each
`optimized for a particular type of dataflow (block, patch, fat line, indexed or guided
`patches...). Most importantly, the on-chip SRAM memory was architected with sufficient
`size and modularity to permit double-buffering of data I/O on and off the chip, while the
`on-chip processors access the other on-chip memory modules.
`In effect, practically no
`overhead is required for video I/0. Many convenient methods were designed into the TC
`to prioritize these accesses. In addition, we included support for most commodity memory
`components (VRAM, SRAM, DRAM). Finally, we devised several methods to mitigate
`any contention between the processors for a particular memory module. Both round-robin
`and fixed-robin priority schemes are available to pemiit developers flexibility in structuring
`
`Apple Exhibit 1009
`Page 11 of 14
`
`Apple Exhibit 1009
`Page 11 of 14
`
`
`
`222
`
`their algorithms to reduce contention. For the many image and video algorithms currently
`developed for the MVP thus far, contention has not been a problem.
`
`Another advantage of the crossbar architecture is expandability. We can design many
`different MVP chips, as a function of the number of PPs. We simply slice the architecture,
`cutting or adding PPs and memory modules. Conceptually, the advantage of this approach
`is that, with the same package and pin-out, several different performance and price points
`can be used. A range of applications may require a range of different MVP chips.
`Applications which require CCIR 601 studio quality video and/or multifunction processing
`(graphics / audio / video) would most-likely require an MVP with 4 processors. On the
`other hand, a more dedicated or single-function application like graphics may require
`fewer PPs.
`In addition, if only limited resolution video (QCIF) processing is necessary,
`again a small number of PPS could suffice. We anticipate various versions of the MVP in
`the future.
`
`VIDEO CONTROLLER (VC)
`
`In addition, the MVP has two programmable tinting controllers for generation of video and
`other timing signals. As an example, video frame grabbing and display requires may pixel,
`horizontal, and vertical signals for synchronization of the external logic in the system. The
`MVP has internal logic to generate those signals under program control, relieving the
`system designer from design of external logic to perform those functions.
`
`NEW DCT ALGORITHMS FOR COMPRESSION
`SPEED-UP WITH USE OF A PROGRAMMABLE ARCHITECTURE
`
`One advantage of the programmable compression chip is the optimization possible by
`selecting the least computationally demanding DCT algorithm that will meet the accuracy
`required of the application. For example, fast DCT algorithms like those of Lee and Chen
`[Lee-84, Chen-77] have considerable advantage with respect to traditional matrix multiply
`approaches (with a factor of 5 or more speedup). Seperability of the 2D DCT is generally
`used for decomposition of DCTs (with successive processing of the individual rows &
`columns of an image) . The size of the DCT directly influences the benefit of seperability,
`however with the 8x8 DCTs of most standards, a definite speedup results. The Lee
`algorithm tends to be easy to implement and achieve faster computation, although has
`accuracy issues. The Chen algorithm is harder to implement and is computational slower,
`but with good accuracy. Depending on the available processing bandwidth, the encoder
`can select an appropriate DCT or IDCT algorithm to perform the task.
`If errors result,
`different coding decisions result and either lower SNRs or compression ratios occur.
`
`In addition, we devised a "Truncated-IDCT" algorithm to utilize the advantages of a
`programmable architecture. Since the DCT, quantizer and threholding operations seek to
`minimize the population of selective frequency coefficients (for high compression),
`statistically, most of the high frequency coefficients are zero valued. Therefore, the
`conventional IDCI‘ will act on 8x8 matrixes with a high percentage of zero valued inputs.
`We can then significantly reduce the amount of IDCT operations performed by not
`executing the zero valued multiplies and adds (similar work has been reported[McMillan-
`92]). This is only possible with software-based IDCTs. In implementation, the program
`adaptively truncates the 8x1 IDCT summation in the vertical direction based on the run-
`length encoded input values. Further reductions by selecting 4x1 summation when
`appropriate also shortens the process, although not as frequently. With this approach a
`factor of 3 or more speed-up on IDCTs will usually occur.
`
`Apple Exhibit 1009
`Page 12 of 14
`
`Apple Exhibit 1009
`Page 12 of 14
`
`
`
`TOOLS
`
`223
`
`Advances in video compression have been limited by the availability of tools to develop
`software and hardware. With the MVP, TI offers a range of software tools and direct on-
`chip support for in-circuit debug. A real-time executive, C++ compilers, algebraic
`assembler, windowed high-level language debugger (with JTAG emulation hardware on
`chip) and library of primitives/applications, all the tools familiar to computer application
`developers, will now be available for development of video applications.
`
`The software model for the MVP is based on two levels. The primary level includes the
`Master Processor acting as a director and scheduler of the MVP's parallelism. The
`Executive operates on the MP, performing those supervisory tasks. The Executive can
`dispatch tasks for operation in pipeline, parallel or any other arrangement on any processor
`within the MVP. Under that, a level which actually performs the tasks on each processor is
`accessed by either: (1) a library of primitives, (2) application tools for programming in
`assembly or (3) a high—level language compiler. Each of these methods have advantages,
`with varying performance and skill level required to code the chip, as a function of the
`particular application. Although nothing restricts the use of any processor as the master or
`slave processor, only software convention.
`
`COMPETING VIDEO COMPRESSION CHIP ARCHITECTURES
`
`Several semiconductor companies have reported activity in video compression chip or chip
`set solutions ([Bolton-93][Konstantinides-92]). Most chip manufacturers are proposing
`hardwired or paramaterized architectures, without C—level programmability. Our MVP is
`an exception. In addition, most of the other "programmable" approaches are based on an
`architecture which integrates dedicated logic modules, like DCTS and Motion Estimators,
`with only the controller programmable. This limits their efficiency since the silicon devoted
`to those functions must always keep busy with those functions to justify their cost. On the
`contrary the MVP architecture has no dedicated logic, permitting user balancing of silicon
`based on the varied and dynamic computational demands of compression. Other
`researchers have recently described simulations which support our position that universally
`programmable architectures are competitive solutions when compared with dedicated or
`hybrid architectures for video decompression (Mayer-93).
`
`Many different architectures are proposed, in development, or currently available, however
`none except the MVP has the flexibility nor computational performance to meet the
`complete demands of truly integrated digital video on the desktop, including the complete
`concert of real-time video & audio compression, with image & 3D graphics processing.
`Not only compression and decompression, but system—level bit—stream control, video
`scaling, error correction and even audio echo cancellation. Architecture limitations and
`transistor counts limit other chips to subsets of these functions.
`
`CONCLUSION
`
`The MVP is a monolithic single—chip parallel processor that performs compression
`processing, audio & video processing, 3D graphics and others, and even at the same time.
`Over 2 billion operations are performed per second. This dramatic performance boost will
`enable a wide range of new applications, including desktop interactive digital video.
`Integrating fully—programmable parallel DSP processors with a RISC processor on one
`chip provides software flexibility and system adaptability. A new parallel architecture,
`using a crossbar network to couple the processors and large on-chip SRAMs, and with
`MIMD (Multiple Instruction Multiple Data) operations, yields extremely high efficiency for
`
`Apple Exhibit 1009
`Page 13 of 14
`
`Apple Exhibit 1009
`Page 13 of 14
`
`
`
`224
`
`most image, graphics, and video algorithms. Software tools like real-time executives,
`assemblers and compilers all help bring a familiar computer programming model to
`multidimensional signal processing. This new technology frees developers of compression
`algorithms to optimize implementations of standard video & audio compression algorithms,
`without the restrictions found in today's compression chips (those which are limited to
`current interpretations or versions of the standards). In addition, algorithm developers can
`implement future compression algorithms, without the difficulties of developing new chips
`or adapting existing chips.
`
`The MVP supports a wide range of open standards for video compression and image
`computing. The variation within each standard to promote creative and distinguishing
`advantages in the market place and the constant urge to optimize the standard to a particular
`range of markets, each work to prevent fixed hardware solutions. This programmable,
`integrated solution gives flexibility to system designers to develop competitive algorithms
`as well as adapt to emerging standards.
`
`ACKNOWLEDGMENTS
`
`The author wishes to thank Jeremiah Golston, Dr. Chris Read, and Dr. V. Venkateswar for
`compression algorithm work relating to the MVP.
`In addition, thanks to the MVP Program
`Manager, Walt Bonneau, for developing and motivating a "world-class“ team. A special
`thanks to my original co-MVP-architects, Keith Balmer, Karl Guttag, and Nick Ing-
`Simrnons. Finally, thanks to the entire "Team-MVP"!
`
`REFERENCES
`
`[Bolton-93] Bolton, M. "A Family of MPEG Video Encoder and Decoder Chips", IEEE
`Proceeding of Conference on Hot Chips, 1993.
`
`[Chen—77] Chen, W.H., C.H. Smith, and S.C. Fralick, "A Fast Computational Algorithm
`for the Discrete Cosine Transform", IEEE Transactions of Communication, Vol. 25, pp.
`1004-1009, Sept. 1977.
`
`[Gove-92] Gove, R.J., "Architectures for Single-Chip Image Computing", SPIE
`Proceedings of Conf. on Image Processing and Interchange, San Jose, Ca., Feb 1992.
`
`[Guttag-92] Guttag, K.M., R.J. Gove, & J.R. Van Aken, "A Single-Chip Multiprocessor
`For Multimedia: The MVP", IEEE Computer Graphics & Applications, pp.53-64, 11/92.
`
`[Konstanttinides-92] K. Konstantinides & V. Bhaskaran, "Monolithic Architectures for
`Image Processing & C