`Jay Torborg
`James T. Kajiya
`Microsoft Corporation
`
`ABSTRACT
`A new 3D graphics and multimedia hardware architecture, code-
`named Talisman, is described which exploits both spatial and
`temporal coherence to reduce the cost of high quality animation.
`Individually animated objects are rendered into independent
`image layers which are composited together at video refresh rates
`to create the final display. During the compositing process, a full
`affine transformation is applied to the layers to allow translation,
`rotation, scaling and skew to be used to simulate 3D motion of
`objects, thus providing a multiplier on 3D rendering performance
`and exploiting temporal image coherence. Image compression is
`broadly exploited for textures and image layers to reduce image
`capacity and bandwidth requirements. Performance rivaling high-
`end 3D graphics workstations can be achieved at a cost point of
`two to three hundred dollars.
`CR Categories and Subject Descriptors: B.2.1 [Arithmetic and
`Logic Structures]: Design Styles - Parallel, Pipelined; C.1.2
`[Processor Architectures]: Multiprocessors - Parallel processors,
`Pipelined processors; I.3.1 [Computer Graphics]: Hardware
`Architecture - Raster display devices; I.3.3 [Computer Graphics]:
`Picture/Image Generation - Display algorithms.
`
`INTRODUCTION
`The central problem we are seeking to solve is that of attaining
`ubiquity for 3D graphics. Why ubiquity? Traditionally, the
`purpose of computer graphics has been as a tool. For example,
`mechanical CAD enhances the designerÕs ability to imagine
`complex three dimensional shapes and how they fit together.
`Scientific visualization seeks to translate complex abstract
`relationships into perspicuous spatial relationships. Graphics in
`film-making is as a tool that realizes the vision of a creative
`imagination. Today, computer graphics has thrived on being the
`tool of choice for augmenting the human imagination.
`However, the effect of ubiquity is to promote 3D graphics from a
`tool to a medium. Without ubiquity, graphics will remain as it
`does today, a tool for those select few whose work justifies
`investment in exotic and expensive hardware. With ubiquity,
`jaytor@microsoft.com kajiya@microsoft.com
`
`Permission to make digital or hard copies of part or all of this work or
`personal or classroom use is granted without fee provided that copies are
`not made or distributed for profit or commercial advantage and that copies
`bear this notice and the full citation on the first page. To copy otherwise, to
`republish, to post on servers, or to redistribute to lists, requires prior
`specific permission and/or a fee.
`© 1996 ACM-0-89791-746-4/96/008...$3.50
`
`graphics can be used as a true medium. As such, graphics can be
`used to record ideas and experiences, to transmit them across
`space, and to serve as a technological substrate for people to
`communicate within and communally experience virtual worlds.
`But before it can become a successful medium, 3D graphics must
`be universally available: the breadth and depth of the potential
`audience must be large enough to sustain interesting and varied
`content.
`How can we achieve ubiquity? There are a few criteria: 1)
`hardware must be so inexpensive that anyone who wants it can
`afford it, 2) there must be a minimum level of capability and
`quality to carry a wide range of applications, and 3) the offering
`must carry compelling content. This paper will treat the first two
`problems and a novel hardware approach to solving them.
`There are two approaches to making inexpensive graphics
`hardware. One approach is to make an attenuated version of
`conventional hardware. In the next section we make an analysis
`of the forces driving the cost of conventional graphics
`architectures. By mitigating some of these costs, one may obtain
`cheaper implementations with more modest performance. Over a
`dozen manufacturers are currently exploring this approach by
`cutting down on one or another cost factor. The risk of this
`approach, of course, is that each time one cuts cost, one also cuts
`performance or quality.
`An alternative approach is to look to new architectures that have a
`fundamentally different character than the conventional graphics
`pipeline. This is an approach pioneered at the high end by the
`Pixel Planes project [Fuc89], PixelFlow [Mol92], and various
`parallel ray tracing machines [Nis83, Pot89]. At the low end,
`Nvidia [Nvi95] is offering such a different architecture. We
`present an architecture that very much is in the spirit of this latter
`path, delivering a high performance, high quality graphics system
`for a parts cost of $200 to $300.
`The second criterion, quality, must be evaluated in terms of the
`applications and content to be executed by the machine. Here we
`make a fundamentally different assumption from that underlying
`the conventional graphics pipeline. We believe that the
`requirements and metric of performance for a ubiquitous graphics
`system to be much different than that for a system designed
`primarily for mechanical CAD. In MCAD the ability to
`accurately and faithfully display the shape of the part is a strict
`requirement. The metric of performance is often polygons per
`second, but ultimate result is frame rateÑa low-cost system will
`display at a much slower rate than a high-performance system, but
`both will be able to display the shape accurately with the exactly
`the same image. One of our central assumptions is that in
`applications and content for ubiquitous graphics this situation is
`reversed. In a system to be used as a medium, rather than as a
`tool, the ability to smoothly convey motion, to be synchronized
`with sound and video, and to achieve low-latency interaction are
`critical requirements. We believe the fidelity of the shapes, the
`
`353
`0001
`
`Volkswagen 1013
`
`
`
`45
`
`19
`
`20
`
`13
`
`7
`
`8
`
`2.6
`
`4
`
`52
`48
`44
`40
`36
`32
`28
`24
`20
`16
`12
`
`48
`
`Memory
`Capacity
`Requirement
`Mbytes
`
`640x480x16 bit, 16 bit Z, 2 texels/pixel
`8 bit palletized point sampled texture
`640x480x16 bit, 24 bit Z, 3 texels/pixel
`16 bit bilinear filtered texture
`800x600x16 bit, 24 bit Z, 3 texels/pixel
`16 bit trilinear filtered texture
`800x600x24 bit, 24 bit Z, 3 texels/pixel
`16 bit trilinear filtered texture
`
`1024x768x24 bit, 24 bit Z, 3 texels/pixel
`16 bit trilinear filtered texture
`1024x768x24 bit, 24 bit Z, 3 texels/pixel
`32 bit trilinear filtered texture
`1024x768x32 bit, 24 bit Z, 3 texels/pixel
`32 bit anisotropic filtered texture
`1024x768x32 bit, 24 bit Z, 3 texels/pixel
`32 bit anisotropic filtered texture,
`anti-aliased polygon edges
`Memory Capacity Requirements for Conventional Graphics Pipeline
`for various 3D Graphics Performance, Quality, and Resolutions
`Note that although capacity has improved tremendously, latency
`and bandwidth have not made similar improvements. There is
`every indication that these trends will continue to hold.
`These charts suggest that achieving high-quality imagery using
`the conventional graphics pipeline is an inherently expensive
`enterprise. Those who maintain that improvements in CPU and
`VLSI technology are sufficient to produce low-cost hardware or
`even software systems that we would consider high-performance
`today, have not carefully analyzed the nature of the fundamental
`forces at work.
`
`DRAM Technology Improvements
`
`1976
`
`1995
`
`Change
`
`Access Time
`
`Bandwidth
`(per data pin)
`
`Capacity
`
`350 ns
`
`50 ns
`
`7X
`
`2 Mb/sec
`
`22 Mb/sec
`
`11X
`
`4 Kbit
`
`16 Mbit
`
`4096X
`
`Cost per MByte
`
`$ 16,500
`
`$ 23
`
`720X
`
`Change
`per Year
`
`10%
`
`12%
`
`50%
`
`40%
`
`IMAGE PROCESSING AND 3D GRAPHICS
`Although the conventional graphics pipeline uses massive
`amounts of memory bandwidth to do its job, it is equally clear
`that much of this bandwidth is creating unused, if not unusable,
`capacity. For example, the conventional pipeline is fully capable
`of making every frame a display of a completely different
`geometric model at full performance. The viewpoint may skip
`about completely at random with no path coherence at all. Every
`possible pixel pattern may serve as a texture map, even though the
`vast majority of them are perceptually indistinguishable from
`random noise. A frame may be completed in any pixel order even
`though polygons tend to occupy adjacent pixels.
`In our architecture we have sought to employ temporal coherence
`of models, of motion, of viewpoint, and spatial coherence of
`texture and display. We have found that this approach greatly
`mitigates the need for large memory bandwidths and capacities for
`high-quality systems.
`A fundamental technique we have used repeatedly is to replace
`image synthesis with image processing. That image processing
`
`precise nature of their geometric relationships, and image quality
`are performance metrics. In our architecture we have striven to
`make it possible for one to always be able to interact in real-time,
`at video frame rates (e.g. 72-85 Hz). The difference between
`high-cost and low-cost systems will be in the fidelity and quality
`of the images.
`
`FUNDAMENTAL FORCES
`A graphics system designer struggles with two fundamental
`forces: memory bandwidth, and system latency. To achieve low-
`cost, a third force looms large: memory cost.
`Space considerations do not allow us to detail all the bandwidth
`requirements for a conventional graphics pipeline. The
`considerations are straightforward: for example, simple
`multiplication shows display refresh bandwidth for a 75 Hz,
`640x480x8 frame buffer requires 23MB per second, while that for
`1024x768x24 requires 169 MB per second. If we add the
`requirements for z-buffering (average depth complexity of 3 with
`random z-order), texture map reads with various antialiasing
`schemes (point sample, bilinear, trilinear, anisotropic), and
`additional factors imposed by anti-aliasing, we obtain the
`following chart:
`
`Memory
`Bandwidth
`Requirement
`Mbytes
`
`13000
`12000
`11000
`10000
`9000
`8000
`7000
`6000
`5000
`4000
`3000
`2000
`1000
`
`1800
`
`690
`
`1100
`
`190
`
`340
`
`12,000
`
`6900
`
`4300
`
`640x480x16 bit, 30 Hz update, 16 bit Z
`8 bit palletized point sampled texture
`640x480x16 bit, 30 Hz update, 24 bit Z
`16 bit bilinear filtered texture
`800x600x16 bit, 30 Hz update, 24 bit Z
`16 bit trilinear filtered texture
`800x600x24 bit, 45 Hz update, 24 bit Z
`16 bit trilinear filtered texture
`
`1024x768x24 bit, 45 Hz update, 24 bit Z
`16 bit trilinear filtered texture
`1024x768x24 bit, 75 Hz update, 24 bit Z
`32 bit trilinear filtered texture
`1024x768x32 bit, 75 Hz update, 24 bit Z
`32 bit anisotropic filtered texture
`1024x768x32 bit, 75 Hz update, 24 bit Z
`32 bit anisotropic filtered texture,
`anti-aliased polygon edges
`Memory Bandwidth Requirements for Conventional Graphics Pipeline
`for various 3D Graphics Performance, Quality, and Resolutions
`Memory bandwidth is a key indicator of system cost. The left
`hand two columns indicate where current 3D accelerators for the
`PC are falling. A full up SGI RE2—a truly impressive machine—
`boasts a memory bandwidth of well over 10,000 MB per second.
`Its quite clear that SGI has nothing to fear from evolving PC 3D
`accelerators, which utilize traditional 3D pipelines, for some time
`to come.
`The second force, system latency, is handled mainly through
`careful design of the basic algorithms of the architecture, as well
`as careful pipelining to mask memory latencies.
`The third force, memory cost, traditionally has not been of great
`concern to high-end systems because achieving the aggregate
`bandwidth has required large amounts of memory. The next chart
`shows the results of calculating memory requirements for a
`conventional graphics pipeline with different levels of
`performance.
`Over the last two decades, the drop in price per bit of
`semiconductor memory has been phenomenal. A look at an early
`DRAM vs. today’s reveals interesting trends.
`
`354
`0002
`
`
`
`and 3D graphics have always had an intimate theoretical
`relationship, is evident to anyone perusing the contents of a
`typical SIGGRAPH proceedings. Even in high-quality off-line
`rendering, image processing and composition has served essential
`functions for many years. But, with a few exceptions like the
`Pixar Image Computer (cid:31)Lev84(cid:31), Regan’s image remapping system
`(cid:31)Reg94(cid:31), and the PixelFlow architecture (cid:31)Mol92(cid:31) this relationship
`has not extended into the physical embodiment of hardware.
`In a sense, one can view texture mapping as an example of
`marrying images and 3d graphics early in the pipeline. Segal, et.
`al. (cid:31)Seg92(cid:31) have shown that texture mapping, especially when
`considered in context with multiple renderings can simulate many
`lighting effects. We have adopted this idea for the real-time
`context, calling it multi-pass rendering.
`Image compositing and image morphing have been long used in
`the utilization of temporal coherence—at least in software
`systems, (cid:31)Coo87, Che94, Che95, McM95(cid:31). Our architecture
`extends these ideas into the real-time hardware domain, for the
`case of affine image transformations.
`
`HARDWARE ARCHITECTURE
`There are four major concepts utilized in Talisman, these are:
`•
`Composited image layers with full affine transformations.
`•
`Image compression.
`•
`Chunking.
`• Multi-pass rendering.
`Composited Image Layers
`The Talisman hardware does not incorporate a frame buffer in the
`traditional sense. Instead, multiple independent image layers are
`composited together at video rates to create the output video
`signal. These image layers can be rendered into and manipulated
`independently. The graphics system will generally use an
`independent image layer for each non-interpenetrating object in
`the scene. This allows each object to be updated independently so
`that object update rates can be optimized based on scene
`priorities. For example, an object that is moving in the distant
`background may not need to be updated as often, or with as much
`accuracy, as a foreground object.
`Image layers can be of arbitrary size and shape, although the first
`implementation of the system software uses only rectangular
`shaped layers. Each pixel in a layer has color and alpha (opacity)
`information associated with it so that multiple layers can be
`composited together to create the overall scene.
`Several different operations can be performed on these image
`layers at video rates, including scaling, rotation, subpixel
`positioning, and skews (i.e., full affine transformations). So, while
`image layer update rates are variable, image layer transformations
`(motion, etc.) occur at full video rates (e.g. 72 to 85 Hz),
`resulting in much more fluid dynamics than can be achieved by a
`conventional 3D graphics system that has no update rate
`guarantees.
`Many 3D transformations can be simulated by 2D imaging
`operations. For example, a receding object can be simulated by
`scaling the size of the image. By utilizing 2D transformations on
`previously rendered images for intermediate frames, overall
`processing requirements are significantly reduced, and 3D
`rendering power can be applied where it is needed to yield the
`highest quality results. Thus, the system software can employ
`
`temporal level of detail management and utilize frame-to-frame
`temporal coherence.
`By using image layer scaling, the level of spatial detail can also be
`adjusted to match scene priorities. For example, background
`objects (e.g., cloudy sky) can be rendered into a small image layer
`(low resolution) which is then scaled to the appropriate size for
`display. By utilizing high quality filtering, the typical low
`resolution artifacts are reduced.
`A typical 3D graphics application (particularly an interactive
`game) trades off geometric level of detail to achieve higher
`animation rates. The use of composited image layers allow the
`Talisman system to utilize two additional scene parameters—
`temporal level of detail and spatial level of detail—to optimize the
`effective performance as seen by the user. Further, the Talisman
`system software can manage these trade-offs automatically
`without requiring application support.
`
`Image Compression
`Talisman broadly applies image compression technology to solve
`these problems. Image compression has traditionally not been
`used in graphics systems because of the computational complexity
`required for high quality, and because it does not easily fit into a
`conventional graphics architecture. By using a concept we call
`chunking (described below), we are able to effectively apply
`compression to images and textures, achieving a significant
`improvement in price-performance.
`In one respect, graphics systems have employed compression to
`frame buffer memory. High end systems utilize eight bits (or
`more) for each of three color components, and often also include
`an eight bit alpha value. Low end systems compress these 32 bits
`per pixel to as few as four bits by discarding information and/or
`using a color palette to reduce the number of simultaneously
`displayable colors. This compression results in very noticeable
`artifacts, does not achieve a significant reduction in data
`requirements, and forces applications and/or drivers to deal with a
`broad range of pixel formats.
`The compression used in Talisman is much more sophisticated,
`using an algorithm similar to (cid:31)PEG which we refer to as TREC to
`achieve very high image quality yet still provide compression
`ratios of 10:1 or better. Another benefit of this approach is that a
`single high quality image format (32 bit true color) can be used
`for all applications.
`
`Chunking
`A traditional 3D graphics system, or any frame buffer for that
`matter, can be, and usually is, accessed randomly. Arbitrary pixels
`on the screen can be accessed in random order. Since compression
`algorithms rely on having access to a fairly large number of
`neighboring pixels in order to take advantage of spatial
`coherence, and only after all pixel updates have been made, the
`random access patterns utilized by conventional graphics
`algorithms make the application of compression technology to
`display buffers impractical.
`This random access pattern also means that per-pixel hidden
`surface removal and anti-aliasing algorithms must maintain
`additional information for every pixel on the screen. This
`dramatically increases the memory size requirements, and adds
`another performance bottleneck.
`Talisman takes a different approach. Each image layer is divided
`into pixel regions (32 x 32 pixels in our reference
`implementation) called chunks. The geometry is presorted into
`
`355
`0003
`
`
`
`bins based on which chunk (or chunks) the geometry will be
`rendered into. This process is referred to as chunking. Geometry
`that overlaps a chunk boundary is referenced in each chunk it is
`visible in. As the scene is animated, the data structure is modified
`to adjust for geometry that moves from one chunk to another.
`While chunking adds some upstream overhead, it provides several
`significant advantages. Since all the geometry in one chunk is
`rendered before proceeding to the next, the depth buffer need only
`be as large as a single chunk. With a chunk size of 32 x 32, the
`depth buffer is implemented directly on the graphics rendering
`chip. This eliminates a considerable amount of memory, and also
`allows the depth buffer to be implemented using a specialized
`memory architecture which can be accessed with very high
`bandwidth and cleared instantly from one chunk to the next,
`eliminating the overhead between frames.
`Anti-aliasing is also considerably easier since each 32 x 32 chunk
`can be dealt with independently. Most high-end graphics systems
`which implement anti-aliasing utilize a great deal of additional
`memory, and still perform relatively simplistic filtering. By using
`chunking, the amount of data required is considerably reduced (by
`a factor of 1000), allowing practical implementation of a much
`more sophisticated anti-aliasing algorithm.
`The final advantage is that chunking enables block oriented image
`compression. Once each 32 x 32 chunk has been rendered (and
`anti-aliased), it can then be compressed with the TREC block
`transform compression algorithm.
`
`Multi-pass Rendering
`One of the major attractions of the Talisman architecture is the
`opportunity for 3D interactive applications to break out of the late
`1970’s look of CAD graphics systems: boring lambertian
`Gouraud-shaded polygons with Phong highlights. Texture
`mapping of color improves this look but imposes another
`System HW Partitioning
`
`2Mx8
`RDRAM
`
`2Mx8
`RDRAM
`
`characteristic appearance on applications. In the 1980’s, the idea
`of programmable shaders and procedural texture maps(cid:31)Coo84,
`Han90(cid:31) opened a new versatility to the rendering process. These
`ideas swept the off-line rendering world to create the high-quality
`images that we see today in film special effects.
`By reducing the bandwidth requirements using the techniques
`outlined above, Talisman can use a single shared memory system
`for all memory requirements including compressed texture storage
`and compressed image layer storage. This architecture allows data
`created by the rendering process to be fed back through the
`texture processor to be used as data in the rendering of a new
`image layer. This feedback allows rendering algorithms which
`require multiple passes to be implemented.
`By coupling multi-pass rendering with a variety of compositing
`modes, texture mapping techniques (cid:31)Seg92(cid:31), and a flexible
`shading language, Talisman provides a variety of rendering effects
`that have previously been the domain of off-line software
`renderers. This includes support of functions such as shadows
`(including shadows from multiple light sources), environment
`mapped reflective objects, spot lights, fog, ground fog, lens flare,
`underwater simulation, waves, clouds, etc.
`
`REFERENCE HARDWARE IMPLEMENTATION
`The Talisman architecture supports a broad range of
`implementations which provide different performance, features,
`rendering quality, etc. The reference implementation is targeted at
`the high-end of the consumer PC market and is designed to plug
`into personal computers using the PCI expansion bus. This board
`replaces functionality that is typically provided by a Windows
`accelerator board, a 3D accelerator board, an MPEG playback
`board, a video conferencing board, a sound board, and a modem.
`
`Talisman VLSI Components
`
`Standard Components
`
`Commodity DRAM Memory
`
`RGB Video
`
`PCI Bus
`
`Media DSP
`
`Polygon
`Object
`Processor
`
`Image
`Layer
`Compositor
`
`Compositing
`Buffer
`
`Media
`DAC
`
`1394
`
`USB
`
`Audio
`Chip
`
`2 Ch Audio
`
`Modem
`
`The reference hardware consists of a combination of proprietary
`VLSI devices and commercially available components. The
`VLSI components have been developed using a top-down
`modular design approach allowing various aspects of the
`reference implementation to be readily used to create derivative
`designs.
`The reference implementation uses 4 Mbytes of shared memory
`implemented using two 8-bit Rambus channels. The Rambus
`memory provides higher bandwidth than traditional DRAM at
`
`near commodity DRAM pricing. This shared memory is used to
`store image layers and texture data in compressed form, DSP
`code and data, and various buffers used to transfer data between
`processing subsystems. A 2MB configuration is also possible,
`although such a configuration would have lower display
`resolution and would have other resource limitations.
`The Media DSP Processor is responsible for video codecs, audio
`processing, and front-end graphics processing (transformations,
`lighting, etc.). The reference HW implementation uses the
`
`356
`0004
`
`
`
`Samsung MSP to perform these functions. The DSP combines a
`RISC processor with a specialized SIMD processor capable of
`providing high performance floating point and integer
`processing ((cid:31)1000 MFLOPS/MOPS). A real-time kernel and
`resource manager deals with allocating the DSP to the various
`graphics and multimedia tasks which are performed by this
`system.
`The Polygon Object Processor is a proprietary VLSI chip which
`performs scan-conversion, shading, texturing, hidden-surface
`
`removal, and anti-aliasing. The resulting rendered image layer
`chunks are stored in compressed form in the shared memory.
`The Image Layer Compositor operates at video rates to access
`the image layer chunk information from the shared memory,
`decompress the chunks, and process the images to perform
`general affine transformations (which include scaling,
`translation with subpixel accuracy, rotation, and skew). The
`resulting pixels (with alpha) are sent to Compositing Buffer.
`
`M emo ry U s e - T ypical S cenario
`Image Layer Data S torage
`Display Resolution
`Average Image Layer Size
`Average Image Layer Depth Complexity
`Image Layer Data Compression Factor
`Image Layer Memory Management Overhead
`Memory Allocation Overhead
`T otal Image Layer Data S torage R equirements
`Dis play Memory Management
`T exture Data S torage
`Number of Texels
`Percent Texels with Alpha
`Avg. Number of Texture LODs
`Texture Data Compression Factor
`T otal T exture Data S torage R equirements
`Command B uffers
`Audio Output B uffer
`Audio S ynthes is Data
`Wav T able B uffer
`Media DS P P rogram and S cratch Mem
`T o t al
`
`32.4 Mbytes/sec
`130.0 Mbytes/sec
`
`Image layer chunk data is processed 32 scan lines at a time for
`display. The Compositing Buffer contains two 32 scan line
`buffers which are toggled between display and compositing
`activities. Each chip also contains a 32 scan line alpha buffer
`which is used to accumulate alpha for each pixel. The Video
`DAC includes a USB serial channel (for joysticks, etc.), and an
`IEEE1394 media channel (up to 400 Mbits/sec. for connection
`to an optional break-out box and external A/V equipment), as
`well as standard palette DAC features.
`A separate chip is used to handle audio digital to analog and
`analog to digital conversion.
`The table above indicates the total memory usage for a typical
`3D application scenario. For the same scenario, the memory
`bandwidth requirements are shown in the following table.
`Memory Bandwidth - Typical Scenario
`Pixel Rendering (avg. depth complexity 2.5)
`Display Bandwidth
`Texture Reads
`Texels per Pixel (anisotropic filtering)
`Texture Cache Multiplier (avg. texel reuse)
`Texture Read Bandwidth
`Polygon Command (30,000 polygons/scene)
`Total 3D Pipeline Bandwidth
`
`16
`2.5
`
`58 Mbytes/sec
`61.0 Mbytes/sec
`281.4 Mbytes/sec
`
`x 768
`x 128
`
`1024
`128
`1.7
`5
`51 bytes per 32x32 chunk
`4 bytes per 128 bytes
`
`64 bytes per
`
`image layer
`
`texels
`
`4,000,000
`30%
`6
`15
`
`N et M emo ry R equirement s
`
`1,171,637 bytes
`5,222 bytes
`
`1,415,149 bytes
`53,248 bytes
`2,450 bytes
`32,768 bytes
`524,800 bytes
`524,288 bytes
`3,729,563 byt es
`
`POLYGON OBJECT PROCESSOR
`The Polygon Object Processor is one of the two primary VLSI
`chips that are being developed for the reference HW
`implementation.
`Unique Functional Blocks
`Many of the functional blocks in the Polygon Object Processor
`will be recognized as being common in traditional 3D graphics
`pipelines. Some of the unique blocks are described here. The
`operation of this chip is provided later in the paper.
`Initial Evaluation - Since polygons are processed in 32 x 32
`chunks, triangle processing will typically not start at a triangle
`vertex. This block computes the intersection of the chunk with
`the triangle and computes the values for color, transparency,
`depth, and texture coordinates for the starting point of the
`triangle within the chunk.
`Pixel Engine - performs pixel level calculations including
`compositing, depth buffering, and fragment generation for pixels
`which are only partially covered. The pixel engine also handles
`z-comparison operations required for shadows.
`
`357
`0005
`
`
`
`Pixel
`Queue
`
`Cache
`Address
`Map
`
`Rasterizer
`
`Texture
`Cache
`
`Texture
`Filter
`Engine
`
`Pixel
`Engine
`
`Pre-Rasterizer
`
`Texture
`Cache
`Control
`
`Decompress
`
`Depth/Stencil/Priority
`Buffer
`
`Fragment
`Buffer
`
`Color
`Buffers
`
`Primitive
`Register
`
`Texture
`Read
`Queue
`
`Compressed
`Texture
`Cache
`
`Fragment
`Resolve
`
`Polygon Object
`Processor
`
`Initial
`Evaluation
`
`Primitive
`Queue
`
`Command
`and
`Memory
`Control
`
`Compress
`
`Media DSP
`
`Image Layer
`Compositor
`
`RAMBUS
`Channels
`
`Fragment Resolve - performs the final anti-aliasing step by
`resolving depth sorted pixel fragments with partial coverage or
`transparency.
`
`Coping with Latency
`One of the most challenging aspects of this design was coping
`with the long latency to memory for fetching texture data. Not
`only do we need to cope with a decompression step which takes
`well over 100 12.5 ns. cycles, but we are also using Rambus
`memory devices which need to be accessed using large blocks to
`achieve adequate bandwidth. This results in a total latency of
`several hundred cycles.
`Maintaining the full pixel rendering rate was a high priority in
`the design, so a mechanism that could ensure that texels were
`available for the texture filter engine when needed was required.
`The basic solution to this problem is to have two rasterizers -
`one calculating texel addresses and making sure that they are
`available in time, and the other performing color, depth, and
`pixel address interpolation for rendering. While these rasterizers
`both calculate information for the same pixels, they are
`separated by up to several hundred cycles.
`Two solutions were considered for this mechanism - one was to
`duplicate the address calculations in both rasterizers(cid:31) the other
`was to pass the texture addresses from the first rasterizer (called
`the Pre-Rasterizer in the block diagram) to the second rasterizer
`using a FIFO.
`In this case, texture address calculation logic in the rasterizers is
`fairly complex to deal with perspective divides and anisotropic
`texture filtering (discussed later). To duplicate this logic in both
`rasterizers required more silicon area than using the pixel queue,
`so the latter approach was chosen.
`
`Die Area and Packaging
`The total die area of the Polygon Object Processor is shown in
`the following table. The die area figures shown here are
`estimates since the layout of this part was not complete at the
`
`POP Area Calculation
`
`Functional Block
`RAC Cell
`Memory Interface
`Input Logic
`Setup Logic
`Scan Convert
`Texture Lookup
`Pixel Logic
`Cache Logic
`Compression Logic
`Decompression Logic
`
`Testability Gates
`Interblock Routing Area
`
`Gates
`
`RAM bits
`
`12,288
`0
`0
`57,760
`0
`137,216
`71,680
`32,896
`16,000
`
`4,500
`10,044
`30,920
`125,510
`83,450
`86,090
`42,000
`33,120
`47,000
`
`50,000
`
`0.35 Micron
`
`Total Area
`5.17
`1.77
`1.09
`3.92
`18.38
`8.87
`20.03
`10.91
`14.62
`6.02
`90.77
`6.55
`9.73
`
`Core Area
`I/O Cells Area
`Total Area
`time of paper submission.
`The Polygon Object Processor is implemented using an
`advanced 0.35 micron four layer metal 3.3 volt CMOS process.
`The die is mounted in a 304 pin thermally-enhanced plastic
`package.
`
`107.05
`21.69
`128.75
`
`IMAGE LAYER COMPOSITOR
`The Image Layer Compositor is the other custom VLSI chip that
`is being developed for the reference HW implementation. This
`part is responsible for generating the graphics output from a
`collection of depth sorted image layers.
`
`358
`0006
`
`
`
`Cache
`Address
`Map
`
`Rasterizer
`
`Image Layer
`Cache
`
`Image Layer
`Filter
`Engine
`
`Compositing
`Buffer
`Controller
`
`To Comp.
`Buffer
`
`Pre-Rasterizer
`
`Image Layer
`Cache
`Control
`
`Decompress
`
`Image Layer
`Header
`Registers
`
`Image Layer
`Read
`Queue
`
`Compressed
`Image layer
`Cache
`
`Image Layer
`Compositor
`
`Initial
`Evaluation
`
`Image Layer
`Queue
`
`Polygon
`
`Obj. Proc.
`
`Interface
`Control
`
`Comparison with Polygon Object Processor
`You will notice that this block diagram is similar in many ways
`to the Polygon Object Processo. In fact, many of the blocks are
`identical to reduce design time. In many ways, the Image Layer
`Compositor performs the same operations as triangle
`rasterization with texture mapping.
`In addition to the obvious differences (no depth buffering, anti-
`aliasing, image compression, etc.) there are a couple of key
`differences which significantly affect the design:
`Rendering Rate - the Image Layer Compositor must composite
`the images of multiple objects at full video rates with multiple
`objects overlapping each other. To support this, the rendering
`rate of the Image Layer Compositor is eight times higher than
`the Polygon Object Processor.
`Texture/Image Processing - the sophistication of the image
`processing used by the Image Layer Compositor is significantly
`reduced in order to keep silicon area to a reasonable level.
`Instead of performing perspective correct anisotropic filtering,
`this chip performs simple bi-linear filtering and requires only
`linear address calculations (since perspective transforms are not
`supported).
`These differences significantly affect the approach used to deal
`with memory latency. The rasterizer in the Image Layer
`Compositor is significantly simpler due to the simplified image
`processing, and the higher pixel rate requires the pre-rasterizer
`to be much further ahead of the rasterizer. As a result, the Image
`Layer Compositor eliminates the Pixel (cid:31)ueue and simply
`reca