throbber
Talisman: Commodity Realtime 3D Graphics for the PC
`Jay Torborg
`James T. Kajiya
`Microsoft Corporation
`
`ABSTRACT
`A new 3D graphics and multimedia hardware architecture, code-
`named Talisman, is described which exploits both spatial and
`temporal coherence to reduce the cost of high quality animation.
`Individually animated objects are rendered into independent
`image layers which are composited together at video refresh rates
`to create the final display. During the compositing process, a full
`affine transformation is applied to the layers to allow translation,
`rotation, scaling and skew to be used to simulate 3D motion of
`objects, thus providing a multiplier on 3D rendering performance
`and exploiting temporal image coherence. Image compression is
`broadly exploited for textures and image layers to reduce image
`capacity and bandwidth requirements. Performance rivaling high-
`end 3D graphics workstations can be achieved at a cost point of
`two to three hundred dollars.
`CR Categories and Subject Descriptors: B.2.1 [Arithmetic and
`Logic Structures]: Design Styles - Parallel, Pipelined; C.1.2
`[Processor Architectures]: Multiprocessors - Parallel processors,
`Pipelined processors; I.3.1 [Computer Graphics]: Hardware
`Architecture - Raster display devices; I.3.3 [Computer Graphics]:
`Picture/Image Generation - Display algorithms.
`
`INTRODUCTION
`The central problem we are seeking to solve is that of attaining
`ubiquity for 3D graphics. Why ubiquity? Traditionally, the
`purpose of computer graphics has been as a tool. For example,
`mechanical CAD enhances the designerÕs ability to imagine
`complex three dimensional shapes and how they fit together.
`Scientific visualization seeks to translate complex abstract
`relationships into perspicuous spatial relationships. Graphics in
`film-making is as a tool that realizes the vision of a creative
`imagination. Today, computer graphics has thrived on being the
`tool of choice for augmenting the human imagination.
`However, the effect of ubiquity is to promote 3D graphics from a
`tool to a medium. Without ubiquity, graphics will remain as it
`does today, a tool for those select few whose work justifies
`investment in exotic and expensive hardware. With ubiquity,
`jaytor@microsoft.com kajiya@microsoft.com
`
`Permission to make digital or hard copies of part or all of this work or
`personal or classroom use is granted without fee provided that copies are
`not made or distributed for profit or commercial advantage and that copies
`bear this notice and the full citation on the first page. To copy otherwise, to
`republish, to post on servers, or to redistribute to lists, requires prior
`specific permission and/or a fee.
`© 1996 ACM-0-89791-746-4/96/008...$3.50
`
`graphics can be used as a true medium. As such, graphics can be
`used to record ideas and experiences, to transmit them across
`space, and to serve as a technological substrate for people to
`communicate within and communally experience virtual worlds.
`But before it can become a successful medium, 3D graphics must
`be universally available: the breadth and depth of the potential
`audience must be large enough to sustain interesting and varied
`content.
`How can we achieve ubiquity? There are a few criteria: 1)
`hardware must be so inexpensive that anyone who wants it can
`afford it, 2) there must be a minimum level of capability and
`quality to carry a wide range of applications, and 3) the offering
`must carry compelling content. This paper will treat the first two
`problems and a novel hardware approach to solving them.
`There are two approaches to making inexpensive graphics
`hardware. One approach is to make an attenuated version of
`conventional hardware. In the next section we make an analysis
`of the forces driving the cost of conventional graphics
`architectures. By mitigating some of these costs, one may obtain
`cheaper implementations with more modest performance. Over a
`dozen manufacturers are currently exploring this approach by
`cutting down on one or another cost factor. The risk of this
`approach, of course, is that each time one cuts cost, one also cuts
`performance or quality.
`An alternative approach is to look to new architectures that have a
`fundamentally different character than the conventional graphics
`pipeline. This is an approach pioneered at the high end by the
`Pixel Planes project [Fuc89], PixelFlow [Mol92], and various
`parallel ray tracing machines [Nis83, Pot89]. At the low end,
`Nvidia [Nvi95] is offering such a different architecture. We
`present an architecture that very much is in the spirit of this latter
`path, delivering a high performance, high quality graphics system
`for a parts cost of $200 to $300.
`The second criterion, quality, must be evaluated in terms of the
`applications and content to be executed by the machine. Here we
`make a fundamentally different assumption from that underlying
`the conventional graphics pipeline. We believe that the
`requirements and metric of performance for a ubiquitous graphics
`system to be much different than that for a system designed
`primarily for mechanical CAD. In MCAD the ability to
`accurately and faithfully display the shape of the part is a strict
`requirement. The metric of performance is often polygons per
`second, but ultimate result is frame rateÑa low-cost system will
`display at a much slower rate than a high-performance system, but
`both will be able to display the shape accurately with the exactly
`the same image. One of our central assumptions is that in
`applications and content for ubiquitous graphics this situation is
`reversed. In a system to be used as a medium, rather than as a
`tool, the ability to smoothly convey motion, to be synchronized
`with sound and video, and to achieve low-latency interaction are
`critical requirements. We believe the fidelity of the shapes, the
`
`353
`0001
`
`Volkswagen 1013
`
`

`
`45
`
`19
`
`20
`
`13
`
`7
`
`8
`
`2.6
`
`4
`
`52
`48
`44
`40
`36
`32
`28
`24
`20
`16
`12
`
`48
`
`Memory
`Capacity
`Requirement
`Mbytes
`
`640x480x16 bit, 16 bit Z, 2 texels/pixel
`8 bit palletized point sampled texture
`640x480x16 bit, 24 bit Z, 3 texels/pixel
`16 bit bilinear filtered texture
`800x600x16 bit, 24 bit Z, 3 texels/pixel
`16 bit trilinear filtered texture
`800x600x24 bit, 24 bit Z, 3 texels/pixel
`16 bit trilinear filtered texture
`
`1024x768x24 bit, 24 bit Z, 3 texels/pixel
`16 bit trilinear filtered texture
`1024x768x24 bit, 24 bit Z, 3 texels/pixel
`32 bit trilinear filtered texture
`1024x768x32 bit, 24 bit Z, 3 texels/pixel
`32 bit anisotropic filtered texture
`1024x768x32 bit, 24 bit Z, 3 texels/pixel
`32 bit anisotropic filtered texture,
`anti-aliased polygon edges
`Memory Capacity Requirements for Conventional Graphics Pipeline
`for various 3D Graphics Performance, Quality, and Resolutions
`Note that although capacity has improved tremendously, latency
`and bandwidth have not made similar improvements. There is
`every indication that these trends will continue to hold.
`These charts suggest that achieving high-quality imagery using
`the conventional graphics pipeline is an inherently expensive
`enterprise. Those who maintain that improvements in CPU and
`VLSI technology are sufficient to produce low-cost hardware or
`even software systems that we would consider high-performance
`today, have not carefully analyzed the nature of the fundamental
`forces at work.
`
`DRAM Technology Improvements
`
`1976
`
`1995
`
`Change
`
`Access Time
`
`Bandwidth
`(per data pin)
`
`Capacity
`
`350 ns
`
`50 ns
`
`7X
`
`2 Mb/sec
`
`22 Mb/sec
`
`11X
`
`4 Kbit
`
`16 Mbit
`
`4096X
`
`Cost per MByte
`
`$ 16,500
`
`$ 23
`
`720X
`
`Change
`per Year
`
`10%
`
`12%
`
`50%
`
`40%
`
`IMAGE PROCESSING AND 3D GRAPHICS
`Although the conventional graphics pipeline uses massive
`amounts of memory bandwidth to do its job, it is equally clear
`that much of this bandwidth is creating unused, if not unusable,
`capacity. For example, the conventional pipeline is fully capable
`of making every frame a display of a completely different
`geometric model at full performance. The viewpoint may skip
`about completely at random with no path coherence at all. Every
`possible pixel pattern may serve as a texture map, even though the
`vast majority of them are perceptually indistinguishable from
`random noise. A frame may be completed in any pixel order even
`though polygons tend to occupy adjacent pixels.
`In our architecture we have sought to employ temporal coherence
`of models, of motion, of viewpoint, and spatial coherence of
`texture and display. We have found that this approach greatly
`mitigates the need for large memory bandwidths and capacities for
`high-quality systems.
`A fundamental technique we have used repeatedly is to replace
`image synthesis with image processing. That image processing
`
`precise nature of their geometric relationships, and image quality
`are performance metrics. In our architecture we have striven to
`make it possible for one to always be able to interact in real-time,
`at video frame rates (e.g. 72-85 Hz). The difference between
`high-cost and low-cost systems will be in the fidelity and quality
`of the images.
`
`FUNDAMENTAL FORCES
`A graphics system designer struggles with two fundamental
`forces: memory bandwidth, and system latency. To achieve low-
`cost, a third force looms large: memory cost.
`Space considerations do not allow us to detail all the bandwidth
`requirements for a conventional graphics pipeline. The
`considerations are straightforward: for example, simple
`multiplication shows display refresh bandwidth for a 75 Hz,
`640x480x8 frame buffer requires 23MB per second, while that for
`1024x768x24 requires 169 MB per second. If we add the
`requirements for z-buffering (average depth complexity of 3 with
`random z-order), texture map reads with various antialiasing
`schemes (point sample, bilinear, trilinear, anisotropic), and
`additional factors imposed by anti-aliasing, we obtain the
`following chart:
`
`Memory
`Bandwidth
`Requirement
`Mbytes
`
`13000
`12000
`11000
`10000
`9000
`8000
`7000
`6000
`5000
`4000
`3000
`2000
`1000
`
`1800
`
`690
`
`1100
`
`190
`
`340
`
`12,000
`
`6900
`
`4300
`
`640x480x16 bit, 30 Hz update, 16 bit Z
`8 bit palletized point sampled texture
`640x480x16 bit, 30 Hz update, 24 bit Z
`16 bit bilinear filtered texture
`800x600x16 bit, 30 Hz update, 24 bit Z
`16 bit trilinear filtered texture
`800x600x24 bit, 45 Hz update, 24 bit Z
`16 bit trilinear filtered texture
`
`1024x768x24 bit, 45 Hz update, 24 bit Z
`16 bit trilinear filtered texture
`1024x768x24 bit, 75 Hz update, 24 bit Z
`32 bit trilinear filtered texture
`1024x768x32 bit, 75 Hz update, 24 bit Z
`32 bit anisotropic filtered texture
`1024x768x32 bit, 75 Hz update, 24 bit Z
`32 bit anisotropic filtered texture,
`anti-aliased polygon edges
`Memory Bandwidth Requirements for Conventional Graphics Pipeline
`for various 3D Graphics Performance, Quality, and Resolutions
`Memory bandwidth is a key indicator of system cost. The left
`hand two columns indicate where current 3D accelerators for the
`PC are falling. A full up SGI RE2—a truly impressive machine—
`boasts a memory bandwidth of well over 10,000 MB per second.
`Its quite clear that SGI has nothing to fear from evolving PC 3D
`accelerators, which utilize traditional 3D pipelines, for some time
`to come.
`The second force, system latency, is handled mainly through
`careful design of the basic algorithms of the architecture, as well
`as careful pipelining to mask memory latencies.
`The third force, memory cost, traditionally has not been of great
`concern to high-end systems because achieving the aggregate
`bandwidth has required large amounts of memory. The next chart
`shows the results of calculating memory requirements for a
`conventional graphics pipeline with different levels of
`performance.
`Over the last two decades, the drop in price per bit of
`semiconductor memory has been phenomenal. A look at an early
`DRAM vs. today’s reveals interesting trends.
`
`354
`0002
`
`

`
`and 3D graphics have always had an intimate theoretical
`relationship, is evident to anyone perusing the contents of a
`typical SIGGRAPH proceedings. Even in high-quality off-line
`rendering, image processing and composition has served essential
`functions for many years. But, with a few exceptions like the
`Pixar Image Computer (cid:31)Lev84(cid:31), Regan’s image remapping system
`(cid:31)Reg94(cid:31), and the PixelFlow architecture (cid:31)Mol92(cid:31) this relationship
`has not extended into the physical embodiment of hardware.
`In a sense, one can view texture mapping as an example of
`marrying images and 3d graphics early in the pipeline. Segal, et.
`al. (cid:31)Seg92(cid:31) have shown that texture mapping, especially when
`considered in context with multiple renderings can simulate many
`lighting effects. We have adopted this idea for the real-time
`context, calling it multi-pass rendering.
`Image compositing and image morphing have been long used in
`the utilization of temporal coherence—at least in software
`systems, (cid:31)Coo87, Che94, Che95, McM95(cid:31). Our architecture
`extends these ideas into the real-time hardware domain, for the
`case of affine image transformations.
`
`HARDWARE ARCHITECTURE
`There are four major concepts utilized in Talisman, these are:
`•
`Composited image layers with full affine transformations.
`•
`Image compression.
`•
`Chunking.
`• Multi-pass rendering.
`Composited Image Layers
`The Talisman hardware does not incorporate a frame buffer in the
`traditional sense. Instead, multiple independent image layers are
`composited together at video rates to create the output video
`signal. These image layers can be rendered into and manipulated
`independently. The graphics system will generally use an
`independent image layer for each non-interpenetrating object in
`the scene. This allows each object to be updated independently so
`that object update rates can be optimized based on scene
`priorities. For example, an object that is moving in the distant
`background may not need to be updated as often, or with as much
`accuracy, as a foreground object.
`Image layers can be of arbitrary size and shape, although the first
`implementation of the system software uses only rectangular
`shaped layers. Each pixel in a layer has color and alpha (opacity)
`information associated with it so that multiple layers can be
`composited together to create the overall scene.
`Several different operations can be performed on these image
`layers at video rates, including scaling, rotation, subpixel
`positioning, and skews (i.e., full affine transformations). So, while
`image layer update rates are variable, image layer transformations
`(motion, etc.) occur at full video rates (e.g. 72 to 85 Hz),
`resulting in much more fluid dynamics than can be achieved by a
`conventional 3D graphics system that has no update rate
`guarantees.
`Many 3D transformations can be simulated by 2D imaging
`operations. For example, a receding object can be simulated by
`scaling the size of the image. By utilizing 2D transformations on
`previously rendered images for intermediate frames, overall
`processing requirements are significantly reduced, and 3D
`rendering power can be applied where it is needed to yield the
`highest quality results. Thus, the system software can employ
`
`temporal level of detail management and utilize frame-to-frame
`temporal coherence.
`By using image layer scaling, the level of spatial detail can also be
`adjusted to match scene priorities. For example, background
`objects (e.g., cloudy sky) can be rendered into a small image layer
`(low resolution) which is then scaled to the appropriate size for
`display. By utilizing high quality filtering, the typical low
`resolution artifacts are reduced.
`A typical 3D graphics application (particularly an interactive
`game) trades off geometric level of detail to achieve higher
`animation rates. The use of composited image layers allow the
`Talisman system to utilize two additional scene parameters—
`temporal level of detail and spatial level of detail—to optimize the
`effective performance as seen by the user. Further, the Talisman
`system software can manage these trade-offs automatically
`without requiring application support.
`
`Image Compression
`Talisman broadly applies image compression technology to solve
`these problems. Image compression has traditionally not been
`used in graphics systems because of the computational complexity
`required for high quality, and because it does not easily fit into a
`conventional graphics architecture. By using a concept we call
`chunking (described below), we are able to effectively apply
`compression to images and textures, achieving a significant
`improvement in price-performance.
`In one respect, graphics systems have employed compression to
`frame buffer memory. High end systems utilize eight bits (or
`more) for each of three color components, and often also include
`an eight bit alpha value. Low end systems compress these 32 bits
`per pixel to as few as four bits by discarding information and/or
`using a color palette to reduce the number of simultaneously
`displayable colors. This compression results in very noticeable
`artifacts, does not achieve a significant reduction in data
`requirements, and forces applications and/or drivers to deal with a
`broad range of pixel formats.
`The compression used in Talisman is much more sophisticated,
`using an algorithm similar to (cid:31)PEG which we refer to as TREC to
`achieve very high image quality yet still provide compression
`ratios of 10:1 or better. Another benefit of this approach is that a
`single high quality image format (32 bit true color) can be used
`for all applications.
`
`Chunking
`A traditional 3D graphics system, or any frame buffer for that
`matter, can be, and usually is, accessed randomly. Arbitrary pixels
`on the screen can be accessed in random order. Since compression
`algorithms rely on having access to a fairly large number of
`neighboring pixels in order to take advantage of spatial
`coherence, and only after all pixel updates have been made, the
`random access patterns utilized by conventional graphics
`algorithms make the application of compression technology to
`display buffers impractical.
`This random access pattern also means that per-pixel hidden
`surface removal and anti-aliasing algorithms must maintain
`additional information for every pixel on the screen. This
`dramatically increases the memory size requirements, and adds
`another performance bottleneck.
`Talisman takes a different approach. Each image layer is divided
`into pixel regions (32 x 32 pixels in our reference
`implementation) called chunks. The geometry is presorted into
`
`355
`0003
`
`

`
`bins based on which chunk (or chunks) the geometry will be
`rendered into. This process is referred to as chunking. Geometry
`that overlaps a chunk boundary is referenced in each chunk it is
`visible in. As the scene is animated, the data structure is modified
`to adjust for geometry that moves from one chunk to another.
`While chunking adds some upstream overhead, it provides several
`significant advantages. Since all the geometry in one chunk is
`rendered before proceeding to the next, the depth buffer need only
`be as large as a single chunk. With a chunk size of 32 x 32, the
`depth buffer is implemented directly on the graphics rendering
`chip. This eliminates a considerable amount of memory, and also
`allows the depth buffer to be implemented using a specialized
`memory architecture which can be accessed with very high
`bandwidth and cleared instantly from one chunk to the next,
`eliminating the overhead between frames.
`Anti-aliasing is also considerably easier since each 32 x 32 chunk
`can be dealt with independently. Most high-end graphics systems
`which implement anti-aliasing utilize a great deal of additional
`memory, and still perform relatively simplistic filtering. By using
`chunking, the amount of data required is considerably reduced (by
`a factor of 1000), allowing practical implementation of a much
`more sophisticated anti-aliasing algorithm.
`The final advantage is that chunking enables block oriented image
`compression. Once each 32 x 32 chunk has been rendered (and
`anti-aliased), it can then be compressed with the TREC block
`transform compression algorithm.
`
`Multi-pass Rendering
`One of the major attractions of the Talisman architecture is the
`opportunity for 3D interactive applications to break out of the late
`1970’s look of CAD graphics systems: boring lambertian
`Gouraud-shaded polygons with Phong highlights. Texture
`mapping of color improves this look but imposes another
`System HW Partitioning
`
`2Mx8
`RDRAM
`
`2Mx8
`RDRAM
`
`characteristic appearance on applications. In the 1980’s, the idea
`of programmable shaders and procedural texture maps(cid:31)Coo84,
`Han90(cid:31) opened a new versatility to the rendering process. These
`ideas swept the off-line rendering world to create the high-quality
`images that we see today in film special effects.
`By reducing the bandwidth requirements using the techniques
`outlined above, Talisman can use a single shared memory system
`for all memory requirements including compressed texture storage
`and compressed image layer storage. This architecture allows data
`created by the rendering process to be fed back through the
`texture processor to be used as data in the rendering of a new
`image layer. This feedback allows rendering algorithms which
`require multiple passes to be implemented.
`By coupling multi-pass rendering with a variety of compositing
`modes, texture mapping techniques (cid:31)Seg92(cid:31), and a flexible
`shading language, Talisman provides a variety of rendering effects
`that have previously been the domain of off-line software
`renderers. This includes support of functions such as shadows
`(including shadows from multiple light sources), environment
`mapped reflective objects, spot lights, fog, ground fog, lens flare,
`underwater simulation, waves, clouds, etc.
`
`REFERENCE HARDWARE IMPLEMENTATION
`The Talisman architecture supports a broad range of
`implementations which provide different performance, features,
`rendering quality, etc. The reference implementation is targeted at
`the high-end of the consumer PC market and is designed to plug
`into personal computers using the PCI expansion bus. This board
`replaces functionality that is typically provided by a Windows
`accelerator board, a 3D accelerator board, an MPEG playback
`board, a video conferencing board, a sound board, and a modem.
`
`Talisman VLSI Components
`
`Standard Components
`
`Commodity DRAM Memory
`
`RGB Video
`
`PCI Bus
`
`Media DSP
`
`Polygon
`Object
`Processor
`
`Image
`Layer
`Compositor
`
`Compositing
`Buffer
`
`Media
`DAC
`
`1394
`
`USB
`
`Audio
`Chip
`
`2 Ch Audio
`
`Modem
`
`The reference hardware consists of a combination of proprietary
`VLSI devices and commercially available components. The
`VLSI components have been developed using a top-down
`modular design approach allowing various aspects of the
`reference implementation to be readily used to create derivative
`designs.
`The reference implementation uses 4 Mbytes of shared memory
`implemented using two 8-bit Rambus channels. The Rambus
`memory provides higher bandwidth than traditional DRAM at
`
`near commodity DRAM pricing. This shared memory is used to
`store image layers and texture data in compressed form, DSP
`code and data, and various buffers used to transfer data between
`processing subsystems. A 2MB configuration is also possible,
`although such a configuration would have lower display
`resolution and would have other resource limitations.
`The Media DSP Processor is responsible for video codecs, audio
`processing, and front-end graphics processing (transformations,
`lighting, etc.). The reference HW implementation uses the
`
`356
`0004
`
`

`
`Samsung MSP to perform these functions. The DSP combines a
`RISC processor with a specialized SIMD processor capable of
`providing high performance floating point and integer
`processing ((cid:31)1000 MFLOPS/MOPS). A real-time kernel and
`resource manager deals with allocating the DSP to the various
`graphics and multimedia tasks which are performed by this
`system.
`The Polygon Object Processor is a proprietary VLSI chip which
`performs scan-conversion, shading, texturing, hidden-surface
`
`removal, and anti-aliasing. The resulting rendered image layer
`chunks are stored in compressed form in the shared memory.
`The Image Layer Compositor operates at video rates to access
`the image layer chunk information from the shared memory,
`decompress the chunks, and process the images to perform
`general affine transformations (which include scaling,
`translation with subpixel accuracy, rotation, and skew). The
`resulting pixels (with alpha) are sent to Compositing Buffer.
`
`M emo ry U s e - T ypical S cenario
`Image Layer Data S torage
`Display Resolution
`Average Image Layer Size
`Average Image Layer Depth Complexity
`Image Layer Data Compression Factor
`Image Layer Memory Management Overhead
`Memory Allocation Overhead
`T otal Image Layer Data S torage R equirements
`Dis play Memory Management
`T exture Data S torage
`Number of Texels
`Percent Texels with Alpha
`Avg. Number of Texture LODs
`Texture Data Compression Factor
`T otal T exture Data S torage R equirements
`Command B uffers
`Audio Output B uffer
`Audio S ynthes is Data
`Wav T able B uffer
`Media DS P P rogram and S cratch Mem
`T o t al
`
`32.4 Mbytes/sec
`130.0 Mbytes/sec
`
`Image layer chunk data is processed 32 scan lines at a time for
`display. The Compositing Buffer contains two 32 scan line
`buffers which are toggled between display and compositing
`activities. Each chip also contains a 32 scan line alpha buffer
`which is used to accumulate alpha for each pixel. The Video
`DAC includes a USB serial channel (for joysticks, etc.), and an
`IEEE1394 media channel (up to 400 Mbits/sec. for connection
`to an optional break-out box and external A/V equipment), as
`well as standard palette DAC features.
`A separate chip is used to handle audio digital to analog and
`analog to digital conversion.
`The table above indicates the total memory usage for a typical
`3D application scenario. For the same scenario, the memory
`bandwidth requirements are shown in the following table.
`Memory Bandwidth - Typical Scenario
`Pixel Rendering (avg. depth complexity 2.5)
`Display Bandwidth
`Texture Reads
`Texels per Pixel (anisotropic filtering)
`Texture Cache Multiplier (avg. texel reuse)
`Texture Read Bandwidth
`Polygon Command (30,000 polygons/scene)
`Total 3D Pipeline Bandwidth
`
`16
`2.5
`
`58 Mbytes/sec
`61.0 Mbytes/sec
`281.4 Mbytes/sec
`
`x 768
`x 128
`
`1024
`128
`1.7
`5
`51 bytes per 32x32 chunk
`4 bytes per 128 bytes
`
`64 bytes per
`
`image layer
`
`texels
`
`4,000,000
`30%
`6
`15
`
`N et M emo ry R equirement s
`
`1,171,637 bytes
`5,222 bytes
`
`1,415,149 bytes
`53,248 bytes
`2,450 bytes
`32,768 bytes
`524,800 bytes
`524,288 bytes
`3,729,563 byt es
`
`POLYGON OBJECT PROCESSOR
`The Polygon Object Processor is one of the two primary VLSI
`chips that are being developed for the reference HW
`implementation.
`Unique Functional Blocks
`Many of the functional blocks in the Polygon Object Processor
`will be recognized as being common in traditional 3D graphics
`pipelines. Some of the unique blocks are described here. The
`operation of this chip is provided later in the paper.
`Initial Evaluation - Since polygons are processed in 32 x 32
`chunks, triangle processing will typically not start at a triangle
`vertex. This block computes the intersection of the chunk with
`the triangle and computes the values for color, transparency,
`depth, and texture coordinates for the starting point of the
`triangle within the chunk.
`Pixel Engine - performs pixel level calculations including
`compositing, depth buffering, and fragment generation for pixels
`which are only partially covered. The pixel engine also handles
`z-comparison operations required for shadows.
`
`357
`0005
`
`

`
`Pixel
`Queue
`
`Cache
`Address
`Map
`
`Rasterizer
`
`Texture
`Cache
`
`Texture
`Filter
`Engine
`
`Pixel
`Engine
`
`Pre-Rasterizer
`
`Texture
`Cache
`Control
`
`Decompress
`
`Depth/Stencil/Priority
`Buffer
`
`Fragment
`Buffer
`
`Color
`Buffers
`
`Primitive
`Register
`
`Texture
`Read
`Queue
`
`Compressed
`Texture
`Cache
`
`Fragment
`Resolve
`
`Polygon Object
`Processor
`
`Initial
`Evaluation
`
`Primitive
`Queue
`
`Command
`and
`Memory
`Control
`
`Compress
`
`Media DSP
`
`Image Layer
`Compositor
`
`RAMBUS
`Channels
`
`Fragment Resolve - performs the final anti-aliasing step by
`resolving depth sorted pixel fragments with partial coverage or
`transparency.
`
`Coping with Latency
`One of the most challenging aspects of this design was coping
`with the long latency to memory for fetching texture data. Not
`only do we need to cope with a decompression step which takes
`well over 100 12.5 ns. cycles, but we are also using Rambus
`memory devices which need to be accessed using large blocks to
`achieve adequate bandwidth. This results in a total latency of
`several hundred cycles.
`Maintaining the full pixel rendering rate was a high priority in
`the design, so a mechanism that could ensure that texels were
`available for the texture filter engine when needed was required.
`The basic solution to this problem is to have two rasterizers -
`one calculating texel addresses and making sure that they are
`available in time, and the other performing color, depth, and
`pixel address interpolation for rendering. While these rasterizers
`both calculate information for the same pixels, they are
`separated by up to several hundred cycles.
`Two solutions were considered for this mechanism - one was to
`duplicate the address calculations in both rasterizers(cid:31) the other
`was to pass the texture addresses from the first rasterizer (called
`the Pre-Rasterizer in the block diagram) to the second rasterizer
`using a FIFO.
`In this case, texture address calculation logic in the rasterizers is
`fairly complex to deal with perspective divides and anisotropic
`texture filtering (discussed later). To duplicate this logic in both
`rasterizers required more silicon area than using the pixel queue,
`so the latter approach was chosen.
`
`Die Area and Packaging
`The total die area of the Polygon Object Processor is shown in
`the following table. The die area figures shown here are
`estimates since the layout of this part was not complete at the
`
`POP Area Calculation
`
`Functional Block
`RAC Cell
`Memory Interface
`Input Logic
`Setup Logic
`Scan Convert
`Texture Lookup
`Pixel Logic
`Cache Logic
`Compression Logic
`Decompression Logic
`
`Testability Gates
`Interblock Routing Area
`
`Gates
`
`RAM bits
`
`12,288
`0
`0
`57,760
`0
`137,216
`71,680
`32,896
`16,000
`
`4,500
`10,044
`30,920
`125,510
`83,450
`86,090
`42,000
`33,120
`47,000
`
`50,000
`
`0.35 Micron
`
`Total Area
`5.17
`1.77
`1.09
`3.92
`18.38
`8.87
`20.03
`10.91
`14.62
`6.02
`90.77
`6.55
`9.73
`
`Core Area
`I/O Cells Area
`Total Area
`time of paper submission.
`The Polygon Object Processor is implemented using an
`advanced 0.35 micron four layer metal 3.3 volt CMOS process.
`The die is mounted in a 304 pin thermally-enhanced plastic
`package.
`
`107.05
`21.69
`128.75
`
`IMAGE LAYER COMPOSITOR
`The Image Layer Compositor is the other custom VLSI chip that
`is being developed for the reference HW implementation. This
`part is responsible for generating the graphics output from a
`collection of depth sorted image layers.
`
`358
`0006
`
`

`
`Cache
`Address
`Map
`
`Rasterizer
`
`Image Layer
`Cache
`
`Image Layer
`Filter
`Engine
`
`Compositing
`Buffer
`Controller
`
`To Comp.
`Buffer
`
`Pre-Rasterizer
`
`Image Layer
`Cache
`Control
`
`Decompress
`
`Image Layer
`Header
`Registers
`
`Image Layer
`Read
`Queue
`
`Compressed
`Image layer
`Cache
`
`Image Layer
`Compositor
`
`Initial
`Evaluation
`
`Image Layer
`Queue
`
`Polygon
`
`Obj. Proc.
`
`Interface
`Control
`
`Comparison with Polygon Object Processor
`You will notice that this block diagram is similar in many ways
`to the Polygon Object Processo. In fact, many of the blocks are
`identical to reduce design time. In many ways, the Image Layer
`Compositor performs the same operations as triangle
`rasterization with texture mapping.
`In addition to the obvious differences (no depth buffering, anti-
`aliasing, image compression, etc.) there are a couple of key
`differences which significantly affect the design:
`Rendering Rate - the Image Layer Compositor must composite
`the images of multiple objects at full video rates with multiple
`objects overlapping each other. To support this, the rendering
`rate of the Image Layer Compositor is eight times higher than
`the Polygon Object Processor.
`Texture/Image Processing - the sophistication of the image
`processing used by the Image Layer Compositor is significantly
`reduced in order to keep silicon area to a reasonable level.
`Instead of performing perspective correct anisotropic filtering,
`this chip performs simple bi-linear filtering and requires only
`linear address calculations (since perspective transforms are not
`supported).
`These differences significantly affect the approach used to deal
`with memory latency. The rasterizer in the Image Layer
`Compositor is significantly simpler due to the simplified image
`processing, and the higher pixel rate requires the pre-rasterizer
`to be much further ahead of the rasterizer. As a result, the Image
`Layer Compositor eliminates the Pixel (cid:31)ueue and simply
`reca

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket