`
`John Eyles*, Steven Molnar*, John Poulton†, Trey Greer*, Anselmo Lastra†, Nick England†, Lee Westover*
`
`* Hewlett-Packard Company
`Chapel Hill Graphics Lab
`
`† Department of Computer Science
`University of North Carolina
`
`ABSTRACT
`
`PixelFlow is an architecture for high-speed, highly realistic image
`generation, based on the techniques of object-parallelism and
`image composition. Its initial architecture was described in
`[MOLN92].
` After development by the original team of
`researchers at the University of North Carolina, and co-
`development with industry partners, Division Ltd. and Hewlett-
`Packard, PixelFlow now is a much more capable system than
`initially conceived and its hardware and software systems have
`evolved considerably. This paper describes the final realization
`of PixelFlow, along with hardware and software enhancements
`heretofore unpublished.
`
`CR Categories and Subject Descriptors: C.5.4 [Computer
`System
`Implementation]: VLSI Systems;
`I.3.1
`[Computer
`Graphics]: Hardware Architecture; I.3.3 [Computer Graphics]:
`Picture/Image Generation; I.3.7 [Computer Graphics]: Three-
`Dimensional Graphics and Realism.
`Additional Key Words and Phrases: object-parallel, rendering,
`compositing, deferred shading, scalable.
`
`1
`
`INTRODUCTION
`
`PixelFlow is an architecture for high-speed image generation that
`was designed to be linearly scaleable to unprecedented levels of
`performance and to implement realistic rendering techniques such
`as user-programmable shading,
`texturing, antialiasing, and
`shadows. Achieving these goals required a new architecture that
`was substantially different than the interleaved screen-subdivision
`approach that is nearly universal in today’s commercial graphics
`architectures (e.g. [E&S96;SGI97]).
`_______________________
`Author addresses:
`* 1512 East Franklin St., Suite 200, Chapel Hill, NC 27514
`{jge,molnar,greer,lee}@chapelhill.hp.com
`† Sitterson Hall C.B #3175, Chapel Hill, NC 27599-3175
`{jp,lastra,nick}@cs.unc.edu
`
`image-
`PixelFlow uses an object-parallel approach called
`composition to achieve its high speed. Display primitives are
`distributed over an array of identical renderers, each of which
`computes a full-screen image of its fraction of the primitives. A
`dedicated, high-speed communication network called the Image-
`Composition Network merges these images in real time, based on
`visibility information, to produce an image of the entire scene
`[MOLN92].
`
`The PixelFlow architecture is extremely flexible, allowing
`configurations from deskside systems, drawing tens of millions of
`triangles per second, to multiple-rack systems, drawing hundreds
`of millions of triangles per second. Near-linear performance
`increases are obtained by adding renderers.
`
`1.1 System Overview
`A PixelFlow system consists of one or more chassis, each
`containing up to 9 Flow Units (the PixelFlow name for renderer).
`Each Flow Unit consists of: a Geometry Processor Board (GP), a
`conventional floating-point microprocessor with DRAM memory;
`and a Rasterizer Board (RB), a SIMD array of 8,192 byte-serial
`processing elements, each with 384 bytes of local memory.
`
`A Flow Unit is a powerful graphics engine in itself, capable of
`rendering up to 3 million antialiased polygons per second and
`performing complex shading calculations (such as bump
`mapping, shadows, and user-programmable shading) in real time.
`Geometry Processor Boards provide the front-end floating-point
`computation needed for transforming primitives and generating
`rendering instructions for the Rasterizer Boards. Rasterizer
`Boards turn screen-space descriptions of primitives into pixel
`values and perform sophisticated shading calculations.
`
`The Image-Composition Network is implemented as a daisy-
`chained connection between Rasterizer Boards of neighboring
`Flow Units. A second communication network, the Geometry
`Network, is a packet-routing network which connects Geometry
`Processor Boards.
`
`Any subset of Flow Units can be provided with I/O or video
`adapter daughter-cards that provide host-interface or video (frame
`buffer or frame grabber) capabilities. It is possible to build very
`large systems with multiple host interfaces attached to a parallel
`host, and multiple displays to support multiple-user applications.
`Such a system also can be re-configured by software into several
`smaller systems. Figure 1 shows a typical two-chassis PixelFlow
`system configuration.
`
`0001
`
`Volkswagen 1018
`
`
`
`Geometry Network
`Image Composition
`Network
`
`PixelFlow
`Chassis
`
`(up to 9
`Flow Units
`per chassis)
`
`PixelFlow
`Chassis
`
`GP
`
`GP
`
`GP
`
`GP
`
`GP
`
`GP
`
`GP
`
`GP
`
`GP
`
`GP
`
`RB
`
`RB
`
`RB
`
`RB
`
`RB
`
`RB
`
`RB
`
`io
`
`RB
`
`RB
`
`RB
`
`vid
`
`Host
`
`Display
`
`Figure 1: Typical PixelFlow System.
`
`1.2 System Operation
`Individual Flow Units can be designated, by software, as one of
`three types:
`
`Renderers (not to be confused with the general term
`renderer above) process a portion of the database to
`generate regions of pixel data ready for shading. The
`Geometry Processor Board transforms primitives to
`screen-space and sorts them into bins according to
`screen region. The Rasterizer Board rasterizes primi-
`tives one region at a time. After all renderers have proc-
`essed a given region, the region is composited across the
`Image-Composition Network and the composited pixel-
`values are deposited onto one or more shaders.
`Shaders apply texture and lighting models to regions of
`raw pixel data, producing RGB color values that are
`forwarded to the frame buffer.
`Frame buffers send or receive video data via an
`attached video adapter card.
`To compute a frame, the GPs on each renderer first transform
`their fraction of the primitives into screen coordinates and sort
`them into bins corresponding to regions of the screen. The
`renderers then process the regions one at a time, rasterizing all of
`the primitives that affect the current region before moving on to
`the next.
`
`Once a given region has been rasterized on all of the renderers,
`the composition network merges the pixel data together and loads
`the region of composited pixel data onto a shader. Regions are
`assigned to shaders in round-robin fashion, with each shader
`processing every nth region. Shaders operate on entire regions in
`parallel, to convert raw pixel attributes into final RGB values,
`blend multiple samples together for antialiasing, and forward final
`color values to the frame buffer for display.
`
`1.3 Design Evolution
`The PixelFlow architecture has evolved considerably since its
`initial conception, described in [MOLN92]. PixelFlow was
`initially developed at the University of North Carolina at Chapel
`Hill as an NSF- and DARPA-sponsored research project. In 1994,
`Division Ltd. of Bristol, UK acquired commercial rights to the
`technology and established a laboratory in Chapel Hill to
`
` In mid-1996, this
`complete development of the project.
`laboratory and rights to the PixelFlow technology were acquired
`by Hewlett-Packard. The final design is significantly faster, more
`complex, and
`technically more aggressive
`than originally
`conceived.
`
`The following sections describe the final PixelFlow architecture
`in more detail. Special attention will be given to aspects of the
`architecture that have not been described before.
`
`2 ARCHITECTURAL FEATURES
`
`PixelFlow was designed to demonstrate the advantages of image-
`composition architectures, to provide a research platform for real-
`time 3D graphics algorithms and applications, and to provide
`workstation graphics capability with unprecedented levels of
`performance and realism.
`In this section we describe its major
`architectural features and the rationale under which they were
`chosen.
`
`2.1 Image Composition Architecture
`PixelFlow’s most characteristic feature is that it is an image-
`composition architecture. Image-composition is an object-parallel
`rendering approach in which the primitives in the scene are
`distributed over a parallel array of renderers, each of which is
`responsible for generating a full-screen image of its fraction of
`the primitives (Figure 2). To compute a frame, each renderer
`computes a full-screen image of its fraction of the primitives. It
`then feeds color and visibility information for its pixels into a
`local compositor (C), which also receives a stream of pixels from
`the compositors (and renderers) upstream. The compositor
`selects the visible pixel from its two input ports and forwards it to
`the compositors downstream. The compositors together form an
`Image-Composition Network, the output of which contains the
`pixels of the final image. The Image-Composition Network can
`be built as a pipeline, as shown in Figure 2, or as a binary tree;
`PixelFlow uses a pipeline, since it is easier to implement and the
`additional latency is negligible.
`
`The bandwidth in every link of the Image-Composition Network
`is identical; it is determined by frame rate, screen size, and the
`amount of data per pixel or sample, and is independent of the
`number of polygons in the scene. Thus the network (and system)
`can be extended to incorporate an arbitrary number of renderers
`
`0002
`
`(cid:129)
`(cid:129)
`(cid:129)
`
`
`custom rasterizer chips also allowed a low-cost implementation of
`the Image-Composition Network (see below).
`
`2.4 Region-Based Rendering
`The logic-enhanced memory approach has one big disadvantage:
`it is not feasible (today) to implement enough image memory on
`custom chips to provide a full-screen image. This means that a
`full-screen image must be generated in multiple steps.
`
`PixelFlow renderers operate by sequentially processing small
`regions of the screen. The region size is determined by the num-
`ber of samples per pixel and ranges from 32x32 to 64x128 pixels.
`After each renderer rasterizes a given region, the renderers scan
`out that region's rasterized pixels over the Image-Composition
`Network in synchrony with the other renderers. This compositing
`of regions is the “heartbeat” of the system.
`
`Before rasterization can begin, the GP must sort primitives into
`bins corresponding to the screen regions. This extra step requires
`memory on the GP and adds latency. Also, some primitives,
`particularly those of large screen extent, fall into more than one
`region, increasing the effective polygon count by a factor equal to
`the average number of regions per primitive. This number can
`range from 1.3 to 1.7 for typical datasets [MOLN94]. Region-
`based rendering algorithms also may suffer from load imbalances
`when primitives clump into regions; a particular danger is that
`primitives may clump into different regions on different render-
`ers, potentially starving the compositing network. PixelFlow
`mitigates these problems by providing buffering in the logic-
`enhanced memory rasterizer for several regions of pixel data.
`
`2.5 Deferred Shading
`PixelFlow uses deferred shading, an approach that reduces the
`calculations required for complex shading models by factoring
`them out of the rasterization step [DEER88; ELLS91]. PixelFlow
`rasterizers do not compute pixel colors directly; instead, they
`compute geometric and intrinsic pixel attributes, such as surface-
`normal vectors and surface color; these attributes, not pixel
`colors, are composited. The composited pixels (or samples),
`containing these shading attributes, are deposited onto designated
`renderer boards called shaders. The shaders look up texture
`values for the pixels and compute final pixel color values, based
`on surface normal, light sources, etc. Shading information is
`shared among subpixel samples that hit the same surface, up to a
`maximum of three surfaces per pixel. For ultimate quality
`rendering, every subpixel sample can be shaded independently.
`After shading, regions of shaded pixels are forwarded to a frame
`buffer for display.
`
`The advantage of this approach is that a bounded number of
`shading calculations are performed per pixel, no matter what the
`depth complexity of the scene is or how many renderers are in the
`system. Thus, shading performance is decoupled from rasteriza-
`tion performance: the number of shaders required is determined
`only by the resolution of the image, the number of surfaces
`shaded per pixel, the complexity of the shading model, and the
`frame rate.
`
`PixelFlow’s SIMD rasterizer is an ideal processor for deferred
`shading. Shading calculations can be performed for many pixels
`simultaneously; if all pixels are shaded with the same algorithm,
`the SIMD rasterizer achieves near 100% processor utilization.
`This allows up to 800 billion byte-operations per second of
`
`C C C C
`
`Renderer
`
`Renderer
`
`Renderer
`
`Renderer
`
`Figure 2: Object-parallel rendering by image composition.
`
`into the system. This gives the architecture its unique and most
`important property: linear scalability to arbitrarily high levels of
`performance.
`
`2.2 Supersampling Antialiasing with Z-buffer Visibility
`PixelFlow performs antialiasing by supersampling. Each renderer
`computes its full-screen image with 4, 8, or more samples per
`pixel. Samples are computed in parallel at jittered, subpixel loca-
`tions. Up to eight samples per pixel are computed simultane-
`ously, each with independent colors (or shading attributes) and z
`values. The compositors perform a simple z comparison for each
`sample, forwarding the appropriate sample downstream. After
`composition, the samples are blended together to form the final
`image.
`
`Supersampling was chosen because it is general, the compositor
`hardware is simple (therefore fast), and the number of samples
`can be varied to trade speed for image quality. It has two disad-
`vantages: first, the composition network must support the worst-
`case bandwidth—that of every sample within a pixel hitting a
`different surface—and this bandwidth is large; second, rendering
`transparent surfaces requires screen-door or multipass algorithms
`[MAMM89].
`
`2.3 Logic-Enhanced Memory Rasterizer
`The PixelFlow Rasterizer Board uses the logic-enhanced memory
`approach used in our earlier Pixel-Planes designs [EYLE88,
`FUCH89], in which a large SIMD array of pixel processors is
`implemented using custom VLSI chips that integrate simple proc-
`essing elements with high-speed, low-latency image memory.
`This approach eliminates the traditional bandwidth bottleneck
`between rasterizer and image memory, permitting more sophisti-
`cated rasterization algorithms and higher polygon rates [POUL92,
`TORB96]. PixelFlow’s enhanced memory chips have byte-wide
`ALUs and operate at 100 MHz, enabling a 3 million triangle-per-
`second rasterizer to be built on a single circuit board. Building
`
`0003
`
`
`
`shading performance on a single board. The rasterizer’s texture
`subsystem, using commodity SDRAM texture memory, supports
`texturing, environment mapping, shadows, and so
`forth
`[MOLN95].
`
`Deferred shading requires higher bandwidth in the Image-
`Composition Network, since pixel shading attributes require more
`data than the three to four bytes required for RGB color values.
`Transparent surfaces, including texture-modulated transparency,
`are handled by sending transparent polygons to shaders and
`accumulating transparent layers using Mammen’s algorithm
`[MAMM89]. The performance impact is determined by the
`number of transparent primitives and the number of transparency
`layers.
`
`2.6 Ultra-Fast Image-Composition Network
`To support high-resolution displays, fast frame rates, and deferred
`shading,
`the
`Image-Composition Network must provide
`enormous bandwidth—tens of Gbytes/second. We accomplish
`this in a cost-effective way by integrating the compositors onto
`the logic-enhanced memory chips that implement the SIMD
`rasterizer array. The network is formed by daisy-chaining
`connections between
`the
`logic-enhanced memories on
`neighboring boards. Hence the network consists entirely of
`point-to-point communication between identical custom chips on
`neighboring boards, so state-of-the-art techniques for high-speed,
`low-power interconnect can be employed to provide the necessary
`bandwidth.
`
`The Image-Composition Network on PixelFlow is 256 wires wide
`and runs at 200 MHz, with data traveling in both directions on the
`same wire at the same time. The total bandwidth, therefore, is 2 (cid:129)
`200 MHz (cid:129) 256 wires = 100 Gbit/second. This is sufficient to
`render 1280x1024-pixel images with sophisticated shading and 4
`samples per pixel at greater than 60 frames per second.
`
`2.7 System Configurability
`Because renderers, shaders, and frame buffer boards all share the
`same underlying hardware (with the exception of I/O and video
`daughter-cards), applications can tune the number of renderers
`and shaders to achieve optimum speed, based on the number of
`primitives and the complexity of the shading. Also, a large
`machine can be partitioned into smaller machines to support
`multiple users, or as a single, large machine when ultimate
`performance is desired.
`
`3 HARDWARE COMPONENTS
`
`PixelFlow is a modular graphics system, composed of one or
`more chassis, each containing up to 9 Flow Units (the maximum
`configuration is 256 Flow Units or more than 28 chassis!).
`Figure 3 shows a one-chassis PixelFlow configuration.
`
`The system is built around a horizontal midplane. Geometry
`Processor Boards plug into the underside of the midplane.
`Rasterizer Boards plug into the top of the midplane. The
`midplane contains the daisy-chain wiring for the Geometry and
`Image-Composition Networks, as well as clock and power distri-
`bution. Figure 4 shows the components and interconnections of a
`Flow Unit.
`
`I/O Daughter-Card
`
`Video Daughter-Card
`
`Texture
`Memory
`
`EMCs
`
`Image
`Composition
`Network
`
`Geometry
`Network
`
`i p l e
`t
`M u l
`C h a s s i s
`
`Frame
` Buffer
`
`Shader
`
`PA-8000s
`
`Program/Data
`Memory
`
`Renderers
`
`Frame
` Buffer
`
`Shader
`
`Renderers
`
`Figure 3: One-chassis PixelFlow system.
`
`3.1 Geometry Processor
`The Geometry Processor (GP) is a fast floating-point processor
`that may be configured with one or two CPUs and up to 256
`Mbytes of memory.
`
`CPUs. The CPUs are Hewlett-Packard PA-RISC PA-8000
`modules. In a dual processor GP, the two processors are cache
`coherent. The PA-8000 runs at 180 MHz, issuing a peak of two
`floating point multiply-accumulates and two integer ops per
`cycle. The processor modules include large instruction and data
`caches.
`
`Memory. GP memory consists of 64 to 512 Mbytes of SDRAM
`memory, serving both as main memory for the GP and as a large
`FIFO queue for buffering commands for the rasterizer.
`
`RHInO. A custom ASIC, the RHInO (Runway Host and I/O)
`connects the processors with memory, the Geometry Network,
`and the Rasterizer Board. Its primary function is to service mem-
`ory requests from the two processors and its various I/O ports. It
`also contains two DMA engines, one for transmitting rendering
`commands from SDRAM memory to the Rasterizer Board, and
`one for sending and receiving data from the Geometry Network.
`
`Geometry Network. The Geometry Network is a high-speed
`packet-routing network that connects the GPs to each other. This
`is particularly useful for connecting the host to Flow Units that do
`include an I/O daughter-card. It is implemented using a bit-slice
`pair of Geometry Network Interface (GeNIe) ASICs; they
`physically reside on the Rasterizer Board.
`
`The GeNIe provides three ports onto the Geometry Network for
`each Flow Unit. One port goes to the GP itself (via the RHInO);
`one port goes to the optional I/O adapter; a third goes to the Inter-
`TASIC Ring on the Rasterizer Board for loading textures and
`reading frame-buffer contents. Each port supports I/O traffic of
`up to 240 Mbytes/second. The Geometry Network supports
`broadcasts to groups of receivers. The overall Geometry Network
`bandwidth is 800 Mbytes/sec in each direction. Non-overlapping
`transfers may occur simultaneously.
`
`0004
`
`
`
`Runway Bus:
`768 Mbytes / sec
`
`Rasterizing or shading
`instructions, 400
`Mbytes / sec
`
`Geometry Network:
`800 Mbytes / sec
`each direction
`
`Image Composition
`Network:
`6.4 Gbytes / sec each
`direction
`
`PA-8000 CPU
`2 MBytes cache
`
`PA-8000 CPU
`2 MBytes cache
`
`Geometry
`Processor Board
`
`Main Memory:
`64-512 Mbytes,
`850 Mbytes / sec
`
`RHInO
`
`Rasterizer Board
`
`GeNIe
`
`GP to Geometry
`Network: 240 Mbytes
`/ sec
`
`EIGC
`
`32 EMCs
`(SIMD 8,192 PE array)
`
`TIGC
`
`8 TASICs
`
`Texture and
`Frame Buffer
`Store:
`
`16-64
`MB
`
`16-64
`MB
`
`16-64
`MB
`
`16-64
`MB
`
`I/O or Video Daughter Card
`
`TASIC Ring:
`800 Mbytes / sec
`
`Adapter to
`Geometry Network:
`up to 240 Mbytes /
`sec
`
`Video Interface: 400
`Mpixels / sec in or out
`
`Figure 4: PixelFlow Flow Unit with CPUs and IO/Video adapter.
`
`3.2 SIMD Pixel Processor Array
`The heart of the Rasterizer Board is a SIMD array of 8,192
`processing elements (PEs). This array is mapped to screen
`regions of different sizes, depending on the number of samples
`per pixel, as follows:
`
`Samples per Pixel
`1
`4
`8
`
`Region Size (pixels)
`128 x 64
`32 x 64
`32 x 32
`
`The PE array is divided into four modules, each tightly coupled to
`a texture/video subsystem.
`
`The SIMD array and texture/video subsystem operate under the
`
`control of a pair of Image-Generation Controller chips (IGCs),
`which perform cycle-by-cycle sequencing of the SIMD array and
`provide data for the EMCs’ linear expression evaluator.
`
`The PE array is implemented on 32 logic-enhanced memory chips
`(EMCs), each containing 256 PEs. Figure 5 shows a block
`diagram of an EMC.
`
`Each PE consists of an arithmetic/logical unit (ALU) and 384
`bytes of local memory. This includes 256 bytes of main memory,
`and four 32-byte partitions associated with two I/O ports, the
`Local Port and the Image Composition port. A linear expression
`evaluator computes values of the bilinear expression Ax+By+C in
`parallel for every PE; the pair (x,y) is the address of each PE on a
`subpixel resolution grid, and A, B, and C are user-specified as part
`of the SIMD instruction stream. The ALU performs arithmetic
`
`0005
`
`
`
`Linear Expression Evaluator
`Data In
`
`Pixel Shift from
`previous PE of
`this panel
`
`Pixel Shift to
`previous PE of
`this panel
`
`Processing Element
`ALU
`
`M
`
`MUX
`
`+
`
`R
`
`S
`
`Main Pixel
`Memory
`(256 Bytes)
`
`Pixel Shift from
`next PE of this
`panel
`
`Pixel Shift to
`next PE of this
`panel
`
`Local
`Port
`Output
`Buffer
`
`Input
`Buffer
`
`IC Port
`R-to-L
`Buffer
`
`C
`
`L-to-R
`Buffer
`
`C
`
`Data and addresses out to
`TASIC
`
`Data in from TASIC
`
`Right to Left Image
`Composition from this PE
`of next Rasterizer
`
`Left to Right Image
`Composition to this PE
`of next Rasterizer
`
`Figure 6: Functional diagram of Processing Element.
`
`may be read from or written to memory on each clock cycle. The
`384 bytes of local memory for each PE are arranged as:
`
`256 bytes of “main” memory
`32 bytes of Local Port input buffer
`32 bytes of Local Port output buffer
`32 bytes of Image-Composition Network left-to-right
`transfer buffer
`32 bytes of Image-Composition Network right-to-left
`transfer buffer
`
`The four 32-byte partitions of memory are used for I/O
`operations, using the two communication ports described below.
`These partitions are part of the same address space as the 256
`bytes of main memory, and all 384 bytes can be accessed by the
`ALU. While communication port operations are in progress, the
`
`0006
`
`instruction
`
`address
`
`data
`
`Local
`Port
`
`32
`
`32
`
`32
`
`32
`
`Output Buffer
`Local Port
`Input Buffer
`Local Port
`
`Transfer Buffer
`
`Right->Left
`
`Transfer Buffer
`
`Left->Right
`
`256 bytes
`
`PE
`Main
`Memory
`
`256 PEs
`
`256 Pixel ALU's
`
`Linear Expression
`
`Evaluator
`
`A,B,C
`
`Image
`Composition
`Port
`
`EOrL
`
`“left”
`data
`
`“right”
`data
`
`Figure 5: Block diagram of Enhanced Memory Chip.
`
`and logical operations on local memory and on the local value of
`the bilinear expression.
`
`Figure 6 shows a functional diagram of one PE. The major
`components are described in the following sections.
`
`ALU. The ALU implements an 8-bit add and a full range of
`bitwise logical functions. There are three 8-bit registers: the R,
`S, and M registers. The R and S registers can be loaded with the
`core result. The R register can be fed back to the core, and either
`register can be written to memory. The M register is loaded with
`a byte read from memory; it also can be loaded with the R or S
`register value. The R and S registers can be combined into a
`single 16-bit accumulator, to accelerate multiplies. A carry
`register is provided for multi-byte computations. Each PE
`includes an enable register. PEs may be disabled, by clearing this
`register, on the basis of computation results; memory writes do
`not occur at PEs that are disabled.
`
`Linear Expression Evaluator. The linear expression evaluator
`operates byte-serially to provide each processor with one byte of
`the bilinear expression on every clock cycle; this can be thought
`of as an immediate operand. The result for each set of
`coefficients generally must be preceded by two guard bytes, since
`A and B are multiplied by 14-bit numbers.
`
`The PEs are assigned x,y addresses on a subpixel grid with
`resolution of 1/8th pixel. The PEs are grouped into sets of 1, 4 or
`8, each group corresponding to a pixel. The PEs in each group
`are assigned x,y subpixel addresses in a 2-pixel-wide box about
`the pixel center; the pattern of subpixel addresses is the same for
`each group, and this pattern defines the antialiasing kernel.
`
`Inter-PE Communication. The PEs on each EMC are connected
`by a shift path that allows each ALU to use the R register of
`either of its neighbors as an operand. When antialiasing, the 4 or
`8 samples for a single pixel are mapped to contiguous PEs; the
`shift path is used to combine these samples into a single PE,
`where they can be filtered into an aggregate display value. This is
`usually done on a shader, after composition.
`
`Local Memory. An 8-bit wide memory data bus connects the M,
`R, and S registers to the 384 bytes of local memory; a byte of data
`
`(cid:129)
`(cid:129)
`(cid:129)
`(cid:129)
`(cid:129)
`
`
`ALU cannot access these addresses; this lockout is accomplished
`using semaphores in the control processors. The ALU may
`continue to access the main memory and any of the 32-byte
`buffers not involved in I/O operations; this allows I/O to occur
`simultaneously with normal pixel computations.
`
`Image-Composition Port. The Image-Composition Port consists
`of 8 left pins and 8 right pins per EMC. The left pins are
`connected to the right pins of the corresponding EMC on the
`adjacent board, forming a 256-bit wide daisy-chained point-to-
`point connection along the midplane. These pins operate at
`200 MHz (double the system clock rate), with simultaneous bi-
`directional data flow (each pin has an input data stream, and a
`simultaneous output data stream). The Image-Composition
`Network consists of two pathways superimposed onto this bi-
`directional interconnect: on the left-to-right pathway, each PE
`synchronously receives pixel data from the board to the left,
`combines this data with the data in the 32-byte left-to-right
`transfer buffer, and forwards the result to the board to the right;
`similarly, the right-to-left pathway combines data from right to
`left, using the right-to-left transfer buffer. The two pathways can
`be formed into a loop on a set of adjacent boards; in this way,
`large systems can be configured as multiple small systems, each
`with its own independent Image-Composition Network.
`
`The Image-Composition Network operates on one screen region
`of pixel data at a time. Its primary function is the real-time
`compositing operation required to combine the partial images
`from the multiple renderers. The basic composite operation is a z-
`compare (up to 8 bytes) between the incoming pixel data and the
`pixel data in the local transfer buffer; the composited pixel (or
`sample), with the smaller z value, is forwarded. More generally,
`the network is used for rapidly moving pixel data, including
`writing data back into the transfer buffer. For each region
`transfer, a compositor mode is specified for each direction; the
`forwarded pixel is (1) the composited pixel, (2) the incoming
`pixel, or (3) the local pixel, and the pixel written back into the
`transfer buffer is (1) nothing, (2) the incoming pixel, or (3) the
`composited pixel. Thus, there are 9 modes; the four used in the
`basic rendering algorithm are shown in Figure 7.
`
`transfer buffer
`
`transfer buffer
`
`Composite local pixels
`with upstream pixels.
`transfer buffer
`
`Load upstream pixels
`into memory.
`
`Unload local pixels
`downstream.
`
`Forward upstream
`pixels downstream.
`
`Figure 7: Compositor modes.
`
`Composite mode is used by renderer boards as regions are
`composited together to from a final pre-shading image. Load
`mode is used to deposit this composited image into a shader.
`Unload mode is used, to dump final shaded pixels out of the
`shader (to be received by the frame buffer using load mode).
`Forward mode allows data to pass through any boards not
`participating in a given transfer operation.
`
`Local Port. The Local Port consist of 4 bi-directional pins per
`EMC. Data in the 32-byte Local Port output buffer is output
`nibble-serially on these pins. The input data stream from the pins
`is written into the Local Port input buffer. The Local Port is
`connected to the texture/video subsystem. Typically, the output
`buffer is loaded with texture-memory addresses; these are output
`to the texture/video subsystem, which looks up the texels in
`texture maps, and returns texel data to the input buffer.
`
`The local input port and local output port operate independently,
`although they share the same communications substrate. Each
`port can access all PEs, or a subset of the PEs defined by loading
`a memory-mapped mark register. A content-dependent decoder
`gives the local port access to only the marked PEs. This
`substantially reduces texture-lookup time when only a subset of
`pixels in a region needs texturing.
`
`3.3 Texture / Video Subsystem
`The texture/video subsystem consists of 8 texture-datapath ASICs
`(TASICs) and 64 to 256 Mbytes of SDRAM memory. The
`TASIC chips provide the interface between the Local Ports of the
`EMCs and
`texture/image memory;
`they
`transfer addresses
`computed in the PE array to the SDRAMs and transfer texture-
`lookup data back to the PE array. The SDRAM memory is used
`as a texture store on shader boards and as a frame store on frame-
`buffer boards. To provide sufficient bandwidth for Mip-map
`texture lookups, texture memory is replicated on 4 separate
`modules. Each module consists of 8 EMCs, one copy of the
`texture memory (16 to 64 Mbytes), and 2 TASIC chips.
`
`Each copy of the texture memory is divided into eight banks. The
`texture memory is designed to simultaneously read eight texels
`when each of the eight texels comes from a different bank.
`Prefiltered (Mip-map) texture maps can be interleaved across the
`banks so that the eight texels required for one pixel are stored in
`the eight separate banks.
`
`To read from texture memory, the participating PEs each write 8
`addresses into their Local Port output buffers. The texture read
`operation takes this set of eight addresses from each PE in turn,
`applying the addresses to the eight banks of memory and
`returning eight 4-byte results to the PE’s Local Port input buffer.
`
`The time required for the texture read operation is 0.9 μsec +
`0.64 μsec • the number of PEs participating in the worst-case
`EMC (that is, the EMC with the most PEs marked). For full-
`screen texture operations, all 32 EMCs will have all 256 PEs
`marked, so the time is 165 μsec. Pixels are interleaved across the
`EMCs so that the pixels of a small screen area will be evenly
`distributed across the EMCs. A 30% speedup is available by
`replicating textures within each module, halving the effective
`texture store.
`
`Texture memory writes proceed similarly to reads, except that the
`texture memory addresses can either come from the PEs or can be
`generated locally on the TASICs.
`
`Generalized table-lookup operations are supported, allowing
`functions such as bump mapping, environment mapping, and
`image warping. The shader can be loaded with an image, from
`which it computes a Mip-map that can then be loaded into texture
`memory.
`
`Inter-TASIC Ring. For texture reads, each module needs
`independent access to its local copy of texture data; for texture
`
`0007
`
`
`
`writes, each module needs write access to all four copies of the
`texture data. The Inter-TASIC Ring provides each module’s PEs
`with read and write access to the texture memory on all four
`modules. This enables a 1-to-4 write mode for efficiently writing
`texture data to all four modules at once.
`
`can provide a stream of up to 400 million 32-bit pixels per second
`from the texture/image memory store. This pixel rate supports
`very-high resolution displays. Alternatively, the TASIC video
`port can support up to 8 independent video channels, so multi-
`channel (stereo, etc.) frame buffers are possible as well.
`
`The Inter-TASIC Ring also connects to the GeNIe chips, allowing
`the rasterizer to send and receive texel or pixe