`
`John Eyles*, Steven Molnar*, John Poulton’, Trey Greer*, Anselmo Lustra’, Nick Englandt, Lee Westover*
`
`l Hewlett-Packard Company
`Chapel Hill Graphics Lab
`
`t Department of Computer Science
`University of North Carolina
`
`ABSTRACT
`
`PlxelFlow is an architecture for high-speed, highly realistic image
`generation, based on the techniques of object-parallelism and
`image composition,
`Its initial architecture was described in
`After development by
`the original
`team of
`[MOLN92].
`researchers at
`the University of North Carolina, and co-
`development with industry partners, Division Ltd. and Hcwlett-
`Packard, PixelFlow now is a much more capable system than
`initially conceived and its hardware and software systems have
`evolved considerably. This paper describes the final realization
`of PixelFlow, along with hardware and software enhancements
`heretofore unpublished.
`
`C.5.4 [Computer
`CR Cntcgorics and Subject Descriptors:
`System
`Implementation]: VLSI Systems; 1.3.1
`[Computer
`Graphics]: Hardware Architecture; 1.3.3 [Computer Graphics]:
`Picture/Image Generation; 1.3.7 [Computer Graphics]: Three-
`Dimensional Graphics and Realism.
`Additlonnl Key Words and Phrases: object-parallel, rendering.
`compositing, deferred shading, scalable.
`
`1
`
`INTRODUCTION
`
`PixelFlow is an architecture for high-speed image generation that
`was designed to be linearly scaleable to unprecedented 1evc1s of
`performance and to implement realistic rendering techniques such
`as user-programmable
`shading,
`texturing,
`antialiasing.
`and
`hndows. Achieving these goals required a new architecture that
`was substantially different than the interleaved screen-subdivision
`approach that is nearly universal in today’s commercial graphics
`architectures (e.g. [E&S96;SGI97]).
`
`Author addresses:
`
`l 1512 East Franklin St., Suite 200, Chapel Hill, NC 27514
`(jge,molnar,greer,lee) @chapelhill.hp.com
`
`’
`
`Sitterson Hnll C.B W3175, Chapel Hill, NC 27599-3175
`(jp,lastn,nick)Ocs.unc.edu
`
`Pcrmi~sion to make digitnlfllnrd copies ofnll or part ofthis material for
`personnl or classroom USC is granted widlout fre provided that the copies
`nre nut mnde or distributrd for prolit or commercial ndvanbge, the copy-
`right notice. die title of~ha publicntioo and its date appear. and notice is
`given that copy/&1
`is by pennissioo ofthe ACM,
`Inc. To copy otherwise,
`to republish, to post on sawn or to redistribute to lists. requires specific
`pwmiszion nnd!or fee
`I99 7 SKXxtl PW..~4rogrcrphics Workshop
`Copyright 1997 ACht 0-S9791-961.0197/S..$3.50
`
`image-
`approach called
`PixelFlow uses an object-parallel
`composition to achieve its high speed. Display primitives arc
`distributed over an array of identical renderers, each of which
`computes a full-screen image of its fraction of the primitives. A
`dedicated, high-speed communication network called the Image-
`Composition Nehvork merges these images in real time, based on
`visibility information, to produce an image of the entire scene
`[MOLN92].
`
`flexible, allowing
`is extremely
`The PixelFlow architecture
`configurations from deskside systems, drawing tens of millions of
`triangles per second, to multiple-rack systems, drawing hundreds
`of millions of triangles per second. Near-linear performance
`increases arc obtained by adding renderers.
`
`1 .l System Overview
`A PixelFlow system consists of one or more chassis, each
`containing up to 9 Flow Units (the PixelFlow name for renderer).
`Each Flow Unit consists of: a Geometry Processor Board (GP), a
`conventional floating-point microprocessor with DRAM memory;
`and a Rasterizer Board (RB), a SIMD array of S,192 byte-serial
`processing elements, each with 384 bytes of local memory.
`
`A Flow Unit is a powerful graphics engine in itself, capable of
`rendering up to 3 million antialiased polygons per second and
`performing
`complex
`shading
`calculations
`(such as bump
`mapping, shadows, and user-programmable shading) in real time.
`Geometry Processor Boards provide the front-end floating-point
`computation nccdcd for transforming primitives and generating
`rendering
`instructions
`for the Rasterizer Boards. Rasterizer
`Boards turn screen-space descriptions of primitives
`into pixel
`values and perform sophisticated shading calculations.
`
`The Image-Composition Network is implemented as a daisy-
`chained connection between Rasterizer Boards of neighboring
`Flow Units. A second communication network, the Geometry
`Network, is a packet-routing network which connects Geometry
`Processor Boards.
`
`Any subset of Flow Units can be provided with I/O or video
`adapter daughter-cards that provide host-interface or video (frame
`buffer or frame grabber) capabilities.
`It is possible to build very
`large systems with multiple host interfaces attached to a parallel
`host, and multiple displays to support multiple-user applications.
`Such a system also can be rc-configured by software into several
`smaller systems. Figure 1 shows a typical two-chassis PixelFlow
`system configuration.
`
`57
`
`ATI Exhibit 2013
`ARM LTD, et al. v. ATI Technologies ULC
`IPR2018-01148
`Page 1 of 12
`
`
`
`PixelFlow
`Chassis
`
`,!:z z”&
`per chassis)
`
`Geometry Network
`Image Composition
`Network
`
`Figure 1: Typical PixelFlow System.
`
`1.2 System Operation
`Individual Flow Units can be designated, by software, as one of
`three types:
`
`(not to be confused with the general term
`l Renderers
`renderer above) process a portion of the database to
`generate regions of pixel data ready for shading. The
`Geometry Processor Board
`transforms primitives
`to
`screen-space and sorts them into bins according
`to
`screen region. The Rasterizer Board rasterizes primi-
`tives one region at a time. After all renderers have proc-
`essed a given region, the region is cornposited across the
`Image-Composition Network and the composited pixel-
`values are deposited onto one or more shaders.
`l Shaders apply texture and lighting models to regions of
`raw pixel data, producing RGB color values that are
`forwarded to the frame buffer.
`l Frame buffers
`send or receive video data via an
`attached video adapter card.
`To compute a frame, the GPs on each renderer first transform
`their fraction of the primitives into screen coordinates and sort
`them into bins corresponding
`to regions of the screen. The
`renderers then process the regions one at a time, rasterizing all of
`the primitives that affect the current region before moving on to
`the next.
`
`Once a given region has been rasterized on all of the renderers,
`the composition network merges the pixel data together and loads
`the region of composited pixel data onto a shader. Regions are
`assigned to shaders in round-robin
`fashion, with each shader
`processing every nth region. Shaders operate on entire regions in
`parallel, to convert raw pixel attributes into final RGB values,
`blend multiple samples together for antialiasing, and forward final
`color values to the frame buffer for display.
`
`1.3 Design Evolution
`The PixelFlow architecture has evolved considerably since its
`initial conception, described
`in [MOLN92].
`PixelFlow was
`initially developed at the University of North Carolina at Chapel
`Hill as an NSF- and DARPA-sponsored research project. In 1994.
`Division Ltd. of Bristol, UK acquired commercial rights to the
`technology and established a laboratory
`in Chapel Hill
`to
`
`this
`In mid-1996
`the project.
`complete development of
`laboratory and rights to the PixelFlow technology were acquired
`by Hewlett-Packard. The final design is significantly faster, mom
`complex, and
`technically more aggressive
`than originally
`conceived.
`
`The following sections describe the final PixelFlow architecture
`in more detail. Special attention will be given to aspects of the
`architecture that have not been described before.
`
`2 ARCHITECTURAL FEATURES
`
`PixelFlow was designed to demonstrate the advantages of imagc-
`composition architectures, to provide a research platform for rcal-
`time 3D graphics algorithms and applications, and to provide
`workstation graphics capability with unprecedented
`levels of
`performance and realism.
`In this section we describe its major
`architectural features and the rationale under which they were
`chosen.
`
`2.1 Image Composition Architecture
`PixelFlow’s most characteristic feature is that it is an imagc-
`composition architecture. Image-composition is an object-parallel
`rendering approach in which the primitives
`in the scene are
`distributed over a parallel array of renderers, each of which is
`responsible for generating a full-screen image of its fraction of
`the primitives (Figure 2). To compute a frame, each renderer
`computes a full-screen image of its fraction of the primitives.
`It
`then feeds color and visibility information for its pixels into a
`local compositor (C), which also receives a stream of pixels from
`The compositor
`the compositors (and renderers) upstream.
`selects the visible pixel from its two input ports and forwards it to
`the compositors downstream. The compositors together form an
`Image-Composition Network, the output of which contains the
`pixels of the final image. The Image-Composition Network can
`be built as a pipeline, as shown in Figure 2, or as a binary tree;
`PixelFlow uses a pipeline, since it is easier to implement and the
`additional latency is negligible.
`
`The bandwidth in every link of the Image-Composition Network
`is identical; it is determined by frame rate, screen size, and the
`amount of data per pixel or sample, and is independent of the
`number of polygons in the scene. Thus the network (and system)
`can be extended to incorporate an arbitrary number of rcndcrcrs
`
`58
`
`ATI Exhibit 2013, Page 2 of 12
`
`
`
`_ _._
`
`---..
`
`_ .
`
`~. ___
`
`_.__~ ~-
`
`-
`
`_
`
`-
`
`~----_~
`
`-..--
`
`_
`
`I
`
`custom rasterizer chips also allowed a low-cost implementation of
`the Image-Composition Network (see below).
`
`2.4 Region-Based Rendering
`
`The logic-enhanced memory approach has one big disadvantage:
`it is not feasible (today) to implement enough image memory on
`custom chips to provide a full-screen image. This means that a
`full-screen image must be generated in multiple steps.
`
`PixelFlow renderers operate by sequentially processing small
`regions of the screen. The region size is determined by the num-
`ber of samples per pixel and ranges from 32x32 to 64x128 pixels.
`After each renderer rasterizes a given region, the renderers scan
`out that region’s rasterized pixels over the Image-Composition
`Network in synchrony with the other renderers. This compositing
`of regions is the “heartbeat” of the system.
`
`Before raster-k&ion can begin, the GP must sort primitives into
`bins corresponding to the screen regions. This extra step requires
`memory on the GP and adds latency. Also, some primitives,
`particularly those of large screen extent, fall into more than one
`region, increasing the effective polygon count by a factor equal to
`the average number of regions per primitive. This number can
`range from 1.3 to 1.7 for typical datasets iMOLN94]. Region-
`based rendering algorithms also may suffer from load imbalances
`when primitives clump into regions; a particular danger is that
`primitives may clump into different regions on different render-
`ers, potentially starving the compositing network. PixelFlow
`mitigates these problems by providing buffering
`in the logic-
`enhanced memory rasterizer for several regions of pixel data.
`
`2.5 Deferred Shading
`
`PixelFlow uses deferred shading, an approach that reduces the
`calculations required for complex shading models by factoring
`them out of the rasterization step pEER88; ELLS91]. PixelFlow
`rasterizers do not compute pixel colors directly; instead, they
`compute geometric and intrinsic pixel attributes, such as surface-
`normal vectors and surface color. these attributes, not pixel
`colors, are composited. The composited pixels (or samples),
`containing these shading attributes, are deposited onto designated
`renderer boards called shaders. The shaders look up texture
`values for the pixels and compute final pixel color values, based
`on surface normal, light sources, etc. Shading information
`is
`shared among subpixel samples that hit the same surface, up to a
`maximum of three surfaces per pixel.
`For ultimate quality
`rendering, every subpixel sample can be shaded independently.
`After shading, regions of shaded pixels are forwarded to a frame
`buffer for display.
`
`The advantage of this approach is that a bounded number of
`shading calculations are performed per pixel, no matter what the
`depth complexity of the scene is or how many renderers are in the
`system. Thus, shading performance is decoupled from rasteriza-
`tion performance:
`the number of shaders required is determined
`only by the resolution of the image, the number of surfaces
`shaded per pixel, the complexity of the shading model, and the
`frame rate.
`
`PixelFlow’s SIMD rasterizer is an ideal processor for deferred
`shading. Shading calculations can be performed for many pixels
`simultaneously; if all pixels are shaded with the same algorithm,
`the SIMD rasterizer achieves near 100% processor utilization.
`This allows up to 800 billion byte-operations per second of
`
`Figure 2: Object-parallel rendering by image composition.
`
`into the system. This gives the architecture its unique and most
`important property: linear scalability to arbitrarily high levels of
`performance,
`
`2.2 Supersampling Antiaiiasing with Z-buffer Visibility
`PixelFlow performs antialiasing by supersampling. Each renderer
`computes its full-screen image with 4, 8, or more samples per
`pixel. Samples arc computed in parallel at jittered, subpixel loca-
`tions, Up to eight samples per pixel are computed simuitane-
`ously, each with independent colors (or shading attributes) and z
`values. The compositors perform a simple z comparison for each
`sample, forwarding the appropriate sample downstream. After
`composition, the samples arc blended together to form the final
`image.
`
`Supersampling was chosen because it is general, the compositor
`hardware is simple (therefore fast), and the number of samples
`cnn be varied to trade speed for image quality. It has two disad-
`vantages: first, the composition network must support the worst-
`case bandwidth-that
`of every sample within a pixel hitting a
`different surface-and
`this bandwidth is large; second, rendering
`transparent surfaces requires screen-door or multipass algorithms
`[MAMM89].
`
`2.3 Logic-Enhanced Memory Rasterizer
`The PixelFlow Rasterizer Board uses the logic-enhanced memory
`approach used in our earlier Pixel-Planes designs [EYLE88,
`FUCH89], in which a large SIMD array of pixel processors is
`implemented using custom VLSI chips that integrate simple proc-
`essing elements with high-speed, low-latency
`image memory.
`This approach eliminates
`the traditional bandwidth bottleneck
`between rnsterizer and image memory, permitting more sophisti-
`cnted rnsterizntion algorithms and higher polygon rates [pOUL92,
`TORB96]. PixelFlow’s enhanced memory chips have byte-wide
`ALUs and operate at 100 MHz, enabling a 3 million triangle-per-
`second rasterlzer to be built on a single circuit board. Building
`
`59
`
`ATI Exhibit 2013, Page 3 of 12
`
`
`
`l/O DaughterCard
`
`shading performance on a single board. The rasterizer’s texture
`subsystem, using commodity SDRAM texture memory, supports
`texturing,
`environment mapping,
`shadows,
`and
`so
`forth
`[MOLN95].
`
`the Image-
`in
`requires higher bandwidth
`Deferred shading
`Composition Network, since pixel shading attributes require more
`data than the three to four bytes required for RGB color values.
`Transparent surfaces, including
`texture-modulated
`transparency,
`arc handled by sending
`transparent polygons
`to shaders and
`accumulating
`transparent
`layers using Mammen’s algorithm
`mAMM89].
`The performance
`impact is determined by the
`number of transparent primitives and the number of transparency
`layers.
`
`2.6 Ultra-Fast
`
`Image-Composition
`
`Network
`To support high-resolution displays, fast frame rates, and deferred
`shading,
`the
`Image-Composition Network must provide
`enormous bandwidth-tens
`of Gbyteslsecond. We accomplish
`this in a cost-effective way by integrating the compositors onto
`the logic-enhanced memory chips that implement
`the SIMD
`is formed by daisy-chaining
`rasterizer array. The network
`connections
`between
`the
`logic-enhanced memories
`on
`neighboring boards. Hence the network consists entirely of
`point-to-point communication between identical custom chips on
`neighboring boards, so state-of-the-art techniques for high-speed,
`low-power interconnect can be employed to provide the necessary
`bandwidth.
`
`The Image-Composition Network on PixelFlow is 256 wires wide
`and runs at 200 MHz, with data traveling in both directions on the
`same wire at the same time. The total bandwidth, therefore, is 2 l
`200 MHz l 256 wires = 100 Gbit/second. This is sufficient to
`render 1280x1024-pixel images with sophisticated shading and 4
`samples per pixel at greater than 60 frames per second.
`
`2.7 System Configurability
`Because renderers, shaders, and frame buffer boards all share the
`same underlying hardware (with the exception of I/O and video
`daughter-cards), applications can tune the number of renderers
`and shaders to achieve optimum speed, based on the number of
`primitives and the complexity of the shading. Also. a large
`machine can be partitioned
`into smaller machines to support
`multiple users, or as a single,
`large machine when ultimate
`performance is desired.
`
`I
`
`3 HARDWARE COMPONENTS
`
`PixelFlow is a modular graphics system, composed of one or
`more chassis, each containing up to 9 Flow Units (the maximum
`configuration
`is 256 Flow Units or more than 28 chassis!).
`Figure 3 shows a one-chassis PixelFlow configuration.
`
`The system is built around a horizontal midplane. Geometry
`Processor Boards plug
`into
`the underside of the midplane.
`Rasterizer Boards plug
`into
`the top of the midplane. The
`midplane contains the daisy-chain wiring for the Geometry and
`Image-Composition Networks, as well as clock and power distri-
`bution. Figure 4 shows the components and interconnections of a
`Flow Unit.
`
`60
`
`Figure 3: One-chassis PixelFlow system.
`
`3.1 Geometry Processor
`The Geometry Processor (GP) is a fast floating-point processor
`that may be configured with one or two CPUs and up to 256
`Mbytes of memory.
`
`The CPUs are Hewlett-Packard PA-RISC PA-8000
`CPUs.
`modules. In a dual processor GP, the two processors arc cache
`coherent. The PA-8000 runs at 180 MHz, issuing a peak of two
`floating point multiply-accumulates
`and two integer ops per
`cycle. The processor modules include large instruction and dnta
`caches.
`
`Memory. GP memory consists of 64 to 512 Mbytes of SDRAM
`memory, serving both as main memory for the GP and as a large
`FIFO queue for buffering commands for the rasterizer.
`
`RHInO. A custom ASIC, the RHInO (Runway Host and I/O)
`connects the processors with memory, the Geometry Network,
`and the Rasterizer Board. Its primary function is to service mcm-
`ory requests from the two processors and its various I/O ports, It
`also contains two DMA engines, one for transmitting rendering
`commands from SDRAM memory to the Rastcrizcr Board, and
`one for sending and receiving data from the Geometry Network,
`
`Geometry Network. The Geometry Network is a high-speed
`packet-routing network that connects the GPs to each other, This
`is particularly useful for connecting the host to Flow Units that do
`include an I/O daughter-card. It is implemented using a bit-slice
`pair of Geometry Network Interface
`(GeNIe) ASICs;
`they
`physically reside on the Rasterizer Board.
`
`The GeNIe provides three ports onto the Geometry Network for
`each Flow Unit. One port goes to the GP itself (via the RHInO);
`one port goes to the optional I/O adapter; a third goes to the Intcr-
`TASIC Ring on the Rasterizer Board for loading textures and
`reading frame-buffer contents. Each port supports I/O traffic of
`up to 240 Mbytes/second. The Geometry Network supports
`broadcasts to groups of receivers. The overall Geometry Network
`bandwidth is 800 Mbyteslsec in each direction. Non-ovcrlnpplng
`transfers may occur simultaneously.
`
`ATI Exhibit 2013, Page 4 of 12
`
`
`
`- .~ _^ -
`
`-
`
`-
`
`. _ -.--
`
`__...
`
`----
`
`- -..
`
`-
`
`--
`
`PA-8000 CPU
`2 MBytes cache
`
`1
`
`i
`
`768 hfbytes
`
`- .I_
`
`‘----..
`
`.,-
`
`or shading
`Rnsteriring
`instructions,
`hlbytes
`
`I set
`
`I set .\ 4 t.------ -
`
`
`
`Runway Bus: 1
`400 1
`
`PA-8000 CPU
`2 MBytes cache
`
`-l
`I
`
`L------d
`
`Geometry
`* Processor Board
`
`Geometry Network:
`800 Mbytes
`I set
`each direction
`
`(
`
`I
`
`&
`
`Rasterizer Board
`
`L
`
`GeNle
`
`I
`:
`
`L-
`
`7
`
`iMCS
`12 PE array)
`
`lmflge Composition
`Network:
`6.4 Gbytes I set each
`direction
`
`\I
`
`I
`I
`---- nnni-
`
`400
`Interface:
`Video
`hipixels
`I see in ot out
`-
`
`I/O or Video Daughter Card
`_ __-. -. _-_- ---~^.__
`__.---_
`
`Figure 4: PixelFlow Flow Unit with CPUs and IO/Video adapter.
`
`3.2 SIMD Pixel Processor Array
`The heart of the Rasterizer Board is a SIMD array of 8,192
`processing elements (PEs). This array is mapped to screen
`regions of different sizes, depending on the number of samples
`per pixel, as follows:
`
`Samples per Pixel
`1
`4
`8
`
`Region Size (pixels)
`128x64
`32x64
`32x32
`
`The PE array is divided into four modules, each tightly coupled to
`R texture/video subsystem.
`
`The SIMD army and texture/video subsystem operate under the
`
`control of a pair of Image-Generation Controller chips (IGCs),
`which perform cycle-by-cycle sequencing of the SIMD array and
`provide data for the EMCs’ linear expression evaluator.
`
`The PE array is implemented on 32 logic-enhanced memory chips
`(EMCs), each containing 256 PEs. Figure 5 shows a block
`diagram of an EMC.
`
`Each PE consists of an arithmetic/logical unit (ALU) and 384
`bytes of local memory. This includes 256 bytes of main memory,
`and four 32-byte partitions associated with two I/O ports, the
`Local Port and the Image Composition port. A hear expression
`evuluufor computes values of the bilinear expression Ax+By+C in
`parallel for every PE; the pair (x.y) is the address of each PE on a
`subpixel resolution grid, and A. B, and C are user-specified as part
`of the SIMD instruction stream. The ALU performs arithmetic
`
`61
`
`ATI Exhibit 2013, Page 5 of 12
`
`
`
`instruction
`
`address
`
`data
`
`Local
`Port
`
`*
`
`A,B,C -
`
`E
`
`I:~] g
`
`..I”...
`Memarv
`
`I 4
`
`Linear Expression
`Dafa In
`
`Evaluofor
`
`*
`
`irom-
`Pixel Shift
`previous PE of
`fhis panel
`
`lo
`Pfxel Shift
`previous PE of l
`fhis panel
`
`lrom
`P/x0/ Shllf
`- nexf PE ol fhls
`panel
`
`PIxof Sh/M lo
`c nexf PE ol fhls
`panel
`
`I
`
`t
`EOh
`
`‘lefr
`data
`
`“right=
`data
`
`Main Pixel
`Memory
`(256 Bytes)
`
`/
`
`/
`
`I
`
`Figure 5: Block diagram of Enhanced Memory Chip.
`
`and logical operations on local memory and on the local value of
`the bilinear expression.
`
`Figure 6 shows a functional diagram of one PE. The major
`components are described in the following sections.
`
`ALU. The ALU implements an g-bit add and a full range of
`bitwise logical functions. There are three g-bit registers:
`the R,
`S, and M registers. The R and S registers can be loaded with the
`core result. The R register can be fed back to the core, and either
`register can be written to memory. The M register is loaded with
`a byte read from memory; it also can be loaded with the R or S
`register value. The R and S registers can be combined into a
`single 16-bit accumulator,
`to accelerate multiplies. A carry
`Each PE
`register is provided for multi-byte computations.
`includes an enable register. PEs may be disabled, by clearing this
`register, on the basis of computation results; memory writes do
`not occur at PEs that are disabled.
`
`Linear Expression Evaluator. The linear expression evaluator
`operates byte-serially to provide each processor with one byte of
`the bilinear expression on every clock cycle; this can be thought
`The result for each set of
`of as an immediate operand.
`coefficients generally must be preceded by two guard bytes, since
`A and B are multiplied by 14-bit numbers.
`
`The PEs are assigned X,,Y addresses on a subpixel grid with
`resolution of 1/8th pixel. The PEs are grouped into sets of I,4 or
`8, each group corresponding to a pixel. The PEs in each group
`are assigned x;y subpixel addresses in a 2-pixel-wide box about
`the pixel center; the pattern of subpixel addresses is the same for
`each group, and this pattern defines the antialiasing kernel.
`
`Inter-PE Communication. The PEs on each EMC are connected
`by a shift path that allows each ALU to use the R register of
`either of its neighbors as an operand. When antialiasing. the 4 or
`8 samples for a single pixel are mapped to contiguous PEs; the
`shift path is used to combine these samples into a single PE,
`where they can be filtered into an aggregate display value. This is
`usually done on a shader, after composition.
`
`Local Memory. An g-bit wide memory data bus connects the M,
`R, and S registers to the 384 bytes of local memory; a byte of data
`
`. Dafa and addrossos auf lo
`TASK
`
`- Dafa In horn TASK
`
`/ Rfghf lo Lelf lmago
`Composlllon horn fhls PE
`ol nexf Raslorlxer
`
`f Lelf lo Rfghf fmago
`Composlflon
`lo fhls PE
`ol next Rasfsrlzor
`
`L.,.--------
`
`Figure 6: Functional diagram of Processing Element,
`
`may be read from or written to memory on each clock cycle, The
`384 bytes of local memory for each PE are arranged as:
`
`.
`.
`.
`.
`
`l
`
`256 bytes of “main” memory
`32 bytes of Local Port input buffer
`32 bytes of Local Port output buffer
`32 bytes of Image-Composition Network left-to-right
`transfer buffer
`32 bytes of Image-Composition Network right-to-left
`transfer buffer
`
`l/O
`for
`The four 32-byte partitions of memory arc used
`operations, using the two communication ports described below,
`These partitions are part of the same address space as the 256
`bytes of main memory, and all 384 bytes can be accessed by the
`ALU. While communication port operations are in progress, the
`
`62
`
`ATI Exhibit 2013, Page 6 of 12
`
`
`
`-
`
`----
`
`. . _ ..l__l______~------.
`
`-~ -
`
`- --
`
`ALU cannot access these addresses; this lockout is accomplished
`using semaphores in the control processors. The ALU may
`continue
`to access the main memory and any of the 32-byte
`buffers not involved in I/O operations; this allows I/O to occur
`simultaneously with normal pixel computations.
`
`Port. The Image-Composition Port consists
`ImngeComposition
`of 8 leJl pins and 8 righf pins per EMC. The Ieft pins are
`connected to the right pins of the corresponding EMC on the
`adjacent board, forming a 256-bit wide daisy-chained point-to-
`point connection along the midplane. These pins operate at
`200 MHz (double the system clock rate), with simultaneous bi-
`directional data flow (each pin has an input data stream, and a
`The Image-Composition
`simultaneous output data stream).
`Network consists of two pathways superimposed onto this bi-
`directional
`interconnect: on the lefr-to-right pathway, each PE
`synchronously
`receives pixel data from the board to the left,
`combines
`this data with the data in the 32-byte left--to-r+@
`tmnsfir br?ffer, and forwards the result to the board to the right;
`similarly, the right-to-left pathway combines data from right to
`left, using the right-to-left transfer buffer. The two pathways can
`be formed into a loop on a set of adjacent boards; in this way,
`large systems can be configured as multiple small systems, each
`with its own independent Image-Composition Network.
`
`The Image-Composition Network operates on one screen region
`of pixel data at a time,
`Its primary function is the real-time
`compositing operation required to combine the partial images
`from the multiple renderers. The basic composite operation is a z-
`compare (up to 8 bytes) between the incoming pixel data and the
`pixel data in the local transfer buffer; the composited pixel (or
`sample), with the smaller z value, is forwarded. More generally,
`the network is used for rapidly moving pixel data, including
`writing data back into the transfer buffer. For each region
`transfer, a compositor mode is specified for each direction; the
`forwarded pixel is (1) the composited pixel, (2) the incoming
`pixel, or (3) the local pixel, and the pixel written back into the
`transfer buffer is (1) nothing, (2) the incoming pixel, or (3) the
`composited pixel, Thus, there are 9 modes; the four used in the
`basic rendering algorithm are shown in Figure 7.
`
`transfy
`
`buffer
`
`transfer buffer
`
`-A
`Load upstream pixels
`into memory.
`
`local pixels
`:omposlta
`wllh upstream pixels.
`transfer buffer
`
`I
`
`I,
`
`local pixels
`Unload
`dowstream.
`
`Forward upstream
`olxels downstream.
`
`Figure 7: Compositor modes.
`
`Composife mode is used by renderer boards as regions are
`composited together to from a final pre-shading image. Load
`mode is used to deposit this composited image into a shader.
`Unload mode is used, to dump final shaded pixels out of the
`shader (to be received by the frame buffer using foad mode).
`I;bnvurd mode allows data to pass through any boards not
`participating in a given transfer operation.
`
`63
`
`Local Port. The Local Port consist of 4 bi-directional pins per
`EMC. Data in the 32-byte Local Port output buffer is output
`nibble-serially on these pins. The input data stream from the pins
`is written into the Local Port input buffer. The Local Port is
`connected to the texturelvideo subsystem. Typically, the output
`buffer is loaded with texture-memory addresses; these are output
`to the texture/video subsystem, which looks up the texels in
`texture maps, and returns texel data to the input buffer.
`
`The local input port and local output port operate independently,
`although they share the same communications substrate. Each
`port can access all PEs, or a subset of the PEs defined by loading
`a memory-mapped mark register. A content-dependent decoder
`gives the local port access to only
`the marked PEs. This
`substantially reduces texture-lookup time when only a subset of
`pixels in a region needs texturing.
`
`3.3 Texture/Video Subsystem
`The texture/video subsystem consists of 8 texture-datapath ASICs
`(TASICs) and 64 to 256 Mbytes of SDRAM memory. The
`TASIC chips provide the interface between the Local Ports of the
`EMCs and
`texture/image memory;
`they
`transfer addresses
`computed in the PE array to the SDRAMs and transfer texture-
`lookup data back to the PE array. The SDRAM memory is used
`as a texture store on shader boards and as a frame store on frame-
`buffer boards. To provide sufficient bandwidth for Mip-map
`texture lookups, texture memory is replicated on 4 separate
`modules. Each module consists of 8 EMCs. one copy of the
`texture memory (16 to 64 Mbytes), and 2 TASIC chips.
`
`Each copy of the texture memory is divided into eight banks. The
`texture memory is designed to simultaneously read eight texels
`when each of the eight texels comes from a different bank.
`Prefiltered (Mip-map) texture maps can be interleaved across the
`banks so that the eight texels required for one pixel are stored in
`the eight separate banks.
`
`To read from texture memory, the participating PEs each write S
`addresses into their Local Port output buffers. The texture read
`operation takes this set of eight addresses from each PE in turn,
`applying
`the addresses to the eight banks of memory and
`returning eight Cbyte results to the PE’s Local Port input buffer.
`
`The time required for the texture read operation is 0.9 psec +
`in the worst-case
`0.64 psec l the number of PEs participating
`EMC (that is, the EMC with the most PEs marked). For full-
`screen texture operations, all 32 EMCs will have all 256 PEs
`marked, so the time is 165 psec. Pixels are interleaved across the
`EMCs so that the pixels of a small screen area will be evenly
`distributed across the EM& A 30% speedup is available by
`replicating textures within each module, halving
`the effective
`texture store.
`
`Texture memory writes proceed similarly to reads, except that the
`texture memory addresses can either come from the PEs or can be
`generated locally on the TASICs.
`
`table-lookup operations are supported, allowing
`Generalized
`functions such as bump mapping, environment mapping, and
`image warping. The shader can be loaded with an image, from
`which it computes a Mip-map that can then be loaded into texture
`memory.
`
`texture reads, each module needs
`Inter-TASIC Ring. For
`independent access to its local copy of texture data; for texture
`
`ATI Exhibit 2013, Page 7 of 12
`
`
`
`