throbber
PixelFlow: The Realization
`
`John Eyles*, Steven Molnar*, John Poulton’, Trey Greer*, Anselmo Lustra’, Nick Englandt, Lee Westover*
`
`l Hewlett-Packard Company
`Chapel Hill Graphics Lab
`
`t Department of Computer Science
`University of North Carolina
`
`ABSTRACT
`
`PlxelFlow is an architecture for high-speed, highly realistic image
`generation, based on the techniques of object-parallelism and
`image composition,
`Its initial architecture was described in
`After development by
`the original
`team of
`[MOLN92].
`researchers at
`the University of North Carolina, and co-
`development with industry partners, Division Ltd. and Hcwlett-
`Packard, PixelFlow now is a much more capable system than
`initially conceived and its hardware and software systems have
`evolved considerably. This paper describes the final realization
`of PixelFlow, along with hardware and software enhancements
`heretofore unpublished.
`
`C.5.4 [Computer
`CR Cntcgorics and Subject Descriptors:
`System
`Implementation]: VLSI Systems; 1.3.1
`[Computer
`Graphics]: Hardware Architecture; 1.3.3 [Computer Graphics]:
`Picture/Image Generation; 1.3.7 [Computer Graphics]: Three-
`Dimensional Graphics and Realism.
`Additlonnl Key Words and Phrases: object-parallel, rendering.
`compositing, deferred shading, scalable.
`
`1
`
`INTRODUCTION
`
`PixelFlow is an architecture for high-speed image generation that
`was designed to be linearly scaleable to unprecedented 1evc1s of
`performance and to implement realistic rendering techniques such
`as user-programmable
`shading,
`texturing,
`antialiasing.
`and
`hndows. Achieving these goals required a new architecture that
`was substantially different than the interleaved screen-subdivision
`approach that is nearly universal in today’s commercial graphics
`architectures (e.g. [E&S96;SGI97]).
`
`Author addresses:
`
`l 1512 East Franklin St., Suite 200, Chapel Hill, NC 27514
`(jge,molnar,greer,lee) @chapelhill.hp.com
`
`’
`
`Sitterson Hnll C.B W3175, Chapel Hill, NC 27599-3175
`(jp,lastn,nick)Ocs.unc.edu
`
`Pcrmi~sion to make digitnlfllnrd copies ofnll or part ofthis material for
`personnl or classroom USC is granted widlout fre provided that the copies
`nre nut mnde or distributrd for prolit or commercial ndvanbge, the copy-
`right notice. die title of~ha publicntioo and its date appear. and notice is
`given that copy/&1
`is by pennissioo ofthe ACM,
`Inc. To copy otherwise,
`to republish, to post on sawn or to redistribute to lists. requires specific
`pwmiszion nnd!or fee
`I99 7 SKXxtl PW..~4rogrcrphics Workshop
`Copyright 1997 ACht 0-S9791-961.0197/S..$3.50
`
`image-
`approach called
`PixelFlow uses an object-parallel
`composition to achieve its high speed. Display primitives arc
`distributed over an array of identical renderers, each of which
`computes a full-screen image of its fraction of the primitives. A
`dedicated, high-speed communication network called the Image-
`Composition Nehvork merges these images in real time, based on
`visibility information, to produce an image of the entire scene
`[MOLN92].
`
`flexible, allowing
`is extremely
`The PixelFlow architecture
`configurations from deskside systems, drawing tens of millions of
`triangles per second, to multiple-rack systems, drawing hundreds
`of millions of triangles per second. Near-linear performance
`increases arc obtained by adding renderers.
`
`1 .l System Overview
`A PixelFlow system consists of one or more chassis, each
`containing up to 9 Flow Units (the PixelFlow name for renderer).
`Each Flow Unit consists of: a Geometry Processor Board (GP), a
`conventional floating-point microprocessor with DRAM memory;
`and a Rasterizer Board (RB), a SIMD array of S,192 byte-serial
`processing elements, each with 384 bytes of local memory.
`
`A Flow Unit is a powerful graphics engine in itself, capable of
`rendering up to 3 million antialiased polygons per second and
`performing
`complex
`shading
`calculations
`(such as bump
`mapping, shadows, and user-programmable shading) in real time.
`Geometry Processor Boards provide the front-end floating-point
`computation nccdcd for transforming primitives and generating
`rendering
`instructions
`for the Rasterizer Boards. Rasterizer
`Boards turn screen-space descriptions of primitives
`into pixel
`values and perform sophisticated shading calculations.
`
`The Image-Composition Network is implemented as a daisy-
`chained connection between Rasterizer Boards of neighboring
`Flow Units. A second communication network, the Geometry
`Network, is a packet-routing network which connects Geometry
`Processor Boards.
`
`Any subset of Flow Units can be provided with I/O or video
`adapter daughter-cards that provide host-interface or video (frame
`buffer or frame grabber) capabilities.
`It is possible to build very
`large systems with multiple host interfaces attached to a parallel
`host, and multiple displays to support multiple-user applications.
`Such a system also can be rc-configured by software into several
`smaller systems. Figure 1 shows a typical two-chassis PixelFlow
`system configuration.
`
`57
`
`ATI Exhibit 2013
`ARM LTD, et al. v. ATI Technologies ULC
`IPR2018-01148
`Page 1 of 12
`
`

`

`PixelFlow
`Chassis
`
`,!:z z”&
`per chassis)
`
`Geometry Network
`Image Composition
`Network
`
`Figure 1: Typical PixelFlow System.
`
`1.2 System Operation
`Individual Flow Units can be designated, by software, as one of
`three types:
`
`(not to be confused with the general term
`l Renderers
`renderer above) process a portion of the database to
`generate regions of pixel data ready for shading. The
`Geometry Processor Board
`transforms primitives
`to
`screen-space and sorts them into bins according
`to
`screen region. The Rasterizer Board rasterizes primi-
`tives one region at a time. After all renderers have proc-
`essed a given region, the region is cornposited across the
`Image-Composition Network and the composited pixel-
`values are deposited onto one or more shaders.
`l Shaders apply texture and lighting models to regions of
`raw pixel data, producing RGB color values that are
`forwarded to the frame buffer.
`l Frame buffers
`send or receive video data via an
`attached video adapter card.
`To compute a frame, the GPs on each renderer first transform
`their fraction of the primitives into screen coordinates and sort
`them into bins corresponding
`to regions of the screen. The
`renderers then process the regions one at a time, rasterizing all of
`the primitives that affect the current region before moving on to
`the next.
`
`Once a given region has been rasterized on all of the renderers,
`the composition network merges the pixel data together and loads
`the region of composited pixel data onto a shader. Regions are
`assigned to shaders in round-robin
`fashion, with each shader
`processing every nth region. Shaders operate on entire regions in
`parallel, to convert raw pixel attributes into final RGB values,
`blend multiple samples together for antialiasing, and forward final
`color values to the frame buffer for display.
`
`1.3 Design Evolution
`The PixelFlow architecture has evolved considerably since its
`initial conception, described
`in [MOLN92].
`PixelFlow was
`initially developed at the University of North Carolina at Chapel
`Hill as an NSF- and DARPA-sponsored research project. In 1994.
`Division Ltd. of Bristol, UK acquired commercial rights to the
`technology and established a laboratory
`in Chapel Hill
`to
`
`this
`In mid-1996
`the project.
`complete development of
`laboratory and rights to the PixelFlow technology were acquired
`by Hewlett-Packard. The final design is significantly faster, mom
`complex, and
`technically more aggressive
`than originally
`conceived.
`
`The following sections describe the final PixelFlow architecture
`in more detail. Special attention will be given to aspects of the
`architecture that have not been described before.
`
`2 ARCHITECTURAL FEATURES
`
`PixelFlow was designed to demonstrate the advantages of imagc-
`composition architectures, to provide a research platform for rcal-
`time 3D graphics algorithms and applications, and to provide
`workstation graphics capability with unprecedented
`levels of
`performance and realism.
`In this section we describe its major
`architectural features and the rationale under which they were
`chosen.
`
`2.1 Image Composition Architecture
`PixelFlow’s most characteristic feature is that it is an imagc-
`composition architecture. Image-composition is an object-parallel
`rendering approach in which the primitives
`in the scene are
`distributed over a parallel array of renderers, each of which is
`responsible for generating a full-screen image of its fraction of
`the primitives (Figure 2). To compute a frame, each renderer
`computes a full-screen image of its fraction of the primitives.
`It
`then feeds color and visibility information for its pixels into a
`local compositor (C), which also receives a stream of pixels from
`The compositor
`the compositors (and renderers) upstream.
`selects the visible pixel from its two input ports and forwards it to
`the compositors downstream. The compositors together form an
`Image-Composition Network, the output of which contains the
`pixels of the final image. The Image-Composition Network can
`be built as a pipeline, as shown in Figure 2, or as a binary tree;
`PixelFlow uses a pipeline, since it is easier to implement and the
`additional latency is negligible.
`
`The bandwidth in every link of the Image-Composition Network
`is identical; it is determined by frame rate, screen size, and the
`amount of data per pixel or sample, and is independent of the
`number of polygons in the scene. Thus the network (and system)
`can be extended to incorporate an arbitrary number of rcndcrcrs
`
`58
`
`ATI Exhibit 2013, Page 2 of 12
`
`

`

`_ _._
`
`---..
`
`_ .
`
`~. ___
`
`_.__~ ~-
`
`-
`
`_
`
`-
`
`~----_~
`
`-..--
`
`_
`
`I
`
`custom rasterizer chips also allowed a low-cost implementation of
`the Image-Composition Network (see below).
`
`2.4 Region-Based Rendering
`
`The logic-enhanced memory approach has one big disadvantage:
`it is not feasible (today) to implement enough image memory on
`custom chips to provide a full-screen image. This means that a
`full-screen image must be generated in multiple steps.
`
`PixelFlow renderers operate by sequentially processing small
`regions of the screen. The region size is determined by the num-
`ber of samples per pixel and ranges from 32x32 to 64x128 pixels.
`After each renderer rasterizes a given region, the renderers scan
`out that region’s rasterized pixels over the Image-Composition
`Network in synchrony with the other renderers. This compositing
`of regions is the “heartbeat” of the system.
`
`Before raster-k&ion can begin, the GP must sort primitives into
`bins corresponding to the screen regions. This extra step requires
`memory on the GP and adds latency. Also, some primitives,
`particularly those of large screen extent, fall into more than one
`region, increasing the effective polygon count by a factor equal to
`the average number of regions per primitive. This number can
`range from 1.3 to 1.7 for typical datasets iMOLN94]. Region-
`based rendering algorithms also may suffer from load imbalances
`when primitives clump into regions; a particular danger is that
`primitives may clump into different regions on different render-
`ers, potentially starving the compositing network. PixelFlow
`mitigates these problems by providing buffering
`in the logic-
`enhanced memory rasterizer for several regions of pixel data.
`
`2.5 Deferred Shading
`
`PixelFlow uses deferred shading, an approach that reduces the
`calculations required for complex shading models by factoring
`them out of the rasterization step pEER88; ELLS91]. PixelFlow
`rasterizers do not compute pixel colors directly; instead, they
`compute geometric and intrinsic pixel attributes, such as surface-
`normal vectors and surface color. these attributes, not pixel
`colors, are composited. The composited pixels (or samples),
`containing these shading attributes, are deposited onto designated
`renderer boards called shaders. The shaders look up texture
`values for the pixels and compute final pixel color values, based
`on surface normal, light sources, etc. Shading information
`is
`shared among subpixel samples that hit the same surface, up to a
`maximum of three surfaces per pixel.
`For ultimate quality
`rendering, every subpixel sample can be shaded independently.
`After shading, regions of shaded pixels are forwarded to a frame
`buffer for display.
`
`The advantage of this approach is that a bounded number of
`shading calculations are performed per pixel, no matter what the
`depth complexity of the scene is or how many renderers are in the
`system. Thus, shading performance is decoupled from rasteriza-
`tion performance:
`the number of shaders required is determined
`only by the resolution of the image, the number of surfaces
`shaded per pixel, the complexity of the shading model, and the
`frame rate.
`
`PixelFlow’s SIMD rasterizer is an ideal processor for deferred
`shading. Shading calculations can be performed for many pixels
`simultaneously; if all pixels are shaded with the same algorithm,
`the SIMD rasterizer achieves near 100% processor utilization.
`This allows up to 800 billion byte-operations per second of
`
`Figure 2: Object-parallel rendering by image composition.
`
`into the system. This gives the architecture its unique and most
`important property: linear scalability to arbitrarily high levels of
`performance,
`
`2.2 Supersampling Antiaiiasing with Z-buffer Visibility
`PixelFlow performs antialiasing by supersampling. Each renderer
`computes its full-screen image with 4, 8, or more samples per
`pixel. Samples arc computed in parallel at jittered, subpixel loca-
`tions, Up to eight samples per pixel are computed simuitane-
`ously, each with independent colors (or shading attributes) and z
`values. The compositors perform a simple z comparison for each
`sample, forwarding the appropriate sample downstream. After
`composition, the samples arc blended together to form the final
`image.
`
`Supersampling was chosen because it is general, the compositor
`hardware is simple (therefore fast), and the number of samples
`cnn be varied to trade speed for image quality. It has two disad-
`vantages: first, the composition network must support the worst-
`case bandwidth-that
`of every sample within a pixel hitting a
`different surface-and
`this bandwidth is large; second, rendering
`transparent surfaces requires screen-door or multipass algorithms
`[MAMM89].
`
`2.3 Logic-Enhanced Memory Rasterizer
`The PixelFlow Rasterizer Board uses the logic-enhanced memory
`approach used in our earlier Pixel-Planes designs [EYLE88,
`FUCH89], in which a large SIMD array of pixel processors is
`implemented using custom VLSI chips that integrate simple proc-
`essing elements with high-speed, low-latency
`image memory.
`This approach eliminates
`the traditional bandwidth bottleneck
`between rnsterizer and image memory, permitting more sophisti-
`cnted rnsterizntion algorithms and higher polygon rates [pOUL92,
`TORB96]. PixelFlow’s enhanced memory chips have byte-wide
`ALUs and operate at 100 MHz, enabling a 3 million triangle-per-
`second rasterlzer to be built on a single circuit board. Building
`
`59
`
`ATI Exhibit 2013, Page 3 of 12
`
`

`

`l/O DaughterCard
`
`shading performance on a single board. The rasterizer’s texture
`subsystem, using commodity SDRAM texture memory, supports
`texturing,
`environment mapping,
`shadows,
`and
`so
`forth
`[MOLN95].
`
`the Image-
`in
`requires higher bandwidth
`Deferred shading
`Composition Network, since pixel shading attributes require more
`data than the three to four bytes required for RGB color values.
`Transparent surfaces, including
`texture-modulated
`transparency,
`arc handled by sending
`transparent polygons
`to shaders and
`accumulating
`transparent
`layers using Mammen’s algorithm
`mAMM89].
`The performance
`impact is determined by the
`number of transparent primitives and the number of transparency
`layers.
`
`2.6 Ultra-Fast
`
`Image-Composition
`
`Network
`To support high-resolution displays, fast frame rates, and deferred
`shading,
`the
`Image-Composition Network must provide
`enormous bandwidth-tens
`of Gbyteslsecond. We accomplish
`this in a cost-effective way by integrating the compositors onto
`the logic-enhanced memory chips that implement
`the SIMD
`is formed by daisy-chaining
`rasterizer array. The network
`connections
`between
`the
`logic-enhanced memories
`on
`neighboring boards. Hence the network consists entirely of
`point-to-point communication between identical custom chips on
`neighboring boards, so state-of-the-art techniques for high-speed,
`low-power interconnect can be employed to provide the necessary
`bandwidth.
`
`The Image-Composition Network on PixelFlow is 256 wires wide
`and runs at 200 MHz, with data traveling in both directions on the
`same wire at the same time. The total bandwidth, therefore, is 2 l
`200 MHz l 256 wires = 100 Gbit/second. This is sufficient to
`render 1280x1024-pixel images with sophisticated shading and 4
`samples per pixel at greater than 60 frames per second.
`
`2.7 System Configurability
`Because renderers, shaders, and frame buffer boards all share the
`same underlying hardware (with the exception of I/O and video
`daughter-cards), applications can tune the number of renderers
`and shaders to achieve optimum speed, based on the number of
`primitives and the complexity of the shading. Also. a large
`machine can be partitioned
`into smaller machines to support
`multiple users, or as a single,
`large machine when ultimate
`performance is desired.
`
`I
`
`3 HARDWARE COMPONENTS
`
`PixelFlow is a modular graphics system, composed of one or
`more chassis, each containing up to 9 Flow Units (the maximum
`configuration
`is 256 Flow Units or more than 28 chassis!).
`Figure 3 shows a one-chassis PixelFlow configuration.
`
`The system is built around a horizontal midplane. Geometry
`Processor Boards plug
`into
`the underside of the midplane.
`Rasterizer Boards plug
`into
`the top of the midplane. The
`midplane contains the daisy-chain wiring for the Geometry and
`Image-Composition Networks, as well as clock and power distri-
`bution. Figure 4 shows the components and interconnections of a
`Flow Unit.
`
`60
`
`Figure 3: One-chassis PixelFlow system.
`
`3.1 Geometry Processor
`The Geometry Processor (GP) is a fast floating-point processor
`that may be configured with one or two CPUs and up to 256
`Mbytes of memory.
`
`The CPUs are Hewlett-Packard PA-RISC PA-8000
`CPUs.
`modules. In a dual processor GP, the two processors arc cache
`coherent. The PA-8000 runs at 180 MHz, issuing a peak of two
`floating point multiply-accumulates
`and two integer ops per
`cycle. The processor modules include large instruction and dnta
`caches.
`
`Memory. GP memory consists of 64 to 512 Mbytes of SDRAM
`memory, serving both as main memory for the GP and as a large
`FIFO queue for buffering commands for the rasterizer.
`
`RHInO. A custom ASIC, the RHInO (Runway Host and I/O)
`connects the processors with memory, the Geometry Network,
`and the Rasterizer Board. Its primary function is to service mcm-
`ory requests from the two processors and its various I/O ports, It
`also contains two DMA engines, one for transmitting rendering
`commands from SDRAM memory to the Rastcrizcr Board, and
`one for sending and receiving data from the Geometry Network,
`
`Geometry Network. The Geometry Network is a high-speed
`packet-routing network that connects the GPs to each other, This
`is particularly useful for connecting the host to Flow Units that do
`include an I/O daughter-card. It is implemented using a bit-slice
`pair of Geometry Network Interface
`(GeNIe) ASICs;
`they
`physically reside on the Rasterizer Board.
`
`The GeNIe provides three ports onto the Geometry Network for
`each Flow Unit. One port goes to the GP itself (via the RHInO);
`one port goes to the optional I/O adapter; a third goes to the Intcr-
`TASIC Ring on the Rasterizer Board for loading textures and
`reading frame-buffer contents. Each port supports I/O traffic of
`up to 240 Mbytes/second. The Geometry Network supports
`broadcasts to groups of receivers. The overall Geometry Network
`bandwidth is 800 Mbyteslsec in each direction. Non-ovcrlnpplng
`transfers may occur simultaneously.
`
`ATI Exhibit 2013, Page 4 of 12
`
`

`

`- .~ _^ -
`
`-
`
`-
`
`. _ -.--
`
`__...
`
`----
`
`- -..
`
`-
`
`--
`
`PA-8000 CPU
`2 MBytes cache
`
`1
`
`i
`
`768 hfbytes
`
`- .I_
`
`‘----..
`
`.,-
`
`or shading
`Rnsteriring
`instructions,
`hlbytes
`
`I set
`
`I set .\ 4 t.------ -
`
`
`
`Runway Bus: 1
`400 1
`
`PA-8000 CPU
`2 MBytes cache
`
`-l
`I
`
`L------d
`
`Geometry
`* Processor Board
`
`Geometry Network:
`800 Mbytes
`I set
`each direction
`
`(
`
`I
`
`&
`
`Rasterizer Board
`
`L
`
`GeNle
`
`I
`:
`
`L-
`
`7
`
`iMCS
`12 PE array)
`
`lmflge Composition
`Network:
`6.4 Gbytes I set each
`direction
`
`\I
`
`I
`I
`---- nnni-
`
`400
`Interface:
`Video
`hipixels
`I see in ot out
`-
`
`I/O or Video Daughter Card
`_ __-. -. _-_- ---~^.__
`__.---_
`
`Figure 4: PixelFlow Flow Unit with CPUs and IO/Video adapter.
`
`3.2 SIMD Pixel Processor Array
`The heart of the Rasterizer Board is a SIMD array of 8,192
`processing elements (PEs). This array is mapped to screen
`regions of different sizes, depending on the number of samples
`per pixel, as follows:
`
`Samples per Pixel
`1
`4
`8
`
`Region Size (pixels)
`128x64
`32x64
`32x32
`
`The PE array is divided into four modules, each tightly coupled to
`R texture/video subsystem.
`
`The SIMD army and texture/video subsystem operate under the
`
`control of a pair of Image-Generation Controller chips (IGCs),
`which perform cycle-by-cycle sequencing of the SIMD array and
`provide data for the EMCs’ linear expression evaluator.
`
`The PE array is implemented on 32 logic-enhanced memory chips
`(EMCs), each containing 256 PEs. Figure 5 shows a block
`diagram of an EMC.
`
`Each PE consists of an arithmetic/logical unit (ALU) and 384
`bytes of local memory. This includes 256 bytes of main memory,
`and four 32-byte partitions associated with two I/O ports, the
`Local Port and the Image Composition port. A hear expression
`evuluufor computes values of the bilinear expression Ax+By+C in
`parallel for every PE; the pair (x.y) is the address of each PE on a
`subpixel resolution grid, and A. B, and C are user-specified as part
`of the SIMD instruction stream. The ALU performs arithmetic
`
`61
`
`ATI Exhibit 2013, Page 5 of 12
`
`

`

`instruction
`
`address
`
`data
`
`Local
`Port
`
`*
`
`A,B,C -
`
`E
`
`I:~] g
`
`..I”...
`Memarv
`
`I 4
`
`Linear Expression
`Dafa In
`
`Evaluofor
`
`*
`
`irom-
`Pixel Shift
`previous PE of
`fhis panel
`
`lo
`Pfxel Shift
`previous PE of l
`fhis panel
`
`lrom
`P/x0/ Shllf
`- nexf PE ol fhls
`panel
`
`PIxof Sh/M lo
`c nexf PE ol fhls
`panel
`
`I
`
`t
`EOh
`
`‘lefr
`data
`
`“right=
`data
`
`Main Pixel
`Memory
`(256 Bytes)
`
`/
`
`/
`
`I
`
`Figure 5: Block diagram of Enhanced Memory Chip.
`
`and logical operations on local memory and on the local value of
`the bilinear expression.
`
`Figure 6 shows a functional diagram of one PE. The major
`components are described in the following sections.
`
`ALU. The ALU implements an g-bit add and a full range of
`bitwise logical functions. There are three g-bit registers:
`the R,
`S, and M registers. The R and S registers can be loaded with the
`core result. The R register can be fed back to the core, and either
`register can be written to memory. The M register is loaded with
`a byte read from memory; it also can be loaded with the R or S
`register value. The R and S registers can be combined into a
`single 16-bit accumulator,
`to accelerate multiplies. A carry
`Each PE
`register is provided for multi-byte computations.
`includes an enable register. PEs may be disabled, by clearing this
`register, on the basis of computation results; memory writes do
`not occur at PEs that are disabled.
`
`Linear Expression Evaluator. The linear expression evaluator
`operates byte-serially to provide each processor with one byte of
`the bilinear expression on every clock cycle; this can be thought
`The result for each set of
`of as an immediate operand.
`coefficients generally must be preceded by two guard bytes, since
`A and B are multiplied by 14-bit numbers.
`
`The PEs are assigned X,,Y addresses on a subpixel grid with
`resolution of 1/8th pixel. The PEs are grouped into sets of I,4 or
`8, each group corresponding to a pixel. The PEs in each group
`are assigned x;y subpixel addresses in a 2-pixel-wide box about
`the pixel center; the pattern of subpixel addresses is the same for
`each group, and this pattern defines the antialiasing kernel.
`
`Inter-PE Communication. The PEs on each EMC are connected
`by a shift path that allows each ALU to use the R register of
`either of its neighbors as an operand. When antialiasing. the 4 or
`8 samples for a single pixel are mapped to contiguous PEs; the
`shift path is used to combine these samples into a single PE,
`where they can be filtered into an aggregate display value. This is
`usually done on a shader, after composition.
`
`Local Memory. An g-bit wide memory data bus connects the M,
`R, and S registers to the 384 bytes of local memory; a byte of data
`
`. Dafa and addrossos auf lo
`TASK
`
`- Dafa In horn TASK
`
`/ Rfghf lo Lelf lmago
`Composlllon horn fhls PE
`ol nexf Raslorlxer
`
`f Lelf lo Rfghf fmago
`Composlflon
`lo fhls PE
`ol next Rasfsrlzor
`
`L.,.--------
`
`Figure 6: Functional diagram of Processing Element,
`
`may be read from or written to memory on each clock cycle, The
`384 bytes of local memory for each PE are arranged as:
`
`.
`.
`.
`.
`
`l
`
`256 bytes of “main” memory
`32 bytes of Local Port input buffer
`32 bytes of Local Port output buffer
`32 bytes of Image-Composition Network left-to-right
`transfer buffer
`32 bytes of Image-Composition Network right-to-left
`transfer buffer
`
`l/O
`for
`The four 32-byte partitions of memory arc used
`operations, using the two communication ports described below,
`These partitions are part of the same address space as the 256
`bytes of main memory, and all 384 bytes can be accessed by the
`ALU. While communication port operations are in progress, the
`
`62
`
`ATI Exhibit 2013, Page 6 of 12
`
`

`

`-
`
`----
`
`. . _ ..l__l______~------.
`
`-~ -
`
`- --
`
`ALU cannot access these addresses; this lockout is accomplished
`using semaphores in the control processors. The ALU may
`continue
`to access the main memory and any of the 32-byte
`buffers not involved in I/O operations; this allows I/O to occur
`simultaneously with normal pixel computations.
`
`Port. The Image-Composition Port consists
`ImngeComposition
`of 8 leJl pins and 8 righf pins per EMC. The Ieft pins are
`connected to the right pins of the corresponding EMC on the
`adjacent board, forming a 256-bit wide daisy-chained point-to-
`point connection along the midplane. These pins operate at
`200 MHz (double the system clock rate), with simultaneous bi-
`directional data flow (each pin has an input data stream, and a
`The Image-Composition
`simultaneous output data stream).
`Network consists of two pathways superimposed onto this bi-
`directional
`interconnect: on the lefr-to-right pathway, each PE
`synchronously
`receives pixel data from the board to the left,
`combines
`this data with the data in the 32-byte left--to-r+@
`tmnsfir br?ffer, and forwards the result to the board to the right;
`similarly, the right-to-left pathway combines data from right to
`left, using the right-to-left transfer buffer. The two pathways can
`be formed into a loop on a set of adjacent boards; in this way,
`large systems can be configured as multiple small systems, each
`with its own independent Image-Composition Network.
`
`The Image-Composition Network operates on one screen region
`of pixel data at a time,
`Its primary function is the real-time
`compositing operation required to combine the partial images
`from the multiple renderers. The basic composite operation is a z-
`compare (up to 8 bytes) between the incoming pixel data and the
`pixel data in the local transfer buffer; the composited pixel (or
`sample), with the smaller z value, is forwarded. More generally,
`the network is used for rapidly moving pixel data, including
`writing data back into the transfer buffer. For each region
`transfer, a compositor mode is specified for each direction; the
`forwarded pixel is (1) the composited pixel, (2) the incoming
`pixel, or (3) the local pixel, and the pixel written back into the
`transfer buffer is (1) nothing, (2) the incoming pixel, or (3) the
`composited pixel, Thus, there are 9 modes; the four used in the
`basic rendering algorithm are shown in Figure 7.
`
`transfy
`
`buffer
`
`transfer buffer
`
`-A
`Load upstream pixels
`into memory.
`
`local pixels
`:omposlta
`wllh upstream pixels.
`transfer buffer
`
`I
`
`I,
`
`local pixels
`Unload
`dowstream.
`
`Forward upstream
`olxels downstream.
`
`Figure 7: Compositor modes.
`
`Composife mode is used by renderer boards as regions are
`composited together to from a final pre-shading image. Load
`mode is used to deposit this composited image into a shader.
`Unload mode is used, to dump final shaded pixels out of the
`shader (to be received by the frame buffer using foad mode).
`I;bnvurd mode allows data to pass through any boards not
`participating in a given transfer operation.
`
`63
`
`Local Port. The Local Port consist of 4 bi-directional pins per
`EMC. Data in the 32-byte Local Port output buffer is output
`nibble-serially on these pins. The input data stream from the pins
`is written into the Local Port input buffer. The Local Port is
`connected to the texturelvideo subsystem. Typically, the output
`buffer is loaded with texture-memory addresses; these are output
`to the texture/video subsystem, which looks up the texels in
`texture maps, and returns texel data to the input buffer.
`
`The local input port and local output port operate independently,
`although they share the same communications substrate. Each
`port can access all PEs, or a subset of the PEs defined by loading
`a memory-mapped mark register. A content-dependent decoder
`gives the local port access to only
`the marked PEs. This
`substantially reduces texture-lookup time when only a subset of
`pixels in a region needs texturing.
`
`3.3 Texture/Video Subsystem
`The texture/video subsystem consists of 8 texture-datapath ASICs
`(TASICs) and 64 to 256 Mbytes of SDRAM memory. The
`TASIC chips provide the interface between the Local Ports of the
`EMCs and
`texture/image memory;
`they
`transfer addresses
`computed in the PE array to the SDRAMs and transfer texture-
`lookup data back to the PE array. The SDRAM memory is used
`as a texture store on shader boards and as a frame store on frame-
`buffer boards. To provide sufficient bandwidth for Mip-map
`texture lookups, texture memory is replicated on 4 separate
`modules. Each module consists of 8 EMCs. one copy of the
`texture memory (16 to 64 Mbytes), and 2 TASIC chips.
`
`Each copy of the texture memory is divided into eight banks. The
`texture memory is designed to simultaneously read eight texels
`when each of the eight texels comes from a different bank.
`Prefiltered (Mip-map) texture maps can be interleaved across the
`banks so that the eight texels required for one pixel are stored in
`the eight separate banks.
`
`To read from texture memory, the participating PEs each write S
`addresses into their Local Port output buffers. The texture read
`operation takes this set of eight addresses from each PE in turn,
`applying
`the addresses to the eight banks of memory and
`returning eight Cbyte results to the PE’s Local Port input buffer.
`
`The time required for the texture read operation is 0.9 psec +
`in the worst-case
`0.64 psec l the number of PEs participating
`EMC (that is, the EMC with the most PEs marked). For full-
`screen texture operations, all 32 EMCs will have all 256 PEs
`marked, so the time is 165 psec. Pixels are interleaved across the
`EMCs so that the pixels of a small screen area will be evenly
`distributed across the EM& A 30% speedup is available by
`replicating textures within each module, halving
`the effective
`texture store.
`
`Texture memory writes proceed similarly to reads, except that the
`texture memory addresses can either come from the PEs or can be
`generated locally on the TASICs.
`
`table-lookup operations are supported, allowing
`Generalized
`functions such as bump mapping, environment mapping, and
`image warping. The shader can be loaded with an image, from
`which it computes a Mip-map that can then be loaded into texture
`memory.
`
`texture reads, each module needs
`Inter-TASIC Ring. For
`independent access to its local copy of texture data; for texture
`
`ATI Exhibit 2013, Page 7 of 12
`
`

`

`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket