processors execute these instructions at once, the lower execution
`times of some of the integer operations make them very attractive.
`delivers it to the shading node that has been assigned to process
`that region.
`Deferred shading provides an additional computational advantage
`on PixelFlow because of the SIMD nature of the pixel processors.
`Consider how a SIMD machine might behave if shading is
`performed during rasterization (immediate shading—Figure 3a).
`For each primitive, the processors representing the pixels within
`the primitive are enabled, while all of the others are disabled The
`subsequent shading computations are performed only for the
`enabled pixels. The processors representing pixels outside of the
`primitive are disabled, so no useful work is performed.
`Since most primitives cover only a small area of the screen, we
`would make very poor use of the processor array. The key to
`making effective use of the SIMD array is to have every processor
`do useful work as much of the time as possible.
`With deferred shading, all of the pixels in a region that require the
`same shader can be shaded at one time, even if they came from
`different primitives. This is especially useful when tessellated
`surfaces are used as modeling primitives. These can he rastexized
`as numerous small polygons but shaded as a single unit. In fact,
`disjoint surfaces can be shaded at once if they use the same
`shading function.
`Factoring out common calculations. We can go even further
`than executing shading functions only once per region. Shading
`functions tend to be fairly similar. Even at a coarse level, most
`shading functions at least execute the same code for the lights in
`the scene even if their other computations differ. All of this
`common code need only be done once for all of the pixels that
`require it. As illustrated in Figure 7, if each shading function is
`executed to the point where it is ready to do lighting computations,
`the lighting computations for all of them can be performed at
`once. The remainder of each shading function can then be
`executed in turn.
`If Shader—specEtlc code
`for each surface shader
`pre—lighl shading;
`II Common code
`for each light source
`accumulate illumination;
`1/ Shader-specific code
`for each surface shader
`post—llght shading;
`Figure 7: Factoring out common operations for multiple
`shading functions.
`Currently, we code this manually, but this is yet another reason to
`have a high-level compiler. A suitably intelligent compiler can
`identify expensive operations (such as lighting and texture
`lookups) among several shading functions and automatically
`schedule them for co-execution.
`Table lookup memory. Each shader node has its own table-
`lookup memory for textures but, since it is not possible to know
`which textures may be needed in the regions assigned to a particu-
`lar node, the table memory of each must contain every texture. For
`int_e'r'active use this not only limits the size of the textures to the
`maximum that can be stored at one node. but it also presents a
`problem for shadow map and environment map algorithms that
`may generate new textures every frame. After a new map is
`computed, it must be loaded into the table-lookup memories of
`every shader node. This aspect of system performance does not
`scale with the number of nodes:
`a maximum of 100 512x512
`Figure 6: Execution time of integer versus floating-point
`Conversion from floating point to 4-byte integer format takes
`1.35113, and from 4-byte integer to floating point takes l.57tts. This
`makes it feasible to convert representations to use whichever is
`more advantageous. Whenever possible, we use fixed~point or
`integer representations.
`Memory. Each processor has 256 bytes of local memory and 128
`bytes of communication register that may also be used as local
`memory. Each node can store 16MB of texture information in
`table lookup memory. This memory may be read or written from
`each of the pixel processors, thus serving as global storage.
`3.2 Achieving interactive shading
`Each Pixell-‘low node possesses an enormous amount of
`computational power—over 40 billion integer operations or 2
`billion floating-point operations per second. In addition,
`processors are programmable in a very general way, and we
`believe that the 256 (+128) bytes of local memory at each
`processor is sufficient to implement many interesting shading
`algorithms. However. even this amount of computational power is
`not enough to achieve our goal of real-time shading. We must
`harness multiple PixelFlow nodes in an efficient manner to
`multiply the power available for shading.
`PixelFlow rasterizes images using a screen-subdivision approach,
`sometimes called a virtual buffer [1 1]. The screen is divided into
`128x64-pixel regions, and the regions are processed one at a time.
`When the rasterizers have finished with a particular region, they
`send appearance parameters and depth values for each pixel onto
`the image-composition network, where they are merged and
`loaded into a shader.
`If there are s shaders, each shader receives one of every 5 regions.
`While it shades the region. it has full use of the local memory at
`each pixel processor. With this method of rendering, even a small
`machine can support an arbitrary sized screen. Of course, the more
`complex the problem, the more nodes that are needed to achieve
`interactive performance.
`Deferred shading. As stated in Section 2.3, deferred shading is a
`powerful optimization for scenes of high depth complexity. It has
`an even bigger payoff for a SIMD architecture such as PixelF1ow.
`We implement deferred shading on a machine-wide basis by
`giving each node a designated function: rasterization or shading.
`The rasterization nodes implement the first loop in Figure 3b,
`while the shading nodes implement the second.
`As specified in Figure 3b. the rasterization nodes scan convert the
`geometric primitives in order to generate the necessary appearance
`parameters. Multiple rasterization nodes can work on a single
`region of the screen as described by Molnar, et. al. [12]. The
`composition network collects the rasterized pixels for a given
`‘region (including all necessary appearance parameters), and
` jwfi______:_
`0 - eration
`lon --

`texture maps can be loaded into table—lookup memory per second
`(2-3 in a 33 ms frame time).
`linked with the auxiliary shading function library and finally with
`the application program.
`Uniform and varying expressions. For efficiency, expressions
`containing only uniform shader variables (those that are constant
`over all of the pixels being shaded) are computed only once on the
`RISC GP. Varying expressions (those that vary across the pixels),
`or those containing a mix of uniform and varying variables, are
`executed on the pixel-processor array.
`Shader parameters. There are two ways to communicate
`parameters to a shader node. One is to send the parameters over
`the composition network. The other is to send the parameters over
`the front-end geometry network. Obviously, a varying parameter
`that must be interpolated over the pixels, such as color or surface
`normal, is produced on a rasterization node, and should be sent
`over the composition network.
`A uniform parameter that is used at the GP and does not vary from
`primitive to primitive should be sent over the geometry network
`because composition network bandwidth is a valuable resource.
`An example is something like the roughness of a surface which is
`a fixed parameter for a particular material. If the parameter is
`needed in the local memory of the pixel processors, it can be
`broadcast locally at a shading node. We allow the programmer to
`choose the best way to transmit each parameter.
`Shader programming model
`Low-level model. Since instructions for the pixel processors are
`generated by the GP on a PixelFlow node, the code that a user
`writes is actually C or C++ code that executes on the GP. Tlie
`low-level programming model for the pixel processors (called
`IGC.S'tream) consists of inline functions in C++ that generate code
`for the SIMD array. Some of these functions generate the basic
`integer operations; others, however, generate sequences of
`instructions to perform higher-level commands, such as floating-
`point arithmetic.
`We have written a library of auxiliary shading functions to use
`with this programming model. It provides basic vector operations,
`functions to support procedural texturing {5, 13], basic lighting
`functions, image-based texture mapping [14], bump mapping [15],
`and higher-level procedures for generating and using reflection
`maps [16] and shadow maps [17, 18}. It is perfectly feasible to
`program at this level. In fact, we currently use this programming
`model to write code for testing, and to produce images such as
`those in the example video. We would prefer, however, to work at
`a higher, more abstract level.
`I-lligh-level model. We are implementing a version of the
`RenderMan shading language that is modified to suit our needs.
`Our goal in using a higher-level language is not solely to provide
`architecture independence. That may be useful to us in the future,
`of course, but since PixelFlow is an architectural prototype it is
`not necessary. We are more interested in the shading language as a
`way to demonstrate feasibility and to provide our users with a
`higher-level interface that they’ve had [19] in order to encourage
`wide use of the shading capabilities of our system. Also, as
`mentioned earlier in this section, a high—level shading language
`provides opportunities for compiler optimization, such as oo-
`executing portions of several shader functions.
`The RenderMan specification has only float, point, and color
`arithmetic data types. Since we need to be frugal in our use of
`floating-point arithmetic, we have added integers and f1xed-radix-
`point numbers to the data types of our language. A compiler for
`the shading language will accept shader code as input, and emit
`C++ with SIMD processor commands as output. This code will be
`API support. We also need some way for graphics applications to
`access our shading capability. Since one of our main goals for
`PixelF1ow is interactive visualization of computations as they are
`executing on a supercomputer, we have chosen an immediate-
`mode application programmer's interface (API) similar to OpenGL
`[20]. An advantage of choosing OpenGL, and extending it to meet
`our needs is that students and collaborators are likely to be
`familiar with the its basic concepts Also, this will make it easier to
`port software between PixelFlow and other machines.
`The current specification of OpenGL only incorporates the limited
`set of shading models commonly found on current graphics
`workstations: flat and linearly interpolated shading with image-
`based textures. We have extended the specification to allow users
`to select arbitrary shaders. ‘
`We do not plan to implement an official, complete OpenGL for
`two reasons. One is that some of the specifications of OpenGL
`conflict with our parallel model of generating graphics. The
`second is that we lack the resources to implement features that we
`do not use. Consequently, though our functions are similar to
`OpenGL, we use a pxgl prefix instead of OpenGL’s gl prefix.
`Within these constraints, we have attempted to stick as closely as
`possible to the OpenGL philosophy. We intend to describe this
`API, and the problems involved in implementing it on PixelFlow,
`in a future publication.
`Limitations. Although the PixelFlow shading architecture
`supports most of the techniques common in “photorealistic
`rendering," (at least in RenderMan's use of the term), it has a few
`limitations. Because Pixel!’-low uses deferred shading, shaders
`normally do not affect visibility. Special shaders can be defined
`that run at rasterization time to compute opacity values. However,
`these shaders poorly utilize the SIMD array and slow rasterization.
`A second limitation is that shaders cannot affect geometry.
`RenderMan, for example, defines a type of shader called a
`displacement shader, which displaces the actual surface of a
`primitive, rather than simply manipulating its surface-normal
`vector, as is done in bump mapping. This is incompatible with the
`rendering pipeline in PixelFlow, as well as that of virtually all
`other high-performance graphics systems.
`In this section, we present a detailed example of real-time high-
`quality shading on PixelFlow. The example—bowling pins being
`scattered by a bowling balI—was inspired by the well-known
`“Textbook Strike” cover image of the RenderMan Companion [6].
`We cannot guarantee that the dynamics of motion are computable
`in real-time, but we are confident that a modest-sized PixelFlow
`system (less than one card cage) can render the images at 30
`frames per second.
`The accompanying video was rendered on the PixelFlow
`functional simulator. The execution times are estimates based on
`the times of rasterization and shading of regions, using worst-case
`assumptions about overlap. We simulated a PixelFlow machine
`containing three rasterizer nodes, twelve shading nodes, and a
`frame-buffer node. There are 10,700 triangles in the model. The
`images were rendered at a resolution of 640x512 pixels with five-
`sample-per~pixel antialiasing.

`4,1 Shading functions
`Three shading functions are used to render these images, one for
`the bowling pins, one for the alley, and one for the bowling ball.
`Two light sources illuminate the scene, an ambient light and the
`main point-light source which casts shadows in the environment.
`Number of bytes
` llli
`Shader ID
` 1x8
`Figure 8: Appearance parameters used in bowling
`that is sent from a
`Figure 8 shows the data for each pixel
`PixelFlow rasterizer node to a shader node, a total of 34 bytes. We
`actually plan to use 10 bits of color per channel on most PixelFlow
`applications, but 8 bits were used for this simulation. In addition
`to the appearance parameters used by the shaders, two other
`parameters are necessary, the depth and a shader identification
`number for each pixel. The shader ID is used by the shading
`control program to select the shader code for each pixel.
`The bowling ball has a shadow-mapped light source with a Phong
`shader. The alley has a shadow-mapped light source, reflection
`map, mip-mapped wood texture, and a Phong shader. The pins
`have a shadow-mapped light source, procedural crown texture,
`mip-mapped label, bump-mapped scuffs, mip~rnapped dirt, and
`finally a simple Phong shader. We factor out common lighting
`computations as described in Section 3.2. Each shader is divided
`into three parts, the part before the lighting computation, the
`common lighting computation, and the part after the lighting
`4.2 Multiple-pass rendering
`The shadow and reflection maps are obtained during separate
`rendering passes. When each of these 512x512 images has been
`computed and stored, rendering of the final image begins. In this
`section we describe, in detail, the steps necessary to render and
`store the shadow map and to render the final camera-view image.
`Since computation of the reflection map is similar, we do not
`describe it in detail.
`Shadow map. A shadow map is a set of depth values rendered
`from the point of view of the light source. We use three rasterizer
`nodes to rasterize all the primitives and compute the depth at each
`sample point. Since we do not need to calculate colors or other
`parameters, this is a simple computation. The worst-case time for
`this step is approximately 100 us, although many map regions
`have very few polygons and take less time to rasterize.
`The depth values are then z-composited over the composition
`network, and the resulting depth is sent to all of the shaders.
`Composition time is only 5 its per region. Notice that data transfer
`and computation can proceed simultaneously.
`As mentioned in Section 3.2, storing tables for shadow or
`reflection mapping is a point of serialization on our system. The
`combined time to store both the shadow and reflection map takes
`almost half the time for each frame. Since the hardware can store
`four values into table memory at one time, we take advantage of
`this intra-node parallelism by storing the depth map in units of
`four regions each. Thus, the shader nodes accept four regions of
`data before storing them.
`The total time to complete the shadow map pass is the time
`consumed by eight table writes, 6.08 ms, plus the time to rasterize
`the first four regions, for a total time of less than 7 ms.
`Reflection Map. Rasterization for the reflection map can begin as
`soon as enough buffer space is available at the rasterization nodes.
`Shading for the reflection map can begin as soon as the last table
`write for the shadow map has begun. The reflection map can be
`generated and stored in less than 7 ms.
`Final Image. The rendering time for the final image is a function
`of both the rasterization time and the shading time. If the time to
`rasterize a region is longer than the time-to shade it, the shading
`nodes will be idle waiting for appearance parameters from the
`rasterizer nodes. The worst-case time will
`then be the total
`rasterization time plus the time to shade the final region. If the
`time to rasterize a region is less than the time to shade it, the
`shading nodes will always have regions waiting to be shaded. We
`will see that for this scene shading is the bottleneck, so the
`rendering time will be the total shading time plus the time to
`rasterize the first few regions (to get the shading nodes started).
`First, consider the rasterization time. With all of the appearance
`parameters, each of the front-facing triangles in the model takes
`approximately 0.85 us to rasterize. One of the busiest frames, with
`all of the pins visible, contains just under 6400 front-facing
`triangles (this includes the additional triangles that have to be
`rendered when triangles cross region boundaries). This total takes
`5.4 ms to complete on one rasterizer node. If we also do five
`sample antialiasing,
`this becomes 27 ms. To achieve our
`performance goal we divide the polygons over 3 rasterizers to
`decrease the time to a little over 9 ms. Details on the use of
`multiple rasterizers in PixclFl0w can be found in [l2].
`" Shaclin funion '
`Section of code
`pre-light £3-
`-- --
`post-iight E--
`Figure 9: Shading times (1 node,
`excluding table lookup.
`1 sample,
`1 region)
`Now, consider the shading time. In PixelFlow, the table lookup
`time is proportional to the number of pixels that need data, so it is
`not constant for a region but depends on how many total values
`are actually needed. The worst case for table lookup will occur if
`all of the pixels in a region use the bowling pin shading function
`since it needs to look up four different values: two mip-mapped
`image textures, one bump map, and one shadow map. To do one
`table lookup for all 8K pixels on a node takes 190 its, so looking
`up four values for a full region requires 760 ps.
`The worst-case time for the rest of the shader processing occurs
`for regions that require all three shading functions, bowling pin.
`alley, and hall. For regions without all of these elements, only

`some of the shading functions need to be run. Figure 9 shows the
`processing time for the shading functions excluding the table
`lookup times. Note, however, that the time setting up for a
`lookup and using the results is included. The slowest time for a
`region is the sum of all the times in the figure or 150 us.
`This time is for only one sample of one region. Since we are
`doing five samples and a 640x512 video image has 40 regions,
`there are really 200 regions to shade. The total time comes to
`182 ms. By distributing the shading among twelve shading nodes,
`we can cut the worst-case shading time to about 15 .2 ms.
`The 9 ms spent rasterizing is less than the shading time.
`Therefore, the shading time dominates. The total time to compute
`the final camera view is the shading time plus the time to rasterize
`the first regions, or about 15.7 ms.
`Total frame time. A complete image can be rendered in under
`29.7 ms. This includes 7 ms to generate a shadow map, 7 ms to
`generate a reflection map, and 15.7 ms for the final camera image.
`These times were computed with pessimistic assumptions and
`without considering the pipelining that occurs between the
`rendering phases. This results in a frame rate faster than 30 Hz.
`With more hardware it will be possible to run even faster.
`Additional hardware will not significantly speed the shadow or
`reflection map computations since they are dominated by the
`serial time spent writing the lookup tables. But rendering time of
`the camera image is inversely proportional to the number of
`rasterization and shading nodes. For more complex geometry, we
`add rasterization nodes. For more complex shading, we add
`shading nodes. Note that the hardware for both of these tasks is
`identical. The balance between them can be decided at run time.
`In this paper, we described the resources required to achieve real-
`time programmable shading——programmability, memory, and
`computational power—-requirements that many graphics hardware
`systems are close to meeting. We explained how this shading
`power can he realized in our experimental graphics system,
`PixelFlow. And we showed with an example, simulations, and
`timing analysis that a modest size PixelF1ow system will be able
`to run programmable shaders at video rates. We have
`demonstrated that it is now possible to perform, in real time,
`complex programmable shading that was previously only possible
`in software renderers. We hope that programmable shading will
`become a common feature in future commercial systems.
`We would like to acknowledge the help of Lawrence Kesteloot
`and Fredrilt Fatemi for the bowling simulation dynamics, Krish
`Ponamgi for the PixelFlow simulator, Jon Leech for his work on
`the Pixeli-ilow API design, Nick England for his comments on the
`paper, and Tony Apodaca of Pixar for RenderMan help and
`advice. Thanks to Hewlett-Packard for their generous donations of
`This research is supported in part by the Advanced Research
`Projects Agency, ARPA ISTO Order No. A410 and the National
`Science Foundation, Grant No. MIP-9306208.
`:1 for
`- PP-
`Interactive Full Spectral Rendering
`Mark S. Peercy
`Benjamin M. Zhu
`Daniel R. Baum
`Silicon Graphics Computer Systems
`The scattering of light within a scene is a complicated process that
`one seeks to simulate when performing photorealistic image syn-
`thesis. Much research on this problem has been devoted to the
`geometric interaction between light and surfaces, but considerably
`less effort has been focused on methods for accurately representing
`and computing the corresponding color information. Yet the ef-
`fectiveness of computer image synthesis for many applications also
`depends on how accurately spectral information can be simulated.
`Consider applications such as architectural and interior design, prod-
`uct styling, and visual simulation where the role of the computer is
`to show how objects would appear in the real world.
`If the color
`information is rendered inaccurately. the computer simulation may
`serve as a starting point; but in the long run, its usefulness will be
`Correctly handling color during image synthesis requires preserving
`the wavelength dependence of the light that ultimately reaches the
`eye, a task we refer to as full spectral rendering. Full spectral
`rendering has been primarily in the purview of global illumination
`algorithms as they strive for the highest degree of photorealism.
`In contrast, commercially available interactive computer graphics
`systems exclusively use the RGB model, which describes the lights
`and surfaces in a scene with their respective RGB values on a given
`monitor. The light scattered from a surface is given by the products
`ofthc red, green, and blue values ofthe light and surface, and these
`values are directly displayed. Unfortunately, the RGB model does
`a poor job of representing the wide spectral variation of spectral
`power distributions and surface scattering properties that is present
`in the real world [4], and it is strongly dependent on the choice
`of RGB monitor. As a result, the colors in an RGB image can be
`severely shifted from the correct colors.
`These drawbacks have frequently been overlooked in interactive
`graphics applications because the demand for interactivity has tra-
`ditionally overwhelmed the demand for photorealism. However,
`graphics workstations with real-time texture mapping and antialias-
`ing are overcoming this dichotomy [1]. Many applications that had
`Address: Silicon Graphics, lnc., 20] l N. Shoreline Blvd. Mountain View, CA 94040
`Permission to copy without lee all or part of this material is
`granted provided that the copies are not made or distributed for
`direct commercial advanta e, the ACM copyright notice and the
`title of the publication and its date appear, and notice is given
`that copying is by permission of the Association of Qomputing
`Machinery. To copy othenvise, or to republish, requtres a tee
`andfor specific permission.
`1995 Symposium on interactive 3D Graphics, Monterey CA USA
`© 1995 ACM 0-89791-736-7/95/0004...$3.50
`previously bypassed photorealism for interactivity now are capa-
`ble of having some measure of both. The best current example is
`the blending of visual simulation technology and mechanical com-
`puter aided design for the visualization of complex objects such
`as automobiles. And as workstation technology continues to ad-
`vance, interactive rendering quality and speed will both increase.
`Consequently, utilizing interactive full spectral rendering will have
`significant benefits.
`We present an approach to and implementation of hardware-assisted
`full spectral rendering that yields interactive performance. We
`demonstrateits use in an interactive walkthrough of an architectural
`model while changing time of day and interior lighting. Other ex-
`amples include the accurate simulation of Fresnel effects at smooth
`interfaces, thin film colors, and fluorescence.
`Generalized Linear Color Representations
`The architecture uses generalized linear color representations based
`upon those presented in [7] and [8]. The representations are obtained
`by considering scattering events as consisting of three distinct el-
`ements: a light source, a surface, and a viewer. Light from the
`source reflects from the surface to the viewer, where it is detected.
`We use the term viewer to apply to'a set of is implicit or explicit
`linear sensors that extract color information from the scattered light.
`This information might be directly displayed, or it might be used
`again as input to another scattering event.
`To derive the representations we expand the light source spectral
`power distribution in a weighted sum over a set of 171 basis func-
`tions. The light is represented by a light vector, E’, that contains the
`corresponding weights. The surface is then described by a set of
`m sensor vectors, where the 2'” vector gives the viewer response
`to the 1”‘ basis function scattered from the surface. If we collect
`the sensor vectors in the columns of a surface matrix, S, the viewer
`response to the total scattered light reduces to matrix multiplication;
`3' 2- SE‘. The effect of geometry on light scattering is incorporated
`in the chosen illumination model.
`The principal advantage of these representations comes when ev-
`erylight source in the scene is described by the same set of basis
`furictions. The light vectors and surface matrices can then be pre-
`computed, and the rendering computation reduces to inexpensive
`and straightforwardly implemented matrix multiplication. Addi-
`tionally, the freedom to selectappropriate basis functions and sensor
`responsivities opens wide the applications of this approach.

`Selection of Basis Functions.‘ The basis functions are chosen to
`capture the spectral power distributions of all light sources in the
`scene. For a small number of independent lights, one could simply
`choose as basis functions the spectral curves of the lights. However,
`if the number of spectral power distributions for the lights is large,
`as, for example, when the sun rises and sets, the dimensionality can
`be reduced through various techniques, including point sampling
`and characteristic vector analysis [6] [7] {5].
`Selection of Sensor Responsivities: If the scattered light is to be
`viewed directly, as is typically the casein interactive graphics, the
`sensor responsivities should be the human color matching functions
`based on the monitor RGB primaries. For an application such as
`merging computer graphics and live action film,
`the sensors can
`be chosen as the response curves of the camera used. The final
`image would consist of color values of the synthetic objects as if
`they were actually filmed on the set, so the image could be blended
`more easily into the live action. Similarly, the sensor values might
`be chosen to simulate the shift of non-visible radiation into visible
`light in, for example, radio astronomy or night vision goggles.
`alternatively, the scattering is only an intermediate step in a multiple
`refiection path, as when computing environment maps, the sensor
`responsivities can be chosen as the basis functions ofthe next event.
`Hardware Implementation
`Current Capabilities: Current workstations can employ the gener-
`alized linear color representations in special circumstances. When
`a scene contains a set of lights with identical spectral power dis-
`tributions, only a single basis function is required. Light vectors
`then have only one component that modulates single-column surface
`If the viewer has three or fewer sensors, RGB hardware
`can perform this modulation. For scenes illuminated by multiple
`sources. a natural implementation is via the accumulation buffer
`[3]. Pixels from the framebuffer can be added to the accumulation
`buffer with an arbitrary we

