`
`The Intel i860 64-Bit Processor:
`A General-purpose CPU with 3D Graphics
`CaDabilities
`
`A
`
`Jack Grimes
`MASS Microsystems *
`Les Kohn and Rajeev Bharadhwaj
`Intel
`
`T h e shipment of graphics superworkstations from
`Apollo, Ardent, Silicon Graphics, Stellar, and other
`manufacturers represents a new focus on scientific
`visualization that is much more cost effective than a five-
`million-dollar supercomputer for many applications.
`One key is the integration of high numeric performance
`with the ability to visualize interactively the results of the
`computations using 3D graphics-scientific visual-
`ization-while the computations are being performed.
`The Intel i860 processor enables a new class of work-
`stations to be constructed. Computation rates previously
`associated with supercomputers, along with the visuali-
`zation capabilities of superworkstations, will be available
`
`* This work was done while Grimes was at Intel.
`
`July 1989
`
`0272-17-16/89/0700-0085$01.00 e1989 IEEE
`
`85
`
`SAMSUNG-1029
`Page 1 of 10
`
`
`
`at the price of 2D workstations.
`The i860 processor represents the first single-chip
`device to be based on supercomputer design principles.
`Supercomputers have been defined as
`0 equal to or faster than the Cray-3
`0 the fastest current machine
`or as one generation behind what’s needed. Since the
`introduction of the Cray-1 supercomputer in 1976, the
`highest performance levels have been equated with pipe-
`lined vector processing.’ The i860 processor is similar
`in architecture to the Cray-1.’
`The one-million-transistor, 64-bit i860 processor con-
`tains integer-control, paging, and bus units; floating-
`point-control, adder/subtracter, multiplier, and 3D
`graphics units; and instruction and data caches. The
`processor provides high levels of balanced performance.
`At 40 MHz, the Dhrystone 2.1 performance is 78,100
`Dhrystones per second. Peak execution rates of 80
`Mflops can be achieved using fine-grained parallelism.
`This floating-point performance provides over 500,000
`transforms per second, including 4x4 3D matrix multi-
`plies, clipping tests, and perspective calculations. In
`addition, special 3D graphics hardware provides rates
`of over 40,000 triangles per second, where the 100-pixel
`triangles are Gouraud shaded, transformed, z-buffered,
`and include one light source.
`The i860 processor is a general-purpose
`and
`it is being used as the main processor in high-
`performance workstation designs. This article will focus
`on two aspects of the CPU that are particularly impor-
`tant for 3D graphics applications: transforms on vertex
`lists of floating-point values and image rendering involv-
`ing depth and color interpolation, with hidden-surface
`removal using a z-buffer.
`For a long time, the performance of 3D computer
`graphics systems was limited by their floating-point com-
`putation rate. The introduction of high-speed floating-
`point math coprocessors has largely eliminated this bot-
`tleneck. Recently, however, increasing emphasis has
`been placed on realism of the imagery. This means that
`the performance of the 3D system may now be limited
`by the ability of the hardware to shade the polygons that
`represent the surface of the object.
`The next section describes the performance features
`of the overall chip and later sections describe its appli-
`cation to 3D graphics.
`High performance
`The processor shown in Figure 1 provides the follow-
`ing functions on a single chip:
`
`0 RISC core unit
`0 Floating-point control unit
`0 Adder and multiplier floating-point units
`0 Graphics unit
`
`86
`
`0 Memory-management unit
`0 Data and instruction caches
`0 Bus control unit
`
`The reduced-instruction-set core unit fetches both
`integer and floating-point instructions. It contains the
`32 x 32-bit integer register file, and decodes and executes
`load, store, integer, bit, and control-transfer instructions.
`Its pipelined organization with extensive bypassing and
`scoreboarding maximizes performance.
`There are separate floating-point adder and multiplier
`units. Each unit uses pipelining to deliver up to one
`result per clock (40 Mflops). The units can operate in par-
`allel, providing up to two results per clock (80 Mflops).
`Both units support 64- and 32-bit floating-point values
`in IEEE standard 754 format. Furthermore, the floating-
`point unit can operate in parallel with the integer unit.
`This means the two-result-per-clock rate can be sus-
`tained by overlapping overhead functions such as data
`fetching and storing and loop control with floating-point
`operations. The floating-point control unit contains a
`separate five-port register file. This file can be accessed
`as 8 x 128-bit registers, 16x64-bit registers, or 32 x32-bit
`registers. Information can be transferred between the file
`and data cache at the same time as operands are trans-
`ferred between the file and the floating-point or graphics
`units.
`The graphics unit contains the hardware for pixel-
`intensity interpolation, depth (z) interpolation, and z-
`buffer check. Peak rates are 16 million pixels per second,
`assuming Gouraud shading, 16-bit pixels, and a 16-bit z-
`buffer. The operation of the graphics unit is described
`in more detail in a later section.
`The memory-management unit translates addresses
`from the linear logical address space to the linear phys-
`ical address for both instruction and data accesses.
`Address translation is optional, and uses a two-level
`structure.
`Information from the translation tables is cached in a
`64-entry, four-way associative memory. The processor
`provides the basic features to implement a paged, virtual
`memory, and user/supervisor protection. The page
`tables are format-compatible with the 386 architecture.
`One third of the chip area is used for data and instruc-
`tion caches. The I-cache is a two-way, set-associative
`memory of 4 Kbytes, with 32-byte blocks. The D-cache
`is a writeback cache, composed of a two-way, set-
`associative memory of 8 Kbytes, with 32-byte blocks. A
`block is written for each cache load, and the 64-bit word
`that contained the addressed item is loaded first. The
`aggregate bandwidth out of the on-chip caches is 0.96
`Gbytes per second. This rate would not have been pos-
`sible with off-chip caches.
`The bus unit is designed to provide maximum perfor-
`mance from conventional, static-column DRAM. A pin
`indicates whether the next memory cycle is within the
`same page. The bus unit supports both pipelined and
`
`IEEE Computer Graphics & Applications
`
`SAMSUNG-1029
`Page 2 of 10
`
`
`
`External Addr
`
`Instruction Cache
`
`Management
`
`I
`
`Ext.
`Ext.
`Data
`Data
`
`(64
`
`c
`
`core
`
`Unit
`
`RISC Integer Unit
`
`Dest
`src 1
`src2
`
`FP Instr , 128
`
`I
`
`Floating Point
`Control Unit
`
`'
`
`Cache
`Data
`
`Data
`Ad&
`
`I
`
`I
`
`Figure 1. Diagram of the single-
`chip processor showing the func-
`tional blocks and data paths.
`
`nonpipelined operation. High transfer rates (166 Mbytes
`per second) to large memory systems are supported by
`two-level pipelining, where up to three bus cycles can be
`in progress at once. Pipelining enables a new 64-bit word
`to be transferred every two clocks, even though the total
`cycle time might be up to six clocks.
`The high performance is supported by wide informa-
`tion paths, also shown in Figure l. They include the fol-
`lowing:
`
`0 64-bit external data bus
`0 32-bit external address bus
`0 128-bit on-chip data bus from the data cache
`0 Three 64-bit on-chip data buses for floating-point
`operands
`0 64-bit on-chip instruction bus
`
`When pipeline designs are used within many parallel
`functional units, the overall performance is often limited
`by data-path bandwidth. The i860 processor designers
`
`paid careful attention to data paths to obtain sustained
`high levels of performance for computationally intensive
`applications.
`Fine-grained parallelism
`Parallelism exists between the integer unit and the
`floating-point and graphics units. Up to three operations
`can be performed each clock. The integer unit and the
`floating-point units can execute in parallel, supported by
`the 64-bit instruction bus (two instructions per clock). A
`special dual-execution instruction mode is used for these
`operations. The data cache can supply two 32-bit or two
`64-bit operands per clock. The bus unit will supply up
`to one 64-bit or two 32-bit operands every two clocks to
`the data cache or directly to the floating-point register
`file.
`The graphics unit in Figure 1 is connected to the 64-bit
`data paths used by the floating-point units. The graphics
`instructions use the floating-point registers and can be
`executed in parallel with the integer unit instructions.
`
`July 1989
`
`87
`
`SAMSUNG-1029
`Page 3 of 10
`
`
`
`I
`
`Database of geometry
`information
`
`I
`
`Trans form/projec t
`
`Light source calc
`
`Clip to view volume
`
`Color the pixels
`
`I Display frame buffer
`
`I
`
`Figure 2. Example of a 3D graphics pipeline imple-
`mented in software in the i860 processor for Ardent’s
`Dore visualization software.
`
`active visualization software developed by Ardent Com-
`puter), is shown in Figure 2.
`The prototype implementation of this pipeline for tri-
`angle meshes requires about 20 Kbytes of assembly lan-
`guage code. Assembly language was used to optimize the
`low-level parallelism in the hardware and the graphics
`features.
`Floating-point matrix transforms in 3D
`graphics
`Historically, one of the most computationally intensive
`operations is the 3D transformation of geometric objects
`at the top of the graphics pipeline. The data types are
`normally 32-bit floating-point values. A basic transform
`operation first combines all of the translate, scale, and
`rotation operations into a single 4 x 4 matrix. Then all of
`the x, y, z, and w values in the list are sequentially mul-
`tiplied by the transform matrix. If the transform and
`other calculations can be performed at rates of 10 times
`per second or faster, then the results can be interactively
`viewed on a graphics display. The transform matrix com-
`putation is of the form
`
`This dual-instruction feature is especially important to
`the inner loop of the rendering code, as we shall see.
`In addition to the pipelined floating-point operations,
`where a single-precision floating-point add and/or mul-
`tiply result can be completed every clock, there is a more
`conventional, scalar mode of operation. In this mode, the
`floating-point results are interlocked so that the pipeline
`operation is hidden from the program, and data depen-
`dencies are enforced by the hardware.
`Additional details are contained in the
`The rest of this article focuses on the 3D graphics capa-
`bilities of the processor.
`
`3D graphics pipeline
`The i860 processor contains very little specialized
`graphics hardware (pixel and z interpolation, and z-
`buffer checking]. Almost all of the graphics functional-
`ity is implemented in software. Since the base machine
`is so fast, the performance of the resulting graphics func-
`tions is very good. This approach provides maximum
`flexibility (e.g., sharing the floating-point hardware
`between simulation and transformations] and allows
`lower cost systems to be constructed.
`Several necessary graphics functions are sequential
`and have come to be commonly called the “graphics
`~ i p e l i n e . ” ~ One possible pipeline, used for an early
`demonstration implementation with Dore (the 3D inter-
`
`The 3 x 3 r submatrix contains the rotation and scaling
`information. The first three elements of the last row con-
`tain the translation information. This 4 x 4 matrix has 12
`normally single-precision values. These values-plus
`the
`last, current, and next transformed four-element
`vectors-can be held in the register file, enabling the 12
`multiplies and 9 adds to execute without any memory
`load or store delays. The loading and storing of the other
`operands are executed in parallel by the integer unit,
`which also provides loop control.
`The object information is commonly stored as a list of
`vertices, where each vertex has an x, y, and z value. For
`example, this list of vertices may represent coordinates
`from a triangle mesh. In this compact data structure, an
`additional triangle can be represented by one additional
`vertex. This vertex list can easily contain thousands of
`vertices. Additional information-for
`example, RGB
`color values, a vertex normal, a surface color, and a sur-
`face normal-can also be stored in the same list, as
`shown in Figure 3.
`These vector-oriented computations can be converted
`into a program that takes full advantage of the pipelin-
`ing, dual-instruction mode, dual operations, and mem-
`ory hierarchy of the processor. Complete 3D transform
`rates (including 4x4 3D matrices, clipping tests, and per-
`spective) of over 500,000 transformations per second
`have been achieved.
`
`88
`
`IEEE Computer Graphics & Applications
`
`SAMSUNG-1029
`Page 4 of 10
`
`
`
`Figure 3. The :ist of vertices to
`be transformed normally con-
`tains other information such as
`color and a normal vector.
`
`Figure 4 shows the hardware configuration during one
`pipelined instruction (every instruction produces a
`result every clock), which is also a dual-operation
`instruction (the floating-point add unit and the floating-
`point multiply unit each produce a result every clock).
`While this is executing, the integer unit loads and stores
`operands and provides loop control. Thus, the inner loop
`can operate at the peak rate of three operations per clock
`(two floating-point operations and one integer operation)
`and can average more than two operations per clock.
`To execute two operations per instruction, four source
`and two destination operands must be specified. Two of
`the sources and one destination are floating-point
`registers, and are specified by the instruction. The other
`operands come from internal, loadable constant
`registers, and the output of one pipeline can be con-
`nected to the input of another. Figure 4 shows only one
`of the cases, the operation Destination = KR x
`Source 2 + Source 1. In this case, a constant (loaded into
`KR) is multiplied by Source 2. Three clocks later, this
`result is added to Source 1. Three clocks later, the over-
`all result is stored in the Destination register. Once the
`pipelines are full, one result (two floating-point opera-
`tions) is computed each clock. At the same time, integer
`instructions execute in parallel, providing operand load-
`ing and storing and loop control, giving a peak rate of
`three operations per clock.
`Pixel data
`A pixel may be 8, 16, or 32 bits long, depending on
`color and intensity resolution requirements (see Table 1).
`Regardless of pixel size, the graphics unit always oper-
`ates on as many pixels at a time as will fit in one 64-bit
`word. To perform color-intensity shading efficiently in
`different applications, the processor defines the three
`
`looo’s
`of vemces
`in list
`
`I
`
`Destination
`
`I
`
`Source 1 FJ Source 2
`I + i
`.
`KRxsource2 5-3-
`
`I
`
`’
`
`
`
`I
`
`result
`I
`
`I
`
`I
`
`result
`I
`1
`
`I
`KR x Source 2 + Some 1
`Figure 4. Internal data-path configuration for one
`dual-operation instruction: Destination = KR x
`Source 2 + Source 1.
`
`Table 1. Number of bits allocated
`to various fields for the three pixel sizes.
`
`1
`
`Pixel
`Size
`(in bits)
`
`Bits of
`Color I*
`Intensity
`
`I
`
`Color2* I Color 3.
`Intensity i lntcnsi,
`
`Bits of
`
`Bits of
`
`1
`
`1
`
`Bits of
`Other
`Attribute
`(Texture)
`
`N (58) bits of
`intensity **
`
`* The intensity attribute field may be assigned to colors in any order
`convenient to the application.
`** With 8 bit pixels, up to 8 bits can be used for intensity; the mnaining
`bits can be used for any other attribute. such as color. The intensity
`bits must be the low-ordcr bits of the pixel.
`
`~~
`
`~~
`
`~
`
`July 1989
`
`89
`
`SAMSUNG-1029
`Page 5 of 10
`
`
`
`16-bit Pixel
`32-bit Pixel 23
`
`R
`
`G
`
`31 I
`
`I
`C=color, I=intcnsity.
`R
`d
` intensity. G g m intensity. B=bluc intensity. T=tennuc
`
`15
`
`B
`
`7
`
`I
`
`T
`
`7
`
`5
`
`
`
`0
`
`8-bit Pixel
`
`15
`
`9
`
`R
`
`G
`
`3
`
`B
`
`
`
`0
`
`0
`
`..
`.
`x“,y”,z“=lOOo)
`
`Figure 5. Pixel formats for 8-, 16- and 32-bit pixels. Figure 7. Example of depth (z) interpolation for a tri-
`angle with different z values for each vertex.
`
`0 16-bit and 32-bit z-buffer check
`0 64-bit integer add and subtract
`0 Add with pixel merge
`0 Add with z merge
`0 OR with merge register
`
`One additional instruction, the pixel store, is executed
`by the integer unit.
`A z-buffer aids hidden-surface elimination by associat-
`ing a z value with each pixel. This z value represents the
`distance of that pixel from the viewer. When coloring a
`specific pixel, a 3D algorithm calculates the distance of
`the position on the surface from the viewer. If the point
`is farther from the viewer than the point that is repre-
`sented by the pixel already in the frame buffer, then the
`pixel value is not updated. If the new pixel is closer, then
`it replaces the pixel previously stored in the frame buffer.
`The graphics unit supports either 16- or 32-bit z values.
`The z value size is independent of the pixel size.
`For example, in Figure 6, scan line #20 of the gray tri-
`angle is entirely “in front,” while the left end of scan
`line #30 is “behind” the white triangle. As the depth and
`color of each pixel are calculated, the depth must be
`checked with the depth value already stored in the z-
`buffer to determine whether the pixel values should
`replace those already in the frame buffer. When the z
`values for the left end of scan line #30 are compared, they
`are less than the values in the z-buffer, and the cor-
`responding pixels are not drawn into the frame buffer.
`The z-value calculations are interpolated as indicated in
`Figure 7.
`For the three z values shown in Figure 7, the z value
`for each scan-line edge is interpolated (e.g., 2750 and
`3500 for the scan line shown]. The value of Az is calcu-
`lated as shown. As the scan line is rasterized from left
`to right, this Az value is added to the running z value for
`each pixel. This calculation must be done with double-
`integer precision to control rounding errors (32-bit math
`
`Figure 6. Before each pixel is drawn, the current z
`value of the pixel must be checked. If the new pixel is
`“closer,” the new pixel is drawn.
`
`pixel formats shown in Figure 5. The pixel data type is
`used by only two kinds of instructions:
`
`0 The selective pixel store instruction for hidden-
`surface elimination.
`0 The pixel-add instruction that helps implement 3D
`color-intensity shading.
`
`Figure 5 shows one way of assigning meaning to the
`fields. These assignments are for illustration only. The
`graphics unit defines only the field sizes and the opera
`tions on each field, not the specific uses of each field.
`3D graphics rendering instructions
`Almost all processor instructions are used in graphics
`applications, but six instruction types have been added
`to assist specifically in 3D rendering operations. Five are
`executed by the graphics unit:
`
`90
`
`IEEE Computer Graphics & Applications
`
`SAMSUNG-1029
`Page 6 of 10
`
`
`
`for 16-bit z values, and 64-bit math for 32-bit z values).
`Finally, for each pixel, the newly calculated z values are
`compared with the values in the z-buffer to determine
`pixel replacement.
`The simple 3D case in Figure 7 is for flat shading
`where all pixels are set, for example, to the average of the
`vertex colors. Several faddz (add with z merge) instruc-
`tions add Az to the currently interpolated z value and
`place the results into a merge register. The 16-bit and
`32-bit z-buffer check instructions perform multiple
`unsigned-integer comparisons, either four or two at a
`time, depending on how many z values fit in one 64-bit
`word. One set of operands comes from those just com-
`puted by the z-interpolation calculation, and the other
`operands come from the z values (in a z-buffer) that cor-
`respond to the points already drawn. Bits are set in the
`pixel mask register representing the pixels to be updated.
`The pixel store instruction uses this register to determine
`which pixels to update.
`The faddp (add with pixel merge) instruction imple-
`ments interpolation of color intensities in the same way
`as the z interpolation, allowing for the different data
`types. The 8- and 16-bit formats shown in Figure 5 use
`16-bit intensity interpolation. One instruction does four
`or two interpolations at a time (one color in each of four
`or two pixels). For 16- and 32-bit pixels, the instruction
`is executed three times, once for each color component,
`to interpolate and merge the pixels into the merge reg-
`ister. The result is moved to a floating-point register with
`the form (OR with merge register) instruction. The OR
`operation is sometimes necessary when additional bits
`in the pixel are calculated separately, for example, alpha
`channel information.
`The 64-bit integer add and subtract instructions oper-
`ate on values in the floating-point register file and are
`used for 32-bit z-buffers where 64-bit precision interpo-
`lation is needed.
`Once the objects in the display list have been trans-
`formed, clipped, and backface culled, and the necessary
`normals and vertex or surface colors calculated, the
`objects are ready to be rendered. The simplest shading
`is flat shading, where all of the pixels in the triangle, for
`example, are colored with the same value (the surface
`color). More realistic imagery is obtained with Gouraud
`shading,’ where each pixel is colored with a value that
`is linearly interpolated from values calculated at the ends
`of the scan line. (These scan-line endpoint values are in
`turn linearly interpolated between the vertex colors.)
`Even more realistic images with specular effects can be
`obtained with approximations to Phong ha ding.^
`Simple 2D fill is usually fast. However, even flat shad-
`ing of 3D surfaces is much slower than shading 2D sur-
`faces, because of the z-buffer checking on each pixel.
`Without dedicated hardware, the z-buffer check and the
`Gouraud shading calculations can easily bring the per-
`formance of a graphics pipeline to a virtual halt, because
`the check and shading calculations are performed for
`
`July 1989
`
`each pixel, rather than for each vertex, and may be 10 or
`100 times slower than the transform rate.
`Next we will treat a specific assembly language exam-
`ple with %bit pixels (6,6,4-R,G,B) and a 16-bit-deep z-
`buffer.
`
`A specific example
`Table 2 lists the six steps in the inner loop for loading
`the z values, depth interpolation, Gouraud interpolation,
`z-buffer check, pixel store, and z-buffer update. First we
`will look at the sequential form, then show the code
`embedded in the parallel loop. Note that some steps are
`executed by the integer unit and others are executed by
`the graphics unit.
`
`Table 2. Six steps in the inner loop
`for Gouraud shading a scan line.
`
`Execution
`Unit
`Step
`
`Function
`
`integer
`
`1
`
`Load oldz values from the z buffer.
`
`graphics 2
`
`Interpolate the newZ values by adding
`delta2 to the last computed values. Save
`the extra precision intermediate results.
`
`graphics 3
`
`graphics 4
`
`Interpolate the colors for the next set of
`pixels by adding the deltaBlue, deltaGreen
`& deltaRed to the last computed RGB
`values. Save the extra precision
`intermediate results.
`Check the newZ values with the old2
`values, setting pixel mask bits for the
`pixels that are to be stored into the kame
`buffer. Save the newZ values to put back
`into the z buffer.
`
`integer
`
`5
`
`Store the pixels using the pixel mask
`(PM) register.
`
`integer 6
`
`Store the newZ values into the z buffer.
`
`Since the integer unit can execute in parallel with the
`graphics unit, the six steps in Table 2 plus the loop con-
`trol can be “folded” into a parallel program, as shown
`in Figure 8.
`It is beyond the scope of this article to describe all of
`the details of the graphics hardware.6 However, the code
`fragment in Figure 8 shows the important inner loop:
`Gouraud shading one scan line of a 3D triangle. Each
`
`91
`
`SAMSUNG-1029
`Page 7 of 10
`
`
`
`/ / Some boundary c a s e s t a k e n c a r e of by t h i s p o i n t i n t h e code.
`/ / Scan l i n e end p o i n t z and p i x e l c o l o r v a l u e s a l r e a d y computed.
`/ / Color Ablue, Agreen, Ared values a l r e a d y computed f o r t h e scan l i n e .
`/ / Depth Az v a l u e a l r e a d y computed f o r t h e s c a n l i n e .
`/ / F i r s t set of o l d 2 v a l u e s a l r e a d y loaded.
`/ / The "d." i n f r o n t of t h e graphics i n s t r u c t i o n s i n d i c a t e s d u a l
`/ /
`i n s t r u c t i o n e x e c u t i o n mode.
`
`Graphlcs Unlt instructlons
`loop :
`z , Azl, z
`d . f a d d z
`/ / i n t e r p o l a t e 1st two z values
`/ / (double p r e c i s i o n math)
`
`Integer Unlt Instructions
`
`f 1 d . d 8 ( z b u f f e r ) , old2
`/ / l o a d o l d 2 v a l u e s f o r next i n t e r .
`/ / cache h i t 3 o u t of 4 times
`
`z , Az2, z
`d . f a d d z
`/ / i n t e r p o l a t e 2nd two z values
`
`andh 0x8000, p i x e l c o u n t
`/ / p i x e l c o u n t n e g a t i v e ?
`
`fO, new2
`d.form
`/ / c o l l e c t t h e newz's.
`/ / OR i n t o MERGE r e g i s t e r
`d. f zchks o l d 2 , new2 , new2
`/ / compare new2 with o l d 2 .
`/ / p u t lower v a l u e i n t o newz.
`/ / set p i x e l mask register.
`
`bnc e x t r a p i x e l s
`/ / l e f t o v e r 1, 2 o r 3 p i x e l s
`
`noP
`/ / s p a r e i p s t r u c t i o n s l o t
`
`d . f z c h k s fO, fO, f 0
`/ / s h i f t P i x e l Mask reg r i g h t by 4
`
`xor r O , p i x e l c o u n t ,
`/ / count=O?
`
`r O
`
`d . f a d d p b l u e , Ablue, b l u e
`/ / i n t e r p o l a t e 4 b l u e v a l u e s
`
`f s t .d new2, 8 ( z b u f f e r ) ++
`/ / s t o r e new2 v a l u e s (cache h i t ) .
`/ / increment z b u f f e r p o i n t e r by 8
`
`d.faddp g r e e n , Agreen, green
`/ / i n t e r p o l a t e 4 g r e e n v a l u e s
`
`bc a l i g n e d e d g e
`/ / p i x e l s came o u t even
`
`d . f a d d p red, Ared, red
`/ / i n t e r p o l a t e 4 r e d v a l u e s
`
`noP
`/ / s p a r e i n s t r u c t i o n s l o t
`
`d.form, f 0 , newi
`/ / c o l l e c t t h e p i x e l s
`/ / OR i n t o MERGE r e g i s t e r and
`/ / p u t r e s u l t i n t o newi
`
`d. f nop
`/ / s p a r e i n s t r u c t i o n s l o t
`
`b l a neg4, p i x e l c o u n t , loop
`/ / decrement count by 4 p i x e l s ,
`/ / sets cond code "lcc"
`/ / c o n d i t i o n a l l y go t o "loop"
`p s t .d newi, 8 ( f b u f f e r ) ++
`/ / s t o r e p i x e l s (under P M mask).
`/ / increment f b u f f e r p o i n t e r by 8 .
`/ / cache m i s s w i t h no cache update.
`/ / ( d e l a y s l o t i n s t r u c t i o n )
`
`/ / End of t h e loop, now t a k e c a r e of ending boundary cases.
`
`Figure 8. Inner loop for Gouraud shading a scan line, including z-buffer hidden-surface removal.
`
`92
`
`IEEE Computer Graphics & Applications
`
`SAMSUNG-1029
`Page 8 of 10
`
`
`
`pass through the loop computes 64 bits’ worth of z values
`[i.e., four 16-bit z values) and computes 64 bits’ worth of
`pixels (i.e., four 16-bit pixels).
`Because of the parallel nature of the code, the loop exe-
`cutes twice to complete all of the calculations for a par-
`ticular set of pixels. The first integer instruction loads the
`values from the z-buffer to be used for comparison with
`the new interpolated z values on the next iteration. The
`graphics instruction stream then interpolates the new z
`values, checks the old z values with the interpolated z
`values [dzhks), setting the pixel mask (PM) register as
`appropriate. Then the graphics instruction stream inter-
`polates the color values and computes the new intensi-
`ties (newi). Meanwhile, the integer instruction stream
`checks to see if the count (pixelcount) has gone negative.
`This means that there are one, two, or three pixels left
`over at the right edge. Next the integer stream stores the
`updated z values, and exits if the count has reached zero
`(even number of pixels/64 bits). If not, the loop count is
`decremented by four and the pixel values are stored
`[pst.d) using the PM mask register. Full details of these
`and other instruction operations are available
`elsewhere.6
`This loop takes 11 clocks to interpolate four 16-bit
`pixels of color (6,6,4-R,G,B), interpolate four 16-bit z
`values, and check these four z values with the ones cur-
`rently in the z-buffer. This results in a peak rate of 14.5
`million pixels per second at 40 MHz. The average rate
`varies widely, depending on the efficiency of the setup
`code and the size of the triangles. Rates of 40,000
`100-pixel triangles per second are expected. The perfor-
`mance would be higher for 8-bit pixels and lower for
`32-bit pixels. The z-buffer depth can be 16 or 32 bits. The
`8; 16; or 32-bit pixels and 16- or 32-bit z depth can be
`used in any combination.
`The graphics unit shown in Figure 1 provides an
`extremely high leveraged use of silicon by reusing the
`data paths for the floating-point hardware and adding
`only about 3 percent to the total die area. The very high
`speed comes from the dual-instruction execution and
`from the multiple pixels and z interpolations performed
`by the graphics instructions.
`
`System configurations
`There are many ways to use the i860 processor. Figure
`9 shows how a PC or workstation can be upgraded for
`technical applications through add-in cards. The high-
`lighted area shows the processor with a 2 to 8 Mbyte or
`larger private memory and a frame buffer. This config-
`uration, with the host processor operating system, will
`support scientific visualization such as interactive com-
`putation and viewing of particle flow over airfoils (com-
`putational fluid dynamics). In a smaller memory
`configuration, the graphics pipeline, including the pixel
`interpolation, can be done on the add-in cards, provid-
`ing reasonable 3D performance at a lower system cost.
`
`1-77
`p5
`System 1
`
`Memory
`
`Videm -e
`
`Figure 9. A cost-effective graphics subsystem for
`adding 3D imaging and numeric computation capa-
`
`bility to a PC or workstation platform. r y
`
`Processor
`
`Systex
`
`Buffer
`
`Figure 10.3D technical workstation based on a single
`CPU.
`
`The application can run on the host CPU. This configu-
`ration provides the lowest incremental cost for an exist-
`ing workstation or PC.
`With more memory, this configuration also supports
`application acceleration, where the application is
`numerically intensive. Because of the high level of com-
`ponent integration, application acceleration can be
`accomplished by one or two add-in cards.
`Figure 10 diagrams a 3D workstation where the only
`processor runs the operating system, the computation-
`ally intensive application, and the 3D imaging. This pro-
`vides the lowest cost workstation configuration and is
`similar in philosophy to the Stellar GS1000. This config-
`uration has the advantage of making the floating-point
`performance generally available to all applications run
`on the workstation. A second advantage comes from the
`ability to share the often substantial memory between the
`application execution and the graphics tasks.
`
`July 1989
`
`93
`
`SAMSUNG-1029
`Page 9 of 10
`
`
`
`Conclusion
`
`References
`
`-
`-
`
`This article has brieflv described the general-DurRose
`i860 processor, specifically highlightingihe features that
`apply to 3D graphics applications. Very high speed
`floating-point performance is Obtained
`from ‘Oncur-
`rently executing add and multiply units, supported with
`very wide data paths and on-chip caches. This proces-
`sor is unique because of special graphics instructions
`that provide realistic, interactive image rendering using
`Gouraud shading and z-buffering.
`The i860 processor provides new opportunities for
`innovation in products and software. The capability it
`offers on a single chip means that a new generation of
`3D workstations can be built and that high-performance
`add-in boards with a small form factor are feasible. The
`result is the potential for a new level of affordable per-
`formance on the desktop.
`
`I. M.D. Erceaovac and T. Lang. “Vector Processing,” in Suoercom-
`puters, Class IV Systems, Hardware and Software,S. Fernback, ed.,
`Elsevier, New York, 1986, pp. 29-57.
`2. L. Kohn and J. Grimes, “A New Microprocessor with Vector Pro-
`cessing Capabilities,” Professional Program Session Record Electro
`89, Session NO. 16, IEEE, New York, 1989, pp. 4/14/13.
`3. L. Kohn and S.-W. Fu, “A 1,000,000 Transistor Microprocessor,”
`Digest of Tech. Papers Int’l Solid State