throbber
cessor
`
`The Intel i860 64-Bit Processor:
`A General-purpose CPU with 3D Graphics
`CaDabilities
`
`A
`
`Jack Grimes
`MASS Microsystems *
`Les Kohn and Rajeev Bharadhwaj
`Intel
`
`T h e shipment of graphics superworkstations from
`Apollo, Ardent, Silicon Graphics, Stellar, and other
`manufacturers represents a new focus on scientific
`visualization that is much more cost effective than a five-
`million-dollar supercomputer for many applications.
`One key is the integration of high numeric performance
`with the ability to visualize interactively the results of the
`computations using 3D graphics-scientific visual-
`ization-while the computations are being performed.
`The Intel i860 processor enables a new class of work-
`stations to be constructed. Computation rates previously
`associated with supercomputers, along with the visuali-
`zation capabilities of superworkstations, will be available
`
`* This work was done while Grimes was at Intel.
`
`July 1989
`
`0272-17-16/89/0700-0085$01.00 e1989 IEEE
`
`85
`
`SAMSUNG-1029
`Page 1 of 10
`
`

`
`at the price of 2D workstations.
`The i860 processor represents the first single-chip
`device to be based on supercomputer design principles.
`Supercomputers have been defined as
`0 equal to or faster than the Cray-3
`0 the fastest current machine
`or as one generation behind what’s needed. Since the
`introduction of the Cray-1 supercomputer in 1976, the
`highest performance levels have been equated with pipe-
`lined vector processing.’ The i860 processor is similar
`in architecture to the Cray-1.’
`The one-million-transistor, 64-bit i860 processor con-
`tains integer-control, paging, and bus units; floating-
`point-control, adder/subtracter, multiplier, and 3D
`graphics units; and instruction and data caches. The
`processor provides high levels of balanced performance.
`At 40 MHz, the Dhrystone 2.1 performance is 78,100
`Dhrystones per second. Peak execution rates of 80
`Mflops can be achieved using fine-grained parallelism.
`This floating-point performance provides over 500,000
`transforms per second, including 4x4 3D matrix multi-
`plies, clipping tests, and perspective calculations. In
`addition, special 3D graphics hardware provides rates
`of over 40,000 triangles per second, where the 100-pixel
`triangles are Gouraud shaded, transformed, z-buffered,
`and include one light source.
`The i860 processor is a general-purpose
`and
`it is being used as the main processor in high-
`performance workstation designs. This article will focus
`on two aspects of the CPU that are particularly impor-
`tant for 3D graphics applications: transforms on vertex
`lists of floating-point values and image rendering involv-
`ing depth and color interpolation, with hidden-surface
`removal using a z-buffer.
`For a long time, the performance of 3D computer
`graphics systems was limited by their floating-point com-
`putation rate. The introduction of high-speed floating-
`point math coprocessors has largely eliminated this bot-
`tleneck. Recently, however, increasing emphasis has
`been placed on realism of the imagery. This means that
`the performance of the 3D system may now be limited
`by the ability of the hardware to shade the polygons that
`represent the surface of the object.
`The next section describes the performance features
`of the overall chip and later sections describe its appli-
`cation to 3D graphics.
`High performance
`The processor shown in Figure 1 provides the follow-
`ing functions on a single chip:
`
`0 RISC core unit
`0 Floating-point control unit
`0 Adder and multiplier floating-point units
`0 Graphics unit
`
`86
`
`0 Memory-management unit
`0 Data and instruction caches
`0 Bus control unit
`
`The reduced-instruction-set core unit fetches both
`integer and floating-point instructions. It contains the
`32 x 32-bit integer register file, and decodes and executes
`load, store, integer, bit, and control-transfer instructions.
`Its pipelined organization with extensive bypassing and
`scoreboarding maximizes performance.
`There are separate floating-point adder and multiplier
`units. Each unit uses pipelining to deliver up to one
`result per clock (40 Mflops). The units can operate in par-
`allel, providing up to two results per clock (80 Mflops).
`Both units support 64- and 32-bit floating-point values
`in IEEE standard 754 format. Furthermore, the floating-
`point unit can operate in parallel with the integer unit.
`This means the two-result-per-clock rate can be sus-
`tained by overlapping overhead functions such as data
`fetching and storing and loop control with floating-point
`operations. The floating-point control unit contains a
`separate five-port register file. This file can be accessed
`as 8 x 128-bit registers, 16x64-bit registers, or 32 x32-bit
`registers. Information can be transferred between the file
`and data cache at the same time as operands are trans-
`ferred between the file and the floating-point or graphics
`units.
`The graphics unit contains the hardware for pixel-
`intensity interpolation, depth (z) interpolation, and z-
`buffer check. Peak rates are 16 million pixels per second,
`assuming Gouraud shading, 16-bit pixels, and a 16-bit z-
`buffer. The operation of the graphics unit is described
`in more detail in a later section.
`The memory-management unit translates addresses
`from the linear logical address space to the linear phys-
`ical address for both instruction and data accesses.
`Address translation is optional, and uses a two-level
`structure.
`Information from the translation tables is cached in a
`64-entry, four-way associative memory. The processor
`provides the basic features to implement a paged, virtual
`memory, and user/supervisor protection. The page
`tables are format-compatible with the 386 architecture.
`One third of the chip area is used for data and instruc-
`tion caches. The I-cache is a two-way, set-associative
`memory of 4 Kbytes, with 32-byte blocks. The D-cache
`is a writeback cache, composed of a two-way, set-
`associative memory of 8 Kbytes, with 32-byte blocks. A
`block is written for each cache load, and the 64-bit word
`that contained the addressed item is loaded first. The
`aggregate bandwidth out of the on-chip caches is 0.96
`Gbytes per second. This rate would not have been pos-
`sible with off-chip caches.
`The bus unit is designed to provide maximum perfor-
`mance from conventional, static-column DRAM. A pin
`indicates whether the next memory cycle is within the
`same page. The bus unit supports both pipelined and
`
`IEEE Computer Graphics & Applications
`
`SAMSUNG-1029
`Page 2 of 10
`
`

`
`External Addr
`
`Instruction Cache
`
`Management
`
`I
`
`Ext.
`Ext.
`Data
`Data
`
`(64
`
`c
`
`core
`
`Unit
`
`RISC Integer Unit
`
`Dest
`src 1
`src2
`
`FP Instr , 128
`
`I
`
`Floating Point
`Control Unit
`
`'
`
`Cache
`Data
`
`Data
`Ad&
`
`I
`
`I
`
`Figure 1. Diagram of the single-
`chip processor showing the func-
`tional blocks and data paths.
`
`nonpipelined operation. High transfer rates (166 Mbytes
`per second) to large memory systems are supported by
`two-level pipelining, where up to three bus cycles can be
`in progress at once. Pipelining enables a new 64-bit word
`to be transferred every two clocks, even though the total
`cycle time might be up to six clocks.
`The high performance is supported by wide informa-
`tion paths, also shown in Figure l. They include the fol-
`lowing:
`
`0 64-bit external data bus
`0 32-bit external address bus
`0 128-bit on-chip data bus from the data cache
`0 Three 64-bit on-chip data buses for floating-point
`operands
`0 64-bit on-chip instruction bus
`
`When pipeline designs are used within many parallel
`functional units, the overall performance is often limited
`by data-path bandwidth. The i860 processor designers
`
`paid careful attention to data paths to obtain sustained
`high levels of performance for computationally intensive
`applications.
`Fine-grained parallelism
`Parallelism exists between the integer unit and the
`floating-point and graphics units. Up to three operations
`can be performed each clock. The integer unit and the
`floating-point units can execute in parallel, supported by
`the 64-bit instruction bus (two instructions per clock). A
`special dual-execution instruction mode is used for these
`operations. The data cache can supply two 32-bit or two
`64-bit operands per clock. The bus unit will supply up
`to one 64-bit or two 32-bit operands every two clocks to
`the data cache or directly to the floating-point register
`file.
`The graphics unit in Figure 1 is connected to the 64-bit
`data paths used by the floating-point units. The graphics
`instructions use the floating-point registers and can be
`executed in parallel with the integer unit instructions.
`
`July 1989
`
`87
`
`SAMSUNG-1029
`Page 3 of 10
`
`

`
`I
`
`Database of geometry
`information
`
`I
`
`Trans form/projec t
`
`Light source calc
`
`Clip to view volume
`
`Color the pixels
`
`I Display frame buffer
`
`I
`
`Figure 2. Example of a 3D graphics pipeline imple-
`mented in software in the i860 processor for Ardent’s
`Dore visualization software.
`
`active visualization software developed by Ardent Com-
`puter), is shown in Figure 2.
`The prototype implementation of this pipeline for tri-
`angle meshes requires about 20 Kbytes of assembly lan-
`guage code. Assembly language was used to optimize the
`low-level parallelism in the hardware and the graphics
`features.
`Floating-point matrix transforms in 3D
`graphics
`Historically, one of the most computationally intensive
`operations is the 3D transformation of geometric objects
`at the top of the graphics pipeline. The data types are
`normally 32-bit floating-point values. A basic transform
`operation first combines all of the translate, scale, and
`rotation operations into a single 4 x 4 matrix. Then all of
`the x, y, z, and w values in the list are sequentially mul-
`tiplied by the transform matrix. If the transform and
`other calculations can be performed at rates of 10 times
`per second or faster, then the results can be interactively
`viewed on a graphics display. The transform matrix com-
`putation is of the form
`
`This dual-instruction feature is especially important to
`the inner loop of the rendering code, as we shall see.
`In addition to the pipelined floating-point operations,
`where a single-precision floating-point add and/or mul-
`tiply result can be completed every clock, there is a more
`conventional, scalar mode of operation. In this mode, the
`floating-point results are interlocked so that the pipeline
`operation is hidden from the program, and data depen-
`dencies are enforced by the hardware.
`Additional details are contained in the
`The rest of this article focuses on the 3D graphics capa-
`bilities of the processor.
`
`3D graphics pipeline
`The i860 processor contains very little specialized
`graphics hardware (pixel and z interpolation, and z-
`buffer checking]. Almost all of the graphics functional-
`ity is implemented in software. Since the base machine
`is so fast, the performance of the resulting graphics func-
`tions is very good. This approach provides maximum
`flexibility (e.g., sharing the floating-point hardware
`between simulation and transformations] and allows
`lower cost systems to be constructed.
`Several necessary graphics functions are sequential
`and have come to be commonly called the “graphics
`~ i p e l i n e . ” ~ One possible pipeline, used for an early
`demonstration implementation with Dore (the 3D inter-
`
`The 3 x 3 r submatrix contains the rotation and scaling
`information. The first three elements of the last row con-
`tain the translation information. This 4 x 4 matrix has 12
`normally single-precision values. These values-plus
`the
`last, current, and next transformed four-element
`vectors-can be held in the register file, enabling the 12
`multiplies and 9 adds to execute without any memory
`load or store delays. The loading and storing of the other
`operands are executed in parallel by the integer unit,
`which also provides loop control.
`The object information is commonly stored as a list of
`vertices, where each vertex has an x, y, and z value. For
`example, this list of vertices may represent coordinates
`from a triangle mesh. In this compact data structure, an
`additional triangle can be represented by one additional
`vertex. This vertex list can easily contain thousands of
`vertices. Additional information-for
`example, RGB
`color values, a vertex normal, a surface color, and a sur-
`face normal-can also be stored in the same list, as
`shown in Figure 3.
`These vector-oriented computations can be converted
`into a program that takes full advantage of the pipelin-
`ing, dual-instruction mode, dual operations, and mem-
`ory hierarchy of the processor. Complete 3D transform
`rates (including 4x4 3D matrices, clipping tests, and per-
`spective) of over 500,000 transformations per second
`have been achieved.
`
`88
`
`IEEE Computer Graphics & Applications
`
`SAMSUNG-1029
`Page 4 of 10
`
`

`
`Figure 3. The :ist of vertices to
`be transformed normally con-
`tains other information such as
`color and a normal vector.
`
`Figure 4 shows the hardware configuration during one
`pipelined instruction (every instruction produces a
`result every clock), which is also a dual-operation
`instruction (the floating-point add unit and the floating-
`point multiply unit each produce a result every clock).
`While this is executing, the integer unit loads and stores
`operands and provides loop control. Thus, the inner loop
`can operate at the peak rate of three operations per clock
`(two floating-point operations and one integer operation)
`and can average more than two operations per clock.
`To execute two operations per instruction, four source
`and two destination operands must be specified. Two of
`the sources and one destination are floating-point
`registers, and are specified by the instruction. The other
`operands come from internal, loadable constant
`registers, and the output of one pipeline can be con-
`nected to the input of another. Figure 4 shows only one
`of the cases, the operation Destination = KR x
`Source 2 + Source 1. In this case, a constant (loaded into
`KR) is multiplied by Source 2. Three clocks later, this
`result is added to Source 1. Three clocks later, the over-
`all result is stored in the Destination register. Once the
`pipelines are full, one result (two floating-point opera-
`tions) is computed each clock. At the same time, integer
`instructions execute in parallel, providing operand load-
`ing and storing and loop control, giving a peak rate of
`three operations per clock.
`Pixel data
`A pixel may be 8, 16, or 32 bits long, depending on
`color and intensity resolution requirements (see Table 1).
`Regardless of pixel size, the graphics unit always oper-
`ates on as many pixels at a time as will fit in one 64-bit
`word. To perform color-intensity shading efficiently in
`different applications, the processor defines the three
`
`looo’s
`of vemces
`in list
`
`I
`
`Destination
`
`I
`
`Source 1 FJ Source 2
`I + i
`.
`KRxsource2 5-3-
`
`I
`
`’
`
`
`
`I
`
`result
`I
`
`I
`
`I
`
`result
`I
`1
`
`I
`KR x Source 2 + Some 1
`Figure 4. Internal data-path configuration for one
`dual-operation instruction: Destination = KR x
`Source 2 + Source 1.
`
`Table 1. Number of bits allocated
`to various fields for the three pixel sizes.
`
`1
`
`Pixel
`Size
`(in bits)
`
`Bits of
`Color I*
`Intensity
`
`I
`
`Color2* I Color 3.
`Intensity i lntcnsi,
`
`Bits of
`
`Bits of
`
`1
`
`1
`
`Bits of
`Other
`Attribute
`(Texture)
`
`N (58) bits of
`intensity **
`
`* The intensity attribute field may be assigned to colors in any order
`convenient to the application.
`** With 8 bit pixels, up to 8 bits can be used for intensity; the mnaining
`bits can be used for any other attribute. such as color. The intensity
`bits must be the low-ordcr bits of the pixel.
`
`~~
`
`~~
`
`~
`
`July 1989
`
`89
`
`SAMSUNG-1029
`Page 5 of 10
`
`

`
`16-bit Pixel
`32-bit Pixel 23
`
`R
`
`G
`
`31 I
`
`I
`C=color, I=intcnsity.
`R
`d
` intensity. G g m intensity. B=bluc intensity. T=tennuc
`
`15
`
`B
`
`7
`
`I
`
`T
`
`7
`
`5
`
`
`
`0
`
`8-bit Pixel
`
`15
`
`9
`
`R
`
`G
`
`3
`
`B
`
`
`
`0
`
`0
`
`..
`.
`x“,y”,z“=lOOo)
`
`Figure 5. Pixel formats for 8-, 16- and 32-bit pixels. Figure 7. Example of depth (z) interpolation for a tri-
`angle with different z values for each vertex.
`
`0 16-bit and 32-bit z-buffer check
`0 64-bit integer add and subtract
`0 Add with pixel merge
`0 Add with z merge
`0 OR with merge register
`
`One additional instruction, the pixel store, is executed
`by the integer unit.
`A z-buffer aids hidden-surface elimination by associat-
`ing a z value with each pixel. This z value represents the
`distance of that pixel from the viewer. When coloring a
`specific pixel, a 3D algorithm calculates the distance of
`the position on the surface from the viewer. If the point
`is farther from the viewer than the point that is repre-
`sented by the pixel already in the frame buffer, then the
`pixel value is not updated. If the new pixel is closer, then
`it replaces the pixel previously stored in the frame buffer.
`The graphics unit supports either 16- or 32-bit z values.
`The z value size is independent of the pixel size.
`For example, in Figure 6, scan line #20 of the gray tri-
`angle is entirely “in front,” while the left end of scan
`line #30 is “behind” the white triangle. As the depth and
`color of each pixel are calculated, the depth must be
`checked with the depth value already stored in the z-
`buffer to determine whether the pixel values should
`replace those already in the frame buffer. When the z
`values for the left end of scan line #30 are compared, they
`are less than the values in the z-buffer, and the cor-
`responding pixels are not drawn into the frame buffer.
`The z-value calculations are interpolated as indicated in
`Figure 7.
`For the three z values shown in Figure 7, the z value
`for each scan-line edge is interpolated (e.g., 2750 and
`3500 for the scan line shown]. The value of Az is calcu-
`lated as shown. As the scan line is rasterized from left
`to right, this Az value is added to the running z value for
`each pixel. This calculation must be done with double-
`integer precision to control rounding errors (32-bit math
`
`Figure 6. Before each pixel is drawn, the current z
`value of the pixel must be checked. If the new pixel is
`“closer,” the new pixel is drawn.
`
`pixel formats shown in Figure 5. The pixel data type is
`used by only two kinds of instructions:
`
`0 The selective pixel store instruction for hidden-
`surface elimination.
`0 The pixel-add instruction that helps implement 3D
`color-intensity shading.
`
`Figure 5 shows one way of assigning meaning to the
`fields. These assignments are for illustration only. The
`graphics unit defines only the field sizes and the opera
`tions on each field, not the specific uses of each field.
`3D graphics rendering instructions
`Almost all processor instructions are used in graphics
`applications, but six instruction types have been added
`to assist specifically in 3D rendering operations. Five are
`executed by the graphics unit:
`
`90
`
`IEEE Computer Graphics & Applications
`
`SAMSUNG-1029
`Page 6 of 10
`
`

`
`for 16-bit z values, and 64-bit math for 32-bit z values).
`Finally, for each pixel, the newly calculated z values are
`compared with the values in the z-buffer to determine
`pixel replacement.
`The simple 3D case in Figure 7 is for flat shading
`where all pixels are set, for example, to the average of the
`vertex colors. Several faddz (add with z merge) instruc-
`tions add Az to the currently interpolated z value and
`place the results into a merge register. The 16-bit and
`32-bit z-buffer check instructions perform multiple
`unsigned-integer comparisons, either four or two at a
`time, depending on how many z values fit in one 64-bit
`word. One set of operands comes from those just com-
`puted by the z-interpolation calculation, and the other
`operands come from the z values (in a z-buffer) that cor-
`respond to the points already drawn. Bits are set in the
`pixel mask register representing the pixels to be updated.
`The pixel store instruction uses this register to determine
`which pixels to update.
`The faddp (add with pixel merge) instruction imple-
`ments interpolation of color intensities in the same way
`as the z interpolation, allowing for the different data
`types. The 8- and 16-bit formats shown in Figure 5 use
`16-bit intensity interpolation. One instruction does four
`or two interpolations at a time (one color in each of four
`or two pixels). For 16- and 32-bit pixels, the instruction
`is executed three times, once for each color component,
`to interpolate and merge the pixels into the merge reg-
`ister. The result is moved to a floating-point register with
`the form (OR with merge register) instruction. The OR
`operation is sometimes necessary when additional bits
`in the pixel are calculated separately, for example, alpha
`channel information.
`The 64-bit integer add and subtract instructions oper-
`ate on values in the floating-point register file and are
`used for 32-bit z-buffers where 64-bit precision interpo-
`lation is needed.
`Once the objects in the display list have been trans-
`formed, clipped, and backface culled, and the necessary
`normals and vertex or surface colors calculated, the
`objects are ready to be rendered. The simplest shading
`is flat shading, where all of the pixels in the triangle, for
`example, are colored with the same value (the surface
`color). More realistic imagery is obtained with Gouraud
`shading,’ where each pixel is colored with a value that
`is linearly interpolated from values calculated at the ends
`of the scan line. (These scan-line endpoint values are in
`turn linearly interpolated between the vertex colors.)
`Even more realistic images with specular effects can be
`obtained with approximations to Phong ha ding.^
`Simple 2D fill is usually fast. However, even flat shad-
`ing of 3D surfaces is much slower than shading 2D sur-
`faces, because of the z-buffer checking on each pixel.
`Without dedicated hardware, the z-buffer check and the
`Gouraud shading calculations can easily bring the per-
`formance of a graphics pipeline to a virtual halt, because
`the check and shading calculations are performed for
`
`July 1989
`
`each pixel, rather than for each vertex, and may be 10 or
`100 times slower than the transform rate.
`Next we will treat a specific assembly language exam-
`ple with %bit pixels (6,6,4-R,G,B) and a 16-bit-deep z-
`buffer.
`
`A specific example
`Table 2 lists the six steps in the inner loop for loading
`the z values, depth interpolation, Gouraud interpolation,
`z-buffer check, pixel store, and z-buffer update. First we
`will look at the sequential form, then show the code
`embedded in the parallel loop. Note that some steps are
`executed by the integer unit and others are executed by
`the graphics unit.
`
`Table 2. Six steps in the inner loop
`for Gouraud shading a scan line.
`
`Execution
`Unit
`Step
`
`Function
`
`integer
`
`1
`
`Load oldz values from the z buffer.
`
`graphics 2
`
`Interpolate the newZ values by adding
`delta2 to the last computed values. Save
`the extra precision intermediate results.
`
`graphics 3
`
`graphics 4
`
`Interpolate the colors for the next set of
`pixels by adding the deltaBlue, deltaGreen
`& deltaRed to the last computed RGB
`values. Save the extra precision
`intermediate results.
`Check the newZ values with the old2
`values, setting pixel mask bits for the
`pixels that are to be stored into the kame
`buffer. Save the newZ values to put back
`into the z buffer.
`
`integer
`
`5
`
`Store the pixels using the pixel mask
`(PM) register.
`
`integer 6
`
`Store the newZ values into the z buffer.
`
`Since the integer unit can execute in parallel with the
`graphics unit, the six steps in Table 2 plus the loop con-
`trol can be “folded” into a parallel program, as shown
`in Figure 8.
`It is beyond the scope of this article to describe all of
`the details of the graphics hardware.6 However, the code
`fragment in Figure 8 shows the important inner loop:
`Gouraud shading one scan line of a 3D triangle. Each
`
`91
`
`SAMSUNG-1029
`Page 7 of 10
`
`

`
`/ / Some boundary c a s e s t a k e n c a r e of by t h i s p o i n t i n t h e code.
`/ / Scan l i n e end p o i n t z and p i x e l c o l o r v a l u e s a l r e a d y computed.
`/ / Color Ablue, Agreen, Ared values a l r e a d y computed f o r t h e scan l i n e .
`/ / Depth Az v a l u e a l r e a d y computed f o r t h e s c a n l i n e .
`/ / F i r s t set of o l d 2 v a l u e s a l r e a d y loaded.
`/ / The "d." i n f r o n t of t h e graphics i n s t r u c t i o n s i n d i c a t e s d u a l
`/ /
`i n s t r u c t i o n e x e c u t i o n mode.
`
`Graphlcs Unlt instructlons
`loop :
`z , Azl, z
`d . f a d d z
`/ / i n t e r p o l a t e 1st two z values
`/ / (double p r e c i s i o n math)
`
`Integer Unlt Instructions
`
`f 1 d . d 8 ( z b u f f e r ) , old2
`/ / l o a d o l d 2 v a l u e s f o r next i n t e r .
`/ / cache h i t 3 o u t of 4 times
`
`z , Az2, z
`d . f a d d z
`/ / i n t e r p o l a t e 2nd two z values
`
`andh 0x8000, p i x e l c o u n t
`/ / p i x e l c o u n t n e g a t i v e ?
`
`fO, new2
`d.form
`/ / c o l l e c t t h e newz's.
`/ / OR i n t o MERGE r e g i s t e r
`d. f zchks o l d 2 , new2 , new2
`/ / compare new2 with o l d 2 .
`/ / p u t lower v a l u e i n t o newz.
`/ / set p i x e l mask register.
`
`bnc e x t r a p i x e l s
`/ / l e f t o v e r 1, 2 o r 3 p i x e l s
`
`noP
`/ / s p a r e i p s t r u c t i o n s l o t
`
`d . f z c h k s fO, fO, f 0
`/ / s h i f t P i x e l Mask reg r i g h t by 4
`
`xor r O , p i x e l c o u n t ,
`/ / count=O?
`
`r O
`
`d . f a d d p b l u e , Ablue, b l u e
`/ / i n t e r p o l a t e 4 b l u e v a l u e s
`
`f s t .d new2, 8 ( z b u f f e r ) ++
`/ / s t o r e new2 v a l u e s (cache h i t ) .
`/ / increment z b u f f e r p o i n t e r by 8
`
`d.faddp g r e e n , Agreen, green
`/ / i n t e r p o l a t e 4 g r e e n v a l u e s
`
`bc a l i g n e d e d g e
`/ / p i x e l s came o u t even
`
`d . f a d d p red, Ared, red
`/ / i n t e r p o l a t e 4 r e d v a l u e s
`
`noP
`/ / s p a r e i n s t r u c t i o n s l o t
`
`d.form, f 0 , newi
`/ / c o l l e c t t h e p i x e l s
`/ / OR i n t o MERGE r e g i s t e r and
`/ / p u t r e s u l t i n t o newi
`
`d. f nop
`/ / s p a r e i n s t r u c t i o n s l o t
`
`b l a neg4, p i x e l c o u n t , loop
`/ / decrement count by 4 p i x e l s ,
`/ / sets cond code "lcc"
`/ / c o n d i t i o n a l l y go t o "loop"
`p s t .d newi, 8 ( f b u f f e r ) ++
`/ / s t o r e p i x e l s (under P M mask).
`/ / increment f b u f f e r p o i n t e r by 8 .
`/ / cache m i s s w i t h no cache update.
`/ / ( d e l a y s l o t i n s t r u c t i o n )
`
`/ / End of t h e loop, now t a k e c a r e of ending boundary cases.
`
`Figure 8. Inner loop for Gouraud shading a scan line, including z-buffer hidden-surface removal.
`
`92
`
`IEEE Computer Graphics & Applications
`
`SAMSUNG-1029
`Page 8 of 10
`
`

`
`pass through the loop computes 64 bits’ worth of z values
`[i.e., four 16-bit z values) and computes 64 bits’ worth of
`pixels (i.e., four 16-bit pixels).
`Because of the parallel nature of the code, the loop exe-
`cutes twice to complete all of the calculations for a par-
`ticular set of pixels. The first integer instruction loads the
`values from the z-buffer to be used for comparison with
`the new interpolated z values on the next iteration. The
`graphics instruction stream then interpolates the new z
`values, checks the old z values with the interpolated z
`values [dzhks), setting the pixel mask (PM) register as
`appropriate. Then the graphics instruction stream inter-
`polates the color values and computes the new intensi-
`ties (newi). Meanwhile, the integer instruction stream
`checks to see if the count (pixelcount) has gone negative.
`This means that there are one, two, or three pixels left
`over at the right edge. Next the integer stream stores the
`updated z values, and exits if the count has reached zero
`(even number of pixels/64 bits). If not, the loop count is
`decremented by four and the pixel values are stored
`[pst.d) using the PM mask register. Full details of these
`and other instruction operations are available
`elsewhere.6
`This loop takes 11 clocks to interpolate four 16-bit
`pixels of color (6,6,4-R,G,B), interpolate four 16-bit z
`values, and check these four z values with the ones cur-
`rently in the z-buffer. This results in a peak rate of 14.5
`million pixels per second at 40 MHz. The average rate
`varies widely, depending on the efficiency of the setup
`code and the size of the triangles. Rates of 40,000
`100-pixel triangles per second are expected. The perfor-
`mance would be higher for 8-bit pixels and lower for
`32-bit pixels. The z-buffer depth can be 16 or 32 bits. The
`8; 16; or 32-bit pixels and 16- or 32-bit z depth can be
`used in any combination.
`The graphics unit shown in Figure 1 provides an
`extremely high leveraged use of silicon by reusing the
`data paths for the floating-point hardware and adding
`only about 3 percent to the total die area. The very high
`speed comes from the dual-instruction execution and
`from the multiple pixels and z interpolations performed
`by the graphics instructions.
`
`System configurations
`There are many ways to use the i860 processor. Figure
`9 shows how a PC or workstation can be upgraded for
`technical applications through add-in cards. The high-
`lighted area shows the processor with a 2 to 8 Mbyte or
`larger private memory and a frame buffer. This config-
`uration, with the host processor operating system, will
`support scientific visualization such as interactive com-
`putation and viewing of particle flow over airfoils (com-
`putational fluid dynamics). In a smaller memory
`configuration, the graphics pipeline, including the pixel
`interpolation, can be done on the add-in cards, provid-
`ing reasonable 3D performance at a lower system cost.
`
`1-77
`p5
`System 1
`
`Memory
`
`Videm -e
`
`Figure 9. A cost-effective graphics subsystem for
`adding 3D imaging and numeric computation capa-
`
`bility to a PC or workstation platform. r y
`
`Processor
`
`Systex
`
`Buffer
`
`Figure 10.3D technical workstation based on a single
`CPU.
`
`The application can run on the host CPU. This configu-
`ration provides the lowest incremental cost for an exist-
`ing workstation or PC.
`With more memory, this configuration also supports
`application acceleration, where the application is
`numerically intensive. Because of the high level of com-
`ponent integration, application acceleration can be
`accomplished by one or two add-in cards.
`Figure 10 diagrams a 3D workstation where the only
`processor runs the operating system, the computation-
`ally intensive application, and the 3D imaging. This pro-
`vides the lowest cost workstation configuration and is
`similar in philosophy to the Stellar GS1000. This config-
`uration has the advantage of making the floating-point
`performance generally available to all applications run
`on the workstation. A second advantage comes from the
`ability to share the often substantial memory between the
`application execution and the graphics tasks.
`
`July 1989
`
`93
`
`SAMSUNG-1029
`Page 9 of 10
`
`

`
`Conclusion
`
`References
`
`-
`-
`
`This article has brieflv described the general-DurRose
`i860 processor, specifically highlightingihe features that
`apply to 3D graphics applications. Very high speed
`floating-point performance is Obtained
`from ‘Oncur-
`rently executing add and multiply units, supported with
`very wide data paths and on-chip caches. This proces-
`sor is unique because of special graphics instructions
`that provide realistic, interactive image rendering using
`Gouraud shading and z-buffering.
`The i860 processor provides new opportunities for
`innovation in products and software. The capability it
`offers on a single chip means that a new generation of
`3D workstations can be built and that high-performance
`add-in boards with a small form factor are feasible. The
`result is the potential for a new level of affordable per-
`formance on the desktop.
`
`I. M.D. Erceaovac and T. Lang. “Vector Processing,” in Suoercom-
`puters, Class IV Systems, Hardware and Software,S. Fernback, ed.,
`Elsevier, New York, 1986, pp. 29-57.
`2. L. Kohn and J. Grimes, “A New Microprocessor with Vector Pro-
`cessing Capabilities,” Professional Program Session Record Electro
`89, Session NO. 16, IEEE, New York, 1989, pp. 4/14/13.
`3. L. Kohn and S.-W. Fu, “A 1,000,000 Transistor Microprocessor,”
`Digest of Tech. Papers Int’l Solid State

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket