throbber
Ml
`T
`
`he single-chip i860
`executes parallel instructio
`architectural concepts.
`mm x 15 mm processor (see Fi
`floating-point, and graphics pe
`eration CAD tools and 1-micrometer semic
`To accommodate our per
`between blocks for integer operations, floating-point operations, and in-
`struction and data cache memories. Inclusion of the RISC (reduced instruc-
`tion set computing) core, floating-point units, and caches on one chip lets
`us design wider internal buses, eliminate interchip communication over-
`head, and offer higher performance. As a result, the i860 avoids off-chip
`delays and allows users to scale the clock beyond the current 33- and 40-
`MHz speeds.
`We designed the i860 for performance-driven applications such as work-
`stations, minicomputers, application accelerators for existing processors,
`and parallel supercomputers. The i860 CPU design began with the specifi-
`cation of a general-purpose RISC integer core. However, we felt it neces-
`sary to go beyond the traditional 32-bit, one-instruction-per-clock RISC
`processor. A 64-bit architecture provides the data and instruction band-
`width needed to support multiple operations in each clock cycle. The
`balanced performance between integer and floating-point computations
`produces the raw computing power required to support demanding applica-
`tions such as modeling and simulations.
`Finally, we recognized a synergistic opportunity to incorporate a 3D
`graphics unit that supports interactive visualization of results. The architec-
`ture of the i860 CPU provides a complete platform for software vendors
`developing i860 applications.
`
`A million-
`transistor budget
`helps this RISC
`deliver balanced
`MIPS, Mflops,
`and graphics
`performance
`with no data
`bottlenecks.
`
`Architecture overview. The i860 CPU includes the following units on
`one chip (see Figure 2):
`the RISC integer core,
`a memory management unit with paging,
`a floating-point control unit,
`a floating-point adder unit,
`a floating-point multiplier unit,
`a 3D graphics unit,
`
`Les Kohn
`Neal Margulis
`
`Intel Corp.
`
`0272- 1732/89/0800-0015$01 .O 0 1989 IEEE
`
`August 1989
`
`15
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`TCL 1008
`
`

`

`Intel 860
`
`Figure 1. Die photograph of the i860 CPU.
`
`a 4-Kbyte instruction cache,
`an 8-Kbyte data cache, and
`a bus control unit.
`
`Parallel execution. To support the performance
`available from multiple functional units, the i860 CPU
`issues up to three operations each clock cycle. In single-
`instruction mode, the processor issues either a RISC
`core instruction or a floating-point instruction each
`cycle. This mode is useful when the instruction per-
`forms scalar operations such as operating system
`routines.
`In dual-instruction mode, the RISC core fetches two
`32-bit instructions each clock cycle using the 64-bit-
`wide instruction cache. One 32-bit instruction moves to
`the RISC core, and the other moves to the floating-point
`section for parallel execution. This mode allows the
`RISC core to keep the floating-point units fed by fetch-
`ing and storing information and performing loop con-
`trol, while the floating-point section operates on the
`data.
`
`16 IEEE MICRO
`
`The floating-point instructions include a set of op-
`erations that initiate both an add and a multiply. The
`add and multiply. combined with the integer operation.
`result in three operations each clock cycle. With this
`fine-grained parallelism, the architecture can support
`traditional vector processing by software libraries that
`implement a vector instruction set. The inner loops of
`the software vector routines operate up to the peak
`floating-point hardware rate of 80 million floating-
`point operations per second. Consistent with RISC
`philosophy, the i860 CPU achieves the performance of
`hardware vector instructions without the complex
`control logic of hardware vector instructions. The fine-
`grained parallelism can also be used in other parallel
`algorithms that cannot be vectorized.
`
`Register and addressing model. The i860 micro-
`processor contains separate register files for the integer
`and floating-point units to support parallel execution.
`In addition to these register files, as can be seen in
`Figure 3 on page 18, are six control registers and four
`special-purpose registers. The RISC core contains the
`integer register file of thirty-two 32-bit registers, des-
`ignated RO through R3 1 and used for storing addresses
`or data. The floating-point control unit contains a sepa-
`rate set of thirty-two 32-bit floating-point registers
`designated FO through F31. These registers can be
`addressed individually, as sixteen 64-bit registers, or as
`eight 128-bit registers. The integer registers contain
`three ports. Five ports in the floating-point registers
`allow them to be used as a data staging area for perform-
`ing loads and stores in parallel with floating-point
`operations.
`The i860 operates on standard integer and floating-
`point data, as well as pixel data formats for graphics
`operations. All operations on the integer registers exe-
`cute on 32-bit data as signed or unsigned operations and
`additional add and subtract instructions that operate on
`64-bit-long words. All 64-bit operations occur in the
`floating-point registers.
`The i860 microprocessor supports a paged virtual
`address space of four gigabytes. Therefore, data and
`instructions can be stored anywhere in that space, and
`multibyte data values are addressed by specifying their
`lowest addressed byte. Data must be accessed on
`boundaries that are multiples of their size. For example,
`two-byte datamust be aligned to an address divisible by
`two, four-byte data on an address divisible by four, and
`so on, up to 16-byte data values. Data in memory can be
`stored in either little-endian or big-endian format.
`(Little-endian format sends the least significant byte,
`D7-DO, first to the lowest memory address, while big-
`endian sends the most significant byte first.) Code is
`always stored in little-endian format. Support for big-
`endian data allows the processor to operate on data
`produced by a big-endian processor, without perform-
`ing a lengthy data conversion.
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`TCL 1008
`
`

`

`External + ,
`address
`32 bits
`
`I1
`
`Instruction cache
`(4 Kbytes)
`
`management
`
`Data cache
`(8 Kbytes)
`
`-t
`
`64,'
`
`T
`
`Core instruction
`
`32
`
`' 3 2
`
`" 3 2
`
`Floating-point instruction
`
`Cache data
`
`t
`
`" 128
`I1
`
`T
`
`T
`
`Bus control
`unit
`
`j i
`
`RISC core
`
`Floating-point
`control unit
`
`Core registers
`
`Floating-point registers
`
`64,.
`
`64
`
`64,.
`
`SRC2
`
`I
`I
`
`I
`
`I
`I
`
`---
`t
`
`T
`
`II
`
`1
`
`.
`
`T
`
`
`
`1
`KL
`KR
`
`~
`
`Merge
`
`Adder unit
`
`Multiplier unit
`
`Figure 2. Functional units and data paths of the i860 microprocessor.
`
`RISC core
`The RISC core fetches both integer and floating-
`point instructions. It executes load, store, integer, bit,
`and control transfer instructions. Table 1 on page 19
`lists the full instruction set with the 42 core unit instruc-
`tions and their mnemonics in the left column. All in-
`structions are 32 bits long and follow the load/store,
`three-operand style of traditional RISC designs. Only
`
`load and store instructions operate on memory; all other
`instructions operate on registers. Most instructions
`allow users to specify two source registers and a third
`register for storing the results.
`A key feature of the core unit is its ability to execute
`most instructions in one clock cycle. The RISC core
`contains a pipeline consisting of four stages: fetch,
`decode, execute, and write. We used several techniques
`to hide clock cycles of instructions that may take more
`
`August 1989
`
`17
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`TCL 1008
`
`

`

`Intel i860
`
`time to complete. Integer register loads from memory
`take one execution cycle, and the next instruction can
`begin on the following cycle.
`The processor uses a scoreboarding technique to
`guarantee proper operation of the code and allow the
`highest possible performance. The scoreboard keeps a
`history of which registers await data from memory. The
`actual loading of data takes one clock cycle if it is held
`in the cache memory buffer available for ready access,
`but several cycles if it is in main memory. Using
`scoreboarding,
`the i860 microprocessor continues
`execution unless a subsequent instruction attempts to
`use the data before it is loaded. This condition would
`cause execution to freeze. An optimizing compiler can
`organize the code so that freezing rarely occurs by not
`referencing the load data in the following cycle. Be-
`cause the hardware implements scoreboarding, it is
`never necessary to insert NO-OP instructions.
`
`We included several control flow optimizations in
`the core instruction set. The conditional branch instruc-
`tions have variations with and without a delay slot. A
`delay slot allows the processor to execute an instruction
`following a branch while it is fetching from the branch
`target. Having both delayed and nondelayed variations
`of branch instructions allows the compiler to optimize
`the code easily, whether a branch is likely to be taken or
`not. Test and branch instructions execute in one clock
`cycle, a savings of one cycle when testing special cases.
`Finally, another one-cycle loop control instruction
`usefully handles tight loops, such as those in vector
`routines.
`Instead of providing a limited set of locked opera-
`tions, the RISC core provides lock and unlock instruc-
`tions. With these two instructions a sequence of up to
`32 instructions can be interlocked for multiprocessor
`synchronization. Thus, traditional test and set opera-
`
`Integer registers
`
`0
`
`63
`
`Floating-point registers
`32 31
`
`31
`
`t
`
`I
`
`I
`
`I
`
`R1
`R2
`R3
`R4
`R5
`R 6
`. .-
`R7
`R 8
`R9
`R10
`R I 1
`
`R14
`R15
`
`R19
`R20
`R21
`R22
`R23
`
`R27
`R28
`
`I
`
`I
`
`I
`
`F1
`F3
`
`F13
`
`F15 F15
`F17
`, , .
`F19
`F19
`F21
`F23
`F25
`F27
`F29
`
`I
`
`0
`
`1
`
`FO
`F2
`
`F10
`F12
`
`F14 F14
`F16
`t18
`F18
`F20
`F22
`F24
`F26
`F28
`
`Special-purpose floating-point registers
`KR
`KL
`r
`Merge
`
`Control registers
`
`Page directory base
`Data breakpoint
`Floating-point status
`
`Figure 3. Register set.
`
`18 IEEE MICRO
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`TCL 1008
`
`

`

`Table 1.
`Instruction-set summary.
`
`Llnemonic
`
`Description
`
`Mnemonic
`
`Description
`
`Zore unit
`Load and store instructions
`LD.X
`Load integer
`3T.X
`Store integer
`FLD.Y
`F-P load
`?FLD.Z
`Pipelined F-P load
`FST.Y
`F-P store
`PST.D
`Pixel store
`Register-to-register moves
`Transfer integer to F-P register
`lXFR
`Transfer F-P to integer register
`FXFR
`integer arithmetic instructions
`4DDU
`Add unsigned
`4DDS
`Add signed
`SUBU
`Subtract unsigned
`SUBS
`Subtract signed
`Shift instructions
`SHL
`Shift left
`SHR
`Shift right
`SHRA
`Shift right arithmetic
`SHRD
`Shift right double
`Logical instructions
`4ND
`Logical AND
`4NDH
`Logical AND high
`4NDNOT
`Logical AND NOT
`4NDNOTH
`Logical AND NOT high
`3 R
`Logical OR
`3RH
`Logical OR high
`)<OR
`Logical exclusive OR
`YORH
`Logical exclusive OR high
`Zontrol-transfer instructions
`rRAP
`Software trap
`Software trap on integer overflow
`INTOVR
`BR
`Branch direct
`SRI
`Branch indirect
`BC
`Branch on CC
`BC.T
`Branch on CC taken
`3NC
`Branch on not CC
`Branch on not CC taken
`BNC.T
`3 TE
`Branch if equal
`BTNE
`Branch if not equal
`Branch on LCC and add
`BLA
`CALL
`Subroutine call
`CALLI
`Indirect subroutine call
`System control instructions
`FLUSH
`Cache flush
`LD.C
`Load from control register
`ST.C
`Store to control register
`LOCK
`Begin interlocked sequence
`UNLOCK
`End interlocked sequence
`
`cc
`F-P
`LCC
`
`Condition code
`Floating-point
`Load condition code
`
`Floating-point unit
`Floating-point multiplier instructions
`FMUL.P
`F-P multiply
`PFMUL.P
`Pipelined F-P multiply
`PFMUL3.DD
`Three-stage pipelined F-P multiply
`FMLOW .P
`F-P multiply low
`FRCP.P
`F-P reciprocal
`F-P reciprocal square root
`FRSQR . P
`Floating-point adder instructions
`FADD.P
`F-P add
`PFADD. P
`Pipelined F-P add
`FSUB.P
`F-P subtract
`PFSUB.P
`Pipelined F-P subtract
`PFGT.P
`Pipelined F-P greater-than compare
`PFEQ.P
`Pipelined F-P equal compare
`F1X.P
`F-P to integer conversion
`Pipelined F-P to integer conversion
`PF1X.P
`FTRUNC.P
`F-P to integer truncation
`Pipelined F-P to integer truncation
`PFTRUNC.P
`Pipelined F-P less than or equal
`PFLE.P
`F-P adder move
`PAMOV
`Pipelined F-P adder move
`PFAMOV
`Dual-operation instructions
`Pipelined F-P add and multiply
`PFAM.P
`Pipelined F-P subtract and multiply
`PFSM.P
`Pipelined F-P multiply with add
`PFMAM
`Pipelined F-P multiply with subtract
`PFMSM
`Long integer instructions
`FLSUB.Z
`Long-integer subtract
`PFLSUB.Z
`Pipelined long-integer subtract
`FLADD.Z
`Long-integer add
`PFLADD.Z
`Pipelined long-integer add
`Graphics instructions
`FZCHKS
`16-bit z-buffer check
`PFZCHKS
`Pipelined 16-bit ,--buffer check
`FZCHLD
`32-bit z-buffer check
`PFZCHLD
`Pipelined 32-bit z-buffer check
`FADDP
`Add with pixel merge
`PFADDP
`Pipelined add with pixel merge
`FADDZ
`Add with z merge
`Pipelined add with 2 merge
`PFADDZ
`FORM
`OR with merge register
`PFORM
`Pipelined OR with merge register
`Assembler pseudo-operations
`MOV
`Integer register-register move
`FM0V.Q
`F-P register-register move
`PFM0V.Q
`Pipelined F-P register-register move
`NOP
`Core no-operation
`FNOP
`F-P no-operation
`
`August 1989
`
`19
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`TCL 1008
`
`

`

`Intel i860
`
`tions as well as more sophisticated operations, such as
`compare and swap, can be performed.
`The RISC core also executes a pixel store instruc-
`tion. This instruction operates in conjunction with the
`graphics unit to eliminate hidden surfaces. Other in-
`structions transfer integer and floating-point registers,
`examine and modify the control registers, and flush the
`data cache.
`The six control registers accessible by core instruc-
`tions are the
`PSR (processor status),
`EPSR (extended processor status),
`DB (data breakpoint),
`FIR (fault instruction),
`Dirbase (directory base), and
`FSR (floating-point status) registers.
`
`The PSR contains state information relevant to the
`current process, such as trap-related and pixel informa-
`tion. The EPSR contains additional state information
`for the current process and information such as the
`processor type, stepping, and cache size. The DB reg-
`ister generates data breakpoints when the breakpoint is
`enabled and the address matched. The FIR stores the
`address of the instruction that causes a trap. The Dir-
`base register contains the control information for cach-
`ing, address translation, and bus options. Finally, the
`FSR contains the floating-point trap and rounding-
`mode status for the current process. The four special-
`purpose registers are used with the dual-operation
`floating-point instructions (described later).
`The core unit executes all loads and stores, including
`those to the floating-point registers. Two types of float-
`ing-point loads are available: FLD (floating-point load)
`and PFLD (pipelined floating-point load). The FLD
`instruction loads the floating-point register from the
`cache, or loads the data from memory and fills the cache
`line if the data is not in the cache. Up to four floating-
`point registers can be loaded from the cache in one
`clock cycle. This ability to perform 128-bit loads or
`stores in one clock cycle is crucial to supplying the data
`at the rate needed to keep the floating-point units
`executing. The FLD
`instruction processes
`scalar
`floating-point routines, vector data that can fit entirely
`in the cache, or sections of large data structures that are
`going to be reused.
`For accessing data structures too large to fit into the
`on-chip cache, the core uses the PFLD instruction. The
`pipelined load places data directly into the floating-
`point registers without placing it in the data cache on a
`cache miss. This operation avoids displacing the data
`already in the cache that will be reused. Similarly on a
`store miss, the data writes through to memory without
`allocating a cache block. Thus, we avoid data cache
`thrashing, a crucial factor in achieving high sustained
`performance in large vector calculations.
`PFLD also allows up to three accesses to be issued on
`
`20 IEEEMICRO
`
`the pipelined external bus before the data from the first
`cache miss is returned. The pipelined loads occur di-
`rectly from memory and do not cause extra bus cycles
`to fill the cache line, avoiding bus accesses to data that
`is not needed. The full bus bandwidth of the external
`bus can be used even though cache misses are being
`processed. Autoincrement addressing, with an arbi-
`trary increment, increases the flexibility and perform-
`ance for accessing data structures.
`
`Memory management
`The i860’s on-chip memory management unit imple-
`ments the basic features needed for paged virtual
`memory management and page-level protection. We
`intentionally duplicated the memory management tech-
`nique in the 386 and 486 microprocessors’ paging
`system. In this way we can be sure that the processors
`easily exist in a common operating environment. The
`similar MMUs are also useful for reusing paging and
`virtual memory software that is written in C.
`The address translation process maps virtual address
`space onto actual address space in fixed-size blocks
`called pages. While paging is enabled, the processor
`translates a linear address to a physical address using
`page tables. As used in mainframes, the i860 CPU page
`tables are arranged in a two-level hierarchy. (See Fig-
`ure 4.) The directory table base (DTB), which is part of
`the Dirbase register, points to the page directory. This
`one-page-long directory contains address entries for
`1,024 page tables. The page tables are also one page
`long, and their entries describe 1,024 pages. Each page
`is 4 Kbytes in size.
`Figure 4 also shows the translation from a virtual
`address to a physical address. The processor uses the
`upper 10 bits of the linear address as an index into the
`directory. Each directory entry contains 20 bits of
`addressing information, part of which contains the
`address of a page table. The processor uses these 20 bits
`and the middle 10 bits of the linear address to form the
`page table address. The address contents of the page
`table entry and the lower 12 bits (nine address bits and
`the byte enables) of the linear address form the 32-bit
`physical address.
`The processor creates the paging tables and stores
`them in memory when it creates the process. If the
`processor had to access these page tables in memory
`each time that a reference was made, performance
`would suffer greatly. To save the overhead of the page
`table lookups, the processor automatically caches
`mapping information for the 64 recently used pages
`in an on-chip, four-way, set-associative translation
`lookaside buffer. The TLB’s 64 entries cover 4 Kbytes,
`each providing a total cover of 256 Kbytes of memory
`addresses. The TLB can be flushed by setting a bit in the
`Dirbase register.
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`TCL 1008
`
`

`

`Dir
`
`Page
`
`Off set
`
`Page directory
`
`Page table
`
`Physical
`address
`
`b
`
`A
`
`Figure 4. Virtual-to-physical address translation.
`
`Writable
`User
`Write-through
`Cache disable
`Accessed
`Dirty
`(Reserved)
`Available for systems programmer user
`
`~~
`
`Page frame address 31 . . . 12
`
`Available
`
`X X
`
`D A E
`
`U W P
`
`Figure 5. Format of a page table entry. (X indicates Intel reserved; do not use.)
`
`~~
`
`Only when the processor does not find the mapping
`information for a page in the TLB does it perform a
`page table lookup from information stored in memory.
`When a TLB miss does occur, the processor performs
`the TLB entry replacement entirely in hardware. The
`hardware reads the virtual-to-physical mapping infor-
`mation from the page directory and the page table
`entries, and caches this information in the TLB.
`
`The format of a page table entry can be seen in Figure
`5. Paging protects supervisor memory from user ac-
`cesses and also permits write protection of pages. The
`U (user) and W (write) bits control the access rights.
`The operating system can allow a user program to have
`read and write, read-only, or no access to a given page
`or page group. If a memory access violates the page
`protection attributes, such as U-level code writing a
`
`August 1989
`
`21
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`TCL 1008
`
`

`

`Intel i860
`
`read-only page, the system generates an exception.
`While at the user level, the system ignores store control
`instructions to certain control registers.
`The U bit of the PSR is set to 0 when executing at the
`supervisor level, in which all present pages are read-
`able. Normally, at this level, all pages are also writable.
`To support a memory management optimization called
`copy-on-write, the processor sets the write-protection
`(WP) bit of the EPSR. With WP set, any write to a page
`whose W bit is not set causes a trap, allowing an
`operating system to share pages between tasks without
`making a new copy of the page until it is written.
`Of the two remaining control bits, cache disable
`(CD) and write through (WT), one is reflected on the
`output pin for a page table bit (PTB), dependent on the
`setting of the page table bit mode (PBM) in EPSR. The
`WT bit, CD bit, and KEN# cache enable pin are inter-
`nally NORed to determine “cachability.” If either of
`these bits is set to one, the processor will not cache that
`page of data. For systems that use a second-level cache,
`these bits can be used to manage a second-level coher-
`ent cache, with no shared data cached on chip. In
`addition to controlling cachability with software, the
`KEN# hardware signal can be used to disable cache
`reads.
`
`Floating-point unit
`Floating-point unit instructions, as listed in Table 1,
`support both single-precision real and double-preci-
`sion real data. Both types follow the ANSI/IEEE 754
`standard.’ The i860 CPU hardware implements all four
`modes of IEEE rounding. The special values infinity,
`NaN (not a number), indefinite, and denormal generate
`a trap when encountered; and the trap handler produces
`an IEEE-standard result. The double-precision real
`data occupies two adjacent floating-point registers with
`bits 31 . . . 0 stored in an even-numbered register and
`bits 63 . . . 32 stored in the adjacent, higher odd-
`numbered register.
`The floating-point unit includes three-stage-pipe-
`lined add and multiply units. For single-precision data
`each unit can produce one result per clock cycle for a
`peak rate of 80 Mflops at a 40-MHz clock speed. For
`double-precision data, the multiplier can produce a
`result every other cycle. The adder produces a result
`every cycle, for a peak rate of 60 million floating-point
`operations per second. The double-precision peak
`number is 40 Mflops if an algorithm has an even
`distribution of multiplies and adds. Reducing the
`double-precision multiply rate saves half of the multi-
`plier tree and is consistent with the data bandwidth
`available for double-precision operations.
`To save silicon area, we did not include a floating-
`point divide unit. Instead, software performs floating-
`point divide and square-root operations. Newton-Ra-
`phson algorithms use an 8-bit seed provided by a
`
`22 IEEEMICRO
`
`DO 10, I = 1 , 100
`X = X * A + C
`
`FMUL X, A, temp
`FADD temp, C, X
`
`1 result per 6 clock cycles
`
`DO 10, I = 1, 100
`X[I] = A[I] * B[I] + C
`
`M12TPM A[I], B[I], XII - 61
`
`1 result per clock cycle
`
`10
`
`(a)
`
`10
`
`(b)
`
`Figure 6. Floating-point execution models: data-de-
`pendent code in scalar mode (a) and vector code in
`pipeline mode (b).
`
`SRC2 RDEST
`
`SRC1
`
`r-7
`
`Multiplier unit
`
`Result
`
`Adder unit
`
`Result
`
`Figure 7. Dual-operation data paths.
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`TCL 1008
`
`

`

`hardware lookup table. Full IEEE rounding can be
`implemented by using an instruction that returns the
`low-order bits of a floating-point multiply. Therefore
`these algorithms can take advantage of the pipeline and
`allow 16-bit reciprocals used in many graphics calcula-
`tions to be performed either in 10 clock cycles or four
`pipelined cycles.
`The floating-point instruction set supports two
`computation models, scalar and pipelined. In scalar
`mode new floating-point instructions do not start proc-
`essing until the previous floating-point instruction
`completes. This mode is used when a data dependency
`exists between the operations or when a compiler ig-
`nores pipeline scheduling. In the scalar-mode example
`of Figure 6 each iteration of the Do loop requires the
`results from the previous iteration and 6-cycle execu-
`tion.
`In pipelined mode the same operation can produce a
`result every clock cycle, and the CPU pipeline stages
`are exposed to software. The software issues a new
`floating-point operation to the first stage of the pipeline
`and gets back the result of the last stage of the pipeline.
`Destination registers are not specified when the opera-
`tion begins, rather when the result is available. This
`explicit pipelining avoids tying up valuable floating-
`point registers for results, so the registers can still be
`used in the pipeline. Implicit pipelining, using score-
`boarding, would cause the registers to become the
`bottleneck in the floating-point unit.
`Pipelining also takes place in a dual-operation mode
`in which an add and a multiply process in parallel.
`Figure 7 shows the adder unit, the multiplier unit, the
`special registers, and the dual-operation data paths.
`Dual-operation instructions require six operands. The
`register file provides three of the operands, and the
`special registers and the interunit bypasses provide the
`remaining three. The instruction encodings specify the
`source and destination paths for the units.
`Referring back to the pipeline-mode example of
`Figure 6 , note that we show the dual-operation instruc-
`tion M12TPM SRCl, SRC2, RDEST as M12TPM A[i],
`B[i], X[-61. (The M12TPM mnemonic is a variation of
`the PFAN instruction.) This instruction specifies that
`the multiply is initiated with SRCl and SRC2 as the
`operands. It also specifies that the add is initiated with
`the result from the multiply and the T register as the
`operands, and RDEST stores the result from the add.
`Because of the three stages of the add and multiply
`pipelines, the available result comes from the operation
`that started six clock cycles previously.
`There are 32 variations of dual-operation instruc-
`tions. Applications such as fast Fourier transforms,
`graphics transforms, and matrix operations can be
`implemented efficiently with these instructions. Some
`apparently scalar operations, such as adding a series of
`numbers, can also take advantage of the pipelining
`capability.
`
`63
`
`CORE-OP
`CORE-OP
`CORE-OP
`
`31
`
`0
`
`OP
`d.FP-OP
`d.FP-OP or CORE-OP
`d.FP-OP
`FP-OP
`FP-OP
`OP
`OP
`
`I
`
`I
`
`31
`
`0
`
`63
`
`L
`
`CORE-Of‘
`
`OP
`d.FP-OP
`FP-OP
`FP-OP
`OP
`OP
`
`I
`
`f
`
`Enter dual-
`instruction mode.
`Initiate exit from dual-
`instruction mode.
`
`Leave dual-
`instruction mode.
`
`.c +
`I
`
`Temporary dual-
`instruction mode
`
`I
`
`Figure 8. Dual-instruction-mode transitions.
`
`The is60 microprocessor can provide its fast float-
`ing-point hardware with the necessary data bandwidth
`to achieve peak performance for the inner loops of
`common routines. The dual-instruction mode allows
`the processor to perform up to 128-bit data loads and
`stores at the same time it executes a multiply and an
`add. Figure 8 shows the dual-instruction-mode transi-
`tions for an extended sequence of instruction pairs and
`for a single instruction pair. Programs specify dual-
`instruction mode in two ways. They can either include
`in the mnemonic of a floating-point instruction a “d.”
`prefix or use the assembler directives .dual. . . enddual.
`Either of these methods causes the dual or D-bit of the
`floating-point instruction to be set. If the processor
`while executing in single-instruction mode encounters
`a floating-point instruction with the D-bit set, it exe-
`cutes one more 32-bit instruction before beginning
`dual-instruction execution. In dual-instruction mode, a
`floating-point instruction could encounter a clear D-
`bit. The processor would then execute one more in-
`struction pair before returning to single-instruction
`mode.
`The floating-point hardware also performs integer
`multiplies and long integer adds or subtracts. Integer
`multiplies by constants can be performed in the RISC
`core using shift instructions. To perform a full integer
`multiply, the processor transfers two integer registers
`by using IXFR instructions. The FMLOW instruction
`performs the actual multiplication, and the FXFR in-
`struction transfers the results back to the core. The total
`operation takes from four to nine clock cycles, depend-
`ing on what other instructions can be overlapped.
`
`August 1989
`
`23
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`TCL 1008
`
`

`

`Intel 860
`
`Graphics
`The floating-point hardware of the CPU efficiently
`performs the transformation calculations and advanced
`lighting calculations required for 3D graphics. The
`processor performs 500K transforms/second for 3 x 4
`3D matrices, including the trivial reject clipping and
`perspective calculations. A 3D image display requires
`the use of integer operations for shading and hidden-
`surface removal. The graphics unit hardware speeds
`these back-end rendering operations and operates di-
`rectly into screen buffer memory. It uses the floating-
`point registers and operates in parallel with the core.
`Graphics instructions take advantage of the 64-bit
`data paths and can operate on multiple pixels simulta-
`neously, realizing I O times the speed of the RISC
`core when performing shading. Instructions support
`8-, 16-, and 24/32-bit pixels, operating respectively
`on eight, four, or two pixels simultaneously.
`In 3D graphics, polygons generally represent the set
`of points on the surface of a solid object. During
`transformation, the graphics unit calculates only the
`vertices of the polygons. The unit knows the locations
`and color intensities of the vertices of the polygons. but
`points between these vertices must be calculated. These
`points, along with their associated data, are called
`pixels. If a figure is displayed with only the vertices and
`simple lines, it appears as a wireframe drawing. The
`simplest wireframe drawing typically shows all verti-
`ces, even the ones that should be hidden from view by
`an overlapping polygon. To show shaded 3D images,
`the graphics unit must display the surface of the poly-
`gons. Where polygons overlap, it must display the
`polygon closest to the viewer.
`In graphics calculations the z value represents the
`distance of a pixel from the viewer. Although the depth
`of each polygon’s vertices is known, to overlay poly-
`gons not on a vertex, the graphics unit must interpolate
`the depths from the bordering vertices. This step is
`called z interpolation. In this step the depths of all
`points of a polygon can be determined. For overlapping
`points, the z values of different polygons can be checked
`and only the pixel data of the polygon closest to the
`viewer displayed.
`To perform the procedure just described, the graph-
`ics instructions include intensity interpolation, z inter-
`polation, and z-buffer checks. Intensity interpolation
`allows smooth linear changes in pixel intensity and
`color between vertices. This capability provides a
`smoother appearance than does the flat shading of the
`polygons. The more data bits per pixel, the smoother
`the interpolation becomes. The i860 CPU graphics
`instructions support both Gouraud and higher order
`shading techniques. Gouraud shading interpolates in-
`tensities along the scan lines. Figure 9 illustrates pixel
`interpolation for Gouraud shading of a

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket