`T
`
`he single-chip i860
`executes parallel instructio
`architectural concepts.
`mm x 15 mm processor (see Fi
`floating-point, and graphics pe
`eration CAD tools and 1-micrometer semic
`To accommodate our per
`between blocks for integer operations, floating-point operations, and in-
`struction and data cache memories. Inclusion of the RISC (reduced instruc-
`tion set computing) core, floating-point units, and caches on one chip lets
`us design wider internal buses, eliminate interchip communication over-
`head, and offer higher performance. As a result, the i860 avoids off-chip
`delays and allows users to scale the clock beyond the current 33- and 40-
`MHz speeds.
`We designed the i860 for performance-driven applications such as work-
`stations, minicomputers, application accelerators for existing processors,
`and parallel supercomputers. The i860 CPU design began with the specifi-
`cation of a general-purpose RISC integer core. However, we felt it neces-
`sary to go beyond the traditional 32-bit, one-instruction-per-clock RISC
`processor. A 64-bit architecture provides the data and instruction band-
`width needed to support multiple operations in each clock cycle. The
`balanced performance between integer and floating-point computations
`produces the raw computing power required to support demanding applica-
`tions such as modeling and simulations.
`Finally, we recognized a synergistic opportunity to incorporate a 3D
`graphics unit that supports interactive visualization of results. The architec-
`ture of the i860 CPU provides a complete platform for software vendors
`developing i860 applications.
`
`A million-
`transistor budget
`helps this RISC
`deliver balanced
`MIPS, Mflops,
`and graphics
`performance
`with no data
`bottlenecks.
`
`Architecture overview. The i860 CPU includes the following units on
`one chip (see Figure 2):
`the RISC integer core,
`a memory management unit with paging,
`a floating-point control unit,
`a floating-point adder unit,
`a floating-point multiplier unit,
`a 3D graphics unit,
`
`Les Kohn
`Neal Margulis
`
`Intel Corp.
`
`0272- 1732/89/0800-0015$01 .O 0 1989 IEEE
`
`August 1989
`
`15
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`Realtek Ex. 1008
`Case No. IPR2023-00922
`Page 1 of 16
`
`
`
`Intel 860
`
`Figure 1. Die photograph of the i860 CPU.
`
`a 4-Kbyte instruction cache,
`an 8-Kbyte data cache, and
`a bus control unit.
`
`Parallel execution. To support the performance
`available from multiple functional units, the i860 CPU
`issues up to three operations each clock cycle. In single-
`instruction mode, the processor issues either a RISC
`core instruction or a floating-point instruction each
`cycle. This mode is useful when the instruction per-
`forms scalar operations such as operating system
`routines.
`In dual-instruction mode, the RISC core fetches two
`32-bit instructions each clock cycle using the 64-bit-
`wide instruction cache. One 32-bit instruction moves to
`the RISC core, and the other moves to the floating-point
`section for parallel execution. This mode allows the
`RISC core to keep the floating-point units fed by fetch-
`ing and storing information and performing loop con-
`trol, while the floating-point section operates on the
`data.
`
`16 IEEE MICRO
`
`The floating-point instructions include a set of op-
`erations that initiate both an add and a multiply. The
`add and multiply. combined with the integer operation.
`result in three operations each clock cycle. With this
`fine-grained parallelism, the architecture can support
`traditional vector processing by software libraries that
`implement a vector instruction set. The inner loops of
`the software vector routines operate up to the peak
`floating-point hardware rate of 80 million floating-
`point operations per second. Consistent with RISC
`philosophy, the i860 CPU achieves the performance of
`hardware vector instructions without the complex
`control logic of hardware vector instructions. The fine-
`grained parallelism can also be used in other parallel
`algorithms that cannot be vectorized.
`
`Register and addressing model. The i860 micro-
`processor contains separate register files for the integer
`and floating-point units to support parallel execution.
`In addition to these register files, as can be seen in
`Figure 3 on page 18, are six control registers and four
`special-purpose registers. The RISC core contains the
`integer register file of thirty-two 32-bit registers, des-
`ignated RO through R3 1 and used for storing addresses
`or data. The floating-point control unit contains a sepa-
`rate set of thirty-two 32-bit floating-point registers
`designated FO through F31. These registers can be
`addressed individually, as sixteen 64-bit registers, or as
`eight 128-bit registers. The integer registers contain
`three ports. Five ports in the floating-point registers
`allow them to be used as a data staging area for perform-
`ing loads and stores in parallel with floating-point
`operations.
`The i860 operates on standard integer and floating-
`point data, as well as pixel data formats for graphics
`operations. All operations on the integer registers exe-
`cute on 32-bit data as signed or unsigned operations and
`additional add and subtract instructions that operate on
`64-bit-long words. All 64-bit operations occur in the
`floating-point registers.
`The i860 microprocessor supports a paged virtual
`address space of four gigabytes. Therefore, data and
`instructions can be stored anywhere in that space, and
`multibyte data values are addressed by specifying their
`lowest addressed byte. Data must be accessed on
`boundaries that are multiples of their size. For example,
`two-byte datamust be aligned to an address divisible by
`two, four-byte data on an address divisible by four, and
`so on, up to 16-byte data values. Data in memory can be
`stored in either little-endian or big-endian format.
`(Little-endian format sends the least significant byte,
`D7-DO, first to the lowest memory address, while big-
`endian sends the most significant byte first.) Code is
`always stored in little-endian format. Support for big-
`endian data allows the processor to operate on data
`produced by a big-endian processor, without perform-
`ing a lengthy data conversion.
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`Realtek Ex. 1008
`Case No. IPR2023-00922
`Page 2 of 16
`
`
`
`External + ,
`address
`32 bits
`
`I1
`
`Instruction cache
`(4 Kbytes)
`
`management
`
`Data cache
`(8 Kbytes)
`
`-t
`
`64,'
`
`T
`
`Core instruction
`
`32
`
`' 3 2
`
`" 3 2
`
`Floating-point instruction
`
`Cache data
`
`t
`
`" 128
`I1
`
`T
`
`T
`
`Bus control
`unit
`
`j i
`
`RISC core
`
`Floating-point
`control unit
`
`Core registers
`
`Floating-point registers
`
`64,.
`
`64
`
`64,.
`
`SRC2
`
`I
`I
`
`I
`
`I
`I
`
`---
`t
`
`T
`
`II
`
`1
`
`.
`
`T
`
`
`
`1
`KL
`KR
`
`~
`
`Merge
`
`Adder unit
`
`Multiplier unit
`
`Figure 2. Functional units and data paths of the i860 microprocessor.
`
`RISC core
`The RISC core fetches both integer and floating-
`point instructions. It executes load, store, integer, bit,
`and control transfer instructions. Table 1 on page 19
`lists the full instruction set with the 42 core unit instruc-
`tions and their mnemonics in the left column. All in-
`structions are 32 bits long and follow the load/store,
`three-operand style of traditional RISC designs. Only
`
`load and store instructions operate on memory; all other
`instructions operate on registers. Most instructions
`allow users to specify two source registers and a third
`register for storing the results.
`A key feature of the core unit is its ability to execute
`most instructions in one clock cycle. The RISC core
`contains a pipeline consisting of four stages: fetch,
`decode, execute, and write. We used several techniques
`to hide clock cycles of instructions that may take more
`
`August 1989
`
`17
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`Realtek Ex. 1008
`Case No. IPR2023-00922
`Page 3 of 16
`
`
`
`Intel i860
`
`time to complete. Integer register loads from memory
`take one execution cycle, and the next instruction can
`begin on the following cycle.
`The processor uses a scoreboarding technique to
`guarantee proper operation of the code and allow the
`highest possible performance. The scoreboard keeps a
`history of which registers await data from memory. The
`actual loading of data takes one clock cycle if it is held
`in the cache memory buffer available for ready access,
`but several cycles if it is in main memory. Using
`scoreboarding,
`the i860 microprocessor continues
`execution unless a subsequent instruction attempts to
`use the data before it is loaded. This condition would
`cause execution to freeze. An optimizing compiler can
`organize the code so that freezing rarely occurs by not
`referencing the load data in the following cycle. Be-
`cause the hardware implements scoreboarding, it is
`never necessary to insert NO-OP instructions.
`
`We included several control flow optimizations in
`the core instruction set. The conditional branch instruc-
`tions have variations with and without a delay slot. A
`delay slot allows the processor to execute an instruction
`following a branch while it is fetching from the branch
`target. Having both delayed and nondelayed variations
`of branch instructions allows the compiler to optimize
`the code easily, whether a branch is likely to be taken or
`not. Test and branch instructions execute in one clock
`cycle, a savings of one cycle when testing special cases.
`Finally, another one-cycle loop control instruction
`usefully handles tight loops, such as those in vector
`routines.
`Instead of providing a limited set of locked opera-
`tions, the RISC core provides lock and unlock instruc-
`tions. With these two instructions a sequence of up to
`32 instructions can be interlocked for multiprocessor
`synchronization. Thus, traditional test and set opera-
`
`Integer registers
`
`0
`
`63
`
`Floating-point registers
`32 31
`
`31
`
`t
`
`I
`
`I
`
`I
`
`R1
`R2
`R3
`R4
`R5
`R 6
`. .-
`R7
`R 8
`R9
`R10
`R I 1
`
`R14
`R15
`
`R19
`R20
`R21
`R22
`R23
`
`R27
`R28
`
`I
`
`I
`
`I
`
`F1
`F3
`
`F13
`
`F15 F15
`F17
`, , .
`F19
`F19
`F21
`F23
`F25
`F27
`F29
`
`I
`
`0
`
`1
`
`FO
`F2
`
`F10
`F12
`
`F14 F14
`F16
`t18
`F18
`F20
`F22
`F24
`F26
`F28
`
`Special-purpose floating-point registers
`KR
`KL
`r
`Merge
`
`Control registers
`
`Page directory base
`Data breakpoint
`Floating-point status
`
`Figure 3. Register set.
`
`18 IEEE MICRO
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`Realtek Ex. 1008
`Case No. IPR2023-00922
`Page 4 of 16
`
`
`
`Table 1.
`Instruction-set summary.
`
`Llnemonic
`
`Description
`
`Mnemonic
`
`Description
`
`Zore unit
`Load and store instructions
`LD.X
`Load integer
`3T.X
`Store integer
`FLD.Y
`F-P load
`?FLD.Z
`Pipelined F-P load
`FST.Y
`F-P store
`PST.D
`Pixel store
`Register-to-register moves
`Transfer integer to F-P register
`lXFR
`Transfer F-P to integer register
`FXFR
`integer arithmetic instructions
`4DDU
`Add unsigned
`4DDS
`Add signed
`SUBU
`Subtract unsigned
`SUBS
`Subtract signed
`Shift instructions
`SHL
`Shift left
`SHR
`Shift right
`SHRA
`Shift right arithmetic
`SHRD
`Shift right double
`Logical instructions
`4ND
`Logical AND
`4NDH
`Logical AND high
`4NDNOT
`Logical AND NOT
`4NDNOTH
`Logical AND NOT high
`3 R
`Logical OR
`3RH
`Logical OR high
`)<OR
`Logical exclusive OR
`YORH
`Logical exclusive OR high
`Zontrol-transfer instructions
`rRAP
`Software trap
`Software trap on integer overflow
`INTOVR
`BR
`Branch direct
`SRI
`Branch indirect
`BC
`Branch on CC
`BC.T
`Branch on CC taken
`3NC
`Branch on not CC
`Branch on not CC taken
`BNC.T
`3 TE
`Branch if equal
`BTNE
`Branch if not equal
`Branch on LCC and add
`BLA
`CALL
`Subroutine call
`CALLI
`Indirect subroutine call
`System control instructions
`FLUSH
`Cache flush
`LD.C
`Load from control register
`ST.C
`Store to control register
`LOCK
`Begin interlocked sequence
`UNLOCK
`End interlocked sequence
`
`cc
`F-P
`LCC
`
`Condition code
`Floating-point
`Load condition code
`
`Floating-point unit
`Floating-point multiplier instructions
`FMUL.P
`F-P multiply
`PFMUL.P
`Pipelined F-P multiply
`PFMUL3.DD
`Three-stage pipelined F-P multiply
`FMLOW .P
`F-P multiply low
`FRCP.P
`F-P reciprocal
`F-P reciprocal square root
`FRSQR . P
`Floating-point adder instructions
`FADD.P
`F-P add
`PFADD. P
`Pipelined F-P add
`FSUB.P
`F-P subtract
`PFSUB.P
`Pipelined F-P subtract
`PFGT.P
`Pipelined F-P greater-than compare
`PFEQ.P
`Pipelined F-P equal compare
`F1X.P
`F-P to integer conversion
`Pipelined F-P to integer conversion
`PF1X.P
`FTRUNC.P
`F-P to integer truncation
`Pipelined F-P to integer truncation
`PFTRUNC.P
`Pipelined F-P less than or equal
`PFLE.P
`F-P adder move
`PAMOV
`Pipelined F-P adder move
`PFAMOV
`Dual-operation instructions
`Pipelined F-P add and multiply
`PFAM.P
`Pipelined F-P subtract and multiply
`PFSM.P
`Pipelined F-P multiply with add
`PFMAM
`Pipelined F-P multiply with subtract
`PFMSM
`Long integer instructions
`FLSUB.Z
`Long-integer subtract
`PFLSUB.Z
`Pipelined long-integer subtract
`FLADD.Z
`Long-integer add
`PFLADD.Z
`Pipelined long-integer add
`Graphics instructions
`FZCHKS
`16-bit z-buffer check
`PFZCHKS
`Pipelined 16-bit ,--buffer check
`FZCHLD
`32-bit z-buffer check
`PFZCHLD
`Pipelined 32-bit z-buffer check
`FADDP
`Add with pixel merge
`PFADDP
`Pipelined add with pixel merge
`FADDZ
`Add with z merge
`Pipelined add with 2 merge
`PFADDZ
`FORM
`OR with merge register
`PFORM
`Pipelined OR with merge register
`Assembler pseudo-operations
`MOV
`Integer register-register move
`FM0V.Q
`F-P register-register move
`PFM0V.Q
`Pipelined F-P register-register move
`NOP
`Core no-operation
`FNOP
`F-P no-operation
`
`August 1989
`
`19
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`Realtek Ex. 1008
`Case No. IPR2023-00922
`Page 5 of 16
`
`
`
`Intel i860
`
`tions as well as more sophisticated operations, such as
`compare and swap, can be performed.
`The RISC core also executes a pixel store instruc-
`tion. This instruction operates in conjunction with the
`graphics unit to eliminate hidden surfaces. Other in-
`structions transfer integer and floating-point registers,
`examine and modify the control registers, and flush the
`data cache.
`The six control registers accessible by core instruc-
`tions are the
`PSR (processor status),
`EPSR (extended processor status),
`DB (data breakpoint),
`FIR (fault instruction),
`Dirbase (directory base), and
`FSR (floating-point status) registers.
`
`The PSR contains state information relevant to the
`current process, such as trap-related and pixel informa-
`tion. The EPSR contains additional state information
`for the current process and information such as the
`processor type, stepping, and cache size. The DB reg-
`ister generates data breakpoints when the breakpoint is
`enabled and the address matched. The FIR stores the
`address of the instruction that causes a trap. The Dir-
`base register contains the control information for cach-
`ing, address translation, and bus options. Finally, the
`FSR contains the floating-point trap and rounding-
`mode status for the current process. The four special-
`purpose registers are used with the dual-operation
`floating-point instructions (described later).
`The core unit executes all loads and stores, including
`those to the floating-point registers. Two types of float-
`ing-point loads are available: FLD (floating-point load)
`and PFLD (pipelined floating-point load). The FLD
`instruction loads the floating-point register from the
`cache, or loads the data from memory and fills the cache
`line if the data is not in the cache. Up to four floating-
`point registers can be loaded from the cache in one
`clock cycle. This ability to perform 128-bit loads or
`stores in one clock cycle is crucial to supplying the data
`at the rate needed to keep the floating-point units
`executing. The FLD
`instruction processes
`scalar
`floating-point routines, vector data that can fit entirely
`in the cache, or sections of large data structures that are
`going to be reused.
`For accessing data structures too large to fit into the
`on-chip cache, the core uses the PFLD instruction. The
`pipelined load places data directly into the floating-
`point registers without placing it in the data cache on a
`cache miss. This operation avoids displacing the data
`already in the cache that will be reused. Similarly on a
`store miss, the data writes through to memory without
`allocating a cache block. Thus, we avoid data cache
`thrashing, a crucial factor in achieving high sustained
`performance in large vector calculations.
`PFLD also allows up to three accesses to be issued on
`
`20 IEEEMICRO
`
`the pipelined external bus before the data from the first
`cache miss is returned. The pipelined loads occur di-
`rectly from memory and do not cause extra bus cycles
`to fill the cache line, avoiding bus accesses to data that
`is not needed. The full bus bandwidth of the external
`bus can be used even though cache misses are being
`processed. Autoincrement addressing, with an arbi-
`trary increment, increases the flexibility and perform-
`ance for accessing data structures.
`
`Memory management
`The i860’s on-chip memory management unit imple-
`ments the basic features needed for paged virtual
`memory management and page-level protection. We
`intentionally duplicated the memory management tech-
`nique in the 386 and 486 microprocessors’ paging
`system. In this way we can be sure that the processors
`easily exist in a common operating environment. The
`similar MMUs are also useful for reusing paging and
`virtual memory software that is written in C.
`The address translation process maps virtual address
`space onto actual address space in fixed-size blocks
`called pages. While paging is enabled, the processor
`translates a linear address to a physical address using
`page tables. As used in mainframes, the i860 CPU page
`tables are arranged in a two-level hierarchy. (See Fig-
`ure 4.) The directory table base (DTB), which is part of
`the Dirbase register, points to the page directory. This
`one-page-long directory contains address entries for
`1,024 page tables. The page tables are also one page
`long, and their entries describe 1,024 pages. Each page
`is 4 Kbytes in size.
`Figure 4 also shows the translation from a virtual
`address to a physical address. The processor uses the
`upper 10 bits of the linear address as an index into the
`directory. Each directory entry contains 20 bits of
`addressing information, part of which contains the
`address of a page table. The processor uses these 20 bits
`and the middle 10 bits of the linear address to form the
`page table address. The address contents of the page
`table entry and the lower 12 bits (nine address bits and
`the byte enables) of the linear address form the 32-bit
`physical address.
`The processor creates the paging tables and stores
`them in memory when it creates the process. If the
`processor had to access these page tables in memory
`each time that a reference was made, performance
`would suffer greatly. To save the overhead of the page
`table lookups, the processor automatically caches
`mapping information for the 64 recently used pages
`in an on-chip, four-way, set-associative translation
`lookaside buffer. The TLB’s 64 entries cover 4 Kbytes,
`each providing a total cover of 256 Kbytes of memory
`addresses. The TLB can be flushed by setting a bit in the
`Dirbase register.
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`Realtek Ex. 1008
`Case No. IPR2023-00922
`Page 6 of 16
`
`
`
`Dir
`
`Page
`
`Off set
`
`Page directory
`
`Page table
`
`Physical
`address
`
`b
`
`A
`
`Figure 4. Virtual-to-physical address translation.
`
`Writable
`User
`Write-through
`Cache disable
`Accessed
`Dirty
`(Reserved)
`Available for systems programmer user
`
`~~
`
`Page frame address 31 . . . 12
`
`Available
`
`X X
`
`D A E
`
`U W P
`
`Figure 5. Format of a page table entry. (X indicates Intel reserved; do not use.)
`
`~~
`
`Only when the processor does not find the mapping
`information for a page in the TLB does it perform a
`page table lookup from information stored in memory.
`When a TLB miss does occur, the processor performs
`the TLB entry replacement entirely in hardware. The
`hardware reads the virtual-to-physical mapping infor-
`mation from the page directory and the page table
`entries, and caches this information in the TLB.
`
`The format of a page table entry can be seen in Figure
`5. Paging protects supervisor memory from user ac-
`cesses and also permits write protection of pages. The
`U (user) and W (write) bits control the access rights.
`The operating system can allow a user program to have
`read and write, read-only, or no access to a given page
`or page group. If a memory access violates the page
`protection attributes, such as U-level code writing a
`
`August 1989
`
`21
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`Realtek Ex. 1008
`Case No. IPR2023-00922
`Page 7 of 16
`
`
`
`Intel i860
`
`read-only page, the system generates an exception.
`While at the user level, the system ignores store control
`instructions to certain control registers.
`The U bit of the PSR is set to 0 when executing at the
`supervisor level, in which all present pages are read-
`able. Normally, at this level, all pages are also writable.
`To support a memory management optimization called
`copy-on-write, the processor sets the write-protection
`(WP) bit of the EPSR. With WP set, any write to a page
`whose W bit is not set causes a trap, allowing an
`operating system to share pages between tasks without
`making a new copy of the page until it is written.
`Of the two remaining control bits, cache disable
`(CD) and write through (WT), one is reflected on the
`output pin for a page table bit (PTB), dependent on the
`setting of the page table bit mode (PBM) in EPSR. The
`WT bit, CD bit, and KEN# cache enable pin are inter-
`nally NORed to determine “cachability.” If either of
`these bits is set to one, the processor will not cache that
`page of data. For systems that use a second-level cache,
`these bits can be used to manage a second-level coher-
`ent cache, with no shared data cached on chip. In
`addition to controlling cachability with software, the
`KEN# hardware signal can be used to disable cache
`reads.
`
`Floating-point unit
`Floating-point unit instructions, as listed in Table 1,
`support both single-precision real and double-preci-
`sion real data. Both types follow the ANSI/IEEE 754
`standard.’ The i860 CPU hardware implements all four
`modes of IEEE rounding. The special values infinity,
`NaN (not a number), indefinite, and denormal generate
`a trap when encountered; and the trap handler produces
`an IEEE-standard result. The double-precision real
`data occupies two adjacent floating-point registers with
`bits 31 . . . 0 stored in an even-numbered register and
`bits 63 . . . 32 stored in the adjacent, higher odd-
`numbered register.
`The floating-point unit includes three-stage-pipe-
`lined add and multiply units. For single-precision data
`each unit can produce one result per clock cycle for a
`peak rate of 80 Mflops at a 40-MHz clock speed. For
`double-precision data, the multiplier can produce a
`result every other cycle. The adder produces a result
`every cycle, for a peak rate of 60 million floating-point
`operations per second. The double-precision peak
`number is 40 Mflops if an algorithm has an even
`distribution of multiplies and adds. Reducing the
`double-precision multiply rate saves half of the multi-
`plier tree and is consistent with the data bandwidth
`available for double-precision operations.
`To save silicon area, we did not include a floating-
`point divide unit. Instead, software performs floating-
`point divide and square-root operations. Newton-Ra-
`phson algorithms use an 8-bit seed provided by a
`
`22 IEEEMICRO
`
`DO 10, I = 1 , 100
`X = X * A + C
`
`FMUL X, A, temp
`FADD temp, C, X
`
`1 result per 6 clock cycles
`
`DO 10, I = 1, 100
`X[I] = A[I] * B[I] + C
`
`M12TPM A[I], B[I], XII - 61
`
`1 result per clock cycle
`
`10
`
`(a)
`
`10
`
`(b)
`
`Figure 6. Floating-point execution models: data-de-
`pendent code in scalar mode (a) and vector code in
`pipeline mode (b).
`
`SRC2 RDEST
`
`SRC1
`
`r-7
`
`Multiplier unit
`
`Result
`
`Adder unit
`
`Result
`
`Figure 7. Dual-operation data paths.
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`Realtek Ex. 1008
`Case No. IPR2023-00922
`Page 8 of 16
`
`
`
`hardware lookup table. Full IEEE rounding can be
`implemented by using an instruction that returns the
`low-order bits of a floating-point multiply. Therefore
`these algorithms can take advantage of the pipeline and
`allow 16-bit reciprocals used in many graphics calcula-
`tions to be performed either in 10 clock cycles or four
`pipelined cycles.
`The floating-point instruction set supports two
`computation models, scalar and pipelined. In scalar
`mode new floating-point instructions do not start proc-
`essing until the previous floating-point instruction
`completes. This mode is used when a data dependency
`exists between the operations or when a compiler ig-
`nores pipeline scheduling. In the scalar-mode example
`of Figure 6 each iteration of the Do loop requires the
`results from the previous iteration and 6-cycle execu-
`tion.
`In pipelined mode the same operation can produce a
`result every clock cycle, and the CPU pipeline stages
`are exposed to software. The software issues a new
`floating-point operation to the first stage of the pipeline
`and gets back the result of the last stage of the pipeline.
`Destination registers are not specified when the opera-
`tion begins, rather when the result is available. This
`explicit pipelining avoids tying up valuable floating-
`point registers for results, so the registers can still be
`used in the pipeline. Implicit pipelining, using score-
`boarding, would cause the registers to become the
`bottleneck in the floating-point unit.
`Pipelining also takes place in a dual-operation mode
`in which an add and a multiply process in parallel.
`Figure 7 shows the adder unit, the multiplier unit, the
`special registers, and the dual-operation data paths.
`Dual-operation instructions require six operands. The
`register file provides three of the operands, and the
`special registers and the interunit bypasses provide the
`remaining three. The instruction encodings specify the
`source and destination paths for the units.
`Referring back to the pipeline-mode example of
`Figure 6 , note that we show the dual-operation instruc-
`tion M12TPM SRCl, SRC2, RDEST as M12TPM A[i],
`B[i], X[-61. (The M12TPM mnemonic is a variation of
`the PFAN instruction.) This instruction specifies that
`the multiply is initiated with SRCl and SRC2 as the
`operands. It also specifies that the add is initiated with
`the result from the multiply and the T register as the
`operands, and RDEST stores the result from the add.
`Because of the three stages of the add and multiply
`pipelines, the available result comes from the operation
`that started six clock cycles previously.
`There are 32 variations of dual-operation instruc-
`tions. Applications such as fast Fourier transforms,
`graphics transforms, and matrix operations can be
`implemented efficiently with these instructions. Some
`apparently scalar operations, such as adding a series of
`numbers, can also take advantage of the pipelining
`capability.
`
`63
`
`CORE-OP
`CORE-OP
`CORE-OP
`
`31
`
`0
`
`OP
`d.FP-OP
`d.FP-OP or CORE-OP
`d.FP-OP
`FP-OP
`FP-OP
`OP
`OP
`
`I
`
`I
`
`31
`
`0
`
`63
`
`L
`
`CORE-Of‘
`
`OP
`d.FP-OP
`FP-OP
`FP-OP
`OP
`OP
`
`I
`
`f
`
`Enter dual-
`instruction mode.
`Initiate exit from dual-
`instruction mode.
`
`Leave dual-
`instruction mode.
`
`.c +
`I
`
`Temporary dual-
`instruction mode
`
`I
`
`Figure 8. Dual-instruction-mode transitions.
`
`The is60 microprocessor can provide its fast float-
`ing-point hardware with the necessary data bandwidth
`to achieve peak performance for the inner loops of
`common routines. The dual-instruction mode allows
`the processor to perform up to 128-bit data loads and
`stores at the same time it executes a multiply and an
`add. Figure 8 shows the dual-instruction-mode transi-
`tions for an extended sequence of instruction pairs and
`for a single instruction pair. Programs specify dual-
`instruction mode in two ways. They can either include
`in the mnemonic of a floating-point instruction a “d.”
`prefix or use the assembler directives .dual. . . enddual.
`Either of these methods causes the dual or D-bit of the
`floating-point instruction to be set. If the processor
`while executing in single-instruction mode encounters
`a floating-point instruction with the D-bit set, it exe-
`cutes one more 32-bit instruction before beginning
`dual-instruction execution. In dual-instruction mode, a
`floating-point instruction could encounter a clear D-
`bit. The processor would then execute one more in-
`struction pair before returning to single-instruction
`mode.
`The floating-point hardware also performs integer
`multiplies and long integer adds or subtracts. Integer
`multiplies by constants can be performed in the RISC
`core using shift instructions. To perform a full integer
`multiply, the processor transfers two integer registers
`by using IXFR instructions. The FMLOW instruction
`performs the actual multiplication, and the FXFR in-
`struction transfers the results back to the core. The total
`operation takes from four to nine clock cycles, depend-
`ing on what other instructions can be overlapped.
`
`August 1989
`
`23
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:54:01 UTC from IEEE Xplore. Restrictions apply.
`
`Realtek Ex. 1008
`Case No. IPR2023-00922
`Page 9 of 16
`
`
`
`Intel 860
`
`Graphics
`The floating-point hardware of the CPU efficiently
`performs the transformation calculations and advanced
`lighting calculations required for 3D graphics. The
`processor performs 500K transforms/second for 3 x 4
`3D matrices, including the trivial reject clipping and
`perspective calculations. A 3D image display requires
`the use of integer operations for shading and hidden-
`surface removal. The graphics unit hardware speeds
`these back-end rendering operations and operates di-
`rectly into screen buffer memory. It uses the floating-
`point registers and operates in parallel with the core.
`Graphics instructions take advantage of the 64-bit
`data paths and can operate on multiple pixels simulta-
`neously, realizing I O times the speed of the RISC
`core when performing shading. Instructions support
`8-, 16-, and 24/32-bit pixels, operating respectively
`on eight, four, or two pixels simultaneously.
`In 3D graphics, polygons generally represent the set
`of points on the surface of a solid object. During
`transformation, the graphics unit calculates only the
`vertices of the polygons. The unit knows the locations
`and color intensities of the vertices of the polygons. but
`points between these vertices must be calculated. These
`points, along with their associated data, are called
`pixels. If a figure is displayed with only the vertices and
`simple lines, it appears as a wireframe drawing. The
`simplest wireframe drawing typically shows all verti-
`ces, even the ones that should be hidden from view by
`an overlapping polygon. To show shaded 3D images,
`the graphics unit must display the surface of the poly-
`gons. Where polygons overlap, it must display the
`polygon closest to the viewer.
`In graphics calculations the z value represents the
`distance of a pixel from the viewer. Although the depth
`of each polygon’s vertices is known, to overlay poly-
`gons not on a vertex, the graphics unit must interpolate
`the depths from the bordering vertices. This step is
`called z interpolation. In this step the depths of all
`points of a polygon can be determined. For overlapping
`points, the z values of different polygons can be checked
`and only the pixel data of the polygon closest to the
`viewer displayed.
`To perform the procedure just described, the graph-
`ics instructions include intensity interpolation, z inter-
`polation, and z-buffer checks. Intensity interpolation
`allows smooth linear changes in pixel intensity and
`color between verti