`
`Exceptions and Interrupts
`
`Pipeline External
`Stage
`Interrupts
`
`Software
`Interrupts
`
`Internal
`Interrupts
`
`IF
`IS
`RF
`
`EX
`DF
`DS
`TC
`
`WB
`
`Bus error Breakpoint
`Illegal instruction
`Syscall
`Instruction translation
`ECC instruction
`Coprocessor is
`Virtual coherency unusable
`instruction
`
`Interrupt
`
`Floating point
`
`Overflow
`TLB modified
`Data translation
`Virtual coherency Bus error data
`
`Data
`Watch
`NMI
`Reset
`
`TABLE 9.2 R4000 stage designation of interrupted instruction.
`
`stages are discussed in Chapter 10. However, note that a bus error on
`an instruction fetch from the cache is not signaled until the RF stage.
`The reason for this is that the data cache is pipelined and the checks of
`the cache tags occur in the RF stage as shown in Figure 2.31. For
`concurrent interrupts, the R4000 gives priority to the interrupted in(cid:173)
`struction furthest down the pipeline. This processor illustrates the
`need to recognize interrupts and save the program counter value over
`most of the stages of the pipeline.
`
`9.1.2 Handling Preceding Instructions
`The methods for handling preceding instructions, described below, are
`for systems that issue one instruction at a time. Systems that issue more
`than one instruction at a time are described in Chapter 10. Preceding
`instructions have the potential to modify the processor state in registers
`and/or memory. Recall that Smith's Condition #1 states that: All in(cid:173)
`structions preceding the instruction indicated by the saved program
`counter have been executed and have modified the process (processor)
`state correctly. This sequential execution model condition is defined as:
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 451
`
`
`
`9. 1 Precise Interrupts
`
`435
`
`A processor that satisfies the condition that "the result of an ex(cid:173)
`ecution is the same as if the operations had been executed in the
`order specified by the program" [LAMP79]. Note that a processor
`that executes the serial execution model also executes the sequential
`execution model.
`
`The sequential execution model is enforced by one of the three design
`options for handling preceding instructions shown in Table 9.1. The
`options are:
`1. to flush the pipeline, retiring all issued instructions;
`2. to take steps to ensure that all issued instructions retire in-order;
`3. to undo the processor state changes of any instructions that have
`been retired out-of-order.
`The design issues involved in selecting the option for handling the
`preceding instructions are whether or not to (1) penalize the normal
`performance of the pipeline, as measured by CPI, at the expense of a
`longer interrupt latency when there is an interrupt, or (2) have a smaller
`CPI and a longer latency. Cost and complexity are also design consider(cid:173)
`ations.
`Only a Type A instruction window (described in Chapter 8) with in(cid:173)
`order release will always have in-order retirements. The other types of
`windows have the potential for out-of-order completions and out-of-order
`retirements. Figure 9.3 shows windows with two execution units that
`can produce out-of-order retirements. Multiple execution units can have
`a different number of stages, such as an integer unit and a floating point
`unit as found in many microprocessors. Multiple register files, for integer
`and floating points, are not shown but have the same out-of-order retire(cid:173)
`ment problem. Systems with a common result bus that store integer and
`floating point values in the same register file can have structural haz(cid:173)
`ards on the output bus and the register file.
`The operation of these windows and pipelines is illustrated with two
`examples of a four-instruction sequence using the long and the short
`execution units. For the Type C window, if an interrupt occurs at the
`end of t 4 , i2 has already been retired out-of-order by writing to the
`register file. In addition, even if the interrupt had not occurred, there
`will be a structural hazard on the result bus and register file at t6 that
`must be resolved.
`For the Type D or E window, the instructions issue in-order to the
`reservation stations. For this example, instruction i3 is delayed two
`clocks because of a dependency on i2 (without forwarding). If an interrupt
`occurs at t5 , i2 will have been retired out-of-order. Notice that the release
`is out-of-order; i4 releases before i3 due to the dependency. The character(cid:173)
`istics of these two configurations follow.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 452
`
`
`
`436
`
`Exceptions and Interrupts
`
`Release on Dependency Resolution
`EU #1
`
`Type C Window
`
`il Long
`i2 Short
`i3 Short
`i4 Long
`
`Release on Dependency Resolution
`
`#1
`
`Type Dor E Wmdow
`
`5
`
`4
`
`i4
`
`2 S
`
`il
`
`i1
`
`il
`
`il □
`
`il Structunl
`Hazard
`
`2
`EU#l 3
`
`4
`
`WB
`
`EU#2 l~ " i3
`i2
`2
`i:l
`
`WB
`
`R.S.
`
`i1
`
`EU#I
`
`2
`
`3
`
`4
`
`WB
`
`R.S.
`
`i2
`
`i8
`
`t Inte1TUpt
`
`il
`
`il
`
`i4
`
`il
`
`i4
`
`il
`
`i3
`
`i3
`
`i2
`
`i3
`
`i2
`
`EU#2
`
`il Long
`i2 Short
`i3 Short
`i4 Long
`True dependency on i3, i2
`
`2
`
`i2
`1---+---+-+---+---+---f
`i2
`WB
`
`t Intern.pl
`
`FIGURE 9.3 Multiple execution units.
`
`Type C window (Cray-1, MIPS, Pentium integer) characteristics in(cid:173)
`clude
`
`1. in-order issue,
`2. dependencies resolved before release,
`3. in-order release,
`4. deterministic time from issue to completion.
`
`Type Dor E window (CDC 6600, IBM S/36O/91 Flt. Pt., MC68110•
`characteristics include
`
`1. in-order issue,
`2. dependencies resolved in the reservation stations,
`3. out-of-order release,
`4. nondeterministic time from issue to completion.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 453
`
`
`
`9.1 Precise Interrupts
`
`437
`
`Resolving Structural Hazards
`For systems with multiple-variable length execution units, the potential
`exists for a structural hazard on the result bus and register file even if
`instructions are released in-order. There are two methods for resolving
`this structural hazard.
`
`1. Deterministically schedule the "issue or release of instruction to
`eliminate the hazard.
`2. Resolve the hazard with priority logic.
`
`Consider the first method. Structural hazards can be resolved by
`delaying an instruction release until the hazard is eliminated; the result
`bus is scheduled so that hazards will not occur. Figure 9.4 shows a result
`shift register [SMITH85], called a RSR(a), that schedules the result bus
`and writes to a common register file.
`The reservation table of a shift register, shown on the right of the
`figure, is the same length as the longest execution unit pipeline; in this
`case four stages, and the stages are numbered in reverse order. This
`result shift register reserves a time slot for the result bus and the path
`to the register file but does not ensure in-order retirement. Scheduling of
`instructions into the execution units is illustrated by a three-instruction
`sequence. Instruction il, because it uses the long pipeline, places its
`destination address in stage four at t 1 . Then i2 releases and places its
`destination address into stage 2 at t 2 • Instruction i3 cannot release into
`EU #2 at t3 because the bus will be blocked by the result of il. Thus, i3
`is delayed one clock and releases at t4 • When a result is ready to exit
`the execution unit, the result bus is gated on.
`The TI ASC uses a Type C window and a RSR(a) to schedule its four
`execution units into the result bus and register file. Because of the
`number of data types, the RSR required seven bits to indicate the ad(cid:173)
`dress and type of register. For example, an operation with an upper half-
`
`EU#l
`
`2 345 678
`
`i-o
`Release
`
`EU#2
`
`Write to
`Registers
`
`il: Long, EU #1
`i2: Short, EU #2
`i3: Short, EU #2
`
`FIGURE 9.4 Result shift register (a).
`
`4
`
`3
`
`2
`
`1
`WB
`
`il
`
`il
`
`i2
`
`il
`
`i3
`
`i2
`
`il i3
`
`i2
`
`il
`
`i3
`
`• Reserves Bus and Register
`• Does Not Preserve Order
`• Requires 6 Clocks
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 454
`
`
`
`438
`
`Exceptions and Interrupts
`
`word result will not have a structural hazard with a result with a lower
`half-word result.
`The second method for resolving the structural hazard employs pri(cid:173)
`ority logic. Type C, D, and E windows cannot easily use a RSR(a) to
`schedule the result bus because of the possible release delays while
`waiting for dependencies to be resolved. The CDC 6600 scoreboard and
`CDB must resolve these structural hazards on their result bus(ses) and
`move the delays to the completion end of the execution uni ts. In the case
`of the scoreboard, each of the busses, called trunks, has priority hardware
`to delay lower priority bus requests. I am not certain, but I believe that
`some form of priority scheme is used with the CDB as well. The two
`floating point execution units probably receive COB access priority based
`on a random selection.
`
`Retirement by Flushing
`After ensuring that there are no structural hazards, the requirements
`for properly handling the preceding instructions must be satisfied. A
`common method, used in a number of processors, is a simple flush of the
`pipelines; that is, the issued instructions in the pipelines are run to
`completion and all results are retired. After flushing, the processor state
`is correct, the context switch can be performed, and the interrupt ser(cid:173)
`viced.
`Flushing is illustrated with the reservation table in Figure 9.5.
`which shows an instruction sequence: long, short, long, short. The v.in(cid:173)
`dow can be Type C, D, or E.
`
`12345678
`
`il
`
`il
`
`i3
`
`il
`
`i3
`
`il
`
`i3
`
`il
`
`i3
`
`i3
`
`2
`
`3
`
`4
`
`WB
`
`il Long
`i2 Short
`i3 Long
`i4 Short
`
`EU #1
`Reservation Table
`
`l
`
`2
`
`3 4 Is 6
`
`7
`
`ii lllll..l +
`
`Interrupt
`
`8
`
`I
`
`EU #2
`Reservation Table
`
`I
`..
`
`Context Switch
`
`FIGURE 9.5 Flush pipelines.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 455
`
`
`
`9. 1 Precise Interrupts
`
`439
`
`If an interrupt occurs at t 4 , instruction release is halted, the pipelines
`flush and perform their write backs to the ·register file out-of-order. The
`context switch does not occur until after the flush operation is complete
`and the register file is in the correct state. The flush time, in this exam(cid:173)
`ple, is three clocks. The time to flush, or the latency of the context switch,
`can be long and indeterminate. For example, all dependencies must be
`resolved before the instructions release, and there could be a major
`delay if there is an address translation miss on a released load or store
`instruction.
`The flush latency causes another problem: the instructions being
`flushed can themselves be interrupted. The general sol1.1.tion to this
`problem is to mask off all interrupts after the first has been recognized.
`Because interrupts will be masked off, unpredictable results can result
`and· an error condition is signaled by the processor.
`This strategy is useful when an internal interrupt or exception is
`detected prior to release and before any side effects can occur. Examples
`are page faults on instruction fetch or the detection of an illegal op-code.
`This simple flush scheme is called interrupts prior to instruction issue in
`[SMIT85].
`The PowerPC 601 and the Intel Pentium both use pipeline flushing
`to handle preceding instructions. Both of these processors have rela(cid:173)
`tively short execution unit pipelines, and released instructions are
`guaranteed not to produce interrupts.
`With the Pentium, the processor checks for pending interrupts before
`each instruction is issued. If an interrupt is pending, fetching instruc(cid:173)
`tions is halted and the pipeline is flushed [INTE93]. It is not clear from
`the Pentium documentation what happens if a second interrupt occurs
`during the pipeline flush operation. I assume that this is prevented by
`control logic in the processor. Some exception conditions are not handled
`by interrupts with the Pentium. Internal exceptions, such as register file
`parity errors, indicate that the processor can no longer be trusted, and
`the pipeline is shutdown with no hope of recovery.
`
`In-Order Retirement
`Because of problems with flushing, it is desirable to provide a method
`that will ensure that the processor state will be updated in-order regard(cid:173)
`less of the order of release. Design solutions to providing in-order-retire(cid:173)
`ment are discussed in the following sections and require in-order release,
`Type C windows, except where noted. Some of the solutions to this
`problem have hardware at the end of the pipelines that restore results
`into order before retirement; this hardware is called the result window
`by [LAIR92].
`Recall that the MIPS concept is based on software managing the
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 456
`
`
`
`440
`
`Exceptions and Interrupts
`
`pipeline without interlocks; the only interlock provided in the hardware
`is the load true dependency. The MIPS idea completely breaks down
`when a second execution unit-for instance, a floating point copro(cid:173)
`cessor-is added. Sohi [SOHl87) posits, "We are unaware of any software
`solutions to the imprecise interrupt problem for multiple functional unit
`computers." To cope with the problem of out-of-order-retirement, tech(cid:173)
`niques have been developed to coerce the processor to behave as if there
`are in-order-retirements even though completions are out-of-order.
`
`Equal Length Execution Unit Pipelines
`
`A simple strategy for Type C windows that ensures in-order retirement
`is to make all execution pipelines the same length. In other words, if the
`floating point pipeline is four stages, then the integer pipeline is four
`stages as well. As an integer pipeline can usually be one stage, three
`stages will simply be delays. This strategy will provide in-order retire(cid:173)
`ment with in-order release and will eliminate any structural hazards on
`the result bus or register file.
`The problem with padding out short pipelines is that other oper(cid:173)
`ations may be adversely affected. For example, the delays for true depen(cid:173)
`dencies can be increased. In addition, as the integer pipeline is frequently
`used for branch calculations. branch performance may be adversely af(cid:173)
`fected . The cost in execution time and hardware of implementing equal(cid:173)
`length pipelines is excessive, and I know of no system that uses this
`technique.
`I take issue with the proposition that there is no software solution
`to precise interrupts on multiple execution unit processors with a Type
`C window. No-ops can be used to delay instruction release to short
`pipelines, giving the same results as equal-length pipelines. There \\ill
`be a significant performance loss. I know of no published information on
`no-op scheduling for in-order retirement.
`
`Result Shift Register
`Smith [SMIT85] describes another type of result shift register, RSR(bJ.
`shown in Figure 9.6, that can be used with Type C windows. For RSR(b i.
`when instruction 1 is released to the four-stage execution unit, its destin(cid:173)
`ation address is placed in the stage of the result shift register along with
`X's in stages 3, 2, and 1. These time slots are thus blocked out or
`reserved. Instruction 2 would like to be released at t 2 , but stage 2 has
`been reserved and this instruction must wait until t 4 , when it can be
`released and its destination address placed in the result shift register.
`Instruction 3, because it is released in sequence, can now be released at
`t6 • There are no X's to block its release, as is the case with instruction
`2. As we can see, the results are retired in-order.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 457
`
`
`
`EU#l
`
`In -o r d e r~ Write to
`
`Release lffi.----J -Registern
`
`Type C Window
`
`ii: Long, EU#l
`i2: Short, EU#2
`i3: Short, EU#2
`
`9. 1 Precise Interrupts
`
`441
`
`1234 56 789
`
`4
`
`il
`
`3
`
`X
`
`i1
`
`2
`
`X
`
`X
`
`X
`
`i1
`
`X X
`
`i2
`
`i1
`
`WB
`
`i3
`
`i2
`
`il
`
`i3
`
`i2
`
`i3
`
`• Preserves Order
`Type b) • Requires 7 Clocks
`
`j EU# !Dest. Address Iv!
`FIGURE 9.6 Result shift register (b).
`
`PC
`
`In summary, when an instruction is issued, (1) a time slot is reserved
`on the single bus that connects the execution units to the register file
`and (2) the X's reserve time slots so that an instruction that uses a
`shorter pipeline will not get ahead of the issuing instruction. Delays are
`dynamically inserted, only when needed, compared to static introduction
`of no-ops into all slots whether or not delays are needed. As with the
`RSR(a), the time through the execution units must be deterministic;
`there can be no delays after the instruction is released.
`The RSR(b) entries have a number of fields; the execution unit
`number and destination address provide routing information to write
`the result into the correct register file location. A valid field is needed
`to distinguish between reserved slots and the slots that are only blocked
`out. The program counter value associated with each entry is required
`so that an instruction with an exception can be uniquely identified. The
`field for the program counter is an extension of the program counter
`pipeline of Figure 9.2.
`Because results are retired in-order, an interrupt at any time will
`find the register file processor state correct; thus, it is not necessary to
`flush the pipelines. Because the processor state is correct, the context
`switch can start immediately. The price for zero-context switch latency
`is that the normal processing CPI has been increased. For out-of-order
`results scheduled with a RSR(a), of Figure 9.4, six clocks are required
`for the last instruction write to the register file. For the RSR(b), seven
`clocks are required, a significant performance loss of 16%, to achieve in(cid:173)
`order retirements.
`In the examples to follow, two program fragments will be used to
`illustrate the operation of result shift registers handling preceding in(cid:173)
`structions. One program fragment has a true dependency while the other
`does not.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 458
`
`
`
`442
`
`Exceptions and Interrupts
`
`i1 RI - R2 + R3
`i2 R4 - Rl x R2
`i3 R7 - R6 + R8
`
`4
`
`3
`
`2
`
`1
`Write
`Back
`
`Type (a)
`4 5
`
`6
`
`7 8
`
`I
`
`2
`
`3
`
`R4
`
`Rl
`
`R4
`
`R7
`
`R4
`
`RI
`
`R7 R4
`
`RI
`
`R7 R4
`
`j
`
`Interrupt
`
`4
`
`3
`
`2
`
`Write
`Back
`
`Type (b)
`2 345 6789
`R4
`
`RI
`X RI
`
`R4
`
`X R4
`X X
`
`X
`
`X
`
`X
`
`RI
`
`R7
`
`R4 R7
`R4 R7
`
`+
`
`Interrupt
`
`FIGURE 9.7 Result shift register-with a true dependency.
`
`W True Dep.
`
`W /0 True Dep.
`
`i1 Rl +-R2 + R3
`i2 R4 +-Rl X R2
`i3 R7 +-R6 + R8
`
`i1 Rl +-R2 x R3
`i2 R4,(----R5 + R8
`i3 R7,(----R8 x R9
`
`The operation of a RSR(a) and (b) in the presence of an interrupt i.5
`illustrated with these two program fragments. The first example, shown
`in Figure 9.7, is the three-instruction program fragment with a true
`dependency on Rl between instructions il and i2, assuming forwarding
`around the write back stage. The add pipeline is assumed to have h.o
`stages while the multiply pipeline has four stages. The destination regis(cid:173)
`ter address is shown in the RSRs rather than the instruction number as
`in Figure 9.6.
`For the RSR(a), the instructions are released in-order but are retired
`out-of-order: il, i3, i2. If an interrupt occurs a time t 6 , the register file
`has been updated out-of-order. Note that if there is an i4, an add instruc(cid:173)
`tion for example, it could not release until t6 because of the reservation
`placed on the result bus and register file. There is no structural hazard
`with this example.
`Because of the introduced delays, the RSR(b) ensures in-order retire(cid:173)
`ment of results, permitting an interrupt at any time to be properly
`handled. Instruction i2 will not release until t3 because of the true
`dependency (bypassed around the register), and i3 will not release until
`t6 because of the reservation that has been placed on the result bus by
`i2. Because i2 and i3 have not modified the processor state by writing
`into the register file, they can be abandoned. The RSR(a) required seven
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 459
`
`
`
`9.1
`
`Precise Interrupts
`
`443
`
`i1Rl-R2+R3
`i2R4-R5·R8
`i3 R7 -RS+ R9
`
`Type (a)
`4 6
`
`2
`
`3
`
`6 7
`
`8
`
`Rl
`
`R7
`
`Rl
`
`R7
`
`R4
`
`Rl
`
`R7
`
`R4
`
`R7
`
`Rl
`R4 Rl
`
`R7
`
`4
`
`3
`
`2
`
`Write
`Back
`
`4
`
`3
`
`2
`
`Rl
`X
`
`X
`
`X
`
`Write
`Back
`
`2
`
`6 7 8
`
`Type (b)
`3 4 5
`R7
`X R7
`Rl
`X Rl R4 X X R7
`X X Rl R4 X X R7
`Rl R4
`
`9
`
`R7
`
`J~
`Interrupt
`FIGURE 9.8 Result shift register-without a true dependency.
`
`~
`
`Interrupt
`
`clocks to retire the sequence, while the RSR (b) required eight clocks.
`The time penalty is approximately 14% to ensure in-order retirement.
`The second example of RSR operation is shown in Figure 9.8 for the
`code fragment without a true dependency. Instructions issue and release
`in-order. With the RSR(a), there are no delays to resolve true dependen(cid:173)
`cies or structural hazards. Seven clocks are required to complete this
`sequence and, if an interrupt should occur at t 6 , i3 can be abandoned
`with the register file in the correct state-purely accidental due to the
`sequence of instructions. Such would not be the case if the interrupt had
`occurred at t4.
`The RSR(b) schedules the release of instructions, i2 at t4 and i3 at
`t5 , so that an interrupt at any time, not just at t6 as shown, will find the
`register file in the correct state. Nine clocks are required for this same
`instruction sequence-a penalty of approximately 28% to ensure in-order
`retirement. The latency to start the context switch is zero because all of
`the instructions that have not been retired are abandoned.
`
`Reorder Buffer
`The reorder buffer [SMIT85] assumes a Type C window that releases
`instructions in-order as their dependencies are resolved and ensures in(cid:173)
`order retirement to the register file without the performance penalty of
`the RSR(b). As the operations complete, the results are posted to the
`reorder buffet, which then retires the instructions, in-order. The reorder
`buffer scheme, shown in Figure 9.9, is controlled by a RSR(a) that re(cid:173)
`serves a time slot on the result bus for posting completed results to the
`reorder buffer.
`The reorder buffer is shown, and implemented, as a circular buffer
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 460
`
`
`
`444
`
`Exceptions and Interrupts
`
`Register File
`
`ln•Order
`Release
`
`Execution Unit
`
`: ............ Rb~fkt~f§~o/::1 .......... ·
`
`Reorder Buffor Enbies
`
`I DesAd~~· 1
`
`.
`
`un:= . Result
`
`I I
`
`I Program I
`
`Valid Except.ions Count.e~
`
`FIGURE 9.9 Reorder buffer.
`
`with head and tail pointers into a register file. The pointers increment
`modulo the size of the register file. The information carried in each of th€
`reorder buffer entries includes destination register address, the result of
`the operation, a valid bit that is set when the results are posted, any
`exceptions that occur during execution such as overflow, and the pro(cid:173)
`gram counter value. The instructions are released into the pipelined
`execution units in-order "With the information shown in the figure bei.::g
`placed in the reorder buffer location at the head pointer in-order. If the
`reorder buffer is full, the window stalls until there is a vacant location.
`As instructions complete (perhaps out-of-order) the results are
`written into their reserved slot in the reorder buffer and the valid bit
`is set. Rotation of the reorder buffer then permits the results to be retired
`in-order into the register file. In other words, a series of in-order slots
`rotate around the reorder buffer and are posted out-of-order by the
`execution units and then are retired in-order into the register file.
`The operation of the reorder buffer follows these steps.
`
`RELEASE. Write the following into the reorder buffer at the head
`pointer location:
`
`■ Destination Register Name
`■ PC value
`■ Increment the tail pointer.
`
`In addition, the head pointer value is written into the RSR(a).
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 461
`
`
`
`9. 1 Precise Interrupts
`
`445
`
`COMPLETION. Write the following to the reorder buffer associated
`with the correct PC.
`
`• Result
`• Set Valid bit
`■ Post any exception information.
`
`TESTS AT THE TAIL POINTER. If Valid= 1, the head pointer
`value is either in the last stage or is not present in the RSR(a), and
`there are no exceptions
`
`Then: retire the result and increment the head pointer
`Else: test on the next clock.
`
`Because completed results are delayed to ensure in-order retirement
`to the register file, a subsequent instruction with a true dependency on
`that result would be delayed until the result is retired. However, if the
`reorder buff er can be associatively searched on the destination register
`address by the window release logic, the needed result can be forwarded
`into the dependent instruction execution. Handling true dependencies
`with a reorder buffer is discussed in Section 8.2.3 and in [SMIT85].
`Notice that the reorder buffer, because it holds results waiting for
`in-order retirement to the register file, is another form of renaming that
`provides more implemented registers than architected registers.
`The operation of a reorder buffer is illustrated in Figure 9.10 using
`the same execution pipelines and the program fragment with a true
`dependency on Rl used in the previous examples. The contents of the
`reorder buffer and the RSR are shown for each clock period. The head
`and tail pointers point to location 2 at Time 0. Instruction il releases at
`Time 1; the destination register address and the PC value are inserted
`in the reorder buffer and the head pointer value is placed in the RSR(a).
`The tail pointer is incremented to indicate the location to receive infor(cid:173)
`mation from the next released instruction.
`As with Figure 9. 7, i2 releases at t3 , having been delayed to resolve
`the true dependency, and i3 releases at t 4 • At t 2 , the value of the head
`pointer (2) is found in the last stage of the RSR and the valid bit is set
`at the head pointer entry. Thus, i1 will write to Rl at t3 • The head
`pointer is incremented and the valid bit is reset because that entry is
`now vacant. For this example, i3 must be delayed before it is written
`into the register file to retire in-order. This is accomplished as R7 is set
`up for writing in t1 with the write being accomplished in t8 , just as is
`done with the RSR(b) of Figure 9.7.
`What happens if there is an interrupt, at say t6? Even if the write
`to R4 is accomplished, all results will have been retired in-order to the
`register file. The write to R7 will not have taken place and the processor
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 462
`
`
`
`446
`
`Exceptions and Interrupts
`
`Dest.Reg.
`
`TimeO
`Result V
`
`P.C.
`
`Timel
`Dest.Reg. Result V
`
`I
`2H,T
`
`Rt
`
`-WriteRl
`
`"
`
`p~
`I
`2 H
`T 8
`4
`6
`
`Dest.Reg.
`
`RI
`
`Time2
`Result V
`
`V
`-1/1
`
`Dest. Reg.
`
`Time4
`Result V
`
`P.C.
`
`" "
`
`Time6
`Result V
`
`P.C. /
`/
`.,,7
`"
`,1/1
`"
`
`" I
`
`R4
`R7
`
`Des t. Reg.
`
`R4
`R7
`
`3H
`4
`ST
`
`I
`2
`3 R
`
`' 5 T
`
`I~ value inserted
`
`P.C.
`
`"
`
`I
`2H
`3T
`
`P.C.
`
`"
`
`P.C.
`
`.,
`"
`
`2
`3H
`4T
`5
`
`3H
`
`ST
`
`Time3
`Dest. Reg. Result V
`
`R4
`
`Time5
`Dest. Reg. Result V
`
`R4
`R7
`
`..,
`
`1
`
`Time7
`Dest. Reg. Result V
`
`/
`
`V
`1
`
`-1./
`
`R7
`
`P.C/
`/
`
`"
`
`3
`48
`ST
`
`FIGURE 9.10 Reorder buffer operation-with true dependency.
`
`state is correct. Thus, all of the instructions in execution can be a.han(cid:173)
`doned and there is a zero-latency interrupt to start the context switch.
`The reorder buffer provides the instruction release rate and low CPI
`of an RSR(a) with the zero latency of a RSR(b).
`Figure 9.11 shows the operation of the reorder buffer with the exam(cid:173)
`ple program fragment without the true dependency.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 463
`
`
`
`9. 1 Precise Interrupts
`
`44 7
`
`OcsL Reg.
`
`Time0
`Rcault V
`
`P.C.
`
`Time!
`DesL Reg. Result V
`
`2H,T
`3
`
`RI
`
`Deel Reg.
`
`Time2
`Result V
`
`RI
`R4
`
`Ea
`~
`
`P.C.
`.,
`.,
`
`I
`2H
`
`4T
`
`Time3
`DesL Reg. Result V
`
`RSR
`
`;
`
`I
`.
`
`RI
`R4
`R7
`
`.,
`
`1
`
`P.C.
`
`-I
`
`P.C.
`.,
`.,
`.,
`
`2H
`3T
`4
`6
`
`2H
`
`ST
`
`Time5
`Desl Reg. Result V
`
`.,, 1
`
`/
`
`R4
`R7
`
`P.9,/
`
`/
`
`.,
`.,
`
`WriteR4
`
`3H
`4
`ST
`
`- I
`
`Write RI
`
` 2
`
` H
`3
`4
`6 T
`
`De.L Reg.
`
`Tirne4
`Result V
`
`@
`@
`
`RI
`R4
`R7
`
`/
`
`I
`
`y /
`.,
`
`P.V
`.,
`.,
`.,
`
`DesL Reg.
`
`Time6
`Result V
`
`P.C. /
`/
`
`/
`
`R7
`
`/
`, / / I
`
`.,
`
`.,
`WriteR7
`
`1
`2
`3
`H
`4
`T
`5
`
`FIGURE 9.11 Reorder buffer without a true dependency.
`
`Because there is no true dependency in this example, the reorder
`buffer only must restore order while using the RSR(a) for reserving time
`slots on the result bus. For this example without the true dependency,
`eight clocks are required to retire the last result into the register file(cid:173)
`the same as that for the RSR(a).
`The performance of the RSR(b) and the reorder buffer schemes for
`ensuring in-order retirement is compared to the RSR(a) and tabulated
`in Table 9.3. For the program fragment examples, the RSR(b) and the
`reorder buffer require the same number of clocks when there is a true
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 464
`
`
`
`448
`
`Exceptions and Interrupts
`
`Program
`Fragment
`
`With true
`dependencies
`
`Without true
`dependencies
`
`RSR(a)
`
`RSR(b)
`
`Reorder Buff er
`
`8 Clocks
`8 Clocks
`7 Clocks
`Normalized= 1 Normalized = Normalized=
`0.875
`0.875
`7 Clocks
`9 clocks
`7 Clocks
`Normalized= 1 Normalized= Normalized= 1
`0.77
`
`TABLE 9.3 Performance comparisons.
`
`dependency. However, without a true dependency, the reorder buffer
`requires two fewer clocks for a performance improvement of 22q over
`a RSR(b). For the program fragments without true dependencies, the
`RSR(a) and the reorder buffer give equal performance.
`These simple examples show that there can be a cost in performance
`for ensuring in-order retirement: speedup is in the range of 0.77 to 1.0
`Fractional speedup means that the performance is reduced. Smith and
`Pleszkun [SMIT85) simulated a RSRlb) and a reorder buffer added to a
`Cray-like processor having a Type C window to evaluate their perfor(cid:173)
`mance impact. In general, adding protection for precise interrupts to th:.:o
`processor degrades the performance depending on (1) the strategy usec
`and (2) how writes and true dependencies are handled. Recall that pro(cid:173)
`cessors such as the Cray do not provide support for precise interrup~
`so as not to degrade performance.
`The results of the simulation are shown in Table 9.4. The work load
`is the Lawrence Livermore Loops, and the comparison is to the Cray(cid:173)
`like processor. The table also shows the effect of the size of the reorder
`buffer.
`Another evaluation of the speedup penalty of providing in-order
`
`Number of Entries
`in the Buffer
`
`Result Shift Register (b)
`
`Reorder Buff er
`
`3
`4
`5
`8
`10
`
`0.81
`0.81
`0.81
`0.81
`0.81
`
`0.75
`0.82
`0.84
`0.85
`0.85
`
`TABLE 9.4 Result shift register and reorder buffer performance impact,
`Cray-1.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 465
`
`
`
`9. 1 Precise Interrupts
`
`449
`
`retirements is given by Wang and Emnett [WANG93]. Their results
`show that a result shift register has S = 0.81 while a reorder buffer has
`S = 0.83 compared to a Cray-like base machine. Their results will be
`discussed further in Section 9.5.
`The conclusion from the simple examples given above and the
`simulations of (SOHI87] and (WANG93] are clear. Providing RSRs
`and/or reorder buffers to ensure precise interrupts, even in a processor
`without early side effects to undo, is an expensive (in terms of perfor(cid:173)
`mance) design decision. The performance cost is in the range of 10% to
`20% in addition to the hardware cost of reorder buffers and result shift
`registers. Even in the face of hardware complexity and a loss in perfor(cid:173)
`mance, the AMD K5 (HALF94] and the Intel P6 [HALF95] processors
`are reported to use reorder buffers. With both processors, the reorder
`buffer performs the additional functions of renaming and undoing state
`changes for miss-predicted branches.
`
`Register Update Unit
`Sohi and Vajapeyam [SOHI87] describe a system called a register update
`unit (RUU), which overcomes some of the performance problems associ(cid:173)
`ated with a reorder buffer. In-order retirement is assured by using the
`register update unit upstream from the execution units, as the window,
`rather than down stream, as with the buffer functions.
`The root idea of the register update unit is found in a move to consoli(cid:173)
`date the Tomasulo CDB reservation stations into a pool instead of associ(cid:173)
`ating them with each of the function units. The RUU is similar to a Type
`C window except that instructions can be released out-of-order to