throbber
434
`
`Exceptions and Interrupts
`
`Pipeline External
`Stage
`Interrupts
`
`Software
`Interrupts
`
`Internal
`Interrupts
`
`IF
`IS
`RF
`
`EX
`DF
`DS
`TC
`
`WB
`
`Bus error Breakpoint
`Illegal instruction
`Syscall
`Instruction translation
`ECC instruction
`Coprocessor is
`Virtual coherency unusable
`instruction
`
`Interrupt
`
`Floating point
`
`Overflow
`TLB modified
`Data translation
`Virtual coherency Bus error data
`
`Data
`Watch
`NMI
`Reset
`
`TABLE 9.2 R4000 stage designation of interrupted instruction.
`
`stages are discussed in Chapter 10. However, note that a bus error on
`an instruction fetch from the cache is not signaled until the RF stage.
`The reason for this is that the data cache is pipelined and the checks of
`the cache tags occur in the RF stage as shown in Figure 2.31. For
`concurrent interrupts, the R4000 gives priority to the interrupted in(cid:173)
`struction furthest down the pipeline. This processor illustrates the
`need to recognize interrupts and save the program counter value over
`most of the stages of the pipeline.
`
`9.1.2 Handling Preceding Instructions
`The methods for handling preceding instructions, described below, are
`for systems that issue one instruction at a time. Systems that issue more
`than one instruction at a time are described in Chapter 10. Preceding
`instructions have the potential to modify the processor state in registers
`and/or memory. Recall that Smith's Condition #1 states that: All in(cid:173)
`structions preceding the instruction indicated by the saved program
`counter have been executed and have modified the process (processor)
`state correctly. This sequential execution model condition is defined as:
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 451
`
`

`

`9. 1 Precise Interrupts
`
`435
`
`A processor that satisfies the condition that "the result of an ex(cid:173)
`ecution is the same as if the operations had been executed in the
`order specified by the program" [LAMP79]. Note that a processor
`that executes the serial execution model also executes the sequential
`execution model.
`
`The sequential execution model is enforced by one of the three design
`options for handling preceding instructions shown in Table 9.1. The
`options are:
`1. to flush the pipeline, retiring all issued instructions;
`2. to take steps to ensure that all issued instructions retire in-order;
`3. to undo the processor state changes of any instructions that have
`been retired out-of-order.
`The design issues involved in selecting the option for handling the
`preceding instructions are whether or not to (1) penalize the normal
`performance of the pipeline, as measured by CPI, at the expense of a
`longer interrupt latency when there is an interrupt, or (2) have a smaller
`CPI and a longer latency. Cost and complexity are also design consider(cid:173)
`ations.
`Only a Type A instruction window (described in Chapter 8) with in(cid:173)
`order release will always have in-order retirements. The other types of
`windows have the potential for out-of-order completions and out-of-order
`retirements. Figure 9.3 shows windows with two execution units that
`can produce out-of-order retirements. Multiple execution units can have
`a different number of stages, such as an integer unit and a floating point
`unit as found in many microprocessors. Multiple register files, for integer
`and floating points, are not shown but have the same out-of-order retire(cid:173)
`ment problem. Systems with a common result bus that store integer and
`floating point values in the same register file can have structural haz(cid:173)
`ards on the output bus and the register file.
`The operation of these windows and pipelines is illustrated with two
`examples of a four-instruction sequence using the long and the short
`execution units. For the Type C window, if an interrupt occurs at the
`end of t 4 , i2 has already been retired out-of-order by writing to the
`register file. In addition, even if the interrupt had not occurred, there
`will be a structural hazard on the result bus and register file at t6 that
`must be resolved.
`For the Type D or E window, the instructions issue in-order to the
`reservation stations. For this example, instruction i3 is delayed two
`clocks because of a dependency on i2 (without forwarding). If an interrupt
`occurs at t5 , i2 will have been retired out-of-order. Notice that the release
`is out-of-order; i4 releases before i3 due to the dependency. The character(cid:173)
`istics of these two configurations follow.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 452
`
`

`

`436
`
`Exceptions and Interrupts
`
`Release on Dependency Resolution
`EU #1
`
`Type C Window
`
`il Long
`i2 Short
`i3 Short
`i4 Long
`
`Release on Dependency Resolution
`
`#1
`
`Type Dor E Wmdow
`
`5
`
`4
`
`i4
`
`2 S
`
`il
`
`i1
`
`il
`
`il □
`
`il Structunl
`Hazard
`
`2
`EU#l 3
`
`4
`
`WB
`
`EU#2 l~ " i3
`i2
`2
`i:l
`
`WB
`
`R.S.
`
`i1
`
`EU#I
`
`2
`
`3
`
`4
`
`WB
`
`R.S.
`
`i2
`
`i8
`
`t Inte1TUpt
`
`il
`
`il
`
`i4
`
`il
`
`i4
`
`il
`
`i3
`
`i3
`
`i2
`
`i3
`
`i2
`
`EU#2
`
`il Long
`i2 Short
`i3 Short
`i4 Long
`True dependency on i3, i2
`
`2
`
`i2
`1---+---+-+---+---+---f
`i2
`WB
`
`t Intern.pl
`
`FIGURE 9.3 Multiple execution units.
`
`Type C window (Cray-1, MIPS, Pentium integer) characteristics in(cid:173)
`clude
`
`1. in-order issue,
`2. dependencies resolved before release,
`3. in-order release,
`4. deterministic time from issue to completion.
`
`Type Dor E window (CDC 6600, IBM S/36O/91 Flt. Pt., MC68110•
`characteristics include
`
`1. in-order issue,
`2. dependencies resolved in the reservation stations,
`3. out-of-order release,
`4. nondeterministic time from issue to completion.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 453
`
`

`

`9.1 Precise Interrupts
`
`437
`
`Resolving Structural Hazards
`For systems with multiple-variable length execution units, the potential
`exists for a structural hazard on the result bus and register file even if
`instructions are released in-order. There are two methods for resolving
`this structural hazard.
`
`1. Deterministically schedule the "issue or release of instruction to
`eliminate the hazard.
`2. Resolve the hazard with priority logic.
`
`Consider the first method. Structural hazards can be resolved by
`delaying an instruction release until the hazard is eliminated; the result
`bus is scheduled so that hazards will not occur. Figure 9.4 shows a result
`shift register [SMITH85], called a RSR(a), that schedules the result bus
`and writes to a common register file.
`The reservation table of a shift register, shown on the right of the
`figure, is the same length as the longest execution unit pipeline; in this
`case four stages, and the stages are numbered in reverse order. This
`result shift register reserves a time slot for the result bus and the path
`to the register file but does not ensure in-order retirement. Scheduling of
`instructions into the execution units is illustrated by a three-instruction
`sequence. Instruction il, because it uses the long pipeline, places its
`destination address in stage four at t 1 . Then i2 releases and places its
`destination address into stage 2 at t 2 • Instruction i3 cannot release into
`EU #2 at t3 because the bus will be blocked by the result of il. Thus, i3
`is delayed one clock and releases at t4 • When a result is ready to exit
`the execution unit, the result bus is gated on.
`The TI ASC uses a Type C window and a RSR(a) to schedule its four
`execution units into the result bus and register file. Because of the
`number of data types, the RSR required seven bits to indicate the ad(cid:173)
`dress and type of register. For example, an operation with an upper half-
`
`EU#l
`
`2 345 678
`
`i-o
`Release
`
`EU#2
`
`Write to
`Registers
`
`il: Long, EU #1
`i2: Short, EU #2
`i3: Short, EU #2
`
`FIGURE 9.4 Result shift register (a).
`
`4
`
`3
`
`2
`
`1
`WB
`
`il
`
`il
`
`i2
`
`il
`
`i3
`
`i2
`
`il i3
`
`i2
`
`il
`
`i3
`
`• Reserves Bus and Register
`• Does Not Preserve Order
`• Requires 6 Clocks
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 454
`
`

`

`438
`
`Exceptions and Interrupts
`
`word result will not have a structural hazard with a result with a lower
`half-word result.
`The second method for resolving the structural hazard employs pri(cid:173)
`ority logic. Type C, D, and E windows cannot easily use a RSR(a) to
`schedule the result bus because of the possible release delays while
`waiting for dependencies to be resolved. The CDC 6600 scoreboard and
`CDB must resolve these structural hazards on their result bus(ses) and
`move the delays to the completion end of the execution uni ts. In the case
`of the scoreboard, each of the busses, called trunks, has priority hardware
`to delay lower priority bus requests. I am not certain, but I believe that
`some form of priority scheme is used with the CDB as well. The two
`floating point execution units probably receive COB access priority based
`on a random selection.
`
`Retirement by Flushing
`After ensuring that there are no structural hazards, the requirements
`for properly handling the preceding instructions must be satisfied. A
`common method, used in a number of processors, is a simple flush of the
`pipelines; that is, the issued instructions in the pipelines are run to
`completion and all results are retired. After flushing, the processor state
`is correct, the context switch can be performed, and the interrupt ser(cid:173)
`viced.
`Flushing is illustrated with the reservation table in Figure 9.5.
`which shows an instruction sequence: long, short, long, short. The v.in(cid:173)
`dow can be Type C, D, or E.
`
`12345678
`
`il
`
`il
`
`i3
`
`il
`
`i3
`
`il
`
`i3
`
`il
`
`i3
`
`i3
`
`2
`
`3
`
`4
`
`WB
`
`il Long
`i2 Short
`i3 Long
`i4 Short
`
`EU #1
`Reservation Table
`
`l
`
`2
`
`3 4 Is 6
`
`7
`
`ii lllll..l +
`
`Interrupt
`
`8
`
`I
`
`EU #2
`Reservation Table
`
`I
`..
`
`Context Switch
`
`FIGURE 9.5 Flush pipelines.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 455
`
`

`

`9. 1 Precise Interrupts
`
`439
`
`If an interrupt occurs at t 4 , instruction release is halted, the pipelines
`flush and perform their write backs to the ·register file out-of-order. The
`context switch does not occur until after the flush operation is complete
`and the register file is in the correct state. The flush time, in this exam(cid:173)
`ple, is three clocks. The time to flush, or the latency of the context switch,
`can be long and indeterminate. For example, all dependencies must be
`resolved before the instructions release, and there could be a major
`delay if there is an address translation miss on a released load or store
`instruction.
`The flush latency causes another problem: the instructions being
`flushed can themselves be interrupted. The general sol1.1.tion to this
`problem is to mask off all interrupts after the first has been recognized.
`Because interrupts will be masked off, unpredictable results can result
`and· an error condition is signaled by the processor.
`This strategy is useful when an internal interrupt or exception is
`detected prior to release and before any side effects can occur. Examples
`are page faults on instruction fetch or the detection of an illegal op-code.
`This simple flush scheme is called interrupts prior to instruction issue in
`[SMIT85].
`The PowerPC 601 and the Intel Pentium both use pipeline flushing
`to handle preceding instructions. Both of these processors have rela(cid:173)
`tively short execution unit pipelines, and released instructions are
`guaranteed not to produce interrupts.
`With the Pentium, the processor checks for pending interrupts before
`each instruction is issued. If an interrupt is pending, fetching instruc(cid:173)
`tions is halted and the pipeline is flushed [INTE93]. It is not clear from
`the Pentium documentation what happens if a second interrupt occurs
`during the pipeline flush operation. I assume that this is prevented by
`control logic in the processor. Some exception conditions are not handled
`by interrupts with the Pentium. Internal exceptions, such as register file
`parity errors, indicate that the processor can no longer be trusted, and
`the pipeline is shutdown with no hope of recovery.
`
`In-Order Retirement
`Because of problems with flushing, it is desirable to provide a method
`that will ensure that the processor state will be updated in-order regard(cid:173)
`less of the order of release. Design solutions to providing in-order-retire(cid:173)
`ment are discussed in the following sections and require in-order release,
`Type C windows, except where noted. Some of the solutions to this
`problem have hardware at the end of the pipelines that restore results
`into order before retirement; this hardware is called the result window
`by [LAIR92].
`Recall that the MIPS concept is based on software managing the
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 456
`
`

`

`440
`
`Exceptions and Interrupts
`
`pipeline without interlocks; the only interlock provided in the hardware
`is the load true dependency. The MIPS idea completely breaks down
`when a second execution unit-for instance, a floating point copro(cid:173)
`cessor-is added. Sohi [SOHl87) posits, "We are unaware of any software
`solutions to the imprecise interrupt problem for multiple functional unit
`computers." To cope with the problem of out-of-order-retirement, tech(cid:173)
`niques have been developed to coerce the processor to behave as if there
`are in-order-retirements even though completions are out-of-order.
`
`Equal Length Execution Unit Pipelines
`
`A simple strategy for Type C windows that ensures in-order retirement
`is to make all execution pipelines the same length. In other words, if the
`floating point pipeline is four stages, then the integer pipeline is four
`stages as well. As an integer pipeline can usually be one stage, three
`stages will simply be delays. This strategy will provide in-order retire(cid:173)
`ment with in-order release and will eliminate any structural hazards on
`the result bus or register file.
`The problem with padding out short pipelines is that other oper(cid:173)
`ations may be adversely affected. For example, the delays for true depen(cid:173)
`dencies can be increased. In addition, as the integer pipeline is frequently
`used for branch calculations. branch performance may be adversely af(cid:173)
`fected . The cost in execution time and hardware of implementing equal(cid:173)
`length pipelines is excessive, and I know of no system that uses this
`technique.
`I take issue with the proposition that there is no software solution
`to precise interrupts on multiple execution unit processors with a Type
`C window. No-ops can be used to delay instruction release to short
`pipelines, giving the same results as equal-length pipelines. There \\ill
`be a significant performance loss. I know of no published information on
`no-op scheduling for in-order retirement.
`
`Result Shift Register
`Smith [SMIT85] describes another type of result shift register, RSR(bJ.
`shown in Figure 9.6, that can be used with Type C windows. For RSR(b i.
`when instruction 1 is released to the four-stage execution unit, its destin(cid:173)
`ation address is placed in the stage of the result shift register along with
`X's in stages 3, 2, and 1. These time slots are thus blocked out or
`reserved. Instruction 2 would like to be released at t 2 , but stage 2 has
`been reserved and this instruction must wait until t 4 , when it can be
`released and its destination address placed in the result shift register.
`Instruction 3, because it is released in sequence, can now be released at
`t6 • There are no X's to block its release, as is the case with instruction
`2. As we can see, the results are retired in-order.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 457
`
`

`

`EU#l
`
`In -o r d e r~ Write to
`
`Release lffi.----J -Registern
`
`Type C Window
`
`ii: Long, EU#l
`i2: Short, EU#2
`i3: Short, EU#2
`
`9. 1 Precise Interrupts
`
`441
`
`1234 56 789
`
`4
`
`il
`
`3
`
`X
`
`i1
`
`2
`
`X
`
`X
`
`X
`
`i1
`
`X X
`
`i2
`
`i1
`
`WB
`
`i3
`
`i2
`
`il
`
`i3
`
`i2
`
`i3
`
`• Preserves Order
`Type b) • Requires 7 Clocks
`
`j EU# !Dest. Address Iv!
`FIGURE 9.6 Result shift register (b).
`
`PC
`
`In summary, when an instruction is issued, (1) a time slot is reserved
`on the single bus that connects the execution units to the register file
`and (2) the X's reserve time slots so that an instruction that uses a
`shorter pipeline will not get ahead of the issuing instruction. Delays are
`dynamically inserted, only when needed, compared to static introduction
`of no-ops into all slots whether or not delays are needed. As with the
`RSR(a), the time through the execution units must be deterministic;
`there can be no delays after the instruction is released.
`The RSR(b) entries have a number of fields; the execution unit
`number and destination address provide routing information to write
`the result into the correct register file location. A valid field is needed
`to distinguish between reserved slots and the slots that are only blocked
`out. The program counter value associated with each entry is required
`so that an instruction with an exception can be uniquely identified. The
`field for the program counter is an extension of the program counter
`pipeline of Figure 9.2.
`Because results are retired in-order, an interrupt at any time will
`find the register file processor state correct; thus, it is not necessary to
`flush the pipelines. Because the processor state is correct, the context
`switch can start immediately. The price for zero-context switch latency
`is that the normal processing CPI has been increased. For out-of-order
`results scheduled with a RSR(a), of Figure 9.4, six clocks are required
`for the last instruction write to the register file. For the RSR(b), seven
`clocks are required, a significant performance loss of 16%, to achieve in(cid:173)
`order retirements.
`In the examples to follow, two program fragments will be used to
`illustrate the operation of result shift registers handling preceding in(cid:173)
`structions. One program fragment has a true dependency while the other
`does not.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 458
`
`

`

`442
`
`Exceptions and Interrupts
`
`i1 RI - R2 + R3
`i2 R4 - Rl x R2
`i3 R7 - R6 + R8
`
`4
`
`3
`
`2
`
`1
`Write
`Back
`
`Type (a)
`4 5
`
`6
`
`7 8
`
`I
`
`2
`
`3
`
`R4
`
`Rl
`
`R4
`
`R7
`
`R4
`
`RI
`
`R7 R4
`
`RI
`
`R7 R4
`
`j
`
`Interrupt
`
`4
`
`3
`
`2
`
`Write
`Back
`
`Type (b)
`2 345 6789
`R4
`
`RI
`X RI
`
`R4
`
`X R4
`X X
`
`X
`
`X
`
`X
`
`RI
`
`R7
`
`R4 R7
`R4 R7
`
`+
`
`Interrupt
`
`FIGURE 9.7 Result shift register-with a true dependency.
`
`W True Dep.
`
`W /0 True Dep.
`
`i1 Rl +-R2 + R3
`i2 R4 +-Rl X R2
`i3 R7 +-R6 + R8
`
`i1 Rl +-R2 x R3
`i2 R4,(----R5 + R8
`i3 R7,(----R8 x R9
`
`The operation of a RSR(a) and (b) in the presence of an interrupt i.5
`illustrated with these two program fragments. The first example, shown
`in Figure 9.7, is the three-instruction program fragment with a true
`dependency on Rl between instructions il and i2, assuming forwarding
`around the write back stage. The add pipeline is assumed to have h.o
`stages while the multiply pipeline has four stages. The destination regis(cid:173)
`ter address is shown in the RSRs rather than the instruction number as
`in Figure 9.6.
`For the RSR(a), the instructions are released in-order but are retired
`out-of-order: il, i3, i2. If an interrupt occurs a time t 6 , the register file
`has been updated out-of-order. Note that if there is an i4, an add instruc(cid:173)
`tion for example, it could not release until t6 because of the reservation
`placed on the result bus and register file. There is no structural hazard
`with this example.
`Because of the introduced delays, the RSR(b) ensures in-order retire(cid:173)
`ment of results, permitting an interrupt at any time to be properly
`handled. Instruction i2 will not release until t3 because of the true
`dependency (bypassed around the register), and i3 will not release until
`t6 because of the reservation that has been placed on the result bus by
`i2. Because i2 and i3 have not modified the processor state by writing
`into the register file, they can be abandoned. The RSR(a) required seven
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 459
`
`

`

`9.1
`
`Precise Interrupts
`
`443
`
`i1Rl-R2+R3
`i2R4-R5·R8
`i3 R7 -RS+ R9
`
`Type (a)
`4 6
`
`2
`
`3
`
`6 7
`
`8
`
`Rl
`
`R7
`
`Rl
`
`R7
`
`R4
`
`Rl
`
`R7
`
`R4
`
`R7
`
`Rl
`R4 Rl
`
`R7
`
`4
`
`3
`
`2
`
`Write
`Back
`
`4
`
`3
`
`2
`
`Rl
`X
`
`X
`
`X
`
`Write
`Back
`
`2
`
`6 7 8
`
`Type (b)
`3 4 5
`R7
`X R7
`Rl
`X Rl R4 X X R7
`X X Rl R4 X X R7
`Rl R4
`
`9
`
`R7
`
`J~
`Interrupt
`FIGURE 9.8 Result shift register-without a true dependency.
`
`~
`
`Interrupt
`
`clocks to retire the sequence, while the RSR (b) required eight clocks.
`The time penalty is approximately 14% to ensure in-order retirement.
`The second example of RSR operation is shown in Figure 9.8 for the
`code fragment without a true dependency. Instructions issue and release
`in-order. With the RSR(a), there are no delays to resolve true dependen(cid:173)
`cies or structural hazards. Seven clocks are required to complete this
`sequence and, if an interrupt should occur at t 6 , i3 can be abandoned
`with the register file in the correct state-purely accidental due to the
`sequence of instructions. Such would not be the case if the interrupt had
`occurred at t4.
`The RSR(b) schedules the release of instructions, i2 at t4 and i3 at
`t5 , so that an interrupt at any time, not just at t6 as shown, will find the
`register file in the correct state. Nine clocks are required for this same
`instruction sequence-a penalty of approximately 28% to ensure in-order
`retirement. The latency to start the context switch is zero because all of
`the instructions that have not been retired are abandoned.
`
`Reorder Buffer
`The reorder buffer [SMIT85] assumes a Type C window that releases
`instructions in-order as their dependencies are resolved and ensures in(cid:173)
`order retirement to the register file without the performance penalty of
`the RSR(b). As the operations complete, the results are posted to the
`reorder buffet, which then retires the instructions, in-order. The reorder
`buffer scheme, shown in Figure 9.9, is controlled by a RSR(a) that re(cid:173)
`serves a time slot on the result bus for posting completed results to the
`reorder buffer.
`The reorder buffer is shown, and implemented, as a circular buffer
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 460
`
`

`

`444
`
`Exceptions and Interrupts
`
`Register File
`
`ln•Order
`Release
`
`Execution Unit
`
`: ............ Rb~fkt~f§~o/::1 .......... ·
`
`Reorder Buffor Enbies
`
`I DesAd~~· 1
`
`.
`
`un:= . Result
`
`I I
`
`I Program I
`
`Valid Except.ions Count.e~
`
`FIGURE 9.9 Reorder buffer.
`
`with head and tail pointers into a register file. The pointers increment
`modulo the size of the register file. The information carried in each of th€
`reorder buffer entries includes destination register address, the result of
`the operation, a valid bit that is set when the results are posted, any
`exceptions that occur during execution such as overflow, and the pro(cid:173)
`gram counter value. The instructions are released into the pipelined
`execution units in-order "With the information shown in the figure bei.::g
`placed in the reorder buffer location at the head pointer in-order. If the
`reorder buffer is full, the window stalls until there is a vacant location.
`As instructions complete (perhaps out-of-order) the results are
`written into their reserved slot in the reorder buffer and the valid bit
`is set. Rotation of the reorder buffer then permits the results to be retired
`in-order into the register file. In other words, a series of in-order slots
`rotate around the reorder buffer and are posted out-of-order by the
`execution units and then are retired in-order into the register file.
`The operation of the reorder buffer follows these steps.
`
`RELEASE. Write the following into the reorder buffer at the head
`pointer location:
`
`■ Destination Register Name
`■ PC value
`■ Increment the tail pointer.
`
`In addition, the head pointer value is written into the RSR(a).
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 461
`
`

`

`9. 1 Precise Interrupts
`
`445
`
`COMPLETION. Write the following to the reorder buffer associated
`with the correct PC.
`
`• Result
`• Set Valid bit
`■ Post any exception information.
`
`TESTS AT THE TAIL POINTER. If Valid= 1, the head pointer
`value is either in the last stage or is not present in the RSR(a), and
`there are no exceptions
`
`Then: retire the result and increment the head pointer
`Else: test on the next clock.
`
`Because completed results are delayed to ensure in-order retirement
`to the register file, a subsequent instruction with a true dependency on
`that result would be delayed until the result is retired. However, if the
`reorder buff er can be associatively searched on the destination register
`address by the window release logic, the needed result can be forwarded
`into the dependent instruction execution. Handling true dependencies
`with a reorder buffer is discussed in Section 8.2.3 and in [SMIT85].
`Notice that the reorder buffer, because it holds results waiting for
`in-order retirement to the register file, is another form of renaming that
`provides more implemented registers than architected registers.
`The operation of a reorder buffer is illustrated in Figure 9.10 using
`the same execution pipelines and the program fragment with a true
`dependency on Rl used in the previous examples. The contents of the
`reorder buffer and the RSR are shown for each clock period. The head
`and tail pointers point to location 2 at Time 0. Instruction il releases at
`Time 1; the destination register address and the PC value are inserted
`in the reorder buffer and the head pointer value is placed in the RSR(a).
`The tail pointer is incremented to indicate the location to receive infor(cid:173)
`mation from the next released instruction.
`As with Figure 9. 7, i2 releases at t3 , having been delayed to resolve
`the true dependency, and i3 releases at t 4 • At t 2 , the value of the head
`pointer (2) is found in the last stage of the RSR and the valid bit is set
`at the head pointer entry. Thus, i1 will write to Rl at t3 • The head
`pointer is incremented and the valid bit is reset because that entry is
`now vacant. For this example, i3 must be delayed before it is written
`into the register file to retire in-order. This is accomplished as R7 is set
`up for writing in t1 with the write being accomplished in t8 , just as is
`done with the RSR(b) of Figure 9.7.
`What happens if there is an interrupt, at say t6? Even if the write
`to R4 is accomplished, all results will have been retired in-order to the
`register file. The write to R7 will not have taken place and the processor
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 462
`
`

`

`446
`
`Exceptions and Interrupts
`
`Dest.Reg.
`
`TimeO
`Result V
`
`P.C.
`
`Timel
`Dest.Reg. Result V
`
`I
`2H,T
`
`Rt
`
`-WriteRl
`
`"
`
`p~
`I
`2 H
`T 8
`4
`6
`
`Dest.Reg.
`
`RI
`
`Time2
`Result V
`
`V
`-1/1
`
`Dest. Reg.
`
`Time4
`Result V
`
`P.C.
`
`" "
`
`Time6
`Result V
`
`P.C. /
`/
`.,,7
`"
`,1/1
`"
`
`" I
`
`R4
`R7
`
`Des t. Reg.
`
`R4
`R7
`
`3H
`4
`ST
`
`I
`2
`3 R
`
`' 5 T
`
`I~ value inserted
`
`P.C.
`
`"
`
`I
`2H
`3T
`
`P.C.
`
`"
`
`P.C.
`
`.,
`"
`
`2
`3H
`4T
`5
`
`3H
`
`ST
`
`Time3
`Dest. Reg. Result V
`
`R4
`
`Time5
`Dest. Reg. Result V
`
`R4
`R7
`
`..,
`
`1
`
`Time7
`Dest. Reg. Result V
`
`/
`
`V
`1
`
`-1./
`
`R7
`
`P.C/
`/
`
`"
`
`3
`48
`ST
`
`FIGURE 9.10 Reorder buffer operation-with true dependency.
`
`state is correct. Thus, all of the instructions in execution can be a.han(cid:173)
`doned and there is a zero-latency interrupt to start the context switch.
`The reorder buffer provides the instruction release rate and low CPI
`of an RSR(a) with the zero latency of a RSR(b).
`Figure 9.11 shows the operation of the reorder buffer with the exam(cid:173)
`ple program fragment without the true dependency.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 463
`
`

`

`9. 1 Precise Interrupts
`
`44 7
`
`OcsL Reg.
`
`Time0
`Rcault V
`
`P.C.
`
`Time!
`DesL Reg. Result V
`
`2H,T
`3
`
`RI
`
`Deel Reg.
`
`Time2
`Result V
`
`RI
`R4
`
`Ea
`~
`
`P.C.
`.,
`.,
`
`I
`2H
`
`4T
`
`Time3
`DesL Reg. Result V
`
`RSR
`
`;
`
`I
`.
`
`RI
`R4
`R7
`
`.,
`
`1
`
`P.C.
`
`-I
`
`P.C.
`.,
`.,
`.,
`
`2H
`3T
`4
`6
`
`2H
`
`ST
`
`Time5
`Desl Reg. Result V
`
`.,, 1
`
`/
`
`R4
`R7
`
`P.9,/
`
`/
`
`.,
`.,
`
`WriteR4
`
`3H
`4
`ST
`
`- I
`
`Write RI
`
` 2
`
` H
`3
`4
`6 T
`
`De.L Reg.
`
`Tirne4
`Result V
`
`@
`@
`
`RI
`R4
`R7
`
`/
`
`I
`
`y /
`.,
`
`P.V
`.,
`.,
`.,
`
`DesL Reg.
`
`Time6
`Result V
`
`P.C. /
`/
`
`/
`
`R7
`
`/
`, / / I
`
`.,
`
`.,
`WriteR7
`
`1
`2
`3
`H
`4
`T
`5
`
`FIGURE 9.11 Reorder buffer without a true dependency.
`
`Because there is no true dependency in this example, the reorder
`buffer only must restore order while using the RSR(a) for reserving time
`slots on the result bus. For this example without the true dependency,
`eight clocks are required to retire the last result into the register file(cid:173)
`the same as that for the RSR(a).
`The performance of the RSR(b) and the reorder buffer schemes for
`ensuring in-order retirement is compared to the RSR(a) and tabulated
`in Table 9.3. For the program fragment examples, the RSR(b) and the
`reorder buffer require the same number of clocks when there is a true
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 464
`
`

`

`448
`
`Exceptions and Interrupts
`
`Program
`Fragment
`
`With true
`dependencies
`
`Without true
`dependencies
`
`RSR(a)
`
`RSR(b)
`
`Reorder Buff er
`
`8 Clocks
`8 Clocks
`7 Clocks
`Normalized= 1 Normalized = Normalized=
`0.875
`0.875
`7 Clocks
`9 clocks
`7 Clocks
`Normalized= 1 Normalized= Normalized= 1
`0.77
`
`TABLE 9.3 Performance comparisons.
`
`dependency. However, without a true dependency, the reorder buffer
`requires two fewer clocks for a performance improvement of 22q over
`a RSR(b). For the program fragments without true dependencies, the
`RSR(a) and the reorder buffer give equal performance.
`These simple examples show that there can be a cost in performance
`for ensuring in-order retirement: speedup is in the range of 0.77 to 1.0
`Fractional speedup means that the performance is reduced. Smith and
`Pleszkun [SMIT85) simulated a RSRlb) and a reorder buffer added to a
`Cray-like processor having a Type C window to evaluate their perfor(cid:173)
`mance impact. In general, adding protection for precise interrupts to th:.:o
`processor degrades the performance depending on (1) the strategy usec
`and (2) how writes and true dependencies are handled. Recall that pro(cid:173)
`cessors such as the Cray do not provide support for precise interrup~
`so as not to degrade performance.
`The results of the simulation are shown in Table 9.4. The work load
`is the Lawrence Livermore Loops, and the comparison is to the Cray(cid:173)
`like processor. The table also shows the effect of the size of the reorder
`buffer.
`Another evaluation of the speedup penalty of providing in-order
`
`Number of Entries
`in the Buffer
`
`Result Shift Register (b)
`
`Reorder Buff er
`
`3
`4
`5
`8
`10
`
`0.81
`0.81
`0.81
`0.81
`0.81
`
`0.75
`0.82
`0.84
`0.85
`0.85
`
`TABLE 9.4 Result shift register and reorder buffer performance impact,
`Cray-1.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 465
`
`

`

`9. 1 Precise Interrupts
`
`449
`
`retirements is given by Wang and Emnett [WANG93]. Their results
`show that a result shift register has S = 0.81 while a reorder buffer has
`S = 0.83 compared to a Cray-like base machine. Their results will be
`discussed further in Section 9.5.
`The conclusion from the simple examples given above and the
`simulations of (SOHI87] and (WANG93] are clear. Providing RSRs
`and/or reorder buffers to ensure precise interrupts, even in a processor
`without early side effects to undo, is an expensive (in terms of perfor(cid:173)
`mance) design decision. The performance cost is in the range of 10% to
`20% in addition to the hardware cost of reorder buffers and result shift
`registers. Even in the face of hardware complexity and a loss in perfor(cid:173)
`mance, the AMD K5 (HALF94] and the Intel P6 [HALF95] processors
`are reported to use reorder buffers. With both processors, the reorder
`buffer performs the additional functions of renaming and undoing state
`changes for miss-predicted branches.
`
`Register Update Unit
`Sohi and Vajapeyam [SOHI87] describe a system called a register update
`unit (RUU), which overcomes some of the performance problems associ(cid:173)
`ated with a reorder buffer. In-order retirement is assured by using the
`register update unit upstream from the execution units, as the window,
`rather than down stream, as with the buffer functions.
`The root idea of the register update unit is found in a move to consoli(cid:173)
`date the Tomasulo CDB reservation stations into a pool instead of associ(cid:173)
`ating them with each of the function units. The RUU is similar to a Type
`C window except that instructions can be released out-of-order to

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket