throbber
Pipelining
`
`319
`
`Instruction type
`
`Pipe
`
`Stages
`
`Integer instruction
`FP instruction
`Integer instruction
`FP instruction
`Integer instruction
`FP instruction
`Integer instruction
`FP instruction
`
`IF
`IF
`
`ID
`ID
`IF
`IF
`
`EX
`EX
`ID
`ID
`IF
`IF
`
`MEM
`MEM
`EX
`EX
`ID
`ID
`IF
`IF
`
`WB
`WB
`MEM
`MEM
`EX
`EX
`ID
`ID
`
`WB
`WB
`MEM
`MEM
`EX
`EX
`
`WB
`WB
`MEM
`MEM
`
`WB
`WB
`
`FIGURE 6.46 $uperscalar pipeline in operation. The integer and floating-point instructions are issued at the same
`time, and each executes at its own pace through the pipeline. This scheme will only improve the performance of programs
`with a fair amount of floating point.
`
`By issuing an integer and a floating-point operation in parallel, the need for
`additional hardware is minimized-integer and floating-point operations use dif(cid:173)
`ferent register sets and different functional units. The only conflict arises when
`the integer instruction is a floating-point load, store, or move. This creates con(cid:173)
`tention for the floating-point register ports and may also create a hazard if the
`floating-point operation uses the result of a floating-point load issued at the same
`time. Both problems could be solved by detecting this contention as a structural
`hazard and delaying the issue of the floating-point instruction. The contention
`could also be eliminated by providing two additional ports, a read and a write,
`on the floating-point register file. We would also need to add several additional
`bypass paths to avoid performance loss.
`There is another difficulty that may limit the effectiveness of a superscalar
`pipeline. In our simple DLX pipeline, loads had a latency of one clock cycle;
`this prevented one·· instructi.on from using the result without stalling. In the
`superscalar pipeline, the result of a load instruction cannot be used on the same
`clock cycle or on the next clock cycle. This means that the next three instruc(cid:173)
`tions cannot use the load result without stalling; without extra ports, moves
`between the register sets are similarly affected. The branch delay also becomes
`three instructions. To effectively exploit the parallelism available in a super(cid:173)
`scalar machine, more ambitious compiler-scheduling techniques, as well as more
`complex instruction decoding, will need to be implemented. Loop unrolling
`helps generate larger straightline fragments for scheduling; more powerful
`compiler techniques are discussed near the end of this section.
`Let's see how well loop unrolling and scheduling work on a superscalar ver(cid:173)
`sion of DLX with the same delays in clock cycles.
`
`INTEL Ex.1035.351
`
`

`

`320
`
`6.8 Advanced Pipelining-Taking Advantage of More Instruction-Level Parallelism
`
`Example
`
`How would the unrolled loop on page 317 be scheduled on a superscalar pipe(cid:173)
`line for DLX? To schedule it without any delays, we will need to unroll it to
`make five copies of the body.
`
`Answer
`
`The resulting code is shown in Figure 6.47.
`
`Integer instruction
`
`FP instruction
`
`Clock cycle
`
`Loop:
`
`LD
`FO,O(Rl)
`LD
`F6,-8(Rl)
`LD
`F10,-16(Rl)
`F14,-24(Rl)
`LD
`F18, -32 (Rl)
`LD
`0(Rl),F4
`SD
`-8(Rl),F8
`SD
`-16(Rl),F12
`SD
`-24(Rl),F16
`SD
`Rl,Rl,#40
`SUB
`BNEZ Rl,LOOP
`SD
`8 (Rl) , F20
`
`ADDD F4,FO,F2
`ADDD F8,F6,F2
`ADDD Fl2,F10,F2
`ADDD F16,F14,F2
`ADDD F20,Fl8,F2
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`
`FIGURE 6.47 The unrolled and scheduled code as it would look on a superscalar
`DLX.
`
`This unrolled superscalar loop now runs in 12 clock cycles per iteration, or 2.4
`clock cycles per element, versus 3.5 for the scheduled and unrolled loop on the
`ordinary DLX pipeline. In this example, the performance of the superscalar
`DLX is limited by the balance between integer and floating-point computation.
`Every floating-point instruction is issued together with an integer instruction, but
`there are not enough floating-point instructions to keep the floating-point
`pipeline full. When scheduled, the original loop ran in 6 clock cycles per
`iteration. We have improved on that by a factor of 2.5, more than half of which
`came from loop unrolling, which took us from 6 to 3.5, with the rest coming
`from issuing more than one instruction per clock cycle.
`
`Ideally, our superscalar machine will pick up two instructions and issue them
`both if the first is an integer and the second is a floating-point instruction. If they
`do not fit this pattern, which can be quickly detected, then they are issued
`sequentially. This points to one of the major advantages of a general superscalar
`machine: There is little impact on code density, and even unscheduled programs
`can be run. The number of issues and classes of instructions that can be issued
`together are the major factors that differentiate superscalar processors.
`
`INTEL Ex.1035.352
`
`

`

`Pipelining
`
`321
`
`Multiple Instruction Issue with
`Dynamic Scheduling
`
`Multiple instruction issue can also be applied to dynamically scheduled
`machines. We could start with either the scoreboard scheme or Tomasulo's
`algorithm. Let's assume we want to extend Tomasulo's algorithm to support
`issuing two instructions per clock cycle, one integer and one floating point. We
`do not want to issue instructions in the queue out of order, since this makes the
`bookkeeping in the register file impossible. Rather, by employing data structures
`for the integer and floating-point registers, both types of instructions can be
`issued to their respective reservation stations, as long as the two instructions at
`the head of the instruction queue do not access the same register set.
`Unfortunately, this approach bars issuing two instructions with a dependence in
`the same clock cycle. This is, of course, true in the superscalar case, where it is
`clearly the compiler's problem. There are three approaches that can be used to
`achieve dual issue. First, we could use software scheduling to ensure that depen(cid:173)
`dent instructions do not appear adjacent. However, this would require pipeline(cid:173)
`scheduling software, thereby defeating one of the advantages of dynamically
`scheduled pipelines.
`A second approach is to pipeline the instruction-issue stage so that it runs
`twice as fast as the basic clock rate. This permits updating the tables before pro(cid:173)
`cessing the next instruction; then the two instructions can begin execution at
`once.
`The third approach is based on the observation that if multiple instructions
`are not being issued to the same functional unit, then it will only be loads and
`stores that will create dependences among instructions that we wish to issue
`together. The need for reservation tables for loads and stores can be eliminated
`by using queues for the result of a load and for the source operand of a store.
`Since dynamic scheduling is most effective for loads and stores, while static
`scheduling is highly effective in register-register code sequences, we could use
`static scheduling to eliminate reservation stations completely and rely only on
`the queues for loads and stores. This style of machine organization has been \,
`called a decoupled architecture.
`For simplicity, let us assume that we have pipelined the instruction issue logic
`so that we can issue two operations that are dependent but use different
`functional units. Let's see how this would work with our example.
`
`Example
`
`Consider the execution of our simple loop on a DLX pipeline extended with
`Tomasulo's algorithm and with multiple issue. Assume that both a floating-point
`and an integer operation can be issued on every clock cycle, even if they are
`related. The number of cycles of latency per instruction is the same. Assume that
`issue and write results take one cycle each, and ·that there is dynamic branch(cid:173)
`prediction hardware. Create a table showing when each instruction issues, begins
`execution, and writes its result, for the first two iterations of the loop. Here is the
`original loop:
`
`INTEL Ex.1035.353
`
`

`

`322
`
`6.8 Advanced Pipelining-Taking Advantage of More Instruction-Level Parallelism
`
`Loop:
`
`FO,O(Rl)
`
`LD
`ADDD F4,FO,F2
`
`SD
`
`0(Rl),F4
`
`SUB Rl,Rl,#8
`
`BNEZ Rl,LOOP
`
`Answer
`
`The loop will be dynamically unwound and, whenever possible, instructions will
`be issued in pairs. The result is shown in Figure 6.48. The loop runs in 4 + Z
`n
`clock cycles per result for n iterations. For large n this approaches 4 clock cycles
`per result.
`
`Iteration
`number
`
`Instructions
`
`Issues at
`clock-cycle
`number
`
`Executes at
`clock-cycle
`number
`
`Writes
`result at
`clock-cycle
`number
`
`1
`1
`1
`1
`1
`2
`2
`2
`2
`2
`
`FO,O(Rl)
`LD
`ADDD F4,FO,F2
`SD
`0(Rl),F4
`SUB Rl,Rl,#8
`BNEZ Rl,LOOP
`FO,O(Rl)
`LD
`ADDD F4,FO,F2
`0(Rl),F4
`SD
`SUB Rl,Rl,#8
`BNEZ Rl,LOOP
`
`1
`1
`2
`3
`4
`5
`5
`6
`7
`8
`
`2
`5
`9
`4
`5
`6
`9
`13
`8
`9
`
`4
`8
`
`5
`
`8
`12
`
`9
`
`FIGURE 6.48 The time of issue, execution, and writing result for a dual-issue
`version of our Tomasulo pipeline. The write-result stage does not apply to either stores
`or branches, since they do not write any registers.
`
`The number of dual issues is small because there is only one floating-point
`operation per iteration. The relative number of dual-issued instructions would be
`helped by the compiler partially unwinding the loop to reduce the instruction
`count by eliminating loop overhead. With that transformation, the loop would
`run as fast as on a superscalar machine. We will return to this transformation in
`Exercises 6.16 and 6.17.
`
`The VLIW Approach
`
`Our superscalar DLX machine can issue two instructions per clock cycle. That
`could perhaps be extended to three or at most four, but it becomes difficult to
`
`INTEL Ex.1035.354
`
`

`

`Pipelining
`
`323
`
`determine whether three or four instructions can all issue simultaneously without
`knowing what order the instructions could be in when fetched and what depen(cid:173)
`dencies might exist among them. An alternative is an LIW (Long Instruction
`Word) or VLIW (Very Long Instruction Word) architecture. VLIWs use multi(cid:173)
`ple, independent functional units. Rather than attempting to issue multiple, inde(cid:173)
`pendent instructions to the units, a VLIW packages the multiple operations into
`one very long instruction, hence the name. A VLIW instruction might include
`two integer operations, two floating-point operations, two memory references,
`and a branch. An instruction would have a set of fields for each functional
`unit-perhaps 16 to 24 bits per unit, yielding an instruction length of between
`112 and 168 bits. To keep the functional units busy there must be enough work
`in a straightline code sequence to keep the instructions scheduled. This is
`accomplished by unrolling loops and scheduling code across basic blocks using
`a technique called trace scheduling. In addition to eliminating branches by un(cid:173)
`rolling loops, trace scheduling provides a method to move instructions across
`branch points. We will discuss trace scheduling more in the next section. For
`now, let's assume we have a technique to generate long, straightline code
`sequences for building up VLIW instructions.
`
`Example
`
`Suppose we have a VLIW that could issue two memory references, two FP
`operations, and one integer operation or branch in every clock cycle. Show an
`unrolled version of the vector sum loop for such a machine. Unroll as many
`times as necessary to eliminate any stalls. Ignore the branch-delay slot.
`
`Answer
`
`The code is shown in Figure 6.49. The loop has been unrolled 6 times, which
`eliminates stalls, and runs in 9 cycles. This yields a running rate of 7 results in 9
`cycles, or 1.28 cycles per result.
`
`Memory
`reference 1
`
`Memory
`reference 2
`
`FP
`operation 1
`
`FP
`operation 2
`
`Integer operation
`I branch
`
`LD FO,O(Rl)
`
`LD FlO, -16 (Rl)
`
`LD F6,-8(Rl)
`LD Fl4,-24(Rl)
`
`LD F18, -32 (Rl)
`
`LD F22,-40(Rl)
`
`ADDD F4,FO,F2
`
`ADDD F8,F6,F2
`
`LD F26, -48 (Rl)
`
`ADDD Fl2,Fl0,F2
`
`ADDD Fl6,Fl4,F2
`
`ADDD F20,Fl8,F2
`
`ADDD F24,F22,F2
`
`SD 0(Rl),F4
`
`SD -8(Rl),F8
`
`ADDD F28,F26,F2
`
`SD -16(Rl),Fl2
`
`SD -24(Rl),Fl6
`
`SD -32(Rl),F20
`
`SD -40(Rl),F24
`
`SD -0(Rl),F28
`
`SUB Rl,Rl,#48
`
`BNEZ Rl,LOOP
`
`FIGURE 6.49 VLIW instructions that occupy the inner loop and replace the unrolled sequence. This code takes
`nine cycles assuming no branch delay; normally the branch would also be scheduled. The issue rate is 23 operations in 9
`clock cycles, or 2.5 operations per cycle. The efficiency, the percentage of available slots that contained an operation, is
`about 60%. To achieve this issue rate requires a much larger number of registers than DLX would normally use in this
`loop.
`
`INTEL Ex.1035.355
`
`

`

`324
`
`6.8 Advanced Pipelining-Taking Advantage of More Instruction-Level Parallelism
`
`What are the limitations and costs of a VLIW approach? If we can issue 5
`operations per clock cycle, why not 50? Three different limitations ar:e encoun(cid:173)
`tered: limited parallelism, limited hardware resources, and code size explosion.
`The first is the simplest:. There is a limited amount of parallelism available in in(cid:173)
`struction sequences. Unless loops are unrolled very large numbers of times,
`there may not be enough operations to fill the instructions. At first glance, it
`might appear that 5 instructions that could be executed in parallel would be suf(cid:173)
`ficient to keep our VLIW completely busy. This, however, is not the case. Sev(cid:173)
`eral of these functional units-the memory, the branch, and the floating-point
`units-will be pipelined, requiring a much larger number of operations that can
`be executed in parallel. For example, if the floating-point pipeline has 8 steps,
`the 2 operations being issued on a clock cycle cannot depend on any of the 14
`operations already in the floating-point pipeline. Thus, we need to find a number
`of independent operations roughly equal to the average pipeline depth times the
`number of functional units. This means about 15 to 20 operations would be
`needed to keep a VLIW with 5 functional units busy.
`The second cost, the hardware resources for a VLIW, seem quite straight(cid:173)
`forward; duplicating the floating-point and integer functional units is easy and
`cost scales linearly. However, there is a large increase in the memory- and
`register-file bandwidth. Even with a split floating-point and integer register file,
`our VLIW will require 5 read ports and 2 write ports on the integer register file
`and 4 read ports and 2 write ports on the floating-point register file. This
`bandwidth cannot be supported without some substantial cost in the size of the
`register file and possible degradation of clock speed. Our 5-unit VLIW also has
`2 data memory ports. Furthermore, if we wanted to expand it, we would need to
`continue adding memory ports. Adding only arithmetic units would not help,
`since the machine would be starved for memory bandwidth. As the number of
`data memory ports grows, so does the complexity of the memory system. To
`allow multiple memory accesses in parallel, the memory must be broken into
`banks containing different addresses with the hope that the operations in a single
`instruction do not have conflicting accesses. A conflict will cause the entire
`machine to stall, since all the functional units must be kept synchronized. This
`same factor makes it extremely difficult to use data caches in a VLIW.
`Finally, there is the problem of code size. There are two different elements
`that combine to increase code size substantially. First, generating enough opera(cid:173)
`tions in a straightline code fragment requires ambitiously unrolling loops, which
`increases code size. Second, whenever instructions are not full, the unused func(cid:173)
`tional units translate to wasted bits in the instruction encoding. In Figure 6.49,
`we saw that only about 60% of the functional units were used; almost half of
`each instruction was empty. To combat this problem, clever encodings are
`sometimes used. For example, there may be only one large immediate field for
`use by any functional unit. Another technique is to compress the instructions in
`main memory and expand them when they are read into the cache or are
`decoded.
`
`INTEL Ex.1035.356
`
`

`

`Pipelining
`
`325
`
`The major challenge for these machines is to try to exploit large amounts of
`instruction-level parallelism. When the parallelism comes from unrolling simple
`loops, the original loop probably could have been run efficiently on a vector
`machine (see the next chapter). It is not clear that a VLIW is preferred over a
`vector machine for such applications; the costs are similar, and the vector
`machine is typically the same speed or faster. The open question in 19.90 is
`whether there are large classes of applications that are not suitable for vector
`machines, but still offer enough parallelism to justify the VLIW approach rather
`than a simpler one, such as a superscalar machine.
`
`Increasing Instruction-Level Parallelism with
`Software Pipelining and Trace Scheduling
`
`We have already seen that one compiler technique, loop unrolling, is used to
`help exploit parallelism among instructions. Loop unrolling creates longer
`sequences of straightline code, which can be used to exploit more instruction(cid:173)
`level parallelism. There are two other more general techniques that have been
`developed for this purpose: software pipelining and trace scheduling.
`Software pipelining is a technique for reorganizing lo~ps such that each itera(cid:173)
`tion in the software-pipelined code is made from instruction sequences chosen
`from different iterations in the original code segment. This is most easily under(cid:173)
`stood by looking at the scheduled code for the superscalar version of DLX. The
`scheduler essentially interleaves instructions from different loop iterations,
`putting together all the loads, then all the adds, then all the stores. A software(cid:173)
`pipelined loop interleaves instructions from different iterations without unrolling
`the loop. This technique is the software counterpart to what Tomasulo's algo(cid:173)
`rithm does in hardware. The software-pipelined loop would contain one load,
`one add, and one store, each from a different iteration. There is also some startup
`code that is needed before the loop begins as well as code to finish up after the
`loop is completed. We will ignore these in this discussion.
`
`Example
`
`Show a software-pipelined version of this loop:
`
`Loop:
`
`FO,O(Rl)
`LD
`ADDD F4,FO,F2
`SD
`0(Rl),F4
`SUB Rl,Rl,#8
`BNEZ Rl,LOOP
`
`You may omit the start-up and clean-up code.
`
`INTEL Ex.1035.357
`
`

`

`326
`
`6.8 Advanced Pipelining-Taking Advantage of More Instruction-Level Parallelism
`
`Answer
`
`Given the vector Min memory, and ignoring the start-up and finishing code, we
`have:
`Loop:
`
`;stores into M[i]
`0(Rl),F4
`SD
`;adds to M[i-1]
`ADDD F4,FO,F2
`F0,-16(Rl) ;loads M[i-2]
`LD
`BNEZ Rl,LOOP
`;subtract in delay slot
`SUB Rl,Rl,#8
`This loop can be run at a rate of 5 cycles per result, ignoring the start-up and
`clean-up portions. Because the load fetches two array elements beyond the
`element count, the loop should run for two fewer iterations. This would be
`accomplished by decrementing Rl by 16 prior to the loop.
`Software pipelining can be thought of as symbolic loop unrolling. Indeed,
`some of the algorithms for software pipelining use loop unrolling to figure out
`how to software pipeline the loop. The major advantage of software pipelining
`over straight loop unrolling is that software pipelining consumes less code space.
`Software pipelining and loop unrolling, in addition to yielding a better scheduled
`inner loop, each reduce a different type of overhead. Loop unrolling reduces the
`overhead of the loop-the branch and counter-update code. Software pipelining
`reduces the tim~ when the loop is not running at peak speed to once per loop at
`the beginning and end. If we unroll a loop that does 100 iterations a constant
`number of times, say 4, we pay the overhead 100/4 = 25 times-every time the
`inner unrolled loop is reinitiated. Figure 6.50 shows this behavior graphically.
`Because these techniques attack two different types of overhead, the best
`performance comes from doing both.
`The other technique used to generate additional parallelism is trace schedul(cid:173)
`ing. This is particularly useful for VLIWs, for which the technique was origi(cid:173)
`nally developed. Trace scheduling is a combination of two separate processes.
`The first process, called trace selection tries to find the most likely sequence of
`operations to put together into a small number of instructions; this sequence is
`called a trace. Loop unrolling is used to generate long traces, since loop
`branches are taken with high probability. Once a trace is selected, the second
`process, called trace compaction, tries to squeeze the trace into a small number
`of wide instructions. Trace compaction attempts to move operations as early as it
`can in a sequence (trace), packing the operations into as few wide instructions as
`possible.
`There are two different considerations in compacting a trace: data depen(cid:173)
`dences, which force a partial order on operations, and branch points, which cre(cid:173)
`ate places across which code cannot be easily moved. In essence, the code wants
`to be compacted into the shortest possible sequence that preserves the data
`dependences; branches are the main impediment to this process. The major
`advantage of trace scheduling over simpler pipeline-scheduling techniques is
`that it includes a method to move code across branches. Figure 6.51 shows a
`code fragment, which may be thought of as an iteration of an unrolled loop, and
`the trace selected.
`
`INTEL Ex.1035.358
`
`

`

`

`

`328
`
`6.8 Advanced Pipelining-Taking Advantage of More Instruction-Level Parallelism
`
`Once the trace is selected as shown in Figure 6.51, it must be compacted so as
`to fill the wide instruction word. Compacting the trace involves moving the
`assignments to variables B and C up to the block before the branch decision.
`Let's first consider the problem of moving the assignment to B. If the assign(cid:173)
`ment to Bis moved above the branch (and thus out of the trace), the code in X
`would be affected if it used B, since moving the assignment would change the
`value of B. Thus, to move the assignment to B, B must not be read in X. One
`could imagine more clever schemes if B were read in X-for example, making a
`shadow copy and updating B later. Such schemes are generally not used, both
`because they are complex to implement and because they will slow down the
`program if the trace selected is not optimal and the operations end up requiring
`additional instructions. Also, because the assignment to B is moved before the if
`test, for this schedule to be valid either X also assigns to B or B is not read after
`the if statement.
`Moving the assignment to Cup to before the first branch requires first mov(cid:173)
`ing it over the branch from X into the trace. To do this, a copy is made of the
`assignment to C on the branch into the trace. A check must still be done, as was
`done for B, to make sure that the assignment can be moved over the branch out
`of the trace. If C is successfully moved to before the first branch and the "false"
`direction of the branch-the branch off the trace-is taken, the assignment to C
`will have been done twice. This may be slower than the original code, depending
`on whether this operation or other moved operations create additional work in
`the main trace. Ironically, the more successful the trace-scheduling algorithm is
`in moving code across the branch, the higher the penalty for misprediction.
`Loop unrolling, trace scheduling, and software pipelining all aim at trying to
`increase the amount of local instruction parallelism that can be exploited by a
`machine issuing more than one instruction on every clock cycle. The effective(cid:173)
`ness of each of these techniques and their suitability for various architectural
`approaches are among the most significant open research areas in pipelined-pro(cid:173)
`cessor design.
`
`6.9 I
`Putting It All Together: A Pipelined VAX
`
`In this section we will examine the pipeline of the VAX 8600, a macropipelined
`VAX. This machine is described in detail by DeRosa et al. [1985] and Troiani et
`al. [1985]. The 8600 pipeline is a more dynamic structure than the DLX integer
`pipeline. This is because the processing steps may take multiple cycles in one
`stage of the pipeline. Additionally, the hazard detection is more complicated
`because of the possibility that stages progress independently and because
`instructions may modify registers before they complete. Techniques similar to
`those used in the DLX FP pipeline to handle variable-length instructions are
`used in the 8600 pipeline.
`The 8600 is macropipelined-the pipeline understands the structure of VAX
`instructions and overlaps their execution, checking the hazards on the instruction
`
`INTEL Ex.1035.360
`
`

`

`

`

`330
`
`6.9 Putting It All Together: A Pipelined VAX
`
`Step
`
`Function
`
`1. If etch
`Prefetch instruction bytes and decode them
`2. Op fetch
`Operand address calculation and fetch
`3. Execution
`Execute opcode and write result
`4. Result store Write result to memory or registers
`
`Located in
`
`!Box
`!Box
`EBox, FBox
`EBox, !Box
`
`FIGURE 6.53 The basic structure of the 8600 pipeline has four stages, each taking
`from 1 to a large number of clock cycles. Up to four VAX instructions are being
`processed at once.
`
`stage may cause a back up in the pipeline; this back up may eventually reach the
`!fetch step, where it will cause the pipeline to simply stop fetching instructions.
`Additionally, several resources (e.g., the W Bus and GPR ports) are contended
`for by multiple stages in the pipeline. In general, these problems are resolved on
`the fly using a fixed-priority scheme.
`
`Operand Decode and Fetch
`
`Much of the work in interpreting a VAX instruction is in the operand specifier
`and decode process, and this is the heart of the IBox. Substantial effort is de(cid:173)
`voted to decoding and fetching operands as fast as possible to keep instructions
`flowing through the pipeline. Figure 6.54 shows the number of cycles spent in
`Opfetch under ideal conditions (no cache misses or other stalls from the memory
`hierarchy) for each operand specifier. If the result is a register, the EBox stores
`
`Specifier
`
`Cycles
`
`Literal or immediate
`Register
`Deferred
`Displacement
`PC-relative and absolute
`Autodecrement
`Autoincrement
`Autoincrement deferred
`Displacement deferred
`PC-relative deferred
`
`1
`1
`1
`1
`1
`1
`2
`5
`4
`4
`
`FIGURE 6.54 The minimum number of cycles spent in Opfetch by operand specifier.
`This shows the data for an operand of type byte, word, or longword that is read. Modified
`and written operands take an additional cycle, except for register mode and immediate or
`literal, where writes are not allowed. Quadword and octaword operands may take much
`longer. If any stalls are encountered, the cycle count will increase.
`
`INTEL Ex.1035.362
`
`

`

`Pipelining
`
`331
`
`the result. If the result is a memory operand, Opfetch calculates the address and
`waits for the EBox to signal ready, then the IBox stores the result during the
`Result store step. If an instruction result is to be stored in memory, the EB ox
`signals to the IBox when it enters the last cycle of execution for the instruction.
`This allows Opfetch to overlap the first cycle of a two-cycle memory write with
`the last cycle of execution (even if the operation only takes one cycle).
`To maximize the performance of the machine, there are three copies of the
`GPRs-in the IBox, EBox, and FBox. A write is broadcast from the FBox,
`EBox, or IBox (in the case of autoincrement or autodecrement addressing) to the
`other two units, so that their copies of the registers can be updated.
`
`Handling Data Dependences
`
`Register hazards are tracked in Opfetch by maintaining a small table of registers
`that will be written. Whenever an instruction passes through Opfetch, its result
`register is marked as busy. If an instruction that uses that register arrives in
`Opfetch and sees the busy flag set, it stalls until the flag is cleared. This prevents
`RAW hazards. The busy flag is cleared when the register is written. Because
`there are only two stages after Opfetch (execute and write memory result), the
`busy flag can be implemented as a two-entry associative memory. Writes are
`maintained in order and always at the end of the pipeline, and all reads are done
`in Opfetch. This eliminates all explicit WA W and WAR hazards. The only
`possible remaining hazards are those that can occur on implicit operands, such
`as the registers written by a MOVC3. Hazards on implicit operands are prevented
`by explicit control in the microcode.
`Opfetch optimizes the case when the last operand specifier is a register by
`processing the register operand specifier at the same time as the next-to-last
`specifier. In addition, when the result register of an instruction is the source
`operand of the next instruction, rather than stall the dependent instruction,
`Opfetch merely signals this relationship to the EBox, allowing execution to
`proceed without a stall. This is like the bypassing in our DLX pipeline.
`Memory hazards between reads and writes are easily resolved because there
`is a single memory port, and the IBox decodes all operand addresses.
`
`Handling Control Dependences
`
`There are two aspects to handling branches in a VAX: synchronizing on the
`condition code and dealing with the branch hazard. Most of the branch process(cid:173)
`ing is handled by the IBox. A predict-taken strategy is used; the following steps
`are taken when the IBox sees a branch:
`
`1. Compute the branch target address, send it to the MBox, and initiate a fetch
`from the target address. Wait for the EBox to issue CCSYNC, which indi(cid:173)
`cates that the condition codes will be available in the next clock cycle.
`
`INTEL Ex.1035.363
`
`

`

`332
`
`6.9 Putting It All Together: A Pipelined VAX
`
`2. Evaluate the condition codes from the EBox to check the prediction. If the
`prediction was incorrect, the access initiated in the MBox is aborted. The
`current PC points at the next instruction or its first operand specifier.
`
`3. Assuming the branch was taken, the IBox flushes the prefetch and decode
`stages and begins loading the instruction register and processing the new tar(cid:173)
`get stream. If the branch was not taken, the access to the potential target has
`already been killed and the pipeline can continue just using what is in the
`prefetch and decode stages.
`
`Simple conditional branches (BEQL, BNEQ), the unconditional branches
`(BRB, BRW), and the computed branches (e.g., AOBLEQ) are handled by the
`IBox. The EBox handles more complex branches and also the instructions used
`for calls and returns.
`
`An Example
`
`To really understand how this pipeline works, let's look at how a code sequence
`executes. This example is somewhat simplified, but is sufficient to demonstrate
`the major pipeline interactions. The code sequence we will consider is as follows
`(remember that for consistency the result of the ADDL3 is given first):
`ADDL3
`Rl,R2,56(R3)
`CMPL
`45(Rl),@54(R2)
`target
`BEQL
`
`MOVL
`target: SUBL3
`
`Figure 6.55 shows an annotated pipeline diagram of how these instructions
`would progress through the 8600 pipeline.
`
`Dealing with Interrupts
`
`The 8600 maintains three program counters so that instruction interruption and
`restart are possible. These program counters and what they designate are:
`• Current Program Counter-points to the next byte to be processed and
`consumed in Opfetch.
`• IBox Starting Address-points to the instruction currently in Opfetch.
`• EBox Starting Address-points to the instruction executing in the EBox or
`FBox.
`In addition, the prefetch unit keeps an address to prefetch from (the VIBA,
`Virtual Instruction Buffer Address), but this does not affect interrupt handling.
`When an exception is caused by a prefetch operation, ihe byte in the instruction
`buffer is marked. When Opfetch eventually asks fot the byte, it will see the
`exception, and the Current Program Counter will have the address of the byte
`that caused the exception.
`
`INTEL Ex.1035.364
`
`

`

`Pipelining
`
`333
`
`1
`
`2
`
`3
`
`Clock Cycle
`5
`4
`
`6
`
`7
`
`8
`
`9
`
`IF: Fetch
`ADDL.
`
`IF:
`IF:
`Continue Decode
`pre fetch
`Rl.
`if space
`andMBox
`available.
`
`IF:
`IF:
`OP: Start WR:
`OP:
`Compute write.
`Decode
`Decode
`Store.
`56 (R3). 56+ (R3) ·EX: Add.
`R2.
`EX: get
`OP: Fetch OP:
`Fetch R2.
`first
`Rl.
`operand.
`IF:
`Decode
`45 (Rl).
`
`IF: De-
`code
`@54 (R2).
`OP: Fetch
`45 (Rl).
`
`OP: Fetch
`54 (R2).

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket