`
`319
`
`Instruction type
`
`Pipe
`
`Stages
`
`Integer instruction
`FP instruction
`Integer instruction
`FP instruction
`Integer instruction
`FP instruction
`Integer instruction
`FP instruction
`
`IF
`IF
`
`ID
`ID
`IF
`IF
`
`EX
`EX
`ID
`ID
`IF
`IF
`
`MEM
`MEM
`EX
`EX
`ID
`ID
`IF
`IF
`
`WB
`WB
`MEM
`MEM
`EX
`EX
`ID
`ID
`
`WB
`WB
`MEM
`MEM
`EX
`EX
`
`WB
`WB
`MEM
`MEM
`
`WB
`WB
`
`FIGURE 6.46 $uperscalar pipeline in operation. The integer and floating-point instructions are issued at the same
`time, and each executes at its own pace through the pipeline. This scheme will only improve the performance of programs
`with a fair amount of floating point.
`
`By issuing an integer and a floating-point operation in parallel, the need for
`additional hardware is minimized-integer and floating-point operations use dif(cid:173)
`ferent register sets and different functional units. The only conflict arises when
`the integer instruction is a floating-point load, store, or move. This creates con(cid:173)
`tention for the floating-point register ports and may also create a hazard if the
`floating-point operation uses the result of a floating-point load issued at the same
`time. Both problems could be solved by detecting this contention as a structural
`hazard and delaying the issue of the floating-point instruction. The contention
`could also be eliminated by providing two additional ports, a read and a write,
`on the floating-point register file. We would also need to add several additional
`bypass paths to avoid performance loss.
`There is another difficulty that may limit the effectiveness of a superscalar
`pipeline. In our simple DLX pipeline, loads had a latency of one clock cycle;
`this prevented one·· instructi.on from using the result without stalling. In the
`superscalar pipeline, the result of a load instruction cannot be used on the same
`clock cycle or on the next clock cycle. This means that the next three instruc(cid:173)
`tions cannot use the load result without stalling; without extra ports, moves
`between the register sets are similarly affected. The branch delay also becomes
`three instructions. To effectively exploit the parallelism available in a super(cid:173)
`scalar machine, more ambitious compiler-scheduling techniques, as well as more
`complex instruction decoding, will need to be implemented. Loop unrolling
`helps generate larger straightline fragments for scheduling; more powerful
`compiler techniques are discussed near the end of this section.
`Let's see how well loop unrolling and scheduling work on a superscalar ver(cid:173)
`sion of DLX with the same delays in clock cycles.
`
`Ex.1035.351
`
`DELL
`
`
`
`320
`
`6.8 Advanced Pipelining-Taking Advantage of More Instruction-Level Parallelism
`
`Example
`
`How would the unrolled loop on page 317 be scheduled on a superscalar pipe(cid:173)
`line for DLX? To schedule it without any delays, we will need to unroll it to
`make five copies of the body.
`
`Answer
`
`The resulting code is shown in Figure 6.47.
`
`Integer instruction
`
`FP instruction
`
`Clock cycle
`
`Loop:
`
`LD
`FO,O(Rl)
`LD
`F6,-8(Rl)
`LD
`F10,-16(Rl)
`F14,-24(Rl)
`LD
`F18, -32 (Rl)
`LD
`0(Rl),F4
`SD
`-8(Rl),F8
`SD
`-16(Rl),F12
`SD
`-24(Rl),F16
`SD
`Rl,Rl,#40
`SUB
`BNEZ Rl,LOOP
`SD
`8 (Rl) , F20
`
`ADDD F4,FO,F2
`ADDD F8,F6,F2
`ADDD Fl2,F10,F2
`ADDD F16,F14,F2
`ADDD F20,Fl8,F2
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`
`FIGURE 6.47 The unrolled and scheduled code as it would look on a superscalar
`DLX.
`
`This unrolled superscalar loop now runs in 12 clock cycles per iteration, or 2.4
`clock cycles per element, versus 3.5 for the scheduled and unrolled loop on the
`ordinary DLX pipeline. In this example, the performance of the superscalar
`DLX is limited by the balance between integer and floating-point computation.
`Every floating-point instruction is issued together with an integer instruction, but
`there are not enough floating-point instructions to keep the floating-point
`pipeline full. When scheduled, the original loop ran in 6 clock cycles per
`iteration. We have improved on that by a factor of 2.5, more than half of which
`came from loop unrolling, which took us from 6 to 3.5, with the rest coming
`from issuing more than one instruction per clock cycle.
`
`Ideally, our superscalar machine will pick up two instructions and issue them
`both if the first is an integer and the second is a floating-point instruction. If they
`do not fit this pattern, which can be quickly detected, then they are issued
`sequentially. This points to one of the major advantages of a general superscalar
`machine: There is little impact on code density, and even unscheduled programs
`can be run. The number of issues and classes of instructions that can be issued
`together are the major factors that differentiate superscalar processors.
`
`Ex.1035.352
`
`DELL
`
`
`
`Pipelining
`
`321
`
`Multiple Instruction Issue with
`Dynamic Scheduling
`
`Multiple instruction issue can also be applied to dynamically scheduled
`machines. We could start with either the scoreboard scheme or Tomasulo's
`algorithm. Let's assume we want to extend Tomasulo's algorithm to support
`issuing two instructions per clock cycle, one integer and one floating point. We
`do not want to issue instructions in the queue out of order, since this makes the
`bookkeeping in the register file impossible. Rather, by employing data structures
`for the integer and floating-point registers, both types of instructions can be
`issued to their respective reservation stations, as long as the two instructions at
`the head of the instruction queue do not access the same register set.
`Unfortunately, this approach bars issuing two instructions with a dependence in
`the same clock cycle. This is, of course, true in the superscalar case, where it is
`clearly the compiler's problem. There are three approaches that can be used to
`achieve dual issue. First, we could use software scheduling to ensure that depen(cid:173)
`dent instructions do not appear adjacent. However, this would require pipeline(cid:173)
`scheduling software, thereby defeating one of the advantages of dynamically
`scheduled pipelines.
`A second approach is to pipeline the instruction-issue stage so that it runs
`twice as fast as the basic clock rate. This permits updating the tables before pro(cid:173)
`cessing the next instruction; then the two instructions can begin execution at
`once.
`The third approach is based on the observation that if multiple instructions
`are not being issued to the same functional unit, then it will only be loads and
`stores that will create dependences among instructions that we wish to issue
`together. The need for reservation tables for loads and stores can be eliminated
`by using queues for the result of a load and for the source operand of a store.
`Since dynamic scheduling is most effective for loads and stores, while static
`scheduling is highly effective in register-register code sequences, we could use
`static scheduling to eliminate reservation stations completely and rely only on
`the queues for loads and stores. This style of machine organization has been \,
`called a decoupled architecture.
`For simplicity, let us assume that we have pipelined the instruction issue logic
`so that we can issue two operations that are dependent but use different
`functional units. Let's see how this would work with our example.
`
`Example
`
`Consider the execution of our simple loop on a DLX pipeline extended with
`Tomasulo's algorithm and with multiple issue. Assume that both a floating-point
`and an integer operation can be issued on every clock cycle, even if they are
`related. The number of cycles of latency per instruction is the same. Assume that
`issue and write results take one cycle each, and ·that there is dynamic branch(cid:173)
`prediction hardware. Create a table showing when each instruction issues, begins
`execution, and writes its result, for the first two iterations of the loop. Here is the
`original loop:
`
`Ex.1035.353
`
`DELL
`
`
`
`322
`
`6.8 Advanced Pipelining-Taking Advantage of More Instruction-Level Parallelism
`
`Loop:
`
`FO,O(Rl)
`
`LD
`ADDD F4,FO,F2
`
`SD
`
`0(Rl),F4
`
`SUB Rl,Rl,#8
`
`BNEZ Rl,LOOP
`
`Answer
`
`The loop will be dynamically unwound and, whenever possible, instructions will
`be issued in pairs. The result is shown in Figure 6.48. The loop runs in 4 + Z
`n
`clock cycles per result for n iterations. For large n this approaches 4 clock cycles
`per result.
`
`Iteration
`number
`
`Instructions
`
`Issues at
`clock-cycle
`number
`
`Executes at
`clock-cycle
`number
`
`Writes
`result at
`clock-cycle
`number
`
`1
`1
`1
`1
`1
`2
`2
`2
`2
`2
`
`FO,O(Rl)
`LD
`ADDD F4,FO,F2
`SD
`0(Rl),F4
`SUB Rl,Rl,#8
`BNEZ Rl,LOOP
`FO,O(Rl)
`LD
`ADDD F4,FO,F2
`0(Rl),F4
`SD
`SUB Rl,Rl,#8
`BNEZ Rl,LOOP
`
`1
`1
`2
`3
`4
`5
`5
`6
`7
`8
`
`2
`5
`9
`4
`5
`6
`9
`13
`8
`9
`
`4
`8
`
`5
`
`8
`12
`
`9
`
`FIGURE 6.48 The time of issue, execution, and writing result for a dual-issue
`version of our Tomasulo pipeline. The write-result stage does not apply to either stores
`or branches, since they do not write any registers.
`
`The number of dual issues is small because there is only one floating-point
`operation per iteration. The relative number of dual-issued instructions would be
`helped by the compiler partially unwinding the loop to reduce the instruction
`count by eliminating loop overhead. With that transformation, the loop would
`run as fast as on a superscalar machine. We will return to this transformation in
`Exercises 6.16 and 6.17.
`
`The VLIW Approach
`
`Our superscalar DLX machine can issue two instructions per clock cycle. That
`could perhaps be extended to three or at most four, but it becomes difficult to
`
`Ex.1035.354
`
`DELL
`
`
`
`Pipelining
`
`323
`
`determine whether three or four instructions can all issue simultaneously without
`knowing what order the instructions could be in when fetched and what depen(cid:173)
`dencies might exist among them. An alternative is an LIW (Long Instruction
`Word) or VLIW (Very Long Instruction Word) architecture. VLIWs use multi(cid:173)
`ple, independent functional units. Rather than attempting to issue multiple, inde(cid:173)
`pendent instructions to the units, a VLIW packages the multiple operations into
`one very long instruction, hence the name. A VLIW instruction might include
`two integer operations, two floating-point operations, two memory references,
`and a branch. An instruction would have a set of fields for each functional
`unit-perhaps 16 to 24 bits per unit, yielding an instruction length of between
`112 and 168 bits. To keep the functional units busy there must be enough work
`in a straightline code sequence to keep the instructions scheduled. This is
`accomplished by unrolling loops and scheduling code across basic blocks using
`a technique called trace scheduling. In addition to eliminating branches by un(cid:173)
`rolling loops, trace scheduling provides a method to move instructions across
`branch points. We will discuss trace scheduling more in the next section. For
`now, let's assume we have a technique to generate long, straightline code
`sequences for building up VLIW instructions.
`
`Example
`
`Suppose we have a VLIW that could issue two memory references, two FP
`operations, and one integer operation or branch in every clock cycle. Show an
`unrolled version of the vector sum loop for such a machine. Unroll as many
`times as necessary to eliminate any stalls. Ignore the branch-delay slot.
`
`Answer
`
`The code is shown in Figure 6.49. The loop has been unrolled 6 times, which
`eliminates stalls, and runs in 9 cycles. This yields a running rate of 7 results in 9
`cycles, or 1.28 cycles per result.
`
`Memory
`reference 1
`
`Memory
`reference 2
`
`FP
`operation 1
`
`FP
`operation 2
`
`Integer operation
`I branch
`
`LD FO,O(Rl)
`
`LD FlO, -16 (Rl)
`
`LD F6,-8(Rl)
`LD Fl4,-24(Rl)
`
`LD F18, -32 (Rl)
`
`LD F22,-40(Rl)
`
`ADDD F4,FO,F2
`
`ADDD F8,F6,F2
`
`LD F26, -48 (Rl)
`
`ADDD Fl2,Fl0,F2
`
`ADDD Fl6,Fl4,F2
`
`ADDD F20,Fl8,F2
`
`ADDD F24,F22,F2
`
`SD 0(Rl),F4
`
`SD -8(Rl),F8
`
`ADDD F28,F26,F2
`
`SD -16(Rl),Fl2
`
`SD -24(Rl),Fl6
`
`SD -32(Rl),F20
`
`SD -40(Rl),F24
`
`SD -0(Rl),F28
`
`SUB Rl,Rl,#48
`
`BNEZ Rl,LOOP
`
`FIGURE 6.49 VLIW instructions that occupy the inner loop and replace the unrolled sequence. This code takes
`nine cycles assuming no branch delay; normally the branch would also be scheduled. The issue rate is 23 operations in 9
`clock cycles, or 2.5 operations per cycle. The efficiency, the percentage of available slots that contained an operation, is
`about 60%. To achieve this issue rate requires a much larger number of registers than DLX would normally use in this
`loop.
`
`Ex.1035.355
`
`DELL
`
`
`
`324
`
`6.8 Advanced Pipelining-Taking Advantage of More Instruction-Level Parallelism
`
`What are the limitations and costs of a VLIW approach? If we can issue 5
`operations per clock cycle, why not 50? Three different limitations ar:e encoun(cid:173)
`tered: limited parallelism, limited hardware resources, and code size explosion.
`The first is the simplest:. There is a limited amount of parallelism available in in(cid:173)
`struction sequences. Unless loops are unrolled very large numbers of times,
`there may not be enough operations to fill the instructions. At first glance, it
`might appear that 5 instructions that could be executed in parallel would be suf(cid:173)
`ficient to keep our VLIW completely busy. This, however, is not the case. Sev(cid:173)
`eral of these functional units-the memory, the branch, and the floating-point
`units-will be pipelined, requiring a much larger number of operations that can
`be executed in parallel. For example, if the floating-point pipeline has 8 steps,
`the 2 operations being issued on a clock cycle cannot depend on any of the 14
`operations already in the floating-point pipeline. Thus, we need to find a number
`of independent operations roughly equal to the average pipeline depth times the
`number of functional units. This means about 15 to 20 operations would be
`needed to keep a VLIW with 5 functional units busy.
`The second cost, the hardware resources for a VLIW, seem quite straight(cid:173)
`forward; duplicating the floating-point and integer functional units is easy and
`cost scales linearly. However, there is a large increase in the memory- and
`register-file bandwidth. Even with a split floating-point and integer register file,
`our VLIW will require 5 read ports and 2 write ports on the integer register file
`and 4 read ports and 2 write ports on the floating-point register file. This
`bandwidth cannot be supported without some substantial cost in the size of the
`register file and possible degradation of clock speed. Our 5-unit VLIW also has
`2 data memory ports. Furthermore, if we wanted to expand it, we would need to
`continue adding memory ports. Adding only arithmetic units would not help,
`since the machine would be starved for memory bandwidth. As the number of
`data memory ports grows, so does the complexity of the memory system. To
`allow multiple memory accesses in parallel, the memory must be broken into
`banks containing different addresses with the hope that the operations in a single
`instruction do not have conflicting accesses. A conflict will cause the entire
`machine to stall, since all the functional units must be kept synchronized. This
`same factor makes it extremely difficult to use data caches in a VLIW.
`Finally, there is the problem of code size. There are two different elements
`that combine to increase code size substantially. First, generating enough opera(cid:173)
`tions in a straightline code fragment requires ambitiously unrolling loops, which
`increases code size. Second, whenever instructions are not full, the unused func(cid:173)
`tional units translate to wasted bits in the instruction encoding. In Figure 6.49,
`we saw that only about 60% of the functional units were used; almost half of
`each instruction was empty. To combat this problem, clever encodings are
`sometimes used. For example, there may be only one large immediate field for
`use by any functional unit. Another technique is to compress the instructions in
`main memory and expand them when they are read into the cache or are
`decoded.
`
`Ex.1035.356
`
`DELL
`
`
`
`Pipelining
`
`325
`
`The major challenge for these machines is to try to exploit large amounts of
`instruction-level parallelism. When the parallelism comes from unrolling simple
`loops, the original loop probably could have been run efficiently on a vector
`machine (see the next chapter). It is not clear that a VLIW is preferred over a
`vector machine for such applications; the costs are similar, and the vector
`machine is typically the same speed or faster. The open question in 19.90 is
`whether there are large classes of applications that are not suitable for vector
`machines, but still offer enough parallelism to justify the VLIW approach rather
`than a simpler one, such as a superscalar machine.
`
`Increasing Instruction-Level Parallelism with
`Software Pipelining and Trace Scheduling
`
`We have already seen that one compiler technique, loop unrolling, is used to
`help exploit parallelism among instructions. Loop unrolling creates longer
`sequences of straightline code, which can be used to exploit more instruction(cid:173)
`level parallelism. There are two other more general techniques that have been
`developed for this purpose: software pipelining and trace scheduling.
`Software pipelining is a technique for reorganizing lo~ps such that each itera(cid:173)
`tion in the software-pipelined code is made from instruction sequences chosen
`from different iterations in the original code segment. This is most easily under(cid:173)
`stood by looking at the scheduled code for the superscalar version of DLX. The
`scheduler essentially interleaves instructions from different loop iterations,
`putting together all the loads, then all the adds, then all the stores. A software(cid:173)
`pipelined loop interleaves instructions from different iterations without unrolling
`the loop. This technique is the software counterpart to what Tomasulo's algo(cid:173)
`rithm does in hardware. The software-pipelined loop would contain one load,
`one add, and one store, each from a different iteration. There is also some startup
`code that is needed before the loop begins as well as code to finish up after the
`loop is completed. We will ignore these in this discussion.
`
`Example
`
`Show a software-pipelined version of this loop:
`
`Loop:
`
`FO,O(Rl)
`LD
`ADDD F4,FO,F2
`SD
`0(Rl),F4
`SUB Rl,Rl,#8
`BNEZ Rl,LOOP
`
`You may omit the start-up and clean-up code.
`
`Ex.1035.357
`
`DELL
`
`
`
`326
`
`6.8 Advanced Pipelining-Taking Advantage of More Instruction-Level Parallelism
`
`Answer
`
`Given the vector Min memory, and ignoring the start-up and finishing code, we
`have:
`Loop:
`
`;stores into M[i]
`0(Rl),F4
`SD
`;adds to M[i-1]
`ADDD F4,FO,F2
`F0,-16(Rl) ;loads M[i-2]
`LD
`BNEZ Rl,LOOP
`;subtract in delay slot
`SUB Rl,Rl,#8
`This loop can be run at a rate of 5 cycles per result, ignoring the start-up and
`clean-up portions. Because the load fetches two array elements beyond the
`element count, the loop should run for two fewer iterations. This would be
`accomplished by decrementing Rl by 16 prior to the loop.
`Software pipelining can be thought of as symbolic loop unrolling. Indeed,
`some of the algorithms for software pipelining use loop unrolling to figure out
`how to software pipeline the loop. The major advantage of software pipelining
`over straight loop unrolling is that software pipelining consumes less code space.
`Software pipelining and loop unrolling, in addition to yielding a better scheduled
`inner loop, each reduce a different type of overhead. Loop unrolling reduces the
`overhead of the loop-the branch and counter-update code. Software pipelining
`reduces the tim~ when the loop is not running at peak speed to once per loop at
`the beginning and end. If we unroll a loop that does 100 iterations a constant
`number of times, say 4, we pay the overhead 100/4 = 25 times-every time the
`inner unrolled loop is reinitiated. Figure 6.50 shows this behavior graphically.
`Because these techniques attack two different types of overhead, the best
`performance comes from doing both.
`The other technique used to generate additional parallelism is trace schedul(cid:173)
`ing. This is particularly useful for VLIWs, for which the technique was origi(cid:173)
`nally developed. Trace scheduling is a combination of two separate processes.
`The first process, called trace selection tries to find the most likely sequence of
`operations to put together into a small number of instructions; this sequence is
`called a trace. Loop unrolling is used to generate long traces, since loop
`branches are taken with high probability. Once a trace is selected, the second
`process, called trace compaction, tries to squeeze the trace into a small number
`of wide instructions. Trace compaction attempts to move operations as early as it
`can in a sequence (trace), packing the operations into as few wide instructions as
`possible.
`There are two different considerations in compacting a trace: data depen(cid:173)
`dences, which force a partial order on operations, and branch points, which cre(cid:173)
`ate places across which code cannot be easily moved. In essence, the code wants
`to be compacted into the shortest possible sequence that preserves the data
`dependences; branches are the main impediment to this process. The major
`advantage of trace scheduling over simpler pipeline-scheduling techniques is
`that it includes a method to move code across branches. Figure 6.51 shows a
`code fragment, which may be thought of as an iteration of an unrolled loop, and
`the trace selected.
`
`Ex.1035.358
`
`DELL
`
`
`
`
`
`328
`
`6.8 Advanced Pipelining-Taking Advantage of More Instruction-Level Parallelism
`
`Once the trace is selected as shown in Figure 6.51, it must be compacted so as
`to fill the wide instruction word. Compacting the trace involves moving the
`assignments to variables B and C up to the block before the branch decision.
`Let's first consider the problem of moving the assignment to B. If the assign(cid:173)
`ment to Bis moved above the branch (and thus out of the trace), the code in X
`would be affected if it used B, since moving the assignment would change the
`value of B. Thus, to move the assignment to B, B must not be read in X. One
`could imagine more clever schemes if B were read in X-for example, making a
`shadow copy and updating B later. Such schemes are generally not used, both
`because they are complex to implement and because they will slow down the
`program if the trace selected is not optimal and the operations end up requiring
`additional instructions. Also, because the assignment to B is moved before the if
`test, for this schedule to be valid either X also assigns to B or B is not read after
`the if statement.
`Moving the assignment to Cup to before the first branch requires first mov(cid:173)
`ing it over the branch from X into the trace. To do this, a copy is made of the
`assignment to C on the branch into the trace. A check must still be done, as was
`done for B, to make sure that the assignment can be moved over the branch out
`of the trace. If C is successfully moved to before the first branch and the "false"
`direction of the branch-the branch off the trace-is taken, the assignment to C
`will have been done twice. This may be slower than the original code, depending
`on whether this operation or other moved operations create additional work in
`the main trace. Ironically, the more successful the trace-scheduling algorithm is
`in moving code across the branch, the higher the penalty for misprediction.
`Loop unrolling, trace scheduling, and software pipelining all aim at trying to
`increase the amount of local instruction parallelism that can be exploited by a
`machine issuing more than one instruction on every clock cycle. The effective(cid:173)
`ness of each of these techniques and their suitability for various architectural
`approaches are among the most significant open research areas in pipelined-pro(cid:173)
`cessor design.
`
`6.9 I
`Putting It All Together: A Pipelined VAX
`
`In this section we will examine the pipeline of the VAX 8600, a macropipelined
`VAX. This machine is described in detail by DeRosa et al. [1985] and Troiani et
`al. [1985]. The 8600 pipeline is a more dynamic structure than the DLX integer
`pipeline. This is because the processing steps may take multiple cycles in one
`stage of the pipeline. Additionally, the hazard detection is more complicated
`because of the possibility that stages progress independently and because
`instructions may modify registers before they complete. Techniques similar to
`those used in the DLX FP pipeline to handle variable-length instructions are
`used in the 8600 pipeline.
`The 8600 is macropipelined-the pipeline understands the structure of VAX
`instructions and overlaps their execution, checking the hazards on the instruction
`
`Ex.1035.360
`
`DELL
`
`
`
`
`
`330
`
`6.9 Putting It All Together: A Pipelined VAX
`
`Step
`
`Function
`
`1. If etch
`Prefetch instruction bytes and decode them
`2. Op fetch
`Operand address calculation and fetch
`3. Execution
`Execute opcode and write result
`4. Result store Write result to memory or registers
`
`Located in
`
`!Box
`!Box
`EBox, FBox
`EBox, !Box
`
`FIGURE 6.53 The basic structure of the 8600 pipeline has four stages, each taking
`from 1 to a large number of clock cycles. Up to four VAX instructions are being
`processed at once.
`
`stage may cause a back up in the pipeline; this back up may eventually reach the
`!fetch step, where it will cause the pipeline to simply stop fetching instructions.
`Additionally, several resources (e.g., the W Bus and GPR ports) are contended
`for by multiple stages in the pipeline. In general, these problems are resolved on
`the fly using a fixed-priority scheme.
`
`Operand Decode and Fetch
`
`Much of the work in interpreting a VAX instruction is in the operand specifier
`and decode process, and this is the heart of the IBox. Substantial effort is de(cid:173)
`voted to decoding and fetching operands as fast as possible to keep instructions
`flowing through the pipeline. Figure 6.54 shows the number of cycles spent in
`Opfetch under ideal conditions (no cache misses or other stalls from the memory
`hierarchy) for each operand specifier. If the result is a register, the EBox stores
`
`Specifier
`
`Cycles
`
`Literal or immediate
`Register
`Deferred
`Displacement
`PC-relative and absolute
`Autodecrement
`Autoincrement
`Autoincrement deferred
`Displacement deferred
`PC-relative deferred
`
`1
`1
`1
`1
`1
`1
`2
`5
`4
`4
`
`FIGURE 6.54 The minimum number of cycles spent in Opfetch by operand specifier.
`This shows the data for an operand of type byte, word, or longword that is read. Modified
`and written operands take an additional cycle, except for register mode and immediate or
`literal, where writes are not allowed. Quadword and octaword operands may take much
`longer. If any stalls are encountered, the cycle count will increase.
`
`Ex.1035.362
`
`DELL
`
`
`
`Pipelining
`
`331
`
`the result. If the result is a memory operand, Opfetch calculates the address and
`waits for the EBox to signal ready, then the IBox stores the result during the
`Result store step. If an instruction result is to be stored in memory, the EB ox
`signals to the IBox when it enters the last cycle of execution for the instruction.
`This allows Opfetch to overlap the first cycle of a two-cycle memory write with
`the last cycle of execution (even if the operation only takes one cycle).
`To maximize the performance of the machine, there are three copies of the
`GPRs-in the IBox, EBox, and FBox. A write is broadcast from the FBox,
`EBox, or IBox (in the case of autoincrement or autodecrement addressing) to the
`other two units, so that their copies of the registers can be updated.
`
`Handling Data Dependences
`
`Register hazards are tracked in Opfetch by maintaining a small table of registers
`that will be written. Whenever an instruction passes through Opfetch, its result
`register is marked as busy. If an instruction that uses that register arrives in
`Opfetch and sees the busy flag set, it stalls until the flag is cleared. This prevents
`RAW hazards. The busy flag is cleared when the register is written. Because
`there are only two stages after Opfetch (execute and write memory result), the
`busy flag can be implemented as a two-entry associative memory. Writes are
`maintained in order and always at the end of the pipeline, and all reads are done
`in Opfetch. This eliminates all explicit WA W and WAR hazards. The only
`possible remaining hazards are those that can occur on implicit operands, such
`as the registers written by a MOVC3. Hazards on implicit operands are prevented
`by explicit control in the microcode.
`Opfetch optimizes the case when the last operand specifier is a register by
`processing the register operand specifier at the same time as the next-to-last
`specifier. In addition, when the result register of an instruction is the source
`operand of the next instruction, rather than stall the dependent instruction,
`Opfetch merely signals this relationship to the EBox, allowing execution to
`proceed without a stall. This is like the bypassing in our DLX pipeline.
`Memory hazards between reads and writes are easily resolved because there
`is a single memory port, and the IBox decodes all operand addresses.
`
`Handling Control Dependences
`
`There are two aspects to handling branches in a VAX: synchronizing on the
`condition code and dealing with the branch hazard. Most of the branch process(cid:173)
`ing is handled by the IBox. A predict-taken strategy is used; the following steps
`are taken when the IBox sees a branch:
`
`1. Compute the branch target address, send it to the MBox, and initiate a fetch
`from the target address. Wait for the EBox to issue CCSYNC, which indi(cid:173)
`cates that the condition codes will be available in the next clock cycle.
`
`Ex.1035.363
`
`DELL
`
`
`
`332
`
`6.9 Putting It All Together: A Pipelined VAX
`
`2. Evaluate the condition codes from the EBox to check the prediction. If the
`prediction was incorrect, the access initiated in the MBox is aborted. The
`current PC points at the next instruction or its first operand specifier.
`
`3. Assuming the branch was taken, the IBox flushes the prefetch and decode
`stages and begins loading the instruction register and processing the new tar(cid:173)
`get stream. If the branch was not taken, the access to the potential target has
`already been killed and the pipeline can continue just using what is in the
`prefetch and decode stages.
`
`Simple conditional branches (BEQL, BNEQ), the unconditional branches
`(BRB, BRW), and the computed branches (e.g., AOBLEQ) are handled by the
`IBox. The EBox handles more complex branches and also the instructions used
`for calls and returns.
`
`An Example
`
`To really understand how this pipeline works, let's look at how a code sequence
`executes. This example is somewhat simplified, but is sufficient to demonstrate
`the major pipeline interactions. The code sequence we will consider is as follows
`(remember that for consistency the result of the ADDL3 is given first):
`ADDL3
`Rl,R2,56(R3)
`CMPL
`45(Rl),@54(R2)
`target
`BEQL
`
`MOVL
`target: SUBL3
`
`Figure 6.55 shows an annotated pipeline diagram of how these instructions
`would progress through the 8600 pipeline.
`
`Dealing with Interrupts
`
`The 8600 maintains three program counters so that instruction interruption and
`restart are possible. These program counters and what they designate are:
`• Current Program Counter-points to the next byte to be processed and
`consumed in Opfetch.
`• IBox Starting Address-points to the instruction currently in Opfetch.
`• EBox Starting Address-points to the instruction executing in the EBox or
`FBox.
`In addition, the prefetch unit keeps an address to prefetch from (the VIBA,
`Virtual Instruction Buffer Address), but this does not affect interrupt handling.
`When an exception is caused by a prefetch operation, ihe byte in the instruction
`buffer is marked. When Opfetch eventually asks fot the byte, it will see the
`exception, and the Current Program Counter will have the address of the byte
`that caused the exception.
`
`Ex.1035.364
`
`DELL
`
`
`
`Pipelining
`
`333
`
`1
`
`2
`
`3
`
`Clock Cycle
`5
`4
`
`6
`
`7
`
`8
`
`9
`
`IF: Fetch
`ADDL.
`
`IF:
`IF:
`Continue Decode
`pre fetch
`Rl.
`if space
`andMBox
`available.
`
`IF:
`IF:
`OP: Start WR:
`OP:
`Compute write.
`Decode
`Decode
`Store.
`56 (R3). 56+ (R3) ·EX: Add.
`R2.
`OP: Fetch OP:
`EX: get
`Fetch R2.
`first
`Rl.
`operand.
`IF:
`Decode
`45 (Rl).
`
`IF: De-
`code
`@54 (R2).
`OP: Fetch
`45 (Rl).
`
`OP: Fetch
`54 (R2).