`W
`
`
`
`“1353‘
`zmn‘lfi‘égg?
`
`‘
`
`.V,
`»
`
`“
`
`,
`
`{Lax w .
`“ r ‘35
`“‘ 't'fifi‘“
`x ....
`‘1
`.
`> m ,
`I”
`’g
`.
`,.Ԥ.%h. .
`.
`m
`v
`fifiifii‘glag“ ‘ u r m§£-4§zgfi 11
`it;
`-
`
`_.
`,m ..-w
`.
`
`i
`
`~
`
`V
`
`,
`
`,
`
`,
`
`‘
`
`,,.,
`, m
`394%,
`
`.
`
`,
`
`_
`
`,*~
`
`~
`
`.
`x;
`”Km.”
`
`SAMSUNG-1007
`
`Page 1 of 82
`
`SAMSUNG-1007
`Page 1 of 82
`
`
`
`Sponsoring Editor Bruce M. Spatz
`Production Manager Yonie Overton
`Production Editor Julie Pabst
`Text Design Gary Head, with second edition
`modifications by Carron Design
`Cover Art Direction David Lance Goines
`Cover Design Carron Design
`
`Illustrious, Inc.
`Illustration
`Copyeditors Ken DellaPenta, Gary Morris,
`Sharilyn Hovind, Jessie Wood
`Proofreader Jeff Van Bueren
`Composition Nancy Logan
`Printer Courier Corporation
`
`Morgan Kaufmann Publishers, Inc.
`Editorial and Sales Office
`340 Pine Street, Sixth Floor
`San Francisco, CA 94104-3205 USA
`Telephone: 415 I 392-2665, Facsimile: 415 I 982-2665, Internet: mkp@mkp.com
`Order toll free: 800 I 745-7323
`
`© 1990, 1996 by Morgan Kaufmann Publishers, Inc.
`All rights reserved
`
`Published 1990. Second Edition 1996
`Printed in the United States of America
`00 99 98 97 96
`5 4 3 2
`
`No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any
`means-electronic, mechanical, recording, or otherwise-without the prior written permission of the publisher.
`
`All instruction sets and other design information of the DLX computer system contained herein is copyrighted by the
`publisher and may not be incorporated in other publications or distributed by media without formal acknowledgment
`and written consent from the publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for permission is welcomed.
`
`ADVICE, PRAISE, & ERRORS: Any correspondence related to this publication or intended for the authors should
`be addressed to the Editorial and Sales Office of Morgan Kaufmann Publishers, Inc., Dept. P &H APE. Information
`regarding error sightings is encouraged. Any error sightings that are accepted for correction in subsequent printings
`will be rewarded by the authors with a payment of $1.00 (U.S.) per correction upon availability of the new printing.
`Electronic mail can be sent to arc2bugs@mkp.com. (Please include your full name and permanent mailing address.)
`
`INSTRUCTOR SUPPORT: For information on classroom software and other instructor materials available to
`adopters, please contact the Editorial and Sales Office of Morgan Kaufmann Publishers, Inc. (415 I 392-2665) or
`visit our web site at http://rnkp.com.
`
`Library of Congress Cataloging-in-Publication Data is available for this book.
`ISBN 1-55860-329-8
`
`SAMSUNG-1007
`Page 2 of 82
`
`
`
`It is quite a three-pipe problem.
`
`Sir Arthur Conan Doyle
`The Adventures of Sherlock Holmes
`
`SAMSUNG-1007
`Page 3 of 82
`
`
`
`3 .1 I What Is Pipelining?
`
`Pipelining is an implementation technique whereby multiple instructions are
`overlapped in execution. Today, pipelining is the key implementation technique
`used to make fast CPUs.
`A pipeline is like an assembly line. In an automobile assembly line, there are
`many steps, each contributing something to the construction of the car. Each step
`operates in parallel with the other steps, though on a different car. In a computer
`pipeline, each step in the pipeline completes a part of an instruction. Like the
`assembly line, different steps are completing different parts of different instruc(cid:173)
`tions in parallel. Each of these steps is called a pipe stage or a pipe segment. The
`stages are connected one to the next to form a pipe-instructions enter at one end,
`progress through the stages, and exit at the other end, just as cars would in an as(cid:173)
`sembly line.
`In our automobile assembly line, throughput is defined as the number of cars
`per hour and is determined by how often a completed car exits the assembly line.
`Likewise, the throughput of an instruction pipeline is determined by how often an
`instruction exits the pipeline. Because the pipe stages are hooked together, all the
`
`SAMSUNG-1007
`Page 4 of 82
`
`
`
`126
`
`Chapter 3 Pipelining
`
`stages must be ready to proceed at the same time, just as we would require in an
`assembly line. The time required between moving an instruction one step down
`the pipeline is a machine cycle. Because all stages proceed at the same time, the
`length of a machine cycle is determined by the time required for the slowest pipe
`stage, just as in an auto assembly line, the longest step would determine the time
`between advancing the line. In a computer, this machine cycle is usually one
`clock cycle (sometimes it is two, rarely more), although the clock may have
`multiple phases.
`The pipeline designer's goal is to balance the length of each pipeline stage,
`just as the designer of the assembly line tries to balance the time for each step in
`the process. If the stages are perfectly balanced, then the time per instruction on
`the pipelined machine-assuming ideal conditions-is equal to
`
`Time per instruction on unpipelined machine
`Number of pipe stages
`
`Under these conditions, the speedup from pipelining equals the number of pipe
`stages, just as an assembly line with n stages can ideally produce cars n times as
`fast. Usually, however, the stages will not be perfectly balanced; furthermore,
`pipelining does involve some overhead. Thus, the time per instruction on the
`pipelined machine will not have its minimum possible value, yet it can be close.
`Pipelining yields a reduction in the average execution time per instruction.
`Depending on what you consider as the base line, the reduction can be viewed as
`decreasing the number of clock cycles per instruction (CPI), as decreasing the
`clock cycle time, or as a combination. If the starting point is a machine that takes
`multiple clock cycles per instruction, then pipelining is usually viewed as reduc(cid:173)
`ing the CPI. This is the primary view we will take. If the starting point is a ma(cid:173)
`chine that takes one (long) clock cycle per instruction, then pipelining decreases
`the clock cycle time.
`Pipelining is an implementation technique that exploits parallelism among the
`instructions in a sequential instruction stream. It has the substantial advantage
`that, unlike some speedup techniques (see Chapter 8 and Appendix B), it is not
`visible to the programmer. In this chapter we will first cover the concept of pipe(cid:173)
`lining using DLX and a simple version of its pipeline. We use DLX because its
`simplicity makes it easy to demonstrate the principles of pipelining. In addition,
`to simplify the diagrams we do not include the jump instructions of DLX; adding
`them does not involve new concepts-only bigger diagrams. The principles of
`pipelining in this chapter apply to more complex instruction sets than DLX or its
`RISC relatives, although the resulting pipelines are more complex. Using the
`DLX example, we will look at the problems pipelining introduces and the perfor(cid:173)
`mance attainable under typical situations. Section 3.9 examines the MIPS R4000
`pipeline, which is similar to other recent machines with extensive pipelining.
`Chapter 4 looks at more advanced pipelining techniques being used in the
`highest-performance processors.
`
`SAMSUNG-1007
`Page 5 of 82
`
`
`
`3.1 What Is Pipelining?
`
`127
`
`Before we proceed to basic pipelining, we need to review a simple implemen(cid:173)
`tation of an unpipelined version of DLX.
`
`A Simple Implementation of DLX
`
`To understand how DLX can be pipelined, we need to understand how it is imple(cid:173)
`mented without pipelining. This section shows a simple implementation where
`every instruction talces at most five clock cycles. We will extend this basic imple(cid:173)
`mentation to a pipelined version, resulting in a much lower CPI. Our unpipelined
`implementation is not the most economical or the highest-performance imple(cid:173)
`mentation without pipelining. Instead, it is designed to lead naturally to a pipe(cid:173)
`lined implementation. We will indicate where the implementation could be
`improved later in this section. Implementing the instruction set requires the intro(cid:173)
`duction of several temporary registers that are not part of the architecture; these
`are introduced in this section to simplify pipelining.
`Every DLX instruction can be implemented in at most five clock cycles. The
`five clock cycles are as follows.
`
`1. Instructionfetch cycle (IF):
`
`IR ~ Mem[PC]
`NFC ~ PC + 4
`
`Operation: Send out the PC and fetch the instruction from memory into the in(cid:173)
`struction register (IR); increment the PC by 4 to address the next sequential in(cid:173)
`struction. The IR is used to hold the instruction that will be needed on
`subsequent clock cycles; likewise the register NPC is used to hold the next se(cid:173)
`quential PC.
`
`2. Instruction decode/register fetch cycle (ID):
`
`A ~ Regs [IR6 .. 10] i
`B ~ Regs[IR11 .. 1sl;
`16
`Imm~ ( (IR15)
`##IR16 .. 31)
`
`Operation: Decode the instruction and access the register file to read the regis(cid:173)
`ters. The outputs of the general-purpose registers are read into two temporary
`registers (A and B) for use in later clock cycles.The lower 16 bits of the IR are
`also sign-extended and stored into the temporary register Imm, for use in the
`next cycle.
`Decoding is done in parallel with reading registers, which is possible be(cid:173)
`cause these fields are at a fixed location in the DLX instruction format (see
`Figure 2.21 on page 99). This technique is known as fixed-field decoding.
`Note that we may read a register we don't use, which doesn't help but also
`
`SAMSUNG-1007
`Page 6 of 82
`
`
`
`128
`
`Chapter 3 Pipelining
`
`doesn't hurt. Because the immediate portion of an instruction is located in an
`identical place in every DLX format, the sign-extended immediate is also cal(cid:173)
`culated during this cycle in case it is needed in the next cycle.
`
`3. Execution/effective address cycle (EX):
`
`The ALU operates on the operands prepared in the prior cycle, performing one
`of four functions depending on the DLX instruction type.
`
`• Memory reference:
`
`ALUOutput f--- A + Imm;
`
`Operation: The ALU adds the operands to form the effective address and
`places the result into the register ALUOutput.
`
`• Register-Register ALU instruction:
`
`ALUOutput f--- A op B;
`
`Operation: The ALU performs the operation specified by the opcode on the
`value in register A and on the value in register B. The result is placed in the
`temporary register ALUOutput.
`
`• Register-Immediate ALU instruction:
`
`ALUOutput f--- A op Imm;
`
`Operation: The ALU performs the operation specified by the opcode on the
`value in register A and on the value in register Imm. The result is placed in the
`temporary register ALUOutput.
`
`• Branch:
`
`ALUOutput f--- NPC + Imm;
`Cond f--- (A op 0)
`
`Operation: The ALU adds the NPC to the sign-extended immediate value in
`Imm to compute the address of the branch target. Register A, which has been
`read in the prior cycle, is checked to determine whether the branch is taken.
`The comparison operation op is the relational operator determined by the
`branch opcode; for example, op is"==" for the instruction BEQZ.
`
`The load-store architecture of DLX means that effective address and execu(cid:173)
`tion cycles can be combined into a single clock cycle, since no instruction needs
`to simultaneously calculate a data address, calculate an instruction target address,
`
`SAMSUNG-1007
`Page 7 of 82
`
`
`
`3.1 What Is Pipelining?
`
`129
`
`and perform an operation on the data. The other integer instructions not included
`above are jumps of various forms, which are similar to branches.
`
`4. Memory access/branch completion cycle (MEM):
`
`The only DLX instructions active in this cycle are loads, stores, and branches.
`
`• Memory reference:
`
`LMD f- Mem[ALUOutput] or
`Mem[ALUOutput] f- B;
`
`Operation: Access memory if needed. If instruction is a load, data returns
`from memory and is placed in the LMD (load memory data) register; if it is a
`store, then the data from the B register is written into memory. In either case
`the address used is the one computed during the prior cycle and stored in the
`register ALUOutput.
`
`• Branch:
`
`if (cond) PC f- ALUOutput else PC f- NPC
`
`Operation: If the instruction branches, the PC is replaced with the branch des(cid:173)
`tination address in the register ALUOutput; otherwise, it is replaced with the
`incremented PC in the register NPC.
`
`5. Write-back cycle (WB):
`
`• Register-Register ALU instruction:
`
`Regs [ IR16 .. 2 0] f- ALUOutput i
`
`• Register-Immediate ALU instruction:
`
`Regs [IR11 .. 15] f- ALUOutput;
`
`• Load instruction:
`
`Regs [IR11 .. 15] f- LMD;
`
`Operation: Write the result into the register file, whether it comes from the
`memory system (which is in LMD) or from the ALU (which is in ALUOut(cid:173)
`put); the register destination field is also in one of two positions depending on
`the opcode.
`
`Figure 3.1 shows how an instruction flows through the datapath. At the end of
`each clock cycle, every value computed during that clock cycle and required on a
`later clock cycle (whether for this instruction or the next) is written into a storage
`
`SAMSUNG-1007
`Page 8 of 82
`
`
`
`130
`
`Chapter 3 Pipelining
`
`device, which may be memory, a general-purpose register, the PC, or a temporary
`register (i.e., LMD, Imm, A, B, IR, NPC, ALUOutput, or Cond). The temporary
`registers hold values between clock cycles for one instruction, while the other
`storage elements are visible parts of the state and hold values between successive
`instructions.
`
`Instruction fetch
`
`Instruction decode/
`register fetch
`
`Execute/
`address
`calculation
`
`4
`
`PCf-*---
`
`FIGURE 3.1 The implementation of the DLX datapath allows every instruction to be executed in four or five clock
`cycles. Although the PC is shown in the portion of the datapath that is used in instruction fetch and the registers are shown
`in the portion of the datapath that is used in instruction decode/register fetch, both of these functional units are read as well
`as written by an instruction. Although we show these functional units in the cycle corresponding to where they are read, the
`PC is written during the memory access clock cycle (as well as in instruction fetch) and the registers are written during the
`write back clock cycle. In both cases, the writes in later pipe stages are indicated by the multiplexer output (in memory ac(cid:173)
`cess or write back) that carries a value back to the PC or registers. These backward-flowing signals introduce much of the
`complexity of pipelining, and we will look at them more carefully in the next few sections.
`
`In this implementation, branch instructions require four cycles and all other
`instructions require five cycles. Assuming the branch frequency of 12% from the
`last chapter, this leads to an overall CPI of 4.88. This implementation, however,
`is not optimal either in achieving the best performance or in using the minimal
`amount of hardware given the performance level. The CPI could be improved
`
`SAMSUNG-1007
`Page 9 of 82
`
`
`
`3.1 What Is Pipelining?
`
`131
`
`without affecting the clock rate by completing ALU instructions during the MEM
`cycle, since those instructions are idle during that cycle. Assuming ALU instruc(cid:173)
`tions occupy 44% of the instruction mix, as we measured in Chapter 2, this im(cid:173)
`provement would lead to a CPI of 4.44, or an improvement of 4.88/4.44 = 1.1.
`Beyond this simple change, any other attempts to decrease the CPI may increase .
`the clock cycle time, since such changes would need to put more activity into a
`clock cycle. Of course, it may still be beneficial to trade an increase in the clock
`cycle time for a decrease in the CPI, but this requires a detailed analysis and is
`unlikely to produce large improvements, especially if the initial distribution of
`work among the clock cycles is reasonably balanced.
`Although all machines today are pipelined, this multicycle implementation is
`a reasonable approximation of how most machines would have been imple(cid:173)
`mented in earlier times. A simple finite-state machine could be used to implement
`the control following the five-cycle structure shown above. For a much more
`complex machine, microcode control could be used. In either event, an instruc(cid:173)
`tion sequence like that above would determine the structure of the control.
`In addition to these CPI improvements, there are some hardware redundancies
`that could be eliminated in this multicycle implementation. For example, there
`are two ALUs: one to increment the PC and one used for effective address and
`ALU computation. Since they are not needed on the same clock cycle, we could
`merge them by adding additional multiplexers and sharing the same ALU. Like(cid:173)
`wise, instructions and data could be stored in the same memory, since the data
`and instruction accesses happen on different clock cycles.
`Rather than optimize this simple implementation, we will leave the design as
`it is in Figure 3.1, since this provides us with a better base for the pipelined im(cid:173)
`plementation.
`As an alternative to the multicycle design discussed in this section, we could
`also have implemented the machine so that every instruction takes one long clock
`cycle. In such cases, the temporary registers would be deleted, since there would
`not be any communication across clock cycles within an instruction. Every in(cid:173)
`struction would execute in one long clock cycle, writing the result into the data
`memory, registers, or PC at the end of the clock cycle. The CPI would be one for
`such a machine. However, the clock cycle would be roughly equal to five times
`the clock cycle of the multicycle machine, since every instruction would need to
`traverse all the functional units. Designers would never use this single cycle im(cid:173)
`plementation for two reasons. First, a single-cycle implementation would be very
`inefficient for most machines that have a reasonable variation among the amount
`of work, and hence in the clock cycle time, needed for different instructions. Sec(cid:173)
`ond, a single-cycle implementation requires the duplication of functional units
`that could be shared in a multicycle implementation. Nonetheless, this single(cid:173)
`cycle datapath allows us to illustrate how pipelining can improve the clock cycle
`time, as opposed to the CPI, of a machine.
`
`SAMSUNG-1007
`Page 10 of 82
`
`
`
`132
`
`Chapter 3 Pipelining
`
`3.2 I The Basic Pipeline for DLX
`
`We can pipeline the datapath of Figure 3.1 with almost no changes by starting a
`new instruction on each clock cycle. (See why we chose that design!) Each of the
`clock cycles from the previous section becomes a pipe stage: a cycle in the pipe(cid:173)
`line. This results in the execution pattern shown in Figure 3.2, which is the typi(cid:173)
`cal way a pipeline structure is drawn. While each instruction takes five clock
`cycles to complete, during each clock cycle the hardware will initiate a new in(cid:173)
`struction and will be executing some part of the five different instructions.
`
`Clock number
`
`Instruction number
`
`Instruction i
`Instruction i + 1
`Instruction i + 2
`Instruction i + 3
`Instruction i + 4
`
`1
`
`IF
`
`2
`
`ID
`IF
`
`3
`EX
`ID
`IF
`
`6
`
`4
`5
`MEM WB
`EX
`MEM WB
`EX
`MEM WB
`ID
`EX
`MEM WB
`ID
`IF
`MEM WB
`EX
`ID
`IF
`
`7
`
`8
`
`9
`
`FIGURE 3.2 Simple DLX pipeline. On each clock cycle, another instruction is fetched and begins its five-cycle execution.
`If an instruction is started every clock cycle, the performance will be up to five times that of a machine that is not pipelined.
`The names for the stages in the pipeline are the same as those used for the cycles in the implementation on pages 127-
`129: IF= instruction fetch, ID= instruction decode, EX= execution, MEM =memory access, and WB =write back.
`
`Your instinct is right if you find it hard to believe that pipelining is as simple
`as this, because it's not. In this and the following sections, we will make our DLX
`pipeline "real" by dealing with problems that pipelining introduces.
`To begin with, we have to determine what happens on every clock cycle of the
`machine and make sure we don't try to perform two different operations with the
`same datapath resource on the same clock cycle. For example, a single ALU can(cid:173)
`not be asked to compute an effective address and perform a subtract operation at
`the same time. Thus, we must ensure that the overlap of instructions in the pipe(cid:173)
`line cannot cause such a conflict. Fortunately, the simplicity of the DLX instruc(cid:173)
`tion set makes resource evaluation relatively easy. Figure 3.3 shows a simplified
`version of the DLX datapath drawn in pipeline fashion. As you can see, the major
`functional units are used in different cycles and hence overlapping the execution
`of multiple instructions introduces relatively few conflicts. There are three obser(cid:173)
`vations on which this fact rests.
`First, the basic datapath of the last section already used separate instruction
`and data memories, which we would typically implement with separate instruc(cid:173)
`tion and data caches (discussed in Chapter 5). The use of separate caches elimi(cid:173)
`nates a conflict for a single memory that would arise between instruction fetch
`
`SAMSUNG-1007
`Page 11 of 82
`
`
`
`3.2 The Basic Pipeline for DLX
`
`133
`
`Time (in clock c y c le s ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ->
`
`cc 1
`
`CC2
`
`CC3
`
`CC4
`
`CC5
`
`CC6
`
`CC7
`
`CC8
`
`CC9
`
`Q)
`
`()
`
`c:
`0
`~
`Q) x
`E
`~
`OJ
`2
`0...
`
`FIGURE 3.3 The pipeline can be thought of as a series of datapaths shifted in time. This shows the overlap among
`the parts of the datapath, with clock cycle 5 (CC 5) showing the steady state situation. Because the register file is used as
`a source in the EX stage and as a destination in the WB stage, it appears twice. We show that it is read in one stage and
`written in another by using a solid line, on the right or left, respectively, and a dashed line on the other side. The abbreviation
`IM is used for instruction memory, DM for data memory, and CC for clock cycle.
`
`and data memory access. Notice that if our pipelined machine has a clock cycle
`that is equal to that of the unpipelined version, the memory system must deliver
`five times the bandwidth. This is one cost of higher performance.
`Second, the register file is used in the two stages: for reading in ID and for
`writing in WB. These uses are distinct, so we simply show the register file in two
`places. This does mean that we need to perform two reads and one write every
`clock cycle. What if a read and write are to the same register? For now, we ignore
`this problem, but we will focus on it in the next section.
`Third, Figure 3.3 does not deal with the PC. To start a new instruction every
`clock, we must increment and store the PC every clock, and this must be done
`during the IF stage in preparation for the next instruction. The problem arises
`
`SAMSUNG-1007
`Page 12 of 82
`
`
`
`134
`
`Chapter 3 Pipelining
`
`when we consider the effect of branches, which change the PC also, but not until
`the MEM stage. This is not a problem in our multicycle, unpipelined datapath,
`since the PC is written once in the MEM stage. For now, we will organize our
`pipelined datapath to write the PC in IF, and write either the incremented PC or
`the value of the branch target of an earlier branch. This introduces a problem in
`how branches are handled that we will explain in the next section and explore in
`detail in section 3.5.
`Because every pipe stage is active on every clock cycle, all operations in a
`pipe stage must complete in one clock cycle and any combination of operations
`must be able to occur at once. Furthermore, pipelining the datapath requires that
`values passed from one pipe stage to the next must be placed in registers.
`Figure 3.4 shows the DLX pipeline with the appropriate registers, called pipeline
`registers or pipeline latches, between each pipeline stage. The registers are
`labeled with the names of the stages they connect. Figure 3.4 is drawn so that
`connections through the pipeline registers from one stage to another are clear.
`
`ID/EX
`
`EX/MEM
`
`MEM/WB
`
`..-----, Branch
`taken
`
`Data
`memory
`
`IR6 .. 10
`
`IR11 .. 15
`
`MEM/WB.IR
`
`16
`
`FIGURE 3.4 The datapath is pipelined by adding a set of registers, one between each pair of pipe stages. The reg(cid:173)
`isters serve to convey values and control information from one stage to the next. We can also think of the PC as a pipeline
`register, which sits before the IF stage of the pipeline, leading to one pipeline register for each pipe stage. The selection
`multiplexer for the PC has been moved so that the PC is written in exactly one stage (IF). If we didn't move it there would
`be a conflict when a branch occurred, since two instructions would try to write different values into the PC. Most of the data(cid:173)
`paths flow from left to right, which is from earlier in time to later. The paths flowing from right to left (which carry the register
`write-back information and PC information on a branch) introduce complications into our pipeline, which we will spend much
`of this chapter overcoming.
`
`SAMSUNG-1007
`Page 13 of 82
`
`
`
`3.2 The Basic Pipeline for DLX
`
`135
`
`All of the registers needed to hold values temporarily between clock cycles
`within one instruction are subsumed into these pipeline registers. The fields of the
`instruction register (IR), which is part of the IF/ID register, are labeled when they
`are used to supply register names. The pipeline registers carry both data and control
`from one pipeline stage to the next. Any value needed on a later pipeline stage must
`be placed in such a register and copied from one pipeline register to the next, until it
`is no longer needed. If we tried to just use the temporary registers we had in our
`earlier unpipelined datapath, values could be overwritten before all uses were com(cid:173)
`pleted. For example, the field of a register operand used for a write on a load or
`ALU operation is supplied from the MEM/WB pipeline register rather than from
`the IF/ID register. This is because we want a load or ALU operation to write the
`register designated by that operation, not the register field of the instruction current(cid:173)
`ly transitioning from IF to ID! This destination register field is simply copied from
`one pipeline register to the next, until it is needed during the WB stage.
`Any instruction is active in exactly one stage of the pipeline at a time; there(cid:173)
`fore, any actions taken on behalf of an instruction occur between a pair of pipe(cid:173)
`line registers. Thus, we can also look at the activities of the pipeline by
`examining what has to happen on any pipeline stage depending on the instruction
`type. Figure 3.5 shows this view. Fields of the pipeline registers are named so as
`to show the flow of data from one stage to the next. Notice that the actions in the
`first two stages are independent of the current instruction type; they must be inde(cid:173)
`pendent because the instruction is not decoded until the end of the ID stage. The
`IF activity depends on whether the instruction in EXIMEM is a taken branch. If
`so, then the branch target is used both to fetch the instruction and to compute the
`next PC. Otherwise, the current PC is used for both of these actions. (As we said
`earlier, this effect of branches leads to complications in the pipeline that we deal
`with in the next few sections.) The fixed-position encoding of the register source
`operands is critical to allowing the registers to be fetched during ID.
`To control this simple pipeline we need only determine how to set the control
`for the four multiplexers in the datapath of Figure 3.4. The two multiplexers in
`the ALU stage are set depending on the instruction type, which is dictated by the
`IR field of the ID/EX register. The top ALU input multiplexer is set by whether
`the instruction is a branch or not, and the bottom multiplexer is set by whether the
`instruction is a register-register ALU operation or any other type of operation.
`The multiplexer in the IF stage chooses whether to use the current PC or the val(cid:173)
`ue of EX/MEM.NPC (the branch target) as the instruction address. This multi(cid:173)
`plexer is controlled by the field EXIMEM.cond. The fourth multiplexer is
`controlled by whether the instruction in the WB stage is a load or a ALU opera(cid:173)
`tion. In addition to these four multiplexers, there is one additional multiplexer
`needed that is not drawn in Figure 3.4, but whose existence is clear from looking
`at the WB stage of an ALU operation. The destination register field is in one of
`two different places depending on the instruction type (register-register ALU
`versus either ALU immediate or load). Thus, we will need a multiplexer to
`choose the correct portion of the IR in the MEMIWB register to specify the regis(cid:173)
`ter destination field.
`
`SAMSUNG-1007
`Page 14 of 82
`
`
`
`136
`
`Chapter 3 Pipelining
`
`Stage Any instruction
`
`IF I ID. IR ~ Mem [PC] ;
`IF/ID.NPC,PC ~(if EX/MEM.cond {EX/MEM.NPC} else {PC+4});
`ID/EX.A~Regs[IF/ID.IR6 .. 10J; ID/EX. B ~Regs [IF /ID. IR11 .. 1s l ;
`ID/EX.NPC ~ IF/ID.NPC; ID/EX.IR ~IF/ID.IR;
`16
`ID/EX. Imm ~
`(IR16)
`##IR16 .. 31i
`
`ALU instruction
`
`Load or store instruction
`
`Branch instruction
`
`IF
`
`ID
`
`EX
`
`EX/MEM. IR ~ID/EX. IR;
`EX/MEM.ALUOutput~
`ID/EX.A op ID/EX.B;
`or
`EX/MEM.ALUOutput ~
`ID/EX.A op ID/EX.Imm;
`EX/MEM. cond ~ 0;
`
`MEM MEM/WB. IR ~ EX/MEM. IR;
`MEM/WB.ALUOutput ~
`EX/MEM.ALUOutput;
`
`EX/MEM. IR~ ID/EX. IR
`EX/MEM.ALUOutput ~
`ID/EX.A+ ID/EX.Imm;
`
`EX/MEM.ALUOutput ~
`ID/EX.NPC+ID.EX.Imm;
`
`EX/MEM.cond ~
`(ID/EX.A op 0) ;
`
`EX/MEM. cond ~ 0;
`EX/MEM.B~ ID/EX.B;
`MEM/WB. IR ~ EX/MEM. IR;
`MEM/WB. LMD ~
`Mem[EX/MEM.ALUOutput];
`or
`Mem[EX/MEM.ALUOutput] ~
`EX/MEM.B;
`
`WB
`
`Regs [MEM/WB. IR16 .. 2 0 l ~ Regs[MEM/WB.IR11 .. 1sl ~
`MEM/WB.ALUOutput;
`MEM/WB.LMD;
`or
`Regs [MEM/WB. IR11 .. 1sJ ~
`MEM/WB.ALUOutput;
`
`FIGURE 3.5 Events on every pipe stage of the DLX pipeline. Let's review the actions in the stages that are specific to
`the pipeline organization. In IF, in addition to fetching the instruction and computing the new PC, we store the incremented
`PC (NPC) into a pipeline register for later use in computing the branch target address. This structure is the same as the
`organization in Figure 3.4, where the PC is updated in IF from one or two sources. In ID, we fetch the registers, extend the
`sign of the lower 16 bits of the IR, and pass along the IR and NPC. During EX, we perform an ALU operation or an address
`calculation; we pass along the IR and the B register (if the instruction is a store). We also set the value of cond to 1 if the
`instruction is a taken branch. During the MEM phase, we cycle the memory, make a branch decision, write the PC if needed,
`and pass along values needed in the final pipe stage. Finally, during WB, we update the register field from either the ALU
`output or the loaded value. For simplicity we always pass the entire IR from one stage to the next, though as an instruction
`proceeds down the pipeline, less and less of the IR is needed.
`
`Basic Performance Issues in Pipelining
`
`Pipelining increases the CPU instruction throughput-the number of instructions
`completed per unit of time-but it does not reduce the execution time of an indi(cid:173)
`vidual instruction. In fact, it usually slightly increases the execution time of each
`instruction due to overhead in the control of the pipeline. The increase in instruc(cid:173)
`tion throughput means that a program runs faster and has lower total execution
`time, even though no single instruction runs faster!
`
`SAMSUNG-1007
`Page 15 of 82
`
`
`
`3.2 The Basic Pipeline for DLX
`
`137
`
`The fact that the execution time of each instruction does not decrease puts lim(cid:173)
`its on the practical depth of a pipeline, as we will see in the next section. In addi(cid:173)
`tion to limitations arising from pipeline latency, limits arise from imbalance
`among the pipe stages and from pipelining overhead. Imbalance among the pipe
`stages reduces performance since the clock can run no faster than the time needed
`for the slowest pipeline stage. Pipeline overhead arises from the combination of
`pipeline register delay and clock skew. The pipeline registers or latches add setup
`time plus propagation delay to the clock cycle. Clock skew also contributes to the
`lower limit on the clock cycle. Once the clock cycle is as small as the sum of the
`clock skew and latch overhead, no further p