throbber
amum.
`W
`
`
`
`“1353‘
`zmn‘lfi‘égg?
`
`‘
`
`.V,

`
`“
`
`,
`
`{Lax w .
`“ r ‘35
`“‘ 't'fifi‘“
`x ....
`‘1
`.
`> m ,
`I”
`’g
`.
`,.Ԥ.%h. .
`.
`m
`v
`fifiifii‘glag“ ‘ u r m§£-4§zgfi 11
`it;
`-
`
`_.
`,m ..-w
`.
`
`i
`
`~
`
`V
`
`,
`
`,
`
`,
`
`‘
`
`,,.,
`, m
`394%,
`
`.
`
`,
`
`_
`
`,*~
`
`~
`
`.
`x;
`”Km.”
`
`SAMSUNG-1007
`
`Page 1 of 82
`
`SAMSUNG-1007
`Page 1 of 82
`
`

`

`Sponsoring Editor Bruce M. Spatz
`Production Manager Yonie Overton
`Production Editor Julie Pabst
`Text Design Gary Head, with second edition
`modifications by Carron Design
`Cover Art Direction David Lance Goines
`Cover Design Carron Design
`
`Illustrious, Inc.
`Illustration
`Copyeditors Ken DellaPenta, Gary Morris,
`Sharilyn Hovind, Jessie Wood
`Proofreader Jeff Van Bueren
`Composition Nancy Logan
`Printer Courier Corporation
`
`Morgan Kaufmann Publishers, Inc.
`Editorial and Sales Office
`340 Pine Street, Sixth Floor
`San Francisco, CA 94104-3205 USA
`Telephone: 415 I 392-2665, Facsimile: 415 I 982-2665, Internet: mkp@mkp.com
`Order toll free: 800 I 745-7323
`
`© 1990, 1996 by Morgan Kaufmann Publishers, Inc.
`All rights reserved
`
`Published 1990. Second Edition 1996
`Printed in the United States of America
`00 99 98 97 96
`5 4 3 2
`
`No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any
`means-electronic, mechanical, recording, or otherwise-without the prior written permission of the publisher.
`
`All instruction sets and other design information of the DLX computer system contained herein is copyrighted by the
`publisher and may not be incorporated in other publications or distributed by media without formal acknowledgment
`and written consent from the publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for permission is welcomed.
`
`ADVICE, PRAISE, & ERRORS: Any correspondence related to this publication or intended for the authors should
`be addressed to the Editorial and Sales Office of Morgan Kaufmann Publishers, Inc., Dept. P &H APE. Information
`regarding error sightings is encouraged. Any error sightings that are accepted for correction in subsequent printings
`will be rewarded by the authors with a payment of $1.00 (U.S.) per correction upon availability of the new printing.
`Electronic mail can be sent to arc2bugs@mkp.com. (Please include your full name and permanent mailing address.)
`
`INSTRUCTOR SUPPORT: For information on classroom software and other instructor materials available to
`adopters, please contact the Editorial and Sales Office of Morgan Kaufmann Publishers, Inc. (415 I 392-2665) or
`visit our web site at http://rnkp.com.
`
`Library of Congress Cataloging-in-Publication Data is available for this book.
`ISBN 1-55860-329-8
`
`SAMSUNG-1007
`Page 2 of 82
`
`

`

`It is quite a three-pipe problem.
`
`Sir Arthur Conan Doyle
`The Adventures of Sherlock Holmes
`
`SAMSUNG-1007
`Page 3 of 82
`
`

`

`3 .1 I What Is Pipelining?
`
`Pipelining is an implementation technique whereby multiple instructions are
`overlapped in execution. Today, pipelining is the key implementation technique
`used to make fast CPUs.
`A pipeline is like an assembly line. In an automobile assembly line, there are
`many steps, each contributing something to the construction of the car. Each step
`operates in parallel with the other steps, though on a different car. In a computer
`pipeline, each step in the pipeline completes a part of an instruction. Like the
`assembly line, different steps are completing different parts of different instruc(cid:173)
`tions in parallel. Each of these steps is called a pipe stage or a pipe segment. The
`stages are connected one to the next to form a pipe-instructions enter at one end,
`progress through the stages, and exit at the other end, just as cars would in an as(cid:173)
`sembly line.
`In our automobile assembly line, throughput is defined as the number of cars
`per hour and is determined by how often a completed car exits the assembly line.
`Likewise, the throughput of an instruction pipeline is determined by how often an
`instruction exits the pipeline. Because the pipe stages are hooked together, all the
`
`SAMSUNG-1007
`Page 4 of 82
`
`

`

`126
`
`Chapter 3 Pipelining
`
`stages must be ready to proceed at the same time, just as we would require in an
`assembly line. The time required between moving an instruction one step down
`the pipeline is a machine cycle. Because all stages proceed at the same time, the
`length of a machine cycle is determined by the time required for the slowest pipe
`stage, just as in an auto assembly line, the longest step would determine the time
`between advancing the line. In a computer, this machine cycle is usually one
`clock cycle (sometimes it is two, rarely more), although the clock may have
`multiple phases.
`The pipeline designer's goal is to balance the length of each pipeline stage,
`just as the designer of the assembly line tries to balance the time for each step in
`the process. If the stages are perfectly balanced, then the time per instruction on
`the pipelined machine-assuming ideal conditions-is equal to
`
`Time per instruction on unpipelined machine
`Number of pipe stages
`
`Under these conditions, the speedup from pipelining equals the number of pipe
`stages, just as an assembly line with n stages can ideally produce cars n times as
`fast. Usually, however, the stages will not be perfectly balanced; furthermore,
`pipelining does involve some overhead. Thus, the time per instruction on the
`pipelined machine will not have its minimum possible value, yet it can be close.
`Pipelining yields a reduction in the average execution time per instruction.
`Depending on what you consider as the base line, the reduction can be viewed as
`decreasing the number of clock cycles per instruction (CPI), as decreasing the
`clock cycle time, or as a combination. If the starting point is a machine that takes
`multiple clock cycles per instruction, then pipelining is usually viewed as reduc(cid:173)
`ing the CPI. This is the primary view we will take. If the starting point is a ma(cid:173)
`chine that takes one (long) clock cycle per instruction, then pipelining decreases
`the clock cycle time.
`Pipelining is an implementation technique that exploits parallelism among the
`instructions in a sequential instruction stream. It has the substantial advantage
`that, unlike some speedup techniques (see Chapter 8 and Appendix B), it is not
`visible to the programmer. In this chapter we will first cover the concept of pipe(cid:173)
`lining using DLX and a simple version of its pipeline. We use DLX because its
`simplicity makes it easy to demonstrate the principles of pipelining. In addition,
`to simplify the diagrams we do not include the jump instructions of DLX; adding
`them does not involve new concepts-only bigger diagrams. The principles of
`pipelining in this chapter apply to more complex instruction sets than DLX or its
`RISC relatives, although the resulting pipelines are more complex. Using the
`DLX example, we will look at the problems pipelining introduces and the perfor(cid:173)
`mance attainable under typical situations. Section 3.9 examines the MIPS R4000
`pipeline, which is similar to other recent machines with extensive pipelining.
`Chapter 4 looks at more advanced pipelining techniques being used in the
`highest-performance processors.
`
`SAMSUNG-1007
`Page 5 of 82
`
`

`

`3.1 What Is Pipelining?
`
`127
`
`Before we proceed to basic pipelining, we need to review a simple implemen(cid:173)
`tation of an unpipelined version of DLX.
`
`A Simple Implementation of DLX
`
`To understand how DLX can be pipelined, we need to understand how it is imple(cid:173)
`mented without pipelining. This section shows a simple implementation where
`every instruction talces at most five clock cycles. We will extend this basic imple(cid:173)
`mentation to a pipelined version, resulting in a much lower CPI. Our unpipelined
`implementation is not the most economical or the highest-performance imple(cid:173)
`mentation without pipelining. Instead, it is designed to lead naturally to a pipe(cid:173)
`lined implementation. We will indicate where the implementation could be
`improved later in this section. Implementing the instruction set requires the intro(cid:173)
`duction of several temporary registers that are not part of the architecture; these
`are introduced in this section to simplify pipelining.
`Every DLX instruction can be implemented in at most five clock cycles. The
`five clock cycles are as follows.
`
`1. Instructionfetch cycle (IF):
`
`IR ~ Mem[PC]
`NFC ~ PC + 4
`
`Operation: Send out the PC and fetch the instruction from memory into the in(cid:173)
`struction register (IR); increment the PC by 4 to address the next sequential in(cid:173)
`struction. The IR is used to hold the instruction that will be needed on
`subsequent clock cycles; likewise the register NPC is used to hold the next se(cid:173)
`quential PC.
`
`2. Instruction decode/register fetch cycle (ID):
`
`A ~ Regs [IR6 .. 10] i
`B ~ Regs[IR11 .. 1sl;
`16
`Imm~ ( (IR15)
`##IR16 .. 31)
`
`Operation: Decode the instruction and access the register file to read the regis(cid:173)
`ters. The outputs of the general-purpose registers are read into two temporary
`registers (A and B) for use in later clock cycles.The lower 16 bits of the IR are
`also sign-extended and stored into the temporary register Imm, for use in the
`next cycle.
`Decoding is done in parallel with reading registers, which is possible be(cid:173)
`cause these fields are at a fixed location in the DLX instruction format (see
`Figure 2.21 on page 99). This technique is known as fixed-field decoding.
`Note that we may read a register we don't use, which doesn't help but also
`
`SAMSUNG-1007
`Page 6 of 82
`
`

`

`128
`
`Chapter 3 Pipelining
`
`doesn't hurt. Because the immediate portion of an instruction is located in an
`identical place in every DLX format, the sign-extended immediate is also cal(cid:173)
`culated during this cycle in case it is needed in the next cycle.
`
`3. Execution/effective address cycle (EX):
`
`The ALU operates on the operands prepared in the prior cycle, performing one
`of four functions depending on the DLX instruction type.
`
`• Memory reference:
`
`ALUOutput f--- A + Imm;
`
`Operation: The ALU adds the operands to form the effective address and
`places the result into the register ALUOutput.
`
`• Register-Register ALU instruction:
`
`ALUOutput f--- A op B;
`
`Operation: The ALU performs the operation specified by the opcode on the
`value in register A and on the value in register B. The result is placed in the
`temporary register ALUOutput.
`
`• Register-Immediate ALU instruction:
`
`ALUOutput f--- A op Imm;
`
`Operation: The ALU performs the operation specified by the opcode on the
`value in register A and on the value in register Imm. The result is placed in the
`temporary register ALUOutput.
`
`• Branch:
`
`ALUOutput f--- NPC + Imm;
`Cond f--- (A op 0)
`
`Operation: The ALU adds the NPC to the sign-extended immediate value in
`Imm to compute the address of the branch target. Register A, which has been
`read in the prior cycle, is checked to determine whether the branch is taken.
`The comparison operation op is the relational operator determined by the
`branch opcode; for example, op is"==" for the instruction BEQZ.
`
`The load-store architecture of DLX means that effective address and execu(cid:173)
`tion cycles can be combined into a single clock cycle, since no instruction needs
`to simultaneously calculate a data address, calculate an instruction target address,
`
`SAMSUNG-1007
`Page 7 of 82
`
`

`

`3.1 What Is Pipelining?
`
`129
`
`and perform an operation on the data. The other integer instructions not included
`above are jumps of various forms, which are similar to branches.
`
`4. Memory access/branch completion cycle (MEM):
`
`The only DLX instructions active in this cycle are loads, stores, and branches.
`
`• Memory reference:
`
`LMD f- Mem[ALUOutput] or
`Mem[ALUOutput] f- B;
`
`Operation: Access memory if needed. If instruction is a load, data returns
`from memory and is placed in the LMD (load memory data) register; if it is a
`store, then the data from the B register is written into memory. In either case
`the address used is the one computed during the prior cycle and stored in the
`register ALUOutput.
`
`• Branch:
`
`if (cond) PC f- ALUOutput else PC f- NPC
`
`Operation: If the instruction branches, the PC is replaced with the branch des(cid:173)
`tination address in the register ALUOutput; otherwise, it is replaced with the
`incremented PC in the register NPC.
`
`5. Write-back cycle (WB):
`
`• Register-Register ALU instruction:
`
`Regs [ IR16 .. 2 0] f- ALUOutput i
`
`• Register-Immediate ALU instruction:
`
`Regs [IR11 .. 15] f- ALUOutput;
`
`• Load instruction:
`
`Regs [IR11 .. 15] f- LMD;
`
`Operation: Write the result into the register file, whether it comes from the
`memory system (which is in LMD) or from the ALU (which is in ALUOut(cid:173)
`put); the register destination field is also in one of two positions depending on
`the opcode.
`
`Figure 3.1 shows how an instruction flows through the datapath. At the end of
`each clock cycle, every value computed during that clock cycle and required on a
`later clock cycle (whether for this instruction or the next) is written into a storage
`
`SAMSUNG-1007
`Page 8 of 82
`
`

`

`130
`
`Chapter 3 Pipelining
`
`device, which may be memory, a general-purpose register, the PC, or a temporary
`register (i.e., LMD, Imm, A, B, IR, NPC, ALUOutput, or Cond). The temporary
`registers hold values between clock cycles for one instruction, while the other
`storage elements are visible parts of the state and hold values between successive
`instructions.
`
`Instruction fetch
`
`Instruction decode/
`register fetch
`
`Execute/
`address
`calculation
`
`4
`
`PCf-*---
`
`FIGURE 3.1 The implementation of the DLX datapath allows every instruction to be executed in four or five clock
`cycles. Although the PC is shown in the portion of the datapath that is used in instruction fetch and the registers are shown
`in the portion of the datapath that is used in instruction decode/register fetch, both of these functional units are read as well
`as written by an instruction. Although we show these functional units in the cycle corresponding to where they are read, the
`PC is written during the memory access clock cycle (as well as in instruction fetch) and the registers are written during the
`write back clock cycle. In both cases, the writes in later pipe stages are indicated by the multiplexer output (in memory ac(cid:173)
`cess or write back) that carries a value back to the PC or registers. These backward-flowing signals introduce much of the
`complexity of pipelining, and we will look at them more carefully in the next few sections.
`
`In this implementation, branch instructions require four cycles and all other
`instructions require five cycles. Assuming the branch frequency of 12% from the
`last chapter, this leads to an overall CPI of 4.88. This implementation, however,
`is not optimal either in achieving the best performance or in using the minimal
`amount of hardware given the performance level. The CPI could be improved
`
`SAMSUNG-1007
`Page 9 of 82
`
`

`

`3.1 What Is Pipelining?
`
`131
`
`without affecting the clock rate by completing ALU instructions during the MEM
`cycle, since those instructions are idle during that cycle. Assuming ALU instruc(cid:173)
`tions occupy 44% of the instruction mix, as we measured in Chapter 2, this im(cid:173)
`provement would lead to a CPI of 4.44, or an improvement of 4.88/4.44 = 1.1.
`Beyond this simple change, any other attempts to decrease the CPI may increase .
`the clock cycle time, since such changes would need to put more activity into a
`clock cycle. Of course, it may still be beneficial to trade an increase in the clock
`cycle time for a decrease in the CPI, but this requires a detailed analysis and is
`unlikely to produce large improvements, especially if the initial distribution of
`work among the clock cycles is reasonably balanced.
`Although all machines today are pipelined, this multicycle implementation is
`a reasonable approximation of how most machines would have been imple(cid:173)
`mented in earlier times. A simple finite-state machine could be used to implement
`the control following the five-cycle structure shown above. For a much more
`complex machine, microcode control could be used. In either event, an instruc(cid:173)
`tion sequence like that above would determine the structure of the control.
`In addition to these CPI improvements, there are some hardware redundancies
`that could be eliminated in this multicycle implementation. For example, there
`are two ALUs: one to increment the PC and one used for effective address and
`ALU computation. Since they are not needed on the same clock cycle, we could
`merge them by adding additional multiplexers and sharing the same ALU. Like(cid:173)
`wise, instructions and data could be stored in the same memory, since the data
`and instruction accesses happen on different clock cycles.
`Rather than optimize this simple implementation, we will leave the design as
`it is in Figure 3.1, since this provides us with a better base for the pipelined im(cid:173)
`plementation.
`As an alternative to the multicycle design discussed in this section, we could
`also have implemented the machine so that every instruction takes one long clock
`cycle. In such cases, the temporary registers would be deleted, since there would
`not be any communication across clock cycles within an instruction. Every in(cid:173)
`struction would execute in one long clock cycle, writing the result into the data
`memory, registers, or PC at the end of the clock cycle. The CPI would be one for
`such a machine. However, the clock cycle would be roughly equal to five times
`the clock cycle of the multicycle machine, since every instruction would need to
`traverse all the functional units. Designers would never use this single cycle im(cid:173)
`plementation for two reasons. First, a single-cycle implementation would be very
`inefficient for most machines that have a reasonable variation among the amount
`of work, and hence in the clock cycle time, needed for different instructions. Sec(cid:173)
`ond, a single-cycle implementation requires the duplication of functional units
`that could be shared in a multicycle implementation. Nonetheless, this single(cid:173)
`cycle datapath allows us to illustrate how pipelining can improve the clock cycle
`time, as opposed to the CPI, of a machine.
`
`SAMSUNG-1007
`Page 10 of 82
`
`

`

`132
`
`Chapter 3 Pipelining
`
`3.2 I The Basic Pipeline for DLX
`
`We can pipeline the datapath of Figure 3.1 with almost no changes by starting a
`new instruction on each clock cycle. (See why we chose that design!) Each of the
`clock cycles from the previous section becomes a pipe stage: a cycle in the pipe(cid:173)
`line. This results in the execution pattern shown in Figure 3.2, which is the typi(cid:173)
`cal way a pipeline structure is drawn. While each instruction takes five clock
`cycles to complete, during each clock cycle the hardware will initiate a new in(cid:173)
`struction and will be executing some part of the five different instructions.
`
`Clock number
`
`Instruction number
`
`Instruction i
`Instruction i + 1
`Instruction i + 2
`Instruction i + 3
`Instruction i + 4
`
`1
`
`IF
`
`2
`
`ID
`IF
`
`3
`EX
`ID
`IF
`
`6
`
`4
`5
`MEM WB
`EX
`MEM WB
`EX
`MEM WB
`ID
`EX
`MEM WB
`ID
`IF
`MEM WB
`EX
`ID
`IF
`
`7
`
`8
`
`9
`
`FIGURE 3.2 Simple DLX pipeline. On each clock cycle, another instruction is fetched and begins its five-cycle execution.
`If an instruction is started every clock cycle, the performance will be up to five times that of a machine that is not pipelined.
`The names for the stages in the pipeline are the same as those used for the cycles in the implementation on pages 127-
`129: IF= instruction fetch, ID= instruction decode, EX= execution, MEM =memory access, and WB =write back.
`
`Your instinct is right if you find it hard to believe that pipelining is as simple
`as this, because it's not. In this and the following sections, we will make our DLX
`pipeline "real" by dealing with problems that pipelining introduces.
`To begin with, we have to determine what happens on every clock cycle of the
`machine and make sure we don't try to perform two different operations with the
`same datapath resource on the same clock cycle. For example, a single ALU can(cid:173)
`not be asked to compute an effective address and perform a subtract operation at
`the same time. Thus, we must ensure that the overlap of instructions in the pipe(cid:173)
`line cannot cause such a conflict. Fortunately, the simplicity of the DLX instruc(cid:173)
`tion set makes resource evaluation relatively easy. Figure 3.3 shows a simplified
`version of the DLX datapath drawn in pipeline fashion. As you can see, the major
`functional units are used in different cycles and hence overlapping the execution
`of multiple instructions introduces relatively few conflicts. There are three obser(cid:173)
`vations on which this fact rests.
`First, the basic datapath of the last section already used separate instruction
`and data memories, which we would typically implement with separate instruc(cid:173)
`tion and data caches (discussed in Chapter 5). The use of separate caches elimi(cid:173)
`nates a conflict for a single memory that would arise between instruction fetch
`
`SAMSUNG-1007
`Page 11 of 82
`
`

`

`3.2 The Basic Pipeline for DLX
`
`133
`
`Time (in clock c y c le s ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ->
`
`cc 1
`
`CC2
`
`CC3
`
`CC4
`
`CC5
`
`CC6
`
`CC7
`
`CC8
`
`CC9
`
`Q)
`
`()
`
`c:
`0
`~
`Q) x
`E
`~
`OJ
`2
`0...
`
`FIGURE 3.3 The pipeline can be thought of as a series of datapaths shifted in time. This shows the overlap among
`the parts of the datapath, with clock cycle 5 (CC 5) showing the steady state situation. Because the register file is used as
`a source in the EX stage and as a destination in the WB stage, it appears twice. We show that it is read in one stage and
`written in another by using a solid line, on the right or left, respectively, and a dashed line on the other side. The abbreviation
`IM is used for instruction memory, DM for data memory, and CC for clock cycle.
`
`and data memory access. Notice that if our pipelined machine has a clock cycle
`that is equal to that of the unpipelined version, the memory system must deliver
`five times the bandwidth. This is one cost of higher performance.
`Second, the register file is used in the two stages: for reading in ID and for
`writing in WB. These uses are distinct, so we simply show the register file in two
`places. This does mean that we need to perform two reads and one write every
`clock cycle. What if a read and write are to the same register? For now, we ignore
`this problem, but we will focus on it in the next section.
`Third, Figure 3.3 does not deal with the PC. To start a new instruction every
`clock, we must increment and store the PC every clock, and this must be done
`during the IF stage in preparation for the next instruction. The problem arises
`
`SAMSUNG-1007
`Page 12 of 82
`
`

`

`134
`
`Chapter 3 Pipelining
`
`when we consider the effect of branches, which change the PC also, but not until
`the MEM stage. This is not a problem in our multicycle, unpipelined datapath,
`since the PC is written once in the MEM stage. For now, we will organize our
`pipelined datapath to write the PC in IF, and write either the incremented PC or
`the value of the branch target of an earlier branch. This introduces a problem in
`how branches are handled that we will explain in the next section and explore in
`detail in section 3.5.
`Because every pipe stage is active on every clock cycle, all operations in a
`pipe stage must complete in one clock cycle and any combination of operations
`must be able to occur at once. Furthermore, pipelining the datapath requires that
`values passed from one pipe stage to the next must be placed in registers.
`Figure 3.4 shows the DLX pipeline with the appropriate registers, called pipeline
`registers or pipeline latches, between each pipeline stage. The registers are
`labeled with the names of the stages they connect. Figure 3.4 is drawn so that
`connections through the pipeline registers from one stage to another are clear.
`
`ID/EX
`
`EX/MEM
`
`MEM/WB
`
`..-----, Branch
`taken
`
`Data
`memory
`
`IR6 .. 10
`
`IR11 .. 15
`
`MEM/WB.IR
`
`16
`
`FIGURE 3.4 The datapath is pipelined by adding a set of registers, one between each pair of pipe stages. The reg(cid:173)
`isters serve to convey values and control information from one stage to the next. We can also think of the PC as a pipeline
`register, which sits before the IF stage of the pipeline, leading to one pipeline register for each pipe stage. The selection
`multiplexer for the PC has been moved so that the PC is written in exactly one stage (IF). If we didn't move it there would
`be a conflict when a branch occurred, since two instructions would try to write different values into the PC. Most of the data(cid:173)
`paths flow from left to right, which is from earlier in time to later. The paths flowing from right to left (which carry the register
`write-back information and PC information on a branch) introduce complications into our pipeline, which we will spend much
`of this chapter overcoming.
`
`SAMSUNG-1007
`Page 13 of 82
`
`

`

`3.2 The Basic Pipeline for DLX
`
`135
`
`All of the registers needed to hold values temporarily between clock cycles
`within one instruction are subsumed into these pipeline registers. The fields of the
`instruction register (IR), which is part of the IF/ID register, are labeled when they
`are used to supply register names. The pipeline registers carry both data and control
`from one pipeline stage to the next. Any value needed on a later pipeline stage must
`be placed in such a register and copied from one pipeline register to the next, until it
`is no longer needed. If we tried to just use the temporary registers we had in our
`earlier unpipelined datapath, values could be overwritten before all uses were com(cid:173)
`pleted. For example, the field of a register operand used for a write on a load or
`ALU operation is supplied from the MEM/WB pipeline register rather than from
`the IF/ID register. This is because we want a load or ALU operation to write the
`register designated by that operation, not the register field of the instruction current(cid:173)
`ly transitioning from IF to ID! This destination register field is simply copied from
`one pipeline register to the next, until it is needed during the WB stage.
`Any instruction is active in exactly one stage of the pipeline at a time; there(cid:173)
`fore, any actions taken on behalf of an instruction occur between a pair of pipe(cid:173)
`line registers. Thus, we can also look at the activities of the pipeline by
`examining what has to happen on any pipeline stage depending on the instruction
`type. Figure 3.5 shows this view. Fields of the pipeline registers are named so as
`to show the flow of data from one stage to the next. Notice that the actions in the
`first two stages are independent of the current instruction type; they must be inde(cid:173)
`pendent because the instruction is not decoded until the end of the ID stage. The
`IF activity depends on whether the instruction in EXIMEM is a taken branch. If
`so, then the branch target is used both to fetch the instruction and to compute the
`next PC. Otherwise, the current PC is used for both of these actions. (As we said
`earlier, this effect of branches leads to complications in the pipeline that we deal
`with in the next few sections.) The fixed-position encoding of the register source
`operands is critical to allowing the registers to be fetched during ID.
`To control this simple pipeline we need only determine how to set the control
`for the four multiplexers in the datapath of Figure 3.4. The two multiplexers in
`the ALU stage are set depending on the instruction type, which is dictated by the
`IR field of the ID/EX register. The top ALU input multiplexer is set by whether
`the instruction is a branch or not, and the bottom multiplexer is set by whether the
`instruction is a register-register ALU operation or any other type of operation.
`The multiplexer in the IF stage chooses whether to use the current PC or the val(cid:173)
`ue of EX/MEM.NPC (the branch target) as the instruction address. This multi(cid:173)
`plexer is controlled by the field EXIMEM.cond. The fourth multiplexer is
`controlled by whether the instruction in the WB stage is a load or a ALU opera(cid:173)
`tion. In addition to these four multiplexers, there is one additional multiplexer
`needed that is not drawn in Figure 3.4, but whose existence is clear from looking
`at the WB stage of an ALU operation. The destination register field is in one of
`two different places depending on the instruction type (register-register ALU
`versus either ALU immediate or load). Thus, we will need a multiplexer to
`choose the correct portion of the IR in the MEMIWB register to specify the regis(cid:173)
`ter destination field.
`
`SAMSUNG-1007
`Page 14 of 82
`
`

`

`136
`
`Chapter 3 Pipelining
`
`Stage Any instruction
`
`IF I ID. IR ~ Mem [PC] ;
`IF/ID.NPC,PC ~(if EX/MEM.cond {EX/MEM.NPC} else {PC+4});
`ID/EX.A~Regs[IF/ID.IR6 .. 10J; ID/EX. B ~Regs [IF /ID. IR11 .. 1s l ;
`ID/EX.NPC ~ IF/ID.NPC; ID/EX.IR ~IF/ID.IR;
`16
`ID/EX. Imm ~
`(IR16)
`##IR16 .. 31i
`
`ALU instruction
`
`Load or store instruction
`
`Branch instruction
`
`IF
`
`ID
`
`EX
`
`EX/MEM. IR ~ID/EX. IR;
`EX/MEM.ALUOutput~
`ID/EX.A op ID/EX.B;
`or
`EX/MEM.ALUOutput ~
`ID/EX.A op ID/EX.Imm;
`EX/MEM. cond ~ 0;
`
`MEM MEM/WB. IR ~ EX/MEM. IR;
`MEM/WB.ALUOutput ~
`EX/MEM.ALUOutput;
`
`EX/MEM. IR~ ID/EX. IR
`EX/MEM.ALUOutput ~
`ID/EX.A+ ID/EX.Imm;
`
`EX/MEM.ALUOutput ~
`ID/EX.NPC+ID.EX.Imm;
`
`EX/MEM.cond ~
`(ID/EX.A op 0) ;
`
`EX/MEM. cond ~ 0;
`EX/MEM.B~ ID/EX.B;
`MEM/WB. IR ~ EX/MEM. IR;
`MEM/WB. LMD ~
`Mem[EX/MEM.ALUOutput];
`or
`Mem[EX/MEM.ALUOutput] ~
`EX/MEM.B;
`
`WB
`
`Regs [MEM/WB. IR16 .. 2 0 l ~ Regs[MEM/WB.IR11 .. 1sl ~
`MEM/WB.ALUOutput;
`MEM/WB.LMD;
`or
`Regs [MEM/WB. IR11 .. 1sJ ~
`MEM/WB.ALUOutput;
`
`FIGURE 3.5 Events on every pipe stage of the DLX pipeline. Let's review the actions in the stages that are specific to
`the pipeline organization. In IF, in addition to fetching the instruction and computing the new PC, we store the incremented
`PC (NPC) into a pipeline register for later use in computing the branch target address. This structure is the same as the
`organization in Figure 3.4, where the PC is updated in IF from one or two sources. In ID, we fetch the registers, extend the
`sign of the lower 16 bits of the IR, and pass along the IR and NPC. During EX, we perform an ALU operation or an address
`calculation; we pass along the IR and the B register (if the instruction is a store). We also set the value of cond to 1 if the
`instruction is a taken branch. During the MEM phase, we cycle the memory, make a branch decision, write the PC if needed,
`and pass along values needed in the final pipe stage. Finally, during WB, we update the register field from either the ALU
`output or the loaded value. For simplicity we always pass the entire IR from one stage to the next, though as an instruction
`proceeds down the pipeline, less and less of the IR is needed.
`
`Basic Performance Issues in Pipelining
`
`Pipelining increases the CPU instruction throughput-the number of instructions
`completed per unit of time-but it does not reduce the execution time of an indi(cid:173)
`vidual instruction. In fact, it usually slightly increases the execution time of each
`instruction due to overhead in the control of the pipeline. The increase in instruc(cid:173)
`tion throughput means that a program runs faster and has lower total execution
`time, even though no single instruction runs faster!
`
`SAMSUNG-1007
`Page 15 of 82
`
`

`

`3.2 The Basic Pipeline for DLX
`
`137
`
`The fact that the execution time of each instruction does not decrease puts lim(cid:173)
`its on the practical depth of a pipeline, as we will see in the next section. In addi(cid:173)
`tion to limitations arising from pipeline latency, limits arise from imbalance
`among the pipe stages and from pipelining overhead. Imbalance among the pipe
`stages reduces performance since the clock can run no faster than the time needed
`for the slowest pipeline stage. Pipeline overhead arises from the combination of
`pipeline register delay and clock skew. The pipeline registers or latches add setup
`time plus propagation delay to the clock cycle. Clock skew also contributes to the
`lower limit on the clock cycle. Once the clock cycle is as small as the sum of the
`clock skew and latch overhead, no further p

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket