`
`135
`
`that would p~ss back and _forth between the processor and disk system that
`enables the disk _system t~ implement the managing strategy that you outline.
`In your explanation descnbe each command by listing its name and operands,
`followed by a description of what the command does.
`
`Example: READ RECORD <track number> <sector number>. This com(cid:173)
`mand tells the disk system to obtain the specified record and transmit the
`record to the processor.
`
`b) Assume that the processor that uses the disk system as a paging system for
`virtual memory must read pages during page faults and must write back pages
`that have been altered. What should the disk system do to manage this type of
`access? To implement this management policy, what should be the commands
`and replies between processor and disk system?
`c) Assume that the processor is using the disk system to store the database for a
`banking center that serves customers and tellers through on-line queries. How
`might this database be managed by the disk cache? What should the command
`and reply interface be for this type of access to implement your suggested man(cid:173)
`agement algorithm?
`2.16 The object of this exercise is to confirm the footprint model.
`a) Measure the footprints for Trace 1 and Trace 2. The footprint is the number of
`distinct lines touched.
`b) Now measure the cache-reload transient under two different conditions. Run
`Trace 2, then Trace 1, then Trace 2. The reload transient for the second running
`of Trace 2 is equal to the size of its footprint minus the number of lines of Trace
`2 that are resident in the cache when Trace 2 begins to run the second time.
`Measure the reload transient for one-way and two-way set associative caches of
`size 32, 64, 128, 256, 512, and 1024. (Twelve different caches.)
`c) Calculate the reload transient by using the size of the footprints and the cache
`structures in the mathematical model for the reload transient as described in the
`textbook.
`d) Compare your answers.
`e) Repeat the three parts of this question assuming that you run Trace 1, then Trace
`2, then Trace 1. Measure and calculate the reload transient for the second running
`of Trace 1.
`2.17 This problem concerns footprints in caches and in virtual memories.
`a) The footprint model developed in the text appears for caches. What are the
`appropriate values for the cache-footprint model that describe a v~al-memory
`system in which two programs, A and B, execute alternately? Dehne the param(cid:173)
`eters you need to make the virtual-memory problem correspond to the cache(cid:173)
`footprint model.
`b) Now consider a collection of programs to execute concurrently in a virtual(cid:173)
`memory system. How can the cache-footprint model help you determine which
`subset of programs to run together?
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 150
`
`
`
`136
`
`Memory-System Design
`
`Chapter 2
`
`2.18 The purpose of this question is to analyze the relative advantag~s and disadvantages
`of two cache designs. In a computer system that uses both virtual memory and a
`high·speed cache, a cache tag can be either a virtual address or a physical address.
`If the tag is a virtual address, it is compared to the virtual address produced by a
`program before the virtual address is mapped to a real address by a segment and
`page transformation. If the tag is a real address, ifis compared to the virtual address
`of a reference after that virtual address has been mapped to a real address.
`These are the basic cache schemes mentioned in the following parts of this exercise.
`In response to the questions, you are asked to add more capability or other functions
`to one or both caches to gain higher performance.
`a) Consider the relative performance of the two basic approaches. Which of the
`two, if any, is the faster? Explain your answer.
`b) Consider the problem of handling references to main memory by an input/output
`processor while maintaining the cache to reflect changes made by the input/
`output operations. Also, consider how changes made by the central processor
`to data in the cache can be made available to the input/output processor so that
`the input/output processor always accesses fresh data when it reads from mem(cid:173)
`ory. Which of the two basic schemes, if any, leads to higher performance? Explain
`your answer. If you have found one scheme to be slower than the other, find a
`way to improve the performance of the slower scheme to bring it as dose to the
`performance of the faster scheme by augmenting the cache structure, the control
`of the cache, or its implementation.
`c) A cache flush is a purging process that is perfonned in some virtual-memory
`systems. When a process in a virtual-memory system relinquishes the processor,
`and a new process takes its place, the second process may generate the same
`virtual addresses as the former process, but the second process refers to totally
`different items. We assume that when a process relinquishes the processor, all
`of its data held in the cache that might be accessed by another process are purged
`from the cache. With regard to the cache flush to purge these data, which of
`the two schemes-real or virtual address tags-leads to higher performance, or
`are they about equal in performance? Explain your answer. How can the slower
`of the two schemes be augmented to improve its performance in handling the
`cache flush? In answering these questions, assume that the caches are four-way
`set associative.
`d) It may be possible to reduce the overhead of cache flushes, Consider how you
`might augment both of the cache designs (real-address tags and virtual-address
`tags) so that data resident in the cache for a process need not be purged each
`time the process relinquishes the processor. Discuss how both cache schemes
`can be modified to support this behavior.
`2.19 Having studied the design of cache-memory systems, consider the parts from which
`caches are made. Each memory chip in a cache is a standard random-access memory
`(RAM) chip that contains M bits of information, where M is a power of 2. The
`memory chip can be designed to have any one of several different organizations.
`It can, for example, have 1 bit at each of M different addresses, or with a slightly
`different design, have 2 bits at each of M/2 different addresses, or 4 bits at each of
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 151
`
`
`
`Exercises
`
`137
`
`M/4 addresses, and_ so forth. A single cache line made up of, for example, 16 bytes
`(l28 bits) ca~ be built from 128 ~ x 1 chips or from 64 M/2 x 2 chips, and so forth.
`If M x 1 chips are used, the chips create not just one cache line, but a total of M
`sets of cache lines.
`a) Show a scheme fo: ~r~anizing M x 1 chips to form a memory with 1024 sets,
`
`four-way set assoc1atlv1ty, and 16 bytes per cache line. For what value or values
`of M do you achieve the minimum number of chips in the memory? For this
`scheme how many address and data bits have to be supplied to a memory chip
`for each access?
`b) Show a scheme for organizing M/4 x 4 chips to form the memory described in
`the previous part of this exercise. How many address and data bits have to be
`supplied to each chip?
`c) Suppose that M = 1024K ( == 220
`) . What size cache would you design, and how
`would you organize the memory chips to achieve this size?
`2.20 We want to explore the behavior of a multilevel memory hierarchy. Let Level 1 be
`small and very fast, Level 2 be much larger than Level 1 and somewhat slower, and
`Level 3 be a very large and very slow memory. The objective is to retain information
`in Level 2 that will be needed in the near future. All transfers occur on demand,
`and no pref etching is used.
`Assume that both Level 1 and Level 2 are maintained as LRU caches. When a miss
`occurs, an item is moved immediately to level 1 from main memory or from Level
`2, wherever it is found. If an item is moved from Level 2 to Level 1, no copy of the
`item remains in Level 2. When an item ages out of Level 2, it is discarded, and
`future accesses for the item are made to main memory.
`a) Assume that Level 1 is organized as N sets and is K-way set associative. Assume
`that Level 2 is organized as N sets and is ]-way set associative. Consider a situation
`in which a Process A runs, then Process B runs, and then A is to be resumed.
`that show the expected number of lines of A in Level 1, in
`Find expressions
`Level 2, and in main memory at the time that A is resumed.
`b) Repeat the previous part of this exercise under the assumption that Level 2 is
`organized as a 2N-set cache, J-way set associative.
`c) Repeat the first part of this exercise assuming that Level 2 is an RN-set cache,
`/-way set associative where R is some power of 2.
`2.21 This problem examines models for preallocating cache to different functions.
`a) Assume that a particular computer architecture produces one instruction refer(cid:173)
`ence followed by one data reference for each instruction executed. The stream
`of instruction references has a miss rate M,(x) in a cache of size x. Similarly, the
`stream of data references has a miss-rate function M0 (x) for a cache of size x.
`(Ignore the exact structure of the cache, and just assume that the miss rate
`depends on the total number of bytes available.) Assume that both of t~ese
`functions are known to you. Given C bytes to use for a cache, how should 1t be
`into a data cache and an instruction cache in a way that mi~imizes
`partitioned
`the overall miss rate? (Assume that the partitioning can be of any size, not
`necessarily into pieces that are good for cache structures.)
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 152
`
`
`
`138
`
`Memory-System Design
`
`Chapter 2
`
`that r data references occur on the average for each instruction
`b) Now assume
`reference, for some real number r. What partitioning of cache produces the least
`miss ratio in this case?
`in the first part is running without having in(cid:173)
`c) Now assume
`that the program
`structions and data prepartitioned. Assume that the cache allocation between
`data and instructions varies randomly as misses occur and cache replacements
`change the allocation between data and instructions. Show that when the cache
`reaches a state in which the number of bytes holding data is equal to the optimum
`number of bytes to assign to data in a partition, then the cache is in equilibrium.
`That is, the rate at which bytes change from holding instructions
`to holding data
`is equal to the rate at which bytes change from holding data to holding instructions.
`d) Assume
`that the unpartitioned
`cache in the previous part is in equilibrium.
`Assume that the program changes so that its M1 and Mo functions change as well.
`Construct a mathematical model whose solution describes the number of data
`lines in cache as a function of time. (fhis model could be
`lines and instruction
`very complex, so simply construct the model, but make no attempt to solve it.)
`2.22 For this problem assume that two different programs are sharing a computer. Each
`program
`takes over the processor, runs for a fixed quantum
`time Q, and then
`relinquishes control while the other program takes over. The programs have identical
`cache footprints. Their streams of address references produce miss rates at the rate
`of M 1(x) and Mz(x), respectively,
`in caches of size x. As above, assume
`that you
`two different ways of using C bytes of
`know both of these functions. Compare
`cache storage for this situation.
`The first method is to use cache in a conventional way so that each program runs,
`uses all cache available as best it can, and then yields the processor
`to the other
`program. The second method preallocates a fixed amount of cache storage to each
`program. The cache storage allocated to a program
`is private to that program, and
`is inaccessible to the other program. All storage is allocated to one program or the
`other.
`To answer
`the following questions, develop the mathematical models required for
`the comparisons assuming that the miss-rate functions and footprint functions have
`been given to you.
`a) For the scheme that uses a fixed cache allocation, what fixed allocation of cache
`produces
`the lowest composite miss rate? (Hint: Express the composite miss-rate
`as a function of the miss-rate functions and the cache allocations. Then find the
`minimum of that function.)
`to the answer of a
`b) The characteristics of working sets allow some simplification
`three cases: Neither footprint fits fully in cache, both
`in some cases. Consider
`footprints fit fully in cache, and one footprint fits in cache together with a fraction
`o: of the second footprint. In which cases can you give simple allocations and
`what are these allocations?
`c) Now compare the fixed-allocation strategy to conventional allocation as described
`in the opening discussion. Which of the two schemes produces
`the lower miss
`rate in each of the three cases of b? Indicate for which of the three cases, one
`scheme is clearly better than another and in which cases you cannot tell.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 153
`
`
`
`Exercises
`
`139
`
`2.23 The purpose of this question is to examine slightly less expensive implementations
`of replacement policies than conventional LRU replacement for set-associative caches.
`For 4-way set-associative caches, at least 5 bits per set are required to keep track of
`the order of the cache entries in each set from most recently used entry to least
`recently used entry.
`a) Construct some means for approximating LRU replacement that uses only 4 bits
`per set for LRU purposes. Describe your replacement algorithm and describe
`how you use the 4 bits to keep track of which item to replace.
`b) Simulate the 4-way set associative caches of Exercise 2.5 on Trace 1 and Trace
`2, and compare the effectiveness of your replacement algorithm with true LRU
`replacement. Your algorithm should do about as well as LRU, possibly marginally
`better or marginally worse, provided that your algorithm never replaces the most
`recently used item when it does not replace the least recently used item.
`2,24 The purpose of this problem is to explore cache-memory allocations based on miss(cid:173)
`rate derivatives analogous to the main memory allocation strategy based on fault(cid:173)
`rate derivatives.
`
`Assume that you must design separate caches to hold instructions and data. The
`total cache size you have available is 16K bytes. Assume that data fetches and
`instruction fetches occur with equal frequency on your computer system. Assume
`that the misses as a function for cache size and structure for data fetches is the
`function that you measure for Trace 1, and that the misses for instruction fetches
`is the function that you measure for Trace 2.
`
`a) Use your trace data to fit a smooth continuous curve through your data points
`for caches up to size 16K. You may find that straight lines give reasonably good
`fits to the log/log plots of these functions. For both instruction and data caches,
`assume caches are 4-way set-associative. Find where to divide 16K bytes between
`the two caches so that the number of misses is minimized . The division point
`can be any integer, and need not be a power of 2 for your answer. The line sizes
`are 8 bytes for your data.
`b) Working analytically, take the derivative of the number of misses as a function
`of cache size for each of the two analytic functions you created. Calculate the
`value of the derivatives at the division point you have chosen. Are the derivatives
`nearly equal?
`c) Prove that the division point that gives the fewest misses is the division point
`at which the miss-rate derivatives are equal. That is, if you take away a little
`cache from the instructions and give it to the data accesses, then extra misses
`created for instructions will be approximately equal to the extra misses saved for
`data fetches. (Hint : find an expression for miss rate as a function of cache al(cid:173)
`locations, take the derivative of this expression, and find which allocations pro(cid:173)
`duce a value of zero for the derivative.)
`2.25 The purpose of this problem is to practice calculations of cycles per instruction
`measures from performance data.
`
`Assume that you are investigating
`two different cache memory designs, one of
`which uses separate caches for data and instructions, and one of which uses a
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 154
`
`
`
`140
`
`Memory-System Design
`
`Chapter 2
`
`combined cache to hold both data and instructions. The cycle time is the same for
`both designs, so that the CP[ for each design determines performance. The size and
`cost of the combined cache is equal to the sum of the sizes and costs of the data
`and instruction caches, so that the system cost-performance rating is determined
`totally by CPI. Assume that the miss-rate per reference in the individual caches is
`3 percent. In the combined cache the miss-rate is also 3 percent when the instructions
`and data references have equal frequency. For the following requested responses,
`assume the miss rate for the more frequent process drops to 2 percent and the miss(cid:173)
`rate for the less frequent process increases to 3.5 percent when the frequencies of
`access are unequal. Assume that the penalty for a miss is 10 cycles, and that the
`processor waits for a miss to be completed before continuing execution. (Ignore the
`fact that such a wait is not necessary for WRITE misses.)
`a) Compute the CPI contributed by the finite-cache effect for both cache designs
`assuming that instructions and data are issued with equal frequency and that
`there are 2.0 references per instruction on the average.
`b) Repeat part a under the assumption that there are 1.50 references per instruction,
`of which 1.0 references are instruction references and 0.5 references are data
`fetches. Note that the CPI is the same for 1.0 data references per instruction and
`0.5 instruction references per instruction.
`c) Repeat part a for a design with 3.0 references per instruction, of which 2.0 are
`data references and 1.0 are instruction references.
`d) Compare your results and comment on your findings. ls the potential miss-rate
`reduction for the combined cache sufficient to offset the increased bandwidth
`of two caches?
`2.26 Repeat Exercise 2.25 for a system in which the cache-miss penalty is 100 cycles
`instead of 10 cycles.
`2.27 The purpose of this exercise is to examine the memory traffic generated by different
`write policies in cache.
`Calculate the main memory traffic expressed in cycles per instruction for a write(cid:173)
`through cache and a write-back cache assuming the following:
`a) 20 percent of instructions are WRITEs.
`b) Each WRITE modifies an average of 7.5 bytes. The operand can cross a cache(cid:173)
`line boundary, and if so, two cache-lines are modified.
`c) The bus width is 4 bytes.
`d) For a write-back cache, 80 percent of the changes recorded in cache are not
`written back because they are superseded by subsequent WRITEs to cache.
`e) You should find the relative traffic generated for line sizes of 4, 8, 16, 32, 64,
`and 128 bytes.
`
`Comment on what combinations of parameters favor the write-through policy
`and what combinations favor the write•back policy.
`2.28 The purpose of this exercise is to examine the trailing-edge effect.
`Assume that the leading-edge penalty for a cache miss is 10 cycles, and each portion
`of a cache line to be returned
`takes 1 additional cycle. A processor can restart
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 155
`
`
`
`Exercises
`
`141
`
`the first segment of a line. The processor generates two
`unmediately after recei~ng
`ests per cycle to different areas of memory. One request issued each cycle has
`f b ·
`· h
`h
`·
`requ
`obability of 10 percent o
`emg m t e same cac e line as the last miss when
`~~rline size is twice the bus width, and the _probab~Jity ~ncreases by 2.5 percent to
`percent, 15 percent, etc. for each doubling of _hne size above two bus widths.
`
`12
`The probability that a. second request t~ the same lme as t~e last miss is a hit in the
`he line is proportional
`to the fraction of the cache lme that has reached the
`cac
`h
`.
`.
`d F
`l
`·f
`cessor at the time t e request 1s issue
`. or examp e, 1 a cache line takes two
`::e reaches the proce_sso~, there is ~ 50 percent probabili~ that a second request
`pro mory cycles to reload, when the processor issues a request after the first half the
`to the same line is a hit. Smee there 1s ~. 10 percen~ ~robability ?f t~e request being
`made to the same line, the net probability of a trailing-edge miss 1s 5 percent.
`What is the CPI contribution
`from the _trailing-edge ef~ect under these assumptions
`for line sizes equal to 2, 4, 8, and 16 times the bus width?
`
`_5
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 156
`
`
`
`... · .....
`
`3
`
`Comparisons do ofttime great
`grievance.
`-John Lydgate, c. 1440
`
`Pipeline Design Techniques
`
`3.1 Principles of Pipeline Design
`3.2 Memory Structures in Pipeline Computers
`3.3 Performance of Pipelined Computers
`3.4 Control of Pipeline Stages
`3.5 Exploiting Pipeline Techniques
`3.6 Historical References
`
`We learn in Chapter 2 that memory is a major bottleneck
`in high-speed com(cid:173)
`puters, and that the bottleneck can be relieved somewhat by taking advantage
`of the characteristics of typical programs. The objective has been to store the
`most-frequently
`referenced items in fast memory and less-frequently
`referenced
`items in slower memory. It is not necessary to make all. memory equally fast;
`we need use only as much fast memory as necessary to hold the active regions
`of a program and data.
`This chapter concerns a different approach to relieving bottlenecks. The idea
`is to use parallelism at the point of the bottleneck
`to improve performance
`globally. If the design techniques are successful, then the extra hardware devoted
`is present in only a small portion of a computer
`to performance enhancement
`system, yet its effect is to increase performance as if the full computer system
`were replicated.
`To contrast the approaches of the last chapter and the present one, in one
`case the speed differential is due to faster hardware whereas in the second case
`
`14i
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 157
`
`
`
`section 3.1
`
`Principles of Pipeline Design
`
`143
`
`is obtained by replicating slower hardware . In both cases,
`the speed increase
`is required
`to create efficient computer systems in which
`clever architecture
`enhancements of a relatively small cost have a global impact on performance.
`techniques described in this chapter are by far the most
`Pipeline computer
`popular mean~ for ~nh~ndng p~rformance through parallel hardware. The basic
`ideas from whJCh pipeline techniques developed are apparent in von Neumann's
`In Burks et al. [1946], a
`proposal to b~ild the first stor~d-program. computer.
`discussion on input/output
`techruques descnbes a buffer arrangement that would
`to be ove_rla~ped with i~putloutput operations and there(cid:173)
`permit computation
`by provide a crude form of p1pelme processing
`that is used widely in today's
`machines.
`Although von Neumann did not build the input/output
`capability into his
`first machine,
`the basic ideas for pipelined computer design evolved rapidly
`after the first appearance of magnetic-core memory as the primary storage me(cid:173)
`dium for main memory. This storage was roughly a factor of 10 or more slower
`per cycle than the transistor
`technology used in high-speed registers and control
`logic. Designers quickly conceived of a variety of techniques
`to initiate one or
`more concurrent accesses to memory while executing instructions
`in the central
`processor. This body of techniques eventually evolved and is exemplified in the
`structures described
`pipeline-processing
`in this chapter.
`In the 1960s, when hardware costs were relatively high, pipelined computers
`were the supercomputers.
`IBM's STRETCH and CDC's 6600 were two such
`designs from the early 1960s that made extensive use of pipelining, and these
`designs strongly influenced
`that followed. By
`the structure of supercomputers
`the 1980s, hardware costs had diminished
`to the extent that pipeline techniques
`cou]d be implemented
`across the entire range of performance, and indeed, even
`that costs just a fev,r dollars uses pipeline accesses
`the Intel 8086 microprocessor
`to memory while performing on-chip computation.
`of the principles of pipe(cid:173)
`Our approach
`is to develop a basic understanding
`line design in the next section. In subsequent
`sections we observe where it can
`be used and how to design effective pipelines.
`
`3.1 Principles of Pipeline Design
`The basic idea behind pipeline design is quite natural; it is not specific to com(cid:173)
`In fact the name pipeline stems from the analogy with petro(cid:173)
`puter technology.
`leum pipelines in which a sequence of hydrocarbon products is pumped
`through
`a pipeline. The last product might well be entering
`the pipeline before the first
`product has been removed
`from the terminus.
`In the remainder of this section
`we treat pipeline design first in abstract
`terms, and then follow with concrete
`examples.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 158
`
`
`
`144
`
`PlpcHnc Design Techniques
`
`Chapter 3
`
`The key contribution of pipelining is that it provides a way to start a new
`task before an old one has been completed. Hence the completion rate is not a
`function of the total processing time, but rather of how soon a new process can
`be introduced.
`Consider Fig. 3.1, which shows a sequential process being done step-by(cid:173)
`step over a period of time. The total time required to complete this process is
`N units, assuming that each step takes one time unit. In the figure, each box
`denotes the execution of a single step, and the label in the box gives the number
`of the step being executed.
`To perform this same process using pipeline techniques, consider Fig. 3.2,
`which shows a continuous stream of jobs going through the N sequential steps
`of the process. In this case each horizontal row of boxes represents
`the time
`the activity at a
`history of one job. Each vertical column of boxes represents
`specific time. Note that up to N different jobs may be active at any time in this
`example, assuming that we have N independent stations to perform the sequence
`of steps in the process.
`The pipeline timing of Fig. 3.2 is characteristic of assembly lines and main(cid:173)
`tenance depots as well as oil pipelines. The total time to perform one process
`does not change between Fig. 3.1 and Fig. 3.2, and it may actually be longer in
`in moving
`Fig. 3.2 if the pipeline structure forces some processing overhead
`from station to station. But the completion rate of tasks in Fig. 3.2 is one per
`cycle as opposed to one task every N cycles in Fig. 3.1.
`Figure 3.3(a) shows a box that represents a server that can perform any of
`the N processing steps in a single unit of time. If the job stream is processed by
`this one server, then the rate of completion is one job every N steps, and the
`time behavior of the job stream is as described in Fig.: 3.1.
`Compare Fig. 3.3(a) with Fig. 3.3(b), which shows N servers concatenated
`in a sequence. A task flows through the coUection of servers by visiting Server
`1, then Server 2, and so on, and finally emerging from Server N after N steps.
`The time behavior of this system is described by Fig. 3.2. Figure 3.3(b) is an
`ideal model of a constant-speed assembly line, such as an automobile assembly
`plant.
`to computer
`in Figs. 3.1-3.3
`Now let's relate the general ideas presented
`design. Where can we find an N-step task that can conveniently be broken up,
`as shown in Fig. 3.2? Consider the steps required to execute a single instruction.
`
`---~1 2
`Fig. 3.1 An N-step sequential process.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 159
`
`
`
`section 3.1
`
`Princlple:s of Pipeline: De:slgn
`
`145
`
`I , I 2 I 3 ~----+[£]
`
`I 2 ! 3 t-- -...j N I
`~--.-2~,-3~r---~
`I 1
`3_.r---~
`.___.___._l_
`I 1 I 2
`1 I 2 I 3
`~~~~---~
`
`[
`
`fig. 3.2 Pipelined execution of an N-step process.
`
`Time---►
`
`This sequence has traditionally been implemented using pipeline design. A typ(cid:173)
`ical instruction-execution
`sequence might be:
`
`1. Instruction fetch: obtain a copy of the instruction from memory.
`2. Instruction decode: examine the instruction and prepare to initialize the control
`signals required
`to execute the instruction
`in subsequent steps.
`3. Address generation: compute
`the effective address of the operands by per(cid:173)
`forming indexing or indirection as specified by the instruction.
`4. Operand fetch: for READ operations, obtain the operand from central memory.
`
`Job Stream
`(N Units/Job)
`
`SERVER .,.I ____
`(N Units)
`
`One Completion
`~ Every N Units
`
`Job
`Stream SEAVER
`(1 Unit)
`
`{a)
`
`SERVER
`( 1 Unit)
`
`(b)
`
`SERVER
`{1 Unit)
`
`--· Every 1 Unit
`
`One Completion
`
`Fig. 3.3 Two ways to execute N-unit jobs in a stream:
`(a) Sequential execution with a I-unit server; and
`(b) Pipelined execution with 1-unit servers.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 160
`
`
`
`146
`
`Plpellnc Desfgn Techniques
`
`Chapter 3
`
`5. Instruction execution: execute the instruction in the processor.
`6. Operand store: for WRITE operations, return the resulting data to central
`memory.
`7. Update program counter: generate the address of the next instruction.
`If we simply map these steps onto the model of Fig. 3.3(b), we obtain a block
`diagram of a pipeline computer as shown in Fig. 3.4. For reasons discussed later
`this structure might have to be tuned somewhat to obtain a good balance between
`~ost a_nd performance. More important, the structure has to be designed to work
`correctly and efficiently in the face of difficulties caused by interactions between
`events at different stages in the pipeline.
`In the normal mode of operation, the first stage of the pipeline in Fig. 3,4
`continuously fetches instructions, even though the address of a fetch has not
`been produced by the last stage of the pipeline, the stage that updates
`the
`program counter. Thus, the first stage is operating with a new program counter
`value several cycles before the counter value is produced in the last stage. Since
`most changes to the program counter ~re increments,
`the first stage estimates
`the future value of the program counter quite easily by successive increments
`of the program counter. The interaction between the first and last stages occurs
`when a branch alters the program counter nonsequentially.
`· The address Qf the instruction that follows a conditional branch might not
`be known until the branch instruction reaches the last stage of the pipeline, but
`
`Instruction Fetch
`
`Instruction Decode
`
`Address Generate
`
`Operand Fetch
`
`Execute
`
`Operand Store
`
`Update Program Counter
`
`Fig. 3.4 Pipelined execution of a single instruction.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 161
`
`
`
`section 3.1
`
`Prlnclplcs of Pipeline Design
`
`147
`
`instructions have followed the conditional
`that subsequent
`Fig. 3.4 suggests
`branch down the pipeline and may have altered the state of one or more machine
`registers even before the outcome of the conditional branch is known. The danger
`•s that we cannot be sure what to execute until the condition has been evaluated .
`. r.)_t:-One way to assure correct execution of conditional branches is to interlock
`- the instruction-fetch stage with the program-counter update stage. Immediately
`after a conditional branch is fetched by the first stage, no further fetches take
`place until the branch reaches the last stage. At this point the last stage produces
`the correct target address and removes the interlock, allowing the first stage to
`continue instruction fetches at the new address.
`The problem with this solution is the lost performance caused by the inactive
`pipeline during the pe~od a conditiona~ branch is pending. C~nditional branches
`occur quite frequently m most conventional computers, possibly once every five
`to ten cycles on the average. If a pipeline were idled when each such branch is
`encountered,
`then every five to ten instructions a few cycles would potentially
`be lost waiting for branches. This is a fairly hefty penalty for the inefficiency of
`a pipeline implementation.
`There are other interactions of concern as well, all of which tend to force
`the designer to add complexity to ensure correct pipeline operation. These in(cid:173)
`teractions could severely impede performance