throbber
Exercises
`
`135
`
`that would p~ss back and _forth between the processor and disk system that
`enables the disk _system t~ implement the managing strategy that you outline.
`In your explanation descnbe each command by listing its name and operands,
`followed by a description of what the command does.
`
`Example: READ RECORD <track number> <sector number>. This com(cid:173)
`mand tells the disk system to obtain the specified record and transmit the
`record to the processor.
`
`b) Assume that the processor that uses the disk system as a paging system for
`virtual memory must read pages during page faults and must write back pages
`that have been altered. What should the disk system do to manage this type of
`access? To implement this management policy, what should be the commands
`and replies between processor and disk system?
`c) Assume that the processor is using the disk system to store the database for a
`banking center that serves customers and tellers through on-line queries. How
`might this database be managed by the disk cache? What should the command
`and reply interface be for this type of access to implement your suggested man(cid:173)
`agement algorithm?
`2.16 The object of this exercise is to confirm the footprint model.
`a) Measure the footprints for Trace 1 and Trace 2. The footprint is the number of
`distinct lines touched.
`b) Now measure the cache-reload transient under two different conditions. Run
`Trace 2, then Trace 1, then Trace 2. The reload transient for the second running
`of Trace 2 is equal to the size of its footprint minus the number of lines of Trace
`2 that are resident in the cache when Trace 2 begins to run the second time.
`Measure the reload transient for one-way and two-way set associative caches of
`size 32, 64, 128, 256, 512, and 1024. (Twelve different caches.)
`c) Calculate the reload transient by using the size of the footprints and the cache
`structures in the mathematical model for the reload transient as described in the
`textbook.
`d) Compare your answers.
`e) Repeat the three parts of this question assuming that you run Trace 1, then Trace
`2, then Trace 1. Measure and calculate the reload transient for the second running
`of Trace 1.
`2.17 This problem concerns footprints in caches and in virtual memories.
`a) The footprint model developed in the text appears for caches. What are the
`appropriate values for the cache-footprint model that describe a v~al-memory
`system in which two programs, A and B, execute alternately? Dehne the param(cid:173)
`eters you need to make the virtual-memory problem correspond to the cache(cid:173)
`footprint model.
`b) Now consider a collection of programs to execute concurrently in a virtual(cid:173)
`memory system. How can the cache-footprint model help you determine which
`subset of programs to run together?
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 150
`
`

`

`136
`
`Memory-System Design
`
`Chapter 2
`
`2.18 The purpose of this question is to analyze the relative advantag~s and disadvantages
`of two cache designs. In a computer system that uses both virtual memory and a
`high·speed cache, a cache tag can be either a virtual address or a physical address.
`If the tag is a virtual address, it is compared to the virtual address produced by a
`program before the virtual address is mapped to a real address by a segment and
`page transformation. If the tag is a real address, ifis compared to the virtual address
`of a reference after that virtual address has been mapped to a real address.
`These are the basic cache schemes mentioned in the following parts of this exercise.
`In response to the questions, you are asked to add more capability or other functions
`to one or both caches to gain higher performance.
`a) Consider the relative performance of the two basic approaches. Which of the
`two, if any, is the faster? Explain your answer.
`b) Consider the problem of handling references to main memory by an input/output
`processor while maintaining the cache to reflect changes made by the input/
`output operations. Also, consider how changes made by the central processor
`to data in the cache can be made available to the input/output processor so that
`the input/output processor always accesses fresh data when it reads from mem(cid:173)
`ory. Which of the two basic schemes, if any, leads to higher performance? Explain
`your answer. If you have found one scheme to be slower than the other, find a
`way to improve the performance of the slower scheme to bring it as dose to the
`performance of the faster scheme by augmenting the cache structure, the control
`of the cache, or its implementation.
`c) A cache flush is a purging process that is perfonned in some virtual-memory
`systems. When a process in a virtual-memory system relinquishes the processor,
`and a new process takes its place, the second process may generate the same
`virtual addresses as the former process, but the second process refers to totally
`different items. We assume that when a process relinquishes the processor, all
`of its data held in the cache that might be accessed by another process are purged
`from the cache. With regard to the cache flush to purge these data, which of
`the two schemes-real or virtual address tags-leads to higher performance, or
`are they about equal in performance? Explain your answer. How can the slower
`of the two schemes be augmented to improve its performance in handling the
`cache flush? In answering these questions, assume that the caches are four-way
`set associative.
`d) It may be possible to reduce the overhead of cache flushes, Consider how you
`might augment both of the cache designs (real-address tags and virtual-address
`tags) so that data resident in the cache for a process need not be purged each
`time the process relinquishes the processor. Discuss how both cache schemes
`can be modified to support this behavior.
`2.19 Having studied the design of cache-memory systems, consider the parts from which
`caches are made. Each memory chip in a cache is a standard random-access memory
`(RAM) chip that contains M bits of information, where M is a power of 2. The
`memory chip can be designed to have any one of several different organizations.
`It can, for example, have 1 bit at each of M different addresses, or with a slightly
`different design, have 2 bits at each of M/2 different addresses, or 4 bits at each of
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 151
`
`

`

`Exercises
`
`137
`
`M/4 addresses, and_ so forth. A single cache line made up of, for example, 16 bytes
`(l28 bits) ca~ be built from 128 ~ x 1 chips or from 64 M/2 x 2 chips, and so forth.
`If M x 1 chips are used, the chips create not just one cache line, but a total of M
`sets of cache lines.
`a) Show a scheme fo: ~r~anizing M x 1 chips to form a memory with 1024 sets,
`
`four-way set assoc1atlv1ty, and 16 bytes per cache line. For what value or values
`of M do you achieve the minimum number of chips in the memory? For this
`scheme how many address and data bits have to be supplied to a memory chip
`for each access?
`b) Show a scheme for organizing M/4 x 4 chips to form the memory described in
`the previous part of this exercise. How many address and data bits have to be
`supplied to each chip?
`c) Suppose that M = 1024K ( == 220
`) . What size cache would you design, and how
`would you organize the memory chips to achieve this size?
`2.20 We want to explore the behavior of a multilevel memory hierarchy. Let Level 1 be
`small and very fast, Level 2 be much larger than Level 1 and somewhat slower, and
`Level 3 be a very large and very slow memory. The objective is to retain information
`in Level 2 that will be needed in the near future. All transfers occur on demand,
`and no pref etching is used.
`Assume that both Level 1 and Level 2 are maintained as LRU caches. When a miss
`occurs, an item is moved immediately to level 1 from main memory or from Level
`2, wherever it is found. If an item is moved from Level 2 to Level 1, no copy of the
`item remains in Level 2. When an item ages out of Level 2, it is discarded, and
`future accesses for the item are made to main memory.
`a) Assume that Level 1 is organized as N sets and is K-way set associative. Assume
`that Level 2 is organized as N sets and is ]-way set associative. Consider a situation
`in which a Process A runs, then Process B runs, and then A is to be resumed.
`that show the expected number of lines of A in Level 1, in
`Find expressions
`Level 2, and in main memory at the time that A is resumed.
`b) Repeat the previous part of this exercise under the assumption that Level 2 is
`organized as a 2N-set cache, J-way set associative.
`c) Repeat the first part of this exercise assuming that Level 2 is an RN-set cache,
`/-way set associative where R is some power of 2.
`2.21 This problem examines models for preallocating cache to different functions.
`a) Assume that a particular computer architecture produces one instruction refer(cid:173)
`ence followed by one data reference for each instruction executed. The stream
`of instruction references has a miss rate M,(x) in a cache of size x. Similarly, the
`stream of data references has a miss-rate function M0 (x) for a cache of size x.
`(Ignore the exact structure of the cache, and just assume that the miss rate
`depends on the total number of bytes available.) Assume that both of t~ese
`functions are known to you. Given C bytes to use for a cache, how should 1t be
`into a data cache and an instruction cache in a way that mi~imizes
`partitioned
`the overall miss rate? (Assume that the partitioning can be of any size, not
`necessarily into pieces that are good for cache structures.)
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 152
`
`

`

`138
`
`Memory-System Design
`
`Chapter 2
`
`that r data references occur on the average for each instruction
`b) Now assume
`reference, for some real number r. What partitioning of cache produces the least
`miss ratio in this case?
`in the first part is running without having in(cid:173)
`c) Now assume
`that the program
`structions and data prepartitioned. Assume that the cache allocation between
`data and instructions varies randomly as misses occur and cache replacements
`change the allocation between data and instructions. Show that when the cache
`reaches a state in which the number of bytes holding data is equal to the optimum
`number of bytes to assign to data in a partition, then the cache is in equilibrium.
`That is, the rate at which bytes change from holding instructions
`to holding data
`is equal to the rate at which bytes change from holding data to holding instructions.
`d) Assume
`that the unpartitioned
`cache in the previous part is in equilibrium.
`Assume that the program changes so that its M1 and Mo functions change as well.
`Construct a mathematical model whose solution describes the number of data
`lines in cache as a function of time. (fhis model could be
`lines and instruction
`very complex, so simply construct the model, but make no attempt to solve it.)
`2.22 For this problem assume that two different programs are sharing a computer. Each
`program
`takes over the processor, runs for a fixed quantum
`time Q, and then
`relinquishes control while the other program takes over. The programs have identical
`cache footprints. Their streams of address references produce miss rates at the rate
`of M 1(x) and Mz(x), respectively,
`in caches of size x. As above, assume
`that you
`two different ways of using C bytes of
`know both of these functions. Compare
`cache storage for this situation.
`The first method is to use cache in a conventional way so that each program runs,
`uses all cache available as best it can, and then yields the processor
`to the other
`program. The second method preallocates a fixed amount of cache storage to each
`program. The cache storage allocated to a program
`is private to that program, and
`is inaccessible to the other program. All storage is allocated to one program or the
`other.
`To answer
`the following questions, develop the mathematical models required for
`the comparisons assuming that the miss-rate functions and footprint functions have
`been given to you.
`a) For the scheme that uses a fixed cache allocation, what fixed allocation of cache
`produces
`the lowest composite miss rate? (Hint: Express the composite miss-rate
`as a function of the miss-rate functions and the cache allocations. Then find the
`minimum of that function.)
`to the answer of a
`b) The characteristics of working sets allow some simplification
`three cases: Neither footprint fits fully in cache, both
`in some cases. Consider
`footprints fit fully in cache, and one footprint fits in cache together with a fraction
`o: of the second footprint. In which cases can you give simple allocations and
`what are these allocations?
`c) Now compare the fixed-allocation strategy to conventional allocation as described
`in the opening discussion. Which of the two schemes produces
`the lower miss
`rate in each of the three cases of b? Indicate for which of the three cases, one
`scheme is clearly better than another and in which cases you cannot tell.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 153
`
`

`

`Exercises
`
`139
`
`2.23 The purpose of this question is to examine slightly less expensive implementations
`of replacement policies than conventional LRU replacement for set-associative caches.
`For 4-way set-associative caches, at least 5 bits per set are required to keep track of
`the order of the cache entries in each set from most recently used entry to least
`recently used entry.
`a) Construct some means for approximating LRU replacement that uses only 4 bits
`per set for LRU purposes. Describe your replacement algorithm and describe
`how you use the 4 bits to keep track of which item to replace.
`b) Simulate the 4-way set associative caches of Exercise 2.5 on Trace 1 and Trace
`2, and compare the effectiveness of your replacement algorithm with true LRU
`replacement. Your algorithm should do about as well as LRU, possibly marginally
`better or marginally worse, provided that your algorithm never replaces the most
`recently used item when it does not replace the least recently used item.
`2,24 The purpose of this problem is to explore cache-memory allocations based on miss(cid:173)
`rate derivatives analogous to the main memory allocation strategy based on fault(cid:173)
`rate derivatives.
`
`Assume that you must design separate caches to hold instructions and data. The
`total cache size you have available is 16K bytes. Assume that data fetches and
`instruction fetches occur with equal frequency on your computer system. Assume
`that the misses as a function for cache size and structure for data fetches is the
`function that you measure for Trace 1, and that the misses for instruction fetches
`is the function that you measure for Trace 2.
`
`a) Use your trace data to fit a smooth continuous curve through your data points
`for caches up to size 16K. You may find that straight lines give reasonably good
`fits to the log/log plots of these functions. For both instruction and data caches,
`assume caches are 4-way set-associative. Find where to divide 16K bytes between
`the two caches so that the number of misses is minimized . The division point
`can be any integer, and need not be a power of 2 for your answer. The line sizes
`are 8 bytes for your data.
`b) Working analytically, take the derivative of the number of misses as a function
`of cache size for each of the two analytic functions you created. Calculate the
`value of the derivatives at the division point you have chosen. Are the derivatives
`nearly equal?
`c) Prove that the division point that gives the fewest misses is the division point
`at which the miss-rate derivatives are equal. That is, if you take away a little
`cache from the instructions and give it to the data accesses, then extra misses
`created for instructions will be approximately equal to the extra misses saved for
`data fetches. (Hint : find an expression for miss rate as a function of cache al(cid:173)
`locations, take the derivative of this expression, and find which allocations pro(cid:173)
`duce a value of zero for the derivative.)
`2.25 The purpose of this problem is to practice calculations of cycles per instruction
`measures from performance data.
`
`Assume that you are investigating
`two different cache memory designs, one of
`which uses separate caches for data and instructions, and one of which uses a
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 154
`
`

`

`140
`
`Memory-System Design
`
`Chapter 2
`
`combined cache to hold both data and instructions. The cycle time is the same for
`both designs, so that the CP[ for each design determines performance. The size and
`cost of the combined cache is equal to the sum of the sizes and costs of the data
`and instruction caches, so that the system cost-performance rating is determined
`totally by CPI. Assume that the miss-rate per reference in the individual caches is
`3 percent. In the combined cache the miss-rate is also 3 percent when the instructions
`and data references have equal frequency. For the following requested responses,
`assume the miss rate for the more frequent process drops to 2 percent and the miss(cid:173)
`rate for the less frequent process increases to 3.5 percent when the frequencies of
`access are unequal. Assume that the penalty for a miss is 10 cycles, and that the
`processor waits for a miss to be completed before continuing execution. (Ignore the
`fact that such a wait is not necessary for WRITE misses.)
`a) Compute the CPI contributed by the finite-cache effect for both cache designs
`assuming that instructions and data are issued with equal frequency and that
`there are 2.0 references per instruction on the average.
`b) Repeat part a under the assumption that there are 1.50 references per instruction,
`of which 1.0 references are instruction references and 0.5 references are data
`fetches. Note that the CPI is the same for 1.0 data references per instruction and
`0.5 instruction references per instruction.
`c) Repeat part a for a design with 3.0 references per instruction, of which 2.0 are
`data references and 1.0 are instruction references.
`d) Compare your results and comment on your findings. ls the potential miss-rate
`reduction for the combined cache sufficient to offset the increased bandwidth
`of two caches?
`2.26 Repeat Exercise 2.25 for a system in which the cache-miss penalty is 100 cycles
`instead of 10 cycles.
`2.27 The purpose of this exercise is to examine the memory traffic generated by different
`write policies in cache.
`Calculate the main memory traffic expressed in cycles per instruction for a write(cid:173)
`through cache and a write-back cache assuming the following:
`a) 20 percent of instructions are WRITEs.
`b) Each WRITE modifies an average of 7.5 bytes. The operand can cross a cache(cid:173)
`line boundary, and if so, two cache-lines are modified.
`c) The bus width is 4 bytes.
`d) For a write-back cache, 80 percent of the changes recorded in cache are not
`written back because they are superseded by subsequent WRITEs to cache.
`e) You should find the relative traffic generated for line sizes of 4, 8, 16, 32, 64,
`and 128 bytes.
`
`Comment on what combinations of parameters favor the write-through policy
`and what combinations favor the write•back policy.
`2.28 The purpose of this exercise is to examine the trailing-edge effect.
`Assume that the leading-edge penalty for a cache miss is 10 cycles, and each portion
`of a cache line to be returned
`takes 1 additional cycle. A processor can restart
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 155
`
`

`

`Exercises
`
`141
`
`the first segment of a line. The processor generates two
`unmediately after recei~ng
`ests per cycle to different areas of memory. One request issued each cycle has
`f b ·
`· h
`h

`requ
`obability of 10 percent o
`emg m t e same cac e line as the last miss when
`~~rline size is twice the bus width, and the _probab~Jity ~ncreases by 2.5 percent to
`percent, 15 percent, etc. for each doubling of _hne size above two bus widths.
`
`12
`The probability that a. second request t~ the same lme as t~e last miss is a hit in the
`he line is proportional
`to the fraction of the cache lme that has reached the
`cac
`h
`.
`.
`d F
`l
`·f
`cessor at the time t e request 1s issue
`. or examp e, 1 a cache line takes two
`::e reaches the proce_sso~, there is ~ 50 percent probabili~ that a second request
`pro mory cycles to reload, when the processor issues a request after the first half the
`to the same line is a hit. Smee there 1s ~. 10 percen~ ~robability ?f t~e request being
`made to the same line, the net probability of a trailing-edge miss 1s 5 percent.
`What is the CPI contribution
`from the _trailing-edge ef~ect under these assumptions
`for line sizes equal to 2, 4, 8, and 16 times the bus width?
`
`_5
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 156
`
`

`

`... · .....
`
`3
`
`Comparisons do ofttime great
`grievance.
`-John Lydgate, c. 1440
`
`Pipeline Design Techniques
`
`3.1 Principles of Pipeline Design
`3.2 Memory Structures in Pipeline Computers
`3.3 Performance of Pipelined Computers
`3.4 Control of Pipeline Stages
`3.5 Exploiting Pipeline Techniques
`3.6 Historical References
`
`We learn in Chapter 2 that memory is a major bottleneck
`in high-speed com(cid:173)
`puters, and that the bottleneck can be relieved somewhat by taking advantage
`of the characteristics of typical programs. The objective has been to store the
`most-frequently
`referenced items in fast memory and less-frequently
`referenced
`items in slower memory. It is not necessary to make all. memory equally fast;
`we need use only as much fast memory as necessary to hold the active regions
`of a program and data.
`This chapter concerns a different approach to relieving bottlenecks. The idea
`is to use parallelism at the point of the bottleneck
`to improve performance
`globally. If the design techniques are successful, then the extra hardware devoted
`is present in only a small portion of a computer
`to performance enhancement
`system, yet its effect is to increase performance as if the full computer system
`were replicated.
`To contrast the approaches of the last chapter and the present one, in one
`case the speed differential is due to faster hardware whereas in the second case
`
`14i
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 157
`
`

`

`section 3.1
`
`Principles of Pipeline Design
`
`143
`
`is obtained by replicating slower hardware . In both cases,
`the speed increase
`is required
`to create efficient computer systems in which
`clever architecture
`enhancements of a relatively small cost have a global impact on performance.
`techniques described in this chapter are by far the most
`Pipeline computer
`popular mean~ for ~nh~ndng p~rformance through parallel hardware. The basic
`ideas from whJCh pipeline techniques developed are apparent in von Neumann's
`In Burks et al. [1946], a
`proposal to b~ild the first stor~d-program. computer.
`discussion on input/output
`techruques descnbes a buffer arrangement that would
`to be ove_rla~ped with i~putloutput operations and there(cid:173)
`permit computation
`by provide a crude form of p1pelme processing
`that is used widely in today's
`machines.
`Although von Neumann did not build the input/output
`capability into his
`first machine,
`the basic ideas for pipelined computer design evolved rapidly
`after the first appearance of magnetic-core memory as the primary storage me(cid:173)
`dium for main memory. This storage was roughly a factor of 10 or more slower
`per cycle than the transistor
`technology used in high-speed registers and control
`logic. Designers quickly conceived of a variety of techniques
`to initiate one or
`more concurrent accesses to memory while executing instructions
`in the central
`processor. This body of techniques eventually evolved and is exemplified in the
`structures described
`pipeline-processing
`in this chapter.
`In the 1960s, when hardware costs were relatively high, pipelined computers
`were the supercomputers.
`IBM's STRETCH and CDC's 6600 were two such
`designs from the early 1960s that made extensive use of pipelining, and these
`designs strongly influenced
`that followed. By
`the structure of supercomputers
`the 1980s, hardware costs had diminished
`to the extent that pipeline techniques
`cou]d be implemented
`across the entire range of performance, and indeed, even
`that costs just a fev,r dollars uses pipeline accesses
`the Intel 8086 microprocessor
`to memory while performing on-chip computation.
`of the principles of pipe(cid:173)
`Our approach
`is to develop a basic understanding
`line design in the next section. In subsequent
`sections we observe where it can
`be used and how to design effective pipelines.
`
`3.1 Principles of Pipeline Design
`The basic idea behind pipeline design is quite natural; it is not specific to com(cid:173)
`In fact the name pipeline stems from the analogy with petro(cid:173)
`puter technology.
`leum pipelines in which a sequence of hydrocarbon products is pumped
`through
`a pipeline. The last product might well be entering
`the pipeline before the first
`product has been removed
`from the terminus.
`In the remainder of this section
`we treat pipeline design first in abstract
`terms, and then follow with concrete
`examples.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 158
`
`

`

`144
`
`PlpcHnc Design Techniques
`
`Chapter 3
`
`The key contribution of pipelining is that it provides a way to start a new
`task before an old one has been completed. Hence the completion rate is not a
`function of the total processing time, but rather of how soon a new process can
`be introduced.
`Consider Fig. 3.1, which shows a sequential process being done step-by(cid:173)
`step over a period of time. The total time required to complete this process is
`N units, assuming that each step takes one time unit. In the figure, each box
`denotes the execution of a single step, and the label in the box gives the number
`of the step being executed.
`To perform this same process using pipeline techniques, consider Fig. 3.2,
`which shows a continuous stream of jobs going through the N sequential steps
`of the process. In this case each horizontal row of boxes represents
`the time
`the activity at a
`history of one job. Each vertical column of boxes represents
`specific time. Note that up to N different jobs may be active at any time in this
`example, assuming that we have N independent stations to perform the sequence
`of steps in the process.
`The pipeline timing of Fig. 3.2 is characteristic of assembly lines and main(cid:173)
`tenance depots as well as oil pipelines. The total time to perform one process
`does not change between Fig. 3.1 and Fig. 3.2, and it may actually be longer in
`in moving
`Fig. 3.2 if the pipeline structure forces some processing overhead
`from station to station. But the completion rate of tasks in Fig. 3.2 is one per
`cycle as opposed to one task every N cycles in Fig. 3.1.
`Figure 3.3(a) shows a box that represents a server that can perform any of
`the N processing steps in a single unit of time. If the job stream is processed by
`this one server, then the rate of completion is one job every N steps, and the
`time behavior of the job stream is as described in Fig.: 3.1.
`Compare Fig. 3.3(a) with Fig. 3.3(b), which shows N servers concatenated
`in a sequence. A task flows through the coUection of servers by visiting Server
`1, then Server 2, and so on, and finally emerging from Server N after N steps.
`The time behavior of this system is described by Fig. 3.2. Figure 3.3(b) is an
`ideal model of a constant-speed assembly line, such as an automobile assembly
`plant.
`to computer
`in Figs. 3.1-3.3
`Now let's relate the general ideas presented
`design. Where can we find an N-step task that can conveniently be broken up,
`as shown in Fig. 3.2? Consider the steps required to execute a single instruction.
`
`---~1 2
`Fig. 3.1 An N-step sequential process.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 159
`
`

`

`section 3.1
`
`Princlple:s of Pipeline: De:slgn
`
`145
`
`I , I 2 I 3 ~----+[£]
`
`I 2 ! 3 t-- -...j N I
`~--.-2~,-3~r---~
`I 1
`3_.r---~
`.___.___._l_
`I 1 I 2
`1 I 2 I 3
`~~~~---~
`
`[
`
`fig. 3.2 Pipelined execution of an N-step process.
`
`Time---►
`
`This sequence has traditionally been implemented using pipeline design. A typ(cid:173)
`ical instruction-execution
`sequence might be:
`
`1. Instruction fetch: obtain a copy of the instruction from memory.
`2. Instruction decode: examine the instruction and prepare to initialize the control
`signals required
`to execute the instruction
`in subsequent steps.
`3. Address generation: compute
`the effective address of the operands by per(cid:173)
`forming indexing or indirection as specified by the instruction.
`4. Operand fetch: for READ operations, obtain the operand from central memory.
`
`Job Stream
`(N Units/Job)
`
`SERVER .,.I ____
`(N Units)
`
`One Completion
`~ Every N Units
`
`Job
`Stream SEAVER
`(1 Unit)
`
`{a)
`
`SERVER
`( 1 Unit)
`
`(b)
`
`SERVER
`{1 Unit)
`
`--· Every 1 Unit
`
`One Completion
`
`Fig. 3.3 Two ways to execute N-unit jobs in a stream:
`(a) Sequential execution with a I-unit server; and
`(b) Pipelined execution with 1-unit servers.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 160
`
`

`

`146
`
`Plpellnc Desfgn Techniques
`
`Chapter 3
`
`5. Instruction execution: execute the instruction in the processor.
`6. Operand store: for WRITE operations, return the resulting data to central
`memory.
`7. Update program counter: generate the address of the next instruction.
`If we simply map these steps onto the model of Fig. 3.3(b), we obtain a block
`diagram of a pipeline computer as shown in Fig. 3.4. For reasons discussed later
`this structure might have to be tuned somewhat to obtain a good balance between
`~ost a_nd performance. More important, the structure has to be designed to work
`correctly and efficiently in the face of difficulties caused by interactions between
`events at different stages in the pipeline.
`In the normal mode of operation, the first stage of the pipeline in Fig. 3,4
`continuously fetches instructions, even though the address of a fetch has not
`been produced by the last stage of the pipeline, the stage that updates
`the
`program counter. Thus, the first stage is operating with a new program counter
`value several cycles before the counter value is produced in the last stage. Since
`most changes to the program counter ~re increments,
`the first stage estimates
`the future value of the program counter quite easily by successive increments
`of the program counter. The interaction between the first and last stages occurs
`when a branch alters the program counter nonsequentially.
`· The address Qf the instruction that follows a conditional branch might not
`be known until the branch instruction reaches the last stage of the pipeline, but
`
`Instruction Fetch
`
`Instruction Decode
`
`Address Generate
`
`Operand Fetch
`
`Execute
`
`Operand Store
`
`Update Program Counter
`
`Fig. 3.4 Pipelined execution of a single instruction.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 161
`
`

`

`section 3.1
`
`Prlnclplcs of Pipeline Design
`
`147
`
`instructions have followed the conditional
`that subsequent
`Fig. 3.4 suggests
`branch down the pipeline and may have altered the state of one or more machine
`registers even before the outcome of the conditional branch is known. The danger
`•s that we cannot be sure what to execute until the condition has been evaluated .
`. r.)_t:-One way to assure correct execution of conditional branches is to interlock
`- the instruction-fetch stage with the program-counter update stage. Immediately
`after a conditional branch is fetched by the first stage, no further fetches take
`place until the branch reaches the last stage. At this point the last stage produces
`the correct target address and removes the interlock, allowing the first stage to
`continue instruction fetches at the new address.
`The problem with this solution is the lost performance caused by the inactive
`pipeline during the pe~od a conditiona~ branch is pending. C~nditional branches
`occur quite frequently m most conventional computers, possibly once every five
`to ten cycles on the average. If a pipeline were idled when each such branch is
`encountered,
`then every five to ten instructions a few cycles would potentially
`be lost waiting for branches. This is a fairly hefty penalty for the inefficiency of
`a pipeline implementation.
`There are other interactions of concern as well, all of which tend to force
`the designer to add complexity to ensure correct pipeline operation. These in(cid:173)
`teractions could severely impede performance

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket