throbber
Homayoun
`
`Reference 33
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 1
`
`

`

`We learn in Chapter 2 that memory is a major bottleneck in high-speed
`computers, and that the bottleneck can be relieved somewhat by taking ad(cid:173)
`vantage of the characteristics of typical programs. The objective has been to
`store the most-frequently referenced items in fast memory and less(cid:173)
`frequently referenced items in slower memory. It is not necessary to make all
`memory equally fast; we need use only as much fast memory as necessary to
`hold the active regions of a program and data.
`This chapter concerns a different approach to relieving bottlenecks. The
`idea is to use parallelism at the point of the bottleneck to improve per(cid:173)
`formance globally. If the design techniques are successful, then the extra
`hardware devoted to performance enhancement is present in only a small
`portion of a computer system, yet its effect is to increase performance as if the
`full computer system were replicated.
`To contrast the approaches of the last chapter and the present one, in one
`case the speed differential is due to faster hardware, whereas in the second
`
`102
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 2
`
`

`

`Sec. 3.1
`
`Principles of Pipeline Design Techniques
`
`103
`
`case the speed increase is obtained by replicating slower hardware. In both
`cases, clever architecture is required to create efficient computer systems in
`which enhancements of a relatively small cost have a global impact on per(cid:173)
`formance.
`Pipeline computer techniques described in this chapter are by far the
`most popular means for enhancing performance through parallel hardware.
`The basic ideas from which pipeline techniques developed are apparent
`in von Neumann's proposal to build the first stored-program computer.
`In Burks et al. [1946], a discussion on input/output techniques describes a
`buffer arrangement that would permit computation to be overlapped with
`input/output operations and thereby provide a crude form of pipeline pro(cid:173)
`cessing that is used widely in today's machines.
`Although von Neumann did not build the input/output capability into his
`first machine, the basic ideas for pipelined computer design evolved rapidly
`after the first appearance of magnetic-core memory as the primary storage
`medium for main memory. This storage was roughly a factor of 10 or more
`slower per cycle than the transistor technology used in high-speed registers
`and control logic. Designers quickly conceived of a variety of techniques to
`initiate one or more concurrent accesses to memory while executing in(cid:173)
`structions in the central processor. This body of techniques eventually
`evolved and is exemplified in the pipeline-processing structures described in
`this chapter.
`In the 1960s, when hardware costs were relatively high, pipelined com(cid:173)
`puters were the supercomputers. IBM's STRETCH and CDC's 6600 were two
`such designs from the early 1960s that made extensive use of pipelining, and
`these designs strongly influenced the structure of supercomputers that fol(cid:173)
`lowed. By the 1980s, hardware costs had diminished to the extent that pipe(cid:173)
`line techniques could be implemented across the entire range of performance,
`and indeed, even the Intel 8086 microprocessor that costs just a few dollars
`uses pipeline accesses to memory while performing on-chip computation.
`Our approach is to develop a basic understanding of the principles of
`pipeline design in the next section. In subsequent sections we observe where
`it can be used and how to design effective pipelines.
`
`3.1 Principles of Pipeline Design
`
`The basic idea behind pipeline design is quite natural; it is not specific to
`computer technology. In fact the name pipeline stems from the analogy with
`petroleum pipelines in which a sequence of hydrocarbon products is pumped
`through a pipeline. The last product might well be entering the pipeline
`before the first product has been removed from the terminus. In the remain(cid:173)
`der of this section we treat pipeline design first in abstract terms, and then
`follow with concrete examples.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 3
`
`

`

`

`

`Sec. 3.1
`
`Principles of Pipeline Design Techniques
`
`105
`
`2
`
`3 r---~
`
`2 I 3 ~---~
`,~ -1----.-2--r,~3~r---~
`._____._I _2___,_I _3___/r-- -~
`'---'--2~1~3~r---~
`
`Fig. 3.2 Pipelined execution of an N-step process.
`
`Time---
`
`struction. This sequence has traditionally been implemented using pipeline
`design. A typical instruction-execution sequence might be:
`
`1. Instruction fetch: obtain a copy of the instruction from memory.
`2. Instruction decode: examine the instruction and prepare to initialize the
`control signals required to execute the instruction in subsequent steps.
`3. Address generation: compute the effective address of the operands by per(cid:173)
`forming indexing or indirection as specified by the instruction.
`4. Operand fetch: for READ operations, obtain the operand from central
`memory.
`
`Job Stream
`(N Units/Job)
`
`One Completion
`SERVER
`(N Units) 1---+ Every N Units
`
`(a)
`
`Job
`Stream
`
`SERVER
`(1 Unit)
`
`SERVER
`(1 Unit)
`
`SERVER
`(1 Unit)
`
`One Completion
`Every 1 Unit
`
`(b)
`
`Fig. 3.3 Two ways to execute N -unit jobs in a stream:
`(a) Sequential execution with a 1-unit server; and
`(b) Pipelined execution with 1-unit servers.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 5
`
`

`

`

`

`Sec. 3.1
`
`Principles of Pipeline Design Techniques
`
`107
`
`line, but Fig. 3.4 suggests that subsequent instructions have followed the
`conditional branch down the pipeline and may have altered the state of one or
`more machine registers even before the outcome of the conditional branch is
`known. The danger is that we cannot be sure what to execute until the
`condition has been evaluated.
`One way to assure correct execution of conditional branches is to inter(cid:173)
`lock the instruction-fetch stage with the program-counter update stage. Im(cid:173)
`mediately after a conditional branch is fetched by the first stage, no further
`fetches take place until the branch reaches the last stage. At this point the last
`stage produces the correct target address and removes the interlock, allowing
`the first stage to continue instruction fetches at the new address.
`The problem with this solution is the lost performance caused by
`the inactive pipeline during the period a conditional branch is pending.
`Conditional branches occur quite frequently in most conventional comput(cid:173)
`ers, possibly once every five to ten cycles on the average. If a pipeline were
`idled when each such branch is encountered, then from one-fifth to one-tenth
`of the machine cycles would be lost waiting for branches. This is a fairly hefty
`penalty of the inefficiency of a pipeline implementation.
`There are other interactions of concern as well, all of which tend to force
`the designer to add complexity to ensure correct pipeline operation. These
`interactions could severely impede performance if the resulting implementa(cid:173)
`tion is idle a significant fraction of time because of the stage-to-stage inter(cid:173)
`locks. Consequently, the designer is concerned with building a pipeline that is
`simultaneously correct, efficient, and fast. How this is done is the subject of
`the material that follows.
`If we assume that each of the steps in the execution sequence is a single
`cycle, then a typical instruction takes seven cycles to execute. In several
`computer systems the pipeline takes advantage of the fact that the majority of
`instructions do not perform both a READ and a WRITE operation, and the
`number of stages is shortened to six. For our discussion we use the simpler,
`but possibly less realistic, pipeline shown in Fig. 3.4.
`The structure of Fig. 3.4 produces one instruction completed during each
`machine cycle, for an improvement in performance of a factor of 7 over a
`purely sequential implementation (or a factor of 6 faster than a sequential
`implementation that provides for only a READ or a WRITE cycle, but not
`both, for each instruction).
`In actual practice the performance improvement depends on the ability
`to split the processes into steps of equal duration. Suppose, for example, that
`a memory access takes four machine cycles, not just one, as we have pre(cid:173)
`sumed thus far. Then the three different steps in Fig. 3.4 that refer to memory
`accesses take four cycles each, whereas all other steps take just one cycle.
`Now the average rate of instruction completion depends on the slowest pro(cid:173)
`cess in the sequence, which in this case is a memory operation that produces
`one result every four cycles. No matter how fast the other processes are, they
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 7
`
`

`

`108
`
`Pipeline Design Techniques
`
`Chap.3
`
`cannot produce results any faster on the average than the slowest processor in
`the chain.
`In effect, we can run all processes in Fig. 3.4 at a rate that produces one
`result every four cycles at the output of each process. The original instruction
`execution requires three memory accesses-an instruction fetch, a READ,
`and a WRITE-and four single-cycle steps for a total of 16 cycles. The struc(cid:173)
`ture in the figure produces one result every 4 cycles, for a speedup of 4. This is
`so mew hat lower than our original es ti mate of a speedup of 7.
`By adding some delays to the pipeline of Fig. 3.4, we can create a longer
`pipeline that produces one result every cycle. In Fig. 3.5 we have added three
`stages of delay to every memory operation, making the pipeline a total of 16
`units long. But in this pipeline new instructions can be initiated every cycle
`rather than every four cycles, so that it produces results at a rate about 16
`times faster than a purely sequential implementation of a processing unit.
`Although Fig. 3.5 seems to make the speedup deceptively simple to ob(cid:173)
`tain, actually the memory system must be able to support the processing rate,
`and this may be quite a feat to accomplish. For the structure shown running
`at its full processing rate, there are 12 active stages with different memory
`operations at any given time. So the memory system must be able to support
`12 independent concurrent accesses in a nonconflicting way.
`In this case, a pipelined structure in one subsystem shifts the processing
`bottleneck to a different subsystem. Obviously, to make the best use of pipe(cid:173)
`lining for a processor structured in a pipeline fashion as shown in Fig. 3.5, the
`architect should extend pipeline techniques or other high-speed techniques
`to the memory system as well so that the maximum processing capacities of
`various subsystems are matched to each other.
`Up to this point the design principles have been very simple. We identify a
`task that performs a sequence of operations. Then we build a chain of pro(cid:173)
`cessors and perform one step of the sequence on each processor in the chain.
`We have just observed that we may have to break some of the steps into
`smaller steps to create a pipeline in which all stages produce results at the
`same rate. Lengthy steps may have to be broken into two or more smaller
`steps to meet this requirement.
`The design would be satisfactory at this point if there were no inter(cid:173)
`actions among two or more tasks that were at different stages of the pipe. If
`such interactions are present, then the pipeline must account for these in
`some way, which considerably increases the difficulty of producing an effi(cid:173)
`cient design and adds to the complexity of the final result.
`For the moment, however, let us explore a practical application of the
`structure we have outlined. One way to guarantee that there are no inter(cid:173)
`actions among tasks in a pipeline is to make sure that any two or more tasks
`concurrently in the pipeline are totally independent and have no conflicts.
`The 16 cycles allocated in Fig. 3.5 could be allocated to 16 different processors
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 8
`
`

`

`

`

`110
`
`Pipeline Design Techniques
`
`Chap.3
`
`on a round-robin basis. And this is the thinking that guided the design of the
`CDC 6600 peripheral controllers.
`Figure. 3.6(a) shows the pipeline structure of the peripheral controller
`"barrel" for the CDC 6600. Each stage takes one cycle, and ten cycles are
`required to execute one instruction, including a fetch or store to local
`memory. In this particular design all events take place at one of the ten steps
`in the process cycle. The other nine steps are simple delays that hold the
`machine state while awaiting a memory access. The state of each task held in
`a CDC 6600 peripheral processor is very small, only 51 bits, and consists of:
`
`• The accumulator (18 bits);
`• The program counter (12 bits);
`• The operand address or other instruction-specific data (12 bits); and
`• The instruction code and trip counter (9 bits).
`
`During one traversal of ten stages, a peripheral processor performs one READ
`or WRITE to local memory, together with related actions. If more machine
`traversals are required, the instruction traverses the stages multiple times.
`The trip counter indicates which traversal is currently in progress, and it
`thereby controls how each instruction is executed as a function of its trip
`number through the barrel.
`All processor activity occurs between Stages 9 and 10 shown in Fig. 3.6(a).
`In one trip through the ten stages, the processor can do an instruction fetch,
`an operand fetch together with an instruction execution, or one fetch of an
`indirect address. (The processor can also fetch long operands from the main
`memory. This takes five trips through the pipeline since the operand width of
`the pipeline is only 12 bits, but operands in main memory contain 60 bits.)
`In this architecture, the first trip through the ten-stage pipeline is an
`instruction fetch, which starts at the clock that latches a processor state in
`Stage 10. At this time the program counter is sent to memory to initiate a
`READ cycle to obtain the next instruction. That instruction reaches the pro(cid:173)
`cessor at the clock time that latches the corresponding processor state into
`Stage 9, and is ready to join that processor state as the state moves from Stage
`9 to Stage 10 at the end of the next trip through the pipeline.
`Since each instruction in this processor contains two fields, one each for
`the instruction code and operand address, the instruction fetch concludes by
`loading the instruction code and address registers with the instruction ob(cid:173)
`tained by the memory fetch. The trip counter is also initialized and the
`instruction decode takes place at this point, so the processor is prepared for
`the next cycle to read or write an operand or to read an indirect address as
`required by the instruction.
`Assume that the instruction is a LOAD or an ADD and that the addressing
`mode is direct. Execution of this instruction is completed during the second
`trip through the pipeline. In that trip, the operand from memory is fetched,
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 10
`
`

`

`

`

`112
`
`Pipeline Design Techniques
`
`Chap.3
`
`and it arrives at the processor just as the corresponding processor state
`reaches Stage 9 and is ready to move to Stage 10.
`Note that Fig. 3.6(a) shows a data path for the accumulator that passes
`through an adder/shifter at the execution stage. As the accumulator data
`propagates on this path, the operand data joins it. The instruction sets the
`adder/shifter to add, subtract, logical AND, or logical OR its two operands, or
`shift or copy one of the two operands into the accumulator in Stage 10. Hence,
`the logic to perform the full repertoire of instructions is in the path between
`Stages 9 and 10. At the same stage, the program counter is adjusted by
`incrementing it (for sequential execution) or by loading it from the address
`register (for branches).
`It is clear from Fig. 3.6 why this design is both fast and efficient. The basic
`logic for one processor is placed only in the execution stage of the pipeline. All
`other stages of the pipeline contain state storage only. Consequently, this
`implementation of ten processors contains one execution unit (but ten full
`register sets and ten memory systems), so the cost is less than the cost of ten
`individual machines. Yet the performance capacity is equal to the capacity of
`ten machines, provided that the ten machines operate independently and do
`not contend for shared data, nor interfere mutually in other ways. The cost
`benefit of this approach was very high in the mid-1960s when the cost of the
`logic saved was quite substantial. The reader should consult Thornton [1970]
`for a detailed description of the implementation.
`Figure 3.6(a) illustrates an actual implementation of the idealized
`pipeline design for the processor shown in Fig. 3.6(b). Each stage in this
`design consists of a set of latches (data registers) that hold data in transit
`through the pipeline.
`Between stages is combinational logic through which data propagates.
`The idea is that all activity occurs at successive clock times. At a clock tick,
`each bank of latches reads its input data and captures that data in its internal
`storage. After a short propagation delay, the data appear at the latch outputs
`and begin propagating to the next stage in the pipeline.
`The logic between stages transforms the data, for example, by shifting or
`adding, or by some other elementary operation that implements one stage of
`the pipeline function. Eventually the transformed data reach the input pins of
`the next rank of latches. The next clock tick is delayed long enough with
`respect to the preceding clock tick to ensure that all data in the pipeline have
`propagated to the input pins of the next stage. Only then are the stage outputs
`allowed to change.
`The purpose of the clock is to synchronize the data at the input to each
`stage. The clock is set as fast as possible within the normal constraints of
`reliability, power consumption, and engineering tolerances. It must be slow
`enough to accommodate the slowest propagation path.
`The simplicity of the design in Fig. 3.6(a) is somewhat deceptive. Because
`the ten processors are totally separate, there are no interactions in the pipe-
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 12
`
`

`

`Sec. 3.1
`
`Principles of Pipeline Design Techniques
`
`113
`
`line, and no extra hardware is required to deal with unusual conditions. If we
`consider the design of a processing unit for a single stream of instructions, the
`situation is complicated by several factors. The ones that normally require
`attention are:
`
`• Conditional branches must complete their execution before the address of
`the next instruction is known. Therefore, when a conditional branch is
`encountered at Stage 1 of a pipeline, subsequent data that reaches Stage 1
`must be treated tentatively until the target of the conditional branch is
`known.
`• An instruction in the pipeline might produce a result that is used by a
`later instruction in the pipeline. The second instruction must be delayed
`somehow until the result is available.
`• An instruction in the pipeline might produce a result for a register whose
`previous contents must be read by an earlier instruction in the pipeline.
`The second operation must not destroy the current contents of that regis(cid:173)
`ter until that register is free to be rewritten.
`• Two different instructions in the pipeline may write data into a common
`location, but the pipeline may reverse the order in which the location is
`updated.
`
`We have already mentioned the problem concerning conditional
`branches. The other three points are instances of a more general problem of
`concurrent operations on shared resources. In the specific cases here, the
`shared resources are registers or data in main memory.
`If the basic operations are READ and WRITE, we have four ways that two
`processes can interact, namely READ/READ, READ/WRITE, WRITE/READ,
`and WRITE/WRITE, where the first label indicates the earlier operation and
`the second label indicates the second operation. The idea is that a program is
`written for correct execution under the assumption that instructions are
`executed sequentially, with one instruction fully completed before the next is
`started. The pipeline implementation violates this assumption, but it must
`give the appearance that the assumption holds.
`For READ/READ interactions, we have a situation in which a datum is to
`be read first by Instruction 1 and then by Instruction 2. Can we permit these
`operations to be done out of order? The answer is yes, so this conflict is not of
`concern. However, if done out of order, WRITE/WRITE interactions leave the
`shared cell in the wrong state, and subsequent READs to the shared item will
`obtain the wrong value. Similarly, if the order of WRITE/READ is reversed,
`the READ obtains old data, not new data as intended. And if READ/WRITE is
`reversed, the READ obtains new data, not the old data as intended. Each of
`these three operation pairs needs to be detected and interlocked when they
`occur on shared data.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 13
`
`

`

`114
`
`Pipeline Design Techniques
`
`Chap. 3
`
`The WRITE/WRITE conflict appears to be unusual because it requires a
`program to write a datum to memory and then immediately overwrite that
`datum. Such sequences could happen, however, if a WRITE is followed by a
`conditional branch, one of whose outcomes immediately performs the second
`WRITE.
`A common WRITE/WRITE conflict on a shared variable is updating of a
`processor's state bits (often called the condition codes) that are written as a
`consequence of instruction execution. The need to avoid WRITE/WRITE con(cid:173)
`flicts on these variables requires that they be updated in the order that they
`are written on a purely sequential processor without pipeline execution. If a
`processor can detect that no intervening READ occurs between two WRITEs
`to a shared variable, the processor can execute the second WRITE first, then
`abort the first WRITE when its point of execution occurs.
`READ/WRITE and WRITE/READ conflicts are a fact of life and the bane
`of pipeline designs. The CDC 6600 design faces this problem in a central
`processor that has ten independent functional units dedicated to such func(cid:173)
`tions as add, multiply, divide, shift, normalize, branch, increment, and logic
`(Boolean). The divide is rather lengthy compared to the shift and Boolean
`functions. Consequently, it is quite possible to encounter pipeline conflicts
`resulting from instructions completing execution out of order.
`The approach used in the CDC 6600 is to implement instructions with
`three addresses to support overlapped execution and use a hardware struc(cid:173)
`ture that the designers called a scoreboard to search for conflicts. The three(cid:173)
`address format of instructions takes the general form:
`
`where OP designates one of the functional units. In this fashion it becomes
`possible to take advantage of pipeline execution when evaluating such
`functions as (Ax B) + (C x D), which can be encoded as:
`
`R3:=R1 XR3
`R6:=R4 x Rs
`R1 := R3 + R6
`
`The second instruction uses a distinct set of registers from the first in(cid:173)
`struction so that it can be executed in a pipelined fashion, provided that two
`distinct multipliers are available. The third instruction is interlocked with
`the first two instructions, and must be delayed for their completion. Note the
`WRITE/READ conflicts for Registers 3 and 6 and a READ/WRITE conflict for
`Register 1, any of which would result in deferring the execution of the addi(cid:173)
`tion instruction.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 14
`
`

`

`Sec. 3.2
`
`Memory Structures in Pipeline Computers
`
`115
`
`The scoreboard keeps track of conflicts and defers instructions for any of
`the following conditions:
`
`• WRITE/WRITE conflict;
`• WRITE/READ conflict;
`• READ/WRITE conflict; and
`• Function unit conflict.
`
`In the case of WRITE/READ and READ/WRITE conflicts, the conflicting
`instructions are passed through the early stages of the pipeline, in spite of the
`conflicts, so that they can be decoded and their operands can be requested.
`They are moved to their respective function units and deferred at those places
`to wait until the conflicting conditions are cleared. In so doing, the early
`stages of the pipeline are cleared and made available for new instructions.
`The WRITE/WRITE and function unit conflicts are deferred before being
`issued to the function units, which causes the pipeline to freeze until the
`conflicting condition is cleared.
`
`3.2 Memory Structures
`in Pipeline Computers
`
`The previous section indicates a general structure for pipeline computers and
`illustrates how pipeline execution has been implemented in an early super(cid:173)
`computer. The major innovation of pipeline design is the ability to complete
`instructions at a rate much faster than that achievable by executing in(cid:173)
`structions without overlapping.
`We have learned that one major obstacle to continuous execution at the
`maximum pipeline rate is the requirement that pipeline execution must
`behave exactly as if the same instructions were executed without overlap.
`This gives rise to conflicts that are detectable by hardware and results in
`freezing of some operations pending the completion of others.
`Let us reconsider how pipelined processors work in the context of
`memory systems. In the previous section, an ideal pipelined processor de(cid:173)
`codes one instruction in each machine cycle. To maintain that rate, we clearly
`need to supply instructions and data to the processor at a rate consistent with
`the needs of the pipeline. No single large physical memory available today
`can meet such demands because, at best, a memory cycle usually runs from
`four to 20 processor cycles. Moreover, the ratios are not likely to change in the
`future. However, cache memory can be implemented to deliver one result per
`cycle, and therefore cache memory can potentially supply the demands of a
`pipelined processor.
`The actual requirements for a pipelined processor depend specifically on
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 15
`
`

`

`

`

`Sec. 3.3
`
`Performance of Pipelined Computers
`
`117
`
`The timing for accessing the banks is shown in Fig. 3.8. Note that the
`banks are active in an overlapped fashion, with the second bank starting its
`access one machine cycle behind the previous bank. So even though cycle
`times are large, the processor can request a vector of data and will receive one
`datum per cycle after a startup transient. The key to this technique is that
`successive elements of the vector of data must lie in successive memory
`banks.
`
`3.3 Performance
`of Pipelined Computers
`
`The major performance characteristic of pipelined computers is that the rate
`of computation can be very high even though the elapsed time of individual
`operations might be very long. Since performance is the number of com(cid:173)
`pleted operations, pipeline-design techniques produce high performance in
`spite of long delays associated with specific operations.
`To the extent that a high rate of completion can be maintained, pipeline
`design is an effective method for increasing performance. But when the rate
`of completion drops off, possibly because of the inability to keep data flowing
`through a pipeline on a continuous basis, the pipeline becomes rather costly
`and may outweigh the benefits from using this design technique.
`A very interesting way to view the efficiency of pipelined computers is to
`compare them to highly parallel array (vector) computers. The analysis for
`the array computer is very easy to derive and understand, and it corresponds
`directly to the analysis for pipelined computers. This discussion is originally
`due to observations by T. C. Chen [1980].
`Consider a computation that is natural to execute on a parallel computer
`because of its highly repetitive structure. A typical computation of this type is
`
`4
`
`-E 3
`ca
`co 2
`
`1
`
`I DATUM 3 /DATUM 7 /DATUM 111
`I DATUM 2 I DATUM 6 [DATUM 101
`[DATUM 1 [DATUM s [DATUM 91
`[DATUM o I DATUM 4 [DATUM a I
`
`Time - - -
`Fig. 3.8 Timing for pipelined fetches for the memory shown in Fig. 3.37.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 17
`
`

`

`118
`
`Pipeline Design Techniques
`
`Chap.3
`
`a calculation to be made concurrently (in the time domain of the program) at
`every point in a mesh of points. This structure is typical of partial differential
`programs that solve such problems as heat transfer, the strength of mechani(cid:173)
`cal structures, turbulent air flow across aircraft and spacecraft wings, and
`weather models. A reasonable means for solving such problems is to build a
`computer composed of many processors and assign the computations at each
`node in the mesh to individual processors.
`The timing diagram for such a processor is shown in Fig. 3.9, where time
`appears on the horizontal axis, and the processors are plotted on the vertical
`axis. For a short period of time, only one processor is busy. During this period,
`the processor initializes variables for the computation, parcels out the work
`to be done, and performs other similar overhead computations that are serial
`in nature and not replicated at other processors in the system. Following the
`initialization, all processors perform the computation in parallel. The
`process repeats as shown in the figure, with a new parallel operation being
`preceded by a serial initialization section.
`The diagram in Fig. 3.9 shows both the advantage of parallelism and the
`degradation in performance due to serial computations embedded in the
`program. The use of many processors provides a potential for high per(cid:173)
`formance, but the serial operations reduce the actual effective parallelism
`below the maximum attainable. To obtain a reasonable measure of per(cid:173)
`formance for the situation depicted in Fig. 3.9, one interesting parameter to
`study is the efficiency of a computation. This is a measure of the proportion of
`time that the processors are busy. Therefore it measures the degradation in
`peak performance due to serial operations and other effects.
`In Fig. 3 .9, let the fraction of time spent in serial code be o: and the
`fraction spent in parallel code be 1 - o:. With N processors available, the
`fraction of time that processors are busy is given by:
`
`Efficiency = ~ + 1 - o:
`
`= 1 - o: (1 - 1/N)
`
`As the number of processors increases, the limiting efficiency is 1 - o:. Com(cid:173)
`pare Fig. 3.9 with Fig. 3.10, which shows the timing diagram for the same
`problem with twice as many processors used. The serial code is the same
`length, but the time devoted to parallel code has been shortened, and the
`fraction o: has increased.
`In fact, o: for any specific problem is a function of N, the number of
`processors used to solve that problem in parallel. To show this relation, let us
`normalize the computation times so that the serial execution time takes one
`unit, and the code that can be run in parallel takes time p on a single pro(cid:173)
`cessor. The parallel time shrinks to p/N when run on on N processors. The
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 18
`
`

`

`

`

`

`

`

`

`

`

`

`

`

`

`

`

`126
`
`Pipeline Design Techniques
`
`Chap.3
`
`64 on the Cray II, so long vectors have to be treated as a sequence of short
`vectors. So the vector equations
`
`A :=B xc
`D :=E +A
`
`actually can be implemented as a sequence of vector operations of the form
`
`A0 :=Box C0
`
`Do:= Eo +Ao
`A1 :=B1 x C1
`
`D1 :=E1 +A1
`
`where each subscripted variable denotes a vector of length 64.
`Potential overlap is available by overlapping either the add and multiply
`operating on the same set of data, or by overlapping an add with an add or a
`multiply with a multiply where the operations are applied to different sets of
`data. The overlap of addition with multiplication involves chaining the
`output of one computation to the input of the next. But the overlap of two
`vector addition or two vector multiplication operations on independent data
`requires no chaining of this type.
`The architecture of the Cray II is such that both ways of obtaining overlap
`yield roughly equal speed. Hence, some programs that might require chain(cid:173)
`ing on other architectures receive relatively less benefit from chaining on the
`Cray II when there are other viable alternatives for overlapping computa(cid:173)
`tions.
`In actual practice, pipelines rarely exceed ten to 20 stages, and systems
`with

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket