`
`Reference 33
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 1
`
`
`
`We learn in Chapter 2 that memory is a major bottleneck in high-speed
`computers, and that the bottleneck can be relieved somewhat by taking ad(cid:173)
`vantage of the characteristics of typical programs. The objective has been to
`store the most-frequently referenced items in fast memory and less(cid:173)
`frequently referenced items in slower memory. It is not necessary to make all
`memory equally fast; we need use only as much fast memory as necessary to
`hold the active regions of a program and data.
`This chapter concerns a different approach to relieving bottlenecks. The
`idea is to use parallelism at the point of the bottleneck to improve per(cid:173)
`formance globally. If the design techniques are successful, then the extra
`hardware devoted to performance enhancement is present in only a small
`portion of a computer system, yet its effect is to increase performance as if the
`full computer system were replicated.
`To contrast the approaches of the last chapter and the present one, in one
`case the speed differential is due to faster hardware, whereas in the second
`
`102
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 2
`
`
`
`Sec. 3.1
`
`Principles of Pipeline Design Techniques
`
`103
`
`case the speed increase is obtained by replicating slower hardware. In both
`cases, clever architecture is required to create efficient computer systems in
`which enhancements of a relatively small cost have a global impact on per(cid:173)
`formance.
`Pipeline computer techniques described in this chapter are by far the
`most popular means for enhancing performance through parallel hardware.
`The basic ideas from which pipeline techniques developed are apparent
`in von Neumann's proposal to build the first stored-program computer.
`In Burks et al. [1946], a discussion on input/output techniques describes a
`buffer arrangement that would permit computation to be overlapped with
`input/output operations and thereby provide a crude form of pipeline pro(cid:173)
`cessing that is used widely in today's machines.
`Although von Neumann did not build the input/output capability into his
`first machine, the basic ideas for pipelined computer design evolved rapidly
`after the first appearance of magnetic-core memory as the primary storage
`medium for main memory. This storage was roughly a factor of 10 or more
`slower per cycle than the transistor technology used in high-speed registers
`and control logic. Designers quickly conceived of a variety of techniques to
`initiate one or more concurrent accesses to memory while executing in(cid:173)
`structions in the central processor. This body of techniques eventually
`evolved and is exemplified in the pipeline-processing structures described in
`this chapter.
`In the 1960s, when hardware costs were relatively high, pipelined com(cid:173)
`puters were the supercomputers. IBM's STRETCH and CDC's 6600 were two
`such designs from the early 1960s that made extensive use of pipelining, and
`these designs strongly influenced the structure of supercomputers that fol(cid:173)
`lowed. By the 1980s, hardware costs had diminished to the extent that pipe(cid:173)
`line techniques could be implemented across the entire range of performance,
`and indeed, even the Intel 8086 microprocessor that costs just a few dollars
`uses pipeline accesses to memory while performing on-chip computation.
`Our approach is to develop a basic understanding of the principles of
`pipeline design in the next section. In subsequent sections we observe where
`it can be used and how to design effective pipelines.
`
`3.1 Principles of Pipeline Design
`
`The basic idea behind pipeline design is quite natural; it is not specific to
`computer technology. In fact the name pipeline stems from the analogy with
`petroleum pipelines in which a sequence of hydrocarbon products is pumped
`through a pipeline. The last product might well be entering the pipeline
`before the first product has been removed from the terminus. In the remain(cid:173)
`der of this section we treat pipeline design first in abstract terms, and then
`follow with concrete examples.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 3
`
`
`
`
`
`Sec. 3.1
`
`Principles of Pipeline Design Techniques
`
`105
`
`2
`
`3 r---~
`
`2 I 3 ~---~
`,~ -1----.-2--r,~3~r---~
`._____._I _2___,_I _3___/r-- -~
`'---'--2~1~3~r---~
`
`Fig. 3.2 Pipelined execution of an N-step process.
`
`Time---
`
`struction. This sequence has traditionally been implemented using pipeline
`design. A typical instruction-execution sequence might be:
`
`1. Instruction fetch: obtain a copy of the instruction from memory.
`2. Instruction decode: examine the instruction and prepare to initialize the
`control signals required to execute the instruction in subsequent steps.
`3. Address generation: compute the effective address of the operands by per(cid:173)
`forming indexing or indirection as specified by the instruction.
`4. Operand fetch: for READ operations, obtain the operand from central
`memory.
`
`Job Stream
`(N Units/Job)
`
`One Completion
`SERVER
`(N Units) 1---+ Every N Units
`
`(a)
`
`Job
`Stream
`
`SERVER
`(1 Unit)
`
`SERVER
`(1 Unit)
`
`SERVER
`(1 Unit)
`
`One Completion
`Every 1 Unit
`
`(b)
`
`Fig. 3.3 Two ways to execute N -unit jobs in a stream:
`(a) Sequential execution with a 1-unit server; and
`(b) Pipelined execution with 1-unit servers.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 5
`
`
`
`
`
`Sec. 3.1
`
`Principles of Pipeline Design Techniques
`
`107
`
`line, but Fig. 3.4 suggests that subsequent instructions have followed the
`conditional branch down the pipeline and may have altered the state of one or
`more machine registers even before the outcome of the conditional branch is
`known. The danger is that we cannot be sure what to execute until the
`condition has been evaluated.
`One way to assure correct execution of conditional branches is to inter(cid:173)
`lock the instruction-fetch stage with the program-counter update stage. Im(cid:173)
`mediately after a conditional branch is fetched by the first stage, no further
`fetches take place until the branch reaches the last stage. At this point the last
`stage produces the correct target address and removes the interlock, allowing
`the first stage to continue instruction fetches at the new address.
`The problem with this solution is the lost performance caused by
`the inactive pipeline during the period a conditional branch is pending.
`Conditional branches occur quite frequently in most conventional comput(cid:173)
`ers, possibly once every five to ten cycles on the average. If a pipeline were
`idled when each such branch is encountered, then from one-fifth to one-tenth
`of the machine cycles would be lost waiting for branches. This is a fairly hefty
`penalty of the inefficiency of a pipeline implementation.
`There are other interactions of concern as well, all of which tend to force
`the designer to add complexity to ensure correct pipeline operation. These
`interactions could severely impede performance if the resulting implementa(cid:173)
`tion is idle a significant fraction of time because of the stage-to-stage inter(cid:173)
`locks. Consequently, the designer is concerned with building a pipeline that is
`simultaneously correct, efficient, and fast. How this is done is the subject of
`the material that follows.
`If we assume that each of the steps in the execution sequence is a single
`cycle, then a typical instruction takes seven cycles to execute. In several
`computer systems the pipeline takes advantage of the fact that the majority of
`instructions do not perform both a READ and a WRITE operation, and the
`number of stages is shortened to six. For our discussion we use the simpler,
`but possibly less realistic, pipeline shown in Fig. 3.4.
`The structure of Fig. 3.4 produces one instruction completed during each
`machine cycle, for an improvement in performance of a factor of 7 over a
`purely sequential implementation (or a factor of 6 faster than a sequential
`implementation that provides for only a READ or a WRITE cycle, but not
`both, for each instruction).
`In actual practice the performance improvement depends on the ability
`to split the processes into steps of equal duration. Suppose, for example, that
`a memory access takes four machine cycles, not just one, as we have pre(cid:173)
`sumed thus far. Then the three different steps in Fig. 3.4 that refer to memory
`accesses take four cycles each, whereas all other steps take just one cycle.
`Now the average rate of instruction completion depends on the slowest pro(cid:173)
`cess in the sequence, which in this case is a memory operation that produces
`one result every four cycles. No matter how fast the other processes are, they
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 7
`
`
`
`108
`
`Pipeline Design Techniques
`
`Chap.3
`
`cannot produce results any faster on the average than the slowest processor in
`the chain.
`In effect, we can run all processes in Fig. 3.4 at a rate that produces one
`result every four cycles at the output of each process. The original instruction
`execution requires three memory accesses-an instruction fetch, a READ,
`and a WRITE-and four single-cycle steps for a total of 16 cycles. The struc(cid:173)
`ture in the figure produces one result every 4 cycles, for a speedup of 4. This is
`so mew hat lower than our original es ti mate of a speedup of 7.
`By adding some delays to the pipeline of Fig. 3.4, we can create a longer
`pipeline that produces one result every cycle. In Fig. 3.5 we have added three
`stages of delay to every memory operation, making the pipeline a total of 16
`units long. But in this pipeline new instructions can be initiated every cycle
`rather than every four cycles, so that it produces results at a rate about 16
`times faster than a purely sequential implementation of a processing unit.
`Although Fig. 3.5 seems to make the speedup deceptively simple to ob(cid:173)
`tain, actually the memory system must be able to support the processing rate,
`and this may be quite a feat to accomplish. For the structure shown running
`at its full processing rate, there are 12 active stages with different memory
`operations at any given time. So the memory system must be able to support
`12 independent concurrent accesses in a nonconflicting way.
`In this case, a pipelined structure in one subsystem shifts the processing
`bottleneck to a different subsystem. Obviously, to make the best use of pipe(cid:173)
`lining for a processor structured in a pipeline fashion as shown in Fig. 3.5, the
`architect should extend pipeline techniques or other high-speed techniques
`to the memory system as well so that the maximum processing capacities of
`various subsystems are matched to each other.
`Up to this point the design principles have been very simple. We identify a
`task that performs a sequence of operations. Then we build a chain of pro(cid:173)
`cessors and perform one step of the sequence on each processor in the chain.
`We have just observed that we may have to break some of the steps into
`smaller steps to create a pipeline in which all stages produce results at the
`same rate. Lengthy steps may have to be broken into two or more smaller
`steps to meet this requirement.
`The design would be satisfactory at this point if there were no inter(cid:173)
`actions among two or more tasks that were at different stages of the pipe. If
`such interactions are present, then the pipeline must account for these in
`some way, which considerably increases the difficulty of producing an effi(cid:173)
`cient design and adds to the complexity of the final result.
`For the moment, however, let us explore a practical application of the
`structure we have outlined. One way to guarantee that there are no inter(cid:173)
`actions among tasks in a pipeline is to make sure that any two or more tasks
`concurrently in the pipeline are totally independent and have no conflicts.
`The 16 cycles allocated in Fig. 3.5 could be allocated to 16 different processors
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 8
`
`
`
`
`
`110
`
`Pipeline Design Techniques
`
`Chap.3
`
`on a round-robin basis. And this is the thinking that guided the design of the
`CDC 6600 peripheral controllers.
`Figure. 3.6(a) shows the pipeline structure of the peripheral controller
`"barrel" for the CDC 6600. Each stage takes one cycle, and ten cycles are
`required to execute one instruction, including a fetch or store to local
`memory. In this particular design all events take place at one of the ten steps
`in the process cycle. The other nine steps are simple delays that hold the
`machine state while awaiting a memory access. The state of each task held in
`a CDC 6600 peripheral processor is very small, only 51 bits, and consists of:
`
`• The accumulator (18 bits);
`• The program counter (12 bits);
`• The operand address or other instruction-specific data (12 bits); and
`• The instruction code and trip counter (9 bits).
`
`During one traversal of ten stages, a peripheral processor performs one READ
`or WRITE to local memory, together with related actions. If more machine
`traversals are required, the instruction traverses the stages multiple times.
`The trip counter indicates which traversal is currently in progress, and it
`thereby controls how each instruction is executed as a function of its trip
`number through the barrel.
`All processor activity occurs between Stages 9 and 10 shown in Fig. 3.6(a).
`In one trip through the ten stages, the processor can do an instruction fetch,
`an operand fetch together with an instruction execution, or one fetch of an
`indirect address. (The processor can also fetch long operands from the main
`memory. This takes five trips through the pipeline since the operand width of
`the pipeline is only 12 bits, but operands in main memory contain 60 bits.)
`In this architecture, the first trip through the ten-stage pipeline is an
`instruction fetch, which starts at the clock that latches a processor state in
`Stage 10. At this time the program counter is sent to memory to initiate a
`READ cycle to obtain the next instruction. That instruction reaches the pro(cid:173)
`cessor at the clock time that latches the corresponding processor state into
`Stage 9, and is ready to join that processor state as the state moves from Stage
`9 to Stage 10 at the end of the next trip through the pipeline.
`Since each instruction in this processor contains two fields, one each for
`the instruction code and operand address, the instruction fetch concludes by
`loading the instruction code and address registers with the instruction ob(cid:173)
`tained by the memory fetch. The trip counter is also initialized and the
`instruction decode takes place at this point, so the processor is prepared for
`the next cycle to read or write an operand or to read an indirect address as
`required by the instruction.
`Assume that the instruction is a LOAD or an ADD and that the addressing
`mode is direct. Execution of this instruction is completed during the second
`trip through the pipeline. In that trip, the operand from memory is fetched,
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 10
`
`
`
`
`
`112
`
`Pipeline Design Techniques
`
`Chap.3
`
`and it arrives at the processor just as the corresponding processor state
`reaches Stage 9 and is ready to move to Stage 10.
`Note that Fig. 3.6(a) shows a data path for the accumulator that passes
`through an adder/shifter at the execution stage. As the accumulator data
`propagates on this path, the operand data joins it. The instruction sets the
`adder/shifter to add, subtract, logical AND, or logical OR its two operands, or
`shift or copy one of the two operands into the accumulator in Stage 10. Hence,
`the logic to perform the full repertoire of instructions is in the path between
`Stages 9 and 10. At the same stage, the program counter is adjusted by
`incrementing it (for sequential execution) or by loading it from the address
`register (for branches).
`It is clear from Fig. 3.6 why this design is both fast and efficient. The basic
`logic for one processor is placed only in the execution stage of the pipeline. All
`other stages of the pipeline contain state storage only. Consequently, this
`implementation of ten processors contains one execution unit (but ten full
`register sets and ten memory systems), so the cost is less than the cost of ten
`individual machines. Yet the performance capacity is equal to the capacity of
`ten machines, provided that the ten machines operate independently and do
`not contend for shared data, nor interfere mutually in other ways. The cost
`benefit of this approach was very high in the mid-1960s when the cost of the
`logic saved was quite substantial. The reader should consult Thornton [1970]
`for a detailed description of the implementation.
`Figure 3.6(a) illustrates an actual implementation of the idealized
`pipeline design for the processor shown in Fig. 3.6(b). Each stage in this
`design consists of a set of latches (data registers) that hold data in transit
`through the pipeline.
`Between stages is combinational logic through which data propagates.
`The idea is that all activity occurs at successive clock times. At a clock tick,
`each bank of latches reads its input data and captures that data in its internal
`storage. After a short propagation delay, the data appear at the latch outputs
`and begin propagating to the next stage in the pipeline.
`The logic between stages transforms the data, for example, by shifting or
`adding, or by some other elementary operation that implements one stage of
`the pipeline function. Eventually the transformed data reach the input pins of
`the next rank of latches. The next clock tick is delayed long enough with
`respect to the preceding clock tick to ensure that all data in the pipeline have
`propagated to the input pins of the next stage. Only then are the stage outputs
`allowed to change.
`The purpose of the clock is to synchronize the data at the input to each
`stage. The clock is set as fast as possible within the normal constraints of
`reliability, power consumption, and engineering tolerances. It must be slow
`enough to accommodate the slowest propagation path.
`The simplicity of the design in Fig. 3.6(a) is somewhat deceptive. Because
`the ten processors are totally separate, there are no interactions in the pipe-
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 12
`
`
`
`Sec. 3.1
`
`Principles of Pipeline Design Techniques
`
`113
`
`line, and no extra hardware is required to deal with unusual conditions. If we
`consider the design of a processing unit for a single stream of instructions, the
`situation is complicated by several factors. The ones that normally require
`attention are:
`
`• Conditional branches must complete their execution before the address of
`the next instruction is known. Therefore, when a conditional branch is
`encountered at Stage 1 of a pipeline, subsequent data that reaches Stage 1
`must be treated tentatively until the target of the conditional branch is
`known.
`• An instruction in the pipeline might produce a result that is used by a
`later instruction in the pipeline. The second instruction must be delayed
`somehow until the result is available.
`• An instruction in the pipeline might produce a result for a register whose
`previous contents must be read by an earlier instruction in the pipeline.
`The second operation must not destroy the current contents of that regis(cid:173)
`ter until that register is free to be rewritten.
`• Two different instructions in the pipeline may write data into a common
`location, but the pipeline may reverse the order in which the location is
`updated.
`
`We have already mentioned the problem concerning conditional
`branches. The other three points are instances of a more general problem of
`concurrent operations on shared resources. In the specific cases here, the
`shared resources are registers or data in main memory.
`If the basic operations are READ and WRITE, we have four ways that two
`processes can interact, namely READ/READ, READ/WRITE, WRITE/READ,
`and WRITE/WRITE, where the first label indicates the earlier operation and
`the second label indicates the second operation. The idea is that a program is
`written for correct execution under the assumption that instructions are
`executed sequentially, with one instruction fully completed before the next is
`started. The pipeline implementation violates this assumption, but it must
`give the appearance that the assumption holds.
`For READ/READ interactions, we have a situation in which a datum is to
`be read first by Instruction 1 and then by Instruction 2. Can we permit these
`operations to be done out of order? The answer is yes, so this conflict is not of
`concern. However, if done out of order, WRITE/WRITE interactions leave the
`shared cell in the wrong state, and subsequent READs to the shared item will
`obtain the wrong value. Similarly, if the order of WRITE/READ is reversed,
`the READ obtains old data, not new data as intended. And if READ/WRITE is
`reversed, the READ obtains new data, not the old data as intended. Each of
`these three operation pairs needs to be detected and interlocked when they
`occur on shared data.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 13
`
`
`
`114
`
`Pipeline Design Techniques
`
`Chap. 3
`
`The WRITE/WRITE conflict appears to be unusual because it requires a
`program to write a datum to memory and then immediately overwrite that
`datum. Such sequences could happen, however, if a WRITE is followed by a
`conditional branch, one of whose outcomes immediately performs the second
`WRITE.
`A common WRITE/WRITE conflict on a shared variable is updating of a
`processor's state bits (often called the condition codes) that are written as a
`consequence of instruction execution. The need to avoid WRITE/WRITE con(cid:173)
`flicts on these variables requires that they be updated in the order that they
`are written on a purely sequential processor without pipeline execution. If a
`processor can detect that no intervening READ occurs between two WRITEs
`to a shared variable, the processor can execute the second WRITE first, then
`abort the first WRITE when its point of execution occurs.
`READ/WRITE and WRITE/READ conflicts are a fact of life and the bane
`of pipeline designs. The CDC 6600 design faces this problem in a central
`processor that has ten independent functional units dedicated to such func(cid:173)
`tions as add, multiply, divide, shift, normalize, branch, increment, and logic
`(Boolean). The divide is rather lengthy compared to the shift and Boolean
`functions. Consequently, it is quite possible to encounter pipeline conflicts
`resulting from instructions completing execution out of order.
`The approach used in the CDC 6600 is to implement instructions with
`three addresses to support overlapped execution and use a hardware struc(cid:173)
`ture that the designers called a scoreboard to search for conflicts. The three(cid:173)
`address format of instructions takes the general form:
`
`where OP designates one of the functional units. In this fashion it becomes
`possible to take advantage of pipeline execution when evaluating such
`functions as (Ax B) + (C x D), which can be encoded as:
`
`R3:=R1 XR3
`R6:=R4 x Rs
`R1 := R3 + R6
`
`The second instruction uses a distinct set of registers from the first in(cid:173)
`struction so that it can be executed in a pipelined fashion, provided that two
`distinct multipliers are available. The third instruction is interlocked with
`the first two instructions, and must be delayed for their completion. Note the
`WRITE/READ conflicts for Registers 3 and 6 and a READ/WRITE conflict for
`Register 1, any of which would result in deferring the execution of the addi(cid:173)
`tion instruction.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 14
`
`
`
`Sec. 3.2
`
`Memory Structures in Pipeline Computers
`
`115
`
`The scoreboard keeps track of conflicts and defers instructions for any of
`the following conditions:
`
`• WRITE/WRITE conflict;
`• WRITE/READ conflict;
`• READ/WRITE conflict; and
`• Function unit conflict.
`
`In the case of WRITE/READ and READ/WRITE conflicts, the conflicting
`instructions are passed through the early stages of the pipeline, in spite of the
`conflicts, so that they can be decoded and their operands can be requested.
`They are moved to their respective function units and deferred at those places
`to wait until the conflicting conditions are cleared. In so doing, the early
`stages of the pipeline are cleared and made available for new instructions.
`The WRITE/WRITE and function unit conflicts are deferred before being
`issued to the function units, which causes the pipeline to freeze until the
`conflicting condition is cleared.
`
`3.2 Memory Structures
`in Pipeline Computers
`
`The previous section indicates a general structure for pipeline computers and
`illustrates how pipeline execution has been implemented in an early super(cid:173)
`computer. The major innovation of pipeline design is the ability to complete
`instructions at a rate much faster than that achievable by executing in(cid:173)
`structions without overlapping.
`We have learned that one major obstacle to continuous execution at the
`maximum pipeline rate is the requirement that pipeline execution must
`behave exactly as if the same instructions were executed without overlap.
`This gives rise to conflicts that are detectable by hardware and results in
`freezing of some operations pending the completion of others.
`Let us reconsider how pipelined processors work in the context of
`memory systems. In the previous section, an ideal pipelined processor de(cid:173)
`codes one instruction in each machine cycle. To maintain that rate, we clearly
`need to supply instructions and data to the processor at a rate consistent with
`the needs of the pipeline. No single large physical memory available today
`can meet such demands because, at best, a memory cycle usually runs from
`four to 20 processor cycles. Moreover, the ratios are not likely to change in the
`future. However, cache memory can be implemented to deliver one result per
`cycle, and therefore cache memory can potentially supply the demands of a
`pipelined processor.
`The actual requirements for a pipelined processor depend specifically on
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 15
`
`
`
`
`
`Sec. 3.3
`
`Performance of Pipelined Computers
`
`117
`
`The timing for accessing the banks is shown in Fig. 3.8. Note that the
`banks are active in an overlapped fashion, with the second bank starting its
`access one machine cycle behind the previous bank. So even though cycle
`times are large, the processor can request a vector of data and will receive one
`datum per cycle after a startup transient. The key to this technique is that
`successive elements of the vector of data must lie in successive memory
`banks.
`
`3.3 Performance
`of Pipelined Computers
`
`The major performance characteristic of pipelined computers is that the rate
`of computation can be very high even though the elapsed time of individual
`operations might be very long. Since performance is the number of com(cid:173)
`pleted operations, pipeline-design techniques produce high performance in
`spite of long delays associated with specific operations.
`To the extent that a high rate of completion can be maintained, pipeline
`design is an effective method for increasing performance. But when the rate
`of completion drops off, possibly because of the inability to keep data flowing
`through a pipeline on a continuous basis, the pipeline becomes rather costly
`and may outweigh the benefits from using this design technique.
`A very interesting way to view the efficiency of pipelined computers is to
`compare them to highly parallel array (vector) computers. The analysis for
`the array computer is very easy to derive and understand, and it corresponds
`directly to the analysis for pipelined computers. This discussion is originally
`due to observations by T. C. Chen [1980].
`Consider a computation that is natural to execute on a parallel computer
`because of its highly repetitive structure. A typical computation of this type is
`
`4
`
`-E 3
`ca
`co 2
`
`1
`
`I DATUM 3 /DATUM 7 /DATUM 111
`I DATUM 2 I DATUM 6 [DATUM 101
`[DATUM 1 [DATUM s [DATUM 91
`[DATUM o I DATUM 4 [DATUM a I
`
`Time - - -
`Fig. 3.8 Timing for pipelined fetches for the memory shown in Fig. 3.37.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 17
`
`
`
`118
`
`Pipeline Design Techniques
`
`Chap.3
`
`a calculation to be made concurrently (in the time domain of the program) at
`every point in a mesh of points. This structure is typical of partial differential
`programs that solve such problems as heat transfer, the strength of mechani(cid:173)
`cal structures, turbulent air flow across aircraft and spacecraft wings, and
`weather models. A reasonable means for solving such problems is to build a
`computer composed of many processors and assign the computations at each
`node in the mesh to individual processors.
`The timing diagram for such a processor is shown in Fig. 3.9, where time
`appears on the horizontal axis, and the processors are plotted on the vertical
`axis. For a short period of time, only one processor is busy. During this period,
`the processor initializes variables for the computation, parcels out the work
`to be done, and performs other similar overhead computations that are serial
`in nature and not replicated at other processors in the system. Following the
`initialization, all processors perform the computation in parallel. The
`process repeats as shown in the figure, with a new parallel operation being
`preceded by a serial initialization section.
`The diagram in Fig. 3.9 shows both the advantage of parallelism and the
`degradation in performance due to serial computations embedded in the
`program. The use of many processors provides a potential for high per(cid:173)
`formance, but the serial operations reduce the actual effective parallelism
`below the maximum attainable. To obtain a reasonable measure of per(cid:173)
`formance for the situation depicted in Fig. 3.9, one interesting parameter to
`study is the efficiency of a computation. This is a measure of the proportion of
`time that the processors are busy. Therefore it measures the degradation in
`peak performance due to serial operations and other effects.
`In Fig. 3 .9, let the fraction of time spent in serial code be o: and the
`fraction spent in parallel code be 1 - o:. With N processors available, the
`fraction of time that processors are busy is given by:
`
`Efficiency = ~ + 1 - o:
`
`= 1 - o: (1 - 1/N)
`
`As the number of processors increases, the limiting efficiency is 1 - o:. Com(cid:173)
`pare Fig. 3.9 with Fig. 3.10, which shows the timing diagram for the same
`problem with twice as many processors used. The serial code is the same
`length, but the time devoted to parallel code has been shortened, and the
`fraction o: has increased.
`In fact, o: for any specific problem is a function of N, the number of
`processors used to solve that problem in parallel. To show this relation, let us
`normalize the computation times so that the serial execution time takes one
`unit, and the code that can be run in parallel takes time p on a single pro(cid:173)
`cessor. The parallel time shrinks to p/N when run on on N processors. The
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2145, p. 18
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`126
`
`Pipeline Design Techniques
`
`Chap.3
`
`64 on the Cray II, so long vectors have to be treated as a sequence of short
`vectors. So the vector equations
`
`A :=B xc
`D :=E +A
`
`actually can be implemented as a sequence of vector operations of the form
`
`A0 :=Box C0
`
`Do:= Eo +Ao
`A1 :=B1 x C1
`
`D1 :=E1 +A1
`
`where each subscripted variable denotes a vector of length 64.
`Potential overlap is available by overlapping either the add and multiply
`operating on the same set of data, or by overlapping an add with an add or a
`multiply with a multiply where the operations are applied to different sets of
`data. The overlap of addition with multiplication involves chaining the
`output of one computation to the input of the next. But the overlap of two
`vector addition or two vector multiplication operations on independent data
`requires no chaining of this type.
`The architecture of the Cray II is such that both ways of obtaining overlap
`yield roughly equal speed. Hence, some programs that might require chain(cid:173)
`ing on other architectures receive relatively less benefit from chaining on the
`Cray II when there are other viable alternatives for overlapping computa(cid:173)
`tions.
`In actual practice, pipelines rarely exceed ten to 20 stages, and systems
`with