`© 1993 Kluwer Academic Publishers, Bo>~on. Manufactured in The Netherlands.
`
`Instruction-Level Parallel Processing:
`History, Overview, and Perspective
`
`B. RAMAKRISHNA RAU AND JOSEPH A. FISHER
`Hew/eu-Paclwrd wborarories, 1501 Page Mill Road, Bldg. 3U, Palo A/co, C4 94304
`
`(October 20, 1992)
`
`Abstract. lnstruclion-level parallelism (QP) is a family of processor and compiler design techniques that speed
`up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the
`highest performance uniprocessors fOr the past 30 years, the 1980s saw it become a much more significant force
`in computer design , Several systems were built and sold commercially, which pushed ILP far beyond where it
`had been before, both in terms of the amount of ILP offered and in the central role JLP played in the design
`of the system By the end of the d<ead~. advanced microprocessor design at all major CPU manufacturers had
`incorporated ILP, and new techniques for fLP had become a popular topic at academic conferences. This article
`provides an overview and historical perspective of the field of ILP and its development over the past three decades.
`
`Keywords. lnstru<'.:ion-level parallelism, YLIW processors, superscalar processors, pipelining, multiple operation
`issue, speculative execution, scheduling, regLster alldtalion ,
`
`I. Introduction
`
`Instruction-level parallelism (ILP) is a family of processor and compiler design techniques
`that speed up execution by causing individual machine operations, such as memory loads
`and stores, integer additions, and floating point multiplications, to execute in parallel. The
`operations involved are normal RISC-style operations, and the system is handed a single
`program written with a sequential processor in mind. Thus an important feature of these
`techniques is that like circuit speed improvements, but unlike traditional multiprocessor
`parallelism and massive parallel processing, they are largely transparent to users. VLIWs
`and superscalars are examples of processors that derive their benefit from instruction-level
`parallelism, and software pipelining and trace scheduling are example software techniques
`that expose the parallelism that these processors can use_
`Although small amounts of ILP have been present in the highest performance uniproc(cid:173)
`essors of the past 30 years, the 1980s saw it become a much more significant force in com(cid:173)
`puter design. Several systems were built and sold commercially, which pushed ILP far beyond
`where it had been before, both in tenns of the amount of ILP offered and in the central
`role ILP played in the design of the system. By the early 1990s, advanced microprocessor
`design at all major CPU manufacturers incorporated ILP, and new techniques fur ILP became
`a popular topic at academic conferences. With all of this activity we felt that, in contrast
`to a report on suggested future techniques, there would be great value in gathering, in an
`archival reference, reports on experience with real ILP systems and reports on the measured
`potentia\ of lLP_ Thus this special issue of The Journal of Supercomputing.
`
`J.]_ JLP Execution
`
`A typical ILP processor has the same type of execution hardware as a nonnal RJSC machine.
`The differences between a machine with ILP and one without is that there may be more
`of that hardware, for example, several integer adders instead of just one, and that the con(cid:173)
`trol will allow, and possibly arrdnge, simultaneous access to whatever execution hardware
`is present_
`Consider the execution hardware of a simplified ILP processor consisting of four func(cid:173)
`tional units and a branch unit connected to a common register file (Table 1). Typically
`ILP execution hardware allows multiple-cycle operations to be pipelined, so we may assume
`that a total of four operations can be initiated each cycle. If in each cycle the longest latency
`operation is issued, this hardware could have ten operations "in flight" at once, which
`would give it a maximum possible speedup of a factor of ten over a sequential processor
`wilh similar execution hardware. As the papers in this issue show, this execution hardware
`resembles that of several VLIW processors that have been built and used commercially,
`though it is more limited in its amount of ILP. Several superscalar processors now being
`built also offer a similar amount of ILP_
`There is a large amount of parallel ism available even in this simple processor. The chal(cid:173)
`lenge is to make good use of it-we will see that with the technology available today, an
`ILP processor is unlikely to achieve nearly as much as a factor of ten on many classes
`of programs, though scientific programs and others can yield far more than that on a proc(cid:173)
`essor that has more functional units. The first question that comes to mind is whether enough
`ILP exists in programs to make this possible. Then, if this is so, what must the compiler
`and hardware do to successfully exploit it? In reality, as we shall see in Section 4, the
`two questions have to be reversed; in the absence of techniques to find and exploit ILP,
`it remains hidden, and we are left with a pessimistic answer.
`Figure Ia shows a very large expression taken from the inner loop of a compute-intensive
`program. It is presented cycle by cycle as it might execute on a processor with functional
`units similar to those shown in Table I, but capable of having only one operation in flight
`
`"' 3
`
`Table /. Execution hardware for a simplified ILP processor,
`
`Functional Unit
`
`Operations Performed
`
`Latency
`
`Integer unit f
`
`Integer unit 2/branch unit
`
`Floating point unit I
`Floating point unit 2
`
`Integer AL U opcr•toons
`Integer multiplication
`Loads
`Stores
`
`Integer ALU operations
`Integer multiplicalion
`Loads
`Stores
`Test-and-branch
`
`Floating point operations
`
`I
`2
`2
`l
`
`f
`2
`2
`
`3
`
`
`
`Blue Coat Systems - Exhibit 1034 Page 1
`
`
`
`- vaeecl • llOi
`
`&s..CS: • s•__, • 1 !09
`CYCLE
`CYCI.!
`.D(i"'
`CYCLE
`nop
`~·eedl • yaee<l • 1.}01
`CYCLE
`CYCLE
`nop
`cvcu: • """
`• 1)849
`C'tCl.£
`1 XSw«dl • JUi41¥dl
`i yllleedl • yaeedJ • 1 llt9
`CYC: I £
`CYCli:
`llilllc-«d
`• x••Nl .. ~ ·~t;, )I\
`CYC'l..£
`4 Y"'•Cd
`.. ys.eedl •• 6~5lS
`CYCLE ll tsn<il -
`tsee<l • UC7
`C YCl.£ 12 ncp
`CY\.:L£
`! l n olo'
`CY("U: 1 t v•e-e-<:1':..
`CYCL£ l~ nGI?
`C'I'Ct..E \ ' n..;:;p
`tst<Odl • 1)149
`CYCU 17 tu•dl -
`CYCLE 16 v.e•dl - vo-eedl • lliH
`CYCLE 19 tu"<l
`•
`ta-ee<!J U 6SS .B
`• VS.C'~2 -.a. ·~~)~
`C'W('Lt
`lG '4/iiOCd
`CYCLE 21
`x &eed • xs•6fd
`.Jiaq
`CYCt.( 21 nop
`CYCt.[ 2) n -1p
`CYCI.I:: l4 y"q • yc:oed • y1utod
`CYCtE l; r.op
`CYCLE 26 nop
`'
`, VOQ
`CYCLE l7 xy~uO>oq • xoq
`;: a taq • taeed • tt••d
`C'I'C'LE
`CYCLE U nop
`CVCLt lO nop
`•
`CYCL.£ ll veq .. v•••d • vae.d
`CYCLE l4 no!)
`CYCLE lJ n~
`CYCLE l4 :v•u~Mq • tsq • vaq
`CYCLt l~ pl~ • plc • 1
`C"YCLi: J6 tp • tP •
`;
`
`C 'rCl.i: J7 ~r zyaua.q > r•d.J.u• goto t.y - no-htt .,
`
`lOfT ALU
`
`lNT Al.U
`
`FLOAT A;JJ
`
`n..oAT ALU
`
`plc•plc•1
`
`v~eoe-d,!.,..v~eed•!lD' '-••••:tl•t:ileed•t.\01
`y•Hdl =J15e.d"ll08 n"<'l•neltd•l)Q9
`
`1 lp-tp•l
`CYCL£
`l
`CYCLE
`l nop
`CYCL£
`4 ve•e-<U . vsaedlt!)B49 t:•..-d2•t.e•edl-.IlBt9
`CYCLE'
`5 ya.edl•yaeltdl•lliU u . . dl• .. •ll<ll•ll849
`CVCL£
`6 yao<td•yaotedUh~5l~ xa"d·VI-d2•0655J5
`CYCLE
`xaq.•aeed•x••ed
`\lae.-d•\l'&cted26.&.4SS:3!a tae•d· t.•e•dllr,i!a~l$ )lllqc:ya;:eed•yae.ed
`CCYY~LI.~ "I
`vaq .. vseed•va~ l8(J•C.t>ead•L•"ed
`I
`Ill.
`9 nop
`CYClF.
`0 "'fOUN<JoUQ•Y•Q
`CYCI.[
`C'JClt: 11 tV&uUq-C:sq•vsq
`
`1 f X'(autuq;,.r~,j::.~s. goto ixy-no-h.lt
`
`b l
`
`F;,_ I (a I An c:aAmplc ollllc ~IIIIAI ouord ol-.:ucioft fOe a loop. (b) T'hol: iDStn..:t~on·ICYel par.lleii'IIConl
`o1 csoc;utlCift ror 111c -
`loop.
`
`at a time. Figun: lb shows the same program fragment as it might be ellecuted on the hard·
`ware indicated in Table I.
`Note that several of the cycles in Figun: Ia contain no-ops. This is because the sequential
`processor must await the completion of the three-cycle latency multiply issued in cycle I
`before issuing the next operation. (These no-ops would not appear in the text of a program.
`but are shown here as the actual record of what is executed each cycle.) Most instruction(cid:173)
`level parallel processors cliJI issue operations during these no-op I.)'Cies, when previous
`operations an: still in Oighl, and ll'lllny can issue more than one operation in a given cycle.
`
`In our ILP record of execution (Figure !b), both effects an: evident: In cycle 1, four opera(cid:173)
`tions an: issued; in cycle 2, two mon: operations are issu~d even though neither mulciply
`in cycle I has yet completed execution .
`This special issue of The Journal of Supercomputing concerns itSelf with the technology
`of syslems that try to ana in the kind of record of c:xecution in Figure lb. given a prog111m
`written with the record of eJU:Cution in Figure Ia in mind
`
`1.2. Early History of lnstructiOII-Wel PuruiiC'iiJm
`
`In small ways, instruction-level parallelism factored into the thinking of machine designers
`in the 1940s and 1950s. Parallelism that would today be called horiwntal microcode ap(cid:173)
`peared in lUring's 1946 design of the Pilot ACE £Carpenter and Doran 1986] and was care(cid:173)
`fully described by Wilkes [1951]. Indeed, in 1953 Wilkes and Stringer wrole, "In some
`cases it may be possible for two or more micro-operations to take place at the same tim~"
`[Wilkes and Stringer 1953}.
`The 1960s saw the appearance of transistorized computers. One ~ffect of this revolution
`was that it bec:ame practical to build reliable machines with far mon: gates than was necessary
`to build a general-purpose CPU. This led to commercia111y succeSl>fu.l machinCli that used
`this available tuirdware to provide instruction-Je·~el parallelism at the machine-language
`level. In 1963 Control Data Corporation stancd delivering its CDC 6600 [Thornton 1964,
`1~1. which had ten functional units-in~tger add, shift, incn:mcfll (2), multiply (2),1ogical
`branch, floating poinl add and divide. Any on~ of these could sran executing in a given
`cycle whether or not others were still processing daua-indcpendcnt earlier operations. In
`this machine the hardwan: decided, as the program eliCCUted. which operation 10 1ssuc in
`a given cycle; its model of execution was well along the way toward what we would today
`call supcncalar. Indeed, in many ways it strongly resembled its direct descendant, lhe scalar
`ponion of the CRAY-1. 11\e CDC 6600 was the scientific supercomputer of its day.
`Also durin& the 1960s, WM introduced, and in 196'1-68 delivered, the 360191 [mM 1967].
`This machine, based partly on IBM's instruction· level parallel ellpcrimenla.l Stretch proc(cid:173)
`essor, offered less instruction-level parallelism than the CDC 6600. having only a single
`in~eger adder. a floating poim adder, and a Ooating point multipl)ldivide. But it was f.u
`more 11111bitious than !be CDC 6600 in its auempt 10 reanliJige the inslrucuon strum 10
`bep ttae functional units busy-a key technology in IOday's 5Uperscalar desigN. For wrious
`nontechnical reasons the 360/91 was not as commercially successful u u might have been,
`with only about20 machines delfven:d (Bell n.nd N ~ll 19711. But its CPU arcbitccrun: w.s
`the stan of a long line of succ-essful high-pcrfo nnance processurs. As with the CDC 6600,
`this lLP p1onccr s111ncd o chain of supersc.alur a r huecturc::. tluu has I sted into the 1990s.
`In the 1960s. research into "parallel processing" often wa concerned with the ILP found
`in these proccsson~ . By the mid-1!170s the term was used more often for multiple processor
`parallelism and for regular array and vector parallelism. In pan, this was due to some very
`pessimiJiic results about the availability of ILP in ordinary programs, wb1ch we discuss
`below.
`
`
`
`Blue Coat Systems - Exhibit 1034 Page 2
`
`
`
`1.3. Modem lnstruction-Level Parallelism
`
`In the late 1970s the beginnings of a new style of ILP, caJled very long instruction word
`(VUW), emerged on several different fronts. In many ways VLIWs were a natural outgrowth
`of horizontal microcode, the first ILP technology, and they were triggered, in the 1980s,
`by the same changes in semiconductor technology that had such a profound impact upon
`the entire computer industry.
`For sequential processors, as the speed gap between WTiteable and read-only memory
`narrowed, the advantages of a small, dedicated, read-only control store began to disappear.
`One natural effect of this was to diminish the advantage of microcode; it no longer made
`as much sense to define a complex language as a compiler target and then interpret this
`in very fast read-only microcode. Instead, the vertical microcode interface was presented
`as a clean, simple compiler target. This concept was called RJSC [Hennessy, Jouppi, Baskett
`et al. 1982 ; Patterson and Sequin 1981; Radin 1982] . In the 1980s the general movement
`of microprocessor products was towards the RISC concept, and instruction-level parallel
`techniques fell out of favor. In the minisupercomputer price-bracket though, one innovative
`superscalar product, the ZS-1, which could issue up to two instructions each cycle, was
`built and marketed by Astronautics rsmith et al. 1987].
`The same changes in memory technology were having a somewhat different effect upon
`horizontally microcoded processors. During the 1970s a large market had grown in special(cid:173)
`ired signal processing computers. Not aimed at general-purpose use, these CPUs hard(cid:173)
`wired FFfs and other important algorithms directly into the horizontal (;ontrol store, gaining
`tremendous advantages from the instruction-level parallelism available there. When fast,
`writeable memory became available, some of these manufacturers, most notably Floating
`Point Systems (Charlesworth 1981], replaced the read-only control store with writeable mem(cid:173)
`ory, giving users access to instruction-level parallelism in far greater amounts than the early
`superscalar processors had. These machines were extremely fast, the fastest processors
`by far in their price ranges, for important classes of scientific applications. However, despite
`attempts on the part of several manufacturers to market their products for more general,
`everyday use, they were almost always restricted to a narrow class of applications. This
`was caused by the lack of good system software, which in turn was caused by the idiosyn(cid:173)
`cratic architecture of processors built for a single application, and by the lack at that time
`of good code generation algorithms for ILP machines with that much parallelism.
`As with RISC, the crucial step was to present a simple, clean interface to the compiler.
`However, in this case the clean interface was horizontal, not vertical, so as to afford greater
`ILP [Fisher 1983; Rau, Glaeser, and Greenaw.Ut 1982]. This style of architecture was dubbed
`VLIW (Fisher 1983] . Code generation techniques, some of which had been developed for
`generating horizontal microcode, were extended to these general-purpose VUW machines
`so that the compiler could specify the parallelism directly [Fisher 1981; Rau and Glaeser
`1981] .
`In the 1980s VLIW CPUs were offered commercially in the form of capable, general(cid:173)
`purpose machines. Three computer start-ups-Culler, Multiflow, and Cydrome-built
`VLIWs with varying degrees of parallelism (Colwell et aL 1988; Rau et aL 1989]. As a
`group thc:se companies were able to demonstrate that it was possible to build practical ma(cid:173)
`chines that achieved large amounts of ILP on scientific and engineering codes. Although,
`
`for various reasons, none was a lasting business success, several major computer manufac(cid:173)
`turers acquired access to the techne io!;ies developed at these start-ups and there are several
`active VLIW design efforts underway. Furthermore, many of the compiler techniques devel(cid:173)
`oped with VLIWs in mind, and reported upon in this issue, have been used to compile
`for superscalar machines as well.
`
`1.3.1. JLP in the 1990s. Just as had happened 30 years ago when the transistor became
`available, CPU designers in the 1990s now have offered to them more silicon space on
`a single chip than a RISC processor requires. Vinually all designers have begun to add
`some degree of superscalar capability, and some are investigating VLIWs as welL It is a
`safe bet that by 1995 virtually all new CPUs will embody some degree of ILP.
`Partly as a result of this commercial resurgence of interest in ILP, research into that area
`has become a dominant feature of architecture and systems conferences of the 1990s. Unfor(cid:173)
`tunately, those researchers who found themselves designing slate-of-the-art products at com(cid:173)
`puter start-ups did not have the time to document the progress that was made and the large
`amount that was learned. Virtually everything that was done by these groups was relevant
`to what designers wrestle with today.
`
`n
`::::r
`"' "S.
`<D -. ....
`
`l. ILP Architectures
`
`The end result of instruction-level parallel execution is that multiple operations are simul(cid:173)
`taneously in execution, either as a result of having been issued simultaneously or because
`the time to execute an operation is greater than the interval between the issuance of succes(cid:173)
`sive operations. How exactly are the necessary decisions made as to when an operation
`should be e11ecuted and whether an operation should be speculatively eKecuted? The alter(cid:173)
`natives can be broken down depending on the extent to which these decisions are made
`by the compiler rather than by the hardware and on the manner in which information regard(cid:173)
`ing parallelism is communicated by the compiler to the hardware via the program.
`A computer architecture is a contract between the class of programs that are written for
`the architecture and the set of processor implementations of that architecture. Usually this
`contract is concerned with the instruction format and the interpretation of the bits that con(cid:173)
`stitute an instruction, but in the case of ILP architectures it extends to information embedded
`in the program pertaining to the available parallelism between the instructions or operations
`in the program. With this in mind, JLP architectures can be classified as follows.
`
`• Sequential architectures: architectures for which the program is not expected to convey
`any explicit information regarding parallelism. Superscalar processors are representative
`of ILP processor implementations for sequential architectures [Anderson et at. 1967;
`Apollo Computer 1988; Bahr et al. 1991; Blanck and Krueger 1992; DeLano et al. 1992;
`Diefendorff and Allen 1992; IBM 1990; Intel 1989b; Keller et at. 1975; Popescu et al.
`1991; Smith et aL 1987; Thompson 1964].
`• Dep~ndence architectures: architectures for which the program explicitly indicates the
`dependences that exist between operations. Dataflow processors [Arvind and Gostelow
`1982; Arvind and Kathail 1981; Gurd et aL 1985) are representative of this class.
`
`
`
`Blue Coat Systems - Exhibit 1034 Page 3
`
`
`
`• lndt?pendence architectures: architectures for which the program provides infonnation
`as to which operations are independent of one another. Very long instruction word (VLIW)
`processors [Charlesworth 1981; Colwell et al. 1988; Rau et al. 1989] are examples of
`the class of independence architectures.
`
`In the context of this taxonomy, vector processors [Hintz and Tate Jg]2; Russell lm;
`Watson 1972] are best thought of as processors for a sequential, CISC (compleK instruction
`set computer) archi~i:ture. The complex instructions are the vector instructions that do
`possess a stylized form of instruction-level parallelism internal to each vector instruction.
`Attempting to execute multiple instructions in parallel, whether scalar or vector, incurs
`all of the same problems that are faced by a superscalar processor. Because of their stylized
`approach to parallelism, vector processors are less general in their ability to exploit all
`forms of instruction-level parallelism. Nevertheless, vector processors have enjoyed great
`commercial success over the past decade. Not being true ILP processors, vector processors
`are outside the scope of this special issue. (Vector processors have received a great deal
`of attention e I sew here over the past decade and have been treated extensively in many books
`and articles, for instance, the survey by Dongarra [1986] and the book by Schneck [1987].)
`Also, certain hybrid architectures [Danelutto and V,mneschi 1990; Franklin and Sohi 1992;
`Wolfe and Shen 1991], which also combine some degree of multithreading with ILP, fall
`outside of this taxonomy for uniprocessors.
`If lLP is to be achieved, between the compiler and the run-time hardware, the following
`functions must be performed:
`
`I. The dependences between operations must be determined_
`2. The operations that are independent of any operation that has not as yet completed must
`he dctcnnincd.
`3. These independent operations must be scheduled to execute at some particular time,
`on some specific functional unit, and must be assigned a register into which the result
`may be deposited.
`
`Figure 2 shows the breakdown of these three tasks, between the compiler and run-time
`hardware, for the three classes of architecture.
`
`2.1. Sequential Archirectures and Superscolar Processors
`
`The program fur a sequential architecture contains no e)(plicit information regarding the
`dependences that exist between instructions. Consequently, the compiler need neither identify
`parallelism nor make scheduling decisions since there is no explicit way to communicate
`this information to the hardware. (It is true, nevertheless, that there is value in the com(cid:173)
`piler performing these functions and ordering the instructions so as to facilitate the hard(cid:173)
`ware's task of extracting parallelism.) In any event, if instruction-level parallelism is to
`be employed, the dependences that exist between instructions must be determined by the
`hardware. It is only necessary to determine dependences with sequentially preceding opera(cid:173)
`tions that are in night, that is, those that have been issued but have not yet completed.
`
`( Frontend & Opthnlzar )
`!
`l
`J
`·T 1
`( Detennlne lndependances J
`1----- -- - .!.nd~~·'!!:!'-
`)
`-r
`
`( Determine Dependences )
`
`(
`
`Bind Rasoun:n
`
`r
`
`Compllar
`
`Sequential
`(Superscalal)
`
`O&pe.ndono:e
`Architecturo
`{Dataflow)
`
`Al'chilecture
`(Horizon)
`
`ln<fiiJIO/ldonce
`Architec.ture
`(VUW)
`
`'
`( Determtn. Dependencn J
`~
`•
`( Determine Independence• J
`-1 · 1 l
`~
`~
`'
`
`(
`
`(
`
`Bind Rftources
`
`l
`
`Eiecute
`
`Hardware
`
`)
`
`)
`
`Figurt 2. Divi$ion of responsibilities between the compiler 11J1d the hardware for the three d•sses of architecture.
`
`When the operation is independent of all other operations it may begin execution. At this
`point the hardware must make the scheduling decision of when and where this operation
`is to execute.
`A superscalar processor' strives to issue an instruction every cycle so as to execute many
`instructions in parallel, even though the hardware is handed a sequential program. The
`problem is that a sequential program is constructed with the assumption only that it will
`execute correctly when each instruction waits for the previous one to finish, and that is
`the only order that the architecture guarantees to be correct. The first task, then, for a
`superscalar processor is to understand, for each instruction, which other instructions it
`actually is dependent upon. With every instruction that a superscalar processor issues, it
`must check whether the instruction's operands (registers or memory locations that the in(cid:173)
`struction uses or modifies) interfere with the oper.mds of any other instruction in flight,
`that is, one that is either
`
`• already in execution or
`• has been issued but is waiting for the completion of interfering instructions that would
`have been executed ~rlier in a sequential execution of the program.
`
`If either of these conditions is true, the instruction in question must be delayed until
`the instructions on which it is dependent have completed execution. For each waiting opera(cid:173)
`tion, these dependences must be monitored to determine the point at which neither condi(cid:173)
`tion is true. When this happens, the instruction is independent of all other uncompleted
`instructions and can be allowed to begin executing at any time thereafter. In the meantime
`the processor may begin eJteeution of 5Ubsequenl instructions thai prove to be independent
`
`r& -
`
`
`
`Blue Coat Systems - Exhibit 1034 Page 4
`
`
`
`:::>
`
`(1)
`
`"'0
`
`~ c
`~ c;·
`:::>
`' "' <
`"' ~ "' co
`
`c;;·
`3
`
`of all sequentially preceding instructions in flight. Once an instruction is independent of
`all other ones in flight, the hardware must also decide exactly when and on which available
`functional unit to execute the in~truction. The Control Data CDC 6600 used a mechanism,
`called the scoreboard, to perform these functions tThornton 1964]. The IBM System/360
`Model 91, built in the early 1960s, used an even more sophisticated method known as
`Tomasulo's algorithm to carry out these functions [Tomasulo 1967].
`The further goal of a superscalar processor is to issue multiple instructions every cycle.
`The most problematic aspect of doing so is determining the dependences between the opera(cid:173)
`tions that one wishes to issue simultaneously. Since the semantics of the program, and
`in particular the essential dependences, are specified by the sequential ordering of the opem(cid:173)
`tions, the operations must be processed in this order to determine the essential dependences.
`This constitutes an unacceptable performance bottleneck in a machine that is attempting
`parallel execution. On the other hand, eliminating this bottleneck can be very expensive,
`as is always the case when attempting to execute an inherently sequential task in parallel.
`An excellent reference on superscalar processor design and its complexity is the book by
`Johnson [1991].
`A number of superscalar processors have been built during the past decade including
`the Astronautics' ZS-1 decoupled access minisupercomputer [Smith 1989; Smith et al. 1987],
`Apollo's DN10000 personal supercomputer"[Apollo 1988; Bahr eta!. 19911, and, most re(cid:173)
`cently, a number of microprocessors [Blanck and Krueger \992; DeLano et al. 1992;
`Diefendorff and Allen 1992; IBM 1990; Intel \989b; Popescu et al. 1991].
`Note that an ILP processor need not issue multiple operations per cycle in order to achieve
`a certain level of performance. For instance, instead of a processor capable of issuing five
`instructions pd cycle, the same performance could be achieved by pipe\ining the functional
`units and instruction issue hardware five times as deeply, speeding up the clock rate by
`a factor of five but is>~.Jing only one instruction per cycle. This strategy, which has been
`termed superpipelining [Jouppi 1989], goes full circle back to the single-issue, superscalar
`processing of the 1960s. Superpipelining may result in some parts of the processor (such
`as the instruction unit and communications buses) being less expensive and better utilized
`and other parts (such as the execution hardware) being more costly and less well used.
`
`2.2. Dependence Architectures and Dataflow Processors
`
`In the case of dependence architectures the compiler or the programmer identifies the paral(cid:173)
`lelism in the program and communicates it to the hardware by specifying, in the executable
`program, the dependences between operations. The hardware must still determine, at run
`time, when each operation is independent of all other operations and then perform the
`scheduling. However, the inherently sequential task, of scanning the sequential program
`in its original order to determine the dependences, has been eliminated.
`The objective of a dataflow processor is to execute an instruction at the earliest possible
`time subject only to the availability of the input operands and a functional unit upon which
`to execute the instruction [Arvind and Gostelow 1982; Arvind and Kathail 1981]. To do
`so, it counts on the program to provide information about the dependences between instruc(cid:173)
`tions. Typically, this is accomplished by including in each instruction a list of successor
`
`instructions. (An instruction is a successor of another instruction if it uses as one of its
`input operands the result of that other instruction.) Each time an instruction completes,
`it creates a copy of its result for each of its successor instructions. As soon as all of the
`input operands of an instruction are available, the hardware fetches the instruction, which
`specifies the operation to be performed and the list of successor instructions. The instruc(cid:173)
`tion is then executed as soon as a functional unit of the requisite type is available. This
`property, whereby the availability of the data triggers the fetching and execution of an in(cid:173)
`struction, is what gives rise to the name of this type of processor. Because of this property,
`it is redundant for the instruction to specify its input operands. Rather, the input operands
`specify the instruction! If there is always at least one instruction ready to execute on every
`functional unit, the dataflow processor achieves peak performance.
`Computation within a basic block typically does nO! provide adequate levels of pamllelism.
`Superscalar and VLIW processors use control parallelism and speculative execution to keep
`the hardware fully utilized. (This is discussed in greater detail in Sections 3 and 4.) Dataflow
`processors have traditionally counted on using control parallelism alone to fully utilize the
`functional units. A dataflow processor is more successful than the others at looking far
`down the execution path to find abundant control parallelism. When successful, this is a
`better strategy than speculative execution since every instruction executed is a useful one
`and the processor does not have to deal with error conditions raised by speculative operations.
`As far as the authors are aware, there have been no commercial products built based on
`the dataflow architecture, except in a limited sense [Schmidt and Caesar 1991]. There have,
`however, been a number of research prototypes built, for instance, the ones built at the
`University of Manchester [Gurd et al. 1985] and at MIT [Papadopoulos and Culler 1990].
`
`2.3. Independence Architectures and VLJW Processors
`
`In order to execute operations in parallel, the system must determine that the operations
`are independent of one another. Superscalar processors and dataflow processors represent
`two ways of deriving this information at run time. In the case of the dataflow processor
`the explicitly provided dependence information is used to determine when an instruction
`may be executed so that it is independent of all other concurrently executing instructions.
`The superscalar processor must do the same, but since programs for it lack any explicit
`information, it must also first determine the dependences between instructions. In contrast,
`for an independence architecture the compiler identifies the parallelism in the program
`and conununicates it to the hardware by specifying which operations are independent of
`one another. This information is of direct value to the hardware, since it knows with no
`further checking which operations it can execute in the same cycle. Unfortunately, for any
`given opemtion, the number of operations of which it is independent is far greater than
`the number of operations on which it is dependent, so it is impractical to specify all inde(cid:173)
`pendences. Instead, for each operation, independences with only a subset of all independent
`operations (those operations that the compiler thinks are the best candidates to execute
`concurrently) are specified.
`By listing operations that could be executed simultaneously, code for an independence
`architecture may be very close to the record of eKecution produced by an implementation
`
`
`
`Blue Coat Systems - Exhibit 1034 Page 5
`
`
`
`of that architecture. If the architecture additionally requires that programs specify where
`(on which functional unit) and when (in which cycle) the operations are executed, then
`the hardware makes no run time decisions at all and the code is virtually identical to the
`desired record of execution. The VLIW processors that have been built to date are of th