throbber
The Journal of Supercomputing, 7, 9-50 (1993)
`© 1993 Kluwer Academic Publishers, Bo>~on. Manufactured in The Netherlands.
`
`Instruction-Level Parallel Processing:
`History, Overview, and Perspective
`
`B. RAMAKRISHNA RAU AND JOSEPH A. FISHER
`Hew/eu-Paclwrd wborarories, 1501 Page Mill Road, Bldg. 3U, Palo A/co, C4 94304
`
`(October 20, 1992)
`
`Abstract. lnstruclion-level parallelism (QP) is a family of processor and compiler design techniques that speed
`up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the
`highest performance uniprocessors fOr the past 30 years, the 1980s saw it become a much more significant force
`in computer design , Several systems were built and sold commercially, which pushed ILP far beyond where it
`had been before, both in terms of the amount of ILP offered and in the central role JLP played in the design
`of the system By the end of the d<ead~. advanced microprocessor design at all major CPU manufacturers had
`incorporated ILP, and new techniques for fLP had become a popular topic at academic conferences. This article
`provides an overview and historical perspective of the field of ILP and its development over the past three decades.
`
`Keywords. lnstru<'.:ion-level parallelism, YLIW processors, superscalar processors, pipelining, multiple operation
`issue, speculative execution, scheduling, regLster alldtalion ,
`
`I. Introduction
`
`Instruction-level parallelism (ILP) is a family of processor and compiler design techniques
`that speed up execution by causing individual machine operations, such as memory loads
`and stores, integer additions, and floating point multiplications, to execute in parallel. The
`operations involved are normal RISC-style operations, and the system is handed a single
`program written with a sequential processor in mind. Thus an important feature of these
`techniques is that like circuit speed improvements, but unlike traditional multiprocessor
`parallelism and massive parallel processing, they are largely transparent to users. VLIWs
`and superscalars are examples of processors that derive their benefit from instruction-level
`parallelism, and software pipelining and trace scheduling are example software techniques
`that expose the parallelism that these processors can use_
`Although small amounts of ILP have been present in the highest performance uniproc(cid:173)
`essors of the past 30 years, the 1980s saw it become a much more significant force in com(cid:173)
`puter design. Several systems were built and sold commercially, which pushed ILP far beyond
`where it had been before, both in tenns of the amount of ILP offered and in the central
`role ILP played in the design of the system. By the early 1990s, advanced microprocessor
`design at all major CPU manufacturers incorporated ILP, and new techniques fur ILP became
`a popular topic at academic conferences. With all of this activity we felt that, in contrast
`to a report on suggested future techniques, there would be great value in gathering, in an
`archival reference, reports on experience with real ILP systems and reports on the measured
`potentia\ of lLP_ Thus this special issue of The Journal of Supercomputing.
`
`J.]_ JLP Execution
`
`A typical ILP processor has the same type of execution hardware as a nonnal RJSC machine.
`The differences between a machine with ILP and one without is that there may be more
`of that hardware, for example, several integer adders instead of just one, and that the con(cid:173)
`trol will allow, and possibly arrdnge, simultaneous access to whatever execution hardware
`is present_
`Consider the execution hardware of a simplified ILP processor consisting of four func(cid:173)
`tional units and a branch unit connected to a common register file (Table 1). Typically
`ILP execution hardware allows multiple-cycle operations to be pipelined, so we may assume
`that a total of four operations can be initiated each cycle. If in each cycle the longest latency
`operation is issued, this hardware could have ten operations "in flight" at once, which
`would give it a maximum possible speedup of a factor of ten over a sequential processor
`wilh similar execution hardware. As the papers in this issue show, this execution hardware
`resembles that of several VLIW processors that have been built and used commercially,
`though it is more limited in its amount of ILP. Several superscalar processors now being
`built also offer a similar amount of ILP_
`There is a large amount of parallel ism available even in this simple processor. The chal(cid:173)
`lenge is to make good use of it-we will see that with the technology available today, an
`ILP processor is unlikely to achieve nearly as much as a factor of ten on many classes
`of programs, though scientific programs and others can yield far more than that on a proc(cid:173)
`essor that has more functional units. The first question that comes to mind is whether enough
`ILP exists in programs to make this possible. Then, if this is so, what must the compiler
`and hardware do to successfully exploit it? In reality, as we shall see in Section 4, the
`two questions have to be reversed; in the absence of techniques to find and exploit ILP,
`it remains hidden, and we are left with a pessimistic answer.
`Figure Ia shows a very large expression taken from the inner loop of a compute-intensive
`program. It is presented cycle by cycle as it might execute on a processor with functional
`units similar to those shown in Table I, but capable of having only one operation in flight
`
`"' 3
`
`Table /. Execution hardware for a simplified ILP processor,
`
`Functional Unit
`
`Operations Performed
`
`Latency
`
`Integer unit f
`
`Integer unit 2/branch unit
`
`Floating point unit I
`Floating point unit 2
`
`Integer AL U opcr•toons
`Integer multiplication
`Loads
`Stores
`
`Integer ALU operations
`Integer multiplicalion
`Loads
`Stores
`Test-and-branch
`
`Floating point operations
`
`I
`2
`2
`l
`
`f
`2
`2
`
`3
`
`
`
`Blue Coat Systems - Exhibit 1034 Page 1
`
`

`
`- vaeecl • llOi
`
`&s..CS: • s•__, • 1 !09
`CYCLE
`CYCI.!
`.D(i"'
`CYCLE
`nop
`~·eedl • yaee<l • 1.}01
`CYCLE
`CYCLE
`nop
`cvcu: • """
`• 1)849
`C'tCl.£
`1 XSw«dl • JUi41¥dl
`i yllleedl • yaeedJ • 1 llt9
`CYC: I £
`CYCli:
`llilllc-«d
`• x••Nl .. ~ ·~t;, )I\
`CYC'l..£
`4 Y"'•Cd
`.. ys.eedl •• 6~5lS
`CYCLE ll tsn<il -
`tsee<l • UC7
`C YCl.£ 12 ncp
`CY\.:L£
`! l n olo'
`CY("U: 1 t v•e-e-<:1':..
`CYCL£ l~ nGI?
`C'I'Ct..E \ ' n..;:;p
`tst<Odl • 1)149
`CYCU 17 tu•dl -
`CYCLE 16 v.e•dl - vo-eedl • lliH
`CYCLE 19 tu"<l
`•
`ta-ee<!J U 6SS .B
`• VS.C'~2 -.a. ·~~)~
`C'W('Lt
`lG '4/iiOCd
`CYCLE 21
`x &eed • xs•6fd
`.Jiaq
`CYCt.( 21 nop
`CYCt.[ 2) n -1p
`CYCI.I:: l4 y"q • yc:oed • y1utod
`CYCtE l; r.op
`CYCLE 26 nop
`'
`, VOQ
`CYCLE l7 xy~uO>oq • xoq
`;: a taq • taeed • tt••d
`C'I'C'LE
`CYCLE U nop
`CVCLt lO nop
`•
`CYCL.£ ll veq .. v•••d • vae.d
`CYCLE l4 no!)
`CYCLE lJ n~
`CYCLE l4 :v•u~Mq • tsq • vaq
`CYCLt l~ pl~ • plc • 1
`C"YCLi: J6 tp • tP •
`;
`
`C 'rCl.i: J7 ~r zyaua.q > r•d.J.u• goto t.y - no-htt .,
`
`lOfT ALU
`
`lNT Al.U
`
`FLOAT A;JJ
`
`n..oAT ALU
`
`plc•plc•1
`
`v~eoe-d,!.,..v~eed•!lD' '-••••:tl•t:ileed•t.\01
`y•Hdl =J15e.d"ll08 n"<'l•neltd•l)Q9
`
`1 lp-tp•l
`CYCL£
`l
`CYCLE
`l nop
`CYCL£
`4 ve•e-<U . vsaedlt!)B49 t:•..-d2•t.e•edl-.IlBt9
`CYCLE'
`5 ya.edl•yaeltdl•lliU u . . dl• .. •ll<ll•ll849
`CVCL£
`6 yao<td•yaotedUh~5l~ xa"d·VI-d2•0655J5
`CYCLE
`xaq.•aeed•x••ed
`\lae.-d•\l'&cted26.&.4SS:3!a tae•d· t.•e•dllr,i!a~l$ )lllqc:ya;:eed•yae.ed
`CCYY~LI.~ "I
`vaq .. vseed•va~ l8(J•C.t>ead•L•"ed
`I
`Ill.
`9 nop
`CYClF.
`0 "'fOUN<JoUQ•Y•Q
`CYCI.[
`C'JClt: 11 tV&uUq-C:sq•vsq
`
`1 f X'(autuq;,.r~,j::.~s. goto ixy-no-h.lt
`
`b l
`
`F;,_ I (a I An c:aAmplc ollllc ~IIIIAI ouord ol-.:ucioft fOe a loop. (b) T'hol: iDStn..:t~on·ICYel par.lleii'IIConl
`o1 csoc;utlCift ror 111c -
`loop.
`
`at a time. Figun: lb shows the same program fragment as it might be ellecuted on the hard·
`ware indicated in Table I.
`Note that several of the cycles in Figun: Ia contain no-ops. This is because the sequential
`processor must await the completion of the three-cycle latency multiply issued in cycle I
`before issuing the next operation. (These no-ops would not appear in the text of a program.
`but are shown here as the actual record of what is executed each cycle.) Most instruction(cid:173)
`level parallel processors cliJI issue operations during these no-op I.)'Cies, when previous
`operations an: still in Oighl, and ll'lllny can issue more than one operation in a given cycle.
`
`In our ILP record of execution (Figure !b), both effects an: evident: In cycle 1, four opera(cid:173)
`tions an: issued; in cycle 2, two mon: operations are issu~d even though neither mulciply
`in cycle I has yet completed execution .
`This special issue of The Journal of Supercomputing concerns itSelf with the technology
`of syslems that try to ana in the kind of record of c:xecution in Figure lb. given a prog111m
`written with the record of eJU:Cution in Figure Ia in mind
`
`1.2. Early History of lnstructiOII-Wel PuruiiC'iiJm
`
`In small ways, instruction-level parallelism factored into the thinking of machine designers
`in the 1940s and 1950s. Parallelism that would today be called horiwntal microcode ap(cid:173)
`peared in lUring's 1946 design of the Pilot ACE £Carpenter and Doran 1986] and was care(cid:173)
`fully described by Wilkes [1951]. Indeed, in 1953 Wilkes and Stringer wrole, "In some
`cases it may be possible for two or more micro-operations to take place at the same tim~"
`[Wilkes and Stringer 1953}.
`The 1960s saw the appearance of transistorized computers. One ~ffect of this revolution
`was that it bec:ame practical to build reliable machines with far mon: gates than was necessary
`to build a general-purpose CPU. This led to commercia111y succeSl>fu.l machinCli that used
`this available tuirdware to provide instruction-Je·~el parallelism at the machine-language
`level. In 1963 Control Data Corporation stancd delivering its CDC 6600 [Thornton 1964,
`1~1. which had ten functional units-in~tger add, shift, incn:mcfll (2), multiply (2),1ogical
`branch, floating poinl add and divide. Any on~ of these could sran executing in a given
`cycle whether or not others were still processing daua-indcpendcnt earlier operations. In
`this machine the hardwan: decided, as the program eliCCUted. which operation 10 1ssuc in
`a given cycle; its model of execution was well along the way toward what we would today
`call supcncalar. Indeed, in many ways it strongly resembled its direct descendant, lhe scalar
`ponion of the CRAY-1. 11\e CDC 6600 was the scientific supercomputer of its day.
`Also durin& the 1960s, WM introduced, and in 196'1-68 delivered, the 360191 [mM 1967].
`This machine, based partly on IBM's instruction· level parallel ellpcrimenla.l Stretch proc(cid:173)
`essor, offered less instruction-level parallelism than the CDC 6600. having only a single
`in~eger adder. a floating poim adder, and a Ooating point multipl)ldivide. But it was f.u
`more 11111bitious than !be CDC 6600 in its auempt 10 reanliJige the inslrucuon strum 10
`bep ttae functional units busy-a key technology in IOday's 5Uperscalar desigN. For wrious
`nontechnical reasons the 360/91 was not as commercially successful u u might have been,
`with only about20 machines delfven:d (Bell n.nd N ~ll 19711. But its CPU arcbitccrun: w.s
`the stan of a long line of succ-essful high-pcrfo nnance processurs. As with the CDC 6600,
`this lLP p1onccr s111ncd o chain of supersc.alur a r huecturc::. tluu has I sted into the 1990s.
`In the 1960s. research into "parallel processing" often wa concerned with the ILP found
`in these proccsson~ . By the mid-1!170s the term was used more often for multiple processor
`parallelism and for regular array and vector parallelism. In pan, this was due to some very
`pessimiJiic results about the availability of ILP in ordinary programs, wb1ch we discuss
`below.
`
`
`
`Blue Coat Systems - Exhibit 1034 Page 2
`
`

`
`1.3. Modem lnstruction-Level Parallelism
`
`In the late 1970s the beginnings of a new style of ILP, caJled very long instruction word
`(VUW), emerged on several different fronts. In many ways VLIWs were a natural outgrowth
`of horizontal microcode, the first ILP technology, and they were triggered, in the 1980s,
`by the same changes in semiconductor technology that had such a profound impact upon
`the entire computer industry.
`For sequential processors, as the speed gap between WTiteable and read-only memory
`narrowed, the advantages of a small, dedicated, read-only control store began to disappear.
`One natural effect of this was to diminish the advantage of microcode; it no longer made
`as much sense to define a complex language as a compiler target and then interpret this
`in very fast read-only microcode. Instead, the vertical microcode interface was presented
`as a clean, simple compiler target. This concept was called RJSC [Hennessy, Jouppi, Baskett
`et al. 1982 ; Patterson and Sequin 1981; Radin 1982] . In the 1980s the general movement
`of microprocessor products was towards the RISC concept, and instruction-level parallel
`techniques fell out of favor. In the minisupercomputer price-bracket though, one innovative
`superscalar product, the ZS-1, which could issue up to two instructions each cycle, was
`built and marketed by Astronautics rsmith et al. 1987].
`The same changes in memory technology were having a somewhat different effect upon
`horizontally microcoded processors. During the 1970s a large market had grown in special(cid:173)
`ired signal processing computers. Not aimed at general-purpose use, these CPUs hard(cid:173)
`wired FFfs and other important algorithms directly into the horizontal (;ontrol store, gaining
`tremendous advantages from the instruction-level parallelism available there. When fast,
`writeable memory became available, some of these manufacturers, most notably Floating
`Point Systems (Charlesworth 1981], replaced the read-only control store with writeable mem(cid:173)
`ory, giving users access to instruction-level parallelism in far greater amounts than the early
`superscalar processors had. These machines were extremely fast, the fastest processors
`by far in their price ranges, for important classes of scientific applications. However, despite
`attempts on the part of several manufacturers to market their products for more general,
`everyday use, they were almost always restricted to a narrow class of applications. This
`was caused by the lack of good system software, which in turn was caused by the idiosyn(cid:173)
`cratic architecture of processors built for a single application, and by the lack at that time
`of good code generation algorithms for ILP machines with that much parallelism.
`As with RISC, the crucial step was to present a simple, clean interface to the compiler.
`However, in this case the clean interface was horizontal, not vertical, so as to afford greater
`ILP [Fisher 1983; Rau, Glaeser, and Greenaw.Ut 1982]. This style of architecture was dubbed
`VLIW (Fisher 1983] . Code generation techniques, some of which had been developed for
`generating horizontal microcode, were extended to these general-purpose VUW machines
`so that the compiler could specify the parallelism directly [Fisher 1981; Rau and Glaeser
`1981] .
`In the 1980s VLIW CPUs were offered commercially in the form of capable, general(cid:173)
`purpose machines. Three computer start-ups-Culler, Multiflow, and Cydrome-built
`VLIWs with varying degrees of parallelism (Colwell et aL 1988; Rau et aL 1989]. As a
`group thc:se companies were able to demonstrate that it was possible to build practical ma(cid:173)
`chines that achieved large amounts of ILP on scientific and engineering codes. Although,
`
`for various reasons, none was a lasting business success, several major computer manufac(cid:173)
`turers acquired access to the techne io!;ies developed at these start-ups and there are several
`active VLIW design efforts underway. Furthermore, many of the compiler techniques devel(cid:173)
`oped with VLIWs in mind, and reported upon in this issue, have been used to compile
`for superscalar machines as well.
`
`1.3.1. JLP in the 1990s. Just as had happened 30 years ago when the transistor became
`available, CPU designers in the 1990s now have offered to them more silicon space on
`a single chip than a RISC processor requires. Vinually all designers have begun to add
`some degree of superscalar capability, and some are investigating VLIWs as welL It is a
`safe bet that by 1995 virtually all new CPUs will embody some degree of ILP.
`Partly as a result of this commercial resurgence of interest in ILP, research into that area
`has become a dominant feature of architecture and systems conferences of the 1990s. Unfor(cid:173)
`tunately, those researchers who found themselves designing slate-of-the-art products at com(cid:173)
`puter start-ups did not have the time to document the progress that was made and the large
`amount that was learned. Virtually everything that was done by these groups was relevant
`to what designers wrestle with today.
`
`n
`::::r
`"' "S.
`<D -. ....
`
`l. ILP Architectures
`
`The end result of instruction-level parallel execution is that multiple operations are simul(cid:173)
`taneously in execution, either as a result of having been issued simultaneously or because
`the time to execute an operation is greater than the interval between the issuance of succes(cid:173)
`sive operations. How exactly are the necessary decisions made as to when an operation
`should be e11ecuted and whether an operation should be speculatively eKecuted? The alter(cid:173)
`natives can be broken down depending on the extent to which these decisions are made
`by the compiler rather than by the hardware and on the manner in which information regard(cid:173)
`ing parallelism is communicated by the compiler to the hardware via the program.
`A computer architecture is a contract between the class of programs that are written for
`the architecture and the set of processor implementations of that architecture. Usually this
`contract is concerned with the instruction format and the interpretation of the bits that con(cid:173)
`stitute an instruction, but in the case of ILP architectures it extends to information embedded
`in the program pertaining to the available parallelism between the instructions or operations
`in the program. With this in mind, JLP architectures can be classified as follows.
`
`• Sequential architectures: architectures for which the program is not expected to convey
`any explicit information regarding parallelism. Superscalar processors are representative
`of ILP processor implementations for sequential architectures [Anderson et at. 1967;
`Apollo Computer 1988; Bahr et al. 1991; Blanck and Krueger 1992; DeLano et al. 1992;
`Diefendorff and Allen 1992; IBM 1990; Intel 1989b; Keller et at. 1975; Popescu et al.
`1991; Smith et aL 1987; Thompson 1964].
`• Dep~ndence architectures: architectures for which the program explicitly indicates the
`dependences that exist between operations. Dataflow processors [Arvind and Gostelow
`1982; Arvind and Kathail 1981; Gurd et aL 1985) are representative of this class.
`
`
`
`Blue Coat Systems - Exhibit 1034 Page 3
`
`

`
`• lndt?pendence architectures: architectures for which the program provides infonnation
`as to which operations are independent of one another. Very long instruction word (VLIW)
`processors [Charlesworth 1981; Colwell et al. 1988; Rau et al. 1989] are examples of
`the class of independence architectures.
`
`In the context of this taxonomy, vector processors [Hintz and Tate Jg]2; Russell lm;
`Watson 1972] are best thought of as processors for a sequential, CISC (compleK instruction
`set computer) archi~i:ture. The complex instructions are the vector instructions that do
`possess a stylized form of instruction-level parallelism internal to each vector instruction.
`Attempting to execute multiple instructions in parallel, whether scalar or vector, incurs
`all of the same problems that are faced by a superscalar processor. Because of their stylized
`approach to parallelism, vector processors are less general in their ability to exploit all
`forms of instruction-level parallelism. Nevertheless, vector processors have enjoyed great
`commercial success over the past decade. Not being true ILP processors, vector processors
`are outside the scope of this special issue. (Vector processors have received a great deal
`of attention e I sew here over the past decade and have been treated extensively in many books
`and articles, for instance, the survey by Dongarra [1986] and the book by Schneck [1987].)
`Also, certain hybrid architectures [Danelutto and V,mneschi 1990; Franklin and Sohi 1992;
`Wolfe and Shen 1991], which also combine some degree of multithreading with ILP, fall
`outside of this taxonomy for uniprocessors.
`If lLP is to be achieved, between the compiler and the run-time hardware, the following
`functions must be performed:
`
`I. The dependences between operations must be determined_
`2. The operations that are independent of any operation that has not as yet completed must
`he dctcnnincd.
`3. These independent operations must be scheduled to execute at some particular time,
`on some specific functional unit, and must be assigned a register into which the result
`may be deposited.
`
`Figure 2 shows the breakdown of these three tasks, between the compiler and run-time
`hardware, for the three classes of architecture.
`
`2.1. Sequential Archirectures and Superscolar Processors
`
`The program fur a sequential architecture contains no e)(plicit information regarding the
`dependences that exist between instructions. Consequently, the compiler need neither identify
`parallelism nor make scheduling decisions since there is no explicit way to communicate
`this information to the hardware. (It is true, nevertheless, that there is value in the com(cid:173)
`piler performing these functions and ordering the instructions so as to facilitate the hard(cid:173)
`ware's task of extracting parallelism.) In any event, if instruction-level parallelism is to
`be employed, the dependences that exist between instructions must be determined by the
`hardware. It is only necessary to determine dependences with sequentially preceding opera(cid:173)
`tions that are in night, that is, those that have been issued but have not yet completed.
`
`( Frontend & Opthnlzar )
`!
`l
`J
`·T 1
`( Detennlne lndependances J
`1----- -- - .!.nd~~·'!!:!'-
`)
`-r
`
`( Determine Dependences )
`
`(
`
`Bind Rasoun:n
`
`r
`
`Compllar
`
`Sequential
`(Superscalal)
`
`O&pe.ndono:e
`Architecturo
`{Dataflow)
`
`Al'chilecture
`(Horizon)
`
`ln<fiiJIO/ldonce
`Architec.ture
`(VUW)
`
`'
`( Determtn. Dependencn J
`~
`•
`( Determine Independence• J
`-1 · 1 l
`~
`~
`'
`
`(
`
`(
`
`Bind Rftources
`
`l
`
`Eiecute
`
`Hardware
`
`)
`
`)
`
`Figurt 2. Divi$ion of responsibilities between the compiler 11J1d the hardware for the three d•sses of architecture.
`
`When the operation is independent of all other operations it may begin execution. At this
`point the hardware must make the scheduling decision of when and where this operation
`is to execute.
`A superscalar processor' strives to issue an instruction every cycle so as to execute many
`instructions in parallel, even though the hardware is handed a sequential program. The
`problem is that a sequential program is constructed with the assumption only that it will
`execute correctly when each instruction waits for the previous one to finish, and that is
`the only order that the architecture guarantees to be correct. The first task, then, for a
`superscalar processor is to understand, for each instruction, which other instructions it
`actually is dependent upon. With every instruction that a superscalar processor issues, it
`must check whether the instruction's operands (registers or memory locations that the in(cid:173)
`struction uses or modifies) interfere with the oper.mds of any other instruction in flight,
`that is, one that is either
`
`• already in execution or
`• has been issued but is waiting for the completion of interfering instructions that would
`have been executed ~rlier in a sequential execution of the program.
`
`If either of these conditions is true, the instruction in question must be delayed until
`the instructions on which it is dependent have completed execution. For each waiting opera(cid:173)
`tion, these dependences must be monitored to determine the point at which neither condi(cid:173)
`tion is true. When this happens, the instruction is independent of all other uncompleted
`instructions and can be allowed to begin executing at any time thereafter. In the meantime
`the processor may begin eJteeution of 5Ubsequenl instructions thai prove to be independent
`
`r& -
`
`
`
`Blue Coat Systems - Exhibit 1034 Page 4
`
`

`
`:::>
`
`(1)
`
`"'0
`
`~ c
`~ c;·
`:::>
`' "' <
`"' ~ "' co
`
`c;;·
`3
`
`of all sequentially preceding instructions in flight. Once an instruction is independent of
`all other ones in flight, the hardware must also decide exactly when and on which available
`functional unit to execute the in~truction. The Control Data CDC 6600 used a mechanism,
`called the scoreboard, to perform these functions tThornton 1964]. The IBM System/360
`Model 91, built in the early 1960s, used an even more sophisticated method known as
`Tomasulo's algorithm to carry out these functions [Tomasulo 1967].
`The further goal of a superscalar processor is to issue multiple instructions every cycle.
`The most problematic aspect of doing so is determining the dependences between the opera(cid:173)
`tions that one wishes to issue simultaneously. Since the semantics of the program, and
`in particular the essential dependences, are specified by the sequential ordering of the opem(cid:173)
`tions, the operations must be processed in this order to determine the essential dependences.
`This constitutes an unacceptable performance bottleneck in a machine that is attempting
`parallel execution. On the other hand, eliminating this bottleneck can be very expensive,
`as is always the case when attempting to execute an inherently sequential task in parallel.
`An excellent reference on superscalar processor design and its complexity is the book by
`Johnson [1991].
`A number of superscalar processors have been built during the past decade including
`the Astronautics' ZS-1 decoupled access minisupercomputer [Smith 1989; Smith et al. 1987],
`Apollo's DN10000 personal supercomputer"[Apollo 1988; Bahr eta!. 19911, and, most re(cid:173)
`cently, a number of microprocessors [Blanck and Krueger \992; DeLano et al. 1992;
`Diefendorff and Allen 1992; IBM 1990; Intel \989b; Popescu et al. 1991].
`Note that an ILP processor need not issue multiple operations per cycle in order to achieve
`a certain level of performance. For instance, instead of a processor capable of issuing five
`instructions pd cycle, the same performance could be achieved by pipe\ining the functional
`units and instruction issue hardware five times as deeply, speeding up the clock rate by
`a factor of five but is>~.Jing only one instruction per cycle. This strategy, which has been
`termed superpipelining [Jouppi 1989], goes full circle back to the single-issue, superscalar
`processing of the 1960s. Superpipelining may result in some parts of the processor (such
`as the instruction unit and communications buses) being less expensive and better utilized
`and other parts (such as the execution hardware) being more costly and less well used.
`
`2.2. Dependence Architectures and Dataflow Processors
`
`In the case of dependence architectures the compiler or the programmer identifies the paral(cid:173)
`lelism in the program and communicates it to the hardware by specifying, in the executable
`program, the dependences between operations. The hardware must still determine, at run
`time, when each operation is independent of all other operations and then perform the
`scheduling. However, the inherently sequential task, of scanning the sequential program
`in its original order to determine the dependences, has been eliminated.
`The objective of a dataflow processor is to execute an instruction at the earliest possible
`time subject only to the availability of the input operands and a functional unit upon which
`to execute the instruction [Arvind and Gostelow 1982; Arvind and Kathail 1981]. To do
`so, it counts on the program to provide information about the dependences between instruc(cid:173)
`tions. Typically, this is accomplished by including in each instruction a list of successor
`
`instructions. (An instruction is a successor of another instruction if it uses as one of its
`input operands the result of that other instruction.) Each time an instruction completes,
`it creates a copy of its result for each of its successor instructions. As soon as all of the
`input operands of an instruction are available, the hardware fetches the instruction, which
`specifies the operation to be performed and the list of successor instructions. The instruc(cid:173)
`tion is then executed as soon as a functional unit of the requisite type is available. This
`property, whereby the availability of the data triggers the fetching and execution of an in(cid:173)
`struction, is what gives rise to the name of this type of processor. Because of this property,
`it is redundant for the instruction to specify its input operands. Rather, the input operands
`specify the instruction! If there is always at least one instruction ready to execute on every
`functional unit, the dataflow processor achieves peak performance.
`Computation within a basic block typically does nO! provide adequate levels of pamllelism.
`Superscalar and VLIW processors use control parallelism and speculative execution to keep
`the hardware fully utilized. (This is discussed in greater detail in Sections 3 and 4.) Dataflow
`processors have traditionally counted on using control parallelism alone to fully utilize the
`functional units. A dataflow processor is more successful than the others at looking far
`down the execution path to find abundant control parallelism. When successful, this is a
`better strategy than speculative execution since every instruction executed is a useful one
`and the processor does not have to deal with error conditions raised by speculative operations.
`As far as the authors are aware, there have been no commercial products built based on
`the dataflow architecture, except in a limited sense [Schmidt and Caesar 1991]. There have,
`however, been a number of research prototypes built, for instance, the ones built at the
`University of Manchester [Gurd et al. 1985] and at MIT [Papadopoulos and Culler 1990].
`
`2.3. Independence Architectures and VLJW Processors
`
`In order to execute operations in parallel, the system must determine that the operations
`are independent of one another. Superscalar processors and dataflow processors represent
`two ways of deriving this information at run time. In the case of the dataflow processor
`the explicitly provided dependence information is used to determine when an instruction
`may be executed so that it is independent of all other concurrently executing instructions.
`The superscalar processor must do the same, but since programs for it lack any explicit
`information, it must also first determine the dependences between instructions. In contrast,
`for an independence architecture the compiler identifies the parallelism in the program
`and conununicates it to the hardware by specifying which operations are independent of
`one another. This information is of direct value to the hardware, since it knows with no
`further checking which operations it can execute in the same cycle. Unfortunately, for any
`given opemtion, the number of operations of which it is independent is far greater than
`the number of operations on which it is dependent, so it is impractical to specify all inde(cid:173)
`pendences. Instead, for each operation, independences with only a subset of all independent
`operations (those operations that the compiler thinks are the best candidates to execute
`concurrently) are specified.
`By listing operations that could be executed simultaneously, code for an independence
`architecture may be very close to the record of eKecution produced by an implementation
`
`
`
`Blue Coat Systems - Exhibit 1034 Page 5
`
`

`
`of that architecture. If the architecture additionally requires that programs specify where
`(on which functional unit) and when (in which cycle) the operations are executed, then
`the hardware makes no run time decisions at all and the code is virtually identical to the
`desired record of execution. The VLIW processors that have been built to date are of th

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket