`
`Merrimac: Supercomputing with Streams
`
`William J. Dally
`Fram;ois Labonte
`Abhishek Das
`
`Patrick Hanrahan
`Jung-Ho Ahn
`Jayanth Gummaraju
`
`Mattan Erez
`Nuwan Jayasena
`Ian Buck
`
`Timothy J. Knight
`Ujval J. Kapasi
`
`ABSTRACT
`Merrimac uses stream architecture and advanced interconnection
`networks to give an order of magnitude more performance per unit
`cost than cluster-based scientific computers built from the same
`technology. Organizing the computation into streams and exploit(cid:173)
`ing the resulting locality using a register hierarchy enables a stream
`architecture to reduce the memory bandwidth required by repre(cid:173)
`sentative applications by an order of magnitude or more. Hence
`a processing node with a fixed bandwidth (expensive) can support
`an order of magnitude more arithmetic units (inexpensive). This in
`tum allows a given level of performance to be achieved with fewer
`nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes)
`resulting in greater reliability, and simpler system management. We
`sketch the design of Merrimac, a streaming scientific computer that
`can be scaled from a $20K 2 TFLOPS workstation to a $20M 2
`PFLOPS supercomputer and present the results of some initial ap(cid:173)
`plication experiments on this architecture.
`
`Introduction
`1.
`Modem semiconductor technology makes arithmetic inexpen(cid:173)
`sive and bandwidth expensive. To exploit this shift in cost, a high(cid:173)
`performance computer system must exploit locality, to raise the
`arithmetic intensity (the ratio of arithmetic to bandwidth) of the
`application as well as parallelism to keep a large number of arith(cid:173)
`metic units busy. Expressing an application as a stream program
`fulfills both of these requirements. It exposes large amounts of par(cid:173)
`allelism across stream elements and reduces global bandwidth by
`expressing locality within and between kernels.
`A stream processor exploits the parallelism exposed by a stream
`program, by providing I OOs of arithmetic units, and exploits the
`locality of a stream program, by providing a deep register hier-
`
`This work was supported in pan by the Department of Energy, NNSA,
`under the ASCI Alliances program (contract LLL-B341491), in pan by
`National Science Foundation Fellowships, in pan by Stanford Graduate
`Fellowships, in pan by the DARPA Smart Memories Project (contract
`MDA904-98-R-S855), and in pan by the DARPA Polymorphous Comput(cid:173)
`ing Architectures Project (contract F29601-00-2-0085).
`
`Permission to make digital or hard copies of all or pan of this work
`for personal or classroom use is granted without fee provided that copies
`are not made or distributed for profit or commercial advantage, and that
`copies bear this notice and the full citation on the first page. To copy
`otherwise, to republish, to post on servers or to redistribute to lists, requires
`prior specific permission and/or a fee.
`
`archy. In particular, memory bandwidth is reduced by capturing
`short-term producer-consumer locality in large local register files,
`and long-term producer-consumer locality in a stream register file.
`This locality might not be captured by a reactive cache. More im(cid:173)
`portantly, the stream register file is aligned with individual ALUs
`and requires only local on-chip communication while a cache re(cid:173)
`quires global on-chip communication.
`We are designing Merrimac•, a scientific computer system tai(cid:173)
`lored to exploit the parallelism and locality of streams. The core
`of Merrimac is a single-chip (90nm CMOS) stream processor that
`is expected to have 128 GFLOPS peak performance. This proces(cid:173)
`sor chip along with 16 high-bandwidth DRAM chips (2G Bytes of
`memory) form a single Merrimac node. Application experiments
`suggest that this single-node Merrimac will sustain up to half of
`peak performance on a range of scientific applications. With an
`estimated parts cost of less than $1 K per 128 GFLOPS node (in(cid:173)
`cluding network), we expect a Merrimac machine to provide both
`capability and capacity- being more cost effective than machines
`based on commodity microprocessors.
`Merrimac employs a high-radix interconnection network to con(cid:173)
`nect 16 nodes (2 TFLOPS) on a single board, 512 nodes ( 64 TFLOPS)
`in a cabinet, and 8K nodes (I PFLOPS) in 16 cabinets. The net(cid:173)
`work provides a flat shared address space across the multi-cabinet
`system with flat bandwidth across a board (16 nodes) and a global
`bandwidth of 1/8 the local bandwidth anywhere in the system.
`We have coded three representative scientific applications as stream
`programs and measured their performance on a simulated Merri(cid:173)
`mac node. These initial experiments show that typical scientific
`applications cast as stream programs maintain a high arithmetic to
`memory bandwidth ratio and achieve a high fraction of peak perfor(cid:173)
`mance. The applications simulated have computation-to-memory
`ratios in the range of 7: I to 50: I, achieving between 18% and 52%
`of the peak performance of the machine, with less than 1.5% of
`data references traveling off-chip.
`The remainder of this paper describes stream processors and the
`Merrimac project in more detail. In Section 2 we see that modem
`VLSI technology makes arithmetic cheap and bandwidth expen(cid:173)
`sive. Section 3 shows how a stream processor exploits the appli(cid:173)
`cation locality using a bandwidth hierarchy and application paral(cid:173)
`lelism by using large numbers of ALUs. Merrimac, a supercom(cid:173)
`puter based on streams, is described in Section 4. We show the
`performance of a simulated stream processor on a number of ap(cid:173)
`plications in Section 5. Issues related to scientific computing with
`streams are discussed in Section 6
`
`SC'03, November 15-21, 2003, Phoenix, Arizona, USA
`Copyright 2003 ACM l-58113-695-l/03/0011...$5.00
`
`1 Merrimac is a Native American word meaning "fast moving
`stream".
`
`Petitioners Amazon
`Ex. 1010, p. 114 of 399
`
`
`
`2. VLSI enables inexpensive arithmetic mak(cid:173)
`ing bandwidth the limiting factor
`Modem VLSI fabrication processes make it very inexpensive in
`terms of both area and power to put large amounts of arithmetic
`capability on a chip. With arithmetic almost free, global band(cid:173)
`width, both on-chip and off-chip, becomes the factor limiting per(cid:173)
`formance.
`In 0.13 µm CMOS technology, a 64-bit floating-point unit (FPU)
`(multiplier and adder) has an area ofless than I mm 2 and dissipates
`about 50pJ of energy per operation [I]. Over 200 such FPUs can
`fit on a 14mm x 14nun chip that can be manufactured in volume
`(including testing and packaging) for less than $100. Even at a
`conservative operating frequency of 500MHz this gives a cost of
`64-bit floating-point arithmetic of less than $1 per GFLOPS and a
`power of less than 50mW per GFLOPS. Even though one cannot
`completely fill a chip with FPUs, modem graphics chips come close
`to realizing these cost performance levels. For example, the nVidia
`NV30 sustains 100 GFLOPS (32-bit floating point) [2].
`The already low cost of arithmetic is decreasing rapidly as tech(cid:173)
`nology improves. We describe a CMOS technology by its drawn
`gate length L. Most chips today are manufactured with L = 0.13µm.
`Historical trends show that L decreases at about 14% per year [3].
`The cost of a GFLOPS of arithmetic scales as £ 3 and hence de(cid:173)
`creases at a rate of about 35% per year [4]. Every five years, L
`is halved, four times as many FPUs fit on a chip of a given area,
`and they operate twice as fast -
`giving a total of eight times the
`performance for the same cost. Of equal importance, the switching
`energy also scales as £ 3 so every five years, we get eight times the
`arithmetic performance for the same power.
`Global bandwidth, not arithmetic is the factor limiting the per(cid:173)
`formance and dominating the power of modem processors. The
`cost of bandwidth grows at least linearly with distance in terms of
`both availability and power [4]. To keep distances constant across
`technology generations, we express distance in units of tracks. One
`track (or IX) is the distance between two minimum width wires on
`a chip. In 0.13µm technology, lX::::: 0.5µm. We can put ten times
`as many 103 X wires on a chip as we can 104 X wires. More im(cid:173)
`portantly, moving a bit of information over a 103 X wire takes only
`l/101
`h the energy as moving a bit over a 104 X wire. In an 0.13µm
`technology, for example, transporting the three 64-bit operands for
`a 50pJ floating point operation over global 3 x 104 X wires con(cid:173)
`sumes about lnJ, 20 times the energy required to do the operation.
`In contrast, transporting these operands on local wires with an av(cid:173)
`erage length of 3 x 102X takes only JOpJ, much less than the cost
`of the operation.
`Contemporary architectures are not yet tuned to these develop(cid:173)
`ing VLSI constraints. These architectures are unable to use more
`than a few arithmetic units because they are designed for applica(cid:173)
`tions with limited parallelism and are hindered by a low bandwidth
`memory system. Their main goal is to provide high performance
`for mostly serial code that is highly sensitive to memory latency
`and not bandwidth. To exploit the capabilities of today's VLSI
`technology requires an architecture that can exploit parallelism -
`to keep large numbers or arithmetic units busy while hiding the ever
`increasing latency to memory, and locality -
`to increase the ratio
`of arithmetic, which is inexpensive, to global bandwidth, which is
`the limiting factor.
`
`3. Stream Architecture exploits the character(cid:173)
`istics of VLSI
`A Stream Processor is able to take advantage of the large number
`of arithmetic units that VLSI technology enables without exceed(cid:173)
`ing the bandwidth limitations of the technology by using a register
`hierarchy to exploit locality in the application. This greatly re(cid:173)
`duces the average distance an operand must travel to reach a FPU.
`As shown in Figure I, a stream architecture consists of an array
`of clusters, each with a set of FPUs, a set of local register files
`(LRFs), and a bank of a stream register file (SRF). Each FPU in a
`cluster reads its operands out of an adjacent LRF over very short,
`(::::: lOOX), wires. FPU results are distributed to the other LRFs
`in a cluster and accesses to the local SRF bank are made via the
`cluster switch over short(::::: 1, OOOX) wires. While the SRF is sim(cid:173)
`ilar in size to a cache, SRF accesses are much less expensive than
`cache accesses because they are aligned and do not require a tag
`lookup. Each cluster accesses its own bank of the SRF over short
`wires. In contrast, accessing a cache requires a global communi(cid:173)
`cation over long(::::: 10, OOOX) wires. The SRF also plays another
`crucial role in keeping the arithmetic units busy by allowing the
`software to hide long memory latencies. An entire stream is trans(cid:173)
`ferred between the SRF and the memory with a single instruction.
`These stream memory operations generate a large number of mem(cid:173)
`ory references to fill the very deep pipeline between processor and
`memory, allowing memory bandwidth to be maintained in the pres(cid:173)
`ence of latency. Arithmetic units are kept busy by overlapping the
`execution of arithmetic kernels with these stream memory opera(cid:173)
`tions.
`To see how a stream processor exploits locality, consider a sim(cid:173)
`ple application expressed as a stream program (Figure 2). This fig(cid:173)
`ure shows a synthetic application that is designed to have the same
`bandwidth demands as the StreamFEM application (Section 5). Each
`iteration, the application streams a set of 5-word grid cells into a se(cid:173)
`ries of four kernels. The kernels operate on the data, performing the
`number of operations indicated, and pass intermediate results on to
`the next kernel. To perform a table lookup, kernel Kl generates an
`index stream that is used to reference a table in memory generating
`a 3-word per element stream into kernel K3.
`Figure 3 shows how the stream program of Figure 2 maps to
`the register hierarchy of a stream processor. The grid cells start
`in memory and are read a strip at a time into a buffer in the SRF.
`A typical strip might be 1024 5-word records. 2 Once a strip of
`cells is in the SRF, kernel KI is run generating a strip of indices
`and a strip of intermediate results in the SRF. Kernel K2 is run on
`the results, generating a second set of intermediate results while the
`indices are applied to memory to read a strip of table values into the
`SRF. Table values that are repeatedly accessed are provided by the
`cache. The process continues until the updates to the strip of grid
`cells, generated by kernel K4, are written back to memory. Each
`strip is software pipelined so that the loading of one strip of cells is
`overlapped with the execution of the four kernels on the previous
`strip of cells and the storing of the strip before that.
`This synthetic application shows how the stream architecture ex(cid:173)
`ploits locality. In Section 5 we shall see that actual applications ex(cid:173)
`ploit locality in a similar manner. Kernels Kl ... K4 perform all of
`their 300 operations out ofLRFs, performing 900 LRF accesses per ·
`grid point. The streams between the kernels are passed through the
`SRF generating 58 words of SRF bandwidth per grid point. Finally
`memory accesses total 12 words. This gives us a bandwidth ratio
`of75:5:l, 75 LRF references and 5 SRF references for every mem-
`
`2The strip size is chosen by the compiler to use the entire SRF
`without any spilling.
`
`Petitioners Amazon
`Ex. 1010, p. 115 of 399
`
`
`
`chip
`boundary
`
`10.ooox
`wires
`
`cluster
`
`DRAM
`Main
`Momory
`
`Cache Bank
`
`Stream
`Reg File
`Bank
`
`Global
`Switch
`
`DRAM
`Main
`Memory
`
`Cacho Bank
`
`Figure 1: A stream processor consists of an array of clusters each having a number of functional units with local register files and
`a stream register file bank connected by a cluster switch. The clusters are connected to each other and to cache banks by a global
`switch. At each level of this hierarchy -
`local register, intra-cluster, and inter-cluster -
`the wires get an order of magnitude longer.
`
`Grid of
`Cells
`
`Grid of
`Cells
`
`Figure 2: A synthetic stream application, modeled after StreamFEM (Section 5) consists of a set of kernels Kl .. . K4 that pass
`streams of data between them.
`
`Memo
`
`Stream Cache
`
`Stream Re File
`
`Local Re isters
`
`Grid of Cells
`
`5
`
`0.5
`
`3
`
`Grid of Cells
`
`4
`
`5
`
`6
`
`B
`
`3
`
`8
`
`Cells
`
`Indices
`
`Results 1
`
`Results 2
`
`Table Values
`
`Results 3
`
`Indices
`
`Results 4
`
`K1
`500ps
`
`K2
`100 Ops
`
`K3
`700ps
`
`K4
`800ps
`
`JOO Ops
`900W
`
`9.5Words
`
`12Words
`
`58Words
`
`Figure 3: The stream program of Figure 2 is mapped to the bandwidth hierarchy of a stream processor.
`
`Petitioners Amazon
`Ex. 1010, p. 116 of 399
`
`
`
`ory reference. Put differently, 93% of all references are made from
`the LRFs, where bandwidth is very inexpensive, and only 1.2% of
`references are made from the memory system, where bandwidth is
`expensive for cache hits and very expensive for misses. 3
`A stream processor executes a stream instruction set. This in(cid:173)
`struction set includes scalar instructions, that are executed on a
`conventional scalar processor, stream execution instructions, that
`each trigger the execution of a kernel on one or more strips in the
`SRF, and stream memory instructions that load and store (possibly
`with gather and scatter) a stream of records from memory to the
`SRF. This stream instruction set closely follows that of the Imagine
`streaming media processor [5, 6).
`Merrimac also provides hardware support for a scatter-add in(cid:173)
`struction. This instruction is an example of a new architectural fea(cid:173)
`ture that is enabled by programming in streams. A scatter-add acts
`as a regular scatter, but adds each value to the data already at each
`specified memory address rather than simply overwriting the data.
`This type of operation was discussed from a parallel algorithm per(cid:173)
`spective in [7].
`
`4. Sketch of Merrimac: a Streaming Scientific
`Computer
`
`C:
`
`""
`"'
`ID
`LL
`a:
`en
`
`FP/INT
`64 Bit
`MADD
`
`64WRF
`64WRF
`64WRF
`""w '"'
`64WRF
`64WRF
`
`FP/INT
`64 Bit
`MADD
`
`2.3 mm
`
`FP/INT
`64 Bit
`MADD
`
`64WRF
`64WRF
`64WRF
`fu>WRF
`64WRF
`64WRF
`
`FP/INT
`64 Bit
`MADD
`
`E
`E
`
`Figure 4: Floorplan of a Merrimac cluster.
`
`Each Merrimac node contains a stream processor (as illustrated
`in Figure I) with 16 arithmetic clusters. Each cluster contains four
`floating-point multiply-add (MADD) units, 768 64-bit words oflo(cid:173)
`cal registers, and SK words of stream register file. The entire stream
`register file has a capacity of 128K 64-bit words, distributed across
`the 16 clusters. A floorplan for one cluster is shown in Figure 4.
`Each MADD unit measures 0.9mm x 0.6mm and the entire clus(cid:173)
`ter measures 2.3mm x 1.6mm. We conservatively plan to operate
`with a clock cycle of Ins (37 F04 inverters in 90nm[3)) giving a
`performance of 8 GFLOPS per cluster and 128 GFLOPS across the
`16 clusters.
`A floorplan of the entire Merrimac stream processor chip is shown
`in Figure 5. The bulk of the chip is occupied by the 16 clusters. The
`
`3Many of our applications have very large kernels that in effect
`combine several smaller kernels -
`passing intermediate results
`through LRFs rather than SRFs. While this increases the fraction of
`LRF accesses, it also stresses LRF capacity. Ideally, the compiler
`will partition large kernels and combine small kernels to balance
`these two effects. We have not yet implemented this optimization.
`
`$ [TIJ $
`
`::,
`
`"'
`u
`s
`"'
`::,
`u
`
`s
`"'
`::,
`u
`
`::,
`
`"'
`u
`s
`"'
`::,
`u
`
`Mips64 20kc
`
`·i
`::i
`.,
`E
`:;
`
`$ bank
`
`$ bank
`$ bank
`$ bank
`
`$ bank
`$ bank
`
`$ bank
`
`$ bank
`
`Network
`
`$
`"'
`u
`
`::,
`
`$
`"'
`::,
`u
`
`s
`"'
`::,
`u
`
`'"
`iii
`::,
`u
`
`E
`E
`d -
`"
`
`~
`
`I
`
`Microcontroller
`
`$
`"'
`::,
`u
`
`'"
`
`iii
`.2
`(.)
`
`s
`"'
`::,
`u
`
`'"
`iii
`.2
`(.)
`
`$
`"'
`u
`
`::,
`
`'"
`iii
`::,
`ti
`
`I
`
`9.8mm
`
`Figure 5: Floorplan of a Merrimac stream processor chip.
`
`left edge of the chip holds the remainder of the node. A scalar pro(cid:173)
`cessor [8) fetches all instructions, executes the scalar instructions
`itself, and dispatches stream execution instructions to the clusters
`(under control of the microcontroller) and stream memory instruc(cid:173)
`tions to the memory system. The node memory system consists of
`a set of address generators (not shown), a line-interleaved eight(cid:173)
`bank 64K-word (512K.Byte) cache, and interfaces for 16 external
`DRAM chips. A network interface directs off-node memory refer(cid:173)
`ences to the routers. We estimate that each Merrimac processor will
`cost about $200 to manufacture and will dissipate a maximum of
`31 W of power. Area and power estimates in a standard cell process
`in 90nm technology are derived from models based on a previous
`implementation of stream processor [I).
`Figure 6 illustrates a single Merrimac board containing 16 nodes
`-
`16 I 28GFLOPS stream processors (Figure 5) each with 2 GBytes
`of DRAM -
`and four router chips. The router chips interconnect
`the 16 processors on the board, providing flat memory bandwidth
`on board of 20 GBytes/s per node. The routers also provide a gate(cid:173)
`way to the inter-board network, with a 4: I reduction in memory
`bandwidth (to 5 GBytes/s per node), for inter-board references.
`Larger Merrimac systems are interconnected by a five-stage folded(cid:173)
`Clos [9) network4 using high-radix routers as illustrated in Figure 7.
`The routers on each 16-node board serve as the first and last stage
`of this network. The basic building block of this network is a 48-
`input x 48-output router chip. Each bidirectional router channel
`(one input and one output) has a bandwidth of 2.5 GBytes/s (four
`5Gb/s differential signals) in each direction. On each 16-processor
`board, each of four routers has two 2.5 GByte/s channels to/from
`each of the 16 processor chips and eight ports to/from the backplane
`switch. The remaining eight ports are unused. Thus each node pro(cid:173)
`vides a total of 32 channels to the backplane. At the backplane
`level, 32 routers connect one channel to each of the 32 boards and
`connect 16 channels to the system-level switch. A total of 512 2.5
`GByte/s channels traverse optical links to the system-level switch
`where 512 routers connect all 48 ports to up to 48 backplanes (the
`figure shows just 32 backplanes).
`Table I shows the estimated cost of a streaming supercomputer.
`The processor and router chip are modest-sized (10mm x I Imm)
`ASICs in 1000-pin flip-chip BGA packages that are expected to
`
`4This topology is sometimes called a Fat Tree [10).
`
`Petitioners Amazon
`Ex. 1010, p. 117 of 399
`
`
`
`64 TFLOPS
`Per Backplane
`
`Board O
`16 Proc
`2TFLOPS
`32GBytes
`
`Board 1
`
`Board 31
`
`Backplane
`2
`
`Backplane
`31
`
`512 x 2.5GB/s
`
`Router
`0
`
`Router
`1
`
`Router
`32
`
`Router
`33
`
`0/E
`
`Router
`511
`
`Figure 7: A 2 PFLOPS Merrimac system uses a high-radix interconnection network.
`
`32 Gbytes
`Per Board
`
`DRAM
`2GBytes
`
`DRAM
`2GBytes
`
`20GB/s
`
`Stream
`Proco
`128
`GFLOPS
`
`Stream
`Proc 1
`
`2 TFLOPS
`Per Board
`
`20GB/s
`
`node
`
`DRAM
`2Gbytes
`
`Stream
`Proc 15
`
`To Backplane
`
`Figure 6: Sixteen 128 GFLOPS stream processors each with 2
`GBytes of DRAM memory can be packaged on a single board.
`The board has a total of2 TFLOPS of arithmetic and 32 GBytes
`of memory. Such a board is useful as a stand-alone scientific
`computer and as a building-block for larger systems.
`
`Item
`Processor Chip
`Router Chip
`Memory Chip
`Board
`Router Board
`Backplane
`Global Router Board
`Power
`Per Node Cost
`$/GFLOPS (128/Node)
`$/M-GUPS (250/Node)
`
`I Cost($) I Per Node Cost ($) I
`200
`200
`200
`69
`20
`320
`63
`1000
`2
`1000
`10
`5000
`5000
`5
`I
`50
`718
`6
`3
`
`Table I: Rough Per-Node Budget. Parts cost only, does not
`include 1/0.
`
`cost $200 each in moderate quantities (1000s). DRAM chips are
`projected to cost $20 each, making DRAM, at $320 the largest sin(cid:173)
`gle cost item. Board and backplane costs, including connectors,
`capacitors, regulators, and other components is amortized over the
`16 nodes on each board and the 512 nodes in each backplane. The
`router board and global router board costs reflect the costs of the
`intra-cabinet and inter-cabinet networks respectively. Supplying
`and removing power costs about $1 per W or about $50 per SOW
`node. Overall cost is less than $ I K per node, which translates into
`$6 per GFLOP of peak performance and $3 per M-GUPS5
`•
`
`5GUPS or global updates per second is a measure of global un(cid:173)
`It is the number of single-word
`structured memory bandwidth.
`read-modify-write operations a machine can perform to memory
`locations randomly selected from over the entire address space.
`
`Petitioners Amazon
`Ex. 1010, p. 118 of 399
`
`
`
`I Application
`StreamFEM (Euler, quadratic)
`
`StreamFEM (MHD, cubic)
`
`StreamMD
`
`StreamFLO
`
`33.5
`
`14.2
`
`11.4
`
`I Sustained GFLOPS I FP Ops I Mem Ref I LRF Refs I SRF Refs I Mem. Refs I
`23.5
`IQ.3M
`32.2
`169.5M
`1.4M
`(93.6%)
`(5.7%)
`(0.7%)
`3.2M
`733.3M
`43.SM
`(94.0%)
`(5.6%)
`(0.4%)
`90.2M
`1.6M
`0.7M
`(1.7%)
`(97.5%)
`(0.8%)
`234.3M
`7.2M
`3.4M
`(95.7%)
`(2.9%)
`(1.4%)
`
`50.6
`
`12.1
`
`7.4
`
`Table 2: Performance measurements of streaming scientific applications
`
`5. Applications exploit the locality of a stream
`processor
`Three scientific applications were used to evaluate the single
`node performance of the Merrimac stream processor: StreamFEM,
`StreamMD, and StreamFLO. These applications feature a number
`of characteristics which are common in scientific applications in
`general, including regular and irregular multidimensional meshes,
`multigrid techniques, and particle-in-cell computations.
`StreamFEM is a finite element application designed to solve sys(cid:173)
`tems of first-order conservation laws on general unstructured meshes.
`The StreamFEM implementation has the capability of solving sys(cid:173)
`tems of 2D conservation laws corresponding to scalar transport,
`compressible gas dynamics, and magnetohydrodynamics (MHD)
`using element approximation spaces ranging from piecewise con(cid:173)
`stant to piecewise cubic polynomials. StreamFEM uses the discon(cid:173)
`tinuous Galerkin (DG) method developed by Reed and Hill [11]
`and later popularized by Cockburn, Hou and Shu [12).
`In the
`present StreamFEM implementation, the limiting procedure of Cock(cid:173)
`burn et al. has been replaced by variational discontinuity capturing
`terms as discussed in Jaffre, Johnson and Szepessy [ 13) with fur(cid:173)
`ther overall algorithmic simplifications as discussed in Barth [14).
`StreamMD is a molecular dynamics solver [ 15, 16) that is based
`on solving Newton's equations of motion. The velocity Verlet method
`(or Leap-frog) is used to integrate the equations of motion in time;
`using this method, it is possible to simulate the complex trajectories
`of atoms and molecules for very long periods of time. The present
`StreamMD implementation simulates a box of water molecules,
`with the potential energy function defined as the sum of two terms:
`electrostatic potential and the Van der Waals potential. A cutoff
`is applied so that all particles which are at a distance greater than
`rcutoff do not interact. A 30 gridding structure is used to accelerate
`the determination of which particles are close enough to interact -
`each grid cell contains a list of the particles within that cell, and
`each timestep particles may move between grid cells. StreamMD
`makes use of the scatter-add functionality of Merrimac by com(cid:173)
`puting the pairwise particle forces in parallel and accumulating the
`forces on each particle by scattering them to memory.
`StreamFLO [17] is a finite volume 20 Euler solver that uses a
`non-linear multigrid algorithm.
`It is based on the FL082 code
`[ 18][ 19], which influenced many industrial and research codes. The
`choice of the code is motivated by the need for an application that
`is representative of a typical computational fluid dynamics appli(cid:173)
`cation, without unnecessary complexity. A cell-centered finite(cid:173)
`volume formulation is used to solve the fluid equations together
`with multigrid acceleration. Time integration is performed using a
`five stage Runge-Kutta scheme.
`Table 2 presents measurements from running these three applica(cid:173)
`tions on a cycle-accurate simulator of one Merrimac node. These
`simulations were run on a version of the simulator that included
`
`four 2-input multiply/add units per cluster (for a peak performance
`of64GFLOPS/node) rather than the four integrated 3-input MADD
`units (128GFLOPS/node) that is the current design.
`The Sustained GFLOPS and FP Ops / Mem Ref columns illus(cid:173)
`trate the arithmetic intensity of the applications; they are able to
`sustain from 18% to 52% of the node's peak arithmetic perfor(cid:173)
`mance, by performing from 7 to 50 floating point operations for
`each global memory access. Note that only "real" ops are counted
`in this figure, such as floating point add/mul/compare instructions,
`and not non-arithmetic ops such as branches. Divides are counted
`as single floating point operations, even though each divide requires
`several multiplication and addition operations when executed on
`the hardware. This leads to the lower performance numbers for
`StreamMD and StreamFLO - for example, the sustained perfor(cid:173)
`mance ofStreamFLO would double ifwe counted all the multiplies
`and adds required for divisions as well.
`The right-most three columns list the respective numbers of LRF,
`SRF, and memory references made by the program, along with the
`percentage of references satisfied by each level. Note that only a
`small fraction of references, usually less than I%, require commu(cid:173)
`nication over global(> 10, OOOX or off-chip) wires, and that over
`95% of all data movement is on local (IOOX) wires (at the LRF
`level). The register hierarchy of a stream processor exposes costly
`global communication and allows the locality inherent in applica(cid:173)
`tions to be exploited to keep communications local.
`Exploiting locality using a register hierarchy increases perfor(cid:173)
`mance and reduces power dissipation. By performing less data
`movement per arithmetic operation, we can support a much larger
`number of arithmetic units before saturating the limited global band(cid:173)
`width. At the same time power per operation is dramatically re(cid:173)
`duced by eliminating much of the global communication that dom(cid:173)
`inates power.
`
`6. Discussion
`
`6.1 Streams vs Vectors
`Stream processors share with vector processors, like the Cray!
`through Cray C90 [20][21 ], the ability to hide latency, amortize
`instruction overhead, and expose data parallelism by operating on
`large aggregates of data. In a similar manner, a stream processor,
`such as Merrimac, hides memory latency by fetching a stream of
`records with a single stream load instruction. A kernel is performed
`on one or more streams of records in the stream register file (SRF)
`with a single operate instruction. This both amortizes the overhead
`of the operate instruction and exposes data parallelism.
`Stream processors extend the capabilities of vector processors
`by adding a layer to the register hierarchy, and adding a layer of in(cid:173)
`struction sequencing that enables them to operate in record (rather
`than operation) order. The functions of the vector register file (VRF)
`of a vector processor is split between the local register files (LRFs)
`
`Petitioners Amazon
`Ex. 1010, p. 119 of 399
`
`
`
`and the stream register file (SRF) of a stream processor. The LRFs
`stage data between ALU operations to exploit fine-grained producer(cid:173)
`consumer locality (sometimes called kernel locality). To support
`a large number of ALUs, they have a very high aggregate band(cid:173)
`width. Because they exploit only kernel locality, their capacity can
`be modest, a few thousand words - about the same size as a modem
`VRF. The stream register file (SRF) of a stream processor stages
`data to and from memory and stages data to and from the LRFs
`to exploit coarse-grained (sometimes called outer-loop) producer(cid:173)
`consumer locality. Because it is relieved of the task of forwarding
`data to/from the ALUs, its bandwidth is modest (an order of magni(cid:173)
`tude less than the LRFs) which makes it economical to build SRFs
`large enough to exploit coarse-grained locality.
`6.2 Balance
`The ratios between arithmetic rate, memory bandwidth, and mem(cid:173)
`ory capacity on Merrimac are balanced based on cost and utility (cid:173)
`so that the last dollar spent on each returns the same incremen(cid:173)
`tal improvement in performance. This balancing by diminishing
`returns gives ratios quite different from the common approach of
`fixing the ratio of GFLOPS to GBytes irrespective of cost. If we
`took this approach with Merrimac, we would have to provide 128
`GBytes of memory (costing about $20K) for each $200 processor
`chip making our processor to memory cost ratio I: I 00. If one needs
`128 GBytes of memory, it is more efficient to provide 64 nodes,
`even if the additional processors are not required -
`their cost is
`small compared to the memory.
`A similar argument applies to the ratio of arithmetic to memory
`bandwidth. Merrimac provides only 20 GBytes/s (2.5 GWords/s) of
`memory bandwidth for 128 GFLOPS, a FLOP/Word ratio of over
`50: I. Many vector machines have FLOP/Word ratios of I: I [21 ],
`and conventional microprocessors have ratios between 4: I and 12: I
`[22][23] . Providing even a 10:1 ratio on Merrimac would be pro(cid:173)
`hibitively expensive. We would need 80 external DRAMs rather
`than 16. Interfacing to this large number of DRAMs would require
`at least 5 external memory interface chips (pin expanders). As with
`memory capacity, taking this fixed-balance approach to memory
`bandwidth causes the cost of bandwidth to dominate the cost of
`processing. Its more efficient to just use Merrimac processor chips
`to directly interface to 16 DRAMs each. For memory bandwidth
`dominated computations (e.g., sparse vector-matrix product) most
`of the arithmetic will be idle. However, even for such computations
`the Merrimac approach is more cost effective than trying to provide
`a much larger memory bandwidth for a single node.
`6.3 High-Radix Routers
`In the 1980s and early 90s, when routers had pin bandwidth in
`the range of 1-IOGb/s, torus networks gave