throbber
.,
`
`Merrimac: Supercomputing with Streams
`
`William J. Dally
`Fram;ois Labonte
`Abhishek Das
`
`Patrick Hanrahan
`Jung-Ho Ahn
`Jayanth Gummaraju
`
`Mattan Erez
`Nuwan Jayasena
`Ian Buck
`
`Timothy J. Knight
`Ujval J. Kapasi
`
`ABSTRACT
`Merrimac uses stream architecture and advanced interconnection
`networks to give an order of magnitude more performance per unit
`cost than cluster-based scientific computers built from the same
`technology. Organizing the computation into streams and exploit(cid:173)
`ing the resulting locality using a register hierarchy enables a stream
`architecture to reduce the memory bandwidth required by repre(cid:173)
`sentative applications by an order of magnitude or more. Hence
`a processing node with a fixed bandwidth (expensive) can support
`an order of magnitude more arithmetic units (inexpensive). This in
`tum allows a given level of performance to be achieved with fewer
`nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes)
`resulting in greater reliability, and simpler system management. We
`sketch the design of Merrimac, a streaming scientific computer that
`can be scaled from a $20K 2 TFLOPS workstation to a $20M 2
`PFLOPS supercomputer and present the results of some initial ap(cid:173)
`plication experiments on this architecture.
`
`Introduction
`1.
`Modem semiconductor technology makes arithmetic inexpen(cid:173)
`sive and bandwidth expensive. To exploit this shift in cost, a high(cid:173)
`performance computer system must exploit locality, to raise the
`arithmetic intensity (the ratio of arithmetic to bandwidth) of the
`application as well as parallelism to keep a large number of arith(cid:173)
`metic units busy. Expressing an application as a stream program
`fulfills both of these requirements. It exposes large amounts of par(cid:173)
`allelism across stream elements and reduces global bandwidth by
`expressing locality within and between kernels.
`A stream processor exploits the parallelism exposed by a stream
`program, by providing I OOs of arithmetic units, and exploits the
`locality of a stream program, by providing a deep register hier-
`
`This work was supported in pan by the Department of Energy, NNSA,
`under the ASCI Alliances program (contract LLL-B341491), in pan by
`National Science Foundation Fellowships, in pan by Stanford Graduate
`Fellowships, in pan by the DARPA Smart Memories Project (contract
`MDA904-98-R-S855), and in pan by the DARPA Polymorphous Comput(cid:173)
`ing Architectures Project (contract F29601-00-2-0085).
`
`Permission to make digital or hard copies of all or pan of this work
`for personal or classroom use is granted without fee provided that copies
`are not made or distributed for profit or commercial advantage, and that
`copies bear this notice and the full citation on the first page. To copy
`otherwise, to republish, to post on servers or to redistribute to lists, requires
`prior specific permission and/or a fee.
`
`archy. In particular, memory bandwidth is reduced by capturing
`short-term producer-consumer locality in large local register files,
`and long-term producer-consumer locality in a stream register file.
`This locality might not be captured by a reactive cache. More im(cid:173)
`portantly, the stream register file is aligned with individual ALUs
`and requires only local on-chip communication while a cache re(cid:173)
`quires global on-chip communication.
`We are designing Merrimac•, a scientific computer system tai(cid:173)
`lored to exploit the parallelism and locality of streams. The core
`of Merrimac is a single-chip (90nm CMOS) stream processor that
`is expected to have 128 GFLOPS peak performance. This proces(cid:173)
`sor chip along with 16 high-bandwidth DRAM chips (2G Bytes of
`memory) form a single Merrimac node. Application experiments
`suggest that this single-node Merrimac will sustain up to half of
`peak performance on a range of scientific applications. With an
`estimated parts cost of less than $1 K per 128 GFLOPS node (in(cid:173)
`cluding network), we expect a Merrimac machine to provide both
`capability and capacity- being more cost effective than machines
`based on commodity microprocessors.
`Merrimac employs a high-radix interconnection network to con(cid:173)
`nect 16 nodes (2 TFLOPS) on a single board, 512 nodes ( 64 TFLOPS)
`in a cabinet, and 8K nodes (I PFLOPS) in 16 cabinets. The net(cid:173)
`work provides a flat shared address space across the multi-cabinet
`system with flat bandwidth across a board (16 nodes) and a global
`bandwidth of 1/8 the local bandwidth anywhere in the system.
`We have coded three representative scientific applications as stream
`programs and measured their performance on a simulated Merri(cid:173)
`mac node. These initial experiments show that typical scientific
`applications cast as stream programs maintain a high arithmetic to
`memory bandwidth ratio and achieve a high fraction of peak perfor(cid:173)
`mance. The applications simulated have computation-to-memory
`ratios in the range of 7: I to 50: I, achieving between 18% and 52%
`of the peak performance of the machine, with less than 1.5% of
`data references traveling off-chip.
`The remainder of this paper describes stream processors and the
`Merrimac project in more detail. In Section 2 we see that modem
`VLSI technology makes arithmetic cheap and bandwidth expen(cid:173)
`sive. Section 3 shows how a stream processor exploits the appli(cid:173)
`cation locality using a bandwidth hierarchy and application paral(cid:173)
`lelism by using large numbers of ALUs. Merrimac, a supercom(cid:173)
`puter based on streams, is described in Section 4. We show the
`performance of a simulated stream processor on a number of ap(cid:173)
`plications in Section 5. Issues related to scientific computing with
`streams are discussed in Section 6
`
`SC'03, November 15-21, 2003, Phoenix, Arizona, USA
`Copyright 2003 ACM l-58113-695-l/03/0011...$5.00
`
`1 Merrimac is a Native American word meaning "fast moving
`stream".
`
`Petitioners Amazon
`Ex. 1010, p. 114 of 399
`
`

`

`2. VLSI enables inexpensive arithmetic mak(cid:173)
`ing bandwidth the limiting factor
`Modem VLSI fabrication processes make it very inexpensive in
`terms of both area and power to put large amounts of arithmetic
`capability on a chip. With arithmetic almost free, global band(cid:173)
`width, both on-chip and off-chip, becomes the factor limiting per(cid:173)
`formance.
`In 0.13 µm CMOS technology, a 64-bit floating-point unit (FPU)
`(multiplier and adder) has an area ofless than I mm 2 and dissipates
`about 50pJ of energy per operation [I]. Over 200 such FPUs can
`fit on a 14mm x 14nun chip that can be manufactured in volume
`(including testing and packaging) for less than $100. Even at a
`conservative operating frequency of 500MHz this gives a cost of
`64-bit floating-point arithmetic of less than $1 per GFLOPS and a
`power of less than 50mW per GFLOPS. Even though one cannot
`completely fill a chip with FPUs, modem graphics chips come close
`to realizing these cost performance levels. For example, the nVidia
`NV30 sustains 100 GFLOPS (32-bit floating point) [2].
`The already low cost of arithmetic is decreasing rapidly as tech(cid:173)
`nology improves. We describe a CMOS technology by its drawn
`gate length L. Most chips today are manufactured with L = 0.13µm.
`Historical trends show that L decreases at about 14% per year [3].
`The cost of a GFLOPS of arithmetic scales as £ 3 and hence de(cid:173)
`creases at a rate of about 35% per year [4]. Every five years, L
`is halved, four times as many FPUs fit on a chip of a given area,
`and they operate twice as fast -
`giving a total of eight times the
`performance for the same cost. Of equal importance, the switching
`energy also scales as £ 3 so every five years, we get eight times the
`arithmetic performance for the same power.
`Global bandwidth, not arithmetic is the factor limiting the per(cid:173)
`formance and dominating the power of modem processors. The
`cost of bandwidth grows at least linearly with distance in terms of
`both availability and power [4]. To keep distances constant across
`technology generations, we express distance in units of tracks. One
`track (or IX) is the distance between two minimum width wires on
`a chip. In 0.13µm technology, lX::::: 0.5µm. We can put ten times
`as many 103 X wires on a chip as we can 104 X wires. More im(cid:173)
`portantly, moving a bit of information over a 103 X wire takes only
`l/101
`h the energy as moving a bit over a 104 X wire. In an 0.13µm
`technology, for example, transporting the three 64-bit operands for
`a 50pJ floating point operation over global 3 x 104 X wires con(cid:173)
`sumes about lnJ, 20 times the energy required to do the operation.
`In contrast, transporting these operands on local wires with an av(cid:173)
`erage length of 3 x 102X takes only JOpJ, much less than the cost
`of the operation.
`Contemporary architectures are not yet tuned to these develop(cid:173)
`ing VLSI constraints. These architectures are unable to use more
`than a few arithmetic units because they are designed for applica(cid:173)
`tions with limited parallelism and are hindered by a low bandwidth
`memory system. Their main goal is to provide high performance
`for mostly serial code that is highly sensitive to memory latency
`and not bandwidth. To exploit the capabilities of today's VLSI
`technology requires an architecture that can exploit parallelism -
`to keep large numbers or arithmetic units busy while hiding the ever
`increasing latency to memory, and locality -
`to increase the ratio
`of arithmetic, which is inexpensive, to global bandwidth, which is
`the limiting factor.
`
`3. Stream Architecture exploits the character(cid:173)
`istics of VLSI
`A Stream Processor is able to take advantage of the large number
`of arithmetic units that VLSI technology enables without exceed(cid:173)
`ing the bandwidth limitations of the technology by using a register
`hierarchy to exploit locality in the application. This greatly re(cid:173)
`duces the average distance an operand must travel to reach a FPU.
`As shown in Figure I, a stream architecture consists of an array
`of clusters, each with a set of FPUs, a set of local register files
`(LRFs), and a bank of a stream register file (SRF). Each FPU in a
`cluster reads its operands out of an adjacent LRF over very short,
`(::::: lOOX), wires. FPU results are distributed to the other LRFs
`in a cluster and accesses to the local SRF bank are made via the
`cluster switch over short(::::: 1, OOOX) wires. While the SRF is sim(cid:173)
`ilar in size to a cache, SRF accesses are much less expensive than
`cache accesses because they are aligned and do not require a tag
`lookup. Each cluster accesses its own bank of the SRF over short
`wires. In contrast, accessing a cache requires a global communi(cid:173)
`cation over long(::::: 10, OOOX) wires. The SRF also plays another
`crucial role in keeping the arithmetic units busy by allowing the
`software to hide long memory latencies. An entire stream is trans(cid:173)
`ferred between the SRF and the memory with a single instruction.
`These stream memory operations generate a large number of mem(cid:173)
`ory references to fill the very deep pipeline between processor and
`memory, allowing memory bandwidth to be maintained in the pres(cid:173)
`ence of latency. Arithmetic units are kept busy by overlapping the
`execution of arithmetic kernels with these stream memory opera(cid:173)
`tions.
`To see how a stream processor exploits locality, consider a sim(cid:173)
`ple application expressed as a stream program (Figure 2). This fig(cid:173)
`ure shows a synthetic application that is designed to have the same
`bandwidth demands as the StreamFEM application (Section 5). Each
`iteration, the application streams a set of 5-word grid cells into a se(cid:173)
`ries of four kernels. The kernels operate on the data, performing the
`number of operations indicated, and pass intermediate results on to
`the next kernel. To perform a table lookup, kernel Kl generates an
`index stream that is used to reference a table in memory generating
`a 3-word per element stream into kernel K3.
`Figure 3 shows how the stream program of Figure 2 maps to
`the register hierarchy of a stream processor. The grid cells start
`in memory and are read a strip at a time into a buffer in the SRF.
`A typical strip might be 1024 5-word records. 2 Once a strip of
`cells is in the SRF, kernel KI is run generating a strip of indices
`and a strip of intermediate results in the SRF. Kernel K2 is run on
`the results, generating a second set of intermediate results while the
`indices are applied to memory to read a strip of table values into the
`SRF. Table values that are repeatedly accessed are provided by the
`cache. The process continues until the updates to the strip of grid
`cells, generated by kernel K4, are written back to memory. Each
`strip is software pipelined so that the loading of one strip of cells is
`overlapped with the execution of the four kernels on the previous
`strip of cells and the storing of the strip before that.
`This synthetic application shows how the stream architecture ex(cid:173)
`ploits locality. In Section 5 we shall see that actual applications ex(cid:173)
`ploit locality in a similar manner. Kernels Kl ... K4 perform all of
`their 300 operations out ofLRFs, performing 900 LRF accesses per ·
`grid point. The streams between the kernels are passed through the
`SRF generating 58 words of SRF bandwidth per grid point. Finally
`memory accesses total 12 words. This gives us a bandwidth ratio
`of75:5:l, 75 LRF references and 5 SRF references for every mem-
`
`2The strip size is chosen by the compiler to use the entire SRF
`without any spilling.
`
`Petitioners Amazon
`Ex. 1010, p. 115 of 399
`
`

`

`chip
`boundary
`
`10.ooox
`wires
`
`cluster
`
`DRAM
`Main
`Momory
`
`Cache Bank
`
`Stream
`Reg File
`Bank
`
`Global
`Switch
`
`DRAM
`Main
`Memory
`
`Cacho Bank
`
`Figure 1: A stream processor consists of an array of clusters each having a number of functional units with local register files and
`a stream register file bank connected by a cluster switch. The clusters are connected to each other and to cache banks by a global
`switch. At each level of this hierarchy -
`local register, intra-cluster, and inter-cluster -
`the wires get an order of magnitude longer.
`
`Grid of
`Cells
`
`Grid of
`Cells
`
`Figure 2: A synthetic stream application, modeled after StreamFEM (Section 5) consists of a set of kernels Kl .. . K4 that pass
`streams of data between them.
`
`Memo
`
`Stream Cache
`
`Stream Re File
`
`Local Re isters
`
`Grid of Cells
`
`5
`
`0.5
`
`3
`
`Grid of Cells
`
`4
`
`5
`
`6
`
`B
`
`3
`
`8
`
`Cells
`
`Indices
`
`Results 1
`
`Results 2
`
`Table Values
`
`Results 3
`
`Indices
`
`Results 4
`
`K1
`500ps
`
`K2
`100 Ops
`
`K3
`700ps
`
`K4
`800ps
`
`JOO Ops
`900W
`
`9.5Words
`
`12Words
`
`58Words
`
`Figure 3: The stream program of Figure 2 is mapped to the bandwidth hierarchy of a stream processor.
`
`Petitioners Amazon
`Ex. 1010, p. 116 of 399
`
`

`

`ory reference. Put differently, 93% of all references are made from
`the LRFs, where bandwidth is very inexpensive, and only 1.2% of
`references are made from the memory system, where bandwidth is
`expensive for cache hits and very expensive for misses. 3
`A stream processor executes a stream instruction set. This in(cid:173)
`struction set includes scalar instructions, that are executed on a
`conventional scalar processor, stream execution instructions, that
`each trigger the execution of a kernel on one or more strips in the
`SRF, and stream memory instructions that load and store (possibly
`with gather and scatter) a stream of records from memory to the
`SRF. This stream instruction set closely follows that of the Imagine
`streaming media processor [5, 6).
`Merrimac also provides hardware support for a scatter-add in(cid:173)
`struction. This instruction is an example of a new architectural fea(cid:173)
`ture that is enabled by programming in streams. A scatter-add acts
`as a regular scatter, but adds each value to the data already at each
`specified memory address rather than simply overwriting the data.
`This type of operation was discussed from a parallel algorithm per(cid:173)
`spective in [7].
`
`4. Sketch of Merrimac: a Streaming Scientific
`Computer
`
`C:
`
`""
`"'
`ID
`LL
`a:
`en
`
`FP/INT
`64 Bit
`MADD
`
`64WRF
`64WRF
`64WRF
`""w '"'
`64WRF
`64WRF
`
`FP/INT
`64 Bit
`MADD
`
`2.3 mm
`
`FP/INT
`64 Bit
`MADD
`
`64WRF
`64WRF
`64WRF
`fu>WRF
`64WRF
`64WRF
`
`FP/INT
`64 Bit
`MADD
`
`E
`E
`
`Figure 4: Floorplan of a Merrimac cluster.
`
`Each Merrimac node contains a stream processor (as illustrated
`in Figure I) with 16 arithmetic clusters. Each cluster contains four
`floating-point multiply-add (MADD) units, 768 64-bit words oflo(cid:173)
`cal registers, and SK words of stream register file. The entire stream
`register file has a capacity of 128K 64-bit words, distributed across
`the 16 clusters. A floorplan for one cluster is shown in Figure 4.
`Each MADD unit measures 0.9mm x 0.6mm and the entire clus(cid:173)
`ter measures 2.3mm x 1.6mm. We conservatively plan to operate
`with a clock cycle of Ins (37 F04 inverters in 90nm[3)) giving a
`performance of 8 GFLOPS per cluster and 128 GFLOPS across the
`16 clusters.
`A floorplan of the entire Merrimac stream processor chip is shown
`in Figure 5. The bulk of the chip is occupied by the 16 clusters. The
`
`3Many of our applications have very large kernels that in effect
`combine several smaller kernels -
`passing intermediate results
`through LRFs rather than SRFs. While this increases the fraction of
`LRF accesses, it also stresses LRF capacity. Ideally, the compiler
`will partition large kernels and combine small kernels to balance
`these two effects. We have not yet implemented this optimization.
`
`$ [TIJ $
`
`::,
`
`"'
`u
`s
`"'
`::,
`u
`
`s
`"'
`::,
`u
`
`::,
`
`"'
`u
`s
`"'
`::,
`u
`
`Mips64 20kc
`
`·i
`::i
`.,
`E
`:;
`
`$ bank
`
`$ bank
`$ bank
`$ bank
`
`$ bank
`$ bank
`
`$ bank
`
`$ bank
`
`Network
`
`$
`"'
`u
`
`::,
`
`$
`"'
`::,
`u
`
`s
`"'
`::,
`u
`
`'"
`iii
`::,
`u
`
`E
`E
`d -
`"
`
`~
`
`I
`
`Microcontroller
`
`$
`"'
`::,
`u
`
`'"
`
`iii
`.2
`(.)
`
`s
`"'
`::,
`u
`
`'"
`iii
`.2
`(.)
`
`$
`"'
`u
`
`::,
`
`'"
`iii
`::,
`ti
`
`I
`
`9.8mm
`
`Figure 5: Floorplan of a Merrimac stream processor chip.
`
`left edge of the chip holds the remainder of the node. A scalar pro(cid:173)
`cessor [8) fetches all instructions, executes the scalar instructions
`itself, and dispatches stream execution instructions to the clusters
`(under control of the microcontroller) and stream memory instruc(cid:173)
`tions to the memory system. The node memory system consists of
`a set of address generators (not shown), a line-interleaved eight(cid:173)
`bank 64K-word (512K.Byte) cache, and interfaces for 16 external
`DRAM chips. A network interface directs off-node memory refer(cid:173)
`ences to the routers. We estimate that each Merrimac processor will
`cost about $200 to manufacture and will dissipate a maximum of
`31 W of power. Area and power estimates in a standard cell process
`in 90nm technology are derived from models based on a previous
`implementation of stream processor [I).
`Figure 6 illustrates a single Merrimac board containing 16 nodes
`-
`16 I 28GFLOPS stream processors (Figure 5) each with 2 GBytes
`of DRAM -
`and four router chips. The router chips interconnect
`the 16 processors on the board, providing flat memory bandwidth
`on board of 20 GBytes/s per node. The routers also provide a gate(cid:173)
`way to the inter-board network, with a 4: I reduction in memory
`bandwidth (to 5 GBytes/s per node), for inter-board references.
`Larger Merrimac systems are interconnected by a five-stage folded(cid:173)
`Clos [9) network4 using high-radix routers as illustrated in Figure 7.
`The routers on each 16-node board serve as the first and last stage
`of this network. The basic building block of this network is a 48-
`input x 48-output router chip. Each bidirectional router channel
`(one input and one output) has a bandwidth of 2.5 GBytes/s (four
`5Gb/s differential signals) in each direction. On each 16-processor
`board, each of four routers has two 2.5 GByte/s channels to/from
`each of the 16 processor chips and eight ports to/from the backplane
`switch. The remaining eight ports are unused. Thus each node pro(cid:173)
`vides a total of 32 channels to the backplane. At the backplane
`level, 32 routers connect one channel to each of the 32 boards and
`connect 16 channels to the system-level switch. A total of 512 2.5
`GByte/s channels traverse optical links to the system-level switch
`where 512 routers connect all 48 ports to up to 48 backplanes (the
`figure shows just 32 backplanes).
`Table I shows the estimated cost of a streaming supercomputer.
`The processor and router chip are modest-sized (10mm x I Imm)
`ASICs in 1000-pin flip-chip BGA packages that are expected to
`
`4This topology is sometimes called a Fat Tree [10).
`
`Petitioners Amazon
`Ex. 1010, p. 117 of 399
`
`

`

`64 TFLOPS
`Per Backplane
`
`Board O
`16 Proc
`2TFLOPS
`32GBytes
`
`Board 1
`
`Board 31
`
`Backplane
`2
`
`Backplane
`31
`
`512 x 2.5GB/s
`
`Router
`0
`
`Router
`1
`
`Router
`32
`
`Router
`33
`
`0/E
`
`Router
`511
`
`Figure 7: A 2 PFLOPS Merrimac system uses a high-radix interconnection network.
`
`32 Gbytes
`Per Board
`
`DRAM
`2GBytes
`
`DRAM
`2GBytes
`
`20GB/s
`
`Stream
`Proco
`128
`GFLOPS
`
`Stream
`Proc 1
`
`2 TFLOPS
`Per Board
`
`20GB/s
`
`node
`
`DRAM
`2Gbytes
`
`Stream
`Proc 15
`
`To Backplane
`
`Figure 6: Sixteen 128 GFLOPS stream processors each with 2
`GBytes of DRAM memory can be packaged on a single board.
`The board has a total of2 TFLOPS of arithmetic and 32 GBytes
`of memory. Such a board is useful as a stand-alone scientific
`computer and as a building-block for larger systems.
`
`Item
`Processor Chip
`Router Chip
`Memory Chip
`Board
`Router Board
`Backplane
`Global Router Board
`Power
`Per Node Cost
`$/GFLOPS (128/Node)
`$/M-GUPS (250/Node)
`
`I Cost($) I Per Node Cost ($) I
`200
`200
`200
`69
`20
`320
`63
`1000
`2
`1000
`10
`5000
`5000
`5
`I
`50
`718
`6
`3
`
`Table I: Rough Per-Node Budget. Parts cost only, does not
`include 1/0.
`
`cost $200 each in moderate quantities (1000s). DRAM chips are
`projected to cost $20 each, making DRAM, at $320 the largest sin(cid:173)
`gle cost item. Board and backplane costs, including connectors,
`capacitors, regulators, and other components is amortized over the
`16 nodes on each board and the 512 nodes in each backplane. The
`router board and global router board costs reflect the costs of the
`intra-cabinet and inter-cabinet networks respectively. Supplying
`and removing power costs about $1 per W or about $50 per SOW
`node. Overall cost is less than $ I K per node, which translates into
`$6 per GFLOP of peak performance and $3 per M-GUPS5
`•
`
`5GUPS or global updates per second is a measure of global un(cid:173)
`It is the number of single-word
`structured memory bandwidth.
`read-modify-write operations a machine can perform to memory
`locations randomly selected from over the entire address space.
`
`Petitioners Amazon
`Ex. 1010, p. 118 of 399
`
`

`

`I Application
`StreamFEM (Euler, quadratic)
`
`StreamFEM (MHD, cubic)
`
`StreamMD
`
`StreamFLO
`
`33.5
`
`14.2
`
`11.4
`
`I Sustained GFLOPS I FP Ops I Mem Ref I LRF Refs I SRF Refs I Mem. Refs I
`23.5
`IQ.3M
`32.2
`169.5M
`1.4M
`(93.6%)
`(5.7%)
`(0.7%)
`3.2M
`733.3M
`43.SM
`(94.0%)
`(5.6%)
`(0.4%)
`90.2M
`1.6M
`0.7M
`(1.7%)
`(97.5%)
`(0.8%)
`234.3M
`7.2M
`3.4M
`(95.7%)
`(2.9%)
`(1.4%)
`
`50.6
`
`12.1
`
`7.4
`
`Table 2: Performance measurements of streaming scientific applications
`
`5. Applications exploit the locality of a stream
`processor
`Three scientific applications were used to evaluate the single
`node performance of the Merrimac stream processor: StreamFEM,
`StreamMD, and StreamFLO. These applications feature a number
`of characteristics which are common in scientific applications in
`general, including regular and irregular multidimensional meshes,
`multigrid techniques, and particle-in-cell computations.
`StreamFEM is a finite element application designed to solve sys(cid:173)
`tems of first-order conservation laws on general unstructured meshes.
`The StreamFEM implementation has the capability of solving sys(cid:173)
`tems of 2D conservation laws corresponding to scalar transport,
`compressible gas dynamics, and magnetohydrodynamics (MHD)
`using element approximation spaces ranging from piecewise con(cid:173)
`stant to piecewise cubic polynomials. StreamFEM uses the discon(cid:173)
`tinuous Galerkin (DG) method developed by Reed and Hill [11]
`and later popularized by Cockburn, Hou and Shu [12).
`In the
`present StreamFEM implementation, the limiting procedure of Cock(cid:173)
`burn et al. has been replaced by variational discontinuity capturing
`terms as discussed in Jaffre, Johnson and Szepessy [ 13) with fur(cid:173)
`ther overall algorithmic simplifications as discussed in Barth [14).
`StreamMD is a molecular dynamics solver [ 15, 16) that is based
`on solving Newton's equations of motion. The velocity Verlet method
`(or Leap-frog) is used to integrate the equations of motion in time;
`using this method, it is possible to simulate the complex trajectories
`of atoms and molecules for very long periods of time. The present
`StreamMD implementation simulates a box of water molecules,
`with the potential energy function defined as the sum of two terms:
`electrostatic potential and the Van der Waals potential. A cutoff
`is applied so that all particles which are at a distance greater than
`rcutoff do not interact. A 30 gridding structure is used to accelerate
`the determination of which particles are close enough to interact -
`each grid cell contains a list of the particles within that cell, and
`each timestep particles may move between grid cells. StreamMD
`makes use of the scatter-add functionality of Merrimac by com(cid:173)
`puting the pairwise particle forces in parallel and accumulating the
`forces on each particle by scattering them to memory.
`StreamFLO [17] is a finite volume 20 Euler solver that uses a
`non-linear multigrid algorithm.
`It is based on the FL082 code
`[ 18][ 19], which influenced many industrial and research codes. The
`choice of the code is motivated by the need for an application that
`is representative of a typical computational fluid dynamics appli(cid:173)
`cation, without unnecessary complexity. A cell-centered finite(cid:173)
`volume formulation is used to solve the fluid equations together
`with multigrid acceleration. Time integration is performed using a
`five stage Runge-Kutta scheme.
`Table 2 presents measurements from running these three applica(cid:173)
`tions on a cycle-accurate simulator of one Merrimac node. These
`simulations were run on a version of the simulator that included
`
`four 2-input multiply/add units per cluster (for a peak performance
`of64GFLOPS/node) rather than the four integrated 3-input MADD
`units (128GFLOPS/node) that is the current design.
`The Sustained GFLOPS and FP Ops / Mem Ref columns illus(cid:173)
`trate the arithmetic intensity of the applications; they are able to
`sustain from 18% to 52% of the node's peak arithmetic perfor(cid:173)
`mance, by performing from 7 to 50 floating point operations for
`each global memory access. Note that only "real" ops are counted
`in this figure, such as floating point add/mul/compare instructions,
`and not non-arithmetic ops such as branches. Divides are counted
`as single floating point operations, even though each divide requires
`several multiplication and addition operations when executed on
`the hardware. This leads to the lower performance numbers for
`StreamMD and StreamFLO - for example, the sustained perfor(cid:173)
`mance ofStreamFLO would double ifwe counted all the multiplies
`and adds required for divisions as well.
`The right-most three columns list the respective numbers of LRF,
`SRF, and memory references made by the program, along with the
`percentage of references satisfied by each level. Note that only a
`small fraction of references, usually less than I%, require commu(cid:173)
`nication over global(> 10, OOOX or off-chip) wires, and that over
`95% of all data movement is on local (IOOX) wires (at the LRF
`level). The register hierarchy of a stream processor exposes costly
`global communication and allows the locality inherent in applica(cid:173)
`tions to be exploited to keep communications local.
`Exploiting locality using a register hierarchy increases perfor(cid:173)
`mance and reduces power dissipation. By performing less data
`movement per arithmetic operation, we can support a much larger
`number of arithmetic units before saturating the limited global band(cid:173)
`width. At the same time power per operation is dramatically re(cid:173)
`duced by eliminating much of the global communication that dom(cid:173)
`inates power.
`
`6. Discussion
`
`6.1 Streams vs Vectors
`Stream processors share with vector processors, like the Cray!
`through Cray C90 [20][21 ], the ability to hide latency, amortize
`instruction overhead, and expose data parallelism by operating on
`large aggregates of data. In a similar manner, a stream processor,
`such as Merrimac, hides memory latency by fetching a stream of
`records with a single stream load instruction. A kernel is performed
`on one or more streams of records in the stream register file (SRF)
`with a single operate instruction. This both amortizes the overhead
`of the operate instruction and exposes data parallelism.
`Stream processors extend the capabilities of vector processors
`by adding a layer to the register hierarchy, and adding a layer of in(cid:173)
`struction sequencing that enables them to operate in record (rather
`than operation) order. The functions of the vector register file (VRF)
`of a vector processor is split between the local register files (LRFs)
`
`Petitioners Amazon
`Ex. 1010, p. 119 of 399
`
`

`

`and the stream register file (SRF) of a stream processor. The LRFs
`stage data between ALU operations to exploit fine-grained producer(cid:173)
`consumer locality (sometimes called kernel locality). To support
`a large number of ALUs, they have a very high aggregate band(cid:173)
`width. Because they exploit only kernel locality, their capacity can
`be modest, a few thousand words - about the same size as a modem
`VRF. The stream register file (SRF) of a stream processor stages
`data to and from memory and stages data to and from the LRFs
`to exploit coarse-grained (sometimes called outer-loop) producer(cid:173)
`consumer locality. Because it is relieved of the task of forwarding
`data to/from the ALUs, its bandwidth is modest (an order of magni(cid:173)
`tude less than the LRFs) which makes it economical to build SRFs
`large enough to exploit coarse-grained locality.
`6.2 Balance
`The ratios between arithmetic rate, memory bandwidth, and mem(cid:173)
`ory capacity on Merrimac are balanced based on cost and utility (cid:173)
`so that the last dollar spent on each returns the same incremen(cid:173)
`tal improvement in performance. This balancing by diminishing
`returns gives ratios quite different from the common approach of
`fixing the ratio of GFLOPS to GBytes irrespective of cost. If we
`took this approach with Merrimac, we would have to provide 128
`GBytes of memory (costing about $20K) for each $200 processor
`chip making our processor to memory cost ratio I: I 00. If one needs
`128 GBytes of memory, it is more efficient to provide 64 nodes,
`even if the additional processors are not required -
`their cost is
`small compared to the memory.
`A similar argument applies to the ratio of arithmetic to memory
`bandwidth. Merrimac provides only 20 GBytes/s (2.5 GWords/s) of
`memory bandwidth for 128 GFLOPS, a FLOP/Word ratio of over
`50: I. Many vector machines have FLOP/Word ratios of I: I [21 ],
`and conventional microprocessors have ratios between 4: I and 12: I
`[22][23] . Providing even a 10:1 ratio on Merrimac would be pro(cid:173)
`hibitively expensive. We would need 80 external DRAMs rather
`than 16. Interfacing to this large number of DRAMs would require
`at least 5 external memory interface chips (pin expanders). As with
`memory capacity, taking this fixed-balance approach to memory
`bandwidth causes the cost of bandwidth to dominate the cost of
`processing. Its more efficient to just use Merrimac processor chips
`to directly interface to 16 DRAMs each. For memory bandwidth
`dominated computations (e.g., sparse vector-matrix product) most
`of the arithmetic will be idle. However, even for such computations
`the Merrimac approach is more cost effective than trying to provide
`a much larger memory bandwidth for a single node.
`6.3 High-Radix Routers
`In the 1980s and early 90s, when routers had pin bandwidth in
`the range of 1-IOGb/s, torus networks gave

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket