`
`David E. Culler, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Brent Chun,
`Steven Lumetta, Alan Mainwaring, Richard Martin, Chad Yoshikawa, Frederick Wong
`
`Computer Science Division
`University of California, Berkeley
`
`Abstract: The UC Berkeley Network of Workstations
`(NOW) project demonstrates a new approach to large(cid:173)
`scale system design enabled by technology advances
`that provide inexpensive, low latency, high bandwidth,
`scalable interconnection networks. This paper provides
`an overview of the hardware and software architecture
`of NOW and reports on the performance obtained at
`each layer of the system: Active Messages, MPI mes(cid:173)
`sage passing, and benchmark parallel applications.
`
`1 Introduction
`
`In the early 1990's it was often said that the "Killer
`Micro" had attacked the supercomputer market, much as
`it had the minicomputer and mainframe markets earlier.
`This attack came in the form of massively parallel pro(cid:173)
`cessors (MPPs) which repackaged the single-chip
`microprocessor, cache, DRAM, and system chip-set of
`workstations and PCs in a dense configuration to con(cid:173)
`struct very large parallel computing systems. However,
`another technological revolution was brewing in these
`MPP systems - the single-chip switch - which enabled
`building inexpensive, low latency, high bandwidth, scal(cid:173)
`able interconnection networks. As with other important
`technologies, this "killer switch" has taken on a role far
`beyond its initial conception. Emerging from the eso(cid:173)
`teric confines of MPP backplanes, it has become avail(cid:173)
`able in a form that is readily deployed with commodity
`workstations and PCs. This switch is the basis for sys(cid:173)
`tem area networks, which have performance and scal(cid:173)
`ability of the MPP interconnects and the flexibility of a
`local area network, but operate on a somewhat restricted
`physical scale.
`
`The Berkeley NOW project seeks to demonstrate that it
`is viable to build large parallel computing systems that
`are fast, inexpensive, and highly available, by simply
`snapping these switches together with the latest com(cid:173)
`modity components. Such cost-effective, incrementally
`scalable systems provide a basis for traditional parallel
`computing, but also for novel applications, such as inter(cid:173)
`net services[Brew96].
`
`This paper provides an overview of the Berkeley NOW
`as a parallel computing system. Section 2 gives a
`description of the NOW hardware configuration and its
`layered software architecture. In the following sections,
`the layers are described from the bottom-up. Section 3
`describes the Active Message layer and compares its
`performance to what has been achieved on MPPs.
`Section 4 shows the performance achieved through MPI,
`built on top of Active Messages. Section 5 illustrates the
`application performance of NOW using the NAS Paral(cid:173)
`lel Benchmarks in MPI. Section 6 provides a more
`detailed discussion of the world's leading disk-to-disk
`sort, which brings out a very important property of this
`class of system: the ability to concurrently perform I/O
`to disks on every node.
`
`2 Berkeley NOW System
`
`The hardware configuration of the Berkeley NOW sys(cid:173)
`tem consists of one hundred and five Sun Ultra 170
`workstations, connected by a large Myricom net(cid:173)
`work[Bode95], and packaged into 19-inch racks. Each
`workstation contains a 167 MHz Ultral microprocessor
`with 512 KB level-2 cache, 128 MB of memory, two 2.3
`GB disks, ethemet, and a Myricom "Lanai" network
`interface card (NIC) on the SBus. The NIC has a 37.5
`MHz embedded processor and three DMA engines,
`which compete for bandwidth to 256 KB of embedded
`SRAM. The node architecture is shown in Figure 1.
`
`The network uses multiple stages of Myricom switches,
`each with eight 160 MB/s bidirectional ports, in a vari(cid:173)
`ant of a fat-tree topology.
`
`2.1 Packaging
`
`We encountered a number of interesting engineering
`issues in assembling a cluster of this size that are not so
`apparent in smaller clusters, such as our earlier 32-node
`prototype. This rack-and-stack style of packaging is
`extremely scalable, both in the number of nodes and the
`ability to upgrade nodes over time. However, structured
`cable management is critical. In tightly packaged sys(cid:173)
`tems the interconnect is hidden in the center of the
`
`DEFS-ALAOOO 1829
`Ex.1032.001
`
`DELL
`
`
`
`Myricom Network
`
`160 MB/s
`. .
`.
`~1direct1onal
`hnk
`
`~S-port
`
`wormhole
`switches
`
`0 0 0
`
`direct console access to all the nodes via the serial port.
`This is needed only in situations when the node cannot
`be rebooted through the network, or during system
`development and debugging. Fourth, conventional AC
`lines provide a power distribution network. As clusters
`transition to the commercial mainstream, one engineer(cid:173)
`ing element will be to consolidate these layers of inter(cid:173)
`connect into a clean modular design. Figure 2 shows a
`picture of the NOW system.
`
`UPA
`
`UltraSparc
`
`L2 Cache
`
`Figure 1. NOW Node Configuration
`
`machine. When multiple systems are placed in a
`machine room, all the interconnect is hidden under the
`floor in an indecipherable mess. However, in clusters,
`the interconnect is a clearly exposed part of the design.
`(a bit like the service conduits in deconstructionist
`buildings). Having the interconnect exposed is valuable
`for working on the system, but it must stay orderly and
`well structured, or it becomes both unsightly and diffi(cid:173)
`cult to manage.
`
`The Berkeley NOW has four distinct interconnection
`networks. First, the Myrinet provides high-speed com(cid:173)
`munication within the cluster. We discuss this in detail
`below. Second, switched-Ethernet into an ATM back(cid:173)
`bone provides scalable external access to the cluster.
`The need for an external network that scales with the
`size of the cluster was not apparent when we began the
`project, but the traffic between the cluster and other
`servers, especially file servers, is an important design
`consideration. Third, a terminal concentrator provides
`
`Figure 2. NOW System
`
`2.2 Network topology
`
`The Myrinet switches that form the high-speed intercon(cid:173)
`nect use source routing and can be configured in arbi(cid:173)
`trary topologies. The NOW automatic mapping software
`can handle arbitrary interconnect[Mai*97]; however, we
`wire the machine as a variant of a Fat-tree to create a
`system with more uniform bandwidth between nodes
`thereby minimizing the impact of process placement'.
`The topology is constrained by the use of 8-port (bidi(cid:173)
`rectional) switches and wiring density concerns. Ini(cid:173)
`tially we planned to run cables from all the nodes to a
`central rack of switches; however, the cable cross-sec(cid:173)
`tional area near the switches became unmanageable as a
`result of bulky, heavily-shielded copper network cables.
`Using fiber-optic cables that are now available, the cable
`density may be reduced enough to centrally locate the
`switches.
`
`DEFS-ALAOOO 1830
`Ex.1032.002
`
`DELL
`
`
`
`In building an indirect network out of fixed-degree
`switches, the number of upward links depends on the
`number of downward links. We elected to attach five
`hosts to each first level switch, which eliminates 40% of
`the cable mass. As shown in Figure 3, groups of seven of
`these switches are treated as a 35-node subcluster with
`the 21 upward links spread over four level-two switches.
`Three of these subclusters are wired together to com(cid:173)
`prise the NOW. We have found that as a rule of thumb,
`adding 10% extra nodes and extra ports greatly simpli(cid:173)
`fies system administration, allowing for node failures,
`software or hardware upgrades, and system expansion.
`
`Figure 3. NOW Myrinet Network Topology
`
`2.3 Software Architecture
`
`The system software for NOW employs a layered archi(cid:173)
`tecture, as illustrated in Figure 4. Each node runs a com(cid:173)
`plete, independent Solaris Unix with the associated
`process management, memory management, file system,
`thread support, scheduler, and device drivers. We extend
`Solaris at each of these interfaces to support global
`operations over the NOW.
`
`layer, called
`Process Management: A global OS
`GLUnix, provides NOW-wide process management as a
`layer on top of Solaris (via sockets, daemons, and sig(cid:173)
`nals). Using either a global shell, gl ush, or the gl u(cid:173)
`run command, sequential processes can be started
`anywhere on the NOW or parallel processes can be
`started on multiple nodes. Local pids are elevated to a
`global pids, and the familiar process control opera(cid:173)
`tions, such as ctrl-C or ctrl-Z, work on global processes.
`The Unix process information and control utilities, such
`asps and kill, are globalized as well.
`
`parallel applications
`
`demanding
`sequential
`applications
`global OS layer - GLUnix
`global process mgmt, resource mgmt, file system, scheduling
`process mgmt
`resource mgmt
`scheduler
`I/O system
`I comm. driver I
`mte 1gent
`NIC
`
`commodity
`workstation
`with full OS
`
`scalable, low latency network
`
`Figure 4. NOW software architecture
`
`File System: A prototype file system, xFS, extends
`Solaris at the vnode interface to provide a global, high
`performance file system[And*95b]. Files are striped
`over nodes in a RAID-like fashion so that each node can
`read file data at the bandwidth of its interface into the
`network. The aggregate bandwidth available to nodes is
`that of all the disks. xFS uses a log-structured approach,
`much like Zebra[Ha0u95], to minimize the cost of par(cid:173)
`ity calculations. A single node accumulates enough of
`the log so that it can write a block to each disk in a stripe
`group. Before writing the blocks, it calculates a parity
`block locally and then writes it along with the data
`blocks.
`
`An update-based file cache-coherence strategy is used,
`and the caches are managed cooperatively to increase
`the population of blocks covered by the collection of
`nodal caches. If a block about to be discarded is the last
`copy in the system, then it is cast off to a random remote
`node. Nodes take mercy on this block until it has aged to
`the point where it appears pointless to keep it in mem(cid:173)
`ory. This policy has the attractive property that actively
`used nodes behave like traditional clients while idle
`nodes behave like servers, so the cooperative file cache
`adapts dynamically to system usage.
`
`Virtual Memory: Two prototype global virtual mem(cid:173)
`ory systems have been developed to allow sequential
`processes to page to the memory of remote idle nodes,
`since communication within the NOW has higher band(cid:173)
`width, and much lower latency than access to local
`disks. One of these uses a custom Solaris segment driver
`to
`implement an external user-level pager which
`
`DEFS-ALA0001831
`Ex.1032.003
`
`DELL
`
`
`
`exchanges pages with remote page daemons. The other
`provides similar operation on specially mapped regions
`using only signals.
`
`3 Active Messages
`
`Active Messages are the basic communication primi(cid:173)
`tives in NOW. This work continues our investigation of
`implementation trade-offs for fast communication lay(cid:173)
`ers[vE92*,Gol*96,Kri*96] and on NOW we have
`sought to generalize the approach and take full advan(cid:173)
`tage of the complete OS on every node. The segment
`driver and device driver interface is used to provide
`applications with direct, protected user-level access to
`the network. Active Messages map to simple operations
`on queues and buffers that are shared between the user
`process and the communication firmware, which is exe(cid:173)
`cuted on a dedicated processor embedded in the network
`interface card.
`
`We have built two Active Message layers. The first,
`Generic Active Messages (gam) is oriented toward the
`traditional single-parallel-program-at-a-time style of
`parallel machines, and provides exactly the same API
`across a wide range ofplatforms[Cul*95].This serves as
`a valuable basis for comparison.
`
`The newer AM layer[Main95], AM-II, provides a much
`more general purpose communication eINironment,
`which allows many simultaneous parallel programs, as
`well as client/server and system use. It is closely inte(cid:173)
`grated with POSIX threads. The AM implementation is
`extremely versatile. It provides error detection and retry
`a the NIC-to-NIC level and allows the network to be
`reconfigured in a running system. A privileged mapper
`daemon explores the physical interconnection, derives
`deadlock-free routes, and distributes routes periodi(cid:173)
`cally[Mai*97]. AM-II provides a clean return-to-sender
`error model to support highly available applications.
`
`The Active Messages communication model is essen(cid:173)
`tially a simplified remote procedure call that can be
`implemented efficiently on a wide range of hardware.
`Three classes of messages are supported. Short mes(cid:173)
`sages pass eight 32-bit arguments to a handler on a des(cid:173)
`tination node, which executes with the message data as
`arguments. Medium messages treat one of the argu(cid:173)
`ments as a pointer to a 128 byte to 8 KB data buffer and
`iINoke the handler with a pointer to a temporary data
`buffer at the destination. Bulk messages perform a mem(cid:173)
`ory-to-memory copy before iINoking the handler. A
`request handler issues replies to the source node.
`
`We have developed a microbenchmarking tool to char(cid:173)
`acterize empirically the performance of Active Mes(cid:173)
`sages in terms of the LogP model[Cul*93, Cul*95].
`Figure 5 compares the gam short message LogP param(cid:173)
`eters on NOW with the best implementations on a range
`of parallel machines. The bars on the left show the one(cid:173)
`way message time broken down into three components:
`send overhead (oJ, receive overhead (or), and the
`remaining latency (L). The bars on the right shows the
`time per message (g = l/MessageRate) for a sequence
`of messages. NOW obtains competitive or superior
`communication performance to the more tightly inte(cid:173)
`grated, albeit older, designs.
`
`The overhead on NOW is dominated by the time to
`write and read data across the 1/0 bus. The Paragon has
`a dedicated message processor and network interface on
`the memory bus; however, there is considerable over(cid:173)
`head in the processor-to-processor transfer due to the
`cache coherence protocol and the latency is large
`because the message processors must write the data to
`the NI and read it from the NI. The actual time on the
`wire is quite small. The Meiko has a dedicated message
`processor on the memory bus with a direct connection to
`the network, but the overhead is dominated by the
`exchange instruction that queues a message descriptor
`for the message processor and the latency is dominated
`by the slow message processor accessing the data from
`host memory. Medium and bulk messages achieve 38
`MB/s on NOW, limited primarily by the SBus.
`
`14 :·······································~···········
`
`~ 10 t
`c
`
`12 t
`~ 8 t
`~ 6 t
`:E 41
`2 t
`0 --L
`
`-
`
`-
`
`•L
`
`C!I Or
`
`ag
`
`c
`0
`0)
`
`ro ro
`
`CL
`
`,__ ~
`~
`::::> s 0
`
`z
`
`c
`0
`0)
`
`ro ro
`
`CL
`
`Figure 5. Active Messages LogP Perlormance
`
`Traditional communication APis and programming
`models are built upon the Active Message layer. We
`have built a version of the MPI message passing stan-
`
`DEFS-ALAOOO 1832
`Ex.1032.004
`
`DELL
`
`
`
`dard for parallel programs in this fashion, as well as a
`version of the Berkeley Sockets API, called Fast Sock(cid:173)
`ets[Rod*97]. A shared address space parallel C, called
`Split-C[Cul*93], compiles directly to Active Messages,
`whereas HPF[PGI] compiles down to the MPI layer.
`
`4MPI
`
`Our implementation of MPI is based on the MPICH ref(cid:173)
`erence implementation, but realizes the abstract device
`interface (ADI) through Active Message operations.
`This approach achieves good performance and yet is
`portable across Active Message platforms. The MPI
`communicator and related information occupy a full
`short message. Thus, a zero-byte control message is
`implemented as a single small-message
`request(cid:173)
`response, with the handler performing the match opera(cid:173)
`tion against a receive table. The one-way time for an
`echo test is 15 µs. MPI messages of less than 8 KB use
`an adaptive protocol implemented with medium Active
`Messages. Each node maintains a temporary input
`buffer for each sender and senders keep track of whether
`their buffers are available on the destination nodes. If
`the buffer is available, the send issues the data without
`handshaking. Buffer availability is conveyed back to the
`source through the response, if the match succeeds, or
`via a request issued by the later matching receive. Large
`messages perform a handshake to do the tag match and
`coINey the destination address to the source. A bulk
`operation moves the message data directly into the user
`buffer.
`
`Figure 6 shows the bandwidth obtained as a function of
`message size using Dongarra's echo test on NOW and
`on recent MPP platforms[DoDu95]. The NOW version
`has lower start-up cost than the other distributed mem(cid:173)
`ory platforms and has intermediate peak bandwidth. The
`T3D/pvm version does well for small messages, but has
`trouble with cache effects. Newer MPI implementations
`on the T3D should perform better than the T3D/pvm in
`the figure, but data is not available in the Dongarra
`report.
`
`5 NAS Parallel Benchmarks
`
`An application-level comparison of NOW with recent
`parallel machines on traditional scientific codes can be
`obtained with the NAS MPI-based parallel benchmarks
`in the NPB2 suite[NPB]. We report briefly on two appli(cid:173)
`cations. The LU benchmark solves a finite difference
`discretization of the 3-D compressible Navier-Stokes
`equations. A 2-D partitioning of the 3-D data grid onto a
`power-of-two number of processors is obtained by halv-
`
`45~~~~~~~~~~~~
`: -X- Meiko CS2
`X
`401----~
`:§' 35 t --/r- IBM SP2
`~ 3 o +--<>--Cray T3D
`: 25 +
`0 x J.
`f_
`:§ 20 _j_
`0
`:
`.---w
`""15-i-
`~•o
`;
`:
`m 10 f
`~ -~
`Q lt
`5 t
`o~B~iA <
`,0
`100000 1000000
`10000
`1 0
`100
`1000
`Message Size (bytes)
`
`~
`
`0. I
`
`0
`
`0
`
`XX
`
`•
`/l!C
`
`0
`
`~
`~
`i:,,.
`~
`f
`~
`:
`:
`~
`~
`~
`
`Figure 6. MPI bandwidth
`
`ing the grid repeatedly in the first two dimensions, alter(cid:173)
`nating between x and y, resulting in vertical pencil-like
`grid partitions. The ordering of point based operations
`constituting the SSOR procedure proceeds on diagonals
`which progressively sweep from one comer on a given z
`plane to the opposite comer of the same z plane, there(cid:173)
`upon proceeding to the next z plane. This constitutes a
`diagonal pipelining method and is called a "wavefront"
`method by its authors [Bar*93]. The software pipeline
`spends relatively little time filling and emptying and is
`perfectly load-balanced. Communication of partition
`boundary data occurs after completion of computation
`on all diagonals that contact an adjacent partition.
`
`The BT algorithm solves three sets of uncoupled sys(cid:173)
`tems of equations, first in the x, then in they, and finally
`in the z direction. These systems are block tridiagonal
`with 5x5 blocks and are solved using a multi-partition
`scheme[Bm88]. The multi-partition approach provides
`good load-balance and uses coarse-grained communica(cid:173)
`tion. Each processor is responsible for several disjoint
`sub-blocks of points ("cells") in the grid. The cells are
`arranged such that for each direction in the line-solve
`phase, the cells belonging to a certain processor are
`evenly distributed along the direction of solution. This
`allows each processor to perform useful work through(cid:173)
`out a line-solve, instead of being forced to wait for the
`partial solution to a line from another processor before
`beginning work. Additionally, the information from a
`cell is not sent to the next processor until all sections of
`linear equation systems handled in this cell have been
`solved. Therefore the granularity of communications is
`kept large and fewer messages are sent. The BT code
`requires a square number of processors.
`
`DEFS-ALAOOO 1833
`Ex.1032.005
`
`DELL
`
`
`
`Figure 7 shows the speedups obtained on sparse LU
`with the Class A input (200 iterations on a 64x64x64
`grid) for the IBM SP-2 (wide nodes), Cray T3D, and
`NOW. The speedup is normalized to the performance on
`four processors, shown in Table 1, because this is the
`smallest number of nodes for which the problem can be
`run on the T3D. NOW achieves the scalability of the
`T3D with much high per-node performance. It scales
`better than the SP-2, although it has only two-thirds the
`node performance. These results are consistent with the
`ratio of the processor performance to the performance of
`small message transfers.
`
`TABLE 1. NPB baseline MFLOPS
`
`LU@4proc
`BT@25 proc
`
`T3D
`74
`378
`
`SP-2
`236
`1375
`
`NOW
`152
`714
`
`0
`
`25
`
`50
`75
`Processors
`
`100
`
`125
`
`Figure 7. Speedup on NPB Sparse LU
`
`Figure 8 shows the scalability of the IBM SP2, Cray
`T3D, and NOW on the Class A problem of the BT
`benchmark. Here the speedup is normalized to the per(cid:173)
`formance on 25 processors, the smallest T3D configura(cid:173)
`tion for which performance data is available. The
`scalability comparison is similar to that with LU, with
`the SP-2 lagging somewhat more, but having higher per
`node performance.
`
`6 Disk-to-Disk Sort
`
`The real capability of NOW with a complete operating
`system per node and 1/0 capability spread across the
`nodes is best illustrated through 1/0-intensive applica(cid:173)
`tions, such as commercial workloads and scientific
`
`100~~~~~~~~~~~~~,
`
`•
`
`6.
`
`b.
`
`0
`b.
`•
`
`90 t
`~
`/ / / / / ~
`80 ~
`1
`/
`70 t
`~ 60 +
`/.
`:
`~ 50 t
`~
`b.
`~ 40 +
`:
`30 ~
`T3D !
`20 L
`SP2
`:
`:
`NOW:
`:
`10 + •
`'
`0 tP!
`0 10 20 30 40 50 60 70 80 90 100
`
`!ft.
`
`•
`
`I!
`
`Processors
`
`Figure 8. Speedup on NPB BT
`
`application on very large data sets. To evaluate this
`aspect of the system, we have developed a high-perfor(cid:173)
`mance disk-to-disk sort. Sorting is an important applica(cid:173)
`tion to many fields, especially the database community;
`it stresses both the 1/0 subsystem and operating system,
`and has a set of well-known benchmarks allowing com(cid:173)
`parison with other systems. We describe our experience
`with the Datamation benchmark, proposed in 1985 by a
`group of database experts[Anon85].
`
`By utilizing the full operating system capability, local
`disks, and fast communication, NOW-sort has broken all
`previous records on the Datamation benchmark, which
`measures the elapsed time to sort one million 100-byte
`records with 10-byte keys from disk to disk, and
`Minute-sort, which measures the amount of record data
`sorted in under a minute[Nyb*94].
`
`We briefly describe our parallel algorithm for NOW.(cid:173)
`Sort, which is given in complete detail in [Arp*97]. One
`of the goals of NOW-Sort is to dynamically adapt to a
`variety of cluster configurations: namely, differing num(cid:173)
`bers of disks, amounts of memory, and number of work(cid:173)
`stations. A primary variation is the size of the data set as
`compared to the size of available memory. A one-pass
`version is used when all records fit into main memory;
`otherwise a two-pass version is used. Due to space limi(cid:173)
`tations, we only describe the one-pass implementation
`here.
`
`6.1 Algorithm Description
`
`In the Datamation sort benchmark the keys are from a
`uniform distribution and begin evenly distributed across
`all workstations. At the highest level, the single-pass
`
`DEFS-ALAOOO 1834
`Ex.1032.006
`
`DELL
`
`
`
`parallel sort of N records on P processors is a general(cid:173)
`ized bucket sort and contains four steps:
`
`N
`1. Read: Each processor reads P 100-byte records
`
`96% of the peak aggregate bandwidth from the four
`disks. To harness this bandwidth in the sort, we need to
`adjust the blocking factor on the two kinds of disks to
`balance the transfer times.
`
`from its local disks into memory.
`
`TABLE 2. Bandwidths of disk configurations
`
`2. Communicate: Keys are examined and the records
`are sent to the appropriate destination processor. The
`destination processor copies a portion of the key to a
`local bucket, keeping a pointer to the full key and
`record; thus the keys are partially sorted according to
`their most-significant bits.
`
`3. Sort: Each processor sorts its keys in memory using
`a partial-radix sort on each bucket.
`
`4. Write: Each processor gathers and writes its sorted
`records to local disks.
`
`At the end of the algorithm, the data is sorted across the
`disks of the workstations, with the lowest-valued keys
`on processor 0 and the highest-valued keys on processor
`P - 1 . The number of records per node will only be
`approximately equal, and depends upon the actual distri(cid:173)
`bution of key values.
`
`6.2 Local disk performance
`
`A key advantage of a NOW is that the performance of
`each node can be studied and optimized in isolation
`before considering the interactions between nodes. For
`NOW-Sort, we needed to understand how best to utilize
`the disks, the memory, and the operating system sub(cid:173)
`strate of each node.
`
`To fully utilize the aggregate bandwidth of multiple
`disks per machine, we implemented a user-level library
`for file striping on top of each local Solaris file system.
`We have two disk configurations to consider: two 5400
`rpm disks on a fast-narrow SCSI bus and an additional
`two 7200 rpm disks on a fast-wide SCSI. Table 2 shows
`the performance of the striped file system for several
`configurations.
`
`In the first two rows, we see that the two 5400 rpm disks
`saturate the fast-narrow SCSI bus, which has a peak
`bandwidth of 10 MB/s. We measure 8.3 MB/s from two
`disks capable of a total of 11 MB/s. The full NOW clus(cid:173)
`ter has two disks per node, providing a potential file 1/0
`bandwidth of 830 MB/s on 100 nodes.
`
`A subset of the nodes have additional external disks.
`The second two rows indicate that the (external) fast(cid:173)
`wide SCSI bus adequately handles the two faster disks.
`Finally, the last rows shows we achieve 20.5 MB/s, or
`
`Seagate Disks
`
`1 5400 rpm Hawk
`2 5400 rpm Hawk
`
`SCSI
`Bus
`
`narrow
`
`1 7200 rpm Barracuda wide
`
`2 7200 rpm Barracuda
`2 of each
`
`both
`
`2 of each (peak)
`
`Read Write
`(MB/s)
`(MB/s)
`
`5.5
`8.3
`
`6.5
`
`13.0
`20.5
`
`21.3
`
`5.2
`8.0
`
`6.2
`
`12.1
`19.1
`
`20.1
`
`6.3 OS Interfaces for Buffer Management
`
`Given a near peak bandwidth local file system, the sort(cid:173)
`ing performance of each node depends critically on how
`effectively memory is used. With a general purpose OS,
`there is no simple way for the application to determine
`how much memory is actually available to it. Depending
`upon the system interface used to read data from disk,
`the application may or may not be able to effectively
`control its memory usage.
`
`We compare two approaches for reading records from
`disk: read and mmap with madvise. For demonstra(cid:173)
`tion purposes, we use a simple implementation of the
`sorting algorithm using one UltraSparc with one disk
`and 64 MB of DRAM. It quicksorts all of the keys in
`memory. The upper graph of Figure 9 shows that when
`the application uses the read call to read records into
`memory from disk, the total execution time increases
`severely when more than 20 MB of records are sorted,
`even though 64 MB of physical memory are in the
`machine. The operating system uses roughly 20 MB and
`the file system performs its own buffering, which effec(cid:173)
`tively doubles the application footprint. This perfor(cid:173)
`mance degradation occurs because the system starts
`paging out the sorting data.
`
`To avoid double-buffering while leveraging the conve(cid:173)
`nience of the file system, we use memory mapped files
`by opening the file, calling mma p to bind the file into a
`memory segment of the address space, and accessing the
`memory region as desired. The auxiliary system call,
`madvise, informs the operation system of the intended
`access pattern. For example, one call to madvise noti(cid:173)
`fies the kernel that region will be accessed sequentially,
`thus allowing the OS to fetch ahead of the current page
`
`DEFS-ALAOOO 1835
`Ex.1032.007
`
`DELL
`
`
`
`read()
`
`Total Time~
`Write Time ---r--(cid:173)
`Sort Time
`c
`Read Time
`'
`
`20
`
`30
`Size (MB)
`mmap() and madvise()
`
`40
`
`120
`
`100
`
`80
`
`60
`
`40
`
`20
`
`0
`
`0
`
`10
`
`120
`
`100
`
`80
`
`60
`
`40
`
`20
`
`Vi'
`~
`0 " Q)
`~
`
`Q)
`
`E
`i=
`
`Vi'
`~
`0 " Q)
`~
`E
`i=
`
`Q)
`
`0
`
`0
`
`10
`
`20
`
`40
`
`30
`Size (MB)
`Figure 9. Sensitivity to OS Interlace
`
`e·
`'
`
`50
`
`60
`
`Total Time~
`Write Time ---r---(cid:173)
`Sort Time
`c
`Read Time
`'
`
`50
`
`60
`
`the record from the input file to a 4 KB send-buffer allo(cid:173)
`cated for each destination processor. When a send-buffer
`is filled, it is sent to its destination processor.
`
`Upon arrival of a message, an Active Message handler is
`executed. The handler moves the full record into main
`memory and copies a portion of the key into one of
`B = 2b buckets based on the high-order b bits of the
`key. The number of buckets is calculated at run-time
`such that the average number of keys per bucket fits into
`the second-level cache.
`
`The read and communication phases are easily over(cid:173)
`lapped due to the interfaces provided by mma p and
`Active Messages. Copying keys into send-buffers is
`completely hidden under the disk transfer time. Obtain(cid:173)
`ing this same performance with the read system call
`would require more programming complexity; because
`the cost of issuing each read is high, threads must be
`used prefetch data in large chunks.
`
`Measurements on a cluster with two disks per worksta(cid:173)
`tion show that the communication time is mostly hidden
`by the read time. However, with four disks per worksta(cid:173)
`tion, very little communication is overlapped with read(cid:173)
`ing, and the algorithm is actually slower than with only
`two disks. This penalty occurs because the UltraSPARC
`1/0 bus, the SBus, saturates long before its theoretical
`peak of 80 MB/s. Since almost all records are sent to
`another processor, the SBus must transfer three times
`the 1/0 rate: once for reading, once for sending, and
`once for receiving.
`
`6.5 Sorting and Writing
`
`The sort and write phase on each node are straightfor(cid:173)
`ward. After synchronizing across processors to ensure
`that all records have been sent and received, each node
`performs a partial-radix sort on the set of keys in each
`bucket, very similar to the approach described in
`[Agar96]. The partial-radix sort orders keys using the
`top 22-bits after the stripping off the logP + b bits used
`to determine the destination processor and bucket. At
`this point, with high-probability, most keys are correctly
`sorted, and a simple bubble-sort cleans-up the misor(cid:173)
`dered keys. A pointer is kept to the full 100-byte record
`so that only keys and pointers are swapped. The sorted
`records are then gathered and written to local disk using
`the write interface.
`
`6.6 Performance Measurements
`
`Our performance on the Datamation benchmark is
`shown in Figure 10 for two NOW configurations. For
`
`DEFS-ALAOOO 1836
`Ex.1032.008
`
`and to throw away pages that have already been
`accessed. The lower graph of Figure 9 shows that with
`mma p and mad vi s e the sorting pro gram has linear per(cid:173)
`formance up to roughly 40 MB, when it has used all the
`memory that the OS makes available. For larger data
`sets, the two pass algorithm is used, which places
`greater demand on the file system and requires multiple
`threads to maintain the disk bandwidth[ Arp*97].
`
`6.4 Using Active Message communication
`
`The optimizations for striping data across local disks
`and using operating system interfaces for memory man(cid:173)
`agement apply when running on a single-node or multi(cid:173)
`ple nodes. The impact of parallelization is isolated to the
`communication phase.
`
`After each node has memory-mapped its local input
`files, it calculates the processor which should contain
`this key in the final sorted order. Using the assumption
`that the keys are from a uniform distribution, we deter(cid:173)
`mine the destination processor with a simple bucket
`function (i.e., the top logP bits of each key) and copy
`
`DELL
`
`
`
`each configuration, the lower curve is the sort time and
`the upper curve gives the additional GLUnix start-up
`time. The previous record-holder on this benchmark,
`indicated by the horizontal line, was a 12 processor SGI
`PowerChallenge with 96 disks and 2.25 GB of main
`memory[Swe96]. In NOW-Sort, each processor sorts an
`equal por