`
`Jeffrey S. Chase, Darrell C. Anderson, Andrew J. Gallatin, Alvin R. Lebeck, and Kenneth G. Yocum
`Department of Computer Science
`Duke University
`fchase, anderson, gallatin, alvy, grantg@cs.duke.edu
`
`Abstract
`
`Recent gains in communication speeds motivate the de-
`sign of network storage systems whose performance
`tracks the rapid advances in network technology rather
`than the slower rate of advances in disk technology.
`Viewing the network as the primary access path to I/O
`is an attractive approach to building incrementally scal-
`able, cost-effective, and easy-to-administer storage sys-
`tems that move data at network speeds.
`This paper gives an overview of research on high-
`speed network storage in the Trapeze project. Our work
`is directed primarily at delivering gigabit-per-second
`performance for network storage access, using custom
`firmware for Myrinet networks, a lightweight messaging
`system optimized for block I/O traffic, and a new kernel
`storage layer incorporating network memory and paral-
`lel disks. Our current prototype is capable of client file
`access bandwidths approaching 100 MB/s, with network
`memory fetch latencies below 150s for 8KB blocks.
`
`1 Introduction
`
`Storage access is a driving application for high-speed
`LAN interconnects. Over the next few years, new high-
`speed network standards — primarily Gigabit Ethernet
`— will consolidate an order-of-magnitude gain in LAN
`performance already achieved with specialized cluster
`interconnects such as Myrinet and SCI. Combined with
`faster I/O bus standards, these networks greatly expand
`the capacity of even inexpensive PCs to handle large
`amounts of data for scalable computing, network ser-
`vices, multimedia and visualization.
`These gains in communication speed enable a new
`generation of network storage systems whose perfor-
`
` This work is supported by the National Science Foundation (CCR-
`96-24857, CDA-95-12356, and EIA-9870724) and equipment grants
`from Intel Corporation and Myricom. Anderson is supported by a U.S.
`Department of Education GAANN fellowship.
`
`mance tracks the rapid advances in network technology
`rather than the slower rate of advances in disk technol-
`ogy. With gigabit-per-second networks, a fetch request
`for a faulted page or file block can complete up to two
`orders of magnitude faster from remote memory than
`from a local disk (assuming a seek). Moreover, a storage
`system built from disks distributed through the network
`(e.g., attached to dedicated servers [11, 12, 10], cooper-
`ating peers [3, 13], or the network itself [8]) can be made
`incrementally scalable, and can source and sink data to
`and from individual clients at network speeds.
`The Trapeze project is an effort to harness the power
`of gigabit-per-second networks to “cheat” the disk I/O
`bottleneck for I/O-intensive applications. We use the
`network as the sole access path to external storage, push-
`ing all disk storage out into the network. This network-
`centric approach to I/O views the client’s file system and
`virtual memory system as extensions of the network pro-
`tocol stack. The key elements of our approach are:
`
` Emphasis on communication performance. Our
`system is based on custom Myrinet firmware and
`a lightweight kernel-kernel messaging layer opti-
`mized for block I/O traffic. The firmware includes
`features for zero-copy block movement, and uses
`an adaptive message pipelining strategy to reduce
`block fetch latency while delivering high band-
`width under load.
`
` Integration of network memory as an interme-
`diate layer of the storage hierarchy. The Trapeze
`project originated with communication support for
`a network memory service [6], which stresses net-
`work performance by removing disks from the crit-
`ical path of I/O. We are investigating techniques
`to manage network memory as a distributed, low-
`overhead, “smart” file buffer cache between local
`memory and disks, to exploit its potential to mask
`disk access latencies.
`
`1
`
`Dot Hill Systems Corp., Exhibit 1016
`Page 1
`
`
`
` Parallel block-oriented I/O storage. We are de-
`veloping a new scalable storage layer, called Slice,
`that partitions file data and metadata across a col-
`lection of I/O servers. While the I/O nodes in our
`design could be network storage appliances, we
`have chosen to use generic PCs because they are
`cheap, fast, and programmable.
`
`This paper is organized as follows. Section 2 gives a
`broad overview of the Trapeze project elements, with a
`focus on the features relevant to high-speed block I/O.
`Section 3 presents more detail on the adaptive message
`pipelining scheme implemented in the Trapeze/Myrinet
`firmware. Section 4 presents some experimental results
`showing the network storage access performance cur-
`rently achievable with Slice and Trapeze. We conclude
`in Section 5.
`
`2 Overview of Trapeze and Slice
`
`The Trapeze messaging system consists of two compo-
`nents: a messaging library that is linked into the kernel
`or user programs, and a firmware program that runs on
`the Myrinet network interface card (NIC). The firmware
`defines the interface between the host CPU and the net-
`work device; it interprets commands issued by the host
`and masters DMA transactions to move data between
`host memory and the network link. The host accesses
`the network using the Trapeze library, which defines
`the lowest-level API for network communication. Since
`Myrinet firmware is customer-loadable, any Myrinet site
`can use Trapeze.
`Trapeze was designed primarily to support fast kernel-
`to-kernel messaging alongside conventional TCP/IP net-
`working. Figure 1 depicts the structure of our current
`prototype client based on FreeBSD 4.0. The Trapeze
`library is linked into the kernel along with a network de-
`vice driver that interfaces to the TCP/IP protocol stack.
`Network storage access bypasses the TCP/IP stack, in-
`stead using NetRPC, a lightweight communication layer
`that supports an extended Remote Procedure Call (RPC)
`model optimized for block I/O traffic. Since copying
`overhead can consume a large share of CPU cycles at
`gigabit-per-second bandwidths, Trapeze is designed to
`allow copy-free data movement, which is supported by
`page-oriented buffering strategies in the socket layer,
`network device driver, and NetRPC.
`We are experimenting with Slice, a new scalable net-
`work I/O service based on Trapeze. The current Slice
`prototype is implemented as a set of loadable kernel
`modules for FreeBSD. The client side consists of 3000
`lines of code interposed as a stackable file system layer
`above the Network File System (NFS) protocol stack.
`This module intercepts read and write operations on
`
`file vnodes and redirects them to an array of block I/O
`servers using NetRPC. It incorporates a simple striping
`layer and cacheable block maps that track the location
`of blocks in the storage system. Name space operations
`are handled by a file manager using the NFS protocol,
`decoupling name space management (and access con-
`trol) from block management. This structure is similar
`to other systems that use independent file managers, in-
`cluding Swift [4], Zebra [10], and Cheops/NASD [9].
`To scale the file manager service, Slice uses a hashing
`scheme that partitions the name space across an array of
`file managers, implemented in a packet filter that redi-
`rects NFS requests to the appropriate server.
`
`2.1 The Slice Block I/O Service
`
`The Slice block I/O service is built from a collection of
`PCs, each with a handful of disks and a high-speed net-
`work interface. We call this approach to network storage
`“PCAD” (PC-attached disks), indicating an intermedi-
`ate approach between network-attached disks (NASD)
`and conventional file servers (server-attached disks, or
`SAD). While the CMU NASD group has determined that
`SAD can add up to 80% to the cost of disk capacity [9],
`it is interesting to note that the cost of the CPU, mem-
`ory, and network interface in PCAD is comparable to the
`price differential between IDE and SCSI storage today.
`Our current IDE PCAD nodes serve 88 GB of storage
`on four IBM DeskStar 22GXP drives at a cost under
`$60/GB, including a PC tower, a separate Promise Ul-
`tra/33 IDE channel for each drive, and a Myrinet NIC
`and switch port. With the right software, a collection
`of PCAD nodes can act as a unified network storage
`volume with incrementally scalable bandwidth and ca-
`pacity, at a per-gigabyte cost equivalent to a medium-
`range raw SCSI disk system.1 Moreover, the PCAD
`nodes feature a 450 MHz Pentium-III CPU and 256MB
`of DRAM, and are sharable on the network.
`The chief drawback of the PCAD architecture is that
`the I/O bus in the storage nodes limits the number of
`disks that can be used effectively on each node, pre-
`senting a fundamental obstacle to lowering the price per
`gigabyte without also compromising the bandwidth per
`unit of capacity. Our current IDE/PCAD configurations
`use a single 32-bit 33 MHz PCI bus, which is capable
`of streaming data between the network and disks at 40
`MB/s. Thus the PCAD/IDE network storage service de-
`livers only about 30% of the bandwidth per gigabyte of
`capacity as the SCSI disks of equivalent cost. Even so,
`the bus bandwidth limitation is a cost issue rather than a
`fundamental limit to performance, since bandwidth can
`
`1Seagate Barracuda 9GB drives were priced between $50/GB and
`$80/GB in April 1999, with an average price of $60/GB.
`
`2
`
`Dot Hill Systems Corp., Exhibit 1016
`Page 2
`
`
`
`code
`
`
`data
`
`User Applications
`
`User Data Pages
`
`raw Trapeze message laya'
`
`System Page Frame Pool
`
`yload
`/l;:fl'erpointers
`
`user space
`
`host
`kernel
`
`PCI bus
`
`
`
`
`
`Send Ring Receive Ring
`
`Figure l: A view of the Slice/Trapeze prototype.
`
`be expanded by adding more I/O nodes, and bus laten-
`cies are insignificant where disk accesses are involved.
`
`the block I/O servers run
`In our Slice prototype,
`FreeBSD 4.0 kernels supplemented with a loadable
`module that maps incoming requests to collections of
`files in dedicated local file systems. Slice includes fea-
`tures that enable I/O servers to act as caches over NFS
`
`file servers, including tertiary storage servers or Internet
`gateways supporting the NFS protocol [2]. In other re—
`spects, the block I/O protocol is compatible with NASD,
`which is emerging as a promising storage architecture
`that would eliminate the I/O server bus bottleneck.
`
`One benefit of PCAD I/O nodes is that they support
`flexible use of network memory as a shared high-speed
`I/O cache integrated with the storage service. Trapeze
`was originally designed as a messaging substrate for the
`Global Memory Service (GMS) [6], which supports re-
`mote paging and cooperative caching [5] of file blocks
`and virtual memory pages, unified at a low level of the
`operating system kernel. The GMS work showed sig-
`nificant benefits for network memory as a fast tempo-
`rary backing store for virtual memory or scratch files.
`The Slice block I/O service retains the performance em-
`phasis of the network memory orientation, and the basic
`protocol and mechanisms for moving, caching, and lo-
`cating blocks are derived from GMS. We are investigat-
`ing techniques for using the I/O server CPU to actively
`manage server memory as a prefetch buffer, using spec-
`ulative prediction or hinting directives from clients [13].
`
`2.2 Trapeze Messaging and NetRPC
`
`Trapeze messages are short (128-byte) control messages
`with optional attachedpayloads typically containing ap-
`plication data not interpreted by the messaging system,
`e.g., file blocks, virtual memory pages, or TCP seg-
`ments. The data structures in NIC memory include two
`message rings, one for sending and one for receiving.
`Each message ring is a circular producer/consumer array
`of 128-byte control message buffers and related state,
`shown in Figure l. The host attaches a payload bufi'er to
`a message by placing its DMA address in a designated
`field of the control message header.
`The Trapeze messaging system has several features
`useful for high-speed network storage access:
`
`0 Separation of header and payload. A Trapeze
`control message and its payload (if any) are sent as
`a single packet on the network, but they are handled
`separately by the message system, and the separa-
`tion is preserved at the receiver. This enables the
`TCP/IP socket layer and NetRPC to avoid copying,
`e.g., by remapping aligned payload bufl'ers. To sim-
`plify zero-copy block fetches, the NIC can demulti-
`plex incoming payloads into a specific frame, based
`on a token in the message that indirects through an
`incomingpayload table on the NIC.
`
`0 Large MTUs with scatter/gather DMA. Since
`Myrinet has no fixed maximum packet size (MTU),
`the maximum payload size of a Trapeze network is
`easily configurable. Trapeze supports scatter/gather
`DMA so that payloads may span multiple noncon-
`
`Dot Hill Systems Corp., Exhibit 1016
`Page 3
`
`Dot Hill Systems Corp., Exhibit 1016
`Page 3
`
`
`
`Host
`450 MHz Pentium-III
`500 MHz Alpha Miata
`
`fixed pipeline
`124s / 77 MB/s
`129s / 71 MB/s
`
`store & forward
`217s / 110 MB/s
`236s / 93 MB/s
`
`adaptive pipeline
`112s / 110 MB/s
`119s / 93 MB/s
`
`Table 1: Latency and bandwidth of NIC DMA pipelining alternatives for 8KB payloads.
`
`tiguous page frames. Scatter/gather is useful for
`deep prefetching and write bursts, reducing per-
`packet overheads for high-volume data access.
`
` Adaptive message pipelining.
`The Trapeze
`firmware adaptively pipelines DMA transfers on
`the I/O bus and network link to minimize the la-
`tency of I/O block transfers, while delivering peak
`bandwidth under load. Section 3 discusses adaptive
`message pipelining in more detail.
`
`The NetRPC package based on Trapeze is derived
`from the original RPC package for the Global Memory
`Service (gms net), which was extended to use Trapeze
`with zero-copy block handling and support for asyn-
`chronous prefetching at high bandwidth [1].
`To complement the zero-copy features of Trapeze,
`the socket layer, TCP/IP driver, and NetRPC share a
`common pool of aligned network payload buffers allo-
`cated from the virtual memory page frame pool. Since
`FreeBSD exchanges file block buffers between the vir-
`tual memory page pool and the file cache, this allows
`unified buffering among the network, file, and VM sys-
`tems. For example, NetRPC can send any virtual mem-
`ory page or cached file block out to the network by at-
`taching it as a payload to an outgoing message. Simi-
`larly, every incoming payload is deposited in an aligned
`physical frame that can mapped into a user process or
`hashed into the file cache or VM page cache. This uni-
`fied buffering also enables the socket layer to reduce
`copying by remapping pages, which significantly re-
`duces overheads for TCP streams [7].
`High-bandwidth network I/O requires support for
`asynchronous block operations for prefetching or write-
`behind. NFS clients typically support this asynchrony by
`handing off outgoing RPC calls to a system I/O daemon
`that can wait for RPC replies, allowing the user process
`that originated the request to continue. NetRPC supports
`a lower-overhead alternative using nonblocking RPC, in
`which the calling thread or process supplies a contin-
`uation procedure to be executed — typically from the
`receiver interrupt handler — when the reply arrives. The
`issuing thread may block at a later time, e.g., if it refer-
`ences a page that is marked in the I/O cache for a pend-
`ing prefetch. In this case, the thread sleeps and is awak-
`ened directly from the receiver interrupt handler. Non-
`blocking RPCs are a simple extension of kernel facilities
`
`already in place for asynchronous I/O on disks; each net-
`work I/O operation applies to a buffer in the I/O cache,
`which acts as a convenient point for synchronizing with
`the operation or retrieving its status.
`
`3 Balancing Latency and Bandwidth
`
`From a network perspective, storage access presents
`challenges that are different from other driving applica-
`tions of high-speed networks, such as parallel computing
`or streaming media. While small-message latency is im-
`portant, server throughput and client I/O stall times are
`determined primarily by the latency and bandwidth of
`messages carrying file blocks or memory pages in the 4
`KB to 16 KB range. The relative importance of latency
`and bandwidth varies with workload. A client issuing
`unpredicted fetch requests requires low latency; other
`clients may be bandwidth-limited due to multithreading,
`prefetching, or write-behind.
`Reconciling these conflicting demands requires care-
`ful attention to data movement through the messaging
`system and network interface. One way to achieve
`high bandwidth is to use large transfers, reducing per-
`transfer overheads. On the other hand, a key technique
`for achieving low latency for large packets is to frag-
`ment each message and pipeline the fragments through
`the network, overlapping transfers on the network links
`and I/O buses [16, 14]. Since it is not possible to do
`both at once, systems must select which strategy to use.
`Table 1 shows the effect of this choice on Trapeze la-
`tency and bandwidth for 8KB payloads, which are typi-
`cal of block I/O traffic. The first two columns show mea-
`sured one-way latency and bandwidth using fixed-size
`1408-byte DMA transfers and 8KB store-and-forward
`transfers. These experiments use raw Trapeze messag-
`ing over LANai-4 Myrinet NICs with firmware config-
`ured for each DMA policy. Fixed pipelining reduces la-
`tency by up to 45% relative to store-and-forward DMA
`through the NIC, but the resulting per-transfer overheads
`on the NIC and I/O bus reduce delivered bandwidth by
`up to 30%.
`To balance latency and bandwidth, Trapeze uses an
`adaptive strategy that pipelines individual messages au-
`tomatically for lowest latency, while dynamically adjust-
`ing the degree of pipelining to traffic patterns and con-
`gestion. The third column in Table 1 shows that this
`
`4
`
`Dot Hill Systems Corp., Exhibit 1016
`Page 4
`
`
`
`HostTx
`
`NetTx
`
`HostRcv
`
`do forever
`}
`for each idle DMA sink in {
`NetTx, HostRcv
`waiting = words awaiting transfer to sink
`if (waiting > MINPULSE)
`initiate transfer of waiting words to sink
`
`end
`for each idle DMA source in {
`}
`HostTx, NetRcv
`if (waiting transfer and buffer available)
`initiate transfer from source
`
`0
`
`20
`
`60
`40
`microseconds
`
`80
`
`100
`
`end
`loop
`
`Figure 2: Adaptive message pipelining policy and resulting pipeline transfers.
`
`yields both low latency and high bandwidth. Adaptive
`message pipelining in Trapeze is implemented in the
`NIC firmware, eliminating host overheads for message
`fragmentation and reassembly.
`Figure 2 outlines the message pipelining policy and
`the resulting overlapped transfers of a single 8KB packet
`across the sender’s I/O bus, network link, and receiver’s
`I/O bus. The basic function of the firmware running in
`each NIC is to move packet data from a source to a sink,
`in both sending and receiving directions. Data flows into
`the NIC from the source and accumulates in NIC buffers;
`the firmware ultimately moves the data to the sink by
`scheduling a transfer on the NIC DMA engine for the
`sink. When sending, the NIC’s source is the host I/O bus
`(hostTx) and the sink is the network link (netTx). When
`receiving, the source is the network link (netRcv) and
`the sink is the I/O bus (hostRcv). The Trapeze firmware
`issues large transfers from each source as soon as data
`is available and there is sufficient buffer space to accept
`it. Each NIC makes independent choices about when to
`move data from its local buffers to its sinks.
`The policy behind the Trapeze pipelining strategy is
`simple: if a sink is idle, initiate a transfer of all buffered
`data to the sink if and only if the amount of data exceeds
`a configurable threshhold (minpulse). This policy pro-
`duces near-optimal pipeline schedules automatically be-
`cause it naturally adapts to speed variations between the
`source and the sink. For example, if a fast source feeds
`a slow sink, data builds up in the NIC buffers behind
`the sink, triggering larger transfers through the bottle-
`neck to reduce the total per-transfer overhead. Similarly,
`if a slow source feeds a fast sink, the policy produces a
`sequence of small transfers that use the idle sink band-
`width to reduce latency.
`The adaptive message pipelining strategy falls back
`to larger transfers during bursts or network congestion,
`because buffer queues on the NICs allow the adaptive
`behavior to carry over to multiple packets headed for the
`same sink. Even if the speeds and overheads at each
`pipeline stage are evenly matched, the higher overhead
`of initial small transfers on the downstream links quickly
`causes data to build up in the buffers of the sending and
`
`receiving NICs, triggering larger transfers.
`Figure 3 illustrates the adaptive pipelining behavior
`for a one-way burst of packets with 8KB payloads. This
`packet flow graph was generated from logs of DMA ac-
`tivity taken by an instrumented version of the Trapeze
`firmware on the sending and receiving NICs. The trans-
`fers for successive packets are shown in alternating shad-
`ings; all consecutive stripes with the same shading are
`from the same packet. The width of each stripe shows
`the duration of the transfer, measured by a cycle counter
`on the NIC. This duration is proportional to the transfer
`size in the absence of contention. Contention effects can
`be seen in the long first transfer on the sender’s I/O bus,
`which results from the host CPU contending for the bus
`as it initiates send requests for the remaining packets.
`Figure 3 shows that both the sending and receiving
`NICs automatically drop out of pipelining and fall back
`to full 8KB transfers about one millisecond into the
`packet burst. While the pipelining yields low latency for
`individual packets at low utilization, the adaptive behav-
`ior yields peak bandwidth for streams of packets. The
`policy is automatic and self-tuning, and requires no di-
`rection from the host software. Experiments have shown
`that the policy is robust, and responds well to a range of
`congestion conditions [15].
`
`4 Performance
`
`Our goal with Trapeze and Slice is to push the per-
`formance bounds for network storage systems using
`Myrinet and similar networks. Although our work with
`Slice is preliminary, our initial prototype shows the per-
`formance that can be achieved with network storage sys-
`tems using today’s technology and the right network
`support.
`Figure 4 shows read and write bandwidths from
`disk for high-volume sequential file access through the
`FreeBSD read and write system call interface using the
`current Slice prototype. For these tests, the client was
`a DEC Miata (Personal Workstation 500au) with a 500
`MHz Alpha 21164 CPU and a 32-bit 33 MHz PCI bus
`
`5
`
`Dot Hill Systems Corp., Exhibit 1016
`Page 5
`
`
`
`HostTx
`
`NetTx
`
`HostRcv
`
`0
`
`200
`
`400
`
`600
`
`800
`
`1000
`
`microseconds
`
`Figure 3: Adaptive message pipelining reverts to larger transfer sizes for a stream of 8K payloads. The fixed-size
`transfer at the start of each packet on NetTx and HostRcv is the control message data, which is always handled as a
`separate transfer. Control messages do not appear on HostTx because they are sent using programmed I/O rather than
`DMA on this platform (300 MHz Pentium-II/440LX).
`
`two servers
`three servers
`four servers
`five servers
`six servers
`two IDE servers
`
`2
`
`4
`
`6
`
`10
`8
`Total Disks
`
`12
`
`14
`
`16
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`0
`
`Write Bandwidth (MB/s)
`
`two servers
`three servers
`four servers
`five servers
`six servers
`two IDE servers
`
`2
`
`4
`
`6
`
`10
`8
`Total Disks
`
`12
`
`14
`
`16
`
`100
`
`80
`
`60
`
`40
`
`20
`
`0
`
`0
`
`Read Bandwidth (MB/s)
`
`Figure 4: Single-client sequential disk read and write bandwidths with Slice/Trapeze.
`
`using the Digital 21174 “Pyxis” chipset. The block I/O
`servers have a 450 MHz Pentium-III on an Asus P2B
`motherboard with an Intel 440BX chipset, and either of
`two disk configurations: (1) four Seagate Medalist disks
`on two separate Ultra-Wide SCSI channels, or (2) four
`IBM DeskStar 22GXP drives on separate Promise Ul-
`tra/33 IDE channels. All machines are equipped with
`Myricom LANai 4.1 SAN adapters, and run kernels built
`from the same FreeBSD 4.0 source pool. The Slice
`client uses a simple round-robin striping policy with a
`stripe grain of 32KB. The test program (dd) reads or
`writes 1.25 GB in 64K chunks, but it does not touch the
`data. Client and server I/O caches are flushed before
`each run.
`
`Each line in Figure 4 shows the measured I/O band-
`width delivered to a single client using a fixed number of
`block I/O servers. The four points on each line represent
`the number of disks on each server; the x-axis gives the
`total number of disks used for each point. The peak write
`
`bandwidth is 66 MB/s; at this point the client CPU sat-
`urates due to copying at the system call layer. The peak
`read bandwidth is 97 MB/s. While reading at 97 MB/s
`the client CPU utilization is only 28%, since FreeBSD
`4.0 avoids most copying for large reads by page remap-
`ping at the system call layer. In this configuration, an un-
`predicted 8KB page fault from I/O server memory com-
`pletes in under 150 s.
`
`We have also experimented with TCP/IP communi-
`cation using Trapeze. Recent point-to-point TCP band-
`width tests (netperf) yielded a peak bandwidth of 956
`Mb/s through the socket interface on a pair of Com-
`paq XP 1000 workstations using Myricom’s LANai-5
`NICs. These machines have a 500 MHz Alpha 21264
`CPU and a 64-bit 33 MHz PCI bus with a Digital 21272
`“Tsunami” chipset, and were running FreeBSD 4.0 aug-
`mented with page remapping for sockets [7].
`
`6
`
`Dot Hill Systems Corp., Exhibit 1016
`Page 6
`
`
`
`[6] M. J. Feeley, W. E. Morgan, F. H. Pighin, A. R. Karlin,
`and H. M. Levy.
`Implementing global memory man-
`agement in a workstation cluster. In Proceedings of the
`Fifteenth ACM Symposium on Operating Systems Prin-
`ciples, 1995.
`[7] A. J. Gallatin, J. S. Chase, and K. G. Yocum.
`Trapeze/IP:TCP/IP at near-gigabit speeds.
`In 1999
`Usenix Technical Conference (Freenix track), June 1999.
`[8] G. A. Gibson, D. F. Nagle, K. Amiri, F. W. Chang,
`E. M. Feinberg, H. Gobioff, C. Lee, B. Ozceri, E. Riedel,
`D. Rochberg, and J. Zelenka. File server scaling with
`network-attached secure disks. In Proceedings of ACM
`International Conference on Measurement and Model-
`ing of Computer Systems (SIGMETRICS 97), June 1997.
`[9] G. A. Gibson, D. F. Nagle, K. Amiri, F. W. Chang,
`H. Gobioff, C. Hardin, E. Riedel, D. Rochberg, and
`J. Zelenka. A cost-effective, high-bandwidth storage ar-
`chitecture.
`In Proceedings of the Eight Conference on
`Architectural Support for Programming Languages and
`Operating Systems, Oct. 1998.
`[10] J. H. Hartman and J. K. Ousterhout. The Zebra striped
`network file system.
`In Proceedings of
`the Four-
`teenth ACM Symposium on Operating Systems Princi-
`ples, pages 29–43, 1993.
`[11] E. K. Lee and C. A. Thekkath. Petal: Distributed vir-
`tual disks. In Proceedings of the Seventh International
`Conference on Architectural Support for Programming
`Languages and Operating Systems, pages 84–92, Cam-
`bridge, MA, Oct. 1996.
`[12] C. A. Thekkath, T. Mann, and E. K. Lee. Frangipani:
`A scalable distributed file system. In Proceedings of the
`Sixteenth ACM Symposium on Operating System Princi-
`ples (SOSP), Oct. 1997.
`[13] G. M. Voelker, E. J. Anderson, T. Kimbrel, M. J. Feeley,
`J. S. Chase, A. R. Karlin, and H. M. Levy. Implement-
`ing cooperative prefetching and caching in a globally-
`managed memory system. In Proceedings of ACM In-
`ternational Conference on Measurement and Modeling
`of Computer Systems (SIGMETRICS ’98), June 1998.
`[14] R. Y. Wang, A. Krishnamurthy, R. P. Martin, T. E. Ander-
`son, and D. E. Culler. Modeling and optimizing commu-
`nication pipelines. In Proceedings of ACM International
`Conference on Measurement and Modeling of Computer
`Systems (SIGMETRICS 98), June 1998.
`[15] K. G. Yocum, D. C. Anderson, J. S. Chase, S. Gadde,
`A. J. Gallatin, and A. R. Lebeck. Adaptive message
`pipelining for network memory and network storage.
`Technical Report CS-1998-10, Duke University Depart-
`ment of Computer Science, Apr. 1998.
`[16] K. G. Yocum, J. S. Chase, A. J. Gallatin, and A. R.
`Lebeck. Cut-through delivery in Trapeze: An exercise
`in low-latency messaging. In Sixth IEEE International
`Symposium on High Performance Distributed Comput-
`ing (HPDC-6), pages 243–252, Aug. 1997.
`
`5 Conclusion
`
`This paper summarizes recent and current research in
`the Trapeze project at Duke University. The goal of our
`work has been to push the performance bounds for net-
`work storage access using Myrinet networks, by explor-
`ing new techniques and optimizations in the network in-
`terface and messaging system, and in the operating sys-
`tem kernel components for file systems and virtual mem-
`ory. Key elements of our approach include:
`
` An adaptive message pipelining strategy imple-
`mented in custom firmware for Myrinet NICs. The
`Trapeze firmware combines low-latency transfers
`of I/O blocks with high bandwidth under load.
`
` A lightweight kernel-kernel RPC layer optimized
`for network I/O traffic. NetRPC allows zero-copy
`sends and receives directly from the I/O cache, and
`supports nonblocking RPCs with interrupt-time re-
`ply handling for high-bandwidth prefetching.
`
` A scalable storage system incorporating network
`memory and parallel disks on a collection of PC-
`based I/O servers. The system achieves high band-
`width and capacity by using many I/O nodes in tan-
`dem; it can scale incrementally with demand by
`adding more I/O nodes to the network.
`
`Slice uses Trapeze and NetRPC to access network
`storage at speeds close to the limit of the network and
`client I/O bus. We expect that this level of performance
`can also be achieved with new network standards that
`are rapidly gaining commodity status, such as Gigabit
`Ethernet.
`
`References
`
`[1] D. Anderson, J. S. Chase, S. Gadde, A. J. Gallatin, K. G.
`Yocum, and M. J. Feeley. Cheating the I/O bottleneck:
`Network storage with Trapeze/Myrinet. In 1998 Usenix
`Technical Conference, June 1998.
`[2] D. C. Anderson, K. G. Yocum, and J. S. Chase. A case
`for buffer servers. In IEEE Workshop on Hot Topics in
`Operating Systems (HOTOS), Apr. 1999.
`[3] T. Anderson, M. Dahlin, J. Neefe, D. Patterson,
`D. Roselli, and R. Wang. Serverless network file sys-
`tems. In Proceedings of the ACM Symposium on Oper-
`ating Systems Principles, pages 109–126, Dec. 1995.
`[4] L.-F. Cabrera and D. D. E. Long. Swift: Using dis-
`tributed disk striping to provide high I/O data rates.
`Computing Systems, 4(4):405–436, Fall 1991.
`[5] M. D. Dahlin, R. Y. Wang, and T. E. Anderson. Coopera-
`tive caching: Using remote client memory to improve file
`system performance. In Proceedings of the First Sympo-
`sium on Operating System Design and Implementation,
`pages 267–280, Nov. 1994.
`
`7
`
`Dot Hill Systems Corp., Exhibit 1016
`Page 7
`
`