`
`Darrell Anderson and Jeff Chase
`Department of Computer Science
`Duke University
`@cs.duke.edu
` anderson, chase
`
`Abstract
`
`This paper presents a recovery protocol for block I/O
`operations in Slice, a storage system architecture for high-
`speed LANs incorporating network-attached block storage.
`The goal of the Slice architecture is to provide a network
`file service with scalable bandwidth and capacity while
`preserving compatibility with off-the-shelf clients and file
`server appliances. The Slice prototype “virtualizes” the
`Network File System (NFS) protocol by interposing a re-
`quest switching filter at the client’s interface to the network
`storage system (e.g., in a network adapter or switch).
`The distributed Slice architecture separates functions
`typically combined in central file servers, introducing new
`challenges for failure atomicity. This paper presents a pro-
`tocol for atomic file operations and recovery in the Slice ar-
`chitecture, and related support for reliable file storage using
`mirrored striping. Experimental results from the Slice pro-
`totype show that the protocol has low cost in the common
`case, allowing the system to deliver client file access band-
`widths approaching gigabit-per-second network speeds.
`
`1 Introduction
`
`Faster I/O interconnect standards and the arrival of Gi-
`gabit Ethernet greatly expand the capacity of inexpensive
`commodity computers to handle large amounts of data for
`scalable computing, network services, multimedia and vi-
`sualization. These advances and the growing demand for
`storage increase the need for network storage systems that
`are incrementally scalable, reliable, and easy to administer,
`while serving the needs of diverse workloads running on a
`variety of client platforms.
`increasingly provide scalable
`Commercial
`systems
`shared storage by interconnecting storage devices and
`servers with dedicated Storage Area Networks (SANs),
` This work is supported by the National Science Foundation (CCR-96-
`
`24857, EIA-9870724, and EIA-9972879) and by equipment grants from
`Intel Corporation and Myricom.
`
`e.g., FibreChannel. Yet recent order-of-magnitude improve-
`ments in LAN performance have narrowed the bandwidth
`gap between SANs and LANs. This creates an opportu-
`nity to deliver competitive storage solutions by aggregating
`low-cost storage nodes and servers, using a general-purpose
`LAN as the storage backplane. In such a system it is pos-
`sible to incrementally scale either capacity or bandwidth of
`the shared storage resource by attaching additional storage
`to the network.
`A variety of commercial products and research proposals
`pursue this vision by layering device protocols (e.g., SCSI)
`over IP networks, building cluster file systems that manage
`distributed block storage as a shared disk volume, or in-
`stalling large server appliances to export SAN storage to a
`LAN using network file system protocols. Section 2.1 sur-
`veys some of these systems.
`This paper deals with a network storage architecture —
`called Slice — that takes an alternative approach. Slice
`places a request switching filter at the client’s interface to
`the network storage system; the role of the filter is to “wrap”
`a standard IP-based client/server file system protocol, ex-
`tending it to incorporate an incrementally expandable array
`of network-attached block storage nodes. The Slice pro-
`totype implements the architecture by virtualizing the Net-
`work File System version 3 protocol (NFS V3). The request
`switching filter intercepts and rewrites a subset of the NFS
`V3 packet stream, directing I/O requests to the network
`storage array and associated servers that make up a Slice
`ensemble appearing to the client as a unified NFS volume.
`The system is compatible with off-the-shelf NFS clients and
`servers, in order to leverage the large installed base of NFS
`clients and the high-quality NFS server appliances now on
`the market.
`The Slice architecture assumes a block storage model
`loosely based on a proposal in the National Storage In-
`dustry Consortium (NSIC) for object-based storage devices
`(OBSD) [2]. Key elements of the OBSD proposal were
`in turn inspired by research on Network Attached Secure
`Disks (NASD) [8, 9]. Storage nodes are “object-based”
`rather than sector-based, meaning that requesters address
`
`NetApp Ex. 1022, pg. 1
`
`
`
`
`semble. The
`proxy examines NFS requests and responses,
`redirecting requests and transforming responses as neces-
`sary to represent the distributed storage service as a uni-
`fied NFS service to its client. For some operations, the
`proxy must generate new requests and pair responses with
`proxy may reside within the client itself,
`requests. The
`or in a network element along the communication path be-
`tween the client and the servers. In our current prototype
`proxy is implemented as a packet filter installed on the
`the
`client below the NFS/UDP/IP stack.
`proxy is a simple state machine with minimal
`The
`proxy
`buffering requirements. It uses only soft state; the
`proxy
`may fail without compromising correctness. The
`may reside outside of the trust boundary, although it may
`damage the contents of specific files by misusing the au-
`thority of users whose requests are routed through it. In this
`proxy internals
`paper we limit our focus to aspects of the
`and policies that are directly related to operation atomicity
`and the recovery protocol.
`The coordinator plays an important role in managing
`global recovery of operations involving multiple sites. A
`Slice configuration may contain any number of coordina-
`tors, with each coordinator managing operations for some
`subset of files. The functions of the coordinator may be
`combined with the file server, but we consider them sepa-
`rately to emphasize that the architecture is compatible with
`standard file servers.
`Our implementation combines the coordinator with a
`map service responsible for tracking file block location. The
`coordinator servers maintain a global block map for each
`proxies
`file giving the storage site for each block. The
`read, cache, modify, and write back fragments of the global
`maps as they execute read and write operations on files. The
`global maps allow flexible per-file policies for block place-
`ment and striping in the network storage array; although the
`system may use deterministic block placement functions as
`an alternative to the global maps, this paper includes a dis-
`cussion of the maps to show how the recovery protocol in-
`corporates them.
`proxy intercepts read and write operations targeted
`The
`at file regions beyond a configurable threshold offset. Log-
`ical file offsets beyond the threshold are referred to as the
`proxy redirects all reads and writes cov-
`striping zone; the
`ering offsets in the striping zone to an array of block storage
`nodes according to system striping policies and the block
`maps maintained by the coordinators. The policies and pro-
`tocols include support for mirrored striping (“RAID-10”)
`for redundancy to protect against storage node failures, as
`described in Section 3.2. The Slice storage nodes export
`object-based block storage to the network; our prototype
`storage nodes accept NFS read and write operations on a
`flat space of storage objects uniquely identified by NFS file
`handles. Although NFS file handles provide only a weak
`
`Figure 1. The Slice distributed storage archi-
`tecture.
`
`data on each storage node as logical offsets within storage
`objects. A storage object is an ordered sequence of bytes
`with a unique identifier. The NASD work and the OBSD
`proposal allow for cryptographic protection of object iden-
`tifiers if the network is insecure [8].
`The Slice architecture separates functions that are com-
`bined in central file servers. The contribution of this paper
`is to present a simple solution to the coordination and recov-
`ery issues raised by this structure. Our approach introduces
`a coordinator responsible for preserving atomicity of key
`NFS operations, including file truncate/remove, extending
`writes, and write commitment. The coordinators use a sim-
`ple intention logging protocol, with variants for each oper-
`ation type that minimize the common-case costs. We also
`show how the protocol supports failure-atomic write com-
`mitment for mirrored files in the Slice prototype. Mirroring
`consumes more storage and network bandwidth than strip-
`ing with RAID redundancy, but it is simple and reliable,
`avoids the overhead of computing and updating parity, and
`allows load-balanced reads [4, 12].
`This paper is structured as follows. Section 2 summa-
`rizes the Slice architecture. Section 3 describes mecha-
`nisms for operation atomicity and failure handling. Sec-
`tion 4 presents experimental results from the Slice prototype
`on a Myrinet network, showing that the Slice architecture
`and recovery protocols achieve file access performance ap-
`proaching gigabit-per-second network speeds, limited pri-
`marily by the client NFS implementation. Section 5 con-
`cludes.
`
`2 Overview
`
`Figure 1 depicts the Slice architecture with NFS clients
`and servers. The architecture interposes a “microproxy”
`(
`proxy) between the client IP stack and the Slice server en-
`
`NetApp Ex. 1022, pg. 2
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`form of protection in our prototype, the architecture is com-
`patible with proposals for cryptographic protection of stor-
`age object identifiers for insecure networks [8].
`The
`proxy identifies read and write operations in the
`striping zone by examining the request offset and length.
`Small files are not striped; these are files whose logical size
`is below the threshold offset, i.e., that have never received
`a write in the striping zone. Note that even large files are
`not striped in their entirety; data written below the thresh-
`old offset of a large file is stored along with the small files.
`File regions outside the striping zone do not benefit from
`striping, but the performance cost becomes progressively
`less significant as file sizes grow.
`In addition to the interactions required for I/O requests,
`proxies cooperate with the network storage nodes and
`the
`the file’s coordinator to allocate global maps for extending
`write operations, and to release storage on remove and trun-
`cate operations. These multisite operations introduce recov-
`ery issues described in the next section. All other file oper-
`proxy to the NFS server as they
`ations pass through the
`did before, and incur no additional overhead for managing
`distributed storage.
`This architecture scales to higher bandwidth and capac-
`ity by adding storage nodes, since the NFS server is outside
`the critical path of reads and writes handled by the block
`storage nodes. It is also possible to scale or replicate other
`file service functions within the context of the Slice request
`switching architecture. For simplicity this paper assumes
`that a single standard NFS file server manages the entire
`volume name space.
`The goal of the mechanisms described in this paper is to
`deliver consistency and failure properties that are no weaker
`than commercial NFS implementations. While the basic ap-
`proach is quite similar to write-ahead logging that might
`be taken on a journaling central file server with distributed
`disks, we extend it to support multisite operations without
`the awareness of the client, NFS file server, or the storage
`nodes. Our approach to committing writes assumes use of
`the NFS V3 asynchronous writes and write commitment
`protocol, as described below. This paper does not address
`the issue of concurrent write sharing of files, and Slice as
`defined may provide weaker concurrent write sharing guar-
`antees than some NFS implementations. However, the ar-
`chitecture is compatible with NFS file leasing extensions
`for consistent concurrent write sharing, as defined in NQ-
`NFS [13] and early IETF draft proposals for the NFS V4
`protocol.
`
`2.1 Related Work
`
`The Cambridge Universal File Server [5] proposed struc-
`turing a distributed file system as a separate name service
`and file block storage service. One system to take this
`
`approach was Swift [6]. Slice is similar to Swift in that
`each client reads or writes data directly to block storage
`sites on the network, choreographed by a client distribution
`agent using maps provided by a third-party storage media-
`tor. Another system derived from the Swift architecture is
`Cheops, a striping file system for CMU NASD storage sys-
`tems [9, 8]. The Swift and Cheops work did not directly
`address atomicity or recovery issues.
`Amiri et. al. [1] show how to preserve read and write
`atomicity in a shared storage array using RAID striping with
`parity. This work focuses primarily on safe concurrent ac-
`cesses to a fixed space of blocks. It does not address file
`system consistency in the presence of host failures.
`A number of scalable file systems separate some strip-
`ing functions from other file system code by building the
`file system above a striped network storage volume using
`a shared disk model. This approach has been used with
`both log-structured [11, 3] and conventional [15, 14] file
`system structures. In these systems, multisite operations in-
`cluding truncate and remove are made failure-atomic using
`write-ahead metadata logging on the file server. The log-
`structured approach also relies in part on a separate cleaner
`process to reclaim space.
`Relative to these systems, this paper shows how to fac-
`tor out recovery functions so that multisite recovery may
`be interposed in the context of a standard client/server file
`system protocol, without modifying the client or server.
`
`3 Atomic Operations on Network Storage
`
`proxy intercepts
`A multisite operation begins when the
`an NFS V3 write, remove, truncate (setattr) or commit re-
`proxy may
`quest from a client. To handle the request, the
`redirect the request or generate additional request messages
`to nodes in the Slice ensemble, including storage nodes, the
`coordinator for the target file, and the NFS server. Figure 2
`illustrates the message exchanges for the multisite opera-
`tions discussed in this section.
`proxy
`When the operation is complete at all sites, the
`passes through an NFS V3 response to the client. If any
`proxy, a stor-
`participant fails during this sequence — the
`age node, the coordinator, or the file server — a recovery
`protocol is initiated. The recovery protocol is specific to the
`particular operation in progress, and it may either complete
`the operation (roll forward) or abort it (roll back). If the sys-
`tem aborts the operation or delays the response, a standard
`NFS client may reinitiate the operation by retransmitting the
`request after a timeout, unless the client itself has failed.
`The basic protocol is as follows. At the start of the op-
`proxy sends to the coordinator an intention to
`eration, the
`and
`).
`perform the operation (e.g., Figure 2, messages
`The coordinator logs the intention to stable disk storage and
`proxy to carry out the operation.
`responds, authorizing the
`
`NetApp Ex. 1022, pg. 3
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Figure 2. Message exchanges for multisite Slice/NFS operations. Dotted line message exchanges
`are avoided in common cases. Square endpoints represent synchronous storage writes.
`
`When the operation is complete, the
`proxy notifies the co-
`ordinator with a completion message, asynchronously clear-
`and
`). If the coordinator
`ing the intention (e.g., messages
`does not receive the completion within a specified period, it
`probes one or more participants to determine if the opera-
`tion completed, and initiates recovery if necessary. A failed
`coordinator recovers by scanning its intentions log, com-
`pleting or aborting operations in progress at the time of the
`failure.
`
`This is a variant of the standard two-phase commit pro-
`tocol [10] adapted to a file system context with idempotent
`operations. The details for each operation vary significantly.
`In particular, each operation allows optimizations to avoid
`most messaging and logging delays in common cases, as
`described below. Slice further improves performance by
`avoiding multisite operations for small files stored entirely
`on the file server, i.e., files that have never received writes
`beyond the configurable threshold offset. In this way, the
`system amortizes the costs of the protocol across a larger
`number of bytes and operations, since it incurs these costs
`only to create and truncate/remove large files, and to com-
`mit groups of writes to large files.
`
`The following subsections describe the protocol as it ap-
`plies to each type of multisite operation. We then set the
`protocol in context with conventional two-phase commit.
`
`3.1 Write Commitment
`
`An NFS V3 commit operation stabilizes pending or un-
`stable writes on a given file. The NFS V3 protocol allows a
`server failure to legally discard any subset of the uncommit-
`ted writes and associated metadata, provided that the client
`can detect any loss by comparing verifier values returned
`by the file service in its responses to write and commit op-
`erations. NFS V3 clients buffer uncommitted writes locally
`so that they may re-execute these writes after a server fail-
`ure. Clients may safely discard their buffered writes after a
`successful commit. Note that the verifier value returned by
`write and commit is not itself significant; the service guar-
`antees only that the verifier changes after a failure.
`To handle a commit on a file that has unstable writes in
`proxy executes a message exchange
`the striping zone, the
`with each storage node that owns uncommitted writes on
`). The
`proxy also completes
`the file (Figure 2, message
`the writes, which may involve an exchange with the coor-
`proxy
`dinator (map service) and/or the NFS server. The
`pushes any updates to the file’s map back to the coordinator
`). If the write enlarged the file, it pushes the new
`(message
`file size to the NFS server via a setattr (message
`). When
`proxy re-
`all operations have completed successfully, the
`sponds to the client with a valid verifier.
`proxy detects any failures by comparing response
`The
`verifiers against a stored copy of the previous verifier re-
`turned by each participant.
`If any participant fails, the
`proxy reports the failure by changing the response veri-
`
`NetApp Ex. 1022, pg. 4
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`fier to the client. If the
`proxy itself loses its state, it may
`report failure for a commit that has successfully completed
`at all sites. This forces the client to reinitiate writes unnec-
`essarily, but is otherwise harmless.
`Intention logging is unnecessary for write and commit
`on unmirrored files. This is because the file service remains
`in a legal state throughout the write sequence and commit.
`The exact ordering of operations is not strictly important;
`the commit is complete only when the client discarded its
`buffered writes after receiving a valid response. If a failure
`occurs, the client itself is responsible for restarting the write
`sequence after receiving a negative response or no response
`to its commit request.
`
`3.2 Mirrored Writes
`
`Writes to a mirrored file are replicated using a read-any-
`write-all model. Without loss of generality we assume that
`the replication degree is two. A replication degree of two
`guarantees that a file is available unless two or more stor-
`age nodes fail concurrently, or the file’s coordinator fails to-
`gether with one storage node and a client who was actively
`writing the file.
`Block maps for a mirrored file have dual entries for each
`logical block, with one entry for each block replica. The
`proxy writes each block to a pair of storage nodes selected
`according to some placement policy, which is not important
`for the purposes of this paper. A mirrored write is consid-
`ered complete only after it has committed; i.e., both storage
`nodes confirm that the block is stable, and (if applicable) the
`file’s coordinator (map service) confirms that the covering
`map fragment is stable.
`Mirrored writes use the intention protocol to reconcile
`replicas in the event of a failure. If a participant fails while
`there are incomplete mirrored writes, then it is possible that
`the write executed at one replica but not the other. In prac-
`tice, this does not occur unless a client fails concurrently
`with one or more server failures, since an NFS V3 client
`retransmits all uncommitted writes after a server failure, as
`described in Section 3.1.
`The mirrored write protocol piggybacks intention mes-
`proxy’s request for the
`sages for mirrored writes on the
`map fragment covering the write. Before returning the re-
`quested map fragment, the coordinator logs the intention
`record and updates a conservative in-memory active region
`list of offset ranges or map fragments that might be held by
`proxy, and that may have incomplete writes. These
`each
`intentions are cleared implicitly by a commit request cov-
`proxy to discard all
`ering the region; commit causes the
`covered map fragments for a mirrored file.
`proxy) fails, any uncommitted mir-
`If a client (or its
`rored writes are guaranteed to be covered by the coordina-
`tor’s active region list. The coordinator can reconcile the
`
`replicas for these regions by traversing the region list; any
`conflict within the active regions may be resolved by select-
`ing one replica to dominate. In principle, the system can
`serve one copy of the file concurrently with reconciliation,
`even if a storage node fails. If the coordinator fails, it re-
`covers a conservative approximation of its active region list
`from its intentions log.
`In practice, most intention logging activity for mirrored
`writes may be optimized away. Slice logs these intentions
`only when a mirrored file first comes into active write use,
`e.g., when a
`proxy first requests map fragments with in-
`tent to write. If a file falls out of write use (no map frag-
`ment requests received since the last commit completion),
`the coordinator marks the file as inactive by logging a write-
`complete entry. This protocol adds a synchronous log write
`to the write-open path for mirrored files, but this cost is
`amortized over all writes on the file. It allows a recover-
`ing coordinator to identify a superset of the mirrored files
`that may need reconciliation after a multiple failure.
`One drawback of the protocol is that a buggy or mali-
`cious client might cause the active region list to grow with-
`out bound by issuing large numbers of writes and never
`committing them. This is not a problem with clients that
`correctly buffer their uncommitted writes, since the num-
`ber of writes is limited by available memory; in any case,
`standard clients commit writes at regular intervals under the
`control of a system update daemon. For malicious clients,
`the system may avoid this problem by weakening replica
`consistency guarantees for mirrored files with writes left un-
`committed for unreasonably long periods.
`
`3.3 Truncate and Remove
`
`The protocol for truncate and remove relies on the NFS
`server to maintain an authoritative record of the file length
`proxy first consults a set of attributes
`and link count. The
`); the attributes must
`for the target file (Figure 2, message
`be current up to the “three second window” defined by NFS
`implementations (see Section 3.4. If the target file’s log-
`ical size shows that it has data in the striping zone, the
`proxy issues an intention to the coordinator (message
`)
`before issuing the NFS operation to the file server (mes-
`). Once the operation has committed at the NFS
`sage
`server, the protocol contacts the storage nodes and coordi-
`), then reg-
`nator (map service) to release storage (message
`). In our
`isters a completion with the coordinator (message
`proxy executes the entire protocol,
`current prototype the
`but it could be done directly by the coordinator, simplifying
`proxy and saving one message exchange (the intention
`the
`response and the completion).
`If the intention expires, the coordinator probes the NFS
`server (using a getattr) to determine the status of the op-
`eration. If the operation completed on the NFS server, the
`
`NetApp Ex. 1022, pg. 5
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`coordinator rolls the operation forward by contacting the
`storage nodes to release any orphaned storage.
`
`3.4 Enlarging Writes
`
`The truncate/remove protocol in Section 3.3 must avoid
`a race with an enlarging write, a special case of extending
`write that extends a “small” file beyond the threshold offset
`and into the striping zone, making it a “large” file. The dan-
`ger is that another client will complete an enlarging write
`after the
`proxy consults the file’s logical size, recognizing
`proxy issues the trun-
`it as a small file, and before the
`cate/remove operation to the NFS server. If this occurs, the
`proxy could fail to notify the coordinator of the need to
`release network storage allocated to the newly enlarged file,
`leaving it orphaned by the truncate/remove.
`One way to prevent the race is to conservatively notify
`the coordinator of all truncate/remove operations, even for
`small files. However, this imposes an extra message latency
`and perhaps a disk fault on truncates and removes of small
`files. We prefer instead to shift the costs to the enlarging
`write operation, increasing the creation cost of large files.
`The enlarging write cost is incurred once for each large file,
`and is amortized over all I/O operations on the file.
`Our approach uses a variant of the basic intention pro-
`tocol to detect the race when it occurs, and to release any
`orphaned storage. The trick is for the coordinator to de-
`proxy has executed a truncate/remove opera-
`tect that a
`tion based on attributes that were fetched before the com-
`pletion of an enlarging write. After an enlarging write has
`completed, the file’s coordinator contacts the NFS server
`to validate the file’s existence and logical size. The coor-
`dinator delays this validation until a fixed waiting period
`has elapsed. The waiting period is chosen to exceed the
`time bound on the staleness of cached attributes in NFS
`(the three second rule) with ample slack time to account
`for clock skew and operation latencies.
`
`3.5 Comparison to Two-Phase Commit
`
`The basic intention logging protocol used in Slice is sim-
`ilar to conventional two-phase commit [10], but there are
`several key differences. These are brought about by the
`simple nature of the file system operations, which tends to
`make the protocol more efficient than a general two-phase
`commit in the common cases.
`
`proxy assumes most of the func-
` For simplicity, the
`tions of the traditional commit coordinator: it trans-
`mits requests to participants and gathers commit votes.
`However, it never actually performs a commit since it
`has no stable storage.
`
` Participants execute their portion of the operation in a
`fixed partial order, with one participant acting as the
`primary commit site. The purpose of the intention pro-
`tocol is to detect and recover from failures that inter-
`rupt the sequence before the primary commit site exe-
`cutes its part of the operation. For example, the NFS
`server itself unwittingly acts as the primary commit
`site for removes, truncates, and extending writes (or
`extending write commits). For truncate and remove,
`a failure after the NFS server commits causes the re-
`covery protocol to roll forward by releasing orphaned
`storage, similar to a conventional journaling file sys-
`tem or a file system scavenger (fsck).
` There is no need to notify participants other than the
`coordinator that the operation committed. The pre-
`commit is sufficient to stabilize the data, and the par-
`ticipants do not hold locks on the committed state. File
`operations are serialized (when necessary) at the NFS
`server (for name space operations) or at the coordina-
`tor (for reads and writes of shared files).
`
`4 Prototype and Experimental Results
`
`We have implemented the Slice prototype as a set of
`loadable kernel modules for the FreeBSD 4.0 operating
`system.
`The network storage nodes in our prototype
`are FreeBSD PCs serving blocks from local disks using
`UFS/FFS as a storage manager, with an external hash to
`map opaque NFS file handles to local files. The coordi-
`nator is implemented as an extension to the storage node
`module, consisting of a total of about 1400 lines of code.
`proxy is an IP filter between the IP
`In our prototype, the
`proxy may rewrite or
`stack and the network driver. The
`consume packets, and it may also generate new IP packets.
`proxy is a non-blocking state machine consisting of
`The
`about 2500 lines of code. An overarching goal is to keep
`proxy simple, small, and fast.
`the
`This section presents experimental results from inter-
`posed file striping as implemented in the Slice prototype.
`proxy
`The intent is to show the costs of the interposed
`architecture, and the effect of these costs on delivered file
`proxy, coordinator, and
`access bandwidths. The prototype
`storage service implement mechanisms needed for recov-
`ery during normal operation, including the coordinator in-
`tentions log. Thus they reflect the costs of recovery as de-
`scribed in Section 3. However, reconciliation of active re-
`gions for mirrored replicas is not implemented.
`In these experiments, clients are 450 MHz Pentium-III
`PCs using the Asus P2B motherboard with a 32-bit, 33 MHz
`PCI bus an Intel 440BX chipset. The NFS server and Slice
`storage nodes are Dell 4400 systems each with one 733
`MHz Pentium-III Xeon using a ServerWorks chipset. The
`
`NetApp Ex. 1022, pg. 6
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Figure 4. Single-client and saturation bandwidth for sequential read and write.
`
` !"$#&%'($#*)+,-./01,234526716389:4;<= >?@ABC> DE?FGD> >HI>J >?K@AFGD> >HI>JBC> DE? LMNOQPLRNNS*TIL L O R S T N U PVW XZY [\$]G^_X`$]*abccdecfgehfijkljgmnohpqr stuKvwCs xytz{xs s|Is} stuvz{xs s|Is}w~s xyt
`K K Q K{ ' G I ¡¢~ £¢C £G I ¡ ¤ ¤ G I ¡ I*¥*I~*¥K
`
`blocks on the NFS server, bounding deletion time.
`When file size exceeds the striping zone threshold, la-
`tencies jump as operations begin to involve multiple sites
`and incur costs of the intention logging protocol. For ex-
`proxy faults
`ample, read and write costs increase as the
`block maps from the coordinator before issuing I/O beyond
`the threshold. Writes and removes register an intent with
`the coordinator before performing the first extending write
`into the striping zone or before issuing the remove to the
`NFS server, respectively. The resulting discontinuities are
`clearly shown in the graph; however, the cost becomes pro-
`gressively less significant as file sizes grow.
`Both read and write times increase linearly with file size,
`and remove time remains constant. The prototype serializes
`some sub-operations of commit for simplicity, compromis-
`ing write latency slightly. At these sizes, mirroring has a
`negligible effect on both read and write times.
`The architecture allows very high bandwidth for large
`files. Figure 4 shows I/O bandwidth delivered to a single
`client and a group of clients, varying the number of storage
`nodes. Bandwidths are measured using dd to read or write a
`1.25 GB file in 32 KB chunks, with a Slice striping grain of
`32 KB. Each graph gives both non-redundant and mirrored
`storage results.
`The left-hand graph shows the measured I/O bandwidth
`delivered to a single client with a Lanai-7 adapter. We modi-
`fied the FreeBSD NFS client for zero-copy reads, however a
`copy remains in the write path. Single client read bandwidth
`scales with the number of storage nodes until the client CPU
`saturates at 110 MB/s. The copy in the write path saturates
`a client writing at 53 MB/s.
`Mirrored read bandwidth is roughly half that of non-
`mirrored, due to an artifact of the striping policy and our
`use of UFS/FFS as the block storage manager in the proto-
`type. UFS/FFS aggressively prefetches from local disk into
`local memory when it detects sequential or near-sequential
`accesses. In this case, this policy consumes storage band-
`width to load data that the client chooses to read from an-
`other node. With a replication degree of two, clients read
`
`Figure 3. File read, write, and remove timings
`using a 64 KB threshold offset.
`
`server network adapter and disk controllers are on indepen-
`dent peer 64-bit, 66 MHz PCI busses. Each has four 18 GB
`Seagate Ultra-2 Cheetah disks. All machines are equipped
`with Myricom LANai 4 or 7 adapters, with kernels built
`from the same FreeBSD 4.0 source pool.
`All network communication in these experiments uses
`Trapeze, a Myrinet messaging system optimized for net-
`work I/O traffic [7]. In this configuration, Trapeze/Myrinet
`provides 130 MB/s of point-to-point bandwidth with a 32
`KB transfer size. NFS traffic uses UDP/IP with a 32 KB
`MTU, routed through a Trapeze device driver.
`Figure 3 shows the total time to read, write, and remove a
`file, varying file size from 8 KB to 232 KB, with the striping
`zone threshold set to 64 KB. All tests start with cold client
`and storage node