throbber
Failure-Atomic File Access in an Interposed Network Storage System
`
`Darrell Anderson and Jeff Chase
`Department of Computer Science
`Duke University
`@cs.duke.edu
` anderson, chase
`
`Abstract
`
`This paper presents a recovery protocol for block I/O
`operations in Slice, a storage system architecture for high-
`speed LANs incorporating network-attached block storage.
`The goal of the Slice architecture is to provide a network
`file service with scalable bandwidth and capacity while
`preserving compatibility with off-the-shelf clients and file
`server appliances. The Slice prototype “virtualizes” the
`Network File System (NFS) protocol by interposing a re-
`quest switching filter at the client’s interface to the network
`storage system (e.g., in a network adapter or switch).
`The distributed Slice architecture separates functions
`typically combined in central file servers, introducing new
`challenges for failure atomicity. This paper presents a pro-
`tocol for atomic file operations and recovery in the Slice ar-
`chitecture, and related support for reliable file storage using
`mirrored striping. Experimental results from the Slice pro-
`totype show that the protocol has low cost in the common
`case, allowing the system to deliver client file access band-
`widths approaching gigabit-per-second network speeds.
`
`1 Introduction
`
`Faster I/O interconnect standards and the arrival of Gi-
`gabit Ethernet greatly expand the capacity of inexpensive
`commodity computers to handle large amounts of data for
`scalable computing, network services, multimedia and vi-
`sualization. These advances and the growing demand for
`storage increase the need for network storage systems that
`are incrementally scalable, reliable, and easy to administer,
`while serving the needs of diverse workloads running on a
`variety of client platforms.
`increasingly provide scalable
`Commercial
`systems
`shared storage by interconnecting storage devices and
`servers with dedicated Storage Area Networks (SANs),
` This work is supported by the National Science Foundation (CCR-96-
`
`24857, EIA-9870724, and EIA-9972879) and by equipment grants from
`Intel Corporation and Myricom.
`
`e.g., FibreChannel. Yet recent order-of-magnitude improve-
`ments in LAN performance have narrowed the bandwidth
`gap between SANs and LANs. This creates an opportu-
`nity to deliver competitive storage solutions by aggregating
`low-cost storage nodes and servers, using a general-purpose
`LAN as the storage backplane. In such a system it is pos-
`sible to incrementally scale either capacity or bandwidth of
`the shared storage resource by attaching additional storage
`to the network.
`A variety of commercial products and research proposals
`pursue this vision by layering device protocols (e.g., SCSI)
`over IP networks, building cluster file systems that manage
`distributed block storage as a shared disk volume, or in-
`stalling large server appliances to export SAN storage to a
`LAN using network file system protocols. Section 2.1 sur-
`veys some of these systems.
`This paper deals with a network storage architecture —
`called Slice — that takes an alternative approach. Slice
`places a request switching filter at the client’s interface to
`the network storage system; the role of the filter is to “wrap”
`a standard IP-based client/server file system protocol, ex-
`tending it to incorporate an incrementally expandable array
`of network-attached block storage nodes. The Slice pro-
`totype implements the architecture by virtualizing the Net-
`work File System version 3 protocol (NFS V3). The request
`switching filter intercepts and rewrites a subset of the NFS
`V3 packet stream, directing I/O requests to the network
`storage array and associated servers that make up a Slice
`ensemble appearing to the client as a unified NFS volume.
`The system is compatible with off-the-shelf NFS clients and
`servers, in order to leverage the large installed base of NFS
`clients and the high-quality NFS server appliances now on
`the market.
`The Slice architecture assumes a block storage model
`loosely based on a proposal in the National Storage In-
`dustry Consortium (NSIC) for object-based storage devices
`(OBSD) [2]. Key elements of the OBSD proposal were
`in turn inspired by research on Network Attached Secure
`Disks (NASD) [8, 9]. Storage nodes are “object-based”
`rather than sector-based, meaning that requesters address
`
`NetApp Ex. 1022, pg. 1
`
`
`

`

`semble. The
`proxy examines NFS requests and responses,
`redirecting requests and transforming responses as neces-
`sary to represent the distributed storage service as a uni-
`fied NFS service to its client. For some operations, the
`proxy must generate new requests and pair responses with
`proxy may reside within the client itself,
`requests. The
`or in a network element along the communication path be-
`tween the client and the servers. In our current prototype
`proxy is implemented as a packet filter installed on the
`the
`client below the NFS/UDP/IP stack.
`proxy is a simple state machine with minimal
`The
`proxy
`buffering requirements. It uses only soft state; the
`proxy
`may fail without compromising correctness. The
`may reside outside of the trust boundary, although it may
`damage the contents of specific files by misusing the au-
`thority of users whose requests are routed through it. In this
`proxy internals
`paper we limit our focus to aspects of the
`and policies that are directly related to operation atomicity
`and the recovery protocol.
`The coordinator plays an important role in managing
`global recovery of operations involving multiple sites. A
`Slice configuration may contain any number of coordina-
`tors, with each coordinator managing operations for some
`subset of files. The functions of the coordinator may be
`combined with the file server, but we consider them sepa-
`rately to emphasize that the architecture is compatible with
`standard file servers.
`Our implementation combines the coordinator with a
`map service responsible for tracking file block location. The
`coordinator servers maintain a global block map for each
`proxies
`file giving the storage site for each block. The
`read, cache, modify, and write back fragments of the global
`maps as they execute read and write operations on files. The
`global maps allow flexible per-file policies for block place-
`ment and striping in the network storage array; although the
`system may use deterministic block placement functions as
`an alternative to the global maps, this paper includes a dis-
`cussion of the maps to show how the recovery protocol in-
`corporates them.
`proxy intercepts read and write operations targeted
`The
`at file regions beyond a configurable threshold offset. Log-
`ical file offsets beyond the threshold are referred to as the
`proxy redirects all reads and writes cov-
`striping zone; the
`ering offsets in the striping zone to an array of block storage
`nodes according to system striping policies and the block
`maps maintained by the coordinators. The policies and pro-
`tocols include support for mirrored striping (“RAID-10”)
`for redundancy to protect against storage node failures, as
`described in Section 3.2. The Slice storage nodes export
`object-based block storage to the network; our prototype
`storage nodes accept NFS read and write operations on a
`flat space of storage objects uniquely identified by NFS file
`handles. Although NFS file handles provide only a weak
`
`Figure 1. The Slice distributed storage archi-
`tecture.
`
`data on each storage node as logical offsets within storage
`objects. A storage object is an ordered sequence of bytes
`with a unique identifier. The NASD work and the OBSD
`proposal allow for cryptographic protection of object iden-
`tifiers if the network is insecure [8].
`The Slice architecture separates functions that are com-
`bined in central file servers. The contribution of this paper
`is to present a simple solution to the coordination and recov-
`ery issues raised by this structure. Our approach introduces
`a coordinator responsible for preserving atomicity of key
`NFS operations, including file truncate/remove, extending
`writes, and write commitment. The coordinators use a sim-
`ple intention logging protocol, with variants for each oper-
`ation type that minimize the common-case costs. We also
`show how the protocol supports failure-atomic write com-
`mitment for mirrored files in the Slice prototype. Mirroring
`consumes more storage and network bandwidth than strip-
`ing with RAID redundancy, but it is simple and reliable,
`avoids the overhead of computing and updating parity, and
`allows load-balanced reads [4, 12].
`This paper is structured as follows. Section 2 summa-
`rizes the Slice architecture. Section 3 describes mecha-
`nisms for operation atomicity and failure handling. Sec-
`tion 4 presents experimental results from the Slice prototype
`on a Myrinet network, showing that the Slice architecture
`and recovery protocols achieve file access performance ap-
`proaching gigabit-per-second network speeds, limited pri-
`marily by the client NFS implementation. Section 5 con-
`cludes.
`
`2 Overview
`
`Figure 1 depicts the Slice architecture with NFS clients
`and servers. The architecture interposes a “microproxy”
`(
`proxy) between the client IP stack and the Slice server en-
`
`NetApp Ex. 1022, pg. 2
`
`
`
`
`
`
`
`
`
`
`
`
`
`

`

`form of protection in our prototype, the architecture is com-
`patible with proposals for cryptographic protection of stor-
`age object identifiers for insecure networks [8].
`The
`proxy identifies read and write operations in the
`striping zone by examining the request offset and length.
`Small files are not striped; these are files whose logical size
`is below the threshold offset, i.e., that have never received
`a write in the striping zone. Note that even large files are
`not striped in their entirety; data written below the thresh-
`old offset of a large file is stored along with the small files.
`File regions outside the striping zone do not benefit from
`striping, but the performance cost becomes progressively
`less significant as file sizes grow.
`In addition to the interactions required for I/O requests,
`proxies cooperate with the network storage nodes and
`the
`the file’s coordinator to allocate global maps for extending
`write operations, and to release storage on remove and trun-
`cate operations. These multisite operations introduce recov-
`ery issues described in the next section. All other file oper-
`proxy to the NFS server as they
`ations pass through the
`did before, and incur no additional overhead for managing
`distributed storage.
`This architecture scales to higher bandwidth and capac-
`ity by adding storage nodes, since the NFS server is outside
`the critical path of reads and writes handled by the block
`storage nodes. It is also possible to scale or replicate other
`file service functions within the context of the Slice request
`switching architecture. For simplicity this paper assumes
`that a single standard NFS file server manages the entire
`volume name space.
`The goal of the mechanisms described in this paper is to
`deliver consistency and failure properties that are no weaker
`than commercial NFS implementations. While the basic ap-
`proach is quite similar to write-ahead logging that might
`be taken on a journaling central file server with distributed
`disks, we extend it to support multisite operations without
`the awareness of the client, NFS file server, or the storage
`nodes. Our approach to committing writes assumes use of
`the NFS V3 asynchronous writes and write commitment
`protocol, as described below. This paper does not address
`the issue of concurrent write sharing of files, and Slice as
`defined may provide weaker concurrent write sharing guar-
`antees than some NFS implementations. However, the ar-
`chitecture is compatible with NFS file leasing extensions
`for consistent concurrent write sharing, as defined in NQ-
`NFS [13] and early IETF draft proposals for the NFS V4
`protocol.
`
`2.1 Related Work
`
`The Cambridge Universal File Server [5] proposed struc-
`turing a distributed file system as a separate name service
`and file block storage service. One system to take this
`
`approach was Swift [6]. Slice is similar to Swift in that
`each client reads or writes data directly to block storage
`sites on the network, choreographed by a client distribution
`agent using maps provided by a third-party storage media-
`tor. Another system derived from the Swift architecture is
`Cheops, a striping file system for CMU NASD storage sys-
`tems [9, 8]. The Swift and Cheops work did not directly
`address atomicity or recovery issues.
`Amiri et. al. [1] show how to preserve read and write
`atomicity in a shared storage array using RAID striping with
`parity. This work focuses primarily on safe concurrent ac-
`cesses to a fixed space of blocks. It does not address file
`system consistency in the presence of host failures.
`A number of scalable file systems separate some strip-
`ing functions from other file system code by building the
`file system above a striped network storage volume using
`a shared disk model. This approach has been used with
`both log-structured [11, 3] and conventional [15, 14] file
`system structures. In these systems, multisite operations in-
`cluding truncate and remove are made failure-atomic using
`write-ahead metadata logging on the file server. The log-
`structured approach also relies in part on a separate cleaner
`process to reclaim space.
`Relative to these systems, this paper shows how to fac-
`tor out recovery functions so that multisite recovery may
`be interposed in the context of a standard client/server file
`system protocol, without modifying the client or server.
`
`3 Atomic Operations on Network Storage
`
`proxy intercepts
`A multisite operation begins when the
`an NFS V3 write, remove, truncate (setattr) or commit re-
`proxy may
`quest from a client. To handle the request, the
`redirect the request or generate additional request messages
`to nodes in the Slice ensemble, including storage nodes, the
`coordinator for the target file, and the NFS server. Figure 2
`illustrates the message exchanges for the multisite opera-
`tions discussed in this section.
`proxy
`When the operation is complete at all sites, the
`passes through an NFS V3 response to the client. If any
`proxy, a stor-
`participant fails during this sequence — the
`age node, the coordinator, or the file server — a recovery
`protocol is initiated. The recovery protocol is specific to the
`particular operation in progress, and it may either complete
`the operation (roll forward) or abort it (roll back). If the sys-
`tem aborts the operation or delays the response, a standard
`NFS client may reinitiate the operation by retransmitting the
`request after a timeout, unless the client itself has failed.
`The basic protocol is as follows. At the start of the op-
`proxy sends to the coordinator an intention to
`eration, the
`and
`).
`perform the operation (e.g., Figure 2, messages
`The coordinator logs the intention to stable disk storage and
`proxy to carry out the operation.
`responds, authorizing the
`
`NetApp Ex. 1022, pg. 3
`
`
`
`
`
`
`
`
`
`
`
`
`

`

`Figure 2. Message exchanges for multisite Slice/NFS operations. Dotted line message exchanges
`are avoided in common cases. Square endpoints represent synchronous storage writes.
`
`When the operation is complete, the
`proxy notifies the co-
`ordinator with a completion message, asynchronously clear-
`and
`). If the coordinator
`ing the intention (e.g., messages
`does not receive the completion within a specified period, it
`probes one or more participants to determine if the opera-
`tion completed, and initiates recovery if necessary. A failed
`coordinator recovers by scanning its intentions log, com-
`pleting or aborting operations in progress at the time of the
`failure.
`
`This is a variant of the standard two-phase commit pro-
`tocol [10] adapted to a file system context with idempotent
`operations. The details for each operation vary significantly.
`In particular, each operation allows optimizations to avoid
`most messaging and logging delays in common cases, as
`described below. Slice further improves performance by
`avoiding multisite operations for small files stored entirely
`on the file server, i.e., files that have never received writes
`beyond the configurable threshold offset. In this way, the
`system amortizes the costs of the protocol across a larger
`number of bytes and operations, since it incurs these costs
`only to create and truncate/remove large files, and to com-
`mit groups of writes to large files.
`
`The following subsections describe the protocol as it ap-
`plies to each type of multisite operation. We then set the
`protocol in context with conventional two-phase commit.
`
`3.1 Write Commitment
`
`An NFS V3 commit operation stabilizes pending or un-
`stable writes on a given file. The NFS V3 protocol allows a
`server failure to legally discard any subset of the uncommit-
`ted writes and associated metadata, provided that the client
`can detect any loss by comparing verifier values returned
`by the file service in its responses to write and commit op-
`erations. NFS V3 clients buffer uncommitted writes locally
`so that they may re-execute these writes after a server fail-
`ure. Clients may safely discard their buffered writes after a
`successful commit. Note that the verifier value returned by
`write and commit is not itself significant; the service guar-
`antees only that the verifier changes after a failure.
`To handle a commit on a file that has unstable writes in
`proxy executes a message exchange
`the striping zone, the
`with each storage node that owns uncommitted writes on
`). The
`proxy also completes
`the file (Figure 2, message
`the writes, which may involve an exchange with the coor-
`proxy
`dinator (map service) and/or the NFS server. The
`pushes any updates to the file’s map back to the coordinator
`). If the write enlarged the file, it pushes the new
`(message
`file size to the NFS server via a setattr (message
`). When
`proxy re-
`all operations have completed successfully, the
`sponds to the client with a valid verifier.
`proxy detects any failures by comparing response
`The
`verifiers against a stored copy of the previous verifier re-
`turned by each participant.
`If any participant fails, the
`proxy reports the failure by changing the response veri-
`
`NetApp Ex. 1022, pg. 4
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`

`

`fier to the client. If the
`proxy itself loses its state, it may
`report failure for a commit that has successfully completed
`at all sites. This forces the client to reinitiate writes unnec-
`essarily, but is otherwise harmless.
`Intention logging is unnecessary for write and commit
`on unmirrored files. This is because the file service remains
`in a legal state throughout the write sequence and commit.
`The exact ordering of operations is not strictly important;
`the commit is complete only when the client discarded its
`buffered writes after receiving a valid response. If a failure
`occurs, the client itself is responsible for restarting the write
`sequence after receiving a negative response or no response
`to its commit request.
`
`3.2 Mirrored Writes
`
`Writes to a mirrored file are replicated using a read-any-
`write-all model. Without loss of generality we assume that
`the replication degree is two. A replication degree of two
`guarantees that a file is available unless two or more stor-
`age nodes fail concurrently, or the file’s coordinator fails to-
`gether with one storage node and a client who was actively
`writing the file.
`Block maps for a mirrored file have dual entries for each
`logical block, with one entry for each block replica. The
`proxy writes each block to a pair of storage nodes selected
`according to some placement policy, which is not important
`for the purposes of this paper. A mirrored write is consid-
`ered complete only after it has committed; i.e., both storage
`nodes confirm that the block is stable, and (if applicable) the
`file’s coordinator (map service) confirms that the covering
`map fragment is stable.
`Mirrored writes use the intention protocol to reconcile
`replicas in the event of a failure. If a participant fails while
`there are incomplete mirrored writes, then it is possible that
`the write executed at one replica but not the other. In prac-
`tice, this does not occur unless a client fails concurrently
`with one or more server failures, since an NFS V3 client
`retransmits all uncommitted writes after a server failure, as
`described in Section 3.1.
`The mirrored write protocol piggybacks intention mes-
`proxy’s request for the
`sages for mirrored writes on the
`map fragment covering the write. Before returning the re-
`quested map fragment, the coordinator logs the intention
`record and updates a conservative in-memory active region
`list of offset ranges or map fragments that might be held by
`proxy, and that may have incomplete writes. These
`each
`intentions are cleared implicitly by a commit request cov-
`proxy to discard all
`ering the region; commit causes the
`covered map fragments for a mirrored file.
`proxy) fails, any uncommitted mir-
`If a client (or its
`rored writes are guaranteed to be covered by the coordina-
`tor’s active region list. The coordinator can reconcile the
`
`replicas for these regions by traversing the region list; any
`conflict within the active regions may be resolved by select-
`ing one replica to dominate. In principle, the system can
`serve one copy of the file concurrently with reconciliation,
`even if a storage node fails. If the coordinator fails, it re-
`covers a conservative approximation of its active region list
`from its intentions log.
`In practice, most intention logging activity for mirrored
`writes may be optimized away. Slice logs these intentions
`only when a mirrored file first comes into active write use,
`e.g., when a
`proxy first requests map fragments with in-
`tent to write. If a file falls out of write use (no map frag-
`ment requests received since the last commit completion),
`the coordinator marks the file as inactive by logging a write-
`complete entry. This protocol adds a synchronous log write
`to the write-open path for mirrored files, but this cost is
`amortized over all writes on the file. It allows a recover-
`ing coordinator to identify a superset of the mirrored files
`that may need reconciliation after a multiple failure.
`One drawback of the protocol is that a buggy or mali-
`cious client might cause the active region list to grow with-
`out bound by issuing large numbers of writes and never
`committing them. This is not a problem with clients that
`correctly buffer their uncommitted writes, since the num-
`ber of writes is limited by available memory; in any case,
`standard clients commit writes at regular intervals under the
`control of a system update daemon. For malicious clients,
`the system may avoid this problem by weakening replica
`consistency guarantees for mirrored files with writes left un-
`committed for unreasonably long periods.
`
`3.3 Truncate and Remove
`
`The protocol for truncate and remove relies on the NFS
`server to maintain an authoritative record of the file length
`proxy first consults a set of attributes
`and link count. The
`); the attributes must
`for the target file (Figure 2, message
`be current up to the “three second window” defined by NFS
`implementations (see Section 3.4. If the target file’s log-
`ical size shows that it has data in the striping zone, the
`proxy issues an intention to the coordinator (message
`)
`before issuing the NFS operation to the file server (mes-
`). Once the operation has committed at the NFS
`sage
`server, the protocol contacts the storage nodes and coordi-
`), then reg-
`nator (map service) to release storage (message
`). In our
`isters a completion with the coordinator (message
`proxy executes the entire protocol,
`current prototype the
`but it could be done directly by the coordinator, simplifying
`proxy and saving one message exchange (the intention
`the
`response and the completion).
`If the intention expires, the coordinator probes the NFS
`server (using a getattr) to determine the status of the op-
`eration. If the operation completed on the NFS server, the
`
`NetApp Ex. 1022, pg. 5
`
`
`
`
`
`
`
`
`
`
`
`
`

`
`
`
`
`

`

`coordinator rolls the operation forward by contacting the
`storage nodes to release any orphaned storage.
`
`3.4 Enlarging Writes
`
`The truncate/remove protocol in Section 3.3 must avoid
`a race with an enlarging write, a special case of extending
`write that extends a “small” file beyond the threshold offset
`and into the striping zone, making it a “large” file. The dan-
`ger is that another client will complete an enlarging write
`after the
`proxy consults the file’s logical size, recognizing
`proxy issues the trun-
`it as a small file, and before the
`cate/remove operation to the NFS server. If this occurs, the
`proxy could fail to notify the coordinator of the need to
`release network storage allocated to the newly enlarged file,
`leaving it orphaned by the truncate/remove.
`One way to prevent the race is to conservatively notify
`the coordinator of all truncate/remove operations, even for
`small files. However, this imposes an extra message latency
`and perhaps a disk fault on truncates and removes of small
`files. We prefer instead to shift the costs to the enlarging
`write operation, increasing the creation cost of large files.
`The enlarging write cost is incurred once for each large file,
`and is amortized over all I/O operations on the file.
`Our approach uses a variant of the basic intention pro-
`tocol to detect the race when it occurs, and to release any
`orphaned storage. The trick is for the coordinator to de-
`proxy has executed a truncate/remove opera-
`tect that a
`tion based on attributes that were fetched before the com-
`pletion of an enlarging write. After an enlarging write has
`completed, the file’s coordinator contacts the NFS server
`to validate the file’s existence and logical size. The coor-
`dinator delays this validation until a fixed waiting period
`has elapsed. The waiting period is chosen to exceed the
`time bound on the staleness of cached attributes in NFS
`(the three second rule) with ample slack time to account
`for clock skew and operation latencies.
`
`3.5 Comparison to Two-Phase Commit
`
`The basic intention logging protocol used in Slice is sim-
`ilar to conventional two-phase commit [10], but there are
`several key differences. These are brought about by the
`simple nature of the file system operations, which tends to
`make the protocol more efficient than a general two-phase
`commit in the common cases.
`
`proxy assumes most of the func-
` For simplicity, the
`tions of the traditional commit coordinator: it trans-
`mits requests to participants and gathers commit votes.
`However, it never actually performs a commit since it
`has no stable storage.
`
` Participants execute their portion of the operation in a
`fixed partial order, with one participant acting as the
`primary commit site. The purpose of the intention pro-
`tocol is to detect and recover from failures that inter-
`rupt the sequence before the primary commit site exe-
`cutes its part of the operation. For example, the NFS
`server itself unwittingly acts as the primary commit
`site for removes, truncates, and extending writes (or
`extending write commits). For truncate and remove,
`a failure after the NFS server commits causes the re-
`covery protocol to roll forward by releasing orphaned
`storage, similar to a conventional journaling file sys-
`tem or a file system scavenger (fsck).
` There is no need to notify participants other than the
`coordinator that the operation committed. The pre-
`commit is sufficient to stabilize the data, and the par-
`ticipants do not hold locks on the committed state. File
`operations are serialized (when necessary) at the NFS
`server (for name space operations) or at the coordina-
`tor (for reads and writes of shared files).
`
`4 Prototype and Experimental Results
`
`We have implemented the Slice prototype as a set of
`loadable kernel modules for the FreeBSD 4.0 operating
`system.
`The network storage nodes in our prototype
`are FreeBSD PCs serving blocks from local disks using
`UFS/FFS as a storage manager, with an external hash to
`map opaque NFS file handles to local files. The coordi-
`nator is implemented as an extension to the storage node
`module, consisting of a total of about 1400 lines of code.
`proxy is an IP filter between the IP
`In our prototype, the
`proxy may rewrite or
`stack and the network driver. The
`consume packets, and it may also generate new IP packets.
`proxy is a non-blocking state machine consisting of
`The
`about 2500 lines of code. An overarching goal is to keep
`proxy simple, small, and fast.
`the
`This section presents experimental results from inter-
`posed file striping as implemented in the Slice prototype.
`proxy
`The intent is to show the costs of the interposed
`architecture, and the effect of these costs on delivered file
`proxy, coordinator, and
`access bandwidths. The prototype
`storage service implement mechanisms needed for recov-
`ery during normal operation, including the coordinator in-
`tentions log. Thus they reflect the costs of recovery as de-
`scribed in Section 3. However, reconciliation of active re-
`gions for mirrored replicas is not implemented.
`In these experiments, clients are 450 MHz Pentium-III
`PCs using the Asus P2B motherboard with a 32-bit, 33 MHz
`PCI bus an Intel 440BX chipset. The NFS server and Slice
`storage nodes are Dell 4400 systems each with one 733
`MHz Pentium-III Xeon using a ServerWorks chipset. The
`
`NetApp Ex. 1022, pg. 6
`
`
`
`
`
`
`
`
`
`
`
`
`

`

`Figure 4. Single-client and saturation bandwidth for sequential read and write.
`
`           !"$#&%'($#*)+,-./01,234526716389:4;<= >?@ABC> DE?FGD> >HI>J >?K@AFGD> >HI>JBC> DE? LMNOQPLRNNS*TIL L O R S T N U PVW XZY [\$]G^_X`$]*abccdecfgehfijkljgmnohpqr stuKvwCs xytz{xs s|Is} stuvz{xs s|Is}w~s xyt
`€‚ƒK„…Kƒ  † Q † ƒ ƒ†‡ˆ‰ŠŒ‹ˆŽKŠ{ ’‘'“”•–—˜–™š ›Gœ žIŸ ¡¢~ œ£Ÿ¢C œ£Ÿ›Gœ žIŸ ¡ Ÿ¤ Ÿ¤ ›Gœ žIŸ ¡ ŸI›Œž*¥*ŸŸI›~ž*¥KŸ
`
`blocks on the NFS server, bounding deletion time.
`When file size exceeds the striping zone threshold, la-
`tencies jump as operations begin to involve multiple sites
`and incur costs of the intention logging protocol. For ex-
`proxy faults
`ample, read and write costs increase as the
`block maps from the coordinator before issuing I/O beyond
`the threshold. Writes and removes register an intent with
`the coordinator before performing the first extending write
`into the striping zone or before issuing the remove to the
`NFS server, respectively. The resulting discontinuities are
`clearly shown in the graph; however, the cost becomes pro-
`gressively less significant as file sizes grow.
`Both read and write times increase linearly with file size,
`and remove time remains constant. The prototype serializes
`some sub-operations of commit for simplicity, compromis-
`ing write latency slightly. At these sizes, mirroring has a
`negligible effect on both read and write times.
`The architecture allows very high bandwidth for large
`files. Figure 4 shows I/O bandwidth delivered to a single
`client and a group of clients, varying the number of storage
`nodes. Bandwidths are measured using dd to read or write a
`1.25 GB file in 32 KB chunks, with a Slice striping grain of
`32 KB. Each graph gives both non-redundant and mirrored
`storage results.
`The left-hand graph shows the measured I/O bandwidth
`delivered to a single client with a Lanai-7 adapter. We modi-
`fied the FreeBSD NFS client for zero-copy reads, however a
`copy remains in the write path. Single client read bandwidth
`scales with the number of storage nodes until the client CPU
`saturates at 110 MB/s. The copy in the write path saturates
`a client writing at 53 MB/s.
`Mirrored read bandwidth is roughly half that of non-
`mirrored, due to an artifact of the striping policy and our
`use of UFS/FFS as the block storage manager in the proto-
`type. UFS/FFS aggressively prefetches from local disk into
`local memory when it detects sequential or near-sequential
`accesses. In this case, this policy consumes storage band-
`width to load data that the client chooses to read from an-
`other node. With a replication degree of two, clients read
`
`Figure 3. File read, write, and remove timings
`using a 64 KB threshold offset.
`
`server network adapter and disk controllers are on indepen-
`dent peer 64-bit, 66 MHz PCI busses. Each has four 18 GB
`Seagate Ultra-2 Cheetah disks. All machines are equipped
`with Myricom LANai 4 or 7 adapters, with kernels built
`from the same FreeBSD 4.0 source pool.
`All network communication in these experiments uses
`Trapeze, a Myrinet messaging system optimized for net-
`work I/O traffic [7]. In this configuration, Trapeze/Myrinet
`provides 130 MB/s of point-to-point bandwidth with a 32
`KB transfer size. NFS traffic uses UDP/IP with a 32 KB
`MTU, routed through a Trapeze device driver.
`Figure 3 shows the total time to read, write, and remove a
`file, varying file size from 8 KB to 232 KB, with the striping
`zone threshold set to 64 KB. All tests start with cold client
`and storage node

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket