IPR2015-00773, No. 1022 Exhibit - Exhibit 1022 (P.T.A.B. Feb. 19, 2015)

Failure-Atomic File Access in an Interposed Network Storage System
`
`Darrell Anderson and Jeff Chase
`Department of Computer Science
`Duke University
`@cs.duke.edu
` anderson, chase
`
`Abstract
`
`This paper presents a recovery protocol for block I/O
`operations in Slice, a storage system architecture for high-
`speed LANs incorporating network-attached block storage.
`The goal of the Slice architecture is to provide a network
`ﬁle service with scalable bandwidth and capacity while
`preserving compatibility with off-the-shelf clients and ﬁle
`server appliances. The Slice prototype “virtualizes” the
`Network File System (NFS) protocol by interposing a re-
`quest switching ﬁlter at the client’s interface to the network
`storage system (e.g., in a network adapter or switch).
`The distributed Slice architecture separates functions
`typically combined in central ﬁle servers, introducing new
`challenges for failure atomicity. This paper presents a pro-
`tocol for atomic ﬁle operations and recovery in the Slice ar-
`chitecture, and related support for reliable ﬁle storage using
`mirrored striping. Experimental results from the Slice pro-
`totype show that the protocol has low cost in the common
`case, allowing the system to deliver client ﬁle access band-
`widths approaching gigabit-per-second network speeds.
`
`1 Introduction
`
`Faster I/O interconnect standards and the arrival of Gi-
`gabit Ethernet greatly expand the capacity of inexpensive
`commodity computers to handle large amounts of data for
`scalable computing, network services, multimedia and vi-
`sualization. These advances and the growing demand for
`storage increase the need for network storage systems that
`are incrementally scalable, reliable, and easy to administer,
`while serving the needs of diverse workloads running on a
`variety of client platforms.
`increasingly provide scalable
`Commercial
`systems
`shared storage by interconnecting storage devices and
`servers with dedicated Storage Area Networks (SANs),
` This work is supported by the National Science Foundation (CCR-96-
`
`24857, EIA-9870724, and EIA-9972879) and by equipment grants from
`Intel Corporation and Myricom.
`
`e.g., FibreChannel. Yet recent order-of-magnitude improve-
`ments in LAN performance have narrowed the bandwidth
`gap between SANs and LANs. This creates an opportu-
`nity to deliver competitive storage solutions by aggregating
`low-cost storage nodes and servers, using a general-purpose
`LAN as the storage backplane. In such a system it is pos-
`sible to incrementally scale either capacity or bandwidth of
`the shared storage resource by attaching additional storage
`to the network.
`A variety of commercial products and research proposals
`pursue this vision by layering device protocols (e.g., SCSI)
`over IP networks, building cluster ﬁle systems that manage
`distributed block storage as a shared disk volume, or in-
`stalling large server appliances to export SAN storage to a
`LAN using network ﬁle system protocols. Section 2.1 sur-
`veys some of these systems.
`This paper deals with a network storage architecture —
`called Slice — that takes an alternative approach. Slice
`places a request switching ﬁlter at the client’s interface to
`the network storage system; the role of the ﬁlter is to “wrap”
`a standard IP-based client/server ﬁle system protocol, ex-
`tending it to incorporate an incrementally expandable array
`of network-attached block storage nodes. The Slice pro-
`totype implements the architecture by virtualizing the Net-
`work File System version 3 protocol (NFS V3). The request
`switching ﬁlter intercepts and rewrites a subset of the NFS
`V3 packet stream, directing I/O requests to the network
`storage array and associated servers that make up a Slice
`ensemble appearing to the client as a uniﬁed NFS volume.
`The system is compatible with off-the-shelf NFS clients and
`servers, in order to leverage the large installed base of NFS
`clients and the high-quality NFS server appliances now on
`the market.
`The Slice architecture assumes a block storage model
`loosely based on a proposal in the National Storage In-
`dustry Consortium (NSIC) for object-based storage devices
`(OBSD) [2]. Key elements of the OBSD proposal were
`in turn inspired by research on Network Attached Secure
`Disks (NASD) [8, 9]. Storage nodes are “object-based”
`rather than sector-based, meaning that requesters address
`
`NetApp Ex. 1022, pg. 1
`
`
`

`semble. The
`proxy examines NFS requests and responses,
`redirecting requests and transforming responses as neces-
`sary to represent the distributed storage service as a uni-
`ﬁed NFS service to its client. For some operations, the
`proxy must generate new requests and pair responses with
`proxy may reside within the client itself,
`requests. The
`or in a network element along the communication path be-
`tween the client and the servers. In our current prototype
`proxy is implemented as a packet ﬁlter installed on the
`the
`client below the NFS/UDP/IP stack.
`proxy is a simple state machine with minimal
`The
`proxy
`buffering requirements. It uses only soft state; the
`proxy
`may fail without compromising correctness. The
`may reside outside of the trust boundary, although it may
`damage the contents of speciﬁc ﬁles by misusing the au-
`thority of users whose requests are routed through it. In this
`proxy internals
`paper we limit our focus to aspects of the
`and policies that are directly related to operation atomicity
`and the recovery protocol.
`The coordinator plays an important role in managing
`global recovery of operations involving multiple sites. A
`Slice conﬁguration may contain any number of coordina-
`tors, with each coordinator managing operations for some
`subset of ﬁles. The functions of the coordinator may be
`combined with the ﬁle server, but we consider them sepa-
`rately to emphasize that the architecture is compatible with
`standard ﬁle servers.
`Our implementation combines the coordinator with a
`map service responsible for tracking ﬁle block location. The
`coordinator servers maintain a global block map for each
`proxies
`ﬁle giving the storage site for each block. The
`read, cache, modify, and write back fragments of the global
`maps as they execute read and write operations on ﬁles. The
`global maps allow ﬂexible per-ﬁle policies for block place-
`ment and striping in the network storage array; although the
`system may use deterministic block placement functions as
`an alternative to the global maps, this paper includes a dis-
`cussion of the maps to show how the recovery protocol in-
`corporates them.
`proxy intercepts read and write operations targeted
`The
`at ﬁle regions beyond a conﬁgurable threshold offset. Log-
`ical ﬁle offsets beyond the threshold are referred to as the
`proxy redirects all reads and writes cov-
`striping zone; the
`ering offsets in the striping zone to an array of block storage
`nodes according to system striping policies and the block
`maps maintained by the coordinators. The policies and pro-
`tocols include support for mirrored striping (“RAID-10”)
`for redundancy to protect against storage node failures, as
`described in Section 3.2. The Slice storage nodes export
`object-based block storage to the network; our prototype
`storage nodes accept NFS read and write operations on a
`ﬂat space of storage objects uniquely identiﬁed by NFS ﬁle
`handles. Although NFS ﬁle handles provide only a weak
`
`Figure 1. The Slice distributed storage archi-
`tecture.
`
`data on each storage node as logical offsets within storage
`objects. A storage object is an ordered sequence of bytes
`with a unique identiﬁer. The NASD work and the OBSD
`proposal allow for cryptographic protection of object iden-
`tiﬁers if the network is insecure [8].
`The Slice architecture separates functions that are com-
`bined in central ﬁle servers. The contribution of this paper
`is to present a simple solution to the coordination and recov-
`ery issues raised by this structure. Our approach introduces
`a coordinator responsible for preserving atomicity of key
`NFS operations, including ﬁle truncate/remove, extending
`writes, and write commitment. The coordinators use a sim-
`ple intention logging protocol, with variants for each oper-
`ation type that minimize the common-case costs. We also
`show how the protocol supports failure-atomic write com-
`mitment for mirrored ﬁles in the Slice prototype. Mirroring
`consumes more storage and network bandwidth than strip-
`ing with RAID redundancy, but it is simple and reliable,
`avoids the overhead of computing and updating parity, and
`allows load-balanced reads [4, 12].
`This paper is structured as follows. Section 2 summa-
`rizes the Slice architecture. Section 3 describes mecha-
`nisms for operation atomicity and failure handling. Sec-
`tion 4 presents experimental results from the Slice prototype
`on a Myrinet network, showing that the Slice architecture
`and recovery protocols achieve ﬁle access performance ap-
`proaching gigabit-per-second network speeds, limited pri-
`marily by the client NFS implementation. Section 5 con-
`cludes.
`
`2 Overview
`
`Figure 1 depicts the Slice architecture with NFS clients
`and servers. The architecture interposes a “microproxy”
`(
`proxy) between the client IP stack and the Slice server en-
`
`NetApp Ex. 1022, pg. 2
`
`
`
`
`
`
`
`
`
`
`
`
`
`

`form of protection in our prototype, the architecture is com-
`patible with proposals for cryptographic protection of stor-
`age object identiﬁers for insecure networks [8].
`The
`proxy identiﬁes read and write operations in the
`striping zone by examining the request offset and length.
`Small ﬁles are not striped; these are ﬁles whose logical size
`is below the threshold offset, i.e., that have never received
`a write in the striping zone. Note that even large ﬁles are
`not striped in their entirety; data written below the thresh-
`old offset of a large ﬁle is stored along with the small ﬁles.
`File regions outside the striping zone do not beneﬁt from
`striping, but the performance cost becomes progressively
`less signiﬁcant as ﬁle sizes grow.
`In addition to the interactions required for I/O requests,
`proxies cooperate with the network storage nodes and
`the
`the ﬁle’s coordinator to allocate global maps for extending
`write operations, and to release storage on remove and trun-
`cate operations. These multisite operations introduce recov-
`ery issues described in the next section. All other ﬁle oper-
`proxy to the NFS server as they
`ations pass through the
`did before, and incur no additional overhead for managing
`distributed storage.
`This architecture scales to higher bandwidth and capac-
`ity by adding storage nodes, since the NFS server is outside
`the critical path of reads and writes handled by the block
`storage nodes. It is also possible to scale or replicate other
`ﬁle service functions within the context of the Slice request
`switching architecture. For simplicity this paper assumes
`that a single standard NFS ﬁle server manages the entire
`volume name space.
`The goal of the mechanisms described in this paper is to
`deliver consistency and failure properties that are no weaker
`than commercial NFS implementations. While the basic ap-
`proach is quite similar to write-ahead logging that might
`be taken on a journaling central ﬁle server with distributed
`disks, we extend it to support multisite operations without
`the awareness of the client, NFS ﬁle server, or the storage
`nodes. Our approach to committing writes assumes use of
`the NFS V3 asynchronous writes and write commitment
`protocol, as described below. This paper does not address
`the issue of concurrent write sharing of ﬁles, and Slice as
`deﬁned may provide weaker concurrent write sharing guar-
`antees than some NFS implementations. However, the ar-
`chitecture is compatible with NFS ﬁle leasing extensions
`for consistent concurrent write sharing, as deﬁned in NQ-
`NFS [13] and early IETF draft proposals for the NFS V4
`protocol.
`
`2.1 Related Work
`
`The Cambridge Universal File Server [5] proposed struc-
`turing a distributed ﬁle system as a separate name service
`and ﬁle block storage service. One system to take this
`
`approach was Swift [6]. Slice is similar to Swift in that
`each client reads or writes data directly to block storage
`sites on the network, choreographed by a client distribution
`agent using maps provided by a third-party storage media-
`tor. Another system derived from the Swift architecture is
`Cheops, a striping ﬁle system for CMU NASD storage sys-
`tems [9, 8]. The Swift and Cheops work did not directly
`address atomicity or recovery issues.
`Amiri et. al. [1] show how to preserve read and write
`atomicity in a shared storage array using RAID striping with
`parity. This work focuses primarily on safe concurrent ac-
`cesses to a ﬁxed space of blocks. It does not address ﬁle
`system consistency in the presence of host failures.
`A number of scalable ﬁle systems separate some strip-
`ing functions from other ﬁle system code by building the
`ﬁle system above a striped network storage volume using
`a shared disk model. This approach has been used with
`both log-structured [11, 3] and conventional [15, 14] ﬁle
`system structures. In these systems, multisite operations in-
`cluding truncate and remove are made failure-atomic using
`write-ahead metadata logging on the ﬁle server. The log-
`structured approach also relies in part on a separate cleaner
`process to reclaim space.
`Relative to these systems, this paper shows how to fac-
`tor out recovery functions so that multisite recovery may
`be interposed in the context of a standard client/server ﬁle
`system protocol, without modifying the client or server.
`
`3 Atomic Operations on Network Storage
`
`proxy intercepts
`A multisite operation begins when the
`an NFS V3 write, remove, truncate (setattr) or commit re-
`proxy may
`quest from a client. To handle the request, the
`redirect the request or generate additional request messages
`to nodes in the Slice ensemble, including storage nodes, the
`coordinator for the target ﬁle, and the NFS server. Figure 2
`illustrates the message exchanges for the multisite opera-
`tions discussed in this section.
`proxy
`When the operation is complete at all sites, the
`passes through an NFS V3 response to the client. If any
`proxy, a stor-
`participant fails during this sequence — the
`age node, the coordinator, or the ﬁle server — a recovery
`protocol is initiated. The recovery protocol is speciﬁc to the
`particular operation in progress, and it may either complete
`the operation (roll forward) or abort it (roll back). If the sys-
`tem aborts the operation or delays the response, a standard
`NFS client may reinitiate the operation by retransmitting the
`request after a timeout, unless the client itself has failed.
`The basic protocol is as follows. At the start of the op-
`proxy sends to the coordinator an intention to
`eration, the
`and
`).
`perform the operation (e.g., Figure 2, messages
`The coordinator logs the intention to stable disk storage and
`proxy to carry out the operation.
`responds, authorizing the
`
`NetApp Ex. 1022, pg. 3
`
`
`
`
`
`
`
`
`
`
`
`
`

`Figure 2. Message exchanges for multisite Slice/NFS operations. Dotted line message exchanges
`are avoided in common cases. Square endpoints represent synchronous storage writes.
`
`When the operation is complete, the
`proxy notiﬁes the co-
`ordinator with a completion message, asynchronously clear-
`and
`). If the coordinator
`ing the intention (e.g., messages
`does not receive the completion within a speciﬁed period, it
`probes one or more participants to determine if the opera-
`tion completed, and initiates recovery if necessary. A failed
`coordinator recovers by scanning its intentions log, com-
`pleting or aborting operations in progress at the time of the
`failure.
`
`This is a variant of the standard two-phase commit pro-
`tocol [10] adapted to a ﬁle system context with idempotent
`operations. The details for each operation vary signiﬁcantly.
`In particular, each operation allows optimizations to avoid
`most messaging and logging delays in common cases, as
`described below. Slice further improves performance by
`avoiding multisite operations for small ﬁles stored entirely
`on the ﬁle server, i.e., ﬁles that have never received writes
`beyond the conﬁgurable threshold offset. In this way, the
`system amortizes the costs of the protocol across a larger
`number of bytes and operations, since it incurs these costs
`only to create and truncate/remove large ﬁles, and to com-
`mit groups of writes to large ﬁles.
`
`The following subsections describe the protocol as it ap-
`plies to each type of multisite operation. We then set the
`protocol in context with conventional two-phase commit.
`
`3.1 Write Commitment
`
`An NFS V3 commit operation stabilizes pending or un-
`stable writes on a given ﬁle. The NFS V3 protocol allows a
`server failure to legally discard any subset of the uncommit-
`ted writes and associated metadata, provided that the client
`can detect any loss by comparing veriﬁer values returned
`by the ﬁle service in its responses to write and commit op-
`erations. NFS V3 clients buffer uncommitted writes locally
`so that they may re-execute these writes after a server fail-
`ure. Clients may safely discard their buffered writes after a
`successful commit. Note that the veriﬁer value returned by
`write and commit is not itself signiﬁcant; the service guar-
`antees only that the veriﬁer changes after a failure.
`To handle a commit on a ﬁle that has unstable writes in
`proxy executes a message exchange
`the striping zone, the
`with each storage node that owns uncommitted writes on
`). The
`proxy also completes
`the ﬁle (Figure 2, message
`the writes, which may involve an exchange with the coor-
`proxy
`dinator (map service) and/or the NFS server. The
`pushes any updates to the ﬁle’s map back to the coordinator
`). If the write enlarged the ﬁle, it pushes the new
`(message
`ﬁle size to the NFS server via a setattr (message
`). When
`proxy re-
`all operations have completed successfully, the
`sponds to the client with a valid veriﬁer.
`proxy detects any failures by comparing response
`The
`veriﬁers against a stored copy of the previous veriﬁer re-
`turned by each participant.
`If any participant fails, the
`proxy reports the failure by changing the response veri-
`
`NetApp Ex. 1022, pg. 4
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`

`ﬁer to the client. If the
`proxy itself loses its state, it may
`report failure for a commit that has successfully completed
`at all sites. This forces the client to reinitiate writes unnec-
`essarily, but is otherwise harmless.
`Intention logging is unnecessary for write and commit
`on unmirrored ﬁles. This is because the ﬁle service remains
`in a legal state throughout the write sequence and commit.
`The exact ordering of operations is not strictly important;
`the commit is complete only when the client discarded its
`buffered writes after receiving a valid response. If a failure
`occurs, the client itself is responsible for restarting the write
`sequence after receiving a negative response or no response
`to its commit request.
`
`3.2 Mirrored Writes
`
`Writes to a mirrored ﬁle are replicated using a read-any-
`write-all model. Without loss of generality we assume that
`the replication degree is two. A replication degree of two
`guarantees that a ﬁle is available unless two or more stor-
`age nodes fail concurrently, or the ﬁle’s coordinator fails to-
`gether with one storage node and a client who was actively
`writing the ﬁle.
`Block maps for a mirrored ﬁle have dual entries for each
`logical block, with one entry for each block replica. The
`proxy writes each block to a pair of storage nodes selected
`according to some placement policy, which is not important
`for the purposes of this paper. A mirrored write is consid-
`ered complete only after it has committed; i.e., both storage
`nodes conﬁrm that the block is stable, and (if applicable) the
`ﬁle’s coordinator (map service) conﬁrms that the covering
`map fragment is stable.
`Mirrored writes use the intention protocol to reconcile
`replicas in the event of a failure. If a participant fails while
`there are incomplete mirrored writes, then it is possible that
`the write executed at one replica but not the other. In prac-
`tice, this does not occur unless a client fails concurrently
`with one or more server failures, since an NFS V3 client
`retransmits all uncommitted writes after a server failure, as
`described in Section 3.1.
`The mirrored write protocol piggybacks intention mes-
`proxy’s request for the
`sages for mirrored writes on the
`map fragment covering the write. Before returning the re-
`quested map fragment, the coordinator logs the intention
`record and updates a conservative in-memory active region
`list of offset ranges or map fragments that might be held by
`proxy, and that may have incomplete writes. These
`each
`intentions are cleared implicitly by a commit request cov-
`proxy to discard all
`ering the region; commit causes the
`covered map fragments for a mirrored ﬁle.
`proxy) fails, any uncommitted mir-
`If a client (or its
`rored writes are guaranteed to be covered by the coordina-
`tor’s active region list. The coordinator can reconcile the
`
`replicas for these regions by traversing the region list; any
`conﬂict within the active regions may be resolved by select-
`ing one replica to dominate. In principle, the system can
`serve one copy of the ﬁle concurrently with reconciliation,
`even if a storage node fails. If the coordinator fails, it re-
`covers a conservative approximation of its active region list
`from its intentions log.
`In practice, most intention logging activity for mirrored
`writes may be optimized away. Slice logs these intentions
`only when a mirrored ﬁle ﬁrst comes into active write use,
`e.g., when a
`proxy ﬁrst requests map fragments with in-
`tent to write. If a ﬁle falls out of write use (no map frag-
`ment requests received since the last commit completion),
`the coordinator marks the ﬁle as inactive by logging a write-
`complete entry. This protocol adds a synchronous log write
`to the write-open path for mirrored ﬁles, but this cost is
`amortized over all writes on the ﬁle. It allows a recover-
`ing coordinator to identify a superset of the mirrored ﬁles
`that may need reconciliation after a multiple failure.
`One drawback of the protocol is that a buggy or mali-
`cious client might cause the active region list to grow with-
`out bound by issuing large numbers of writes and never
`committing them. This is not a problem with clients that
`correctly buffer their uncommitted writes, since the num-
`ber of writes is limited by available memory; in any case,
`standard clients commit writes at regular intervals under the
`control of a system update daemon. For malicious clients,
`the system may avoid this problem by weakening replica
`consistency guarantees for mirrored ﬁles with writes left un-
`committed for unreasonably long periods.
`
`3.3 Truncate and Remove
`
`The protocol for truncate and remove relies on the NFS
`server to maintain an authoritative record of the ﬁle length
`proxy ﬁrst consults a set of attributes
`and link count. The
`); the attributes must
`for the target ﬁle (Figure 2, message
`be current up to the “three second window” deﬁned by NFS
`implementations (see Section 3.4. If the target ﬁle’s log-
`ical size shows that it has data in the striping zone, the
`proxy issues an intention to the coordinator (message
`)
`before issuing the NFS operation to the ﬁle server (mes-
`). Once the operation has committed at the NFS
`sage
`server, the protocol contacts the storage nodes and coordi-
`), then reg-
`nator (map service) to release storage (message
`). In our
`isters a completion with the coordinator (message
`proxy executes the entire protocol,
`current prototype the
`but it could be done directly by the coordinator, simplifying
`proxy and saving one message exchange (the intention
`the
`response and the completion).
`If the intention expires, the coordinator probes the NFS
`server (using a getattr) to determine the status of the op-
`eration. If the operation completed on the NFS server, the
`
`NetApp Ex. 1022, pg. 5
`
`
`
`
`
`
`
`
`
`
`
`
`

`
`
`
`
`

`coordinator rolls the operation forward by contacting the
`storage nodes to release any orphaned storage.
`
`3.4 Enlarging Writes
`
`The truncate/remove protocol in Section 3.3 must avoid
`a race with an enlarging write, a special case of extending
`write that extends a “small” ﬁle beyond the threshold offset
`and into the striping zone, making it a “large” ﬁle. The dan-
`ger is that another client will complete an enlarging write
`after the
`proxy consults the ﬁle’s logical size, recognizing
`proxy issues the trun-
`it as a small ﬁle, and before the
`cate/remove operation to the NFS server. If this occurs, the
`proxy could fail to notify the coordinator of the need to
`release network storage allocated to the newly enlarged ﬁle,
`leaving it orphaned by the truncate/remove.
`One way to prevent the race is to conservatively notify
`the coordinator of all truncate/remove operations, even for
`small ﬁles. However, this imposes an extra message latency
`and perhaps a disk fault on truncates and removes of small
`ﬁles. We prefer instead to shift the costs to the enlarging
`write operation, increasing the creation cost of large ﬁles.
`The enlarging write cost is incurred once for each large ﬁle,
`and is amortized over all I/O operations on the ﬁle.
`Our approach uses a variant of the basic intention pro-
`tocol to detect the race when it occurs, and to release any
`orphaned storage. The trick is for the coordinator to de-
`proxy has executed a truncate/remove opera-
`tect that a
`tion based on attributes that were fetched before the com-
`pletion of an enlarging write. After an enlarging write has
`completed, the ﬁle’s coordinator contacts the NFS server
`to validate the ﬁle’s existence and logical size. The coor-
`dinator delays this validation until a ﬁxed waiting period
`has elapsed. The waiting period is chosen to exceed the
`time bound on the staleness of cached attributes in NFS
`(the three second rule) with ample slack time to account
`for clock skew and operation latencies.
`
`3.5 Comparison to Two-Phase Commit
`
`The basic intention logging protocol used in Slice is sim-
`ilar to conventional two-phase commit [10], but there are
`several key differences. These are brought about by the
`simple nature of the ﬁle system operations, which tends to
`make the protocol more efﬁcient than a general two-phase
`commit in the common cases.
`
`proxy assumes most of the func-
` For simplicity, the
`tions of the traditional commit coordinator: it trans-
`mits requests to participants and gathers commit votes.
`However, it never actually performs a commit since it
`has no stable storage.
`
` Participants execute their portion of the operation in a
`ﬁxed partial order, with one participant acting as the
`primary commit site. The purpose of the intention pro-
`tocol is to detect and recover from failures that inter-
`rupt the sequence before the primary commit site exe-
`cutes its part of the operation. For example, the NFS
`server itself unwittingly acts as the primary commit
`site for removes, truncates, and extending writes (or
`extending write commits). For truncate and remove,
`a failure after the NFS server commits causes the re-
`covery protocol to roll forward by releasing orphaned
`storage, similar to a conventional journaling ﬁle sys-
`tem or a ﬁle system scavenger (fsck).
` There is no need to notify participants other than the
`coordinator that the operation committed. The pre-
`commit is sufﬁcient to stabilize the data, and the par-
`ticipants do not hold locks on the committed state. File
`operations are serialized (when necessary) at the NFS
`server (for name space operations) or at the coordina-
`tor (for reads and writes of shared ﬁles).
`
`4 Prototype and Experimental Results
`
`We have implemented the Slice prototype as a set of
`loadable kernel modules for the FreeBSD 4.0 operating
`system.
`The network storage nodes in our prototype
`are FreeBSD PCs serving blocks from local disks using
`UFS/FFS as a storage manager, with an external hash to
`map opaque NFS ﬁle handles to local ﬁles. The coordi-
`nator is implemented as an extension to the storage node
`module, consisting of a total of about 1400 lines of code.
`proxy is an IP ﬁlter between the IP
`In our prototype, the
`proxy may rewrite or
`stack and the network driver. The
`consume packets, and it may also generate new IP packets.
`proxy is a non-blocking state machine consisting of
`The
`about 2500 lines of code. An overarching goal is to keep
`proxy simple, small, and fast.
`the
`This section presents experimental results from inter-
`posed ﬁle striping as implemented in the Slice prototype.
`proxy
`The intent is to show the costs of the interposed
`architecture, and the effect of these costs on delivered ﬁle
`proxy, coordinator, and
`access bandwidths. The prototype
`storage service implement mechanisms needed for recov-
`ery during normal operation, including the coordinator in-
`tentions log. Thus they reﬂect the costs of recovery as de-
`scribed in Section 3. However, reconciliation of active re-
`gions for mirrored replicas is not implemented.
`In these experiments, clients are 450 MHz Pentium-III
`PCs using the Asus P2B motherboard with a 32-bit, 33 MHz
`PCI bus an Intel 440BX chipset. The NFS server and Slice
`storage nodes are Dell 4400 systems each with one 733
`MHz Pentium-III Xeon using a ServerWorks chipset. The
`
`NetApp Ex. 1022, pg. 6
`
`
`
`
`
`
`
`
`
`
`
`
`

`Figure 4. Single-client and saturation bandwidth for sequential read and write.
`
` !"$#&%'($#*)+,-./01,234526716389:4;<= >?@ABC> DE?FGD> >HI>J >?K@AFGD> >HI>JBC> DE? LMNOQPLRNNS*TIL L O R S T N U PVW XZY [\$]G^_X`$]*abccdecfgehfijkljgmnohpqr stuKvwCs xytz{xs s|Is} stuvz{xs s|Is}w~s xyt
`KK Q K{ ' G I ¡¢~ £¢C £G I ¡ ¤ ¤ G I ¡ I*¥*I~*¥K
`
`blocks on the NFS server, bounding deletion time.
`When ﬁle size exceeds the striping zone threshold, la-
`tencies jump as operations begin to involve multiple sites
`and incur costs of the intention logging protocol. For ex-
`proxy faults
`ample, read and write costs increase as the
`block maps from the coordinator before issuing I/O beyond
`the threshold. Writes and removes register an intent with
`the coordinator before performing the ﬁrst extending write
`into the striping zone or before issuing the remove to the
`NFS server, respectively. The resulting discontinuities are
`clearly shown in the graph; however, the cost becomes pro-
`gressively less signiﬁcant as ﬁle sizes grow.
`Both read and write times increase linearly with ﬁle size,
`and remove time remains constant. The prototype serializes
`some sub-operations of commit for simplicity, compromis-
`ing write latency slightly. At these sizes, mirroring has a
`negligible effect on both read and write times.
`The architecture allows very high bandwidth for large
`ﬁles. Figure 4 shows I/O bandwidth delivered to a single
`client and a group of clients, varying the number of storage
`nodes. Bandwidths are measured using dd to read or write a
`1.25 GB ﬁle in 32 KB chunks, with a Slice striping grain of
`32 KB. Each graph gives both non-redundant and mirrored
`storage results.
`The left-hand graph shows the measured I/O bandwidth
`delivered to a single client with a Lanai-7 adapter. We modi-
`ﬁed the FreeBSD NFS client for zero-copy reads, however a
`copy remains in the write path. Single client read bandwidth
`scales with the number of storage nodes until the client CPU
`saturates at 110 MB/s. The copy in the write path saturates
`a client writing at 53 MB/s.
`Mirrored read bandwidth is roughly half that of non-
`mirrored, due to an artifact of the striping policy and our
`use of UFS/FFS as the block storage manager in the proto-
`type. UFS/FFS aggressively prefetches from local disk into
`local memory when it detects sequential or near-sequential
`accesses. In this case, this policy consumes storage band-
`width to load data that the client chooses to read from an-
`other node. With a replication degree of two, clients read
`
`Figure 3. File read, write, and remove timings
`using a 64 KB threshold offset.
`
`server network adapter and disk controllers are on indepen-
`dent peer 64-bit, 66 MHz PCI busses. Each has four 18 GB
`Seagate Ultra-2 Cheetah disks. All machines are equipped
`with Myricom LANai 4 or 7 adapters, with kernels built
`from the same FreeBSD 4.0 source pool.
`All network communication in these experiments uses
`Trapeze, a Myrinet messaging system optimized for net-
`work I/O trafﬁc [7]. In this conﬁguration, Trapeze/Myrinet
`provides 130 MB/s of point-to-point bandwidth with a 32
`KB transfer size. NFS trafﬁc uses UDP/IP with a 32 KB
`MTU, routed through a Trapeze device driver.
`Figure 3 shows the total time to read, write, and remove a
`ﬁle, varying ﬁle size from 8 KB to 232 KB, with the striping
`zone threshold set to 64 KB. All tests start with cold client
`and storage node

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases