throbber
Interposed Request Routing for Scalable Network Storage
`
`Darrell C. Anderson, Jeffrey S. Chase, Amin M. Vahdat *
`Department of Computer Science
`Duke University
`{anderson ,chase,vahdat}@cs.duke.edu
`
`Abstract. This paper explores interposed request
`routing in Slice, a new storage system architec-
`ture for high-speed networks incorporating network-
`attached block storage. Slice interposes a request
`switching filter 7 called a uprozy 7 along each
`client's network path to the storage service (e.g.,
`in a network adapter or switch). The pproxy inter-
`cepts request traffic and distributes it across a server
`ensemble. We propose request routing schemes for
`I/O and file service traffic, and explore their effect
`on service structure.
`
`The Slice prototype uses a packet filter nproxy
`to virtualize the standard Network File System
`(NFS) protocol, presenting to NFS clients a uni—
`fied shared file volume with scalable bandwidth and
`capacity. Experimental results from the industry-
`standard SPECSfSQ? workload demonstrate that
`the architecture enables construction of powerful
`network—attached storage services by aggregating
`cost—effective components on a switched Gigabit
`Ethernet LAN.
`
`1
`
`Introduction
`
`Demand for large-scale storage services is growing
`rapidly. A prominent factor driving this growth is
`the concentration of storage in data centers hosting
`VVeb—based applications that serve large client pop—
`ulations through the Internet. At the same time,
`storage demands are increasing for scalable comput—
`ing, multimedia and visualization.
`A successful storage system architecture must scale
`to meet
`these rapidly growing demands, placing
`a premium on the costs (including human costs)
`to administer and upgrade the system. Commer—
`cial systems increasingly interconnect storage de-
`vices and servers with dedicated Storage Area Net—
`
`'This work is supported by the National Science Founda-
`tion (BIA-9972879 and EIA—9870724), Intel, and Myricom.
`Anderson is supported by a U.S. Department of Education
`GAANN fellowship. Chase and Vahdat are supported by
`NSF CAREER awards (CCR-9624857 and OCR-9984328).
`
`works (SANS), e.g., FibreChannel, to enable incre-
`mental scaling of bandwidth and capacity by attach-
`ing more storage to the network. Recent advances
`in LAN performance have narrowed the bandwidth
`gap between SANs and LANs, creating an oppor—
`tunity to take a similar approach using a general-
`purpose LAN as the storage backplane. A key chal-
`lenge is to devise a distributed software layer to
`unify the decentralized storage resources.
`
`This paper explores interposed request routing in
`Slice, a new architecture for network storage. Slice
`interposes a request switching filter _ called a
`,uprozy — along each client‘s network path to the
`storage service. The ,uproxy may reside in a pro-
`grammable switch or network adapter, or in a self-
`contained module at the client’s or server's interface
`to the network. We show how a simple pproxy can
`virtualize a standard network-attached storage pro-
`tocol incorporating file services as well as raw device
`access. The Slice uproxy distributes request traf-
`fic across a collection of storage and server elements
`that cooperate to present a uniform view of a shared
`file volume with scalable bandwidth and capacity.
`
`This paper makes the following contributions:
`
`0 It outlines the architecture and its implementa-
`tion in the Slice prototype, which is based on a
`pproxy implemented as an IP packet filter. We
`explore the impact on service structure, recon—
`figuration, and recovery.
`
`0 It proposes and evaluates request routing poli-
`cies within the architecture.
`In particular, we
`introduce two policies for transparent scaling of
`the name space of a unified file volume. These
`techniques complement simple grouping and
`striping policies to distribute file access load.
`
`the prototype using synthetic
`0 It evaluates
`benchmarks including SPECsf597, an industry-
`standard workload for network-attached stor-
`age servers. The results demonstrate that the
`
`NetApp Ex. 1012, pg. 1
`
`NetApp Ex. 1012, pg. 1
`
`

`

`directory servers
`
`name space ,'
`requests /’.
`
`
`
`client
`
`
`
`
`
`small file \\
`read/write
`
`\\
`file‘oz:""
`placement
`\\ \
`1'0""
`
`small-file servers
`
`network
`storage
`array
`
`Figure 1: Combining functional decomposition and
`data decomposition in the Slice architecture.
`
`system is scalable and that it complies with
`the Network File System (NFS) V3 standard, a
`popular protocol for network-attached storage.
`
`This paper is organized as follows. Section 2 out-
`lines the architecture and sets Slice in context with
`related work. Section 3 discusses the role of the
`,uproxy, defines the request routing policies, and
`discusses service structure. Section 4 describes the
`Slice prototype, and Section 5 presents experimental
`results. Section 6 concludes.
`
`2 Overview
`
`The Slice file service consists of a collection of
`servers cooperating to serve an arbitrarily large vir-
`tual volume of files and directories. To a client, the
`ensemble appears as a single file server at some vir—
`tual network address. The pproxy intercepts and
`transforms packets to redirect requests and to rep—
`resent the ensemble as a unified file service.
`
`Figure 1 depicts the structure of a Slice ensemble.
`Each client’s request stream is partitioned into three
`functional request classes corresponding to the ma-
`jor file workload components:
`(1) high—volume I/O
`to large files, (2) 1/0 on small files, and (3) oper-
`ations on the name space or file attributes. The
`,uproxy switches on the request type and arguments
`to redirect requests to a selected server responsible
`for handling a given class of requests. Bulk I/O op-
`erations route directly to an array of storage nodes,
`which provide block-level access to raw storage ob-
`jects. Other operations are distributed among spe-
`cialized file managers responsible for small-file I/O
`
`and/or name space requests.
`
`This functional decomposition diverts high-volume
`data flow to bypass the managers, while allowing
`specialization of the servers for each workload com—
`ponent, e.g., by tailoring the policies for disk layout,
`caching and recovery. A single server node could
`combine the functions of multiple server classes; we
`separate them to highlight the opportunities to dis—
`tribute requests across rnore servers.
`
`The pproxy selects a target server by switching on
`the request type and the identity of the target file,
`name entry, or block, using a separate routing func—
`tion for each request class. Thus the routing func—
`tions induce a data decomposition of the volume
`data across the ensemble, with the side effect of cre—
`ating or caching data items on the selected man—
`agers.
`Ideally, the request routing scheme spreads
`the data and request workload in a balanced fashion
`across all servers. The routing functions may adapt
`to system conditions, e.g., to use new server sites
`as they become available. This allows each work—
`load component to scale independently by adding
`resources to its server class.
`
`2.1 The nproxy
`
`An overarching goal is to keep the nproxy simple,
`small, and fast. The nproxy may (1) rewrite the
`source address, destination address, or other fields of
`request or response packets, (2) maintain a bounded
`amount of soft state, and (3) initiate or" absorb pack—
`ets to or from the Slice ensemble. The pproxy does
`not require any state that is shared across clients,
`so it may reside on the client host or network in-
`terface, or in a network element close to the server
`ensemble. The nproxy is not a barrier to scalability
`because its functions are freely replicable, with the
`constraint that each client’s request stream passes
`through a single ,uproxy.
`
`The nproxy functions as a network element within
`the Internet architecture.
`It is free to discard its
`
`state and/or pending packets without compromis—
`ing correctness. End-to-end protocols (in this case
`NFS/RPC/UDP or TCP) retransmit packets as
`necessary to recover from drops in the ,uproxy. Al-
`though the ;rproxy resides “within the network”, it
`acts as an extension of the service. For example,
`since the nproxy is a layer-5 protocol component, it
`must reside (logically) at one end of the connection
`or the other; it cannot reside in the “middle” of the
`connection where end»to—end encryption might hide
`layer-5 protocol fields.
`
`NetApp Ex. 1012, pg. 2
`
`NetApp Ex. 1012, pg. 2
`
`

`

`2.2 Network Storage Nodes
`
`2.3 File Managers
`
`A shared array of network storage nodes provides all
`disk storage used in a Slice ensemble. The aproxy
`routes bulk I/O requests directly to the network
`storage array, Without intervention by a file man—
`ager. More storage nodes may be added to incre—
`mentally scale bandwidth, capacity, and disk arms.
`
`The Slice block storage prototype is loosely based
`on a proposal in the National Storage Industry
`Consortium (NSIC) for object-based storage devices
`(OBSD) [3]. Key elements of the OBSD proposal
`were in turn inspired by the CMU research on Net-
`work Attached Secure Disks (NASD) [8, 9]. Slice
`storage nodes are “object-based” rather than sector-
`based, meaning that requesters address data as log-
`ical offsets within storage objects. A storage object
`is an ordered sequence of bytes with a unique iden-
`tifier. The placement policies of the file service are
`responsible for distributing data among storage ob-
`jects so as to benefit fully from all of the resources
`in the network storage array.
`
`A key advantage of OBSDs and NASDs is that they
`allow for cryptographic protection of storage object
`identifiers if the network is insecure [9]. This protec—
`tion allows the uproxy to reside outside of the server
`ensemble’s trust boundary.
`In this case, the dam-
`age from a compromised aproxy is limited to the
`files and directories that its client(s) had permis-
`sion to access. However, the Slice request routing
`architecture is compatible with conventional sector-
`based storage devices if every aproxy resides inside
`the service trust boundary.
`
`This storage architecture is orthogonal to the ques-
`tion of which level arranges redundancy to tolerate
`disk failures. One alternative is to provide redun—
`dancy of disks and other vulnerable components in-
`ternally to each storage node. A second option is for
`the file service software to mirror data or maintain
`parity across the storage nodes. In Slice, the choice
`to employ extra redundancy across storage nodes
`may be made on a per-file basis through support
`for mirrored striping in our prototype’s I/O routing
`policies. For stronger protection, a Slice configura-
`tion could employ redundancy at both levels.
`The Slice block service includes a coordinator mod—
`ule for files that span multiple storage nodes. The
`coordinator manages optional block maps
`(Sec—
`tion 3.1) and preserves atomicity of multisite op-
`erations (Section 3.3.2). A Slice configuration may
`include any number of coordinators, each managing
`a subset of the files (Section 4.2).
`
`File management functions above the network stor-
`age array are split across two classes of file man—
`agers. Each class governs functions that are com-
`mon to any file server;
`the architecture separates
`them to distribute the request load and allow in1«
`plementations specialized for each request class.
`
`a Directory servers handle name space opera-
`tions, e.g., to create, remove, or lookup files and
`directories by symbolic name; they manage di—
`rectories and mappings from names to identi-
`fiers and attributes for each file or directory.
`
`I Small-file servers handle read and write opera—
`tions on small files and the initial segments of
`large files (Section 3.1).
`
`Slice file managers are dataless; all of their state is
`backed by the network storage array. Their role is to
`aggregate their structures into larger storage objects
`backed by the storage nodes, and to provide memory
`and CPU resources to cache and manipulate those
`structures. In this way, the file managers can benefit
`from the parallel disk arms and high bandwidth of
`the storage array as more storage nodes are added.
`
`The principle of dataless file managers also plays a
`key role in recovery. In addition to its backing ob—
`jects, each manager journals its updates in a write—
`ahead log [10]; the system can recover the state of
`any manager from its backing objects together with
`its log. This allows fast failover, in which a surviving
`site assumes the role of a failed server, recovering its
`state from shared storage [12, 4, 24].
`
`2.4 Summary
`
`Interposed request routing in the Slice architecture
`yields three fundamental benefits:
`
`a Scalable file management with content—based re—
`quest switching. Slice distributes file service re—
`quests across a server ensemble. A good request
`switching scheme induces a balanced distribu—
`tion of file objects and requests across servers,
`and improves locality in the request stream.
`
`0 Direct storage access for high-volume 1/0. The
`,uproxy routes bulk I/O traffic directly to the
`network storage array, removing the file man—
`agers from the critical path. Separating re—
`quests in this fashion eliminates a key scaling
`barrier for conventional file services [8, 9]. At
`the same time, the small—file servers absorb and
`
`NetApp Ex. 1012, pg. 3
`
`NetApp Ex. 1012, pg. 3
`
`

`

`aggregate I/ O operations on small files, so there
`is no need for the storage nodes to handle small
`objects efficiently.
`
`o Compatibility with standard file system clients.
`The pproxy factors request routing policies out
`of the client-side file system code. This allows
`the architecture to leverage a minimal comput-
`ing capability within the network elements to
`virtualize the storage protocol.
`
`2.5 Related Work
`
`A large number of systems have interposed new sys-
`tem functionality by “wrapping” an existing inter—
`face, including kernel system calls [14], internal in-
`terfaces [13], communication bindings [11], or mes-
`saging endpoints. The concept of a proxy mediating
`between clients and servers [23] is now common in
`distributed systems. We propose to mediate some
`storage functions by interposing on standard storage
`access protocols within the network elements. Net—
`work file services can benefit from this technique be-
`cause they have well-defined protocols and a large
`installed base of clients and applications, many of
`which face significant scaling challenges today.
`
`The Slice ,uproxy routes file service requests based
`on their content. This is analogous to the HTTP
`content switching features offered by some net-
`work switch vendors (e.g., Alteon, Arrowpoint, F5),
`based in part on research demonstrating improved
`locality and load balancing for large Internet server
`sites [20]. Slice extends the content switching con-
`cept to a file system context.
`A number of recent commercial and research ef-
`forts investigate techniques for building scalable
`storage systems for high-speed switched LAN net-
`works.
`These system are built from disks dis—
`tributed through the network, and attached to ded—
`icated servers [16, 24, 12], cooperating peers [4, 26],
`or the network itself [8, 9]. We separate these sys-
`tems into two broad groups.
`
`the
`The first group separates file managers (e.g.,
`name service) from the block storage service, as in
`Slice. This separation was first proposed for the
`Cambridge Universal File Server [6]. Subsequent
`systems adopted this separation to allow bulk 1/0
`to bypass file managers [7, 12], and it is now a basic
`tenet of research in network—attached storage de—
`vices including the CMU NASD work on devices for
`secure storage objects [8, 9]. Slice shows how to
`incorporate placement and routing functions essen-
`tial for this separation into a new filesystem struc-
`ture for network—attached storage. The CMU NASD
`
`integrated similar functions into network
`project
`file system clients [9];
`the Slice model decouples
`these functions, preserving compatibility with ex—
`isting clients. In addition, Slice extends the NASD
`project approach to support scalable file manage—
`ment as well as high-bandwidth I/O for large files.
`
`A second group of scalable storage systems lay-
`ers the file system functions above a network stor—
`age volume using a shared disk model.
`Policies
`for striping, redundancy, and storage site selection
`are specified on a volume basis; cluster nodes coor—
`dinate their accesses to the shared storage blocks
`using an ownership protocol. This approach has
`been used with both log—structured (Zebra [12] and
`xFS [4]) and conventional (Frangipani/ Petal [16, 24]
`and GFS [21]) file system organizations. The clus-
`ter may be viewed as “serverless” if all nodes are
`trusted and have direct access to the shared disk,
`or alternatively the entire clustcr may act as a file
`server to untrusted clients using a standard network
`file protocol, with all I/O passing through the clus-
`ter nodes as they mediate access to the disks.
`
`The key benefits of Slice request routing apply
`equally to these shared disk systems when untrusted
`clients are present. First, request routing is a key to
`incorporating secure network—attached block stor—
`age, which allows untrusted clients to address stor—
`age objects directly without compromising the in-
`tegrity of the file system. That is, a ,uproxy could
`route bulk I/O requests directly to the devices,
`yielding a more scalable system that preserves com—
`patibility with standard clients and allows per-file
`policies for block placement, parity or replication,
`prefetching, etc. Second, request routing enhances
`locality in the request stream to the file servers, im-
`proving cache effectiveness and reducing block con-
`tention among the servers.
`
`The shared disk model is used in many commercial
`systems, which increasingly interconnect storage def
`vices and servers with dedicated Storage Area Net—
`works (SANS), e.g., FibreChannel. This paper ex-
`plores storage request routing for Internet networks,
`but the concepts are equally applicable in SANS.
`
`Our proposal to separate small-file 1/0 from the re—
`quest stream is similiar in concept to the Amoeba.
`Bullet Server [25], a specialized file server that op—
`timizes small files. As described in Section 4.4,
`the prototype small-file server draws on techniques
`from the Bullet Server, FFS fragments [19], and
`SquidMLA [18], a Web proxy server that maintains
`a user-level “filesystem” of small cached Web pages.
`
`NetApp Ex. 1012, pg. 4
`
`NetApp Ex. 1012, pg. 4
`
`

`

`3 Request Routing Policies
`
`This section explains the structure of the hproxy
`and the request routing schemes used in the Slice
`prototype. The purpose is to illustrate concretely
`the request routing policies enabled by the architec-
`ture, and the implications of those policies for the
`way the servers interact to maintain and recover
`consistent file system states. We use the NFS V3
`protocol as a reference point because it is widely
`understood and our prototype supports it.
`
`The pproxy intercepts NFS requests addressed to
`virtual NFS servers, and routes the request to a
`physical server by applying a function to the re-
`quest type and arguments. It then rewrites the IP
`address and port to redirect the request to the se-
`lected server. When a response arrives, the ”proxy
`rewrites the source address and port before forward-
`ing it to the client, so the response appears to orig—
`inate from the virtual NFS server.
`
`The request routing functions must permit recon—
`figuration to add or remove servers, while minimiz—
`ing state requirements in the pproxy. The pproxy
`directs most requests by extracting relevant fields
`from the request, perhaps hashing to combine mul-
`tiple fields, and interpreting the result as a logical
`server site ID for the request. It then looks up the
`corresponding physical server in a compact routing
`table. Multiple logical sites may map to the same
`physical server, leaving flexibility for reconfiguration
`(Section 3.3.1). The routing tables constitute soft
`state; the mapping is determined externally, so the
`,uproxy never modifies the tables.
`
`The ,uproxy examines up to four fields of each re-
`quest, depending on the policies configured:
`
`0 Request type. Routing policies are keyed by the
`NFS request type, so the ,uproxy may employ
`different policies for different functions. Table 1
`lists the important NFS request groupings dis—
`cussed in this paper.
`
`0 File handle. Each NFS request targets a spe-
`cific file or directory, named by a unique identi-
`fier called a file handle (or flmndle). Although
`NFS ihandles are opaque to the client, their
`structure can be known to the pproxy, which
`acts as an extension of the service. Directory
`servers encode a fileID in each fhandle, which
`the “proxies extract as a routing key.
`
`I Read/write ofiset. NFS I/O operations specify
`the range of offsets covered by each read and
`
`write. The ,uproxy uses these fields to select
`the server or storage node for the data.
`
`0 Name component. NFS name space requests
`include a symbolic name component in their ar—
`guments (see Table 1). A key challenge for scal-
`ing file management is to obtain a balanced dis-
`tribution of these requests. This is particularly
`important for name—intensive workloads with
`small files and heavy create/lockup/remvve ac—
`tivity, as often occurs in Internet services for
`mail, news, message boards, and Web access.
`
`We now outline some ”proxy policies that use these
`fields to route specific request groups.
`
`3.1 Block I/O
`
`Request routing for read/write requests have two
`goals:
`separate small—file read/write traffic from
`bulk I/O, and decluster the blocks of large files
`across the storage nodes for the desired access prop-
`erties (e.g., high bandwidth or a specified level of
`redundancy). We address each in turn.
`
`When small-file servers are configured, the proto-
`type’s routing policy defines a fixed threshold oflset
`(e.g., 64KB); the “proxy directs [/0 requests be—
`low the threshold to a small—file server selected from
`the request fhandle. The threshold offset is neces—
`sary because the size of each file may change at any
`time. Thus the small—file servers also receive a sub-
`
`set of the I/O requests on large files; they receive
`all I/O below the threshold, even if the target file
`is large.
`In practice,
`large files have little impact
`on the small-file servers because there tends to be
`a small number of these files, even if they make up
`a large share of the stored bytes. Similarly, large
`file I/O below the threshold is limited by the band-
`width of the small-file server, but this affects only
`the first threshold bytes, and becomes progressively
`less significant as the file grows.
`
`The hproxy redirects I/O traflic above the thresh-
`old directly to the network storage array, using some
`placement policy to select the storage site (5) for each
`block. A simple option is to employ static strip-
`ing and placement functions that compute on the
`block offset and/or fileID. More flexible placement
`policies would allow the pproxy to consider other
`factors, e.g., load conditions on the network or stor—
`age nodes, or file attributes encoded in the fhandle.
`To generalize to more flexible placement policies,
`Slice optionally records block locations in per—file
`block maps managed by the block service coordina—
`tors. The hproxies interact with the coordinators
`
`NetApp Ex. 1012, pg. 5
`
`NetApp Ex. 1012, pg. 5
`
`

`

`
`Name Space Operations
`lookupfilir, name) returns (fhandle, attr)
`Look up a name in dir‘7 return handle and attributes.
`
`
`
`
`create(dir, name) returns (fliandle, attr)
`Create a file/directory and update the parent entry/link
`
`
`mkdir(dir, name) returns (fhandle, attr)
`count and modify timestamp.
`
`remove(dir, name), rmdir(dir, name)
`Remove a file/directory or hard link and update the parent
`
`
`entry/link count and modify timestamp.
`Create a new name for a file, update the file link count,
`link(alddir, uldname, newdir, newname)
`
`
`
`and update modify timestamps on the file and newdir.
`returns (fhandle, attr)
`
`Rename an existing file or hard link; update the link count
`rename(olddir, oldname, ne'wdir, newnume)
`
`and modify timestamp on both the old and new parent.
`returns (fhandle, attr)
`
`
`Attribute Operations
`Retrieve the attributes of a file or directory.
`getattr{o bject} returns (attr)
`setattr(object, attr)
`Modify the attributes of a file or directory, and update its
`modify timestamp.
`[/0 Operations
`Read data from a file, updating its access timestamp.
`read(file, offset, (en) returns (data, attr)
`
`Write data to a file, updating its modify timestamp.
`write(file, ofl’set, Zen) returns (data, attr)
`
`
`Directory Retrival
`
`'reuddir(dir, cookie) returns (entries, cookie)
`
`Read some or all of the entries in a directory.
`
`Table 1: Some important Network File System (NFS) protocol operations.
`
`to fetch and cache fragments of the block maps as
`they handle I/O operations on files.
`As one example of an attribute—based policy, Slice
`supports a mirrored striping policy that replicates
`each block of a mirrored file on multiple storage
`nodes, to tolerate failures up to the replication de-
`gree. Mirroring consumes more storage and net-
`work bandwidth than striping with parity, but it is
`simple and reliable, avoids the overhead of comput-
`ing and updating parity, and allows load-balanced
`reads [5, 16].
`
`3.2 Name Space Operations
`
`requests
`space
`name
`distributing
`Effectively
`presents different challenges from 1/0 request rout-
`ing. Name operations involve more computation,
`and name entries may benefit more from caching
`because they tend to be relatively small and
`fragmented. Moreover, directories are frequently
`shared. Directory servers act as synchronization
`points to preserve integrity of the name space, e.g.,
`to prevent clients from concurrently creating a file
`with the same name, or removing a directory while
`a name create is in progress.
`
`A simple approach to scaling a file service is to parti-
`tion the name space into a set of volumes, each man-
`aged by a single server. Unfortunately, this VOLUME
`PARTITIONING strategy compromises transparency
`and increases administrative overhead in two ways.
`First, volume boundaries are visible to clients as
`mount points, and naming operations such as link
`and rename cannot cross volume boundaries. Sec-
`
`ond, the system develops imbalances if volume loads
`grow at different rates, requiring intervention to
`repartition the name space. This may be visible to
`users through name changes to existing directories.
`
`An important goal of name management in Slice
`is to automatically distribute the load of a single
`file volume across multiple servers, without impos-
`ing user—visible volume boundaries. We propose two
`alternative name space routing policies to achieve
`this goal. MKDIR SWITCHING yields balanced dis-
`tributions when the average number of active di-
`rectories is large relative to the number of direc-
`tory server sites, but it binds large directories to a
`single server. For workloads with very large direcv
`tories, NAME HASHING yields probabilistically bal—
`anced request distributions independent of work-
`load. The cost of this effectiveness is that more
`operations cross server boundaries,
`increasing the
`cost and complexity of coordination among the di—
`rectory servers (Section 4.3).
`MKDIR SWITCHING works as follows. In most cases,
`the uproxy routes name space operations to the di-
`rectory server that manages the parent directory;
`the ,uproxy identifies this server by indexing its rout-
`ing table with the fileID from the parent directory
`fhandle in the request (refer to Table 1). On a mkdir
`request,
`the pproxy decides with probability p to
`redirect the request to a different directory server,
`placing the new directory 7 and its descendents i
`on a different site from the parent directory. The
`policy uniquely selects the new server by hashing
`on the parent fhandle and the symbolic name of the
`
`NetApp Ex. 1012, pg. 6
`
`NetApp Ex. 1012, pg. 6
`
`

`

`new directory; this guarantees that races over name
`manipulation involve at most two sites. Reducing
`directory affinity by increasing 19 makes the policy
`more aggressive in distributing name entries across
`sites; this produces a more balanced load, but more
`operations involve multiple sites. Section 5 presents
`experimental data illustrating this tradeoff.
`NAME HASHING extends this approach by routing
`all name space operations using a hash on the name
`component and its position in the directory tree,
`as given by the parent directory fliandle. This ap-
`proach represents the entire volume name space as
`a unified global hash table distributed among the
`directory servers. It views directories as distributed
`collections of name entries, rather than as files ac-
`cessed as a unit. Conflicting operations on any given
`name entry (e.g., create/create, create/remove,
`7e—
`mmze/lookap) always hash to the same server, where
`they serialize on the shared hash chain. Operations
`on different entries in the same directory (e.g., cre-
`ate, remove, Ioakup) may proceed in parallel at mul-
`tiple sites. For good performance, NAME HASHING
`requires sufficient memory to keep the hash chains
`memory-resident, since the hashing function sacri-
`fices locality in the hash chain accesses. Also, read-
`dir operations span multiple sites; this is the right
`behavior for large directories, but it increases read-
`dlr costs for small directories.
`
`3.3 Storage Service Structure
`
`routing policies impact storage service
`Request
`structure. The primary challenges are coordination
`and recovery to maintain a consistent view of the
`file volume across all servers, and reconfiguration to
`add or remove servers within each class.
`
`Most of the routing policies outlined above are in-
`dependent of whether small files and name entries
`are bound to the server sites that create them. One
`option is for the servers to share backing objects
`from a shared disk using a block ownership proto-
`col (see Section 2.5);
`in this case, the role of the
`aproxy is to enhance locality in the request stream
`to each server. Alternatively, the system may use
`fixed placement in which items are controlled by
`their create sites unless reconfiguration or failover
`causes them to move; with this approach backing
`storage objects may be private to each site, even if
`they reside on shared network storage. Fixed place-
`ment stresses the role of the request routing pol—
`icy in the placement of new name entries or data
`items. The next two subsections discuss reconfigu-
`ration and recovery issues for the Slice architecture
`with respect to these structural alternatives.
`
`3.3. 1 Reconfiguration
`
`Consider the problem of reconfiguration to add or
`remove file managers, i.e., directory servers, small-
`file servers, or map coordinators.
`For requests
`routed by keying on the fileID, the system updates
`,aproxy routing tables to change the binding from
`fileIDs to physical servers if servers join or depart
`the ensemble. To keep the tables compact, Slice
`maps the fileID to a smaller logical server ID before
`indexing the table. The number of logical servers
`defines the size of the routing tables and the mini-
`mal granularity for rebalancing. The aproxy's copy
`of the routing table is a “hint" that may become
`stale during reconfiguration; the aproxy may load
`new tables lazily from an external source, assuming
`that servers can identify misdirected requests.
`
`This approach generalizes to policies in which the
`logical server ID is derived from a hash that includes
`other request arguments, as in the NAME HASHING
`approach. For NAME HASHING systems and other
`systems with fixed placement, the reconfiguration
`procedure must move logical servers from one phys-
`ical server to another. One approach is for each
`physical server to use multiple backing objects, one
`for each hosted logical server, and reconfigure by re-
`assigning the binding of physical servers to backing
`objects in the shared network storage array. Other-
`wise, reconfiguration must copy data from one back-
`ing object to another. In general, an ensemble with
`N servers must move 1/Nth of its data to rebalance
`after adding or losing a physical server [15].
`
`3.3.2 Atomicity and Recovery
`
`File systems have strong integrity requirements and
`frequent updates; the system must preserve their in—
`tegrity through failures and concurrent operations.
`The focus on request routing naturally implies that
`the multiple servers must manage distributed state.
`
`File managers prepare for recovery by generating a
`write-ahead log in shared storage. For systems that
`use the shared—disk model without fixed placement,
`all operations execute at a single manager site, and
`it is necessary and Sufficient for the system to pro—
`vide locking and recovery procedures for the shared
`disk blocks [24]. For systems with fixed placement,
`servers do not share blocks directly, but some oper—
`ations must update state at multiple sites through
`a peer—peer protocol. Thus there is no need for dis—
`tribut

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket