`
`Darrell C. Anderson, Jeffrey S. Chase, Amin M. Vahdat *
`Department of Computer Science
`Duke University
`{anderson,chase,vahdat}@cs.duke.edu
`
`Abstract. This paper explores interposed request
`routing in Slice, a new storage system architec-
`ture for high—speed networks incorporating network-
`attached block storage. Slice interposes a request
`switching filter — called a aprosry — along each
`client’s network path to the storage service (e.g.,
`in a network adapter or switch). The [tproxy inter—
`cepts request traffic and distributes it across a server
`ensemble. We propose request routing schemes for
`I/O and file service traffic, and explore their effect
`on service structure.
`
`The Slice prototype uses a packet filter pproxy
`to virtualize the standard Network File System
`(NFS) protocol, presenting to NFS clients a uni—
`fied shared file volume with scalable bandwidth and
`capacity. Experimental results from the industry-
`standard SPECsf597 workload demonstrate that
`the architecture enables construction of powerful
`network—attached storage services by aggregating
`cost-effective components on a switched Gigabit
`Ethernet LAN.
`
`1
`
`Introduction
`
`Demand for large-scale storage services is growing
`rapidly. A prominent factor driving this growth is
`the concentration of storage in data centers hosting
`Web—based applications that serve large client pop-
`ulations through the Internet. At the same time,
`storage demands are increasing for scalable comput—
`ing, multimedia and visualization.
`
`A successful storage system architecture must scale
`to meet
`these rapidly growing demands, placing
`a premium on the costs (including human costs)
`to administer and upgrade the system. Commer-
`cial systems increasingly interconnect storage de—
`vices and servers with dedicated Storage Area Net—
`
`‘This work is supported by the National Science Founda-
`tion (BIA-9972879 and BIA—9870724), Intel, and Myricom.
`Anderson is supported by a US. Department of Education
`GAANN fellowship. Chase and Vahdat are supported by
`NSF CAREER awards (OCR-9624857 and COR-9984328).
`
`works (SANS), e.g., FibreChannel, to enable incre-
`mental scaling of bandwidth and capacity by attach-
`ing more storage to the network. Recent advances
`in LAN performance have narrowed the bandwidth
`gap between SANS and LANs, creating an oppor-
`tunity to take a similar approach using a general-
`purpose LAN as the storage backplane. A key chal—
`lenge is to devise a distributed software layer to
`unify the decentralized storage resources.
`
`This paper explores interposed request routing in
`Slice, a new architecture for network storage. Slice
`interposes a request switching filter 7 called a
`nprozy — along each client’s network path to the
`storage service. The pproxy may reside in a pro-
`grammable switch or network adapter, or in a self-
`contained module at the client’s or server’s interface
`to the network. We show how a simple pproxy can
`virtualize a stande network-attached storage pro-
`tocol incorporating file services as well as raw device
`access. The Slice nproxy distributes request traf-
`fic across a collection of storage and server elements
`that cooperate to present a uniform view of a shared
`file volume with scalable bandwidth and capacity.
`
`This paper makes the following contributions:
`
`0 It outlines the architecture and its implementa-
`tion in the Slice prototype, which is based on a
`pproxy implemented as an IP packet filter. We
`explore the impact on service structure, recon-
`figuration, and recovery.
`
`0 It proposes and evaluates request routing poli-
`cies within the architecture.
`In particular, we
`introduce two policies for transparent scaling of
`the name space of a unified file volume. These
`techniques complement simple grouping and
`striping policies to distribute file access load.
`
`the prototype using synthetic
`0 It evaluates
`benchmarks including SPECsfs97, an industry-
`standard workload for network-attached stor-
`
`age servers. The results demonstrate that the
`
`Dot Hill Systems COrp., Exhibit 1012
`Page 1
`
`Dot Hill Systems Corp., Exhibit 1012
`Page 1
`
`
`
`directory servers
`
`name space ,"
`requests /,
`
`
`
`
`
`client
`
`small file \\
`read/write
`
`file‘o::—_"
`placement
`\~ \
`policy
`
`
`\
`\
`network
`
`storage
`array
`
`small-file servers
`
`Figure 1: Combining functional decomposition and
`data decomposition in the Slice architecture.
`
`system is scalable and that it complies with
`the Network File System (NFS) V3 standard, a
`popular protocol for network—attached storage.
`
`This paper is organized as follows. Section 2 out-
`lines the architecture and sets Slice in context with
`related work. Section 3 discusses the role of the
`pproxy, defines the request routing policies, and
`discusses service structure. Section 4 describes the
`Slice prototype, and Section 5 presents experimental
`results. Section 6 concludes.
`
`2 Overview
`
`The Slice file service consists of a collection of
`servers cooperating to serve an arbitrarily large vir-
`tual volume of files and directories. To a client, the
`ensemble appears as a single file server at some vir-
`tual network address. The ,uproxy intercepts and
`transforms packets to redirect requests and to rep—
`resent the ensemble as a unified file service.
`
`Figure 1 depicts the structure of a Slice ensemble.
`Each client’s request stream is partitioned into three
`functional request classes corresponding to the ma-
`jor file workload components:
`(1) high—volume [/0
`to large files, (2) I/O on small files, and (3) oper-
`ations on the name space or file attributes. The
`pproxy switches on the request type and arguments
`to redirect requests to a selected server responsible
`for handling a given class of requests. Bulk I/O op-
`erations route directly to an array of storage nodes,
`which provide block—level access to raw storage ob—
`jects. Other operations are distributed among spe-
`cialized file managers responsible for small—file I/O
`
`and/or name space requests.
`
`This functional decomposition diverts high—volume
`data flow to bypass the managers, while allowing
`specialization of the servers for each workload com—
`ponent, e.g., by tailoring the policies for disk layout,
`caching and recovery. A single server node could
`combine the functions of multiple server classes; we
`separate them to highlight the opportunities to dis-
`tribute requests across more servers.
`
`The pproxy selects a target server by switching on
`the request type and the identity of the target file,
`name entry, or block, using a separate routing func-
`tion for each request class. Thus the routing func—
`tions induce a data decomposition of the volume
`data across the ensemble, with the side effect of cre-
`ating or caching data items on the selected man-
`agers.
`Ideally, the request routing scheme spreads
`the data and request workload in a balanced fashion
`across all servers. The routing functions may adapt
`to system conditions, e.g., to use new server sites
`as they become available. This allows each work—
`load component to scale independently by adding
`resources to its server class.
`
`2.1 The pproxy
`
`An overarching goal is to keep the lth‘OXy simple,
`small, and fast. The aproxy may (1) rewrite the
`source address, destination address, or other fields of
`request or response packets, (2) maintain a bounded
`amount of soft state, and (3) initiate or absorb pack-
`ets to or from the Slice ensemble. The pproxy does
`not require any state that is shared across clients,
`so it may reside on the client host or network in—
`terface, or in a network element close to the server
`ensemble. The “proxy is not a barrier to scalability
`because its functions are freely replicable, with the
`constraint that each client’s request stream passes
`through a single nproxy.
`
`The uproxy functions as a network element within
`the Internet architecture.
`It is free to discard its
`
`state and/or pending packets without compromis—
`ing correctness. End-to-end protocols (in this case
`NFS/RPC/UDP or TCP) retransmit packets as
`necessary to recover from drops in the ,uproxy. Al~
`though the pproxy resides “within the network”, it
`acts as an extension of the service. For example,
`since the pproxy is a layer—5 protocol component. it
`must reside (logically) at one end of the connection
`or the other; it cannot reside in the “middle” of the
`connection where end»to—end encryption might hide
`layer—5 protocol fields.
`
`Dot Hill Systems Corp., Exhibit 1012
`Page 2
`
`Dot Hill Systems Corp., Exhibit 1012
`Page 2
`
`
`
`2.2 Network Storage Nodes
`
`2.3 File Managers
`
`A shared array of network storage nodes provides all
`disk storage used in 3. Slice ensemble. The pproxy
`routes bulk I/O requests directly to the network
`storage array, without intervention by a file man-
`ager. More storage nodes may be added to incre-
`mentally scale bandwidth, capacity, and disk arms.
`
`The Slice block storage prototype is loosely based
`on a proposal in the National Storage Industry
`Consortium (NSIC) for object—based storage devices
`(OBSD) [3]. Key elements of the OBSD proposal
`were in turn inspired by the CMU research on Net—
`work Attached Secure Disks (NASD) [8, 9]. Slice
`storage nodes are “object-based” rather than sector-
`based, meaning that requesters address data as log-
`ical offsets within storage objects. A storage object
`is an ordered sequence of bytes with a unique iden-
`tifier. The placement policies of the file service are
`responsible for distributing data among storage ob-
`jects so as to benefit fully from all of the resources
`in the network storage array.
`
`A key advantage of OBSDs and NASDs is that they
`allow for cryptographic protection of storage object
`identifiers if the network is insecure [9]. This protec—
`tion allows the pproxy to reside outside of the server
`ensemble’s trust boundary.
`In this case, the dam-
`age from a compromised aproxy is limited to the
`files and directories that its client(s) had permis-
`sion to access. However, the Slice request routing
`architecture is compatible with conventional sector-
`based storage devices if every aproxy resides inside
`the service trust boundary.
`
`This storage architecture is orthogonal to the ques—
`tion of which level arranges redundancy to tolerate
`disk failures. One alternative is to provide redun—
`dancy of disks and other vulnerable components in-
`ternally to each storage node. A second option is for
`the file service software to mirror data or maintain
`
`parity across the storage nodes. In Slice, the choice
`to employ extra redundancy across storage nodes
`may be made on a per-file basis through support
`for mirrored striping in our prototype’s I/O routing
`policies. For stronger protection, a Slice configura-
`tion could employ redundancy at both levels.
`The Slice block service includes a coordinator mod—
`
`ule for files that span multiple storage nodes. The
`coordinator manages optional block maps (Sec—
`tion 3.1) and preserves atomicity of multisite op-
`erations (Section 3.3.2). A Slice configuration may
`include any number of coordinators, each managing
`a subset of the files (Section 4.2).
`
`File management functions above the network stor-
`age array are split across two classes of file man—
`agers. Each class governs functions that are com-
`mon to any file server; the architecture separates
`them to distribute the request load and allow im—
`plementations specialized for each request class.
`
`0 Directory servers handle name space opera-
`tions, e.g., to create, remove, or lockup files and
`directories by symbolic name; they manage di-
`rectories and mappings from names to identi-
`fiers and attributes for each file or directory.
`
`I Small-file servers handle read and write opera—
`tions on small files and the initial segments of
`large files (Section 3.1).
`
`Slice file managers are dataless; all of their state is
`backed by the network storage array. Their role is to
`aggregate their structures into larger storage objects
`backed by the storage nodes, and to provide memory
`and CPU resources to cache and manipulate those
`structures. In this way, the file managers can benefit
`from the parallel disk arms and high bandwidth of
`the storage array as more storage nodes are added.
`
`The principle of dataless file managers also plays a
`key role in recovery. In addition to its backing ob-
`jects, each manager journals its updates in a write—
`ahead log [10]; the system can recover the state of
`any manager from its backing objects together with
`its log. This allows fast failover, in which a surviving
`site assumes the role of a failed server, recovering its
`state from shared storage [12, 4, 24].
`
`2.4 Summary
`
`Interposed request routing in the Slice architecture
`yields three fundamental benefits:
`
`0 Scalable file management with content—based re—
`quest swltching. Slice distributes file service re-
`quests across a server ensemble. A good request
`switching scheme induces a balanced distribu—
`tion of file objects and requests across servers,
`and improves locality in the request stream.
`
`0 Direct storage access for high-volume 1/0. The
`aproxy routes bulk I/O traffic directly to the
`network storage array, removing the file man-
`agers from the critical path. Separating re-
`quests in this fashion eliminates a key scaling
`barrier for conventional file services [8, 9]. At
`the same time, the small-file servers absorb and
`
`Dot Hill Systems Corp., Exhibit 1012
`Page 3
`
`Dot Hill Systems Corp., Exhibit 1012
`Page 3
`
`
`
`aggregate I/ O operations on small files, so there
`is no need for the storage nodes to handle small
`objects efficiently.
`
`o Compatibility with standard file system clients.
`The pproxy factors request routing policies out
`of the client—side file system code. This allows
`the architecture to leverage a minimal comput-
`ing capability within the network elements to
`virtualize the storage protocol.
`
`2.5 Related Work
`
`A large number of systems have interposed new sys-
`tem functionality by “wrapping” an existing inter-
`face, including kernel system calls [14], internal in-
`terfaces [13], communication bindings [11], or mes-
`saging endpoints. The concept of a proxy mediating
`between clients and servers [23] is now common in
`distributed systems. We propose to mediate some
`storage functions by interposing on standard storage
`access protocols within the network elements. Net—
`work file services can benefit from this technique be—
`cause they have well-defined protocols and a large
`installed base of clients and applications, many of
`which face significant scaling challenges today.
`
`The Slice ,uproxy routes file service requests based
`on their content. This is analogous to the HTTP
`content switching features offered by some net—
`work switch vendors (e.g., Alteon, Arrowpoint, F5),
`based in part on research demonstrating improved
`locality and load balancing for large Internet server
`sites [20]. Slice extends the content switching con-
`cept to a file system context.
`A number of recent commercial and research ef—
`forts investigate techniques for building scalable
`storage systems for high-speed switched LAN net-
`works.
`These system are built from disks dis—
`tributed through the network, and attached to ded-
`icated servers [16, 24, 12], cooperating peers [4, 26],
`or the network itself [8, 9]. We separate these sys-
`tems into two broad groups.
`
`The first group separates file managers (e.g., the
`name service) from the block storage service, as in
`Slice. This separation was first proposed for the
`Cambridge Universal File Server [6]. Subsequent
`systems adopted this separation to allow bulk 1/ O
`to bypass file managers [7, 12], and it is now a basic
`tenet of research in network-attached storage de-
`vices including the CMU NASD work on devices for
`secure storage objects [8, 9]. Slice shows how to
`incorporate placement and routing functions essen—
`tial for this separation into a new filesystem struc-
`ture for network-attached storage. The CMU NASD
`
`integrated similar functions into network
`project
`file system clients [9];
`the Slice model decouples
`these functions, preserving compatibility with ex—
`isting clients. In addition, Slice extends the NASD
`project approach to support scalable file manage-
`ment as well as high-bandwidth 1/0 for large files.
`
`A second group of scalable storage systems lay-
`ers the file system functions above a network stor—
`age volume using a shared disk model.
`Policies
`for striping, redundancy, and storage site selection
`are specified on a volume basis; cluster nodes coor-
`dinate their accesses to the shared storage blocks
`using an ownership protocol. This approach has
`been used with both log-structured (Zebra [12] and
`xFS [4]) and conventional (Hangipani/Petal [16, 24]
`and GFS [21]) file system organizations. The clus-
`ter may be viewed as “serverless” if all nodes are
`trusted and have direct access to the shared disk,
`or alternatively the entire cluster may act as a file
`server to untrusted clients using a standard network
`file protocol, with all I/O passing through the clus-
`ter nodes as they mediate access to the disks.
`
`The key benefits of Slice request routing apply
`equally to these shared disk systems when untrusted
`clients are present. First, request routing is a key to
`incorporating secure network-attached block stor-
`age, which allows untrusted clients to address stor-
`age objects directly without compromising the in-
`tegrity of the file system. That is, a ,uproxy could
`route bulk I/O requests directly to the devices,
`yielding a more scalable system that preserves com-
`patibility with standard clients and allows per-file
`policies for block placement, parity or replication,
`prefetching, etc. Second, request routing enhances
`locality in the request stream to the file servers, im-
`proving cache effectiveness and reducing block con-
`tention among the servers.
`
`The shared disk model is used in many commercial
`systems, which increasingly interconnect storage de»
`vices and servers with dedicated Storage Area Netv
`works (SANS), e.g., FibreChanncl. This paper cx-
`plores storage request routing for Internet networks,
`but the concepts are equally applicable in SANs.
`
`Our proposal to separate small-file 1/0 from the re—
`quest stream is similiar in concept to the Amoeba
`Bullet Server [25], a specialized file server that 0p-
`timizes small files. As described in Section 4.4,
`the prototype small-file server draws on techniques
`from the Bullet Server, FFS fragments [19], and
`SquidMLA [18], a Web proxy server that maintains
`a user-level “filesystem” of small cached Web pages.
`
`Dot Hill Systems Corp., Exhibit 1012
`Page 4
`
`Dot Hill Systems Corp., Exhibit 1012
`Page 4
`
`
`
`3 Request Routing Policies
`
`This section explains the structure of the nproxy
`and the request routing schemes used in the Slice
`prototype. The purpose is to illustrate concretely
`the request routing policies enabled by the architec—
`ture, and the implications of those policies for the
`way the servers interact to maintain and recover
`consistent file system states. We use the NFS V3
`protocol as a reference point because it is widely
`understood and our prototype supports it.
`
`The uproxy intercepts NFS requests addressed to
`virtual NFS servers, and routes the request to a
`physical server by applying a function to the re-
`quest type and arguments. It then rewrites the IP
`address and port to redirect the request to the se-
`lected server. When a response arrives, the nproxy
`rewrites the source address and port before forward-
`ing it to the client, so the response appears to orig—
`inate from the virtual NFS server.
`
`The request routing functions must permit recon—
`figuration to add or remove servers, while minimiz-
`ing state requirements in the nproxy. The nproxy
`directs most requests by extracting relevant fields
`from the request, perhaps hashing to combine mul-
`tiple fields, and interpreting the result as a logical
`server site ID for the request. It then looks up the
`corresponding physical server in a compact routing
`table. Multiple logical sites may map to the same
`physical server, leaving flexibility for reconfiguration
`(Section 3.3.1). The routing tables constitute soft
`state; the mapping is determined externally, so the
`nproxy never modifies the tables.
`
`The ,uproxy examines up to four fields of each re-
`quest, depending on the policies configured:
`
`0 Request type, Routing policies are keyed by the
`NFS request type, so the ,uproxy may employ
`different policies for different functions. Table 1
`lists the important NFS request groupings dis-
`cussed in this paper.
`
`a File handle. Each NFS request targets a spe-
`cific file or directory, named by a unique identi-
`fier called a file handle (or fhandle). Although
`NFS fhandles are opaque to the client, their
`structure can be known to the uproxy, which
`acts as an extension of the service. Directory
`servers encode a fileID in each fhandle, which
`the pproxies extract as a routing key.
`
`0 Read/write ofiset. N F8 1/0 operations specify
`the range of offsets covered by each read and
`
`write. The pproxy uses these fields to select
`the server or storage node for the data.
`
`0 Name component. NFS name space requests
`include a symbolic name component in their ar—
`guments (see Table 1). A key challenge for scal-
`ing file management is to obtain a balanced dis-
`tribution of these requests. This is particularly
`important for name-intensive workloads with
`small files and heavy create/lookup/remove ac—
`tivity, as often occurs in Internet services for
`mail, news, message boards, and Web access.
`
`We now outline some nproxy policies that use these
`fields to route specific request groups.
`
`3.1 Block [/0
`
`Request routing for read/write requests have two
`goals:
`separate small-file read/write traffic from
`bulk I/O, and decluster the blocks of large files
`across the storage nodes for the desired access prop-
`erties (e.g., high bandwidth or a specified level of
`redundancy). We address each in turn.
`
`When small-file servers are configured, the proto—
`type’s routing policy defines a fixed threshold offset
`(e.g., 64KB); the nproxy directs [/0 requests be—
`low the threshold to a small—file server selected from
`
`the request fhandle. The threshold offset is neces—
`sary because the size of each file may change at any
`time. Thus the small—file servers also receive a sub-
`
`set of the I/O requests on large files; they receive
`all I/O below the threshold, even if the target file
`is large.
`In practice, large files have little impact
`on the small-file servers because there tends to be
`a small number of these files, even if they make up
`a large share of the stored bytes. Similarly, large
`file I/O below the threshold is limited by the band—
`width of the small-file server, but this affects only
`the first threshold bytes, and becomes progressively
`less significant as the file grows.
`
`The nproxy redirects I/O traflic above the thresh-
`old directly to the network storage array, using some
`placement policy to select the storage site(s) for each
`block. A simple option is to employ static strip-
`ing and placement functions that compute on the
`block offset and/or fileID. More flexible placement
`policies would allow the uproxy to consider other
`factors, e.g., load conditions on the network or store
`age nodes, or file attributes encoded in the fhandle.
`To generalize to more flexible placement policies,
`Slice optionally records block locations in per—file
`block maps managed by the block service coordina—
`tors. The uproxies interact with the coordinators
`
`Dot Hill Systems Corp., Exhibit 1012
`Page 5
`
`Dot Hill Systems Corp., Exhibit 1012
`Page 5
`
`
`
`
`
`Name Space Operations
`Look up a name in din return handle and attributes.
`lookup(dir, name) returns (fliandle, attr)
`create{dir, name) returns (fhandle, attr)
`Create a file/directory and update the parent entry/link
`count and modify timestamp.
`mkdir{dir, name) returns (fhandle, attr)
`remove(dir, name), rmdir(dir, name)
`Remove a file/directory or hard link and update the parent
`
`entry/link count and modify timestamp.
`Create a new name for a file, update the file link count,
`link(olddir, aldname, newdir, newname)
`and update modify timestamps on the file and newdir.
`returns (fhandle, attr)
`Rename an existing file or hard link; update the link count
`rename(olddir, oldname, newdir, newname)
`
`and modify timestamp on both the old and new parent.
`returns (fhandle, attr)
`Attribute Operations
`
`Retrieve the attributes of a file or directory.
`getattr(object) returns (attr)
`setattr(abject, attr)
`Modify the attributes of a file or directory, and update its
`
`modify timestamp.
`I/O Operations
`
`ad data from a file, updating its access timestamp,
`R
`len} returns (data, attr)
`read(file, aflset,
`
`
`Write data to a file, updating its modify timestamp.
`write(file, afiset, len) returns (data, attr)
`
`Directory Retrival
`
`Read some or all of the entries in a directory.
`readdi1'(dir, cookie) returns (entries, cookie)
`
`
`
`
`
`I e
`
`Table 1: Some important Network File System (NFS) protocol operations.
`
`to fetch and cache fragments of the block maps as
`they handle I/O operations on files.
`
`As one example of an attribute—based policy, Slice
`supports a mirrored striping policy that replicates
`each block of a mirrored file on multiple storage
`nodes, to tolerate failures up to the replication de—
`gree. Mirroring consumes more storage and net-
`work bandwidth than striping with parity, but it is
`simple and reliable, avoids the overhead of comput-
`ing and updating parity, and allows load-balanced
`reads [5, 16].
`
`3.2 Name Space Operations
`
`requests
`space
`name
`distributing
`Effectively
`presents different challenges from I/O request rout—
`ing. Name operations involve more computation,
`and name entries may benefit more from caching
`because they tend to be relatively small and
`fragmented. Moreover, directories are frequently
`shared. Directory servers act as synchronization
`points to preserve integrity of the name space, e.g.,
`to prevent clients from concurrently creating a file
`with the same name, or removing a directory while
`a name create is in progress.
`
`A simple approach to scaling a file service is to parti—
`tion the name space into a set of volumes, each man—
`aged by a single server. Unfortunately, this VOLUME
`PARTITIONING strategy compromises transparency
`and increases administrative overhead in two ways.
`First, volume boundaries are visible to clients as
`mount points, and naming operations such as link
`and rename cannot cross volume boundaries. Sec—
`
`ond, the system develops imbalances if volume loads
`grow at different rates, requiring intervention to
`repartition the name space. This may be visible to
`users through name changes to existing directories.
`
`An important goal of name management in Slice
`is to automatically distribute the load of a single
`file volume across multiple servers, without impos—
`ing user—visible volume boundaries. We propose two
`alternative name space routing policies to achieve
`this goal. MKDin SWITCHING yields balanced dis-
`tributions when the average number of active di-
`rectories is large relative to the number of direc-
`tory server sites, but it binds large directories to a
`single server. For workloads with very large direc—
`tories, NAME HASHING yields probabilistically bal-
`anced request distributions independent of work-
`load. The cost of this effectiveness is that more
`
`increasing the
`operations cross server boundaries,
`cost and complexity of coordination among the di-
`rectory servers (Section 4.3).
`MKDIR SWITCHING works as follows. In most cases,
`the uproxy routes name space operations to the di-
`rectory server that manages the parent directory;
`the ,uproxy identifies this server by indexing its rout—
`ing table with the fileID from the parent directory
`fhandle in the request (refer to Table 1). On a mkdir
`request,
`the uproxy decides with probability p to
`redirect the request to a different directory server,
`placing the new directory — and its descendents —
`on a different site from the parent directory. The
`policy uniquely selects the new server by hashing
`on the parent fhandle and the symbolic name of the
`
`Dot Hill Systems Corp., Exhibit 1012
`Page 6
`
`Dot Hill Systems Corp., Exhibit 1012
`Page 6
`
`
`
`new directory; this guarantees that races over name
`manipulation involve at most two sites. Reducing
`directory affinity by increasing p makes the policy
`more aggressive in distributing name entries across
`sites; this produces a more balanced load, but more
`operations involve multiple sites. Section 5 presents
`experimental data illustrating this tradeoff.
`
`NAME HASHING extends this approach by routing
`all name space operations using a hash on the name
`component and its position in the directory tree,
`as given by the parent directory fhandle. This ap-
`proach represents the entire volume name space as
`a unified global hash table distributed among the
`directory servers. It views directories as distributed
`collections of name entries, rather than as files ac-
`cessed as a unit. Conflicting operations on any given
`name entry (e.g., create/create, create/remove, re-
`move/lockup) always hash to the same server, where
`they serialize on the shared hash chain. Operations
`on different entries in the same directory (e.g., cre-
`ate, remove, lookup) may proceed in parallel at mul-
`tiple sites. For good performance, NAME HASHING
`requires suflicient memory to keep the hash chains
`memory-resident, since the hashing function sacri-
`fices locality in the hash chain accesses. Also, read-
`dir operations span multiple sites; this is the right
`behavior for large directories, but it increases read-
`dlr costs for small directories.
`
`3.3 Storage Service Structure
`
`routing policies impact storage service
`Request
`structure. The primary challenges are coordination
`and recovery to maintain a consistent view of the
`file volume across all servers, and reconfiguration to
`add or remove servers within each class.
`
`Most of the routing policies outlined above are in—
`dependent of whether small files and name entries
`are bound to the server sites that create them. One
`
`option is for the servers to share backing objects
`from a shared disk using a block ownership proto—
`col (see Section 2.5);
`in this case, the role of the
`pproxy is to enhance locality in the request stream
`to each server. Alternatively, the system may use
`fixed placement in which items are controlled by
`their create sites unless reconfiguration or failover
`causes them to move; with this approach backing
`storage objects may be private to each site, even if
`they reside on shared network storage. Fixed place-
`ment stresses the role of the request routing pol-
`icy in the placement of new name entries or data
`items. The next two subsections discuss reconfigu-
`ration and recovery issues for the Slice architecture
`with respect to these structural alternatives.
`
`3.3.1 Reconfiguration
`
`Consider the problem of reconfiguration to add or
`remove file managers, i.e., directory servers, small-
`file servers, or map coordinators.
`For requests
`routed by keying on the fileID, the system updates
`pproxy routing tables to change the binding from
`fileIDs to physical servers if servers join or depart
`the ensemble. To keep the tables compact, Slice
`maps the fileID to a smaller logical server ID before
`indexing the table. The number of logical servers
`defines the size of the routing tables and the mini-
`mal granularity for rebalancing. The uproxy’s copy
`of the routing table is a “hint” that may become
`stale during reconfiguration; the pproxy may load
`new tables lazily from an external source, assuming
`that servers can identify misdirected requests.
`
`This approach generalizes to policies in which the
`logical server ID is derived from a hash that includes
`other request arguments, as in the NAME HASHING
`approach. For NAME HASHING systems and other
`systems with fixed placement, the reconfiguration
`procedure must move logical servers from one phys—
`ical server to another. One approach is for each
`physical server to use multiple backing objects, one
`for each hosted logical server, and reconfigure by re-
`assigning the binding of physical servers to backing
`objects in the shared network storage array. Other-
`wise, reconfiguration must copy data from one back-
`ing object to another. In general, an ensemble with
`N servers must move 1 / Nth of its data to rebalance
`after adding or losing a physical server [15].
`
`3.3.2 Atomicity and Recovery
`
`File systems have strong integrity requirements and
`frequent updates; the system must preserve their in-
`tegrity through failures and concurrent operations.
`The focus on request routing naturally implies that
`the multiple servers must manage distributed state.
`
`File managers prepare for recovery by generating a
`write-ahead log in shared storage. For systems that
`use the shared—disk model without fixed placement,
`all operations execute at a single manager site, and
`it is necessary and sufficient for the system to pro-
`vide locking and recovery pro