`AUTO ************* ALL FOR ADC 200
`H3 l 2264 06 EXPlRE 020S OD o 1 00/007
`PC94/017584
`/SOPS-10
`0036
`SERlALS RECEl!JlHG/LIB. COHGRESS
`US AQUlSlTlOHS SECTlOH
`101 lHDEf>EHDEHCE AUEHUE S. E.
`~ASHIHGTOH DC 20540 0002
`
`*
`
`CSCO-1008
`Page 1 of 19
`
`
`
`)
`
`1 0PERATING
`SYSTEMS
`REVIEW
`
`A Publication of the
`Association for Computing Machinery
`Special Interest Group on Operating Systems
`
`Volume 35, Number 5
`
`December, 2001
`
`Proceedings of the
`
`18th ACM Symposium on
`Operating Systems Principles
`(SOSP'Ol)
`
`October 21-24, 2001
`Chateau Lake Louise, Banff, Alberta, Canada
`
`Page 2 of 19
`
`
`
`A Publication of the ACM Special Interest Group on Operating Systems
`
`OPERATI 'G SYSTEMS REVIEW
`
`Chairperson
`William E. Weihl
`Akamai Technologies, Inc.
`1400 Fashion Island Blvd.
`San Mateo. CA 94404 USA
`+ 1-650-627-5259
`(b-weihl@akamai.com)
`
`Newsletter Editor
`William M. Waite
`Dept. of Electrical and
`Computer Engineering
`University of Colorado
`Boulder, CO 80309-0425 USA
`+ 1-303-492-7204
`(William. Waite@Colorado.edu)
`
`V. Chairperson
`Valerie Issarny
`INRIA - Rennes
`IRIS A
`Campus de Beaulieu
`35042 Rennes Cedex France
`+33-2-99-84-7462
`(issamy@irisa.fr)
`
`Secretary-Treasurer
`David Kotz
`Dept. of Computer Science
`Dartmouth College
`6211 Sudikoff Laboratory
`Hanover, NH 03755 USA
`+ 1-603-646-1439
`(dfk@cs.dartmouth.edu)
`Information Director
`Michael Dahlin
`Department of Computer Sciences
`Campus Mail Code: C0500
`University of Texas
`Austin, TX 78712 USA
`+l-512-471-9549
`(dahlin@cs.utexas.edu)
`
`OPERATING SYSTEMS REVIEW is an informal publication of the ACM Special Interest Group on
`Operating Systems (SIGOPS), whose scope of interest includes: Computer operating systems and archi(cid:173)
`tecture for multiprogramming, multiprocessing, and time sharing; resource management; evaluation and
`simulation; reliability, integrity, and security of data; communications among computing processors; and
`computer system modeling analysis.
`
`Membership in SIGOPS (at $15 per annum) is open to ACM members and associate members. and stu(cid:173)
`dent members (at $8 per annum); non-ACM members may join also (at $42 per annum). Non-members of
`SIGOPS may subscribe to OPERATING SYSTEMS REVIEW at $30 per year. All SIGOPS members
`receive OPERATING SYSTEMS REVIEW.
`
`SIGOPS membership application forms are available from ACM Headquarters (or see back cover of this
`issue), 1515 Broadway, 17th Floor, New York, New York 10036. telephone (212) 869-7440. Changes of
`address and other matters pertaining to the SIGOPS mailing list should be directed to ACM Headquarters,
`not to any of the SI GO PS officers.
`
`Contributions to OSR are sent to the editor and 5hould be single spaced and printed on one side of the
`paper only Text and diagrams must be contained within a 6.5 inch wide by 8.5 inch high (165mm by
`216mm) area with at least one inch (25mm) top and side margins, 1.5 inch (38mm) bottom margin, and be
`camera-ready (i.e., be an original having sufficient contrast and density for reproduction). All letters to
`the editor will be considered for publication unless accompanied by a request to the contrary. Except for
`editorial items, all sources of material appearing in OPERATING SYSTEMS REVIEW will be clearly
`identified. Items attributed to individuals will ordinarily be interpreted as personal rather than organiza(cid:173)
`tional opinions. The technical material in this newsletter is not refereed. Deadlines for submitting
`material are the 15th of February, May, August, and November. The SIGOPS conference proceed(cid:173)
`ings will be mailed as the December issue in odd numbered calendar years, and the ASPLOS con(cid:173)
`ference proceedings will be mailed i~ even numbered calendar years.
`OPERATING SYSTEMS REVIEW (ISSN 0163-5980) is published five times a year (January. April,
`July, October and December) by the Association for Computing Machinery Inc., 1515 Broadway, New
`York. Y 10036. Periodicals postage paid at New York, NY 10001. and at additional mailing offices
`POSTMASTER: Send address changes to OPERATING SYSTEMS REVIEW, ACM, 1515 Broadway,
`ew York. NY 10036.
`
`Subscriptions: Annual subscription cost of $12.53 is included in the member dues of $15.00 (for students
`cost is included in $8.00); the non-member annual subscription 1s $42.00.
`
`I\
`
`Page 3 of 19
`
`
`
`Proceedings of the
`18th ACM Symposium on
`Operating Systems Principles
`(SOSP'Ol)
`
`October 21-24, 2001
`Chateau Lake Louise, Banff, Alberta, Canada
`
`Sponsored by ACM SIGOPS
`(Association for Computing Machinery Special Interest Group in Operating Systems).
`
`Supported by Microsoft Research, Intel Corporation, QUALCOMM, Mercury Computer Systems,
`Hewlett-Packard Labs, Akamai, and Inktomi.
`
`i ntel.. QUALCOJ\IVV\
`
`i n v e n t
`
`I n k
`
`t o m
`
`i "
`
`Page 4 of 19
`
`
`
`The Association for Computing Machinery
`1515 Broadway
`New York, N.Y.10036
`
`Copyright© 2001 by the Association for Computing Machinery, Inc. (ACM). Permission to
`make digital or hard copies of portions of this work for personal or classroom use is granted
`without fee provided that the copies are not made or distributed for profit or commercial
`advantage and that copies bear this notice and the full citation on the first page. Copyright for
`components of this work owned by others than ACM must be honored. Abstracting with credit is
`permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires
`prior specific permission and/or a fee. Request permission to republish from: Publications Dept.
`ACM, Inc. Fax+ 1 (212) 869-0481 or E-mail perrnissions@acm.org.
`
`For other copying of articles that carry a code at the bottom of the first or last page, copying is
`permitted provided that the per-copy fee indicated in the code is paid through the Copyright
`Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, +1-978-750-8400, +1-978-750-
`4470 (fax).
`
`Notice to Past Authors of ACM-Published Articles
`
`ACM intends to create a complete electronic archive of all articles and/or other material
`previously published by ACM. If you have written a work that was
`previously published by ACM in any journal or conference proceedings prior to
`1978, or any SIG Newsletter at any time, and you do NOT want this work to
`appear in the ACM Digital Library, please inform permissions@acm.org, stating
`the title of the work, the author(s), and where and when published.
`
`ACM ISBN: 1-58113-389-8
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P.O. Box 11405
`New York, N.Y. 10286-1405
`
`Phone: 1-800-342-6626
`(U.S.A. and Canada)
`+ 1-212-626-0500
`(All other countries)
`Fax: +1-212-944-1318
`E-mail: acmhelp@acm.org
`
`ACM Order Number: 534010
`
`Printed in the U.S.A.
`
`Joe he
`10, 2(
`had u:
`to tra'
`tic to
`
`At th~
`reseru
`know1
`ularly
`comrr
`Elma.
`
`Joche:
`dened
`
`Page 5 of 19
`
`
`
`Wide-area cooperative storage with CFS
`
`Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica·
`MIT Laboratory tor Computer Science
`chord@lcs.mit.edu
`http://pdos.lcs.mit.edu/chord/
`
`Abstract
`The Cooperative File System (CFS) is a new peer-to-peer read(cid:173)
`only storage system that provides provable guarantees for the ef(cid:173)
`ficiency, robustness, and load-balance of file storage and retrieval.
`CFS does this with a completely decentralized architecture that can
`scale to large systems. CFS servers provide a distributed hash table
`(DHash) for block storage. CFS clients interpret DHash blocks as
`a file system DHash distributes and caches blocks at a fine granu(cid:173)
`larity to achieve load balance, uses replication for robustness, and
`decreases latency with server selection. DHash finds blocks using
`the Chord location protocol, which operates in time logarithmic in
`the number of servers.
`CFS is implemented using the SFS file system toolkit and runs
`on Linux, OpenBSD, and FreeBSD. Experience on a globally de(cid:173)
`ployed prototype shows that CFS delivers data to clients as fast
`as FTP. Controlled tests show that CFS is scalable: with 4,096
`servers, looking up a block of data involves contacting only seven
`servers. The tests also demonstrate nearly perfect robustness and
`unimpaired performance even when as many as half the servers
`fail.
`
`1.
`Introduction
`Existing peer-to-peer systems (such as Napster (20), Gnu(cid:173)
`tella (11), and Freenet (6)) demonstrate the benefits of cooperative
`storage and serving: fault tolerance, load balance, and the ability to
`harness idle storage and network resources. Accompanying these
`benefits are a number of design challenges. A peer-to-peer archi(cid:173)
`tecture should be symmetric and decentralized, and should operate
`well with unmanaged volunteer participants. Finding desired data
`in a large system must be fast; servers must be able to join and leave
`the system frequently without affecting its robustness or efficiency;
`and load must be balanced across the available servers. While the
`peer-to-peer systems in common use solve some of these problems,
`
`•university of California, Berkeley
`
`This research was sponsored by the Defense Advanced Research
`Projects Agency (DARPA) and the Space and Naval Warfare Sys(cid:173)
`tems Center, San Diego, under contract N66001-00- l-8933.
`
`Permission to meke digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that
`copies are not made or distributed for profit or commercial advan~
`tage and that copies bear this notice and the full citation on the firat page.
`To copy otherwise, to republish, to post on servers or to
`redistribute to lists, requires prior specific permission and/or a fee .
`SOSP01 Sanft, Caned•
`C> 2001 ACM ISBN 1-58113-389-8-1/01/10 .•. $5 .00
`
`none solves all of them. CFS (the Cooperative File System) is a
`new design that meets all of these challenges.
`A CFS file system exists as a set of blocks distributed over the
`available CFS servers. CFS client software interprets the stored
`blocks as file system data and meta-data and presents an ordinary
`read-only file-system interface to applications. The core of the CFS
`software consists of two layers, DHash and Chord. The DHash
`(distributed hash) layer performs block fetches for the client, dis(cid:173)
`tributes the blocks among the servers, and maintains cached and
`replicated copies. DHash uses the Chord [31) distributed lookup
`system to locate the servers responsible for a block. This table
`summarizes the CFS software layering:
`
`Layer
`FS
`
`Responsibility
`Interprets blocks as files ; presents a tile system in-
`terface to applications.
`DHash Stores unstructured data blocks reliably.
`Chord Maintains routing tables used to find blocks .
`
`Chord implements a hash-like operation that maps from block
`identifiers to servers. Chord assigns each server an identifier drawn
`from the same 160-bit identifier space as block identifiers. These
`identifiers can be thought of as points on a circle. The mapping
`that Chord implements takes a block's ID and yields the block's
`successor, the server whose ID most closely follows the block's ID
`on the identifier circle. To implement this mapping, Chord main(cid:173)
`tains at each server a table with information about O(log N) other
`servers, where N is the total number of servers. A Chord lookup
`sends messages to O(log N) servers to consult their tables. Thus
`CFS can find data efficiently even with a large number of servers,
`and servers can join and leave the system with few table updates.
`DHash layers block management on top of Chord. DHash pro(cid:173)
`vides load balance for popular large files by arranging to spread the
`blocks of each file over many servers. To balance the load imposed
`by popular small files, DHash caches each block at servers likely to
`be consulted by future Chord lookups for that block. DHash sup(cid:173)
`ports pre-fetching to decrease download latency. DHash replicates
`each block at a small number of servers, to provide fault tolerance.
`DHash enforces weak quotas on the amount of data each server can
`inject, to deter abuse. Finally, DHash allows control over the num(cid:173)
`ber of virtual servers per server, to provide control over how much
`data a server must store on behalf of others.
`CFS has been implemented using the SFS toolkit [16). This
`paper reports experimental results from a small international de(cid:173)
`ployment of CFS servers and from a large-scale controlled test-bed.
`These results confirm the contributions of the CFS design:
`
`• an aggressive approach to load balance by spreading file
`blocks randomly over servers;
`
`202
`
`Page 6 of 19
`
`
`
`• download performance on an Internet-wide prototype de(cid:173)
`ployment as fast as standard Ff P;
`
`• provable efficiency and provably fast recovery times after
`failure;
`
`• simple algorithms to achieve the above results.
`
`CFS is not yet in operational use, and such use will likely prompt
`refinements to its design. One potential area for improvement is
`the ability of the Chord lookup algorithm to tolerate malicious
`participants, by verifying the routing information received from
`other servers. Another area that CFS does not currently address
`is anonymity; it is expected that anonymity, if needed, would be
`layered on top of the basic CFS system.
`The remainder of this paper is organized as follows. Section 2
`discusses related work. Section 3 outlines the design and design
`goals of CFS. Section 4 describes the Chord location protocol. Sec(cid:173)
`tion S presents the detailed design of CFS. Section 6 describes im(cid:173)
`plementation details, and Section 7 presents experimental results.
`Section 8 discusses open issues and future work. Finally, Section 9
`summarizes our findings and concludes the paper.
`
`2. Related Work
`CFS was inspired by Napster [20), Gnutella [11), and particu(cid:173)
`larly Freenet [6]. CFS uses peer-to-peer distributed hashing similar
`in spirit to a number of ongoing research projects [26, 29, 35). In
`comparison to existing peer-to-peer file sharing systems, CFS of(cid:173)
`fers simplicity of implementation and high performance without
`compromising correctness. CFS balances server load, finds data
`quickly for clients, and guarantees data availability in the face of
`server failures with very high probability. CFS, as a complete sys(cid:173)
`tem, has individual aspects in common with many existing systems.
`The major relationships are summarized below.
`2.1 Naming and Authentication
`CFS authenticates data by naming it with public keys or content(cid:173)
`hashes, as do many other distributed storage systems [9, 6, 7, 13,
`29, 34). The use of content-hashes to securely link together differ(cid:173)
`ent pieces of data is due to Merkle [18); the use of public keys to
`authentically name data is due to the SFS system [17).
`CFS adopts naming, authentication, and file system structure
`ideas from SFSRO [9], which implements a secure distributed read(cid:173)
`only file system-that is, a file system in which files can be modi(cid:173)
`fied only by their owner, and only through complete replacement of
`the file. However, SFSRO and CFS have significant differences at
`the architectural and mechanism levels. SFSRO defines protocols
`and authentication mechanisms which a client can use to retrieve
`data from a given server. CFS adds the ability to dynamically find
`the server currently holding the desired data, via the Chord location
`service. This increases the robustness and the availability of CFS,
`since changes in the set of servers are transparent to clients.
`2.2 Peer-to-Peer Search
`Napster [20] and Gnutella [11) are arguably the most widely used
`peer-to-peer file systems today. They present a keyword search in(cid:173)
`terface to clients, rather than retrieving uniquely identified data. As
`a result they are more like search engines than distributed hash ta(cid:173)
`bles, and they trade scalability for this power: Gnutella broadcasts
`search queries to many machines, and Napster performs searches
`at a central facility. CFS as described in this paper doesn't provide
`search, but we are developing a scalable distributed search engine
`for CFS.
`
`Mojo Nation [19) is a broadcast query peer-to-peer storage sys(cid:173)
`tem which divides files into blocks and uses a secret sharing algo(cid:173)
`rithm to distribute the blocks to a number of hosts. CFS also divides
`files into blocks but does not use secret sharing.
`
`2.3 Anonymous Storage
`Freenet [SJ uses probabilistic routing to preserve the anonym(cid:173)
`ity of clients, publishers, and servers. The anonymity requirement
`limits Freenet's reliability and performance. Freenet avoids associ(cid:173)
`ating a document with any predictable server, and avoids forming
`any globally coherent topology among servers. The former means
`that unpopular documents may simply disappear from the system,
`since no server has the responsibility for maintaining replicas. The
`latter means that a search may need to visit a large fraction of the
`Freenet network. As an example, Hong shows in his Figure 14-12
`that in a network with 1,000 servers, the lookup path length can
`exceed 90 hops [23). This means that if the hop count is limited
`to 90, a lookup may fail even though the document is available.
`Because CFS does not try to provide anonymity, it can guarantee
`much tighter bounds on lookup cost; for example, in a 4,096-node
`system, lookups essentially never exceed IO hops.
`CFS's caching scheme is similar to Freenet's in the sense that
`both leave cached copies of data along the query path from client
`to where the data was found. Because CFS finds data in signifi(cid:173)
`cantly fewer hops than Freenet, and CFS' structured lookup paths
`are more likely to overlap than Freenet's, CFS can make better use
`of a given quantity of cache space.
`Like Freenet, Publius [34) focuses on anonymity, but achieves
`it with encryption and secret sharing rather than routing. Pub(cid:173)
`lius requires a static, globally-known list of servers; it stores each
`share at a fixed location that is predictable from the file name. Free
`Haven [7] uses both cryptography and routing (using re-mailers [4])
`to provide anonymity; like Gnutella, Free Haven finds data with a
`global search.
`CFS does not attempt to provide anonymity, focusing instead on
`efficiency and robustness. We believe that intertwining anonymity
`with the basic data lookup mechanism interferes with correctness
`and performance. On the other hand, given a robust location and
`storage layer, anonymous client access to CFS could be provided
`by separate anonyrnizing proxies, using techniques similar to those
`proposed by Chaum [4] or Reiter and Rubin [27).
`
`2.4 Peer-to-Peer Hash Based Systems
`CFS layers storage on top of an efficient distributed hash lookup
`algorithm. A number of recent peer-to-peer systems use simi(cid:173)
`lar approaches with similar scalability and performance, including
`CAN [26), PAST [28, 29), OceanStore [13, 35), and Ohaha [22). A
`detailed comparison of these algorithms can be found in [31).
`The PAST [29) storage system differs from CFS in its approach
`to load balance. Because a PAST server stores whole files, a server
`might not have enough disk space to store a large file even though
`the system as a whole has sufficient free space. A PAST server
`solves this by offloading files it is responsible for to servers that do
`have spare disk space. PAST handles the load of serving popular
`files by caching them along the lookup path.
`CFS stores blocks, rather than whole files, and spreads blocks
`evenly over the available servers; this prevents large files from
`causing unbalanced use of storage. CFS solves the related problem
`of different servers having different amounts of storage space with
`a mechanism called virtual servers, which gives server managers
`control over disk space consumption. CFS' block storage granu(cid:173)
`larity helps it handle the load of serving popular large files, since
`the serving load is spread over many servers along with the blocks.
`
`203
`
`Page 7 of 19
`
`
`
`FS
`
`DHash
`
`OHash
`
`Chord
`
`Chord
`
`OH ash
`
`Chord
`
`CFS Client
`
`CFS Server
`
`CFS Server
`
`Figure 1: CFS software structure. Vertical links are local APls;
`horizontal links are RPC APis.
`
`This is more space-efficient, for large files, than whole-file caching.
`CFS relies on caching only for files small enough that distributing
`blocks is not effective. Evaluating the performance impact of block
`storage granularity is one of the purposes of this paper.
`OceanStore [13] aims to build a global persistent storage util(cid:173)
`ity. It provides data privacy, allows client updates, and guarantees
`durable storage. However, these features come at a price: com(cid:173)
`plexity. For example, OceanStore uses a Byzantine agreement pro(cid:173)
`tocol for conflict resolution, and a complex protocol based on Plax(cid:173)
`ton trees [24) to implement the location service [35). OceanStore
`assumes that the core system will be maintained by commercial
`providers.
`Ohaha [22] uses consistent hashing to map files and keyword
`queries to servers, and a Freenet-Iike routing algorithm to locate
`files. As a result, it shares some of the same weaknesses as Freenet.
`2.5 Web Caches
`Content distribution networks (CDNs), such as Akarnai [l],
`handle high demand for data by distributing replicas on multiple
`servers. CDNs are typically managed by a central entity, while CFS
`is built from resources shared and owned by a cooperative group of
`users.
`There are several proposed scalable cooperative Web caches [3,
`8, 10, 15). To locate data, these systems either multicast queries or
`require that some or all servers know about all other servers. As
`a result, none of the proposed methods is both highly scalable and
`robust. In addition, load balance is hard to achieve as the content
`of each cache depends heavily on the query pattern.
`Cache Resolver [30], like CFS, uses consistent hashing to evenly
`map stored data among the servers [12, 14]. However, Cache Re(cid:173)
`solver assumes that clients know the entire set of servers; main(cid:173)
`taining an up-to-date server list is likely to be difficult in a large
`peer-to-peer system where servers join and depart at unpredictable
`times.
`
`3. Design Overview
`CFS provides distributed read-only file storage. It is structured as
`a collection of servers that provide block-level storage. Publishers
`(producers of data) and clients (consumers of data) layer file system
`semantics on top of this block store much as an ordinary file system
`is layered on top of a disk. Many umelated publishers may store
`separate file systems on a single CFS system; the CFS design is
`intended to support the possibility of a single world-wide system
`consisting of millions of servers.
`3.1 System Structure
`Figure 1 illustrates the structure of the CFS software. Each CFS
`client contains three software layers: a file system client, a DHash
`
`ublic ke
`
`Figure 2: A simple CFS file system structure example. The
`root-block is identified by a public key and signed by the corre(cid:173)
`sponding private key. The other blocks are identified by cryp(cid:173)
`tographic hashes of their contents.
`
`storage layer, and a Chord lookup layer. The client file system uses
`the DHash layer to retrieve blocks. The client DHash layer uses the
`client Chord layer to locate the servers that hold desired blocks.
`Each CFS server has two software layers: a DHash storage layer
`and a Chord layer. The server DHash layer is responsible for stor(cid:173)
`ing keyed blocks, maintaining proper levels of replication as servers
`come and go, and caching popular blocks. The server DHash and
`Chord layers interact in order to integrate looking up a block iden(cid:173)
`tifier with checking for cached copies of the block. CFS servers
`are oblivious to file system semantics: they simply provide a dis(cid:173)
`tributed block store.
`CFS clients interpret DHash blocks in a file system format
`adopted from SFSRO [9]; the format is similar to that of the UNIX
`V7 file system, but uses DHash blocks and block identifiers in place
`of disk blocks and disk addresses. As shown in Figure 2, each block
`is either a piece of a file or a piece of file system meta-data, such as
`a directory. The maximum size of any block is on the order of tens
`of kilobytes. A parent block contains the identifiers of its children.
`The publisher inserts the file system's blocks into the CFS sys(cid:173)
`tem, using a hash of each block's content (a content-hash) as its
`identifier. Then the publisher signs the root block with his or her
`private key, and inserts the root block into CFS using the corre(cid:173)
`sponding public key as the root block's identifier. Clients name a
`file system using the public key; they can check the integrity of
`the root block using that key, and the integrity of blocks lower in
`the tree with the content-hash identifiers that refer to those blocks.
`This approach guarantees that clients see an authentic and inter(cid:173)
`nally consistent view of each file system, though under some cir(cid:173)
`cumstances a client may see an old version of a recently updated
`file system.
`A CFS file system is read-only as far as clients are concerned.
`However, a file system may be updated by its publisher. This in(cid:173)
`volves updating the file system's root block in place, to make it
`point to the new data. CFS authenticates updates to root blocks by
`checking that the new block is signed by the same key as the old
`block. A timestamp prevents replays of old updates. CFS allows
`file systems to be updated without changing the root block's iden(cid:173)
`tifier so that external references to data need not be changed when
`the data is updated.
`CFS stores data for an agreed-upon finite interval. Publishers
`that want indefinite storage periods can periodically ask CFS for an
`extension; otherwise, a CFS server may discard data whose guar(cid:173)
`anteed period has expired. CFS has no explicit delete operation:
`instead, a publisher can simply stop asking for extensions. In this
`area, as in its replication and caching policies, CFS relies on the
`assumption that large amounts of spare disk space are available.
`
`3.2 CFS Properties
`CFS provides consistency and integrity of file systems by adopt(cid:173)
`ing the SFSRO file system format. CFS extends SFSRO by provid-
`
`204
`
`Page 8 of 19
`
`
`
`ing the following desirable desirable properties:
`
`• Decentralized control. CFS servers need not share any ad(cid:173)
`ministrative relationship with publishers. CFS servers could
`be ordinary Internet hosts whose owners volunteer spare stor(cid:173)
`age and network resources.
`
`• Scalability. CFS lookup operations use space and messages
`at most logarithmic in the number of servers.
`
`• Availability. A client can always retrieve data as long as
`it is not trapped in a small partition of the underlying net(cid:173)
`work, and as long as one of the data's replicas is reachable
`using the underlying network. This is true even if servers are
`constantly joining and leaving the CFS system. CFS places
`replicas on servers likely to be at unrelated network locations
`to ensure independent failure.
`
`• Load balance. CFS ensures that the burden of storing and
`serving data is divided among the servers in rough proportion
`to their capacity. It maintains load balance even if some data
`are far more popular than others, through a combination of
`caching and spreading each file's data over many servers.
`
`• Persistence. Once CFS commits to storing data, it keeps it
`available for at least an agreed-on interval.
`
`• Quotas. CFS limits the amount of data that any particular IP
`address can insert into the system. This provides a degree of
`protection against malicious attempts to exhaust the system's
`storage.
`
`• Efficiency. Clients can fetch CFS data with delay compara(cid:173)
`ble to that of FTP, due to CFS' use of efficient lookup algo(cid:173)
`rithms, caching, pre-fetching, and server selection.
`
`The next two sections present Chord and DHash, which together
`provide these properties.
`
`4. Chord Layer
`CFS uses the Chord protocol to locate blocks [31). Chord sup(cid:173)
`ports just one operation: given a key, it will determine the node re(cid:173)
`sponsible for that key. Chord does not itself store keys and values,
`but provides primitives that allow higher-layer software to build a
`wide variety of storage systems; CFS is one such use of the Chord
`primitive. The rest of this section summarizes Chord and describes
`new algorithms for robustness and server selection to support ap(cid:173)
`plications such as CFS.
`4.1 Consistent Hashing
`Each Chord node has a unique m-bit node identifier (ID), ob(cid:173)
`tained by hashing the node's IP address and a virtual node index.
`Chord views the IDs as occupying a circular identifier space. Keys
`are also mapped into this ID space, by hashing them to m-bit key
`IDs. Chord defines the node responsible for a key to be the succes(cid:173)
`sor of that key's ID. The successor of an ID j is the node with the
`smallest ID that is greater than or equal to j (with wrap-around),
`much as in consistent hashing [12).
`Consistent hashing lets nodes enter and leave the network with
`minimal movement of keys. To maintain correct successor map(cid:173)
`pings when a node n joins the network, certain keys previously as(cid:173)
`signed to n's successor become assigned ton. When node n leaves
`the network, all of n's assigned keys are reassigned to its successor.
`No other changes in the assignment of keys to nodes need occur.
`
`Consistent hashing is straightforward to implement, with
`constant-time lookups, if all nodes have an up-to-date list of all
`other nodes. However, such a system does not scale; Chord pro(cid:173)
`vides a scalable, distributed version of consistent hashing.
`
`4.2 The Chord Lookup Algorithm
`A Chord node uses two data structures to perform lookups: a
`successor list and a finger table. Only the successor list is required
`for correctness, so Chord is careful to maintain its accuracy. The
`finger table accelerates lookups, but does not need to be accurate,
`so Chord is less aggressive about maintaining it. The following
`discussion first describes how to perform correct (but slow) lookups
`with the successor list, and then describes how to accelerate them
`up with the finger table. This discussion assumes that there are no
`malicious participants in the Chord protocol; while we believe that
`it should be possible for nodes to verify the routing information that
`other Chord participants send them, the algorithms to do so are left
`for future work.
`Every Chord node maintains a list of the identities and IP ad(cid:173)
`dresses of its r immediate successors on the Chord ring. The fact
`that every node knows its own successor means that a node can al(cid:173)
`ways process a lookup correctly: if the desired key is between the
`node and its successor, the latter node is the key's successor; oth(cid:173)
`erwise the lookup can be forwarded to the successor, which moves
`the lookup strictly closer to its destination.
`A new node n learns of its successors when it first joins the
`Chord ring, by asking an existing node to perform a lookup for
`n's successor; n then asks that successor for its successor list. The
`r entries in the list provide fault tolerance: if a node's immedi(cid:173)
`ate successor does not respond, the node can substitute the second
`entry in its successor list. All r successors would have to simulta(cid:173)
`neously fail in order to disrupt the Chord ring, an event that can be
`made very improbable with modest values of r. An implementa(cid:173)
`tion should use a fixed r, chosen to be 2 log 2 N for the foreseeable
`maximum number of nodes N.
`The main complexity involved with successor lists is in notify(cid:173)
`ing an existing node when a new node should be its successor. The
`stabilization procedure described in [31) does this in a way that
`guarantees to preserve the connectivity of the Chord ring's succes(cid:173)
`sor pointers.
`Lookups performed only with successor lists would require an
`average of N /2 message exchanges, where N is the number of
`servers. To reduce the number of messages required to O(log N),
`each node maintains a finger table table with m entries. The it/'
`entry in the table at node n contains the identity of the first node
`that succeeds n by at least 2•- 1 on the ID circle. Thus every node
`knows the identities of nodes at power-of-two intervals on the ID
`circle from its own position. A new node in