throbber
I 11lI\II1111\1I11l11lI\111\I111\\111II11111I1lI111\1
`AUTO ************* ALL FOR ADC 200
`H3 l 2264 06 EXPlRE 020S OD o 1 00/007
`PC94/017584
`/SOPS-10
`0036
`SERlALS RECEl!JlHG/LIB. COHGRESS
`US AQUlSlTlOHS SECTlOH
`101 lHDEf>EHDEHCE AUEHUE S. E.
`~ASHIHGTOH DC 20540 0002
`
`*
`
`CSCO-1008
`Page 1 of 19
`
`

`

`)
`
`1 0PERATING
`SYSTEMS
`REVIEW
`
`A Publication of the
`Association for Computing Machinery
`Special Interest Group on Operating Systems
`
`Volume 35, Number 5
`
`December, 2001
`
`Proceedings of the
`
`18th ACM Symposium on
`Operating Systems Principles
`(SOSP'Ol)
`
`October 21-24, 2001
`Chateau Lake Louise, Banff, Alberta, Canada
`
`Page 2 of 19
`
`

`

`A Publication of the ACM Special Interest Group on Operating Systems
`
`OPERATI 'G SYSTEMS REVIEW
`
`Chairperson
`William E. Weihl
`Akamai Technologies, Inc.
`1400 Fashion Island Blvd.
`San Mateo. CA 94404 USA
`+ 1-650-627-5259
`(b-weihl@akamai.com)
`
`Newsletter Editor
`William M. Waite
`Dept. of Electrical and
`Computer Engineering
`University of Colorado
`Boulder, CO 80309-0425 USA
`+ 1-303-492-7204
`(William. Waite@Colorado.edu)
`
`V. Chairperson
`Valerie Issarny
`INRIA - Rennes
`IRIS A
`Campus de Beaulieu
`35042 Rennes Cedex France
`+33-2-99-84-7462
`(issamy@irisa.fr)
`
`Secretary-Treasurer
`David Kotz
`Dept. of Computer Science
`Dartmouth College
`6211 Sudikoff Laboratory
`Hanover, NH 03755 USA
`+ 1-603-646-1439
`(dfk@cs.dartmouth.edu)
`Information Director
`Michael Dahlin
`Department of Computer Sciences
`Campus Mail Code: C0500
`University of Texas
`Austin, TX 78712 USA
`+l-512-471-9549
`(dahlin@cs.utexas.edu)
`
`OPERATING SYSTEMS REVIEW is an informal publication of the ACM Special Interest Group on
`Operating Systems (SIGOPS), whose scope of interest includes: Computer operating systems and archi(cid:173)
`tecture for multiprogramming, multiprocessing, and time sharing; resource management; evaluation and
`simulation; reliability, integrity, and security of data; communications among computing processors; and
`computer system modeling analysis.
`
`Membership in SIGOPS (at $15 per annum) is open to ACM members and associate members. and stu(cid:173)
`dent members (at $8 per annum); non-ACM members may join also (at $42 per annum). Non-members of
`SIGOPS may subscribe to OPERATING SYSTEMS REVIEW at $30 per year. All SIGOPS members
`receive OPERATING SYSTEMS REVIEW.
`
`SIGOPS membership application forms are available from ACM Headquarters (or see back cover of this
`issue), 1515 Broadway, 17th Floor, New York, New York 10036. telephone (212) 869-7440. Changes of
`address and other matters pertaining to the SIGOPS mailing list should be directed to ACM Headquarters,
`not to any of the SI GO PS officers.
`
`Contributions to OSR are sent to the editor and 5hould be single spaced and printed on one side of the
`paper only Text and diagrams must be contained within a 6.5 inch wide by 8.5 inch high (165mm by
`216mm) area with at least one inch (25mm) top and side margins, 1.5 inch (38mm) bottom margin, and be
`camera-ready (i.e., be an original having sufficient contrast and density for reproduction). All letters to
`the editor will be considered for publication unless accompanied by a request to the contrary. Except for
`editorial items, all sources of material appearing in OPERATING SYSTEMS REVIEW will be clearly
`identified. Items attributed to individuals will ordinarily be interpreted as personal rather than organiza(cid:173)
`tional opinions. The technical material in this newsletter is not refereed. Deadlines for submitting
`material are the 15th of February, May, August, and November. The SIGOPS conference proceed(cid:173)
`ings will be mailed as the December issue in odd numbered calendar years, and the ASPLOS con(cid:173)
`ference proceedings will be mailed i~ even numbered calendar years.
`OPERATING SYSTEMS REVIEW (ISSN 0163-5980) is published five times a year (January. April,
`July, October and December) by the Association for Computing Machinery Inc., 1515 Broadway, New
`York. Y 10036. Periodicals postage paid at New York, NY 10001. and at additional mailing offices
`POSTMASTER: Send address changes to OPERATING SYSTEMS REVIEW, ACM, 1515 Broadway,
`ew York. NY 10036.
`
`Subscriptions: Annual subscription cost of $12.53 is included in the member dues of $15.00 (for students
`cost is included in $8.00); the non-member annual subscription 1s $42.00.
`
`I\
`
`Page 3 of 19
`
`

`

`Proceedings of the
`18th ACM Symposium on
`Operating Systems Principles
`(SOSP'Ol)
`
`October 21-24, 2001
`Chateau Lake Louise, Banff, Alberta, Canada
`
`Sponsored by ACM SIGOPS
`(Association for Computing Machinery Special Interest Group in Operating Systems).
`
`Supported by Microsoft Research, Intel Corporation, QUALCOMM, Mercury Computer Systems,
`Hewlett-Packard Labs, Akamai, and Inktomi.
`
`i ntel.. QUALCOJ\IVV\
`
`i n v e n t
`
`I n k
`
`t o m
`
`i "
`
`Page 4 of 19
`
`

`

`The Association for Computing Machinery
`1515 Broadway
`New York, N.Y.10036
`
`Copyright© 2001 by the Association for Computing Machinery, Inc. (ACM). Permission to
`make digital or hard copies of portions of this work for personal or classroom use is granted
`without fee provided that the copies are not made or distributed for profit or commercial
`advantage and that copies bear this notice and the full citation on the first page. Copyright for
`components of this work owned by others than ACM must be honored. Abstracting with credit is
`permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires
`prior specific permission and/or a fee. Request permission to republish from: Publications Dept.
`ACM, Inc. Fax+ 1 (212) 869-0481 or E-mail perrnissions@acm.org.
`
`For other copying of articles that carry a code at the bottom of the first or last page, copying is
`permitted provided that the per-copy fee indicated in the code is paid through the Copyright
`Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, +1-978-750-8400, +1-978-750-
`4470 (fax).
`
`Notice to Past Authors of ACM-Published Articles
`
`ACM intends to create a complete electronic archive of all articles and/or other material
`previously published by ACM. If you have written a work that was
`previously published by ACM in any journal or conference proceedings prior to
`1978, or any SIG Newsletter at any time, and you do NOT want this work to
`appear in the ACM Digital Library, please inform permissions@acm.org, stating
`the title of the work, the author(s), and where and when published.
`
`ACM ISBN: 1-58113-389-8
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P.O. Box 11405
`New York, N.Y. 10286-1405
`
`Phone: 1-800-342-6626
`(U.S.A. and Canada)
`+ 1-212-626-0500
`(All other countries)
`Fax: +1-212-944-1318
`E-mail: acmhelp@acm.org
`
`ACM Order Number: 534010
`
`Printed in the U.S.A.
`
`Joe he
`10, 2(
`had u:
`to tra'
`tic to
`
`At th~
`reseru
`know1
`ularly
`comrr
`Elma.
`
`Joche:
`dened
`
`Page 5 of 19
`
`

`

`Wide-area cooperative storage with CFS
`
`Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica·
`MIT Laboratory tor Computer Science
`chord@lcs.mit.edu
`http://pdos.lcs.mit.edu/chord/
`
`Abstract
`The Cooperative File System (CFS) is a new peer-to-peer read(cid:173)
`only storage system that provides provable guarantees for the ef(cid:173)
`ficiency, robustness, and load-balance of file storage and retrieval.
`CFS does this with a completely decentralized architecture that can
`scale to large systems. CFS servers provide a distributed hash table
`(DHash) for block storage. CFS clients interpret DHash blocks as
`a file system DHash distributes and caches blocks at a fine granu(cid:173)
`larity to achieve load balance, uses replication for robustness, and
`decreases latency with server selection. DHash finds blocks using
`the Chord location protocol, which operates in time logarithmic in
`the number of servers.
`CFS is implemented using the SFS file system toolkit and runs
`on Linux, OpenBSD, and FreeBSD. Experience on a globally de(cid:173)
`ployed prototype shows that CFS delivers data to clients as fast
`as FTP. Controlled tests show that CFS is scalable: with 4,096
`servers, looking up a block of data involves contacting only seven
`servers. The tests also demonstrate nearly perfect robustness and
`unimpaired performance even when as many as half the servers
`fail.
`
`1.
`Introduction
`Existing peer-to-peer systems (such as Napster (20), Gnu(cid:173)
`tella (11), and Freenet (6)) demonstrate the benefits of cooperative
`storage and serving: fault tolerance, load balance, and the ability to
`harness idle storage and network resources. Accompanying these
`benefits are a number of design challenges. A peer-to-peer archi(cid:173)
`tecture should be symmetric and decentralized, and should operate
`well with unmanaged volunteer participants. Finding desired data
`in a large system must be fast; servers must be able to join and leave
`the system frequently without affecting its robustness or efficiency;
`and load must be balanced across the available servers. While the
`peer-to-peer systems in common use solve some of these problems,
`
`•university of California, Berkeley
`
`This research was sponsored by the Defense Advanced Research
`Projects Agency (DARPA) and the Space and Naval Warfare Sys(cid:173)
`tems Center, San Diego, under contract N66001-00- l-8933.
`
`Permission to meke digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that
`copies are not made or distributed for profit or commercial advan~
`tage and that copies bear this notice and the full citation on the firat page.
`To copy otherwise, to republish, to post on servers or to
`redistribute to lists, requires prior specific permission and/or a fee .
`SOSP01 Sanft, Caned•
`C> 2001 ACM ISBN 1-58113-389-8-1/01/10 .•. $5 .00
`
`none solves all of them. CFS (the Cooperative File System) is a
`new design that meets all of these challenges.
`A CFS file system exists as a set of blocks distributed over the
`available CFS servers. CFS client software interprets the stored
`blocks as file system data and meta-data and presents an ordinary
`read-only file-system interface to applications. The core of the CFS
`software consists of two layers, DHash and Chord. The DHash
`(distributed hash) layer performs block fetches for the client, dis(cid:173)
`tributes the blocks among the servers, and maintains cached and
`replicated copies. DHash uses the Chord [31) distributed lookup
`system to locate the servers responsible for a block. This table
`summarizes the CFS software layering:
`
`Layer
`FS
`
`Responsibility
`Interprets blocks as files ; presents a tile system in-
`terface to applications.
`DHash Stores unstructured data blocks reliably.
`Chord Maintains routing tables used to find blocks .
`
`Chord implements a hash-like operation that maps from block
`identifiers to servers. Chord assigns each server an identifier drawn
`from the same 160-bit identifier space as block identifiers. These
`identifiers can be thought of as points on a circle. The mapping
`that Chord implements takes a block's ID and yields the block's
`successor, the server whose ID most closely follows the block's ID
`on the identifier circle. To implement this mapping, Chord main(cid:173)
`tains at each server a table with information about O(log N) other
`servers, where N is the total number of servers. A Chord lookup
`sends messages to O(log N) servers to consult their tables. Thus
`CFS can find data efficiently even with a large number of servers,
`and servers can join and leave the system with few table updates.
`DHash layers block management on top of Chord. DHash pro(cid:173)
`vides load balance for popular large files by arranging to spread the
`blocks of each file over many servers. To balance the load imposed
`by popular small files, DHash caches each block at servers likely to
`be consulted by future Chord lookups for that block. DHash sup(cid:173)
`ports pre-fetching to decrease download latency. DHash replicates
`each block at a small number of servers, to provide fault tolerance.
`DHash enforces weak quotas on the amount of data each server can
`inject, to deter abuse. Finally, DHash allows control over the num(cid:173)
`ber of virtual servers per server, to provide control over how much
`data a server must store on behalf of others.
`CFS has been implemented using the SFS toolkit [16). This
`paper reports experimental results from a small international de(cid:173)
`ployment of CFS servers and from a large-scale controlled test-bed.
`These results confirm the contributions of the CFS design:
`
`• an aggressive approach to load balance by spreading file
`blocks randomly over servers;
`
`202
`
`Page 6 of 19
`
`

`

`• download performance on an Internet-wide prototype de(cid:173)
`ployment as fast as standard Ff P;
`
`• provable efficiency and provably fast recovery times after
`failure;
`
`• simple algorithms to achieve the above results.
`
`CFS is not yet in operational use, and such use will likely prompt
`refinements to its design. One potential area for improvement is
`the ability of the Chord lookup algorithm to tolerate malicious
`participants, by verifying the routing information received from
`other servers. Another area that CFS does not currently address
`is anonymity; it is expected that anonymity, if needed, would be
`layered on top of the basic CFS system.
`The remainder of this paper is organized as follows. Section 2
`discusses related work. Section 3 outlines the design and design
`goals of CFS. Section 4 describes the Chord location protocol. Sec(cid:173)
`tion S presents the detailed design of CFS. Section 6 describes im(cid:173)
`plementation details, and Section 7 presents experimental results.
`Section 8 discusses open issues and future work. Finally, Section 9
`summarizes our findings and concludes the paper.
`
`2. Related Work
`CFS was inspired by Napster [20), Gnutella [11), and particu(cid:173)
`larly Freenet [6]. CFS uses peer-to-peer distributed hashing similar
`in spirit to a number of ongoing research projects [26, 29, 35). In
`comparison to existing peer-to-peer file sharing systems, CFS of(cid:173)
`fers simplicity of implementation and high performance without
`compromising correctness. CFS balances server load, finds data
`quickly for clients, and guarantees data availability in the face of
`server failures with very high probability. CFS, as a complete sys(cid:173)
`tem, has individual aspects in common with many existing systems.
`The major relationships are summarized below.
`2.1 Naming and Authentication
`CFS authenticates data by naming it with public keys or content(cid:173)
`hashes, as do many other distributed storage systems [9, 6, 7, 13,
`29, 34). The use of content-hashes to securely link together differ(cid:173)
`ent pieces of data is due to Merkle [18); the use of public keys to
`authentically name data is due to the SFS system [17).
`CFS adopts naming, authentication, and file system structure
`ideas from SFSRO [9], which implements a secure distributed read(cid:173)
`only file system-that is, a file system in which files can be modi(cid:173)
`fied only by their owner, and only through complete replacement of
`the file. However, SFSRO and CFS have significant differences at
`the architectural and mechanism levels. SFSRO defines protocols
`and authentication mechanisms which a client can use to retrieve
`data from a given server. CFS adds the ability to dynamically find
`the server currently holding the desired data, via the Chord location
`service. This increases the robustness and the availability of CFS,
`since changes in the set of servers are transparent to clients.
`2.2 Peer-to-Peer Search
`Napster [20] and Gnutella [11) are arguably the most widely used
`peer-to-peer file systems today. They present a keyword search in(cid:173)
`terface to clients, rather than retrieving uniquely identified data. As
`a result they are more like search engines than distributed hash ta(cid:173)
`bles, and they trade scalability for this power: Gnutella broadcasts
`search queries to many machines, and Napster performs searches
`at a central facility. CFS as described in this paper doesn't provide
`search, but we are developing a scalable distributed search engine
`for CFS.
`
`Mojo Nation [19) is a broadcast query peer-to-peer storage sys(cid:173)
`tem which divides files into blocks and uses a secret sharing algo(cid:173)
`rithm to distribute the blocks to a number of hosts. CFS also divides
`files into blocks but does not use secret sharing.
`
`2.3 Anonymous Storage
`Freenet [SJ uses probabilistic routing to preserve the anonym(cid:173)
`ity of clients, publishers, and servers. The anonymity requirement
`limits Freenet's reliability and performance. Freenet avoids associ(cid:173)
`ating a document with any predictable server, and avoids forming
`any globally coherent topology among servers. The former means
`that unpopular documents may simply disappear from the system,
`since no server has the responsibility for maintaining replicas. The
`latter means that a search may need to visit a large fraction of the
`Freenet network. As an example, Hong shows in his Figure 14-12
`that in a network with 1,000 servers, the lookup path length can
`exceed 90 hops [23). This means that if the hop count is limited
`to 90, a lookup may fail even though the document is available.
`Because CFS does not try to provide anonymity, it can guarantee
`much tighter bounds on lookup cost; for example, in a 4,096-node
`system, lookups essentially never exceed IO hops.
`CFS's caching scheme is similar to Freenet's in the sense that
`both leave cached copies of data along the query path from client
`to where the data was found. Because CFS finds data in signifi(cid:173)
`cantly fewer hops than Freenet, and CFS' structured lookup paths
`are more likely to overlap than Freenet's, CFS can make better use
`of a given quantity of cache space.
`Like Freenet, Publius [34) focuses on anonymity, but achieves
`it with encryption and secret sharing rather than routing. Pub(cid:173)
`lius requires a static, globally-known list of servers; it stores each
`share at a fixed location that is predictable from the file name. Free
`Haven [7] uses both cryptography and routing (using re-mailers [4])
`to provide anonymity; like Gnutella, Free Haven finds data with a
`global search.
`CFS does not attempt to provide anonymity, focusing instead on
`efficiency and robustness. We believe that intertwining anonymity
`with the basic data lookup mechanism interferes with correctness
`and performance. On the other hand, given a robust location and
`storage layer, anonymous client access to CFS could be provided
`by separate anonyrnizing proxies, using techniques similar to those
`proposed by Chaum [4] or Reiter and Rubin [27).
`
`2.4 Peer-to-Peer Hash Based Systems
`CFS layers storage on top of an efficient distributed hash lookup
`algorithm. A number of recent peer-to-peer systems use simi(cid:173)
`lar approaches with similar scalability and performance, including
`CAN [26), PAST [28, 29), OceanStore [13, 35), and Ohaha [22). A
`detailed comparison of these algorithms can be found in [31).
`The PAST [29) storage system differs from CFS in its approach
`to load balance. Because a PAST server stores whole files, a server
`might not have enough disk space to store a large file even though
`the system as a whole has sufficient free space. A PAST server
`solves this by offloading files it is responsible for to servers that do
`have spare disk space. PAST handles the load of serving popular
`files by caching them along the lookup path.
`CFS stores blocks, rather than whole files, and spreads blocks
`evenly over the available servers; this prevents large files from
`causing unbalanced use of storage. CFS solves the related problem
`of different servers having different amounts of storage space with
`a mechanism called virtual servers, which gives server managers
`control over disk space consumption. CFS' block storage granu(cid:173)
`larity helps it handle the load of serving popular large files, since
`the serving load is spread over many servers along with the blocks.
`
`203
`
`Page 7 of 19
`
`

`

`FS
`
`DHash
`
`OHash
`
`Chord
`
`Chord
`
`OH ash
`
`Chord
`
`CFS Client
`
`CFS Server
`
`CFS Server
`
`Figure 1: CFS software structure. Vertical links are local APls;
`horizontal links are RPC APis.
`
`This is more space-efficient, for large files, than whole-file caching.
`CFS relies on caching only for files small enough that distributing
`blocks is not effective. Evaluating the performance impact of block
`storage granularity is one of the purposes of this paper.
`OceanStore [13] aims to build a global persistent storage util(cid:173)
`ity. It provides data privacy, allows client updates, and guarantees
`durable storage. However, these features come at a price: com(cid:173)
`plexity. For example, OceanStore uses a Byzantine agreement pro(cid:173)
`tocol for conflict resolution, and a complex protocol based on Plax(cid:173)
`ton trees [24) to implement the location service [35). OceanStore
`assumes that the core system will be maintained by commercial
`providers.
`Ohaha [22] uses consistent hashing to map files and keyword
`queries to servers, and a Freenet-Iike routing algorithm to locate
`files. As a result, it shares some of the same weaknesses as Freenet.
`2.5 Web Caches
`Content distribution networks (CDNs), such as Akarnai [l],
`handle high demand for data by distributing replicas on multiple
`servers. CDNs are typically managed by a central entity, while CFS
`is built from resources shared and owned by a cooperative group of
`users.
`There are several proposed scalable cooperative Web caches [3,
`8, 10, 15). To locate data, these systems either multicast queries or
`require that some or all servers know about all other servers. As
`a result, none of the proposed methods is both highly scalable and
`robust. In addition, load balance is hard to achieve as the content
`of each cache depends heavily on the query pattern.
`Cache Resolver [30], like CFS, uses consistent hashing to evenly
`map stored data among the servers [12, 14]. However, Cache Re(cid:173)
`solver assumes that clients know the entire set of servers; main(cid:173)
`taining an up-to-date server list is likely to be difficult in a large
`peer-to-peer system where servers join and depart at unpredictable
`times.
`
`3. Design Overview
`CFS provides distributed read-only file storage. It is structured as
`a collection of servers that provide block-level storage. Publishers
`(producers of data) and clients (consumers of data) layer file system
`semantics on top of this block store much as an ordinary file system
`is layered on top of a disk. Many umelated publishers may store
`separate file systems on a single CFS system; the CFS design is
`intended to support the possibility of a single world-wide system
`consisting of millions of servers.
`3.1 System Structure
`Figure 1 illustrates the structure of the CFS software. Each CFS
`client contains three software layers: a file system client, a DHash
`
`ublic ke
`
`Figure 2: A simple CFS file system structure example. The
`root-block is identified by a public key and signed by the corre(cid:173)
`sponding private key. The other blocks are identified by cryp(cid:173)
`tographic hashes of their contents.
`
`storage layer, and a Chord lookup layer. The client file system uses
`the DHash layer to retrieve blocks. The client DHash layer uses the
`client Chord layer to locate the servers that hold desired blocks.
`Each CFS server has two software layers: a DHash storage layer
`and a Chord layer. The server DHash layer is responsible for stor(cid:173)
`ing keyed blocks, maintaining proper levels of replication as servers
`come and go, and caching popular blocks. The server DHash and
`Chord layers interact in order to integrate looking up a block iden(cid:173)
`tifier with checking for cached copies of the block. CFS servers
`are oblivious to file system semantics: they simply provide a dis(cid:173)
`tributed block store.
`CFS clients interpret DHash blocks in a file system format
`adopted from SFSRO [9]; the format is similar to that of the UNIX
`V7 file system, but uses DHash blocks and block identifiers in place
`of disk blocks and disk addresses. As shown in Figure 2, each block
`is either a piece of a file or a piece of file system meta-data, such as
`a directory. The maximum size of any block is on the order of tens
`of kilobytes. A parent block contains the identifiers of its children.
`The publisher inserts the file system's blocks into the CFS sys(cid:173)
`tem, using a hash of each block's content (a content-hash) as its
`identifier. Then the publisher signs the root block with his or her
`private key, and inserts the root block into CFS using the corre(cid:173)
`sponding public key as the root block's identifier. Clients name a
`file system using the public key; they can check the integrity of
`the root block using that key, and the integrity of blocks lower in
`the tree with the content-hash identifiers that refer to those blocks.
`This approach guarantees that clients see an authentic and inter(cid:173)
`nally consistent view of each file system, though under some cir(cid:173)
`cumstances a client may see an old version of a recently updated
`file system.
`A CFS file system is read-only as far as clients are concerned.
`However, a file system may be updated by its publisher. This in(cid:173)
`volves updating the file system's root block in place, to make it
`point to the new data. CFS authenticates updates to root blocks by
`checking that the new block is signed by the same key as the old
`block. A timestamp prevents replays of old updates. CFS allows
`file systems to be updated without changing the root block's iden(cid:173)
`tifier so that external references to data need not be changed when
`the data is updated.
`CFS stores data for an agreed-upon finite interval. Publishers
`that want indefinite storage periods can periodically ask CFS for an
`extension; otherwise, a CFS server may discard data whose guar(cid:173)
`anteed period has expired. CFS has no explicit delete operation:
`instead, a publisher can simply stop asking for extensions. In this
`area, as in its replication and caching policies, CFS relies on the
`assumption that large amounts of spare disk space are available.
`
`3.2 CFS Properties
`CFS provides consistency and integrity of file systems by adopt(cid:173)
`ing the SFSRO file system format. CFS extends SFSRO by provid-
`
`204
`
`Page 8 of 19
`
`

`

`ing the following desirable desirable properties:
`
`• Decentralized control. CFS servers need not share any ad(cid:173)
`ministrative relationship with publishers. CFS servers could
`be ordinary Internet hosts whose owners volunteer spare stor(cid:173)
`age and network resources.
`
`• Scalability. CFS lookup operations use space and messages
`at most logarithmic in the number of servers.
`
`• Availability. A client can always retrieve data as long as
`it is not trapped in a small partition of the underlying net(cid:173)
`work, and as long as one of the data's replicas is reachable
`using the underlying network. This is true even if servers are
`constantly joining and leaving the CFS system. CFS places
`replicas on servers likely to be at unrelated network locations
`to ensure independent failure.
`
`• Load balance. CFS ensures that the burden of storing and
`serving data is divided among the servers in rough proportion
`to their capacity. It maintains load balance even if some data
`are far more popular than others, through a combination of
`caching and spreading each file's data over many servers.
`
`• Persistence. Once CFS commits to storing data, it keeps it
`available for at least an agreed-on interval.
`
`• Quotas. CFS limits the amount of data that any particular IP
`address can insert into the system. This provides a degree of
`protection against malicious attempts to exhaust the system's
`storage.
`
`• Efficiency. Clients can fetch CFS data with delay compara(cid:173)
`ble to that of FTP, due to CFS' use of efficient lookup algo(cid:173)
`rithms, caching, pre-fetching, and server selection.
`
`The next two sections present Chord and DHash, which together
`provide these properties.
`
`4. Chord Layer
`CFS uses the Chord protocol to locate blocks [31). Chord sup(cid:173)
`ports just one operation: given a key, it will determine the node re(cid:173)
`sponsible for that key. Chord does not itself store keys and values,
`but provides primitives that allow higher-layer software to build a
`wide variety of storage systems; CFS is one such use of the Chord
`primitive. The rest of this section summarizes Chord and describes
`new algorithms for robustness and server selection to support ap(cid:173)
`plications such as CFS.
`4.1 Consistent Hashing
`Each Chord node has a unique m-bit node identifier (ID), ob(cid:173)
`tained by hashing the node's IP address and a virtual node index.
`Chord views the IDs as occupying a circular identifier space. Keys
`are also mapped into this ID space, by hashing them to m-bit key
`IDs. Chord defines the node responsible for a key to be the succes(cid:173)
`sor of that key's ID. The successor of an ID j is the node with the
`smallest ID that is greater than or equal to j (with wrap-around),
`much as in consistent hashing [12).
`Consistent hashing lets nodes enter and leave the network with
`minimal movement of keys. To maintain correct successor map(cid:173)
`pings when a node n joins the network, certain keys previously as(cid:173)
`signed to n's successor become assigned ton. When node n leaves
`the network, all of n's assigned keys are reassigned to its successor.
`No other changes in the assignment of keys to nodes need occur.
`
`Consistent hashing is straightforward to implement, with
`constant-time lookups, if all nodes have an up-to-date list of all
`other nodes. However, such a system does not scale; Chord pro(cid:173)
`vides a scalable, distributed version of consistent hashing.
`
`4.2 The Chord Lookup Algorithm
`A Chord node uses two data structures to perform lookups: a
`successor list and a finger table. Only the successor list is required
`for correctness, so Chord is careful to maintain its accuracy. The
`finger table accelerates lookups, but does not need to be accurate,
`so Chord is less aggressive about maintaining it. The following
`discussion first describes how to perform correct (but slow) lookups
`with the successor list, and then describes how to accelerate them
`up with the finger table. This discussion assumes that there are no
`malicious participants in the Chord protocol; while we believe that
`it should be possible for nodes to verify the routing information that
`other Chord participants send them, the algorithms to do so are left
`for future work.
`Every Chord node maintains a list of the identities and IP ad(cid:173)
`dresses of its r immediate successors on the Chord ring. The fact
`that every node knows its own successor means that a node can al(cid:173)
`ways process a lookup correctly: if the desired key is between the
`node and its successor, the latter node is the key's successor; oth(cid:173)
`erwise the lookup can be forwarded to the successor, which moves
`the lookup strictly closer to its destination.
`A new node n learns of its successors when it first joins the
`Chord ring, by asking an existing node to perform a lookup for
`n's successor; n then asks that successor for its successor list. The
`r entries in the list provide fault tolerance: if a node's immedi(cid:173)
`ate successor does not respond, the node can substitute the second
`entry in its successor list. All r successors would have to simulta(cid:173)
`neously fail in order to disrupt the Chord ring, an event that can be
`made very improbable with modest values of r. An implementa(cid:173)
`tion should use a fixed r, chosen to be 2 log 2 N for the foreseeable
`maximum number of nodes N.
`The main complexity involved with successor lists is in notify(cid:173)
`ing an existing node when a new node should be its successor. The
`stabilization procedure described in [31) does this in a way that
`guarantees to preserve the connectivity of the Chord ring's succes(cid:173)
`sor pointers.
`Lookups performed only with successor lists would require an
`average of N /2 message exchanges, where N is the number of
`servers. To reduce the number of messages required to O(log N),
`each node maintains a finger table table with m entries. The it/'
`entry in the table at node n contains the identity of the first node
`that succeeds n by at least 2•- 1 on the ID circle. Thus every node
`knows the identities of nodes at power-of-two intervals on the ID
`circle from its own position. A new node in

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket