`Hash Table
`
`by
`
`Josh Cates
`
`Submitted to the Department of Electrical Engineering and Computer Science
`in partial fulfillment of the requirements for the degree of
`
`Master of Engineering in Computer Science and Engineering
`
`at the
`
`MASSACHUSETTS INSTITUTE OF TECHNOLOGY
`
`June 2003
`
`@2003 Josh Cates. All rights reserved.
`
`The author hereby grants M.I.T. permission to reproduce and distributed publicly
`paper and electronic copies of this thesis and to grant others the right to do so.
`
`Author ........... .
`
`Departm7t of Ei.ectric~al Engineering and Computer Science
`16. May, 2~!)1
`
`Certified by ................................. :
`
`I lJ v'i:i.~ Fh{rfs Kaash~~k
`Professor of Computer Scjence and Engineering
`Thesis Supervisor
`
`Certified by .............. .
`
`Robert Morris
`Assistant Professor of Computer Science and Engineering
`Thesis Supervisor
`
`Accepted by ..... .
`
`Arthur C. Smith
`Chairman, Department Committee on Graduate Students
`
`MASSACHUSETIS INSTITUTE
`OF TECHNOLOGY
`
`1.JUL 3! LuuJ l
`
`LIBRARIES
`
`CSCO-1012
`Page 1 of 64
`
`
`
`2
`
`Page 2 of 64
`
`Page 2 of 64
`
`
`
`Robust and Efficient Data Management for a Distributed Hash Table
`
`by
`
`Josh Cates
`
`Submitted to the Department of Electrical Engineering and Computer Science
`on 16 May, 2003, in partial fulfillment of the
`requirements for the degree of
`Master of Engineering in Computer Science and Engineering
`
`Abstract
`
`This thesis presents a new design and implementation of the DHash distributed hash table
`based on erasure encoding. This design is both more robust and more efficient than the
`previous replication-based implementation [15].
`DHash uses erasure coding to store each block as a set of fragments. Erasure coding in(cid:173)
`creases availability while saving storage and communication costs compared to a replication
`based design. DHash combines Chord's synthetic coordinates with the the set of fragments
`to implement server selection on block retrieval.
`DHash enhances robustness by implementing efficient fragment maintenance protocols.
`These protocols restore missing or misplaced fragments caused by hosts joining and leaving
`the system.
`Experiments with a 270-node DHash system running on the PlanetLab [1] and RON [4j
`testbeds show that the changes to DHash increase the rate at which the system can fetch
`data by a factor of six, and decrease the latency of a single fetch by more than a factor of two.
`The maintenance protocols ensure that DHash is robust without penalizing performance.
`Even up to large database size, the per host memory footprint is less than 10 MB and the
`per host network bandwidth is under 2 KB/sec over a wide range of system half-lives.
`
`Thesis Supervisor: M. Frans Kaashoek
`Title: Professor of Computer Science and Engineering
`
`Thesis Supervisor: Robert Morris
`Title: Assistant Professor of Computer Science and Engineering
`
`3
`
`Page 3 of 64
`
`
`
`4
`
`Page 4 of 64
`
`Page 4 of 64
`
`
`
`"Of the gladdest moments in human life, methinks, is the departure upon a distant
`journey into unknown lands. Shaking off with one mighty effort, the fetters of Habit, the
`leaden weight of Routine, the cloak of many cares and the slavery of Home, one feels once
`more happy."
`
`Sir Richard Burton - Journal Entry - 1856
`
`Page 5 of 64
`
`
`
`6 6
`
`Page 6 of 64
`
`Page 6 of 64
`
`
`
`Acknowledgments
`
`I would like to thank Prof. Frans Kaashoek for his enduring support and guidance. From
`when I was an undergraduate UROPer until today, he has been a constant source of en(cid:173)
`couragement. His efforts have afforded me the opportunity to work with an unforgettable
`group of faculty and students. In particular, this thesis benefits greatly from his insights,
`suggestions, and patient revisions.
`This thesis is based on joint work with the members of the Chord project. The primary
`contribution of this thesis is the use of erasure coding and fragment repair within DHash,
`which couldn't have been designed and implemented without all of work of Russ Cox,
`Frank Dabek, Prof. Frans Kaashoek, Prof. Robert Morris, James Robertson, Emil Sit and
`Jacob Strauss. Discussions with them have improved the system considerably. Prof. Morris
`deserves to be thanked for his especially lucid design feedback. Emil Sit coded significant
`parts of the work described in this thesis.
`A huge thanks is due to all the members of the PDOS research group for making work
`a lot of fun. They have made for a collection of great memories. I'm very grateful to Chris
`Zimman who opened my eyes to MIT in the first place. Finally, I'd like to thank my parents
`for their constant encouragement.
`This research was partially supported by the IRIS project (http: //project-iris.
`net/), by the National Science Foundation under Cooperative Agreement No. ANI-0225660,
`the Oxygen project (http://oxygen.lcs.mit.edu/), by the RON project (http://ron.
`lcs.mit.edu/), and by the Defense Advanced Research Projects Agency (DARPA).
`
`7
`
`Page 7 of 64
`
`
`
`8
`
`Page 8 of 64
`
`Page 8 of 64
`
`
`
`Contents
`
`1
`
`Introduction
`
`1.1 Peer-to-peer Off-site Backup .
`
`1.2 Background: Chord
`
`1.2.1
`
`Chord API
`
`.
`
`1.2.2
`
`Synthetic Coordinates
`
`1.3 Thesis Overview
`
`. . . . . . .
`
`2 DHash: a distributed hash table
`
`2.1 DHash API . . . .
`
`2.2 Block Availability .
`
`2.3 Block Insert: put Ck, b)
`
`2.4 Block Fetch: get Ck)
`
`3 Fragment Maintenance
`
`3.1
`
`Global DHash Maintenance
`
`3.2
`
`Local DHash Maintenance .
`
`4 Database Synchronization
`
`4.1 Approach . . . . . . . .
`
`9
`
`11
`
`13
`
`15
`
`16
`
`17
`
`18
`
`19
`
`20
`
`20
`
`22
`
`22
`
`25
`
`26
`
`29
`
`31
`
`31
`
`Page 9 of 64
`
`
`
`4.2
`
`Database Properties
`
`. .
`
`4.3
`
`Database Index Format
`
`4.3.1
`
`Index Insertion
`
`4.4
`
`Network Protocol . . .
`
`5 Evaluation
`
`5.1
`
`Basic Performance
`
`5.2
`
`Fetch Latency . . .
`
`5.3
`
`Synchronization Dynamics .
`
`5.4
`
`Ideal State Maintenance
`
`5.5
`
`Effect of Half-Life . . . .
`
`5.6
`
`Synchronization Overhead
`
`5.7
`
`Memory Usage
`
`5.8
`
`Fault Tolerance
`
`6 Related Work
`
`6.1
`
`Cooperative Backup
`
`6.2
`
`Server Selection and Spreading
`
`6.3
`
`Coding . . . . . . . . . .
`
`6.4
`
`Replica Synchronization
`
`7 Summary
`
`7.1
`
`Conclusions
`
`7.2
`
`Future Work
`
`10
`
`;33
`
`33
`
`35
`
`36
`
`41
`
`41
`
`42
`
`44
`
`45
`
`47
`
`47
`
`49
`
`50
`
`53
`
`54
`
`54
`
`55
`
`55
`
`57
`
`57
`
`57
`
`Page 10 of 64
`
`
`
`Chapter 1
`
`Introduction
`
`DHTs have been proposed as a way to simplify the construction of large-scale distributed
`
`applications (e.g., [2, 26]). DHTs1 store blocks of data on a collection of nodes spread
`
`throughout the Internet. Each block is identified by a unique key. The goals of these DHTs
`
`are to spread the load of storing and serving data across all the nodes and to keep the data
`
`available as nodes join and leave the system.
`
`This thesis presents a new design, based on erasure coding, for distributing and storing
`
`blocks within DHash, an existing DHT implemention. These changes make DHash a robust,
`
`efficient and practical DHT for demanding applications such as cooperative backup [13, 18,
`
`24]. Such an application requires that the DHT keep data available despite faults and that
`
`the DHT efficiently serve bulk data (unlike, for example, a naming system).
`
`Like many fault-tolerant systems, DHash uses erasure coding to increase availability
`
`with relatively little cost in extra storage and communication. DHash stores each block as
`
`a set of erasure encoded fragments; the block itself is never stored by the system. This work
`
`1 The DHT community has some disagreement about what the best API for a DHT is [16, 19] and the
`OceanStore project uses a DOLR instead of a DHT [33], but for most of the contributions of this thesis
`these distinctions are not very significant.
`
`11
`
`Page 11 of 64
`
`
`
`extends the previous DHash system, which replicated blocks for fault-tolerance [15].
`
`The maiu contribution of this thesis is the way Dhash combines erasure encoded storage
`
`with the other techniques and properties of Chord to provide robust and efficient operation.
`
`These other techniques include proximity routing, server selection, and successor lists. As
`
`we demonstrate through experiments with DHash implementation, the synergy between the
`
`techniques makes them more effective as a collection than individually.
`
`The second contribution of this thesis is the design and implementation of a pair of
`
`fragment maintenance protocols that ensure DHash is robust: i.e., that each inserted block
`
`can subsequently be retrieved, even if nodes join and leave the system. These protocols
`
`restore any fragments which get destroyed or misplaced when hosts join or leave the system.
`
`The challenge is to make these protocols as efficient as possible given a very general failure
`
`model.
`
`The final contribution of this thesis is a detailed evaluation of DHash. Our evaluation of
`
`the previous, replication based DHash running on the PlanetLab [1] and RON [4] test-beds,
`
`makes clear that it is inadequate to support data-intensive applications which require high
`
`data availability. DHash, with our changes, offers increased block-download throughput,
`
`reduced block fetch latency, and improved availability for a given space replication budget.
`
`12
`
`Page 12 of 64
`
`
`
`1.1 Peer-to-peer Off-site Backup
`
`In order to help guide design decisions for DHash, we implemented a cooperative off-site
`
`backup system. The off-site backups are intended to complement conventional tape or disk(cid:173)
`
`to-disk backups by adding an extra level of availability and providing a browseable archive
`
`of backups. The off-site backup system can be used alone if desired.
`
`The off-site backup system's goals are to support recovery after a disaster by keeping
`
`snapshots of file systems at other Internet sites. The system spreads the data over many
`
`sites in order to balance storage and network load. This striping also allows very large
`
`file systems to be backed up onto a set of hosts with individually limited disk space. The
`
`backup system performs daily incremental backups; each new backup shares storage with
`
`the unchanged part of the previous backups.
`
`The intended users of the backup system are informal groups of people at geographi(cid:173)
`
`cally distributed sites who know each other; for example, colleagues at different university
`
`computer science departments. Each site is expected to make available spare disk space on
`
`workstations. These workstations are likely to be reasonably reliable and have fairly fast
`
`network connections.
`
`Since the backup system sends file system copies over the Internet, communication
`
`performance is important; it rnust be possible to back up a full day's incremental changes
`
`to a typical server file system in a few hours. Performance is more sensitive to network
`
`throughput than to latency, since the backup system usually has large quantities of data
`
`that can be sent concurrently. Storing and fetching data are both important because the
`
`backup system allows users to browse the archive of old snapshots.
`
`At a low level, the backup system works in units of 8192-byte disk blocks; it scans
`
`the disk to be backed up, looking for blocks that have changed since the last backup and
`
`13
`
`Page 13 of 64
`
`
`
`ignoring blocks that are not allocated. It inserts each new block into DHash, using the hash
`
`of the block's contents as its key. The sysLem builds a tree with the blocks as leaves in
`
`order to be able to map disk block numbers to block contents; the tree's interior nodes are
`
`also stored in DHash.
`
`Each daily backup has its own tree, though in practice each tree shares most of its DHash
`
`storage with previous backups. The block orientation of this scheme is a good match for
`
`the block interface of DHash. Because the backup system performs sharing at the block
`
`level, it is not necessary to search for a machine whose file system is similar overall; while
`
`such a file system may exist when a workstation is being backed up [13], the backup of a
`
`large server holding home directories is unlikely to find such a partner.
`
`In summary, the backup system places the following requirements on DHash: 1) high
`
`availability, 2) high throughput for bulk transfers, 3) low latency for interactive browsing
`
`of the archived backups, and 4) good support for block-size operations.
`
`14
`
`Page 14 of 64
`
`
`
`1.2 Background: Chord
`
`DHash uses Chord [38] to help determine on which host to store each piece of data. Chord
`
`implements a hash-like lookup operation that maps 160-bit data keys to hosts. Chord
`
`assigns each host an identifier drawn from the same 160-bit space as the keys. This identifier
`
`space can be viewed as a circle, in which the highest identifier is followed by zero. Chord
`
`maps each key to the host whose identifier most closely follows the key.
`
`Each Chord host maintains information about a number of other hosts, to allow it to
`
`efficiently map keys to hosts and to allow it to tolerate failures. Chord ensures that each
`
`host knows the identity (IP address, Chord identifier, and synthetic coordinates) of its
`
`successor. the host with the next highest identifier. This knowledge organizes the hosts
`
`into a circular linked list sorted by identifier.
`
`In order to maintain the integrity of this organization if nodes fail, each node actually
`
`maintains a successor list, which contains the identities to the r hosts that immediately
`
`follow the host in the identifier circle. If a node's successor is not responsive, the node
`
`replaces it with the next entry in its successor list. The Chord implementation used in
`
`this paper uses successor lists of length r = 16; this is 2 log2 N for the system evaluated in
`
`Section 5, as recommended in the Chord design [38].
`
`The lookup to map a key to a host could in principle be implemented in linear time
`
`by following the successor pointers. Chord builds a routing table, called the finger table,
`
`that allows it to perform lookups in O(log N) time, where N is the number of hosts in the
`
`Chord system. The finger table for a node n contains log N entries that point to hosts at
`
`power-of-two distances ahead of n along the identifier circle.
`
`A Chord host periodically checks the validity of its finger table and successor list entries
`
`in a process called stabilization. This process allows Chord to adapt to failed hosts and to
`
`15
`
`Page 15 of 64
`
`
`
`newly joining hosts. Chord also periodically tries to contact hosts that were alive in the
`
`past, but are no longer reachable; this allows Chord to notice when a network partition has
`
`healed.
`
`1.2.1 Chord API
`
`Table 2.1 shows the external API Chord exposes to DHash.
`
`Description
`Function
`get_successor _list ( n) Contacts Chord node n and returns n's successor list. Each node in
`the list includes its Chord ID, IP address and synthetic coordinates.
`Returns a list of at least m successors of key k. Each node in the
`list includes its Chord ID, IP address and synthetic coordinates.
`
`lookup(k, m)
`
`Table 1.1: Chord API
`
`get_successor -1ist (n) is a simple accessor method for the Chord node n. It is imple-
`
`mented as a single network RPC call.
`
`lookup (k, m), on the other hand, must send O(log N) RPCs in order to determine
`
`them successors of key k. The value of m affects the latency of the lookup. Higher values
`
`of m constrain the lookup routing to travel through specific - potentially high latency -
`
`nodes. For example, when rn = 16, the lookup finds the exact predecessor of k and request
`
`its successor list.
`
`Smaller values of rn permit flexibility in the routing which allows high latency nodes to
`
`be avoided. For example, the two Chord nodes preceding key k both have successor lists
`
`which contain at least 15 successors of k. So form= 15, the lookup routing can choose the
`
`predecessor with lower estimated latency. A node uses synthetic coordinates to estimate
`
`the latency to another node.
`
`16
`
`Page 16 of 64
`
`
`
`1.2.2 Synthetic Coordinates
`
`Synthetic coordinates allow Chord to predict the latency between nodes. The predicted
`
`latency in microseconds is equal to the Euclidean distance between the nodes' coordinates.
`
`The main advantage of synthetic coordinates is that nodes can predict the latency to nodes
`
`with which it has never communicated directly: node X need only know node Y's coordi(cid:173)
`
`nates to estimate the latency.
`
`When a Chord node first joins the system it chooses random coordinates for itself.
`
`Each Chord node continually improves its own coordinates by participating in a distributed
`
`machine learning algorithm, called Vivaldi, based on [14]. Each time a Chord node makes
`
`an RPC request to another node, it measures the network latency to the node. All RPC
`
`responses include the responding node's current coordinates. The requesting node refines
`
`its coordinates based on the latency measurement and the responding node's coordinates.
`
`Synthetic coordinates do not generate any probe traffic themselves, but merely piggy back
`
`on existing Chord stabilization traffic.
`
`Both Chord and DHash use synthetic coordinates to reduce latency by performing server
`
`selection. When performing a lookup(), Chord uses coordinates to avoid routing through
`
`high latency nodes. DHash preferentially fetches data from low latency nodes when there
`
`are many possible nodes holding the data (see Section 2.4).
`
`17
`
`Page 17 of 64
`
`
`
`1.3 Thesis Overview
`
`This thesis starts by presenting the redesigned DHash using erasure encoding in Chapter 2.
`
`Chapter 3 describes the maintenance of the erasure encoded fragments and Chapter 4 details
`
`the database synchronization algorithms used by the maintenance protocols. In Chapter 5,
`
`an implementation of the system is evaluated for robustness and performance. Chapter 6
`
`reviews the related work. And finally, Chapter 7 concludes and suggests possible future
`
`work for DHash.
`
`18
`
`Page 18 of 64
`
`
`
`Chapter 2
`
`DHash: a distributed hash table
`
`The DHash servers form a distributed hash table, storing opaque blocks of data named by
`
`the SHA-1 hash of their contents. Clients can insert and retrieve blocks from this hash table.
`
`The storage required scales as the number of unique blocks, since identical blocks hash to
`
`the same server, where they are coalesced.
`
`The hash table is implemented as a collection of symmetric nodes (i.e., each node is no
`
`more special in function than any other node). Clients inserting blocks into DHash need
`
`not share any administrative relationship with servers storing blocks. DHash servers could
`
`be ordinary Internet hosts whose owners volunteer spare storage and network resources.
`
`DHash allows nodes to enter or leave the system at any time and divides the burden of
`
`storing and serving blocks among the servers.
`
`To increase data availability, DHash splits each block into 14 fragments using the IDA
`
`erasure code. Any 7 of these fragments are sufficient to reconstruct the block. DHash
`
`stores a block's fragments on the 14 Chord nodes immediately following the block's key. To
`
`maintain this proper placement of fragments, DHash transfers fragments between nodes as
`
`nodes enter and leave the system.
`
`19
`
`Page 19 of 64
`
`
`
`2.1 DHash API
`
`Table 2.1 shows that the external API exposed by DHash is minimal. There are calls to
`
`insert and retrieve a block.
`
`Function Description
`Stores the block b under the key k, where k = SHA-1(b).
`put(k, b)
`get (k)
`Fetches and returns the block associated with the key k.
`
`Table 2.1: DHash API
`
`2.2 Block Availability
`
`Like many fault-tolerant storage systems, DHash uses erasure coding to increase availability
`
`with relatively little cost in extra storage and communication. DHash uses the IDA erasure
`
`code [31]. Given an original block of size s, IDA splits the block into f fragments of size
`
`s / k. Any k distinct fragments are sufficient to reconstruct the original block. Fragments
`
`are distinct if, in an information theoretic sense, they contain unique information.
`
`IDA has the ability to randomly generate new, probabilistically distinct fragments from
`
`the block alone; it does not need to know which fragments already exist. From f randomly
`generated fragments, any k are distinct with probability greater than P; 1
`
`, where p is the
`
`characteristic prime of the IDA implementation.
`
`This ability to easily generate new fragments contrasts with Reed-Solomon codes, which
`
`generate only a small set of fragments for a given rate. Other codes, such as Tornado codes
`
`and On-line codes, are targeted to provide efficiency asymptotically and do not perform
`
`well with small blocks.
`
`DHash leverages IDA's ability to generate probabilistically distinct random fragments
`
`to easily and efficiently reconstruct a missing fragment (for example, after a machine crash).
`
`20
`
`Page 20 of 64
`
`
`
`Instead of needing to find all existing fragments to ensure that the new fragment is distinct,
`
`DHash must only find enough fragments to reconstruct the block, and can then generate a
`
`new random fragment to replace the missing fragment.
`
`DHash implements IDA with f = 14, k = 7 and p = 65537. DHash stores a block's
`
`fragments on the f = 14 immediate successors of the block's key. When a block is originally
`
`inserted, the DHash code on the inserting client creates the f fragments and sends them
`
`to the first 14 successors (Section 2.3). When a client fetches a block, the client contacts
`
`enough successors to find k = 7 distinct fragments (Section 2.4). These fragments have
`
`a 65536-in-65537 chance of being able to reconstruct the original block. If reconstruction
`
`fails, DHash keeps trying with different sets of 7 fragments.
`
`A node may find that it holds a fragment for a block even though it is beyond the 14th
`
`successor. If it is the 15th or 16th successor, the node holds onto the fragment in case
`
`failures cause it to become one of the 14. Otherwise the node tries to send the fragment to
`
`one of the successors (Section 3).
`
`The choice off and k are selected to optimize for 8192-byte blocks in our system which
`
`has a successor list length of r = 16. A setting of k = 7 creates 1170-byte fragments, which
`
`fit inside a single IP packet when combined with RPC overhead. Similarly f = 14 interacts
`
`well with r = 16 by giving the lookups needed for store and fetch the flexibility to terminate
`
`at low latency node close to the key, not necessarily exactly at the key's predecessor.
`
`This choice of IDA parameters also gives reasonable fault tolerance: in a system with
`
`a large number of nodes and independent failures, the probability that seven or more of a
`
`block's fragments will survive after 103 of the nodes fail is 0.99998 [40]. If two complete
`
`copies of each block were stored instead, using the same amount of space, the probability
`
`would be only 0.99.
`
`21
`
`Page 21 of 64
`
`
`
`2.3 Block Insert: put Ck, b)
`
`When an application wishes to insert a new block, it calls the DHash put (k, b) procedure.
`
`The DHash code running on the application's node implements put as follows:
`
`void
`put (k, b)
`II place one fragment on each successor
`{
`
`frags = IDAencode (b)
`succs = lookup (k, 14)
`for i
`(0 .. 13)
`send (succs[i].ipaddr, k, frags[i])
`
`}
`
`Figure 2-1: An implementation of DHash's put(k, b) procedure.
`
`The latency of the complete put() operation is likely to be dominated by the maximum
`
`round trip time to any of the 14 successors. The Chord lookup is likely to be relatively low
`
`latency: proximity routing allows it to contact nearby nodes, and the lookup can stop as
`
`soon as it gets to any of the three nodes preceding key k, since the 16-entry successor list
`
`of any of those nodes will contain the desired 14 successors of k. The cost of the Chord
`
`lookup is likely to be dominated by the latency to the nearest of the three predecessors.
`
`2.4 Block Fetch: get (k)
`
`In order to fetch a block, a client must locate and retrieve enough IDA fragments to re-
`
`assemble the original block. The interesting details are in how to avoid communicating with
`
`high-latency nodes and how to proceed when some fragments are not available.
`
`When a client application calls get (k), its local DHash first initiates a Chord call to
`
`lookup Ck, 7), in order to find the list of nodes likely to hold the block's fragments. The
`
`22
`
`Page 22 of 64
`
`
`
`lookup call will result in a list of between 7 and 16 of the nodes immediately succeeding key
`
`k.
`
`get() then chooses the seven of these successors with the lowest latency, estimated from
`
`their synthetic coordinates. It sends each of them an RPC to request a fragment of key k,
`
`in parallel. For each RPC that times out or returns an error reply, get() sends a fragment
`
`request RPC to an as-yet-uncontacted successor from the list returned by lookup () . If the
`
`original call to lookup() returned fewer than 7 successors with distinct fragments, get()
`
`asks one of the successors it knows about for the complete list if it needs to. get() asks
`
`more successors for fragments if IDA fails to reconstruct the block because the fragments
`
`found were not distinct. If it cannot reconstruct the block after talking to the first 14
`
`successors, get() returns failure to the application.
`
`Before returning a reconstructed block to the application, get() checks that the SHA-1
`
`hash of the block's data is equal to the block's key. If it is not, get() returns an error.
`
`An application may occasionally need to repeatedly invoke get (k) to successfully fetch
`
`a given key. When nodes join or leave the system, fragments need to be transferred to the
`
`correct successor nodes. If the join or leave rate is high enough fragments may become
`
`misplaced and cause a block fetch to fail. This transient situation is repaired by the DHash
`
`maintenance algorithm presented in the next section and can be masked by retrying the
`
`get (k) on failure. By retrying, a client will see the semantics that DHash never loses a
`
`block and that all blocks are always available except those that have expired.
`
`A persistent retry strategy reflects the assumption that a key that is retrieved is actually
`
`stored in DH&sh. The client using DHash can easily ensure this by recording keys in meta
`
`data and only retrieving keys recorded in this meta data.
`
`23
`
`Page 23 of 64
`
`
`
`block get (k)
`{
`
`II Collect fragments from the successors.
`frags = [];II empty array
`succs = lookup (k, 7)
`sort_by_latency (succs)
`
`i < #succs && i < 14; i++) {
`for (i = O;
`II download fragment
`<ret, data>= download (key, succ[i])
`if (ret == OK)
`£rags.push (data)
`
`II decode fragments to recover block
`<ret, block> = IDAdecode (frags)
`if (ret == OK)
`return (SHA-1(block) != k) ? FAILURE
`
`block
`
`if (i == #succs - 1) {
`newsuccs = get_successor_list (succs[i])
`sort_by_latency (newsuccs)
`succs.append (newsuccs)
`
`}
`
`}
`
`return FAILURE
`
`}
`
`Figure 2-2: An implementation of the DHash's get (k) procedure.
`
`24
`
`Page 24 of 64
`
`
`
`Chapter 3
`
`Fragment Maintenance
`
`A DHash system is in the ideal state when three conditions hold for each inserted block:
`
`1. multiplicity: 14, 15, or 16 fragments exist.
`
`2. distinctness: All fragments are distinct with high probability.
`
`3. location: Each of the 14 nodes succeeding the block's key store a fragment; the
`
`following two nodes optionally store a fragment; and no other nodes store fragments.
`
`The ideal state is attractive since it ensures all block fetches succeed, with high proba(cid:173)
`
`bility. Block inserts preserve the ideal state, since put Ck, b) stores 14 distinct fragments
`
`of block b at the 14 Chord nodes succeeding key k.
`
`Chord membership changes, such as node joins and node failures, perturb DHash from
`
`the ideal state, and can cause block fetches to fail. The location condition is violated when
`
`a new node storing no fragments joins within the set of 14 successors nodes of a block's key,
`
`since it does not store a fragment of the block. The multiplicity condition can be violated
`
`when nodes fail since fragments are lost from the system. The distinctness condition is not
`
`affected by node joins or failures.
`
`25
`
`Page 25 of 64
`
`
`
`To restore DHash to the ideal state, DHash runs two maintenance protocols: a local
`
`and a global protocol. The local maintenance protocol restores the multiplicity condition by
`
`recreating missing fragments. The global maintenance protocol moves misplaced fragments
`
`(those that violate the location condition) to the correct nodes, restoring the location con(cid:173)
`
`dition. The global maintenance protocol also restores the multiplicity conditions. It detects
`
`and deletes extra fragments when more than 16 fragments exist for a block.
`
`Since membership changes happen continuously, DHash is rarely or never in the ideal
`
`state, but always tends toward it by continually running these maintenance protocols.
`
`The maintenance protocols can restore DHash to its ideal state if there are at least 7
`
`distinct fragments for each block located anywhere in the system, barring any more mem(cid:173)
`
`bership changes. First, the global maintenance protocol will move the misplaced fragments
`
`back to their successors, then the local maintenance protocol will recreate missing fragments
`
`until there are 14 fragments. If there are fewer than 7 distinct fragments for a block, that
`
`block is lost irrevocably.
`
`3.1 Global DHash Maintenance
`
`The global maintenance protocol pushes misplaced fragments to the correct nodes. Each
`
`DHash node scans its database of fragments and pushes any fragment that it stores, but
`
`which fail the location condition, to one of the fragment's 14 successor hosts. For efficiency,
`
`the algorithm processes contiguous ranges of keys at once.
`
`Each DHash host continuously iterates through its fragment database in sorted order.
`
`It performs a Chord lookup() to learn the 16 successor nodes of a fragment's key. These
`
`nodes are the only hosts which should be storing the fragment. If the DHash host is one of
`
`these nodes, the host continues on to the next key in the database, as it should be storing
`
`26
`
`Page 26 of 64
`
`
`
`global_maintenance (void)
`{
`
`a = myID
`while (1) {
`<key, frag> = database.next(a)
`succs = lookup(key, 16)
`if (myID isbetween succ[O] and succ[15])
`II we should be storing key
`a = myID
`else {
`II key is misplaced
`for each sin succs[0 .. 13] {
`response= send_db_keys (s, database[key .. succs[O]])
`for each key in response.desired_keys
`if (database.contains (key))
`upload (s, database.lookup (key))
`database.delete (key)
`
`}
`database.delete_range ([pred .. succs[O]])
`a = succs [OJ
`
`}
`
`}
`
`}
`
`Figure 3-1: An implementation of the global maintenance protocol.
`
`the key. Otherwise, the DHash host is storing a misplaced fragment and needs to push it
`
`to one of the fragment's 14 successors, in order to restore the location condition.
`
`The fragment's 14 successors should also store all keys ranging from the fragment's
`
`key up to the Chord ID of the key's immediate successor. Consequently, the DHash host
`
`processes this entire key range at once by sending all the database keys in this range to
`
`each of the 14 successors. A successor responds with a message that it desires some key, if
`
`it is missing the key. In which case, the DHash host sends that fragment to that successor.
`
`It also deletes the fragment from its database to ensure that the fragment is only sent to
`
`exactly one other host, with high probability; otherwise the distinctness condition would
`
`be violated.
`
`27
`
`Page 27 of 64
`
`
`
`After offering all the keys to all the successors, the DHash node deletes all remaining
`
`keys in the specified range from the database. These keys arc safe to delete because they
`
`have been offered to all the successors, but were already present on each successor. The
`
`DHash node continues sweeping the database from the end of the key range just processed.
`
`The call to database. next (a) skips the ranges of the key space for which no blocks
`
`are in the database. This causes the sweep (i.e., the number of iterations of the while loop
`
`needed to process the entire key space) to scale as the number of misplaced blocks in the
`
`database, not as the number of nodes in the entire Chord ring.
`
`The global maintenance protocol adopts a push-based strategy where nodes push mis(cid:173)
`
`placed fragments to the correct locations. This strategy contrasts with pull-based strate(cid:173)
`
`gies [15] where each node contacts other nodes and pulls fragments which it should store.
`
`In general, we've found that push-based strategies are more robust under highly dynamic
`
`membership change rates, where pull-based s