throbber
Robust and Efficient Data Management for a Distributed
`Hash Table
`
`by
`
`Josh Cates
`
`Submitted to the Department of Electrical Engineering and Computer Science
`in partial fulfillment of the requirements for the degree of
`
`Master of Engineering in Computer Science and Engineering
`
`at the
`
`MASSACHUSETTS INSTITUTE OF TECHNOLOGY
`
`June 2003
`
`@2003 Josh Cates. All rights reserved.
`
`The author hereby grants M.I.T. permission to reproduce and distributed publicly
`paper and electronic copies of this thesis and to grant others the right to do so.
`
`Author ........... .
`
`Departm7t of Ei.ectric~al Engineering and Computer Science
`16. May, 2~!)1
`
`Certified by ................................. :
`
`I lJ v'i:i.~ Fh{rfs Kaash~~k
`Professor of Computer Scjence and Engineering
`Thesis Supervisor
`
`Certified by .............. .
`
`Robert Morris
`Assistant Professor of Computer Science and Engineering
`Thesis Supervisor
`
`Accepted by ..... .
`
`Arthur C. Smith
`Chairman, Department Committee on Graduate Students
`
`MASSACHUSETIS INSTITUTE
`OF TECHNOLOGY
`
`1.JUL 3! LuuJ l
`
`LIBRARIES
`
`CSCO-1012
`Page 1 of 64
`
`

`

`2
`
`Page 2 of 64
`
`Page 2 of 64
`
`

`

`Robust and Efficient Data Management for a Distributed Hash Table
`
`by
`
`Josh Cates
`
`Submitted to the Department of Electrical Engineering and Computer Science
`on 16 May, 2003, in partial fulfillment of the
`requirements for the degree of
`Master of Engineering in Computer Science and Engineering
`
`Abstract
`
`This thesis presents a new design and implementation of the DHash distributed hash table
`based on erasure encoding. This design is both more robust and more efficient than the
`previous replication-based implementation [15].
`DHash uses erasure coding to store each block as a set of fragments. Erasure coding in(cid:173)
`creases availability while saving storage and communication costs compared to a replication
`based design. DHash combines Chord's synthetic coordinates with the the set of fragments
`to implement server selection on block retrieval.
`DHash enhances robustness by implementing efficient fragment maintenance protocols.
`These protocols restore missing or misplaced fragments caused by hosts joining and leaving
`the system.
`Experiments with a 270-node DHash system running on the PlanetLab [1] and RON [4j
`testbeds show that the changes to DHash increase the rate at which the system can fetch
`data by a factor of six, and decrease the latency of a single fetch by more than a factor of two.
`The maintenance protocols ensure that DHash is robust without penalizing performance.
`Even up to large database size, the per host memory footprint is less than 10 MB and the
`per host network bandwidth is under 2 KB/sec over a wide range of system half-lives.
`
`Thesis Supervisor: M. Frans Kaashoek
`Title: Professor of Computer Science and Engineering
`
`Thesis Supervisor: Robert Morris
`Title: Assistant Professor of Computer Science and Engineering
`
`3
`
`Page 3 of 64
`
`

`

`4
`
`Page 4 of 64
`
`Page 4 of 64
`
`

`

`"Of the gladdest moments in human life, methinks, is the departure upon a distant
`journey into unknown lands. Shaking off with one mighty effort, the fetters of Habit, the
`leaden weight of Routine, the cloak of many cares and the slavery of Home, one feels once
`more happy."
`
`Sir Richard Burton - Journal Entry - 1856
`
`Page 5 of 64
`
`

`

`6 6
`
`Page 6 of 64
`
`Page 6 of 64
`
`

`

`Acknowledgments
`
`I would like to thank Prof. Frans Kaashoek for his enduring support and guidance. From
`when I was an undergraduate UROPer until today, he has been a constant source of en(cid:173)
`couragement. His efforts have afforded me the opportunity to work with an unforgettable
`group of faculty and students. In particular, this thesis benefits greatly from his insights,
`suggestions, and patient revisions.
`This thesis is based on joint work with the members of the Chord project. The primary
`contribution of this thesis is the use of erasure coding and fragment repair within DHash,
`which couldn't have been designed and implemented without all of work of Russ Cox,
`Frank Dabek, Prof. Frans Kaashoek, Prof. Robert Morris, James Robertson, Emil Sit and
`Jacob Strauss. Discussions with them have improved the system considerably. Prof. Morris
`deserves to be thanked for his especially lucid design feedback. Emil Sit coded significant
`parts of the work described in this thesis.
`A huge thanks is due to all the members of the PDOS research group for making work
`a lot of fun. They have made for a collection of great memories. I'm very grateful to Chris
`Zimman who opened my eyes to MIT in the first place. Finally, I'd like to thank my parents
`for their constant encouragement.
`This research was partially supported by the IRIS project (http: //project-iris.
`net/), by the National Science Foundation under Cooperative Agreement No. ANI-0225660,
`the Oxygen project (http://oxygen.lcs.mit.edu/), by the RON project (http://ron.
`lcs.mit.edu/), and by the Defense Advanced Research Projects Agency (DARPA).
`
`7
`
`Page 7 of 64
`
`

`

`8
`
`Page 8 of 64
`
`Page 8 of 64
`
`

`

`Contents
`
`1
`
`Introduction
`
`1.1 Peer-to-peer Off-site Backup .
`
`1.2 Background: Chord
`
`1.2.1
`
`Chord API
`
`.
`
`1.2.2
`
`Synthetic Coordinates
`
`1.3 Thesis Overview
`
`. . . . . . .
`
`2 DHash: a distributed hash table
`
`2.1 DHash API . . . .
`
`2.2 Block Availability .
`
`2.3 Block Insert: put Ck, b)
`
`2.4 Block Fetch: get Ck)
`
`3 Fragment Maintenance
`
`3.1
`
`Global DHash Maintenance
`
`3.2
`
`Local DHash Maintenance .
`
`4 Database Synchronization
`
`4.1 Approach . . . . . . . .
`
`9
`
`11
`
`13
`
`15
`
`16
`
`17
`
`18
`
`19
`
`20
`
`20
`
`22
`
`22
`
`25
`
`26
`
`29
`
`31
`
`31
`
`Page 9 of 64
`
`

`

`4.2
`
`Database Properties
`
`. .
`
`4.3
`
`Database Index Format
`
`4.3.1
`
`Index Insertion
`
`4.4
`
`Network Protocol . . .
`
`5 Evaluation
`
`5.1
`
`Basic Performance
`
`5.2
`
`Fetch Latency . . .
`
`5.3
`
`Synchronization Dynamics .
`
`5.4
`
`Ideal State Maintenance
`
`5.5
`
`Effect of Half-Life . . . .
`
`5.6
`
`Synchronization Overhead
`
`5.7
`
`Memory Usage
`
`5.8
`
`Fault Tolerance
`
`6 Related Work
`
`6.1
`
`Cooperative Backup
`
`6.2
`
`Server Selection and Spreading
`
`6.3
`
`Coding . . . . . . . . . .
`
`6.4
`
`Replica Synchronization
`
`7 Summary
`
`7.1
`
`Conclusions
`
`7.2
`
`Future Work
`
`10
`
`;33
`
`33
`
`35
`
`36
`
`41
`
`41
`
`42
`
`44
`
`45
`
`47
`
`47
`
`49
`
`50
`
`53
`
`54
`
`54
`
`55
`
`55
`
`57
`
`57
`
`57
`
`Page 10 of 64
`
`

`

`Chapter 1
`
`Introduction
`
`DHTs have been proposed as a way to simplify the construction of large-scale distributed
`
`applications (e.g., [2, 26]). DHTs1 store blocks of data on a collection of nodes spread
`
`throughout the Internet. Each block is identified by a unique key. The goals of these DHTs
`
`are to spread the load of storing and serving data across all the nodes and to keep the data
`
`available as nodes join and leave the system.
`
`This thesis presents a new design, based on erasure coding, for distributing and storing
`
`blocks within DHash, an existing DHT implemention. These changes make DHash a robust,
`
`efficient and practical DHT for demanding applications such as cooperative backup [13, 18,
`
`24]. Such an application requires that the DHT keep data available despite faults and that
`
`the DHT efficiently serve bulk data (unlike, for example, a naming system).
`
`Like many fault-tolerant systems, DHash uses erasure coding to increase availability
`
`with relatively little cost in extra storage and communication. DHash stores each block as
`
`a set of erasure encoded fragments; the block itself is never stored by the system. This work
`
`1 The DHT community has some disagreement about what the best API for a DHT is [16, 19] and the
`OceanStore project uses a DOLR instead of a DHT [33], but for most of the contributions of this thesis
`these distinctions are not very significant.
`
`11
`
`Page 11 of 64
`
`

`

`extends the previous DHash system, which replicated blocks for fault-tolerance [15].
`
`The maiu contribution of this thesis is the way Dhash combines erasure encoded storage
`
`with the other techniques and properties of Chord to provide robust and efficient operation.
`
`These other techniques include proximity routing, server selection, and successor lists. As
`
`we demonstrate through experiments with DHash implementation, the synergy between the
`
`techniques makes them more effective as a collection than individually.
`
`The second contribution of this thesis is the design and implementation of a pair of
`
`fragment maintenance protocols that ensure DHash is robust: i.e., that each inserted block
`
`can subsequently be retrieved, even if nodes join and leave the system. These protocols
`
`restore any fragments which get destroyed or misplaced when hosts join or leave the system.
`
`The challenge is to make these protocols as efficient as possible given a very general failure
`
`model.
`
`The final contribution of this thesis is a detailed evaluation of DHash. Our evaluation of
`
`the previous, replication based DHash running on the PlanetLab [1] and RON [4] test-beds,
`
`makes clear that it is inadequate to support data-intensive applications which require high
`
`data availability. DHash, with our changes, offers increased block-download throughput,
`
`reduced block fetch latency, and improved availability for a given space replication budget.
`
`12
`
`Page 12 of 64
`
`

`

`1.1 Peer-to-peer Off-site Backup
`
`In order to help guide design decisions for DHash, we implemented a cooperative off-site
`
`backup system. The off-site backups are intended to complement conventional tape or disk(cid:173)
`
`to-disk backups by adding an extra level of availability and providing a browseable archive
`
`of backups. The off-site backup system can be used alone if desired.
`
`The off-site backup system's goals are to support recovery after a disaster by keeping
`
`snapshots of file systems at other Internet sites. The system spreads the data over many
`
`sites in order to balance storage and network load. This striping also allows very large
`
`file systems to be backed up onto a set of hosts with individually limited disk space. The
`
`backup system performs daily incremental backups; each new backup shares storage with
`
`the unchanged part of the previous backups.
`
`The intended users of the backup system are informal groups of people at geographi(cid:173)
`
`cally distributed sites who know each other; for example, colleagues at different university
`
`computer science departments. Each site is expected to make available spare disk space on
`
`workstations. These workstations are likely to be reasonably reliable and have fairly fast
`
`network connections.
`
`Since the backup system sends file system copies over the Internet, communication
`
`performance is important; it rnust be possible to back up a full day's incremental changes
`
`to a typical server file system in a few hours. Performance is more sensitive to network
`
`throughput than to latency, since the backup system usually has large quantities of data
`
`that can be sent concurrently. Storing and fetching data are both important because the
`
`backup system allows users to browse the archive of old snapshots.
`
`At a low level, the backup system works in units of 8192-byte disk blocks; it scans
`
`the disk to be backed up, looking for blocks that have changed since the last backup and
`
`13
`
`Page 13 of 64
`
`

`

`ignoring blocks that are not allocated. It inserts each new block into DHash, using the hash
`
`of the block's contents as its key. The sysLem builds a tree with the blocks as leaves in
`
`order to be able to map disk block numbers to block contents; the tree's interior nodes are
`
`also stored in DHash.
`
`Each daily backup has its own tree, though in practice each tree shares most of its DHash
`
`storage with previous backups. The block orientation of this scheme is a good match for
`
`the block interface of DHash. Because the backup system performs sharing at the block
`
`level, it is not necessary to search for a machine whose file system is similar overall; while
`
`such a file system may exist when a workstation is being backed up [13], the backup of a
`
`large server holding home directories is unlikely to find such a partner.
`
`In summary, the backup system places the following requirements on DHash: 1) high
`
`availability, 2) high throughput for bulk transfers, 3) low latency for interactive browsing
`
`of the archived backups, and 4) good support for block-size operations.
`
`14
`
`Page 14 of 64
`
`

`

`1.2 Background: Chord
`
`DHash uses Chord [38] to help determine on which host to store each piece of data. Chord
`
`implements a hash-like lookup operation that maps 160-bit data keys to hosts. Chord
`
`assigns each host an identifier drawn from the same 160-bit space as the keys. This identifier
`
`space can be viewed as a circle, in which the highest identifier is followed by zero. Chord
`
`maps each key to the host whose identifier most closely follows the key.
`
`Each Chord host maintains information about a number of other hosts, to allow it to
`
`efficiently map keys to hosts and to allow it to tolerate failures. Chord ensures that each
`
`host knows the identity (IP address, Chord identifier, and synthetic coordinates) of its
`
`successor. the host with the next highest identifier. This knowledge organizes the hosts
`
`into a circular linked list sorted by identifier.
`
`In order to maintain the integrity of this organization if nodes fail, each node actually
`
`maintains a successor list, which contains the identities to the r hosts that immediately
`
`follow the host in the identifier circle. If a node's successor is not responsive, the node
`
`replaces it with the next entry in its successor list. The Chord implementation used in
`
`this paper uses successor lists of length r = 16; this is 2 log2 N for the system evaluated in
`
`Section 5, as recommended in the Chord design [38].
`
`The lookup to map a key to a host could in principle be implemented in linear time
`
`by following the successor pointers. Chord builds a routing table, called the finger table,
`
`that allows it to perform lookups in O(log N) time, where N is the number of hosts in the
`
`Chord system. The finger table for a node n contains log N entries that point to hosts at
`
`power-of-two distances ahead of n along the identifier circle.
`
`A Chord host periodically checks the validity of its finger table and successor list entries
`
`in a process called stabilization. This process allows Chord to adapt to failed hosts and to
`
`15
`
`Page 15 of 64
`
`

`

`newly joining hosts. Chord also periodically tries to contact hosts that were alive in the
`
`past, but are no longer reachable; this allows Chord to notice when a network partition has
`
`healed.
`
`1.2.1 Chord API
`
`Table 2.1 shows the external API Chord exposes to DHash.
`
`Description
`Function
`get_successor _list ( n) Contacts Chord node n and returns n's successor list. Each node in
`the list includes its Chord ID, IP address and synthetic coordinates.
`Returns a list of at least m successors of key k. Each node in the
`list includes its Chord ID, IP address and synthetic coordinates.
`
`lookup(k, m)
`
`Table 1.1: Chord API
`
`get_successor -1ist (n) is a simple accessor method for the Chord node n. It is imple-
`
`mented as a single network RPC call.
`
`lookup (k, m), on the other hand, must send O(log N) RPCs in order to determine
`
`them successors of key k. The value of m affects the latency of the lookup. Higher values
`
`of m constrain the lookup routing to travel through specific - potentially high latency -
`
`nodes. For example, when rn = 16, the lookup finds the exact predecessor of k and request
`
`its successor list.
`
`Smaller values of rn permit flexibility in the routing which allows high latency nodes to
`
`be avoided. For example, the two Chord nodes preceding key k both have successor lists
`
`which contain at least 15 successors of k. So form= 15, the lookup routing can choose the
`
`predecessor with lower estimated latency. A node uses synthetic coordinates to estimate
`
`the latency to another node.
`
`16
`
`Page 16 of 64
`
`

`

`1.2.2 Synthetic Coordinates
`
`Synthetic coordinates allow Chord to predict the latency between nodes. The predicted
`
`latency in microseconds is equal to the Euclidean distance between the nodes' coordinates.
`
`The main advantage of synthetic coordinates is that nodes can predict the latency to nodes
`
`with which it has never communicated directly: node X need only know node Y's coordi(cid:173)
`
`nates to estimate the latency.
`
`When a Chord node first joins the system it chooses random coordinates for itself.
`
`Each Chord node continually improves its own coordinates by participating in a distributed
`
`machine learning algorithm, called Vivaldi, based on [14]. Each time a Chord node makes
`
`an RPC request to another node, it measures the network latency to the node. All RPC
`
`responses include the responding node's current coordinates. The requesting node refines
`
`its coordinates based on the latency measurement and the responding node's coordinates.
`
`Synthetic coordinates do not generate any probe traffic themselves, but merely piggy back
`
`on existing Chord stabilization traffic.
`
`Both Chord and DHash use synthetic coordinates to reduce latency by performing server
`
`selection. When performing a lookup(), Chord uses coordinates to avoid routing through
`
`high latency nodes. DHash preferentially fetches data from low latency nodes when there
`
`are many possible nodes holding the data (see Section 2.4).
`
`17
`
`Page 17 of 64
`
`

`

`1.3 Thesis Overview
`
`This thesis starts by presenting the redesigned DHash using erasure encoding in Chapter 2.
`
`Chapter 3 describes the maintenance of the erasure encoded fragments and Chapter 4 details
`
`the database synchronization algorithms used by the maintenance protocols. In Chapter 5,
`
`an implementation of the system is evaluated for robustness and performance. Chapter 6
`
`reviews the related work. And finally, Chapter 7 concludes and suggests possible future
`
`work for DHash.
`
`18
`
`Page 18 of 64
`
`

`

`Chapter 2
`
`DHash: a distributed hash table
`
`The DHash servers form a distributed hash table, storing opaque blocks of data named by
`
`the SHA-1 hash of their contents. Clients can insert and retrieve blocks from this hash table.
`
`The storage required scales as the number of unique blocks, since identical blocks hash to
`
`the same server, where they are coalesced.
`
`The hash table is implemented as a collection of symmetric nodes (i.e., each node is no
`
`more special in function than any other node). Clients inserting blocks into DHash need
`
`not share any administrative relationship with servers storing blocks. DHash servers could
`
`be ordinary Internet hosts whose owners volunteer spare storage and network resources.
`
`DHash allows nodes to enter or leave the system at any time and divides the burden of
`
`storing and serving blocks among the servers.
`
`To increase data availability, DHash splits each block into 14 fragments using the IDA
`
`erasure code. Any 7 of these fragments are sufficient to reconstruct the block. DHash
`
`stores a block's fragments on the 14 Chord nodes immediately following the block's key. To
`
`maintain this proper placement of fragments, DHash transfers fragments between nodes as
`
`nodes enter and leave the system.
`
`19
`
`Page 19 of 64
`
`

`

`2.1 DHash API
`
`Table 2.1 shows that the external API exposed by DHash is minimal. There are calls to
`
`insert and retrieve a block.
`
`Function Description
`Stores the block b under the key k, where k = SHA-1(b).
`put(k, b)
`get (k)
`Fetches and returns the block associated with the key k.
`
`Table 2.1: DHash API
`
`2.2 Block Availability
`
`Like many fault-tolerant storage systems, DHash uses erasure coding to increase availability
`
`with relatively little cost in extra storage and communication. DHash uses the IDA erasure
`
`code [31]. Given an original block of size s, IDA splits the block into f fragments of size
`
`s / k. Any k distinct fragments are sufficient to reconstruct the original block. Fragments
`
`are distinct if, in an information theoretic sense, they contain unique information.
`
`IDA has the ability to randomly generate new, probabilistically distinct fragments from
`
`the block alone; it does not need to know which fragments already exist. From f randomly
`generated fragments, any k are distinct with probability greater than P; 1
`
`, where p is the
`
`characteristic prime of the IDA implementation.
`
`This ability to easily generate new fragments contrasts with Reed-Solomon codes, which
`
`generate only a small set of fragments for a given rate. Other codes, such as Tornado codes
`
`and On-line codes, are targeted to provide efficiency asymptotically and do not perform
`
`well with small blocks.
`
`DHash leverages IDA's ability to generate probabilistically distinct random fragments
`
`to easily and efficiently reconstruct a missing fragment (for example, after a machine crash).
`
`20
`
`Page 20 of 64
`
`

`

`Instead of needing to find all existing fragments to ensure that the new fragment is distinct,
`
`DHash must only find enough fragments to reconstruct the block, and can then generate a
`
`new random fragment to replace the missing fragment.
`
`DHash implements IDA with f = 14, k = 7 and p = 65537. DHash stores a block's
`
`fragments on the f = 14 immediate successors of the block's key. When a block is originally
`
`inserted, the DHash code on the inserting client creates the f fragments and sends them
`
`to the first 14 successors (Section 2.3). When a client fetches a block, the client contacts
`
`enough successors to find k = 7 distinct fragments (Section 2.4). These fragments have
`
`a 65536-in-65537 chance of being able to reconstruct the original block. If reconstruction
`
`fails, DHash keeps trying with different sets of 7 fragments.
`
`A node may find that it holds a fragment for a block even though it is beyond the 14th
`
`successor. If it is the 15th or 16th successor, the node holds onto the fragment in case
`
`failures cause it to become one of the 14. Otherwise the node tries to send the fragment to
`
`one of the successors (Section 3).
`
`The choice off and k are selected to optimize for 8192-byte blocks in our system which
`
`has a successor list length of r = 16. A setting of k = 7 creates 1170-byte fragments, which
`
`fit inside a single IP packet when combined with RPC overhead. Similarly f = 14 interacts
`
`well with r = 16 by giving the lookups needed for store and fetch the flexibility to terminate
`
`at low latency node close to the key, not necessarily exactly at the key's predecessor.
`
`This choice of IDA parameters also gives reasonable fault tolerance: in a system with
`
`a large number of nodes and independent failures, the probability that seven or more of a
`
`block's fragments will survive after 103 of the nodes fail is 0.99998 [40]. If two complete
`
`copies of each block were stored instead, using the same amount of space, the probability
`
`would be only 0.99.
`
`21
`
`Page 21 of 64
`
`

`

`2.3 Block Insert: put Ck, b)
`
`When an application wishes to insert a new block, it calls the DHash put (k, b) procedure.
`
`The DHash code running on the application's node implements put as follows:
`
`void
`put (k, b)
`II place one fragment on each successor
`{
`
`frags = IDAencode (b)
`succs = lookup (k, 14)
`for i
`(0 .. 13)
`send (succs[i].ipaddr, k, frags[i])
`
`}
`
`Figure 2-1: An implementation of DHash's put(k, b) procedure.
`
`The latency of the complete put() operation is likely to be dominated by the maximum
`
`round trip time to any of the 14 successors. The Chord lookup is likely to be relatively low
`
`latency: proximity routing allows it to contact nearby nodes, and the lookup can stop as
`
`soon as it gets to any of the three nodes preceding key k, since the 16-entry successor list
`
`of any of those nodes will contain the desired 14 successors of k. The cost of the Chord
`
`lookup is likely to be dominated by the latency to the nearest of the three predecessors.
`
`2.4 Block Fetch: get (k)
`
`In order to fetch a block, a client must locate and retrieve enough IDA fragments to re-
`
`assemble the original block. The interesting details are in how to avoid communicating with
`
`high-latency nodes and how to proceed when some fragments are not available.
`
`When a client application calls get (k), its local DHash first initiates a Chord call to
`
`lookup Ck, 7), in order to find the list of nodes likely to hold the block's fragments. The
`
`22
`
`Page 22 of 64
`
`

`

`lookup call will result in a list of between 7 and 16 of the nodes immediately succeeding key
`
`k.
`
`get() then chooses the seven of these successors with the lowest latency, estimated from
`
`their synthetic coordinates. It sends each of them an RPC to request a fragment of key k,
`
`in parallel. For each RPC that times out or returns an error reply, get() sends a fragment
`
`request RPC to an as-yet-uncontacted successor from the list returned by lookup () . If the
`
`original call to lookup() returned fewer than 7 successors with distinct fragments, get()
`
`asks one of the successors it knows about for the complete list if it needs to. get() asks
`
`more successors for fragments if IDA fails to reconstruct the block because the fragments
`
`found were not distinct. If it cannot reconstruct the block after talking to the first 14
`
`successors, get() returns failure to the application.
`
`Before returning a reconstructed block to the application, get() checks that the SHA-1
`
`hash of the block's data is equal to the block's key. If it is not, get() returns an error.
`
`An application may occasionally need to repeatedly invoke get (k) to successfully fetch
`
`a given key. When nodes join or leave the system, fragments need to be transferred to the
`
`correct successor nodes. If the join or leave rate is high enough fragments may become
`
`misplaced and cause a block fetch to fail. This transient situation is repaired by the DHash
`
`maintenance algorithm presented in the next section and can be masked by retrying the
`
`get (k) on failure. By retrying, a client will see the semantics that DHash never loses a
`
`block and that all blocks are always available except those that have expired.
`
`A persistent retry strategy reflects the assumption that a key that is retrieved is actually
`
`stored in DH&sh. The client using DHash can easily ensure this by recording keys in meta
`
`data and only retrieving keys recorded in this meta data.
`
`23
`
`Page 23 of 64
`
`

`

`block get (k)
`{
`
`II Collect fragments from the successors.
`frags = [];II empty array
`succs = lookup (k, 7)
`sort_by_latency (succs)
`
`i < #succs && i < 14; i++) {
`for (i = O;
`II download fragment
`<ret, data>= download (key, succ[i])
`if (ret == OK)
`£rags.push (data)
`
`II decode fragments to recover block
`<ret, block> = IDAdecode (frags)
`if (ret == OK)
`return (SHA-1(block) != k) ? FAILURE
`
`block
`
`if (i == #succs - 1) {
`newsuccs = get_successor_list (succs[i])
`sort_by_latency (newsuccs)
`succs.append (newsuccs)
`
`}
`
`}
`
`return FAILURE
`
`}
`
`Figure 2-2: An implementation of the DHash's get (k) procedure.
`
`24
`
`Page 24 of 64
`
`

`

`Chapter 3
`
`Fragment Maintenance
`
`A DHash system is in the ideal state when three conditions hold for each inserted block:
`
`1. multiplicity: 14, 15, or 16 fragments exist.
`
`2. distinctness: All fragments are distinct with high probability.
`
`3. location: Each of the 14 nodes succeeding the block's key store a fragment; the
`
`following two nodes optionally store a fragment; and no other nodes store fragments.
`
`The ideal state is attractive since it ensures all block fetches succeed, with high proba(cid:173)
`
`bility. Block inserts preserve the ideal state, since put Ck, b) stores 14 distinct fragments
`
`of block b at the 14 Chord nodes succeeding key k.
`
`Chord membership changes, such as node joins and node failures, perturb DHash from
`
`the ideal state, and can cause block fetches to fail. The location condition is violated when
`
`a new node storing no fragments joins within the set of 14 successors nodes of a block's key,
`
`since it does not store a fragment of the block. The multiplicity condition can be violated
`
`when nodes fail since fragments are lost from the system. The distinctness condition is not
`
`affected by node joins or failures.
`
`25
`
`Page 25 of 64
`
`

`

`To restore DHash to the ideal state, DHash runs two maintenance protocols: a local
`
`and a global protocol. The local maintenance protocol restores the multiplicity condition by
`
`recreating missing fragments. The global maintenance protocol moves misplaced fragments
`
`(those that violate the location condition) to the correct nodes, restoring the location con(cid:173)
`
`dition. The global maintenance protocol also restores the multiplicity conditions. It detects
`
`and deletes extra fragments when more than 16 fragments exist for a block.
`
`Since membership changes happen continuously, DHash is rarely or never in the ideal
`
`state, but always tends toward it by continually running these maintenance protocols.
`
`The maintenance protocols can restore DHash to its ideal state if there are at least 7
`
`distinct fragments for each block located anywhere in the system, barring any more mem(cid:173)
`
`bership changes. First, the global maintenance protocol will move the misplaced fragments
`
`back to their successors, then the local maintenance protocol will recreate missing fragments
`
`until there are 14 fragments. If there are fewer than 7 distinct fragments for a block, that
`
`block is lost irrevocably.
`
`3.1 Global DHash Maintenance
`
`The global maintenance protocol pushes misplaced fragments to the correct nodes. Each
`
`DHash node scans its database of fragments and pushes any fragment that it stores, but
`
`which fail the location condition, to one of the fragment's 14 successor hosts. For efficiency,
`
`the algorithm processes contiguous ranges of keys at once.
`
`Each DHash host continuously iterates through its fragment database in sorted order.
`
`It performs a Chord lookup() to learn the 16 successor nodes of a fragment's key. These
`
`nodes are the only hosts which should be storing the fragment. If the DHash host is one of
`
`these nodes, the host continues on to the next key in the database, as it should be storing
`
`26
`
`Page 26 of 64
`
`

`

`global_maintenance (void)
`{
`
`a = myID
`while (1) {
`<key, frag> = database.next(a)
`succs = lookup(key, 16)
`if (myID isbetween succ[O] and succ[15])
`II we should be storing key
`a = myID
`else {
`II key is misplaced
`for each sin succs[0 .. 13] {
`response= send_db_keys (s, database[key .. succs[O]])
`for each key in response.desired_keys
`if (database.contains (key))
`upload (s, database.lookup (key))
`database.delete (key)
`
`}
`database.delete_range ([pred .. succs[O]])
`a = succs [OJ
`
`}
`
`}
`
`}
`
`Figure 3-1: An implementation of the global maintenance protocol.
`
`the key. Otherwise, the DHash host is storing a misplaced fragment and needs to push it
`
`to one of the fragment's 14 successors, in order to restore the location condition.
`
`The fragment's 14 successors should also store all keys ranging from the fragment's
`
`key up to the Chord ID of the key's immediate successor. Consequently, the DHash host
`
`processes this entire key range at once by sending all the database keys in this range to
`
`each of the 14 successors. A successor responds with a message that it desires some key, if
`
`it is missing the key. In which case, the DHash host sends that fragment to that successor.
`
`It also deletes the fragment from its database to ensure that the fragment is only sent to
`
`exactly one other host, with high probability; otherwise the distinctness condition would
`
`be violated.
`
`27
`
`Page 27 of 64
`
`

`

`After offering all the keys to all the successors, the DHash node deletes all remaining
`
`keys in the specified range from the database. These keys arc safe to delete because they
`
`have been offered to all the successors, but were already present on each successor. The
`
`DHash node continues sweeping the database from the end of the key range just processed.
`
`The call to database. next (a) skips the ranges of the key space for which no blocks
`
`are in the database. This causes the sweep (i.e., the number of iterations of the while loop
`
`needed to process the entire key space) to scale as the number of misplaced blocks in the
`
`database, not as the number of nodes in the entire Chord ring.
`
`The global maintenance protocol adopts a push-based strategy where nodes push mis(cid:173)
`
`placed fragments to the correct locations. This strategy contrasts with pull-based strate(cid:173)
`
`gies [15] where each node contacts other nodes and pulls fragments which it should store.
`
`In general, we've found that push-based strategies are more robust under highly dynamic
`
`membership change rates, where pull-based s

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket