`by
`Frank Dabek
`
`S.B., Computer Science (2000); S.B. Literature (2000)
`M.Eng., Computer Science (2001)
`Massachusetts Institute of Technology
`
`Submitted to the Department of Electrical Engineering and Computer Science
`in partial ful(cid:2)llment of the requirements for the degree of
`
`Doctor of Philosophy
`
`at the
`
`Massachusetts Institute of Technology
`
`September 2005
`c(cid:13) Massachusetts Institute of Technology 2005. All rights reserved.
`
`Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`Department of Electrical Engineering and Computer Science
`November 4, 2005
`
`Certi(cid:2)ed by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`M. Frans Kaashoek
`Professor
`Thesis Supervisor
`
`Certi(cid:2)ed by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`Robert T. Morris
`Associate Professor
`Thesis Supervisor
`
`Accepted by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`Arthur C. Smith
`Chairman, Department Committee on Graduate Students
`
`HPE Ex. 2011
`Page 1 of 135
`
`
`
`2
`
`HPE Ex. 2011
`
`Page 2 of 135
`
`HPE Ex. 2011
`Page 2 of 135
`
`
`
`A Distributed Hash Table
`Frank Dabek
`
`Submitted to the Department of Electrical Engineering and Computer Science
`on November 4, 2005, in partial ful(cid:2)llment of the
`requirements for the degree of
`Doctor of Philosophy
`
`Abstract
`
`DHash is a new system that harnesses the storage and network resources of computers
`distributed across the Internet by providing a wide›area storage service, DHash. DHash frees
`applications from re›implementing mechanisms common to any system that stores data on
`a collection of machines:
`it maintains a mapping of objects to servers, replicates data for
`durability, and balances load across participating servers. Applications access data stored in
`DHash through a familiar hash›table interface: put stores data in the system under a key;
`get retrieves the data.
`DHash has proven useful to a number of application builders and has been used to build
`a content›distribution system [31], a Usenet replacement [115], and new Internet naming
`architectures [130, 129]. These applications demand low›latency, high›throughput access
`to durable data. Meeting this demand is challenging in the wide›area environment. The
`geographic distribution of nodes means that latencies between nodes are likely to be high: to
`provide a low›latency get operation the system must locate a nearby copy of the data without
`traversing high›latency links. Also, wide›area network links are likely to be less reliable and
`have lower capacities than local›area network links:
`to provide durability ef(cid:2)ciently the
`system must minimize the number of copies of data items it sends over these limited capacity
`links in response to node failure.
`This thesis describes the design and implementation of the DHash distributed hash table
`and presents algorithms and techniques that address these challenges. DHash provides low›
`latency operations by using a synthetic network coordinate system (Vivaldi) to (cid:2)nd nearby
`copies of data without sending messages over high›latency links. A network transport (STP),
`designed for applications that contact a large number of nodes, lets DHash provide high
`throughput by striping a download across many servers without causing high packet loss or
`exhausting local resources. Sostenuto, a data maintenance algorithm, lets DHash maintain
`data durability while minimizing the number of copies of data that the system sends over
`limited›capacity links.
`
`Thesis Supervisor: M. Frans Kaashoek
`Title: Professor
`
`Thesis Supervisor: Robert T. Morris
`Title: Associate Professor
`
`3
`
`HPE Ex. 2011
`Page 3 of 135
`
`
`
`A Distributed Hash Table
`Frank Dabek
`
`Submitted to the Department of Electrical Engineering and Computer Science
`on November 4, 2005, in partial ful(cid:2)llment of the
`requirements for the degree of
`Doctor of Philosophy
`
`Abstract
`
`DHash is a new system that harnesses the storage and network resources of computers
`distributed across the Internet by providing a wide›area storage service, DHash. DHash frees
`applications from re›implementing mechanisms common to any system that stores data on
`a collection of machines:
`it maintains a mapping of objects to servers, replicates data for
`durability, and balances load across participating servers. Applications access data stored in
`DHash through a familiar hash›table interface: put stores data in the system under a key;
`get retrieves the data.
`DHash has proven useful to a number of application builders and has been used to build
`a content›distribution system [31], a Usenet replacement [115], and new Internet naming
`architectures [130, 129]. These applications demand low›latency, high›throughput access
`to durable data. Meeting this demand is challenging in the wide›area environment. The
`geographic distribution of nodes means that latencies between nodes are likely to be high: to
`provide a low›latency get operation the system must locate a nearby copy of the data without
`traversing high›latency links. Also, wide›area network links are likely to be less reliable and
`have lower capacities than local›area network links:
`to provide durability ef(cid:2)ciently the
`system must minimize the number of copies of data items it sends over these limited capacity
`links in response to node failure.
`This thesis describes the design and implementation of the DHash distributed hash table
`and presents algorithms and techniques that address these challenges. DHash provides low›
`latency operations by using a synthetic network coordinate system (Vivaldi) to (cid:2)nd nearby
`copies of data without sending messages over high›latency links. A network transport (STP),
`designed for applications that contact a large number of nodes, lets DHash provide high
`throughput by striping a download across many servers without causing high packet loss or
`exhausting local resources. Sostenuto, a data maintenance algorithm, lets DHash maintain
`data durability while minimizing the number of copies of data that the system sends over
`limited›capacity links.
`
`Thesis Supervisor: M. Frans Kaashoek
`Title: Professor
`
`Thesis Supervisor: Robert T. Morris
`Title: Associate Professor
`
`3
`
`HPE Ex. 2011
`Page 4 of 135
`
`
`
`4
`
`HPE Ex. 2011
`
`Page 5 of 135
`
`HPE Ex. 2011
`Page 5 of 135
`
`
`
`Acknowledgments
`
`This thesis is the product of a close collaboration with my colleagues in PDOS and Project
`IRIS. Russ Cox was instrumental in the development of Vivaldi. David Karger provided a
`great deal of much›needed assistance with the theoretical aspects of Chord and DHash.
`Latency optimizations to DHash are the result of collaboration with Jinyang Li and Chuck
`Blake. Emil Sit, Frans Kaashoek, James Robertson, and Josh Cates helped build and deploy
`the implementation of DHash and Chord without which many of the ideas presented here
`would never have been developed. Andreas Haeberlen, Emil Sit, Hakim Weatherspoon,
`and Byung›Gon Chun helped develop the ideas that led to Sostenuto and assisted with the
`presentation of Sostenuto in Chapter 6.
`Some of the material presented in this thesis has been published elsewhere. The CFS
`(cid:2)lesystem is described in an SOSP publication [31]. Emil Sit authored an IPTPS paper
`on UsenetDHT [115]. Work on Vivaldi was done in collaboration with Russ Cox. The
`work began with a class project [27]; a simple version of the algorithm was presented at
`HotNets [28]. The height›model and adaptive timestep were (cid:2)rst discussed in a SIGCOMM
`publication [30]; The analysis of DHash latency and the STP transport were (cid:2)rst published
`at NSDI [32]. The material in Chapter 6 (data maintenance) is described in a paper currently
`under submission.
`I am extremely grateful for the chance to have worked with my advisers, Frans Kaashoek
`and Robert Morris. I can only aspire to bring to my work the same level of clarity, rigor, and
`honesty that they have demonstrated during my time at MIT.
`The research described in this thesis is part of the IRIS project and was funded by a
`grant from the National Science Foundation under NSF Cooperative Agreement No. ANI›
`0225660. IBM provided support in the form of a fellowship.
`
`Frans Kaashoek
`
`Robert Morris
`
`Dan Aguayo
`Timo Burkhard
`Douglas De Couto
`Kevin Fu
`Michael Kaminsky
`David Mazieres
`James Robertson
`Jeremy Stribling
`
`John Bicket
`Ben Chambers
`Bryan Ford
`Thomer Gil
`Eddie Kohler
`Athicha Muthitacharoen
`Rodrigo Rodrigues
`Emmett Witchel
`
`Sanjit Biswas
`Benjie Chen
`Michael Freedman
`Kyle Jamieson
`Max Krohn
`Eric Peterson
`Emil Sit
`Alex Yip
`
`Chuck Blake
`Russ Cox
`Cliff Frey
`John Jannotti
`Chris Laas
`Sean Rhea
`Alex Snoeren
`Nickolai Zeldovich
`
`Hari Balakrishnan
`
`Peter Druschel
`
`David Karger
`
`Ion Stoica
`
`Jinyang Li
`
`5
`
`HPE Ex. 2011
`Page 6 of 135
`
`
`
`Dedicated to the memory of Joshua Cates (1977›2004).
`
`Where are the songs of Spring? Ay, where are they?
`Think not of them, thou hast thy music too: : :
`
`6
`
`HPE Ex. 2011
`Page 7 of 135
`
`
`
`Contents
`
`1 Introduction
`1.1 DHash overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.2 Using DHash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.3 Problem statement
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.4 Solutions
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.6 Rest of thesis
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`2 System environment
`2.1 Evaluation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.3 Node reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3 Vivaldi
`3.1 Vivaldi algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.4 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.5 Theoretical results
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`4 Latency
`4.1 Chord background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.2 Recursive or iterative?
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.3 Effect of choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.4 Proximity neighbor selection . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.5 Coding versus replication . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.6 Integrating lookup and fetching . . . . . . . . . . . . . . . . . . . . . . . . .
`4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`5 Throughput
`5.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`5.2 STP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`5.3 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`9
`11
`15
`17
`19
`22
`22
`
`25
`25
`26
`29
`30
`
`33
`34
`39
`41
`47
`55
`
`59
`59
`61
`63
`65
`69
`71
`72
`
`75
`75
`76
`79
`
`7
`
`HPE Ex. 2011
`Page 8 of 135
`
`
`
`85
`6 Replica maintenance
`86
`6.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`88
`6.2 rL and the probability of data loss . . . . . . . . . . . . . . . . . . . . . . . .
`95
`6.3 Improving durability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`98
`6.4 Reducing the cost of temporary failure . . . . . . . . . . . . . . . . . . . . .
`6.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
`6.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
`6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
`
`109
`7 Related work
`7.1 Distributed storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
`7.2 Object location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
`7.3 Network location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
`7.4 Replica maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
`
`117
`8 Conclusion
`8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
`8.2 Tradeoffs
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
`8.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
`
`8
`
`HPE Ex. 2011
`Page 9 of 135
`
`
`
`1(cid:151)
`
`Introduction
`
`A distributed hash table (DHT) is a reliable, scalable, wide›area data storage system that frees
`programmers from many of the complications of building a distributed system. DHTs store
`blocks of data on hundreds or thousands of machines connected to the Internet, replicate
`the data for reliability, and quickly locate data despite running over high›latency, wide›area
`links. The DHT addresses problems of locating data and replicating it for reliability, which
`are common to many distributed systems, without additional work by the application. The
`DHT provides a generic interface, which makes it easy for a wide›variety of applications to
`adopt DHTs as a storage substrate: put stores data in the system under a key; get retrieves
`the data.
`Distributed hash tables (cid:2)ll a gap in the design space of storage systems. DHTs occupy a
`middle ground between small systems with strong guarantees on the service they provide (such
`as distributed (cid:2)le systems) and large unorganized, best›effort systems (such as the world wide
`web or (cid:2)le sharing systems). DHTs are able to operate over a large and previously unoccupied
`area of the design space; the existence of a single, practical system that operates throughout
`this regime will make it easier to write new distributed applications.
`In (cid:2)lling this design space gap, DHTs attempt to combine the two strands of systems
`research that inspired DHTs and provide the best features of both.
`Inspired by small,
`LAN›based systems (which we will call transparent distributed systems), DHTs provide
`probabilistic guarantees on the success of a get or put operation. At the same time DHTs
`can operate on the same scales as large systems designed to run on the wide›area (which we
`will call Internet systems).
`Transparent distributed systems, such as Harp [72] and DDS [46], combine the resources
`of a number of machines while attempting to isolate the user from any evidence that the
`system is distributed. Users of a distributed (cid:2)le systems such as Harp would be hard›pressed
`to prove that their requests were being served by a cluster of machines rather than a single
`NFS server. To obtain this transparency these systems provide strong guarantees on the results
`of operations. DDS, for example, guarantees that a get operation will always see the results
`of the last put operation. Providing this guarantee is possible because transparent distributed
`systems are designed to run on a small number of machines connected to a local›area network
`and assume a high›bandwidth, low›latency, reliable interconnect. On the wide›area where
`these assumptions are violated, DDS could not make the same guarantees. DDS assumes that
`the network is never partitioned, for instance, a common occurrence on the wide›area. Also,
`techniques that these systems use to provide consistency (two›phase commit or the Paxos
`agreement protocol) are expensive to run over unreliable, high›latency links.
`The other research ancestor of DHTs, Internet systems, share with transparent distributed
`systems the goal of aggregating the resources of multiple machines. However, Internet systems
`focus not on transparency, but on maximizing scale. These systems are designed to run on a
`
`9
`
`HPE Ex. 2011
`Page 10 of 135
`
`
`
`large number of geographically distributed nodes on the wide›area network. The mechanism
`of these systems is often visible to the user and only best›effort guarantees are offered. The
`scalability of these systems comes at the cost of lax guarantees: a (cid:2)le›sharing service like
`Gnutella offers a keyword search interface but may fail, even in normal operation, to return
`the location of a (cid:2)le even if it does exist in the system [11].
`DHTs are a result of the desire for a system with the scale (large number of machines,
`large amount of data, large inter›node latency) of Internet systems and the graceful failure
`masking and strong guarantees of transparent distributed systems. DHTs sacri(cid:2)ce some
`of the guarantees of transparent distributed systems and can not scale as well as loosely
`organized systems like the web, but the result is a system that will run in a wide range of
`environments, from machine room clusters to a collection of cable modem users spread across
`the Internet, and provide a generic interface that is a suitable substrate for a large number
`of distributed storage applications that can operate with eventual consistency. The lack of
`strong consistency has not limited the usefulness of DHash: scenarios that necessitate strong
`consistency, such as write sharing, are rare and applications designed to run on DHash can be
`designed to minimize the amount of mutable data in the system and eliminate the possibility
`of multiple writers [83].
`Because DHTs work in a variety of deployment scenarios, they provide the possibility of
`a near universal storage infrastructure. Any wide›area distributed system must cope with
`the challenges of node scale, network delay, and node failure. DHTs implement mechanisms
`(such as replication and ef(cid:2)cient location algorithms) once, in the infrastructure, rather
`than as part of applications. The DHT interface may prove to be the storage equivalent
`to the IP abstraction for networks. This implementation convenience has spurred the early
`adoption of DHTs even in small systems where the ability of the system to route queries
`without maintaining full membership information is unnecessary. DHTs are attractive in these
`scenarios because they free the application programmer from the burden of implementing
`features necessary to cope with failure and also because they give the system the potential for
`dramatic expansion.
`By designing a storage system to run on the wide›area network, we hope to take advantage
`of the abundance of network and disk resources connected to the Internet in the same way
`that (cid:2)le sharing networks such as Napster and Gnutella have. While decentralized, wide›area
`deployment provides the tantalizing potential of gaining inexpensive access to a great deal of
`storage and network resources there are challenges facing systems that take this approach.
`Any wide›area storage system will have to cope with the sheer scale of thousands of
`participating nodes: a large number of nodes makes it hard to locate the data item requested
`by a get operation. Nodes are unreliable and may fail or leave the system at unpredictable
`times: data must be continually replicated as nodes fail to make the effect of a put operation
`permanent. Wide›area network delays and loss rates could make the latency of a get
`operation large or limit how fast bulk data can be returned to the application. In addition,
`in some deployment scenarios, the infrastructure for such a system may be dependent on
`volunteers to donate resources: these volunteers could be malicious or uncooperative and
`could deny access to stored data, delete it, or return incorrect values.
`This thesis describes the DHash DHT that addresses many of the challenges of building a
`distributed storage system that runs on wide›area nodes and provides a generic interface, and
`useful guarantees, to applications. We will (cid:2)rst describe the base DHash system and then
`demonstrate improvements to the system’s performance and reliability. These improvements
`make DHash a practical platform for building applications: DHash has proven useful to
`a number of application builders [130, 115, 129, 31]. The rest of this chapter provides
`
`10
`
`HPE Ex. 2011
`Page 11 of 135
`
`
`
`Function
`put h(block)
`
`put s(block, pubkey)
`
`get(key)
`
`Description
`Computes the block’s key by hashing its contents, and sends
`it to the key’s successor server for storage.
`Stores or updates a signed block; used for root blocks. The
`block must be signed with the given public key. The block’s
`Chord key will be the hash of pubkey.
`Fetches and returns the block associated with the speci(cid:2)ed
`Chord key.
`
`Table 1.1: DHash client API
`
`an overview of DHash and describes how it can be used by applications. We outline the
`challenges facing DHash in more detail and sketch how the thesis addresses them.
`
`1.1 DHash overview
`
`This thesis will describe one DHT (DHash) in detail. Many of the algorithms and optimiza›
`tions implemented by DHash can (and have been [89]) adopted in other DHTs. Here we
`give an overview of a basic implementation of DHash (base DHash). Later chapters present
`modi(cid:2)cations to the basic design that improve performance and reliability. For example, the
`invariant governing data placement will be relaxed to improve maintenance costs.
`
`1.1.1 DHash interface
`
`Applications using DHash link against a DHash library which exports a durable hash table
`interface consisting of two operations: put and get. The put operation stores a piece of data
`into the hash table associated with a key; keys are drawn from a large (160›bit) key space.
`The get operation returns the block associated with the supplied key. Table 1.1 shows the
`API that DHash presents to applications.
`The keys used by the DHash system could, in principle, be any 160›bit number, but they
`are usually the hash of the data item being stored. Using the content hash allows the system
`to certify the data received:
`if the content›hash of the returned data matches the key, the
`data must not have been tampered with. We use a 160›bit hash function so that collisions
`are overwhelmingly unlikely. Content›hash blocks are immutable; by nature, they cannot
`be altered without changing the key under which they are stored. Because the blocks are
`immutable, DHash does not need to implement any additional mechanism (e.g., two›phase
`commit) for providing a consistent view of data stored in content hash blocks.
`The other key type supported by DHash is the hash of a public key. The data stored
`under a public key is signed by the corresponding private key; DHash veri(cid:2)es this signature
`when the data is retrieved to certify the data. Unlike content›hash blocks, the application
`can modify that data stored under the hash of a public key without changing the name of
`the block. DHash does not provide strong guarantees on the behavior of public›key blocks
`under simultaneous operations. For instance, while a write is in progress a simultaneous
`reader might see the newly›written value before the write operation returns to the writer.
`Simultaneous writes could result in either, or both, values being stored on the replicas initially.
`DHash runs a stabilization protocol that guarantees that eventually one of the values will
`be copied to all replicas. Because the guarantees on public›key blocks are weak, systems
`using mutable data in DHash usually construct single›writer data structures [83]. Public key
`
`11
`
`HPE Ex. 2011
`Page 12 of 135
`
`
`
`{
`
`GET
`PUT
`
`}
`
`UsenetDHT
`
`UsenetDHT
`
`{ LOOKUP}
`
`DHash
`
`Chord
`
`{
`
`GETKEYS
`FETCH
`
`}
`
`{
`
`STABILIZE
`FORWARD
`
`}
`
`DHash
`
`Chord
`
`Figure 1›1: Overview of the DHash system. Applications (such as UsenetDHT) issue GET commands to DHash to (cid:2)nd the
`value associated with a key; these local calls are indicated by dotted lines in the above (cid:2)gure. DHash, in turn, issues
`LOOKUP requests to Chord to locate the node responsible for the key. The Chord layer coordinates nodes to answer
`the lookup query by forwarding the request (solid black lines indicate remote communication via RPC); the Chord layer
`also exchanges periodic stabilization messages to keep routing structures up to date. The DHash layer causes nodes to
`exchange messages to download block data (FETCH) and to synchronize replica sets (GETKEYS).
`
`blocks are often used to name the root of a tree›like structure of content hash blocks: this
`allows the system to give a single stable name to a large amount of mutable data while only
`using a single mutable block.
`An application uses DHash by linking against a library that exports the DHash API. The
`library sends messages to a DHash server that coordinates with other DHash servers running
`on the wide›area network to resolve the application’s request. The organization of the DHash
`system is shown in Figure 1›1.
`
`1.1.2 Chord
`
`DHash is built as a layer over the Chord distributed lookup system [120]. Each node
`participating in the Chord ring is assigned an identi(cid:2)er from a 160›bit circular key space;
`data items are assigned keys from the same space. Chord provides a scalable›lookup service
`that maps each 160›bit key to a node. The lookup(k) function returns the ID and IP address
`of the node currently responsible for the key k. The node responsible for a key is the node
`with the identi(cid:2)er which most closely follows the key in the circular key space; we refer to
`this node as the successor of k and to the several nodes after k as the successor list of k. In
`Figure 1›2, the successor of key K19 is node N24 and the successor list of the key is nodes
`N24, N34, and N41. Note that N18 (the predecessor of the key) maintains pointers to all of
`these nodes and can de(cid:2)nitively return K19’s successor list.
`Chord maintains a routing table of log N pointers to other nodes in the system and can
`resolve a mapping by sending log N messages, where N is the number of nodes in the system.
`Because Chord keeps a small amount of state, it is able to maintain the state ef(cid:2)ciently in
`large or unstable systems. Section 4.1 explains how Chord implements the lookup operation
`using this routing table in more detail.
`
`1.1.3 Base DHash
`
`While Chord provides a location service, DHash is responsible for actually storing the data,
`creating replicas, and maintaining those replicas as disks fail. DHash has considerable leeway
`in deciding how to accomplish these tasks; this section examines how base DHash chooses
`to address these tasks along with alternative choices.
`
`12
`
`HPE Ex. 2011
`Page 13 of 135
`
`
`
`N16
`
`K19
`
`N24
`
`N34
`
`N41
`
`Figure 1›2: Successor relationship. The successor of a key is the (cid:2)rst node clockwise after that key in the circular
`identi(cid:2)er space. In this example, N24 (shown as a (cid:2)lled circle) is the immediate successor of K19 (shown as a hash).
`Nodes N34 and N41 are in the successor list of key K19. Node N16 (white circle) is able to authoritatively determine the
`successors of any key in the range [16; 24), by maintaining pointers to the nodes immediately following it on the ring.
`In this example and the following, an 8›bit identi(cid:2)er space is used.
`
`DHash is designed to store a large number of small blocks spread approximately uniformly
`across participating servers. DHash uses the Chord mapping to determine which node stores
`a given key: the data associated with a key k is stored on the node returned by lookup(k) and
`replicated on that node’s successors. By storing replicas on the successor list, base DHash
`distributes data uniformly at random over all servers: a block’s key is essentially random (the
`SHA›1 of the block’s value) and node IDs are random. The result is that blocks (and load)
`are uniformly spread over the DHT nodes and that a block’s replicas or fragments are widely
`scattered to avoid correlated failure. An alternative to placing replicas on the successor list
`would be to randomly select nodes to serve as the replica set for a key; this approach requires
`that the system store some meta›data that serves as a level of indirection [7].
`Some DHTs provide only a key location service and let each application decide where
`(or even whether) to store data [57]. By using a semi›structured name space [53, 54], a
`system could give applications control over where a block is stored. Applications might
`choose to give hints about the geographic or administrative domains in which data should
`reside. DHash, and other DHTs that store data uniformly across all nodes in the system,
`are appropriate for applications that wish to view the DHT as a network storage system
`that automatically provides replication and location services. Our motivating examples
`OverCite [122] and UsenetDHT [115] are examples of this class of applications. We will
`describe these applications in more detail in Section 1.2.
`DHTs that store data must decide on the size of the units of data to store. A DHT key
`could refer to a disk›sector›like block of data [31], to a complete (cid:2)le [109], or to an entire (cid:2)le
`system image [26]. Large values reduce number of lookups required to fetch a large amount
`of data. Small blocks spread the load of serving popular large (cid:2)les. For these reasons, and
`because some applications (such as SFR [129]) require the DHT to store small blocks, DHash
`is optimized to store blocks of 8 KB or less. The main advantage of large blocks is potentially
`higher throughput; one might expect that the need to perform a DHT lookup for each 8K
`block would lead to poor performance. However, as we will show in Chapter 5, DHash is
`able to take advantage of available network capacity even when using small (8K) blocks.
`Storing data on the successor list ensures good load›balance but could limit the throughput
`that an application obtains when it downloads a large (cid:2)le that has been broken down into
`
`13
`
`HPE Ex. 2011
`Page 14 of 135
`
`
`
`blocks. Because base DHash rigidly maps a given block to some node, the slowest node in
`the system will be the owner of some blocks and therefore must be periodically contacted.
`The slowest node dictates how fast the entire download can progress on average: N times the
`throughput of the slowest node if there are N nodes in the system. Note that if the slowest
`node is a bottleneck, this implies that the access link of the node performing the download is
`N times faster than the access link of the slowest node.
`The system might achieve a higher initial rate by downloading from higher›capacity hosts
`at those hosts’ full rate; however, such a strategy does not change when the last block will
`be received from the slowest host. We also might relax the requirement on where blocks are
`stored to avoid allowing slow hosts to limit the system performance. This quickly leads to
`a question of access control: should all hosts be allowed to participate fully in the system?
`If the system membership is open, but the system favors high›performance hosts, it is not
`clear what bene(cid:2)t the overall system derives from allowing the slower hosts to participate
`(the penalty for allowing such hosts is clear, however). One possibility is that hosts with slow
`access links can be used as (cid:147)emergency(cid:148) replicas: normally they are not contacted during
`a download, but if the better›equipped replicas are not available, the resource constrained
`hosts can service downloads slowly to maintain block availability.
`Data stored in the DHT must be replicated for reliability. Any storage system should
`provide two distinct but related properties: durability and availability. The de(cid:2)nition of
`availability is straightforward: a block is available at some point in time if an application
`can access the block through the system at that time. The de(cid:2)nition of durability is a bit
`more dif(cid:2)cult: we’ll de(cid:2)ne a block to be durable if it is stored on some media and will
`be available through the system at some point in the future. DHash’s main goal is data
`durability; however, since the system, in practice, can only observe availability failures it also
`maintains data availability.
`The number of copies of each block that are maintained helps to determine the probability
`that a data block will be lost; increasing the replication factor reduces the probability of
`loss but increases the amount of data that must be sent over the network when a block
`is initially stored and during the ongoing process of replacing failed replicas. DHash is
`design