throbber
A Distributed Hash Table
`by
`Frank Dabek
`
`S.B., Computer Science (2000); S.B. Literature (2000)
`M.Eng., Computer Science (2001)
`Massachusetts Institute of Technology
`
`Submitted to the Department of Electrical Engineering and Computer Science
`in partial ful(cid:2)llment of the requirements for the degree of
`
`Doctor of Philosophy
`
`at the
`
`Massachusetts Institute of Technology
`
`September 2005
`c(cid:13) Massachusetts Institute of Technology 2005. All rights reserved.
`
`Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`Department of Electrical Engineering and Computer Science
`November 4, 2005
`
`Certi(cid:2)ed by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`M. Frans Kaashoek
`Professor
`Thesis Supervisor
`
`Certi(cid:2)ed by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`Robert T. Morris
`Associate Professor
`Thesis Supervisor
`
`Accepted by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`Arthur C. Smith
`Chairman, Department Committee on Graduate Students
`
`HPE Ex. 2011
`Page 1 of 135
`
`

`

`2
`
`HPE Ex. 2011
`
`Page 2 of 135
`
`HPE Ex. 2011
`Page 2 of 135
`
`

`

`A Distributed Hash Table
`Frank Dabek
`
`Submitted to the Department of Electrical Engineering and Computer Science
`on November 4, 2005, in partial ful(cid:2)llment of the
`requirements for the degree of
`Doctor of Philosophy
`
`Abstract
`
`DHash is a new system that harnesses the storage and network resources of computers
`distributed across the Internet by providing a wide›area storage service, DHash. DHash frees
`applications from re›implementing mechanisms common to any system that stores data on
`a collection of machines:
`it maintains a mapping of objects to servers, replicates data for
`durability, and balances load across participating servers. Applications access data stored in
`DHash through a familiar hash›table interface: put stores data in the system under a key;
`get retrieves the data.
`DHash has proven useful to a number of application builders and has been used to build
`a content›distribution system [31], a Usenet replacement [115], and new Internet naming
`architectures [130, 129]. These applications demand low›latency, high›throughput access
`to durable data. Meeting this demand is challenging in the wide›area environment. The
`geographic distribution of nodes means that latencies between nodes are likely to be high: to
`provide a low›latency get operation the system must locate a nearby copy of the data without
`traversing high›latency links. Also, wide›area network links are likely to be less reliable and
`have lower capacities than local›area network links:
`to provide durability ef(cid:2)ciently the
`system must minimize the number of copies of data items it sends over these limited capacity
`links in response to node failure.
`This thesis describes the design and implementation of the DHash distributed hash table
`and presents algorithms and techniques that address these challenges. DHash provides low›
`latency operations by using a synthetic network coordinate system (Vivaldi) to (cid:2)nd nearby
`copies of data without sending messages over high›latency links. A network transport (STP),
`designed for applications that contact a large number of nodes, lets DHash provide high
`throughput by striping a download across many servers without causing high packet loss or
`exhausting local resources. Sostenuto, a data maintenance algorithm, lets DHash maintain
`data durability while minimizing the number of copies of data that the system sends over
`limited›capacity links.
`
`Thesis Supervisor: M. Frans Kaashoek
`Title: Professor
`
`Thesis Supervisor: Robert T. Morris
`Title: Associate Professor
`
`3
`
`HPE Ex. 2011
`Page 3 of 135
`
`

`

`A Distributed Hash Table
`Frank Dabek
`
`Submitted to the Department of Electrical Engineering and Computer Science
`on November 4, 2005, in partial ful(cid:2)llment of the
`requirements for the degree of
`Doctor of Philosophy
`
`Abstract
`
`DHash is a new system that harnesses the storage and network resources of computers
`distributed across the Internet by providing a wide›area storage service, DHash. DHash frees
`applications from re›implementing mechanisms common to any system that stores data on
`a collection of machines:
`it maintains a mapping of objects to servers, replicates data for
`durability, and balances load across participating servers. Applications access data stored in
`DHash through a familiar hash›table interface: put stores data in the system under a key;
`get retrieves the data.
`DHash has proven useful to a number of application builders and has been used to build
`a content›distribution system [31], a Usenet replacement [115], and new Internet naming
`architectures [130, 129]. These applications demand low›latency, high›throughput access
`to durable data. Meeting this demand is challenging in the wide›area environment. The
`geographic distribution of nodes means that latencies between nodes are likely to be high: to
`provide a low›latency get operation the system must locate a nearby copy of the data without
`traversing high›latency links. Also, wide›area network links are likely to be less reliable and
`have lower capacities than local›area network links:
`to provide durability ef(cid:2)ciently the
`system must minimize the number of copies of data items it sends over these limited capacity
`links in response to node failure.
`This thesis describes the design and implementation of the DHash distributed hash table
`and presents algorithms and techniques that address these challenges. DHash provides low›
`latency operations by using a synthetic network coordinate system (Vivaldi) to (cid:2)nd nearby
`copies of data without sending messages over high›latency links. A network transport (STP),
`designed for applications that contact a large number of nodes, lets DHash provide high
`throughput by striping a download across many servers without causing high packet loss or
`exhausting local resources. Sostenuto, a data maintenance algorithm, lets DHash maintain
`data durability while minimizing the number of copies of data that the system sends over
`limited›capacity links.
`
`Thesis Supervisor: M. Frans Kaashoek
`Title: Professor
`
`Thesis Supervisor: Robert T. Morris
`Title: Associate Professor
`
`3
`
`HPE Ex. 2011
`Page 4 of 135
`
`

`

`4
`
`HPE Ex. 2011
`
`Page 5 of 135
`
`HPE Ex. 2011
`Page 5 of 135
`
`

`

`Acknowledgments
`
`This thesis is the product of a close collaboration with my colleagues in PDOS and Project
`IRIS. Russ Cox was instrumental in the development of Vivaldi. David Karger provided a
`great deal of much›needed assistance with the theoretical aspects of Chord and DHash.
`Latency optimizations to DHash are the result of collaboration with Jinyang Li and Chuck
`Blake. Emil Sit, Frans Kaashoek, James Robertson, and Josh Cates helped build and deploy
`the implementation of DHash and Chord without which many of the ideas presented here
`would never have been developed. Andreas Haeberlen, Emil Sit, Hakim Weatherspoon,
`and Byung›Gon Chun helped develop the ideas that led to Sostenuto and assisted with the
`presentation of Sostenuto in Chapter 6.
`Some of the material presented in this thesis has been published elsewhere. The CFS
`(cid:2)lesystem is described in an SOSP publication [31]. Emil Sit authored an IPTPS paper
`on UsenetDHT [115]. Work on Vivaldi was done in collaboration with Russ Cox. The
`work began with a class project [27]; a simple version of the algorithm was presented at
`HotNets [28]. The height›model and adaptive timestep were (cid:2)rst discussed in a SIGCOMM
`publication [30]; The analysis of DHash latency and the STP transport were (cid:2)rst published
`at NSDI [32]. The material in Chapter 6 (data maintenance) is described in a paper currently
`under submission.
`I am extremely grateful for the chance to have worked with my advisers, Frans Kaashoek
`and Robert Morris. I can only aspire to bring to my work the same level of clarity, rigor, and
`honesty that they have demonstrated during my time at MIT.
`The research described in this thesis is part of the IRIS project and was funded by a
`grant from the National Science Foundation under NSF Cooperative Agreement No. ANI›
`0225660. IBM provided support in the form of a fellowship.
`
`Frans Kaashoek
`
`Robert Morris
`
`Dan Aguayo
`Timo Burkhard
`Douglas De Couto
`Kevin Fu
`Michael Kaminsky
`David Mazieres
`James Robertson
`Jeremy Stribling
`
`John Bicket
`Ben Chambers
`Bryan Ford
`Thomer Gil
`Eddie Kohler
`Athicha Muthitacharoen
`Rodrigo Rodrigues
`Emmett Witchel
`
`Sanjit Biswas
`Benjie Chen
`Michael Freedman
`Kyle Jamieson
`Max Krohn
`Eric Peterson
`Emil Sit
`Alex Yip
`
`Chuck Blake
`Russ Cox
`Cliff Frey
`John Jannotti
`Chris Laas
`Sean Rhea
`Alex Snoeren
`Nickolai Zeldovich
`
`Hari Balakrishnan
`
`Peter Druschel
`
`David Karger
`
`Ion Stoica
`
`Jinyang Li
`
`5
`
`HPE Ex. 2011
`Page 6 of 135
`
`

`

`Dedicated to the memory of Joshua Cates (1977›2004).
`
`Where are the songs of Spring? Ay, where are they?
`Think not of them, thou hast thy music too: : :
`
`6
`
`HPE Ex. 2011
`Page 7 of 135
`
`

`

`Contents
`
`1 Introduction
`1.1 DHash overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.2 Using DHash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.3 Problem statement
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.4 Solutions
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.6 Rest of thesis
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`2 System environment
`2.1 Evaluation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.3 Node reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3 Vivaldi
`3.1 Vivaldi algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.4 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.5 Theoretical results
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`4 Latency
`4.1 Chord background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.2 Recursive or iterative?
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.3 Effect of choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.4 Proximity neighbor selection . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.5 Coding versus replication . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.6 Integrating lookup and fetching . . . . . . . . . . . . . . . . . . . . . . . . .
`4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`5 Throughput
`5.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`5.2 STP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`5.3 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`9
`11
`15
`17
`19
`22
`22
`
`25
`25
`26
`29
`30
`
`33
`34
`39
`41
`47
`55
`
`59
`59
`61
`63
`65
`69
`71
`72
`
`75
`75
`76
`79
`
`7
`
`HPE Ex. 2011
`Page 8 of 135
`
`

`

`85
`6 Replica maintenance
`86
`6.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`88
`6.2 rL and the probability of data loss . . . . . . . . . . . . . . . . . . . . . . . .
`95
`6.3 Improving durability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`98
`6.4 Reducing the cost of temporary failure . . . . . . . . . . . . . . . . . . . . .
`6.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
`6.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
`6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
`
`109
`7 Related work
`7.1 Distributed storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
`7.2 Object location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
`7.3 Network location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
`7.4 Replica maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
`
`117
`8 Conclusion
`8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
`8.2 Tradeoffs
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
`8.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
`
`8
`
`HPE Ex. 2011
`Page 9 of 135
`
`

`

`1(cid:151)
`
`Introduction
`
`A distributed hash table (DHT) is a reliable, scalable, wide›area data storage system that frees
`programmers from many of the complications of building a distributed system. DHTs store
`blocks of data on hundreds or thousands of machines connected to the Internet, replicate
`the data for reliability, and quickly locate data despite running over high›latency, wide›area
`links. The DHT addresses problems of locating data and replicating it for reliability, which
`are common to many distributed systems, without additional work by the application. The
`DHT provides a generic interface, which makes it easy for a wide›variety of applications to
`adopt DHTs as a storage substrate: put stores data in the system under a key; get retrieves
`the data.
`Distributed hash tables (cid:2)ll a gap in the design space of storage systems. DHTs occupy a
`middle ground between small systems with strong guarantees on the service they provide (such
`as distributed (cid:2)le systems) and large unorganized, best›effort systems (such as the world wide
`web or (cid:2)le sharing systems). DHTs are able to operate over a large and previously unoccupied
`area of the design space; the existence of a single, practical system that operates throughout
`this regime will make it easier to write new distributed applications.
`In (cid:2)lling this design space gap, DHTs attempt to combine the two strands of systems
`research that inspired DHTs and provide the best features of both.
`Inspired by small,
`LAN›based systems (which we will call transparent distributed systems), DHTs provide
`probabilistic guarantees on the success of a get or put operation. At the same time DHTs
`can operate on the same scales as large systems designed to run on the wide›area (which we
`will call Internet systems).
`Transparent distributed systems, such as Harp [72] and DDS [46], combine the resources
`of a number of machines while attempting to isolate the user from any evidence that the
`system is distributed. Users of a distributed (cid:2)le systems such as Harp would be hard›pressed
`to prove that their requests were being served by a cluster of machines rather than a single
`NFS server. To obtain this transparency these systems provide strong guarantees on the results
`of operations. DDS, for example, guarantees that a get operation will always see the results
`of the last put operation. Providing this guarantee is possible because transparent distributed
`systems are designed to run on a small number of machines connected to a local›area network
`and assume a high›bandwidth, low›latency, reliable interconnect. On the wide›area where
`these assumptions are violated, DDS could not make the same guarantees. DDS assumes that
`the network is never partitioned, for instance, a common occurrence on the wide›area. Also,
`techniques that these systems use to provide consistency (two›phase commit or the Paxos
`agreement protocol) are expensive to run over unreliable, high›latency links.
`The other research ancestor of DHTs, Internet systems, share with transparent distributed
`systems the goal of aggregating the resources of multiple machines. However, Internet systems
`focus not on transparency, but on maximizing scale. These systems are designed to run on a
`
`9
`
`HPE Ex. 2011
`Page 10 of 135
`
`

`

`large number of geographically distributed nodes on the wide›area network. The mechanism
`of these systems is often visible to the user and only best›effort guarantees are offered. The
`scalability of these systems comes at the cost of lax guarantees: a (cid:2)le›sharing service like
`Gnutella offers a keyword search interface but may fail, even in normal operation, to return
`the location of a (cid:2)le even if it does exist in the system [11].
`DHTs are a result of the desire for a system with the scale (large number of machines,
`large amount of data, large inter›node latency) of Internet systems and the graceful failure
`masking and strong guarantees of transparent distributed systems. DHTs sacri(cid:2)ce some
`of the guarantees of transparent distributed systems and can not scale as well as loosely
`organized systems like the web, but the result is a system that will run in a wide range of
`environments, from machine room clusters to a collection of cable modem users spread across
`the Internet, and provide a generic interface that is a suitable substrate for a large number
`of distributed storage applications that can operate with eventual consistency. The lack of
`strong consistency has not limited the usefulness of DHash: scenarios that necessitate strong
`consistency, such as write sharing, are rare and applications designed to run on DHash can be
`designed to minimize the amount of mutable data in the system and eliminate the possibility
`of multiple writers [83].
`Because DHTs work in a variety of deployment scenarios, they provide the possibility of
`a near universal storage infrastructure. Any wide›area distributed system must cope with
`the challenges of node scale, network delay, and node failure. DHTs implement mechanisms
`(such as replication and ef(cid:2)cient location algorithms) once, in the infrastructure, rather
`than as part of applications. The DHT interface may prove to be the storage equivalent
`to the IP abstraction for networks. This implementation convenience has spurred the early
`adoption of DHTs even in small systems where the ability of the system to route queries
`without maintaining full membership information is unnecessary. DHTs are attractive in these
`scenarios because they free the application programmer from the burden of implementing
`features necessary to cope with failure and also because they give the system the potential for
`dramatic expansion.
`By designing a storage system to run on the wide›area network, we hope to take advantage
`of the abundance of network and disk resources connected to the Internet in the same way
`that (cid:2)le sharing networks such as Napster and Gnutella have. While decentralized, wide›area
`deployment provides the tantalizing potential of gaining inexpensive access to a great deal of
`storage and network resources there are challenges facing systems that take this approach.
`Any wide›area storage system will have to cope with the sheer scale of thousands of
`participating nodes: a large number of nodes makes it hard to locate the data item requested
`by a get operation. Nodes are unreliable and may fail or leave the system at unpredictable
`times: data must be continually replicated as nodes fail to make the effect of a put operation
`permanent. Wide›area network delays and loss rates could make the latency of a get
`operation large or limit how fast bulk data can be returned to the application. In addition,
`in some deployment scenarios, the infrastructure for such a system may be dependent on
`volunteers to donate resources: these volunteers could be malicious or uncooperative and
`could deny access to stored data, delete it, or return incorrect values.
`This thesis describes the DHash DHT that addresses many of the challenges of building a
`distributed storage system that runs on wide›area nodes and provides a generic interface, and
`useful guarantees, to applications. We will (cid:2)rst describe the base DHash system and then
`demonstrate improvements to the system’s performance and reliability. These improvements
`make DHash a practical platform for building applications: DHash has proven useful to
`a number of application builders [130, 115, 129, 31]. The rest of this chapter provides
`
`10
`
`HPE Ex. 2011
`Page 11 of 135
`
`

`

`Function
`put h(block)
`
`put s(block, pubkey)
`
`get(key)
`
`Description
`Computes the block’s key by hashing its contents, and sends
`it to the key’s successor server for storage.
`Stores or updates a signed block; used for root blocks. The
`block must be signed with the given public key. The block’s
`Chord key will be the hash of pubkey.
`Fetches and returns the block associated with the speci(cid:2)ed
`Chord key.
`
`Table 1.1: DHash client API
`
`an overview of DHash and describes how it can be used by applications. We outline the
`challenges facing DHash in more detail and sketch how the thesis addresses them.
`
`1.1 DHash overview
`
`This thesis will describe one DHT (DHash) in detail. Many of the algorithms and optimiza›
`tions implemented by DHash can (and have been [89]) adopted in other DHTs. Here we
`give an overview of a basic implementation of DHash (base DHash). Later chapters present
`modi(cid:2)cations to the basic design that improve performance and reliability. For example, the
`invariant governing data placement will be relaxed to improve maintenance costs.
`
`1.1.1 DHash interface
`
`Applications using DHash link against a DHash library which exports a durable hash table
`interface consisting of two operations: put and get. The put operation stores a piece of data
`into the hash table associated with a key; keys are drawn from a large (160›bit) key space.
`The get operation returns the block associated with the supplied key. Table 1.1 shows the
`API that DHash presents to applications.
`The keys used by the DHash system could, in principle, be any 160›bit number, but they
`are usually the hash of the data item being stored. Using the content hash allows the system
`to certify the data received:
`if the content›hash of the returned data matches the key, the
`data must not have been tampered with. We use a 160›bit hash function so that collisions
`are overwhelmingly unlikely. Content›hash blocks are immutable; by nature, they cannot
`be altered without changing the key under which they are stored. Because the blocks are
`immutable, DHash does not need to implement any additional mechanism (e.g., two›phase
`commit) for providing a consistent view of data stored in content hash blocks.
`The other key type supported by DHash is the hash of a public key. The data stored
`under a public key is signed by the corresponding private key; DHash veri(cid:2)es this signature
`when the data is retrieved to certify the data. Unlike content›hash blocks, the application
`can modify that data stored under the hash of a public key without changing the name of
`the block. DHash does not provide strong guarantees on the behavior of public›key blocks
`under simultaneous operations. For instance, while a write is in progress a simultaneous
`reader might see the newly›written value before the write operation returns to the writer.
`Simultaneous writes could result in either, or both, values being stored on the replicas initially.
`DHash runs a stabilization protocol that guarantees that eventually one of the values will
`be copied to all replicas. Because the guarantees on public›key blocks are weak, systems
`using mutable data in DHash usually construct single›writer data structures [83]. Public key
`
`11
`
`HPE Ex. 2011
`Page 12 of 135
`
`

`

`{
`
`GET
`PUT
`
`}
`
`UsenetDHT
`
`UsenetDHT
`
`{ LOOKUP}
`
`DHash
`
`Chord
`
`{
`
`GETKEYS
`FETCH
`
`}
`
`{
`
`STABILIZE
`FORWARD
`
`}
`
`DHash
`
`Chord
`
`Figure 1›1: Overview of the DHash system. Applications (such as UsenetDHT) issue GET commands to DHash to (cid:2)nd the
`value associated with a key; these local calls are indicated by dotted lines in the above (cid:2)gure. DHash, in turn, issues
`LOOKUP requests to Chord to locate the node responsible for the key. The Chord layer coordinates nodes to answer
`the lookup query by forwarding the request (solid black lines indicate remote communication via RPC); the Chord layer
`also exchanges periodic stabilization messages to keep routing structures up to date. The DHash layer causes nodes to
`exchange messages to download block data (FETCH) and to synchronize replica sets (GETKEYS).
`
`blocks are often used to name the root of a tree›like structure of content hash blocks: this
`allows the system to give a single stable name to a large amount of mutable data while only
`using a single mutable block.
`An application uses DHash by linking against a library that exports the DHash API. The
`library sends messages to a DHash server that coordinates with other DHash servers running
`on the wide›area network to resolve the application’s request. The organization of the DHash
`system is shown in Figure 1›1.
`
`1.1.2 Chord
`
`DHash is built as a layer over the Chord distributed lookup system [120]. Each node
`participating in the Chord ring is assigned an identi(cid:2)er from a 160›bit circular key space;
`data items are assigned keys from the same space. Chord provides a scalable›lookup service
`that maps each 160›bit key to a node. The lookup(k) function returns the ID and IP address
`of the node currently responsible for the key k. The node responsible for a key is the node
`with the identi(cid:2)er which most closely follows the key in the circular key space; we refer to
`this node as the successor of k and to the several nodes after k as the successor list of k. In
`Figure 1›2, the successor of key K19 is node N24 and the successor list of the key is nodes
`N24, N34, and N41. Note that N18 (the predecessor of the key) maintains pointers to all of
`these nodes and can de(cid:2)nitively return K19’s successor list.
`Chord maintains a routing table of log N pointers to other nodes in the system and can
`resolve a mapping by sending log N messages, where N is the number of nodes in the system.
`Because Chord keeps a small amount of state, it is able to maintain the state ef(cid:2)ciently in
`large or unstable systems. Section 4.1 explains how Chord implements the lookup operation
`using this routing table in more detail.
`
`1.1.3 Base DHash
`
`While Chord provides a location service, DHash is responsible for actually storing the data,
`creating replicas, and maintaining those replicas as disks fail. DHash has considerable leeway
`in deciding how to accomplish these tasks; this section examines how base DHash chooses
`to address these tasks along with alternative choices.
`
`12
`
`HPE Ex. 2011
`Page 13 of 135
`
`

`

`N16
`
`K19
`
`N24
`
`N34
`
`N41
`
`Figure 1›2: Successor relationship. The successor of a key is the (cid:2)rst node clockwise after that key in the circular
`identi(cid:2)er space. In this example, N24 (shown as a (cid:2)lled circle) is the immediate successor of K19 (shown as a hash).
`Nodes N34 and N41 are in the successor list of key K19. Node N16 (white circle) is able to authoritatively determine the
`successors of any key in the range [16; 24), by maintaining pointers to the nodes immediately following it on the ring.
`In this example and the following, an 8›bit identi(cid:2)er space is used.
`
`DHash is designed to store a large number of small blocks spread approximately uniformly
`across participating servers. DHash uses the Chord mapping to determine which node stores
`a given key: the data associated with a key k is stored on the node returned by lookup(k) and
`replicated on that node’s successors. By storing replicas on the successor list, base DHash
`distributes data uniformly at random over all servers: a block’s key is essentially random (the
`SHA›1 of the block’s value) and node IDs are random. The result is that blocks (and load)
`are uniformly spread over the DHT nodes and that a block’s replicas or fragments are widely
`scattered to avoid correlated failure. An alternative to placing replicas on the successor list
`would be to randomly select nodes to serve as the replica set for a key; this approach requires
`that the system store some meta›data that serves as a level of indirection [7].
`Some DHTs provide only a key location service and let each application decide where
`(or even whether) to store data [57]. By using a semi›structured name space [53, 54], a
`system could give applications control over where a block is stored. Applications might
`choose to give hints about the geographic or administrative domains in which data should
`reside. DHash, and other DHTs that store data uniformly across all nodes in the system,
`are appropriate for applications that wish to view the DHT as a network storage system
`that automatically provides replication and location services. Our motivating examples
`OverCite [122] and UsenetDHT [115] are examples of this class of applications. We will
`describe these applications in more detail in Section 1.2.
`DHTs that store data must decide on the size of the units of data to store. A DHT key
`could refer to a disk›sector›like block of data [31], to a complete (cid:2)le [109], or to an entire (cid:2)le
`system image [26]. Large values reduce number of lookups required to fetch a large amount
`of data. Small blocks spread the load of serving popular large (cid:2)les. For these reasons, and
`because some applications (such as SFR [129]) require the DHT to store small blocks, DHash
`is optimized to store blocks of 8 KB or less. The main advantage of large blocks is potentially
`higher throughput; one might expect that the need to perform a DHT lookup for each 8K
`block would lead to poor performance. However, as we will show in Chapter 5, DHash is
`able to take advantage of available network capacity even when using small (8K) blocks.
`Storing data on the successor list ensures good load›balance but could limit the throughput
`that an application obtains when it downloads a large (cid:2)le that has been broken down into
`
`13
`
`HPE Ex. 2011
`Page 14 of 135
`
`

`

`blocks. Because base DHash rigidly maps a given block to some node, the slowest node in
`the system will be the owner of some blocks and therefore must be periodically contacted.
`The slowest node dictates how fast the entire download can progress on average: N times the
`throughput of the slowest node if there are N nodes in the system. Note that if the slowest
`node is a bottleneck, this implies that the access link of the node performing the download is
`N times faster than the access link of the slowest node.
`The system might achieve a higher initial rate by downloading from higher›capacity hosts
`at those hosts’ full rate; however, such a strategy does not change when the last block will
`be received from the slowest host. We also might relax the requirement on where blocks are
`stored to avoid allowing slow hosts to limit the system performance. This quickly leads to
`a question of access control: should all hosts be allowed to participate fully in the system?
`If the system membership is open, but the system favors high›performance hosts, it is not
`clear what bene(cid:2)t the overall system derives from allowing the slower hosts to participate
`(the penalty for allowing such hosts is clear, however). One possibility is that hosts with slow
`access links can be used as (cid:147)emergency(cid:148) replicas: normally they are not contacted during
`a download, but if the better›equipped replicas are not available, the resource constrained
`hosts can service downloads slowly to maintain block availability.
`Data stored in the DHT must be replicated for reliability. Any storage system should
`provide two distinct but related properties: durability and availability. The de(cid:2)nition of
`availability is straightforward: a block is available at some point in time if an application
`can access the block through the system at that time. The de(cid:2)nition of durability is a bit
`more dif(cid:2)cult: we’ll de(cid:2)ne a block to be durable if it is stored on some media and will
`be available through the system at some point in the future. DHash’s main goal is data
`durability; however, since the system, in practice, can only observe availability failures it also
`maintains data availability.
`The number of copies of each block that are maintained helps to determine the probability
`that a data block will be lost; increasing the replication factor reduces the probability of
`loss but increases the amount of data that must be sent over the network when a block
`is initially stored and during the ongoing process of replacing failed replicas. DHash is
`design

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket