`
`Proceedings of the First Symposium on
`
`Networked Systems Design and Implementation
`
`San Francisco, CA, USA
`
`March 29—3 1, 2004
`
`USENIX
`
`THE ADVANCED CCMPLITNG SVSTEMS ASSOQATION
`
`For more information about the USENIX Association:
`All Rights Reserved
`© 2004 by The USENIX Association
`Phone: 1 510 528 8649
`FAX: 1 510 548 5738
`Email: 0ffice@usenix.org
`WWW: http://www.usenix.org
`Rights to individual papers remain with the author or the author‘s employer.
`Permission is granted for noncommercial reproduction of the work for educational or research purposes.
`This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.
`
`HPE EX. 2010
`
`Page 1 of 15
`
`
`
`Designing a DHT for low latency and high throughput
`
`Frank Dabek, Jinyang Li, Emil Sit, James Robertson, M. Frans Kaashoek, Robert Morris *
`MIT Computer Science and Artificial Intelligence Laboratory
`fdabek, jinyang, sit, jsr, kaashoek, rtm@csail.mit.edu
`
`Abstract
`
`Designing a wide-area distributed hash table (DHT) that
`provides high-throughput and low-latency network stor-
`age is a challenge. Existing systems have explored a range
`of solutions, including iterative routing, recursive routing,
`proximity routing and neighbor selection, erasure coding,
`replication, and server selection.
`This paper explores the design of these techniques and
`their interaction in a complete system, drawing on the
`measured performance of a new DHT implementation and
`results from a simulator with an accurate Internet latency
`model. New techniques that resulted from this exploration
`include use of latency predictions based on synthetic co-
`ordinates, efficient integration of lookup routing and data
`fetching, and a congestion control mechanism suitable for
`fetching data striped over large numbers of servers.
`Measurements with 425 server instances running on
`150 PlanetLab and RON hosts show that the latency opti-
`mizations reduce the time required to locate and fetch data
`by a factor of two. The throughput optimizations result
`in a sustainable bulk read throughput related to the num-
`ber of DHT hosts times the capacity of the slowest access
`link; with 150 selected PlanetLab hosts, the peak aggre—
`gate throughput over multiple clients is 12.8 megabytes
`per second.
`
`1
`
`Introduction
`
`The Internet has transformed communication for dis—
`
`tributed applications: each new system need not imple—
`ment its own network, but can simply assume a shared
`global communication infrastructure. A similar transfor-
`mation might be possible for storage, allowing distributed
`applications to assume a shared global storage infrastruc—
`ture. Such an infrastructure would have to name and find
`
`data, assure high availability, balance load across avail—
`able servers, and move data with high throughput and low
`latency.
`Distributed hash tables (DHTs) are a promising path to-
`wards a global storage infrastructure, and have been used
`
`research was conducted as part of the IRIS project
`*This
`(http : / /proj ect: — iris . net/), supported by the National Sci—
`ence Foundation under Cooperative Agreement No. ANI-0225660.
`
`as the basis for a variety of wide-area file and content pub-
`lishing systems [13, 26, 34, 38]. Good performance, how-
`ever, is a challenge: the DHT nodes holding the data may
`be far away in the network, may have access link capaci-
`ties that vary by orders of magnitude, and may experience
`varying degrees of congestion and packet loss.
`This paper explores design choices for DHT read and
`write algorithms. Existing work has investigated how to
`make the lookup of keys in DHTs scalable, low—latency,
`fault-tolerant, and secure, but less attention has been paid
`to the efficiency and robustness with which DHTs read
`and store data. This paper considers a range of design op-
`tions for efificient data handling in the context of a single
`DHT, DHash++. The decisions are evaluated in simula-
`tion and in an implementation of DHash++ on the Planet-
`Lab [29] and RON [2] test—beds.
`To bound the discussion of design decisions, we have
`made a number of assumptions. First, we assume that all
`nodes cooperate; the algorithms for reading and writing
`are likely to be more expensive if they have to defend
`against malicious nodes. Second, we assume that lookups
`are routed using one of the O(log N )-sty1e schemes, in-
`stead of using the recently proposed 0(1) schemes [14,
`17, 18, 44]. Finally, we assume that the DHT stores small
`blocks (on the order of 8192 bytes). Relaxing these as-
`sumptions will result in different DHT designs with dif-
`ferent latency and throughput properties, which we hope
`to explore in the future.
`The paper makes the following contributions. Recur-
`sive lookups take about 0.6 times as long as iterative; the
`reason why the reduction is not a factor of two is the cost
`of the final return trip. The latency of the last few hops
`in a lookup acts as a lower bound on the performance
`of Proximity Neighbor Selection [37, 16], which approxi-
`mates 1.5 times the average round trip time in the underly—
`ing network. This result holds regardless of the number of
`DHT nodes (and thus regardless of the number of hops).
`Replicated data allows for low-latency reads because there
`are many choices for server selection, while erasure-coded
`data reduces bandwidth consumption for writes at the ex-
`pense of increased read latency. Integration of key lookup
`and data fetch reduces the lower bound imposed by the
`last few lookup hops. Finally, using an integrated trans-
`
`HPE Ex. 2010
`
`Page 2 of 15
`
`
`
`
`
`
`
`
`
`
`
`Figure l: DHash++ system overview.
`
`port protocol rather than TCP provides opportunities for
`efficiency in alternate routing after timeouts and allows
`the DHT freedom to efficiently contact many nodes.
`The rest of this paper is structured as follows. Section 2
`outlines the complete system that surrounds the specific
`mechanisms detailed in the paper. Section 3 describes the
`methods behind the paper’s measurements and quantita-
`tive evaluations. Section 4 discusses design decisions that
`affect latency, and Section 5 discusses throughput. Sec-
`tion 6 describes related work. We conclude in Section 7.
`
`2 Background
`
`For concreteness, this evaluates design decisions in the
`context of a complete DHT called DHash++. This section
`describes the parts of DHash++ that are needed to under—
`stand the rest of the paper.
`
`2.1 Chord
`
`DHash++ uses the Chord lookup algorithm to help it find
`data [42]. Chord provides a function lookup (key) —>
`set—of — IP, which maps a 160-bit key to the set of IP
`addresses of the nodes responsible for that key. Each node
`has a 160-bit identifier, and Chord designates the 3 nodes
`whose identifiers immediately follow a key as responsible
`for that key; these are the key’s successors. To provide
`reliable lookup even if half of the nodes fail in a 216-node
`network, the number of successors, s, is 16 in the Chord
`implementation. The ID space wraps around, so that zero
`immediately follows 2160 — 1.
`The base Chord lookup algorithm (which will be modi-
`fied in subsequent sections) works as follows. Each Chord
`node maintains a finger table, consisting of the IP ad-
`dresses and IDs of nodes that follow it at power-of-two
`distances in the identifier space. Each node also maintains
`a successor list referring to its 5 immediate successors.
`When a node originates a lookup, it consults a sequence
`of other nodes, asking each in turn which node to talk to
`
`Figure 2: An illustration of a Chord identifier ring. The
`tick mark denotes the position of a key in ID space. The
`square shows the key’s successor node, and the circles
`show the nodes in the successor’s successor list. The trian—
`
`gles and arrows show a lookup path. The last node before
`the tick mark is the key’s predecessor.
`
`next. Each node in this sequence answers with the node
`from its finger table with highest ID still less than the de-
`sired key. The originating node will find the key’s prede-
`cessor node after O(log N) consultations; it then asks the
`predecessor for its successor list, which is the result of the
`lookup. This style of lookup is called iterative, since the
`originating node controls each step of the lookup. All of
`the communication uses UDP RPCs.
`
`Figure 2 shows a Chord ring with a key, its successor,
`the successor’s successor list, and a lookup path; this pic-
`ture is helpful to keep in mind since much of the discus-
`sion appeals to the ring geometry. Although this paper
`explores optimizations over base Chord, we believe that
`these optimizations also apply to other DHTs that route in
`ID spaces using an O(log N) protocol.
`
`2.2 DHash++
`
`DHash++ stores key/value pairs (called blocks) on a set
`of servers. The DHash++ client API consists of key <—
`put (value) and get (key) a value. DHash++
`calculates the key to be the SHA-l hash of the value, and
`uses Chord to decide which server should store a given
`block; each server runs both Chord and DHash++ soft—
`ware. As well as finding and moving data for client ap-
`plications, DHash++ authenticates the data and moves it
`from server to server as nodes join, leave, and fail [7].
`
`2.3 Synthetic coordinates
`
`Many of the techniques described in this paper use syn-
`thetic coordinates to predict inter-node latencies without
`having to perform an explicit measurement to determine
`the latency. A number of synthetic coordinate systems
`have been proposed [10, 24, 27, 30, 33, 39]. We chose
`
`HPE Ex. 2010
`
`Page 3 of 15
`
`
`
`to use Vivaldi [12], because its algorithm is decentralized,
`which makes it suitable for use in peer-to—peer systems.
`Furthermore, the Vivaldi algorithm is lightweight, since it
`can piggy-back on DHash++’s communication patterns to
`compute coordinates.
`Whenever one Chord or DHash++ node communicates
`
`directly with another, they exchange Vivaldi coordinates.
`Nodes store these coordinates along with IP addresses in
`routing tables and successor lists. The result of a lookup
`for a key carries the coordinates of the nodes responsible
`for the key as well as their IP addresses. Thus the request-
`ing node can predict the latencies to each of the responsi-
`ble nodes without having to first communicate with them.
`
`3 Evaluation methods
`
`The results in this paper are obtained through simulations
`and measurements on the PlanetLab and RON test-beds.
`
`The measurements focus on DHT operations that require
`low latency or high throughput.
`
`3.1 Evaluation infrastructure
`
`DHT performance depends on the detailed behavior of
`the servers and the underlying network. The test-bed mea-
`surements in Section 4 were taken from a DHash++ im—
`
`plementation deployed on the PlanetLab and RON test-
`beds. 180 test—bed hosts were used, of which 150 were
`in the United States and 30 elsewhere. 105 of the hosts
`
`are on the Internet2 network; the rest have connections
`via DSL, cable modem, commercial Tl service, or are
`at co-location centers. Each host runs three independent
`DHash++ processes, or virtual nodes, in order to improve
`load balance and to ensure that the total number of nodes
`
`is large compared to the size of the Chord successor list.
`The measurements in Section 5 were taken on the 27-node
`RON test-bed alone.
`
`The test-bed measurements are augmented with simu-
`lation results to explore large configurations, to allow easy
`testing of alternate designs, and to allow analytic explana-
`tions of behavior in a controlled environment. The simu-
`
`lated network models only packet delay. One input to the
`simulator is a full matrix of the round—trip delays between
`each pair of simulated hosts. This approach avoids hav-
`ing to simulate the Intemet’s topology, a currently open
`area of research; it requires only the measurement of ac-
`tual pair-wise delays among a set of hosts. The simulator
`can produce useful speed-of-light delay results, but cannot
`be used to predict throughput or queuing delay.
`The simulator’s delay matrix is derived from Internet
`measurements using techniques similar to those described
`by Gummadi et a1. [15]. The measurements involved 2048
`DNS servers found with inverse DNS lookups on a trace
`of over 20,000 Gnutella clients. For each pair of these
`servers, a measuring node sends a query to one server that
`
` 1
`0.9 -
`0.8 —
`0.7 -
`0.6 -
`0.5 -
`0.4 -
`
`0.2 r
`
`probability 0.3 -
`0.1 ~
`
`Cumulative
`
`0
`
`PlanetLab
`
`‘King w; -------
`.
`.
`.
`i
`i
`l
`l
`400
`450
`500
`50
`100
`150
`200
`250
`300
`350
`0
`Latency (ms)
`
`Figure 3: Round—trip latency distribution over all pairs of
`PlanetLab and King dataset hosts. The median and aver-
`age King dataset latencies are 134 and 154 milliseconds
`respectively. The median and average PlanetLab latencies
`are 76 and 90 milliseconds respectively.
`
`requires it to contact the other server. Subtracting the de-
`lay between the measuring node and the first server from
`the total delay yields the delay between the two servers.
`In order to reduce the effects of queuing delay, the min-
`imum delay from five experiments is used. In this paper
`the results are called the King data-set. All the simula-
`tions in this paper involve 2048 DHT nodes using King
`delay matrix unless otherwise mentioned. Figure 3 shows
`the CDF of the King data-set round—trip times; the median
`is 134 milliseconds, while the average is 154 millisec-
`onds. The graph also shows the minimum delay of five
`pings between each pair of PlanetLab hosts for compar-
`ison. The main difference between the two curves is the
`
`longer tail on the King distribution, which is likely caused
`by the larger sample of nodes.
`
`3.2 Application workload
`
`The design of a DHT must incorporate assumptions about
`probable application behavior, and a DHT evaluation must
`also involve either applications or models of application
`behavior. The application aspects that most affect perfor-
`mance are the mix of read and write operations, the degree
`to which operations can be pipelined, and the size of the
`data records.
`
`DHash++ is designed to support read—heavy applica-
`tions that demand low-latency and high-throughput reads
`as well as reasonably high~throughput writes. Examples
`of such applications might include the Semantic Free Ref-
`erencing system (SFR) [45] and UsenetDHT [40].
`SFR is a naming system designed to replace the use
`of DNS as a content location system. SFR uses a DHT
`to store small data records representing name bindings.
`
`HPE Ex. 2010
`
`Page 4 of 15
`
`
`
`Reads are frequent and should complete with low latency.
`Writes are relatively infrequent and thus need not be as
`high performance. SFR data blocks are likely to be on the
`order hundreds of bytes.
`UsenetDHT is a service aiming to reduce the total stor-
`age dedicated to Usenet by storing all Usenet articles in a
`shared DHT. UsenetDHT splits large binary articles (av—
`eraging 100 KB) into small blocks for load balance, but
`smaller text articles (typically 5 KB or less) are stored as
`single blocks. While readership patterns vary, UsenetDHT
`must support low—latency single article reads, as well as
`high-throughput pipelined article fetches.
`These systems are unlikely to be deployed on high-
`churn networks—~these systems are all server-class. The
`target environment for them is a network with relatively
`reliable nodes that have good Internet access.
`
`4 Designing for low latency
`
`This section investigates five design choices that affect
`DHT get latency. The naive algorithm against which
`these choices are judged, called base DHash++, operates
`as follows. Each 8192~byte block is stored as 14 1171-
`byte erasure-coded fragments, any seven of which are suf-
`ficient to reconstruct the block, using the IDA coding al-
`gorithm [31]. The 14 fragments are stored at the 14 imme-
`diate successors of the block’s key. When an application
`calls get (key) , the originating node performs an itera-
`tive Chord lookup, which ends when the key’s predecessor
`node returns the key’s l6 successors; the originating node
`then sends seven parallel requests the first seven succes-
`sors asking them each to return one fragment.
`Figure 4 gives a preview of the results of this sec-
`tion. Each pair of bars shows the median time to fetch a
`block on the PlanetLab test-bed after cumulatively apply-
`ing each design improvement. The design improvements
`shown are recursive rather than iterative routing, proxim-
`ity neighbor selection, fetching of data from the closest
`copy, and integration of lookup routing and data fetching.
`These design improvements together reduce the total fetch
`latency by nearly a factor of two.
`
`This paper uses a log(N) protocol for routing lookups.
`An optimization that isn’t explored in this paper is an in-
`crease in the base to reduce the number of hops, or the use
`of a constant-hop protocols. These optimizations would
`reduce latency under low churn, because each node would
`know about many other nodes. On the other hand, in high
`churn networks, these optimizations might require more
`bandwidth to keep routing tables up to date or experi-
`ence more timeouts because routing tables might contain
`recently-failed nodes. The paper’s evaluation infrastruc—
`ture isn’t adequate to explore this design decision in de-
`tail. We hope to explore this issue in future work. We do
`explore the extent to which proximity routing can reduce
`
`500
`
`400
`
`wc)o
`
`:4oo
`
`100
`
`
`
`
`
`Medianlatency(ms)
`
`Base
`
`Server seleclion
`Recursive lookup Proximity muting
`Latency optimization techniques (cumulative)
`
`Integration
`
`Figure 4: The cumulative effect of successive optimiza-
`tions on the latency of a DHash++ data fetch. Each bar
`shows the median time of 1,000 fetches of a randomly
`chosen 8192-byte data block from a randomly chosen
`host. The dark portion of each bar shows the lookup time,
`and the light portion shows the time taken to fetch the
`data. These data are from the implementation running on
`PlanetLab.
`
`the impact of the number of hops on the lookup latency.
`
`4.1 Data layout
`The first decision to be made about where a DHT should
`store data is whether it should store data at all. A num—
`
`ber of DHTs provide only a key location service, perhaps
`with a layer of indirection, and let each application de-
`cide where (or even whether) to store data [20, 28]. The
`choice is a question of appropriate functionality rather
`than performance, though Section 4.5 describes some per-
`formance benefits of integrating the DHT lookup and data
`storage functions. The approach taken by DHash++ is ap—
`propriate for applications that wish to view the DHT as a
`network storage system, such as our motivating examples
`SFR and UsenetDHT.
`
`For DHTs that store data, a second layout decision is
`the size of the units of data to store. A DHT key could re-
`fer to a disk-sector-like block of data [13], to a complete
`file [3 8], or to an entire file system image [1 1]. Large val-
`ues reduce the amortized cost of each DHT lookup. Small
`blocks spread the load of serving popular large files. For
`these reasons, and because some applications such as SFR
`require the DHT to store small blocks, DHash++ is opti-
`mized with blocks of 8 KB or less in mind.
`
`A third layout decision is which server should store
`each block of data (or each replica or coded fragment).
`If a given block is likely to be read mostly by hosts in
`a particular geographic area, then it would make sense
`to store the data on DHT servers in that area. Caching
`is one way to achieve this kind of layout. On the other
`hand, geographic concentration may make the data more
`
`HPE Ex. 2010
`
`Page 5 of 15
`
`
`
`
`
`
`
`
`
` 1 I I Y I
`
`
`
`0.8
`0.7
`0.6
`0.5
`0.4
`
`
`
`Cumulativeprobability 0.3
`
`
`0.2 0‘1
`
`0
`
`200
`
`J
`600
`400
`Latency (ms)
`
`Iterative
`Recursive ----------
`800
`
`1000
`
`vulnerable to network and power failures, it may cause
`the load to be less evenly balanced across all nodes, and
`is difficult to arrange in general without application hints.
`At the other extreme, the DHT could distribute data uni—
`formly at random over the available servers; this design
`would be reasonable if there were no predictable geo-
`graphic locality in the originators of requests for the data,
`or if fault-tolerance were important. DHash++ uses the
`latter approach: a block’s key is essentially random (the
`SHA-l of the block’s value), node IDs are random, and a
`block’s replicas or fragments are placed at its key’s suc-
`cessor nodes. The result is that blocks (and load) are uni-
`formly spread over the DHT nodes, and that a block’s
`replicas or fragments are widely scattered to avoid cor-
`related failure.
`
`Given a DHT design that stores blocks on randomly
`chosen servers, one can begin to form some expectations
`about fetch latency. The lower bound on the total time to
`find and fetch a block is the round trip time from the orig-
`inator to the nearest replica of the block, or the time to
`the most distant of the closest set of fragments required
`to reconstruct the block. For the typical block this time is
`determined by the distribution of inter-host delays in the
`Internet, and by the number of choices of replicas or frag-
`ments. The DHT lookup required to find the replicas or
`fragments will add to this lower bound, as will mistakes
`in predicting which replica or fragments are closest.
`Most of the design choices described in subsequent
`subsections have to do with taking intelligent advantage
`of choices in order to reduce lookup and data fetch 1a-
`tency.
`
`4.2 Recursive or iterative?
`
`The base Chord and Kademlia algorithms are iterative: the
`originator sends an RPC to each successive node in the
`lookup path, and waits for the response before proceed—
`ing [25, 42]. Another possibility is recursive lookup [6,
`47]: each node in the lookup path directly forwards the
`query to the next node, and when the query reaches the
`key’s predecessor, the predecessor sends its successor list
`directly back to the originator [42]. Recursive lookup,
`which many DHTs use, might eliminate half the latency
`of each hop since each intermediate node can immedi-
`ately forward the lookup before acknowledging the pre-
`vious hop.
`
`Figure 5: The cumulative distributions of lookup time for
`Chord with recursive and iterative lookup. The recursive
`median and average are 461 and 489 milliseconds; the it—
`erative median and average are 720 and 822 milliseconds.
`The numbers are from simulations.
`
`return the result to the originator.
`
`While recursive lookup has lower latency than iterative,
`iterative is much easier for a client to manage. If a recur-
`sive lookup elicits no response, the originator has no in-
`formation about what went wrong and how to re-try in
`a way that is more likely to succeed. Sometimes a simple
`re—try may work, as in the case of lost packets. If the prob—
`lem is that each successive node can talk to the next node,
`but that Internet routing anomalies prevent the last node
`from replying to the originator, then re-tries won’t work
`because only the originator realizes a problem exists. In
`contrast, the originator knows which hop of an iterative
`lookup failed to respond, and can re-try that hop through
`a different node in the same region of the identifier space.
`On the the other hand, recursive communication may
`make congestion control easier (that is, it is it may make it
`more feasible to rely on TCP). We will show in Section 5
`that the performance of a naive TCP transport can be quite
`poor
`
`DHash++ uses recursive lookups by default since they
`are faster, but falls back on iterative lookups after persis-
`tent failures.
`
`4.3 Proximity neighbor selection
`
`Figure 5 shows the effect of using recursive rather than
`iterative lookup in the simulator with the 2048—node King
`data set. For each technique, 20,000 lookups were per-
`formed, each from a random host for a random key. The
`average number of hops is 6.3. Recursive lookup takes on
`average 0.6 times as long as iterative. This decrease is not
`quite the expected factor of two: the difference is due to
`the extra one—way hop of (on average) 77 milliseconds to
`
`Many DHTs decrease lookup latency by choosing nearby
`nodes as routing table entries [6, 16, 25, 42, 43, 47],
`a technique often called proximity neighbor selection
`(PNS). The reason this is possible is that there are usually
`few constraints in the choice of routing entries: any node
`in the relevant portion of the identifier space is eligible.
`A DHT design must include an algorithm to search for
`nearby nodes; an exhaustive search may improve lookup
`
`HPE Ex. 2010
`
`Page 6 of 15
`
`
`
`800
`
`300
`PNS(16) —+-—
`PNS(N) ----->(-
`
`250 -
`
`m ME 200 - >‘""""""""""""""""""""""""""""""",6.rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
`
`..
`
`’6D)
`
`3C
`93
`ca. 150
`3xO
`
`2 3
`
`33» 100 »
`a>
`<
`
`50
`
`
`
`
`O
`100
`
`i
`1000
`
`Network Size
`
`700-1
`
`6:Oo
`
`100 -
`
`
`
`
`
`Averagelookuplatency(msec)
`
`
`
`
`
`:3orOOOC
`o:oO I.u
`
`Noo
`
`4__.,
`
`A
`
`.74
`
`’1
`
`._.
`
`,
`
`
`
`10
`
`100
`Number of PNS samples
`
`III
`
`4———<
`
`H—k4—<
`
`1000
`
`Figure 6: Average lookup latency as a function of the
`number of PNS samples. The bar at each a: value shows
`the 10th, average, and 90th percentile of the latencies ob-
`served by 20,000 recursive lookups of random keys from
`random nodes using PNS((L‘). The measurements are from
`the simulator with 2048 nodes.
`
`Figure 7: Average lookup latency of PNS(16) and
`PNS(N) as a function of the number of nodes in the sys-
`tem, N. The simulated network sizes consist of 128, 256,
`512, 1024, 2048 nodes.
`
`latency, but also consume network resources. This sub-
`section builds on the work of Gummadi et al. [16] in two
`ways: it explains why PNS approximates 1.5 times the av-
`erage round trip time in the underlying network and shows
`that this result holds regardless of the number of DHT
`nodes (and thus regardless of the number of hops).
`Following Gummadi et a1. [16], define PNS(x) as fol-
`lows. The ith Chord finger table entry of the node with ID
`a properly refers to the first node in the ID-space range
`a + T to a + 2""+1 — 1. The PNS(m) algorithm considers
`up to the first at: nodes in that range (there may be fewer
`than ac), and routes lookups through the node with lowest
`latency. Ideal PNS refers to PNS(az) with as equal to the to-
`tal number of nodes, so that every finger table entry points
`to the lowest-latency node in the entire allowed ID—space
`range. The simulator simply chooses the lowest—latency
`of the w nodes, while the real implementation asks each
`proper finger entry for its successor list and uses Vivaldi
`to select the closest node. This means that the real imple-
`mentation requires that x g s (the number of successors).
`
`What is a suitable value for a: in PNS(ac)? Figure 6
`shows the simulated effect of varying a; on lookup la—
`tency. For each x value, 20,000 lookups were issued by
`randomly selected hosts for random keys. Each lookup is
`recursive, goes to the key’s predecessor node (but not suc-
`cessor), and then directly back to the originator. The graph
`plots the median, 10th percentile, and 90th percentile of
`latency.
`
`Figure 6 shows that PNS(l) has a simulated average
`latency of 489 ms, PNS(16) has an average latency of 224
`ms, and PNS(2048) has an average latency of 201 ms. The
`
`latter is ideal PNS, since the neighbor choice is over all
`nodes in the simulation. PNS(16) comes relatively close
`to the ideal, and is convenient to implement in the real
`system with successor lists.
`
`Why does ideal PNS show the particular improvement
`that it does? The return trip from the predecessor to the
`originator has the same median as the one-way delay dis—
`tribution of the nodes in the network, (5. For the King
`data set, 6 : 67ms. The last hop (to the predecessor)
`has only one candidate, so its median latency is also 6.
`Each preceding hop has twice as many candidate nodes
`to choose from on average, since the finger—table interval
`involved is twice as large in ID space. So the second—to-
`last hop is the smaller of two randomly chosen latencies,
`the third-to-last is the smallest of four, etc. The minimum
`of 3; samples has its median at the 1 — 0.5a: percentile
`of the original distribution, which can be approximated
`as the 31¢ percentile for large :5. Doubling the sample size
`at will halve the percentile of the best sample. Assum-
`ing a uniform latency distribution, doubling the sample
`size halves the best sampled latency. Therefore, the laten-
`cies incurred at successive lookup hops with ideal PNS
`can be approximated by a geometric series with the fi-
`nal lookup hop to the key’s predecessor being the longest
`hop. The lookup process includes an additional final hop
`to the originator. If we use the per-hop median latency
`as a gross approximation of the average per-hop latency,
`the total average lookup latency is thus approximated as:
`6+ (6+ 3 + Z + ...) : 64-26 = 36.FortheKingdata
`set, this gives 201 ms. This is coincidentally the ideal PNS
`simulation result of 201 ms.
`
`The fact that the average lookup latency of PNS(N)
`can be approximated as an infinite geometric series whose
`
`HPE Ex. 2010
`
`Page 7 of 15
`
`
`
`
`.
`.
`.
`i
`.
`,
`.
`.
`i
`
`05
`0.45 >
`
`0.4 1
`
`Q3
`
`0.25 -
`0.2-
`
`.2
`E2
`0 0.15 —
`
`a3
`
`0.1 - E
`0.05 -15
`,
`o
`
`o
`
`.
`10
`
`.
`20
`
`
`
`.
`30
`
`—
`
`100
`
`
`
`K'n
`.
`'9 .
`so
`90
`
`.
`40
`
`.
`so
`Latency (ms)
`
`.
`60
`
`.
`70
`
`Figure 8: The median of the minimum latency taken
`from 93 samples out of the all-pairs empirical latency dis-
`tribution of the King dataset. The boxes correspond to
`2,4,8,16,32 samples starting from the right.
`
`sum converges quickly suggests that despite the fact that
`the number of lookup hops scales as log(N), the total av-
`erage lookup latency will stay close to 36. Figure 7 shows
`the simulated average lookup latency as a function of the
`number of nodes in the system. As we can see, there is
`indeed little increase in average lookup latency as the net-
`work grows.
`
`Why are there diminishing returns in Figure 6 beyond
`roughly PNS(16)? First, the King delay distribution is not
`uniform, but has a flat toe. Thus increasing the number of
`samples produces smaller and smaller decreases in mini-
`mum latency. Figure 8 shows this effect for various sam-
`ple sizes. Second, for large m, the number of samples is
`often limited by the allowed lD-space range for the finger
`in question, rather than by as; this effect is more important
`in the later hops of a lookup.
`One lesson from this analysis is that the last few hops of
`a lookup dominate the total latency. As a lookup gets close
`to the target key in ID space, the number of remaining
`nodes that are closer in ID space to the key decreases, and
`thus the latency to the nearest one increases on average.
`Section 4.5 shows how to avoid this problem.
`
`4.4 Coding versus replication
`
`Once the node originating a fetch acquires the key’s pre—
`decessor’s successor list, it knows which nodes hold the
`block’s replicas [13, 38] or fragments of an erasure-coded
`block [8, 3, 22, 19]. In the case of replication, the origi-
`nator’s strategy should be to fetch the required data from
`the successor with lowest latency. The originator has more
`options in the case of coded fragments, but a reasonable
`approach is to fetch the minimum required number of
`fragments from the closest successors. The technique of
`
`Second, increasing 7' is likely to decrease fetch latency,
`since that provides the originator more choices from
`which to pick a nearby node. Third, increasing 7" increases
`the amount of communication required to write a block to
`the DHT. These performance aspects of erasure coding
`have not been considered previously.
`Figure 9 illustrates the relationship between total fetch
`latency and block availability. The probability p0 that each
`node is available is kept constant at 0.9. Each line repre-
`sents a different rate 7', and the points on the line are ob-
`tained by varying m and setting l = 7' x m. Each point’s
`x-axis value indicates the probability that a block is avail-
`able as calculated by Equation 1. Each point’s y-axis value
`is the average latency from 20,000 simulations of fetch-
`ing a random block from a random originating node. The
`originator performs a lookup to obtain the list of the de-
`sired key’s successors, then issues parallel RPCs to the
`m of those successors that have lowest latency, and waits
`for the last of the RPCs to complete. The y—axis values
`include only the data fetch time.
`The left—most point on each line corresponds to repli—
`cation; that point on the different lines corresponds to
`2, 3, and 4 replicas. For each line, the points farther to
`the right indicate coding schemes in which smaller—sized
`fragments are placed onto larger numbers of nodes. For
`each redundancy rate 1‘, replication provides the lowest
`latency by a small margin. The reason is easiest to see
`for r = 2: choosing the nearest k of 21: fragments ap-
`proaches the median as k grows, while choosing the near-
`est replica of two yields a latency considerably below the
`median. Replication also provides the least availability
`because the redundant information is spread over fewer
`nodes. The lower lines correspond to larger amounts of re-
`dundant information on more nodes; this pr