`
`Proceedings of the First Symposium on
`Networked Systems Design and Implementation
`
`San Francisco, CA, USA
`March 29-31, 2004
`
`THE ADVANCED COMPUTING SYSTEMS ASSOCIATION
`
`For more information about the USENIX Association:
`All Rights Reserved
`© 2004 by The USENIX Association
`Phone: 1 510 528 8649
`FAX: 1 510 548 5738
`Email: office@usenix.org
`WWW: hitp://www.usenix.org
`Rights to individual papers remain with the author orthe author's employer.
`Permission is granted for noncommercial reproduction of the work for educational or research purposes.
`This copyright notice must be included in the reproduced paper. USENIX acknowledgesall trademarks herein.
`
`HPE Ex. 2010
`Page 1 of 15
`
`
`
`Designing a DHTfor low latency and high throughput
`
`Frank Dabek, Jinyang Li, Emil Sit, James Robertson, M. Frans Kaashoek, Robert Morris *
`MIT Computer Science and Artificial Intelligence Laboratory
`Jdabek, jinyang,sit, jsr, kaashoek, rtm@csail.mit.edu
`
`Abstract
`
`Designing a wide-area distributed hash table (DHT) that
`provides high-throughput and low-latency network stor-
`age is a challenge. Existing systems have explored a range
`of solutions, including iterative routing, recursive routing,
`proximity routing and neighborselection, erasure coding,
`replication, and server selection.
`This paper explores the design of these techniques and
`their interaction in a complete system, drawing on the
`measured performance of a new DHT implementation and
`results from a simulator with an accurate Internet latency
`model. New techniquesthat resulted from this exploration
`include use of latency predictions based on synthetic co-
`ordinates, efficient integration of lookup routing and data
`fetching, and a congestion control mechanism suitable for
`fetching data striped over large numbersofservers.
`Measurements with 425 server instances running on
`150 PlanetLab and RONhosts showthatthe latency opti-
`mizations reduce the time required to locate and fetch data
`by a factor of two. The throughput optimizations result
`in a sustainable bulk read throughputrelated to the num-
`ber of DHThosts times the capacity of the slowest access
`link; with 150 selected PlanetLab hosts, the peak aggre-
`gate throughput over multiple clients is 12.8 megabytes
`per second.
`
`1
`
`Introduction
`
`The Internet has transformed communication for dis-
`tributed applications: each new system need not imple-
`ment its own network, but can simply assume a shared
`global communication infrastructure. A similar transfor-
`mation might be possible for storage, allowing distributed
`applications to assume a shared global storage infrastruc-
`ture. Such an infrastructure would have to name andfind
`data, assure high availability, balance load across avail-
`able servers, and movedata with high throughput and low
`latency.
`Distributed hash tables (DHTs)are a promisingpathto-
`wardsa global storage infrastructure, and have been used
`
`research was conducted as part of the IRIS project
`*This
`(http: //project-iris.net/), supported by the National Sci-
`ence Foundation under Cooperative Agreement No. ANI-0225660.
`
`as the basis for a variety of wide-areafile and content pub-
`lishing systems [13, 26, 34, 38]. Good performance, how-
`ever, is a challenge: the DHT nodes holding the data may
`be far away in the network, may have access link capaci-
`ties that vary by orders of magnitude, and may experience
`varying degrees of congestion and packetloss.
`This paper explores design choices for DHT read and
`write algorithms. Existing work has investigated how to
`make the Jookup of keys in DHTsscalable, low-latency,
`fault-tolerant, and secure, but less attention has been paid
`to the efficiency and robustness with which DHTs read
`andstore data. This paper considers a range of design op-
`tions for efficient data handling in the context of a single
`DHT, DHash++. The decisions are evaluated in simula-
`tion and in an implementation of DHash++ on the Planet-
`Lab [29] and RON [2] test-beds.
`To bound the discussion of design decisions, we have
`made a numberof assumptions. First, we assumethat all
`nodes cooperate; the algorithms for reading and writing
`are likely to be more expensive if they have to defend
`against malicious nodes. Second, we assumethat lookups
`are routed using one of the O(log N)-style schemes, in-
`stead of using the recently proposed O(1) schemes [14,
`17, 18, 44]. Finally, we assumethat the DHT stores small
`blocks (on the order of 8192 bytes). Relaxing these as-
`sumptions will result in different DHT designs with dif-
`ferent latency and throughput properties, which we hope
`to explore in the future.
`The paper makes the following contributions. Recur-
`sive lookups take about 0.6 timesas longas iterative; the
`reason whythe reductionis not a factor of twois the cost
`of the final return trip. The latency of the last few hops
`in a lookup acts as a lower bound on the performance
`of Proximity Neighbor Selection [37, 16], which approxi-
`mates 1.5 times the average roundtrip time in the underly-
`ing network. This result holds regardless of the number of
`DHTnodes(and thus regardless of the number of hops).
`Replicated data allows for low-latency reads because there
`are manychoicesfor server selection, while erasure-coded
`data reduces bandwidth consumption for writes at the ex-
`penseofincreased read latency. Integration of key lookup
`and data fetch reduces the lower bound imposed by the
`last few lookup hops. Finally, using an integrated trans-
`
`HPE Ex, 2010
`Page 2 of 15
`
`
`
`;
`|
`Ee
`
`__
`
`
`
`
`
`DHT —J |
`
`Chord
`—}+—1 Vivaldi|
`ool
`1
`
`STP |
`
`| DHasht+
`
`Figure 1: DHash++ system overview.
`
`port protocol rather than TCP provides opportunities for
`efficiency in alternate routing after timeouts and allows
`the DHT freedom to efficiently contact many nodes.
`Therest of this paperis structured as follows. Section 2
`outlines the complete system that surrounds the specific
`mechanismsdetailed in the paper. Section 3 describes the
`methods behind the paper’s measurements and quantita-
`tive evaluations. Section 4 discusses design decisions that
`affect latency, and Section 5 discusses throughput. Sec-
`tion 6 describes related work. We concludein Section 7.
`
`2 Background
`For concreteness, this evaluates design decisions in the
`context of a complete DHT called DHash++. This section
`describes the parts of DHash++ that are needed to under-
`stand the rest of the paper.
`
`2.1 Chord
`
`DHash++ uses the Chord lookup algorithm to help it find
`data [42]. Chord provides a function Lookup (key) —
`set-of-IP, which maps a 160-bit key to the set of IP
`addressesof the nodes responsible for that key. Each node
`has a 160-bit identifier, and Chord designates the s nodes
`whoseidentifiers immediately follow a key as responsible
`for that key; these are the key’s successors. To provide
`reliable lookupevenif half of the nodesfail in a 2!°-node
`network, the numberofsuccessors, s, is 16 in the Chord
`implementation. The ID space wraps around,so that zero
`immediately follows 216° — 1.
`The base Chord lookupalgorithm (whichwill be modi-
`fied in subsequent sections) works as follows. Each Chord
`node maintains a finger table, consisting of the IP ad-
`dresses and IDs of nodes that follow it at power-of-two
`distances in the identifier space. Each node also maintains
`a successorlist referring to its s immediate successors.
`Whena node originates a lookup, it consults a sequence
`of other nodes, asking each in turn which node totalk to
`
`Figure 2: Anillustration of a Chord identifier ring. The
`tick mark denotes the position of a key in ID space. The
`square shows the key’s successor node, and the circles
`showthe nodes in the successor’s successorlist. The trian-
`gles and arrows show a lookup path. The last node before
`the tick mark is the key’s predecessor.
`
`next. Each node in this sequence answers with the node
`from its finger table with highest IDstill less than the de-
`sired key. The originating node will find the key’s prede-
`cessor nodeafter O(log N’) consultations; it then asks the
`predecessorforits successorlist, whichis the result of the
`lookup. This style of lookup is called iterative, since the
`originating node controls each step of the lookup. All of
`the communication uses UDP RPCs.
`Figure 2 shows a Chordring with a key, its successor,
`the successor’s successorlist, and a lookup path; this pic-
`ture is helpful to keep in mind since muchofthe discus-
`sion appeals to the ring geometry. Although this paper
`explores optimizations over base Chord, we believe that
`these optimizations also apply to other DHTsthat route in
`ID spaces using an O(log N) protocol.
`
`2.2 DHash++
`
`DHash++ stores key/value pairs (called blocks) on a set
`of servers. The DHash++ client API consists of key —
`put (value) and get (key) — value. DHasht+
`calculates the key to be the SHA-1 hash of the value, and
`uses Chord to decide which server should store a given
`block; each server runs both Chord and DHash++ soft-
`ware. As well as finding and moving data for client ap-
`plications, DHash++ authenticates the data and movesit
`from serverto server as nodesjoin, leave, andfail [7].
`
`2.3 Synthetic coordinates
`Many ofthe techniques described in this paper use syn-
`thetic coordinates to predict inter-node latencies without
`having to perform an explicit measurement to determine
`the latency. A number of synthetic coordinate systems
`have been proposed [10, 24, 27, 30, 33, 39]. We chose
`
`HPEEx. 2010
`Page 3 of 15
`
`
`
`to use Vivaldi [12], becauseits algorithm is decentralized,
`which makes it suitable for use in peer-to-peer systems.
`Furthermore, the Vivaldi algorithm is lightweight,since it
`can piggy-back on DHash++’s communication patterns to
`compute coordinates.
`Wheneverone Chord or DHash++ node communicates
`directly with another, they exchange Vivaldi coordinates.
`Nodesstore these coordinates along with IP addresses in
`routing tables and successorlists. The result of a lookup
`for a key carries the coordinates of the nodes responsible
`for the key as well as their IP addresses. Thus the request-
`ing node can predict the latencies to each ofthe responsi-
`ble nodes without having to first communicate with them.
`
`3 Evaluation methods
`
`Theresults in this paper are obtained through simulations
`and measurements on the PlanetLab and RONtest-beds.
`The measurements focus on DHT operationsthat require
`low latency or high throughput.
`
`3.1 Evaluation infrastructure
`
`DHT performance depends on the detailed behavior of
`the servers and the underlying network. The test-bed mea-
`surements in Section 4 were taken from a DHash++ im-
`plementation deployed on the PlanetLab and RONtest-
`beds. 180 test-bed hosts were used, of which 150 were
`in the United States and 30 elsewhere. 105 of the hosts
`are on the Internet2 network; the rest have connections
`via DSL, cable modem, commercial T1 service, or are
`at co-location centers. Each host runs three independent
`DHash++ processes, or virtual nodes, in order to improve
`load balance andto ensure that the total number of nodes
`is large comparedto the size of the Chord successorlist.
`The measurements in Section 5 were taken on the 27-node
`RONtest-bed alone.
`The test-bed measurements are augmented with simu-
`lation results to explore large configurations, to allow easy
`testing of alternate designs, andto allow analytic explana-
`tions of behavior in a controlled environment. The simu-
`lated network models only packet delay. Oneinputto the
`simulatoris a full matrix of the round-trip delays between
`each pair of simulated hosts. This approach avoids hav-
`ing to simulate the Internet’s topology, a currently open
`area of research; it requires only the measurementof ac-
`tual pair-wise delays amonga set of hosts. The simulator
`can produceuseful speed-of-light delay results, but cannot
`be used to predict throughput or queuing delay.
`The simulator’s delay matrix is derived from Internet
`measurements using techniques similar to those described
`by Gummadietal. [15]. The measurements involved 2048
`DNSservers found with inverse DNS lookupsona trace
`of over 20,000 Gnutella clients. For each pair of these
`servers, a measuring node sends a query to oneserverthat
`
`
`
`probability
`
`
`Cumulative
`
`0
`
`50
`
`100
`
`150
`
`PlanetLab
`Bing mr
`400
`450
`
`500
`
`300
`250
`200
`Latency (ms)
`
`350
`
`Figure 3: Round-trip latency distribution overall pairs of
`PlanetLab and King dataset hosts. The median and aver-
`age King dataset latencies are 134 and 154 milliseconds
`respectively. The median and average PlanetLab latencies
`are 76 and 90 milliseconds respectively.
`
`requires it to contact the other server. Subtracting the de-
`lay between the measuring node andthefirst server from
`the total delay yields the delay between the two servers.
`In order to reduce the effects of queuing delay, the min-
`imum delay from five experiments is used. In this paper
`the results are called the King data-set. All the simula-
`tions in this paper involve 2048 DHT nodes using King
`delay matrix unless otherwise mentioned. Figure 3 shows
`the CDFof the King data-set round-trip times; the median
`is 134 milliseconds, while the average is 154 millisec-
`onds. The graph also shows the minimum delay offive
`pings between each pair of PlanetLab hosts for compar-
`ison. The main difference between the two curvesis the
`longertail on the King distribution, whichis likely caused
`by the larger sample of nodes.
`
`3.2 Application workload
`
`The design of a DHT must incorporate assumptions about
`probable application behavior, and a DHTevaluation must
`also involve either applications or models of application
`behavior. The application aspects that most affect perfor-
`manceare the mix of read and write operations, the degree
`to which operations can be pipelined, and the size of the
`data records.
`
`DHash++ is designed to support read-heavy applica-
`tions that demand low-latency and high-throughputreads
`as well as reasonably high-throughput writes. Examples
`of such applications might include the Semantic Free Ref-
`erencing system (SFR) [45] and UsenetDHT [40].
`SFR is a naming system designed to replace the use
`of DNSas a content location system. SFR uses a DHT
`to store small data records representing name bindings.
`
`HPE Ex. 2010
`Page 4 of 15
`
`
`
`Readsare frequent and should complete with low latency.
`Writes are relatively infrequent and thus need not be as
`high performance. SFR data blocksare likely to be on the
`order hundredsofbytes.
`UsenetDHTisa service aiming to reducethe total stor-
`age dedicated to Usenet by storing all Usenetarticles in a
`shared DHT. UsenetDHTsplits large binary articles (av-
`eraging 100 KB) into small blocks for load balance, but
`smaller text articles (typically 5 KB orless) are stored as
`single blocks. While readership patterns vary, UsenetDHT
`must support low-latency single article reads, as well as
`high-throughputpipelinedarticle fetches.
`These systems are unlikely to be deployed on high-
`churn networks—these systemsare all server-class. The
`target environment for them is a network with relatively
`reliable nodes that have good Internet access.
`
`4 Designing for low latency
`This section investigates five design choices that affect
`DHT get latency. The naive algorithm against which
`these choices are judged, called base DHash++, operates
`as follows. Each 8192-byte block is stored as 14 1171-
`byte erasure-coded fragments,any seven of whichare suf-
`ficient to reconstruct the block, using the IDA codingal-
`gorithm [31]. The 14 fragmentsarestored at the 14 imme-
`diate successors of the block’s key. When an application
`calls get (key), the originating node performsanitera-
`tive Chord lookup, which ends whenthe key’s predecessor
`nodereturns the key’s 16 successors; the originating node
`then sends seven parallel requests the first seven succes-
`sors asking them eachto return one fragment.
`Figure 4 gives a preview of the results of this sec-
`tion. Each pair of bars shows the median time to fetch a
`block on the PlanetLab test-bed after cumulatively apply-
`ing each design improvement. The design improvements
`shownare recursive rather than iterative routing, proxim-
`ity neighborselection, fetching of data from the closest
`copy, and integration of lookup routing and data fetching.
`These design improvements together reducethe total fetch
`latency by nearly a factor of two.
`This paper uses a log(N) protocolfor routing lookups.
`An optimization that isn’t explored in this paperis an in-
`crease in the base to reduce the numberofhops,or the use
`of a constant-hop protocols. These optimizations would
`reduce latency under low churn, because each node would
`know about many other nodes. On the other hand, in high
`churn networks, these optimizations might require more
`bandwidth to keep routing tables up to date or experi-
`ence more timeouts because routing tables might contain
`recently-failed nodes. The paper’s evaluation infrastruc-
`ture isn’t adequate to explore this design decision in de-
`tail. We hope to explore this issue in future work. We do
`explore the extent to which proximity routing can reduce
`
`500
`
`400
`
`300
`
`100
`
`
`
`
`
`Medianlatency(ms)
`
`Base
`
`Serverselection
`Recursive lookup Proximity routing
`Latencyoptimization techniques (cumulative)
`
`Integration
`
`Figure 4: The cumulative effect of successive optimiza-
`tions on the latency of a DHash++ data fetch. Each bar
`shows the median time of 1,000 fetches of a randomly
`chosen 8192-byte data block from a randomly chosen
`host. The dark portion of each bar showsthe lookuptime,
`and the light portion shows the time taken to fetch the
`data. These data are from the implementation running on
`PlanetLab.
`
`the impact of the numberof hops on the lookuplatency.
`
`4.1 Data layout
`Thefirst decision to be made about where a DHT should
`store data is whether it should store data at all. A num-
`ber of DHTsprovide only a key location service, perhaps
`with a layer of indirection, and let each application de-
`cide where (or even whether) to store data [20, 28]. The
`choice is a question of appropriate functionality rather
`than performance, though Section 4.5 describes some per-
`formancebenefits of integrating the DHT lookup and data
`storage functions. The approach taken by DHash++is ap-
`propriate for applications that wish to view the DHT as a
`networkstorage system, such as our motivating examples
`SFR and UsenetDHT.
`
`For DHTsthat store data, a second layout decision is
`the size of the units of data to store. A DHT key could re-
`fer to a disk-sector-like block of data [13], to a complete
`file [38], or to an entire file system image [11]. Largeval-
`ues reduce the amortized cost of each DHT lookup. Small
`blocks spread the load of serving popularlarge files. For
`these reasons, and because someapplications such as SFR
`require the DHTto store small blocks, DHash++is opti-
`mized with blocks of 8 KB orless in mind.
`
`A third layout decision is which server should store
`each block of data (or each replica or coded fragment).
`If a given block is likely to be read mostly by hosts in
`a particular geographic area, then it would make sense
`to store the data on DHTservers in that area. Caching
`is one way to achieve this kind of layout. On the other
`hand, geographic concentration may make the data more
`
`HPE Ex. 2010
`Page 5 of 15
`
`
`
`vulnerable to network and powerfailures, it may cause
`the load to be less evenly balanced across all nodes, and
`is difficult to arrange in general without application hints.
`At the other extreme, the DHT could distribute data uni-
`formly at random overthe available servers; this design
`would be reasonable if there were no predictable geo-
`graphiclocality in the originators of requests for the data,
`or if fault-tolerance were important. DHash++ uses the
`latter approach: a block’s key is essentially random (the
`SHA-1 of the block’s value), node IDs are random, and a
`block’s replicas or fragments are placed at its key’s suc-
`cessor nodes. Theresult is that blocks (and load) are uni-
`formly spread over the DHT nodes, and that a block’s
`replicas or fragments are widely scattered to avoid cor-
`related failure.
`
`Given a DHT design that stores blocks on randomly
`chosen servers, one can begin to form some expectations
`about fetch latency. The lower boundonthetotal time to
`find and fetch a block is the roundtrip time from theorig-
`inator to the nearest replica of the block, or the time to
`the most distant of the closest set of fragments required
`to reconstruct the block. For the typical block this timeis
`determined by the distribution of inter-host delays in the
`Internet, and by the numberof choicesofreplicas or frag-
`ments. The DHT lookup requiredto find the replicas or
`fragments will add to this lower bound, as will mistakes
`in predicting whichreplica or fragments areclosest.
`Most of the design choices described in subsequent
`subsections have to do with taking intelligent advantage
`of choices in order to reduce lookup and data fetch la-
`tency.
`
`4.2 Recursiveor iterative?
`
`The base Chord and Kademlia algorithmsare iterative: the
`originator sends an RPC to each successive node in the
`lookup path, and waits for the response before proceed-
`ing [25, 42]. Another possibility is recursive lookup [6,
`47]: each node in the lookup path directly forwards the
`query to the next node, and when the query reaches the
`key’s predecessor, the predecessor sendsits successorlist
`directly back to the originator [42]. Recursive lookup,
`which many DHTsuse, might eliminate half the latency
`of each hop since each intermediate node can immedi-
`ately forward the lookup before acknowledging the pre-
`vious hop.
`Figure 5 showsthe effect of using recursive rather than
`iterative lookup in the simulator with the 2048-node King
`data set. For each technique, 20,000 lookups were per-
`formed, each from a random host for a random key. The
`average numberofhopsis 6.3. Recursive lookup takes on
`average 0.6 timesas longasiterative. This decrease is not
`quite the expected factor of two: the difference is due to
`the extra one-way hop of (on average) 77 millisecondsto
`
`
`
` 1 |
`
`0.8
`0.7
`0.6
`0.5
`0.4
`
`
`
`Cumulativeprobability 0.3
`
`
`0.2 0.4
`
`0
`
`200
`
`4
`600
`400
`Latency (ms)
`
`Iterative
`Recursive ---------
`800
`
`1000
`
`Figure 5: The cumulative distributions of lookup time for
`Chord with recursive and iterative lookup. The recursive
`median and average are 461 and 489 milliseconds;theit-
`erative median and average are 720 and 822 milliseconds.
`The numbersare from simulations.
`
`return the result to the originator.
`While recursive lookup has lowerlatency than iterative,
`iterative is much easier for a client to manage. If a recur-
`sive lookup elicits no response, the originator has no in-
`formation about what went wrong and how to re-try in
`a waythat is morelikely to succeed. Sometimesa simple
`re-try may work,asin the case of lost packets. If the prob-
`lem is that each successive node cantalk to the next node,
`but that Internet routing anomalies prevent the last node
`from replying to the originator, then re-tries won’t work
`because only the originator realizes a problem exists. In
`contrast, the originator knows which hopofaniterative
`lookup failed to respond, and can re-try that hop through
`a different node in the same region of the identifier space.
`On the the other hand, recursive communication may
`make congestion controleasier (thatis, it is it may makeit
`morefeasible to rely on TCP). We will show in Section 5
`that the performanceof a naive TCPtransport can be quite
`poor.
`DHash++ uses recursive lookups by default since they
`are faster, but falls back on iterative lookupsafter persis-
`tent failures.
`
`4.3 Proximity neighborselection
`
`Many DHTsdecrease lookup latency by choosing nearby
`nodes as routing table entries [6, 16, 25, 42, 43, 47],
`a technique often called proximity neighbor selection
`(PNS). Thereasonthis is possible is that there are usually
`few constraints in the choice of routing entries: any node
`in the relevant portion of the identifier space is eligible.
`A DHTdesign must include an algorithm to search for
`nearby nodes; an exhaustive search may improve lookup
`
`HPE Ex. 2010
`Page 6 of 15
`
`
`
`300
`PNS(16) —+—
`PNS(N) --->---
`
`250 +
`
`
`
`
`100 |
`
`50 +
`
`0
`100
`
`1
`1000
`
`Network Size
`
`2 &
`
`S=
`<
`
`oo
`600 +
`Qaee
`— 200 +
`PERRISBTmaepenneencenena, |)
`eCc
`sot,
`2x
`a 150;
`aZz°°
`
`
`
`Averagelookuplatency(msec)
`
`
`
`
`
`1
`
`10
`
`100
`Numberof PNS samples
`
`7000
`
`Figure 6: Average lookup latency as a function of the
`number of PNS samples. The bar at each x value shows
`the 10th, average, and 90th percentile of the latencies ob-
`served by 20,000 recursive lookups of random keys from
`random nodes using PNS(a). The measurements are from
`the simulator with 2048 nodes.
`
`Figure 7: Average lookup latency of PNS(16) and
`PNS(J)as a function of the numberof nodesin the sys-
`tem, NV. The simulated network sizes consist of 128, 256,
`512, 1024, 2048 nodes.
`
`latency, but also consume network resources. This sub-
`section builds on the work of Gummadietal. [16] in two
`ways: it explains why PNS approximates 1.5 timesthe av-
`erage roundtrip time in the underlying network and shows
`that this result holds regardless of the number of DHT
`nodes (and thus regardless of the numberof hops).
`Following Gummadiet al. [16], define PNS(z) as fol-
`lows. The ith Chordfinger table entry of the node with ID
`a properly refers to the first node in the ID-space range
`a+ 2' toa + 2'+! — 1. The PNS(z)algorithm considers
`up to the first x nodes in that range (there may be fewer
`than a), and routes lookups through the node with lowest
`latency. Ideal PNS refers to PNS(x) with x equalto theto-
`tal numberofnodes,so that every finger table entry points
`to the lowest-latency node in the entire allowed ID-space
`range. The simulator simply chooses the lowest-latency
`of the x nodes, while the real implementation asks each
`properfinger entry for its successor list and uses Vivaldi
`to select the closest node. This meansthat the real imple-
`mentation requires that 2 < s (the numberofsuccessors).
`What is a suitable value for x in PNS(«)? Figure 6
`shows the simulated effect of varying x on lookup la-
`tency. For each x value, 20,000 lookups were issued by
`randomly selected hosts for random keys. Each lookupis
`recursive, goes to the key’s predecessor node(but not suc-
`cessor), and then directly back to the originator. The graph
`plots the median, 10th percentile, and 90th percentile of
`latency.
`Figure 6 shows that PNS(1) has a simulated average
`latency of 489 ms, PNS(16) has an average latency of 224
`ms, and PNS(2048)has an averagelatency of 201 ms. The
`
`latter is ideal PNS, since the neighbor choiceis overall
`nodes in the simulation. PNS(16) comesrelatively close
`to the ideal, and is convenient to implement in the real
`system with successorlists.
`Whydoes ideal PNS show the particular improvement
`that it does? The return trip from the predecessorto the
`originator has the same median as the one-waydelay dis-
`tribution of the nodes in the network, 6. For the King
`data set, 6 = 67ms. The last hop (to the predecessor)
`has only one candidate, so its median latency is also 6.
`Each preceding hop has twice as many candidate nodes
`to choose from on average, since the finger-table interval
`involved is twice as large in ID space. So the second-to-
`last hop is the smaller of two randomly chosenlatencies,
`the third-to-last is the smallest of four, etc. The minimum
`of x samples has its median at the 1 — 0.5 percentile
`of the original distribution, which can be approximated
`as the - percentile for large x. Doubling the sample size
`x will halve the percentile of the best sample. Assum-
`ing a uniform latency distribution, doubling the sample
`size halves the best sampled latency. Therefore, the laten-
`cies incurred at successive lookup hops with ideal PNS
`can be approximated by a geometric series with thefi-
`nal lookup hop to the key’s predecessor being the longest
`hop. The lookup process includes an additional final hop
`to the originator. If we use the per-hop median latency
`as a gross approximation of the average per-hoplatency,
`the total average lookup latency is thus approximatedas:
`6+ (6 +3494...) =5 +26 = 36.For the King data
`set, this gives 201 ms. This is coincidentally the ideal PNS
`simulation result of 201 ms.
`
`The fact that the average lookup latency of PNS(V)
`can be approximated as an infinite geometric series whose
`
`HPE Ex. 2010
`Page 7 of 15
`
`
`
`800
`
`700 +
`
`|
`
`]
`
`]
`
`e
`
`«¢
`
`
`
`ooo
`
`||:
`| |
`
`|= |
`
`peepeas
`300 +
`|
`
`200 +
`
`||
`
`100 F
`
`
`
`L
`
`L
`
`L
`
`4
`
`1
`
`1
`
`L
`
`1
`
`1
`
`0.5
`0.45 +
`
`0.4 +
`0.35 4
`0.3 42
`0.25 4
`
`
`
`0.2 4 Cumulativeprobability 0.15 4
`
`
`King
`
`fetching the data from the nearest of a set of candidate
`nodesis typically called server selection.
`The design choice here can be framed as choosing the
`coding parameters | and m, where / is the total number
`of fragments stored on successors and m is the number
`required to reconstruct the block. Replication is the spe-
`cial case in which m = 1, and J is the numberofreplicas.
`The rate of coding, r = y, expresses the amountofre-
`dundancy. A replication scheme with three replicas has
`m = 1,1 = 3, andr = 3, while a 7-out-of-14 IDA coding
`scheme has m = 7,1 = 14, andr = 2.
`The choice of parameters m and / has three main ef-
`fects. First, it determines a block’s availability when nodes
`fail [46]. If the probability that any given DHT nodeis
`available is po, the probability that a block is still avail-
`able is [4]:
`
`Pavail = S- ze — po)!
`
`l
`
`=m
`
`l
`
`a
`
`—1
`
`()
`
`
`
`0
`
`10
`
`20
`
`30
`
`
`
`40
`
`50
`Latency (ms)
`
`60
`
`70
`
`80
`
`90
`
`100
`
`Figure 8: The median of the minimum latency taken
`from x samples out of the all-pairs empirical latency dis-
`tribution of the King dataset. The boxes correspond to
`2,4,8,16,32 samples starting from theright.
`
`sum converges quickly suggests that despite the fact that
`the numberof lookup hopsscales as log(NV), the total av-
`erage lookup latency will stay close to 3d. Figure 7 shows
`the simulated average lookuplatency as a function of the
`numberof nodes in the system. As wecansee, there is
`indeedlittle increase in average lookup latencyas the net-
`work grows.
`Whyare there diminishing returns in Figure 6 beyond
`roughly PNS(16)?First, the King delay distribution is not
`uniform,but has a flat toe. Thus increasing the number of
`samples produces smaller and smaller decreases in mini-
`mum latency. Figure 8 showsthis effect for various sam-
`ple sizes. Second, for large x, the number of samples is
`often limited by the allowed ID-space range for the finger
`in question,rather than by 2;this effect is more important
`in the later hops of a lookup.
`Onelesson from this analysis is that the last few hops of
`a lookup dominate the total latency. As a lookupgets close
`to the target key in ID space, the number of remaining
`nodesthat are closer in ID spaceto the key decreases, and
`thus the latency to the nearest one increases on average.
`Section 4.5 shows how to avoid this problem.
`
`4.4 Coding versusreplication
`
`Once the nodeoriginating a fetch acquires the key’s pre-
`decessor’s successorlist, it knows which nodes hold the
`block’s replicas [13, 38] or fragments of an erasure-coded
`block [8, 3, 22, 19]. In the case of replication, the origi-
`nator’s strategy should beto fetch the required data from
`the successor with lowest latency. The originator has more
`options in the case of coded fragments, but a reasonable
`approach is to fetch the minimum required number of
`fragments from the closest successors. The technique of
`
`Second, increasing r is likely to decrease fetch latency,
`since that provides the originator more choices from
`whichto pick a nearby node. Third, increasing r increases
`the amount of communication required to write a block to
`the DHT. These performance aspects of erasure coding
`have not been considered previously.
`Figure 9 illustrates the relationship betweentotal fetch
`latency and block availability. The probability po that each
`nodeis available is kept constant at 0.9. Each line repre-
`sents a different rate 7, and the points on the line are ob-
`tained by varying m andsetting | = r x m. Each point’s
`x-axis value indicates the probability that a blockis avail-
`able as calculated by Equation 1. Each point’s y-axis value
`is the average latency from 20,000 simulations of fetch-
`ing a random block from a random originating node. The
`originator performs a lookup to obtain the list of the de-
`sired key’s successors, then issues parallel RPCs to the
`m of those successors that have lowest latency, and waits
`for the last of the RPCs to complete. The y-axis values
`include only the data fetch time.
`The left-most point on each line correspondsto repli-
`cation; that point on the different lines corresponds to
`2, 3, and 4 replicas. For each line, the points farther to
`the right indicate coding schemes in which smaller-sized
`fragments are placed onto larger numbers of nodes. For
`each redundancy rate r, replication provides the lowest
`latency by a small margin. The reaso