throbber
USENIX Association
`
`Proceedings of the First Symposium on
`Networked Systems Design and Implementation
`
`San Francisco, CA, USA
`March 29-31, 2004
`
`THE ADVANCED COMPUTING SYSTEMS ASSOCIATION
`
`For more information about the USENIX Association:
`All Rights Reserved
`© 2004 by The USENIX Association
`Phone: 1 510 528 8649
`FAX: 1 510 548 5738
`Email: office@usenix.org
`WWW: hitp://www.usenix.org
`Rights to individual papers remain with the author orthe author's employer.
`Permission is granted for noncommercial reproduction of the work for educational or research purposes.
`This copyright notice must be included in the reproduced paper. USENIX acknowledgesall trademarks herein.
`
`HPE Ex. 2010
`Page 1 of 15
`
`

`

`Designing a DHTfor low latency and high throughput
`
`Frank Dabek, Jinyang Li, Emil Sit, James Robertson, M. Frans Kaashoek, Robert Morris *
`MIT Computer Science and Artificial Intelligence Laboratory
`Jdabek, jinyang,sit, jsr, kaashoek, rtm@csail.mit.edu
`
`Abstract
`
`Designing a wide-area distributed hash table (DHT) that
`provides high-throughput and low-latency network stor-
`age is a challenge. Existing systems have explored a range
`of solutions, including iterative routing, recursive routing,
`proximity routing and neighborselection, erasure coding,
`replication, and server selection.
`This paper explores the design of these techniques and
`their interaction in a complete system, drawing on the
`measured performance of a new DHT implementation and
`results from a simulator with an accurate Internet latency
`model. New techniquesthat resulted from this exploration
`include use of latency predictions based on synthetic co-
`ordinates, efficient integration of lookup routing and data
`fetching, and a congestion control mechanism suitable for
`fetching data striped over large numbersofservers.
`Measurements with 425 server instances running on
`150 PlanetLab and RONhosts showthatthe latency opti-
`mizations reduce the time required to locate and fetch data
`by a factor of two. The throughput optimizations result
`in a sustainable bulk read throughputrelated to the num-
`ber of DHThosts times the capacity of the slowest access
`link; with 150 selected PlanetLab hosts, the peak aggre-
`gate throughput over multiple clients is 12.8 megabytes
`per second.
`
`1
`
`Introduction
`
`The Internet has transformed communication for dis-
`tributed applications: each new system need not imple-
`ment its own network, but can simply assume a shared
`global communication infrastructure. A similar transfor-
`mation might be possible for storage, allowing distributed
`applications to assume a shared global storage infrastruc-
`ture. Such an infrastructure would have to name andfind
`data, assure high availability, balance load across avail-
`able servers, and movedata with high throughput and low
`latency.
`Distributed hash tables (DHTs)are a promisingpathto-
`wardsa global storage infrastructure, and have been used
`
`research was conducted as part of the IRIS project
`*This
`(http: //project-iris.net/), supported by the National Sci-
`ence Foundation under Cooperative Agreement No. ANI-0225660.
`
`as the basis for a variety of wide-areafile and content pub-
`lishing systems [13, 26, 34, 38]. Good performance, how-
`ever, is a challenge: the DHT nodes holding the data may
`be far away in the network, may have access link capaci-
`ties that vary by orders of magnitude, and may experience
`varying degrees of congestion and packetloss.
`This paper explores design choices for DHT read and
`write algorithms. Existing work has investigated how to
`make the Jookup of keys in DHTsscalable, low-latency,
`fault-tolerant, and secure, but less attention has been paid
`to the efficiency and robustness with which DHTs read
`andstore data. This paper considers a range of design op-
`tions for efficient data handling in the context of a single
`DHT, DHash++. The decisions are evaluated in simula-
`tion and in an implementation of DHash++ on the Planet-
`Lab [29] and RON [2] test-beds.
`To bound the discussion of design decisions, we have
`made a numberof assumptions. First, we assumethat all
`nodes cooperate; the algorithms for reading and writing
`are likely to be more expensive if they have to defend
`against malicious nodes. Second, we assumethat lookups
`are routed using one of the O(log N)-style schemes, in-
`stead of using the recently proposed O(1) schemes [14,
`17, 18, 44]. Finally, we assumethat the DHT stores small
`blocks (on the order of 8192 bytes). Relaxing these as-
`sumptions will result in different DHT designs with dif-
`ferent latency and throughput properties, which we hope
`to explore in the future.
`The paper makes the following contributions. Recur-
`sive lookups take about 0.6 timesas longas iterative; the
`reason whythe reductionis not a factor of twois the cost
`of the final return trip. The latency of the last few hops
`in a lookup acts as a lower bound on the performance
`of Proximity Neighbor Selection [37, 16], which approxi-
`mates 1.5 times the average roundtrip time in the underly-
`ing network. This result holds regardless of the number of
`DHTnodes(and thus regardless of the number of hops).
`Replicated data allows for low-latency reads because there
`are manychoicesfor server selection, while erasure-coded
`data reduces bandwidth consumption for writes at the ex-
`penseofincreased read latency. Integration of key lookup
`and data fetch reduces the lower bound imposed by the
`last few lookup hops. Finally, using an integrated trans-
`
`HPE Ex, 2010
`Page 2 of 15
`
`

`

`;
`|
`Ee
`
`__
`
`
`
`
`
`DHT —J |
`
`Chord
`—}+—1 Vivaldi|
`ool
`1
`
`STP |
`
`| DHasht+
`
`Figure 1: DHash++ system overview.
`
`port protocol rather than TCP provides opportunities for
`efficiency in alternate routing after timeouts and allows
`the DHT freedom to efficiently contact many nodes.
`Therest of this paperis structured as follows. Section 2
`outlines the complete system that surrounds the specific
`mechanismsdetailed in the paper. Section 3 describes the
`methods behind the paper’s measurements and quantita-
`tive evaluations. Section 4 discusses design decisions that
`affect latency, and Section 5 discusses throughput. Sec-
`tion 6 describes related work. We concludein Section 7.
`
`2 Background
`For concreteness, this evaluates design decisions in the
`context of a complete DHT called DHash++. This section
`describes the parts of DHash++ that are needed to under-
`stand the rest of the paper.
`
`2.1 Chord
`
`DHash++ uses the Chord lookup algorithm to help it find
`data [42]. Chord provides a function Lookup (key) —
`set-of-IP, which maps a 160-bit key to the set of IP
`addressesof the nodes responsible for that key. Each node
`has a 160-bit identifier, and Chord designates the s nodes
`whoseidentifiers immediately follow a key as responsible
`for that key; these are the key’s successors. To provide
`reliable lookupevenif half of the nodesfail in a 2!°-node
`network, the numberofsuccessors, s, is 16 in the Chord
`implementation. The ID space wraps around,so that zero
`immediately follows 216° — 1.
`The base Chord lookupalgorithm (whichwill be modi-
`fied in subsequent sections) works as follows. Each Chord
`node maintains a finger table, consisting of the IP ad-
`dresses and IDs of nodes that follow it at power-of-two
`distances in the identifier space. Each node also maintains
`a successorlist referring to its s immediate successors.
`Whena node originates a lookup, it consults a sequence
`of other nodes, asking each in turn which node totalk to
`
`Figure 2: Anillustration of a Chord identifier ring. The
`tick mark denotes the position of a key in ID space. The
`square shows the key’s successor node, and the circles
`showthe nodes in the successor’s successorlist. The trian-
`gles and arrows show a lookup path. The last node before
`the tick mark is the key’s predecessor.
`
`next. Each node in this sequence answers with the node
`from its finger table with highest IDstill less than the de-
`sired key. The originating node will find the key’s prede-
`cessor nodeafter O(log N’) consultations; it then asks the
`predecessorforits successorlist, whichis the result of the
`lookup. This style of lookup is called iterative, since the
`originating node controls each step of the lookup. All of
`the communication uses UDP RPCs.
`Figure 2 shows a Chordring with a key, its successor,
`the successor’s successorlist, and a lookup path; this pic-
`ture is helpful to keep in mind since muchofthe discus-
`sion appeals to the ring geometry. Although this paper
`explores optimizations over base Chord, we believe that
`these optimizations also apply to other DHTsthat route in
`ID spaces using an O(log N) protocol.
`
`2.2 DHash++
`
`DHash++ stores key/value pairs (called blocks) on a set
`of servers. The DHash++ client API consists of key —
`put (value) and get (key) — value. DHasht+
`calculates the key to be the SHA-1 hash of the value, and
`uses Chord to decide which server should store a given
`block; each server runs both Chord and DHash++ soft-
`ware. As well as finding and moving data for client ap-
`plications, DHash++ authenticates the data and movesit
`from serverto server as nodesjoin, leave, andfail [7].
`
`2.3 Synthetic coordinates
`Many ofthe techniques described in this paper use syn-
`thetic coordinates to predict inter-node latencies without
`having to perform an explicit measurement to determine
`the latency. A number of synthetic coordinate systems
`have been proposed [10, 24, 27, 30, 33, 39]. We chose
`
`HPEEx. 2010
`Page 3 of 15
`
`

`

`to use Vivaldi [12], becauseits algorithm is decentralized,
`which makes it suitable for use in peer-to-peer systems.
`Furthermore, the Vivaldi algorithm is lightweight,since it
`can piggy-back on DHash++’s communication patterns to
`compute coordinates.
`Wheneverone Chord or DHash++ node communicates
`directly with another, they exchange Vivaldi coordinates.
`Nodesstore these coordinates along with IP addresses in
`routing tables and successorlists. The result of a lookup
`for a key carries the coordinates of the nodes responsible
`for the key as well as their IP addresses. Thus the request-
`ing node can predict the latencies to each ofthe responsi-
`ble nodes without having to first communicate with them.
`
`3 Evaluation methods
`
`Theresults in this paper are obtained through simulations
`and measurements on the PlanetLab and RONtest-beds.
`The measurements focus on DHT operationsthat require
`low latency or high throughput.
`
`3.1 Evaluation infrastructure
`
`DHT performance depends on the detailed behavior of
`the servers and the underlying network. The test-bed mea-
`surements in Section 4 were taken from a DHash++ im-
`plementation deployed on the PlanetLab and RONtest-
`beds. 180 test-bed hosts were used, of which 150 were
`in the United States and 30 elsewhere. 105 of the hosts
`are on the Internet2 network; the rest have connections
`via DSL, cable modem, commercial T1 service, or are
`at co-location centers. Each host runs three independent
`DHash++ processes, or virtual nodes, in order to improve
`load balance andto ensure that the total number of nodes
`is large comparedto the size of the Chord successorlist.
`The measurements in Section 5 were taken on the 27-node
`RONtest-bed alone.
`The test-bed measurements are augmented with simu-
`lation results to explore large configurations, to allow easy
`testing of alternate designs, andto allow analytic explana-
`tions of behavior in a controlled environment. The simu-
`lated network models only packet delay. Oneinputto the
`simulatoris a full matrix of the round-trip delays between
`each pair of simulated hosts. This approach avoids hav-
`ing to simulate the Internet’s topology, a currently open
`area of research; it requires only the measurementof ac-
`tual pair-wise delays amonga set of hosts. The simulator
`can produceuseful speed-of-light delay results, but cannot
`be used to predict throughput or queuing delay.
`The simulator’s delay matrix is derived from Internet
`measurements using techniques similar to those described
`by Gummadietal. [15]. The measurements involved 2048
`DNSservers found with inverse DNS lookupsona trace
`of over 20,000 Gnutella clients. For each pair of these
`servers, a measuring node sends a query to oneserverthat
`
`
`
`probability
`
`
`Cumulative
`
`0
`
`50
`
`100
`
`150
`
`PlanetLab
`Bing mr
`400
`450
`
`500
`
`300
`250
`200
`Latency (ms)
`
`350
`
`Figure 3: Round-trip latency distribution overall pairs of
`PlanetLab and King dataset hosts. The median and aver-
`age King dataset latencies are 134 and 154 milliseconds
`respectively. The median and average PlanetLab latencies
`are 76 and 90 milliseconds respectively.
`
`requires it to contact the other server. Subtracting the de-
`lay between the measuring node andthefirst server from
`the total delay yields the delay between the two servers.
`In order to reduce the effects of queuing delay, the min-
`imum delay from five experiments is used. In this paper
`the results are called the King data-set. All the simula-
`tions in this paper involve 2048 DHT nodes using King
`delay matrix unless otherwise mentioned. Figure 3 shows
`the CDFof the King data-set round-trip times; the median
`is 134 milliseconds, while the average is 154 millisec-
`onds. The graph also shows the minimum delay offive
`pings between each pair of PlanetLab hosts for compar-
`ison. The main difference between the two curvesis the
`longertail on the King distribution, whichis likely caused
`by the larger sample of nodes.
`
`3.2 Application workload
`
`The design of a DHT must incorporate assumptions about
`probable application behavior, and a DHTevaluation must
`also involve either applications or models of application
`behavior. The application aspects that most affect perfor-
`manceare the mix of read and write operations, the degree
`to which operations can be pipelined, and the size of the
`data records.
`
`DHash++ is designed to support read-heavy applica-
`tions that demand low-latency and high-throughputreads
`as well as reasonably high-throughput writes. Examples
`of such applications might include the Semantic Free Ref-
`erencing system (SFR) [45] and UsenetDHT [40].
`SFR is a naming system designed to replace the use
`of DNSas a content location system. SFR uses a DHT
`to store small data records representing name bindings.
`
`HPE Ex. 2010
`Page 4 of 15
`
`

`

`Readsare frequent and should complete with low latency.
`Writes are relatively infrequent and thus need not be as
`high performance. SFR data blocksare likely to be on the
`order hundredsofbytes.
`UsenetDHTisa service aiming to reducethe total stor-
`age dedicated to Usenet by storing all Usenetarticles in a
`shared DHT. UsenetDHTsplits large binary articles (av-
`eraging 100 KB) into small blocks for load balance, but
`smaller text articles (typically 5 KB orless) are stored as
`single blocks. While readership patterns vary, UsenetDHT
`must support low-latency single article reads, as well as
`high-throughputpipelinedarticle fetches.
`These systems are unlikely to be deployed on high-
`churn networks—these systemsare all server-class. The
`target environment for them is a network with relatively
`reliable nodes that have good Internet access.
`
`4 Designing for low latency
`This section investigates five design choices that affect
`DHT get latency. The naive algorithm against which
`these choices are judged, called base DHash++, operates
`as follows. Each 8192-byte block is stored as 14 1171-
`byte erasure-coded fragments,any seven of whichare suf-
`ficient to reconstruct the block, using the IDA codingal-
`gorithm [31]. The 14 fragmentsarestored at the 14 imme-
`diate successors of the block’s key. When an application
`calls get (key), the originating node performsanitera-
`tive Chord lookup, which ends whenthe key’s predecessor
`nodereturns the key’s 16 successors; the originating node
`then sends seven parallel requests the first seven succes-
`sors asking them eachto return one fragment.
`Figure 4 gives a preview of the results of this sec-
`tion. Each pair of bars shows the median time to fetch a
`block on the PlanetLab test-bed after cumulatively apply-
`ing each design improvement. The design improvements
`shownare recursive rather than iterative routing, proxim-
`ity neighborselection, fetching of data from the closest
`copy, and integration of lookup routing and data fetching.
`These design improvements together reducethe total fetch
`latency by nearly a factor of two.
`This paper uses a log(N) protocolfor routing lookups.
`An optimization that isn’t explored in this paperis an in-
`crease in the base to reduce the numberofhops,or the use
`of a constant-hop protocols. These optimizations would
`reduce latency under low churn, because each node would
`know about many other nodes. On the other hand, in high
`churn networks, these optimizations might require more
`bandwidth to keep routing tables up to date or experi-
`ence more timeouts because routing tables might contain
`recently-failed nodes. The paper’s evaluation infrastruc-
`ture isn’t adequate to explore this design decision in de-
`tail. We hope to explore this issue in future work. We do
`explore the extent to which proximity routing can reduce
`
`500
`
`400
`
`300
`
`100
`
`
`
`
`
`Medianlatency(ms)
`
`Base
`
`Serverselection
`Recursive lookup Proximity routing
`Latencyoptimization techniques (cumulative)
`
`Integration
`
`Figure 4: The cumulative effect of successive optimiza-
`tions on the latency of a DHash++ data fetch. Each bar
`shows the median time of 1,000 fetches of a randomly
`chosen 8192-byte data block from a randomly chosen
`host. The dark portion of each bar showsthe lookuptime,
`and the light portion shows the time taken to fetch the
`data. These data are from the implementation running on
`PlanetLab.
`
`the impact of the numberof hops on the lookuplatency.
`
`4.1 Data layout
`Thefirst decision to be made about where a DHT should
`store data is whether it should store data at all. A num-
`ber of DHTsprovide only a key location service, perhaps
`with a layer of indirection, and let each application de-
`cide where (or even whether) to store data [20, 28]. The
`choice is a question of appropriate functionality rather
`than performance, though Section 4.5 describes some per-
`formancebenefits of integrating the DHT lookup and data
`storage functions. The approach taken by DHash++is ap-
`propriate for applications that wish to view the DHT as a
`networkstorage system, such as our motivating examples
`SFR and UsenetDHT.
`
`For DHTsthat store data, a second layout decision is
`the size of the units of data to store. A DHT key could re-
`fer to a disk-sector-like block of data [13], to a complete
`file [38], or to an entire file system image [11]. Largeval-
`ues reduce the amortized cost of each DHT lookup. Small
`blocks spread the load of serving popularlarge files. For
`these reasons, and because someapplications such as SFR
`require the DHTto store small blocks, DHash++is opti-
`mized with blocks of 8 KB orless in mind.
`
`A third layout decision is which server should store
`each block of data (or each replica or coded fragment).
`If a given block is likely to be read mostly by hosts in
`a particular geographic area, then it would make sense
`to store the data on DHTservers in that area. Caching
`is one way to achieve this kind of layout. On the other
`hand, geographic concentration may make the data more
`
`HPE Ex. 2010
`Page 5 of 15
`
`

`

`vulnerable to network and powerfailures, it may cause
`the load to be less evenly balanced across all nodes, and
`is difficult to arrange in general without application hints.
`At the other extreme, the DHT could distribute data uni-
`formly at random overthe available servers; this design
`would be reasonable if there were no predictable geo-
`graphiclocality in the originators of requests for the data,
`or if fault-tolerance were important. DHash++ uses the
`latter approach: a block’s key is essentially random (the
`SHA-1 of the block’s value), node IDs are random, and a
`block’s replicas or fragments are placed at its key’s suc-
`cessor nodes. Theresult is that blocks (and load) are uni-
`formly spread over the DHT nodes, and that a block’s
`replicas or fragments are widely scattered to avoid cor-
`related failure.
`
`Given a DHT design that stores blocks on randomly
`chosen servers, one can begin to form some expectations
`about fetch latency. The lower boundonthetotal time to
`find and fetch a block is the roundtrip time from theorig-
`inator to the nearest replica of the block, or the time to
`the most distant of the closest set of fragments required
`to reconstruct the block. For the typical block this timeis
`determined by the distribution of inter-host delays in the
`Internet, and by the numberof choicesofreplicas or frag-
`ments. The DHT lookup requiredto find the replicas or
`fragments will add to this lower bound, as will mistakes
`in predicting whichreplica or fragments areclosest.
`Most of the design choices described in subsequent
`subsections have to do with taking intelligent advantage
`of choices in order to reduce lookup and data fetch la-
`tency.
`
`4.2 Recursiveor iterative?
`
`The base Chord and Kademlia algorithmsare iterative: the
`originator sends an RPC to each successive node in the
`lookup path, and waits for the response before proceed-
`ing [25, 42]. Another possibility is recursive lookup [6,
`47]: each node in the lookup path directly forwards the
`query to the next node, and when the query reaches the
`key’s predecessor, the predecessor sendsits successorlist
`directly back to the originator [42]. Recursive lookup,
`which many DHTsuse, might eliminate half the latency
`of each hop since each intermediate node can immedi-
`ately forward the lookup before acknowledging the pre-
`vious hop.
`Figure 5 showsthe effect of using recursive rather than
`iterative lookup in the simulator with the 2048-node King
`data set. For each technique, 20,000 lookups were per-
`formed, each from a random host for a random key. The
`average numberofhopsis 6.3. Recursive lookup takes on
`average 0.6 timesas longasiterative. This decrease is not
`quite the expected factor of two: the difference is due to
`the extra one-way hop of (on average) 77 millisecondsto
`
`
`
` 1 |
`
`0.8
`0.7
`0.6
`0.5
`0.4
`
`
`
`Cumulativeprobability 0.3
`
`
`0.2 0.4
`
`0
`
`200
`
`4
`600
`400
`Latency (ms)
`
`Iterative
`Recursive ---------
`800
`
`1000
`
`Figure 5: The cumulative distributions of lookup time for
`Chord with recursive and iterative lookup. The recursive
`median and average are 461 and 489 milliseconds;theit-
`erative median and average are 720 and 822 milliseconds.
`The numbersare from simulations.
`
`return the result to the originator.
`While recursive lookup has lowerlatency than iterative,
`iterative is much easier for a client to manage. If a recur-
`sive lookup elicits no response, the originator has no in-
`formation about what went wrong and how to re-try in
`a waythat is morelikely to succeed. Sometimesa simple
`re-try may work,asin the case of lost packets. If the prob-
`lem is that each successive node cantalk to the next node,
`but that Internet routing anomalies prevent the last node
`from replying to the originator, then re-tries won’t work
`because only the originator realizes a problem exists. In
`contrast, the originator knows which hopofaniterative
`lookup failed to respond, and can re-try that hop through
`a different node in the same region of the identifier space.
`On the the other hand, recursive communication may
`make congestion controleasier (thatis, it is it may makeit
`morefeasible to rely on TCP). We will show in Section 5
`that the performanceof a naive TCPtransport can be quite
`poor.
`DHash++ uses recursive lookups by default since they
`are faster, but falls back on iterative lookupsafter persis-
`tent failures.
`
`4.3 Proximity neighborselection
`
`Many DHTsdecrease lookup latency by choosing nearby
`nodes as routing table entries [6, 16, 25, 42, 43, 47],
`a technique often called proximity neighbor selection
`(PNS). Thereasonthis is possible is that there are usually
`few constraints in the choice of routing entries: any node
`in the relevant portion of the identifier space is eligible.
`A DHTdesign must include an algorithm to search for
`nearby nodes; an exhaustive search may improve lookup
`
`HPE Ex. 2010
`Page 6 of 15
`
`

`

`300
`PNS(16) —+—
`PNS(N) --->---
`
`250 +
`
`
`
`
`100 |
`
`50 +
`
`0
`100
`
`1
`1000
`
`Network Size
`
`2 &
`
`S=
`<
`
`oo
`600 +
`Qaee
`— 200 +
`PERRISBTmaepenneencenena, |)
`eCc
`sot,
`2x
`a 150;
`aZz°°
`
`
`
`Averagelookuplatency(msec)
`
`
`
`
`
`1
`
`10
`
`100
`Numberof PNS samples
`
`7000
`
`Figure 6: Average lookup latency as a function of the
`number of PNS samples. The bar at each x value shows
`the 10th, average, and 90th percentile of the latencies ob-
`served by 20,000 recursive lookups of random keys from
`random nodes using PNS(a). The measurements are from
`the simulator with 2048 nodes.
`
`Figure 7: Average lookup latency of PNS(16) and
`PNS(J)as a function of the numberof nodesin the sys-
`tem, NV. The simulated network sizes consist of 128, 256,
`512, 1024, 2048 nodes.
`
`latency, but also consume network resources. This sub-
`section builds on the work of Gummadietal. [16] in two
`ways: it explains why PNS approximates 1.5 timesthe av-
`erage roundtrip time in the underlying network and shows
`that this result holds regardless of the number of DHT
`nodes (and thus regardless of the numberof hops).
`Following Gummadiet al. [16], define PNS(z) as fol-
`lows. The ith Chordfinger table entry of the node with ID
`a properly refers to the first node in the ID-space range
`a+ 2' toa + 2'+! — 1. The PNS(z)algorithm considers
`up to the first x nodes in that range (there may be fewer
`than a), and routes lookups through the node with lowest
`latency. Ideal PNS refers to PNS(x) with x equalto theto-
`tal numberofnodes,so that every finger table entry points
`to the lowest-latency node in the entire allowed ID-space
`range. The simulator simply chooses the lowest-latency
`of the x nodes, while the real implementation asks each
`properfinger entry for its successor list and uses Vivaldi
`to select the closest node. This meansthat the real imple-
`mentation requires that 2 < s (the numberofsuccessors).
`What is a suitable value for x in PNS(«)? Figure 6
`shows the simulated effect of varying x on lookup la-
`tency. For each x value, 20,000 lookups were issued by
`randomly selected hosts for random keys. Each lookupis
`recursive, goes to the key’s predecessor node(but not suc-
`cessor), and then directly back to the originator. The graph
`plots the median, 10th percentile, and 90th percentile of
`latency.
`Figure 6 shows that PNS(1) has a simulated average
`latency of 489 ms, PNS(16) has an average latency of 224
`ms, and PNS(2048)has an averagelatency of 201 ms. The
`
`latter is ideal PNS, since the neighbor choiceis overall
`nodes in the simulation. PNS(16) comesrelatively close
`to the ideal, and is convenient to implement in the real
`system with successorlists.
`Whydoes ideal PNS show the particular improvement
`that it does? The return trip from the predecessorto the
`originator has the same median as the one-waydelay dis-
`tribution of the nodes in the network, 6. For the King
`data set, 6 = 67ms. The last hop (to the predecessor)
`has only one candidate, so its median latency is also 6.
`Each preceding hop has twice as many candidate nodes
`to choose from on average, since the finger-table interval
`involved is twice as large in ID space. So the second-to-
`last hop is the smaller of two randomly chosenlatencies,
`the third-to-last is the smallest of four, etc. The minimum
`of x samples has its median at the 1 — 0.5 percentile
`of the original distribution, which can be approximated
`as the - percentile for large x. Doubling the sample size
`x will halve the percentile of the best sample. Assum-
`ing a uniform latency distribution, doubling the sample
`size halves the best sampled latency. Therefore, the laten-
`cies incurred at successive lookup hops with ideal PNS
`can be approximated by a geometric series with thefi-
`nal lookup hop to the key’s predecessor being the longest
`hop. The lookup process includes an additional final hop
`to the originator. If we use the per-hop median latency
`as a gross approximation of the average per-hoplatency,
`the total average lookup latency is thus approximatedas:
`6+ (6 +3494...) =5 +26 = 36.For the King data
`set, this gives 201 ms. This is coincidentally the ideal PNS
`simulation result of 201 ms.
`
`The fact that the average lookup latency of PNS(V)
`can be approximated as an infinite geometric series whose
`
`HPE Ex. 2010
`Page 7 of 15
`
`
`
`800
`
`700 +
`
`|
`
`]
`
`]
`
`e
`
`«¢
`
`
`
`ooo
`
`||:
`| |
`
`|= |
`
`peepeas
`300 +
`|
`
`200 +
`
`||
`
`100 F
`
`

`

`L
`
`L
`
`L
`
`4
`
`1
`
`1
`
`L
`
`1
`
`1
`
`0.5
`0.45 +
`
`0.4 +
`0.35 4
`0.3 42
`0.25 4
`
`
`
`0.2 4 Cumulativeprobability 0.15 4
`
`
`King
`
`fetching the data from the nearest of a set of candidate
`nodesis typically called server selection.
`The design choice here can be framed as choosing the
`coding parameters | and m, where / is the total number
`of fragments stored on successors and m is the number
`required to reconstruct the block. Replication is the spe-
`cial case in which m = 1, and J is the numberofreplicas.
`The rate of coding, r = y, expresses the amountofre-
`dundancy. A replication scheme with three replicas has
`m = 1,1 = 3, andr = 3, while a 7-out-of-14 IDA coding
`scheme has m = 7,1 = 14, andr = 2.
`The choice of parameters m and / has three main ef-
`fects. First, it determines a block’s availability when nodes
`fail [46]. If the probability that any given DHT nodeis
`available is po, the probability that a block is still avail-
`able is [4]:
`
`Pavail = S- ze — po)!
`
`l
`
`=m
`
`l
`
`a
`
`—1
`
`()
`
`
`
`0
`
`10
`
`20
`
`30
`
`
`
`40
`
`50
`Latency (ms)
`
`60
`
`70
`
`80
`
`90
`
`100
`
`Figure 8: The median of the minimum latency taken
`from x samples out of the all-pairs empirical latency dis-
`tribution of the King dataset. The boxes correspond to
`2,4,8,16,32 samples starting from theright.
`
`sum converges quickly suggests that despite the fact that
`the numberof lookup hopsscales as log(NV), the total av-
`erage lookup latency will stay close to 3d. Figure 7 shows
`the simulated average lookuplatency as a function of the
`numberof nodes in the system. As wecansee, there is
`indeedlittle increase in average lookup latencyas the net-
`work grows.
`Whyare there diminishing returns in Figure 6 beyond
`roughly PNS(16)?First, the King delay distribution is not
`uniform,but has a flat toe. Thus increasing the number of
`samples produces smaller and smaller decreases in mini-
`mum latency. Figure 8 showsthis effect for various sam-
`ple sizes. Second, for large x, the number of samples is
`often limited by the allowed ID-space range for the finger
`in question,rather than by 2;this effect is more important
`in the later hops of a lookup.
`Onelesson from this analysis is that the last few hops of
`a lookup dominate the total latency. As a lookupgets close
`to the target key in ID space, the number of remaining
`nodesthat are closer in ID spaceto the key decreases, and
`thus the latency to the nearest one increases on average.
`Section 4.5 shows how to avoid this problem.
`
`4.4 Coding versusreplication
`
`Once the nodeoriginating a fetch acquires the key’s pre-
`decessor’s successorlist, it knows which nodes hold the
`block’s replicas [13, 38] or fragments of an erasure-coded
`block [8, 3, 22, 19]. In the case of replication, the origi-
`nator’s strategy should beto fetch the required data from
`the successor with lowest latency. The originator has more
`options in the case of coded fragments, but a reasonable
`approach is to fetch the minimum required number of
`fragments from the closest successors. The technique of
`
`Second, increasing r is likely to decrease fetch latency,
`since that provides the originator more choices from
`whichto pick a nearby node. Third, increasing r increases
`the amount of communication required to write a block to
`the DHT. These performance aspects of erasure coding
`have not been considered previously.
`Figure 9 illustrates the relationship betweentotal fetch
`latency and block availability. The probability po that each
`nodeis available is kept constant at 0.9. Each line repre-
`sents a different rate 7, and the points on the line are ob-
`tained by varying m andsetting | = r x m. Each point’s
`x-axis value indicates the probability that a blockis avail-
`able as calculated by Equation 1. Each point’s y-axis value
`is the average latency from 20,000 simulations of fetch-
`ing a random block from a random originating node. The
`originator performs a lookup to obtain the list of the de-
`sired key’s successors, then issues parallel RPCs to the
`m of those successors that have lowest latency, and waits
`for the last of the RPCs to complete. The y-axis values
`include only the data fetch time.
`The left-most point on each line correspondsto repli-
`cation; that point on the different lines corresponds to
`2, 3, and 4 replicas. For each line, the points farther to
`the right indicate coding schemes in which smaller-sized
`fragments are placed onto larger numbers of nodes. For
`each redundancy rate r, replication provides the lowest
`latency by a small margin. The reaso

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket