throbber
USENIX Association
`
`Proceedings of the First Symposium on
`
`Networked Systems Design and Implementation
`
`San Francisco, CA, USA
`
`March 29—3 1, 2004
`
`USENIX
`
`THE ADVANCED CCMPLITNG SVSTEMS ASSOQATION
`
`For more information about the USENIX Association:
`All Rights Reserved
`© 2004 by The USENIX Association
`Phone: 1 510 528 8649
`FAX: 1 510 548 5738
`Email: 0ffice@usenix.org
`WWW: http://www.usenix.org
`Rights to individual papers remain with the author or the author‘s employer.
`Permission is granted for noncommercial reproduction of the work for educational or research purposes.
`This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.
`
`HPE EX. 2010
`
`Page 1 of 15
`
`

`

`Designing a DHT for low latency and high throughput
`
`Frank Dabek, Jinyang Li, Emil Sit, James Robertson, M. Frans Kaashoek, Robert Morris *
`MIT Computer Science and Artificial Intelligence Laboratory
`fdabek, jinyang, sit, jsr, kaashoek, rtm@csail.mit.edu
`
`Abstract
`
`Designing a wide-area distributed hash table (DHT) that
`provides high-throughput and low-latency network stor-
`age is a challenge. Existing systems have explored a range
`of solutions, including iterative routing, recursive routing,
`proximity routing and neighbor selection, erasure coding,
`replication, and server selection.
`This paper explores the design of these techniques and
`their interaction in a complete system, drawing on the
`measured performance of a new DHT implementation and
`results from a simulator with an accurate Internet latency
`model. New techniques that resulted from this exploration
`include use of latency predictions based on synthetic co-
`ordinates, efficient integration of lookup routing and data
`fetching, and a congestion control mechanism suitable for
`fetching data striped over large numbers of servers.
`Measurements with 425 server instances running on
`150 PlanetLab and RON hosts show that the latency opti-
`mizations reduce the time required to locate and fetch data
`by a factor of two. The throughput optimizations result
`in a sustainable bulk read throughput related to the num-
`ber of DHT hosts times the capacity of the slowest access
`link; with 150 selected PlanetLab hosts, the peak aggre—
`gate throughput over multiple clients is 12.8 megabytes
`per second.
`
`1
`
`Introduction
`
`The Internet has transformed communication for dis—
`
`tributed applications: each new system need not imple—
`ment its own network, but can simply assume a shared
`global communication infrastructure. A similar transfor-
`mation might be possible for storage, allowing distributed
`applications to assume a shared global storage infrastruc—
`ture. Such an infrastructure would have to name and find
`
`data, assure high availability, balance load across avail—
`able servers, and move data with high throughput and low
`latency.
`Distributed hash tables (DHTs) are a promising path to-
`wards a global storage infrastructure, and have been used
`
`research was conducted as part of the IRIS project
`*This
`(http : / /proj ect: — iris . net/), supported by the National Sci—
`ence Foundation under Cooperative Agreement No. ANI-0225660.
`
`as the basis for a variety of wide-area file and content pub-
`lishing systems [13, 26, 34, 38]. Good performance, how-
`ever, is a challenge: the DHT nodes holding the data may
`be far away in the network, may have access link capaci-
`ties that vary by orders of magnitude, and may experience
`varying degrees of congestion and packet loss.
`This paper explores design choices for DHT read and
`write algorithms. Existing work has investigated how to
`make the lookup of keys in DHTs scalable, low—latency,
`fault-tolerant, and secure, but less attention has been paid
`to the efficiency and robustness with which DHTs read
`and store data. This paper considers a range of design op-
`tions for efificient data handling in the context of a single
`DHT, DHash++. The decisions are evaluated in simula-
`tion and in an implementation of DHash++ on the Planet-
`Lab [29] and RON [2] test—beds.
`To bound the discussion of design decisions, we have
`made a number of assumptions. First, we assume that all
`nodes cooperate; the algorithms for reading and writing
`are likely to be more expensive if they have to defend
`against malicious nodes. Second, we assume that lookups
`are routed using one of the O(log N )-sty1e schemes, in-
`stead of using the recently proposed 0(1) schemes [14,
`17, 18, 44]. Finally, we assume that the DHT stores small
`blocks (on the order of 8192 bytes). Relaxing these as-
`sumptions will result in different DHT designs with dif-
`ferent latency and throughput properties, which we hope
`to explore in the future.
`The paper makes the following contributions. Recur-
`sive lookups take about 0.6 times as long as iterative; the
`reason why the reduction is not a factor of two is the cost
`of the final return trip. The latency of the last few hops
`in a lookup acts as a lower bound on the performance
`of Proximity Neighbor Selection [37, 16], which approxi-
`mates 1.5 times the average round trip time in the underly—
`ing network. This result holds regardless of the number of
`DHT nodes (and thus regardless of the number of hops).
`Replicated data allows for low-latency reads because there
`are many choices for server selection, while erasure-coded
`data reduces bandwidth consumption for writes at the ex-
`pense of increased read latency. Integration of key lookup
`and data fetch reduces the lower bound imposed by the
`last few lookup hops. Finally, using an integrated trans-
`
`HPE Ex. 2010
`
`Page 2 of 15
`
`

`

`
`
`
`
`
`
`
`
`Figure l: DHash++ system overview.
`
`port protocol rather than TCP provides opportunities for
`efficiency in alternate routing after timeouts and allows
`the DHT freedom to efficiently contact many nodes.
`The rest of this paper is structured as follows. Section 2
`outlines the complete system that surrounds the specific
`mechanisms detailed in the paper. Section 3 describes the
`methods behind the paper’s measurements and quantita-
`tive evaluations. Section 4 discusses design decisions that
`affect latency, and Section 5 discusses throughput. Sec-
`tion 6 describes related work. We conclude in Section 7.
`
`2 Background
`
`For concreteness, this evaluates design decisions in the
`context of a complete DHT called DHash++. This section
`describes the parts of DHash++ that are needed to under—
`stand the rest of the paper.
`
`2.1 Chord
`
`DHash++ uses the Chord lookup algorithm to help it find
`data [42]. Chord provides a function lookup (key) —>
`set—of — IP, which maps a 160-bit key to the set of IP
`addresses of the nodes responsible for that key. Each node
`has a 160-bit identifier, and Chord designates the 3 nodes
`whose identifiers immediately follow a key as responsible
`for that key; these are the key’s successors. To provide
`reliable lookup even if half of the nodes fail in a 216-node
`network, the number of successors, s, is 16 in the Chord
`implementation. The ID space wraps around, so that zero
`immediately follows 2160 — 1.
`The base Chord lookup algorithm (which will be modi-
`fied in subsequent sections) works as follows. Each Chord
`node maintains a finger table, consisting of the IP ad-
`dresses and IDs of nodes that follow it at power-of-two
`distances in the identifier space. Each node also maintains
`a successor list referring to its 5 immediate successors.
`When a node originates a lookup, it consults a sequence
`of other nodes, asking each in turn which node to talk to
`
`Figure 2: An illustration of a Chord identifier ring. The
`tick mark denotes the position of a key in ID space. The
`square shows the key’s successor node, and the circles
`show the nodes in the successor’s successor list. The trian—
`
`gles and arrows show a lookup path. The last node before
`the tick mark is the key’s predecessor.
`
`next. Each node in this sequence answers with the node
`from its finger table with highest ID still less than the de-
`sired key. The originating node will find the key’s prede-
`cessor node after O(log N) consultations; it then asks the
`predecessor for its successor list, which is the result of the
`lookup. This style of lookup is called iterative, since the
`originating node controls each step of the lookup. All of
`the communication uses UDP RPCs.
`
`Figure 2 shows a Chord ring with a key, its successor,
`the successor’s successor list, and a lookup path; this pic-
`ture is helpful to keep in mind since much of the discus-
`sion appeals to the ring geometry. Although this paper
`explores optimizations over base Chord, we believe that
`these optimizations also apply to other DHTs that route in
`ID spaces using an O(log N) protocol.
`
`2.2 DHash++
`
`DHash++ stores key/value pairs (called blocks) on a set
`of servers. The DHash++ client API consists of key <—
`put (value) and get (key) a value. DHash++
`calculates the key to be the SHA-l hash of the value, and
`uses Chord to decide which server should store a given
`block; each server runs both Chord and DHash++ soft—
`ware. As well as finding and moving data for client ap-
`plications, DHash++ authenticates the data and moves it
`from server to server as nodes join, leave, and fail [7].
`
`2.3 Synthetic coordinates
`
`Many of the techniques described in this paper use syn-
`thetic coordinates to predict inter-node latencies without
`having to perform an explicit measurement to determine
`the latency. A number of synthetic coordinate systems
`have been proposed [10, 24, 27, 30, 33, 39]. We chose
`
`HPE Ex. 2010
`
`Page 3 of 15
`
`

`

`to use Vivaldi [12], because its algorithm is decentralized,
`which makes it suitable for use in peer-to—peer systems.
`Furthermore, the Vivaldi algorithm is lightweight, since it
`can piggy-back on DHash++’s communication patterns to
`compute coordinates.
`Whenever one Chord or DHash++ node communicates
`
`directly with another, they exchange Vivaldi coordinates.
`Nodes store these coordinates along with IP addresses in
`routing tables and successor lists. The result of a lookup
`for a key carries the coordinates of the nodes responsible
`for the key as well as their IP addresses. Thus the request-
`ing node can predict the latencies to each of the responsi-
`ble nodes without having to first communicate with them.
`
`3 Evaluation methods
`
`The results in this paper are obtained through simulations
`and measurements on the PlanetLab and RON test-beds.
`
`The measurements focus on DHT operations that require
`low latency or high throughput.
`
`3.1 Evaluation infrastructure
`
`DHT performance depends on the detailed behavior of
`the servers and the underlying network. The test-bed mea-
`surements in Section 4 were taken from a DHash++ im—
`
`plementation deployed on the PlanetLab and RON test-
`beds. 180 test—bed hosts were used, of which 150 were
`in the United States and 30 elsewhere. 105 of the hosts
`
`are on the Internet2 network; the rest have connections
`via DSL, cable modem, commercial Tl service, or are
`at co-location centers. Each host runs three independent
`DHash++ processes, or virtual nodes, in order to improve
`load balance and to ensure that the total number of nodes
`
`is large compared to the size of the Chord successor list.
`The measurements in Section 5 were taken on the 27-node
`RON test-bed alone.
`
`The test-bed measurements are augmented with simu-
`lation results to explore large configurations, to allow easy
`testing of alternate designs, and to allow analytic explana-
`tions of behavior in a controlled environment. The simu-
`
`lated network models only packet delay. One input to the
`simulator is a full matrix of the round—trip delays between
`each pair of simulated hosts. This approach avoids hav-
`ing to simulate the Intemet’s topology, a currently open
`area of research; it requires only the measurement of ac-
`tual pair-wise delays among a set of hosts. The simulator
`can produce useful speed-of-light delay results, but cannot
`be used to predict throughput or queuing delay.
`The simulator’s delay matrix is derived from Internet
`measurements using techniques similar to those described
`by Gummadi et a1. [15]. The measurements involved 2048
`DNS servers found with inverse DNS lookups on a trace
`of over 20,000 Gnutella clients. For each pair of these
`servers, a measuring node sends a query to one server that
`
` 1
`0.9 -
`0.8 —
`0.7 -
`0.6 -
`0.5 -
`0.4 -
`
`0.2 r
`
`probability 0.3 -
`0.1 ~
`
`Cumulative
`
`0
`
`PlanetLab
`
`‘King w; -------
`.
`.
`.
`i
`i
`l
`l
`400
`450
`500
`50
`100
`150
`200
`250
`300
`350
`0
`Latency (ms)
`
`Figure 3: Round—trip latency distribution over all pairs of
`PlanetLab and King dataset hosts. The median and aver-
`age King dataset latencies are 134 and 154 milliseconds
`respectively. The median and average PlanetLab latencies
`are 76 and 90 milliseconds respectively.
`
`requires it to contact the other server. Subtracting the de-
`lay between the measuring node and the first server from
`the total delay yields the delay between the two servers.
`In order to reduce the effects of queuing delay, the min-
`imum delay from five experiments is used. In this paper
`the results are called the King data-set. All the simula-
`tions in this paper involve 2048 DHT nodes using King
`delay matrix unless otherwise mentioned. Figure 3 shows
`the CDF of the King data-set round—trip times; the median
`is 134 milliseconds, while the average is 154 millisec-
`onds. The graph also shows the minimum delay of five
`pings between each pair of PlanetLab hosts for compar-
`ison. The main difference between the two curves is the
`
`longer tail on the King distribution, which is likely caused
`by the larger sample of nodes.
`
`3.2 Application workload
`
`The design of a DHT must incorporate assumptions about
`probable application behavior, and a DHT evaluation must
`also involve either applications or models of application
`behavior. The application aspects that most affect perfor-
`mance are the mix of read and write operations, the degree
`to which operations can be pipelined, and the size of the
`data records.
`
`DHash++ is designed to support read—heavy applica-
`tions that demand low-latency and high-throughput reads
`as well as reasonably high~throughput writes. Examples
`of such applications might include the Semantic Free Ref-
`erencing system (SFR) [45] and UsenetDHT [40].
`SFR is a naming system designed to replace the use
`of DNS as a content location system. SFR uses a DHT
`to store small data records representing name bindings.
`
`HPE Ex. 2010
`
`Page 4 of 15
`
`

`

`Reads are frequent and should complete with low latency.
`Writes are relatively infrequent and thus need not be as
`high performance. SFR data blocks are likely to be on the
`order hundreds of bytes.
`UsenetDHT is a service aiming to reduce the total stor-
`age dedicated to Usenet by storing all Usenet articles in a
`shared DHT. UsenetDHT splits large binary articles (av—
`eraging 100 KB) into small blocks for load balance, but
`smaller text articles (typically 5 KB or less) are stored as
`single blocks. While readership patterns vary, UsenetDHT
`must support low—latency single article reads, as well as
`high-throughput pipelined article fetches.
`These systems are unlikely to be deployed on high-
`churn networks—~these systems are all server-class. The
`target environment for them is a network with relatively
`reliable nodes that have good Internet access.
`
`4 Designing for low latency
`
`This section investigates five design choices that affect
`DHT get latency. The naive algorithm against which
`these choices are judged, called base DHash++, operates
`as follows. Each 8192~byte block is stored as 14 1171-
`byte erasure-coded fragments, any seven of which are suf-
`ficient to reconstruct the block, using the IDA coding al-
`gorithm [31]. The 14 fragments are stored at the 14 imme-
`diate successors of the block’s key. When an application
`calls get (key) , the originating node performs an itera-
`tive Chord lookup, which ends when the key’s predecessor
`node returns the key’s l6 successors; the originating node
`then sends seven parallel requests the first seven succes-
`sors asking them each to return one fragment.
`Figure 4 gives a preview of the results of this sec-
`tion. Each pair of bars shows the median time to fetch a
`block on the PlanetLab test-bed after cumulatively apply-
`ing each design improvement. The design improvements
`shown are recursive rather than iterative routing, proxim-
`ity neighbor selection, fetching of data from the closest
`copy, and integration of lookup routing and data fetching.
`These design improvements together reduce the total fetch
`latency by nearly a factor of two.
`
`This paper uses a log(N) protocol for routing lookups.
`An optimization that isn’t explored in this paper is an in-
`crease in the base to reduce the number of hops, or the use
`of a constant-hop protocols. These optimizations would
`reduce latency under low churn, because each node would
`know about many other nodes. On the other hand, in high
`churn networks, these optimizations might require more
`bandwidth to keep routing tables up to date or experi-
`ence more timeouts because routing tables might contain
`recently-failed nodes. The paper’s evaluation infrastruc—
`ture isn’t adequate to explore this design decision in de-
`tail. We hope to explore this issue in future work. We do
`explore the extent to which proximity routing can reduce
`
`500
`
`400
`
`wc)o
`
`:4oo
`
`100
`
`
`
`
`
`Medianlatency(ms)
`
`Base
`
`Server seleclion
`Recursive lookup Proximity muting
`Latency optimization techniques (cumulative)
`
`Integration
`
`Figure 4: The cumulative effect of successive optimiza-
`tions on the latency of a DHash++ data fetch. Each bar
`shows the median time of 1,000 fetches of a randomly
`chosen 8192-byte data block from a randomly chosen
`host. The dark portion of each bar shows the lookup time,
`and the light portion shows the time taken to fetch the
`data. These data are from the implementation running on
`PlanetLab.
`
`the impact of the number of hops on the lookup latency.
`
`4.1 Data layout
`The first decision to be made about where a DHT should
`store data is whether it should store data at all. A num—
`
`ber of DHTs provide only a key location service, perhaps
`with a layer of indirection, and let each application de-
`cide where (or even whether) to store data [20, 28]. The
`choice is a question of appropriate functionality rather
`than performance, though Section 4.5 describes some per-
`formance benefits of integrating the DHT lookup and data
`storage functions. The approach taken by DHash++ is ap—
`propriate for applications that wish to view the DHT as a
`network storage system, such as our motivating examples
`SFR and UsenetDHT.
`
`For DHTs that store data, a second layout decision is
`the size of the units of data to store. A DHT key could re-
`fer to a disk-sector-like block of data [13], to a complete
`file [3 8], or to an entire file system image [1 1]. Large val-
`ues reduce the amortized cost of each DHT lookup. Small
`blocks spread the load of serving popular large files. For
`these reasons, and because some applications such as SFR
`require the DHT to store small blocks, DHash++ is opti-
`mized with blocks of 8 KB or less in mind.
`
`A third layout decision is which server should store
`each block of data (or each replica or coded fragment).
`If a given block is likely to be read mostly by hosts in
`a particular geographic area, then it would make sense
`to store the data on DHT servers in that area. Caching
`is one way to achieve this kind of layout. On the other
`hand, geographic concentration may make the data more
`
`HPE Ex. 2010
`
`Page 5 of 15
`
`

`

`
`
`
`
`
`
` 1 I I Y I
`
`
`
`0.8
`0.7
`0.6
`0.5
`0.4
`
`
`
`Cumulativeprobability 0.3
`
`
`0.2 0‘1
`
`0
`
`200
`
`J
`600
`400
`Latency (ms)
`
`Iterative
`Recursive ----------
`800
`
`1000
`
`vulnerable to network and power failures, it may cause
`the load to be less evenly balanced across all nodes, and
`is difficult to arrange in general without application hints.
`At the other extreme, the DHT could distribute data uni—
`formly at random over the available servers; this design
`would be reasonable if there were no predictable geo-
`graphic locality in the originators of requests for the data,
`or if fault-tolerance were important. DHash++ uses the
`latter approach: a block’s key is essentially random (the
`SHA-l of the block’s value), node IDs are random, and a
`block’s replicas or fragments are placed at its key’s suc-
`cessor nodes. The result is that blocks (and load) are uni-
`formly spread over the DHT nodes, and that a block’s
`replicas or fragments are widely scattered to avoid cor-
`related failure.
`
`Given a DHT design that stores blocks on randomly
`chosen servers, one can begin to form some expectations
`about fetch latency. The lower bound on the total time to
`find and fetch a block is the round trip time from the orig-
`inator to the nearest replica of the block, or the time to
`the most distant of the closest set of fragments required
`to reconstruct the block. For the typical block this time is
`determined by the distribution of inter-host delays in the
`Internet, and by the number of choices of replicas or frag-
`ments. The DHT lookup required to find the replicas or
`fragments will add to this lower bound, as will mistakes
`in predicting which replica or fragments are closest.
`Most of the design choices described in subsequent
`subsections have to do with taking intelligent advantage
`of choices in order to reduce lookup and data fetch 1a-
`tency.
`
`4.2 Recursive or iterative?
`
`The base Chord and Kademlia algorithms are iterative: the
`originator sends an RPC to each successive node in the
`lookup path, and waits for the response before proceed—
`ing [25, 42]. Another possibility is recursive lookup [6,
`47]: each node in the lookup path directly forwards the
`query to the next node, and when the query reaches the
`key’s predecessor, the predecessor sends its successor list
`directly back to the originator [42]. Recursive lookup,
`which many DHTs use, might eliminate half the latency
`of each hop since each intermediate node can immedi-
`ately forward the lookup before acknowledging the pre-
`vious hop.
`
`Figure 5: The cumulative distributions of lookup time for
`Chord with recursive and iterative lookup. The recursive
`median and average are 461 and 489 milliseconds; the it—
`erative median and average are 720 and 822 milliseconds.
`The numbers are from simulations.
`
`return the result to the originator.
`
`While recursive lookup has lower latency than iterative,
`iterative is much easier for a client to manage. If a recur-
`sive lookup elicits no response, the originator has no in-
`formation about what went wrong and how to re-try in
`a way that is more likely to succeed. Sometimes a simple
`re—try may work, as in the case of lost packets. If the prob—
`lem is that each successive node can talk to the next node,
`but that Internet routing anomalies prevent the last node
`from replying to the originator, then re-tries won’t work
`because only the originator realizes a problem exists. In
`contrast, the originator knows which hop of an iterative
`lookup failed to respond, and can re-try that hop through
`a different node in the same region of the identifier space.
`On the the other hand, recursive communication may
`make congestion control easier (that is, it is it may make it
`more feasible to rely on TCP). We will show in Section 5
`that the performance of a naive TCP transport can be quite
`poor
`
`DHash++ uses recursive lookups by default since they
`are faster, but falls back on iterative lookups after persis-
`tent failures.
`
`4.3 Proximity neighbor selection
`
`Figure 5 shows the effect of using recursive rather than
`iterative lookup in the simulator with the 2048—node King
`data set. For each technique, 20,000 lookups were per-
`formed, each from a random host for a random key. The
`average number of hops is 6.3. Recursive lookup takes on
`average 0.6 times as long as iterative. This decrease is not
`quite the expected factor of two: the difference is due to
`the extra one—way hop of (on average) 77 milliseconds to
`
`Many DHTs decrease lookup latency by choosing nearby
`nodes as routing table entries [6, 16, 25, 42, 43, 47],
`a technique often called proximity neighbor selection
`(PNS). The reason this is possible is that there are usually
`few constraints in the choice of routing entries: any node
`in the relevant portion of the identifier space is eligible.
`A DHT design must include an algorithm to search for
`nearby nodes; an exhaustive search may improve lookup
`
`HPE Ex. 2010
`
`Page 6 of 15
`
`

`

`800
`
`300
`PNS(16) —+-—
`PNS(N) ----->(-
`
`250 -
`
`m ME 200 - >‘""""""""""""""""""""""""""""""",6.rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
`
`..
`
`’6D)
`
`3C
`93
`ca. 150
`3xO
`
`2 3
`
`33» 100 »
`a>
`<
`
`50
`
`
`
`
`O
`100
`
`i
`1000
`
`Network Size
`
`700-1
`
`6:Oo
`
`100 -
`
`
`
`
`
`Averagelookuplatency(msec)
`
`
`
`
`
`:3orOOOC
`o:oO I.u
`
`Noo
`
`4__.,
`
`A
`
`.74
`
`’1
`
`._.
`
`,
`
`
`
`10
`
`100
`Number of PNS samples
`
`III
`
`4———<
`
`H—k4—<
`
`1000
`
`Figure 6: Average lookup latency as a function of the
`number of PNS samples. The bar at each a: value shows
`the 10th, average, and 90th percentile of the latencies ob-
`served by 20,000 recursive lookups of random keys from
`random nodes using PNS((L‘). The measurements are from
`the simulator with 2048 nodes.
`
`Figure 7: Average lookup latency of PNS(16) and
`PNS(N) as a function of the number of nodes in the sys-
`tem, N. The simulated network sizes consist of 128, 256,
`512, 1024, 2048 nodes.
`
`latency, but also consume network resources. This sub-
`section builds on the work of Gummadi et al. [16] in two
`ways: it explains why PNS approximates 1.5 times the av-
`erage round trip time in the underlying network and shows
`that this result holds regardless of the number of DHT
`nodes (and thus regardless of the number of hops).
`Following Gummadi et a1. [16], define PNS(x) as fol-
`lows. The ith Chord finger table entry of the node with ID
`a properly refers to the first node in the ID-space range
`a + T to a + 2""+1 — 1. The PNS(m) algorithm considers
`up to the first at: nodes in that range (there may be fewer
`than ac), and routes lookups through the node with lowest
`latency. Ideal PNS refers to PNS(az) with as equal to the to-
`tal number of nodes, so that every finger table entry points
`to the lowest-latency node in the entire allowed ID—space
`range. The simulator simply chooses the lowest—latency
`of the w nodes, while the real implementation asks each
`proper finger entry for its successor list and uses Vivaldi
`to select the closest node. This means that the real imple-
`mentation requires that x g s (the number of successors).
`
`What is a suitable value for a: in PNS(ac)? Figure 6
`shows the simulated effect of varying a; on lookup la—
`tency. For each x value, 20,000 lookups were issued by
`randomly selected hosts for random keys. Each lookup is
`recursive, goes to the key’s predecessor node (but not suc-
`cessor), and then directly back to the originator. The graph
`plots the median, 10th percentile, and 90th percentile of
`latency.
`
`Figure 6 shows that PNS(l) has a simulated average
`latency of 489 ms, PNS(16) has an average latency of 224
`ms, and PNS(2048) has an average latency of 201 ms. The
`
`latter is ideal PNS, since the neighbor choice is over all
`nodes in the simulation. PNS(16) comes relatively close
`to the ideal, and is convenient to implement in the real
`system with successor lists.
`
`Why does ideal PNS show the particular improvement
`that it does? The return trip from the predecessor to the
`originator has the same median as the one-way delay dis—
`tribution of the nodes in the network, (5. For the King
`data set, 6 : 67ms. The last hop (to the predecessor)
`has only one candidate, so its median latency is also 6.
`Each preceding hop has twice as many candidate nodes
`to choose from on average, since the finger—table interval
`involved is twice as large in ID space. So the second—to-
`last hop is the smaller of two randomly chosen latencies,
`the third-to-last is the smallest of four, etc. The minimum
`of 3; samples has its median at the 1 — 0.5a: percentile
`of the original distribution, which can be approximated
`as the 31¢ percentile for large :5. Doubling the sample size
`at will halve the percentile of the best sample. Assum-
`ing a uniform latency distribution, doubling the sample
`size halves the best sampled latency. Therefore, the laten-
`cies incurred at successive lookup hops with ideal PNS
`can be approximated by a geometric series with the fi-
`nal lookup hop to the key’s predecessor being the longest
`hop. The lookup process includes an additional final hop
`to the originator. If we use the per-hop median latency
`as a gross approximation of the average per-hop latency,
`the total average lookup latency is thus approximated as:
`6+ (6+ 3 + Z + ...) : 64-26 = 36.FortheKingdata
`set, this gives 201 ms. This is coincidentally the ideal PNS
`simulation result of 201 ms.
`
`The fact that the average lookup latency of PNS(N)
`can be approximated as an infinite geometric series whose
`
`HPE Ex. 2010
`
`Page 7 of 15
`
`

`

`
`.
`.
`.
`i
`.
`,
`.
`.
`i
`
`05
`0.45 >
`
`0.4 1
`
`Q3
`
`0.25 -
`0.2-
`
`.2
`E2
`0 0.15 —
`
`a3
`
`0.1 - E
`0.05 -15
`,
`o
`
`o
`
`.
`10
`
`.
`20
`
`
`
`.
`30
`
`—
`
`100
`
`
`
`K'n
`.
`'9 .
`so
`90
`
`.
`40
`
`.
`so
`Latency (ms)
`
`.
`60
`
`.
`70
`
`Figure 8: The median of the minimum latency taken
`from 93 samples out of the all-pairs empirical latency dis-
`tribution of the King dataset. The boxes correspond to
`2,4,8,16,32 samples starting from the right.
`
`sum converges quickly suggests that despite the fact that
`the number of lookup hops scales as log(N), the total av-
`erage lookup latency will stay close to 36. Figure 7 shows
`the simulated average lookup latency as a function of the
`number of nodes in the system. As we can see, there is
`indeed little increase in average lookup latency as the net-
`work grows.
`
`Why are there diminishing returns in Figure 6 beyond
`roughly PNS(16)? First, the King delay distribution is not
`uniform, but has a flat toe. Thus increasing the number of
`samples produces smaller and smaller decreases in mini-
`mum latency. Figure 8 shows this effect for various sam-
`ple sizes. Second, for large m, the number of samples is
`often limited by the allowed lD-space range for the finger
`in question, rather than by as; this effect is more important
`in the later hops of a lookup.
`One lesson from this analysis is that the last few hops of
`a lookup dominate the total latency. As a lookup gets close
`to the target key in ID space, the number of remaining
`nodes that are closer in ID space to the key decreases, and
`thus the latency to the nearest one increases on average.
`Section 4.5 shows how to avoid this problem.
`
`4.4 Coding versus replication
`
`Once the node originating a fetch acquires the key’s pre—
`decessor’s successor list, it knows which nodes hold the
`block’s replicas [13, 38] or fragments of an erasure-coded
`block [8, 3, 22, 19]. In the case of replication, the origi-
`nator’s strategy should be to fetch the required data from
`the successor with lowest latency. The originator has more
`options in the case of coded fragments, but a reasonable
`approach is to fetch the minimum required number of
`fragments from the closest successors. The technique of
`
`Second, increasing 7' is likely to decrease fetch latency,
`since that provides the originator more choices from
`which to pick a nearby node. Third, increasing 7" increases
`the amount of communication required to write a block to
`the DHT. These performance aspects of erasure coding
`have not been considered previously.
`Figure 9 illustrates the relationship between total fetch
`latency and block availability. The probability p0 that each
`node is available is kept constant at 0.9. Each line repre-
`sents a different rate 7', and the points on the line are ob-
`tained by varying m and setting l = 7' x m. Each point’s
`x-axis value indicates the probability that a block is avail-
`able as calculated by Equation 1. Each point’s y-axis value
`is the average latency from 20,000 simulations of fetch-
`ing a random block from a random originating node. The
`originator performs a lookup to obtain the list of the de-
`sired key’s successors, then issues parallel RPCs to the
`m of those successors that have lowest latency, and waits
`for the last of the RPCs to complete. The y—axis values
`include only the data fetch time.
`The left—most point on each line corresponds to repli—
`cation; that point on the different lines corresponds to
`2, 3, and 4 replicas. For each line, the points farther to
`the right indicate coding schemes in which smaller—sized
`fragments are placed onto larger numbers of nodes. For
`each redundancy rate 1‘, replication provides the lowest
`latency by a small margin. The reason is easiest to see
`for r = 2: choosing the nearest k of 21: fragments ap-
`proaches the median as k grows, while choosing the near-
`est replica of two yields a latency considerably below the
`median. Replication also provides the least availability
`because the redundant information is spread over fewer
`nodes. The lower lines correspond to larger amounts of re-
`dundant information on more nodes; this pr

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket