throbber
Efficient Replica Maintenance for Distributed Storage Systems
`
`Byung-Gon Chun,† Frank Dabek,? Andreas Haeberlen,‡ Emil Sit,? Hakim Weatherspoon,†
`M. Frans Kaashoek,? John Kubiatowicz,† and Robert Morris?
`
`? MIT Computer Science and Artificial Intelligence Laboratory,
`‡ Rice University/MPI-SWS, † University of California, Berkeley
`
`Abstract
`This paper considers replication strategies for storage systems
`that aggregate the disks of many nodes spread over the Internet.
`Maintaining replication in such systems can be prohibitively ex-
`pensive, since every transient network or host failure could po-
`tentially lead to copying a server’s worth of data over the Internet
`to maintain replication levels.
`The following insights in designing an efficient replication al-
`gorithm emerge from the paper’s analysis. First, durability can
`be provided separately from availability; the former is less ex-
`pensive to ensure and a more useful goal for many wide-area ap-
`plications. Second, the focus of a durability algorithm must be
`to create new copies of data objects faster than permanent disk
`failures destroy the objects; careful choice of policies for what
`nodes should hold what data can decrease repair time. Third,
`increasing the number of replicas of each data object does not
`help a system tolerate a higher disk failure probability, but does
`help tolerate bursts of failures. Finally, ensuring that the system
`makes use of replicas that recover after temporary failure is crit-
`ical to efficiency.
`Based on these insights, the paper proposes the Carbonite
`replication algorithm for keeping data durable at a low cost. A
`simulation of Carbonite storing 1 TB of data over a 365 day
`trace of PlanetLab activity shows that Carbonite is able to keep
`all data durable and uses 44% more network traffic than a hy-
`pothetical system that only responds to permanent failures. In
`comparison, Total Recall and DHash require almost a factor of
`two more network traffic than this hypothetical system.
`1 Introduction
`Wide-area distributed storage systems typically use repli-
`cation to provide two related properties: durability and
`availability. Durability means that objects that an applica-
`tion has put into the system are not lost due to disk failure
`whereas availability means that get will be able to return
`the object promptly. Objects can be durably stored but not
`
`This research was supported by the National Science Founda-
`tion under Cooperative Agreement No. ANI-0225660, http://
`project-iris.net/. Andreas Haeberlen was supported in part
`by the Max Planck Society. Emil Sit was supported in part by the
`Cambridge-MIT Institute. Hakim Weatherspoon was supported by an
`Intel Foundation PhD Fellowship.
`
`immediately available: if the only copy of an object is on
`the disk of a node that is currently powered off, but will
`someday re-join the system with disk contents intact, then
`that object is durable but not currently available. The pa-
`per’s goal is to develop an algorithm to store immutable
`objects durably and at a low bandwidth cost in a system
`that aggregates the disks of many Internet nodes.
`The threat to durability is losing the last copy of an ob-
`ject due to permanent failures of disks. Efficiently coun-
`tering this threat to durability involves three main chal-
`lenges. First, network bandwidth is a scarce resource in
`a wide-area distributed storage system. To store objects
`durably, there must be enough network capacity to cre-
`ate copies of objects faster than they are lost due to disk
`failure. Second, a system cannot always distinguish be-
`tween transient failures and permanent disk failures: it
`may waste network bandwidth by creating new copies
`during transient failures. Third, after recovery from tran-
`sient failures, some replicas may be on nodes that the
`replica lookup algorithm does not query and are thus ef-
`fectively lost.
`Since transient failures are common in wide-area sys-
`tems, replication algorithms can waste bandwidth by mak-
`ing unneeded replicas. For example, the initial replica-
`tion algorithm [6] that the DHash distributed hash table
`(DHT) [9] turned out to be inadequate to build storage ap-
`plications such as UsenetDHT [34], Antiquity [11], and
`OverCite [35, 36].
`A problem with DHash was that its design was driven
`by the goal of achieving 100% availability; this decision
`caused it to waste bandwidth by creating new replicas in
`response to temporary failures. Its design and similar ones
`(such as Total Recall [3]) are overkill for durability. Fur-
`thermore, users of many Internet applications can tolerate
`some unavailability. For example, Usenet readers will see
`all articles eventually, as long as they are stored durably.
`Our experience with these DHT applications has led us to
`the following insights:
`
`• Durability is a more practical and useful goal than
`availability for applications that store objects (as op-
`
`USENIX Association
`
`NSDI ’06: 3rd Symposium on Networked Systems Design & Implementation
`
`45
`
`CSCO-1050
`
`

`

`posed to caching objects).
`
`• The main goal of a durability algorithm should be to
`create new copies of an object faster than they are
`destroyed by disk failures; the choice of how repli-
`cas are distributed among nodes can make this task
`easier.
`
`• Increasing the replication level does not help tolerate
`a higher average permanent failure rate, but it does
`help cope with bursts of failures.
`
`• Reintegrating returning replicas is key to avoiding
`unnecessary copying.
`Using these insights we have developed Carbonite, an
`efficient wide-area replication algorithm for keeping ob-
`jects durable. After inserting a set of initial replicas, Car-
`bonite begins by creating new replicas mostly in response
`to transient failures. However, over time it is increasingly
`able to ignore transient failures and approaches the goal of
`only producing replicas in response to permanent failures.
`Carbonite’s design assumes that the disks in the dis-
`tributed storage system fail independently of each other:
`failures of geographically distributed hard drives from dif-
`ferent manufacturers are likely to be uncorrelated.
`In a year-long PlanetLab failure trace, however, we ob-
`serve some correlated failures because of coordinated re-
`installs of the PlanetLab software. Despite this, an evalua-
`tion using the PlanetLab failure trace shows that Carbonite
`is able to keep 1 TB of data durable, and consumes only
`44% more network traffic than a hypothetical system that
`only responds to permanent failures. In comparison, To-
`tal Recall and DHash require almost a factor of two more
`network traffic than this hypothetical system.
`The rest of this paper explains our durability models
`and algorithms, interleaving evaluation results into the ex-
`planation. Section 2 describes the simulated evaluation
`environment. Section 3 presents a model of the relation-
`ship between network capacity, amount of replicated data,
`number of replicas, and durability. Section 4 explains
`how to decrease repair time, and thus increase durabil-
`ity, by proper placement of replicas on servers. Section 5
`presents an algorithm that reduces the bandwidth wasted
`making copies due to transient failures. Section 6 outlines
`some of the challenges that face practical implementations
`of these ideas, Section 7 discusses related work, and Sec-
`tion 8 concludes.
`2 System environment
`The behavior of a replication algorithm depends on the
`environment in which it is used: high disk failure rates or
`low network access link speeds make it difficult for any
`system to maintain durability. We will use the character-
`istics of the PlanetLab testbed as a representative environ-
`ment when evaluating wide-area replication techniques.
`
`Dates
`Number of hosts
`Number of transient failures
`Number of disk failures
`Transient host downtime (s)
`Any failure interarrival (s)
`Disk failures interarrival (s)
`(Median/Mean/90th percentile)
`
`1 March 2005 – 28 Feb 2006
`632
`21255
`219
`1208, 104647, 14242
`305, 1467, 3306
`54411, 143476, 490047
`
`Table 1: CoMon+PLC trace characteristics.
`
`For explanatory purposes, we will also use a synthetic
`trace that makes some of the underlying trends more vis-
`ible. This section describes both environments, as well as
`the simulator we used to evaluate our algorithm.
`2.1 PlanetLab characteristics
`PlanetLab is a large (> 600 node) research testbed [28]
`with nodes located around the world. We chose this
`testbed as our representative environment mainly because
`it is a large, distributed collection of machines that has
`been monitored for long periods; we use this monitoring
`data to construct a realistic trace of failures in a mostly
`managed environment.
`The main characteristics of PlanetLab that interest us
`are the rates of disk and transient failures. We use histor-
`ical data collected by the CoMon project [25] to identify
`transient failures. CoMon has archival records collected
`on average every 5 minutes that include the uptime as re-
`ported by the system uptime counter on each node. We
`use resets of this counter to detect reboots, and we esti-
`mate the time when the node became unreachable based
`on the last time CoMon was able to successfully contact
`the node. This allows us to pinpoint failures without de-
`pending on the reachability of the node from the CoMon
`monitoring site.
`We define a disk failure to be any permanent loss of
`disk contents, due to disk hardware failure or because its
`contents are erased accidentally or intentionally. In or-
`der to identify disk failures, the CoMon measurements
`were supplemented with event logs from PlanetLab Cen-
`tral [28]. This database automatically records each time
`a PlanetLab node is reinstalled (e.g., for an upgrade, or
`after a disk is replaced following a failure). The machine
`is then considered offline until the machine is assigned a
`regular boot state in the database. Table 1 summarizes the
`statistics of this trace. Figure 7(a) visualizes how transient
`and disk failures accumulate over time in this network.
`2.2 Synthetic trace
`We also generated synthetic traces of failures by drawing
`failure inter-arrival times from exponential distributions.
`Synthetic traces have two benefits. First, they let us sim-
`ulate longer time periods, and second, they allow us to
`
`46
`
`NSDI ’06: 3rd Symposium on Networked Systems Design & Implementation
`
`USENIX Association
`
`

`

`increase the failure density, which makes the basic under-
`lying trends more visible. We conjecture that exponential
`inter-failure times are a good model for disks that are in-
`dependently acquired and operated at geographically sep-
`arated sites; exponential intervals are possibly not so well
`justified for transient failures due to network problems.
`Each synthetic trace contains 632 nodes, just like the
`PlanetLab trace. The mean session time and downtime
`match the values shown in Table 1; however, in order to
`increase the failure density, we extended the length to two
`years and reduced the average node lifetime to one year.
`Each experiment was run with ten different traces; the fig-
`ures show the averages from these experiments.
`2.3 Simulation
`We use the failure traces to drive an event-based simu-
`lator. In the simulator, each node has unlimited disk ca-
`pacity, but limited link bandwidth. However, it assumes
`that all network paths are independent so that there are
`no shared bottlenecks. Further it assumes that if a node is
`available, it is reachable from all other nodes. This is oc-
`casionally not the case on PlanetLab [14]; however, tech-
`niques do exist to mask the effects of partially unreachable
`nodes [1].
`The simulator takes as input a trace of transient and
`disk failure events, node repairs and object insertions. It
`simulates the behavior of nodes under different protocols
`and produces a trace of the availability of objects and the
`amount of data sent and stored by each node for each hour
`of simulated time. Each simulation calls put with 50,000
`data objects, each of size 20 MB. Unless otherwise noted,
`each node is configured with an access link capacity of
`150 KBytes/s, roughly corresponding to the throughput
`achievable under the bandwidth cap imposed by Planet-
`Lab. The goal of the simulations is to show the percent-
`age of objects lost and the amount of bandwidth needed
`to sustain objects over time.
`3 Understanding durability
`We consider the problem of providing durability for a stor-
`age system composed of a large number of nodes spread
`over the Internet, each contributing disk space. The sys-
`tem stores a large number of independent pieces of data.
`Each piece of data is immutable. The system must have
`a way to name and locate data; the former is beyond the
`scope of this work, while the latter may affect the possi-
`ble policies for placing replicas. While parts of the system
`will suffer temporary failures, such as network partitions
`or power failures, the focus of this section is on failures
`that result in permanent loss of data. Section 5 shows how
`to efficiently manage transient failures; this section de-
`scribes some fundamental constraints and challenges in
`providing durability.
`
`3.1 Challenges to durability
`It is useful to view permanent disk and node failures as
`having an average rate and a degree of burstiness. To pro-
`vide high durability, a system must be able to cope with
`both.
`In order to handle some average rate of failure, a high-
`durability system must have the ability to create new repli-
`cas of objects faster than replicas are destroyed. Whether
`the system can do so depends on the per-node network
`access link speed, the number of nodes (and hence ac-
`cess links) that help perform each repair, and the amount
`of data stored on each failed node. When a node n fails,
`the other nodes holding replicas of the objects stored on n
`must generate replacements: objects will remain durable
`if there is sufficient bandwidth available on average for the
`lost replicas to be recreated. For example, in a symmetric
`system each node must have sufficient bandwidth to copy
`the equivalent of all data it stores to other nodes during its
`lifetime.
`If nodes are unable to keep pace with the average fail-
`ure rate, no replication policy can prevent objects from
`being lost. These systems are infeasible. If the system is
`infeasible, it will eventually “adapt” to the failure rate by
`discarding objects until it becomes feasible to store the re-
`maining amount of data. A system designer may not have
`control over access link speeds and the amount of data to
`be stored; fortunately, choice of object placement can im-
`prove the speed that a system can create new replicas as
`discussed in Section 4.
`If the creation rate is only slightly above the average
`failure rate, then a burst of failures may destroy all of an
`object’s replicas before a new replica can be made; a sub-
`sequent lull in failures below the average rate will not help
`replace replicas if no replicas remain. For our purposes,
`these failures are simultaneous: they occur closer together
`in time than the time required to create new replicas of
`the data that was stored on the failed disk. Simultaneous
`failures pose a constraint tighter than just meeting the av-
`erage failure rate: every object must have more replicas
`than the largest expected burst of failures. We study sys-
`tems that aim to maintain a target number of replicas in
`order to survive bursts of failure; we call this target rL.
`Higher values of rL do not allow the system to survive a
`higher average failure rate. For examples, if failures were
`to arrive at fixed intervals, then either rL = 2 would always
`be sufficient, or no amount of replication would ensure
`durability. If rL = 2 is sufficient, there will always be time
`to create a new replica of the objects on the most recently
`failed disk before their remaining replicas fail. If creating
`new replicas takes longer than the average time between
`failures, no fixed replication level will make the system
`feasible; setting a replication level higher than two would
`only increase the number of bytes each node must copy in
`response to failures, which is already infeasible at rL = 2.
`
`USENIX Association
`
`NSDI ’06: 3rd Symposium on Networked Systems Design & Implementation
`
`47
`
`

`

`system cannot expect to support more than this number of
`replicas. For example, if the system must handle coinci-
`dental bursts of five failures, it must be able to support at
`least six replicas and hence the replica creation rate must
`be at least 6 times higher than the average replica fail-

`as
`Choices for rL are
`ure rate. We will refer to
`/
`It is not the case that durability
`effectively limited by
`increases continuously with rL; rather, when using rL >
`the system provides the best durability it can, given its re-

`decrease the time
`source constraints. Higher values of
`it takes to repair an object, and thus the ‘window of vul-
`nerability’ during which additional failures can cause the
`object to be destroyed.
`
`θ,
`
`θ.
`
`λf
`
`θ.
`

`
`θ,
`

`
`depends on the achiev-
`The replica creation rate
`able network throughput per node, as well as the amount
`of data that each node has to store (including replica-
`tion). PlanetLab currently limits the available network
`bandwidth to 150 KB/s per node, and if we assume that
`the system stores 500 GB of unique data per node with
`rL = 3 replicas each, then each of the 490 nodes stores
`1.5 TB. This means that one node’s data can be recreated
`in 121 days, or approximately three times per year. This
`µ≈ 3 disk copies per year.
`yields
`In a system with these characteristics, we can estimate

`≈ 6.85, though the actual value is likely to be
`=
`/
`lower. Note that this ratio represents the equilibrium num-
`ber of disks worth of data that can be supported; if a disk
`is lost, all replicas on that disk are lost. When viewed in

`depends on the value
`terms of disk failures and copies,
`of rL: as rL increases, the total amount of data stored per
`disk (assuming available capacity) increases proportion-
`If
`the system can in fact main-
`ally and reduces
`=
`tain rL replicas of each object.
`we ran an experiment with
`To show the impact of
`the synthetic trace (i.e., with 632 nodes, a failure rate of
`= 1 per year and a storage load of 1 TB), varying the
`available bandwidth per node. In this case, 100 B/s cor-


`drops
`= 1.81/rL. Figure 2 shows that, as
`responds to
`below one, the system can no longer maintain full repli-
`cation and starts operating in a ‘best effort’ mode, where
`higher values of rL do not give any benefit. The exception
`is if some of the initial rL replicas survive through the en-
`tire trace, which explains the small differences on the left
`side of the graph.
`
`µ,
`
`θ,
`
`λf
`
`µ.
`
`λf
`

`
`λf
`
`48
`
`NSDI ’06: 3rd Symposium on Networked Systems Design & Implementation
`
`USENIX Association
`
`we estimate
`To get an idea of a real-world value of
`and
`from the historical failure record for disks on Plan-
`etLab. From Table 1, the average disk failure inter-arrival
`time for the entire test bed is 39.85 hours. On average,
`there were 490 nodes in the system, so we can estimate the
`mean time between failures for a single disk as 490·39.85
`≈ 0.439 disk fail-
`hours or 2.23 years. This translates to
`ures per year.
`
`λf
`
`λf
`
`Figure 1: A continuous time Markov model for the pro-
`cess of replica failure and repair for a system that main-
`tains three replicas (rL = 3). Numbered states correspond
`to the number of replicas of each object that are durable.
`Transitions to the left occur at the rate at which repli-
`cas are lost; right-moving transitions happen at the replica
`creation rate.
`
`3.2 Creation versus failure rate
`It might seem that any creation rate higher than the av-
`erage failure rate will lead to an unbounded number of
`replicas, thus satisfying the burst constraint. However, this
`intuition is false. To see why, let us model the number of
`replicas of an object as a birth-death process using a con-
`tinuous time Markov chain, which assumes independent
`exponential inter-failure and inter-repair times. This as-
`sumption is reasonable for independent disk failures.
`An object is in state i when i disks hold a replica of the
`object. There are thus rL + 1 possible states, as we start
`with rL replicas and only create new replicas in response
`to failures. From a given state i, there is a transition to
`corresponding to repair, except for
`state i + 1 with rate
`state 0 which corresponds to loss of durability and state
`depends
`rL which does not need repair. The actual rate
`on how bandwidth is allocated to repair and may change
`depending on the replication level of an object. There is a
`transition to the next lower state i−1 with rate i
`because
`each of the i nodes holding an existing replica might fail.
`Figure 1 shows this model for the case where rL = 3.
`This model can be analyzed numerically to shed light
`on the impact of rL on the probability of data loss; we will
`show this in Section 3.3. However, to gain some intuition
`about the relationship between creation and failure rates
`and the impact this has on the number of replicas that can
`be supported, we consider a simplification of Figure 1 that

`but repairs constantly, even allowing for
`uses a fixed
`transitions out of state 0. While these changes make the
`∞
`model less realistic, they turn the model into an M/M/
`queue [19] where the “arrival rate” is the repair rate and
`the “service rate” is the per-replica failure rate. The “num-
`ber of busy servers” is the number of replicas: the more
`replicas an object has, the more probable it is that one of
`them will fail.
`This simplification allows us to estimate the equilib-


`. Given
`and
`, a
`rium number of replicas: it is
`/
`
`µi
`
`λf
`
`µi
`
`λf
`
`λf
`
`

`

`24 Hour
`72 Hour
`
`0
`
`1
`
`2
`
`9
`8
`7
`6
`5
`4
`3
`Crashes in single period
`
`10
`
`11
`
`12
`
`60
`
`40
`
`20
`
`0
`
`Numberofoccurrences
`
`rL=2
`rL=4
`rL=6
`rL=8
`
`200
`
`800
`600
`400
`Bandwith per node (bytes/s)
`
`1000
`
`1200
`
`12
`
`10
`
`02468
`
`Avgreplicasperobjectatendoftrace
`
`Figure 3: Frequency of “simultaneous” failures in the
`PlanetLab trace. These counts are derived from breaking
`the trace into non-overlapping 24 and 72 hour periods and
`noting the number of permanent failures that occur in each
`period. If there are x replicas of an object, there were y
`chances in the trace for the object to be lost; this would
`happen if the remaining replicas were not able to respond
`quickly enough to create new replicas of the object.
`
`icant. An analysis of the governing differential equations
`can be used to derive the probability that an object will be
`at a given replication level after a given amount of time. In
`particular, we can determine the probability that the chain
`is in state 0, corresponding to a loss of durability.
`We show the results of such an analysis in Figure 4; for
`details, see [7]. To explore different workloads, we con-
`sider different amounts of data per node. The graph shows
`the probability that an object will survive after four years
`as a function of rL and data stored per node (which affects
`.
`the repair rate and hence
`As rL increases, the system can tolerate more simulta-
`neous failures and objects are more likely to survive. The
`probability of object loss at rL = 1 corresponds to using no
`replication. This value is the same for all curves since it
`depends only on the lifetime of a disk; no new replicas can
`be created once the only replica of the object is lost. To
`store 50 GB durably, the system must use an rL of at least
`3. As the total amount of data increases, the rL required to
`attain a given survival probability also increases. Experi-
`ments confirm that data is lost on the PlanetLab trace only
`when maintaining fewer than three replicas.
`4 Improving repair time
`This section explores how the system can increase dura-
`bility by replacing replicas from a failed disk in parallel.
`In effect, this reduces the time needed to repair the disk
`and increases
`Each node, n, designates a set of other nodes that can
`potentially hold copies of the objects that n is responsible
`for. We will call the size of that set the node’s scope, and
`
`θ)
`
`θ.
`
`Figure 2: Average number of replicas per object at the
`end of a two-year synthetic trace for varying values of
`which varies with bandwidth per node (on the x-axis)

`and total data stored (rL). Where
`< 1, the system cannot
`maintain the full replication level; increasing rL further
`does not have any effect.
`
`θ,
`
`3.3 Choosing rL
`A system designer must choose an appropriate value of
`rL to meet a target level of durability. That is, for a given
`deployment environment, rL must be high enough so that
`a burst of rL failures is sufficiently rare.
`One approach is to set rL to one more than the max-
`imum burst of simultaneous failures in a trace of a real
`system. For example, Figure 3 shows the burstiness of
`permanent failures in the PlanetLab trace by counting the
`number of times that a given number of failures occurs
`in disjoint 24 hour and 72 hour periods. If the size of a
`failure burst exceeds the number of replicas, some objects
`may be lost. From this, one might conclude that 12 repli-
`cas are needed to maintain the desired durability. This
`value would likely provide durability but at a high cost.
`If a lower value of rL would suffice, the bandwidth spent
`maintaining the extra replicas would be wasted.
`There are several factors to consider in choosing rL to
`provide a certain level of durability. First, even if failures
`are independent, there is a non-zero (though small) proba-
`bility for every burst size up to the total number of nodes.
`Second, a burst may arrive while there are fewer than rL
`replicas. One could conclude from these properties that
`the highest possible value of rL is desirable. On the other
`hand, the simultaneous failure of even a large fraction of
`nodes may not destroy any objects, depending on how the
`system places replicas (see Section 4). Also, the workload

`and thus
`may change over time, affecting
`The continuous time Markov model described in Fig-
`ure 1 reflects the distributions of both burst size and object
`replication level. The effect of these distributions is signif-
`
`θ.
`
`USENIX Association
`
`NSDI ’06: 3rd Symposium on Networked Systems Design & Implementation
`
`49
`
`

`

`rL=2
`rL=4
`
`5
`
`10
`
`15
`
`20
`
`25
`
`Scope
`
`1.05
`
`1
`
`0.95
`
`0.9
`
`0.85
`
`0.8
`
`0
`
`Durabilityatendoftrace
`
`Figure 5: Durability for different scopes in a synthetic
`Larger scopes spread the repair work
`trace with low
`over more access links and improve the nodes’ ability to
`monitor replicas and temporary failures, which results in
`higher durability.
`
`θ.
`
`nodes. Thus the network traffic sources and destinations
`are spread over all the access links, and the time to recover
`from the failure is short (proportional to the amount of
`data on one disk divided by N).
`A larger scope also means that a temporary failure will
`be noticed by a larger number of nodes. Thus, more access
`links are available to create additional replicas while the
`failure lasts. Unless these links are already fully utilized,
`this increases the effective replica creation rate, and thus
`improves durability.
`Figure 5 shows how scope (and thus repair time) af-
`fects object durability in a simulation on a synthetic trace.
`we limit the bandwidth per node to 1000 B/s
`To reduce
`in this experiment. We vary the repair threshold and the
`scope, and measure durability after two years of simulated
`time. Increasing the scope from 5 to 25 nodes reduces the
`fraction of lost objects by an order of magnitude, inde-
`pendent of rL. By including more nodes (and thus more
`network connections) in each repair effort, the work is
`spread over more access links and completes faster, limit-
`ing the window of time in which the system is vulnerable
`to another disk failure. Ideally, by doubling the scope, the
`window of vulnerability can be cut in half.
`A large scope reduces repair time and increases dura-
`bility; however, implementing a large scope presents two
`trade-offs. First, the system must monitor each node in
`the scope to determine the replication levels; when using
`a large scope, the system must monitor many nodes. This
`increased monitoring traffic limits scalability. Second, in
`some instances, a large scope can increase the likelihood
`that a simultaneous failure of multiple disks will cause
`some object to be lost.
`If objects are placed randomly with scope N and there
`
`rL(cid:1) potential
`are many objects, then it is likely that all (cid:0)N
`
`θ,
`
`5 GB
`50 GB
`500 GB
`
`2
`
`3
`
`4
`
`5
`rL
`
`6
`
`7
`
`8
`
`1.00
`
`0.95
`
`0.90
`
`0.85
`
`0.80
`
`Pr[objectdurability]
`
`Figure 4: Analytic prediction for object durability after
`four years on PlanetLab. The x-axis shows the initial num-
`ber of replicas for each object: as the number of replicas
`is increased, object durability also increases. Each curve
`plots a different per-node storage load; as load increases,
`it takes longer to copy objects after a failure and it is more
`likely that objects will be lost due to simultaneous fail-
`ures.
`
`consider only system designs in which every node has the
`same scope. Scope can range from a minimum of rL to a
`maximum of the number of nodes in the system.
`A small scope means that all the objects stored on node
`n have copies on nodes chosen from the same restricted set
`of other nodes. The advantage of a small scope is that it
`makes it easier to keep track of the copies of each object.
`For example, DHash stores the copies of all the objects
`with keys in a particular range on the successor nodes of
`that key range; the result is that those nodes store similar
`sets of objects, and can exchange compressed summaries
`of the objects they store when they want to check that each
`object is replicated a sufficient number of times [6].
`The disadvantage of a small scope is that the effort of
`creating new copies of objects stored on a failed disk falls
`on the small set of nodes in that disk’s scope. The time
`required to create the new copies is proportional to the
`amount of data on one disk divided by the scope. Thus
`a small scope results in a long recovery time. Another
`problem with a small scope, when coupled with consis-
`tent hashing, is that the addition of a new node may cause
`needless copying of objects: the small scope may dictate
`that the new node replicate certain objects, forcing the pre-
`vious replicas out of scope and thus preventing them from
`contributing to durability.
`Larger scopes spread the work of making new copies
`of objects on a failed disk over more access links, so that
`the copying can be completed faster. In the extreme of
`a scope of N (the number of nodes in the system), the
`remaining copies of the objects on a failed disk would be
`spread over all nodes, assuming that there are many more
`objects than nodes. Furthermore, the new object copies
`created after the failure would also be spread over all the
`
`50
`
`NSDI ’06: 3rd Symposium on Networked Systems Design & Implementation
`
`USENIX Association
`
`

`

`replica sets are used. In this scenario, the simultaneous
`failure of any rL disks is likely to cause data loss: there is
`likely to be at least one object replicated on exactly those
`disks. A small scope limits placement possibilities that are
`used, concentrating objects into common replica sets. As
`a result, it is less likely that a given set of rL failures will
`affect a replica set, but when data loss does occur, many
`more objects will be lost. These effects exactly balance:
`the expected number of objects lost during a large failure
`event is identical for both strategies. It is the variance that
`differs between the two strategies.
`5 Reducing transient costs
`The possibility of transient failures complicates providing
`durability efficiently: we do not want to make new copies
`in response to transient failures, but it is impossible to dis-
`tinguish between disk failures and transient failures using
`only remote network measurements. This section focuses
`minimizing the amount of network traffic sent in response
`to transient failures.
`The key technique needed to achieve this is to en-
`sure that the system reintegrates object replicas stored on
`nodes after transient failures; this means the system must
`be able to track more than rL replicas of each object. The
`number of replicas that the system must remember turns
`out to be dependent on a, the average fraction of time that
`a node is available. However, we show that the correct
`number of extra replicas can be determined without esti-
`mating a by tracking the location of all replicas, including
`those that are offline. We introduce the Carbonite algo-
`rithm that uses this technique and demonstrate its effec-
`tiveness using simulations.
`We additionally consider two other techniques for lim-
`iting response to transient failures: creating extra repli-
`cas in batches and using timeouts as a heuristic for distin-
`guishing transient from disk failures. Both are of limited
`value: batching is best able to save bandwidth when using
`erasure codes and, in the presence of reintegration, time-
`outs work wel

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket