`
`Byung-Gon Chun,† Frank Dabek,? Andreas Haeberlen,‡ Emil Sit,? Hakim Weatherspoon,†
`M. Frans Kaashoek,? John Kubiatowicz,† and Robert Morris?
`
`? MIT Computer Science and Artificial Intelligence Laboratory,
`‡ Rice University/MPI-SWS, † University of California, Berkeley
`
`Abstract
`This paper considers replication strategies for storage systems
`that aggregate the disks of many nodes spread over the Internet.
`Maintaining replication in such systems can be prohibitively ex-
`pensive, since every transient network or host failure could po-
`tentially lead to copying a server’s worth of data over the Internet
`to maintain replication levels.
`The following insights in designing an efficient replication al-
`gorithm emerge from the paper’s analysis. First, durability can
`be provided separately from availability; the former is less ex-
`pensive to ensure and a more useful goal for many wide-area ap-
`plications. Second, the focus of a durability algorithm must be
`to create new copies of data objects faster than permanent disk
`failures destroy the objects; careful choice of policies for what
`nodes should hold what data can decrease repair time. Third,
`increasing the number of replicas of each data object does not
`help a system tolerate a higher disk failure probability, but does
`help tolerate bursts of failures. Finally, ensuring that the system
`makes use of replicas that recover after temporary failure is crit-
`ical to efficiency.
`Based on these insights, the paper proposes the Carbonite
`replication algorithm for keeping data durable at a low cost. A
`simulation of Carbonite storing 1 TB of data over a 365 day
`trace of PlanetLab activity shows that Carbonite is able to keep
`all data durable and uses 44% more network traffic than a hy-
`pothetical system that only responds to permanent failures. In
`comparison, Total Recall and DHash require almost a factor of
`two more network traffic than this hypothetical system.
`1 Introduction
`Wide-area distributed storage systems typically use repli-
`cation to provide two related properties: durability and
`availability. Durability means that objects that an applica-
`tion has put into the system are not lost due to disk failure
`whereas availability means that get will be able to return
`the object promptly. Objects can be durably stored but not
`
`This research was supported by the National Science Founda-
`tion under Cooperative Agreement No. ANI-0225660, http://
`project-iris.net/. Andreas Haeberlen was supported in part
`by the Max Planck Society. Emil Sit was supported in part by the
`Cambridge-MIT Institute. Hakim Weatherspoon was supported by an
`Intel Foundation PhD Fellowship.
`
`immediately available: if the only copy of an object is on
`the disk of a node that is currently powered off, but will
`someday re-join the system with disk contents intact, then
`that object is durable but not currently available. The pa-
`per’s goal is to develop an algorithm to store immutable
`objects durably and at a low bandwidth cost in a system
`that aggregates the disks of many Internet nodes.
`The threat to durability is losing the last copy of an ob-
`ject due to permanent failures of disks. Efficiently coun-
`tering this threat to durability involves three main chal-
`lenges. First, network bandwidth is a scarce resource in
`a wide-area distributed storage system. To store objects
`durably, there must be enough network capacity to cre-
`ate copies of objects faster than they are lost due to disk
`failure. Second, a system cannot always distinguish be-
`tween transient failures and permanent disk failures: it
`may waste network bandwidth by creating new copies
`during transient failures. Third, after recovery from tran-
`sient failures, some replicas may be on nodes that the
`replica lookup algorithm does not query and are thus ef-
`fectively lost.
`Since transient failures are common in wide-area sys-
`tems, replication algorithms can waste bandwidth by mak-
`ing unneeded replicas. For example, the initial replica-
`tion algorithm [6] that the DHash distributed hash table
`(DHT) [9] turned out to be inadequate to build storage ap-
`plications such as UsenetDHT [34], Antiquity [11], and
`OverCite [35, 36].
`A problem with DHash was that its design was driven
`by the goal of achieving 100% availability; this decision
`caused it to waste bandwidth by creating new replicas in
`response to temporary failures. Its design and similar ones
`(such as Total Recall [3]) are overkill for durability. Fur-
`thermore, users of many Internet applications can tolerate
`some unavailability. For example, Usenet readers will see
`all articles eventually, as long as they are stored durably.
`Our experience with these DHT applications has led us to
`the following insights:
`
`• Durability is a more practical and useful goal than
`availability for applications that store objects (as op-
`
`USENIX Association
`
`NSDI ’06: 3rd Symposium on Networked Systems Design & Implementation
`
`45
`
`CSCO-1050
`
`
`
`posed to caching objects).
`
`• The main goal of a durability algorithm should be to
`create new copies of an object faster than they are
`destroyed by disk failures; the choice of how repli-
`cas are distributed among nodes can make this task
`easier.
`
`• Increasing the replication level does not help tolerate
`a higher average permanent failure rate, but it does
`help cope with bursts of failures.
`
`• Reintegrating returning replicas is key to avoiding
`unnecessary copying.
`Using these insights we have developed Carbonite, an
`efficient wide-area replication algorithm for keeping ob-
`jects durable. After inserting a set of initial replicas, Car-
`bonite begins by creating new replicas mostly in response
`to transient failures. However, over time it is increasingly
`able to ignore transient failures and approaches the goal of
`only producing replicas in response to permanent failures.
`Carbonite’s design assumes that the disks in the dis-
`tributed storage system fail independently of each other:
`failures of geographically distributed hard drives from dif-
`ferent manufacturers are likely to be uncorrelated.
`In a year-long PlanetLab failure trace, however, we ob-
`serve some correlated failures because of coordinated re-
`installs of the PlanetLab software. Despite this, an evalua-
`tion using the PlanetLab failure trace shows that Carbonite
`is able to keep 1 TB of data durable, and consumes only
`44% more network traffic than a hypothetical system that
`only responds to permanent failures. In comparison, To-
`tal Recall and DHash require almost a factor of two more
`network traffic than this hypothetical system.
`The rest of this paper explains our durability models
`and algorithms, interleaving evaluation results into the ex-
`planation. Section 2 describes the simulated evaluation
`environment. Section 3 presents a model of the relation-
`ship between network capacity, amount of replicated data,
`number of replicas, and durability. Section 4 explains
`how to decrease repair time, and thus increase durabil-
`ity, by proper placement of replicas on servers. Section 5
`presents an algorithm that reduces the bandwidth wasted
`making copies due to transient failures. Section 6 outlines
`some of the challenges that face practical implementations
`of these ideas, Section 7 discusses related work, and Sec-
`tion 8 concludes.
`2 System environment
`The behavior of a replication algorithm depends on the
`environment in which it is used: high disk failure rates or
`low network access link speeds make it difficult for any
`system to maintain durability. We will use the character-
`istics of the PlanetLab testbed as a representative environ-
`ment when evaluating wide-area replication techniques.
`
`Dates
`Number of hosts
`Number of transient failures
`Number of disk failures
`Transient host downtime (s)
`Any failure interarrival (s)
`Disk failures interarrival (s)
`(Median/Mean/90th percentile)
`
`1 March 2005 – 28 Feb 2006
`632
`21255
`219
`1208, 104647, 14242
`305, 1467, 3306
`54411, 143476, 490047
`
`Table 1: CoMon+PLC trace characteristics.
`
`For explanatory purposes, we will also use a synthetic
`trace that makes some of the underlying trends more vis-
`ible. This section describes both environments, as well as
`the simulator we used to evaluate our algorithm.
`2.1 PlanetLab characteristics
`PlanetLab is a large (> 600 node) research testbed [28]
`with nodes located around the world. We chose this
`testbed as our representative environment mainly because
`it is a large, distributed collection of machines that has
`been monitored for long periods; we use this monitoring
`data to construct a realistic trace of failures in a mostly
`managed environment.
`The main characteristics of PlanetLab that interest us
`are the rates of disk and transient failures. We use histor-
`ical data collected by the CoMon project [25] to identify
`transient failures. CoMon has archival records collected
`on average every 5 minutes that include the uptime as re-
`ported by the system uptime counter on each node. We
`use resets of this counter to detect reboots, and we esti-
`mate the time when the node became unreachable based
`on the last time CoMon was able to successfully contact
`the node. This allows us to pinpoint failures without de-
`pending on the reachability of the node from the CoMon
`monitoring site.
`We define a disk failure to be any permanent loss of
`disk contents, due to disk hardware failure or because its
`contents are erased accidentally or intentionally. In or-
`der to identify disk failures, the CoMon measurements
`were supplemented with event logs from PlanetLab Cen-
`tral [28]. This database automatically records each time
`a PlanetLab node is reinstalled (e.g., for an upgrade, or
`after a disk is replaced following a failure). The machine
`is then considered offline until the machine is assigned a
`regular boot state in the database. Table 1 summarizes the
`statistics of this trace. Figure 7(a) visualizes how transient
`and disk failures accumulate over time in this network.
`2.2 Synthetic trace
`We also generated synthetic traces of failures by drawing
`failure inter-arrival times from exponential distributions.
`Synthetic traces have two benefits. First, they let us sim-
`ulate longer time periods, and second, they allow us to
`
`46
`
`NSDI ’06: 3rd Symposium on Networked Systems Design & Implementation
`
`USENIX Association
`
`
`
`increase the failure density, which makes the basic under-
`lying trends more visible. We conjecture that exponential
`inter-failure times are a good model for disks that are in-
`dependently acquired and operated at geographically sep-
`arated sites; exponential intervals are possibly not so well
`justified for transient failures due to network problems.
`Each synthetic trace contains 632 nodes, just like the
`PlanetLab trace. The mean session time and downtime
`match the values shown in Table 1; however, in order to
`increase the failure density, we extended the length to two
`years and reduced the average node lifetime to one year.
`Each experiment was run with ten different traces; the fig-
`ures show the averages from these experiments.
`2.3 Simulation
`We use the failure traces to drive an event-based simu-
`lator. In the simulator, each node has unlimited disk ca-
`pacity, but limited link bandwidth. However, it assumes
`that all network paths are independent so that there are
`no shared bottlenecks. Further it assumes that if a node is
`available, it is reachable from all other nodes. This is oc-
`casionally not the case on PlanetLab [14]; however, tech-
`niques do exist to mask the effects of partially unreachable
`nodes [1].
`The simulator takes as input a trace of transient and
`disk failure events, node repairs and object insertions. It
`simulates the behavior of nodes under different protocols
`and produces a trace of the availability of objects and the
`amount of data sent and stored by each node for each hour
`of simulated time. Each simulation calls put with 50,000
`data objects, each of size 20 MB. Unless otherwise noted,
`each node is configured with an access link capacity of
`150 KBytes/s, roughly corresponding to the throughput
`achievable under the bandwidth cap imposed by Planet-
`Lab. The goal of the simulations is to show the percent-
`age of objects lost and the amount of bandwidth needed
`to sustain objects over time.
`3 Understanding durability
`We consider the problem of providing durability for a stor-
`age system composed of a large number of nodes spread
`over the Internet, each contributing disk space. The sys-
`tem stores a large number of independent pieces of data.
`Each piece of data is immutable. The system must have
`a way to name and locate data; the former is beyond the
`scope of this work, while the latter may affect the possi-
`ble policies for placing replicas. While parts of the system
`will suffer temporary failures, such as network partitions
`or power failures, the focus of this section is on failures
`that result in permanent loss of data. Section 5 shows how
`to efficiently manage transient failures; this section de-
`scribes some fundamental constraints and challenges in
`providing durability.
`
`3.1 Challenges to durability
`It is useful to view permanent disk and node failures as
`having an average rate and a degree of burstiness. To pro-
`vide high durability, a system must be able to cope with
`both.
`In order to handle some average rate of failure, a high-
`durability system must have the ability to create new repli-
`cas of objects faster than replicas are destroyed. Whether
`the system can do so depends on the per-node network
`access link speed, the number of nodes (and hence ac-
`cess links) that help perform each repair, and the amount
`of data stored on each failed node. When a node n fails,
`the other nodes holding replicas of the objects stored on n
`must generate replacements: objects will remain durable
`if there is sufficient bandwidth available on average for the
`lost replicas to be recreated. For example, in a symmetric
`system each node must have sufficient bandwidth to copy
`the equivalent of all data it stores to other nodes during its
`lifetime.
`If nodes are unable to keep pace with the average fail-
`ure rate, no replication policy can prevent objects from
`being lost. These systems are infeasible. If the system is
`infeasible, it will eventually “adapt” to the failure rate by
`discarding objects until it becomes feasible to store the re-
`maining amount of data. A system designer may not have
`control over access link speeds and the amount of data to
`be stored; fortunately, choice of object placement can im-
`prove the speed that a system can create new replicas as
`discussed in Section 4.
`If the creation rate is only slightly above the average
`failure rate, then a burst of failures may destroy all of an
`object’s replicas before a new replica can be made; a sub-
`sequent lull in failures below the average rate will not help
`replace replicas if no replicas remain. For our purposes,
`these failures are simultaneous: they occur closer together
`in time than the time required to create new replicas of
`the data that was stored on the failed disk. Simultaneous
`failures pose a constraint tighter than just meeting the av-
`erage failure rate: every object must have more replicas
`than the largest expected burst of failures. We study sys-
`tems that aim to maintain a target number of replicas in
`order to survive bursts of failure; we call this target rL.
`Higher values of rL do not allow the system to survive a
`higher average failure rate. For examples, if failures were
`to arrive at fixed intervals, then either rL = 2 would always
`be sufficient, or no amount of replication would ensure
`durability. If rL = 2 is sufficient, there will always be time
`to create a new replica of the objects on the most recently
`failed disk before their remaining replicas fail. If creating
`new replicas takes longer than the average time between
`failures, no fixed replication level will make the system
`feasible; setting a replication level higher than two would
`only increase the number of bytes each node must copy in
`response to failures, which is already infeasible at rL = 2.
`
`USENIX Association
`
`NSDI ’06: 3rd Symposium on Networked Systems Design & Implementation
`
`47
`
`
`
`system cannot expect to support more than this number of
`replicas. For example, if the system must handle coinci-
`dental bursts of five failures, it must be able to support at
`least six replicas and hence the replica creation rate must
`be at least 6 times higher than the average replica fail-
`µ
`as
`Choices for rL are
`ure rate. We will refer to
`/
`It is not the case that durability
`effectively limited by
`increases continuously with rL; rather, when using rL >
`the system provides the best durability it can, given its re-
`θ
`decrease the time
`source constraints. Higher values of
`it takes to repair an object, and thus the ‘window of vul-
`nerability’ during which additional failures can cause the
`object to be destroyed.
`
`θ,
`
`θ.
`
`λf
`
`θ.
`
`µ
`
`θ,
`
`µ
`
`depends on the achiev-
`The replica creation rate
`able network throughput per node, as well as the amount
`of data that each node has to store (including replica-
`tion). PlanetLab currently limits the available network
`bandwidth to 150 KB/s per node, and if we assume that
`the system stores 500 GB of unique data per node with
`rL = 3 replicas each, then each of the 490 nodes stores
`1.5 TB. This means that one node’s data can be recreated
`in 121 days, or approximately three times per year. This
`µ≈ 3 disk copies per year.
`yields
`In a system with these characteristics, we can estimate
`µ
`≈ 6.85, though the actual value is likely to be
`=
`/
`lower. Note that this ratio represents the equilibrium num-
`ber of disks worth of data that can be supported; if a disk
`is lost, all replicas on that disk are lost. When viewed in
`θ
`depends on the value
`terms of disk failures and copies,
`of rL: as rL increases, the total amount of data stored per
`disk (assuming available capacity) increases proportion-
`If
`the system can in fact main-
`ally and reduces
`=
`tain rL replicas of each object.
`we ran an experiment with
`To show the impact of
`the synthetic trace (i.e., with 632 nodes, a failure rate of
`= 1 per year and a storage load of 1 TB), varying the
`available bandwidth per node. In this case, 100 B/s cor-
`θ
`θ
`drops
`= 1.81/rL. Figure 2 shows that, as
`responds to
`below one, the system can no longer maintain full repli-
`cation and starts operating in a ‘best effort’ mode, where
`higher values of rL do not give any benefit. The exception
`is if some of the initial rL replicas survive through the en-
`tire trace, which explains the small differences on the left
`side of the graph.
`
`µ,
`
`θ,
`
`λf
`
`µ.
`
`λf
`
`θ
`
`λf
`
`48
`
`NSDI ’06: 3rd Symposium on Networked Systems Design & Implementation
`
`USENIX Association
`
`we estimate
`To get an idea of a real-world value of
`and
`from the historical failure record for disks on Plan-
`etLab. From Table 1, the average disk failure inter-arrival
`time for the entire test bed is 39.85 hours. On average,
`there were 490 nodes in the system, so we can estimate the
`mean time between failures for a single disk as 490·39.85
`≈ 0.439 disk fail-
`hours or 2.23 years. This translates to
`ures per year.
`
`λf
`
`λf
`
`Figure 1: A continuous time Markov model for the pro-
`cess of replica failure and repair for a system that main-
`tains three replicas (rL = 3). Numbered states correspond
`to the number of replicas of each object that are durable.
`Transitions to the left occur at the rate at which repli-
`cas are lost; right-moving transitions happen at the replica
`creation rate.
`
`3.2 Creation versus failure rate
`It might seem that any creation rate higher than the av-
`erage failure rate will lead to an unbounded number of
`replicas, thus satisfying the burst constraint. However, this
`intuition is false. To see why, let us model the number of
`replicas of an object as a birth-death process using a con-
`tinuous time Markov chain, which assumes independent
`exponential inter-failure and inter-repair times. This as-
`sumption is reasonable for independent disk failures.
`An object is in state i when i disks hold a replica of the
`object. There are thus rL + 1 possible states, as we start
`with rL replicas and only create new replicas in response
`to failures. From a given state i, there is a transition to
`corresponding to repair, except for
`state i + 1 with rate
`state 0 which corresponds to loss of durability and state
`depends
`rL which does not need repair. The actual rate
`on how bandwidth is allocated to repair and may change
`depending on the replication level of an object. There is a
`transition to the next lower state i−1 with rate i
`because
`each of the i nodes holding an existing replica might fail.
`Figure 1 shows this model for the case where rL = 3.
`This model can be analyzed numerically to shed light
`on the impact of rL on the probability of data loss; we will
`show this in Section 3.3. However, to gain some intuition
`about the relationship between creation and failure rates
`and the impact this has on the number of replicas that can
`be supported, we consider a simplification of Figure 1 that
`µ
`but repairs constantly, even allowing for
`uses a fixed
`transitions out of state 0. While these changes make the
`∞
`model less realistic, they turn the model into an M/M/
`queue [19] where the “arrival rate” is the repair rate and
`the “service rate” is the per-replica failure rate. The “num-
`ber of busy servers” is the number of replicas: the more
`replicas an object has, the more probable it is that one of
`them will fail.
`This simplification allows us to estimate the equilib-
`µ
`µ
`. Given
`and
`, a
`rium number of replicas: it is
`/
`
`µi
`
`λf
`
`µi
`
`λf
`
`λf
`
`
`
`24 Hour
`72 Hour
`
`0
`
`1
`
`2
`
`9
`8
`7
`6
`5
`4
`3
`Crashes in single period
`
`10
`
`11
`
`12
`
`60
`
`40
`
`20
`
`0
`
`Numberofoccurrences
`
`rL=2
`rL=4
`rL=6
`rL=8
`
`200
`
`800
`600
`400
`Bandwith per node (bytes/s)
`
`1000
`
`1200
`
`12
`
`10
`
`02468
`
`Avgreplicasperobjectatendoftrace
`
`Figure 3: Frequency of “simultaneous” failures in the
`PlanetLab trace. These counts are derived from breaking
`the trace into non-overlapping 24 and 72 hour periods and
`noting the number of permanent failures that occur in each
`period. If there are x replicas of an object, there were y
`chances in the trace for the object to be lost; this would
`happen if the remaining replicas were not able to respond
`quickly enough to create new replicas of the object.
`
`icant. An analysis of the governing differential equations
`can be used to derive the probability that an object will be
`at a given replication level after a given amount of time. In
`particular, we can determine the probability that the chain
`is in state 0, corresponding to a loss of durability.
`We show the results of such an analysis in Figure 4; for
`details, see [7]. To explore different workloads, we con-
`sider different amounts of data per node. The graph shows
`the probability that an object will survive after four years
`as a function of rL and data stored per node (which affects
`.
`the repair rate and hence
`As rL increases, the system can tolerate more simulta-
`neous failures and objects are more likely to survive. The
`probability of object loss at rL = 1 corresponds to using no
`replication. This value is the same for all curves since it
`depends only on the lifetime of a disk; no new replicas can
`be created once the only replica of the object is lost. To
`store 50 GB durably, the system must use an rL of at least
`3. As the total amount of data increases, the rL required to
`attain a given survival probability also increases. Experi-
`ments confirm that data is lost on the PlanetLab trace only
`when maintaining fewer than three replicas.
`4 Improving repair time
`This section explores how the system can increase dura-
`bility by replacing replicas from a failed disk in parallel.
`In effect, this reduces the time needed to repair the disk
`and increases
`Each node, n, designates a set of other nodes that can
`potentially hold copies of the objects that n is responsible
`for. We will call the size of that set the node’s scope, and
`
`θ)
`
`θ.
`
`Figure 2: Average number of replicas per object at the
`end of a two-year synthetic trace for varying values of
`which varies with bandwidth per node (on the x-axis)
`θ
`and total data stored (rL). Where
`< 1, the system cannot
`maintain the full replication level; increasing rL further
`does not have any effect.
`
`θ,
`
`3.3 Choosing rL
`A system designer must choose an appropriate value of
`rL to meet a target level of durability. That is, for a given
`deployment environment, rL must be high enough so that
`a burst of rL failures is sufficiently rare.
`One approach is to set rL to one more than the max-
`imum burst of simultaneous failures in a trace of a real
`system. For example, Figure 3 shows the burstiness of
`permanent failures in the PlanetLab trace by counting the
`number of times that a given number of failures occurs
`in disjoint 24 hour and 72 hour periods. If the size of a
`failure burst exceeds the number of replicas, some objects
`may be lost. From this, one might conclude that 12 repli-
`cas are needed to maintain the desired durability. This
`value would likely provide durability but at a high cost.
`If a lower value of rL would suffice, the bandwidth spent
`maintaining the extra replicas would be wasted.
`There are several factors to consider in choosing rL to
`provide a certain level of durability. First, even if failures
`are independent, there is a non-zero (though small) proba-
`bility for every burst size up to the total number of nodes.
`Second, a burst may arrive while there are fewer than rL
`replicas. One could conclude from these properties that
`the highest possible value of rL is desirable. On the other
`hand, the simultaneous failure of even a large fraction of
`nodes may not destroy any objects, depending on how the
`system places replicas (see Section 4). Also, the workload
`µ
`and thus
`may change over time, affecting
`The continuous time Markov model described in Fig-
`ure 1 reflects the distributions of both burst size and object
`replication level. The effect of these distributions is signif-
`
`θ.
`
`USENIX Association
`
`NSDI ’06: 3rd Symposium on Networked Systems Design & Implementation
`
`49
`
`
`
`rL=2
`rL=4
`
`5
`
`10
`
`15
`
`20
`
`25
`
`Scope
`
`1.05
`
`1
`
`0.95
`
`0.9
`
`0.85
`
`0.8
`
`0
`
`Durabilityatendoftrace
`
`Figure 5: Durability for different scopes in a synthetic
`Larger scopes spread the repair work
`trace with low
`over more access links and improve the nodes’ ability to
`monitor replicas and temporary failures, which results in
`higher durability.
`
`θ.
`
`nodes. Thus the network traffic sources and destinations
`are spread over all the access links, and the time to recover
`from the failure is short (proportional to the amount of
`data on one disk divided by N).
`A larger scope also means that a temporary failure will
`be noticed by a larger number of nodes. Thus, more access
`links are available to create additional replicas while the
`failure lasts. Unless these links are already fully utilized,
`this increases the effective replica creation rate, and thus
`improves durability.
`Figure 5 shows how scope (and thus repair time) af-
`fects object durability in a simulation on a synthetic trace.
`we limit the bandwidth per node to 1000 B/s
`To reduce
`in this experiment. We vary the repair threshold and the
`scope, and measure durability after two years of simulated
`time. Increasing the scope from 5 to 25 nodes reduces the
`fraction of lost objects by an order of magnitude, inde-
`pendent of rL. By including more nodes (and thus more
`network connections) in each repair effort, the work is
`spread over more access links and completes faster, limit-
`ing the window of time in which the system is vulnerable
`to another disk failure. Ideally, by doubling the scope, the
`window of vulnerability can be cut in half.
`A large scope reduces repair time and increases dura-
`bility; however, implementing a large scope presents two
`trade-offs. First, the system must monitor each node in
`the scope to determine the replication levels; when using
`a large scope, the system must monitor many nodes. This
`increased monitoring traffic limits scalability. Second, in
`some instances, a large scope can increase the likelihood
`that a simultaneous failure of multiple disks will cause
`some object to be lost.
`If objects are placed randomly with scope N and there
`
`rL(cid:1) potential
`are many objects, then it is likely that all (cid:0)N
`
`θ,
`
`5 GB
`50 GB
`500 GB
`
`2
`
`3
`
`4
`
`5
`rL
`
`6
`
`7
`
`8
`
`1.00
`
`0.95
`
`0.90
`
`0.85
`
`0.80
`
`Pr[objectdurability]
`
`Figure 4: Analytic prediction for object durability after
`four years on PlanetLab. The x-axis shows the initial num-
`ber of replicas for each object: as the number of replicas
`is increased, object durability also increases. Each curve
`plots a different per-node storage load; as load increases,
`it takes longer to copy objects after a failure and it is more
`likely that objects will be lost due to simultaneous fail-
`ures.
`
`consider only system designs in which every node has the
`same scope. Scope can range from a minimum of rL to a
`maximum of the number of nodes in the system.
`A small scope means that all the objects stored on node
`n have copies on nodes chosen from the same restricted set
`of other nodes. The advantage of a small scope is that it
`makes it easier to keep track of the copies of each object.
`For example, DHash stores the copies of all the objects
`with keys in a particular range on the successor nodes of
`that key range; the result is that those nodes store similar
`sets of objects, and can exchange compressed summaries
`of the objects they store when they want to check that each
`object is replicated a sufficient number of times [6].
`The disadvantage of a small scope is that the effort of
`creating new copies of objects stored on a failed disk falls
`on the small set of nodes in that disk’s scope. The time
`required to create the new copies is proportional to the
`amount of data on one disk divided by the scope. Thus
`a small scope results in a long recovery time. Another
`problem with a small scope, when coupled with consis-
`tent hashing, is that the addition of a new node may cause
`needless copying of objects: the small scope may dictate
`that the new node replicate certain objects, forcing the pre-
`vious replicas out of scope and thus preventing them from
`contributing to durability.
`Larger scopes spread the work of making new copies
`of objects on a failed disk over more access links, so that
`the copying can be completed faster. In the extreme of
`a scope of N (the number of nodes in the system), the
`remaining copies of the objects on a failed disk would be
`spread over all nodes, assuming that there are many more
`objects than nodes. Furthermore, the new object copies
`created after the failure would also be spread over all the
`
`50
`
`NSDI ’06: 3rd Symposium on Networked Systems Design & Implementation
`
`USENIX Association
`
`
`
`replica sets are used. In this scenario, the simultaneous
`failure of any rL disks is likely to cause data loss: there is
`likely to be at least one object replicated on exactly those
`disks. A small scope limits placement possibilities that are
`used, concentrating objects into common replica sets. As
`a result, it is less likely that a given set of rL failures will
`affect a replica set, but when data loss does occur, many
`more objects will be lost. These effects exactly balance:
`the expected number of objects lost during a large failure
`event is identical for both strategies. It is the variance that
`differs between the two strategies.
`5 Reducing transient costs
`The possibility of transient failures complicates providing
`durability efficiently: we do not want to make new copies
`in response to transient failures, but it is impossible to dis-
`tinguish between disk failures and transient failures using
`only remote network measurements. This section focuses
`minimizing the amount of network traffic sent in response
`to transient failures.
`The key technique needed to achieve this is to en-
`sure that the system reintegrates object replicas stored on
`nodes after transient failures; this means the system must
`be able to track more than rL replicas of each object. The
`number of replicas that the system must remember turns
`out to be dependent on a, the average fraction of time that
`a node is available. However, we show that the correct
`number of extra replicas can be determined without esti-
`mating a by tracking the location of all replicas, including
`those that are offline. We introduce the Carbonite algo-
`rithm that uses this technique and demonstrate its effec-
`tiveness using simulations.
`We additionally consider two other techniques for lim-
`iting response to transient failures: creating extra repli-
`cas in batches and using timeouts as a heuristic for distin-
`guishing transient from disk failures. Both are of limited
`value: batching is best able to save bandwidth when using
`erasure codes and, in the presence of reintegration, time-
`outs work wel