`
`DROPBOX EX. 1006
`
`
`
`
`
`Disconnected Operation in the Coda File System
`
`James J. Kistler and M. Satyanarayanan
`
`School of Computer Science
`Carnegie Mellon University
`Pittsburgh, PA 15213
`
`Abstract
`Disconnected operation is a mode of operation that enables
`a client to continue accessing critical data during temporary
`failures of a shared data repository. An important, though
`not exclusive, application of disconnected operation is in
`supporting portable computers. In this paper, we show that
`disconnected operation is feasible, efficient and usable by
`describing its design and implementation in the Coda File
`System. The central idea behind our work is that caching of
`data, now widely used for performance, can also be
`exploited to improve availability.
`
`1. Introduction
`Every serious user of a distributed system has faced
`situations where critical work has been impeded by a
`remote failure. His frustration is particularly acute when
`his workstation is powerful enough to be used standalone,
`but has been configured to be dependent on remote
`resources. An important instance of such dependence is the
`use of data from a distributed file system.
`
`in a distributed file system simplifies
`Placing data
`collaboration between users, and allows them to delegate
`the administration of that data. The growing popularity of
`distributed file systems such as NFS [15] and AFS [18]
`attests to the compelling nature of these considerations.
`Unfortunately, the users of these systems have to accept the
`fact that a remote failure at a critical juncture may seriously
`inconvenience them.
`
`This work was supported by the Defense Advanced Research Projects Agency
`(Avionics Lab, Wright Research and Development Center, Aeronautical Systems
`Division (AFSC), U.S. Air Force, Wright-Patterson AFB, Ohio, 45433-6543 under
`Contract F33615-90-C-1465, ARPA Order No. 7597), National Science Foundation
`(PYI Award and Grant No. ECD 8907068), IBM Corporation (Faculty Development
`Award, Graduate Fellowship, and Research Initiation Grant), Digital Equipment
`Corporation (External Research Project Grant), and Bellcore (Information
`Networking Research Grant).
`
`How can we improve this state of affairs? Ideally, we
`would like to enjoy the benefits of a shared data repository,
`but be able to continue critical work when that repository is
`inaccessible. We call the latter mode of operation
`disconnected operation, because it represents a temporary
`deviation from normal operation as a client of a shared
`repository.
`
`In this paper we show that disconnected operation in a file
`system is indeed feasible, efficient and usable. The central
`idea behind our work is that caching of data, now widely
`used to improve performance, can also be exploited to
`enhance availability. We have implemented disconnected
`operation in the Coda File System at Carnegie Mellon
`University.
`
`Our initial experience with Coda confirms the viability of
`disconnected operation. We have successfully operated
`disconnected for periods lasting four to five hours. For a
`disconnection of this duration, the process of reconnecting
`and propagating changes typically takes about a minute. A
`local disk of 100MB has been adequate for us during these
`periods of disconnection.
`Trace-driven simulations
`indicate that a disk of about half that size should be
`adequate for disconnections lasting a typical workday.
`
`2. Design Overview
`Coda is designed for an environment consisting of a large
`1
`collection of untrusted Unix clients and a much smaller
`number of trusted Unix file servers. The design is
`optimized for the access and sharing patterns typical of
`academic and research environments. It is specifically not
`intended for applications that exhibit highly concurrent,
`fine granularity data access.
`
`Each Coda client has a local disk and can communicate
`with the servers over a high bandwidth network. At certain
`times, a client may be temporarily unable to communicate
`with some or all of the servers. This may be due to a server
`or network failure, or due to the detachment of a portable
`client from the network.
`
`1Unix is a trademark of AT&T.
`
`Dropbox Ex. 1006
`Page 1
`
`
`
`x=12
`
`x=12
`
`mahler
`
`vivaldi
`
`x=12
`
`ravel
`
`x=87
`
`x=87
`
`mahler
`
`vivaldi
`
`x=87
`
`ravel
`
`x=33
`
`x=87
`
`mahler
`
`vivaldi
`
`x=87
`
`ravel
`
`harp
`x=12
`
`flute
`x=87
`
`harp
`x=87
`
`flute
`x=33
`
`viola
`x=87
`
`(b)
`
`x=45
`
`x=87
`
`x=87
`
`mahler
`
`vivaldi
`
`ravel
`
`harp
`x=87
`
`viola
`x=33
`
`(c)
`
`x=33
`
`x=87
`
`mahler
`
`vivaldi
`
`x=87
`
`ravel
`
`flute
`x=12
`
`viola
`x=12
`
`(a)
`
`x=45
`
`x=45
`
`x=45
`
`mahler
`
`vivaldi
`
`ravel
`
`flute
`x=45
`
`viola
`x=45
`
`(f)
`
`harp
`x=45
`
`flute
`x=45
`
`harp
`x=87
`
`flute
`x=45
`
`harp
`x=87
`
`viola
`x=45
`
`(e)
`
`viola
`x=33
`
`(d)
`
`Three servers (mahler, vivaldi, and ravel) have replicas of the volume containing file x. This file is potentially of interest to users at three
`clients (flute, viola, and harp). Flute is capable of wireless communication (indicated by a dotted line) as well as regular network
`communication. Proceeding clockwise, the steps above show the value of x seen by each node as the connectivity of the system changes.
`Note that in step (d), flute is operating disconnected.
`Figure 1: How Disconnected Operation Relates to Server Replication
`
`cache. Since cache misses cannot be serviced or masked,
`Clients view Coda as a single, location-transparent shared
`they appear as failures to application programs and users.
`Unix file system. The Coda namespace is mapped to
`individual file servers at the granularity of subtrees called When disconnection ends, Venus propagates modifications
`volumes. At each client, a cache manager (Venus)
`and reverts to server replication. Figure 1 depicts a typical
`dynamically obtains and caches volume mappings.
`scenario involving transitions between server replication
`and disconnected operation.
`
`Coda uses two distinct, but complementary, mechanisms to
`achieve high availability. The first mechanism, server
`replication, allows volumes to have read-write replicas at
`more than one server. The set of replication sites for a
`volume is its volume storage group (VSG). The subset of a
`VSG that is currently accessible is a client’s accessible
`VSG (AVSG). The performance cost of server replication is
`kept low by caching on disks at clients and through the use
`of parallel access protocols. Venus uses a cache coherence
`protocol based on callbacks [9] to guarantee that an open of
`a file yields its latest copy in the AVSG. This guarantee is
`provided by servers notifying clients when their cached
`copies are no longer valid, each notification being referred
`to as a "callback break." Modifications in Coda are
`propagated in parallel to all AVSG sites, and eventually to
`missing VSG sites.
`
`Earlier Coda papers [17, 18] have described server
`replication in depth. In contrast, this paper restricts its
`attention to disconnected operation. We discuss server
`replication only in those areas where its presence has
`significantly
`influenced our design for disconnected
`operation.
`
`3. Design Rationale
`At a high level, two factors influenced our strategy for high
`availability. First, we wanted to use conventional, off-the-
`shelf hardware throughout our system. Second, we wished
`to preserve transparency by seamlessly integrating the high
`availability mechanisms of Coda into a normal Unix
`environment.
`
`the second high availability
`Disconnected operation,
`mechanism used by Coda, takes effect when the AVSG
`becomes empty. While disconnected, Venus services file
`system requests by relying solely on the contents of its
`
`At a more detailed level, other considerations influenced
`our design. These include the need to scale gracefully, the
`advent of portable workstations,
`the very different
`resource, integrity, and security assumptions made about
`
`Dropbox Ex. 1006
`Page 2
`
`
`
`clients and servers, and the need to strike a balance
`between availability and consistency. We examine each of
`these issues in the following sections.
`
`3.1. Scalability
`Successful distributed systems tend to grow in size. Our
`experience with Coda’s ancestor, AFS, had impressed upon
`us the need to prepare for growth a priori, rather than
`treating it as an afterthought [16]. We brought this
`experience to bear upon Coda in two ways. First, we
`adopted certain mechanisms that enhance scalability.
`Second, we drew upon a set of general principles to guide
`our design choices.
`
`An example of a mechanism we adopted for scalability is
`callback-based cache coherence. Another such mechanism,
`whole-file caching, offers the added advantage of a much
`simpler failure model: a cache miss can only occur on an
`open, never on a read, write, seek, or close. This,
`in turn, substantially simplifies the implementation of
`disconnected operation. A partial-file caching scheme such
`as that of AFS-4 [21], Echo [8] or MFS [1] would have
`complicated our implementation and made disconnected
`operation less transparent.
`
`A scalability principle that has had considerable influence
`on our design is the placing of functionality on clients
`rather than servers. Only if integrity or security would
`have been compromised have we violated this principle.
`Another scalability principle we have adopted is the
`avoidance of system-wide rapid change. Consequently, we
`have rejected strategies that require election or agreement
`by large numbers of nodes. For example, we have avoided
`algorithms such as that used in Locus [22] that depend on
`nodes achieving consensus on the current partition state of
`the network.
`
`3.2. Portable Workstations
`Powerful, lightweight and compact laptop computers are
`commonplace today.
`It is instructive to observe how a
`person with data in a shared file system uses such a
`machine. Typically, he identifies files of interest and
`downloads them from the shared file system into the local
`name space for use while isolated. When he returns, he
`copies modified files back into the shared file system.
`Such a user is effectively performing manual caching, with
`write-back upon reconnection!
`
`Early in the design of Coda we realized that disconnected
`operation could substantially simplify the use of portable
`clients. Users would not have to use a different name space
`while isolated, nor would they have to manually propagate
`changes upon reconnection. Thus portable machines are a
`champion application for disconnected operation.
`
`The use of portable machines also gave us another insight.
`The fact that people are able to operate for extended
`periods in isolation indicates that they are quite good at
`predicting their future file access needs. This, in turn,
`suggests that it is reasonable to seek user assistance in
`augmenting the cache management policy for disconnected
`operation.
`
`Functionally, involuntary disconnections caused by failures
`are no different from voluntary disconnections caused by
`unplugging portable computers. Hence Coda provides a
`single mechanism to cope with all disconnections. Of
`course,
`there may be qualitative differences:
` user
`expectations as well as the extent of user cooperation are
`likely to be different in the two cases.
`
`3.3. First vs Second Class Replication
`If disconnected operation is feasible, why is server
`replication needed at all? The answer to this question
`depends critically on the very different assumptions made
`about clients and servers in Coda.
`
`Clients are like appliances: they can be turned off at will
`and may be unattended for long periods of time. They have
`limited disk storage capacity, their software and hardware
`may be tampered with, and their owners may not be
`diligent about backing up the local disks. Servers are like
`public utilities: they have much greater disk capacity, they
`are physically secure, and they are carefully monitored and
`administered by professional staff.
`
`It is therefore appropriate to distinguish between first class
`replicas on servers, and second class replicas (i.e., cache
`copies) on clients. First class replicas are of higher quality:
`they are more persistent, widely known, secure, available,
`complete and accurate. Second class replicas, in contrast,
`are inferior along all these dimensions. Only by periodic
`revalidation with respect to a first class replica can a
`second class replica be useful.
`
`The function of a cache coherence protocol is to combine
`the performance and scalability advantages of a second
`class replica with the quality of a first class replica. When
`disconnected, the quality of the second class replica may be
`degraded because the first class replica upon which it is
`contingent is inaccessible. The longer the duration of
`disconnection, the greater the potential for degradation.
`Whereas server replication preserves the quality of data in
`the face of failures, disconnected operation forsakes quality
`for availability. Hence server replication is important
`because
`it reduces
`the frequency and duration of
`disconnected operation, which is properly viewed as a
`measure of last resort.
`
`it requires
`is expensive because
`Server replication
`additional hardware. Disconnected operation, in contrast,
`
`Dropbox Ex. 1006
`Page 3
`
`
`
`costs little. Whether to use server replication or not is thus
`a tradeoff between quality and cost. Coda does permit a
`volume to have a sole server replica. Therefore, an
`installation can rely exclusively on disconnected operation
`if it so chooses.
`
`3.4. Optimistic vs Pessimistic Replica Control
`By definition, a network partition exists between a
`disconnected second class replica and all its first class
`associates. The choice between two families of replica
`control strategies, pessimistic
`and optimistic [5],
`is
`therefore central to the design of disconnected operation.
`A pessimistic strategy avoids conflicting operations by
`disallowing all partitioned writes or by restricting reads and
`writes to a single partition. An optimistic strategy provides
`much higher availability by permitting reads and writes
`everywhere, and deals with the attendant danger of
`conflicts by detecting and resolving them after their
`occurence.
`
`A pessimistic approach towards disconnected operation
`would require a client to acquire shared or exclusive
`control of a cached object prior to disconnection, and to
`retain such control until reconnection. Possession of
`exclusive control by a disconnected client would preclude
`reading or writing at all other replicas. Possession of
`shared control would allow reading at other replicas, but
`writes would still be forbidden everywhere.
`
`Acquiring control prior to voluntary disconnection is
`relatively simple. It is more difficult when disconnection is
`involuntary, because the system may have to arbitrate
`among multiple requestors. Unfortunately, the information
`needed to make a wise decision is not readily available.
`For example, the system cannot predict which requestors
`will actually use the object, when they will release control,
`or what the relative costs of denying them access would be.
`
`Retaining control until reconnection is acceptable in the
`case of brief disconnections. But it is unacceptable in the
`case of extended disconnections. A disconnected client
`with shared control of an object would force the rest of the
`system to defer all updates until it reconnected. With
`exclusive control, it would even prevent other users from
`making a copy of the object. Coercing the client to
`reconnect may not be feasible, since its whereabouts may
`not be known. Thus, an entire user community could be at
`the mercy of a single errant client for an unbounded
`amount of time.
`
`Placing a time bound on exclusive or shared control, as
`done in the case of leases [7], avoids this problem but
`introduces others. Once a lease expires, a disconnected
`client loses the ability to access a cached object, even if no
`else in the system is interested in it. This, in turn, defeats
`
`the purpose of disconnected operation which is to provide
`high availability. Worse, updates already made while
`disconnected have to be discarded.
`
`An optimistic approach has its own disadvantages. An
`update made at one disconnected client may conflict with
`an update at another disconnected or connected client. For
`optimistic replication to be viable, the system has to be
`more sophisticated. There needs to be machinery in the
`system for detecting conflicts, for automating resolution
`when possible, and for confining damage and preserving
`evidence for manual repair. Having to repair conflicts
`manually violates transparency, is an annoyance to users,
`and reduces the usability of the system.
`
`We chose optimistic replication because we felt that its
`strengths and weaknesses better matched our design goals.
`The dominant influence on our choice was the low degree
`of write-sharing typical of Unix. This implied that an
`optimistic strategy was likely to lead to relatively few
`conflicts. An optimistic strategy was also consistent with
`our overall goal of providing
`the highest possible
`availability of data.
`
`In principle, we could have chosen a pessimistic strategy
`for server replication even after choosing an optimistic
`strategy for disconnected operation. But that would have
`reduced transparency, because a user would have faced the
`anomaly of being able to update data when disconnected,
`but being unable to do so when connected to a subset of the
`servers. Further, many of the previous arguments in favor
`of an optimistic strategy also apply to server replication.
`
`Using an optimistic strategy throughout presents a uniform
`model of the system from the user’s perspective. At any
`time, he is able to read the latest data in his accessible
`universe and his updates are immediately visible to
`everyone else in that universe. His accessible universe is
`usually the entire set of servers and clients. When failures
`occur, his accessible universe shrinks to the set of servers
`he can contact, and the set of clients that they, in turn, can
`contact. In the limit, when he is operating disconnected,
`his accessible universe consists of just his machine. Upon
`reconnection, his updates become visible throughout his
`now-enlarged accessible universe.
`
`4. Detailed Design and Implementation
`In describing our
`implementation of disconnected
`operation, we focus on the client since this is where much
`of the complexity lies. Section 4.1 describes the physical
`structure of a client, Section 4.2 introduces the major states
`of Venus, and Sections 4.3 to 4.5 discuss these states in
`detail. A description of the server support needed for
`disconnected operation is contained in Section 4.5.
`
`Dropbox Ex. 1006
`Page 4
`
`
`
`different volumes, depending on failure conditions in the
`system.
`
`Hoarding
`
`r e
`
`lo
`
`c
`
`o
`
`gic
`
`n
`
`n
`
`al
`
`e
`
`ctio
`
`n
`
`disconnection
`
`4.1. Client Structure
`Because of the complexity of Venus, we made it a user
`level process rather than part of the kernel. The latter
`approach may have yielded better performance, but would
`have been less portable and considerably more difficult to
`debug. Figure 2 illustrates the high-level structure of a
`Coda client.
`
`Application
`
`Venus
`
`to Coda
`servers
`
`Emulation
`
`Reintegration
`
`physical
`reconnection
`
`System Call Interface
`
`Vnode Interface
`
`Coda MiniCache
`
`Figure 2: Structure of a Coda Client
`
`Venus intercepts Unix file system calls via the widely-used
`Sun Vnode interface [10]. Since this interface imposes a
`heavy performance overhead on user-level cache managers,
`we use a tiny in-kernel MiniCache to filter out many
`kernel-Venus interactions. The MiniCache contains no
`support for remote access, disconnected operation or server
`replication; these functions are handled entirely by Venus.
`
`A system call on a Coda object is forwarded by the Vnode
`interface to the MiniCache. If possible, the call is serviced
`by the MiniCache and control is returned to the application.
`Otherwise, the MiniCache contacts Venus to service the
`call. This, in turn, may involve contacting Coda servers.
`Control returns from Venus via the MiniCache to the
`application program, updating MiniCache state as a side
`effect. MiniCache state changes may also be initiated by
`Venus on events such as callback breaks from Coda
`servers. Measurements from our implementation confirm
`that the MiniCache is critical for good performance [20].
`
`4.2. Venus States
`Logically, Venus operates in one of three states: hoarding,
`emulation, and reintegration. Figure 3 depicts these states
`and the transitions between them. Venus is normally in the
`hoarding state, relying on server replication but always on
`the alert for possible disconnection. Upon disconnection, it
`enters the emulation state and remains there for the
`duration of disconnection. Upon reconnection, Venus
`enters the reintegration state, resynchronizes its cache with
`its AVSG, and then reverts to the hoarding state. Since all
`volumes may not be replicated across the same set of
`servers, Venus can be in different states with respect to
`
`When disconnected, Venus is in the emulation state. It
`transits to reintegration upon successful reconnection to
`an AVSG member, and thence to hoarding, where it
`resumes connected operation.
`Figure 3: Venus States and Transitions
`
`4.3. Hoarding
`The hoarding state is so named because a key responsibility
`of Venus in this state is to hoard useful data in anticipation
`of disconnection.
` However,
`this
`is not
`its only
`responsibility. Rather, Venus must manage its cache in a
`manner
`that balances
`the needs of connected and
`disconnected operation. For instance, a user may have
`indicated that a certain set of files is critical but may
`currently be using other files.
` To provide good
`performance, Venus must cache the latter files. But to be
`prepared for disconnection, it must also cache the former
`set of files.
`
`Many factors complicate the implementation of hoarding:
`• File reference behavior, especially in the
`distant future, cannot be predicted with
`certainty.
`• Disconnections and reconnections are often
`unpredictable.
`• The
`true cost of a cache miss while
`disconnected is highly variable and hard to
`quantify.
`• Activity at other clients must be accounted for,
`so that the latest version of an object is in the
`cache at disconnection.
`• Since cache space is finite, the availability of
`less critical objects may have to be sacrificed
`in favor of more critical objects.
`To address these concerns, we manage the cache using a
`prioritized algorithm, and periodically reevaluate which
`objects merit retention in the cache via a process known as
`hoard walking.
`
`Dropbox Ex. 1006
`Page 5
`
`
`
`# Personal files
`a /coda/usr/jjk d+
`a /coda/usr/jjk/papers 100:d+
`a /coda/usr/jjk/papers/sosp 1000:d+
`
`# System files
`a /usr/bin 100:d+
`a /usr/etc 100:d+
`a /usr/include 100:d+
`a /usr/lib 100:d+
`a /usr/local/gnu d+
`a /usr/local/rcs d+
`a /usr/ucb d+
`
`(a)
`
`# X11 files
`# (from X11 maintainer)
`a /usr/X11/bin/X
`a /usr/X11/bin/Xvga
`a /usr/X11/bin/mwm
`a /usr/X11/bin/startx
`a /usr/X11/bin/xclock
`a /usr/X11/bin/xinit
`a /usr/X11/bin/xterm
`a /usr/X11/include/X11/bitmaps c+
`a /usr/X11/lib/app-defaults d+
`a /usr/X11/lib/fonts/misc c+
`a /usr/X11/lib/system.mwmrc
`
`(b)
`
`# Venus source files
`# (shared among Coda developers)
`a /coda/project/coda/src/venus 100:c+
`a /coda/project/coda/include 100:c+
`a /coda/project/coda/lib c+
`
`(c)
`
`These are typical hoard profiles provided by a Coda user, an application maintainer, and a group of project developers. Each profile is
`interpreted separately by the HDB front-end program. The ’a’ at the beginning of a line indicates an add-entry command. Other
`commands are delete an entry, clear all entries, and list entries. The modifiers following some pathnames specify non-default priorities
`(the default is 10) and/or meta-expansion for the entry. Note that the pathnames beginning with ’/usr’ are actually symbolic links into
`’/coda’.
`
`Figure 4: Sample Hoard Profiles
`
`4.3.1. Prioritized Cache Management
`Venus combines
`implicit and explicit sources of
`information
`in
`its priority-based cache management
`algorithm. The implicit information consists of recent
`reference history, as in traditional caching algorithms.
`Explicit information takes the form of a per-workstation
`hoard database (HDB), whose entries are pathnames
`identifying objects of
`interest
`to
`the user at
`that
`workstation.
`
`A simple front-end program allows a user to update the
`HDB using command scripts called hoard profiles, such as
`those shown in Figure 4. Since hoard profiles are just files,
`it is simple for an application maintainer to provide a
`common profile for his users, or for users collaborating on
`a project to maintain a common profile. A user can
`customize his HDB by specifying different combinations of
`profiles or by executing front-end commands interactively.
`To facilitate construction of hoard profiles, Venus can
`record all file references observed between a pair of start
`and stop events indicated by a user.
`
`To reduce the verbosity of hoard profiles and the effort
`needed to maintain them, Venus supports meta-expansion
`of HDB entries. As shown in Figure 4, if the letter ’c’ (or
`’d’) follows a pathname, the command also applies to
`immediate children (or all descendants). A ’+’ following
`the ’c’ or ’d’ indicates that the command applies to all
`future as well as present children or descendents. A hoard
`entry may optionally indicate a hoard priority, with higher
`priorities indicating more critical objects.
`
`The current priority of a cached object is a function of its
`hoard priority as well as a metric representing recent usage.
`The latter is updated continuously in response to new
`references, and serves to age the priority of objects no
`longer in the working set. Objects of the lowest priority
`are chosen as victims when cache space has to be
`reclaimed.
`
`To resolve the pathname of a cached object while
`disconnected, it is imperative that all the ancestors of the
`object also be cached. Venus must therefore ensure that a
`cached directory
`is not purged before any of
`its
`descendants. This hierarchical cache management is not
`needed in traditional file caching schemes because cache
`misses during name translation can be serviced, albeit at a
`performance cost. Venus performs hierarchical cache
`management by assigning infinite priority to directories
`with cached children.
` This automatically
`forces
`replacement to occur bottom-up.
`
`4.3.2. Hoard Walking
`We say that a cache is in equilibrium, signifying that it
`meets user expectations about availability, when no
`uncached object has a higher priority than a cached object.
`Equilibrium may be disturbed as a result of normal activity.
`For example, suppose an object, A, is brought into the
`cache on demand, replacing an object, B. Further suppose
`that B is mentioned in the HDB, but A is not. Some time
`after activity on A ceases, its priority will decay below the
`hoard priority of B. The cache is no longer in equilibrium,
`since the cached object A has lower priority than the
`uncached object B.
`
`Venus periodically restores equilibrium by performing an
`operation known as a hoard walk. A hoard walk occurs
`every 10 minutes in our current implementation, but one
`may be explicitly requested by a user prior to voluntary
`disconnection. The walk occurs in two phases. First, the
`name bindings of HDB entries are reevaluated to reflect
`update activity by other Coda clients. For example, new
`children may have been created in a directory whose
`pathname is specified with the ’+’ option in the HDB.
`Second, the priorities of all entries in the cache and HDB
`are reevaluated, and objects fetched or evicted as needed to
`restore equilibrium.
`
`Dropbox Ex. 1006
`Page 6
`
`
`
`Hoard walks also address a problem arising from callback
`breaks.
`In traditional callback-based caching, data is
`refetched only on demand after a callback break. But in
`Coda, such a strategy may result in a critical object being
`unavailable should a disconnection occur before the next
`reference to it. Refetching immediately upon callback
`break avoids this problem, but ignores a key characteristic
`of Unix environments: once an object is modified, it is
`likely to be modified many more times by the same user
`within a short interval [14, 6]. An immediate refetch
`policy would increase client-server traffic considerably,
`thereby reducing scalability.
`
`4.4.1. Logging
`During emulation, Venus records sufficient information to
`replay update activity when it reintegrates. It maintains
`this information in a per-volume log of mutating operations
`called a replay log. Each log entry contains a copy of the
`corresponding system call arguments as well as the version
`state of all objects referenced by the call.
`
`Venus uses a number of optimizations to reduce the length
`of the replay log, resulting in a log size that is typically a
`few percent of cache size. A small log conserves disk
`space, a critical resource during periods of disconnection.
`It also improves reintegration performance by reducing
`latency and server load.
`
`One important optimization to reduce log length pertains to
`write operations on files. Since Coda uses whole-file
`caching,
`the close after an open of a file for
`modification installs a completely new copy of the file.
`Rather than logging the open, close, and intervening
`write operations individually, Venus logs a single
`store record during the handling of a close.
`
`Our strategy is a compromise that balances availability,
`consistency, and scalability. For files and symbolic links,
`Venus purges the object on callback break, and refetches it
`on demand or during the next hoard walk, whichever
`occurs earlier. If a disconnection were to occur before
`refetching,
`the object would be unavailable.
` For
`directories, Venus does not purge on callback break, but
`marks the cache entry suspicious. A stale cache entry is
`thus available should a disconnection occur before the next
`hoard walk or reference. The acceptability of stale
`Another optimization consists of Venus discarding a
`directory data
`follows
`from
`its particular callback
`previous store record for a file when a new one is
`semantics. A callback break on a directory typically means
`appended to the log. This follows from the fact that a
`that an entry has been added to or deleted from the
`store renders all previous versions of a file superfluous.
`directory. It is often the case that other directory entries
`The store record does not contain a copy of the file’s
`and the objects they name are unchanged. Therefore,
`contents, but merely points to the copy in the cache.
`saving the stale copy and using it in the event of untimely
`disconnection causes consistency to suffer only a little, but We are currently implementing two further optimizations
`increases availability considerably.
`to reduce the length of the replay log. The first generalizes
`the optimization described in the previous paragraph such
`that any operation which overwrites the effect of earlier
`operations may cancel the corresponding log records. An
`example would be the cancelling of a store by a
`The second
`subsequent unlink or truncate.
`optimization exploits knowledge of inverse operations to
`cancel both the inverting and inverted log records. For
`example, a rmdir may cancel its own log record as well
`as that of the corresponding mkdir.
`
`4.4. Emulation
`In the emulation state, Venus performs many actions
`normally handled by servers. For example, Venus now
`assumes full responsibility for access and semantic checks.
`It
`is also responsible for generating
`temporary file
`identifiers (fids) for new objects, pending the assignment of
`permanent fids at reintegration. But although Venus is
`functioning as a pseudo-server, updates accepted by it have
`to be revalidated with respect to integrity and protection by
`real servers. This follows from the Coda policy of trusting
`only servers, not clients. To minimize unpleasant delayed
`surprises for a disconnected user, it behooves Venus to be
`as faithful as possible in its emulation.
`
`Cache management during emulation is done with the same
`priority algorithm used during hoarding.
` Mutating
`operations directly update the cache entries of the objects
`involved. Cache entries of deleted objects are freed
`immediately, but those of other modified objects assume
`infinite priority so that they are not purged before
`reintegration. On a cache miss, the default behavior of
`Venus is to return an error code. A user may optionally
`request Venus to block his processes until cache misses can
`be serviced.
`
`4.4.2. Persistence
`A disconnected user must be able to restart his machine
`after a shutdown and continue where he left off. In case of
`a crash, the amount of data lost should be no greater than if
`the same failure occurred during connected operation. To
`provide these guarantees, Venus must keep its cache and
`related data structures in non-volatile storage.
`
`Meta-data, consisting of cached directory and symbolic
`link contents, status b