`
`DROPBOX EX. 1027
`
`
`
`
`
`m Series on
`stems and
`ogrammlng
`
`February 1992
`Volume 10 Number 1
`
`A publication of the
`Association for
`Computing Machinery
`
`I‘III-\
`
`I
`
`‘
`
`‘
`
`l
`
`Transactions
`n Computer
`yStemS "
`f
`‘
`
`1
`
`Preface to the Special Issue on Operating Systems Principles
`by Anita Jones
`
`3 Disconnected Operation in the Coda File System
`by James J. Kistler and M. Satyanarayanan
`
`26 The Design and Implementation of a Log-Structured File System
`by Mendel Rosenblum and John K. Ousterhout
`
`53 Scheduler Activations: Effective Kernel Support for the User-Level
`Management of Parallelism
`‘
`by Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and
`HenryM.Levy
`
`,% (\J
`14.152
`33 a;
`(314.1
`I
`zLLJ
`,: §
`.
`LUZ
`_.-m C: C0
`00 *r -
`(DZ 2 “1
`Lu
`
`Dropbox Ex. 1027
`Page 1
`
`
`
`
`
`
`
`
`
`
`
`Disconnected Operation in the Coda File
`
`System
`
`
`
`JAMES J. KISTLEFt and M. SATYANAHAYANAN
`
`
`
`
`
`
`
`
`Carnegie Mellon University
`
`
`
`
`Disconnected operation. is a mode of operation that enables a client to continue accessing critical
`
`
`
`
`
`
`
`
`
`
`
`
`
`data during temporary failures of a shared data repository. An important. though not. exclusive.
`
`
`
`
`
`
`
`
`
`
`
`
`application of disconnected operation is in supporting portable computers. In this paper, we Show
`
`
`
`
`
`
`
`
`
`
`
`
`
`that disconnected operation is feasible. efficient and usable by describing its design and imple-
`
`
`
`
`
`
`
`
`
`
`
`
`
`mentation in the Code. File System. The central idea behind our work is that caching of data.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`now widely used for performance, can also be exploited to improve availability.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Categories
`and Subject Descriptors: DALB IOpereting Systems]: File Systems
`
`
`
`
`
`
`
`
`Management—distributed file systems; D.4.5 [Operating Systems]: Reliability—foot! tolerance;
`
`
`
`
`
`
`
`11.4.8 [Operating Systems]: Performance— measurements
`
`
`
`
`General Terms: Design. Experimentation. Measurement. Performance. Reliability
`
`
`
`
`
`
`
`Additional Key Words and Phrases: Disconnected operation. hoarding, optimistic replication.
`
`
`
`
`
`
`
`
`
`reintegration, second-class replication. server emulation
`
`
`
`
`
`
`
`
`
`
`
`‘1.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Every serious user of a distributed system has faced situations where critical
`
`
`
`
`
`
`
`
`
`
`
`
`work has been impeded by a remote failure. His frustration is particularly
`
`
`
`
`
`
`
`
`
`
`
`
`acute when his workstation is powerful enough to be used standalone, but
`
`
`
`
`
`
`
`
`
`
`
`has been configured to be dependent on remote resources. An important
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`instance of such dependence is the use of data from a distributed file system.
`
`
`
`
`
`
`
`
`
`Placing data in a distributed file system simplifies collaboration between
`
`
`
`
`
`
`
`
`
`
`
`
`users. and allows them to delegate the administration of that data. The
`
`
`
`
`
`
`
`
`
`
`
`
`
`growing popularity of distributed file systems such as NFS [16] and AFS [19]
`
`
`
`This work was supported by the Defense Advanced Research Projects Agency (Avionics Lab.
`
`
`
`
`
`
`
`
`
`
`
`
`Wright Research and Development Center. Aeronautical Systems Division {AFSC}, U.S. Air
`
`
`
`
`
`
`
`
`
`
`Force. Wright-Patterson AFB. Ohio. 45433-6543 under Contract F33615v90-C-1465. ARPA Order
`
`
`
`
`
`
`
`
`
`7597}. National Science Foundation (PYI Award and Grant ECD 8907068}, IBM Corporation
`
`
`
`
`
`
`
`
`
`
`
`(Faculty Development Award. Graduate Fellowship, and Research Initiation Grant}. Digital
`
`
`
`
`
`
`
`
`
`Equipment Corporation (External Research Project Grant), and Bellcore (Information Network—
`
`
`
`
`
`
`
`
`
`ing Research Grant}.
`
`
`
`Authors' address: School of Computer Science, Carnegie Meilon University. Pittsburgh. PA
`
`
`
`
`
`
`
`
`
`
`15213.
`
`
`Permission to copy without fee all or part of this material is granted provided that the copies are
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`not made or distributed For direct commercial advantage, the ACM copyright. notice and the title
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`of the publication and its date appear. and notice is given that copying is by permission of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Association for Computing Machinery. To copy otherwise. or to republish. requires a fee andfor
`
`
`
`
`
`
`
`
`
`
`
`
`
`specific permission.
`
`
`© 1992 ACM 0734-2071}92,’0200u0003 $01.50
`
`
`
`
`
`ACM Transactions on Computer Systems. Vol. ll], No. 1. February 1992. Pages 3—25.
`
`
`
`
`
`
`
`
`
`
`
`
`INTRODUCTION
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Dropbox Ex. 1027
`Page 2
`
`Dropbox Ex. 1027
`Page 2
`
`
`
`4
`
`
`
`o
`
`J. J. Klstler and M. Satyaoareyanan
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`attests to the compelling nature of these considerations. Unfortunately, the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`users of thEse systems have to accept the fact that a remote Failure at a
`
`
`
`
`
`
`critical juncture. may seriously inconvenience them.
`
`
`
`
`
`
`
`
`
`
`
`
`
`How can we improve this state of affairs? Ideally, we would like to enjoy
`
`
`
`
`
`
`
`
`
`
`
`
`the benefits of a shared data repository, but be able to continue critical work
`
`
`
`
`
`
`
`
`
`
`
`when that repository is inaccessible, We call the latter mode of operation
`
`
`
`
`
`
`
`
`disconnected operation, because it represents a temporary deviation from
`
`
`
`
`
`
`
`normal operation as a client of a shared repository.
`
`
`
`
`
`
`
`
`
`
`
`
`
`In this paper we show that disconnected operation in a file system is indeed
`
`
`
`
`
`
`
`
`
`
`
`
`feasible, efficient and usable. The central
`idea behind ciu‘ work is that
`
`
`
`
`
`
`
`
`
`
`
`
`sacking of data, now widely used to improve performance, can also be
`
`
`
`
`
`
`
`
`exploited to enhance availability. We have implemented disconnected opera-
`
`
`
`
`
`
`
`
`
`
`tion in the Coda File System at Carnegie Mellon University.
`
`
`
`
`
`
`
`
`
`
`Our initial experience with Coda confirms the viability of disconnected
`
`
`
`
`
`
`
`
`
`
`operation. We have successfully operated disconnected for periods lasting one
`
`
`
`
`
`
`
`
`
`
`
`
`
`to two days. For a disconnection of this duration, the process of reconnecting
`
`
`
`
`
`
`
`
`
`
`
`
`and propagating changes typically takes about a minute. A local disk of
`
`
`
`
`
`
`
`
`
`
`
`100MB has been adequate For us during these periods of disconnection.
`Trace-driven simulations indicate that a disk of about half that size should be
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`adequate for disconnections lasting a typical workday.
`
`2. DESIGN OVERVIEW
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Coda is designed for an environment consisting of a large collection of
`untrusted Unix1 clients and a much smaller number of trusted Unix file
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`servers. The design is optimized for the access and sharing patterns typical of
`
`
`
`
`
`
`
`
`
`
`academic and research environments.
`is specifically not
`intended for
`It
`
`
`
`
`
`
`
`
`
`applications that exhibit highly concurrent, fine granularity data access.
`Each Coda client has a local disk and can communicate with the servers
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`over a high bandwidth network. At certain times, a client may be temporar-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`ily unable to communicate with some or all of the servers. This may be due to
`
`
`
`
`
`
`
`
`
`
`
`
`
`a server or network failure, or due to the detachment of a pori‘obie ciient
`from the network.
`
`
`
`
`
`
`
`
`
`
`
`
`
`Clients view Coda as a single, location-transparent shared Unix file sys-
`
`
`
`
`
`
`
`
`
`
`
`
`tem. The Coda namespace is mapped to individual file servers at the granu-
`
`
`
`
`
`
`
`
`
`
`
`larity of subtrees celled volumes. At each client, a cache manager (Venus)
`
`
`
`
`
`
`dynamically obtains and caches volume mappings.
`
`
`
`
`
`
`
`
`
`
`Coda uses two distinct, but complementary, mechanisms to achieve high
`
`
`
`
`
`
`
`
`
`
`availability. The first mechanism, server replication, allows volumes to have
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`read—write replicas at. more than one server. The set of replication sites for a
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`volume is its volume storage group (VSG). The subset of a VSG that is
`
`
`
`
`
`
`
`
`
`
`
`currently accessible is a client's accessible VSG (AVSG). The performance
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`cost of server replication is kept low by caching on disks at clients and
`
`
`
`
`
`
`
`
`
`
`
`
`through the use of parallel access protocols. Venus uses a cache coherence
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`protocol based on calibucks [9] to guarantee that an open file yields its latest
`
`
`
`
`
`
`
`
`
`
`1 Unix is a trademark of AT&T Bell Telephone Labs.
`
`
`
`
`
`
`
`
`
`
`ACM Transactions on Computer Systems. Vol. 10. No. 1. February 1992.
`
`
`
`
`
`
`
`
`
`
`
`
`Dropbox Ex. 1027
`Page 3
`
`Dropbox Ex. 1027
`Page 3
`
`
`
`
`
`
`
`
`
`Disconnected Operation in the Code File System
`
`
`
`-
`
`5
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`copy in the AVSG. This guarantee is provided by servers notifying clients
`
`
`
`
`
`
`
`
`
`
`when tlieiwached copies are no longer valid, each notification being referred
`
`
`
`
`
`
`
`
`
`
`
`
`to as a 'callback break’. Modifications in Coda are propagated in parallel to
`
`
`
`
`
`
`
`
`
`
`all AVSG sites, and eventually to missing VSG sites.
`
`
`
`
`
`
`
`
`
`Disconnected operation, the second high availability mechanism used by
`
`
`
`
`
`
`
`
`
`
`takes effect when the AVSG becomes empty. While disconnected,
`Coda,
`
`
`
`
`
`
`
`
`
`
`
`
`
`Venus services file system requests by relying solely on the contents of its
`
`
`
`
`
`
`
`
`
`
`
`cache. Since cache misses cannot be serviced or masked,
`they appear
`
`
`
`
`
`
`
`
`
`
`as failures to application programs and users. When disconnection ends,
`
`
`
`
`
`
`
`
`
`
`Venus propagates modifications and reverts to server replication. Figure 1
`
`
`
`
`
`
`
`
`
`depicts a typical scenario involving transitions between server replication
`
`
`
`and disconnected operation.
`
`
`
`
`
`
`
`
`
`
`
`Earlier Coda papers [18, 19] have described server replication in depth. In
`
`
`
`
`
`
`
`
`
`contrast,
`this paper restricts its attention to disconnected operation. We
`
`
`
`
`
`
`
`
`
`
`
`discuss server replication only in those areas where its presence has signifi-
`
`
`
`
`
`
`
`cantly influenced our design for disconnected operation.
`
`
`
`
`3. DESIGN RATIONALE
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`At a high level, two factors influenced our strategy for high availability.
`
`
`
`
`
`
`
`
`
`First, we wanted to use conventional, off-the—shelf hardware throughout
`
`
`
`
`
`
`
`
`
`
`our system. Second, we wished to preserve transparency by seamlessly inte-
`
`
`
`
`
`
`
`
`
`
`grating the high availability mechanisms of Coda into a normal Unix
`environment.
`
`
`
`
`
`
`
`
`
`
`At a more detailed level, other considerations influenced our design. These
`
`
`
`
`
`
`
`
`
`
`
`include the need to scale gracefully, the advent of portable workstations, the
`
`
`
`
`
`
`
`
`very different.
`resource,
`integrity, and security assumptions made about
`
`
`
`
`
`
`
`
`
`
`
`clients and servers, and the need to strike a balance between availability and
`
`
`
`
`
`
`
`
`
`
`
`consistency. We examine each of these issues in the following sections.
`
`
`
`
`
`
`
`
`
`
`3.1 Scalability
`
`
`
`
`
`
`
`
`
`
`
`Successful distributed systems tend to grow in size. Our experience with
`
`
`
`
`
`
`
`
`
`
`
`
`Coda's ancestor, AFS, had impressed upon us the need to prepare for growth
`
`
`
`
`
`
`
`
`
`
`
`
`a priori, rather than treating it as an afterthought [171. We brought this
`
`
`
`
`
`
`
`
`
`
`
`
`experience to bear upon Coda in two ways. First, we adopted certain mecha-
`
`
`
`
`
`
`
`
`
`
`
`nisms that enhance scalability. Second, we drew upon a set of general
`
`
`
`
`
`
`principles to guide our design choices.
`
`
`
`
`
`
`
`
`
`
`An example of a mechanism we adopted for scalability is callback-based
`
`
`
`
`
`
`
`
`cache coherence. Another such mechanism whole-file caching, offers the
`
`
`
`
`
`
`
`
`
`
`
`
`added advantage of a much simpler failure model: a cache miss can only
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`occur on an open, never on a read, write. seek. or close. This,
`in turn,
`
`
`
`
`
`
`
`substantially simplifies the implementation of disconnected operation. A
`
`
`
`
`
`
`
`
`
`
`
`
`partiald‘ile caching scheme such as that of AFB-4 {22], Echo [8] or MFS
`
`
`
`
`
`
`
`
`[1] would have complicated our implementation and made disconnected
`
`
`
`operation less transparent.
`
`
`
`
`
`
`
`
`
`
`
`A scalability principle that has had considerable influence on our design is
`
`
`
`
`
`
`
`
`
`
`
`
`the placing offlmctioaality on. clients rather than servers. Only if integrity or
`ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Dropbox Ex. 1027
`Page 4
`
`Dropbox Ex. 1027
`Page 4
`
`
`
`
`
`Fig. 1. How disconnected operation relates to server replication. Three servers (mahler, uiualdi, and rave!) have replicas of the
`volume containing file I. This file is potentially of interest to users at three clients ( flute. viola. and harp). Flute is capable of
`wireless communication (indicated by a dotted line) as well as regular network communication. Proceeding clockwise. the steps
`above show the value of x seen by each node as the connectivity of the system changes. Note that in step (d), flute is operating
`disconnected.
`
`>Gg 5
`
`’
`3wn
`U.o
`5a:
`
`oIOo a
`
`'U
`Ep
`3;
`m‘<
`3m
`5.m
`
`
`
`ueueAemueMeS'wpue1913ng'rr
`
`Dropbox Ex. 1027
`Page 5
`
`<2
`
`.H.
`
`o2.
`
`°
`5"
`3’E
`s:'s
`Q.—
`enEl:
`5‘:
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Powerful, lightweight and compact laptop computers are commonplace today.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`It is instructive to observe how a person with data in a shared file system
`
`
`
`
`
`
`
`
`
`
`
`
`uses such a machine. Typically, he identifies files of interest and downloads
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`them from the shared file system into the local name space for use while
`
`
`
`
`
`
`
`
`
`
`
`
`
`isolated. When he returns, he copies modified files back into the shared file
`
`
`
`
`
`
`
`
`
`
`system. Such a user is effectively performing manual caching, with write-back
`
`
`upon reconnection!
`
`
`
`
`
`
`
`
`
`
`
`Early in the design of Coda we realized that disconnected operation could
`
`
`
`
`
`
`
`
`
`
`
`substantially simplify the use of portable clients. Users would not have to
`
`
`
`
`
`
`
`
`
`
`
`
`use a different name space while isolated, nor would they have to man-
`
`
`
`
`
`
`
`
`
`
`ually propagate changes upon reconnection. Thus portable machines are a
`
`
`
`
`
`champion application for disconnected operation.
`
`
`
`
`
`
`
`
`
`
`
`
`The use of portable machines also gave us another insight. The fact that
`
`
`
`
`
`
`
`
`
`
`
`
`people are able to operate for extended periods in isolation indicates that they
`
`
`
`
`
`
`
`
`
`
`
`
`are quite good at predicting their future file access needs. This, in turn,
`
`
`
`
`
`
`
`
`
`
`
`suggests that it is reasonable to seek user assistance in augmenting the
`
`
`
`
`
`
`cache management policy for disconnected operation.
`
`
`
`
`
`
`
`
`
`Functionally, involuntary disconnections caused by failures are no different
`
`
`
`
`
`
`
`
`from voluntary disconnections caused by unplugging portable computers.
`
`
`
`
`
`
`
`
`
`
`
`
`Hence Coda provides a single mechanism to cope with all disconnections. Of
`
`
`
`
`
`
`
`
`
`
`
`
`course, there may be qualitative differences: user expectations as well as the
`
`
`
`
`
`
`
`
`
`
`
`
`
`extent of user cooperation are likely to be different in the two cases.
`
`
`
`
`
`
`
`
`Disconnected Operation in the Coda File System
`
`s
`
`7
`
`
`
`
`
`
`
`
`
`
`
`
`
`security would have been compromised have we violated this principle.
`
`
`
`
`
`
`
`
`
`
`
`Another scalability principle we have adopted is the avoidance of'system-wide
`
`
`
`
`
`
`
`
`
`
`rapid. chongc. Consequently, we have rejected strategies that require election
`
`
`
`
`
`
`
`
`
`
`
`
`or agreement by large numbers of nodes. For example, we have avoided
`
`
`
`
`
`
`
`
`
`
`
`
`
`algorithms such as that used in Locus {23] that depend on nodes achieving
`
`
`
`
`
`
`
`
`
`consensus on the current partition state of the network.
`
`3.2 Portable Workstations
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`3.3 First- vs. Second-Class Replication
`
`
`
`
`
`
`
`
`
`
`If disconnected operation is feasible, why is server replication needed at
`
`
`
`
`
`
`
`
`
`
`
`all? The answer to this question depends critically on the very different
`
`
`
`
`
`
`
`
`assumptions made about clients and servers in Coda.
`
`
`
`
`
`
`
`
`
`
`
`
`
`Clients are like appliances:
`they can be turned off at will and may be
`
`
`
`
`
`
`
`
`
`
`
`unattended for long periods of time. They have limited disk storage capacity,
`
`
`
`
`
`
`
`
`
`
`
`their software and hardware may be tampered with, and their owners may
`
`
`
`
`
`
`
`
`
`
`
`
`not be diligent about backing up the local disks. Servers are like public
`
`
`
`
`
`
`
`
`
`
`utilities: they have much greater disk capacity, they are physically secure,
`
`
`
`
`
`
`
`
`
`
`and they are carefully monitored and administered by professional staff.
`
`
`
`
`
`
`
`
`
`It is therefore appropriate to distinguish between first—class replicas on
`
`
`
`
`
`
`
`
`
`servers, and second-class replicas (i.e., cache copies) on clients. First-class
`
`
`
`
`
`
`
`
`
`
`replicas are of higher quality:
`they are more persistent, widely known,
`
`
`
`
`
`
`
`
`secure, available, complete and accurate. Second-class replicas, in contrast,
`
`
`
`
`
`
`
`
`
`
`are inferior along all these dimensions. Only by periodic revalidation with
`
`
`
`
`
`
`
`
`
`
`respect to a first-class replica can a second-class replica be useful.
`ACM Transactions on Computer Systems. Vol. 1!]. No. 1. February 1992.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Dropbox Ex. 1027
`Page 6
`
`
`Dropbox Ex. 1027
`Page 6
`
`
`
`8
`
`
`v
`
`.1. J. Kistler and M. Selyanarayanan
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`The function of a cache coherence protocol is to combine the performance
`
`
`
`
`
`
`
`
`
`
`
`
`and scalability advantages of a second-class replica with the quality of a
`
`
`
`
`
`
`
`
`
`
`first-class replica. When disconnected, the quality of the second-class replica
`
`
`
`
`
`
`
`
`
`
`
`
`
`may be degraded because the first-class replica upon which it is contingent is
`
`
`
`
`
`
`
`
`
`
`innocessible. The longer the duration of disconnection, the greater the poten-
`
`
`
`
`
`
`
`
`
`
`tial for degradation. Whereas server replication preserves the quality of data
`
`
`
`
`
`
`
`
`
`
`in the face of failures, disconnected operation forsakes quality for avail-
`
`
`
`
`
`
`
`
`
`
`ability. Hence server replication is important because it reduces the fre-
`
`
`
`
`
`
`
`
`
`
`quency and duration of disconnected operation, which is properly viewed as
`a measure of last resort.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Server replication is expensive because it requires additional hardware.
`
`
`
`
`
`
`
`
`
`
`Disconnected operation,
`in Contrast, costs little. Whether to use servor
`
`
`
`
`
`
`
`
`
`
`
`
`
`replication or not is thus a trade-off between quality and cost. Coda does
`
`
`
`
`
`
`
`
`
`
`
`
`permit a volume to have a sole server replica. Therefore, an installation can
`
`
`
`
`
`
`
`
`
`rely exclusiVely on disconnected operation if it so chooses.
`
`
`
`
`
`
`
`
`
`
`
`3.4 Optlmlstio VS. Pessimistic Replica Control
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`By definitiOn, a network partition exists between a disconnected second-class
`
`
`
`
`
`
`
`
`
`
`
`
`replica and all its first-class associates. The choice between two families of
`
`
`
`
`
`
`
`
`
`
`replica control strategies, pessimistic and optimistic [5], is therefore central
`
`
`
`
`
`
`
`
`
`
`to the design of disconnected operation. A pessimistic strategy avoids conflict-
`
`
`
`
`
`
`
`
`
`
`ing operations by disallowing. all partitioned writes or by restricting reads
`
`
`
`
`
`
`
`
`
`
`and writes to a single partition. An optimistic strategy provides much higher
`
`
`
`
`
`
`
`
`
`
`availability by permitting reads and writes eVerywhere, and deals with the
`
`
`
`
`
`
`
`
`
`
`attendant danger of conflicts by detecting and resolving them after their
`occurrence.
`
`
`
`
`
`
`
`
`
`
`A pessimistic approach towards disconnected operation would require a
`
`
`
`
`
`
`
`
`
`
`
`
`
`client to acquire. shared or exclusive control of a cached object prior to
`
`
`
`
`
`
`
`
`
`
`disconnection, and to retain such control until reconnection. Possession of
`
`
`
`
`
`
`
`
`
`
`
`exclusive control by a disconnected client would preclude reading or writing
`
`
`
`
`
`
`
`
`
`
`
`
`at all other replicas. Possession of shared control would allow reading at
`
`
`
`
`
`
`
`
`
`other replicas, but writes would still be forbidden everywhere.
`
`
`
`
`
`
`
`
`
`
`Acquiring control prior to Voluntary disconnection is relatively simple. It is
`
`
`
`
`
`
`
`
`
`
`more difficult when disconnection is involuntary, because the system may
`
`
`
`
`
`
`
`
`
`have to arbitrate among multiple requestors. Unfortunately, the information
`
`
`
`
`
`
`
`
`
`
`
`
`
`needed to make a wise decision is not readily available. For example, the
`
`
`
`
`
`
`
`
`
`
`
`system cannot predict which req'uestors would actually use the object. when
`
`
`
`
`
`
`
`
`
`
`
`
`
`they would release control, or what the relative costs of denying them access
`Would be.
`
`
`
`
`
`
`
`
`
`
`
`
`
`Retaining control until reconnection is acceptable in the case of brief
`
`
`
`
`
`
`
`
`
`
`
`disconnections. But it is unacceptable in the case of extended disconnections.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`A disconnected client with shared control of an object would force the rest of
`
`
`
`
`
`
`
`
`
`
`
`
`
`the system to defer all updates until it reconnected. With exclusive control, it
`
`
`
`
`
`
`
`
`
`
`
`
`would even. prevent other users from making a copy of the object. Coercing
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the client to reconnect may not be feasible, since its whereabouts may not be
`
`ACM Transactions on Computer Systems. Vol. 10, No. 1. February 1992.
`
`
`
`
`
`
`
`
`
`
`
`
`Dropbox Ex. 1027
`Page 7
`
`Dropbox Ex. 1027
`Page 7
`
`
`
`Disconnected Operation in the Code File System
`
`
`
`
`
`
`
`
`
`-
`
`9
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`known. Thus, an entire user community could be at the mercy of a single
`errant client for an unbounded amount of time.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Placing a time bound on exclusive or shared control, as done in the case of
`
`
`
`
`
`
`
`
`
`
`
`
`
`lenses [71, avoids this problem but introduces others. Once a lease expires, a
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`disconnected client loses the ability to access a cached object, even if no one
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`else in the system is interested in it. This, in turn, defeats the purpose of
`
`
`
`
`
`
`
`
`
`
`disconnected operation which is to provide high availability. Worse, updates
`
`
`
`
`
`
`
`
`already made while disconnected have to be discarded.
`
`
`
`
`
`
`
`
`
`
`
`An optimistic approach has its own disadvantages. An update made at one
`
`
`
`
`
`
`
`
`
`
`disconnected client may conflict with an update at another disconnected or
`
`
`
`
`
`
`
`
`
`
`
`
`connected client. For optimistic replication to be viable, the system has to be
`
`
`
`
`
`
`
`
`
`
`
`more sophisticated. There needs to be machinery in the system for detecting
`
`
`
`
`
`
`
`
`
`conflicts, for automating resolution when possible, and for confining dam-
`
`
`
`
`
`
`
`
`
`
`age and preserving evidence for manual repair. Having to repair conflicts
`
`
`
`
`
`
`
`
`
`
`manually violates transparency, is an annoyance to users, and reduces the
`
`
`
`
`usability of the system.
`
`
`
`
`
`
`
`
`
`
`We chose optimistic replication because We felt that its strengths and
`
`
`
`
`
`
`
`
`
`
`weaknesses better matched our design goals. The dominant influence on our
`
`
`
`
`
`
`
`
`
`
`
`
`choice was the low degree of write-sharing typical of Unix. This implied that
`
`
`
`
`
`
`
`
`
`
`
`an optimistic strategy was likely to lead to relatively few conflicts. An
`
`
`
`
`
`
`
`
`
`
`
`optimistic strategy was also consistent with our overall goal of providing the
`
`
`
`
`
`highest possible availability of data.
`
`
`
`
`
`
`
`
`
`
`In principle, we could have chosen a pessimistic strategy for server replica-
`
`
`
`
`
`
`
`
`
`
`tion even after choosing an optimistic strategy for disconnected operation.
`
`
`
`
`
`
`
`
`
`
`
`
`But that would have reduced transparency, because a user would have faced
`
`
`
`
`
`
`
`
`
`
`
`
`the anomaly of being able to update data when disconnected, but being
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`unable to do so when connected to a subset of the servers. Further, many of
`
`
`
`
`
`
`
`
`
`
`
`
`
`the previous arguments in favor of an optimistic strategy also apply to server
`
`replication.
`
`
`
`
`
`
`
`
`
`
`
`Using an optimistic strategy throughout presents a uniform model of the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`system from the user’s perspective. At any time, he is able to read the latest
`
`
`
`
`
`
`
`
`
`
`
`data in his accessible universe and his updates are immediately visible to
`
`
`
`
`
`
`
`
`
`
`
`everyone else in that universe. His accessible universe is usually the entire
`
`
`
`
`
`
`
`
`
`
`set of servers and clients. When failures occur, his accessible universe
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`shrinks to the set of servers he can contact, and the set of clients that they, in
`
`
`
`
`
`
`
`
`
`
`
`turn, can contact.
`In the limit, when he is operating disconnected, his
`
`
`
`
`
`
`
`
`
`accessible universe consists of just his machine. Upon reconnection, his
`
`
`
`
`
`
`
`
`updates become visible throughout his now-enlarged accessible universe.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`4. DETAILED DESIGN AND IMPLEMENTATION
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`In describing our implementation of disconnected operation, we focus on the
`
`
`
`
`
`
`
`
`
`
`
`
`
`client since this is where much of the complexity lies. Section 4.1 describes
`
`
`
`
`
`
`
`
`
`
`
`
`the physical structure of a client, Section 4.2 introduces the major states
`
`
`
`
`
`
`
`
`
`
`
`
`
`of Venus, and Sections 4.3 to 4.5 discuss these states in detail. A descrip-
`
`
`
`
`
`
`
`
`
`
`
`tion of the server support needed for disconnected operation is contained in
`Section 4.5.
`
`
`
`
`
`ACM Transactions on Computer Systems. Vol. 10. No. 1, February 1992.
`
`
`
`
`
`
`
`
`
`
`
`
`Dropbox Ex. 1027
`Page 8
`
`Dropbox Ex. 1027
`Page 8
`
`
`
`
`
`10
`
`-
`
`J. J. Klsiler and M. Satyanarayanan
`
`A lication
`pp
`
`to Coda
`servers
`
`
`
`
`- stem Call Interface _
`
`Coda MiniCache
`
`Fig. 2. Structure of a Code client.
`
`4. 1 Client Structure
`
`Because of the complexity of Venus, we made it a user-level process rather
`than part of the kernel. The latter approach may have yielded better perfor-
`mance, but would have been less portable and considerably more difi‘icult to
`debug. Figure 2 illustrates the high-level structure of a Coda client.
`Venus intercepts Unix file system calls via the widely used Sun Vnode
`interface [10]. Since this interface imposes a heavy performance overhead on
`user-level cache managers, we use a tiny in-kernel MiniCache to filter out
`many kernel-Venus interactions The MiniCache contains no support for
`remote access, disconnected operation or server replication; these functions
`are handled entirely by Venus.
`A system call on a Coda object is forwarded by the Vnode interface to the
`MiniCache. If possible, the call is serviced by the MiniCache and control is
`returned to the application. Otherwise, the MiniCache contacts Venus to
`service the call. This, in turn, may involve contacting Coda servers. Control
`returns from Venus via the MiniCache to the application program, updating
`MiniCachc state as a side effect. MiniCache state changes may also be
`initiated by Venus on events such as callback breaks from Coda servers.
`Measurements from our implementation confirm that the MiniCache is
`critical for good performance [21].
`
`4.2 Venus States
`
`Logically, Venus operates in one of three states: hoarding, emulation, and
`reintegration. Figure 3 depicts these states and the transitions between them.
`Venus is normally in~the hoarding state, relying on server replication but
`always on the alert for possible disconnection. Upon disconnection, it enters
`the emulation state and remains there for the duration of disconnection.
`
`Upon reconnection, Venus enters the reintegration state, resynchronizes its
`ACM Transactions on Computer Systems. Vol. 10. No. 1. February 1992.
`
`Dropbox Ex. 1027
`Page 9
`
`
`
`Disconnected Operation in the Coda File System
`
`reconnection
`
`Fig. 3. Venus states and transitions. When disconnected. Venus is in the emulation state. It
`transmits to reintegration upon successful reconnection to an AVSG member, and thence to
`hoarding, where it resumes connected operation.
`
`cache with its AVSG, and then reverts to the hoarding state. Since
`all volumes may not be replicated across the same set of servers, Venus can
`be in different states with respect to different volumes, depending on failure
`conditions in the system.
`
`4.3 Hoarding
`
`
`
`The hoarding state is so named because a key responsibility of Venus in this
`state is to hoard useful data in anticipation of disconnection. However, this is
`not its only responsibility. Rather, Venus must manage its cache in a manner
`that balances the needs of connected and disconnected operation. For in-
`stance, a user may have indicated that a certain set of files is critical but may
`currently be using other files. To provide good performance, Venus must
`cache the latter files. But to be prepared for disconnection, it must also cache
`the former set of files.
`
`Many factors complicate the implementation of hoarding:
`
`—Fi1e reference behavior. especially in the distant
`predicted with certainty.
`—Disconnections and reconnections are often unpredictable.
`
`future, cannot be
`
`—The true cost of a cache miss while disconnected is highly variable and
`hard to quantify.
`
`—Activity at other clients must be accounted for. so that the latest version of
`an object is in the cache at disconnection.
`
`—Since cache space is finite, the availability of less critical objects may have
`to be sacrificed in favor of more critical objects.
`ACM Transactions on Computer Systems, Vol. 10. No. 1, February 1992.
`
`Dropbox Ex. 1027
`Page 10
`
`
`
`
`
`J. J. Klstler and M. Satyanatayanan
`
`
`
`
`
`
`
`Personal tiles
`
`
`Egflgafusrfjjk d+
`
`{codafusrfj jumpers 100: m
`
`
`{codafusrfjjkfpapersrsosp iooczc+
`
`
`
`
`System Elle:
`
`
`{usribin 100:d+
`
`
`{narrate Infizd+
`
`
`Iusrfinclude lDO:d+
`
`fusrflih lnflzdo
`
`
`iuarflocalfonu 6+
`
`
`fusrllocallrcs a+
`
`
`{usriuch d+
`
`
`
`
`
`u-na-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`IIII-fll'llflilill-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`llilllllfllllllllfllhIf-
`
`
`
`
`
`
`311 files
`
`
`(from £11 maintainer)
`
`
`Just/xllfblnfx
`
`fusr/XllfblanVga
`
`lusr/Xll/blnfmwm
`
`rusrlxlllblo/startx
`lust/x11fb1n/ac1ock
`lusr/Xll/hlnfxlnit
`
`Iusrfxllfnlnfxcens
`
`Icar/x1ltlncludeIXIIJbitmaps c+
`
`IuerXIIJILbJapp-defaults d+
`
`
`{uerXI1/11bxroncsfmlsc c4
`
`
`{oar/x11f11b!system.mumrc
`
`
`
`
`
`
`
`
`(a)
`l Venus source files
`
`
`
`
`I
`(shared among Coda developers)
`
`
`
`
`
`a Icocaiprcjectfcouarsrcrvenus 100:c*
`
`
`
`a fauna/project!codaflncluda 100:c+
`
`
`
`a jcodafprojectfcodaxllb c+
`
`
`
`
`(b)
`
`
`
`
`(C)
`
`Fig. 4. Sample hoard profiles. These are typical hoard profiles provided by a Coda user, an
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`application maintainer. and a group of project developers. Each profile is interpreted asparately
`
`
`
`
`
`
`
`
`
`
`
`
`by the HUB fronoend program. The 'a’ at. the beginning of a line indicates an arid-entry
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`omnmaud. Other commands are delete an entry. clear all entries. and list entries. The modifiers
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`following some pathnames specify nondefault priorities (the default is 10} ondfor metaexpansion
`
`
`
`
`
`
`
`
`
`
`
`for the entry. Note that the pathnamos beginning with ‘fusr’ are actually symbolic links into
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`'{codaC
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`To address these concerns, we manage the cache using a prioritized
`
`
`
`
`
`
`
`
`
`
`algorithm, and periodically reevaluate which objects merit retention in the
`
`
`
`
`
`
`
`cache via a process known as hoard walking.
`
`
`
`
`
`
`
`
`
`4.3.1 Prioritized Cache Management. Venus combines implicit and ex-
`
`
`
`
`
`
`
`
`
`plicit. sources of information in its priority—based cache management algo-
`
`
`
`
`
`
`
`
`
`
`
`rithm. The implicit information consists of L's-Cent reference history, as in
`
`
`
`
`
`
`
`
`
`
`traditional caching algorithms. Explicit information takes the form of a
`
`
`
`
`
`
`
`
`peraworkstation hoard database (EBB), whose entries a