`System
`
`JAMES J. KISTLER and M. SATYANARAYANAN
`Carnegie Mellon University
`
`Disconnected operation is a mode of operation that enables a client to continue accessing critical
`data during temporary failures of a shared data repository. An important, though not exclusive,
`application of disconnected operation is in supporting portable computers. In this paper, we show
`that disconnected operation is feasible, efficient and usable by describing its design and imple(cid:173)
`mentation in the Coda File System. The central idea behind our work is that caching of data,
`now widely used for performance, can also be exploited to improve availability.
`
`Categories and Subject Descriptors: D.4.3 [Operating Systems]: File Systems
`Management:-distributed file systems; D.4.5 [Operating Systems]: Reliability-fault tolerance;
`D.4.8 [Operating Systems]: Performance-measurements
`
`General Terms: Design, Experimentation, Measurement, Performance, Reliability
`
`Additional Key Words and Phrases: Disconnected operation, hoarding, optimistic replication,
`reintegration, second-class replication, server emulation
`
`1 . INTRODUCTION
`Every serious user of a distributed system has faced situations where critical
`work has been impeded by a remote failure. His frustration is particularly
`acute when his workstation is powerful enough to be used standalone, but
`has been configured to be dependent on remote resources. An important
`instance of such dependence is the use of data from a distributed file system.
`Placing data in a distributed file system simplifies collaboration between
`users, and allows them to delegate the administration of that data. The
`growing popularity of distributed file systems such as NFS [16] and AFS [19]
`
`This work was supported by the Defense Advanced Research Projects Agency (Avionics Lab,
`Wright Research and Development Center, Aeronautical Systems Division (AFSC), U.S. Air
`Force, Wright-Patterson AFB, Ohio, 45433-6543 under Contract F33615-90-C-1465, ARPA Order
`7597), National Science Foundation (PYI Award and Grant ECD 8907068), ffiM Corporation
`(Faculty Development Award, Graduate Fellowship, and Research Initiation Grant), Digital
`Equipment Corporation (External Research Project Grant), and Bellcore (Information Network(cid:173)
`ing Research Grant).
`Authors' address: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
`15213.
`Permission to copy without fee all or part of this material is granted provided that the copies are
`not made or distributed for direct commercial advantage, the ACM copyright notice and the title
`of the publication and its date appear, and notice is given that copying is by permission of the
`Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or
`specific permission.
`© 1992 ACM 0734-2071/92/0200-0003 $01.50
`
`ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992, Pages 3-25.
`
`LG Electronics, Inc. et al.
`EXHIBIT 1022
`IPR Petition for
`U.S. Patent No.7, 149,511
`
`
`
`4.
`
`J J. Kistler and M, Satyanarayanan
`
`Unfortunately,
`a remote
`failure
`
`the
`at a
`
`considerations.
`these
`of
`nature
`to the compelling
`attests
`the
`fact
`that
`have
`to accept
`of
`these
`systems
`users
`seriously
`inconvenience
`them.
`juncture
`may
`critical
`to enjoy
`like
`we would
`this
`state
`of affairs?
`Ideally,
`How can we improve
`critical
`work
`data repository,
`but be able to continue
`the benefits
`of a shared
`of operation
`is inaccessible.
`We call
`the
`latter
`mode
`when
`that
`repository
`because
`it
`represents
`a temporary
`deviation
`from
`disconnected
`operation,
`normal
`operation
`as a client
`of a shared
`repository.
`in a file system is indeed
`In this paper we show that
`disconnected
`operation
`behind
`our work
`is that
`feasible,
`efficient
`and
`usable.
`The
`central
`idea
`performance,
`can
`also
`be
`now widely
`used
`to
`improve
`caching
`of
`data,
`We have implemented
`disconnected
`opera-
`exploited
`to enhance
`availability.
`tion
`in the Coda File System at Carnegie
`Mellon
`University.
`of disconnected
`Our
`initial
`experience
`with
`Coda
`confirms
`the
`viability
`operation.
`We have successfully
`operated
`disconnected
`for periods
`lasting
`one
`to two days. For a disconnection
`of
`this
`duration,
`the process
`of reconnecting
`and
`propagating
`changes
`typically
`takes
`about
`a minute.
`A local
`disk
`100MB
`has
`been
`adequate
`for
`us during
`these
`periods
`of disconnection.
`Trace-driven
`simulations
`indicate
`that
`a disk of about
`half
`that
`size should
`adequate
`for disconnections
`lasting
`a typical
`workday.
`
`of
`
`be
`
`of
`file
`of
`for
`
`sys-
`file
`shared Unix
`servers
`at
`the granu-
`file
`a cache manager
`( Venus)
`
`2. DESIGN OVERVIEW
`collection
`of a large
`consisting
`an environment
`Coda
`is designed
`for
`smaller
`number
`of
`trusted
`Unix
`and
`a much
`untrusted
`Unix 1 clients
`servers.
`The design
`is optimized
`for
`the access and sharing
`patterns
`typical
`academic
`and
`research
`environments.
`It
`is
`specifically
`not
`intended
`applications
`that
`exhibit
`highly
`concurrent,
`fine
`granularity
`data access.
`Each Coda client
`has a local
`disk
`and can communicate
`with
`the servers
`over a high
`bandwidth
`network.
`At certain
`times,
`a client may be temporar-
`ily unable
`to communicate
`with
`some or all of the servers.
`This may be due to
`a server
`or network
`failure,
`or due to the
`detachment
`of a portable
`client
`from the network.
`location-transparent
`as a single,
`Clients
`view Coda
`tem. The Coda namespace
`is mapped
`to individual
`At each client,
`larity
`of subtrees
`called
`volumes.
`dynamically
`obtains
`and caches
`volume mappings.
`high
`to achieve
`Coda uses two
`distinct,
`but
`complementary,
`mechanisms
`volumes
`to have
`allows
`availability.
`The first mechanism,
`server
`replication,
`read-write
`replicas
`at more than
`one
`server.
`The
`set of
`replication
`sites
`for
`a
`The
`subset
`of a VSG that
`is
`volume
`its
`volume
`storage
`group
`( VSG).
`The performance
`currently
`accessible
`is a client’s
`accessible
`VSG
`( A VSG).
`cost of server
`replication
`is kept
`low by
`caching
`on disks
`at
`clients
`and
`through
`the use of parallel
`access protocols.
`Venus
`uses a cache coherence
`to guarantee
`that
`an open file yields
`its latest
`protocol
`based on callbacks
`[9]
`
`is
`
`Labs
`of AT&T Bell Telephone
`is a trademark
`1Unix
`ACM Transactions on Computer Systems, Vol. 10, No 1, February 1992,
`
`
`
`Disconnected Operation m the Coda File System
`
`.
`
`5
`
`to
`
`clients
`notifying
`referred
`being
`in parallel
`
`by servers
`is provided
`guarantee
`This
`Icopy in the AVSG.
`when their
`cached copies are no longer
`valid,
`each notification
`to as a ‘callback
`break’. Modifications
`in Coda are propagated
`all AVSG sites, and eventually
`to missing
`VSG sites.
`used by
`mechanism
`Disconnected
`operation,
`the
`second
`high
`availability
`Coda,
`takes
`effect
`when
`the AVSG
`becomes
`empty. While
`disconnected,
`its
`Venus
`services
`file
`system requests
`by relying
`solely
`on the contents
`of
`cache.
`Since
`cache misses
`cannot
`be
`serviced
`or masked,
`they
`appear
`as failures
`to application
`programs
`and
`users. When
`disconnection
`ends,
`Venus
`propagates
`modifications
`and reverts
`to server
`replication.
`Figure
`1
`depicts
`a typical
`scenario
`involving
`transitions
`between
`server
`replication
`and disconnected
`operation.
`In
`in depth.
`replication
`server
`Earlier
`Coda papers
`[18, 19] have described
`We
`operation.
`to disconnected
`contrast,
`this
`paper
`restricts
`its
`attention
`discuss
`server
`replication
`only
`in those
`areas where
`its presence
`has signifi-
`cantly
`influenced
`our design
`for disconnected
`operation.
`
`3. DESIGN RATIONALE
`influenced
`factors
`At
`a high
`level,
`two
`use
`conventional,
`First,
`we wanted
`to
`our system.
`Second, we wished
`to preserve
`grating
`the
`high
`availability
`mechanisms
`environment.
`our design.
`influenced
`considerations
`other
`level,
`At a more detailed
`of portable
`the advent
`gracefully,
`include
`the need to scale
`workstations,
`assumptions
`made
`and
`very
`different
`security
`integrity,
`resource,
`between
`clients
`and servers,
`and the need to strike
`a balance
`availability
`We examine
`each of
`these
`issues
`in the following
`sections.
`consistency.
`
`strategy
`our
`off-the-shelf
`transparency
`of Coda
`
`availability.
`for high
`throughout
`hardware
`by seamlessly
`inte-
`into
`a normal
`Unix
`
`These
`the
`about
`and
`
`3.1 Scalability
`with
`experience
`to grow in size. Our
`tend
`systems
`Successful
`distributed
`upon us the need to prepare
`for growth
`had impressed
`Coda’s
`ancestor,
`AFS,
`treating
`it as an afterthought
`[17]. We brought
`this
`rather
`than
`a priori,
`experience
`to bear upon Coda in two ways. First,
`we adopted
`certain mecha-
`nisms
`that
`enhance
`scalability.
`Second,
`we
`drew
`upon
`a set of general
`principles
`to guide
`our design
`choices.
`is callback-based
`scalability
`for
`An example
`of a mechanism
`we adopted
`offers
`the
`cache
`coherence.
`Another
`such mechanism
`caching,
`whole-file
`a cache miss
`can only
`model:
`added
`advantage
`of a much
`simpler
`failure
`close.
`This,
`in turn,
`or
`seek,
`occur
`on an open,
`never
`on a read, write,
`of disconnected
`operation.
`A
`substantially
`simplifies
`the
`implementation
`of AFS-4
`[22], Echo
`[8] or MFS
`partial-file
`caching
`scheme
`such
`as that
`[1] would
`have
`complicated
`our
`implementation
`and made
`disconnected
`operation
`less transparent.
`A scalability
`principle
`the placing
`functionality
`
`is
`on our design
`influence
`has had considerable
`that
`or
`servers. Only
`if
`integrity
`rather
`than
`on clients
`ACM Transactions on Computer Systems, Vol 10, No 1, February 1992
`
`of
`
`
`
`6.
`
`J J. Kistler and M Satyanarayanan
`
`‘m
`
`_.___—l
`
`ACM TransactIons on Computer Systems, Vol. 10, No 1, February 1992
`
`
`
`Disconnected Operation in the Coda File System
`
`o
`
`7
`
`violated
`have we
`compromised
`been
`have
`would
`security
`we have adopted
`is the avoidance
`principle
`scalability
`Another
`we have rejected
`strategies
`that
`Consequently,
`change.
`rapid
`or agreement
`by
`large
`numbers
`of nodes.
`For
`example,
`algorithms
`such as that
`used in Locus
`[23]
`that
`depend
`consensus
`on the current
`partition
`state
`of
`the network.
`
`principle.
`this
`ofsystem-wide
`require
`election
`we have
`avoided
`on nodes achieving
`
`Portable Workstations
`3.2
`today.
`are commonplace
`computers
`laptop
`Powerful,
`lightweight
`and compact
`system
`in a shared
`file
`It
`is instructive
`to observe
`how a person with
`data
`interest
`and downloads
`uses such a machine.
`Typically,
`he identifies
`files
`of
`them from the
`shared
`file
`system into
`the
`local
`name
`space for use while
`isolated. When
`he returns,
`he copies modified
`files
`back
`into
`the shared
`file
`system.
`Such a user
`is effectively
`performing
`manual
`caching, with write-back
`upon reconnection!
`could
`operation
`disconnected
`that
`of Coda we realized
`Early
`in the design
`clients.
`Users would
`not have
`to
`the use of portable
`substantially
`simplify
`nor would
`they
`have
`to man-
`space while
`isolated,
`use a different
`name
`Thus
`portable
`machines
`are a
`ually
`propagate
`changes
`upon
`reconnection.
`champion
`application
`for disconnected
`operation.
`that
`The fact
`insight.
`The use of portable
`machines
`also gave us another
`they
`people
`are able to operate
`for extended
`periods
`in isolation
`indicates
`that
`are quite
`good at predicting
`their
`future
`file
`access needs.
`This,
`in turn,
`suggests
`that
`it
`is reasonable
`to seek user
`assistance
`in augmenting
`the
`cache management
`policy
`for disconnected
`operation.
`are no different
`disconnections
`caused by failures
`Functionally,
`involuntary
`disconnections
`caused
`by unplugging
`portable
`computers.
`from uoluntary
`Hence Coda provides
`a single mechanism
`to cope with
`all disconnections.
`course,
`there may be qualitative
`differences:
`user expectations
`as well
`extent
`of user cooperation
`are likely
`to be different
`in the two cases.
`
`Of
`as the
`
`replication
`on the
`
`very
`
`Replication
`First- vs. Second-Class
`3.3
`is server
`why
`is feasible,
`If disconnected
`operation
`critically
`question
`depends
`all? The
`answer
`to this
`in Coda.
`clients
`and servers
`assumptions
`made about
`be
`and may
`off at will
`Clients
`are like
`appliances:
`they
`can be turned
`disk
`storage
`capacity,
`unattended
`for
`long periods
`of
`time.
`They
`have limited
`and their
`owners may
`their
`software
`and hardware
`may
`be tampered
`with,
`Servers
`are like
`public
`not
`be diligent
`about
`backing
`up the
`local
`disks.
`are physically
`secure,
`utilities:
`they
`have much
`greater
`disk
`capacity,
`they
`and they
`are carefully
`monitored
`and administered
`by professional
`staff.
`It
`is therefore
`appropriate
`to distinguish
`between
`first-class
`replicas
`(i.e.,
`cache
`copies)
`on clients.
`First-class
`servers,
`and
`second-class
`replicas
`replicas
`are
`of higher
`quality:
`they
`are more
`persistent,
`widely
`known,
`secure,
`available,
`complete
`and accurate.
`Second-class
`replicas,
`in contrast,
`are inferior
`along
`all
`these
`dimensions.
`Only
`by periodic
`revalidation
`with
`respect
`to a first-class
`replica
`can a second-class
`replica
`be useful.
`ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992.
`
`needed
`different
`
`at
`
`on
`
`
`
`8.
`
`J. J. Kistler and M. Satyanarayanan
`
`the performance
`is to combine
`protocol
`of a cache coherence
`The function
`the
`quality
`of a
`replica
`with
`advantages
`of a second-class
`and
`scalability
`first-class
`replica. When
`disconnected,
`the quality
`of
`the second-class
`replica
`is
`may be degraded
`because
`the first-class
`replica
`upon which
`it
`is contingent
`inaccessible.
`The longer
`the duration
`of disconnection,
`the greater
`the poten-
`tial
`for degradation.
`Whereas
`server
`replication
`preserves
`the quality
`of data
`in the
`face
`of
`failures,
`disconnected
`operation
`forsakes
`quality
`avail-
`for
`ability.
`Hence
`server
`replication
`is important
`because
`reduces
`the
`fre-
`it
`quency
`and duration
`of disconnected
`operation,
`which
`is properly
`viewed
`as
`a measure
`of
`last
`resort.
`because
`Server
`replication
`is expensive
`in
`contrast,
`costs
`Disconnected
`operation,
`replication
`or not
`is thus
`a trade-off
`between
`permit
`a volume
`to have
`a sole server
`replica.
`rely
`exclusively
`on disconnected
`operation
`if
`
`hardware.
`additional
`requires
`use
`server
`Whether
`to
`little.
`quality
`and cost. Coda
`does
`Therefore,
`an installation
`can
`it so chooses.
`
`it
`
`of
`
`Replica Control
`vs. Pessimistic
`3.4 Optimistic
`second-class
`a disconnected
`a network
`partition
`exists
`between
`By definition,
`its first-class
`associates.
`The choice
`between
`two families
`replica
`and all
`and
`is therefore
`central
`strategies,
`replica
`control
`pessimistic
`optimistic
`[51,
`of disconnected
`operation.
`A pessimistic
`strategy
`avoids
`conflict-
`to the design
`by disallowing
`all partitioned
`writes
`or by restricting
`reads
`ing
`operations
`and writes
`to a single
`partition.
`An optimistic
`strategy
`provides much
`higher
`availability
`by permitting
`reads
`and writes
`everywhere,
`and deals with
`the
`attendant
`danger
`of conflicts
`by detecting
`and
`resolving
`them after
`their
`occurrence.
`require
`would
`operation
`disconnected
`towards
`approach
`A pessimistic
`prior
`object
`of a cached
`or exclusive
`control
`shared
`client
`to acquire
`Possession
`reconnection.
`disconnection,
`and to retain
`such
`control
`until
`preclude
`reading
`or writing
`exclusive
`control
`by a disconnected
`client would
`at all
`other
`replicas.
`Possession
`of shared
`control
`would
`allow
`reading
`other
`replicas,
`but writes
`would
`still
`be forbidden
`everywhere.
`is
`It
`simple.
`Acquiring
`control
`prior
`to voluntary
`disconnection
`is relatively
`system may
`more
`difficult
`when
`disconnection
`is involuntary,
`because
`the
`have to arbitrate
`among multiple
`requesters.
`Unfortunately,
`the information
`the
`needed
`to make
`a wise
`decision
`is not
`readily
`available.
`For
`example,
`system cannot
`predict
`which
`requesters
`would
`actually
`use the object, when
`they would
`release
`control,
`or what
`the relative
`costs of denying
`them access
`would
`be.
`case of brief
`in the
`is acceptable
`reconnection
`until
`control
`Retaining
`disconnections.
`in the case of extended
`is unacceptable
`it
`But
`disconnections.
`force the rest of
`control
`of an object would
`client with
`shared
`A disconnected
`it
`reconnected.
`With
`exclusive
`control,
`it
`the system to defer all updates
`until
`would
`even prevent
`other
`users
`from making
`a copy of
`the object. Coercing
`the client
`to reconnect may not be feasible,
`since its whereabouts
`may not be
`
`a
`to
`of
`
`at
`
`ACM Transactions on Computer Systems, Vol. 10, No 1, February 1992
`
`
`
`Disconnected Operation in the Coda File System
`
`.
`
`9
`
`be at
`
`the mercy
`
`of a single
`
`could
`community
`user
`an entire
`Thus,
`known.
`time.
`of
`amount
`for an unbounded
`client
`errant
`as done in the case of
`control,
`Placing
`a time
`bound
`on exclusive
`or shared
`others. Once a lease expires,
`a
`leases [7], avoids
`this
`problem but
`introduces
`disconnected
`client
`loses the ability
`to access a cached
`object,
`even if no one
`else in the
`system is interested
`in it. This,
`in turn,
`defeats
`the purpose
`of
`disconnected
`operation
`which
`is to provide
`high
`availability.
`Worse,
`updates
`already made while
`disconnected
`have to be discarded.
`An update made at one
`An optimistic
`approach
`has its own disadvantages.
`at another
`disconnected
`or
`disconnected
`client may
`conflict
`with
`an update
`connected
`client.
`For optimistic
`replication
`to be viable,
`the system has to be
`more
`sophisticated.
`There
`needs to be machinery
`in the system for detecting
`conflicts,
`for automating
`resolution
`when
`possible,
`and for
`confining
`dam-
`age and preserving
`evidence
`for manual
`repair.
`Having
`to repair
`conflicts
`manually
`violates
`transparency,
`is an annoyance
`to users,
`and reduces
`the
`usability
`of
`the system.
`that
`we felt
`because
`replication
`We
`chose
`optimistic
`weaknesses
`better matched
`our design
`goals. The dominant
`choice was the low degree
`of write-sharing
`typical
`of Unix.
`an optimistic
`strategy
`was
`likely
`to lead
`to relatively
`optimistic
`strategy
`was also consistent
`with
`our overall
`highest
`possible
`availability
`of data.
`replica-
`for server
`strategy
`In principle,
`we could have chosen a pessimistic
`for disconnected
`operation.
`tion
`even
`after
`choosing
`an optimistic
`strategy
`But
`that would
`have reduced
`transparency,
`because
`a user would
`have faced
`the
`anomaly
`of being
`able
`to update
`data when
`disconnected,
`but
`being
`unable
`to do so when
`connected
`to a subset
`of
`the servers.
`Further,
`many
`of
`the previous
`arguments
`in favor
`of an optimistic
`strategy
`also apply
`to server
`replication.
`the
`of
`model
`a uniform
`presents
`throughout
`strategy
`an optimistic
`Using
`perspective.
`At any time,
`he is able to read the latest
`system from the user’s
`and his updates
`are immediately
`visible
`to
`data
`in his
`universe
`accessible
`universe.
`His accessible
`universe
`is usually
`the entire
`everyone
`else in that
`clients.
`When
`failures
`occur,
`his
`accessible
`universe
`set
`of
`servers
`and
`shrinks
`to the set of servers
`he can contact,
`and the set of clients
`that
`they,
`turn,
`can
`contact.
`In
`the
`limit,
`when
`he is operating
`disconnected,
`accessible
`universe
`consists
`of
`just
`his machine.
`Upon
`reconnection,
`updates
`become
`visible
`throughout
`his now-enlarged
`accessible
`universe.
`
`its
`
`strengths
`influence
`This
`implied
`few conflicts.
`goal of providing
`
`and
`on our
`that
`An
`the
`
`in
`his
`his
`
`4. DETAILED DESIGN AND IMPLEMENTATION
`In describing
`our
`implementation
`of disconnected
`of
`client
`since this
`is where much
`the complexity
`the
`physical
`structure
`of a client,
`Section
`4.2
`of Venus,
`and Sections
`4.3 to 4.5 discuss
`these
`tion
`of
`the server
`support
`needed
`for disconnected
`Section
`4.5.
`
`we focus on the
`operation,
`lies. Section
`4.1 describes
`introduces
`the major
`states
`states
`in detail.
`A descrip-
`operation
`is contained
`in
`
`ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992.
`
`
`
`10
`
`.
`
`J. J Kistler and M Satyanarayanan
`
`ITVenus
`
`—
`
`to Coda
`servers
`
`1
`
`I
`
`Fig.2.
`
`Structure
`
`ofa Coda client.
`
`4.1 Client Structure
`rather
`process
`it a user-level
`we made
`of Venus,
`Because
`of
`the complexity
`perfor-
`better
`approach may have yielded
`The latter
`than
`part
`of
`the kernel.
`mance,
`but would
`have been less portable
`and considerably
`more
`difficult
`to
`debug.
`Figure
`2 illustrates
`the high-level
`structure
`of a Coda client.
`Venus
`intercepts
`Unix
`file
`system calls
`via
`the widely
`used Sun Vnode
`interface
`[10]. Since this
`interface
`imposes
`a heavy
`performance
`overhead
`on
`to filter
`out
`user-level
`cache managers,
`we use a tiny
`in-kernel
`MiniCache
`many
`kernel-Venus
`interactions.
`The MiniCache
`contains
`no support
`for
`remote
`access, disconnected
`operation
`or server
`replication;
`these
`functions
`are handled
`entirely
`by Venus.
`to the
`interface
`by the Vnode
`is forwarded
`A system call on a Coda object
`and control
`is
`by the MiniCache
`is serviced
`MiniCache.
`If possible,
`the call
`the MiniCache
`contacts
`Venus
`to
`returned
`to the
`application.
`Otherwise,
`contacting
`Coda servers.
`Control
`service
`the call. This,
`in turn, may
`involve
`returns
`from Venus
`via the MiniCache
`to the application
`program,
`updating
`be
`MiniCache
`state
`as a side
`effect. MiniCache
`state
`changes may
`also
`initiated
`by Venus
`on events
`such
`as callback
`breaks
`from Coda
`servers.
`Measurements
`from our
`implementation
`confirm
`that
`the MiniCache
`critical
`for good performance
`[211,
`
`is
`
`4.2 Venus States
`and
`states:
`three
`in one of
`operates
`Logically,
`Venus
`emulation,
`hoarding,
`them.
`between
`and the transitions
`Figure
`3 depicts
`these states
`reintegration.
`but
`replication
`relying
`on server
`Venus
`is normally
`in the
`hoarding
`state,
`always
`on the alert
`for possible
`disconnection.
`Upon
`disconnection,
`it enters
`the
`emulation
`state
`and
`remains
`there
`for
`the
`duration
`of disconnection.
`Upon
`reconnection,
`Venus
`enters
`the reintegration
`state,
`desynchronizes
`ACM Transactions on Computer Systems, Vol
`10, No 1, February 1992
`
`its
`
`
`
`Disconnected Operation in the Coda File System
`
`r)Hoarding
`
`.
`
`11
`
`It
`state.
`is in the emulation
`states and transitions. When disconnected, Venus
`Fig. 3. Venus
`transmits
`to reintegration
`upon successful
`reconnection
`to an AVSG member,
`and thence to
`hoarding, where it
`resumes connected operation.
`
`then
`and
`AVSG,
`its
`cache with
`all volumes may not be replicated
`be in different
`states with
`respect
`conditions
`in the system.
`
`hoarding
`the
`to
`reverts
`across
`the same set of servers,
`to different
`volumes,
`depending
`
`Since
`state.
`can
`Venus
`on failure
`
`4.3 Hoarding
`in this
`of Venus
`a key responsibility
`because
`state is so named
`The hoarding
`this
`is
`However,
`of disconnection.
`state is to hoard
`useful
`data in anticipation
`not
`its only
`responsibility.
`Rather,
`Venus must manage
`its cache in a manner
`that
`balances
`the
`needs
`of connected
`and
`disconnected
`operation.
`For
`in-
`stance,
`a user may have indicated
`that
`a certain
`set of files
`is critical
`but may
`currently
`be using
`other
`files.
`To provide
`good performance,
`Venus must
`cache the latter
`files. But
`to be prepared
`for disconnection,
`it must
`also cache
`the former
`set of
`files.
`Many
`factors
`complicate
`
`the implementation
`
`of hoarding:
`
`especially
`
`in
`
`the
`
`distant
`
`future,
`
`cannot
`
`be
`
`variable
`
`and
`
`behavior,
`reference
`—File
`certainty.
`predicted
`with
`unpredictable.
`are often
`and reconnection
`–Disconnections
`–The
`true
`cost of a cache miss while
`disconnected
`is highly
`hard
`to quantify.
`be accounted
`clients must
`—Activity
`at other
`an object
`is in the cache at disconnection.
`less critical
`of
`—Since
`cache space is finite,
`the availability
`objects.
`to be sacrificed
`in favor
`of more
`critical
`ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992.
`
`for, so that
`
`the latest
`
`version
`
`of
`
`objects may have
`
`
`
`12
`
`.
`
`J J. Kistler and M Satyanarayanan
`
`files
`# Personal
`d+
`a /coda /usr/jjk
`a /coda /usr/jjk/papers
`a /coda /usr/jjk/pape
`
`100:d+
`rs/sosp
`
`1000:d+
`
`# System files
`a /usr/bin
`100:d+
`a /usr/etc
`100:d+
`a /usr/include
`100:d+
`a /usr/lib
`100:d+
`a /usr/local/gnu
`a /usr/local/rcs
`a /usr/ucb
`
`d+
`d+
`
`d+
`
`files
`# XII
`# (from Xll maintainer)
`a /usr/Xll/bin/X
`a /usr/Xll/bin/Xvga
`a /usr/Xll/bin/mwm
`a /usr/Xl
`l/ bin/startx
`a /usr/Xll
`/bin/x
`clock
`a /usr/Xll/bin/xinlt
`a /usr/Xll/bin/xtenn
`a /usr/Xll/i
`nclude/Xll
`a /usr/Xll/
`lib/app-defaults
`c+
`a /usr/Xll
`/lib/font
`s/mist
`a /usr/Xl
`l/lib/system
`.mwmrc
`
`/bitmaps
`
`c+
`
`d+
`
`(a)
`
`files
`source
`# Venus
`among Coda developers)
`# (shared
`a /coda /project
`/coda /src/venus
`a /coda/project/coda/include
`a /coda/project/coda/lib
`
`c+
`
`100:c+
`100:c+
`
`(c)
`
`hy a Coda user, an
`provided
`hoard profiles
`These are typical
`Sample hoard profiles.
`Fig. 4.
`application maintainer,
`and a group of project developers. Each profile
`IS interpreted
`separately
`by the HDB front-end
`program
`The ‘a’ at
`the beginning
`of a line
`indicates
`an add-entry
`command. Other commands are delete an entry,
`clear all entries, and list entries. The modifiers
`following
`some pathnames
`specify nondefault
`priorities
`(the default
`is 10) and/or metaexpansion
`for
`the entry. Note that
`the pathnames
`beginning
`with
`‘/usr’
`are actually
`symbolic
`links
`into
`‘/coda’.
`
`we manage
`concerns,
`these
`address
`To
`and periodically
`reevaluate
`which
`algorithm,
`cache via a process
`known
`as hoard
`walking.
`
`a prioritized
`using
`cache
`the
`objects merit
`retention
`in the
`
`and
`
`ex-
`implicit
`combines
`algo-
`cache management
`as in
`reference
`history,
`takes
`the
`form of a
`are pathnames
`identi-
`
`Venus
`Management.
`Cache
`Prioritized
`4.3.1
`in its priority-based
`information
`sources
`of
`plicit
`consists
`of
`recent
`The
`implicit
`information
`rithm.
`Explicit
`information
`caching
`algorithms.
`traditional
`whose
`entries
`per-workstation
`(HDB),
`hoard
`database
`interest
`to the user at
`the workstation.
`of
`fying
`objects
`using
`the HDB
`update
`front-end
`program
`allows
`a user
`to
`A simple
`such
`as those
`shown
`in Figure
`4.
`scripts
`called
`command
`hoard
`profiles,
`profiles
`are just
`files,
`it
`is simple
`for an application
`maintainer
`to
`Since hoard
`provide
`a common
`profile
`for his users.
`or for users
`collaborating
`on a project
`to maintain
`a common
`profile.
`A user
`can customize
`his HDB by specifying
`different
`combinations
`of profiles
`or by executing
`front-end
`commands
`inter-
`actively.
`To facilitate
`construction
`of hoard
`profiles,
`Venus
`can record
`all
`file
`references
`observed
`between
`a pair
`of start
`and
`stop
`events
`indicated
`by
`a user.
`needed
`and the effort
`profiles
`of hoard
`the verbosity
`To reduce
`of HDB entries.
`As shown
`supports
`them, Venus
`meta-expansion
`‘c’
`(or
`‘d’)
`follows
`a pathname,
`the
`command
`also
`if
`the
`letter
`ACM Transactions on Computer Systems, Vol 10, No, 1, February 1992
`
`to maintain
`in Figure
`applies
`
`4,
`to
`
`
`
`Disconnected Operation in the Coda File System
`
`.
`
`13
`
`indi-
`or
`with
`
`as
`
`of
`
`‘d’
`‘c’ or
`the
`following
`A ‘ +‘
`(or all descendants).
`children
`immediate
`future
`as well
`as present
`children
`the command
`applies
`to all
`cates
`that
`A hoard
`entry may
`optionally
`indicate
`a hoard
`descendants.
`priority,
`indicating
`more
`critical
`objects.
`higher
`priorities
`priority
`its hoard
`priority
`of a cached
`object
`is a function
`The current
`continu-
`is updated
`well
`as a metric
`representing
`recent
`usage.
`The latter
`of objects
`ously
`in response
`to new references,
`and serves
`to age the priority
`no longer
`in the working
`set. Objects
`of
`the
`lowest
`priority
`are chosen
`as
`victims
`when
`cache space has to be reclaimed.
`is
`it
`disconnected,
`while
`object
`To
`resolve
`the
`pathname
`of a cached
`object
`also be cached.
`Venus must
`imperative
`that
`the
`ancestors
`of
`the
`all
`is not
`purged
`before
`any
`of
`its
`therefore
`ensure
`that
`a cached
`directory
`is not
`needed
`in tradi-
`descendants.
`This
`hierarchical
`cache
`management
`tional
`file
`caching
`schemes
`because
`cache misses
`during
`name
`translation
`can be serviced,
`albeit
`at a performance
`cost. Venus
`performs
`hierarchical
`cache management
`by assigning
`infinite
`priority
`to directories
`with
`cached
`children.
`This
`automatically
`forces
`replacement
`to occur bottom-up.
`
`signifying
`a cache is in equilibrium,
`say that
`We
`Walking.
`Hoard
`4.3.2
`availability,
`when
`no uncached
`object
`about
`user expectations
`that
`it meets
`priority
`than
`a cached object. Equilibrium
`maybe
`disturbed
`as a
`has a higher
`result
`of normal
`activity.
`For example,
`suppose
`an object,
`A,
`is brought
`into
`B. Further
`suppose
`that
`the
`cache
`on demand,
`replacing
`an object,
`B is
`mentioned
`in the HDB,
`but
`A is not. Some time
`after
`activity
`on A ceases,
`of B. The cache is no longer
`in
`its priority
`will
`decay below the hoard
`priority
`A has lower
`priority
`than
`the uncached
`equilibrium,
`since the cached
`object
`object
`B.
`an operation
`by performing
`equilibrium
`restores
`periodically
`Venus
`every
`10 minutes
`in our
`walk
`occurs
`A hoard
`as a hoard
`walk.
`but
`one may
`be explicitly
`required
`by
`a user
`implementation,
`voluntary
`disconnection.
`The walk
`occurs
`in two
`phases.
`First,
`of HDB
`entries
`are reevaluated
`to reflect
`update
`activity
`bindings
`Coda clients.
`For example,
`new children
`may
`have
`been created
`tory whose pathname
`is specified
`with
`the ‘ +‘ option
`in the HDB.
`priorities
`of all entries
`in the cache
`and HDB
`are reevaluated,
`fetched
`or evicted
`as needed
`to restore
`equilibrium.
`breaks.
`from callback
`Hoard
`walks
`also
`address
`a problem
`arising
`only
`on demand
`after
`traditional
`callback-based
`caching,
`data
`is refetched
`result
`in a critical
`object
`callback
`break.
`But
`in Coda,
`such a strategy
`may
`being
`unavailable
`should
`a disconnection
`occur before
`the next
`reference
`to
`it. Refetching
`immediately
`upon
`callback
`break
`avoids
`this
`problem,
`but
`ignores
`a key characteristic
`of Unix
`environments:
`once an object
`is modified,
`it
`is likely
`to be modified
`many more
`times
`by the same user within
`a short
`interval
`[15,
`6]. An
`immediate
`refetch
`policy
`would
`increase
`client-server
`traffic
`considerably,
`thereby
`reducing
`scalability.
`and
`consistency
`availability,
`Our
`strategy
`is a compromise
`that
`balances
`purges
`the object
`on callback
`scalability.
`For
`files
`and symbolic
`links,
`Venus
`ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992
`
`known
`current
`to
`prior
`the
`name
`by other
`in a direc-
`Second,
`the
`and objects
`
`In
`a
`
`
`
`14
`
`.
`
`J. J. Klstler and M. Satyanarayanan
`
`hoard walk, whichever
`the next
`or during
`it on demand
`and refetches
`break,
`refetching,
`the object
`were
`to occur before
`earlier.
`If a disconnection
`occurs
`be unavailable.
`For
`directories,
`Venus
`does not
`purge
`on callback
`would
`but marks
`the
`cache
`entry
`suspicious.
`A stale
`cache
`entry
`is thus
`break,
`should
`a disconnection
`occur
`before
`the next
`hoard walk
`or
`refer-
`available
`acceptability
`y of stale
`directory
`data
`follows
`from its particular
`ence. The
`an
`callback
`semantics.
`A callback
`break
`on a directory
`typically
`means
`that
`entry
`has been added to or deleted
`from the directory.
`It
`is often the case that
`other
`directory
`entries
`and
`the
`objects
`they
`name
`are unchanged.
`There-
`it
`the
`fore,
`saving
`stale
`copy
`and
`using
`in the
`event
`of untimely
`discon-
`nection
`causes
`consistency
`to suffer
`only
`a little,
`but
`increases
`availability
`considerably.
`4.4 Emulation
`by
`handled
`normally
`actions
`many
`performs
`state, Venus
`In the emulation
`for access and
`responsibility
`now assumes
`full
`servers.
`For example,
`Venus
`semantic
`checks.
`It
`is also responsible
`for generating
`temporary
`file
`identi-
`at
`fids
`(fids)
`for
`new objects,
`pending
`the
`assignment
`of permanent
`fiers
`updates
`reintegration.
`But although
`Venus
`is functioning
`as a pseudo-server,
`accepted
`by it have to be revalidated
`with
`respect
`to integrity
`and protection
`by real
`servers.
`This
`follows
`from the Coda policy
`of trusting
`only
`servers,
`not
`clients.
`To minimize
`unpleasant
`delayed
`surprises
`for a disconnected
`user,
`it
`behooves Venus
`to be as faithful
`as possible
`in its emulation.
`priority
`same
`Cache management
`during
`emulation
`is done with
`the
`update
`the
`algorithm
`used
`during
`hoarding.
`Mutating
`operations
`directly
`objects
`are
`cache
`entries
`of
`the
`objects
`involved.
`Cache
`entries
`of deleted
`freed
`immediately,
`but
`those
`of other modified
`objects
`assume
`infinite
`prior-
`ity
`so that
`they
`are not purged
`before
`reintegration.
`On a cache miss,
`the
`default
`behavior
`of Venus
`is to return
`an error
`code. A user may
`optionally
`request
`Venus
`to block
`his processes
`until
`cache misses
`can be serviced.
`During
`emulation,
`Venus
`records
`sufficient
`information
`4.4.1
`Logging.
`to replay
`update
`activity
`when
`it
`reinteWates.
`It maintains
`this
`information
`log. Each log entry
`in a per-volume
`log of mutating
`operations
`called
`a replay
`contains
`a copy of
`the corresponding
`system call
`arguments
`as well
`as the
`version
`state
`of all objects
`referenced
`by the call.
`the replay
`of
`the length
`Venus
`uses a number
`of optimizations
`to reduce
`log,
`resulting
`in a log size that
`is typically
`a few percent
`of cache
`size. A
`small
`log conserves
`disk
`space, a critical
`resource
`during
`periods
`of disconnec-
`tion.
`It also
`improves
`reintegration
`performance
`by
`reducing
`latency
`and
`server
`load.
`opera-
`to write
`pertains
`log length
`to reduce
`optimization
`One important
`caching,
`the close after
`an open of a
`tions
`on files. Since Coda uses whole-file
`new copy of
`the file. Rather
`than
`file
`for modification
`installs
`a completely
`write operations
`individually,
`Venus
`logging
`the open,
`close, and intervening
`logs a single
`store record
`during
`the handling
`of a close,
`Another
`optimization
`consists
`of Venus
`discarding
`a previous
`for a file when
`a new one is appended
`to the log. This
`follows
`ACM TransactIons on Computer Systems, Vol. 10, No. 1, February 1992,
`
`store record
`from the fact
`
`
`
`Disconnected Operation in the Coda File System
`
`.
`
`15
`
`all
`
`previous
`a copy of
`
`versions
`the file’s
`
`of a file
`contents,
`
`superfluous.
`but merely
`
`The