`
`Exhibit 21
`
`
`
`Case 1:20-cv-00034-ADA Document 50-5 Filed 04/10/20 Page 2 of 34
`
`On Incremental File System Development
`EREZ ZADOK, RAKESH IYER, NIKOLAI JOUKOV, GOPALAN SIVATHANU, AND
`CHARLES P. WRIGHT
`
`Developing (cid:12)le systems from scratch is di(cid:14)cult and error prone. Layered, or stackable, (cid:12)le
`systems are a powerful technique to incrementally extend the functionality of existing (cid:12)le systems
`on commodity OSes at runtime. In this paper, we analyze the evolution of layering from historical
`models to what is found in four di(cid:11)erent present day commodity OSes: Solaris, FreeBSD, Linux,
`and Microsoft Windows. We classify layered (cid:12)le systems into (cid:12)ve types based on their functionality
`and identify the requirements that each class imposes on the OS. We then present (cid:12)ve major design
`issues that we encountered during our experience of developing over twenty layered (cid:12)le systems on
`four OSes. We discuss how we have addressed each of these issues on current OSes, and present
`insights into useful OS and VFS features that would provide future developers more versatile
`solutions for incremental (cid:12)le system development.
`Categories and Subject Descriptors: D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhance-
`ment(cid:151)Portability; D.2.13 [Software Engineering]: Reusable Software(cid:151)Reuse models; D.4.3 [Operating Sys-
`tems]: File Systems Management(cid:151)File organization; D.4.7 [Operating Systems]: Organization and Design
`General Terms: Design
`Additional Key Words and Phrases: Layered File Systems, Stackable File Systems, VFS, Vnode,
`I/O Manager, IRP, Extensibility
`
`INTRODUCTION
`1.
`Data management is a fundamental facility provided by the operating system (OS). File
`systems are tasked with the bulk of data management, including storing data on disk (or
`over the network) and naming (i.e., translating a user-visible name such as /usr/src
`into an on-disk object). File systems are complex, and it is dif(cid:2)cult to enhance them.
`Furthermore, OS vendors are reluctant to make major changes to a (cid:2)le system, because
`(cid:2)le system bugs have the potential to corrupt all data on a machine. Because (cid:2)le system
`development is so dif(cid:2)cult, extending (cid:2)le system functionality in an incremental manner
`is valuable.
`Incremental development also makes it possible for a third-party software
`developer to release (cid:2)le system improvements, without developing a whole (cid:2)le system
`from scratch.
`Originally, (cid:2)le systems were thoroughly integrated into the OS, and system calls di-
`rectly invoked (cid:2)le system methods. This architecture made it dif(cid:2)cult to add multiple (cid:2)le
`systems. The introduction of a virtual node or vnode provided a layer of abstraction that
`separates the core of the OS from (cid:2)le systems [Kleiman 1986]. Each (cid:2)le is represented in
`
`Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use
`provided that the copies are not made or distributed for pro(cid:2)t or commercial advantage, the ACM copyright/server
`notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the
`ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior speci(cid:2)c
`permission and/or a fee.
`c(cid:13) 2006 ACM 1533-3077/2006/0000-0001 $5.00
`
`ACM Transactions on Storage, Vol. 2, No. 2, May 2006, Pages 1(cid:150)33.
`
`
`
`Case 1:20-cv-00034-ADA Document 50-5 Filed 04/10/20 Page 3 of 34
`
`2
`
`(cid:1)
`
`Zadok et al.
`
`memory by a vnode. A vnode has an operations vector that de(cid:2)nes several operations that
`the OS can call, thereby allowing the OS to add and remove types of (cid:2)le systems at run-
`time. Most current OSes use something similar to the vnode interface, and the number of
`(cid:2)le systems supported by the OS has grown accordingly. For example, Linux 2.6 supports
`over 30 (cid:2)le systems and many more are maintained outside of the of(cid:2)cial kernel tree.
`Clearly de(cid:2)ning the interface between the OS and (cid:2)le systems makes interposition pos-
`sible. A layered, or stackable, (cid:2)le system creates a vnode with its own operations vector
`to be interposed on another vnode. Each time one of the layered (cid:2)le system’s operations
`is invoked, the layered (cid:2)le system maps its own vnode to a lower-level vnode, and then
`calls the lower-level vnode’s operation. To add functionality, the layered (cid:2)le system can
`perform additional operations before or after the lower-level operation (e.g., encrypting
`data before a write or decrypting data after a read). The key advantage of layered (cid:2)le
`systems is that they can change the functionality of a commodity OS at runtime so hard-to-
`develop lower-level (cid:2)le systems do not need to be changed. This is important, because OS
`developers often resist change, especially to (cid:2)le systems where bugs can cause data loss.
`Rosenthal was among the (cid:2)rst to propose layering as a method of extending (cid:2)le sys-
`tems [Rosenthal 1990; 1992]. To enable layering, Rosenthal radically changed the VFS in-
`ternals of SunOS. Each public vnode (cid:2)eld was converted into a method; and all knowledge
`of vnode types (e.g., directory vs. regular (cid:2)le) was removed from the core OS. Researchers
`at UCLA independently developed another layering infrastructure [Heidemann and Popek
`1991; 1994] that placed an emphasis on light-weight layers and extensibility. The original
`pioneers of layering envisioned creating building blocks that could be composed together
`to create more sophisticated and rich (cid:2)le systems. For example, the directory-name lookup
`cache (DNLC) could simply be implemented as a (cid:2)le system layer, which returns results
`on a cache hit, but passes operations down on a miss [Skinner and Wong 1993].
`Layering has not commonly been used to create and compose building-block (cid:2)le sys-
`tems, but instead has been widely used to add functionality rapidly and portably to existing
`(cid:2)le systems. Many applications of layered (cid:2)le system are features that could be imple-
`mented as part of the VFS (e.g., uni(cid:2)cation), but for practical reasons it is easier to develop
`them as layered (cid:2)le systems. Several OSes have been designed to support layered (cid:2)le sys-
`tems, including Solaris, FreeBSD, and Windows. Several layered (cid:2)le systems are available
`for Linux, even though it was not originally designed to support them. Many users use lay-
`ered (cid:2)le systems unknowingly as part of Antivirus solutions [Symantec 2004; Miretskiy
`et al. 2004], and Windows XP’s system restore feature [Harder 2001]. On Unix, a null-
`layer (cid:2)le system is used to provide support for accessing one directory through multiple
`paths. When the layer additionally modi(cid:2)es the data, useful new functionality like encryp-
`tion [Corner and Noble 2002; Halcrow 2004] or compression [Zadok et al. 2001] can be
`added. Another class of layered (cid:2)le systems, called fan out, operates directly on top of
`several lower-level (cid:2)le systems. For example, uni(cid:2)cation (cid:2)le systems merge the contents
`of several directories [Pendry and McKusick 1995; Wright et al. 2006]. Fanout (cid:2)le systems
`can also be used for replication, load-balancing, failover, snapshotting, and caching.
`The authors of this paper have over (cid:2)fteen years of combined experience developing
`layered (cid:2)le systems on four OSes: Solaris, FreeBSD, Linux, and Windows. We have
`developed more than twenty layered (cid:2)le systems that provide encryption, compression,
`versioning, tracing, antivirus, uni(cid:2)cation, snapshotting, replication, checksumming, and
`more.
`ACM Transactions on Storage, Vol. 2, No. 2, May 2006.
`
`
`
`Case 1:20-cv-00034-ADA Document 50-5 Filed 04/10/20 Page 4 of 34
`
`On Incremental File System Development
`
`(cid:1)
`
`3
`
`The rest of this paper is organized as follows. In Section 2 we survey alternative tech-
`niques to enhance (cid:2)le system functionality. In Section 3 we describe four models of layered
`(cid:2)le system development. We then proceed to describe (cid:2)ve broad classes of layered (cid:2)le sys-
`tems in Section 4. In Section 5 we describe (cid:2)ve general problems and their solutions that
`are useful for all types of layered (cid:2)le systems. We conclude in Section 6 with guiding
`principles for future OS and layered (cid:2)le system developers.
`
`2. RELATED WORK
`In this section we describe alternatives to achieve the extensibility offered by layered (cid:2)le
`systems. We discuss four classes of related works based on the level at which extensibility
`is achieved: in hardware, in the device driver, at the system-call level, or in user-level
`programs. We have a detailed discussion of layered (cid:2)le system infrastructures in Section 3.
`Hardware level. Slice [Anderson et al. 2000] is a storage system architecture for high
`speed networks with network-attached block storage. Slice interposes a piece of code
`called a switching (cid:2)lter in the network hardware to route packets among a group of servers.
`Slice appears to the upper level as a single block-oriented storage device. High-level con-
`trol information (e.g., (cid:2)les) is unavailable to interposition code at the hardware level, and
`therefore cannot perform optimizations for speci(cid:2)c devices.
`Semantically-Smart Disk Systems (SDSs) [Sivathanu et al. 2003] attempt to provide
`(cid:2)le-system(cid:150)like functionality without modifying the (cid:2)le system. Knowledge of a speci(cid:2)c
`(cid:2)le system is embedded into the storage device, and the device provides additional func-
`tionality that would traditionally be implemented in the (cid:2)le system. Such systems are
`relatively easy to deploy, because they do not require modi(cid:2)cations to existing (cid:2)le system
`code. Layered (cid:2)le systems share a similar goal in terms of reusing and leveraging existing
`infrastructures. Unlike a layered (cid:2)le system, an SDS is closely tied to the format of the (cid:2)le
`system running on top of it, so porting SDSs to new (cid:2)le systems is dif(cid:2)cult.
`Device-driver level. Software RAID and Logical Volume Managers (LVMs) introduce
`another layer of abstraction between the storage hardware and the (cid:2)le system. They provide
`additional features such as increased reliability and performance, while appearing to the
`(cid:2)le system as a simple disk, which makes them easy to deploy in existing infrastructure.
`For example, on Linux a Cryptoloop devices uses a loopback block driver to encrypt data
`stored on a disk or in a (cid:2)le. A new (cid:2)le system is then created within the Cryptoloop device.
`Any (cid:2)le system can run on top of a block device-driver extension. However, block device
`extensions cannot exploit the control information (e.g., names) that is available at the (cid:2)le
`system level.
`System-call level. SLIC [Ghormley et al. 1998] is a protected extensibility system for
`OSes that uses interposition techniques to enable the addition of a large class of untrusted
`extensions to existing code. Several OS extensions can be implemented using SLIC such
`as encryption (cid:2)le systems and a protected environment for binary execution. The Inter-
`position Agents toolkit [Jones 1993], developed by Microsoft Research, allows a user’s
`code to be written in terms of high-level objects provided by this interface. The toolkit
`was designed to ease interposing code between the clients and the instances of the system
`interface to facilitate adding new functionality like (cid:2)le reference tracing, customizable (cid:2)le
`system views, etc. to existing systems. Similarly, Mediating Connectors [Balzer and Gold-
`man 1999] is a system call (and library call) wrapper mechanism for Windows NT that
`ACM Transactions on Storage, Vol. 2, No. 2, May 2006.
`
`
`
`Case 1:20-cv-00034-ADA Document 50-5 Filed 04/10/20 Page 5 of 34
`
`4
`
`(cid:1)
`
`Zadok et al.
`
`allows users to trap API calls.
`System call interposition techniques rely on the communication channel between user-
`space and the kernel, and hence cannot handle operations that bypass that channel (e.g.,
`mmap operations and their associated page faults). Also, interposing at the system call
`level results in overhead for all system calls even if only a subset of kernel components
`(e.g., the (cid:2)le system) need to be interposed.
`
`User level. Gray-box Information and Control Layers (ICL) [Arpaci-Dusseau and Arpaci-
`Dusseau 2001] extend the functionality of OSes by acquiring information about their in-
`ternal state.
`ICLs provide OS-like functionality without modifying existing OS code.
`Dust [Burnett et al. 2002] is a direct application of ICLs that uses gray-box knowledge
`of the OS’s buffer cache management policy to optimize application performance. For ex-
`ample, if a Web server (cid:2)rst services Web pages that are believed to be in the OS buffer
`cache, then both average response time and throughput can be improved.
`Blaze’s CFS is a cryptographic (cid:2)le system that is implemented as a user-level NFS
`server [Blaze 1993]. The OS’s unmodi(cid:2)ed NFS client mounts the NFS server over the
`loopback network interface. The SFS toolkit [Mazi·eres 2001] aims to simplify Unix (cid:2)le
`system extensibility by allowing development of (cid:2)le systems at the user level. Using the
`toolkit, one can implement a simple user-level NFS server and redirect local (cid:2)le system
`operations into the user level implementation. The popularity of the SFS toolkit demon-
`strates that developers have observed the complexity of modifying existing time-tested (cid:2)le
`systems. SiRiUS [Goh et al. 2003], a (cid:2)le system for securing remote untrusted storage,
`and Dabek’s CFS [Dabek et al. 2001], a wide area cooperative (cid:2)le system, were built using
`the SFS toolkit.
`Filesystem in Userspace [Szeredi 2005], or FUSE, is a hybrid approach that consists
`of two parts: (1) a standard kernel-level (cid:2)le system which passes calls to a user-level
`demon, and (2) a library to easily develop (cid:2)le-system(cid:150)speci(cid:2)c FUSE demons. Developing
`new (cid:2)le systems with FUSE is relatively simple because the user-level demon can issue
`normal system calls (e.g., read) to service a VFS call (e.g., vfs read). The main two
`disadvantages of a FUSE (cid:2)le system are that (1) performance is limited by crossing the
`user-kernel boundary, and (2) the (cid:2)le system can only use FUSE’s API, which closely
`matches the VFS API, whereas kernel (cid:2)le systems may access a richer kernel API (e.g.,
`for process and memory management).
`Sun designed and developed Spring as an object-oriented microkernel OS. Spring’s
`architecture allowed various components, including (cid:2)le systems, to be transparently ex-
`tended with user libraries [Khalidi and Nelson 1993]. Spring’s design was radically differ-
`ent from current commodity OSes. As it was research prototype, it was not deployed
`in real systems. K42 [Appavoo et al. 2002] is a new OS under development at IBM
`which incorporates innovative mechanisms and modern programming technologies. Vari-
`ous system functionalities can be extended at the user level through libraries by providing
`a microkernel-like interface. The Exokernel architecture [Engler et al. 1995] implements
`minimal components in the kernel and allows user-level applications to customize OS func-
`tionality using library OSes.
`In general, user-level extensions are easy to implement, but their performance is not as
`good as kernel extensions because the former involve data copies between the user level
`and the kernel level, as well as additional context switches.
`ACM Transactions on Storage, Vol. 2, No. 2, May 2006.
`
`
`
`Case 1:20-cv-00034-ADA Document 50-5 Filed 04/10/20 Page 6 of 34
`
`On Incremental File System Development
`
`(cid:1)
`
`5
`
`3. LAYERING MODELS
`On Unix, the idea of layering evolved from the vnode interface [Kleiman 1986], initially
`implemented in SunOS in 1984. Most of today’s layered (cid:2)le system models use a vnode
`interface. Microsoft Windows, on the other hand, uses a message-passing layering model.
`We describe three Unix models and the Windows model here; these cover the full range
`of popular or commodity models available. In Section 3.1 we describe Rosenthal’s object-
`oriented model. In Section 3.2 we describe the model contemporaneously developed at
`UCLA. In Section 3.3 we describe the layering model found in three modern Unix systems
`(Linux, FreeBSD, and Solaris). In Section 3.4 we describe the Windows 2000 and Win-
`dows XP message-passing layering model. We present a summary table of the features
`offered by each layering model in Section 3.5.
`3.1 Rosenthal’s Layering Model
`In 1990, Rosenthal developed an experimental prototype of a new VFS for SunOS [Rosen-
`thal 1990] with the goal of layering vnodes, so that (cid:2)le systems can be composed of build-
`ing blocks. Rosenthal identi(cid:2)ed two distinct types of layering. The (cid:2)rst, shown in Fig-
`ure 1(a), is interposition in which a higher-level vnode is called before the lower-level vn-
`ode, and can modify the lower-level vnode’s arguments, operation, and results. Today, this
`is commonly called linear layering or linear stacking. The second, shown in Figure 1(b)
`is composition in which a higher-level vnode performs operations on several lower-level
`vnodes. Today, this is commonly called fan out.
`User
`User
`
`User
`
`Vnode A
`
`Vnode A
`
`Vnode A
`
`Vnode B
`
`(a) Linear
`
`Vnode B
`
`Vnode C
`
`(b) Fan out
`
`Vnode B
`
`(c) Fan in
`
`Fig. 1. Types of layering. In a linear layer all operations are intercepted by the top-level vnode, A, and A passes
`it to a single vnode below. In fan-out, all operations go through the top-level vnode, A, but there are two or more
`lower-level vnodes. In fan-in, some operations go through the top-level vnode, A, before going to the lower-level
`vnode, B, but some operations can bypass A and directly access B.
`
`Rosenthal introduced two key abstractions to support vnode layering: push and pop for
`inserting and removing vnodes from a stack. All vnode operations go through the highest-
`level vnode. Rosenthal replaced all visible attributes in the vnode structure with methods
`so that layered (cid:2)le systems could intercept them. Push inserts a vnode at the top of a stack
`of vnodes, so that all operations destined for the stack of vnodes go to the new top vnode.
`The top vnode may in turn call the vnodes that are below it. Pop performs the opposite
`operation: the top vnode from a stack is removed and operations go directly to the lower-
`level vnode. Two new (cid:2)elds were added to the vnode structure: v top and v above.
`ACM Transactions on Storage, Vol. 2, No. 2, May 2006.
`
`
`
`Case 1:20-cv-00034-ADA Document 50-5 Filed 04/10/20 Page 7 of 34
`
`6
`
`(cid:1)
`
`Zadok et al.
`
`The v above pointer points to the vnode that is directly above this one in the stack.
`The v top pointer points to the highest level vnode in the stack. All vnode operations
`go through the highest-level vnode, hence every vnode in the stack essentially becomes
`an alias to the highest-level vnode. Unfortunately, this prevents fan-in access (shown in
`Figure 1(c)), in which a user process accesses the lower-level (cid:2)le system directly. Fan-
`in access is necessary when applications need to access unmodi(cid:2)ed data; for example,
`a backup program should write encrypted data (stored on the lower-level (cid:2)le system) to
`tape, not the unencrypted data (accessible through the top layer). Rosenthal also suggested
`several layered (cid:2)le system building blocks, but did not implement any of them for his
`prototype.
`Rosenthal proposed a set of requirements for a layered vnode interface [Rosenthal 1992].
`To support interposition, he concluded that the OS must support replacing a vnode’s op-
`erations vector and private data for each interposer. Moreover, all vnode operations must
`be mitigated through the operations vector. To support composition, Rosenthal concluded
`that vnodes must support transactions. The vnode interfaces assumes that each operation
`is atomic [Kleiman 1986]. However, to compose two (cid:2)le systems together, two distinct
`lower-level vnode operations must be called; thus the overall operation is not atomic.
`Later, a vnode interface for SunOS based on Rosenthal’s interposition and several exam-
`ple (cid:2)le systems were developed [Skinner and Wong 1993]. Of particular note is that some
`existing VFS operations took multiple vnode arguments, making it impossible to determine
`which vnode’s operations vector should handle the operation. For example, link would
`take two vnode arguments, the parent directory and the vnode to insert into that directory.
`To make each vnode control all of its operations, Skinner and Wong decomposed these
`larger vnode operations into smaller operations to manage vnode link counts, to update di-
`rectory entries, and to create objects. Skinner and Wong’s prototype dealt with issues that
`Rosenthal did not, including concurrency, locking, and per-mount operations. They also
`developed (cid:2)le system components in a layered manner. Skinner and Wong’s implementa-
`tion of mount and unmount is a particularly elegant example of push and pop. The OS’s
`name-lookup code no longer needs special code for mount points. The root vnode of the
`(cid:2)le system to be mounted is simply pushed on top of the covered vnode. To unmount a (cid:2)le
`system, its root vnode is simply popped from the stack. The directory-name(cid:150)lookup cache
`was also implemented as an interposition layer.
`3.2 UCLA’s Layering Model
`In the early 1990s, Heidemann and Popek also developed an infrastructure for layered (cid:2)le
`system development at UCLA [Heidemann and Popek 1991; 1994]. The UCLA model
`emphasized general VFS extensibility and the ability to compose many small and ef(cid:2)cient
`building block layers into more sophisticated (cid:2)le system services. The UCLA model sug-
`gests that each separate service should be its own layer. For example, UFS should be
`divided into at least three different layers: (1) managing disk partitions, (2) a (cid:2)le storage
`layer providing arbitrary-length (cid:2)les referenced by a unique identi(cid:2)er (i.e., inode number),
`and (3) a hierarchical directory component. Breaking up a (cid:2)le system like UFS would
`allow interposition at several points. For example, an intrusion-detection layer could be
`inserted above the directory component so it has access to names, but a compression layer
`could be inserted between the naming layer and the (cid:2)le storage layer to avoid the imple-
`mentation details of hard links.
`To provide extensibility, the UCLA model does not have a (cid:2)xed set of operations. In-
`ACM Transactions on Storage, Vol. 2, No. 2, May 2006.
`
`
`
`Case 1:20-cv-00034-ADA Document 50-5 Filed 04/10/20 Page 8 of 34
`
`On Incremental File System Development
`
`(cid:1)
`
`7
`
`stead, each (cid:2)le system provides its own set of operations, and the total set of operations is
`the union of all (cid:2)le systems’ operations. If a (cid:2)le system does not support a given opera-
`tion, then a generic routine for that (cid:2)le system is called. This routine could, for example,
`simply return an (cid:147)Operation not supported(cid:148) error. However, for a layered (cid:2)le system, a
`generic bypass routine can be used. To provide operations with arbitrary arguments, the
`UCLA interface transforms all of an operation’s arguments into a single structure of argu-
`ments. Along with the structure, meta-data is passed to the operation that identi(cid:2)es each
`argument’s offset and type. Using this meta-data, the bypass routine can generically map
`its own vnodes to lower-level vnodes and invoke the lower-level operation.
`The UCLA model uses mount and unmount to instantiate a layered (cid:2)le system. The
`existing mount operation (cid:2)ts well for two reasons: (1) it creates a new subtree of objects
`within the (cid:2)le system name space, and (2) creation of subtrees usually requires an existing
`(cid:2)le system object (e.g., to mount a UFS (cid:2)le system you need a device node). A (cid:2)le-system
`layer often accesses an existing directory in much the same way that UFS uses a device.
`For example, to mount an encryption (cid:2)le system, the path to the encrypted data is used.
`Caching is essential to provide good performance, but can pose problems if two or more
`layers cache an object. For example, if two layers concurrently cache the same page or
`vnode, and one modi(cid:2)es it, then the other one would read stale data. Similarly, if both
`of them modify the page or vnode, then one of them would lose its update. Heidemann
`developed a prototype general-purpose cache manager that provided coherent access to all
`(cid:2)le system layers [Heidemann and Popek 1995]. The UCLA cache manager requires lay-
`ers to request either shared or exclusive ownership of an object before caching it. When
`required, the cache manager revokes the existing layers’ ownership using a callback. In
`most cases, the cache manager automatically determines when two objects represent the
`same data. However, if layers may change the semantics of the data, then the cache man-
`ager cannot correlate the upper-level and lower-level (cid:2)le’s position. In these cases, the
`cache manager treats the upper-level and lower-level objects as two distinct objects, and
`all ownership requests for either object are sent to the layer that changes the semantics.
`The semantic-changing layer can then invoke the correct callbacks based on its knowledge
`of the relationship between the original and transformed data.
`The Ficus replicated (cid:2)le system and several course projects (e.g., encryption and com-
`pression layers) were developed using the UCLA model [Guy et al. 1990].
`
`3.3 Layering in Modern Unix Systems
`In this section we describe layering in three modern Unix-based OSes: Solaris, FreeBSD,
`and Linux. We discuss their pros and cons in terms of ease of development and perfor-
`mance. All three systems use the vnode interface to enable layering functionality, but they
`have key implementation differences that in(cid:3)uence development ease and performance.
`Solaris. The architecture of the Solaris VFS is nearly identical to the classic vnode
`architecture. Each vnode has a (cid:2)xed operations vector. Each operation must be imple-
`mented by every (cid:2)le system (unde(cid:2)ned operations are not permitted). Generic operations
`for returning (cid:147)operation not supported(cid:148) or in some cases success are available. Mutable
`attributes such as size, access time, etc. are not part of vnode (cid:2)elds; instead, the vnode
`operations include functions for managing such attributes.
`The Solaris loopback (cid:2)le system, Lofs [SMCC 1992], is a null pass-through layer that
`layers only on directory vnodes. Thus, for regular (cid:2)les on Lofs, vnode objects are not
`ACM Transactions on Storage, Vol. 2, No. 2, May 2006.
`
`
`
`Case 1:20-cv-00034-ADA Document 50-5 Filed 04/10/20 Page 9 of 34
`
`8
`
`(cid:1)
`
`Zadok et al.
`
`duplicated. Because Lofs does not layer on (cid:2)le vnodes, the data pages of (cid:2)les are not
`duplicated, resulting in more ef(cid:2)cient memory utilization. However, this makes it harder
`to extend Lofs to provide functionality like encryption, where two different page contents
`need to be maintained at the lower level and the layered level.
`CacheFS [SunSoft 1994] is a layered fan-out (cid:2)le system in Solaris which can mount on
`top of two lower level directories, the source and the cache directory. When (cid:2)les in the
`source directory are accessed through CacheFS, they are copied into the root of the cache
`directory and indexed using a numeric identi(cid:2)er.
`
`FreeBSD. The FreeBSD vnode interface has extensibility based on the UCLA model.
`FreeBSD allows dynamic addition of vnode operations. While activating a (cid:2)le system,
`the kernel registers the set of vnode operations that the (cid:2)le system supports, and builds an
`operation vector for each (cid:2)le system that contains the union of all operations supported
`by any (cid:2)le system. File systems provide default routines for operations that they do not
`support. FreeBSD uses packed arguments so layers can generically bypass operations as
`in the UCLA model.
`FreeBSD’s version of the loopback (cid:2)le system is called nullfs [Pendry and McKusick
`1995]. It is a simple (cid:2)le system layer and makes no transformations on its arguments, just
`passing the requests it receives to the lower (cid:2)le system. FreeBSD’s union mounts [Pendry
`and McKusick 1995] logically merge the directories that they are mounted on.
`Compared to Solaris (and Linux), FreeBSD has the most versatile support for layering
`at the (cid:2)le system level. Its architecture allows new vnode operations to be created dy-
`namically. Packing function arguments in an argument structure allows layers to bypass
`operations that they do not need to intercept easily.
`
`Linux. Linux does not have native support for layered (cid:2)le systems. The Linux VFS was
`designed to make adding (cid:2)le systems relatively simple by separating generic (cid:2)le system
`code from the individual (cid:2)le system implementations. A (cid:2)le system may choose not to
`implement a method, and the VFS itself provides a sensible default. For example, a (cid:2)le
`system can use generic routines for most data-oriented operations (i.e., reading from and
`writing to (cid:2)les, including via mmap), and only needs to provide two primitive operations
`to read and write a block from the (cid:2)le. This trend of extracting more and more abstractions
`is on the rise in the new versions of the Linux kernel. Layered (cid:2)le systems are usually not
`able to use generic methods, because they need to pass the operation to the lower-level (cid:2)le
`system, but generic operations do not make it more dif(cid:2)cult to develop layered (cid:2)le systems.
`On the other hand, providing functionality directly within the VFS makes it more dif(cid:2)cult
`to implement layered (cid:2)le systems. As shown in Figure 2, a layered (cid:2)le system must appear
`just like the VFS to the lower-level (cid:2)le system [Zadok and B(cid:11)adulescu 1999], so it must
`replicate this VFS functionality. Worse, as the VFS changes, the layered (cid:2)le system must
`also change, or subtle bugs could be introduced.
`On Linux, the vnode object is divided into two components: a dentry which represents
`the name of the (cid:2)le and an inode which represents the actual on-disk (cid:2)le. This separation
`simpli(cid:2)es support for hard links. On Linux, VFS operation vectors are (cid:2)xed. Operations
`cannot be added, and the prototype of existing operations cannot be changed. The Linux
`VFS stores mutable values inside the inode object. This makes it more dif(cid:2)cult to maintain
`cache coherency between the inodes of the layered (cid:2)le system and the lower level inodes.
`It also causes performance degradation in layered (cid:2)le systems as the modi(cid:2)ed values in the
`ACM Transactions on Storage, Vol. 2, No. 2, May 2006.
`
`
`
`Case 1:20-cv-00034-ADA Document 50-5 Filed 04/10/20 Page 10 of 34
`
`On Incremental File System Development
`
`(cid:1)
`
`9
`
`VFS
`
`File System Behavior
`VFS−Like Behavior
`
`Layered
`File System
`
`Lower−Level File System
`
`Fig. 2. On Unix, layered (cid:2)le systems consist of two halves: the top half must behave like a (cid:2)le system, and the
`bottom half must behave like the VFS.
`
`lower-level objects need to be copied to the upper-level objects after each (cid:2)le operation.
`The existing VFS API requires modi(cid:2)cations to avoid this data redundancy. In particular,
`the data (cid:2)elds should not be directly accessible from outside a (cid:2)le system. Instead, reads
`and writes of these private (cid:2)elds should be performed by calling (cid:2)le-system methods. Our
`evaluation, described in Appendix A, shows that the overheads of these method calls would
`be negligible.
`FiST [Zadok 2001] is a high-level layered (cid:2)le system de(cid:2)nition language that we de-
`veloped. FiST includes Wrapfs [Zadok et al. 1999], a layered vnode interface imple-
`mentation for Linux, FreeBSD, and Solaris. Wrapfs is a simple pass-through layer that
`intercepts name and data operations. The VFS objects of each entity ((cid:2)le or directory)
`are duplicated at the Wrapfs level. On Linux, many (cid:2)le systems have been derived from
`Wrapfs, including ones for encryption [Corner and Noble 2002; Halcrow 2004; Shana-
`han