`
`;
`
`2, Arun M. Krishnakumar1, Chaitanya Patti1, Abhishek Rai1,
`Nikolai Joukov1
`Sunil Satnur1, Avishay Traeger1, and Erez Zadok1
`1 Stony Brook University,
`2 IBM T.J. Watson Research Center
`
`Abstract
`Storage virtualization and data management are well
`known problems for individual users as well as large orga-
`nizations. Existing storage-virtualization systems either do
`not support a complete set of possible storage types, do not
`provide (cid:3)exible data-placement policies, or do not support
`per-(cid:2)le conversion (e.g., encryption). This results in subop-
`timal utilization of resources, inconvenience, low reliability,
`and poor performance.
`We have designed a stackable (cid:2)le system called Redun-
`dant Array of Independent Filesystems (RAIF). It combines
`the data survivability and performance bene(cid:2)ts of tradi-
`tional RAID with the (cid:3)exibility of composition and ease of
`development of stackable (cid:2)le systems. RAIF can be mounted
`on top of directories and thus on top of any combination
`of network, distributed, disk-based, and memory-based (cid:2)le
`systems. Individual (cid:2)les can be replicated, striped, or stored
`with erasure-correction coding on any subset of the under-
`lying (cid:2)le systems.
`RAIF has similar performance to RAID. In con(cid:2)gura-
`tions with parity, RAIF’s write performance is better than
`the performance of driver-level and even entry-level hard-
`ware RAID systems. This is because RAIF has better control
`over the data and parity caching.
`
`Introduction
`1
`If lost, some (cid:2)les
`The importance of (cid:2)les varies greatly.
`can be easily regenerated, some can be regenerated with
`dif(cid:2)culty, and some cannot be regenerated at all. The
`loss of some (cid:2)les, such as (cid:2)nancial data, can be catas-
`trophic. Therefore, (cid:2)les must be stored with varying levels
`of redundancy using data replication or erasure-correction
`codes. File storage is a complicated set of trade-offs be-
`tween data protection, performance, data management con-
`venience, and special infrastructure requirements. For ex-
`ample, some (cid:2)les must be shared by many users, whereas
`some are only available locally on read-only media; some
`(cid:2)les must be encrypted; and some can be ef(cid:2)ciently com-
`pressed. At the same time, every possible storage sub-
`component, such as a disk or a remote server, has a set of
`unique features. Therefore, an ideal storage-virtualization
`system must support most physical and logical storage de-
`vices, as well as support various per-(cid:2)le data placement
`
`policies and per-(cid:2)le special features such as compression
`and encryption.
`Hardware-virtualization systems are limited to speci(cid:2)c
`types of lower storage devices. Driver-level storage virtual-
`ization systems support more device types, but the number
`of local and SAN device types is still limited. Device-level
`and driver-level storage virtualization systems, such as hard-
`ware and software RAID [34], operate on raw data blocks
`without knowledge of meta-data. This means that they use
`the same storage policies for all (cid:2)les. At most, they try to
`use statistical information about data accesses in an attempt
`to increase performance [47]. This also means that informa-
`tion on individual disks in an array has no meaning and little
`or no information can be extracted if too many disks fail.
`File-system(cid:150)level storage virtualization systems still sup-
`port only limited types of storage; many support just one.
`This is because they have to explicitly support speci(cid:2)c NAS
`and other protocols as well as special features such as com-
`pression and encryption.
`Stackable (cid:2)le systems are a useful and well-known tech-
`nique for adding functionality to existing (cid:2)le systems [53,
`54]. They allow incremental addition of features and can
`be dynamically loaded as external kernel modules. Stack-
`able (cid:2)le systems overlay another lower (cid:2)le system, inter-
`cept (cid:2)le system events and data bound from user processes
`to the lower (cid:2)le system, and in turn use the lower (cid:2)le sys-
`tem’s operations and data. Most stackable (cid:2)le systems in the
`past have assumed a simple one-to-one mapping: the stack-
`able (cid:2)le system was layered on top of one lower directory
`on a single (cid:2)le system. A different class of stackable (cid:2)le
`systems that use a one-to-many mapping (fan-out) has pre-
`viously been suggested [16, 38] and was recently developed
`from the FiST [48] templates.
`We have developed a fan-out RAID-like stackable (cid:2)le
`system called Redundant Array of Independent Filesystems
`(RAIF). RAIF is a storage-virtualization system that im-
`poses virtually no restrictions on the underlying stores and
`allows versatile per-(cid:2)le and per-directory storage policies.
`RAIF derives its usefulness from three main features: (cid:3)exi-
`bility of con(cid:2)guration, access to high-level information, and
`easier administration.
`First, because RAIF is stackable, it can be mounted over
`directories and, thus, over any combination of lower (cid:2)le sys-
`
`IEEE
`
`Exhibit
`0002
`
`IEER
`
`OMPUTER
` SOCIETY
`
`Authorized licensed use limited to: NASA Goddard Space Flight. Downloaded on July 13,2010 at 20:45:20 UTC from IEEE Xplore. Restrictions apply.
`
`24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007)
`0-7695-3025-7/07 $25.00 © 2007
`
`HAFEMAN EXHIBIT 2049
`
`
`
`outlines the design of RAIF. Section 3 describes some inter-
`esting implementation details. Section 4 presents an evalua-
`tion of RAIF. Section 5 discusses related work. We conclude
`and outline future directions in Section 6.
`
`2 Design
`Existing storage-virtualization systems are limited in three
`ways: (1) they only virtualize a limited class of storage de-
`vices; (2) they support one or a small number of data place-
`ment policies, such as replication on a subset of lower stor-
`age devices or striping with parity; and (3) they provide
`few (usually no) per-(cid:2)le data management features such as
`compression, consistency validation, or encryption. Even
`so, existing storage-virtualization systems are complex, ex-
`pensive, and hard to maintain. This is because traditional
`storage-virtualization systems must explicitly support differ-
`ent types of lower storage types and features.
`Stackable (cid:2)le systems are a technique for layering new
`functionality on top of existing (cid:2)le systems. As seen in
`Figure 1, a stackable (cid:2)le system’s methods are called by
`the Virtual File System (VFS) as with other (cid:2)le systems,
`but in turn it uses another (cid:2)le system instead of perform-
`ing operations on a backing store such as a disk or an NFS
`server [35, 53]. Before calling the lower-level (cid:2)le system’s
`methods, a stackable (cid:2)le system can modify the operation
`or the data. Stackable (cid:2)le systems behave like normal (cid:2)le
`systems from the perspective of the VFS; from the perspec-
`tive of the underlying (cid:2)le system, they behave like the VFS.
`Fan-out stackable (cid:2)le systems differ from linear stackable
`(cid:2)le systems in that they call multiple underlying (cid:2)le sys-
`tems, or branches.
`
`User
`
`Kernel
`
`User Process
`rename()
`
`vfs_rename()
`Virtual File System (VFS)
`wrapfs_rename()
`Wrapfs
`ext2_rename()
`
`Ext2
`
`Figure 1: Linear (cid:2)le system stacking
`Redundant Array of Independent Filesystems (RAIF) is
`a stackable fan-out (cid:2)le system. We designed it with (cid:3)exi-
`bility in mind. RAIF can be mounted over any set of (cid:2)le
`systems. This means that it does not have to support all pos-
`sible storage types and protocols explicitly: it can leverage
`the code of dozens of existing (cid:2)le systems and NAS proto-
`cols. RAIF will even be able to reuse future (cid:2)le systems and
`protocols without modi(cid:2)cation. If necessary, RAIF can still
`
`tems. For example, it can be mounted over network (cid:2)le sys-
`tems like NFS and Samba, AFS distributed (cid:2)le systems, and
`local (cid:2)le systems(cid:151)all at the same time. RAIF leverages the
`well-maintained code and protocols of existing (cid:2)le systems.
`In fact, RAIF can even utilize future (cid:2)le systems and proto-
`cols. Stackable (cid:2)le systems can be mounted on top of each
`other. Existing stackable (cid:2)le systems offer features like en-
`cryption [50], data-integrity veri(cid:2)cation [27], virus check-
`ing [32], versioning [33], tracing [3], and compression [51].
`These (cid:2)le systems can be mounted over RAIF as well as
`below it.
`Second, because RAIF operates at the (cid:2)le system level,
`it has access to high-level (cid:2)le system meta-data that is
`not available to traditional block-level storage virtualization
`systems. This meta-data allows the optimization of redun-
`dancy schemes, data placement, and read-ahead algorithms.
`For example, RAIF can stripe large multimedia (cid:2)les across
`different branches for performance, and concurrently use
`two parity pages for important (cid:2)nancial data (cid:2)les that must
`survive even two failures. Dynamic RAIF-level migration
`allows one to optimize RAIF reliability, resource utilization,
`and performance.
`Third, administration is easier because (cid:2)les are stored on
`unmodi(cid:2)ed lower-level (cid:2)le systems. Therefore, the size of
`these lower (cid:2)le systems can be changed, and they can be
`backed up easily using standard software. The data is easier
`to recover in the case of failure because it is stored in (cid:2)les
`with the same names as their upper-level counterparts. For
`example, if the (cid:2)les are replicated on several disks, then ev-
`ery such disk will contain an ordinary (cid:2)le system with the
`same (cid:2)les intact.
`A common limitation of software RAIDs is that data re-
`covery is slow and may require a restart from the very be-
`ginning if it is interrupted. RAIF performs storage recovery
`on a per-(cid:2)le basis and can even do it in the background, con-
`currently with normal activity. If irreparable inconsistencies
`are detected, RAIF can identify the speci(cid:2)c (cid:2)les affected,
`which is more user-friendly than the lists of block numbers
`offered by block-level systems.
`Like other data striping and replication systems, RAIF
`can improve performance for multi-process workloads.
`Even for I/O-intensive single-process workloads that re-
`quire many synchronous operations, RAIF’s performance is
`comparable with the performance of driver-level software
`RAID systems. For RAIF con(cid:2)gurations similar to standard
`RAID4 and RAID5, RAIF can outperform driver-level and
`even some hardware implementations thanks to better con-
`trol over (cid:2)le system caches. Moreover, most other storage-
`virtualization systems choose a single data-placement pol-
`icy, which is not optimal for all (cid:2)les. RAIF can optimize
`performance for all (cid:2)les because it can use better data-
`placement policies for individual (cid:2)les or groups of (cid:2)les.
`The rest of the paper is organized as follows. Section 2
`
`IEEE
`
`IEER 4•
`
`COMPUTER
`SOCIETY
`
`Authorized licensed use limited to: NASA Goddard Space Flight. Downloaded on July 13,2010 at 20:45:20 UTC from IEEE Xplore. Restrictions apply.
`
`24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007)
`0-7695-3025-7/07 $25.00 © 2007
`
`HAFEMAN EXHIBIT 2049
`
`
`
`the lower branches. The data (cid:2)les are stored using differ-
`ent RAIF modes that we call levels, analogous to standard
`RAID levels:
`(cid:15) RAIF0 The (cid:2)le is striped over a set of lower (cid:2)le sys-
`tems. The striping unit may be different for different
`(cid:2)les. As we describe below, this level also allows us to
`distribute entire (cid:2)les across branches.
`(cid:15) RAIF1 The (cid:2)le is replicated over a set of branches.
`Note that we de(cid:2)ne RAIF1 slightly differently from the
`original RAID level 1 [34]. The original RAID level 1
`de(cid:2)nition corresponds to RAIF01.
`(cid:15) RAIF4 The (cid:2)le is striped as in RAIF0 but a dedicated
`branch is used for parity. This level is useful for het-
`erogeneous sets of branches, for example if the parity
`branch is much faster than the others and the workload
`has many writes, or if the parity branch is much slower
`and the workload is read-biased.
`(cid:15) RAIF5 This level is similar to RAIF4, but the parity is
`rotated through different branches.
`(cid:15) RAIF6 In RAIF6, extra parity branches are used to re-
`cover from two or more simultaneous failures.
`(cid:15) RAIF01 This level is a mirrored RAIF0 array.
`(cid:15) RAIF10 Similarly to RAIF01, a mirrored array is
`striped over a set of branches.
`Small (cid:2)les may occupy only a portion of a single stripe.
`To distribute space utilization and accesses among branches,
`we start the stripes of different (cid:2)les on different branches.
`We use the term starting branch to denote the branch where
`the (cid:2)le’s (cid:2)rst data page is located. By default, the starting
`branch is derived from the (cid:2)le name using a hash function.
`Also, users may specify starting branches explicitly.
`Another use of the starting branch is to randomize the
`data placement of entire (cid:2)les. In particular, one may use
`RAIF0 with a large striping unit size.
`In this case, ev-
`ery small (cid:2)le will be entirely placed on its starting branch
`whereas large (cid:2)les will be striped. For RAIF levels 4, 5,
`and 6, small (cid:2)les will be stored on their starting branches
`and their parity will be stored on parity branches. Essen-
`tially, small (cid:2)les will be automatically replicated on two or
`three branches whereas large (cid:2)les will be striped. Also, it
`is possible to specify a special striping width value ((cid:150)1) to
`force (cid:2)les to be stored on their starting branch independently
`of their size. This is useful to replicate data on two ran-
`dom branches (level 5) or one random and one or two (cid:2)xed
`branches (levels 4 and 6).
`RAIF can assign levels and striping unit sizes to (cid:2)les and
`directories individually, or by applying regular expressions
`to (cid:2)le names. For example, if users’ home directories are
`to be stored on a RAIF mounted over a set of branches as
`shown in Figure 2, one may use the following global rules:
`(cid:15) Use RAIF1 to replicate *.c, *.h, *.doc, and other
`important (cid:2)les on all local drives (branches E0...EN)
`
`be mounted over standard RAID arrays or any other storage-
`virtualization system. In addition, existing and future stack-
`able (cid:2)le systems can be mounted above RAIF or over any
`lower branches. This allows one to add important extra fea-
`tures such as compression or encryption in a selective way.
`Figure 2 shows an example RAIF (cid:2)le system mounted over
`several different types of lower storage. In this (cid:2)gure, letters
`below RAIF denote the branch labels as follows:
`C A read-only CD-ROM (cid:2)le system.
`E A set of Ext3 (cid:2)le systems mounted over local disks.
`V A versioning stackable (cid:2)le system mounted over an
`NFS client.
`N An encryption stackable (cid:2)le system mounted over an
`NFS client. All data stored on this remote branch is
`automatically encrypted.
`G A TmpFS Linux (cid:2)le system that stores data in virtual
`memory or in the swap area. Because virtual memory
`may be scarce, the branch data is compressed with a
`stackable compression (cid:2)le system.
`
`User
`
`Kernel
`
`User Process
`rename()
`
`vfs_rename()
`Virtual File System (VFS)
`avfs_rename()
`
`AVFS
`
`raif_rename()
`RAIF
`
`C
`
`Ei
`
`V
`
`N
`
`G
`
`I-
`
`VersionFS
`
`NCryptFS
`
`GzipFS
`
`iso9660
`
`yv
`Ext3
`
`NFS
`
`TmpFS
`
`CD−ROM
`
`yv
`Local Disks
`
`LAN
`
`Memory/Swap
`
`Figure 2: A possible combination of RAIF fan-out
`stacking and other (cid:2)le systems stacked linearly.
`Letters below RAIF denote the branch labels.
`
`2.1 RAIF Data Placement
`RAIF is designed at a high level of abstraction and leverages
`well-maintained, external code. Therefore, we were able to
`concentrate on storage virtualization issues rather than sup-
`port for different types of lower storage devices, placement
`of individual blocks on these devices, and NAS or SAN pro-
`tocols. RAIF duplicates the directory structure on all of
`
`IEEE
`
`IEER
`
`COMPUTER
`SOCIETY
`
`Authorized licensed use limited to: NASA Goddard Space Flight. Downloaded on July 13,2010 at 20:45:20 UTC from IEEE Xplore. Restrictions apply.
`
`24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007)
`0-7695-3025-7/07 $25.00 © 2007
`
`HAFEMAN EXHIBIT 2049
`
`
`
`and branch V. This way, important (cid:2)les are replicated
`locally, stored remotely, and also stored with versions
`at the same time.
`(cid:15) Use RAIF4 to stripe large multimedia (cid:2)les on all local
`drives and keep parity data on the remote NFS server
`(branch N). This allows us to save precious space on
`the local drives and still keep parity to recover (cid:2)les in
`case of a single disk failure. Note that multimedia (cid:2)les
`are usually not changed frequently. Therefore, the re-
`mote server is seldom contacted for common read op-
`erations.
`(cid:15) Use branch G for intermediate (cid:2)les (e.g., *.o). This
`way these (cid:2)les are available for repeated compilations
`but are purged if not used for a while.
`(cid:15) Use RAIF1 over branches C and G to provide the il-
`lusion of writing to a read-only (cid:2)le system. As we
`will describe in Section 2.3, RAIF provides a simpli-
`(cid:2)ed uni(cid:2)cation functionality [48]. For example, if a
`CD-ROM contains source code, one may compile it
`directly from the CD-ROM. Resulting (cid:2)les will be put
`onto the G branch and users will see a merged set of
`(cid:2)les in a single directory.
`(cid:15) Use RAIF0 or RAIF5 on the E branches for all other
`(cid:2)les as they are either regenerable or unimportant.
`In addition, individual users may be allowed to de(cid:2)ne
`special rules on a per-(cid:2)le or per-directory basis.
`
`2.2 RAIF Rules
`The meta-information about every (cid:2)le includes the (cid:2)le’s
`starting branch, RAIF level, striping unit, and the set of
`branches used to store the (cid:2)le. Since the (cid:2)le name is the
`only information available for a (cid:2)le LOOKUP operation,
`RAIF needs to maintain a mapping of (cid:2)le names to meta-
`information to perform LOOKUP operations.
`In an early RAIF prototype [23], we used a hash function
`to (cid:2)nd the branch with per-(cid:2)le meta-information. However,
`this approach is not suitable for a storage-virtualization sys-
`tem in which any subset of branches can be used to store
`a (cid:2)le: RAIF needs to know the (cid:2)le storage con(cid:2)guration
`in order to determine the subset of branches used to store
`the (cid:2)le. RAIF’s meta-information storage is based on two
`assumptions: (cid:2)rst, we assume that RAIF meta-information
`does not change frequently; second, we assume that the
`number of storage policies is much smaller than the num-
`ber of (cid:2)les. Therefore, every storage policy is associated
`with a regular expression to match (cid:2)le names. A regular
`expression, a RAIF level, a striping unit size, and a set of
`branches together de(cid:2)ne a RAIF rule. A lookup, then, con-
`sists of matching a (cid:2)le name with a rule and looking up the
`(cid:2)le in the returned set of branches. Rules are stored in spe-
`cial (cid:2)les on a subset of lower branches. These (cid:2)les are not
`visible from above RAIF and are read at directory lookup
`time. This approach allows RAIF to decrease both system
`
`time and I/O overheads. Directories with no rules contain
`no rule (cid:2)les, which is the common case. Also, storing rule
`(cid:2)les close to the (cid:2)les they relate to increases data locality on
`the lower branches and thus improves performance. RAIF
`supports the following three types of rules:
`(cid:15) Global rules are inherited by lower directories. They
`can be added or removed only manually.
`(cid:15) Local rules are not inherited by lower directories and
`have higher priority than global rules. Generally, these
`rules are necessary for exception handling. For exam-
`ple, they can be used to de(cid:2)ne special policies for an
`individual (cid:2)le. However, the most common use of local
`rules is to handle RENAME and rule-change operations,
`as we will describe in Section 2.2.2.
`(cid:15) Meta rules specify policies for storing RAIF rule (cid:2)les.
`This way, these special rule (cid:2)les can be replicated on
`all or only a subset of branches to balance reliability
`and performance. Also, meta rules are useful if some
`branches are mounted read-only. Finally, this allows
`one to dedicate branches for storing RAIF rules. For
`example, such branches can be networked (cid:2)le systems
`to allow sharing of the rules between servers. Dedi-
`cated branches are thus similar to dedicated meta-data
`servers in cluster storage systems [1]. Regular expres-
`sions in meta rules are matched only against lower di-
`rectory names.
`Let us again consider the example RAIF con(cid:2)guration
`shown in Figure 2, and assume that we want to store a small
`directory tree as shown in Figure 3. As we discussed in Sec-
`tion 2.1, we may want to have a global rule for the topmost
`directory, raif, to stripe all multimedia (cid:2)les on the local
`hard disks and keep their parity on a remote branch (RAIF4
`con(cid:2)guration). However, one may want to add a special lo-
`cal rule to the dir1 directory to replicate birth.avi on all
`the local drives and a remote server because that (cid:2)le con-
`tains especially precious data.
`2.2.1 Rule Matching
`Upon a (cid:2)le LOOKUP operation, RAIF matches the (cid:2)le name
`against the regular expressions in all relevant rule (cid:2)les.
`Rules in every directory are arranged sequentially accord-
`ing to their priority. Local rules have the highest priority
`and are tried (cid:2)rst. Global rules in every directory have
`precedence over global rules of the parent directory. This
`way, every rule has a unique priority value. Therefore, the
`rule-matching process proceeds by scanning the rules se-
`quentially, starting from the highest-priority local rule in the
`current directory and (cid:2)nishing at the lowest-priority global
`rule from the RAIF mount-point directory. Because we as-
`sume that rules are not changed frequently, RAIF can use
`(cid:2)nite-state machines (e.g., using the Aho-Corasick pattern-
`matching algorithm [2]) to avoid sequential scanning.
`Let us now consider the LOOKUP operation in more de-
`
`IEEE
`
`IEER 4•
`
`COMPUTER
`SOCIETY
`
`Authorized licensed use limited to: NASA Goddard Space Flight. Downloaded on July 13,2010 at 20:45:20 UTC from IEEE Xplore. Restrictions apply.
`
`24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007)
`0-7695-3025-7/07 $25.00 © 2007
`
`HAFEMAN EXHIBIT 2049
`
`
`
`/raif
`
`top.avi
`
`dir1
`
`dir2
`
`top.avi
`
`/raif
`
`dir1
`
`birth.avi
`
`bday.avi
`
`a.txt
`
`party.avi
`
`birth.avi
`
`bday.avi
`
`dir2
`
`Figure 3: A sample directory tree.
`
`tail. First, RAIF matches the (cid:2)le name against all relevant
`rules using the (cid:2)nite automaton for the current directory.
`Out of the returned set of rules, RAIF selects the highest-
`priority rule and looks up the (cid:2)le on one of the branches
`according to that rule. After this LOOKUP operation, RAIF
`knows if the lower (cid:2)le exists and if it is a regular (cid:2)le or a
`directory. If the (cid:2)le exists, RAIF looks up the rest of the
`branches. In case of a directory LOOKUP, RAIF reads and
`parses its rules if they exist. If the (cid:2)le being looked up does
`not exist, RAIF returns from the LOOKUP operation. The
`rest of the branches are looked up later, upon calls to (cid:2)le
`or directory CREATE functions, because this is the earliest
`opportunity for RAIF to know the type of these (cid:2)les. This
`allows RAIF to minimize the number of LOOKUP calls sent
`to the lower (cid:2)le systems.
`
`2.2.2 Rule Management
`Rules for groups of (cid:2)les allow RAIF to perform with min-
`imal overheads for common (cid:2)le system operations because
`only a minimal amount of extra meta-information requires
`maintenance in memory and on the lower branches. This is
`possible because RAIF rules are kept intact most of the time.
`Let us consider the three situations in which rule changes are
`required.
`Rename operations change the name of the (cid:2)le. Therefore,
`the rule that matches a (cid:2)le may be different before and af-
`ter it is renamed. Note that this is not likely because most
`renames do not change the logical meaning of the (cid:2)le be-
`ing renamed. For example, it is unlikely that the source (cid:2)le
`main.c would be renamed to a video (cid:2)le main.avi. How-
`ever, the location of the starting branch may change even in
`the common case, which is important if the (cid:2)le is striped.
`Therefore, if the old storage policy does not correspond to
`the new (cid:2)le name, RAIF has two options: (1) add a local
`rule for the new (cid:2)le (the regular expression (cid:2)eld is the ac-
`tual new (cid:2)le name), or (2) store the (cid:2)le according to its new
`name. RAIF adds a local rule for large (cid:2)les and moves the
`data for small (cid:2)les. The (cid:2)le-size threshold is con(cid:2)gurable.
`Upon an UNLINK operation of a (cid:2)le, the corresponding local
`rule is deleted if it exists.
`Moving a directory may require an update of the rules for
`that directory. For example, Figure 4 shows the directory
`
`a.txt
`
`party.avi
`
`Figure 4: Same directory tree with dir2 moved in-
`side dir1.
`tree of Figure 3 with directory dir2 moved into dir1. We
`can see that if dir1 has any rules, then these rules could
`con(cid:3)ict with the placement of the (cid:2)les in the old dir2. In
`that case, RAIF would add the necessary global rules to
`dir2. Note that the number of rules to add is usually small.
`In the worst case, the global rules from the directories above
`may require an addition to a single directory.
`Manual addition, removal, and modi(cid:2)cation of rules by
`users is not a frequent operation. However, it may result in
`con(cid:3)icts if existing (cid:2)les match the new rule and the new rule
`does not correspond to the old storage policy. Again, RAIF
`offers two possible options: (1) moving the (cid:2)les’ data; and
`(2) adding local rules for affected (cid:2)les. The (cid:2)rst option may
`be useful for storage recon(cid:2)guration. Rule changes usually
`correspond to administrators’ desire to modify the storage
`con(cid:2)guration. For example, an administrator may want to
`increase the storage capacity and add another local disk to
`the RAIF con(cid:2)guration shown in Figure 2.
`In that case,
`administrators will change the existing rule to stripe data
`across more branches, and may want to migrate the data for
`all existing (cid:2)les as well. The second option, adding local
`rules, corresponds to the standard technique of lazy storage
`migration. In particular, new (cid:2)les will be stored according
`to the new rules whereas old (cid:2)les will be gradually migrated
`if deleted or completely overwritten.
`Rename and directory-movement operations both require
`changes to rules or (cid:2)les in only one directory. Rule addi-
`tions, removals, and modi(cid:2)cations may require recursive op-
`erations and are therefore performed partially in user mode.
`In particular, a user-level utility traverses the directory tree
`bottom-up and gradually updates it by calling the kernel-
`level directory rule-update routines. Per-directory updates
`are not recursive, and RAIF performs them in the kernel to
`avoid races. As an example rule-update scenario, consider
`the directory tree shown in Figure 3. Let us assume that we
`want to add a global rule to directory raif with the regular
`expression a.txt. First, the user-level program for rule ad-
`dition will issue an ioctl call to ask the kernel component
`to add the rule to directory dir1. Because that directory has
`no (cid:2)les that match the regular expression a.txt, RAIF will
`just add that global rule to dir1 and return. Second, the
`
`IEEE
`
`IEER 4•
`
`COMPUTER
`SOCIETY
`
`Authorized licensed use limited to: NASA Goddard Space Flight. Downloaded on July 13,2010 at 20:45:20 UTC from IEEE Xplore. Restrictions apply.
`
`24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007)
`0-7695-3025-7/07 $25.00 © 2007
`
`HAFEMAN EXHIBIT 2049
`
`
`
`user-level program will do the same for dir2, which has a
`(cid:2)le that matches the new regular expression. Depending on
`the (cid:2)le size or additional program (cid:3)ags, RAIF either adds
`a local rule for that (cid:2)le or moves its data. (Storage poli-
`cies must be changed only in that case.) If RAIF moves the
`data, it (cid:2)rst creates local rules for every matched (cid:2)le in a
`directory to ensure that the (cid:2)les’ data is always consistent
`with the rules. Only after doing this, RAIF removes these
`temporary local rules, adds a new global rule to dir2, and
`returns. Third, the user-level program calls RAIF to add a
`rule to directory raif. Finally, RAIF adds the global rule
`to raif and removes the temporary global rules from dir1
`and dir2.
`2.3 Merging and Uni(cid:2)cation Semantics
`RAIF levels that stripe the (cid:2)les can be used to merge several
`small (cid:2)le systems together to form a single, larger (cid:2)le sys-
`tem. Similar to Chunkfs [19], such merging decreases the
`total consistency checking time. RAIF can merge any num-
`ber of (cid:2)le systems without having to modify their source
`code or on-disk data layouts. RAIF can also store parity
`locally or remotely to increase data reliability.
`RAIF’s design ensures that (cid:2)les stored on several
`branches are visible as a uni(cid:2)ed name-space from above.
`This allows RAIF to join several (cid:2)le systems together simi-
`lar to Unionfs [48]. Thus, one can run programs that require
`read-write access to the directory structure from a read-only
`CD-ROM or a DVD-ROM. For example, one may create
`rules to store rule (cid:2)les and all the new (cid:2)les on the writable
`branches only. This way, one can compile programs from
`read-only media directly, or temporarily store useful (cid:2)les on
`RAM-based (cid:2)le systems.
`2.4 RAIF Administration
`RAIF’s rule-changing process is useful in many cases, such
`as when adding or removing RAIF branches, or when in-
`creasing or decreasing available storage space. However,
`the rule-changing process may be lengthy so it is wise to
`estimate the storage space changes beforehand. RAIF has
`user-level utilities for this purpose.
`RAIF has a user-level Perl script that accepts high-level
`descriptions of (cid:2)le types and information about the relative
`importance of (cid:2)les. This script can change rules and even
`migrate data to increase or decrease storage space utiliza-
`tion given this information. For example, an administrator
`could use the script to free N gigabytes of space when nec-
`essary [55]. This script can perform automatic storage re-
`con(cid:2)guration if periodically started by a cron job. For ex-
`ample, RAIF can store all (cid:2)les with replication while space
`is plentiful but gradually change the reliability of data up to
`the lowest allowed level as space becomes scarce.
`Also, user-mode tools can initiate delayed conversion of
`(cid:2)le storage policies. For example, RAIF can create local
`
`rules for (cid:2)les affected by renaming or rule-changing opera-
`tions during high system load times. Later, the cron dae-
`mon can initiate conversion of these (cid:2)les during periods in
`which the system is underutilized.
`This design allows RAIF to be simple in the kernel and
`operate with clear and deterministic rules. At the same time,
`more sophisticated and abstract policies can be processed
`and de(cid:2)ned at the user level.
`2.4.1 RAIF Backup and Security
`RAIF stores (cid:2)les on lower (cid:2)le systems as regular (cid:2)les. This
`provides three main bene(cid:2)ts. First, it allows incremental
`backups on a per-branch basis. These backups can be re-
`stored on individual failed branches. Second, RAIF (cid:2)les
`have the same owners and attributes as their counterparts
`on lower branches. Similarly, rule (cid:2)les have the same own-
`ers and access permissions as their parent directories. This
`means that users not allowed to write to a directory are not
`allowed to create rules in that directory because they cannot
`create rule (cid:2)les there. Third, administrators can set up quo-
`tas for the lower (cid:2)le systems and control the usage of these
`individual branches. This could be used to discourage users
`from setting up rules to replicate all their (cid:2)les.
`2.5 Performance
`RAIF is a (cid:2)le system, and as such is located above (cid:2)le
`system caches. Therefore, it can ef(cid:2)ciently use all avail-
`able memory for data and meta-data caching. For example,
`both data and parity are automatically cached in the page
`cache [13]. This allows RAIF to outperform driver-level and
`even some hardware RAID controllers (with small caches)
`under a write-oriented w