`
`Proceedings of the
`FAST 2002 Conference on
`File and Storage Technologies
`
`Monterey, California, USA
`January 28-30, 2002
`
`THE ADVANCED COMPUTING SYSTEMS ASSOCIATION
`
`For more information about the USENIX Association:
`All Rights Reserved
`© 2002 by The USENIX Association
`Phone: 1 510 528 8649
`FAX: 1 510 548 5738
`Email: office@usenix.org
`WWW: http://www.usenix.org
`Rights to individual papers remain with the author or the author's employer.
` Permission is granted for noncommercial reproduction of the work for educational or research purposes.
`This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.
`
`CSCO-1018
`Page 1 of 14
`
`
`
`Venti: a new approach to archival storage
`
`Sean Quinlan and Sean Dorward
`Bell Labs, Lucent Technologies
`
`Abstract
`
`This paper describes a network storage system, called
`Venti, intended for archival data. In this system, a
`unique hash of a block’s contents acts as the block
`identifier for read and write operations. This approach
`enforces a write-once policy, preventing accidental or
`malicious destruction of data. In addition, duplicate
`copies of a block can be coalesced, reducing the
`consumption
`of
`storage
`and
`simplifying
`the
`implementation of clients. Venti is a building block for
`constructing a variety of storage applications such as
`logical backup, physical backup, and snapshot file
`systems.
`
`We have built a prototype of the system and present
`some preliminary performance results. The system uses
`magnetic disks as the storage technology, resulting in
`an access time for archival data that is comparable to
`non-archival data. The feasibility of the write-once
`model for storage is demonstrated using data from over
`a decade’s use of two Plan 9 file systems.
`
`1. Introduction
`
`Archival storage is a second class citizen. Many
`computer environments provide access to a few recent
`versions of the information stored in file systems and
`databases, though this access can be tedious and may
`require the assistance of a system administrator. Less
`common is the ability for a user to examine data from
`last month or last year or last decade. Such a feature
`may not be needed frequently, but when it is needed it
`is often crucial.
`
`The growth in capacity of storage technologies exceeds
`the ability of many users to generate data, making it
`practical to archive data in perpetuity. Plan 9, the
`computing environment that the authors use, includes a
`file system that stores archival data to an optical
`jukebox [16, 17]. Ken Thompson observed that, for our
`usage patterns, the capacity of the jukebox could be
`considered infinite. In the time it took for us to fill the
`jukebox, the improvement in technology would allow
`us to upgrade to a new jukebox with twice the capacity.
`
`Abundant storage suggests that an archival system
`impose a write-once policy. Such a policy prohibits
`either a user or administrator from deleting or
`modifying data once it is stored. This approach greatly
`reduces the opportunities for accidental or malicious
`data loss and simplifies the system’s implementation.
`
`Moreover, our experience with Plan 9 is that a write-
`once policy changes the way one views storage.
`Obviously, some data is temporary, derivative, or so
`large that it is either undesirable or impractical to retain
`forever and should not be archived. However, once it is
`decided that the data is worth keeping, the resources
`needed to store the data have been consumed and
`cannot be reclaimed. This eliminates the task of
`periodically “cleaning up” and deciding whether the
`data is still worth keeping. More thought is required
`before storing the data to a write-once archive, but as
`the cost of storage continues to fall, this becomes an
`easy decision.
`
`This paper describes the design and implementation of
`an archival server, called Venti. The goal of Venti is to
`provide a write-once archival repository that can be
`shared by multiple client machines and applications. In
`addition, by using magnetic disks as the primary
`storage technology, the performance of the system
`approaches that of non-archival storage.
`
`2. Background
`
`A prevalent form of archival storage is the regular
`backup of data to magnetic tape [15]. A typical scenario
`is to provide backup as a central service for a number of
`client machines. Client software interfaces with a
`database or file system and determines what data to
`back up. The data is copied from the client to the tape
`device, often over a network, and a record of what was
`copied is stored in a catalog database.
`
`Restoring data from a tape backup system can be
`tedious and error prone. The backup system violates the
`access permission of the file system, requiring a system
`administrator or privileged software to perform the task.
`Since they are tedious, restore operations are infrequent
`and problems with the process may go undetected.
`Potential sources of error abound: tapes are mislabeled
`
`Page 2 of 14
`
`
`
`or reused or lost, drives wander out of alignment and
`cannot read their old tapes, technology becomes
`obsolete.
`
`For tape backup systems, a tradeoff exists between the
`performance of backup and restore operations [1]. A
`full backup simplifies the process of restoring data
`since all the data is copied to a continuous region on the
`tape media. For large file systems and databases,
`incremental backups are more efficient to generate, but
`such backups are not self-contained; the data for a
`restore operation
`is
`scattered
`across multiple
`incremental backups and perhaps multiple tapes. The
`conventional solution is to limit the extent of this
`scattering by performing a full backup followed by a
`small number of incremental backups.
`
`File systems such as Plan 9 [16, 17], WAFL [5], and
`AFS [7] provide a more unified approach to the backup
`problem by
`implementing a snapshot feature. A
`snapshot is a consistent read-only view of the file
`system at some point in the past. The snapshot retains
`the file system permissions and can be accessed with
`standard tools (ls, cat, cp, grep, diff) without special
`privileges or assistance from an administrator. In our
`experience, snapshots are a relied-upon and frequently-
`used resource because they are always available and
`easy to access.
`
`full and
`tradeoff between
`the
`Snapshots avoid
`incremental backups. Each snapshot is a complete file
`system
`tree, much
`like a
`full backup. The
`implementation, however, resembles an incremental
`backup because the snapshots and the active file system
`share any blocks that remain unmodified; a snapshot
`only requires additional storage for the blocks that have
`changed. To achieve reasonable performance, the
`device that stores the snapshots must efficiently support
`random access, limiting the suitability of tape storage
`for this approach.
`
`the WAFL and AFS systems, snapshots are
`In
`ephemeral; only a small number of recent versions of
`the file system are retained. This policy is reasonable
`since the most recent versions of files are the most
`useful. For these systems, archival storage requires an
`additional mechanism such as tape backup.
`
`The philosophy of the Plan 9 file system is that random
`access storage is sufficiently cheap that it is feasible to
`retain snapshots permanently. The storage required to
`retain all daily snapshots of a file system is surprisingly
`modest; later in the paper we present statistics for two
`file servers that have been in use over the last 10 years.
`
`Like Plan 9, the Elephant file system [18] retains many
`versions of data. This system allows a variety of storage
`reclamation policies that determine when a version of a
`file should be deleted. In particular, “landmark”
`versions of files are retained permanently and provide
`an archival record.
`
`3. The Venti Archival Server
`
`Venti is a block-level network storage system intended
`for archival data. The interface to the system is a simple
`protocol that enables client applications to read and
`write variable sized blocks of data. Venti itself does not
`provide the services of a file or backup system, but
`rather the backend archival storage for these types of
`applications.
`
`Venti identifies data blocks by a hash of their contents.
`By using a collision-resistant hash function with a
`sufficiently large output, it is possible to consider the
`hash of a data block as unique. Such a unique hash is
`called the fingerprint of a block and can be used as the
`address for read and write operations. This approach
`results in a storage system with a number of interesting
`properties.
`
`As blocks are addressed by the fingerprint of their
`contents, a block cannot be modified without changing
`its address; the behavior is intrinsically write-once. This
`property distinguishes Venti from most other storage
`systems, in which the address of a block and its
`contents are independent.
`
`Moreover, writes are idempotent. Multiple writes of the
`same data can be coalesced and do not require
`additional storage space. This property can greatly
`increase the effective storage capacity of the server
`since it does not rely on the behavior of client
`applications. For example, an incremental backup
`application may not be able to determine exactly which
`blocks have changed,
`resulting
`in unnecessary
`duplication of data. On Venti, such duplicate blocks
`will be discarded and only one copy of the data will be
`retained. In fact, replacing the incremental backup with
`a full backup will consume the same amount of storage.
`Even duplicate data from different applications and
`machines can be eliminated if the clients write the data
`using the same block size and alignment.
`
`The hash function can be viewed as generating a
`universal name space for data blocks. Without
`cooperating or coordinating, multiple clients can share
`this name space and share a Venti server. Moreover, the
`block level interface places few restrictions on the
`
`Page 3 of 14
`
`
`
`Venti uses the Sha1 hash function [13] developed by
`the US National Institute for Standards and Technology
`(NIST). Sha1 is a popular hash algorithm for many
`security systems and, to date, there are no known
`collisions. The output of Sha1 is a 160 bit (20 byte)
`hash value. Software implementations of Sha1 are
`relatively efficient; for example, a 700Mhz Pentium 3
`can compute the Sha1 hash of 8 Kbyte data blocks in
`about 130 microseconds, a rate of 60 Mbytes per
`second.
`
`Are the 160 bit hash values generated by Sha1 large
`enough to ensure the fingerprint of every block is
`unique? Assuming random hash values with a uniform
`distribution, a collection of n different data blocks and a
`hash function that generates b bits, the probability p that
`there will be one or more collisions is bounded by the
`number of pairs of blocks multiplied by the probability
`that a given pair will collide, i.e.
`·-£
`
`nn(
`1)
`2
`
`p
`
`.
`
`21
`
`b
`
`Today, a large storage system may contain a petabyte
`1510 bytes) of data. Consider an even larger system
`(
`1810 bytes) stored as 8 Kbyte
`that contains an exabyte (
`1410~
`blocks (
`blocks). Using the Sha1 hash function,
`20
`10-
`. Such a
`the probability of a collision is less than
`scenario seems sufficiently unlikely that we ignore it
`and use the Sha1 hash as a unique identifier for a block.
`Obviously, as storage technology advances, it may
`become feasible to store much more than an exabyte, at
`which point it maybe necessary to move to a larger hash
`function. NIST has already proposed variants of Sha1
`that produce 256, 384, and 512 bit results [14]. For the
`immediate future, however, Sha1 is a suitable choice
`for generating the fingerprint of a block.
`
`3.2. Choice of Storage Technology
`
`When the Plan 9 file system was designed in 1989,
`optical
`jukeboxes offered high
`capacity with
`respectable random access performance and thus were
`an obvious candidate for archival storage. The last
`decade, however, has seen the capacity of magnetic
`disks
`increase at a far faster rate
`than optical
`technologies [20]. Today, a disk array costs less than
`the equivalent capacity optical jukebox and occupies
`less physical
`space. Disk
`technology
`is even
`approaching tape in cost per bit.
`
`structures and format that clients use to store their data.
`In contrast, traditional backup and archival systems
`require more centralized control. For example, backup
`systems include some form of job scheduler to serialize
`access to tape devices and may only support a small
`number of predetermined data formats so that the
`catalog system can extract pertinent meta-data.
`
`Venti provides inherent integrity checking of data.
`When a block is retrieved, both the client and the server
`can compute the fingerprint of the data and compare it
`to the requested fingerprint. This operation allows the
`client to avoid errors from undetected data corruption
`and enables the server to identify when error recovery
`is necessary.
`
`Using the fingerprint of a block as its identity facilitates
`features such as
`replication, caching, and
`load
`balancing. Since the contents of a particular block are
`immutable, the problem of data coherency is greatly
`reduced; a cache or a mirror cannot contain a stale or
`out of date version of a block.
`
`3.1. Choice of Hash Function
`
`The design of Venti requires a hash function that
`generates a unique fingerprint for every data block that
`a client may want to store. Obviously, if the size of the
`fingerprint is smaller than the size of the data blocks,
`such a hash function cannot exist since there are fewer
`possible fingerprints than blocks. If the fingerprint is
`large enough and randomly distributed, this problem
`does not arise in practice. For a server of a given
`capacity, the likelihood that two different blocks will
`have the same hash value, also known as a collision,
`can be determined. If the probability of a collision is
`vanishingly small, we can be confident that each
`fingerprint is unique.
`
`It is desirable that Venti employ a cryptographic hash
`function. For such a function, it is computationally
`infeasible to find two distinct inputs that hash to the
`same value [10]. This property is important because it
`prevents a malicious client from intentionally creating
`blocks that violate the assumption that each block has a
`unique fingerprint. As an additional benefit, using a
`cryptographic hash function strengthens a client’s
`integrity check, preventing a malicious server from
`fulfilling a read request with fraudulent data. If the
`fingerprint of the returned block matches the requested
`fingerprint, the client can be confident the server
`returned the original data.
`
`Page 4 of 14
`
`
`
`Magnetic disk storage is not as stable or permanent as
`optical media. Reliability can be
`improved with
`technology such as RAID, but unlike write-once optical
`disks, there is little protection from erasure due to
`failures of the storage server or RAID array firmware.
`This issue is discussed in Section 7.
`
`Using magnetic disks for Venti has the benefit of
`reducing
`the disparity
`in performance between
`conventional and archival storage. Operations that
`previously required data to be restored to magnetic disk
`can be accomplished directly from
`the archive.
`Similarly, the archive can contain the primary copy of
`often-accessed read-only data. In effect, archival data
`need not be further down the storage hierarchy; it is
`differentiated by the write-once policy of the server.
`
`4. Applications
`
`Venti is a building block on which to construct a variety
`of storage applications. Venti provides a
`large
`repository for data that can be shared by many clients,
`much as tape libraries are currently the foundation of
`many centralized backup systems. Applications need to
`accommodate the unique properties of Venti, which are
`different from traditional block level storage devices,
`but these properties enable a number of interesting
`features.
`
`Applications use the block level service provided by
`Venti to store more complex data structures. Data is
`divided into blocks and written to the server. To enable
`this data to be retrieved, the application must record the
`fingerprints of these blocks. One approach is to pack
`the fingerprints into additional blocks, called pointer
`blocks, that are also written to the server, a process that
`can be repeated recursively until a single fingerprint is
`obtained. This fingerprint represents the root of a tree of
`blocks and corresponds to a hierarchical hash of the
`original data.
`
`A simple data structure for storing a linear sequence of
`data blocks is shown in Figure 1. The data blocks are
`located via a fixed depth tree of pointer blocks which
`itself is addressed by a root fingerprint. Applications
`can use such a structure to store a single file or to
`mimic the behavior of a physical device such as a tape
`or a disk drive. The write-once nature of Venti does not
`allow such a tree to be modified, but new versions of
`the tree can be generated efficiently by storing the new
`or modified data blocks and reusing the unchanged
`sections of the tree as depicted in Figure 2.
`
`Root H(P )0
`
`P0
`H(P )1
`H(P )2
`(cid:133)
`
`P1
`H(D )0
`H(D )1
`(cid:133)
`
`P2
`H(D )6
`H(D )7
`(cid:133)
`
`D0
`
`D1
`
`(cid:133)
`
`D6
`
`D7
`
`(cid:133)
`
`Figure 1. A tree structure for storing a linear sequence
`of blocks
`
`Root H(P )0
`
`Root H(P )3
`
`P0
`H(P )1
`H(P )2
`(cid:133)
`
`P3
`H(P )1
`H(P )4
`(cid:133)
`
`P1
`H(D )0
`H(D )1
`(cid:133)
`
`P2
`H(D )6
`H(D )7
`(cid:133)
`
`P4
`H(D )6
`H(D )7
`(cid:133)
`
`D0
`
`D1
`
`(cid:133)
`
`D6
`
`D7
`
`(cid:133)
`
`D8
`
`Figure 2. Build a new version of the tree.
`
`By mixing data and fingerprints in a block, more
`complex data structures can be constructed. For
`example, a structure for storing a file system may
`include three types of blocks: directory, pointer, and
`data. A directory block combines the meta information
`for a file and the fingerprint to a tree of data blocks
`containing the file’s contents. The depth of the tree can
`be determined from the size of the file, assuming the
`pointer and data blocks have a fixed size. Other
`structures are obviously possible. Venti’s block-level
`interface
`leaves
`the choice of format
`to client
`applications and different data structures can coexist on
`a single server.
`
`The following sections describes three applications that
`use Venti as an archival data repository: a user level
`archive utility called vac, a proposal for a physical level
`backup utility, and our preliminary work on a new
`version of the Plan 9 file system.
`
`Page 5 of 14
`
`
`
`4.1. Vac
`
`Vac is an application for storing a collection of files and
`directories as a single object, similar in functionality to
`the utilities tar and zip. With vac, the contents of the
`selected files are stored as a tree of blocks on a Venti
`server. The root fingerprint for this tree is written to a
`vac archive file specified by the user, which consists of
`an ASCII representation of the 20 byte root fingerprint
`plus a fixed header string, and is always 45 bytes long.
`A corresponding program, called unvac, enables the
`user to restore files from a vac archive. Naturally,
`unvac requires access to the Venti server that contains
`the actual data, but in most situations this is transparent.
`For a user, it appears that vac compresses any amount
`of data down to 45 bytes.
`
`An important attribute of vac is that it writes each file
`as a separate collection of Venti blocks, thus ensuring
`that duplicate copies of a file will be coalesced on the
`server. If multiple users vac the same data, only one
`copy will be stored on the server. Similarly, a user may
`repeatedly vac a directory over time and even if the
`contents of the directory change, the additional storage
`consumed on the server will be related to the extent of
`the changes rather than the total size of the contents.
`Since Venti coalesces data at the block level, even files
`that change may share many blocks with previous
`versions and thus require little space on the server; log
`and database files are good examples of this scenario.
`
`On many Unix systems, the dump utility is used to back
`up file systems. Dump has the ability to perform
`incremental backups of data; a user specifies a dump
`level, and only files that are new or have changed since
`the last dump at this level are written to the archive. To
`implement incremental backups, dump examines the
`modified time associated with each file, which is an
`efficient method of filtering out the unchanged files.
`
`Vac also implements an incremental option based on
`the file modification times. The user specifies an
`existing vac file and this archive is used to reduce the
`number of blocks written to the Venti server. For each
`file, vac examines the modified time in both the file
`system and the vac archive. If they are the same, vac
`copies the fingerprint for the file from the old archive
`into
`the new archive. Copying
`just
`the 20-byte
`fingerprint enables the new archive to include the entire
`file without reading the data from the file system nor
`writing the data across the network to the Venti server.
`In addition, unlike an incremental dump, the resulting
`archive will be identical to an archive generated without
`the incremental option; it is only a performance
`
`improvement. This means there is no need to have
`multiple levels of backups, some incremental, some
`full, and so restore operations are greatly simplified.
`
`A variant of the incremental option improves the
`backup of files without reference to modification times.
`As vac reads a file, it computes the fingerprint for each
`block. Concurrently, the pointer blocks of the old
`archive are examined to determine the fingerprint for
`the block at the same offset in the old version of the
`file. If the fingerprints are the same, the block does not
`need to be written to Venti. Instead, the fingerprint can
`simply be copied into the appropriate pointer block.
`This optimization reduces the number of writes to the
`Venti server, saving both network and disk bandwidth.
`Like the file level optimization above, the resulting vac
`file is no different from the one produced without this
`optimization. It does, however, require the data for the
`file to be read and is only effective if there are a
`significant number of unchanged blocks.
`
`4.2. Physical backup
`
`Utilities such as vac, tar, and dump archive data at the
`file or logical level: they walk the file hierarchy
`converting both data and meta-data into their own
`internal format. An alternative approach is block-level
`or physical backup, in which the disk blocks that make
`up
`the file system are directly copied without
`interpretation. Physical backup has a number of benefits
`including simplicity and potentially much higher
`throughput [8]. A physical backup utility for file
`systems that stores the resulting data on Venti appears
`attractive, though we have not yet implemented such an
`application.
`
`The simplest form of physical backup is to copy the raw
`contents of one or mores disk drives to Venti. The
`backup also includes a tree of pointer blocks, which
`enables access to the data blocks. Like vac, the end
`result is a single fingerprint representing the root of the
`tree; that fingerprint needs to be recorded outside of
`Venti.
`
`Coalescing duplicate blocks is the main advantage of
`making a physical backup to Venti rather than copying
`the data to another storage medium such as tape. Since
`file systems are inherently block based, we expect
`coalescing to be effective. Not only will backups of a
`file system over time share many unchanged blocks, but
`even file systems for different machines that are
`running the same operating system may have many
`blocks in common. As with vac, the user sees a full
`
`Page 6 of 14
`
`
`
`backup of the device, while retaining the storage space
`advantages of an incremental backup.
`
`One enhancement to physical backup is to copy only
`blocks that are actively in use in the file system. For
`most file system formats it is relatively easy to
`determine if a block is in use or free without walking
`the file system hierarchy. Free blocks generally contain
`the remnants of temporary files that were created and
`removed in the time between backups and it is
`advantageous not
`to
`store
`such blocks. This
`optimization requires that the backup format be able to
`represent missing blocks, which can easily be achieved
`on Venti by storing a null value for the appropriate
`entry in the pointer tree.
`
`The random access performance of Venti is sufficiently
`good that it is possible to use a physical backup without
`first restoring it to disk. With operating system support,
`it is feasible to directly mount a backup file system
`image from Venti. Access to this file system is read
`only, but it provides a natural method of restoring a
`subset of files. For situations where a full restore is
`required, it might be possible to do this restore in a lazy
`fashion, copying blocks from Venti to the file system as
`needed, instead of copying the entire contents of the file
`system before resuming normal operation.
`
`The time to perform a physical backup can be reduced
`using a variety of incremental techniques. Like vac, the
`backup utility can compute the fingerprint of each block
`and compare this fingerprint with the appropriate entry
`in
`the pointer
`tree of a previous backup. This
`optimization reduces the number of writes to the Venti
`server. If the file system provides information about
`which blocks have changed, as is the case with WAFL,
`the backup utility can avoid even reading
`the
`unchanged blocks. Again, a major advantage of using
`Venti is that the backup utility can implement these
`incremental techniques while still providing the user
`with a full backup. The backup utility writes the new
`blocks to the Venti server and constructs a pointer tree
`with the appropriate fingerprint for the unchanged
`blocks.
`
`4.3. Plan 9 File system
`
`When combined with a small amount of read/write
`storage, Venti can be used as the primary location for
`data rather than a place to store backups. A new version
`of the Plan 9 file system, which we are developing,
`exemplifies this approach.
`
`Previously, the Plan 9 file system was stored on a
`combination of magnetic disks and a write-once optical
`jukebox. The jukebox furnishes the permanent storage
`for the system, while the magnetic disks act as a cache
`for the jukebox. The cache provides faster file access
`and, more importantly, accumulates the changes to the
`file system during the period between snapshots. When
`a snapshot is taken, new or modified blocks are written
`from the disk cache to the jukebox.
`
`The disk cache can be smaller than the active file
`system, needing only to be big enough to contain the
`daily changes to the file system. However, accesses that
`miss the cache are significantly slower since changing
`platters in the jukebox takes several seconds. This
`performance penalty makes certain operations on old
`snapshots prohibitively expensive. Also, on the rare
`occasions when the disk cache has been reinitialized
`due to corruption, the file server spends several days
`filling the cache before performance returns to normal.
`
`The new version of the Plan 9 file system uses Venti
`instead of an optical jukebox as its storage device.
`Since the performance of Venti is comparable to disk,
`this substitution equalizes access both to the active and
`to the archival view of the file system. It also allows the
`disk cache to be quite small; the cache accumulates
`changes to the file system between snapshots, but does
`not speed file access.
`
`5. Implementation
`
`We have implemented a prototype of Venti. The
`implementation uses an append-only log of data blocks
`and an index that maps fingerprints to locations in this
`log. It also includes a number of features that improve
`robustness and performance. This section gives a brief
`overview of the implementation. Figure 3 shows a
`block diagram of the server.
`
`Network
`
`Venti Server
`
`Client
`
`FS
`
`Client
`
`Client
`
`Block
`Cache
`
`Index
`Cache
`
`Data
`
`Index
`
`Figure 3. A block diagram of the Venti prototype.
`
`Since Venti is intended for archival storage, one goal of
`our prototype is robustness. The approach we have
`
`Page 7 of 14
`
`
`
`data blocks
`
`block header
`
`magic
`fingerprint
`type
`size
`user
`wtime
`encoding
`esize
`
`header
`data
`header
`data
`
`0
`
`1
`
`(cid:133) (cid:133)
`
`directory
`
`(cid:133) (cid:133)
`
`header
`1
`offset
`header
`0
`offset
`
`arena
`
`header
`
`data
`blocks
`
`directory
`trailer
`
`data log
`
`arena
`arena
`arena
`
`0 1 2
`
`(cid:133) (cid:133) .
`
`..
`
`Figure 4. The format of the data log.
`
`taken is to separate the storage of data blocks from the
`index used to locate a block. In particular, blocks are
`stored in an append-only log on a RAID array of disk
`drives. The simplicity of the append-only log structure
`eliminates many possible software errors that might
`cause data corruption and facilitates a variety of
`additional
`integrity strategies. A separate
`index
`structure allows a block to be efficiently located in the
`log; however, the index can be regenerated from the
`data log if required and thus does not have the same
`reliability constraints as the log itself.
`
`The structure of the data log is illustrated in Figure 4.
`To ease maintenance, the log is divided into self-
`contained sections called arenas. Each arena contains a
`large number of data blocks and is sized to facilitate
`operations such as copying to removable media. Within
`an arena is a section for data bocks that is filled in an
`append-only manner. In Venti, data blocks are variable
`sized, up to a current limit of 52 Kbytes, but since
`blocks are immutable they can be densely packed into
`an arena without fragmentation.
`
`Each block is prefixed by a header that describes the
`contents of the block. The primary purpose of the
`header is to provide integrity checking during normal
`operation and to assist in data recovery. The header
`includes a magic number, the fingerprint and size of the
`block, the time when the block was first written, and
`identity of the user that wrote it. The header also
`includes a user-supplied type identifier, which is
`explained in Section 7. Note, only one copy of a given
`block is stored in the log, thus the user and wtime fields
`correspond to the first time the block was stored to the
`server.
`
`Before storing a block in the log, an attempt is made to
`compress
`its contents. The
`inclusion of data
`compression increases the effective capacity of the
`archive and is simple to add given the log structure.
`Obviously, some blocks are
`incompressible. The
`encoding field in the block header indicates whether the
`data was compressed and, if so, the algorithm used. The
`esize field
`indicates
`the size of
`the data after
`compression, enabling the location of the next block in
`the arena to be determined. The downside of using
`compression
`is
`the computational cost,
`typically
`resulting in a decrease in the rate that blocks can be
`stored and retrieved. Our prototype uses a custom
`Lempel-Ziv ’77 [21] algorithm that is optimized for
`speed. Compression is not a performance bottleneck for
`our existing server. Future implementations may benefit
`from hardware solutions.
`
`In addition to a log of data blocks, an arena includes a
`header, a directory, and a trailer. The header identifies
`the arena. The directory contains a copy of the block
`header and offset for every block in the arena. By
`replicating the headers of all the blocks in one relatively
`small part of the arena, the server can rapidly check or
`rebuild the system’s global block index. The directory
`also facilitates error recovery if part of the arena is
`destroyed or corrupted. The trailer summarizes the
`current state of the arena itself, including the number of
`blocks and the size of the log. Within the arena, the data
`log and the directory start at opposite ends and grow
`towards each other. When the arena is filled, it is
`marked as sealed, and a fingerprint is computed for the
`contents of the entire arena. Sealed arenas are never
`modified.
`
`Page 8 of 14
`
`
`
`Table 1. The performance of read and write operations in Mbytes/s for 8 Kbyte blocks
`
`Uncached
`Index Cache
`Block Cache
`Raw Raid
`
`Sequential Reads
`0.9
`4.2
`6.8
`14.8
`
`Random Reads
`0.4
`0.7
`-
`1.0
`
`Virgin Writes
`3.7
`-
`-
`12.4
`
`Duplicate Writes
`5.6
`6.2
`6.5
`12.4
`
`only the index lookup, but the entries are much smaller
`and the hit rate correspondingly higher.
`
`Unfortunately, these caches do not speed the process of
`storing a new block to Venti. The server must check
`that the block is not a duplicate by examining the index.
`If the block is not contained on the server, it will
`obviously not be in any cache. Since the fingerprint of
`the block contains no internal structure, the location of
`a fingerprint in the index is essentially random.
`Furthermore, the archival nature of Venti means the
`entire index will not fit in memory because of the large
`number of blocks. Combining these factors means that
`the write performance of Venti will be limited to the
`random IO performance of the index disk, which for
`current technology is a few hundred ac