`
`Robbert van Renesse* Andrew S. Tanenbaum Annita Wilschut$
`
`Dept. of Computer Science
`Vrije Universiteit
`The Netherlands
`
`ABSTRACT
`The Bullet server is an innovative file server that outperforms
`traditional file servers like SUN’S NFS by more than a factor
`of thrce. It achieves high throughput and low delay by a radi-
`d y different software design than current file servers in use.
`Instead of storing files as a sequence of disk blocks, each Bul-
`let server file is stored contiguously, both on disk and in the
`server’s RAM cache. Furthermore, it employs the concept of
`an immutable fde, to improve performance, to enable caching,
`and to provide a clean semantic made1 to the user. The paper
`describes the design and implementation of the Bullet server in
`detail, presents measurements of its performance, and com-
`pares this performance with other well-known file servers run-
`ning on the same hardware.
`
`1. Introduction
`Traditional file systems were designed for small
`machines, that is, computers with little RAM memory and
`small disks. Emphasis was on supporting large files using as
`few resources as possible. To allow dynamic growth of files,
`files were split into fmed size blocks scattered all over the
`disk. Blocks would be dynamically allocated to files. such that
`a large file would be scattered all over a disk. performance
`suffered since each block had to be separately accessed. Also
`the block management introduced high overhcad:
`indirect
`blocks were necessary to administer the files and their blocks.
`A small part of the computer’s little memory was used to keep
`parts of files in a RAM cache to make access somewhat more
`efficient.
`Today the situation has changed considerably.
`Machines have enormous RAM memories and huge disks.
`Files usually fit completely in memory. For exampk, memory
`sizes of at least 16 Megabytes are common today, enough to
`hold most files encountered in practice. Measurements 111
`*This research was supported in part by the Netherlands Organization for
`Scientific Resccrrch (N.W.O.) undergram 125-30-10.
`+ Current affiliation: Dcpt. of Computer Science, University of Twmte,
`Enschedc, The Neklands
`
`show that the median file size in a UNIXt system is 1 Kbyte
`and 99% of all files are less than 64 Kbytes. File systems,
`however, have not changed yet. Files are still subdivided into
`blocks. To take some advantage of new technology, the size
`of blocks has been increased, and the memory caches have
`been enlarged. This has led to a marginal performance
`improvement.
`As part of the Amoeba distributed operating system pro-
`ject [2,3] we have designed and implemented a fie server that
`is intended for current and future computer and disk technol-
`ogy. We have devoted considerable energy to making the file
`server fast. Since we believe we have achieved this goal, we
`have named it the Bullet file server. Among its features are
`support for replication, caching, and immutability.
`This paper has six sections. In the following section we
`will present the architectural model of the Bullet file service.
`In section three we present the implementation of the server,
`including the data structures and interfaces. The performance
`of the file server is the subject of section four. In section five
`we will compare the Bullet file server with other files servers.
`such as SUN NFS. Section six contains our conclusions.
`2. Architectural model
`The basic idea behind the design of the Bullet file server
`is to do away with the block model. In fact, we have chosen
`the extreme, which
`to maintain files contiguously
`is
`throughout the system. That is, files are contiguously stored
`on disk, contiguously cached in RAM, and kept contiguously
`in processes’ memories. This dictates the choice for whole file
`transfer. As a consequence, processors can only operate on
`files that fit in thei physical memory. This affects the way in
`which we store data structures on files, and how we assign
`processors to applications. Since most files (about 75%) are
`accessed in entirety [4]. whole file transfer optimizes overall
`scaling and performance, as has also been r e p o d in other
`system that do whole file transfer, such as in the Andrew lTC
`file system [5].
`t UNlX is a Registered Trademark of AT&T in the USA and other m-
`tries.
`
`CH2706-0/89/0000/0022$01.00 0 1989 IEEE
`
`22
`
`Unified Patents Inc., Exhibit 1003, pg. 1
`
`
`
`Another design choice, which is closely linked to keep-
`ing files contiguous, is to make all files immutable. That is,
`the only operations on files are creation, retrieval, and dele-
`tion; there are no update-in-place operations. Instead, if we
`want to update a data structure that is stored on a file, we do
`this by creating a new file holding the updated data structure.
`In other words, we store files as sequences of versions. Note
`that as we do whole file transfer anyway, this puts no perfor-
`mance penalties on the fie server. Version mechanisms have
`positive influences on caching, as reported in the Cedar File
`System [6], and on replication. It also presents the possibility
`of keeping versions on write-once storage such as optical
`disks. Although the version mechanism is itself quite interest-
`ing, space limitations prevent us from describing it here. It is
`discussed in [7].
`For most applications this model works well, but there
`are some applications where different solutions will have to be
`found. Each append to a log file, for example, would require
`the whole file to be copied. Similarly, for data bases, a small
`update might incur a large overhead. For log files we have
`implemented a separate server. Data bases can be subdivided
`over many smaller Bullet files, for example based on the iden-
`tifying keys.
`Throughout the design we have strived for performance,
`scalability, and availability. The Bullet file server is the main
`storage server for the Amoeba distributed operating system,
`where these issues are of high importance [8,9]. Performance
`can only be achieved if the management of storage and repli-
`cation are low, and the model of contiguity and immutability
`corresponds to how files are usually accessed. Scalability
`involves both geographic scalability-Amoeba
`currently runs
`in four different countries-and
`quantitative scalability-there
`may be thousands of processors accessing files. Availability
`implies the need for replication.
`Since these issues correlate heavily with those of the
`Amoeba distributed operating system, we will first devote a
`section to Amoeba. In this section we will also describe the
`naming service of Amoeba, which plays a role in how data
`structures, especially large ones, may be stored efficiently on
`immutable files. In the following section we will describe the
`Bullet file interface.
`
`2.1. Amoeba
`Amoeba [2,3] is a distributed operating system that was
`designed and implemented at the Vrije Universiteit in Amster-
`dam, and is now being further developed there and at the Cen-
`tre. for Mathematics and Computer Science, also in Amster-
`dam. It is based on the object model. An object is an abstract
`data type, and operations on it are invoked through remote pro-
`cedure calls. The hardware on which Amoeba runs consists of
`four principal components:
`
`workstations
`dynamically allocatable processors
`specialized services
`gateways
`Workstations provide the user interface to Amoeba and are
`involved with interactive tasks such as command
`only
`interpretation and text editing. Consequently they do not deal
`with very large or dynamically changing files. The dynami-
`cally allocatable processors together form the so-called pro-
`cessor pool. These processors may be allocated for compiling
`or text formatting purposes, or for distributed or parallel algo-
`rithms. Among other applications, we have implemented a
`parallel make [lo] and parallel heuristic search [ll].
`Specialized servers include fiiing servers such as the
`Bullet file server, and the directory server. The directory
`server is used in conjunction with the Bullet server. It’s func-
`tion is to handle naming and protection of Bullet server files
`and other objects in a simple, uniform way. Servers manage
`the Amoeba objects, that is. they handle the storage and per-
`form the operations. Gateways provide transparent communi-
`cation among Amoeba sites currently operating in four dif-
`ferent counties (The Netherlands, England, Norway, and Ger-
`many).
`All objects in Amoeba are addressed and protected by
`capabilities [3,12]. A capability consists of four parts:
`The server port identifies the server that manages the
`object. It is a 48-bit location-independent number that
`is chosen by the server itself and made known to the
`server’s potential clients.
`The object number identifies the object within the
`server. For example, a file server may manage many
`files, and use the object number to index in a table of
`inodes. An inode contains the position of the file on
`disk, and accounting information.
`The rightsfield specifies which access rights the holder
`of the capability has to the object. For a file server
`there may be a bit indicating the right to read the file,
`another bit for deleting the file, and so on.
`The checkfield is used to protect capabilities against
`forging and tampering. In the case of the file server
`this can be done as follows. Each time a file is created
`the server generates a large random number and stores
`this in the inode for the file. Capabilities for the file
`can be generated by taking the server’s port, the index
`of the file in the inode table, and the required rights.
`The check field can be generated by taking the rights
`and the random number from the inode, and encrypting
`both. If, later, a client shows a capability for a file, its
`validity can be checked by decrypting the check field
`and comparing the rights and the random number.
`Other schemes are described in [12]. Capabilities can
`be cached to avoid decryption for each access.
`
`23
`
`Unified Patents Inc., Exhibit 1003, pg. 2
`
`
`
`Although capbilities are a convenient way for addressing and
`protecting objects, they are not usable for human users. For
`this the directory service maps human-chosen ASCII names to
`tables, the first
`capabilities. Directories are two-column
`column containing names, and the second containing the
`corresponding capabilities. Directories are objects themselves,
`and can be addressed by Capabilities. By placing dirtctory
`capabilities in directories an arbitrary naming structure can be
`built at the convenience of the user. The directory service pro-
`vides a single global naming space for objects. This has
`allowed us to link multiple Bullet file servers together provid-
`ing one single large file service that C~DSSCS international bord-
`ers [13,141.
`23. Bullet Server Interface
`The simple architectural model of the file service is
`reflected in its simple interface. Whole file transfer eliminates
`the need for relatively complicated interfaces to access parts of
`files. Immutability eliminates the need for separate update
`operators. Version management is not part of the file server
`interface, since it is done by the directory service [7].
`The Bullet interface consist of four functions:
`BULLET.CREATE(SERVER, DATA, SIZE, P - F A O R ) +
`CAPABILITY
`BULLET.SIZE(CAPABILITY) +SIZE
`BULLET.READ(WABILITY, &DATA)
`BILLET.DELE"fZ(CAPABILITY)
`The BULLET.CREATE function is the only way to store data on
`a Bullet server. The SERVER argument specifies which Bullet
`server to use. This enables users to use more that on Bullet
`server. The DATA and SIZE arguments describe the contents of
`the file to be created. A capability for the file is returned for
`subsequent usage.
`stands for Paranoia Factor. It is a measure for
`P-FACTOR
`how carefully the file should be stored before BWT.CREATE
`can return
`If
`to the
`invoker.
`is zero.
`the P-FACTOR
`BULLET.CREATE will return immediately after the file has been
`copied to the file server's RAM cache, but before it has been
`stored on disk. This is fast, but if the server crashes shortly
`afterwards the file may be lost. If the P-FACTOR is one, the file
`will be stored on one disk before the client can resume. If the
`P - F A m R is N. the file will be stored on N disks before the
`client can resume. This requires the file server to have at least
`N disks available for replication. At present we have two
`disks.
`The BULLETSIZE and BULLETREAD functions are used
`to retrieve files from a server. First BULLETSIZE is called to
`get the size of the file addressed by CAPABILITY, after which
`local memory is allocated to store its contents. Then
`BULLETREAD is invoked to get the contents, where &DATA is
`the address of the allocated local memory. Altematively a sec-
`tion of the virtual address space can be reserved, after which
`
`24
`
`the file can be mapped into the virtual memory of the process.
`In that case the underlying kernel performs the BULLET-
`function. BULLET.DELETE allows files to be discarded fnnn
`the file server.
`
`3. Implementation
`Keeping files contiguous (i.e., not splitting them up in
`blocks) greatly simplifies file server design. Consequently, the
`implementation of the file server can be simple. In this section
`we will discuss an implementation on a 16.7 MHz Motorola
`68ou)-based server with 16 Mbytes of RAM memory and two
`800 Mbyte magnetic disk drives. We will describe the disk
`layout of the file server, the file server cache, and how replica-
`tion is done.
`The disk is divided into two sections. The f i t is the
`inode table, each entry of which gives the ownership, location.
`and size of one file. The second section contains contiguous
`files. along with the gaps between files. Inode entry 0 is spe-
`cial, and contains three 4 byte integers:
`block size: the physical sector size used by the disk
`hardware;
`control size: the number of blocks in the control sec-
`tion;
`data size: the number of blocks in the data Section;
`
`1)
`
`2)
`
`3)
`
`The remaining inodes describe files. An inode consist of four
`fields:
`1)
`
`2)
`
`3)
`
`A 6-byte random number that is used for access protec-
`tion. It is essentially the key used to decrypt capabili-
`ties that are presented to the server.
`A 2-byte integer that is called the index. The index has
`no signif'i'ice on disk. but is used for cache manage-
`ment and will be described later.
`A 4-byte integer specifying the first block of the file on
`disk. Files are aligned on blocks (sectors). This align-
`ment is forced by the disk hardware.
`A 4-byte integer giving the size of the file in bytes.
`4)
`when the file server starts up, it reads the complete inode table
`into the RAM inode table and keeps it there permanently. By
`scanning the inodes it can figure out which parts of disk are
`free. It uses this information to build a free list in RAM. Also
`unused inodes (inodes that are zero-filled) are maintained in a
`list. While scanning the inodes, the file server performs some
`consistency checks, for example to make sure that files do not
`overlap. AU of the server's remaining memory will be used
`for file caching. At this time the file server is ready for opera-
`tion and starts awaiting client requests.
`The entries in the RAM inode table are a slightly dif-
`ferent format from those on the disk, and are called modes.
`An mode contains the following information:
`
`Unified Patents Inc., Exhibit 1003, pg. 3
`
`
`
`I Disk Descriptor
`
`I
`
`InodeTable
`
`Contiguous Files
`and Holes
`
`the client in one RPC operation. The age field is updated to
`reflect the recent access of the file.
`If the index in the incde is zero, the file is not in
`memory and has to be loaded from disk. First the memory free
`list is searched to see if there is a part large enough to hold the
`file. If not, the least recently accessed file is removed from the
`RAM cache, found by checking the age fields in the modes.
`This is done by re-claiming the mode, freeing the associated
`memory, and clearing the index field in the corresponding
`inode. This is repeated until enough memory is found. Then
`an mode is allocated for this file, and its fields initialized. The
`index field in the inode of the file is set to the mode index.
`Then the file can be read into the RAM cache. Most modem
`controllers can do this in a single DMA transfer. The client
`read operation can now proceed as with files that were already
`cached.
`Creating files is much the same as reading fdes that
`were not in the cache. A large enough part of cache memory
`has to be allocated to hold the file, after which it can be filled
`with data specified by the client. Also, an inode and a free
`part in the disk data section have to be allocated. For this we
`use a fist fit strategy. In our implementation we use a write-
`I
`through caching scheme, that is, we immediately write the file
`to disk. The new inode, complete with a new random number,
`is immediately written as well. For this the whole disk block
`containing the inode has to be written. The new random
`number is used to create a capability for the user, which is
`returned in a reply message.
`In our hardware configuration we have two disks that we
`use as identical replicas. One of the disks is the main disk on
`which the file server reads. Disk writes are performed on both
`disks. If the main disk fails, the f i l e server can proceed unin-
`terruptedly by using the other disk. Recovery is simply done
`by copying the complete disk. The P-FAnOR in the create
`operation is used to determine where in the execution to send a
`reply back to the client.
`Deleting a file involves checking the capability, freeing
`an inode by zeroing it and writing it back to the disk. If the
`file is in the cache, the space in the cache can be freed. The
`disk free list in RAM has to be updated to include the part pre-
`viously occupied by the file.
`At first glance, it appears that storing files contiguously
`in memory and on disk is very wasteful due to the external
`fragmentation problem (gaps between fdes). However, the
`fragmentation in memory can be alleviated by compacting part
`or all of the RAM cache from time to time. The disk fragmen-
`tation can also be relieved by compaction every moming at say
`3 am when the system is lightly loaded.
`However, it is our belief that this trade-off is not unrea-
`sonable. In effect, the conscious choice of using contiguous
`Ne m y requiring buying, say, an 800 MB disk to store 500
`MB worth of files (the rest being lost to fragmentation unless
`compaction is done). Since our scheme gives an increased
`
`Fig. 1. The Bullet Disk Layout
`
`The inode table index of the corresponding file;
`1)
`A pointer to the file in RAM cache, if present:
`2)
`An age field to implement an LRU cache strategy.
`3)
`Just like the disk, the free modes and free parts in the RAM
`cache are maintained using free lists.
`Client requests basically come in three varieties: read-
`ing ffles, creating files, and deleting files. To read a file the
`client has to provide the capability of the fie. The object
`number in the capability indexes into the inode table to find
`the inode for the fde. Using the random number in the inode.
`and the check field in the capability, the right to read the file
`by this client can be checked. Next the index field in the inode
`is inspected to see whether there is a copy of the file in the
`RAM cache. If the index in the inode is non-zero, there is a
`copy in the server’s RAM cache. The index is used to locate
`an mode, which describes where to find the file in memory.
`Since the Ne is contiguously kept in RAM, it can be sent to
`
`25
`
`Unified Patents Inc., Exhibit 1003, pg. 4
`
`
`
`performance of a factor of 3 (as described in the next section)
`and disk prices rise only very slowly with disk capacity, a rela-
`tively small increment in total file server cost gives a major
`gain in speed.
`The complete code of the file server is less than 30
`pages of C. The MC68020 object size of the server, including
`all library routines, is 23 Kbytes. This small and simple
`implementation has resulted in a fie server that has been
`operational flawlessly for over a year. The most vulnerable
`component of the server is the disk, but because of its replica-
`tion, the complete file server is highly reliable.
`
`4. Performance and Comparison
`Figure 2 gives the performance of the Bullet file server.
`In the first column the delay and bandwidth for read operations
`are shown. The measurements have been done on a normally
`loaded Ethemet from a 16 MHz 68020 processor. In all cases
`the test file will be completely in memory, and no disk
`accesses are necessary. In the second column a create and a
`delete operation together is measured, and the file is written to
`both disks. Note that both creation and deletion involve
`requests to two disks.
`To compare this with the SUN NFS file system, we have
`measured reading and creating files on a SUN 3/50 using a
`remote SUN 3/180 file server (using 16.7 MHz 68020s and
`SUN OS 3.9, equipped with a 3 Mbyte buffer cache. The
`measurements were made on an idle system. To disable local
`caching on the SUN 3/50, we have locked the file using the
`SUN UNIX lucw primitive. The read test consisted of an lseek
`followed by a read system call. The write test consisted of
`consecutively executing creur, wrire, and close. The SUN
`NFS file server uses a write-through cache, but writes the file
`to one disk only. The results are depicted in Figure 3.
`These measurements include both the communication
`time and the file server time. Since Amoeba uses a dedicated
`processor for the fie server, it is impossible to separate com-
`munication and file server performance. Observe that reading
`and creating 1 Mbyte NFS files result in lower bandwidths
`than reading and creating 64 Kbyte NFS files. The Bullet file
`server performs read operations three to six times better than
`the SUN NFS file server for all file sizes. Although the Bullet
`file server stores the files on two disks, for large files the
`bandwidth is ten times that of SUN NFS. For very large files
`(> 64 Kbytes) the Bullet server even achieves a higher
`bandwidth for writing than SUN NFS achieves for reading
`files.
`
`5. Discussion and Conclusions
`The simple architectural model of immutable files that
`are kept contiguous on disk, in memory, and on the network,
`results in a major perfmance boost. Whole file transfer
`minimizes the load on the file server and on the network,
`allowing the service to be used on a larger scale [SI.
`
`lbytc
`16 bytes
`2% byter
`
`4-s
`64 Kbytes
`1-
`
`2
`2
`
`2
`
`1
`109
`1970
`
`94
`97
`
`9a
`109
`331
`
`2700
`
`Bonmvidth (Kbyrcslxc)
`, CREATE+DEL,
`READ
`
`Fig.2. performance of the Bullet file server for read
`operations, and create and delete operations together.
`The delay in msec (a) and bandwidth in Kbyte.s/sec (b)
`are given.
`
`Replication for availability is relatively easy. The simple
`implementation of the server also renders high availability.
`Immutability tumed out to be a satisfactory model for most of
`our storage requirements. Recently we have implemented a
`UNIX emulation on top of the Bullet service supporting a
`wealth of existing software.
`Caching of immutable files is straightforward. Check-
`ing if a cached copy of a file is still current is simply done by
`looking up its capability in the directory service, and compar-
`ing it to the capability on which the copy is based.
`Currently we are investigating how the Bullet file server
`and the Amoeba directory service can cooperate in providing a
`general purpose storage system. Goals of this research are
`high availability, consistency, performance, scalability, and
`minimal trade-offs between these goals. We have extended
`the interface to allow generating a new file based on an exist-
`ing file, such that for a small modification it is not necessary
`any longer to transfer the whole file, while it also allows pro-
`cessors with small memories to handle large fdes. Files still
`have to fit completely in the file server’s memory such that
`performance is not affected.
`
`26
`
`Unified Patents Inc., Exhibit 1003, pg. 5
`
`
`
`Fde Size
`
`1 byte
`16 bytes
`
`File Size
`
`16 bytes
`256 bytes
`4 Kbytes
`64 Kbyos
`
`1 Mbyo
`
`Delay (mec)
`
`Bandwidth (Kbyredsec)
`
`164
`
`@)
`
`Fig.3. Performance of the SUN NFS file server for
`read and create operations. The delay in msec (a) and
`bandwidth in Kbytedsec (b) arc given.
`
`[2]
`
`References
`[l] Mullender, S. J. and Tanenbaum, A. S., “Immediate
`Files,’’ Sojiware-Practice and Experience, Vol. 14,
`NO. 4, pp. 365-368 (April 1984).
`Mullender, S. J. and Tanenbaum, A. S., “The Design
`of a Capability-Based Distributed Operating System,”
`The Computer Journal, Vol. 29, No. 4, pp. 289-300
`(March 1986).
`Mullender, S. J. and Tanenbaum, A. S., “Protection
`and Resource Control in Distributed Operating Sys-
`tems,” Computer Networks. Vol. 8, No. 5-6, pp. 421-
`432 (October 1984).
`Ousterhout. J. K., Costa, H. Da, Harrison, D., Kunze, J.
`A., Kupfer, M., and Thompson, J. G., “A Trace-Driven
`Analysis of the UNIX 4.2 BSD File System,” Proc. of
`the 10th Symp. on Operating Systems Principles, pp.
`15-24, &cas Island, Wash. (December 1985).
`
`[3]
`
`[4]
`
`Howard, J. H., Kazar, M. L., Menees, S. G., Nichols,
`D. A., Satyanarayanan, M., Sidebotham, R. N., and
`West, M. J., “Scale and Performance in a Distributed
`File System,” ACM Trans. on Computer Systems, Vol.
`6, No. 1, pp. 51-81 (February 1988).
`Gifford, D. K., Needham, R. M., and Schroeder, M. D.,
`“The Cedar File System,” Comm. ACM, pp. 288-298
`(March 1988).
`Renesse, R. van and Tanenbaum, A. S., “Consistency
`and Availability in the Amoeba Distributed Operating
`System,” W Technical Report, Vrije Universiteit,
`Amsterdam (To Appear).
`Renesse, R. van, Staveren, T. M. van, and Tanenbaum,
`A. S., “The Performance of the Amoeba Distributed
`Operating System,” Sojiware--Practice and Experi-
`ence (1988).
`Renesse, R. van, Staveren, J. M. van, and Tanenbaum,
`A. S., “The Performance of the World’s Fastest Distri-
`buted Operating System,” ACM Operating Systems
`Review, Vol. 22, No. 4, pp. 25-34 (October 1988).
`Baalbergen, E. H., “Design and Implementation of
`Parallel Make,” Computing Systems, Vol. 1, No. 2, pp.
`135-158 (Spring 1988).
`Bal, H. E., Renesse, R. van, and Tanenbaum, A. S.,
`“Implementing Distributed Algorithms Using Remote
`Procedure Calls,’’ Proc. of the 1987 National Com-
`puter Conf., pp. 499-506, Chicago, I11 (June 1987).
`Tanenbaum, A. S., Mullender, S. J., and Renesse, R.
`van, “Using Sparse Capabilities in a Distributed
`Operating System,” Proc. of the 6th Int. Conf. on Dis-
`tributed Computing Systems, pp. 558-563, Cambridge,
`MA (May 1986).
`Renesse, R. van, Tanenbaum, A. S., Staveren, J. M.
`van, and Hall, J., “Connecting RPC-Based Distributed
`Systems Using Wide-Area Networks,” Proc. of the 7th
`Int. Conf. on Distributed Computing Systems, pp. 28-
`34, Berlin (West) (September 1987).
`Renesse, R. van, Staveren, J. M. van, Hall, J., Turnbull,
`M., Janssen, A. A., Jansen, A. J., Mullender, S. J., Hol-
`den, D. B., Bastable, A., Fallmyr, T., Johansen, D.,
`Mullender,
`and
`Zimmer, W.,
`K.
`S.,
`“MANDIS/Amoeba: A Widely Dispersed Object-
`Oriented Operating System,” Proc. of the EUTECO 88
`Con,, pp. 823-831, ed. R. Speth, North-Holland,
`Vienna, Austria (April 1988).
`
`[91
`
`21
`
`Unified Patents Inc., Exhibit 1003, pg. 6
`
`