Design and Implementation of the Sun Network Filesystem
Proceedings of the Summer 1985 USENIX Conference, Portland OR,
June 1985, pp. 119-130.
`Design and Implementation of the Sun Network Filesystem
`Russel Sandberg
`David Goldberg
`Steve Kleiman
`The Sun Network Filesystem (NFS) provides transparent. remote access to filesystems. Unlike
`many other remote filesystem implementations under UNIX‘I'. the NFS is designed to be easily
`portable to other operating systems and machine architectures.
`It uses an External Data
`Representation (XDR) specification to describe protocols in a machine and system independent
`way. The NFS is implemented on top of a Remote Procedure Call package (RPC) to help
`simplify protocol definition. implementation. and maintenance.
`In order to build the NFS into the UNIX 4.2 kernel in a user transparent way, we decided to add
`a new interface to the kernel which separates generic filesystem operations from specific
`filesystem implementations. The “filesystem interface" consists of two parts: the Virtual File
`System (VFS) interface defines the operations that can be done on a filesystem. while the vnode
`interface defines the operations that can be done on a file within that filesystem. This new
`interface allows us to implement and install new filesystems in much the same way as new device
`drivers are added to the kernel.
`In this paper we discuss the design and implementation of the filesystem interface in the kernel
`and the NFS virtual filesystem. We describe some interesting design issues and how they were
`resolved. and point out some of the shortcomings of the current implementation. We conclude
`with some ideas for future enhancements.
`Design Goals
`The NFS was designed to make sharing of filesystem resources in a network of non-homogeneous
`machines easier. Our goal was to provide a UNIX-like way of making remote files available to
`local programs without having to modify. or even recompile,
`those programs. In addition. we
`wanted remote file access to be comparable in speed to local file access.
`The overall design goals of the NFS were:
`Machine and Operating System Independence
`The protocols used should be independent of UNIX so that an NFS server can
`supply files to many different types of clients. The protocols should also be
`simple enough that they can be implemented on low end machines like the PC.
`Crash Recovery
`When .clients can mount remote filesystems from many different servers it is
`very important that clients be able to recover easily from server crashes.
`Transparent Access
`We want to provide a system which allows programs to access remote tiles in
`exactly the same way as local files. No pathname parsing. no special libraries.
`no recompiling. Programs should not be able to tell whether a file is remote or
`1' UNIX is a trademark of Bell Laboratories.
`UNIX Semantics Maintained on Client
`In order for
`transparent access to work on UNIX machines, UNIX filesystern
`semantics have to be maintained for remote files.
`Reasonable Performance
`People will not want to use the NFS if it is no faster than the existing networking
`utilities, such as rcp. even if it is easier to use. Our design goal is to make NFS
`as fast as the Sun Network Disk protocol (ND‘), or about 80% as fast as a
`local disk.
`Basic Design
`The NFS design consists of three major pieces: the protocol, the server side and the client side.
`N FS Protocol
`The NFS protocol uses the Sun Remote Procedure Call (RPC) mechanism [1]. For the same
`reasons that procedure calls help simplify programs, RPC helps simplify the definition,
`organization, and implementation of remote services. The NFS protocol is defined in terms of a
`set of procedures, their arguments and results, and their effects. Remote procedure calls are
`synchronous, that is. the client blocks until the server has completed the call and returned the
`results. This makes RFC very easy to use since it behaves like a local procedure call.
`The NFS uses a stateless protocol. The parameters to each procedure call contain all of the
`information necessary to complete the call, and the server does'not keep track of any past
`requests. This makes crash recovery very easy; when a server crashes, the client resends NFS
`requests until a response is received, and the server does no crash recovery at all. When a client
`crashes no recovery is necessary for either the client or the server. When state is maintained on
`the server, on the other hand, recovery is much harder. Both client and server need to be able to
`reliably detect crashes. The server needs to detect client crashes so that it can discard any state it
`is holding for the client, and the client must detect server crashes so that it can rebuild the
`server's state.
`Using a stateless protocol allows us to avoid complex crash recovery and simplifies the protocol.
`If a client just resends requests until a response is received, data will never be lost due to a server
`In fact the client can not
`tell the difference between a server that has crashed and
`recovered, and a server that is slow.
`Sun's remote procedure call package is designed to be transport independent. New transport
`protocols can be “plugged in" to the RPC implementation without affecting the higher level
`protocol code. The NFS uses the ARPA User Datagram Protocol (UDP) and lntemet Protocol
`(IP) for its transport level. Since UDP is an unreliable datagram protocol, packets can get lost,
`but because the NFS protocol is stateless and the NFS requests are idempotent,
`the client can
`recover by retrying the call until the packet gets through.
`The most common NFS procedure parameter is a structure called a file handle (fhandle or fh)
`which is provided by the server and used by the client to reference a file. The fhandle is opaque,
`that is, the client never looks at the contents of the fhandle, but uses it when operations are done
`on that file.
`An outline of the NFS protocol procedures is given below. For the complete specification see the
`Sun Network Filesystem Protocol Specification [2].
`null() returns ()
`Do nothing procedure to ping the server and measure round trip time.
`lookup(dirfh. name) returns (fh, attr)
`Returns a new {handle and attributes for the named file in a directory.
`create(dirfh, name, attr) returns (newfh, attr)
`Creates a new file and returns its fhandle and attributes.
`remove(dirfh, name) returns (status)
`Removes a file from a directory.
`getattr(fh) returns (attr)
`Returns file attributes. This procedure is like a stat call.
`E ND, the Sun Network Disk Protocol, provides block-level access to remote, sub-partitioned disks.
`read also
`setattr(th. attr) returns (attr)
`Sets the mode, uid, gid, size, access time. and modify time of a file. Setting the size to
`zero truncates the file.
`read(fh, offset, count) returns (attr, data)
`Returns up to count bytes of data from a file starting offset bytes into the file.
`returns the attributes of the tile.
`wrlte(fh. offset, count, data) returns (attr)
`Writes count bytes of data to a file beginning offset bytes from the beginning of the file.
`Returns the attributes of the file after the write takes place.
`rename(dirfh, name. tofh. toname) returns (status)
`Renames the file name in the directory dirfh, to toname in the directory lofh.
`link(dirih. name. tofh. toname) returns (status)
`Creates the file toname in the directory loflt, which is a link to the file name in the
`directory dirfh.
`symllnk(dirfli. name, string) returns (status)
`Creates a symbolic link name in the directory dirflx with value string. The server does not
`interpret the string argument in any way. just saves it and makes an association to the new
`symbolic link file.
`readllnk(fh) returns (string)
`Returns the string which is associated with the symbolic link file.
`mkdlr(dirfh, name, attr) returns (fh, newattr)
`Creates a new directory name in the directory dirfh and returns the new ihandle and
`attributes .
`rmdlr(dirfh, name) returns(status)
`Removes the empty directory name from the parent directory dirflz.
`readdir(dirfh, cookie, count) retums(entries)
`Returns up to count bytes of directory entries from the directory dirfh. Each entry contains
`a file name. file id, and an opaque pointer to the next directory entry called a cookie. The
`cookie is used in subsequent readdir calls to start reading at a specific entry in the
`directory. A readdir call with the cookie of zero returns entries starting with the first
`entry in the directory.
`statfs(fh) returns (fsstats)
`Returns filesystem information such as block size. number of free blocks. etc.
`New thandles are returned byxthe lookup. create, and mkdlr procedures which also take an
`fhandle as an argument. The first remote fhandle. for the root of a filesystem, is obtained by the
`client using another RPC based protocol. The MOUNT protocol takes a directory pathname and
`returns an fhandle if the client has access permission to the filesystem which contains that
`directory. The reason for making this a separate protocol is that this makes it easier to plug in
`new filesystem access checking methods. and it separates out the operating system dependent
`aspects of the protocol. Note that the MOUNT protocol is the only place that UNIX pathnames
`are passed to the server.
`ln other operating system implementations the MOUNT protocol can
`be replaced without having to change the NFS protocol.
`The NFS protocol and RPC are built on top of an External Data Representation (XDR)
`specification [3]. XDR defines the size, bytes order and alignment of basic data types such as
`string, integer. union, boolean and array. Complex structures can be built from the basic data
`types. Using XDR not only makes protocols machine and language independent. it also makes
`them easy to define. The arguments and results of RPC procedures are defined using an XDR
`data definition language that looks a lot like C declarations.
`Server Side
`Because the NFS server is stateless. as mentioned above. when servicing an NFS request it must
`commit any modified data to stable storage before returning results. The implication for UNIX
`based servers is that
`requests which modify the filesystem must flush all modified data to disk
`before returning from the call. This means that. for example on a write request. not only the
`data block, but also any modified indirect blocks and the block containing the.inode must be
`flushed if they have been modified.
`- Another modification to UNIX necessary to make the server work is the addition of a generation
`number in the mode. and a filesystem id in the superblock. These extra numbers make it
`possible for the server to use the inode number.
`inode generation number. and filesystem id
`together as the fhandle for a file. The inode generation number is necessary because the server
`may hand out an fhandle with an inode number of a file that is later removed and the inode
`reused. When the original fhandle comes back, the server must be able to tell that this inode
`number now refers to a different file. The generation number has to be incremented every time
`the inode is freed.
`Client Side
`The client side provides the transparent interface to the NFS. To make transparent access to
`remote files work we had to use a method of locating remote files that does not change the
`structure of path names. Some UNIX based remote file access schemes use host.-path to name
`remote files. This does not allow real transparent access since existing programs that parse
`pathnames have to be modified.
`Rather than doing a “late binding" of file address. we decided to do the hostname lookup and
`file address binding once per tilesystem by allowing the client to attach a remote filesystern to a
`directory using the mount program. This method has the advantage that the client only has to
`deal with hostnames once. at mount time.
`It also allows the server to limit access to filesystems
`by checking client credentials. The disadvantage is that remote files are not available to the
`client until a mount is done.
`Transparent access to different types of filesystems mounted on a single machine is provided by a
`new filesystems interface in the kernel. Each “filesystem type" supports two sets of operations:
`the Virtual Filesystem (VFS) interface defines the procedures that operate on the filesystem as a
`whole; and the Virtual Node (vnode) interface defines the procedures that operate on an
`individual file within that filesystem type. Figure 1 is a schematic diagram of the filesystem
`interface and how the NFS uses it.
`System Calls '
`System Calls
`PC Filesystcm
`4.2 Filesystem
`NFS Filesystem
`Figure 1
`The Filesystem Interface
`The VFS interface is implemented using a structure that contains the operations that can be done
`on a whole filesystem. Likewise,
`the mode interface is a structure that contains the operations
`that can be done on a node (file or directory) within a filesystem. There is one VFS structure per
`mounted filesystem in the kernel and one vnode structure for each active node. Using this
`abstract data type implementation allows the kernel to treat all filesystems and nodes in the same
`way without knowing which underlying filesystem implementation it is using.
`Each vnode contains a pointer to its parent VFS and a pointer to a mounted—on VFS. This
`means that any node in a filesystem tree can be a mount point for another tilesystem. A root
`operation is provided in the VFS to return the root vnode of a mounted filesystem. This is used
`by the pathname traversal routines in the kernel to bridge mount points. The root operation is
`used instead of just keeping a pointer so that the root vnode for each mounted filesystem can be
`released. The VFS of a mounted filesystem also contains a back pointer to the vnode on which it
`is mounted so that pathnames that include “.." can also be traversed across mount points.
`In addition to the VFS and vnode operations, each filesystem type must provide mount and
`mount_root operations to mount normal and root filesystems. The operations defined for the
`filesystem interface are:
`Filesystem Operations
`mount( varies )
`mount_root( )
`VFS Operations
`root(vt's) retums(vnode)
`statfs (vfs) retums( fsstatbuf)
`Vnode Operations
`System call to mount filesystem
`Mount filesystem as root
`Unmount fllesystem
`Return the vnode of the filesystem root
`Return filesystem statistics
`Flush delayed write blocks
`Mark file open
`open(vnode. flags)
`Mark file closed
`close(vnode, flags)
`Read or write a file
`rdwr(vnode, uio, rwflag. flags)
`Do I/O control operation
`ioctl(vnode. cmd, data. rwflag)
`Do select
`select(vnode. rwflag)
`Return file attributes
`getattr(vnode ) retums(attr)
`Set file attributes
`setattr(vnode, attr)
`Check access permission
`access(vnode. mode)
`Look up file name in a directory
`lookup(dvnode, name) retums(vnode)
`create (dvnode, name. attr, excl. mode) retums(vnode) Create a file
`remove (dvnode, name)
`Remove a file name from a directory
`link(vnode, todvnode. toname)
`Link to a file
`rename(dvnode. name. todvnode, toname)
`Rename a file
`mkdir(dvnode. name, attr) retums(dvnode)
`Create a directory
`' rmdlr(dvnode, name)
`Remove a directory
`readdir(dvnode) retums(entries)
`Read directory entries
`symllnk(dvnode. name. attr. to_name)
`Create a symbolic link
`readllnk(vp) returns(data)
`Read the value of a symbolic link
`Flush dirty blocks of a file
`inactive (vnode)
`Mark vnode inactive and do clean up
`brnap(vnode. blk) returns(devnode. mappedhlk) Map block number
`Read and write filesystem blocks
`bread(vnode, blockno) returns(buf)
`Read a block
`brelse(vnode, but’)
`Release a block buffer
`Notice that many of the vnode procedures map one-to-one with NFS protocol procedures. while
`other, UNIX dependent procedures such as open. close. and loctl do not. The bmap.
`strategy, bread. and brelse procedures are used to do reading and writing using the buffer
`Pathname traversal is done in the kernel by breaking the path into directory components and
`doing a lookup call through the vnode for each component. At first glance it seems like a waste
`of time to pass only one component with each call instead of passing the whole path and receiving
`back a target vnode. The main reason for this is that any component of the path could be a
`mount point for another filesystem. and the mount
`information is kept above the vnode
`implementation level.
`In the NFS filesystern. passing whole pathnames would force the server to
`keep track of all of the mount points of its clients in order to determine where to break the
`pathname and this would violate server statelessness. The inefficiency of looking up one
`component at a time is alleviated with a cache of directory vnodes.
`Implementation of the NFS started in March 1984. The first step in the implementation was
`modification of the 4.2 kernel to include the filesystem interface. By June we had the first
`“vnode kernel" running. We did some benchmarks to test the amount of overhead added by the
`extra interface.
`It turned out that in most cases the difference was not measurable. and in the
`worst case the kernel had only slowed down by about 2%. Most of the work in adding the new
`interface was in finding and fixing all of the places in the kernel that used inodes directly. and
`code that contained implicit knowledge of inodes or disk layout.
`Only a few of the filesystem routines in the kernel had to be completely rewritten to use vnodes.
`the routine that does pathname lookup. was changed to use the vnode lookup
`operation, and cleaned up so that it doesn't use global state. The direnter routine, which adds
`new directory entries (used by create, rename. etc.) , also had to be fixed because it depended
`on the global state from namei. Direnter also had to be modified to do directory locking during
`directory rename operations because inode locking is no longer available at this level, and vnodes
`are never locked.
`To avoid having a fixed upper limit on the number of active vnode and VFS structures we added a
`memory allocator to the kernel so that these and other structures can be allocated and freed
`dynamically .
`A new system call. getdirentries, was added to read directory entries from different types of
`filesystems. The 4.2 readdir library routine was modified to use the new system call so programs
`would not have to be rewritten. This change does. however, mean that programs that use
`readdir have to be relinked.
`Beginning in March. the user level RFC and XDR libraries were ported to the kernel and we were
`able to make kernel to user and kernel
`to kernel RPC calls in June. We worked on RPC
`performance for about a month until the round trip time for a kernel to kernel null RPC call was
`8.8 milliseconds. The performance tuning included several speed ups to the UDP and IP code in
`the kernel.
`Once RPC and the vnode kernel were in place the implementation of NFS was simply a matter of
`writing the XDR routines to do the NFS protocol.
`implementing an RPC server for the NFS
`procedures in the kernel, and implementing a filesystem interface which translates vnode
`operations into NFS remote procedure calls. The first NFS kernel was up and running in mid
`August. At this point we had to make some modifications to the vnode interface to allow the
`NFS server to do synchronous write operations. This was necessary since unwritten blocks in
`the server's buffer cache are part of the “client's state".
`It wasn't
`Our first implementation of the MOUNT protocol was built into the NFS protocol.
`until later that we broke the MOUNT protocol into a separate. user level RPC service. The
`MOUNT server is a user level daemon that is started automatically when a mount request comes
`it checks the file /etc/exports which contains a list of exported filesystems and the clients
`that can import, them.
`If the client has import permission,
`the mount daemon does a get!‘h
`system call to convert a pathname into an fhandle which is returned to the client.
`On the client side. the mount command was modified to take additional arguments including a
`filesystem type and options string. The filesystem type allows one mount command to mount any
`type of filesystem. The options string is used to pass optional flags to the different filesystem
`mount system calls. For example, the NFS allows two flavors of mount, soft and hard. A hard
`mounted filcsystem will retry NFS calls forever if the server goes down, while a soft mount gives
`up after a while and returns an error. The problem with soft mounts is that most UNIX programs
`are not very good about checking return status from system calls so you can get some strange
`behavior when servers go down. A hard mounted filesystern. on the other hand. will never fail
`due to a server crash; it may cause processes to hang for a while. but data will not be lost.
`In addition to the MOUNT server, we have added NFS server daemons. These are user level
`processes that make an nfsd system call into the kernel, and never return. This provides a user
`context to the kernel NFS server which allows the server to sleep. Similarly.
`the block l/O
`daemon, on the client side.
`is a user level process that
`lives in the kernel and services
`asynchronous block l/0 requests. Because the RPC requests are blocking. a user context is
`necessary to wait for read-ahead and write-behind requests to complete. These daemons provide
`a temporary solution to the problem of handling parallel. synchronous requests in the kernel.
`the future we hope to use a light-weight process mechanism in the kernel to handle these requests
`The NFS group started using the NFS in September, and spent the next six months working on
`performance enhancements and administrative tools to make the NFS easier to install and use.
`One of the advantages of the NFS was immediately obvious; as the df output below shows, a
`diskless workstation can have access to more than a Gigabyte of disk!
`mercury:/usr/mercury 301719
`The Hard Issues
`avail capacity Mounted on
`Several hard design issues were resolved during the development of the NFS. One of the toughest
`was deciding how we wanted to use the NFS. Lots of flexibility can lead to lots of confusion.
`Root Fllesystems
`Our current NFS implementation does not allow shared NFS root filesystems. There are many
`hard problems associated with shared root filesystems that we just didn't have time to address.
`For example, many well-known, machine specific files are on the root filesystem. and too many
`programs use them. Also. sharing a root filesystem implies sharing /tnp and /dev. Sharing
`/txnp is a problem because programs create temporary files using their process id, which is not
`unique across machines. Sharing /dev requires a remote device access system. We considered
`allowing shared access to /dev by making operations on device nodes appear local. The
`problem with this simple solution is that many programs make special use of the ownership and
`permissions of device nodes.
`Since every client has private storage (either real disk or ND) for the root filesystem, we were
`able to move machine specific files from shared filesystems into a new directory called
`/private. and replace those files with symbolic links. Things like /usr/lib/crontab and the
`whole directory /usr/adm have been moved. This allows clients to boot with only /etc and
`/bin executables local. The /usr. and other filesystems are then remote mounted.
`Filesystem Naming
`Servers export whole filesystems. but clients can mount any sub-directory of a remote filesystem
`on top of a local filesystern, or on top of another remote ftlcsystem.
`In fact. a remote filesystem
`can be mounted more than once. and can even be mounted on another copy of itself! This
`means that clients can have different "names" for filesystems by mounting them in different
`places .
`To alleviate some of the confusion we use a set of basic mounted filesystcms on each machine
`and then let users add other filesystems on top of that. Remember though that this is just policy,
`there is no mechanism in the NFS to enforce this. User home directories are mounted on
`/usr/serve:-name. This may seem like a violation of our goals because hostnames are now part
`of pathnames but
`in fact the directories could have been called /usr/1. /usr/2. etc. Using
`server names is just a convenience. This scheme makes workstations look more like timesharing
`terminals because a user can log in to any workstation and her home directory will be there.
`also makes tilde expansion (—usemame is expanded to the user's home directory) in the C shell
`work in a network with many workstations.
`To aviod the problems of loop detection and dynamic filesystem access checking, servers do not
`cross mount points on remote lookup requests. This means that in order to see the same
`filesystem layout as a server. a client has to remote mount each of the server’: exported
`Credentials. Authentication and Security
`We wanted to use UNIX style permission checking on the server and client so that UNIX users
`would see very little difference between remote and local
`RPC allows different
`authentication parameters to be "plugged-in" to the packet header of each call so we were able to
`make the NFS use a UNIX flavor authenticator to pass uid, gid, and groups on each call. The
`server uses the authentication parameters to do permission checking as if the user making the call
`were doing the operation locally.
`The problem with this authentication method is that the mapping from uid and gid to user must
`be the same on the server and client. This implies a flat uid. gid space over a whole local
`network. This is not acceptable in the long run and we are working on different authentication
`In the mean time, we have developed another RPC based service called the Yellow
`Pages (YP) to provide a simple, replicated database lookup service [5]. By letting YP handle
`/etc/passwd and /etc/group we make the flat uid space much easier to administrate.
`Another issue related to client authentication is super-user access to remote files.
`It is not clear
`that the super~user on a workstation should have root access to files on a server machine through
`the NFS. To solve this problem the server maps user root (uid O) to user nobody (uid -2) before
`checking access permission. This solves the problem but. unfortunately, causes some strange
`behavior for users logged in as root, since root may have fewer access rights to a file than a
`normal user.
`Remote root access also affects programs which are set-uid root and need access to remote user
`files. for example Ipr. To make these programs more likely to succeed we check on the client
`side for RPC calls that fail with EACCES and retry the call with the real-uid instead of the
`effective-uid. This is only done when the effective-uid is zero and the real-uid is something other
`than zero so normal users are not affected.
`the super-user on a client
`While restricting super-user access helps to protect remote files,
`machine can still gain access by using su to change her effective-uid to the uid of the owner of a
`remote file.
`Concurrent Access and File Locking
`The NFS does not support remote file locking. We purposely did not include this as part of the
`protocol because we could not find a set of locking facilities that everyone agrees is correct.
`Instead we plan to build separate, RPC based file locking facilities.
`In this way people can use
`the locking facility with the flavor of their choice with minimal effort.
`Related to the problem of file locking is concurrent access to remote files by multiple clients.
`the local filesystem. file modifications are locked at the inode level. This prevents two processes
`writing to the same file from intermixing data on a single write. Since the server maintains no
`locks between requests, and a write may span several RPC requests,
`two clients writing to the
`same remote file may get intermixed data on long writes.
`UNIX Open File Semantics
`We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the
`server or the protocol.
`In some cases this was hard to do. For example, UNIX allows removal of
`open files. A process can open a file, then remove the directory entry for the file so that it has no
`name anywhere in the filesystem. and still read and write the file. This is a disgusting bit of
`UNIX trivia and at first we were just not going to support it, but it turns out that all of the
`programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.
`What we did to make open file removal work on remote files was check in the client VFS
`remove operation if the file is open. and if so rename it instead of removing it. This makes it
`(sort of) invisible to the client and still allows reading and writing. The client kernel then
`removes the new name when the vnode becomes inactive. We call this the 3/4 solution because
`if the client crashes between the rename and remove a garbage file is left on the server. An
`entry to cron can be added to clean up on the server.
`Another problem associated with remote, open files is that access permission on the file can
`change while the file is open.
`In the local case the access permission is only checked when the
`file is opened. but in the remote case permission is c

