"Appeared at the USENIX Conference & Exhibition, Portland, Oregon, Summer 1985"
`Design and Implementation or the Sun Network Fllesystem
`Rwrsd Sartdb,rl
`Dtnld Goldb,rl
`Sin• 1Cl•lm1111
`D1111 Wo/JII
`Bob Lyo11
`Sun Microsystems, Inc.
`2550 Oarcla Ave.
`Mountain View, CA. 94110
`(415) 960-7293
`The Sun Network FUesystem (NI'S) provides transparent, remote access to filesystems. Unlike
`many other remote filesystem implementations under UNJXt, the NI'S is designed to be easily
`It uses an External Data
`portable 10 other operating systems and machine architectures.
`Representation (XOR) specification to describe protocoll in a machine and system Independent
`way. The NFS is implemented on top of • Remote Procedure Call package (RPC) to help
`simplify protocol deflJlition, Implementation, and maintenance.
`In order to build the NFS into the UNIX ·4,2 kernel in a user transparent way, we decided to add
`a new Interface 10 the kernel which separates generic mesystem operations from specific
`filesystem implementations. The •fi!esystem interface" consists or two parts: the Virtual File
`System (VFS) interface defines the operations that can be done on a filesystem, while the vnode
`Interface defines the operations that can be done on a file within that mesystem. This new
`inte~face allows us 10 implement and install new filesystems in much the same way as new device
`,drivers are added to the kernel.
`In this paper we discuss the design and Implementation of the filesystem Interface in the kernel
`and the NFS virtual filesystem. We describe some interesting design Issues and how they were
`resolved, and point out some of the shoncornings of the current Implementation. We conclude
`with some ideas for future enhancel)lents.
`Design Goals
`The NFS was designed to make sharing of filesystem resources in a network of non-homogeneous
`machines easier. Our goal was to provide a UNIX-like way of makin& remote files available to
`local programs without having to modify, or even recompile, those programs. In addition, we
`wanted remote file access to be comparable in speed to local file access.
`The overall design aoals of the NFS were:
`Machine and Operating System Independence
`The protocols used should be independent or UNIX 10 that an NFS server can
`supply files to many different types of clients. The protocols should· also be
`simple enouah that they can be implemented on low end machines like the PC.
`Crash Recovery
`When clients can mount remote filesystems from many different servers ii Is
`very Important that clients be able to recover easily from server crashes.
`Transparent Access
`We want to provide a system which allows programs to access remote files in
`exactly the same way as local files. No pathname parsing, no special libraries,
`no recompillna. Programs should not be able to tell whether a me is remote or
t UNIX ii I trademart of Bell Laboratoriea.
`UNIX Semantics Maintained on Client
`In order for transparent access to worlr. on UNIX machines, UNIX fllesystem
`semantics have to be maintained for remote files.
`Reasonable Performance
`People will not want to use !be NFS if It is no faster !ban !be existing networlr.lng
`utilities, such as rep, even if It II easier to use. Our design goal is to mate NFS
`as fast as the Sun Network Dlslr. protocol (ND 1) , or about 8091, u fut as a
`local disk.
`Basic Design
`The NFS design consists of lhree major pieces: !be protocol, !be server side and !be client side.
`NFS Protocol
`The NFS protocol uses !be Sun Remote Procedure Call (RPC) mechanism I l I. For !be ■ame
`reasons that pro<cdllre calls help ■lmpllfy programs, RPC helps ■lmpllfy !be definition,
`organization, and implementation of remote services. The NFS protocol II denned In ternu of a
`set of procedures, their arguments and results, and !heir effects. Remote procedure calls are
`synchronous, that is, the client blocks until !be server bu completed !be call and returned !be
`results. This makes RPC very easy to use since It behaves lllr.e a local procedure call.
`The NFS uses a stateless protocol. The parameters to each procedure call contain all of !be
`Information necessary to complete the call, and the server does not keep track of any past
`requests. This makes crash recovery very easy; when a ■erver crashes, !be client resends NFS
`requests until a response is received, and the server does no crash recovery at all. When a client
`crashes no recovery is necessary for either the client or the server. When state II maintained on
`the server, on the other hand, recovery is much harder. Both client and server need to be able to
`reliably detect crashes. The server needs to detect client crashes so that It can discard any state it
`is holding for the client, and the client must detect server crashes 10 that It can rebuild !be
`server's state.
`Using a stateless protocol allows us to avoid complex crash recovery and simplifies the protocol.
`If a client just resends requests until a response is received, data will never be lost due to a server
`In fact the client can not tell the difference between a server that bas crashed and
`recovered, and a server that Is slow.
`Sun's remote procedure call package is designed to be transpon Independent. New transpon
`protocols can be "plugged in" to the RPC irnplementati.on without affecting !be higher level
`protocol code. The NFS uses the ARPA User Datagram Protocol (UDP) and Internet Protocol
`(IP) for its transpon level. Since UDP Is an unreliable datagram protocol, packets can get Jost,
`but because the NFS protocol is stateless and the NFS requests are idempotent, the client can
`recover by retrying the call until the packet gets through.
`The most common NFS procedure parameter is a strucNre called a me handle (fhandle or fh)
`which is provided by the server and used by the client to reference a me. The fbandle is opaque,
`that is, the client never looks at the contents of the fbandle, but uses It when operations are done
`on that file.
`An outline of the NFS protocol procedures 11 given below. For the complete specification see the
`Sim Nllwork Filrsyst,m Protocol Sp,cification {ZJ.
`null() returns ()
`Do nothing procedure to ping the server and measure round trip time.
`tookup(dirfb, name) returns (fb, attr)
`Returns a new fbandle and attributes for the named me In a directory.
`cre ■ te(dlrfh, name, anr) retum1 (newfb, attr)
`Create• a new me and reNmS 111 fhandle and attributes.
`remove(dirfh, name) returns (1taNs)
`Removes a me from a directory.
`aet ■ ttr(fh) returns (attr)
`Returns file attributes. This procedure is lllr.e a stat call.
[!] NJ>, the Sun Networt Dist Protocol, provide• block-level acce11 to remoll, sub-partitioned diata.
`aetattr(fh, attr) returns (attr)
`Sets the mode, uid, gid, 1ize, access time, and mocllty time or a file. Setting the size to
`zero truncates the me.
`read(fh, offset, count) return• (anr, data)
`Returns up to count bytes of data [rom a me stantn1 off11t bytes tnto the flle. read also
`returns the attributes of the file.
`wrlte(fh, offset, count, data) returns (attr)
`Writes count byte, or data to a me beginning 0//111 bytn from the begin.Ding or the file.
`Returns the attributes of the me after the write tun place.
`rename(dirfh, name, tofb, toname) returns (status)
`Renames the me nam, tn the directory dlr/11, to ton am, ln the directory to/It.
`Unk(dirfh, name, tofh, toname) returns (status)
`Creates the file tonam, tn lbe directory to/11. which ii a link to the file 11am, ln the
`directory dlrflt.
`1ymllnk(dirfb, name, 1trin1) returns (status)
`Creates a symbolic link 11am, in the directory dir/11 with value 1trln1. The server does not
`interpret the 11r;n1 arpunent tn any way, just saves it and mates an usociadon to the new
`symbolic link file.
`readllnk(fh) returns (string)
`Returns the strtna which is associated with the symbolic lint file.
`mkdlr(dirfh, name, attr) returns (fh, newanr)
`Creates a new directory 11am, in tbe directory dirP, and returns the new fhandle and
`rmdlr(dirfh, name) retums(status)
`Removes the empty directory nam, from the parent directory dirfh.
`readdlr(dirfh, cookie, count) retums(entries)
`Returns up to count bytes of directory entries from the dir9'tory dirfh. Each entry contains
`a file name, file Id, and an opaque pointer to the next directory entry called a coolci,. The
`cooki, is used In subsequent readdlr calls to stan reading at a specific entry in the
`directory. A readdlr call with the cooA:i, of zero returns entries stanin1 with the first
`entry in the directory.
`1tatrs(fh) returns (fsstats)
`Returns filesystcm information such as block size, number of free blocks, etc.
`New fhandles are returned by the lookup. create, and mkdlr procedures which also take an
`fhandle as an argument. The first remote fhandle, for the root of a mesystem, Is obtained by the
`client using another RPC based protocol. The MOUNT protocol takes a directory pathname and
`returns an fhandle if the client has access permission to the mesystem which contains that
`directory. The reason for making this a separate protocol is that this makes it easier to plug in
`new filesystem access checkin& methods, and it separates out the operating system dependent
`aspects of the protocol. Note that the MOUNT protocol ls the only place that UNIX pathnames
`are passed to the server. In other operating system implementations the MOUNT protocol can
`be replaced without having to chanae the NFS protocol.
`The NFS protocol and RPC are built on top or an External Data Representation (XOR)
`specification (3}. XOR defines the size, bytes order and alignment or basic data types such as
`string, integer, union, boolean and array. Complex structures can be built from tbe basic data
`types. Using XDR not only makes protocols machine and lanauage independent, it also makes
`them easy to define. The arauments and results or RPC procedures are defined usina an XDR
`data definitioq lanauage that looks a lot like C declarations.
`Server Side
`Because the NFS server ls stateless, u mentioned above, when servicing an NFS request it must
`commit any modified data to stable storage before returning results. The implication for UNIX
`based servers is that requests which modify the fllesystem must nu.sh all modified data to disk
`before returning from the call. This means that, for example on a write request, not only the
`data block, but also any modified indirect blocks and the block containing the inode must be
`nushed if they have been modified.
`Another modification to UNIX necessary to make the server work is the addition or a generation
`number in the inode, and a rllesystem id in the superbloct. These extra numbers make it
`possible ror the server to use the inode number, in ode generation number, and filesysttm id
`101ether as the fhandle !or a me. The inode generation number is neceasary because the aerver
`may hand out an !handle with an inode number or a flle that ii later removed and the inode
`reused. When the original fhandle comes back, the server must be able to tell that thls lnode
`number now refers to a different me. Tbe generation number bu 10 be incremented every time
`the inode is !reed.
`Cllenl Side
`The client side provides the transparent interface to the NFS. To mate transparent access to
`remote ftles work we had to uae a method of locating remote rues that does not change the
`1tnu:ture of path names. Some UNIX based remote file access 1cheme1 use lto1t:poth to name
`remote files. This does not allow real transparent acceu 1ince existing programs that pane
`pathnames have to be modified.
`Rather than doing a •]ate binding" of me addre11, we declded to do the bostname lookup and
`rue address bindins once per fiJesystem by allowing the client to attach a remote fllesystem to a
`directory usinJ the mount propam. This method bu the advantage that the client only bu to
`deal with bostnames once, at mount time. It also allows the server to limit ac:c:etl to f1111ystem1
`by checking client credentials. The disadvantage d that remote rues are not available to the
`client until a mount is done.
`Transparent access to different types of filesystems mounted on a single machine is provided by a
`new filesystems interface in the kernel. Each •ruesy11em type" supports two sets or operations:
`the Virtual Filesystem (VFS) interface defmes the procedures lhat operate on the fllesystem u a
`whole; and the Virtual Node (vnode) interface defines the procedures that operate on an
`individual file within that filesystem type. Figure 1 ls a schematic diagram of the filesystem
`interface and how the NFS uses it.
`System Calls
`System Calls
`PC Filesystem
`4 . 2 Filesystem
`NFS Filesystem
`Server Routines
`Network •
`Flaure I
`The Flleayslem lnlerrace
`The VFS interface is implemented using a structure that contains the operations that can be done
`on a whole filesystem. Likewise, the vnode interface is a structure that contains the operations
`that can be done on a node (me or dlrectory) within a mesystem. There is one VFS structure per
`mounted fllesystem In the kemtl and ont vnode structure for ucb active node. U1in1 this
`abstract data type implementation allows the kernel to treat all fUesystems and nodn lD the 1&me
`way without knowin& wblcb underlytna ftlesysttm implementation it II UJin&.
`Each vnode contains a pointer to its puent VFS and a pointer to a mounted•on VFS. 11m
`means that any node In a filesystem tree can be a mount point for another ftlaystem. A root
`operation is provided in the VFS to retum the root vnode of a mounted filnyatem. This ls used
`by the pathname travenal routinn in the kernel to brld&e mount points. Th• root operation ls
`used instead or Just teeptns a pointer ao that the root vnode for uch mounted llletyllem an be
`released. The VFS of a mounted ftlesystem also contalnJ a back pointer to tbe vnode on ~blcb lt
`is mounted 10 that pathnames that Include • .. " can also be trav,ned acrou mount points.· .
`In addition to the VfS and vnode operations. each filnystem type must provide mount and
`mount_root operations to mount normal and root fi111y1tems. Th• operations defined for the
`ftlesystem interface are:
`System call to mount ralnystem
`Mount filesystem u root
`Unmount fLlesystem
`Return the Y'Dode of the fllesystem root
`Return n.Ie1y1tem statistics
`f1usb delayed write blocks
`Fil,syst,m Op1rarions
`• mount( varies }
`VFS Op1rario11s
`root(vfs) retunu(vnode)
`atatrs(vfs) returns( f11tatbuf)
`Vnod, Op,rarions
`Mark me open
`open(vnode, flags)
`Mart flle closed
`close(vnode. nags)
`Read or writt a file
`rdwr(vnode, uio, rwflag, naas)
`locll(vnode, cmd, data, rwfla1)
`Do 1/0 control operation
`Do 11lect
`aelect(vnode, rwna1)
`Return flle attributes
`getattr(vnode ) retums(attr)
`Set file attributes
`aetattr(vnode, attr)
`Check access permission
`access(vnode, mode)
`Loot up file name in a directory
`lookup(dvnode, name) retums('.(Dode)
`create(dvnode, name, attr, excl, mode) retums(vnode) Create a ftle
`Remove a me name from a direcrory
`remove(dvnode, name)
`Unk to a rue
`Jlnk(vnode, todvnodt, toname)
`rename(dvnode, name, todvnode, toname)
`Rename a file
`Create a directory
`mkdir(dvnode, name, attr) retums(dvnode)
`Remove a diRctory
`rmdlr(dvnode, name)
`readdlr(dvnode) retums(entries)
`Read dlrectory entries
`•ymlink(dvnode, name, attr, to_name)
`Create a symbolic link
`Read the value of a symbolic lint
`readllnk(vp) retums(data)
`flush cliny blocks of a rue
`Mart vnode inactive and do clean up
`bmap(vnode, bit) retumJ(devnode, mappedblk) Map block number
`Read and write filesysttm blocks
`Read a block
`bread(vnode, blockno) returns(buf)
`Releue a block buffer
`brelse(vnode, buf)
`Notice that many of the vnode procedures map one•to•one with NFS protocol procedurt1, while
`other, UNIX dependent procedures such as open, close, and Ioctl do not. The bmap,
`1crate1y, bread, and brelae procedures are used to do readina and wrttina usin& the buffer
`Pathname traversal is done lD the kernel by breatin& the path into directory componen11 and
`doin& a lookup call throu&b the vnode for each compc,,ient. At first glanee it aeemi like a wute
`of time 10 pass only one component with each call instead of passing the whole path and receiving
`back a target vnode. The main reason for lhil is that any component of the path could be a
`mount point for another fllesystem, and the mount Information Is kept above the vnode
`implementation level. In the NFS ftlesystem, passing whole pathnames would force the server to
`keep track of all of the mount points of its clients in order to determine where to break the
`pathname and this would violate server 11atele11ne11. The inefficiency of looting up one
`component at a time ls alleviated with a cache of directory vnodes.
`Jmplementa !Ion
`lmplementalion of the NFS started in March 1984. The fint step in the implementation was
`modification of the 4.2 kernel to include the filesystem interface. By June we had the· finl
`•"vnode kernel" running. We did some benchmarks to test the amount of overhead added by the
`• extra interface. It turned out that in most cases the difference was not measurable, and in the
`wont case the kernel had only slowed down by about 291,. Most of the wort in adding the new
`interface was in finding and fixing all of Ole places in the kernel that used !nodes directly, and
`code that contained impllcil knowledge of inodes or disk layout.
`Only a few of the filesystem routines in the kernel had to be completely rewritten to use vnodes.
`Namti, the routine that does pathname lookup, was changed to use the vnode lookup
`operation, and cleaned up so that it doesn't use global state. The dir,nr,r routine, which adds
`new directory entries (used by create, rename, etc.), also had to be fixed because it depended
`on the global state from namti. Dirtnltr also had to be modified to do directory locking during
`directory rename operations because inode locking is no longer available at this level, and vnodes
`are never locked.
`To avoid having a rixed upper limit on the number of active vnode and VFS strucrures we added a
`memory allocator to the kernel so that these and other strucrures can be allocated and freed
`A new system call, 1•tdirtntrit1, was added to read directory entries from different types of
`filesystems. The 4.2 rtaddir library routine was modified to use the new system call 10 programs
`would not have to be rewritten. This change does, however, mean that programs that use
`rtaddir have to be relinked.
`Beginning in March, the user level RPC and XDR libraries were ported to the kernel and we were
`able to make kernel to user and kernel to kernel RPC calls in June. We worked on RPC
`performance for about a month until the round trip lime for a kernel to kernel null RPC call was
`8.8 milliseconds. The performance tuning included several speed ups to the UDP and IP code in
`the kernel.
`Once RPC and the vnode kernel were in place the implementation of NFS was simply a matter of
`writing the XDR routines to do the NFS protocol, implementing an RPC server for the NFS
`procedures in the kernel, and implementing a filesystem Interface which translates vnode
`operations into NFS remote procedure calls. The first NFS kernel was up and running In mid
`August. At this point we had to make some modifications to the vnode interface to allow the
`NFS server 10 do synchronous write operations. This was necessary since unwritten blocks In
`, the server's buffer cache are part of the "client's state".
`It wasn't
`Our first implementation of the MOUNT protocol was buill into the NFS protocol.
`until later that we broke the MOUNT protocol into a separate, user level RPC service. The
`MOUNT server is a user level daemon that Is started automatically when a mount request comes
`in. It checks the file /etc/exports which contains a list or exported filesy1tems and the clients
`If the client has import permission, the mount daemon does a 1etfh
`that can import them.
`system call to convert a pathname into an fhandle which Is rerumed to the client.
`On the client side, the mount command was modified to take additional arguments including a
`filesystem type and options string. The mesystem type allows one mount command 10 mount any
`type of filesystem. The options string is used 10 pass optional flags to the different filesystem
`mount system calls. For example, the NFS allows two fiavon of mount, soft and hard. A hard
`mounted filesystem will retry NFS calls forever If the server goes down, while a soft mount gives
`up after a while and returns an error. The problem with soft mounts ls that most UNIX programs
`are not very good about checking reNrn llatus from system calls so you can get some strange
`behavior when servers go down. A hard mounted filesystem, on the other hand, will never fail
`due 10 a server crash; ii may cause processes to hang for a while, but data will not be Josi.
`Mounted on
`/usr/ ■ rc
`Jn addition to the MOUNT server, we have added NFS server daemons. These are user level
`processes that make an nr1d system call into the temel, and never return. This provides a user
`context to the kernel NFS server which allows tbe server to sleep. Similarly, the block J/0
`daemon, on the .client side, is a user level process that lives in the kernel and services
`asynchronous block J/O requests. Because the RPC requests are blocking, a \lier context ls
`necessary to wait for read-ahead and write-behind requests to complete. These daemons provide
`a temporary solution to the problem of handllnl parallel, synchronous requests in the kernel. In
`the future we hope to use a ll&ht•welght process mechanism in the temel to handle tbese requests
`[ 4).
`The NFS group started using the NFS in September, and spent the next sbc months working on
`performance enhancements and·administrative tools to mate the NFS easier to install and use.
`One of the advantages of the NFS was immediately obvious; as the df output below shows, a
`diskless workstation can have access to more than a Gigabyte of distt
`avail capacity
`;dev ;ndo
`55 ..
`as ..
`345915 220122
`148371 118505
`mercury:/usr/mercury 301719 215178
`78 ..
`39312 258447
`The Hard Issues
`Several hard design issues were resolved during the development of the NFS. One of the toughest
`was decidin& how we wanted to use the NFS. Lots of ne:lQ1;,Uity can lead to lots of confusion.
`Root Fllesystems
`Our current NFS implementation does not allow shared NFS root filesystems. There arc many
`hard problems associated with shared root filesystems that we just didn't have time to address.
`For example, many well•known, machine specific files art! on the root filesystem, and too many
`programs use them. Also, sharing a root filesystem impli.es sharing /tap and /dev. Sharing
`/tmp is a problem because programs create temporary mes using their process Id, which ls not
`unique across machines. Sharing /dev requires a remote device access system. We considered
`allowing shared access to /dev by making operations on device nodes appear local. The
`problem with this simple solution is that many programs make special use or the ownership and
`permissions or device nodes.
`Since every client has private storage (either real disk or ND) for the root filesy,tem, we were
`able to move machine specific files from shared filesy,tems into a new directory called
`/private, and replace those files with symbolic links. Things like /usr/lib/crontab and the
`whole directory /usr tad.a have been moved. This allows clients to boot with only /etc and
`/!>in executables local. The /usr, and other filesystems are then remote mounted.
`Fllesystem Namln1
`:;ervers expon whole fili:isysui ms, but clients can mount any sub-directory or a remote filesystem
`on top of a local filesystem, or on top of another remote filuystem. In fact, a remote filesystem
`can be mounted more than once, and can even be mounted on another copy of llselfl This
`means that clients can have different "names" for filesystems by mounting them in different
`To alleviate some or the confusion we use a set or basic mounted ruesystems on each machine
`and then let usen add other ri.tesystems on top or that. Remember thouah that this is just policy,
`there is no mechanism ln the NFS to enforce this. User home directories are mounted on
`/usr 1serverna11e. This may seem like a violation or our goals because hostnames are now pan
`or pathnames but in fact the directories could have been called /usr/1, /usr/2,. etc. Using
`server names is just a coJ}.venience. This scheme makes workstations look more like timesharing
`terminals because a user can 101 in to any workstation and her home directory will be there. It
`also makes tilde expansion (-usemame is expanded to the user's home directory) in tbe C shell
`work in a network with many workstations.
`To aviod the problems or loop detec:lion and dynamic mesystem access checking, servers do not
`cross mount points on remote lookup requests. This means that in order to see the same
`filesystem layoui as a server, a client has to remote mount each of the server's exponed
`Credentials, Authentication and Security
`We wanted to use UNIX style permission cbeckin& on the server and client so that UNIX users
`would see very little difference between remote and local files.
`RPC allows different
`authentication parameters to be "pluged-in" to the packet header of eacb call to we were able to
`make the NFS use a UNIX navor authenticator to pus uid, pd, and groups on each call. The
`aerver uses the authentication parameters to do permission checking as i! the 111er mating the call
`were doing the operation locally.
`The problem with this auther:nication method is that the mappin& from uld and gid to user must
`be the same on the server and client. This implies a fiat uid, gid space over a whole local
`network. This is not acceptable in the long nin and we are working on different autbimticalion
`schemes. In the mean time, we have developed another RPC bued service called the Yellow
`Pages (YP) to provide a simple, replicated databue lookup service [:SJ. By Jettin& YP handle
`1etc1passwd and /etc/sroup we make the fiat uid space much cuter to administrate.
`Another issue related to client authentication is super-user access to remote file1. It ii not clear
`that the super-user on a workstation should have root access to mes on a server machine through
`the NFS. To solve this problem the server maps user root (uid 0) to user nobody (uid -2) before.
`checking access permission. This solves the problem but, unfortunately, causes some strange
`behavior for users logged in as root, since root may have fewer access rights to a fale than a
`normal user.
`Remote root access also affects progra.mt which are set-uid root and need access to remote user
`files, for example /pr. To make these programs more likely to succeed we check OD the client
`side for RPC calls that fail with EACCES and retry the call with the real-uid instead of the
`effective-uid. This is only done when the effective•uid is zero and the real-uld is 1omethin1 other
`than zero so normal users are not affected.
`While restricting super-user access helps to protect remote files, the 1uper•user on a client
`machine can still aain access by using 1u to change her effective-uid to the uid of the owner of a
`re:note file.
`Concurrent Access and FIie Loc:kln1
`The NFS does rot support remote me locking. We purposely did not include this u pan of the
`protocol because we could not find a set of lockina facilitie■ that everyone agrees ts correct.
`Instead we plan to build separate, RPC based file locking facililies. In this way people can use
`the locking facility with the flavor of their choice with minimal effon.
`Related 10 the problem of me locking is concurrent access to remote mes by multiple cllenu. In
`the local mesystem, file modifications are locked at the lnode level. This prevents two processes
`writina to the same file Crom intermixin& data on a single write. Since the server maintains no
`locks between requests, and a write may span ■everal RPC requests, two clients writin1 to the
`1ame remote file may gel intermixed data on Ions writes.
`UNIX Open File Sem•ntlcs
`We tried very hard to mak-: the NFS client obey UNIX fllesystem semantics without modifying the
`server or the protocol. In some cases this was hard to do. Por example, UNIX allow, removal of
`open files. A process can open a me, then remove the directory entry for the file 10 that it has no
`name anywhere Jn the tilesystem, and still read and write the rue. Thi1 ls a disgusting bit of
`UNIX trivia and at first we were just not goiJl& to support U, but it turns out that all of the
`programs that we dicln.'t want to have to fix (c,A, .. ndmall, etc.) use 1bit for temporary files.
`What we did to mate open me removal wort on remote flies was check in the client VPS
`remove operation if the file is open, and if so rename it instead of removin1 It. T'bil mate, it
`(sort of) invisible to tbe client and stW allows reading and writtn1. The client kernel then
`removes the new name when the vnode becomes inactive. We call tbis the 314 solution because
`if the client crashes between the rename and remove a 1arba1e file is left OD the aerver. An
`entry to cron can be added to clean up on the server.
`Another problem associated with remote, open files is that access permission on the flle can
`change while the fale is open. In the local case the acce11 permission is only checked when the
`file is opened, but in the remote case permission is checked on every NFS call. This means that
`if a client program opens a flle, then changes the permission bits so that it no Jon1er has read
`permilsion, a 1Ubsequent read request will fall. To aet around this problem we save the client
`credentials in the me table at open time, and use them iD later file access requests.
`Not all or the UNIX. open rue semantics have been preserved because interactions between two
`clients usl.n& the same remote rue can not be controlled on a lin&le client. For example, if one
`client opens a file and another client removes that me, the rmt client's read request will fall
`even thou&h the file is still open.
`Time Skew
`Time skew between two clients or a client and a server can cause time usoclated With a rue to be
`inconsistent. For example, ronlib saves the cumnt time in a library entry, and Id checks the
`modify time of the library against the time saved in the library. When ro,dlb 11 nm on a remote
`rue the modify time comes from the nrver while the current time that &•ts saved in the library
`comes from the client. If the server's time ls far ahead of the client's lt loots to Id Ute the
`library is out of date. There were only three prosrams that we found that were affected by this,
`ranlib, Is and emacs, 10 we fixed them.
`This is a potential problem for any proaram that compares system time tom, modification time.
`We plan to fix this by limiting the time 1kew between machines with a lime syncbron.i.Zation