`SYSTEMS
`REVIEW
`
`A Publication of the
`·Association for Computing Machinery
`Special Interest Group on Operating Systems
`
`OSR Special Issue - Winter 2002
`
`Proceedings of the
`
`Fifth ACM Symposium on
`Operating Systems Design
`and Implementation (OSDI'02)
`
`:.·
`
`CSCO-1007
`Page 1 of 21
`
`
`
`I
`
`I
`. I
`
`I
`
`·'I
`I
`
`I
`
`I
`~ I
`
`OPERA TING SYSTEMS REVIEW
`A Publication of the ACM Special Interest Group on Operating Systems
`
`V. Chairperson
`Valerie Issamy
`INRIA Ur Rocquencourt
`Domaine de Voluceau
`Rocquencourt - BP 105 Cedex
`78153 Le Chesnay France
`+33-1-39-63-57-17
`(valerie.issamy@inria.fr)
`
`Chairperson
`David Kotz
`Dept. of Computer Science
`Dartmouth College
`6211 Sudikoff Laboratory
`Hanover, NH 03755-3510 USA
`+1-603-646-1439
`(dfk@cs.dartmouth.edu)
`Newsletter Editor
`William M. Waite
`Dept. of Electrical and
`Computer Engineering
`University of Colorado
`Boulder, CO 80309-0425 USA
`+1-303-492-7204
`(William.Waite@Colorado.edu)
`
`Secretary-Treasurer
`Michael Dahlin
`Dept. of Computer Science
`Campus Mail Code: C0500
`University of Texas
`Austin, TX 78712 USA
`+1-512-471 -9549
`(dahlin@cs.utexas .edu)
`Information Director
`Amin V ahdat
`D308 LSRC, Box 90129
`Department of Computer Science
`Duke University
`Durham, NC 27708 USA
`+ 1-919-660-6566
`(vahdat@cs.duke.edu)
`
`OPERATING SYSTEMS REVIEW is an informal publication of the ACM Special Interest Group on
`Operating Systems (SIGOPS), whose scope of interest includes: Computer operating systems and archi(cid:173)
`tecture for multiprogramming, multiprocessing, and time sharing; resource management; evaluation and
`simulation; reliability, integrity, and security of data; communications among computing processors; and
`computer system modeling analysis.
`Membership in SIGOPS (at $15 per annum) is open to ACM members and associate members, and stu(cid:173)
`dent members (at $8 per annum). Non-members of SIGOPS may subscribe to OPERATING SYSTEMS
`REVIEW at $30 per year. All SIGOPS members receive OPERATING SYSTEMS REVIEW.
`SIGOPS membership application forms are available from ACM Headquarters (or see back cover of this
`issue), 1515 Broadway, 17th Floor, New York, New York 10036, telephone (212) 869-7440. Changes of
`address and other matters pertaining to the SIGOPS mailing list should be directed to ACM Headquarters,
`not to any of the SIGOPS officers.
`Contributions to OSR are sent to the editor and should be single spaced and printed on one side of the
`paper only. Text and diagrams must be contained within a 6.5 inch wide by 8.5 inch high (165mm by
`216mm) area with at least one inch (25mm) top and side margins, 1.5 inch (38mm) bottom margin, and
`be camera-ready (i.e., be an original having sufficient contrast and density for reproduction). All letters to
`the editor will be considered for publication unless accompanied by a request to the contrary. Except for
`editorial items, all sources of material appearing in OPERATING SYSTEMS REVIEW will be clearly
`identified. Items attributed to individuals will ordinarily be interpreted as personal rather than organiza(cid:173)
`tional opinions. The technical material in this newsletter is not refereed. Deadlines for submitting
`material are the 15th of February, May, August, and November. The SIGOPS conference proceed(cid:173)
`ings will be mailed as the December issue in odd numbered calendar years, and the ASPLOS
`conference proceedings will be mailed in even numbered calendar years.
`OPERATING SYSTEMS REVIEW (ISSN 0163-5980) is published five times a year (January, April,
`July, October and December) by the Association for Computing Machinery Inc., 1515 Broadway, New
`York, NY 10036. Periodicals postage paid at New York, NY 10001, and at additional mailing offices.
`POSTMASTER: Send address changes to OPERATING SYSTEMS REVIEW, ACM, 1515 Broadway,
`New York, NY 10036.
`Subscriptions: Annual subscription cost of $12.53 is included in the member dues of $15.00 (for stu(cid:173)
`dents cost is included in $8.00); the non-member annual subscription is $30.00.
`
`--
`
`Page 2 of 21
`
`
`
`USENIX Association
`
`• )
`
`Proceedings of the
`
`Fifth Symposium on Operating Systems
`
`Design and Implementation
`
`(OSDI '02)
`
`APR o 3 zoo~
`, .
`GeQrge Washin t
`Wa!hington "oocn 2Unrversi1y
`,
`0052
`
`Co-sponsored by ACM SIGOPS
`
`In Cooperation with IEEE TCOS
`
`December 9-11, 2002
`Boston, Massachusetts, USA
`
`Page 3 of 21
`
`
`
`, 1·
`
`Symposium Organizers
`Program Chair
`David Culler, University of California, Berkeley
`Peter Druschel, Rice University
`Program Committee
`Eric Brewer, University of California, Berkeley
`Miguel Castro, Microsoft Research
`Carla Ellis, Duke University
`Dawson Engler, Stanford University
`Deborah Estrin, University of California, Los
`Angeles/IS!
`Greg Ganger, Carnegie Mellon University
`
`Jim Gray, Microsoft Research
`Jay Lepreau, University of Utah
`Robert Morris, Massachusetts Institute of Technology
`Timothy Roscoe, Intel Research Lab at Berkeley
`Chandu Thekkath, Microsoft Research
`David Wetherall, University of Washington
`Steering Committee
`Michael B. Jones, Microsoft Research
`Frans Kaashoek, Massachusetts Institute of Technology
`The USENIX Association Staff
`
`External Reviewers
`Atul Adya
`Sameer Ajmani
`David Andersen
`Darrell Anderson
`Andrea Arpaci-Dusseau
`Remzi ArpaCi-Dusseau
`Tuomas Aura
`Godmar Back
`Paul Barham
`Bobby Bhattacharjee
`David Black
`Richard Black
`Trevor Blackwell
`Nikita Borisov
`Thomas Bressoud
`Nevil Brownlee
`Philip Buonadonna
`Andrew Campbell
`John Carter
`Alberto Cerpa
`Surendar Chandra
`Jeff Chase
`Benjie Chen
`Mike Chen
`Peter Chen
`Andy Chou
`Brent Chun
`Patrick Crowley
`Mike Dahlin
`David Dill
`John Douceur
`Eric Eide
`Kevin Eustice
`Ted Faber
`Mike Feeley
`Jason Flinn
`Rick Floyd
`Cedric Fournet
`
`Michael Franklin
`Ayalvadi Ganesh
`Johannes Gehrke
`Garth Gibson
`Daniel Giffin
`Thomer M. Gil
`Steve Gribble
`Dirk Grunwald
`Richard Guy
`Steven Hand
`Timothy Harris
`Chris Hawblitzel
`Mike Hibler
`Michael Hicks
`Wilson Hsieh
`Galen Hunt
`Sitaram Iyer
`Trent Jaeger
`Paul Jardetzky
`Brad Karp
`Kimberly Keeton
`Anne-Marie Kermarrec
`Eddie Kohler
`David Kotz
`Michael Kozuch
`Orran Krieger
`Lakshman Krishnamurthy
`John Kubiatowicz
`Eren Kursun
`Alvin Lebeck
`Philip Levis
`Hank Levy
`Jinyang Li
`KaiLi
`Darrell Long
`Ewing Lusk
`Mesaac Makpangou
`Rob Malan
`
`Laurent Massoulie
`Roy Max.ion
`David Mazieres
`Ethan Miller
`Jeffrey Mogul
`Richard Mortier
`David Mosberger
`Todd Mowry
`Gilles Muller
`Andrew Myers
`David Nagle
`Erich Nahum
`Juan Navarro
`Silvia Nittel
`Brian Noble
`Chris Olston
`Vivek Pai
`Parveen Patel
`Larry Peterson
`David Petrou
`Ian Pratt
`Calton Pu
`Vijay Raghunathan
`Sylvia Ratnasamy
`John Regehr
`Alastair Reid
`Peter Reiher
`Mike Reiter
`Erik Riedel
`Rodrigo Rodrigues
`Michael Roe
`Mendel Rosenblum
`Antony Rowstron
`Constantine Sapuntzakis
`Mahadev Satyanarayanan
`Dan Scales
`Margo Seltzer
`Srinivasan Seshan
`
`Jonathan Shapiro
`Marc Shapiro
`Dan Simon
`Emin Gun Sirer
`Stephen Smalley
`Alex Snoeren
`Ankur Srivastava
`Dave Stewart
`Doug Terry
`Amin Vahdat
`Alistair Veitch
`Geoff Voelker
`Werner Vogels
`James von Behren
`David Wagner
`Carl Waldspurger
`Jonathan Walpole
`Randolph Wang
`Matt Welsh
`Brian White
`Ted Wobber
`Alec Wolman
`Fan Ye
`Erez Zadok
`Ellen Zegura
`Beichuan Zhang
`Ben Zhao
`Lidong Zhou
`Yuanyuan Zhou
`Ben Zorn
`
`Page 4 of 21
`
`
`
`OSDI '02: 5th Symposium on
`Operating Systems Design and Implementation
`December 9-11, 2002
`Boston, Massachusetts, USA
`
`Message from the Program Chair ... . . ... ...... .. . .... . ............... . ... .... .. ... .... . ...... vii
`
`Index of Authors . . ..... .. .. ........... .. ..... ... .. . . ... . .. .. .. . ... . ...... . . . ..... . ... .. ... . . ix
`
`Monday, December 9
`
`Decentralized Storage Systems
`Session Chair: Greg Ganger, Carnegie Mellon University
`
`FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment . ... ... . ...... 1
`Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken, John R. Douceur, Jon Howell,
`Jacob R. Lorch, Marvin Theimer, and Roger P Wattenhofer, Microsoft Research
`
`Taming Aggressive Replication in the Pangaea Wide-Area File System ........ . .. . ... .. .. .. . .. . .... .. .. 15
`Yasushi Saito, Christos Karamanolis, Magnus Karlsson, and Mallik Mahalingam, HP Labs
`
`Ivy: A Read/Write Peer-to-Peer File System .. ... .... ... .. ..... . . . . ... .... .... . .... .. . . .. .... .. .. . . 31
`Athicha Muthitacharoen, Robert Morris, Thomer M. Gil, and Benjie Chen, Massachusetts Institute of Technology
`
`Robustness
`Session Chair: Miguel Castro, Microsoft Research
`
`Defensive Programming: Using an Annotation Toolkit to Build DoS-Resistant Software . .. ...... . .. . ... .... 45
`Xiaohu Qie, Ruoming Pang, and Larry Peterson, Princeton University
`
`Using Model Checking to Debug Device Firmware . . ..... . . . .. .. . .............. .. . .. . . ... . . . ... . .. . 61
`Sanjeev Kumar and Kai Li, Princeton University
`
`CMC: A Pragmatic Approach to Model Checking Real Code ....... . . . .. .. . ..... .. . ... ... .. .......... 75
`Madanlal Musuvathi, David Y. W Park, Andy Chou, Dawson R. Engler, and David L. Dill, Stanford University
`
`Kernels
`Session Chair: Carla Ellis, Duke University
`
`Practical, Transparent Operating System Support for Superpages . .. .... .. . ...... . . .. .. .. .... .. . . . .. . .. 89
`Juan Navarro, Rice University and Universidad Cat6lica de Chile; Sitaram Iyer, Peter Druschel, and Alan Cox,
`Rice University
`
`Vertigo: Automatic Performance-Setting for Linux .. ... . .. .. .. ..... . . . .. .. . ... ... . . .... .. ..... .. ... 105
`Krisztian Flautner, ARM Limited; Trevor Mudge, University of Michigan
`
`Cooperative 1/0: A Novel 110 Semantics for Energy-Aware Applications .. .. .. .... . ....... . .. .. . ... . . .. 117
`Andreas Weisse[, Bjorn Beutel, and Frank Bellosa, University of Erlangen
`
`Page 5 of 21
`
`
`
`Physical Interface
`Session Chair: Jay Lepreau, University of Utah
`
`TAG: A Tiny AGgregation Service for Ad-Hoc Sensor Networks . .. ..... . .. . .. . ... .. . .... . . . . .. . . . . . . 131
`Samuel Madden, Michael J. Franklin, and Joseph M. Hellerstein, University of California, Berkeley; Wei Hong,
`Intel Research, Berkeley
`
`Fine-Grained Network Time Synchronization Using Reference Broadcasts .... . . . .... . . . .... .. ... . ... . . 147
`Jeremy Elson, Lewis Girod, and Deborah Estrin, University of California, Los Angeles
`
`Supporting Time-Sensitive Applications on a Commodity OS .. .... . . . . .. . .. ... . . .. . . . ..... ..... .. ... 165
`Ashvin Goel, Luca Abeni, Charles Krasic, Jim Snow, and Jonathan Walpole, Oregon Graduate Institute
`
`Tuesday, December 10
`
`Virtual Machines
`Session Chair: Dawson Engler, Stanford University
`
`Memory Resource Management in VMware ESX Server .... ..... . .. ..... . . .. ..... .. .... . .. . ... .. .. 181
`Carl A. Waldspurger, VMware Inc.
`
`Scale and Performance in the Denali Isolation Kernel . . .... ... . . ... ..... .. . ..... .. ....... . . . ..... .. 195
`Andrew Whitaker, Marianne Shaw, and Steven D. Gribble, University of Washington
`
`ReVirt: Enabling Intrusion Analysis Through Virtual-Machine Logging and Replay .. . . .. . . . . . .. .. . . . ... . 211
`George W. Dunlap, Samuel T King, Sukru Cinar, Murtaza A. Basrai, and Peter M. Chen, University of Michigan
`
`Cluster Resource Management
`Session Chair: Eric Brewer, University of California, Berkeley
`
`Integrated Resource Management for Cluster-based Internet Services ..... . . . .. .... ... ... . .... .. .... .. . 225
`Kai Shen, University of Rochester; Hong Tang, University of California, Santa Barbara; Tao Yang, University of
`California, Santa Barbara, and Ask Jeeves, Inc.; Lingkun Chu, University of California, Santa Barbara
`
`Resource Overbooking and Application Profiling in Shared Hosting Platforms .. .. .. ..... ... .. ......... . 239
`Bhuvan Urgaonkar and Prashant Shenoy, University of Massachusetts; Timothy Roscoe, Intel Research, Berkeley
`
`An Integrated Experimental Environment for Distributed Systems and Networks .. .... . . .. . . . . ... . . .. . .. 255
`Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold, Mike Hibler, Chad
`Barb, and Abhijeet Joglekar, University of Utah
`
`Peer-to-Peer Infrastructure
`Session Chair: Robert Morris, Massachusetts Institute of Technology
`
`Scalability and Accuracy in a Large-Scale Network Emulator ..... ... ...... .... ... ... . . . . ... ... . . ... . 271
`Amin Vahdat, Ken Yocum, Kevin Walsh, Priya Mahadevan, Dejan Kostic, Jeff Chase, and David Becker, Duke
`University
`
`Pastiche: Making Backup Cheap and Easy .. . .. . ... . .. . . . . . . .. ... .... .. .... . .. . .. ...... . ... . .. ... 285
`Landon P. Cox, Christopher D. Murray, and Brian D. Noble, Universit)i of Michigan
`
`Secure Routing for Structured Peer-to-Peer Overlay Networks . . . . .. .. ........ .. . .. ...... . .. . .. .. .. . . . 299
`Miguel Castro, Microsoft Research; Peter Druschel, Rice University; Ayalvadi Ganesh and Antony Rowstron,
`Microsoft Research; Dan S. Wallach, Rice University
`
`Page 6 of 21
`
`
`
`Wednesday, December 11
`
`Network Behavior
`Session Chair: David Wetherall, University of Washington
`
`An Analysis of Internet Content Delivery Systems . ..... ..... . . . ............... . ..... ... ...... . .... 315
`Stefan Saroiu, Krishna P Gummadi, Richard J. Dunn, Steven D. Gribble, and Henry M. Levy, University of
`Washington
`
`TCP Nice: A Mechanism for Background Transfers .. ... ... . .... ........ ... ...... .. . ... . . .. . .... . .. 329
`A run Venkataramani, Ravi Kokku, and Mike Dahlin, University of Texas at Austin
`
`The Effectiveness of Request Redirection on CDN Robustness .. .... . .. . .. . . .. . . ... . . ..... ....... .. .. 345
`Limin Wang, Vivek Pai, and Larry Peterson, Princeton University
`
`Migration
`Session Chair: Timothy Roscoe, Intel Research, Berkeley
`
`The Design and Implementation of Zap: A System for Migrating Computing Environments .. ....... .. . ... . 361
`Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh, Columbia University
`
`Optimizing the Migration of Virtual Computers .. . ... ... . . ..... . ... . ......... ... .... . ... . .. ....... 377
`Constantine P Sapuntzakis, Ramesh Chandra, Ben Pfaff, Jim Chow, Monica S. Lam, and Mendel Rosenblum,
`Stanford University
`
`Luna: A Flexible Java Protection System . .. . ... . . . ... . . .. . ........ .. . ..... . ...... . . . . ..... ... ... 391
`Chris Hawblitzel, Dartmouth College; Thorsten von Eicken, Expertcity, Inc.
`
`Page 7 of 21
`
`
`
`Ivy: A Read/Write Peer-to-Peer File System
`
`Athicha Muthitacharoen, Robert Morris, Thomer M. Gil, and Benjie Chen
`{ athicha, rtm, thomer, benjie }@lcs.mit.edu
`MIT Laboratory for Computer Science
`200 Technology Square, Cambridge, MA 02139.
`
`Abstract
`
`Ivy is a multi-user read/write peer-to-peer file system.
`Ivy has no centralized or dedicated components, and
`it provides useful integrity properties without requiring
`users to fully trust either the underlying peer-to-peer stor(cid:173)
`age system or the other users of the file system.
`An Ivy file system consists solely of a set of logs, one
`log per participant. Ivy stores its logs in the DHash dis(cid:173)
`tributed hash table. Each participant finds data by con(cid:173)
`sulting all logs, but performs modifications by appending
`only to its own log. This arrangement allows Ivy to main(cid:173)
`tain meta-data consistency without locking. Ivy users can
`choose which other logs to trust, an appropriate arrange(cid:173)
`ment in a semi-open peer-to-peer system.
`Ivy presents applications with a conventional file sys(cid:173)
`tem interface. When the underlying network is fully
`connected, Ivy provides NFS-like semantics, such as
`close-to-open consistency. Ivy detects conflicting modi(cid:173)
`fications made during a partition, and provides relevant
`version information to application-specific conflict re(cid:173)
`solvers. Performance measurements on a wide-area net(cid:173)
`work show that Ivy is two to three times slower than
`NFS.
`
`1 Introduction
`
`This paper describes Ivy, a distributed read/write net(cid:173)
`work file system. Ivy presents a single file system im(cid:173)
`age that appears much like an NFS [33] file system. In
`contrast to NFS, Ivy does not require a dedicated server;
`instead, it stores all data and meta-data in the DHash [9]
`peer-to-peer block storage system. DHash can distribute
`and replicate blocks, giving Ivy the potential to be highly
`available. One possible application of Ivy is to support
`distributed projects with loosely affiliated participants.
`Building a shared read-write peer-to-peer file system
`poses a number of challenges. First, multiple distributed
`writers make maintenance of consistent file system meta(cid:173)
`data difficult. Second, unreliable participants make lock(cid:173)
`ing an unattractive approach for achieving meta-data
`consistency. Third, the participants may not fully trust
`each other, or may not trust that the other participants'
`
`machines have not been compromised by outsiders; thus
`there should be a way to ignore or un-do some or all
`modifications by a participant revealed to be untrustwor(cid:173)
`thy. Finally, distributing file-system data over many hosts
`means that the system may have to cope with operation
`while partitioned, and may have to help applications re(cid:173)
`pair conflicting updates made during a partition.
`Ivy uses logs to solve the problems described above.
`Each participant with write access to a file system main(cid:173)
`tains a log of changes they have made to the file sys(cid:173)
`tem. Participants scan all the logs (most recent record
`first) to look up file data and meta-data. Each participant
`maintains a private snapshot to avoid scanning all but the
`most recent log entries. The use of per-participant logs,
`instead of shared mutable data structures, allows Ivy to
`avoid using locks to protect meta-data. Ivy stores its logs
`in DHash, so a participant's logs are available even when
`the participant is not.
`Ivy resists attacks from non-participants, and from
`corrupt DHash servers, by cryptographically verifying
`the data it retrieves from DHash. An Ivy user can cope
`with attacks from other Ivy users by choosing which
`other logs to read when looking for data, and thus which
`other users to trust. Ignoring a log that was once trusted
`might discard useful information or critical meta-data;
`Ivy provides tools to selectively ignore logs and to fix
`broken meta-data.
`Ivy provides NFS-like file system semantics when the
`underlying network is fully connected. For example, Ivy
`provides close-to-open consistency. In the case of net(cid:173)
`work partition, DHash replication may allow participants
`to modify files in multiple partitions. Ivy's logs contain
`version vectors that allow it to detect conflicting updates
`after partitions merge, and to provide version informa(cid:173)
`tion to application-specific conflict resolvers.
`The Ivy implementation uses a local NFS loop-back
`server [22] to provide an ordinary file system interface.
`Performance is within a factor of two to three of NFS .
`The main performance bottlenecks are network latency
`and the cost of generating digital signatures on data
`stored in DHash.
`This paper makes three contributions. It describes a
`
`USENIX Association
`
`5th Symposium on Operating Systems Design and Implementation
`
`31
`
`Page 8 of 21
`
`
`
`~ !
`,.
`,..
`~
`:.·~J
`•' , .,;
`,,;-(cid:173)
`.,:_;;
`~~ ....
`,,,..
`
`read/write peer-to-peer storage system; previous peer(cid:173)
`to-peer systems have supported read-only data or data
`writeable by a single publisher. It describes how to de(cid:173)
`sign a distributed file system with useful integrity prop(cid:173)
`erties based on a collection of untrusted components. Fi(cid:173)
`nally, it explores the use of distributed hash tables as a
`building-block for more sophisticated systems.
`Section 2 describes Ivy's design. Section 3 discusses
`the consistency semantics that Ivy presents to applica(cid:173)
`tions . Section 4 presents tools for dealing with malicious
`participants. Sections 5 and 6 describe Ivy's implementa(cid:173)
`tion and performance. Section 7 discusses related work,
`and Section 8 concludes.
`
`2 Design
`
`An Ivy file system consists of a set of logs, one log
`per participant. A log contains all of one participant's
`changes to file system data and meta-data. Each partic(cid:173)
`ipant appends only to its own log, but reads from all
`logs. Participants store log records in the DHash dis(cid:173)
`tributed hash system, which provides per-record repli(cid:173)
`cation and authentication. Each participant maintains a
`mutable DHash record (called a log-head) that points to
`the participant's most recent log record. Ivy uses version
`vectors [27] to impose a total order on log records when
`reading from multiple logs. To avoid the expense of re(cid:173)
`peatedly reading the whole log, each participant main(cid:173)
`tains a private snapshot summarizing the file system state
`as of a recent point in time.
`The Ivy implementation acts as a local loop-back NFS
`v3 [6] server, in cooperation with a host's in-kernel NFS
`client support. Consequently, Ivy presents file system se(cid:173)
`mantics much like those of an NFS v3 file server.
`
`2.1 DHash
`
`Ivy stores all its data in DHash [9]. DHash is a distributed
`peer-to-peer hash table mapping keys to arbitrary val(cid:173)
`ues. DHash stores each key/value pair on a set oflntemet
`hosts determined by hashing the key. This paper refers to
`a DHash key/value pair as a DHash block. DHash repli(cid:173)
`cates blocks to avoid losing them if nodes crash.
`DHash ensures the integrity of each block with one of
`two methods. A content-hash block requires the block's
`key to be the SHA-1 [10] cryptographic hash of the
`block's value; this allows anyone fetching the block to
`verify the value by ensuring that its SHA-1 hash matches
`the key. A public-key block requires the block's key to be
`a public key, and the value to be signed using the corre(cid:173)
`sponding private key. DHash refuses to store a value that
`does not match the key. Ivy checks the authenticity of all
`data it retrieves from DHash. These checks prevent a ma(cid:173)
`licious or buggy DHash node from forging data, limiting
`
`log-head
`
`......... JJ. ........ JJ ........ .JJ
`
`--v(cid:173)
`log records
`
`Figure 1: Example Ivy view and logs. White boxes are DJiash
`content-hash blocks; gray boxes are public-key blocks.
`
`it to denying the existence of a block or producing a stale
`copy of a public-key block.
`Ivy participants communicate only via DHash stor(cid:173)
`age; they don't communicate directly with each other ex(cid:173)
`cept when setting up a new file system. Ivy uses DHash
`content-hash blocks to store log records. Ivy stores the
`DHash key of a participant's most recent log record
`in a DHash block called the log-head; the log-head is
`a public-key block, so that the participant can update
`its value without changing its key. Each Ivy participant
`caches content-hash blocks locally without fear of us(cid:173)
`ing stale data, since content-hash blocks are immutable.
`An Ivy participant does not cache other participants' log(cid:173)
`head blocks, since they may change.·
`interface:
`simple
`through a
`Ivy uses DHash
`put (key, value) and get (key) . Ivy assumes
`that, within any given network partition, DHash provides
`write-read consistency; that is , if put (k, v) com(cid:173)
`pletes, a subsequent get ( k) will yield v . The current
`DHash implementation does not guarantee write-read
`consistency; however, techniques are known which can
`provide such a guarantee with high probability [19] .
`These techniques require that DHash replicate data and
`update it carefully, and might significantly decrease
`performance. Ivy operates best in a fully connected
`network, though it has support for conflict detection
`after operating in a partitioned network (see Section 3.4).
`Ivy would in principle work with other distributed
`hash tables, such as PAST [32], CAN [29], Tapestry [41],
`or Kademlia [21].
`
`2.2 Log Data Structure
`
`An Ivy log consists of a linked list of immutable log
`records. Each log record is a DHash content-hash block.
`Table 1 describes fields common to all log records. The
`prev field contains the previous record's DHash key. A
`participant stores the DHash key of its most recent log
`record in its log-head block. The log-head is a public(cid:173)
`key block with a fixed DHash key, which makes it easy
`
`32
`
`5th Symposium on Operating Systems Design and Implementation
`
`USENIX Association
`
`Page 9 of 21
`
`
`
`Field
`prev
`head
`seq
`times tamp
`version
`
`Use
`DHash key of next oldest log record
`DHash key of log-head
`per-log sequence number
`time at which record was created
`version vector
`
`Table 1: Fields present in all Ivy log records.
`
`for other participants to find.
`A log record contains information about a single file
`system modification, and corresponds roughly to an NFS
`operation. Table 2 describes the types of log records and
`the type-specific fields each contains.
`Log records contain the minimum possible informa(cid:173)
`tion to avoid unnecessary conflicts from concurrent up(cid:173)
`dates by different participants. For example, a Write
`log record contains the newly written data, but not the
`file's new length or modification time. These attributes
`cannot be computed correctly at the time the Write
`record is created, since the true state of the file will only
`be known after all concurrent updates are known. Ivy
`computes that information incrementally when travers(cid:173)
`ing the logs, rather than storing it explicitly as is done in
`UNIX i-nodes [30].
`Ivy records file owners and permission modes, but
`does not use those attributes to enforce permissions. A
`user who wishes to make a file umeadable should instead
`encrypt the file's contents. A user should ignore the logs
`of people who should not be allowed to write the user's
`data.
`Ivy identifies files and directories using 160-bit i(cid:173)
`numbers. Log records contain the i-number(s) of the files
`or directories they affect. Ivy chooses i-numbers ran(cid:173)
`domly to minimize the probability of multiple partici(cid:173)
`pants allocating the same i-number for different files. Ivy
`uses the 160-bit i-number as the NFS file handle.
`Ivy keeps log records indefinitely, because they may
`be needed to help recover from a malicious participant
`or from a network partition.
`
`2.3 Using the Log
`
`For the moment, consider an Ivy file system with only
`one log. Ivy handles non-updating NFS requests with a
`single pass through the log. Requests that cause modifi(cid:173)
`cation use one or more passes, and then append one or
`more records to the log. Ivy scans the log starting at the
`most recently appended record, pointed to by the log(cid:173)
`head. Ivy stops scanning the log once it has gathered
`enough data to handle the request.
`Ivy appends a record to a log as follows. First, it cre(cid:173)
`ates a log record containing a description of the update,
`
`typically derived from arguments in the NFS request.
`The new record's prev field is the DHash key of the
`most recent log record. Then, it inserts the new record
`into DHash, signs a new log-head that points to the new
`log record, and updates the log-head in DHash.
`The following text describes how Ivy uses the log to
`perform selected operations.
`File system creation. Ivy builds a new file system by
`creating a new log with an End record, an Inode record
`with a random i-number for the root directory, and a log(cid:173)
`head. The user then mounts the local Ivy server as an
`NFS file system, using the root i-number as the NFS root
`file handle.
`File creation. When an application creates a new file,
`the kernel NFS client code sends the local Ivy server an
`NFS CREATE request. The request contains the direc(cid:173)
`tory i-number and a file name. Ivy appends an Inode
`log record with a new random i-number and a Link
`record that contains the i-number, the file's name, and the
`directory's i-number. Ivy returns the new file's i-number
`in a file handle to the NFS client. If the application then
`writes the file, the NFS client will send a WRITE request
`containing the file's i-number, the written data, and the
`file offset; Ivy will append a Write log record contain(cid:173)
`ing the same information.
`File name lookup. System calls such as open () that
`refer to file names typically generate NFS LOOKUP re(cid:173)
`quests. A LOOKUP request contains a file name and a di(cid:173)
`rectory i-number. Ivy scans the log to find a Link record
`with the desired directory i-number and file name, and
`returns the file i-number. However, if Ivy first encoun(cid:173)
`ters a Unlink record that mentions the same directory
`i-number and name, it returns an NFS error indicating
`that the file does not exist.
`File read. An NFS READ request contains the file's i(cid:173)
`number, an offset within the file, and the number of bytes
`to read. Ivy scans the log accumulating data from Write
`records whose ranges overlap the range of the data to be
`read, while ignoring data hidden by SetAttr records
`that indicate file truncation.
`File attributes. Some NFS requests,
`including
`GETATTR, require Ivy to include file attributes in the
`reply. Ivy only fully supports the file length, file mod(cid:173)
`ification time ("mtime"), attribute modification time
`("ctime"), and link count attributes. Ivy computes these
`attributes incrementally as it scans the log. A file's length
`is determined by either the write to the highest offset
`since the last truncation, or by the last truncation. Mtirne
`is determined by the time stamp in the most recent rel(cid:173)
`evant log record; Ivy must return correct time attributes
`because NFS client cache consistency depends on it. Ivy
`computes the number of links to a file by counting the
`number of relevant Link records not canceled by Un-
`1 ink and Rename records .
`
`USENIX Association
`
`5th Symposium on Operating Systems Design and Implementation
`
`33
`
`• t.
`'
`
`Page 10 of 21
`
`
`
`I I
`
`Type
`I node
`Write
`Link
`Unlink
`Rename
`Prepare
`Cancel
`SetAttrs
`End
`
`Fields
`type (file, directory, or symlink), i-m:mber, mode, owner
`i-number, offset, data
`i-number, i-number of directory, name
`i-number of directory, name
`i-number of directory, name, i-number of new directory, new file name
`i-number of directory, file name
`i-number of directory, file name
`i-number, changed attributes
`none
`
`Meaning
`create new inode
`write data to a file
`create a directory entry
`remove a file
`rename a file
`for exclusive operations
`for exclusive operations
`change file attributes
`end of log
`
`Table 2: Summary of Ivy log record types.
`
`Directory listings. Ivy handles READDIR requests
`by accumulating all file names from relevant Link log
`records, taking more recent Unlink and Rename log
`records into account.
`
`adopt the new view block (and newly named file system).
`Ivy's lack of support for automatically adding new users
`to a view is intentional.
`
`2.4 User Cooperation: Views
`
`When multiple users write to a single Ivy file system,
`each source of potentially concurrent updates must have
`its own log; this paper refers to such sources as partici(cid:173)
`pants. A user who uses an Ivy file system from multiple
`hosts concurrently must have one log per host.
`The participants in an Ivy file system agree on a view:
`the set of logs that comprise the file system. Ivy makes
`management of shared views convenient by providing
`a view block, a DHash content-hash block containing
`pointers to all log-heads in the view. A view block also
`contains the i-number of the root directory. A view block
`is immutable; if a set of users wants to form a file system
`with a different set of logs, they create a new view block.
`A user names an Ivy file system with the content-hash
`key of the view block; this is essentially a self-certifying
`pathname [23]. Users creating a new file system must
`exchange public keys in advance by some out-of-band
`means. Once they know each other's public keys, one of
`them creates a view block and tells the other users the
`view block's DHash key.
`Ivy uses the view block key to verify the view block's
`contents; the contents are the public keys that name and
`verify the participants' log-heads. A log-head contains
`a content-hash key that names and verifies the most re(cid:173)
`cent log record. It is this reasoning that allows Ivy to ver(cid:173)
`ify it has retrieved correct log records from the untrusted
`DHash storage system. This approach requi