throbber
bPERATING
`SYSTEMS
`REVIEW
`
`A Publication of the
`·Association for Computing Machinery
`Special Interest Group on Operating Systems
`
`OSR Special Issue - Winter 2002
`
`Proceedings of the
`
`Fifth ACM Symposium on
`Operating Systems Design
`and Implementation (OSDI'02)
`
`:.·
`
`CSCO-1007
`Page 1 of 21
`
`

`

`I
`
`I
`. I
`
`I
`
`·'I
`I
`
`I
`
`I
`~ I
`
`OPERA TING SYSTEMS REVIEW
`A Publication of the ACM Special Interest Group on Operating Systems
`
`V. Chairperson
`Valerie Issamy
`INRIA Ur Rocquencourt
`Domaine de Voluceau
`Rocquencourt - BP 105 Cedex
`78153 Le Chesnay France
`+33-1-39-63-57-17
`(valerie.issamy@inria.fr)
`
`Chairperson
`David Kotz
`Dept. of Computer Science
`Dartmouth College
`6211 Sudikoff Laboratory
`Hanover, NH 03755-3510 USA
`+1-603-646-1439
`(dfk@cs.dartmouth.edu)
`Newsletter Editor
`William M. Waite
`Dept. of Electrical and
`Computer Engineering
`University of Colorado
`Boulder, CO 80309-0425 USA
`+1-303-492-7204
`(William.Waite@Colorado.edu)
`
`Secretary-Treasurer
`Michael Dahlin
`Dept. of Computer Science
`Campus Mail Code: C0500
`University of Texas
`Austin, TX 78712 USA
`+1-512-471 -9549
`(dahlin@cs.utexas .edu)
`Information Director
`Amin V ahdat
`D308 LSRC, Box 90129
`Department of Computer Science
`Duke University
`Durham, NC 27708 USA
`+ 1-919-660-6566
`(vahdat@cs.duke.edu)
`
`OPERATING SYSTEMS REVIEW is an informal publication of the ACM Special Interest Group on
`Operating Systems (SIGOPS), whose scope of interest includes: Computer operating systems and archi(cid:173)
`tecture for multiprogramming, multiprocessing, and time sharing; resource management; evaluation and
`simulation; reliability, integrity, and security of data; communications among computing processors; and
`computer system modeling analysis.
`Membership in SIGOPS (at $15 per annum) is open to ACM members and associate members, and stu(cid:173)
`dent members (at $8 per annum). Non-members of SIGOPS may subscribe to OPERATING SYSTEMS
`REVIEW at $30 per year. All SIGOPS members receive OPERATING SYSTEMS REVIEW.
`SIGOPS membership application forms are available from ACM Headquarters (or see back cover of this
`issue), 1515 Broadway, 17th Floor, New York, New York 10036, telephone (212) 869-7440. Changes of
`address and other matters pertaining to the SIGOPS mailing list should be directed to ACM Headquarters,
`not to any of the SIGOPS officers.
`Contributions to OSR are sent to the editor and should be single spaced and printed on one side of the
`paper only. Text and diagrams must be contained within a 6.5 inch wide by 8.5 inch high (165mm by
`216mm) area with at least one inch (25mm) top and side margins, 1.5 inch (38mm) bottom margin, and
`be camera-ready (i.e., be an original having sufficient contrast and density for reproduction). All letters to
`the editor will be considered for publication unless accompanied by a request to the contrary. Except for
`editorial items, all sources of material appearing in OPERATING SYSTEMS REVIEW will be clearly
`identified. Items attributed to individuals will ordinarily be interpreted as personal rather than organiza(cid:173)
`tional opinions. The technical material in this newsletter is not refereed. Deadlines for submitting
`material are the 15th of February, May, August, and November. The SIGOPS conference proceed(cid:173)
`ings will be mailed as the December issue in odd numbered calendar years, and the ASPLOS
`conference proceedings will be mailed in even numbered calendar years.
`OPERATING SYSTEMS REVIEW (ISSN 0163-5980) is published five times a year (January, April,
`July, October and December) by the Association for Computing Machinery Inc., 1515 Broadway, New
`York, NY 10036. Periodicals postage paid at New York, NY 10001, and at additional mailing offices.
`POSTMASTER: Send address changes to OPERATING SYSTEMS REVIEW, ACM, 1515 Broadway,
`New York, NY 10036.
`Subscriptions: Annual subscription cost of $12.53 is included in the member dues of $15.00 (for stu(cid:173)
`dents cost is included in $8.00); the non-member annual subscription is $30.00.
`
`--
`
`Page 2 of 21
`
`

`

`USENIX Association
`
`• )
`
`Proceedings of the
`
`Fifth Symposium on Operating Systems
`
`Design and Implementation
`
`(OSDI '02)
`
`APR o 3 zoo~
`, .
`GeQrge Washin t
`Wa!hington "oocn 2Unrversi1y
`,
`0052
`
`Co-sponsored by ACM SIGOPS
`
`In Cooperation with IEEE TCOS
`
`December 9-11, 2002
`Boston, Massachusetts, USA
`
`Page 3 of 21
`
`

`

`, 1·
`
`Symposium Organizers
`Program Chair
`David Culler, University of California, Berkeley
`Peter Druschel, Rice University
`Program Committee
`Eric Brewer, University of California, Berkeley
`Miguel Castro, Microsoft Research
`Carla Ellis, Duke University
`Dawson Engler, Stanford University
`Deborah Estrin, University of California, Los
`Angeles/IS!
`Greg Ganger, Carnegie Mellon University
`
`Jim Gray, Microsoft Research
`Jay Lepreau, University of Utah
`Robert Morris, Massachusetts Institute of Technology
`Timothy Roscoe, Intel Research Lab at Berkeley
`Chandu Thekkath, Microsoft Research
`David Wetherall, University of Washington
`Steering Committee
`Michael B. Jones, Microsoft Research
`Frans Kaashoek, Massachusetts Institute of Technology
`The USENIX Association Staff
`
`External Reviewers
`Atul Adya
`Sameer Ajmani
`David Andersen
`Darrell Anderson
`Andrea Arpaci-Dusseau
`Remzi ArpaCi-Dusseau
`Tuomas Aura
`Godmar Back
`Paul Barham
`Bobby Bhattacharjee
`David Black
`Richard Black
`Trevor Blackwell
`Nikita Borisov
`Thomas Bressoud
`Nevil Brownlee
`Philip Buonadonna
`Andrew Campbell
`John Carter
`Alberto Cerpa
`Surendar Chandra
`Jeff Chase
`Benjie Chen
`Mike Chen
`Peter Chen
`Andy Chou
`Brent Chun
`Patrick Crowley
`Mike Dahlin
`David Dill
`John Douceur
`Eric Eide
`Kevin Eustice
`Ted Faber
`Mike Feeley
`Jason Flinn
`Rick Floyd
`Cedric Fournet
`
`Michael Franklin
`Ayalvadi Ganesh
`Johannes Gehrke
`Garth Gibson
`Daniel Giffin
`Thomer M. Gil
`Steve Gribble
`Dirk Grunwald
`Richard Guy
`Steven Hand
`Timothy Harris
`Chris Hawblitzel
`Mike Hibler
`Michael Hicks
`Wilson Hsieh
`Galen Hunt
`Sitaram Iyer
`Trent Jaeger
`Paul Jardetzky
`Brad Karp
`Kimberly Keeton
`Anne-Marie Kermarrec
`Eddie Kohler
`David Kotz
`Michael Kozuch
`Orran Krieger
`Lakshman Krishnamurthy
`John Kubiatowicz
`Eren Kursun
`Alvin Lebeck
`Philip Levis
`Hank Levy
`Jinyang Li
`KaiLi
`Darrell Long
`Ewing Lusk
`Mesaac Makpangou
`Rob Malan
`
`Laurent Massoulie
`Roy Max.ion
`David Mazieres
`Ethan Miller
`Jeffrey Mogul
`Richard Mortier
`David Mosberger
`Todd Mowry
`Gilles Muller
`Andrew Myers
`David Nagle
`Erich Nahum
`Juan Navarro
`Silvia Nittel
`Brian Noble
`Chris Olston
`Vivek Pai
`Parveen Patel
`Larry Peterson
`David Petrou
`Ian Pratt
`Calton Pu
`Vijay Raghunathan
`Sylvia Ratnasamy
`John Regehr
`Alastair Reid
`Peter Reiher
`Mike Reiter
`Erik Riedel
`Rodrigo Rodrigues
`Michael Roe
`Mendel Rosenblum
`Antony Rowstron
`Constantine Sapuntzakis
`Mahadev Satyanarayanan
`Dan Scales
`Margo Seltzer
`Srinivasan Seshan
`
`Jonathan Shapiro
`Marc Shapiro
`Dan Simon
`Emin Gun Sirer
`Stephen Smalley
`Alex Snoeren
`Ankur Srivastava
`Dave Stewart
`Doug Terry
`Amin Vahdat
`Alistair Veitch
`Geoff Voelker
`Werner Vogels
`James von Behren
`David Wagner
`Carl Waldspurger
`Jonathan Walpole
`Randolph Wang
`Matt Welsh
`Brian White
`Ted Wobber
`Alec Wolman
`Fan Ye
`Erez Zadok
`Ellen Zegura
`Beichuan Zhang
`Ben Zhao
`Lidong Zhou
`Yuanyuan Zhou
`Ben Zorn
`
`Page 4 of 21
`
`

`

`OSDI '02: 5th Symposium on
`Operating Systems Design and Implementation
`December 9-11, 2002
`Boston, Massachusetts, USA
`
`Message from the Program Chair ... . . ... ...... .. . .... . ............... . ... .... .. ... .... . ...... vii
`
`Index of Authors . . ..... .. .. ........... .. ..... ... .. . . ... . .. .. .. . ... . ...... . . . ..... . ... .. ... . . ix
`
`Monday, December 9
`
`Decentralized Storage Systems
`Session Chair: Greg Ganger, Carnegie Mellon University
`
`FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment . ... ... . ...... 1
`Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken, John R. Douceur, Jon Howell,
`Jacob R. Lorch, Marvin Theimer, and Roger P Wattenhofer, Microsoft Research
`
`Taming Aggressive Replication in the Pangaea Wide-Area File System ........ . .. . ... .. .. .. . .. . .... .. .. 15
`Yasushi Saito, Christos Karamanolis, Magnus Karlsson, and Mallik Mahalingam, HP Labs
`
`Ivy: A Read/Write Peer-to-Peer File System .. ... .... ... .. ..... . . . . ... .... .... . .... .. . . .. .... .. .. . . 31
`Athicha Muthitacharoen, Robert Morris, Thomer M. Gil, and Benjie Chen, Massachusetts Institute of Technology
`
`Robustness
`Session Chair: Miguel Castro, Microsoft Research
`
`Defensive Programming: Using an Annotation Toolkit to Build DoS-Resistant Software . .. ...... . .. . ... .... 45
`Xiaohu Qie, Ruoming Pang, and Larry Peterson, Princeton University
`
`Using Model Checking to Debug Device Firmware . . ..... . . . .. .. . .............. .. . .. . . ... . . . ... . .. . 61
`Sanjeev Kumar and Kai Li, Princeton University
`
`CMC: A Pragmatic Approach to Model Checking Real Code ....... . . . .. .. . ..... .. . ... ... .. .......... 75
`Madanlal Musuvathi, David Y. W Park, Andy Chou, Dawson R. Engler, and David L. Dill, Stanford University
`
`Kernels
`Session Chair: Carla Ellis, Duke University
`
`Practical, Transparent Operating System Support for Superpages . .. .... .. . ...... . . .. .. .. .... .. . . . .. . .. 89
`Juan Navarro, Rice University and Universidad Cat6lica de Chile; Sitaram Iyer, Peter Druschel, and Alan Cox,
`Rice University
`
`Vertigo: Automatic Performance-Setting for Linux .. ... . .. .. .. ..... . . . .. .. . ... ... . . .... .. ..... .. ... 105
`Krisztian Flautner, ARM Limited; Trevor Mudge, University of Michigan
`
`Cooperative 1/0: A Novel 110 Semantics for Energy-Aware Applications .. .. .. .... . ....... . .. .. . ... . . .. 117
`Andreas Weisse[, Bjorn Beutel, and Frank Bellosa, University of Erlangen
`
`Page 5 of 21
`
`

`

`Physical Interface
`Session Chair: Jay Lepreau, University of Utah
`
`TAG: A Tiny AGgregation Service for Ad-Hoc Sensor Networks . .. ..... . .. . .. . ... .. . .... . . . . .. . . . . . . 131
`Samuel Madden, Michael J. Franklin, and Joseph M. Hellerstein, University of California, Berkeley; Wei Hong,
`Intel Research, Berkeley
`
`Fine-Grained Network Time Synchronization Using Reference Broadcasts .... . . . .... . . . .... .. ... . ... . . 147
`Jeremy Elson, Lewis Girod, and Deborah Estrin, University of California, Los Angeles
`
`Supporting Time-Sensitive Applications on a Commodity OS .. .... . . . . .. . .. ... . . .. . . . ..... ..... .. ... 165
`Ashvin Goel, Luca Abeni, Charles Krasic, Jim Snow, and Jonathan Walpole, Oregon Graduate Institute
`
`Tuesday, December 10
`
`Virtual Machines
`Session Chair: Dawson Engler, Stanford University
`
`Memory Resource Management in VMware ESX Server .... ..... . .. ..... . . .. ..... .. .... . .. . ... .. .. 181
`Carl A. Waldspurger, VMware Inc.
`
`Scale and Performance in the Denali Isolation Kernel . . .... ... . . ... ..... .. . ..... .. ....... . . . ..... .. 195
`Andrew Whitaker, Marianne Shaw, and Steven D. Gribble, University of Washington
`
`ReVirt: Enabling Intrusion Analysis Through Virtual-Machine Logging and Replay .. . . .. . . . . . .. .. . . . ... . 211
`George W. Dunlap, Samuel T King, Sukru Cinar, Murtaza A. Basrai, and Peter M. Chen, University of Michigan
`
`Cluster Resource Management
`Session Chair: Eric Brewer, University of California, Berkeley
`
`Integrated Resource Management for Cluster-based Internet Services ..... . . . .. .... ... ... . .... .. .... .. . 225
`Kai Shen, University of Rochester; Hong Tang, University of California, Santa Barbara; Tao Yang, University of
`California, Santa Barbara, and Ask Jeeves, Inc.; Lingkun Chu, University of California, Santa Barbara
`
`Resource Overbooking and Application Profiling in Shared Hosting Platforms .. .. .. ..... ... .. ......... . 239
`Bhuvan Urgaonkar and Prashant Shenoy, University of Massachusetts; Timothy Roscoe, Intel Research, Berkeley
`
`An Integrated Experimental Environment for Distributed Systems and Networks .. .... . . .. . . . . ... . . .. . .. 255
`Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold, Mike Hibler, Chad
`Barb, and Abhijeet Joglekar, University of Utah
`
`Peer-to-Peer Infrastructure
`Session Chair: Robert Morris, Massachusetts Institute of Technology
`
`Scalability and Accuracy in a Large-Scale Network Emulator ..... ... ...... .... ... ... . . . . ... ... . . ... . 271
`Amin Vahdat, Ken Yocum, Kevin Walsh, Priya Mahadevan, Dejan Kostic, Jeff Chase, and David Becker, Duke
`University
`
`Pastiche: Making Backup Cheap and Easy .. . .. . ... . .. . . . . . . .. ... .... .. .... . .. . .. ...... . ... . .. ... 285
`Landon P. Cox, Christopher D. Murray, and Brian D. Noble, Universit)i of Michigan
`
`Secure Routing for Structured Peer-to-Peer Overlay Networks . . . . .. .. ........ .. . .. ...... . .. . .. .. .. . . . 299
`Miguel Castro, Microsoft Research; Peter Druschel, Rice University; Ayalvadi Ganesh and Antony Rowstron,
`Microsoft Research; Dan S. Wallach, Rice University
`
`Page 6 of 21
`
`

`

`Wednesday, December 11
`
`Network Behavior
`Session Chair: David Wetherall, University of Washington
`
`An Analysis of Internet Content Delivery Systems . ..... ..... . . . ............... . ..... ... ...... . .... 315
`Stefan Saroiu, Krishna P Gummadi, Richard J. Dunn, Steven D. Gribble, and Henry M. Levy, University of
`Washington
`
`TCP Nice: A Mechanism for Background Transfers .. ... ... . .... ........ ... ...... .. . ... . . .. . .... . .. 329
`A run Venkataramani, Ravi Kokku, and Mike Dahlin, University of Texas at Austin
`
`The Effectiveness of Request Redirection on CDN Robustness .. .... . .. . .. . . .. . . ... . . ..... ....... .. .. 345
`Limin Wang, Vivek Pai, and Larry Peterson, Princeton University
`
`Migration
`Session Chair: Timothy Roscoe, Intel Research, Berkeley
`
`The Design and Implementation of Zap: A System for Migrating Computing Environments .. ....... .. . ... . 361
`Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh, Columbia University
`
`Optimizing the Migration of Virtual Computers .. . ... ... . . ..... . ... . ......... ... .... . ... . .. ....... 377
`Constantine P Sapuntzakis, Ramesh Chandra, Ben Pfaff, Jim Chow, Monica S. Lam, and Mendel Rosenblum,
`Stanford University
`
`Luna: A Flexible Java Protection System . .. . ... . . . ... . . .. . ........ .. . ..... . ...... . . . . ..... ... ... 391
`Chris Hawblitzel, Dartmouth College; Thorsten von Eicken, Expertcity, Inc.
`
`Page 7 of 21
`
`

`

`Ivy: A Read/Write Peer-to-Peer File System
`
`Athicha Muthitacharoen, Robert Morris, Thomer M. Gil, and Benjie Chen
`{ athicha, rtm, thomer, benjie }@lcs.mit.edu
`MIT Laboratory for Computer Science
`200 Technology Square, Cambridge, MA 02139.
`
`Abstract
`
`Ivy is a multi-user read/write peer-to-peer file system.
`Ivy has no centralized or dedicated components, and
`it provides useful integrity properties without requiring
`users to fully trust either the underlying peer-to-peer stor(cid:173)
`age system or the other users of the file system.
`An Ivy file system consists solely of a set of logs, one
`log per participant. Ivy stores its logs in the DHash dis(cid:173)
`tributed hash table. Each participant finds data by con(cid:173)
`sulting all logs, but performs modifications by appending
`only to its own log. This arrangement allows Ivy to main(cid:173)
`tain meta-data consistency without locking. Ivy users can
`choose which other logs to trust, an appropriate arrange(cid:173)
`ment in a semi-open peer-to-peer system.
`Ivy presents applications with a conventional file sys(cid:173)
`tem interface. When the underlying network is fully
`connected, Ivy provides NFS-like semantics, such as
`close-to-open consistency. Ivy detects conflicting modi(cid:173)
`fications made during a partition, and provides relevant
`version information to application-specific conflict re(cid:173)
`solvers. Performance measurements on a wide-area net(cid:173)
`work show that Ivy is two to three times slower than
`NFS.
`
`1 Introduction
`
`This paper describes Ivy, a distributed read/write net(cid:173)
`work file system. Ivy presents a single file system im(cid:173)
`age that appears much like an NFS [33] file system. In
`contrast to NFS, Ivy does not require a dedicated server;
`instead, it stores all data and meta-data in the DHash [9]
`peer-to-peer block storage system. DHash can distribute
`and replicate blocks, giving Ivy the potential to be highly
`available. One possible application of Ivy is to support
`distributed projects with loosely affiliated participants.
`Building a shared read-write peer-to-peer file system
`poses a number of challenges. First, multiple distributed
`writers make maintenance of consistent file system meta(cid:173)
`data difficult. Second, unreliable participants make lock(cid:173)
`ing an unattractive approach for achieving meta-data
`consistency. Third, the participants may not fully trust
`each other, or may not trust that the other participants'
`
`machines have not been compromised by outsiders; thus
`there should be a way to ignore or un-do some or all
`modifications by a participant revealed to be untrustwor(cid:173)
`thy. Finally, distributing file-system data over many hosts
`means that the system may have to cope with operation
`while partitioned, and may have to help applications re(cid:173)
`pair conflicting updates made during a partition.
`Ivy uses logs to solve the problems described above.
`Each participant with write access to a file system main(cid:173)
`tains a log of changes they have made to the file sys(cid:173)
`tem. Participants scan all the logs (most recent record
`first) to look up file data and meta-data. Each participant
`maintains a private snapshot to avoid scanning all but the
`most recent log entries. The use of per-participant logs,
`instead of shared mutable data structures, allows Ivy to
`avoid using locks to protect meta-data. Ivy stores its logs
`in DHash, so a participant's logs are available even when
`the participant is not.
`Ivy resists attacks from non-participants, and from
`corrupt DHash servers, by cryptographically verifying
`the data it retrieves from DHash. An Ivy user can cope
`with attacks from other Ivy users by choosing which
`other logs to read when looking for data, and thus which
`other users to trust. Ignoring a log that was once trusted
`might discard useful information or critical meta-data;
`Ivy provides tools to selectively ignore logs and to fix
`broken meta-data.
`Ivy provides NFS-like file system semantics when the
`underlying network is fully connected. For example, Ivy
`provides close-to-open consistency. In the case of net(cid:173)
`work partition, DHash replication may allow participants
`to modify files in multiple partitions. Ivy's logs contain
`version vectors that allow it to detect conflicting updates
`after partitions merge, and to provide version informa(cid:173)
`tion to application-specific conflict resolvers.
`The Ivy implementation uses a local NFS loop-back
`server [22] to provide an ordinary file system interface.
`Performance is within a factor of two to three of NFS .
`The main performance bottlenecks are network latency
`and the cost of generating digital signatures on data
`stored in DHash.
`This paper makes three contributions. It describes a
`
`USENIX Association
`
`5th Symposium on Operating Systems Design and Implementation
`
`31
`
`Page 8 of 21
`
`

`

`~ !
`,.
`,..
`~
`:.·~J
`•' , .,;
`,,;-(cid:173)
`.,:_;;
`~~ ....
`,,,..
`
`read/write peer-to-peer storage system; previous peer(cid:173)
`to-peer systems have supported read-only data or data
`writeable by a single publisher. It describes how to de(cid:173)
`sign a distributed file system with useful integrity prop(cid:173)
`erties based on a collection of untrusted components. Fi(cid:173)
`nally, it explores the use of distributed hash tables as a
`building-block for more sophisticated systems.
`Section 2 describes Ivy's design. Section 3 discusses
`the consistency semantics that Ivy presents to applica(cid:173)
`tions . Section 4 presents tools for dealing with malicious
`participants. Sections 5 and 6 describe Ivy's implementa(cid:173)
`tion and performance. Section 7 discusses related work,
`and Section 8 concludes.
`
`2 Design
`
`An Ivy file system consists of a set of logs, one log
`per participant. A log contains all of one participant's
`changes to file system data and meta-data. Each partic(cid:173)
`ipant appends only to its own log, but reads from all
`logs. Participants store log records in the DHash dis(cid:173)
`tributed hash system, which provides per-record repli(cid:173)
`cation and authentication. Each participant maintains a
`mutable DHash record (called a log-head) that points to
`the participant's most recent log record. Ivy uses version
`vectors [27] to impose a total order on log records when
`reading from multiple logs. To avoid the expense of re(cid:173)
`peatedly reading the whole log, each participant main(cid:173)
`tains a private snapshot summarizing the file system state
`as of a recent point in time.
`The Ivy implementation acts as a local loop-back NFS
`v3 [6] server, in cooperation with a host's in-kernel NFS
`client support. Consequently, Ivy presents file system se(cid:173)
`mantics much like those of an NFS v3 file server.
`
`2.1 DHash
`
`Ivy stores all its data in DHash [9]. DHash is a distributed
`peer-to-peer hash table mapping keys to arbitrary val(cid:173)
`ues. DHash stores each key/value pair on a set oflntemet
`hosts determined by hashing the key. This paper refers to
`a DHash key/value pair as a DHash block. DHash repli(cid:173)
`cates blocks to avoid losing them if nodes crash.
`DHash ensures the integrity of each block with one of
`two methods. A content-hash block requires the block's
`key to be the SHA-1 [10] cryptographic hash of the
`block's value; this allows anyone fetching the block to
`verify the value by ensuring that its SHA-1 hash matches
`the key. A public-key block requires the block's key to be
`a public key, and the value to be signed using the corre(cid:173)
`sponding private key. DHash refuses to store a value that
`does not match the key. Ivy checks the authenticity of all
`data it retrieves from DHash. These checks prevent a ma(cid:173)
`licious or buggy DHash node from forging data, limiting
`
`log-head
`
`......... JJ. ........ JJ ........ .JJ
`
`--v(cid:173)
`log records
`
`Figure 1: Example Ivy view and logs. White boxes are DJiash
`content-hash blocks; gray boxes are public-key blocks.
`
`it to denying the existence of a block or producing a stale
`copy of a public-key block.
`Ivy participants communicate only via DHash stor(cid:173)
`age; they don't communicate directly with each other ex(cid:173)
`cept when setting up a new file system. Ivy uses DHash
`content-hash blocks to store log records. Ivy stores the
`DHash key of a participant's most recent log record
`in a DHash block called the log-head; the log-head is
`a public-key block, so that the participant can update
`its value without changing its key. Each Ivy participant
`caches content-hash blocks locally without fear of us(cid:173)
`ing stale data, since content-hash blocks are immutable.
`An Ivy participant does not cache other participants' log(cid:173)
`head blocks, since they may change.·
`interface:
`simple
`through a
`Ivy uses DHash
`put (key, value) and get (key) . Ivy assumes
`that, within any given network partition, DHash provides
`write-read consistency; that is , if put (k, v) com(cid:173)
`pletes, a subsequent get ( k) will yield v . The current
`DHash implementation does not guarantee write-read
`consistency; however, techniques are known which can
`provide such a guarantee with high probability [19] .
`These techniques require that DHash replicate data and
`update it carefully, and might significantly decrease
`performance. Ivy operates best in a fully connected
`network, though it has support for conflict detection
`after operating in a partitioned network (see Section 3.4).
`Ivy would in principle work with other distributed
`hash tables, such as PAST [32], CAN [29], Tapestry [41],
`or Kademlia [21].
`
`2.2 Log Data Structure
`
`An Ivy log consists of a linked list of immutable log
`records. Each log record is a DHash content-hash block.
`Table 1 describes fields common to all log records. The
`prev field contains the previous record's DHash key. A
`participant stores the DHash key of its most recent log
`record in its log-head block. The log-head is a public(cid:173)
`key block with a fixed DHash key, which makes it easy
`
`32
`
`5th Symposium on Operating Systems Design and Implementation
`
`USENIX Association
`
`Page 9 of 21
`
`

`

`Field
`prev
`head
`seq
`times tamp
`version
`
`Use
`DHash key of next oldest log record
`DHash key of log-head
`per-log sequence number
`time at which record was created
`version vector
`
`Table 1: Fields present in all Ivy log records.
`
`for other participants to find.
`A log record contains information about a single file
`system modification, and corresponds roughly to an NFS
`operation. Table 2 describes the types of log records and
`the type-specific fields each contains.
`Log records contain the minimum possible informa(cid:173)
`tion to avoid unnecessary conflicts from concurrent up(cid:173)
`dates by different participants. For example, a Write
`log record contains the newly written data, but not the
`file's new length or modification time. These attributes
`cannot be computed correctly at the time the Write
`record is created, since the true state of the file will only
`be known after all concurrent updates are known. Ivy
`computes that information incrementally when travers(cid:173)
`ing the logs, rather than storing it explicitly as is done in
`UNIX i-nodes [30].
`Ivy records file owners and permission modes, but
`does not use those attributes to enforce permissions. A
`user who wishes to make a file umeadable should instead
`encrypt the file's contents. A user should ignore the logs
`of people who should not be allowed to write the user's
`data.
`Ivy identifies files and directories using 160-bit i(cid:173)
`numbers. Log records contain the i-number(s) of the files
`or directories they affect. Ivy chooses i-numbers ran(cid:173)
`domly to minimize the probability of multiple partici(cid:173)
`pants allocating the same i-number for different files. Ivy
`uses the 160-bit i-number as the NFS file handle.
`Ivy keeps log records indefinitely, because they may
`be needed to help recover from a malicious participant
`or from a network partition.
`
`2.3 Using the Log
`
`For the moment, consider an Ivy file system with only
`one log. Ivy handles non-updating NFS requests with a
`single pass through the log. Requests that cause modifi(cid:173)
`cation use one or more passes, and then append one or
`more records to the log. Ivy scans the log starting at the
`most recently appended record, pointed to by the log(cid:173)
`head. Ivy stops scanning the log once it has gathered
`enough data to handle the request.
`Ivy appends a record to a log as follows. First, it cre(cid:173)
`ates a log record containing a description of the update,
`
`typically derived from arguments in the NFS request.
`The new record's prev field is the DHash key of the
`most recent log record. Then, it inserts the new record
`into DHash, signs a new log-head that points to the new
`log record, and updates the log-head in DHash.
`The following text describes how Ivy uses the log to
`perform selected operations.
`File system creation. Ivy builds a new file system by
`creating a new log with an End record, an Inode record
`with a random i-number for the root directory, and a log(cid:173)
`head. The user then mounts the local Ivy server as an
`NFS file system, using the root i-number as the NFS root
`file handle.
`File creation. When an application creates a new file,
`the kernel NFS client code sends the local Ivy server an
`NFS CREATE request. The request contains the direc(cid:173)
`tory i-number and a file name. Ivy appends an Inode
`log record with a new random i-number and a Link
`record that contains the i-number, the file's name, and the
`directory's i-number. Ivy returns the new file's i-number
`in a file handle to the NFS client. If the application then
`writes the file, the NFS client will send a WRITE request
`containing the file's i-number, the written data, and the
`file offset; Ivy will append a Write log record contain(cid:173)
`ing the same information.
`File name lookup. System calls such as open () that
`refer to file names typically generate NFS LOOKUP re(cid:173)
`quests. A LOOKUP request contains a file name and a di(cid:173)
`rectory i-number. Ivy scans the log to find a Link record
`with the desired directory i-number and file name, and
`returns the file i-number. However, if Ivy first encoun(cid:173)
`ters a Unlink record that mentions the same directory
`i-number and name, it returns an NFS error indicating
`that the file does not exist.
`File read. An NFS READ request contains the file's i(cid:173)
`number, an offset within the file, and the number of bytes
`to read. Ivy scans the log accumulating data from Write
`records whose ranges overlap the range of the data to be
`read, while ignoring data hidden by SetAttr records
`that indicate file truncation.
`File attributes. Some NFS requests,
`including
`GETATTR, require Ivy to include file attributes in the
`reply. Ivy only fully supports the file length, file mod(cid:173)
`ification time ("mtime"), attribute modification time
`("ctime"), and link count attributes. Ivy computes these
`attributes incrementally as it scans the log. A file's length
`is determined by either the write to the highest offset
`since the last truncation, or by the last truncation. Mtirne
`is determined by the time stamp in the most recent rel(cid:173)
`evant log record; Ivy must return correct time attributes
`because NFS client cache consistency depends on it. Ivy
`computes the number of links to a file by counting the
`number of relevant Link records not canceled by Un-
`1 ink and Rename records .
`
`USENIX Association
`
`5th Symposium on Operating Systems Design and Implementation
`
`33
`
`• t.
`'
`
`Page 10 of 21
`
`

`

`I I
`
`Type
`I node
`Write
`Link
`Unlink
`Rename
`Prepare
`Cancel
`SetAttrs
`End
`
`Fields
`type (file, directory, or symlink), i-m:mber, mode, owner
`i-number, offset, data
`i-number, i-number of directory, name
`i-number of directory, name
`i-number of directory, name, i-number of new directory, new file name
`i-number of directory, file name
`i-number of directory, file name
`i-number, changed attributes
`none
`
`Meaning
`create new inode
`write data to a file
`create a directory entry
`remove a file
`rename a file
`for exclusive operations
`for exclusive operations
`change file attributes
`end of log
`
`Table 2: Summary of Ivy log record types.
`
`Directory listings. Ivy handles READDIR requests
`by accumulating all file names from relevant Link log
`records, taking more recent Unlink and Rename log
`records into account.
`
`adopt the new view block (and newly named file system).
`Ivy's lack of support for automatically adding new users
`to a view is intentional.
`
`2.4 User Cooperation: Views
`
`When multiple users write to a single Ivy file system,
`each source of potentially concurrent updates must have
`its own log; this paper refers to such sources as partici(cid:173)
`pants. A user who uses an Ivy file system from multiple
`hosts concurrently must have one log per host.
`The participants in an Ivy file system agree on a view:
`the set of logs that comprise the file system. Ivy makes
`management of shared views convenient by providing
`a view block, a DHash content-hash block containing
`pointers to all log-heads in the view. A view block also
`contains the i-number of the root directory. A view block
`is immutable; if a set of users wants to form a file system
`with a different set of logs, they create a new view block.
`A user names an Ivy file system with the content-hash
`key of the view block; this is essentially a self-certifying
`pathname [23]. Users creating a new file system must
`exchange public keys in advance by some out-of-band
`means. Once they know each other's public keys, one of
`them creates a view block and tells the other users the
`view block's DHash key.
`Ivy uses the view block key to verify the view block's
`contents; the contents are the public keys that name and
`verify the participants' log-heads. A log-head contains
`a content-hash key that names and verifies the most re(cid:173)
`cent log record. It is this reasoning that allows Ivy to ver(cid:173)
`ify it has retrieved correct log records from the untrusted
`DHash storage system. This approach requi

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket