`
`Edward K. Lee and Chandramohan A. Thekkath
`
`Systems Research Center
`Digital Equipment Corporation
`130 Lytton Ave, Palo Alto, CA 94301.
`
`Abstract
`
`The ideal storage system is globally accessible, always available,
`provides unlimited performance and capacity for a large number
`of clients, and requires no management. This paper describes
`the design, implementation, and performance of Petal, a system
`that attempts to approximate this ideal in practice through a novel
`combination of features. Petal consists of a collection of network-
`connected servers that cooperatively manage a pool of physical
`disks. To a Petal client, this collection appears as a highly available
`block-level storage system that provides large abstract containers
`called virtual disks. A virtual disk is globally accessible to all Petal
`clients on the network. A client can create a virtual disk on demand
`to tap the entire capacity and performance of the underlying phys-
`ical resources. Furthermore, additional resources, such as servers
`and disks, can be automatically incorporated into Petal.
`We have an initial Petal prototype consisting of four 225 MHz
`DEC 3000/700 workstations running Digital Unix and connected
`by a 155 Mbit/s ATM network. The prototype provides clients
`with virtual disks that tolerate and recover from disk, server, and
`network failures. Latency is comparable to a locally attached disk,
`and throughput scales with the number of servers. The prototype
`can achieve I/O rates of up to 3150 requests/sec and bandwidth up
`to 43.1 Mbytes/sec.
`
`1
`
`Introduction
`
`Currently, managing large storage systems is an expensive and
`complicated process. Often a single component failure can halt the
`entire system, and requires considerable time and effort to resume
`operation. Moreover, the capacity and performance of individual
`components in the system must be periodically monitored and
`balanced to reduce fragmentation and eliminate hot spots. This
`usually requires manually moving, partitioning, or replicating files
`and directories.
`This paper describes the design, implementation, and perfor-
`mance of Petal, an easy-to-manage distributed storage system.
`Clients, such as file systems and databases, view Petal as a collec-
`tion of virtual disks as shown in Figure 1. A Petal virtual disk is a
`
`Published in The Proceedings of the 7th International Conference on Architectural
`Support for Programming Languages and Operating Systems. Copyright c 1996
`by the Assocation for Computing Machinery. All rights reserved. Republished by
`permission.
`
` LFS
`(Petal Client)
`
`NT FS
`(Petal Client)
`
`PC FS
`(Petal Client)
`
`BSD FFS
`(Petal Client)
`
`Scalable Network
`
`Petal
`
`/dev/vdisk11
`
`/dev/vdisk2
`
`/dev/vdisk3
`
`/dev/vdisk4
`
`/dev/vdisk5
`
`A virtual disk
`
`Figure 1: Client View
`
`container that provides a sparse 64-bit byte storage space. As with
`ordinary magnetic disks, data are read and written to Petal virtual
`disks in blocks. In addition, it has the following novel combination
`of characteristics, which we believe will reduce the complexity of
`managing large storage systems:
`
` It can tolerate and recover from any single component failure
`such as disk, server, or network.
`
` It can be geographically distributed to tolerate site failures
`such as power outages and natural disasters.
`
` It transparently reconfigures to expand in performance and
`capacity as new servers and disks are added.
`
` It uniformly balances load and capacity throughout the
`servers in the system.
`
` It provides fast, efficient support for backup and recovery
`in environments with multiple types of clients, such as file
`servers and databases.
`
`Petal’s virtual disks allow us to cleanly separate a client’s view
`of storage from the physical resources that are used to implement it.
`This allows us to share the physical resources more flexibly among
`many clients, and to offer important services such as “snapshots”
`and incremental expandability in an efficient manner.
`The disk-like interface offered by Petal provides a lower-level
`service than a distributed file system; however, we believe that a
`distributed file system can be efficiently implemented on top of
`Petal, and that the resulting system as a whole will be as cost
`
`1
`
`Oracle Ex. 1020, pg. 1
`
`
`
`LFS
`
`NT FS
`
`PC FS
`
`BSD FFS
`
`Global State
` Module
`
`Scalable Network
`
`Liveness Module
`
`Recovery Module
`
`Virtual to Physical
` Translator
`
`Storage Server
`
`Storage Server
`
`Storage Server
`
`Storage Server
`
`Disk Storage
`
`Disk Storage
`
`Disk Storage
`
`Disk Storage
`
`Figure 2: Physical View
`
`effective as a comparable distributed file system implementation
`that accesses local disks directly. By separating the system cleanly
`into a block-level storage system and a file system, and by handling
`many of the distributed systems problems in the block-level storage
`system, we have an overall system that is easier to model, design,
`implement, and tune. This simplicity is particularly important
`when the design is expected to scale to a large size and provide
`reliable data storage over a long period of time. An additional
`benefit is that the block-level interface is useful for supporting
`heterogeneous clients and client applications; that is, we can easily
`support many different types of file systems and databases.
`We have implemented Petal servers on Alpha workstations run-
`ning Digital Unix connected by the Digital ATM network [2]. A
`Petal client interface exists for Digital Unix and is implemented
`as a kernel device driver, allowing all standard Unix applications,
`utilities, and file systems to run unmodified when using Petal.
`Our implementation exhibits graceful scaling and provides perfor-
`mance that is comparable to local disks while providing significant
`new functionality.
`
`2 Design of Petal
`
`As shown in Figure 2, Petal consists of a pool of distributed storage
`servers that cooperatively implement a single, block-level storage
`system. Clients view the storage system as a collection of vir-
`tual disks and access Petal services via a remote procedure call
`(RPC) [3] interface. A basic principle in the design of the Petal
`RPC interface was to maintain all state needed for ensuring the in-
`tegrity of the storage system in the servers, and maintain only hints
`in the clients. Clients maintain only a small amount of high-level
`mapping information that is used to route read and write requests
`to the “most appropriate” server. If a request is sent to an inappro-
`priate server, the server returns an error code, causing the client to
`update its hints and retry the request.
`Figure 3 illustrates the software structure of Petal. Each of the
`ovals represents a software module. Arrows indicate the use of
`one module by another. Two modules, the liveness module and the
`global state module, manage much of the distributed system aspect
`of Petal. The liveness module ensures that all servers in the system
`will agree on the operational status, whether running or crashed, of
`each other. This service is used by the other modules, notably the
`global state manager, to guarantee continuous, consistent operation
`of the system as a whole in the face of server and communication
`failures. The operation of the liveness module is based on majority
`
`Data Access
` Module
`
`Figure 3: Petal Server Modules
`
`consensus and the periodic exchange of “I’m alive” and “You’re
`alive” messages between the servers. These message exchanges
`must be done in a timely manner to ensure progress but can be
`arbitrarily delayed or reordered without affecting correctness.
`Petal maintains information that describes the current members
`of the storage system and the currently supported virtual disks.
`This information is replicated across all Petal servers in the system.
`The global state manager is responsible for consistently maintain-
`ing this information, which is less than a megabyte in our current
`implementation. Our algorithm for maintaining global state is
`based on Leslie Lamport’s Paxos, or “part-time parliament” algo-
`rithm [14] for implementing distributed, replicated state machines.
`The algorithm assumes that servers fail by ceasing to operate and
`that networks can reorder and lose messages. The algorithm en-
`sures correctness in the face of arbitrary combinations of server and
`communication failures and recoveries, and guarantees progress as
`long as a majority of servers can communicate with each other.
`This ensures that management operations in Petal, such as creat-
`ing, deleting, or snapshotting virtual disks, or adding and deleting
`servers, are fault tolerant.
`The other three modules deal with servicing the read and write
`requests issued by Petal clients. The data access and recovery mod-
`ules control how client data is distributed and stored in the Petal
`storage system. A different set of data access and recovery mod-
`ules exists for each type of redundancy scheme supported by the
`system. We currently support simple data striping without redun-
`dancy and a replication-based redundancy scheme called chained-
`declustering [13]. The desired redundancy scheme for a virtual
`disk is specified when the virtual disk is created. Subsequently,
`the redundancy scheme, and other attributes, can be transparently
`changed via a process called virtual disk reconfiguration. The
`virtual-to-physical address translation module contains common
`routines used by the various data access and recovery modules.
`These routines translate the virtual disk offsets to physical disk
`addresses. The rest of this section will examine specific aspects of
`the system in greater detail.
`
`2.1 Virtual to Physical Translation
`
`This section describes how Petal translates the virtual disk ad-
`dresses used by clients into physical disk addresses. The basic
`problem is to translate virtual addresses of the form virtual-
`disk-identifier, offset to physical addresses of the form server-
`identifier, disk-identifier, disk-offset . This translation must be
`done consistently and efficiently in a distributed system where
`events that alter virtual disk address translation, such as server
`
`2
`
`Oracle Ex. 1020, pg. 2
`
`
`
`<vdiskID, offset> −> <serverID, diskID, diskOffset>
`
`Server 0
`
`Server 1
`
`Server 2
`
`Server 3
`
`the mapping information local. Doing so minimizes the amount
`of information that must be kept in global data structures that are
`replicated and, therefore, expensive to update.
`
`vdiskID
`
`VDir
`
`VDir
`
`VDir
`
`VDir
`
`offset
`
`GMap
`
`serverID
`
`GMap
`
`GMap
`
`PMap0
`
`PMap1
`
`PMap2
`
`PMap3
`
`<diskID, diskOffset>
`
`Figure 4: Virtual to Physical Mapping
`
`failure or recovery, can occur unexpectedly.
`Figure 4 illustrates the basic data structures and the steps in the
`translation procedure. There are three important data structures: a
`virtual disk directory (VDir), a global map (GMap), and a physical
`map (PMap). The dotted lines around the virtual disk directory and
`the global map indicate that these are global data structures that are
`replicated and consistently updated on all the servers by the global
`state manager. Each server also has a physical map that is local to
`that server. Translating a client-supplied virtual disk identifier and
`offset into a particular disk offset occurs in three steps as shown in
`Figure 4.
`
`1. The virtual disk directory translates the client-supplied virtual
`disk identifier into a global map identifier.
`
`2. The specified global map determines the server responsible
`for translating the given offset.
`
`3. The physical map at the specified server translates the global
`map identifier and the offset to a physical disk and an offset
`within that disk.
`
`To minimize communication, in almost all cases, the server that
`performs the translation in Step 2 will be the same server that
`performs the translation in Step 3. Thus, if a client has initially
`sent the request to the appropriate server, that server can perform
`all three steps in the translation locally without communicating
`with any other server.
`There is one global map per virtual disk that specifies the tuple
`of servers spanned by the virtual disk and the redundancy scheme
`used to protect client data stored on the virtual disk. To tolerate
`server failures, a secondary server can be assigned responsibility
`for mapping the same offset when the primary is not available.
`Global maps are immutable; to change a virtual disk’s tuple of
`servers or redundancy scheme, the virtual disk must be assigned a
`new global map. Section 2.3 describing reconfiguration provides
`more details about this process.
`The physical map is the actual data structure used to translate
`an offset within a virtual disk to a physical disk and an offset
`within that disk. It is similar to a page table in a virtual memory
`system and each physical map entry translates a 64 Kbyte region of
`physical disk. The server that performs the translation will usually
`also perform the disk operations needed to service the original
`client request. The separation of the translation data structures
`into global and local physical maps allows us to keep the bulk of
`
`2.2 Support for Backup
`
`Petal attempts to simplify a client’s backup procedure by providing
`a common mechanism that can be applied by clients to automate
`the backup and recovery of all data stored on the system. The
`mechanism Petal provides is fast efficient snapshots of virtual disks.
`By using copy-on-write techniques, Petal can quickly create an
`exact copy of a virtual disk at a specified point in time. A client
`treats the snapshot like any other virtual disk, except that it cannot
`be modified.
`Supporting snapshots requires a slightly more complicated
`virtual-to-physical translation procedure than described in the pre-
`vious section.
`In particular, the virtual disk directory does not
`translate a virtual disk identifier to a global map identifier, but
`rather to the tuple global-map-identifier, epoch-number . The
`epoch-number is a monotonically increasing version number that
`distinguishes data stored at the same virtual disk offset at different
`points in time. The tuple global-map-identifier, epoch-number
`is then used by the physical map in the last step of the translation.
`When the system creates a snapshot of a virtual disk, a new tuple
`with a later epoch number is created in the virtual disk directory.
`All accesses to the original virtual disk are then made using the
`new epoch number. The older epoch number is used by the newly
`created snapshot. This ensures that any new data written to the
`original virtual disk will create new entries in the new epoch rather
`than overwriting the data in the previous epoch. Also, read requests
`can find the data most recently written to a particular offset by
`looking for the most recent epoch.
`Creating a snapshot that is consistent at the client application
`level requires pausing the application for the brief time, less than
`one second, it takes to create a Petal snapshot. An alternative
`approach would not require pausing the application and would
`create a “crash-consistent” snapshot, that is, the snapshot would
`be similar to the disk image that would be left after an application
`crashed. Such snapshots could later be made consistent at the
`application level by running an application-dependent recovery
`program such as fsck in the case of Unix file systems. We are
`considering implementing crash-consistent snapshots, but they are
`currently not supported.
`Snapshots can be kept on-line and facilitate the recovery of ac-
`cidentally deleted files. Also, since a snapshot behaves exactly like
`a read-only local disk, a Petal client can use it to create consistent
`archives of data using utilities such as tar.
`
`2.3 Incremental Reconfiguration
`
`Occasionally, it is desirable to change a virtual disk’s redundancy
`scheme or the set of servers over which it is mapped. Such a
`change is often precipitated by the addition or removal of disks and
`servers. This section describes how Petal incorporates new disks
`and servers, and how existing virtual disks can be reconfigured to
`take advantage of these new resources. The former processes are
`described only from the point of view of adding new resources
`but are easily generalized to the removal of resources. The latter
`process is referred to as virtual disk reconfiguration and is the
`primary focus of this section.
`The addition of a disk to a server is handled locally by the given
`server. Subsequent storage allocation requests automatically take
`
`3
`
`Oracle Ex. 1020, pg. 3
`
`
`
`the new disk into consideration. However, for load balance, it
`is desirable to redistribute previously allocated storage to the new
`disk as well. This redistribution is most easily accomplished as part
`of a local background process that periodically moves data among
`disks. We have not yet implemented such a background process in
`Petal. Nonetheless, existing data is redistributed to newly added
`disks as a side-effect of the virtual disk reconfiguration.
`The addition of a Petal server is a global operation composed
`of several steps involving the global state management module
`and the liveness module. First, the new server is added to the
`membership of the Petal storage system. Thereafter, the new server
`will participate in any future global operations. Next, the sets
`of servers used by the liveness module for determining whether
`a particular server is up or down is adjusted to incorporate the
`new server. Finally, existing virtual disks are reconfigured to take
`advantage of the new server, using the process described below.
`Given the virtual-to-physical translation procedure already de-
`scribed in Section 2.1, and in the absence of any other activity in the
`system, virtual disk reconfiguration can be trivially implemented
`as follows:
`
`1. Create a new global map with the desired redundancy scheme
`and server mapping.
`
`2. Change all virtual disk directory entries that refer to the old
`global map to refer to the new one.
`
`3. Redistribute the data to the servers according to the transla-
`tions specified in the new global map. This data distribution
`could potentially require substantial amounts of network and
`disk traffic.
`
`The challenge is to perform reconfiguration incrementally and con-
`currently with the processing of normal client requests. We find it
`acceptable if the procedure takes a few hours but it must not de-
`grade the performance of the system significantly. For example, if
`a virtual disk is reconfigured because a new server has been added,
`the performance of the virtual disk should gradually increase dur-
`ing reconfiguration from its level before reconfiguration to its level
`after reconfiguration. We will describe our reconfiguration algo-
`rithm in two steps. First, we describe the basic algorithm and then
`a refinement to that algorithm. The refined algorithm is what is
`actually implemented in our system.
`In the basic algorithm, steps one and two, described above, are
`first executed. Next, starting with the translations in the most recent
`epoch that have not yet been moved, data is transferred to the new
`collection of servers as specified by the new global map. Because
`of the amount of data that may need to be moved, reconfiguration
`can take a long time to complete. In the meantime, clients will wish
`to read and write data to a virtual disk that is being reconfigured.
`To accommodate such requests, our read and write procedures are
`designed to function as follows. When a client read request is
`serviced, the old global map is tried if an appropriate translation
`is not found in the new global map. This ensures that translations
`that have not yet been moved will still be found in the old global
`map. Any client write requests will always access only the new
`global map. Also, since we move data starting with the most recent
`epoch, we ensure that read requests will not return data from an
`older epoch than that requested by the client.
`The main limitation of the basic algorithm is that server map-
`pings for an entire virtual disk are changed before any data is
`moved. This means that almost every client read request submitted
`that is based on the new global map will miss in the new global
`
`Virtual Disk
`
`Server 0
`
`Server 1
`
`Server 2
`
`Server 3
`
`D0
`
`D3
`
`D4
`
`D7
`
`D1
`
`D0
`
`D5
`
`D4
`
`D2
`
`D1
`
`D6
`
`D5
`
`D3
`
`D2
`
`D7
`
`D6
`
`Figure 5: Chained-Declustering
`
`map and will have to be forwarded to the old one. This will usu-
`ally require additional communication between servers and has the
`potential to seriously degrade the performance of the system.
`The refined algorithm solves the limitation of the basic algorithm
`by relocating only small portions of a virtual disk at a time. The
`basic idea is to break up a virtual disk’s address range into three
`regions: old, new, and fenced. Requests to the old and new regions
`simply use the old and new global maps, respectively. Requests
`to the fenced region, however, use the basic algorithm we have
`described above. Once we have relocated everything in the fenced
`region, it becomes a new region and we fence another part of the
`old region. We repeat until we have moved all the data in the old
`region into the new region.
`By keeping the relative size of the fenced region small, roughly
`one to ten percent of the entire range, we minimize the forward-
`ing overhead. To help guard against fencing off a heavily used
`subrange of the virtual disk, we construct the fenced region by
`collecting small non-contiguous ranges distributed throughout the
`virtual disk, instead of a single contiguous region.
`
`2.4 Data Access and Recovery
`
`This section describes Petal’s chained-declustered [13] data access
`and recovery modules. These modules give clients highly available
`access to data by automatically bypassing failed components. Dy-
`namic load balancing eliminates system bottlenecks by ensuring
`uniform load distribution even in the face of component failures.
`We start by describing the basic idea behind chained-declustering
`and then move into detailed descriptions of exactly what happens
`on each read and write operation.
`Figure 5 illustrates the chained-declustered data placement
`scheme. The dotted rectangle emphasizes that the data on the
`storage servers appear as a single virtual disk to clients. Each
`sequence of letters represents a block of data stored in the stor-
`age system. Note that the two copies of each block of data are
`always stored on neighboring servers.Furthermore, every pair of
`neighboring servers has data blocks in common. Because of this
`arrangement, if Server 1 fails, servers 0 and 2 will automatically
`share Server 1’s read load; however, Server 3 will not experience
`any load increase. By performing dynamic load balancing, we can
`do better. For example, since Server 3 has copies of some data
`from servers 0 and 2, servers 0 and 2 can offload some of their
`normal read load on Server 3 and achieve uniform load balancing.
`Chaining the data placement allows each server to offload some
`of its read load to the server either immediately following or pre-
`
`4
`
`Oracle Ex. 1020, pg. 4
`
`
`
`ceding the given server. By cascading the offloading across multi-
`ple servers, a uniform load can be maintained across all surviving
`servers. In contrast, with a simple mirrored redundancy scheme
`that replicates all the data stored on two servers, the failure of
`either would result in a 100% load increase at the other with no
`opportunities for dynamic load balancing. In a system that stripes
`over many mirrored servers, the 100% load increase at this single
`server would reduce the overall system throughput by 50%.
`Our current prototype implements a simple dynamic load bal-
`ancing scheme. Each client keeps track of the number of requests
`it has pending at each server and always sends read requests to the
`server with the shorter queue length. This works well if most of
`the requests are generated by a few clients but, obviously, would
`not work well if most requests are generated by many clients that
`only occasionally issue I/O requests. The choice of load balancing
`algorithm is currently an active area of research within the Petal
`project.
`An additional advantage with chained-declustering is that by
`placing all the even-numbered servers at one site and all the odd-
`numbered servers at another site, we can tolerate site failures. A
`disadvantage of chained-declustering relative to simple mirroring
`is that it is less reliable. With simple mirroring, if a server failed,
`only the failure of its mirror server would result in data becoming
`unavailable. With chained-declustering, if a server fails, the failure
`of either one of its two neighboring servers will result in data
`become unavailable.
`In our implementation of chained-declustering, one of the two
`copies of each data block is denoted the primary and the other is
`denoted the secondary. Read requests can be serviced from either
`the primary or the secondary copy but the servicing of write re-
`quests must always start at the primary, unless the server containing
`the primary is down in which case it may start at the secondary.
`Because we lock copies of the data blocks before reading or writing
`them to guarantee consistency, this ordering guarantee is necessary
`to avoid deadlocks.
`On a read request, the server that receives the request attempts
`to read the requested data.
`If successful, the server returns the
`requested data, otherwise it returns an error code and the client tries
`another server. If a request times out due to network congestion
`or because a server is down, the client will alternately retry the
`primary and secondary servers until either the request succeeds
`or both servers return error codes indicating that it is not possible
`to satisfy the request. Currently, this happens only if both disks
`containing copies of the requested data have been destroyed.
`On a write request, the server that receives the request first
`checks to see if it is the primary for the specified data element. If
`it is the primary, it first marks this data element as busy on stable
`storage.
`It then simultaneously sends write requests to its local
`copy and the secondary copy. When both requests complete, the
`busy bit is cleared and the client that issued the request is sent a
`status code indicating the success or failure of the operation. If the
`primary crashes while performing the update, the busy bits are used
`during crash recovery to ensure that the primary and secondary
`copies are consistent. Write-ahead-logging with group commits
`makes updating the busy bits efficient. As a further optimization,
`the clearing of busy bits is done lazily and we maintain a cache
`of the most recently set busy bits. Thus, if write requests display
`locality, a given busy bit will already be set on disk and will not
`require additional I/O.
`If the server that received the write request is the secondary for
`the specified data element, then it will service the request only
`if it can determine that the server containing the primary copy is
`
`Petal Client
`
`Petal Client
`
`Petal Client
`
`Petal Client
`
`Digital ATM Network
`
`Petal Virtual Disk
`
`Petal Server
`
`Petal Server
`
`Petal Server
`
`Petal Server
`
`Log
`
`Log
`
`Log
`
`Log
`
`Figure 6: Petal Prototype
`
`down. In this case, the secondary marks the data element as stale
`on stable storage before writing it to its local disk. The server
`containing the primary copy will eventually have to bring all data
`elements marked stale up-to-date during its recovery process. A
`similar procedure is used by the primary if the secondary dies.
`
`3
`
`Implementation and Performance
`
`Our Petal prototype is illustrated in Figure 6. Four 225 MHz DEC
`3000/700s running Digital Unix act as server machines. Each runs
`a single Petal server, which is a user-level process that accesses the
`physical disks using the Unix raw disk interface, and the network
`using UDP/IP Unix sockets. Each server machine is configured
`with 14 Digital RZ29 disks, each of which is a 3.5 inch SCSI device
`with a 4.3 Gbyte capacity. Each machine uses one of the disks for
`write-ahead logging and the remaining to store client data. The
`disks are connected to the server machine via two 10 Mbyte/s fast
`SCSI strings using the Digital PMZAA-C host bus adapter.
`Four additional machines running Digital Unix are configured
`as Petal clients to generate load on the servers. Each client’s kernel
`is loaded with the Petal device driver for accessing Petal virtual
`disks. This allows clients to access Petal virtual disks just like local
`disks. Both the servers and clients are connected to each other via
`155 Mbit/s ATM links over a Digital ATM network.
`The entire Petal RPC interface has 24 calls and many of these
`calls are devoted to management functions, such as creating and
`deleting virtual disks, making snapshots, reconfiguring a virtual
`disk, and adding and deleting servers. These calls are typically
`used by user-level utilities to perform tasks such as virtual disk
`creation and monitoring the physical resource pools in the system
`to determine when additional servers or disk should be added.
`Petal RPC calls that implement management functions are infre-
`quently executed and generally take less than a second to complete.
`In particular, create and snapshot operations take about 650 mil-
`liseconds. Delete and reconfiguration take about 650 milliseconds
`to initiate, but their total execution time is dependent on the actual
`amount of physical storage associated with the specified virtual
`disk.
`In the remainder of the section, we will report on the perfor-
`mance of accessing a Petal virtual disk and the behavior of file sys-
`tems built on Petal. Our primary performance goals are to provide
`latency roughly comparable to a locally attached disk, through-
`
`5
`
`Oracle Ex. 1020, pg. 5
`
`
`
`Request
`
`512 byte Read
`8 Kbyte Read
`64 Kbyte Read
`512 byte Write
`8 Kbyte Write
`64 Kbyte Write
`
`Client Request Latency (ms)
`Local Disk
`Petal
`NVRAM Log
`10
`12
`28
`12
`16
`33
`
`RZ29 Log
`10
`12
`28
`19
`22
`40
`
`9
`11
`21
`10
`12
`20
`
`Request
`512 byte Read
`8 Kbyte Read
`64 Kbyte Read
`512 byte Write
`8 Kbyte Write
`64 Kbyte Write
`
`Aggregate Throughput (RZ29 Log)
`Normal
`Failed
`3150 req/s
`2310 req/s
`20 Mbytes/s
`14.6 Mbytes/s
`43.1 Mbytes/s
`33.7 Mbytes/s
`1030 req/s
`1055 req/s
`6.6 Mbytes/s
`6.6 Mbytes/s
`12.3 Mbytes/s
`12.5 Mbytes/s
`
`% of Normal
`73 %
`73 %
`78 %
`102 %
`100 %
`101 %
`
`Table 1: Latency of a Chained-Declustered Virtual Disk
`
`Table 2: Normal and Failed Throughput of a Chained-Declustered
`Virtual Disk
`
`(cid:14)
`(cid:6)
`
`(cid:10)
`
`(cid:4)(cid:1)
`(cid:2)
`
`|2
`
`|1.00
`
`|0.80
`
`|0.60
`
`|0.40
`
`|0.20
`
` Relative Throughput
`
`|1
`
`|0
`
`|0.00
`
`put that scales with the number of servers, and performance that
`gracefully degrades as servers fail.
`
`3.1 Petal Performance
`
`This section examines the read and write performance of a Petal
`chain-declustered virtual disk. For a read request, the client makes
`an RPC to a Petal server that simply returns the data from its local
`disk. When a server receives a write request, it first writes a small
`log entry that is used to recover to a consistent state after a server
`crash. Next, the server simultaneously writes the data to its local
`disk and a second disk on a mirror server. When both disk writes
`complete, the first, or primary, server replies to the client. The read
`and write procedures used by Petal are described in greater detail
`in Section 2.4.
`Table 1 compares the read and write latency of a chained-
`declustered Petal virtual disk with a local RZ29 disk. For this
`experiment, a single client generates requests of the specified size
`to random disk offsets. We show Petal performance with two kinds
`of write-ahead-logging devices, an RZ29 disk and an NVRAM de-
`vice simulated using RAM. The log device is used only to service
`write requests and does not affect read performance. Logging to
`NVRAM improves write latency by approximately 7 ms.
`For read requests of 512 bytes and 8 Kbytes, the Petal latency
`is only slightly worse than an RZ29. For 64 Kbyte reads, the
`latency gap widens to 7 ms. Most of the increased latency is due to
`the additional delay in transmitting the data over the network and
`includes the Unix socket, UDP/IP, and ATM hardware overheads,
`which accounts for over 6 ms. The Petal server software and the
`client interface overheads are negligible.
`If we overlapped the
`reading of data from disks with the transfer of data over the network,
`we could eliminate much of this 7 ms overhead.
`Even with an NVRAM log device, Petal write performance is
`worse than a local RZ29 disk. In addition to the network delay in
`sending the data to the primary server, there is an additional delay
`because the primary has to send the data to the mirror server and
`wait for an acknowledgment before returning to the client. The
`latencies due to the network transmissions are approximately 1 ms,
`3 ms, and 12 ms for 512 byte, 8 Kbyte, and 64 Kbyte write requests
`respectively. Also, the arms and the spindles of the primary and
`secondary disks are unsynchronized. This lack of synchronization
`causes write requests to wait for the slower of the primary and
`se