`
`A Cost-Effective, High-Bandwidth Storage Architecture
`
`Garth A. Gibson*, David F. Nagle†, Khalil Amiri†, Jeff Butler†, Fay W. Chang*,
`Howard Gobioff*, Charles Hardin†, Erik Riedel†, David Rochberg*, Jim Zelenka*
`
`School of Computer Science*
`Department of Electrical and Computer Engineering†
`Carnegie Mellon University, Pittsburgh, PA 15213
`garth+asplos98@cs.cmu.edu
`
`ABSTRACT
`
`This paper describes the Network-Attached Secure Disk
`(NASD) storage architecture, prototype implementations of
`NASD drives, array management for our architecture, and
`three filesystems built on our prototype. NASD provides scal-
`able storage bandwidth without the cost of servers used
`primarily for transferring data from peripheral networks
`(e.g. SCSI) to client networks (e.g. ethernet). Increasing
`dataset sizes, new attachment technologies, the convergence
`of peripheral and interprocessor switched networks, and the
`increased availability of on-drive transistors motivate and
`enable this new architecture. NASD is based on four main
`principles: direct transfer to clients, secure interfaces via
`cryptographic support, asynchronous non-critical-path
`oversight, and variably-sized data objects. Measurements of
`our prototype system show that these services can be cost-
`effectively integrated into a next generation disk drive ASIC.
`End-to-end measurements of our prototype drive and filesys-
`tems suggest that NASD can support conventional distrib-
`uted filesystems without performance degradation. More
`importantly, we show scalable bandwidth for NASD-special-
`ized filesystems. Using a parallel data mining application,
`NASD drives deliver a linear scaling of 6.2 MB/s per client-
`drive pair, tested with up to eight pairs in our lab.
`
`Keywords
`
`D.4.3 File systems management, D.4.7 Distributed systems,
`B.4 Input/Output and Data Communications.
`1. INTRODUCTION
`Demands for storage bandwidth continue to grow due to
`rapidly increasing client performance, richer data types such
`as video, and data-intensive applications such as data
`mining. For storage subsystems to deliver scalable band-
`
`Copyright © 1998 by the Association for Computing Machinery,
`Inc. Permission to make digital or hard copies of part or all of this
`work for personal or classroom use is granted without fee provided
`that copies are not made or distributed for profit or commercial
`advantage and that copies bear this notice and the full citation on
`the first page. Copyrights for components of this work owned by
`others than ACM must be honored. Abstracting with credit is per-
`mitted. To copy otherwise, to republish, to post on servers, or to
`redistribute to lists, requires prior specific permission and/or a fee.
`Request permissions from Publications Dept, ACM Inc., fax +1
`(212) 869-0481, or permissions@acm.org.
`
`width, that is, linearly increasing application bandwidth with
`increasing numbers of storage devices and client processors,
`the data must be striped over many disks and network links
`[Patterson88]. With 1998 technology, most office, engineer-
`ing, and data processing shops have sufficient numbers of
`disks and scalable switched networking, but they access stor-
`age through storage controller and distributed fileserver
`bottlenecks. These bottlenecks arise because a single
`“server” computer receives data from the storage (periph-
`eral) network and forwards it to the client (local area)
`network while adding functions such as concurrency control
`and metadata consistency. A variety of research projects
`have explored techniques for scaling the number of
`machines used to enforce the semantics of such controllers
`or fileservers [Cabrera91, Hartman93, Cao93, Drapeau94,
`Anderson96, Lee96, Thekkath97]. As Section 3 shows, scal-
`ing the number of machines devoted to store-and-forward
`copying of data from storage to client networks is expensive.
`
`This paper makes a case for a new scalable bandwidth stor-
`age architecture, Network-Attached Secure Disks (NASD),
`which separates management and filesystem semantics from
`store-and-forward copying. By evolving the interface for
`commodity storage devices (SCSI-4 perhaps), we eliminate
`the server resources required solely for data movement. As
`with earlier generations of SCSI, the NASD interface is
`simple, efficient and flexible enough to support a wide range
`of filesystem semantics across multiple generations of tech-
`nology. To demonstrate how a NASD architecture can
`deliver scalable bandwidth, we describe a prototype imple-
`mentation of NASD, a storage manager for NASD arrays,
`and a simple parallel filesystem that delivers scalable band-
`width to a parallel data-mining application. Figure 1 illus-
`trates the components of a NASD system and indicates the
`sections describing each.
`
`We continue in Section 2 with a discussion of storage system
`architectures and related research. Section 3 presents
`enabling technologies for NASD. Section 4 presents an over-
`view of our NASD interface, its implementation and perfor-
`mance. Section 5 discusses ports of NFS and AFS
`filesystems to a NASD-based system, our implementation of
`a NASD array management system, a simple parallel filesys-
`tem, and an I/O-intensive data mining application that
`exploits the bandwidth of our prototype. This section reports
`the scalability of our prototype compared to the performance
`of a fast single NFS server. Section 6 discusses active disks,
`the logical extension of NASD to execute application code.
`Section 7 concludes with a discussion of ongoing research.
`
`1
`
`Oracle Ex. 1017, pg. 1
`
`
`
`NASD Object System
`
`Security
`
`Layout
`
`Net Protocol
`
`Controller
`
`Net Hardware
`
`HDA
`
`Section 5.2
`
`Storage Manager
`
`Striping FS
`
`Concurrency Control
`Mapping
`Redundancy
`
`Net Protocol
`
`Net Hardware
`
`NASD
`
`Section 4
`
`Management
`
`Read/Write
`
`File Manager
`
`Filesystem
`Access Control
`Namespace
`Consistency
`
`Net Protocol
`
`Net Hardware
`
`Access Control
`
`Application
`
`Filesystem
`
`NASD Driver
`
`Net Protocol
`
`Net Hardware
`
`Client
`
`Switch
`
`Section 5.1
`
`NASD
`
`NASD
`
`NASD
`
`NASD
`
`Figure 1: An overview of a scalable bandwidth NASD system. The major components are annotated with the layering of their logical
`components. The innermost box shows a basic NASD drive as described in Section 4. The larger box contains the essentials for a
`NASD-based filesystem, which adds a file manager and client as detailed in Section 5.1. Finally, the outer box adds a storage manager
`to coordinate drives on which parallel filesystem is built as discussed in Section 5.2.
`
`2. BACKGROUND AND RELATED WORK
`Figure 2 illustrates the principal alternative storage architec-
`tures: (1) a local filesystem, (2) a distributed filesystem
`(DFS) built directly on disks, (3) a distributed filesystem
`built on a storage subsystem, (4) a network-DMA distrib-
`uted filesystem, (5) a distributed filesystem using smart
`object-based disks (NASD) and (6) a distributed filesystem
`using a second level of objects for storage management.
`
`The simplest organization (1) aggregates the application,
`file management (naming, directories, access control, con-
`currency control) and low-level storage management. Disk
`data makes one trip over a simple peripheral area network
`such as SCSI or Fibrechannel and disks offer a fixed-size
`block abstraction. Stand-alone computer systems use this
`widely understood organization.
`
`To share data more effectively among many computers, an
`intermediate server machine is introduced (2). If the server
`offers a simple file access interface to clients, the organiza-
`tion is known as a distributed filesystem. If the server pro-
`cesses data on behalf of the clients, this organization is a
`distributed database. In organization (2), data makes a sec-
`ond network trip to the client and the server machine can
`become a bottleneck, particularly since it usually serves
`large numbers of disks to better amortize its cost.
`
`The limitations of using a single central fileserver are
`widely recognized. Companies such as Auspex and Net-
`work Appliance have attempted to improve file server per-
`formance, specifically the number of clients supported,
`through the use of special purpose server hardware and
`highly optimized software [Hitz90, Hitz94]. Although not
`
`the topic of this paper, the NASD architecture can improve
`the client-load-bearing capability of traditional filesystems
`by off-loading simple data-intensive processing to NASD
`drives [Gibson97a].
`
`To transparently improve storage bandwidth and reliability
`many systems interpose another computer, such as a RAID
`controller [Patterson88]. This organization (3) adds another
`peripheral network transfer and store-and-forward stage for
`data to traverse.
`
`Provided that the distributed filesystem is reorganized to
`logically “DMA” data rather than copy it through its server,
`a fourth organization (4) reduces the number of network
`transits for data to two. This organization has been exam-
`ined extensively [Drapeau94, Long94] and is in use in the
`HPSS implementation of the Mass Storage Reference
`Model [Watson95, Miller88]. Organization (4) also applies
`to systems where clients are trusted to maintain filesystem
`metadata integrity and implement disk striping and redun-
`dancy [Hartman93, Anderson96]. In this case, client cach-
`ing of metadata can reduce the number of network transfers
`for control messages and data to two. Moreover, disks can
`be attached to client machines which are presumed to be
`independently paid for and generally idle. This eliminates
`additional store-and-forward cost, if clients are idle, without
`eliminating the copy itself.
`
`As described in Section 4, the NASD architecture (5)
`embeds the disk management functions into the device and
`offers a variable-length object storage interface. In this
`organization, file managers enable repeated client accesses
`to specific storage objects by granting a cachable capability.
`
`2
`
`Oracle Ex. 1017, pg. 2
`
`
`
`computers in
`data/control
`path
`
`2 3 4
`
`3/4
`
`2/2+
`
`2/2++
`
`application
`
`file manager
`
`object store
`
`disk
`
`1. Local filesystem.
`
`2. Distributed FS.
`
`3. Distributed FS
`with RAID controller
`
`4. DMA-based DFS.
`
`5. NASD - based DFS.
`
`6. NASD-Cheops-based DFS.
`
`LAN
`
`LAN
`
`LAN
`
`SAN
`
`SAN
`
`PAN
`
`PAN
`
`PAN
`
`PAN
`
`PAN
`
`LAN
`
`bulk data transfer
`
`SAN
`
`read, write
`
`SAN
`
`read, write
`
`Figure 2: Evolution of storage architectures for untrusted networks and clients. Boxes are computers, horizontal lines are
`communication paths and vertical lines are internal and external interfaces. LAN is a local area network such as Ethernet or FDDI.
`PAN is a peripheral area network such as SCSI, Fibrechannel or IBM’s ESCON. SAN is an emerging system area network such as
`ServerNet, Myrinet or perhaps Fibrechannel or Ethernet that is common across clients, servers and devices. On the far right, a disk is
`capable of functions such as seek, read, write, readahead, and simple caching. The object store binds blocks into variable-length
`objects and manages the layout of these objects in the storage space offered by the device(s). The file manager provides naming,
`directory hierarchies, consistency, access control, and concurrency control. In NASD, storage management is done by recursion on
`the object interface on the SAN.
`
`Hence, all data and most control travels across the network
`once and there is no expensive store-and-forward computer.
`
`The idea of a simple, disk-like network-attached storage
`server as a building block for high-level distributed filesys-
`tems has been around for a long time. Cambridge’s Univer-
`sal File Server (UFS) used an abstraction similar to NASD
`along with a directory-like index structure [Birrell80]. The
`UFS would reclaim any space that was not reachable from a
`root index. The successor project at Cambridge, CFS, also
`performed automatic reclamation and added undoable (for a
`period of time after initiation) transactions into the filesys-
`tem interface [Mitchell81]. To minimize coupling of file
`manager and device implementations, NASD offers less
`powerful semantics, with no automatic reclamation or trans-
`action rollback.
`
`Using an object interface in storage rather than a fixed-
`block interface moves data layout management to the disk.
`In addition, NASD partitions are variable-sized groupings
`of objects, not physical regions of disk media, enabling the
`total partition space to be managed easily, in a manner simi-
`lar to virtual volumes or virtual disks [IEEE95, Lee96]. We
`also believe that specific implementations can exploit
`NASD’s uninterpreted filesystem-specific attribute fields to
`respond to higher-level capacity planning and reservation
`systems
`such
`as HP’s
`attribute-managed
`storage
`[Golding95]. Object-based storage is also being pursued for
`quality-of-service at the device, transparent performance
`optimizations,
`and
`drive
`supported
`data
`sharing
`[Anderson98a].
`
`ISI’s Netstation project [VanMeter96] proposes a form of
`
`object-based storage called Derived Virtual Devices (DVD)
`in which the state of an open network connection is aug-
`mented with access control policies and object metadata,
`provided by the file manager using Kerberos [Neuman94]
`for underlying security guarantees. This is similar to
`NASD’s mechanism except that NASD’s access control pol-
`icies are embedded in unforgeable capabilities separate from
`communication state, so that their interpretation persists (as
`objects) when a connection is terminated. Moreover, Netsta-
`tion’s use of DVD as a physical partition server in VISA
`[VanMeter98] is not similar to our use of NASD as a single-
`object server in a parallel distributed filesystem.
`
`In contrast to the ISI approach, NASD security is based on
`capabilities, a well-established concept for regulating access
`to resources [Dennis66]. In the past, many systems have
`used capabilities that rely on hardware support or trusted
`operating system kernels
`to protect system
`integrity
`[Wulf74, Wilkes79, Karger88]. Within NASD, we make no
`assumptions about the integrity of the client to properly
`maintain capabilities. Therefore, we utilize cryptographic
`techniques similar to ISCAP [Gong89] and Amoeba
`[Tanenbaum86]. In these systems, both the entity issuing a
`capability and the entity validating a capability must share a
`large amount of private information about all of the issued
`capabilities. These systems are generally implemented as
`single entities issuing and validating capabilities, while in
`NASD these functions are done in distinct machines and no
`per-capability state is exchanged between issuer and valida-
`tor.
`
`To offer disk striping and redundancy for NASD, we layer
`the NASD interface. In this organization (6), a storage man-
`
`3
`
`Oracle Ex. 1017, pg. 3
`
`
`
`ager replaces the file manager’s capability with a set of
`capabilities for the objects that actually make up the high-
`level striped object. This costs an additional control mes-
`sage but once equipped with these capabilities, clients again
`access storage objects directly. Redundancy and striping are
`done within the objects accessible with the client’s set of
`capabilities, not the physical disk addresses.
`
`Our storage management system, Cheops, differs from other
`storage subsystems with scalable processing power such as
`Swift, TickerTAIP and Petal [Long94, Cao93, Lee96] in that
`Cheops uses client processing power rather than scaling the
`computational power of the storage subsystem. Cheops is
`similar to the Zebra and xFS filesystems except that client
`trust is not required because the client manipulates only
`objects it can access [Hartman93, Anderson96].
`3. ENABLING TECHNOLOGY
`Storage architecture is ready to change as a result of the syn-
`ergy between five overriding factors: I/O bound applica-
`tions, new drive attachment technologies, an excess of on-
`drive transistors, the convergence of peripheral and inter-
`processor switched networks, and the cost of storage sys-
`tems.
`I/O-bound applications: Traditional distributed filesystem
`workloads are dominated by small random accesses to small
`files whose sizes are growing with time, though not dramat-
`ically [Baker91, TPC98]. In contrast, new workloads are
`much more I/O-bound, including data types such as video
`and audio, and applications such as data mining of retail
`transactions, medical records, or telecommunication call
`
`records.
`New drive attachment technology: The same technology
`improvements that are increasing disk density by 60% per
`year are also driving up disk bandwidth at 40% per year
`[Grochowski96]. High transfer rates have increased pres-
`sure on the physical and electrical design of drive busses,
`dramatically reducing maximum bus length. At the same
`time, people are building systems of clustered computers
`with shared storage. For these reasons, the storage industry
`is moving toward encapsulating drive communication over
`Fibrechannel [Benner96], a serial, switched, packet-based
`peripheral network that supports long cable lengths, more
`ports, and more bandwidth. One impact of NASD is to
`evolve the SCSI command set that is currently being encap-
`sulated over Fibrechannel to take full advantage of the
`promises of that switched-network technology for both
`higher bandwidth and increased flexibility.
`Excess of on-drive transistors: The increasing transistor
`density in inexpensive ASIC technology has allowed disk
`drive designers to lower cost and increase performance by
`integrating sophisticated special-purpose functional units
`into a small number of chips. Figure 3 shows the block dia-
`gram for the ASIC at the heart of Quantum’s Trident drive.
`When drive ASIC technology advances from 0.68 micron
`CMOS to 0.35 micron CMOS, they could insert a 200 MHz
`StrongARM microcontroller, leaving 100,000 gate-equiva-
`lent space for functions such as onchip DRAM or crypto-
`graphic support. While this may seem like a major jump,
`Siemen’s TriCore integrated microcontroller and ASIC
`architecture promises to deliver a 100 MHz, 3-way issue,
`
`(a) Current Trident ASIC (74 mm2 at 0.68 micron)
`
`(b) Next-generation ASIC (0.35 micron technology)
`
`frees 100 Kgates
`? cryptography
`? network support
`
`.35 micron frees 40 mm2
`
`Insert.35 micron StrongArm RISC m P
`fits in 27 mm2 with 8K+8K cache
`at 200 MHz, 230 Dhrystone MIPS
`
`Figure 3: Quantum’s Trident disk drive features the ASIC on the left (a). Integrated onto this chip in four independent clock domains are
`10 function units with a total of about 110,000 logic gates and a 3 KB SRAM: a disk formatter, a SCSI controller, ECC detection, ECC
`correction, spindle motor control, a servo signal processor and its SRAM, a servo data formatter (spoke), a DRAM controller, and a
`microprocessor port connected to a Motorola 68000 class processor. By advancing to the next higher ASIC density, this same die area
`could also accommodate a 200 MHz StrongARM microcontroller and still have space left over for DRAM or additional functional units
`such as cryptographic or network accelerators.
`
`4
`
`Oracle Ex. 1017, pg. 4
`
`
`
`Processor Unit
`
`System
`Memory
`
`$1000 / $7000
`
`$50 / $650
`
`Network
`Interface
`
`100 Mb/s / 1 Gb/s
`
`Ethernet
`
`133 MB/s / 532 MB/s
`
`Disk
`Interface
`$100 / $400
`
`40 MB/s / 80 MB/s
`
`Ultra SCSI
`
`$300 / $600
`10 MB/s / 18MB/s
`Seagate
`Medallist / Cheetah
`
`32-bit datapath with up to 2 MB of onchip DRAM and cus-
`tomer defined logic in 1998 [TriCore97].
`Convergence of peripheral and interprocessor networks:
`Scalable computing is increasingly based on clusters of
`workstations. In contrast to the special-purpose, highly reli-
`able, low-latency interconnects of massively parallel pro-
`cessors such as the SP2, Paragon, and Cosmic Cube,
`clusters typically use Internet protocols over commodity
`LAN routers and switches. To make clusters effective, low-
`latency network protocols and user-level access to network
`adapters have been proposed, and a new adapter card inter-
`face, the Virtual Interface Architecture, is being standard-
`ized [Maeda93, Wilkes92, Boden95, Horst95, vonEicken95,
`Intel97]. These developments continue to narrow the gap
`between the channel properties of peripheral interconnects
`and the network properties of client interconnects [Sachs94]
`and make Fibrechannel and Gigabit Ethernet look more
`alike than different
`
`Cost-ineffective storage servers: In high performance dis-
`tributed filesystems, there is a high cost overhead associated
`with the server machine that manages filesystem semantics
`and bridges traffic between the storage network and the cli-
`ent network [Anderson96]. Figure 4 illustrates this problem
`for bandwidth-intensive applications in terms of maximum
`storage bandwidth. Based on these cost and peak perfor-
`mance estimates, we can compare the expected overhead
`cost of a storage server as a fraction of the raw storage cost.
`Servers built from high-end components have an overhead
`that starts at 1,300% for one server-attached disk! Assuming
`dual 64-bit PCI busses that deliver every byte into and out
`of memory once, the high-end server saturates with 14
`disks, 2 network interfaces, and 4 disk interfaces with a
`115% overhead cost. The low cost server is more cost effec-
`tive. One disk suffers a 380% cost overhead and, with a 32-
`bit PCI bus limit, a six disk system still suffers an 80% cost
`overhead.
`
`While we can not accurately anticipate the marginal
`increase in the cost of a NASD over current disks, we esti-
`mate that the disk industry would be happy to charge 10%
`more. This bound would mean a reduction in server over-
`
`Figure 4: Cost model for the traditional server architecture. In
`this simple model, a machine serves a set of disks to clients
`using a set of disk (wide Ultra and Ultra2 SCSI) and network
`(Fast and Gigabit Ethernet) interfaces. Using peak bandwidths
`and neglecting host CPU and memory bottlenecks, we
`estimate the server cost overhead at maximum bandwidth as
`the sum of the machine cost and the costs of sufficient
`numbers of interfaces to transfer the disks’ aggregate
`bandwidth divided by the total cost of the disks. While the
`prices are probably already out of date, the basic problem of a
`high server overhead is likely to remain. We report pairs of
`costs and bandwidth estimates. On the left, we show values
`for a low cost system built from high-volume components. On
`the right, we show values for a high-performance reliable
`system built from components recommended for mid-range
`and enterprise servers [Pricewatch98].
`
`head costs of at least a factor of 10 and in total storage sys-
`tem cost (neglecting the network infrastructure) of over
`50%.
`4. NETWORK-ATTACHED SECURE DISKS
`Network-Attached Secure Disks (NASD) enable cost-effec-
`tive bandwidth scaling. NASD eliminates the server band-
`width bottleneck by modifying storage devices to transfer
`data directly to clients and also repartitions traditional file
`server or database functionality between the drive, client
`and server as shown in Figure 1. NASD presents a flat name
`space of variable-length objects that is both simple enough
`to be implemented efficiently yet flexible enough for a wide
`variety of applications. Because the highest levels of distrib-
`uted filesystem functionality—global naming, access con-
`trol, concurrency control, and cache coherency—vary
`significantly, we do not advocate that storage devices sub-
`sume the file server entirely. Instead, the residual filesystem,
`the file manager, should define and manage these high-level
`policies while NASD devices should implement simple
`storage primitives efficiently and operate as independently
`of the file manager as possible.
`
`Broadly, we define NASD to be storage that exhibits the fol-
`lowing four properties:
`Direct transfer: Data is transferred between drive and cli-
`ent without indirection or store-and-forward through a file
`server machine.
`Asynchronous oversight: We define asynchronous over-
`sight as the ability of the client to perform most operations
`without synchronous appeal to the file manager. Frequently
`consulted but infrequently changed policy decisions, such as
`authorization decisions, should be encoded into capabilities
`by the file manager and subsequently enforced by drives.
`Cryptographic integrity: By attaching storage to the net-
`work, we open drives to direct attack from adversaries. Thus,
`it is necessary to apply cryptographic techniques to defend
`against potential attacks. Drives ensure that commands and
`data have not been tampered with by generating and verifying
`cryptographic keyed digests. This is essentially the same
`requirement for security as proposed for IPv6 [Deering95].
`
`5
`
`Oracle Ex. 1017, pg. 5
`
`
`
`Object-based interface: To allow drives direct knowledge
`of the relationships between disk blocks and to minimize
`security overhead, drives export variable length “objects”
`instead of fixed-size blocks. This also improves opportuni-
`ties for storage self-management by extending into a disk an
`understanding of the relationships between blocks on the
`disk [Anderson98a].
`
`4.1 NASD Interface
`For the experiments presented in this paper, we have con-
`structed a simple, capability-based, object-store interface
`(documented separately [Gibson97b]). This interface con-
`tains less than 20 requests including: read and write object
`data; read and write object attributes; create and remove
`object; create, resize, and remove partition; construct a
`copy-on-write object version; and set security key. Figure 5
`diagrams the components of a NASD request and illustrates
`the layering of networking and security.
`
`interface of a UNIX
`inode
`the
`loosely on
`Based
`filesystem [McKusick84], our interface provides soft parti-
`tions, control objects, and per-object attributes for prealloca-
`tion, clustering, and capability revocation. Resizeable
`partitions allow capacity quotas to be managed by a drive
`administrator. Objects with well-known names and struc-
`tures allow configuration and bootstrap of drives and parti-
`tions. They also enable filesystems to find a fixed starting
`point for an object hierarchy and a complete list of allocated
`object names. Object attributes provide timestamps, size,
`and allow capacity to be reserved and objects to be linked
`for clustering [deJonge93]. A logical version number on the
`object may be changed by a filesystem to immediately
`revoke a capability (either temporarily or permanently).
`Finally, an uninterpreted block of attribute space is available
`to the file manager to record its own long-term, per-object
`state such as filesystem access control lists or mode bits.
`
`NASD security is based on cryptographic capabilities which
`are documented in an earlier publication [Gobioff97]. Capa-
`bilities are protected by a small number of keys organized
`into a four-level hierarchy. The primary use of the keys is to
`manage the key hierarchy and construct capabilities for use
`by clients. Clients obtain capabilities from a file manager
`using a secure and private protocol external to NASD. A
`
`Network Header
`RPC Header
`Security Header
`Capability
`Request Args
`
`Nonce
`Request Digest
`Data
`Overall Digest
`
`Indicates key and security
`options to use when handling
`request
`
`Includes approved logical ver-
`sion number, access
`rights,
`expiration time and accessible
`object region
`
`Protects against replayed and
`delayed requests
`
`Figure 5: Packet diagram of the major components of a NASD
`request in our current prototype. The details of NASD objects,
`requests, and security are documented in separate papers
`[Gibson97b, Gobioff97].
`
`capability contains a public and a private portion. The pub-
`lic portion is a description of what rights are being granted
`for which object. The private portion is a cryptographic key
`generated via a keyed message digest [Bellare96] from the
`public portion and drive’s secret keys. A drive verifies a cli-
`ent’s right to perform an operation by confirming that the
`client holds the private portion of an appropriate capability.
`The client sends the public portion along with each request
`and generates a second digest of the request parameters
`keyed by the private field. Because the drive knows its keys,
`receives the public fields of a capability with each request,
`and knows the current version number of the object, it can
`compute the client’s private field (which the client cannot
`compute on its own because only the file manager has the
`appropriate drive secret keys). If any field has been
`changed, including the object version number, the access
`fails and the client is sent back to the file manager.
`
`These mechanisms ensure the integrity of requests in the
`presence of both attacks and simple accidents. Protecting
`the integrity and/or privacy of the data involves crypto-
`graphic operations on all the data which is potentially very
`expensive. Software implementations operating at disk rates
`are not available with the computational resources we
`expect on a disk, but schemes based on multiple DES func-
`tion blocks in hardware can be implemented in a few tens of
`thousands of gates and operate faster than disk data
`rates [Verbauwhede87, Knudsen96]. For the measurements
`reported in this paper, we disabled these security protocols
`because our prototype does not currently support such hard-
`ware.
`4.2 Prototype Implementation
`We have implemented a working prototype of the NASD
`drive software running as a kernel module in Digital UNIX.
`Each
`NASD
`prototype
`drive
`runs
`on
`a
`DEC Alpha 3000/400
`(133 MHz,
`64 MB,
`Digital UNIX 3.2g) with two Seagate ST52160 Medallist
`disks attached by two 5 MB/s SCSI busses. While this is
`certainly a bulky “drive”, the performance of this five year
`old machine is similar to what we predict will be available
`in drive controllers soon. We use two physical drives man-
`aged by a software striping driver to approximate the
`10 MB/s rates we expect from more modern drives.
`
`Because our prototype code is intended to operate directly
`in a drive, our NASD object system implements its own
`internal object access, cache, and disk space management
`modules (a total of 16,000 lines of code) and interacts mini-
`mally with Digital UNIX. For communications, our proto-
`type uses DCE RPC 1.0.3 over UDP/IP. The implementation
`of these networking services is quite heavyweight. The
`appropriate protocol suite and implementation is currently
`an issue of active research [Anderson98b, Anderson98c,
`VanMeter98]
`
`Figure 6 shows the disks’ baseline sequential access band-
`width as a function of request size, labeled raw read and
`write. This test measures the latency of each request.
`Because these drives have write-behind caching enabled, a
`
`6
`
`Oracle Ex. 1017, pg. 6
`
`
`
`100
`
`Percent CPU Idle
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`w id t h
`
`d
`
`n
`
`a
`
`a t e b
`
`g
`
`g r e
`
`g
`
`A
`
`Average NASD CPU idle
`
`Average client idle
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`Bandwidth (MB/s)
`
`0
`
`1
`
`2
`
`3
`
`4
`
`6
`5
`# of Clients
`
`7
`
`8
`
`9
`
`10
`
`Figure 7: Prototype NASD cache read bandwidth. Read
`bandwidth obtained by clients accessing a single large cached
`file striped over 13 NASD drives with a stripe unit of 512 KB.
`As shown by the client idle values, the limiting factor is the CPU
`power of the clients within this range.
`on-chip counters to measure the code paths of read and
`write operations. These measurements are reported in the
`Total Instructions columns of Table 1. For the one byte
`requests, our measurements with DCPI [Anderson97] also
`show that the prototype consumes 2.2 cycles per instruction
`(CPI). There are many reasons why using these numbers to
`predict drive performance is approximate. Our prototype
`uses an Alpha processor (which has different CPI properties
`than an embedded processor), our estimates neglect poorer
`CPI during copying (which would have hardware assist in a
`real drive), and our communications implementation is
`more expensive than we believe to be appropriate in a drive
`protocol stack. However, these numbers are still useful for
`broadly addressing the question of implementing NASD in
`a drive ASIC.
`
`Table 1 shows that a 200 MHz version of our prototype
`
`write’s actual completion time is not measured accurately,
`resulting in a write throughput (~7 MB/s) that appears to
`exceed the read throughput (~5 MB/s). To evaluate object
`access performance, we modified the prototype to serve
`NASD requests from a user-level process on the same
`machine (without the use of RPC) and compared that to the
`performance of the local filesystem (a variant of Berkeley’s
`FFS [McKusick84]). Figure 6 shows apparent throughput as
`a function of request size with NASD and FFS being
`roughly comparable. Aside from FFS’s strange write-behind
`implementation, the principle differences are that NASD is
`better tuned for disk access (~5 MB/s versus ~2.5 MB/s on
`reads that miss in the cache), while FFS is better tuned for
`cache accesses (fewer copies give it ~48 MB/s versus
`~40 MB/s on reads that hit in the memory cache).
`4.3 Scalabili