`is increasing at an astonishing rate, doubling on the order of
`once every 18 months to two years. The newest generation
`of RISC processors has performance in the St) to 60 MEPS
`range. For example, a recent workstation announced by
`the Hewlett-Packard Corporation.
`the HP 9900;"?30, has
`been rated at 72 SPECMarks (1 SPECMark is roughly the
`processing power of a single Dig-ital Equipment Corporation
`VAX “[780 on u particular benchmark set). Powerful
`shared memory multiprocessor systems. now available from
`companies such as Silicon Graphics and Solborne, provide
`well over 100 MIPS performance. One of Amdahl‘s famous
`laws equaled one MIPS of processing power with one
`megabit of HO per second. Obviously such processing rates
`far exceed anything that can be delivered by existing server,
`network, or storage architectures.
`Unlike processor power. network technology evolves at
`a slower rate. but when it advances, it does so in order
`of magnitude steps. In the last decade we have advanced
`from 3 Mirth/second Ethernet to 30 Mbitlsecond Ethernet.
`We are now on the verge of a new generation of network
`technology, based on fiber-optic interconnect, called FDDl.
`This technology promises 1th Mbits per second. and at
`least initially.
`it will move the server bottleneck from the
`network to the server CPU or its storage system. With
`more powerful processors available on the horizon.
`performance challenge is very likely to be in the storage
`system, where a typical magnetic disk can service 30 BK
`byte 110’s per second and can sustain a data rate in the range
`of 1 lo 3 Mbytes per second. And even faster networks and
`in the gigabit range, are now commercially
`avaiiable and. will become more widespread as their costs
`begin to drop [1].
`To keep up with the advances in processors and networks.
`storage systems are also experiencing rapid improvements.
`Magnetic disks have been doubling in storage capacity
`once every three years. As disk form factors shrink from
`in inch to 3.5 inch and below,
`the disks can be made
`to spin faster, thus increasing the sequential transfer rate.
`Unfortunately. the random HO rate is improving only very
`slowly. owing to mechanically limited positioning delays.
`Since U0 and data rates are primarily disk actuator limited,
`:1 new storage system approach called disk arrays addresses
`this problem by replacing a small number of large-format
`disks by a very large number of small—format disks. Disk
`arrays maintain the high capacity of the storage system,
`while enormously increasing the system‘s disk actuators
`and thus the aggregate lfO and data rate.
`The confluence of developments in processors. networks,
`and storage offers the possibility of extending the client!
`server model so effectively used in workstation environ—
`ments to higher performance environments, which inte-
`grate supercomputer. nertr supercomputers. workstations,
`and storage services on a very high performance network.
`The technology is rapidly reaching the point where it
`possible to think in terms of diskless supercomputers in
`much the same way as we think about diskless workstations.
`the network is emerging as the future "backptanc”
`of high-performance systems. The challenge is to develop
`the new hardware and software architectures that will be
`suitable for this world of network-based storage.
`The emphasis of this paper is on the integration of storage
`and network services. and the challenges of managing
`the complex storage hierarchy of the future:
`file caches.
`on-tine disk storage. near-line data iibraries. and off-line
`archives. We specifically ignore existing mainframe L'O
`architectures. as these are well described eisewhere (for
`in [2]). The rest of this paper is organized as
`follows. in the next three sections, we will review the recent
`advances in interconnect, storage devices, and distributed
`to better understand the underlying changes in
`network, storage, and software technologies. Section V con—
`tains detailed case studies of commercially available high-
`performancc networks. storage servers, and file servers, as
`well as a prototype high-performance network-attached HO
`controller being developed at the University of California,
`Berkeley. Our summary. conclusions, and suggestions for
`future research are found in Section VI.
`A. Networks. Chaucer’s. and Backpfartcs
`interconnect is a generic term for the “glue” that inter-
`faces the components of a computer system. Interconnect
`consists of high-speed hardware interfaces and the asso-
`ciated logical protocols. The former consists of physical
`wires or control registers. The latter may be interpreted
`by either hardware or software. From the viewpoint of
`the storage system. interconnect can he claSsified as high
`speed networks, processor-to—storage channels. or system
`backplancs that provide ports to a memory system through
`direct memory access techniques.
`Networks, channels. and backplancs differ in terms of
`the interconnection distances they can support,
`the band~
`width and latencies they can achieve, and the fundamental
`assumptions about the inherent unreliability of data trans-
`mission. Whtle no statement we can make is universally
`true, in general. backplanes can be characterized by parallel
`wide data paths and centralized arbitration, and are oriented
`toward read/write ”memory mapped" operations. That is,
`access to control registers is treated identically to memory
`word access. Networks. on the other hand, provide serial
`data, distributed arbitration. and support more message—
`oriented protocols. The latter
`require a more complex
`handshake, usually involving the exchange of highdcvel
`request and acknowledgment messages. Channets t‘all be-
`tween the two extremes. consisting of wide data paths
`of medium distance and often incorporating simplified
`versions of networklikc protocols.
`summarized in Table
`typically span more
`10 Mbitlsecond (Ethernet)
`to 100 Mbit/second {FDDI}
`and beyond, experience latencies measured in several
`miliiseconds (ms), and the network medium itself
`considered to be inherently unreliable. Networks include
`extensive data integrity features within their protocols.
`2 on4


`Table I Comparison of Network. Channel. and Backplanc Attributes
` Network Channel Backplane
`>IOOU tn
`10-100 m
`l m
`[0—100 Mbt‘s
`40—1000 Mbr's
`high (>ms)
`low (915)
`Byte ?arity
`Byte Parity
`The comparison is. based upon the interconnection distance.
`mission bandwidth. transmission latency. inherent reliability. and typical
`techniques for improving data integrity.
`including CRC checksums at the packet and message levels,
`and the explicit acknowledgment of received packets.
`Channels span small EO’s of meters, transmit at anywhere
`from 4.5 Mbytesfsecond (lBM channel
`interfaces) to 100
`Mbytesisecond (HiPPl channels), incur latencies of under
`100 Its per transfer, and have medium reliability. Byte
`parity at the individual transfer word is usually supported.
`although packet-level check—summing might also be sup-
`transfer from 4D
`I m in length.
`Backpltencs are about
`(VME) to over 100 (FutureBus) MBytestccond, incur sub
`latencies. and the interconnect
`is considered to be
`highly reliable. Backplanes typically support byte parity,
`although some backplanes (unfortunately) dispense with
`parity altogether,
`in the remainder of this section. we will look at each
`of the three kinds of interconnect. network. channel, and
`in more detail.
`8. Communications Networks and Network Corttmt’t’em'
`An excellent overview of networking technology can be
`found in [31. For a futuristic view, see [4} and [5]. The
`decade of the 1980’s has seen a slow maturation of network
`technology. but the 1990’s promise much more rapid tievelc
`upments. Today,
`it) Minn/second Ethernets are pervasive,
`with many environments advancing to the next generation
`of 100 Mbitfsecond networks based on the FUDI (Fiber
`Distributed Data Interface) standard [6}. FDDl provides
`higher bandwidth, longer distances. and reduced error rates,
`largely because of the introduction of liher optics for data
`transmission. Unfortunately cost, especially for replacing
`the existing copper wire network with fiber. coupled with
`disappointing transmission latencies. has slowed the accep-
`tance of these higher speed networks. The latency problems
`have more to do With FDDI's protocols, which are based on
`a token passing arbitration scheme. than anything intrinsic
`in fiber-optic technology.
`A network system is decomposed into multipie protocol
`layers", from the application interface down to the method
`of physical communication of bits on the network. Fig—
`ure l summarizes the popular seven—layer ISO protocol
`model. The physical and link levels are closely tied to the
`Detailed information shoot the data being exchanged
`Dot. representation
`M arraignment of connectiom' between programs
`Delivery of packet sequences
`Format of individual packets
`Access to and control oftrnnnnission medium
`Medium of transmission
`l. Seven-layer ISO protocol model. The phystcal
`describes the actuat transmission medium. he it coax cubic. fiber
`optics. or a parallel hackplztne. The link layer describes how
`Stations gain access to the medium. This layer deals with [he
`protocols for arbitration for and obtaining grant pennission to the
`media. The network layer defines the format of data packets to
`be transmitted over the media,
`including destination and sender
`information as well as any check sums. The transport
`layer is
`rappomible for the reliable delivory of packets. The amnion layer
`establishes communication between the sending program and the
`receiving program. The presentation layer determines the detailed
`formats ofthc daln cmhcddcd within packets. The application layer
`has the responsibility of understanding how these data should be
`interpreted within an applications context.
`underlying transport medium. and deal with the physical
`to the network and the method of acquiring
`access to it. The network.
`transport, and session levels
`focus on the detailed formats of communications packets
`and the methods for transmitting them from one program
`to another. The presentation and applications layers define
`the formats of the data embedded within the packets and
`the applicationrspecilic semantics of that data.
`A number of performance measuremenus of
`transmission services poinl out
`the Significant over—
`head is not protocol interpretation {approximately 10% of
`instructions are spent in interpreting the network headers).
`The culprits are memory system overheads arising from
`data movement and operating system overheads related to
`context switches and data cepying [7H10]. We will see
`this again and again in the sections to follow.
`The network controller is
`the collection of hardware
`implements the interface between the
`and firmware that
`network and the host processor. It is typically implemented
`on a small printed circuit board, and contains its own
`processor, memory mapped control registers, interface to
`the network, and small mercury to hold messages being
`transmitted and received. The tin-board processor. usually
`in conjunction with VLSI components within the network
`interface. implements the physicai and [ink-level protocols
`of the network.
`The interaction between the network controller and the
`host’s memory is depicted in Fig. 2. Lists of blocks
`containing packets to he sent and packets that have been
`received are maintained in the host processor's memory.
`The locations of buffers for these blocks are made known
`to the network controller, and it will copy packets to and
`from the requestfreceive block areas using direct memory
`access (DMA) techniques. This means that the copy of data
`across the peripheral bus is under the control of the network
`controller, and does not require the interVention of the host
`processor. The controller will interrupt the host whenever
`a montage has been received or sent.


`thwork lemlicr
`Puphcmi Backpianc Bus
`Fig. 2. Network contrrillcr.’proccssor memory interaction. The
`figure dcbcrihcs the interaction between the network controller
`and the memory of thc network node. The controller contains
`an on-bourti microprocessor, various memory-mapped control reg-
`isters through which service roqucsts can be mad: and status
`checked, 3 physical interface to the network media. and a buffer
`memory to hold request and receive blocks. Thaw contain network
`messages to be transmitted or which havc been rcccivcd respec-
`tively, A list of pending requests and mcssagcs already received
`rcsidcs in the host processor's memory. [Jircct memory operations
`(DMA'slt Undcr
`the control of the node processor. copy these
`blocks in and from this memory.
`While this presents a particularly clean interface between
`the network controller and the operating system,
`it points
`out some of the intrinsic rncrnory systcm latcncics that
`reduce network performance. Consider a message that will
`be transmitted to the network. First the contents of the
`message are created within a user application. A call to the
`operating system results in a process switch and a data copy
`from the user's address space to the operating system's area.
`A protocol—specific network header is then appended to thc
`data to form a packaged network message. This must be
`copied one more time, to place the message into a request
`block that can be accessed by the network controller. The
`final copy is the DMA operation that moves the mcsrsagc
`within the request block to memory within the network
`Data integrity is the aspcct of system reliabiIity concerned
`with the transmission of correct data and the explicit
`flagging of incorrect data. An overriding consideration of
`network protocols is their concern with reliable transmis-
`sion. Because of the distances involved and the complexity
`of the transmission path, network transmission is inherently
`lossy. The solution is to append chock—sum protection bits
`to all network packets and to include explicit acknowledg~
`mcnt as part of the network protocols. For example, if the
`check sum computed at the receiving end docs not match
`the transmittcd check sum,
`the receiver sends a negative
`acknowledgment to thc scndcr.
`C. Cflannel A rchr'recrurcs
`Channels provide the logical and physical pathways
`between UO controllers and slomgc devices. They are
`medium-distance interconnect that carry signals in parallel.
`usually with some parity technique to provide data integrity.
`In this Subsection, we will describe three alternative
`channel organizations that characterize the opposite ends
`of the performance spectrum: SCSI (small computer system
`interface), HIPPI (high-performance parallel interface}. and
`PCS (fibrc channel standard).
`1) Small Computer System Interface SCSI is the channel
`interface most frequently encountered in small form factor
`(5.25 in diameter and smaller) disk drives, as well as a
`wide variety of peripherals such as tape drives, optical disk
`readers, and image scanners. SCSI treats peripheral devices
`in a largely device-independent fashion. For example. a disk
`drive is viewed as a linear byte stream; its detailed structure
`in tcrms of sectors,
`tracks. and cylinders is not visible
`through the SCSI interface. A SCSI channel can support
`up to eight devices sharing a common bus with an 8-bit-
`widc data path.
`in SCSI terminology,
`the lr'O controller
`counts as one of these devices. and is called the host bus
`adapter (HBA), Burst transfers at 4 to 5 Mbytcsis arc widely
`available today. In SCSI terminology. u dcviCc that requests
`service from another device is called the master or the
`initiator. The device that is providing the service is called
`the slave or the target.
`SCSI provides a high—level message—based protocol for
`communications between initiators and targets. While this
`makes it possible to mix widely different kinds on devices
`on the same channel, it does load to relatively high micr—
`l'rcads. The protocol has bccn dcsigncd to allow initiators
`to manage multiple simultaneous operations. Targets are
`intelligent in the sense that they explicitly notify the initiator
`when they are ready to transmit data or when they need to
`throttle a transfer.
`[1 is worthwhile to examine the SCSl protocol in some
`detail. to clearly distinguish what it does from the kinds of
`messages exchanged on a computer network. The SCSI pro-
`tocol proceeds in a series of phases, which we summarize
`' Bus Free: No dcvicc currently has the bus allocated.
`' Arbitration: initiators arbitrate {or access to tho bus. A
`dcvicc‘s physical address determines its priority.
`' Selection: The initiator informs the target that it will
`participate in an l/O operation.
`- Rosclcction: The target
`informs the initiator that an
`outstanding operation is to he resumed. For example.
`an operation could have been previously soapcndcd
`because the HO device had to obtain more data.
`- Command: Command bytes are written to the target by
`the initiator. The target begins executing the operation.
`' Data Transfer: The protocol supports two forms of the
`data transfer phase, Data in and Data Out. The former
`refers to the movcmcm of data from the target to the
`initiator. In the latter, data move from the initiator to
`the target.
`- Message: The message phase also comes in two forms,
`Message In and Message Our. Message In consists of
`several alternatives.
`identifies the ruselcctcd
`target. Save Dara Pointer saves the place in the current
`data transfer ifthc target is about to disconnect. Restore
`Dara Painter restores this pointer. Disconnect notifies
`the initiator that the targetis about to give up the data
`bus. Command Compicte occurs when the target tells
`the initiator that the Operation has completed. Message


`“w (‘ur'ntmdfictug
`' Mnap- Out trecnuryt
`if no dth i: pumice!
`Dacrmflwt to WE' butter
`mention: [:2 (Disconnect)
`. - Elm; FM: * <
`Q1 "—5.1” Tm"\ftr
`‘Jatn in
`C mpiezion
`..... Curaiiund Cttmpkdon-“MM“
`to nil—5.3:?
`Message In {Save Dru Put
`Hemp: In (Command Complete]
`Message In (Disconnect)
`. - Bus Fm: - -
`Menage to (Identify:
`Manic In (Resume Dull Pit-J
`read. The basic phase 5:-
`SCSI phase tmnsiliuna on a
`Fin. 3.
`qucncing for a
`read (from disk} operation is shown. First
`initiator sets up the reed command and semis it to the [JD device.
`The target device disconnects from the SCSI but to perform a Suck
`and to begin to fill its internal buffer. It then transfers the data to
`the initiator. 'fitis may be intcrxpcrscd with additional! disconnecm.
`as the transfer gets ahead of the internal buffering. A command
`complete message terminates the operation. This figure is adapted
`from [4D].
`Out hasjust one form: Idanttfi'. This is used to identify
`the requesting initiator and its intended target.
`‘ Status: Just before command compicliun,
`the target
`sends a status message to the initiator.
`To better understand the sequencing among the phases.
`see Fig. 3. This illustrates the phase :ransiiions [or a typical
`SCSI read operation. The sequencing of an I/O operation
`actually begins when the host’s operating system establishes
`data and status blocks within its memory. Next, it issues an
`[/0 command to the HBA, passing it pointers to command,
`status, and data biocks, as well as the SCSI address of
`the target device. These are staged from host memory to
`device~spccific queues within the HBA‘s memory using
`direct memory access techniques.
`Now the 1/0 operation can begin in earnest. The HBA
`arbitrates for and wins control of the SCSI bus.
`It then
`indicates the target device it wishes to communicate with
`during the selection phase. The target responds by iden-
`tifying itself during a following message out phase. Now
`the actual command. such as "read a sequence of bytes.“
`is transmitted to the device.
`We assume that the target device is a disk. If the disk
`must first seek before it can obtain the requested data,
`it will disconnect from the bus.
`It sends a disconnect
`message to the initiator. which in turn gives up the bus.
`Note that the HBA can communicate with other devices on
`the SCSI channel, initiating additional i/O operations. Now
`the device will seek to the appropriate track and will begin
`to fill
`its internai buffer with data. At this point, it needs
`to reestablish communications with the HBA. The device
`now arbitrator: for and wins control of the htts. it next enters
`the reselection phase, and identifies itself to the initiator to
`reestablish communications.
`The data transfer phase can now begin. Data are
`transferred one byte
`time using a
`simple re-
`quest/acknuwledgment protocol between the target and the
`initiator. This cominucs until
`the need for a disconnect
`arises again. such as when the target‘s buffer is emptied,
`or perhaps the command has cornpicted. If it
`is the first
`case, the data pointer must first he saved within the HBA,
`so we can restart
`the transfer at a later time. Once the
`data transfer pointer has been saved, the target sequences
`through a disconnect. as described abmre.
`it rear-
`When the disk is once again ready to transfer,
`bitrutes for the bus and identifies the initiator with which
`to reconnect. This is followed by a restore data pointer
`message to reestablish the current position within the data
`transfer. The data transfer phase can now continue where
`left off.
`The command completion phase is entered (that:
`data transfer is finished. The target device sends a status
`message to the initiator. describing any errors that may have
`been encountered during the operation. The final command
`completion message completes the 1/0 operation.
`The SCSI protocol specification is currentiy undergoing
`a major revision for higher performance. In the so-called
`SCSI-I, the basic clock rate on the channel is 10 MHz. In
`the new SCSI-2, “fast SCSI" increases the clock rate to 20
`MHz. doubling the channel's bandwidth from 5 Mbytc/s
`to 10 Mbyter's. Recently announced high-performance disk
`drives support fast SCSI. The rcVised specification also
`supports an alternative method of doubling the channei
`bandwidth. called wide SCSI. This provides a 16-bit dam
`path on the channel rather lhau SCSI—1’s 3-bit width. By
`combining wide and fast SCSI-2,
`the channel bandwidth
`quadruples to 2i) Mbyte/s. Some manufacturers of high-
`performancc disk controlicrs have begun to us: SCSI-2 to
`interface their controllers to a computer host.
`2) High-Performance Parade! Interface
`The high per—
`formance parallel interface. HIPI’I, was originally devel-
`oped at
`the Los Aiamos National Laboratory in the mid
`3980’s as a high—speed unidirectional (Simplex) point-to-
`interface between supercomputers [it]. Thus.
`wuy communications requires two HIPPI channels, one for
`commands and write data (the write channel) and one for
`status and read data (the read channel). Data are transmitted
`at a nominal rate of 800 Mbit/s (32-bitvwide data path) or
`1600 Mititfs (M—hit-wide data path) in each direction.
`The physical interface of the RIP?! channel was stanv
`dardizcd in the late 1980’s. Its data transfer protocol was
`designed to be extremely simple and fast. The source of
`the transfer must first assert a request signal to gain acncss
`to the channel. A connection signal grants the channel
`to the source. However,
`the source cannot send until
`destination asserts ready. This provides a simple flow
`control mechanism.
`The minimum unit of data transfer is the burst. A burst
`consists of 1
`to 256 words (the width is determined by
`the physical width of the channel; for a 32~bit channel, a
`5 of 24


`burst is 1024 bytes), Scot as a continuous stream of words.
`one per clock period. A burst is in progress as long as the
`channel’s burst signal
`is asserted. When the burst signal
`goes unassertcd, a CRC (cyclic redundancy check) word
`computed over the transmitted data words is sent down the
`channel. Because of the way the protocol is defined, when
`the destination asserts ready. it means that it must be able
`to accept a complete burst.
`Unfortunately. the upper level protocol (UL?) for per-
`forming operations over the channel is still under discussion
`within the standardization committees. To illustrate the
`concepts involved in using HIPPI as an interface to storage
`devices, we restrict our description to the proposal to layer
`the lPl-3 device generic command set on top of HIPPI, put
`forward by Maximum Strategies and lBM Corporation {12].
`A logical unit of data, sent from a source to a destination.
`is called a packet. A packet
`is a sequence of bursts.
`A special channel signal delineates the start of a new
`packet. Packets consist of a header, 3 ULP (upper layer
`protocol) data set, and fill, The ULI‘F data consist of a
`contmandr’rcsponse field and rcadlwritc data field.
`Packets fall into three types: command, response, or data-
`only. A command packet can contain a header burst with an
`ll’i-3 device command, such as read or write, followed by
`multiple data bursts; if the command is a write. A response
`is similar.
`it contains an [Pi-3 response within a
`header burst, followed by data bursts if the response is a
`read transfer notification. Data—only packets contain header
`bursts without command or response fields.
`Consider a read operation over a HIPPI channel using the
`[Pl-3 protocol. 0n the write channel,
`the slave peripheral
`device receives a header burst containing a valid read
`command from the master host processor. This causes the
`slave to initiate its read operation. When data are available,
`the slave must gain access to the read channet. When
`the master is ready to receive,
`the slave will transmit
`response packet. if the response packet contains a transfer
`notification status. this indicates that the slave is ready to
`transmit a stream of data. The master will pulse a ready
`signal to receive subsequent data bursts.
`The original HIPI’l Specification limits the interconnect
`distance to 25 m over twisted pair copper cable. A bit serial
`version of HIPPI has recently been proposed [13] which
`will mat-re it possible to support gigabis’s data transfers
`over a distance of up to 30 km using optical fiber. Me:
`l-‘arland at al, [14] describes a VLSi chip set developed by
`the Hewlett-Packard Corporation to sopport serial HIPPI
`3) Fibre Channel Standard The Fibre Channel Standard
`is a rapidly emerging specification for high-performance
`bit—serial point—to-point communications over optical fiber
`[15]. Much like a HlPPI channel,
`it has been designed to
`support high-speed computer or storage device to computer
`communications, albeit over a bit-serial connection. But
`unlike lllPPl. FCS purposefully blurs the distinction be»
`tween networks and channels. FCS has been designed as a
`multilayer series of protocols to make it suitable as the basis
`of a high-speed network. A crucial aspect of the standard
`is its definition of the concept of a switching “fabric," a
`network formed from switched high~speed links that can
`be used to communicate between user nodes.
`FCS supports three levels of network service: dedicated
`connections, multiplexed connections, and datagrams. A
`dedicated connection guarantees sequential delive

