`
`A High-Bandwidth
`
`Network File Server
`
`Ann
`
`L. Drapeau
`
`Ken W. Shirriff
`
`John H. Hartman
`
`Ethan
`
`L. Miller
`
`Srinivasan
`
`Seshan
`
`Randy H. Katz
`
`Ken L~tz
`
`David
`
`A. Pattersonl
`
`Edward
`
`K. Lee*
`
`Peter M. Chen3
`
`Garth
`
`A. Gibson4
`
`Abstract
`Inexpensive
`the RAID (Redundant Arrays of
`In 1989,
`Disks) group at U. C. Berkeley built a prototype
`disk ar-
`ray called RAID-I.
`The bandwidth
`delivered to clients by
`RAID-I was severely limited by the menwry system band-
`width of the disk array’s host workstation. We designed our
`second prototype, RAID-II,
`to deliver nwre of the disk array
`bandwidth
`to jile server clients. A custom-built
`crossbar
`menwry system called the XBUS board connects the disks
`directly to the high-speed network, allowing
`data for
`large
`requests to bypass the server workstation.
`RAID-II
`runs
`Log-Structured
`File System (LFS) software to optinzizeper-
`formance for bandwidth-intensive
`applications.
`The RAID-II
`hardware with a single XBUS controller
`board delivers 20 megabyteslsecond for
`large, random read
`operations
`and up to 31 megabyteslsecond for sequential
`read operations.
`A preliminary
`implementation
`of LFS
`on RAID-II
`delivers 21 megabyteslsecond
`on large read
`requests and 15 megabyteslsecond
`on large write opera-
`tions.
`
`1
`
`Introduction
`high-
`future file servers to provide
`for
`is essential
`It
`bandwidth
`1/0 beeause of a trend toward
`bandwidth-
`intensive
`applications
`like multi-media,
`CAD,
`object-
`oriented
`databases and scientific
`visualization.
`Even in
`well-established
`application
`areas such as scientific
`com-
`puting,
`the size of data sets is growing rapidly due to reduc-
`tions in the cost of secondary storage and the introduction
`of
`faster supercomputers.
`These developments
`require faster
`1/0 systems to transfer
`the increasing volume of data.
`High performance
`file servers will
`increasingly
`incorpo-
`rate disk arrays to provide greater disk bandwidth. RAIDs,
`or Redundant Arrays of
`Inexpensive Disks [13],
`[5], use a
`collection
`of relatively
`small,
`inexpensive disks to achieve
`high performance
`and high reliability
`in a seeondary stor-
`age system. RAIDs provide greater disk bandwidth to a file
`by striping or interleaving
`the data from a single file across
`a group of disk drives, allowing multiple
`disk transfers to
`occur
`in parallel. RAIDs ensure reliability
`by calculating
`
`of Cahfomia,
`
`‘ Computer Science Division, University
`94720
`7-DEC Systems Research Center, Palo Alto, CA 94301-l@t
`3ComputerScience
`and Engineering
`Ihvision, University
`of Michigan,
`Ann Arbor, MI 48109-2122
`4School of Computer Science, Carnegie Melton Umversity, Pittsburgh,
`PA 15213-3891
`
`Berkeley, CA
`
`this redun-
`codes across a group of disks;
`error correcting
`the data on
`dancy information
`can be used to reconstruct
`disks that
`fail.
`In this paper, we examine the efficient de-
`livery of bandwidth
`from a file server
`that
`includes a disk
`array.
`the RAID group at U.C. Berkeley built an ini-
`In 1989,
`tial RAID prototype,
`called RAID-I
`[2].
`The prototype
`was constructed
`using a Sun 4/280 workstation with 128
`megabytes of memory,
`four dual-string
`SCSI controllers,
`28 5.25-inch SCSI disks and specialized disk striping soft-
`ware. We were interested in the performance of a disk array
`constructed of commercially-available
`components on two
`workloads:
`small,
`random operations typical of file servers
`in a workstation
`environment
`and large transfers typical
`of bandwidth-intensive
`applications. We anticipated
`that
`there would be both hardware and software bottlenecks
`that would restrict
`the bandwidth
`achievable by RAID-I.
`The development
`of RAID-I was motivated by our desire
`to gain experience with disk array software, disk controllers
`and the SCSI protocol.
`it performs well
`show that
`Experiments with RAID-I
`achieving
`ap-
`when
`processing
`small,
`random 1/0’s,
`proximately
`275 four-kilobyte
`random 1/0s
`per
`sec-
`ond.
`However,
`RAID-I
`proved woefully
`inadequate
`at providing
`high-bandwidth
`1/0, sustaining
`at best 2.3
`megabytes/second
`to a user-level application
`on RAID-I.
`By comparison,
`a single disk on RAID-I
`can sustain 1.3
`megabytes/second.
`The bandwidth
`of nearly 26 of
`the 28
`disks in the array is effectively wasted because it cannot be
`delivered to clients.
`for
`is ill-suited
`reasons why RAID-I
`There are several
`high-bandwidth
`1/0. The most serious is the memory con-
`tention experienced
`on the Sun 4/280 workstation
`during
`1/0 operations.
`The copy operations
`that move data be-
`tween kernel DMA buffers and buffers in user space sat-
`urate the memory
`system when 1/0 bandwidth
`reaches
`2.3 megabytes/second.
`Seeond, because all 1/0 on the
`Sun 4/280 goes through
`the CPU’s
`virtually
`addressed
`cache, data transfers experience
`interference
`from cache
`flushes.
`Finally,
`disregarding
`the workstation’s
`mem-
`ory bandwidth
`limitation,
`high-bandwidth
`performance
`is
`also restricted
`by the low backplane
`bandwidth
`of
`the
`Sun 4/280’s VME system bus, which becomes saturated
`at 9 megabytes/second.
`experienced are typical of many
`The problems RAID-I
`workstations
`that are designed to exploit
`large,
`fast caches
`for good processor performance,
`but
`fail
`to support ade-
`quate primary memory or 1/0 bandwidth
`[8],
`[12].
`In such
`
`1063-6897/94
`
`$03.0001994
`
`IEEE
`
`234
`
`Oracle Ex. 1018, pg. 1
`
`
`
`the memory system is designed so that only
`workstations,
`the CPU has a fast, high-bandwidth
`path to memory. For
`busses or backplanes farther away from the CPU, available
`memory bandwidth
`drops quickly. Thus,
`file servers incor-
`porating such workstations
`perform poorly on bandwidth-
`intensive applications.
`
`The design of hardware and software for our second
`prototype, RAID-II
`[1], was motivated by a desire to de-
`liver more of
`the disk array’s bandwidth
`to the file server’s
`clients.
`Toward this end,
`the RAID-II
`hardware contains
`two data paths
`a high-bandwidth
`path that handles large
`transfers efficiently
`using a custom-built
`crossbar intercon-
`nect between the disk system and the high-bandwidth
`net-
`work, and a low-bandwidth
`path used for control operations
`and small data transfers.
`The software for RAID-II
`was
`also designed to deliver disk array bandwidth.
`RAID-II
`runs LFS,
`the Log-Structured
`File System [14], developed
`by the Sprite operating
`systems group at Berkeley.
`LFS
`treats the disk array as a log, combining
`small accesses
`together and writing
`large blocks of data sequentially
`to
`avoid inefficient
`small write operations.
`A single XBUS
`board with 24 attached disks and no file system delivers
`up to 20 megabytes/second
`for random read operations and
`31 megabytes/second
`for sequential
`read operations. Whh
`LFS software,
`the board delivers 21 megabytes/second
`on
`sequential
`read operations and 15 megabyteskcond
`on se-
`quential write operations. Not all of
`this data is currently
`delivered to clients because of client and network
`limita-
`tions described in Section 3.4.
`and
`implementation
`This paper describes the design,
`performance of the RAID-II
`prototype. Section 2 describes
`the RAID-II
`hardware,
`including
`the design choices made
`to deliver bandwidth,
`architecture
`and implementation
`de-
`tails, and hardware performance measurements.
`Section
`3 discusses the implementation
`and performance
`of LFS
`running on RAID-II.
`Section 4 compares other high per-
`formance 1/0 systems to RAID-II,
`and Section 5 discusses
`future directions.
`Finally, we summarize the contributions
`of
`the RAID-II
`prototype.
`
`2
`
`RAID-II
`
`Hardware
`
`storage architecture.
`the RAID-II
`1 illustrates
`Figure
`spans three racks,
`The RAID-II
`storage server prototype
`disk drives and the
`with the two outer
`racks containing
`crossbar controller
`center rack containing
`custom-designed
`boards. The center
`boards and commercial
`disk controller
`rack also contains a workstation,
`called the host, which
`is shown in the figure as logically
`separate. Each XBUS
`controller
`board has a HIPPI
`connection
`to a high-speed
`Ultra Network
`Technologies
`ring network
`that may also
`connect supercomputers
`and client workstations.
`The host
`workstation
`has an Ethernet
`interface that allows transfers
`between the disk array and clients connected to the Ethernet
`network.
`
`In this section, we discuss the prototype hardware. First,
`we describe the design decisions that were made to deliver
`disk array bandwidth.
`Next, we describe the architecture
`and implementation,
`followed
`by microbenchmark
`mea-
`surements of hardware performance.
`
`2.1
`2.1.1
`
`Disk Array Bandwidth
`Delivering
`High and Low Bandwidth
`Data Paths
`
`on our first prototype was limited by
`Disk array bandwidth
`low memory
`system bandwidth
`on the host workstation.
`RAID-II was designed to avoid similar performance
`limi-
`tations.
`It uses a custom-built memory
`systeml called the
`XBUS controller
`to create a high-bandwidth
`datapath. This
`data path allows RAID-II
`to transfer data dweetly between
`the disk array and the high-speed,
`100 megabytes/second
`HIPPI network without sending data through the host work-
`station memory, as occurs in traditional
`disk array storage
`servers like RAID-I. Rather, data are temporarily
`stored in
`memory on the XBUS board and transferred using a high-
`bandwidth
`crossbar interconnect
`between disks,
`the HIPPI
`network,
`the parity computation
`engine and memory.
`that
`There is also a low-bandwidthdata
`path in RAID-II
`is similm to the data path in RAID-I.
`This path transfers
`data across the workstation’sbackplane
`between the XBUS
`board and host memory. The data sent along tlhis path in-
`clude file metadata (data associated with a file other
`than
`its contents) and small data transfers. Metadata are needed
`by file system software for
`file management,
`name trans-
`lation and for maintaining
`consistency between file caches
`on the host workstation
`and the XBUS board.
`The low-
`bandwidth data path is also used for servicing data requests
`from clients on the 10 megabits/second Ethemlet network
`attached to the host workstation.
`We refer to the two access modes on RAID-11 as the high-
`bandwidth mode for accesses that use the high-performance
`data path and standard mode for
`requests serviced by the
`low-bandwidth
`data path. Any client
`request can be ser-
`viced using either access mode, but we maximize utilization
`and performance of the high-bandwidth
`data path if smaller
`requests use the Ethernet network and larger
`requests use
`the HIPPI network.
`
`2.1.2
`
`Scaling to Provide Greater Bandwidth
`
`storage server can be scaled
`of the RAID-II
`The bandwidth
`boards to a host workstation.
`by adding XBUS controller
`To some extent, adding XBUS boards is like adding disks
`to a conventional
`file server. An important
`distinction
`is
`that adding an XBUS board to RAID-II
`increases the I/O
`bandwidth
`available to the network, whereas adding a disk
`to a conventional
`file server only increases the 1/0 band-
`width available to that particular
`file server.
`Ile
`latter
`is
`less effective,
`since the file server’s memory
`!system and
`backplane will soon saturate if
`it is a typical workstation.
`Eventually, adding XBUS controllers
`to a host worksta-
`tion will saturate the host’s CPU, since the host manages
`all disk and network
`transfers. However,
`since the high-
`bandwidth
`data path of RAID-II
`is independent of
`the host
`workstation, we can use a more powerful
`host
`to continue
`scaling storage server bandwidth.
`of RAID-11
`Architecture
`and Implementation
`2.2
`the RAID-II
`file
`Figure 2 illustrates
`the architecture
`of
`server. The file server’s backplane consists of
`two high-
`bandwidth
`(HIPPI) data busses and a low-latency
`(VME)
`control bus. The backplane connects the high-bandwidth
`network
`interfaces,
`several XBUS controllers
`and a host
`workstation
`that controls the operation of
`the XBUS con-
`troller boards. Each XBUS board contains interfaces to the
`
`Oracle Ex. 1018, pg. 2
`
`
`
`Client Workstations
`
`. -----
`
`-----
`
`1
`1
`
`-----
`
`---
`
`z------------------l
`
`1
`
`\
`Ethernet
`(10 Mb/s)
`
`II
`
`4
`
`?
`
`--------------------
`
`RAID-n File Server-~
`
`Workstation ~
`I
`
`Client Workstations
`
`1 I 1 I I 1 I
`
`I
`
`1I
`
`IL
`
`is composed of three racks. The
`Figure 1: The RAID-II File Server and its Clients. RAID-II, shown by dotted lines,
`host workstation
`is connected to one or more XBUS controller boards (contained in the center
`rack of RAID-II) over
`the VME backplane. Each XBUS controller has a HIPPI network connection that connects to the Lfltranet high-speed
`ring network;
`the Ultranet also provides connections
`to client workstations
`and supercomputers.
`The XBUS boards
`also have connections
`to SCSI disk controller boards.
`In addition.
`the host workstation
`has an Ethernet network
`interface that connects it to client workstations.
`
`HIPPI backplane and VME control bus, memory, a parity
`computation
`engine and four VME interfaces that conneet
`to disk controller
`boards.
`
`the
`of
`packaging
`the physical
`3 illustrates
`Figure
`To min-
`file server, which
`spans three racks.
`RAID-II
`available
`imize the design effort, we used commercially
`components whenever possible.
`Thinking Machines Cor-
`poration (TMC)
`provided
`a board set for
`the HIPPI
`inter-
`face to the Ultranet
`ring network;
`Interphase Corporation
`provided VME-based,
`dual-SCSI, Cougar disk controllers;
`Sun Microsystems
`provided the Sun 4/280 file serve~ and
`IBM donated disk drives and DRAM.
`The center
`rack is
`composed of
`three chassis. On top is a VME chassis con-
`taining eight
`Interphase Cougar disk controllers. The center
`chassis was provided by TMC and contains the HIPPI
`inter-
`faces and our custom XBUS controller boards. The bottom
`VME chassis contains the Sun4/280 host workstation.
`Two
`outer
`racks each contain 72 3.5-inch,
`320 megabyte IBM
`SCSI disks and their power supplies. Each of
`these racks
`contains eight shelves of nine disks. RAID-II
`has a total
`capacity of 46 gigabytes, although the performance
`results
`in the next section are for a single XBUS board controlling
`24 disks. We will achieve full array connectivity
`by adding
`XBUS boards and increasing the number of disks per SCSI
`string.
`
`is a block diagram of the XBUS disk array con-
`Figure4
`troller board, which implements
`a 4x8, 32-bit wide cross-
`bar interconnect
`called the XBUS.
`‘Ehe crossbar connects
`four memory modules to eight system components
`called
`ports.
`The XBUS ports include two interfaces
`to HIPPI
`network boards,
`four VME interfaces to Interphase Cougar
`disk controller
`boards, a parity computation
`engine, and a
`VME interface to the host workstation.
`Each XBUS trans-
`fer involves one of
`the four memory modules and one of
`the eight XBUS ports. Each port was intended to support
`40 megabytes/second
`of data transfer
`for a total of 160
`megabytes/second
`of sustainable XBUS bandwidth.
`
`(address/data),
`The XBUS is asynchronous, multiplexed
`crossbar-based
`interconnect
`that
`uses a centralized,
`priority-based
`arbitration
`scheme. Each of
`the eight 32-
`bit XBUS ports operates at a cycle time of 80 nanoseconds.
`The memory
`is interleaved
`in sixteen-word
`blocks.
`The
`XBUS supports only two types of bus transactions:
`reads
`and writes. Our
`implementation
`of
`the crossbar
`intercon-
`nect was fairly expensive, using 192 16-bit
`transceivers.
`Using surface mount packaging,
`the implementation
`re-
`quired
`120 square inches or approximately
`20% of
`the
`XBUS controller’s
`board area. An advantage of
`the im-
`plementation
`is that
`the 32-bit XBUS ports are relatively
`inexpensive.
`
`236
`
`Oracle Ex. 1018, pg. 3
`
`
`
`#e --
`
`/
`/’
`
`HIPPI
`
`(1 Gb/s)
`,
`
`HIPPI
`
`(1 Gb/s)
`
`-------
`
`----
`
`‘\
`
`Host Workstation
`..................
`
`\
`TMC-HIPPIS BUS’*--
`‘- - bits, 100 MB/s)
`+.
`
`;: Ethernet
`
`--------.
`
`.
`
`\
`
`* m
`\\
`
`\
`
`.,,:;
`. ... . .
`
`---.-
`
`,-
`
`- RAID-II main
`
`chassis.
`
`----.-”
`
`. %9 --------–
`
`I
`
`Scsll
`
`Scslo
`
`I
`
`I I I
`
`I
`
`I 1
`
`\
`
`\
`
`\
`
`\
`
`\
`
`1 1 \
`
`boards
`of RAID-II File Server. The host workstation may have several XBUS controller
`Figure 2: Architecture
`attached to its VME backplane.
`Each XBUS controller
`board contains
`interfaces to HIPPI network source and
`destination
`boards, buffer memory, a high-bandwidth
`crossbar, a parity engine, and interfaces to four SCSI disk
`controller boards.
`
`an interface between the
`Two XBUS ports implement
`XBUS and the TMC HIPPI boards. Each port
`is unidi-
`rectional, designed to sustain 40 megabytes/second
`of data
`transfer and bursts of 100 megabytes/seeond
`into 32 kilo-
`byte FIFO interfaces.
`the XBUS
`Four of
`the XBUS ports are used to connect
`board to four VME busses, each of which may connect
`to
`one or two dual-string
`Interphase Cougar disk controllers.
`In our current configuration,
`we connect
`three disks to each
`SCSI string,
`two strings to each Cougar
`controller,
`and
`one Cougar controller
`to each XBUS VME interface for
`a total of 24 disks per XBUS board.
`The Cougar disk
`controllers
`can transfer data at 8 megabytes/second,
`for a
`maximum disk bandwidth
`to a single XBUS board of 32
`megabytes/second
`in our present configuration.
`Of
`the remaining
`two XBUS ports, one interfaces to a
`parity computation
`engine. The last port
`is the VME control
`
`It
`interface linking the XBUS board to the host workstation.
`provides the host with access to the XBUS board”s memory
`as well as its control
`registers. This makes it possible for
`tile server software running on the host
`to access network
`headers, file data and metadata in the XBUS memory.
`
`2.3
`
`Performance
`Hardware
`the
`of
`raw performance
`In this section, we present
`RAID-II
`[1] hardware,
`that is, the performance of the hard-
`ware without
`the overhead of a file system.
`In Section 3.4,
`we show how much of
`this raw hardware performance
`can
`be delivered by the tile system to clients.
`for ramdom reads
`Figure 5 shows RAID-II
`performance
`and writes. We refer
`to these performance measurements
`as hardware system level experiments,
`since they involve
`all
`the components
`of
`the system from the disks to the
`HIPPI network.
`For these experiments,
`the disk system is
`
`237
`
`Oracle Ex. 1018, pg. 4
`
`
`
`Figure 4: Structure of XBUS controller board. The board contains four memory modules connected by a 4x8 crossbar
`interconnect
`to eight XBUS ports. Two of these XBUS ports are interfaces to the HIPPI network. Another
`is a parity
`computation
`engine. One is a VME network interface for sending control
`information, and the remaining four are
`VME interfaces connecting to commercial SCSI disk controller boards.
`
`configured as a RAID Level 5 [13] with one parit y group of
`24 disks.
`(This scheme delivers high bandwidth but exposes
`the array to data loss during dependent
`failure modes such
`as a SCSI controller
`failure.
`Techniques
`for maximizing
`reliability
`are beyond the scope of this paper [4],
`[161, [61.)
`For
`reads, data are read from the disk array into the
`memory on the XBUS board;
`from there, data are sent over
`HIPPI, back to the XBUS board, and into XBUS mem-
`ory. For writes, data originate
`in XBUS memory, are sent
`over the HIPPI and then back to the XBUS board to XBUS
`memory; parity is computed, and then both data and parity
`are written
`to the disk array. For both tests,
`the system
`is configured with four
`Interphase Cougar disk controllers,
`each with two strings of
`three disks. For both reads and
`writes, subsequent
`fixed size operations are at random lo-
`cations. Figure 5 shows that,
`for
`large requests, hardware
`system level read and write performance
`reaches about 20
`megabytes/seeond.
`The dip in read performance
`for
`re-
`quests of size 768 kilobytes
`occurs beeause at
`that size
`the striping
`scheme in this experiment
`involves a second
`string on one of
`the controllers;
`there is some contention
`on the controller
`that results in lower performance when
`both strings are used. Writes are slower
`than reads due
`to the increased disk and memory activity associated with
`computing
`and writing
`parity. While an order of magni-
`tude faster than our previous prototype, RAID-I,
`this is still
`well below our target bandwidth
`of 40 megabytes/second.
`Below, we show that system performance
`is limited by that
`of
`the commercial
`disk controller
`boards and our disk in-
`terfaces.
`Table 1 shows peak performance
`
`the system when
`
`of
`
`Figure 3:
`The physical packaging of the RAID-II File
`Server. Two outer
`racks contain 144 disks and their
`power supplies. The center
`rsck contains three chas-
`sis:
`the top chassis holds VME disk controller boards;
`the center chassis
`contains XBUS controller
`boards
`and HIPPI interface boards, and the bottom VME chas-
`sis contains the Sun4/280 workstation.
`
`238
`
`Oracle Ex. 1018, pg. 5
`
`
`
`25 1
`
`ld-k
`1/0 I&e
`(voslw)
`
`System
`
`15 disks
`I/O Rate
`(1/Os/see)
`14
`422
`
`Table 2: Peak 1/0 rates in operations per second for
`random, 4 kilobyte reads. Performance for a single disk
`and for fifteen disks, with a separate process issuing
`random 1/0 operations to each disk. The IBM 0661 disk
`drives used in RAID-II can perform more t/Os per sec-
`ond than the Seagate Wren IV disks used in IRAID-I be-
`cause they have shorter seek and rotation times. Supe-
`rior 1/0 rates on RAID-II are also the result of the archi-
`tecture, which does not require data to be transferred
`through the host workstation.
`
`o
`
`1000
`500
`Request Size (KB)
`
`1500
`
`Figure 5: Hardware System Level Read and Write
`Performance.
`RAID-II achieves
`approximatety
`20
`megabytes/second
`for both random reads and writes.
`
`Table 1: Peak read and write performance for an XBUS
`board on sequential operations.
`This experiment was
`run using the four XBUS VME interfaces to disk con-
`trollers boards, with a fifth disk controller attached to
`the XBUS VME control bus interface.
`
`These
`read and write operadons are performed.
`sequential
`measurements were obtained using the four Cougar boards
`attached to the XBUS VME interfaces,
`and in addition,
`using a fifth Cougar board attached to the XBUS VME
`control bus interfaee.
`For requests of size 1.6 megabytes,
`read performance
`is 31 megabytes/seeond,
`compared
`to
`23 megabytes/second
`for writes. Write
`performance
`is
`worse than read performance
`because of
`the extra accesses
`required
`during writes
`to calculate
`and write parity and
`because sequential
`reads benefit
`from the read-ahead per-
`formed into track buffers on the disks; writes have no such
`advantage on these disks.
`While one of
`the main goals in the design of RAID-II
`was to provide better performance
`than our first prototype
`on high-bandwidth
`operations, we also want
`the file server
`to perform well on small,
`random operations.
`Table 2
`compares the 1/0 rates achieved on our two disk array pro-
`totypes, RAID-I
`and I$AID-11, using a test program that
`performed
`random 4 kdobyte reads.
`In each case, fifteen
`disks were accessed. The tests used Sprite operating sys-
`tem read and write system calls;
`the RAID-II
`test did not
`use LFS.
`In each case, a separate process issued 4 kilobyte,
`randomly
`distributed
`1/0 requests to each active disk in
`the system. The performance with a single disk indicates
`the difference
`between the Seagate Wren IV disks used in
`RAID-I
`and the IBM 0661 disks used in RAID-II.
`The
`IBM disk drives have faster rotation and seek times, mak-
`
`L4--T7.
`
`100 ~ 1000 10000100000
`Request Size (KB)
`
`1
`
`10
`
`RAID-II
`Figure 6: HIPPI Loopback Performance.
`achieves 38.5 megabytes/second
`in each direction.
`
`ing random operations quicker. With fifteen disks active,
`both systems perform well, with RAID-II
`achieving over
`400 small accesses per second.
`In both prototypes,
`the
`small,
`random 1/0 rates were limited by the large number
`of context switches required on the Sun4/280 workstation
`to handle request completions. RAID-I’s
`performance was
`also limited
`by the memory bandwidth
`on the Sun4/280,
`since all data went
`into host memory. Since data need not
`be transferred to the host workstation
`in RAID-II,
`it deliv-
`ers a higher percentage (78%) of the potential
`I/01 rate from
`its fifteen disks than does RAID-I
`(67VO).
`in this sec-
`The remaining
`performance measurements
`tion are for specific components of the XBUS board. Figure
`6 shows the performance
`of the HIPPI network and boards.
`Data are transferred from the XBUS memory to the HIPPI
`source board, and then to the HIPPI destination lboard and
`back to XBUS memory. Because the network is configured
`as a loop,
`there is minimal network protocol overhead. This
`test foeuses on the HIPPI
`interfaces’
`raw hardware perfor-
`mance;
`it does not
`include measurements for sending data
`over the Ultranet.
`In the loopback mode,
`the overhead of
`sending a HIPPI packet
`is about 1.1 milliseconds, mostly
`due to setting up the HIPPI and XBUS controll
`registers
`
`239
`
`Oracle Ex. 1018, pg. 6
`
`
`
`3.1
`
`Next, we illustrate how LFS on RAID-II
`implementation.
`read requests from a client. Finally, we dis-
`handles typical
`cuss the measured performance
`and implementation
`status
`of LFS on RAID-II.
`File System
`The Log-Structured
`tile
`LFS [14]
`is a disk storage manager
`that writes all
`files
`data and metadata to a sequential append-only
`log;
`the
`are not assigned fixed blocks on disk. LFS minimizes
`number of small write operations
`to dksk data for small
`writes are buffered in main memory and are written to disk
`asynchronously when a log segment
`fills. This is in contrast
`to traditional
`file systems, which assign files to fixed blocks
`on dkk.
`In traditional
`file systems, a sequence of random
`file writes results in inefficient
`small, random disk accesses.
`Beeause it avoids
`small write
`operations,
`the Log-
`Structured File System avoids a weakness of
`redundant
`disk arrays that affeets traditional
`file systems. Under a
`traditional
`file system, disk arrays that use large block in-
`terleaving
`(Level 5 RAID [13]) perform poorly on small
`write operations
`because each small write requires
`four
`disk accesse~ reads of
`the old data and parity blocks and
`writes of the new data and parit y blocks. By contrast,
`large
`write operations in disk arrays are efficient since they don’t
`require the reading of old data or parity.
`LFS eliminates
`small writes, grouping them into efficient
`large, sequential
`write operations.
`A secondary benefit of LFS is that crash reeovery is very
`quick.
`LFS periodically
`performs
`checkpoint
`operations
`that record the current state of
`the file system. To reeover
`from a tile system crash,
`the LFS server need only process
`the log from the position of the last checkpoint. By contrast,
`a UNIX tile system consistency checker
`traverses the entire
`directory
`structure in search of
`lost data. Because a RAID
`has a very large storage capacity, a standard UNIX-style
`consistency check on boot would be unacceptably
`slow. For
`a 1 gigabyte file system,
`it
`takes a few seconds to perform
`an LFS file system check, compared with approximately
`20 minutes to check the consistency of a typical UNIX file
`system of comparable size.
`3.2
`LFS on RAID-II
`of LFS for RAID-II must efficiently
`An implementation
`features of the disk array:
`the sep-
`handle two architectural
`arate low- and high-bandwidth
`data paths, and the separate
`memories on the host workstation
`and the XBUS board.
`To handle the two data paths, we changed the original
`Sprite LFS software to separate the handling
`of data and
`metadata. We use the high-bandwidth
`data path to transfer
`data between the disks and the HIPPI network, bypassing
`host workstation memory.
`‘l%e low-bandwidth
`VME con-
`nection between the XBUS and the host workstation
`is used
`to send metadata to the host
`for file name lookup and loca-
`tion of data on disk. The low-bandwidth
`path is also used
`for small data transfers to clients on the Ethernet attached
`to the host.
`also must manage separate memories
`LFS on RAID-II
`on the host workstation
`and the XBUS board.
`‘The host
`memory cache contains metadata as well as files that have
`been read into workstation memory
`for
`transfer over
`the
`Ethernet.
`The cache is managed with a simple Least Re-
`cently Used replacement
`policy. Memory
`on the XBUS
`board is used for prefetch buffers, pipeliningbuffers,
`HIPPI
`net work buffers, and write buffers for LFS segments. The
`
`Number of Disks
`
`Figure 7: Disk read performance for varying number of
`disks on a single SCSI string. Cougar string bandwidth
`is limited to about 3 megabytes/second,
`less than that
`of three disks.
`The dashed line indicates the perfor-
`mance if bandwidth scaled linearly.
`
`an Ether-
`(By comparison,
`across the slow VME link.
`to trans-
`0.5 millisecond
`net packet
`takes approximately
`fer.) Due to this control overhead,
`small
`requests result
`in low performance
`on the HIPPI network.
`For
`large re-
`quests, however,
`the XBUS and HIPPI boards support 38
`megabytes/seeond
`in both directions, which is very close
`to the maximum bandwidth
`of the XBUS ports. The HIPPI
`loopback bandwidth
`of 38 megabytes/second
`is three times
`greater
`than FDDI
`and two orders of magnitude
`greater
`than Ethernet. Clearly
`the HIPPI part of
`the XBUS is not
`the limiting
`factor
`in determining
`hardware system level
`performance.
`the lower-than-
`for
`is responsible
`Disk performance
`expected hardware system level performance
`of RAID-II.
`One performance
`limitation
`is the Cougar disk controller,
`which only supports about 3 megabytes/second
`on each of
`two SCSI strings, as shown in Figure 7. The other
`lim-
`itation is our relatively
`slow, synchronous VME interface
`ports, which only support 6.9 megabytes/second
`on read
`operations and 5.9 megabytes/second
`on write operations.
`This low performance
`is largely due to the difficulty
`of syn-
`chronizing
`the asynchronous VME bus for interaction with
`the XBUS.
`
`3
`
`File System
`The RAID-II
`the
`In the last section, we presented measurements of
`performance
`of
`the RAID-II
`hardware without
`the over-
`head of a file system.
`Although
`some applications may
`use RAID-II
`as a large raw disk, most will
`require a file
`system with abstractions
`such as directories and files. The
`file system should deliver
`to applications
`as much of
`the
`bandwidth
`of the RAID-II
`disk array hardware as possible.
`RAID-II’s
`file system is a modified
`version of
`the Sprite
`Log-Structured
`File System (LFS)
`[14].
`In this section, we
`first describe LFS and explain why it
`is particularly
`well-
`-suited for use with disk arrays. Seeond, we describe the
`features of RAID-II
`that require special attention in the LFS
`
`240
`
`Oracle Ex. 1018, pg. 7
`
`
`
`file system keeps the two caches consistent and copies data
`between them as required.
`to service large
`To hide some of
`the latency required
`into XBUS memory
`requests, LFS performs
`prefetching
`buffers. LFS on RAID-II
`uses XBUS memory to pipeline
`network and disk operations.
`For
`read operations, while
`one block of data is being sent across the network,
`the next
`blocks are being read off
`the disk. For write operations,
`full
`LFS segments are written to disk while newer segments are
`being filled with data. We are atso experimenting
`with
`prefetching
`techniques
`so small sequential
`reads can also
`benefit
`from overlapping
`disk and network operations.
`From the user’s perspective,
`the RAID-II
`file system
`looks like a standard file system to clients using the slow
`Ethernet path (standard mode), but
`requires relinking
`of
`applications
`for clients to use the fast HIPPI path (high-
`bandwidth mode). To access data over the Ethernet, clients
`simply use NFS or the Sprite RPC mechanism.
`The fast
`data path across the Ultranet
`uses a special
`library of
`file
`system operations
`for RAID files: open,
`read, writ
`e,
`etc. The library converts file operations to operations on an
`Ultranet
`socket between the client and the RAID-II
`server.
`The advantage of
`this approach is that
`it doesn’t
`require
`changes to the client operating system. Alternatively,
`these
`operations could be moved into the kernel on the client and
`would be transparent
`to the user-level application.
`Finally,
`some operating systems offer dynamically
`loadable device
`drivers that would eliminate the need for relinking
`applica-
`tions; however, Sprite does not allow dynamic loading.
`
`Requests
`Typical High and Low Bandwidth
`3.3
`This section expltins
`how the RAID-II
`file system han-
`dles typical open and read requests from a client over the
`fast Ultra Network Technologies network and over the slow
`Ethernet network attached to the host workstation.
`In the first example,
`the client
`is connected to RAID-II
`across the high-bandwidth
`Ultranet network. The client ap-
`plication
`is linked with a small
`library that converts RAID
`file operations into operations on an Ultranet socket connec-
`tion. First,
`the application
`running on the client performs an
`open operation by calling the library routine ra id_open.
`This call
`is transformed
`by the library
`into socket opera-
`tions. The library opens a socket between the client and the
`RAID-II
`file server. Then the client sends an open com-
`mand to RAID-II.
`Finally,
`the RAID-II
`host opens the file
`and informs
`the client.
`can perform read and write oper-
`Now the application
`ations on the file.
`Suppose the application
`does a large
`read using the library routine ra id_read.
`As before,
`the
`library converts this read operation into socket operations,
`sending RAID-II
`a read command that
`includes file posi-
`tion and request
`length. RAID-II
`handles a read request
`by pipclining
`disk reads and network
`sends. First,
`the file
`system code allocat