throbber
RAID-II:
`
`A High-Bandwidth
`
`Network File Server
`
`Ann
`
`L. Drapeau
`
`Ken W. Shirriff
`
`John H. Hartman
`
`Ethan
`
`L. Miller
`
`Srinivasan
`
`Seshan
`
`Randy H. Katz
`
`Ken L~tz
`
`David
`
`A. Pattersonl
`
`Edward
`
`K. Lee*
`
`Peter M. Chen3
`
`Garth
`
`A. Gibson4
`
`Abstract
`Inexpensive
`the RAID (Redundant Arrays of
`In 1989,
`Disks) group at U. C. Berkeley built a prototype
`disk ar-
`ray called RAID-I.
`The bandwidth
`delivered to clients by
`RAID-I was severely limited by the menwry system band-
`width of the disk array’s host workstation. We designed our
`second prototype, RAID-II,
`to deliver nwre of the disk array
`bandwidth
`to jile server clients. A custom-built
`crossbar
`menwry system called the XBUS board connects the disks
`directly to the high-speed network, allowing
`data for
`large
`requests to bypass the server workstation.
`RAID-II
`runs
`Log-Structured
`File System (LFS) software to optinzizeper-
`formance for bandwidth-intensive
`applications.
`The RAID-II
`hardware with a single XBUS controller
`board delivers 20 megabyteslsecond for
`large, random read
`operations
`and up to 31 megabyteslsecond for sequential
`read operations.
`A preliminary
`implementation
`of LFS
`on RAID-II
`delivers 21 megabyteslsecond
`on large read
`requests and 15 megabyteslsecond
`on large write opera-
`tions.
`
`1
`
`Introduction
`high-
`future file servers to provide
`for
`is essential
`It
`bandwidth
`1/0 beeause of a trend toward
`bandwidth-
`intensive
`applications
`like multi-media,
`CAD,
`object-
`oriented
`databases and scientific
`visualization.
`Even in
`well-established
`application
`areas such as scientific
`com-
`puting,
`the size of data sets is growing rapidly due to reduc-
`tions in the cost of secondary storage and the introduction
`of
`faster supercomputers.
`These developments
`require faster
`1/0 systems to transfer
`the increasing volume of data.
`High performance
`file servers will
`increasingly
`incorpo-
`rate disk arrays to provide greater disk bandwidth. RAIDs,
`or Redundant Arrays of
`Inexpensive Disks [13],
`[5], use a
`collection
`of relatively
`small,
`inexpensive disks to achieve
`high performance
`and high reliability
`in a seeondary stor-
`age system. RAIDs provide greater disk bandwidth to a file
`by striping or interleaving
`the data from a single file across
`a group of disk drives, allowing multiple
`disk transfers to
`occur
`in parallel. RAIDs ensure reliability
`by calculating
`
`of Cahfomia,
`
`‘ Computer Science Division, University
`94720
`7-DEC Systems Research Center, Palo Alto, CA 94301-l@t
`3ComputerScience
`and Engineering
`Ihvision, University
`of Michigan,
`Ann Arbor, MI 48109-2122
`4School of Computer Science, Carnegie Melton Umversity, Pittsburgh,
`PA 15213-3891
`
`Berkeley, CA
`
`this redun-
`codes across a group of disks;
`error correcting
`the data on
`dancy information
`can be used to reconstruct
`disks that
`fail.
`In this paper, we examine the efficient de-
`livery of bandwidth
`from a file server
`that
`includes a disk
`array.
`the RAID group at U.C. Berkeley built an ini-
`In 1989,
`tial RAID prototype,
`called RAID-I
`[2].
`The prototype
`was constructed
`using a Sun 4/280 workstation with 128
`megabytes of memory,
`four dual-string
`SCSI controllers,
`28 5.25-inch SCSI disks and specialized disk striping soft-
`ware. We were interested in the performance of a disk array
`constructed of commercially-available
`components on two
`workloads:
`small,
`random operations typical of file servers
`in a workstation
`environment
`and large transfers typical
`of bandwidth-intensive
`applications. We anticipated
`that
`there would be both hardware and software bottlenecks
`that would restrict
`the bandwidth
`achievable by RAID-I.
`The development
`of RAID-I was motivated by our desire
`to gain experience with disk array software, disk controllers
`and the SCSI protocol.
`it performs well
`show that
`Experiments with RAID-I
`achieving
`ap-
`when
`processing
`small,
`random 1/0’s,
`proximately
`275 four-kilobyte
`random 1/0s
`per
`sec-
`ond.
`However,
`RAID-I
`proved woefully
`inadequate
`at providing
`high-bandwidth
`1/0, sustaining
`at best 2.3
`megabytes/second
`to a user-level application
`on RAID-I.
`By comparison,
`a single disk on RAID-I
`can sustain 1.3
`megabytes/second.
`The bandwidth
`of nearly 26 of
`the 28
`disks in the array is effectively wasted because it cannot be
`delivered to clients.
`for
`is ill-suited
`reasons why RAID-I
`There are several
`high-bandwidth
`1/0. The most serious is the memory con-
`tention experienced
`on the Sun 4/280 workstation
`during
`1/0 operations.
`The copy operations
`that move data be-
`tween kernel DMA buffers and buffers in user space sat-
`urate the memory
`system when 1/0 bandwidth
`reaches
`2.3 megabytes/second.
`Seeond, because all 1/0 on the
`Sun 4/280 goes through
`the CPU’s
`virtually
`addressed
`cache, data transfers experience
`interference
`from cache
`flushes.
`Finally,
`disregarding
`the workstation’s
`mem-
`ory bandwidth
`limitation,
`high-bandwidth
`performance
`is
`also restricted
`by the low backplane
`bandwidth
`of
`the
`Sun 4/280’s VME system bus, which becomes saturated
`at 9 megabytes/second.
`experienced are typical of many
`The problems RAID-I
`workstations
`that are designed to exploit
`large,
`fast caches
`for good processor performance,
`but
`fail
`to support ade-
`quate primary memory or 1/0 bandwidth
`[8],
`[12].
`In such
`
`1063-6897/94
`
`$03.0001994
`
`IEEE
`
`234
`
`Oracle Ex. 1018, pg. 1
`
`

`

`the memory system is designed so that only
`workstations,
`the CPU has a fast, high-bandwidth
`path to memory. For
`busses or backplanes farther away from the CPU, available
`memory bandwidth
`drops quickly. Thus,
`file servers incor-
`porating such workstations
`perform poorly on bandwidth-
`intensive applications.
`
`The design of hardware and software for our second
`prototype, RAID-II
`[1], was motivated by a desire to de-
`liver more of
`the disk array’s bandwidth
`to the file server’s
`clients.
`Toward this end,
`the RAID-II
`hardware contains
`two data paths
`a high-bandwidth
`path that handles large
`transfers efficiently
`using a custom-built
`crossbar intercon-
`nect between the disk system and the high-bandwidth
`net-
`work, and a low-bandwidth
`path used for control operations
`and small data transfers.
`The software for RAID-II
`was
`also designed to deliver disk array bandwidth.
`RAID-II
`runs LFS,
`the Log-Structured
`File System [14], developed
`by the Sprite operating
`systems group at Berkeley.
`LFS
`treats the disk array as a log, combining
`small accesses
`together and writing
`large blocks of data sequentially
`to
`avoid inefficient
`small write operations.
`A single XBUS
`board with 24 attached disks and no file system delivers
`up to 20 megabytes/second
`for random read operations and
`31 megabytes/second
`for sequential
`read operations. Whh
`LFS software,
`the board delivers 21 megabytes/second
`on
`sequential
`read operations and 15 megabyteskcond
`on se-
`quential write operations. Not all of
`this data is currently
`delivered to clients because of client and network
`limita-
`tions described in Section 3.4.
`and
`implementation
`This paper describes the design,
`performance of the RAID-II
`prototype. Section 2 describes
`the RAID-II
`hardware,
`including
`the design choices made
`to deliver bandwidth,
`architecture
`and implementation
`de-
`tails, and hardware performance measurements.
`Section
`3 discusses the implementation
`and performance
`of LFS
`running on RAID-II.
`Section 4 compares other high per-
`formance 1/0 systems to RAID-II,
`and Section 5 discusses
`future directions.
`Finally, we summarize the contributions
`of
`the RAID-II
`prototype.
`
`2
`
`RAID-II
`
`Hardware
`
`storage architecture.
`the RAID-II
`1 illustrates
`Figure
`spans three racks,
`The RAID-II
`storage server prototype
`disk drives and the
`with the two outer
`racks containing
`crossbar controller
`center rack containing
`custom-designed
`boards. The center
`boards and commercial
`disk controller
`rack also contains a workstation,
`called the host, which
`is shown in the figure as logically
`separate. Each XBUS
`controller
`board has a HIPPI
`connection
`to a high-speed
`Ultra Network
`Technologies
`ring network
`that may also
`connect supercomputers
`and client workstations.
`The host
`workstation
`has an Ethernet
`interface that allows transfers
`between the disk array and clients connected to the Ethernet
`network.
`
`In this section, we discuss the prototype hardware. First,
`we describe the design decisions that were made to deliver
`disk array bandwidth.
`Next, we describe the architecture
`and implementation,
`followed
`by microbenchmark
`mea-
`surements of hardware performance.
`
`2.1
`2.1.1
`
`Disk Array Bandwidth
`Delivering
`High and Low Bandwidth
`Data Paths
`
`on our first prototype was limited by
`Disk array bandwidth
`low memory
`system bandwidth
`on the host workstation.
`RAID-II was designed to avoid similar performance
`limi-
`tations.
`It uses a custom-built memory
`systeml called the
`XBUS controller
`to create a high-bandwidth
`datapath. This
`data path allows RAID-II
`to transfer data dweetly between
`the disk array and the high-speed,
`100 megabytes/second
`HIPPI network without sending data through the host work-
`station memory, as occurs in traditional
`disk array storage
`servers like RAID-I. Rather, data are temporarily
`stored in
`memory on the XBUS board and transferred using a high-
`bandwidth
`crossbar interconnect
`between disks,
`the HIPPI
`network,
`the parity computation
`engine and memory.
`that
`There is also a low-bandwidthdata
`path in RAID-II
`is similm to the data path in RAID-I.
`This path transfers
`data across the workstation’sbackplane
`between the XBUS
`board and host memory. The data sent along tlhis path in-
`clude file metadata (data associated with a file other
`than
`its contents) and small data transfers. Metadata are needed
`by file system software for
`file management,
`name trans-
`lation and for maintaining
`consistency between file caches
`on the host workstation
`and the XBUS board.
`The low-
`bandwidth data path is also used for servicing data requests
`from clients on the 10 megabits/second Ethemlet network
`attached to the host workstation.
`We refer to the two access modes on RAID-11 as the high-
`bandwidth mode for accesses that use the high-performance
`data path and standard mode for
`requests serviced by the
`low-bandwidth
`data path. Any client
`request can be ser-
`viced using either access mode, but we maximize utilization
`and performance of the high-bandwidth
`data path if smaller
`requests use the Ethernet network and larger
`requests use
`the HIPPI network.
`
`2.1.2
`
`Scaling to Provide Greater Bandwidth
`
`storage server can be scaled
`of the RAID-II
`The bandwidth
`boards to a host workstation.
`by adding XBUS controller
`To some extent, adding XBUS boards is like adding disks
`to a conventional
`file server. An important
`distinction
`is
`that adding an XBUS board to RAID-II
`increases the I/O
`bandwidth
`available to the network, whereas adding a disk
`to a conventional
`file server only increases the 1/0 band-
`width available to that particular
`file server.
`Ile
`latter
`is
`less effective,
`since the file server’s memory
`!system and
`backplane will soon saturate if
`it is a typical workstation.
`Eventually, adding XBUS controllers
`to a host worksta-
`tion will saturate the host’s CPU, since the host manages
`all disk and network
`transfers. However,
`since the high-
`bandwidth
`data path of RAID-II
`is independent of
`the host
`workstation, we can use a more powerful
`host
`to continue
`scaling storage server bandwidth.
`of RAID-11
`Architecture
`and Implementation
`2.2
`the RAID-II
`file
`Figure 2 illustrates
`the architecture
`of
`server. The file server’s backplane consists of
`two high-
`bandwidth
`(HIPPI) data busses and a low-latency
`(VME)
`control bus. The backplane connects the high-bandwidth
`network
`interfaces,
`several XBUS controllers
`and a host
`workstation
`that controls the operation of
`the XBUS con-
`troller boards. Each XBUS board contains interfaces to the
`
`Oracle Ex. 1018, pg. 2
`
`

`

`Client Workstations
`
`. -----
`
`-----
`
`1
`1
`
`-----
`
`---
`
`z------------------l
`
`1
`
`\
`Ethernet
`(10 Mb/s)
`
`II
`
`4
`
`?
`
`--------------------
`
`RAID-n File Server-~
`
`Workstation ~
`I
`
`Client Workstations
`
`1 I 1 I I 1 I
`
`I
`
`1I
`
`IL
`
`is composed of three racks. The
`Figure 1: The RAID-II File Server and its Clients. RAID-II, shown by dotted lines,
`host workstation
`is connected to one or more XBUS controller boards (contained in the center
`rack of RAID-II) over
`the VME backplane. Each XBUS controller has a HIPPI network connection that connects to the Lfltranet high-speed
`ring network;
`the Ultranet also provides connections
`to client workstations
`and supercomputers.
`The XBUS boards
`also have connections
`to SCSI disk controller boards.
`In addition.
`the host workstation
`has an Ethernet network
`interface that connects it to client workstations.
`
`HIPPI backplane and VME control bus, memory, a parity
`computation
`engine and four VME interfaces that conneet
`to disk controller
`boards.
`
`the
`of
`packaging
`the physical
`3 illustrates
`Figure
`To min-
`file server, which
`spans three racks.
`RAID-II
`available
`imize the design effort, we used commercially
`components whenever possible.
`Thinking Machines Cor-
`poration (TMC)
`provided
`a board set for
`the HIPPI
`inter-
`face to the Ultranet
`ring network;
`Interphase Corporation
`provided VME-based,
`dual-SCSI, Cougar disk controllers;
`Sun Microsystems
`provided the Sun 4/280 file serve~ and
`IBM donated disk drives and DRAM.
`The center
`rack is
`composed of
`three chassis. On top is a VME chassis con-
`taining eight
`Interphase Cougar disk controllers. The center
`chassis was provided by TMC and contains the HIPPI
`inter-
`faces and our custom XBUS controller boards. The bottom
`VME chassis contains the Sun4/280 host workstation.
`Two
`outer
`racks each contain 72 3.5-inch,
`320 megabyte IBM
`SCSI disks and their power supplies. Each of
`these racks
`contains eight shelves of nine disks. RAID-II
`has a total
`capacity of 46 gigabytes, although the performance
`results
`in the next section are for a single XBUS board controlling
`24 disks. We will achieve full array connectivity
`by adding
`XBUS boards and increasing the number of disks per SCSI
`string.
`
`is a block diagram of the XBUS disk array con-
`Figure4
`troller board, which implements
`a 4x8, 32-bit wide cross-
`bar interconnect
`called the XBUS.
`‘Ehe crossbar connects
`four memory modules to eight system components
`called
`ports.
`The XBUS ports include two interfaces
`to HIPPI
`network boards,
`four VME interfaces to Interphase Cougar
`disk controller
`boards, a parity computation
`engine, and a
`VME interface to the host workstation.
`Each XBUS trans-
`fer involves one of
`the four memory modules and one of
`the eight XBUS ports. Each port was intended to support
`40 megabytes/second
`of data transfer
`for a total of 160
`megabytes/second
`of sustainable XBUS bandwidth.
`
`(address/data),
`The XBUS is asynchronous, multiplexed
`crossbar-based
`interconnect
`that
`uses a centralized,
`priority-based
`arbitration
`scheme. Each of
`the eight 32-
`bit XBUS ports operates at a cycle time of 80 nanoseconds.
`The memory
`is interleaved
`in sixteen-word
`blocks.
`The
`XBUS supports only two types of bus transactions:
`reads
`and writes. Our
`implementation
`of
`the crossbar
`intercon-
`nect was fairly expensive, using 192 16-bit
`transceivers.
`Using surface mount packaging,
`the implementation
`re-
`quired
`120 square inches or approximately
`20% of
`the
`XBUS controller’s
`board area. An advantage of
`the im-
`plementation
`is that
`the 32-bit XBUS ports are relatively
`inexpensive.
`
`236
`
`Oracle Ex. 1018, pg. 3
`
`

`

`#e --
`
`/
`/’
`
`HIPPI
`
`(1 Gb/s)
`,
`
`HIPPI
`
`(1 Gb/s)
`
`-------
`
`----
`
`‘\
`
`Host Workstation
`..................
`
`\
`TMC-HIPPIS BUS’*--
`‘- - bits, 100 MB/s)
`+.
`
`;: Ethernet
`
`--------.
`
`.
`
`\
`
`* m
`\\
`
`\
`
`.,,:;
`. ... . .
`
`---.-
`
`,-
`
`- RAID-II main
`
`chassis.
`
`----.-”
`
`. %9 --------–
`
`I
`
`Scsll
`
`Scslo
`
`I
`
`I I I
`
`I
`
`I 1
`
`\
`
`\
`
`\
`
`\
`
`\
`
`1 1 \
`
`boards
`of RAID-II File Server. The host workstation may have several XBUS controller
`Figure 2: Architecture
`attached to its VME backplane.
`Each XBUS controller
`board contains
`interfaces to HIPPI network source and
`destination
`boards, buffer memory, a high-bandwidth
`crossbar, a parity engine, and interfaces to four SCSI disk
`controller boards.
`
`an interface between the
`Two XBUS ports implement
`XBUS and the TMC HIPPI boards. Each port
`is unidi-
`rectional, designed to sustain 40 megabytes/second
`of data
`transfer and bursts of 100 megabytes/seeond
`into 32 kilo-
`byte FIFO interfaces.
`the XBUS
`Four of
`the XBUS ports are used to connect
`board to four VME busses, each of which may connect
`to
`one or two dual-string
`Interphase Cougar disk controllers.
`In our current configuration,
`we connect
`three disks to each
`SCSI string,
`two strings to each Cougar
`controller,
`and
`one Cougar controller
`to each XBUS VME interface for
`a total of 24 disks per XBUS board.
`The Cougar disk
`controllers
`can transfer data at 8 megabytes/second,
`for a
`maximum disk bandwidth
`to a single XBUS board of 32
`megabytes/second
`in our present configuration.
`Of
`the remaining
`two XBUS ports, one interfaces to a
`parity computation
`engine. The last port
`is the VME control
`
`It
`interface linking the XBUS board to the host workstation.
`provides the host with access to the XBUS board”s memory
`as well as its control
`registers. This makes it possible for
`tile server software running on the host
`to access network
`headers, file data and metadata in the XBUS memory.
`
`2.3
`
`Performance
`Hardware
`the
`of
`raw performance
`In this section, we present
`RAID-II
`[1] hardware,
`that is, the performance of the hard-
`ware without
`the overhead of a file system.
`In Section 3.4,
`we show how much of
`this raw hardware performance
`can
`be delivered by the tile system to clients.
`for ramdom reads
`Figure 5 shows RAID-II
`performance
`and writes. We refer
`to these performance measurements
`as hardware system level experiments,
`since they involve
`all
`the components
`of
`the system from the disks to the
`HIPPI network.
`For these experiments,
`the disk system is
`
`237
`
`Oracle Ex. 1018, pg. 4
`
`

`

`Figure 4: Structure of XBUS controller board. The board contains four memory modules connected by a 4x8 crossbar
`interconnect
`to eight XBUS ports. Two of these XBUS ports are interfaces to the HIPPI network. Another
`is a parity
`computation
`engine. One is a VME network interface for sending control
`information, and the remaining four are
`VME interfaces connecting to commercial SCSI disk controller boards.
`
`configured as a RAID Level 5 [13] with one parit y group of
`24 disks.
`(This scheme delivers high bandwidth but exposes
`the array to data loss during dependent
`failure modes such
`as a SCSI controller
`failure.
`Techniques
`for maximizing
`reliability
`are beyond the scope of this paper [4],
`[161, [61.)
`For
`reads, data are read from the disk array into the
`memory on the XBUS board;
`from there, data are sent over
`HIPPI, back to the XBUS board, and into XBUS mem-
`ory. For writes, data originate
`in XBUS memory, are sent
`over the HIPPI and then back to the XBUS board to XBUS
`memory; parity is computed, and then both data and parity
`are written
`to the disk array. For both tests,
`the system
`is configured with four
`Interphase Cougar disk controllers,
`each with two strings of
`three disks. For both reads and
`writes, subsequent
`fixed size operations are at random lo-
`cations. Figure 5 shows that,
`for
`large requests, hardware
`system level read and write performance
`reaches about 20
`megabytes/seeond.
`The dip in read performance
`for
`re-
`quests of size 768 kilobytes
`occurs beeause at
`that size
`the striping
`scheme in this experiment
`involves a second
`string on one of
`the controllers;
`there is some contention
`on the controller
`that results in lower performance when
`both strings are used. Writes are slower
`than reads due
`to the increased disk and memory activity associated with
`computing
`and writing
`parity. While an order of magni-
`tude faster than our previous prototype, RAID-I,
`this is still
`well below our target bandwidth
`of 40 megabytes/second.
`Below, we show that system performance
`is limited by that
`of
`the commercial
`disk controller
`boards and our disk in-
`terfaces.
`Table 1 shows peak performance
`
`the system when
`
`of
`
`Figure 3:
`The physical packaging of the RAID-II File
`Server. Two outer
`racks contain 144 disks and their
`power supplies. The center
`rsck contains three chas-
`sis:
`the top chassis holds VME disk controller boards;
`the center chassis
`contains XBUS controller
`boards
`and HIPPI interface boards, and the bottom VME chas-
`sis contains the Sun4/280 workstation.
`
`238
`
`Oracle Ex. 1018, pg. 5
`
`

`

`25 1
`
`ld-k
`1/0 I&e
`(voslw)
`
`System
`
`15 disks
`I/O Rate
`(1/Os/see)
`14
`422
`
`Table 2: Peak 1/0 rates in operations per second for
`random, 4 kilobyte reads. Performance for a single disk
`and for fifteen disks, with a separate process issuing
`random 1/0 operations to each disk. The IBM 0661 disk
`drives used in RAID-II can perform more t/Os per sec-
`ond than the Seagate Wren IV disks used in IRAID-I be-
`cause they have shorter seek and rotation times. Supe-
`rior 1/0 rates on RAID-II are also the result of the archi-
`tecture, which does not require data to be transferred
`through the host workstation.
`
`o
`
`1000
`500
`Request Size (KB)
`
`1500
`
`Figure 5: Hardware System Level Read and Write
`Performance.
`RAID-II achieves
`approximatety
`20
`megabytes/second
`for both random reads and writes.
`
`Table 1: Peak read and write performance for an XBUS
`board on sequential operations.
`This experiment was
`run using the four XBUS VME interfaces to disk con-
`trollers boards, with a fifth disk controller attached to
`the XBUS VME control bus interface.
`
`These
`read and write operadons are performed.
`sequential
`measurements were obtained using the four Cougar boards
`attached to the XBUS VME interfaces,
`and in addition,
`using a fifth Cougar board attached to the XBUS VME
`control bus interfaee.
`For requests of size 1.6 megabytes,
`read performance
`is 31 megabytes/seeond,
`compared
`to
`23 megabytes/second
`for writes. Write
`performance
`is
`worse than read performance
`because of
`the extra accesses
`required
`during writes
`to calculate
`and write parity and
`because sequential
`reads benefit
`from the read-ahead per-
`formed into track buffers on the disks; writes have no such
`advantage on these disks.
`While one of
`the main goals in the design of RAID-II
`was to provide better performance
`than our first prototype
`on high-bandwidth
`operations, we also want
`the file server
`to perform well on small,
`random operations.
`Table 2
`compares the 1/0 rates achieved on our two disk array pro-
`totypes, RAID-I
`and I$AID-11, using a test program that
`performed
`random 4 kdobyte reads.
`In each case, fifteen
`disks were accessed. The tests used Sprite operating sys-
`tem read and write system calls;
`the RAID-II
`test did not
`use LFS.
`In each case, a separate process issued 4 kilobyte,
`randomly
`distributed
`1/0 requests to each active disk in
`the system. The performance with a single disk indicates
`the difference
`between the Seagate Wren IV disks used in
`RAID-I
`and the IBM 0661 disks used in RAID-II.
`The
`IBM disk drives have faster rotation and seek times, mak-
`
`L4--T7.
`
`100 ~ 1000 10000100000
`Request Size (KB)
`
`1
`
`10
`
`RAID-II
`Figure 6: HIPPI Loopback Performance.
`achieves 38.5 megabytes/second
`in each direction.
`
`ing random operations quicker. With fifteen disks active,
`both systems perform well, with RAID-II
`achieving over
`400 small accesses per second.
`In both prototypes,
`the
`small,
`random 1/0 rates were limited by the large number
`of context switches required on the Sun4/280 workstation
`to handle request completions. RAID-I’s
`performance was
`also limited
`by the memory bandwidth
`on the Sun4/280,
`since all data went
`into host memory. Since data need not
`be transferred to the host workstation
`in RAID-II,
`it deliv-
`ers a higher percentage (78%) of the potential
`I/01 rate from
`its fifteen disks than does RAID-I
`(67VO).
`in this sec-
`The remaining
`performance measurements
`tion are for specific components of the XBUS board. Figure
`6 shows the performance
`of the HIPPI network and boards.
`Data are transferred from the XBUS memory to the HIPPI
`source board, and then to the HIPPI destination lboard and
`back to XBUS memory. Because the network is configured
`as a loop,
`there is minimal network protocol overhead. This
`test foeuses on the HIPPI
`interfaces’
`raw hardware perfor-
`mance;
`it does not
`include measurements for sending data
`over the Ultranet.
`In the loopback mode,
`the overhead of
`sending a HIPPI packet
`is about 1.1 milliseconds, mostly
`due to setting up the HIPPI and XBUS controll
`registers
`
`239
`
`Oracle Ex. 1018, pg. 6
`
`

`

`3.1
`
`Next, we illustrate how LFS on RAID-II
`implementation.
`read requests from a client. Finally, we dis-
`handles typical
`cuss the measured performance
`and implementation
`status
`of LFS on RAID-II.
`File System
`The Log-Structured
`tile
`LFS [14]
`is a disk storage manager
`that writes all
`files
`data and metadata to a sequential append-only
`log;
`the
`are not assigned fixed blocks on disk. LFS minimizes
`number of small write operations
`to dksk data for small
`writes are buffered in main memory and are written to disk
`asynchronously when a log segment
`fills. This is in contrast
`to traditional
`file systems, which assign files to fixed blocks
`on dkk.
`In traditional
`file systems, a sequence of random
`file writes results in inefficient
`small, random disk accesses.
`Beeause it avoids
`small write
`operations,
`the Log-
`Structured File System avoids a weakness of
`redundant
`disk arrays that affeets traditional
`file systems. Under a
`traditional
`file system, disk arrays that use large block in-
`terleaving
`(Level 5 RAID [13]) perform poorly on small
`write operations
`because each small write requires
`four
`disk accesse~ reads of
`the old data and parity blocks and
`writes of the new data and parit y blocks. By contrast,
`large
`write operations in disk arrays are efficient since they don’t
`require the reading of old data or parity.
`LFS eliminates
`small writes, grouping them into efficient
`large, sequential
`write operations.
`A secondary benefit of LFS is that crash reeovery is very
`quick.
`LFS periodically
`performs
`checkpoint
`operations
`that record the current state of
`the file system. To reeover
`from a tile system crash,
`the LFS server need only process
`the log from the position of the last checkpoint. By contrast,
`a UNIX tile system consistency checker
`traverses the entire
`directory
`structure in search of
`lost data. Because a RAID
`has a very large storage capacity, a standard UNIX-style
`consistency check on boot would be unacceptably
`slow. For
`a 1 gigabyte file system,
`it
`takes a few seconds to perform
`an LFS file system check, compared with approximately
`20 minutes to check the consistency of a typical UNIX file
`system of comparable size.
`3.2
`LFS on RAID-II
`of LFS for RAID-II must efficiently
`An implementation
`features of the disk array:
`the sep-
`handle two architectural
`arate low- and high-bandwidth
`data paths, and the separate
`memories on the host workstation
`and the XBUS board.
`To handle the two data paths, we changed the original
`Sprite LFS software to separate the handling
`of data and
`metadata. We use the high-bandwidth
`data path to transfer
`data between the disks and the HIPPI network, bypassing
`host workstation memory.
`‘l%e low-bandwidth
`VME con-
`nection between the XBUS and the host workstation
`is used
`to send metadata to the host
`for file name lookup and loca-
`tion of data on disk. The low-bandwidth
`path is also used
`for small data transfers to clients on the Ethernet attached
`to the host.
`also must manage separate memories
`LFS on RAID-II
`on the host workstation
`and the XBUS board.
`‘The host
`memory cache contains metadata as well as files that have
`been read into workstation memory
`for
`transfer over
`the
`Ethernet.
`The cache is managed with a simple Least Re-
`cently Used replacement
`policy. Memory
`on the XBUS
`board is used for prefetch buffers, pipeliningbuffers,
`HIPPI
`net work buffers, and write buffers for LFS segments. The
`
`Number of Disks
`
`Figure 7: Disk read performance for varying number of
`disks on a single SCSI string. Cougar string bandwidth
`is limited to about 3 megabytes/second,
`less than that
`of three disks.
`The dashed line indicates the perfor-
`mance if bandwidth scaled linearly.
`
`an Ether-
`(By comparison,
`across the slow VME link.
`to trans-
`0.5 millisecond
`net packet
`takes approximately
`fer.) Due to this control overhead,
`small
`requests result
`in low performance
`on the HIPPI network.
`For
`large re-
`quests, however,
`the XBUS and HIPPI boards support 38
`megabytes/seeond
`in both directions, which is very close
`to the maximum bandwidth
`of the XBUS ports. The HIPPI
`loopback bandwidth
`of 38 megabytes/second
`is three times
`greater
`than FDDI
`and two orders of magnitude
`greater
`than Ethernet. Clearly
`the HIPPI part of
`the XBUS is not
`the limiting
`factor
`in determining
`hardware system level
`performance.
`the lower-than-
`for
`is responsible
`Disk performance
`expected hardware system level performance
`of RAID-II.
`One performance
`limitation
`is the Cougar disk controller,
`which only supports about 3 megabytes/second
`on each of
`two SCSI strings, as shown in Figure 7. The other
`lim-
`itation is our relatively
`slow, synchronous VME interface
`ports, which only support 6.9 megabytes/second
`on read
`operations and 5.9 megabytes/second
`on write operations.
`This low performance
`is largely due to the difficulty
`of syn-
`chronizing
`the asynchronous VME bus for interaction with
`the XBUS.
`
`3
`
`File System
`The RAID-II
`the
`In the last section, we presented measurements of
`performance
`of
`the RAID-II
`hardware without
`the over-
`head of a file system.
`Although
`some applications may
`use RAID-II
`as a large raw disk, most will
`require a file
`system with abstractions
`such as directories and files. The
`file system should deliver
`to applications
`as much of
`the
`bandwidth
`of the RAID-II
`disk array hardware as possible.
`RAID-II’s
`file system is a modified
`version of
`the Sprite
`Log-Structured
`File System (LFS)
`[14].
`In this section, we
`first describe LFS and explain why it
`is particularly
`well-
`-suited for use with disk arrays. Seeond, we describe the
`features of RAID-II
`that require special attention in the LFS
`
`240
`
`Oracle Ex. 1018, pg. 7
`
`

`

`file system keeps the two caches consistent and copies data
`between them as required.
`to service large
`To hide some of
`the latency required
`into XBUS memory
`requests, LFS performs
`prefetching
`buffers. LFS on RAID-II
`uses XBUS memory to pipeline
`network and disk operations.
`For
`read operations, while
`one block of data is being sent across the network,
`the next
`blocks are being read off
`the disk. For write operations,
`full
`LFS segments are written to disk while newer segments are
`being filled with data. We are atso experimenting
`with
`prefetching
`techniques
`so small sequential
`reads can also
`benefit
`from overlapping
`disk and network operations.
`From the user’s perspective,
`the RAID-II
`file system
`looks like a standard file system to clients using the slow
`Ethernet path (standard mode), but
`requires relinking
`of
`applications
`for clients to use the fast HIPPI path (high-
`bandwidth mode). To access data over the Ethernet, clients
`simply use NFS or the Sprite RPC mechanism.
`The fast
`data path across the Ultranet
`uses a special
`library of
`file
`system operations
`for RAID files: open,
`read, writ
`e,
`etc. The library converts file operations to operations on an
`Ultranet
`socket between the client and the RAID-II
`server.
`The advantage of
`this approach is that
`it doesn’t
`require
`changes to the client operating system. Alternatively,
`these
`operations could be moved into the kernel on the client and
`would be transparent
`to the user-level application.
`Finally,
`some operating systems offer dynamically
`loadable device
`drivers that would eliminate the need for relinking
`applica-
`tions; however, Sprite does not allow dynamic loading.
`
`Requests
`Typical High and Low Bandwidth
`3.3
`This section expltins
`how the RAID-II
`file system han-
`dles typical open and read requests from a client over the
`fast Ultra Network Technologies network and over the slow
`Ethernet network attached to the host workstation.
`In the first example,
`the client
`is connected to RAID-II
`across the high-bandwidth
`Ultranet network. The client ap-
`plication
`is linked with a small
`library that converts RAID
`file operations into operations on an Ultranet socket connec-
`tion. First,
`the application
`running on the client performs an
`open operation by calling the library routine ra id_open.
`This call
`is transformed
`by the library
`into socket opera-
`tions. The library opens a socket between the client and the
`RAID-II
`file server. Then the client sends an open com-
`mand to RAID-II.
`Finally,
`the RAID-II
`host opens the file
`and informs
`the client.
`can perform read and write oper-
`Now the application
`ations on the file.
`Suppose the application
`does a large
`read using the library routine ra id_read.
`As before,
`the
`library converts this read operation into socket operations,
`sending RAID-II
`a read command that
`includes file posi-
`tion and request
`length. RAID-II
`handles a read request
`by pipclining
`disk reads and network
`sends. First,
`the file
`system code allocat

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket