`
`· ;.-;.-_____ _
`
`:t ·=·
`
`... · .·
`
`1 ..
`
`8
`PROCEEDINGS
`
`' 8 .
`
`SIGMOD
`
`INTERNATIONAL
`
`CONFERENCE
`
`ON
`
`MANAGEMENT
`~- O~iiii;r::J;tY ,,.
`~~l F-' we...--.
`
`Dcr o a 798
`8
`
`CHICAGO, ILLINOIS
`J U""*Arltso~• :u,
`''• rrl 63706
`
`SIGMOD RECORD
`VOLUME 17~ NUMBER 3
`SEPTEMBER, 1988
`
`LENOVO ET AL. EXHIBIT 1027
`Lenovo et al. v. Intellectual Ventures I, LLC
`IPR2017-00477
`
`Page A
`
`
`
`1
`
`9
`
`8
`
`8
`
`PROCEEDINGS
`
`SIGMOD
`
`INTERNATIONAL
`
`CONFERENCE
`
`ON
`
`MANAGEMENT
`
`OF
`
`DATA
`
`CHICIIGO
`
`ILLINOIS
`
`JUNE 1 - 3
`
`•
`
`Page i
`
`
`
`The Association for Computing Machinery, Inc.
`11 West 42nd Street
`New York, New York 10036
`
`Copyright © 1988 by the Association for Computing Machinery, Inc. Copying
`without fee is permitted provided that the copies are not made or distributee for
`direct commercial advantage, and credit to the source is given. Abstracting with
`credit is permitted. For other copying of articles that carry a code at the bottom
`of the frrst page, copying is permitted provided that the per-copy fee indicated in
`the code is paid through the Copyright Clearance Center, 27 Congress Street,
`Salem, MA 01970. For permission to republish write to: Director of Publications,
`Association for Computing Machinery. To copy otherwise, or republish, requires
`a fee and / or specific permission.
`\
`
`ISBN 0- 89791-268-3
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P .O. Box 64145
`Baltimore, MD 21264
`
`Price:
`Members: .. . ..... . $25.00
`All others: ... .. .. . . $33.00
`
`ACM Order Number: 472880
`
`ii
`
`Page ii
`
`
`
`This material may be protected by Copyright law (Title 17 U.S. Code)
`
`A Case for Redundant Arrays of Inexpensive Disks (RAID)
`
`David A. Patterson, Garth Gibson, and Randy H. Katz
`
`Computer Science Division
`Department of Electrical Engineering and Computer Sciences
`571 Evans Hall
`University of California
`Berkeley, CA 94720
`(pattrsn@gjnger.bericeley .edu}
`
`Abstract. Increasing performance of CPUs and memories will be
`squamlered if not matcMd by a similar performance increase inl/0. While
`the capacity of Single Large Expensive Disks (SLED) has grown rapidly,
`the performance improvement of SLED has betn modest. Redundant
`Arrays of Inexpensive Disks (RAID). based on tM magnetic disk
`technology developed for personal computers, offers an attractive
`alternative to SLED, promising improvements of an ortkr of magnitude in
`performance. reliability, power consumption, and scalability. This paper
`introduces jive levels of RAIDs, giving their relative cost/performance, and
`compares RAID to an IBM 3380 and a Fujitsu Super Eagle.
`
`1. Background: Rising CPU and Memory Performance
`The users of computers arc currently enjoying unprecedented growth
`in the speed of computers. Gordon Bell said that between 1974 and 1984,
`single chip computers improved in performance by 40% per year, about
`twice the rate of minicomputers [Bell 84]. In the following year Bill Joy
`predicted an even faster growth [Joy 85]:
`
`MIPS= 2Year·1984
`
`Mainframe and supercomputer manufacturers, having difficulty keeping
`pace with the rapid growth predicted by "Joy's Law; cope by offering
`multiprocessors as their tOp-<>f·the-line product
`But a fast CPU does not a fast system make. Gene Amdahl related
`CPU speed tO main memory size using this rule [Siewiorek 82):
`
`Each CPU instruction per second requires one byte of maitr memory;
`
`If computer system costs are not to be dominated by the cost of memory,
`then Amdahl's constant suggests that memory chip capacity should grow
`at the same rate. Gordon Moore predicted that growth rate over 20 years
`ago:
`
`transistors/chip= 2Year-i964
`
`A3 predicted by Moore's Law, RAMs have quadrupled in capacity every
`two [Moore 75] tO throe years [Myers 86].
`Recently the ratio of megabytes of main memory to MIPS has been
`defined as alpha [Garcia 84), with Amdahl's constant meaning alpha= 1. In
`part because of the rapid drop of memory prices, main memory sizes have
`grown faster than CPU speeds and many machines are shipped tOday with
`alphas of 3 or higher.
`To maintain the balance of costs in computer systems, secondary
`storage must match the advances in other parts of the system. A key meas-
`
`Permission to copy without fee all or part of this material is panted provided that the copies
`arc not made or distributed for dirtel commercialadvanuo,e, the ACM copyright notice
`and the title of the publlcotlon ond iu da•,· appear. and notice is aJven that copyinJ is by
`pennlssion of the Auociatlon for Computing Machinery. To copy otherwise, or to
`republish, requires a fee and I or specific permission.
`
`@ 1988 ACM ()..89791-268- 3/88/0006/0109 SI.SO
`
`109
`
`ure of magnetic disk technology is the growth In the maximum number of
`bits that can be stored pee square inch. or the bits per inch in a track
`times the number of tracks per inch. Called M.A. D., for maximal areal
`density, the "First Law in Disk Density" predicts (Frank87] :
`MAD = to<Year-1971)/10
`
`Magnetic disk technology has doubled capacity and halved price every three
`year~, in line with the growth rate of semiconductor memory, and in
`pracuce between 1967 and 1979 the disk capacity of the average IBM data
`processing system more than kept up with its main memory [Stevens81).
`Capacity is not the ottly memory characteristic that must grow
`rapidly to maintain system balance, since the speed with which
`instructions and data are delivered to a CPU also determines its ultimate
`performance. The speed or main memory has kept pace for two reasons:
`(1} the inve~tion of caches, showing that a small buffer can be managed
`autOmabcally tO contain a substantial fraction of memory references;
`(2} and the SRAM technology, used tO build caches, whose speed has
`improved at the rate of 40% to 100% per year.
`In contrast to primary memory technologies, the performance of
`single large expensive magnetic disks (SLED} has improved at a modest
`rate. These ~~~echanical devices are dominated by the seek and the rotation
`delays: from 1971 tO 1981, the raw seek time for a high-end IBM disk
`improved by only a factor of two while ' the rotation time did not
`change[Harker81). Grcatec density means a higher transfec rate when the
`information is found, and extra heads can reduce the average seek time, but
`the raw seek time only improved at a rate of 7% per year. There is no
`reason tO expect a faster rate in the near future.
`. To main.tain balan.ce, computer systems have been using' even larger
`m~ memones or sohd state disks to buffer some of the l/0 activity.
`Thts .may be a fine solution for applications whose J/0 activity has
`locality of refecence and for which volatility is not an issue but
`applications dominated by a high rate of random requ~ts for small ~ieccs
`of~ (such as transaction,processing} or by a low number of requests for
`masstve amounts of data (such as largo simulations running on
`supercomputers) are facing a serious performance limitation.
`2. The Pending 1/0 Crisis
`What ~ the impact of improving the performance of some pieces of a
`problem whale leavmg others the same? Amdahl's answer is now known
`as Amdahl's Law (Amdahl67]:
`s ..
`
`(11) +Pic
`
`where:
`S"' the effective speedup;
`f = fraction of wOfk in faster mode; and
`k = speedup while in faster mode.
`Suppose that some current applications spend 10% of their time in
`1/0. Then when computers arc lOX faster--according to Bill Joy in just
`over three years-then Amdahl's Law predicts efTective speedup will be only
`SX. When ~e have computers 100X faster--via evolution of uniprocessors
`or by multaprocessors--this application will be less than lOX faster
`'
`wasting 90% of the potential speedup.
`
`Page 109
`
`
`
`While wo can imagine improvements in software me systems via
`buffering for near term IJO demands, we need innovation to avoid an IJO
`crisis [Bora! 83).
`
`3. A Solution: Arrays of Inexpensive Disks
`Rapid improvements in capacity of large disks have not been the only
`target of disk designers, since personal computers have created a market for
`inexpensive magnetic disks. These lower cost disks have lower perfor(cid:173)
`mance as well as less capacity. Table I below compares the top-of-the-line
`IBM 3380 model AK4 mainframe disk, Fujitsu M2361A "Super Eagle"
`minicomputer disk, and the Conner Peripherals CP 3100 personal
`computer disk.
`
`Characteristics
`
`IBM Fujitsu Conners 3380 v. 2361 v
`3100
`3380 M2361A CP3100 3100
`(>1 mt'L1fU
`JJOOisbenu)
`4
`3
`3.5
`10.5
`14
`Disk diameter (inches)
`Formatted Data Capacity (MB) 7500
`.01
`.2
`100
`600
`1-2.5 1.7-3
`Price/MB(controller incl.)
`$18-$10 $20-$17 $10-$7
`1
`1.5
`MTTF Rated (hours)
`30,000 20,000 30,000
`MTTF in practice (hours)
`100,000
`?
`?
`?
`?
`.2 1
`No. Actuators
`4
`1
`1
`.6
`.8
`Maximum IJO'S/~ond/Actuator 50
`40
`30
`.7
`.8
`,30
`24
`20
`Typical 1/0'S/second/Actuator
`.2
`.8
`Maximum 1/0'S/second/box
`200
`40
`30
`.2
`.8
`Typical 1/0'S/second/box
`120
`24
`20
`.3
`.4
`3
`2.5
`1
`Transfer Rate (MB/~)
`660 64
`Power/box (W)
`6,600
`640
`10
`Volume (cu. ft.)
`24
`3.4
`.03 800 llO
`
`Table I. Comparison of IBM 3380 disk made/ AK4 for mainframe
`computers, the Fujitsu M2361A "Super Eagle" disk for minicomputers,
`and the Conners Peripherals CP 3100 disk for personal computers. By
`"Maximum IIO'slsecond" we mean the maximum number of average seeks
`and average rotates for a single sector access. Cost and reliability
`information on the 3380 comes from widespread experience [IBM 87]
`[Gawlick87] and the information on the Fujitsu from the manual [Fujitsu
`87], while some numbers on the new CP3100 are based on speculation.
`The price per megabyte is given as a range to allow for different prices for
`volume discount and different mark-up practices of the vendors. (The 8
`wall maximum power of the CP3100 was increased to 10 watts to at/ow
`for the inefficiency of an external power supply, since fhe other drives
`contain their own power supplies).
`
`One surprising fact is that the numberofl/Os per second per actuator in an
`inexpensive disk is-within a factor of two of the large disks. In several of
`the remaining metrics, including price per megabyte, the inexpensive disk
`is superior or equal to the large disks.
`The small size and low power are even more impressive since disks
`such as the CP3100 contain full track buffers and most functions of the
`traditional mainframe controller. Small disk manufacturers can provide
`such functions in high volume disks because of the effons of standards
`committees in defining higher level peripheral interfaces, such as the ANSI
`X3.131-1986 Small Computer System Interface (SCSI). Such standards
`have encouraged companies like Adeptec to offer SCSI interfaces as single
`chips, in turn allowin$ disk companies to embed mainframe controller
`functions at low cost. Figure 1 compares the traditional mainframe disk
`approach and the small computer disk approach. The same SCSI interface
`chip embedded as a controller in every disk can also be used as the direct
`memory access (DMA) device at the other end of the SCSI bus.
`Such characteristics lead to our proposal for building 1/0 systems as
`arrays of inexpensive disks, either interleaved for the large transfers of
`supercompu~rs [Kim 86)[Livny 87][Salem86) or independent for the many
`small transfers of transaction processing. Using the information in Table
`I, 75 inexpensive disks potentially have 12 times the I/0 bandwidth of the
`IBM 3380 and the same capacity, with lower power consumption and cost.
`
`4. Caveats
`We cannot explore all issues associated with such arrays in the space
`available for this paper, so we concentrate on fundamental estimates of
`
`price-performance and reliability. Our reasoning is that if there arc no
`advantages in price-performance or terrible disadvantages in reliability, then
`there is no need to explore funhcr. We characterize a transaction-processing
`workload to evaluate performance of a collection of inexpensive disks, but
`remember that such a collection is just one hardware component of a
`complete tranaction-processing system. While designing a complete TPS
`based on the,se ideas is enticing, we will resist that temptation in this
`paper. Cabling and packaging. certainly an issue in the cost and reliability
`of an array of many inexpensive disks. is also beyond this paper's scope.
`
`Mainframe
`
`Small Computer
`
`Figure 1. Comparison of organizations for typical mainframe and small
`computer disk interfaces. Single chip SCSI interfaces such as the Adoptee
`AJC-6250 allow the small computer to use a single chip to be the DMA
`interface as well as provide an embedded controller for each disk [ Adeptec
`87] . (The price per megabyte in Table I includes everything in the shaded
`bo:xesabove.)
`5. And Now The Bad News: Reliability
`The unreliability of disks forces computer systems managers to make
`backup versions of information quite frequently in case of failure. What
`would be the impact on reliability of having a hundredfold increase in
`disks? Assuming a constant failure rate--that is, an exponentially
`distributed time to failure--and that failures are independent--both
`assumptions made by disk manufacturers when calculating the Mean Time
`To Failure (MITF)--the reliability of an array of disks is:
`
`MTTF of a Disk Array "
`
`MITF of a Sing/~ Disk
`
`Using the information in Table I, the MTTF of 100 CP 3100 disks is
`30,000/100 = 300 hours, or less than 2 weeks. Compared to the 30,000
`hour (> 3 years) MTTF of the IBM 3380, this is dismal. If we consider
`scaling the array to 1000 disks, then the MTIF is 30 hours or about one
`day, requiring a'n adjective worse than dismal.
`Without fault tolerance, large arrays of inexpensive disks are too
`unrel!able to be useful.
`6. A Better Solution: RAID
`To overcome the reliability challenge, we must make use of extra
`disks containing redundant information to recover the original information
`when a disk fails. Our acronym for these Redundant Arrays of Inexpensive
`Disks is RAID. To simplify the explanation of our final proposal and to
`avoid confusion with previous work, we give a fllxonomy of five different
`organizations of disk arrays, beginning with mirrored disks and progressing
`through a variety of alternatives with differing performance and reliability.
`We refer to each organization as a RAID level.
`The reader should be forewarned that we describe all levels as if
`implemented in hardware solely to simplify the presentation, for RAID
`ideas are applicable to software implementations as well as hardware.
`Reliability. Our basic approach will be to break the arrays into
`reliability groups, with each group having extra "check" disks containing
`redundant information. When a disk fails we assume that within a short
`time the failed disk will be replaced and the information will be
`
`110
`
`Page 110
`
`
`
`roconstru.:tcd on to the new disk 11sing the redundant information. This
`time is called lhe mean time to repair <Mm>. The M1TR can be reduced
`if the system includes extra disks to act as "hot" standby spares; when a
`disk fails, a replacement disk is switched in electronically. Periodically a
`human operator replaces all failed disks. Here arc other terms that we use:
`D =total number of disks with data {not including extra check disks);
`G =number of data disks in a group {not including extra check disks);
`C = number of check disks in a group;
`no =DIG "' number of groups;
`As mentioned above we make the same assumptions that disk
`manufacturers make--that failures are exponential and independent. {An
`earthquake or power surge is a situation where an array of disks might not
`fail independently.) Since these reliability predictions will be very high,
`we want to emphasize that the reliability is only of the the disk-head
`assemblies with this failure model, and not the whole software and
`electtonic system. In addition, in our view the pace of technology means
`extremely high MTI'F are "overlcill" -for, independent of expected lifetime,
`users will replace obsolete disks. After all, how many people are still
`using 20 year old disks?
`The general MTI'F calculation for single;-ecror repairing RAID is
`given in two steps. First, lhe group MTIF is:
`
`MTIFDisJr.
`
`1
`
`MTIF Group =
`
`•
`
`G+ C
`
`Probability of QJI()ther failure in a group
`before repairing the dead disk
`
`As more formally derived in the appendix, the probability of a second
`failure before the first has been repaired is:
`MTIR
`
`MTIR
`
`Probability of
`Another Failure
`
`M1TFDisti(No. Dis.ls-1) MTIFDisti(G+C-1)
`
`The inlllition behind the formal calculation in the appendix comes
`from Uying to calculate the average num~ of second disk failures during
`the repair time for X single disk failures. Since we assume that disk failures
`occur at a uniform rate, this average number of second failures during the
`repair time for X fi!St failures is
`
`MTIF of remaining disks in the group
`
`The average number of second failu.res for a single disk is then •
`MrrR
`
`MTIFDisk.l No. of remaining disks in the group
`
`The MlTF of the remaining disks is just the MTI'F of a single disk
`divided by the number of good disks in the group, giving the result above.
`The second step is the reliability of the whole system, which is
`l!pproximately (since MTIF Group is not quite distributed exponentially):
`MITFGroup
`MITFwD = - ----
`
`Plugging it all together, we gee
`
`MTIFwD
`
`MTIFDisk
`
`--- · --- --
`
`G+C
`
`(G+C-t)•M7TR
`(MTIFDiskfl
`
`(G+C)•nG • (G+C-l)•MTIR
`
`(MTIFois/J2
`
`•
`
`111
`
`Since the formula is the same for each level, we make the abstract
`numbers concrete using these parameters as appropriate: D=lOO total data
`disks, G=lO data disks per group, MTfFoisk = 30,000 hours, MTfR = 1
`hour. with the check disks per group C determined by the RAID level.
`Reliability Overhead Cost. This is simply the extra check
`disks, expressed as a percentage of the number of data disks D. As we shall
`see below, the cost varies with RAID level from 100% down 'to 4%.
`Useablt Storage Capacity Percentage. Another way to
`express this reliability overhead is in terms of the percentage of the total
`capacity of data disks and check disks that can be used to store data.
`Depending on the organization, this varies from a low of SO% to a high of
`96%.
`Performance. Since supercomputer applications and
`transaction-processing systems have different access patterns and rates, we
`need different metries to evaluate both. For supercomputers we count the
`number of reads and writes per second for large blocks of data, with large
`defined as getting at least one sector from each data disk in a group. During
`large transfers all the disks in a group act as a single unit, each reading or
`writing a portion of the large data block in parallel.
`A beuer measure for transaction-processing systems is the number of
`individual reads or writes per second. Since transaction-processing
`systems (e.g., debitS/credits) use a read-modify-write sequence of disk
`accesses, we include that metric as well. Ideally during small transfers each
`disk in a group can act independently, either reading or writing independent
`information. In summary supercomputer applications need a high data rate
`while transaction-processing need a high 110 rate.
`For both the large and small transfer calculations we assume the
`minimum user request is a sector, that a sector is small relative to a track,
`and·that there is enough work to keep every device busy. Thus sector size
`affects both disk storage efficiency and transfer size. Figure 2 shows the
`ideal operation of large and small disk accesses in a RAID.
`
`(a) Single Large or "Grouped" Read
`(1 read spread over G disks)
`
`• • •
`
`• • •
`(b) Several Small or Individual Reads and Writes
`(G reads and/or writes spread over G disks)
`
`Figure Z. Large transfer vs. small transfers in a gr~up ofG disks.
`The six performance metties are then the number of reads, writes, and
`read-modify-wri~ per second for both large (grouped) or small (individual)
`transfers. Rather than give absolute numbers for each metric, we calculate
`efficiency: the number of events per second for a RAID relative to the
`corresponding events per second for a single dislc_ {This is Boral's 1/0
`bandwidth per gigabyte [Bora! 83) scaled to gigabytes per disk.) In this
`paper we are after fundamental differences so we use simple, deterministic
`throughput measures for our performance metric rather than latency.
`Effective Performance Per Disk. The cost of disks can be a
`large portion of the cost of a database system, so the 1/0 performance per
`disk--factoring in the overhead of the check disks--suggests the
`cosl/performance of a system. This is the bottom line for a RAID.
`
`Page 111
`
`
`
`7. First Level RAID: Mirrored Disks
`Mirrored disks are ~ traditional approach for improving reliability of
`magnetic disks. This is the most expensive option we consider since all
`disks are duplicated (G=l and C• l), and every write to a data disk is also a
`write to a checlc disk. Tandem doubles the number of controllers for fault
`tolerance, allowing an optimized version of mirrored disks that lets reads
`occur in parallel Table II shows the metrics for a Level 1 RAID assuming
`this optimization.
`
`MTJF
`
`Total Nwnber of Disks
`Overhead Cost
`Useable Sto;age Capacity
`
`Exceeds Useful Product Lifetime
`(4,500,000 Ius or> 500 years)
`20
`100%
`SO%
`
`Events/Sec vs. Single Disk
`Large(orGrouped)Reads
`Large (or Grouped) Writes
`Large (or Grouped) R·M-W
`Small (or Individual) Reads
`Small (or Individual) Writes
`Small (or Individual) R·M-W
`
`Full RAID
`2DIS
`DIS
`4D(3S
`')[)
`D
`4D(3
`
`Efficiency Per Dislc
`l.OOIS
`.SOlS
`.67IS
`1.00
`.so
`.67
`
`Table n. Characteristics of uvel 1 RAID. Here -
`assume that writes
`are not slowed by waiting for the second write to complete because the
`slowdown for wriling 2 disks is minor compared to the slowdown S f or
`writing a whole group of 10 to 2.5 disks. UnliU a "pure• mirrored scheme
`with extra disks that are invisible to the software, we DSSIIII'Ie an optimized
`scheme with twice as many f()ntrollers allowing parallel reads to all disks,
`giving full disk' bandwidth for large reads and allowing the reads of
`read-modify-writes to occur in parallel.
`
`When inruvidual aocesses are rusnibuted across multiple disks, average
`queueing, seek, and rotate delays may differ from the single rusk ease.
`Although bandwidth may be unc~anged, it is distributed more evenly,
`reducing variance in queueing delay and, if the disk load is not too high.
`also reducing the expected queueing delay through parallelism {Livny 87].
`When many arms seek to the same track then rotate to the described sector,
`the average seek and rowe time will be larger than the average for a single
`disk, tending toward the worst case times. This affect should not generally
`more than double the average access timt to a single sector while still
`getting many sectors in parallel. In the special case of mirrored disks with
`sufficient controllers, the choice between arms that can read any data sector
`will reduce the time for the average read seek by up to 45% (Bitton 88).
`To allow for these factors but to retain our fundamental emphasis we
`apply a slowdown factor, S, when there are more than two disks in a
`group. In general, 1 S S S 2 whenever groups of disk work in parallel.
`With synchronous disks the spindles of all disks in the group are
`synchronous so that the corresponding sectors of a group of rusks pass
`under the heads simultaneously,[Kurzweil 88) so for synchronous disks
`there is no slowdown and S = 1. Since a Level 1 RAID has only one data
`disk in its group, we assume that the large transfer requires the same
`number of disks acting in concert ~.s fQund in groups of the higher level
`RAIDs: 10 to 25 disks.
`Duplicating all disks can mean doubling the cost of the database
`system or using only SO% of the disk storage capacity. Such largess
`inspires the next levels of RAID.
`8. Second Level RAID: Hamming Code for ECC
`The history of main memory organiza~ons suggests a way to reduce
`the cost of reliability. With the introduction of 4K and 16K DRAMs,
`computer designers discovered that these new devices were subject to
`losing information due to alpha particles. Since there were many single
`bit DRAMs in a system and sinGe they were usually accessed in groups of
`16 to 64 chips at a time, system designers added redundant chips to correct
`single errors and to detect double errors in each group. This increased the
`number of memory chips by 12% to 38%-depending on the size of the
`group--but it significantly improved reliability.
`As long as all the data bits in a group are read or wriuen together.
`there is no impact on performance. However, reads of less than the group
`size require reading the whole group to be sure the informatio:~ is correct,
`and writes to a portion of the group mean three steps:
`
`1) a read step to get all the rest of the data:
`2) a modify step to merge the new and old information;
`3) a write step to write the full group, including check information.
`Since we have scores of dislcs in a RAID and since some accesses are
`to groups of dislcs, we can mimic the DRAM solution by bit-interleaving
`the data across the disks of a group and then add enough check disks to
`detect and correct a single error. A single parity disk can detect a single
`error, but to correct an error we need enough check disks to identify the
`disk with the error. For a group size of 10 data disks (G) we need 4 check
`disks (C) in total, and if G ,. 2S then C'" 5 [HammingSO]. To keep down
`the cost of redundancy, we assume the group size will vary from I 0 to 25.
`Since our individual data transfer unit is just a sector, bit· interleaved
`disks mean that a large transfer for this RAID must be at least G sectors.
`Like DRAMs, reads to a smaller amount implies reading a full sector from.
`each of the bit-interleaved disks in a group, and writes of a single unit
`involve the read-modify-write cycle to all the disks. Table lU shows the
`menics of this Level 2 RAID.
`MTJF
`
`Exceeds Useful Lifeti.rnt.
`G--25
`(103,500 Ius
`or 12 years)
`1.20D
`20%
`83%
`Efficiency Per Dislc
`L2
`L21L1
`.86/S
`86%
`.86/S 172%
`.86/S 129%
`.03/S
`3%
`3%
`.(Yl/S
`.03/S
`4%
`
`(ft.JO
`(494,500 Ius
`or >SO years)
`1.40D
`40%
`71%
`Efficiency Per Disk
`L2
`L21L1
`.71/S
`71%
`.71/S 143%
`.71/S 107%
`.07/S
`6%
`.04/S
`6%
`.071S
`9%
`
`Total Nwnber of Disks
`Overhead COst
`Useable Storage Capacity
`Events/Sec
`Full RAID
`(vs. Single Disk)
`ImgeReods
`Imge Writes
`LargeR-M-W
`Small Reads
`Small Writes
`Small R-M-W
`
`DIS
`DIS
`DIS
`DISG
`D(}.SG
`DISG
`
`Table UI. Characttristics <Jf a uvt!l 2 RAID. The L21Ll column gives
`the %performance of level 2 in terms of level 1 (>100% means L2 is
`faster). As long as the transfer unit is large enough to spread over all the
`data disks of a group, the large /lOs get the full bandwidth of each disk.,
`divided by S to allow all disks in a group to complete. uw:l1 large reads
`are faster because data is duplicattd and so the redundancy disJcs can also do
`independent accesses. Small/lOs still require accessing all the disks in a
`group, so only DIG small 110! ::an happen at a time, again divided by S to
`allow a group of disks to finish. Small wei 2 writes are lilce small
`R-M-W because full sectors must be read before new data can be written
`onto part of each sector.
`
`For large writes, the level 2 system has the same performance as level
`1 even though it uses fewer check disks, and so on a per disk basis it
`outperforms level 1. For small data transfers the performance is dismal
`either for the whole system or per disk; all the disks of a group must be
`accessed for a small transfer, limiting the ilaximum number of
`simultaneous accesses to DIG. We also include the slowdown factorS
`since the access must wait for all the disks to complete.
`Thus level 2 RAID is desirable for supercomputers but inappropriate
`for ttansaction processing systems, with increasing group size inczeasing
`the disparity in performance per disk for the two applications. In
`recognition of this fact, Thinking Machines Incorporated announced a
`Level 2 RAID this year for its Connection Machine supercomputer called
`the "Data Vault," with G • 32 and C = 8. including one hot standby spare
`[Hillis 87].
`Before improving small data transfers, we concentrate once more on
`lowering the cost.
`
`9. Third Level RAID: Single Check Disk Per Group
`Most check disks in the level 2 RAID are used to determine which
`disk failed, for only one redundant parity disk is needed to detect ari error.
`These extra disks are uuly "redundant" since. most dislc controllers can
`already detect if a disk failed: either through special signals provided in the
`disk interface or the extra checking information at the end of a sector used
`to detect and correct soft errors. So information on the failed disk can be
`reconsuucted by calculating the parity of the remaining good disks and
`then comparing bit-by-bit to the parity calculated for the original full
`
`112
`
`Page 112
`
`
`
`group. When these two parities agr.ec. the failed bit was a 0; otherwise it
`was a 1. If the check disk is the failure, just read all the data disks and store
`the group parity in the replacement disk.
`Reducing the check disks to one per group (C=1) reduces the overhead
`cost to between 4% and 10% for the group sizes considered here. The
`performance for the third level RAID system is the same as the Level 2
`RAID, but the effective performance per disk increases since it needs fewer
`check disks. This reduction in total disks also increases reliability, but
`since it is still larger than the useful lifetime of disks, this is a minor
`point. One advantage of a level 2 system over level 3 is that the extra
`check information associated with each sector to correct soft errors is not
`needed, increasing the capacity per disk by perhaps 10%. Level 2 also
`allows all soft errors to be corrected "on the fly" without having to reread a
`sector. Table IV summarizes the third level RAID characteristics and
`Figure 3 compares the sector layout and check disks for levels 2 and 3.
`
`MITF
`
`Total Number of Dislcs
`Overhead Cost
`Useable Storage Capacity
`
`Exceeds Useful Lifetime
`G--25
`{346,000 hrs
`or40 years)
`1.04D
`4%
`96%
`
`0=10
`(820,000 hrs
`or >90 years)
`l.lOD
`10%
`91%
`
`Full RAID
`Events/Sec
`(vs. Single Disk)
`lArge Reads
`Large Writes
`LargeR-M-W
`Small Reads
`Sma/1 Writes
`Sma/1 R-M-W
`
`DIS
`DIS
`DIS
`D!SG
`D!2SG'
`DISG
`
`Efficiency Per Disk
`L3 L3JL2 L31LI
`.911S 127% 91%
`.911S 127% 182%
`.911S 127% 136%
`.091S 127% 8%
`.05/S 127% 8%
`.09/S 127% 11%
`
`Efficiency Per Disk
`L3 L3JL2 L3/Ll
`.961S 112% 96%
`.96/S 112% 192%
`.96/S 112% 142%
`.04/S 112%
`3%
`.02/S 112%
`'3%
`5%
`.04/S 112%
`
`Table IV. Characteristics of a Level 3 RAID. The L3/L2 column gives
`the %performance of L3 in terms of L2 and the L3/Ll column gives it in
`terms of Ll (>100% means L3 is faster). The performance for the full
`systems is the same in RAID levels 2 and 3, but since there are fewer
`check disks the performtince per disk improves.
`
`Park and Balasubramanian proposed a third level RAID system
`without suggesting a particular application [Park86]. Our calculations
`suggest it is a much better match to supercomputer applications than to
`transaction processing systems. This year two disk manufacturers have
`announced level 3 RAIDs for such applications using synchronized 5.25
`inch disks with 0=4 and C=I: one from Maxtor and one from Micropolis
`[Maginnis 87].
`This third level has brought the reliability overhead cost to irs lowest
`level, so in the last two levels we improve performance of small accesses
`without changing cost or reliability.
`lO. Fourth Level RAID: Independent Reads/Writes
`Spreading a transfer across all disks within the group has the
`following advantage:
`Large or grouped transfer time is reduced because transfer
`bandwidth of the er1tire array can be exploited.
`But it has the following disadvantage~ as well:
`Reading/writing to a disk in a group requires reading/writing to
`all the disks in a group; levels 2 and 3 RAIDs can perform only
`one I/0 at a time per group.
`If the disks are not synchronized, you do not see average seek
`and rotational delays; the observed delays should move towards
`the worst case, her1ce the S factor in the equations above.
`This fourth level RAID improves performance of small transfers through
`parallelism--the ability to rlo more than one I/0 per group at a time. We
`no longer spread the individual transfer information across several disks.
`but keep each individual