throbber
A Case for Redundant Arrays of Inexpensive Disks (RAID)
`David A. Patterson, Garth Gibson, and Randy H. Katz
`Computer Science Division
`Department of Electrical EngineeringandComputer Sciences
`571 Evans Hall
`UniversityofCalifornia
`Berkeley, CA 94720
`(pattrsn@ginger.bericeley.edu)
`
`Abstract. Increasing performance of CPUs and memories will be
`squandered ifnot matchedby a similarperformance increase inHO. While
`the capacity ofSingle Large Expensive Disks (SLED) has grown rapidly,
`the performance improvement of SLED has been modest. Redundant
`Arrays of Inexpensive Disks (RAID), based on the magnetic disk
`technology developed for personal computers, offers an attractive
`alternative to SLED,promising improvements ofan order ofmagnitude in
`performance, reliability, power consumption, and scalability. This paper
`introduces five levels ofRAIDs, giving theirrelative cost/performance, and
`compares RAID to anIBM 3380 and aFujitsuSuper Eagle.
`1. Background: Rising CPU and Memory Performance
`The users of computers are currently enjoying unprecedented growth
`in the speed of computers. Gordon Bell said that between 1974 and 1984,
`single chip computers improved in performance by 40% per year, about
`twice the rate of minicomputers [Bell 84]. In the following year Bill Joy
`predicted an even faster growth [Joy 85]:
`MIPS = 2Year-im
`Mainframe and supercomputer manufacturers, having difficulty keeping
`pace with the rapid growth predicted by "Joy's Law," cope by offering
`multiprocessors as their top-of-the-lineproduct.
`But a fast CPU does not a fast system make. Gene Amdahl related
`CPU speed to main memory size using this rule [Siewiorek 82]:
`Each CPU instruction per second requires one byte of main memory;
`If computer system costs are not to be dominated by the cost of memory,
`then Amdahl's constant suggests that memory chip capacity should grow
`at the same rate. Gordon Moore predicted that growth rate over 20 years
`ago:
`transistors/chip = 2i'ea,'-1964
`As predicted by Moore's Law, RAMs have quadrupled in capacity every
`two [Moore 75] to three years [Myers 86].
`Recently the ratio of megabytes of main memory to MIPS has been
`defined as alpha [Garcia 84], with Amdahl's constant meaning alpha = 1. In
`part because of the rapid drop of memory prices, main memory sizes have
`grown faster than CPU speeds and many machines are shipped today with
`alphas of 3 or higher.
`To maintain the balance of costs in computer systems, secondary
`storage must match the advances inother parts of the system. Akey meas-
`
`Permission to copy without fee all or part of this material is granted provided that the copies
`are not made or distributed for direct commercial advantage, the ACM copyright notice
`and the title of the publication and its date appear, and notice is given that copying is by
`permission of the Association for Computing Machinery. To copy otherwise, or to
`republish, requires a fee and/or specific permission.
`©1988 ACM 0-89791-268-3/88/0006/0109 $1.50
`
`ure of magnetic disk technology is the growth in the maximum number of
`bits that can be stored per square inch, or the bits per inch in a track
`times the number of tracks per inch. Called M.A.D., for maximal areal
`density, the "First Law in Disk Density" predicts [Frank87]:
`MAD = \§Qfear-\9T\)l\0
`Magnetic disk technology has doubled capacity and halved price every three
`years, in line with the growth rate of semiconductor memory, and in
`practicebetween 1967 and 1979 the disk capacityof the average IBM data
`processing system more than kept up with its main memory [Stevens81].
`Capacity is not the only memory characteristic that must grow
`rapidly to maintain system balance, since the speed with which
`instructions and data are delivered to a CPU also determines its ultimate
`performance. The speed ofmain memory has kept pace for tworeasons:
`(1) the invention of caches, showing that a small buffer can be managed
`automatically to contain a substantial fraction ofmemoryreferences;
`(2) and the SRAM technology, used to build caches, whose speed has
`improved at therate of40% to 100% per year.
`In contrast to primary memory technologies, the performance of
`single large expensive magnetic disks (SLED) has improved at a modest
`rate. These mechanical devices are dominated by the seek and the rotation
`delays: from 1971 to 1981, the raw seek time for a high-end IBM disk
`improved by only a factor of two while the rotation time did not
`change[Harker81], Greater density means a higher transfer rate when the
`information is found, and extra heads can reduce the average seek time, but
`the raw seek time only improved at a rate of 7% per year. There is no
`reason to expect a faster rate in the near future.
`To maintain balance, computer systems have been using even larger
`main memories or solid state disks to buffer some of the I/O activity.
`This may be a fine solution for applications whose I/O activity has
`locality of reference and for which volatility is not an issue, but
`applications dominated by a high rate ofrandom requests for small pieces
`ofdata (such as transaction-processing) or by a low number ofrequests for
`massive amounts of data (such as large simulations running on
`supercomputers) are facing a serious performance limitation.
`2. The Pending I/O Crisis
`What is the impact of improving the performance of some pieces of a
`problem while leaving others the same? Amdahl's answer is now known
`as Amdahl'sLaw [Amdahl67]:
`1
`S = --------------
`(1 -f) +flk
`
`where:
`S = the effective speedup;
`/= fraction ofwork in faster mode; and
`k = speedup while in faster mode.
`Suppose that some current applications spend 10% of their time in
`I/O. Then when computers are 10X faster-according to Bill Joy in just
`overthree years-then Amdahl's Law predictseffective speedup will be only
`5X. When we have computers 100X faster—via evolution of uniprocessors
`or by multiprocessors-this application will be less than 10X faster,
`wasting90% of the potential speedup.
`
`109
`
`Google Exhibit 1014
`Google v. LS Cloud Storage Technologies
`IPR2023-00120, Page 1 of 8
`
`

`

`price-performance and reliability. Our reasoning is that if there are no
`advantages in price-performanceor terribledisadvantages in reliability, then
`thereis noneedto explorefurther. Wecharacterizea transaction-processing
`workload to evaluateperformanceofa collection of inexpensivedisks, but
`remember that such a collection is just one hardware component of a
`complete tranaction-processing system. While designing a complete TPS
`based on these ideas is enticing, we will resist that temptation in this
`paper. Cabling and packaging, certainly an issue in the cost andreliability
`ofan array of many inexpensive disks, isalso beyond thispaper's scope.
`Small Computer
`Mainframe
`
`MTTF of a Disk Array - -
`
`Figure X. Comparison of organizations for typical mainframe and small
`computer disk interfaces. Single chipSCSI interfaces such as theAdaptec
`AIC-6250 allow the small computer to use a single chip to be the DMA
`interfaceas wellas providean embeddedcontrollerfor eachdisk[Adeptec
`87[ .(Theprice per megabyte in TableI includes everything in the shaded
`boxesabove.)
`5. And Now The Bad News: Reliability
`The unreliability of disks forces computer systems managers to make
`backup versions of information quite frequendy in case of failure. What
`would be the impact on reliability of having a hundredfold increase in
`disks? Assuming a constant failure rate-that
`is, an exponentially
`distributed time to failure-and that failures are independent-both
`assumptions madeby diskmanufacturers when calculating the Mean Time
`To Failure (MTTF)—the reliability ofan array ofdisks is:
`MTTF of a Single Disk
`-------------------------------------------------------------
`Number of Disks in the Array
`Using the information in Table I, the MTTF of 100 CP 3100 disks is
`30,000/100 = 300 hours, or less than 2 weeks. Compared to the 30,000
`hour (> 3 years) MTTF of the IBM 3380, this is dismal. If we consider
`scaling the array to 1000 disks, then the MTTF is 30 hours or about one
`day, requiring an adjective worse than dismal.
`Without fault tolerance, large arrays of inexpensive disks are too
`unreliable to be useful.
`6. A Better Solution: RAID
`To overcome the reliability challenge, we must make use of extra
`disks containing redundant information to recover the original information
`when a disk fails. Ouracronym for these Redundant Arrays of Inexpensive
`Disks is RAID. To simplify the explanation of our final proposal and to
`avoid confusion with previous work, we give a taxonomy of five different
`organizations ofdiskarrays, beginning with mirroreddisks and progressing
`through a variety of alternatives with differing performance and reliability.
`We refer toeach organization as a RAID level.
`The reader should be forewarned that we describe all levels as if
`implemented in hardware solely to simplify the presentation, for RAID
`ideasareapplicableto softwareimplementations as well as hardware.
`Reliability. Our basic approach will be to break the arrays into
`reliability groups, with each group having extra "check" disks containing
`redundant information. When a disk fails we assume that within a short
`time the failed disk will be replaced and the information will be
`
`110
`
`While we can imagine improvements in software file systems via
`buffering for near term I/O demands, we need innovation to avoid an I/O
`crisis [Boral 83].
`3. A Solution: Arrays of Inexpensive Disks
`Rapid improvements in capacity of large disks have not been the only
`target of disk designers, since personal computers have created a market for
`inexpensive magnetic disks. These lower cost disks have lower perfor­
`mance as well as less capacity. Table I below compares the top-of-the-line
`IBM 3380 model AK4 mainframe disk, Fujitsu M2361A "Super Eagle"
`minicomputer disk, and the Conner Peripherals CP 3100 personal
`computer disk.
`Characteristics
`
`IBM Fujitsu Conners 3380 v. 2361 v
`3380 M2361A CP3100 3100
`3100
`(>1 means
`3100 is better)
`14
`Disk diameter (inches)
`4
`3.5
`10.5
`3
`Formatted Data Capacity (MB) 7500
`100
`600
`.01 .2
`Price/MB (controller inch)
`$18-$10 S20-S17 $10-$7
`1-2.5 1.7-
`MTTF Rated (hours)
`30,000
`1
`20,00030,000
`1.5
`7
`?
`?
`7
`MTTF in practice (hours)
`100,000
`No. Actuators
`4
`1
`1
`.2 1
`Maximum 1/O's/second/Actuator
`50
`40
`30
`.6
`.8
`Typical I/O's/second/Actuator
`30
`24
`20
`.7
`.8
`Maximum l/O's/second/box
`200
`.2
`40
`30
`.8
`Typical I/O's/second/box
`120
`24
`20
`.2
`.8
`Transfer Rate (MB/sec)
`.4
`3
`1
`.3
`2.5
`Power/box (W)
`6,600
`640
`10
`660 64
`Volume (cu. ft.)
`24
`3.4
`.030
`Table I. Comparison of IBM 3380 disk model AK4 for mainframe
`computers, the Fujitsu M2361A "Super Eagle" disk for minicomputers,
`and the Conners Peripherals CP 3100 disk for personal computers. By
`"MaximumHO''sisecond"wemean the maximumnumber ofaverage seeks
`and average rotates for a single sector access. Cost and reliability
`information on the 3380 comes from widespread experience [IBM 87]
`[Gawlick87] and the information on the Fujitsu from the manual [Fujitsu
`87], while some numbers on the new CP3100 are based on speculation.
`Theprice per megabyteis given asa range toallow for different prices for
`volume discount and different mark-up practices of the vendors. (The 8
`watt maximum power of the CP3100 was increased to 10 watts to allow
`for the inefficiency of an external power supply, since the other drives
`contain theirownpower supplies).
`One surprising fact is that the number of I/Os per second per actuator in an
`inexpensive disk is within a factor of two of the large disks. In several of
`the remaining metrics, including price per megabyte, the inexpensive disk
`is superior or equal to the large disks.
`The small size and low power are even more impressive since disks
`such as the CP3100 contain full track buffers and most functions of the
`traditional mainframe controller. Small disk manufacturers can provide
`such functions in high volume disks because of the efforts of standards
`committees in defining higher level peripheral interfaces, suchas the ANSI
`X3.131-1986 Small Computer System Interface (SCSI). Such standards
`have encouraged companies like Adeptecto offer SCSI interfaces as single
`chips, in turn allowing disk companies to embed mainframe controller
`functions at low cost. Figure 1 compares the traditional mainframe disk
`approach and the small computer disk approach. The same SCSI interface
`chip embedded as a controller in every disk can also be used as the direct
`memory access (DMA) device at the otherend of the SCSI bus.
`Such characteristics lead to our proposal for building I/O systems as
`arrays of inexpensive disks, either interleaved for the large transfers of
`supercomputers [Kim 86][Livny 87][Salem86] or independent for the many
`small transfers of transaction processing. Using the information in Table
`I, 75 inexpensive disks potentially have 12 times the I/O bandwidth of the
`IBM 3380 and the same capacity, with lowerpower consumption and cost.
`4. Caveats
`We cannot explore all issues associated with such arrays in the space
`available for this paper, so we concentrate on fundamental estimates of
`
`08
`
`O
`
`Google Exhibit 1014
`Google v. LS Cloud Storage Technologies
`IPR2023-00120, Page 2 of 8
`
`

`

`reconstructed on to the new disk using the redundant information. This
`time is called the mean time to repair (MTTR). The MTTR can be reduced
`if the system includes extra disks to act as "hot" standby spares; when a
`disk fails, a replacement disk is switched in electronically. Periodically a
`human operator replaces all failed disks. Here areother terms that we use:
`D = total number of disks with data (not including extra check disks);
`G = number of data disks in a group (not includingextracheck disks);
`C= number of check disks in a group;
`riQ =DIG = number of groups;
`As mentioned above we make the same assumptions that disk
`manufacturers make-that failures are exponential and independent. (An
`earthquake or power surge is a situation wherean array of disks might not
`fail independently.) Since these reliability predictions will be very high,
`we want to emphasize that the reliability is only of the the disk-head
`assemblies with this failure model, and not the whole software and
`electronic system. In addition, in our view the pace of technology means
`extremely high MTTF are "overkill"-for, independent ofexpected lifetime,
`users will replace obsolete disks. After all, how many people are still
`using 20 year old disks?
`The general MTTF calculation for single-error repairing RAID is
`given in two steps. First, the group MTTF is:
`MTTFDisk
`-------------------------------- *
`G+ C
`
`MTTF Group =
`
`1
`-----------------------------------------------------------------------------------------------
`Probability of another failure in a group
`before repairing the dead disk
`As more formally derived in the appendix, the probability of a second
`failurebeforethe first hasbeen repaired is:
`MTTR
`MTTR
`= -----------------------------
`= -----------------------
`Probability of
`Another Failure
`MTTFq^R N o. Disks-V) MTIFGlsk/(G+C-l)
`The intuition behind the formal calculation in the appendix comes
`from trying to calculate the average number of second disk failures during
`therepair time forX single disk failures. Since we assume that disk failures
`occur at a uniform rate, this average number of second failures during the
`repair time for X first failures is
`X *MTTR
`MTTF of remaining disks in the group
`The average number of second failures for a single disk is then
`MTTR
`MTTFGlsk / No. of remaining disks in the group
`The MTTF of the remaining disks is just the MTTF of a single disk
`dividedby the number ofgood disks in thegroup, giving the result above.
`The second step is the reliability of the whole system, which is
`approximately (since MTTFGr.oup is not quitedistributedexponentially):
`MJTF(;TOup
`m ttfraid -----------------------
`nG
`Plugging it all together, we get:
`MTTF[)isfc M T T F 1
`----------- * ------------------- * -
`(G+C-\)*MTTR
`rtQ
`G+C
`(MTTFDisk)T
`(G+Q*nG * (G+C-1)*MTTR
`(MTTFDisk)l
`MTTFraIT) = ----------------------------------
`(D+C*nG)*(G+C-\)*MTTR
`
`m ttfraid =
`
`Since the formula is the same for each level, we make the abstract
`numbers concrete using these parameters as appropriate: D=100 total data
`disks, G=10 data disks per group, MTTFGlsk = 30,000 hours, MTTR = 1
`hour, with the check disks per group C determined by the RAID level.
`Reliability Overhead Cost. This is simply the extra check
`disks, expressed as a percentage of the number ofdata disksD. As we shall
`see below, the cost varies with RAID level from 100% down to 4%.
`Useable Storage Capacity Percentage. Another way to
`express this reliability overhead is in terms of the percentage of the total
`capacity of data disks and check disks that can be used to store data.
`Depending on the organization, this varies from a low of 50% to a high of
`96%.
`Perform ance. Since supercomputer applications and
`transaction-processing systems have different access patterns and rates, we
`need different metrics to evaluate both. For supercomputers we count the
`number of reads and writes per second for large blocks ofdata, with large
`defined as getting at least one sector from each data disk in a group. During
`large transfers all the disks in a group act as a single unit, each reading or
`writinga portion of the large data block in parallel.
`A better measure for transaction-processing systems is the number of
`individual reads or writes per second. Since transaction-processing
`systems (e.g., debits/credits) use a read-modify-write sequence of disk
`accesses, we include thatmetric as well. Ideally during small transferseach
`disk in a group can act independently, either readingor writing independent
`information. In summary supercomputer applications needa highdatarate
`while transaction-processing need a highI/O rate.
`For both the large and small transfer calculations we assume the
`minimum user request is a sector, that a sector is small relative to a track,
`and that there is enough work to keep every device busy. Thus sector size
`affects both disk storage efficiency and transfer size. Figure 2 shows the
`ideal operation oflarge and small diskaccesses in a RAID.
`
`(a) Single Large or "Grouped" Read
`(1 read spread over G clisks)
`
`(b) Several Small or Individual Reads and Writes
`(G reads and/or writes spread over G disks)
`Figure 2. Large transfer vs. small transfers in a group of G disks.
`The she performance metrics are then the number of reads, writes, and
`read-modify-writes per secondfor both large(grouped) or small (individual)
`transfers. Rather than give absolute numbers for each metric, wecalculate
`efficiency: the number of events per second for a RAID relative to the
`corresponding events per second for a single disk. (This is Boral's I/O
`bandwidth per gigabyte [Bora] 83] scaled to gigabytes per disk.) In this
`paper we are after fundamental differences so we use simple, deterministic
`throughput measures for our performancemetric rather than latency.
`Effective Performance Per Disk. The cost of disks can be a
`large portion of the cost of a database system, so the I/O performance per
`disk-factoring in the overhead of the check disks-suggests the
`cost/performance of a system. This is the bottom line for a RAID.
`
`I l l
`
`Google Exhibit 1014
`Google v. LS Cloud Storage Technologies
`IPR2023-00120, Page 3 of 8
`
`

`

`7. First Level RAID: Mirrored Disks
`Mirrored disks are a traditional approach for improving reliability of
`magnetic disks. This is the most expensive option we consider since all
`disks areduplicated (G=l and C=l), and every write toa data disk is alsoa
`write to a check disk. Tandem doubles the number of controllers for fault
`tolerance, allowing an optimized version of mirrored disks that lets reads
`occur in parallel. Table II shows the metrics for a Level 1RAID assuming
`this optimization.
`MTTF
`Exceeds Useful Product Lifetime
`(4,500,000 hrs or > 500 years)
`2D
`Total Number of Disks
`100%
`OverheadCost
`50%
`UseableStorage Capacity
`Efficiency Per D
`Events/Sec vs. Single Disk
`Full RAID
`2D/S
`Large (or Grouped) Reads
`1.00/S
`Large (or Grouped) Writes
`DIS
`.50/S
`Large (or Grouped) R-M-W
`4D/35
`.67/S
`2D
`Small (or Individual) Reads
`1.00
`D
`Small (or Individual) Writes
`.50
`Small (orIndividual) R-M-W 4D/3
`.67
`Table II. Characteristics of Level 1 RAID. Here we assume that writes
`are not slowed by waiting for the second write to complete because the
`slowdown for writing 2 disks is minor compared to the slowdown S for
`writing a whole groupof10 to 25 disks. Unlikea "pure" mirroredscheme
`with extra disks that are invisible to thesoftware, weassume an optimized
`schemewith twice as many controllersallowingparallel reads to alldisks,
`giving full disk bandwidth for large reads and allowing the reads of
`read-modify-writestooccurinparallel.
`When individual accesses are distributed across multiple disks, average
`queueing, seek, and rotate delays may differ from the single disk case.
`Although bandwidth may be unchanged, it is distributed more evenly,
`reducing variance in queueing delay and, if the disk load is not too high,
`also reducing the expected queueing delay through parallelism [Livny 87].
`When many arms seek to the same track then rotate to the described sector,
`the average seek and rotate time will be larger than the average for a single
`disk, tending toward the worst case times. This affect should not generally
`more than double the average access time to a single sector while still
`getting many sectors in parallel. In the special case of mirrored disks with
`sufficient controllers, the choicebetweenarms thatcanread anydata sector
`will reduce the time for the average read seek by up to45% [Bitton 88].
`To allow for these factors but to retain our fundamental emphasis we
`apply a slowdown factor, S, when there are more than two disks in a
`group. In general, 1 <S < 2 whenever groups of disk work in parallel.
`With synchronous disks the spindles of all disks in the group are
`synchronous so that the corresponding sectors of a group of disks pass
`under the heads simultaneously,[Kurzweil 88] so for synchronous disks
`there is no slowdown and 5=1. Since a Level 1RAID has only one data
`disk in its group, we assume that the large transfer requires the same
`number of disks acting in concert as found in groups of the higher level
`RAIDs: 10 to 25 disks.
`Duplicating all disks can mean doubling the cost of the database
`system or using only 50% of the disk storage capacity. Such largess
`inspires the next levels ofRAID.
`8. Second Level RAID: Hamming Code for ECC
`The history of main memory organizations suggests a way to reduce
`the cost of reliability. With the introduction of 4K and 16K DRAMs,
`computer designers discovered that these new devices were subject to
`losing information due to alpha particles. Since there were many single
`bit DRAMs in a system and since they were usually accessed in groups of
`16 to64 chips at a time, system designers added redundant chips to correct
`single errors and to detect double errors in each group. This increased the
`number of memory chips by 12% to 38%—depending on the size of the
`group-but it significantly improved reliability.
`As long as all the data bits in a group are read or written together,
`there is no impact on performance. However, reads of less than the group
`size require reading the whole group to be sure the information is correct,
`andwrites to a portion of the group mean three steps:
`
`1) a read step to get all the rest of the data;
`2) a modify step to merge the new and old information;
`3) a write step to write the full group, including check information.
`Since we have scoresof disks in a RAID and since some accesses are
`to groups of disks, we can mimic the DRAM solution by bit-interleaving
`the data across the disks of a group and then add enough check disks to
`detect and correct a single error. A single parity disk can detect a single
`error, but to correct an error we need enough check disks to identify the
`disk with the error. For a group size of 10 data disks (G) we need4 check
`disks (C) in total, and if G = 25 then C = 5 [HammingSO]. To keep down
`thecost ofredundancy, we assume the group size will vary from 10 to 25.
`Since our individual data transfer unit isjust a sector,bit- interleaved
`disks mean that a large transfer for this RAID mustbe at least G sectors.
`Like DRAMs, reads toa smaller amount implies readinga full sector from
`each of the bit-interleaved disks in a group, and writes of a single unit
`involve the read-modify-write cycle to all the disks. Table III shows the
`_________
`___
`metrics of this Level 2 RAID.
`MTTF
`Exceeds Useful Lifetime
`G=I0
`G=25
`(494,500 hrs
`(103,500 hrs
`or >50 years)
`or 12 years)
`1.40D
`1.20D
`Total Number of Disks
`OverheadCost
`20%
`40%
`83%
`UseableStorage Capacity
`71%
`Full RAID Efficiency Per Disk Efficiency Per Disk
`Events/Sec
`L2/L1
`(vs. Single Disk)
`12
`12
`12/L1
`.IMS,
`71%
`,86/S
`86%
`Large Reads
`DIS
`.71/S 143%
`,86/S 172%
`Large Writes
`DIS
`,71/S 107%
`Large R-M-W
`D/S
`,86/S 129%
`Small Reads
`D/SG
`,07/S
`6%
`,03/S
`3%
`,04/S
`Small Writes
`D/2SG
`6%
`,02/S
`3%
`D/SG
`,07/S
`,03/S
`4%
`Small R-M-W
`9%
`Table III. Characteristics of a Level 2 RAID. The L2IL1 column gives
`the % performance of level 2 in terms of level 1 (>100% means L2 is
`faster). As long as the transfer unit is large enough to spread over all the
`data disks ofa group, the largeI/Os get thefull bandwidth of each disk,
`divided by S to allow all disks in a group to complete. Level 1 large reads
`arefasterbecausedataisduplicatedandsotheredundancydiskscan alsodo
`independent accesses. Small I/Os still require accessing all the disks in a
`group, so onlyDIG small I/Os can happen at a time, again divided by S to
`allow a group of disks tofinish. Small Level 2 writes are like small
`R-M-W because full sectors must be read before new data can be written
`ontopart ofeachsector.
`For large writes, the level 2 system has the same performance as level
`1 even though it uses fewer check disks, and so on a per disk basis it
`outperforms level 1. For small data transfers the performance is dismal
`either for the whole system or per disk; all the disks of a group must be
`accessed for a small transfer,
`limiting the maximum number of
`simultaneous accesses to DIG. We also include the slowdown factor S
`since the access must wait for all the disks to complete.
`Thus level 2 RAID is desirable for supercomputers but inappropriate
`for transaction processing systems, with increasing group size increasing
`the disparity in performance per disk for the two applications.
`In
`recognition of this fact, Thinking Machines Incorporated announced a
`Level 2 RAID this year for its Connection Machine supercomputer called
`the "Data Vault," with G = 32 and C = 8, including one hot standby spare
`[Hillis 87],
`Before improving small data transfers, we concentrate once more on
`lowering the cost.
`9. Third Level RAID: Single Check Disk Per Group
`Most check disks in the level 2 RAID are used to determine which
`disk failed, for only one redundant parity disk is needed to detect an error.
`These extra disks are truly "redundant" since most disk controllers can
`already detect if a disk failed: either through special signalsprovided in the
`disk interface or the extra checking information at the end of a sector used
`to detect and correct soft errors. So information on the failed disk can be
`reconstructed by calculating the parity of the remaining good disks and
`then comparing bit-by-bit to the parity calculated for the original full
`
`112
`
`Google Exhibit 1014
`Google v. LS Cloud Storage Technologies
`IPR2023-00120, Page 4 of 8
`
`

`

`77
`77]
`2 2
`77
`
`77
`77
`
`Level
`
`f\7aCIala2ta3
`
`S 3
`
`Y77i
`2 ,I
`cC
`cl
`c2
`c3
`dO
`dl
`d2
`d3
`ecccE
`ECC1
`ECC2
`ECC3 -KX,
`(Each transfer
`unit isplacedinto
`a single sector.
`Note that the check
`info is nowcalculated
`overapieceofeach
`transferunit.)
`
`§I
`
`77
`
`S77
`
`7 7
`
`NX
`
`7m
`
`Sector 0
`Data
`Disk 1
`Sector 0
`Dala
`Disk 2
`Sector 0
`Data
`Disk 3
`Sector 0
`Data
`Disk 4
`
`group. When these two parities agree, the failed bit was a 0; otherwise it
`was a 1. Ifthe chock disk is the failure, just read all the data disks and store
`the groupparity in the replacement disk.
`Reducing the check disks to one per group (C=l) reduces the overhead
`cost to between 4% and 10% for the group sizes considered here. The
`performance for the third level RAID system is the same as the Level 2
`RAID, but theeffectiveperformanceper disk increases sinceitneeds fewer
`check disks. This reduction in total disks also increases reliability, but
`since it is still larger than the useful lifetime of disks, this is a minor
`point. One advantage of a level 2 system over level 3 is that the extra
`check information associated with each sector to correct soft errors is not
`needed, increasing the capacity per disk by perhaps 10%. Level 2 also
`allows all softerrors to be corrected "on the fly" without having to rereada
`sector. Table IV summarizes the third level RAID characteristics and
`Figure 3compares the sector layout and check disks for levels 2 and 3.
`MTTF
`Exceeds Useful Lifetime
`G=25
`G=10
`(346,000 hrs
`(820,000 hrs
`or 40 years)
`or >90 years)
`1.04D
`MOD
`Total Number of Disks
`4%
`10%
`OverheadCost
`91%
`96%
`UseableStorage Capacity
`Efficiency Per Disk Efficiency Per Disk
`Events/Sec
`Full RAID
`L3IL1
`13 L3IL2 L3IL1
`13 L3IL2
`(vs. Single Disk)
`,91/S 127% 91% ,96/S 112% 96%
`Large Reads
`D/S
`,91/S 127% 182% ,96/S 112% 192%
`D/S
`Large Writes
`,91/S 127% 136% ,96/S 112% 142%
`Large R-M-W
`D/S
`,09/S 127% . 8% ,04/S 112% 3%
`D/SG
`Small Reads
`,05/S 127% 8% ,02/S 112% 3%
`Small Writes
`D/2SG
`,09/S 127% 11% ,04/S 112% 5%
`Small R-M-W
`D/SG
`Table IV. Characteristics of a Level 3 RAID. The L3IL2 column gives
`the %performance ofL3 in terms ofL2 and the L3IL1 column gives it in
`terms of LI (>100% means L3 is faster). The performance for thefull
`systems is the same in RAID levels 2 and 3, but since there are fewer
`checkdisks theperformanceper diskimproves.
`Park and Balasubramanian proposed a third level RAID system
`without suggesting a particular application [Park86], Our calculations
`suggest it is a much better match to supercomputer applications than to
`transaction processing systems. This year two disk manufacturers have
`announced level 3 RAIDs for such applications using synchronized 5.25
`inch disks with G=4 and C=l: one from Maxtor and one from Micropolis
`[Maginnis 87].
`This third level has brought the reliability overhead cost to its lowest
`level, so in the last two levels we improve performance of small accesses
`without changing cost or reliability.
`10. Fourth Level RAID: Independent Reads/Writes
`Spreading a transfer across all disks within the group has the
`following advantage:
`■
`Large or grouped transfer time is reduced because transfer
`bandwidth of theentirearraycanbe exploited.
`But it has the following disadvantages as well:
`Reading/writing to a disk in a group requires reading/writing to
`•
`all the disks in a group; levels 2 and 3 RAIDs can perform only
`one I/O at a time per group.
`If the disks are not synchronized, you do not see average seek
`and rotational delays; the observeddelays should movetowards
`theworst case, hence the S factor in the equations above.
`This fourth level RAID improves performance of small transfers through
`parallelism-the ability to do more than one I/O per group at a time. We
`no longer spread the individual transfer information across several disks,
`but keep each individual unit in a single disk.
`The virtue of bit-interleaving is the easy calculation of the Hamming
`code needed to detect or correct errors in level 2. But recall that in the third
`level RAID we rely on the disk controller to detect errors within a single
`disk sector. Hence, ifwe storean individual transfer unit in a single sector,
`we can detect errors on an individual read without accessing any otherdisk.
`Figure 3 shows the different ways the information is stored in a sector for
`
`•
`
`113
`
`RAID levels 2, 3, and 4. By storing a whole transfer unit in a sector, reads
`can be independent and operate at the maximum rate of a disk yet still
`detect errors. Thus the primary change between level 3 and 4 is that wc
`interleave databetween disks at the sector level rather than at thebit level.
`bût™,
`4 Transfer
`aO
`Units:
`bl
`a a2
`b2
`a , b, C .& d
`a3
`b3
`Level
`Level
`FT?
`aO
`aO
`bO
`bO
`cO
`cO
`dO
`dO
`1X'
`al
`al
`bl
`bl
`2Zm
`cl
`cl
`dl
`dl
`a2
`a2
`b2
`b2
`c2
`c2
`d2
`d2
`a3
`a3
`b3
`b3
`c3
`c3
`d3
`d3
`aECCO
`ECCa
`Sector 0
`bECCO
`ECCb
`Check
`cECCO
`ECCc
`Disk 5
`1
`dECCO
`ECCd
`(Only one
`aECCl
`lector 0
`check disk
`bECCl
`Check
`in level 3.
`cECCl
`Disk 6
`Check info
`dECCl
`is calculated
`Sector 0

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket