throbber
A Case for Redundant Arrays of Inexpensive Disks (RAID) Davtd A Patterson, Garth Gibson, and Randy H Katz Computer Saence D~v~smn Department of Elecmcal Engmeermg and Computer Sclencea 571 Evans Hall Umversity of Cabforma Berkeley. CA 94720 (partrsl@WF -kY du) Abstract Increasmg performance of CPUs and memorres wrll be squandered lf not matched by a sunrlm peformance ourease m II0 Whde the capactty of Smgle Large Expenstve D&T (SLED) has grown rapuily, the performance rmprovement of SLED has been modest Redundant Arrays of Inexpensive Disks (RAID), based on the magnetic duk technology developed for personal computers, offers an attractive alternattve IO SLED, promtang onprovements of an or&r of mogm&e m pctformance, rehabdlty, power consumption, and scalalnlrty Thu paper rntroducesfivc levels of RAIDS, grvmg rheu relative costlpetfotmance, and compares RAID to an IBM 3380 and a Fupisu Super Eagle 1 Background: Rlsrng CPU and Memory Performance The users of computers are currently enJoymg unprecedented growth m the speed of computers Gordon Bell said that between 1974 and 1984. smgle chip computers improved m performance by 40% per year, about twice the rate of mmlcomputers [Bell 841 In the followmg year B111 Joy predicted an even faster growth [Joy 851 Mamframe and supercomputer manufacturers, havmg &fficulty keeping pace with the rapId growth predicted by “Joy’s Law,” cope by offermg m&processors as theu top-of-the-lme product. But a fast CPU does not a fast system make Gene Amdahl related CPU speed to mam memory s12e usmg this rule [Siewmrek 821 Each CPU mnstrucaon per second requues one byte of moan memory, If computer system costs are not to be dommated by the cost of memory, then Amdahl’s constant suggests that memory chip capacity should grow at the same rate Gordon Moore pr&cted that growth rate over 20 years fransuforslclup = 2y*-1%4 AK predzted by Moore’s Law, RAMs have quadrupled m capacity every twotMoom75110threeyeaFIyers861 Recently the rauo of megabytes of mam memory to MIPS ha9 been defti as ahha [Garcm 841. vvlth Amdahl’s constant meanmg alpha = 1 In parl because of the rapti drop of memory prices, mam memory we.9 have grownfastexthanCPUspeedsandmanymachmesare~ppedtoday~th alphas of 3 or tigha To mamtam the balance of costs m computer systems, secondary storage must match the advances m other parts of the system A key meas- Pemuswn to copy mthout fee all or w of &IS matcnal IS granted pronded that the COP!S zzrc not made or lstnbuted for dwct commernal advantage, the ACM copyright notIce and the tltk of the pubbcatuon and IW da’, appear, and notxe IS @“en that COPYI"K IS by pemtrs~on of the Association for Computing Machtnery To COPY otherwIse, or to repubbsh, requres B fee and/or spenfic perm~ss~o” 0 1988 ACM 0-89791~268-3/88/~/OlOP $1 50 ure of magneuc tik technology 1s the growth m the maxnnum number of bits that can be stored per square mch, or the bits per mch m a track umes the number of tracks per mch Called MA D , for maxunal area1 density, the “Fmt Law m Disk Density” predicts ~rank87] MAD = lo(Year-1971)/10 Magnettc dd technology has doubled capacity and halved pnce every three years, m hne with the growth rate of semiconductor memory, and m practice between 1967 and 1979 the dtsk capacity of the average IBM data processmg system more than kept up with its mam memory [Stevens81 ] Capacity IS not the o~rty memory charactensuc that must grow rapidly to mamtam system balance, since the speed with which msuuctions and data are delivered to a CPU also determmes its ulamdte perfarmanceThespeedof~mem~has~tpacefoPtworeasons (1) the mvenuon of caches, showmg that a small buff= can be managed automamzally to contain a substanttal fractmn of memory refaences. (2) and the SRAM technology, used to build caches, whose speed has lmpmvedattherateof4O%tolOO%peryear In umtmst to pnmary memory technologres, the performance of single large expensive ma8netuz d&s (SLED) has improved at a modest rate These mechamcal devu~ are dommated by the seek and the rotahon delays from 1971 to 1981, the raw seek tune for a high-end IBM disk improved by only a factor of two whllt the rocstlon hme did not cbange[Harkex811 Greater denslty means a lugher transfer rate when the mformatmn 1s found. and extra heads can educe the aveaage seek tnne, but the raw seek hme only unproved at a rate of 7% per year There 1s no reasontoexpectafasterratemthenearfuture To mamtam balance, computer systems have been usmg even larger mam memones or solid state d&s to buffer some of the I/O acttvlty This may be a fine solutron for apphcattons whose I/O actrvlty has locality of reference and for which volatlltty 1s not an issue. but appbcauons dommated by a high rate of random muests for small peces of data (such BS tmmact~on-pmcessmg) or by a low number of requests for massive amounts of data (such as large simulahons nmnmg on supercomputers) are facmg a sermus p&mnance hmuatmn 2. The Pendrng I/O Crisw What t3 the Impact of lmprovmg the performance of sOme pieces of a problem while leavmg others the same? Amdahl’s answer IS now known asAmdahl'sLaw[Amdahl67] 1 S z (1-n +flk Whae S = the effecttve speedup, f=fractmnofworkmfastermode,and k = speedup whde m faster mode I/G Suppose that some current appbcatmns spend 10% of thev ume In Then when computers are 10X faster--accordmg to Bdl Joy m JUSt Over thtte years--then Amdahl’s Law predicts efQcove speedup wdl be only 5X When we have computers lOOX faster--vm evolutmn of umprcuzessors or by multiprocessors-&s applrcatlon will be less than 10X faster, wastmg 90% of the potenhal speedup
`
`PayPal Ex. 1020, p. 1
`PayPal v. IOENGINE
`
`

`

`Whde we can lmagme improvements m software file systems via buffcrmg for near term 40 demands, we need mnovaUon to avoid an J./O crms [Boral83] 3 A Solution: Arrays of Inexpensrve Disks RapId unprovements m capacity of large disks have not been the only target ofd& designers, smce personal computers have created a market for inexpensive magnetic disks These lower cost &sks have lower perfor- mance as well as less capacity Table I below compares the top-of-the-lme IBM 3380 model AK4 mamframe dtsk, FUJ~$U M2361A “Super Eagle” muucomputer disk, and the Conner Penpherals CP 3100 personal computer d& ChoroctensacS IBM FUJUSU Canners 3380 v 2361 v 3380 M2361A CP3100 3100 31Go (>I mmrr 3100 Is tt?tter) D&c dmmeter (mches) 14 105 35 4 3 Formatted DaraCapaclty (MB) 7500 600 100 01 2 Pr~ce/MB(controller mcl ) $18-$10 $20517 $lO-$7 l-25 17-3 MlTFRated (hours) 30,oLw 20@030,ooo 1 15 MlTF m pracUce (hours) 100,000 3 ? ?V No Actuators 4 1 1 2 1 MaxmuunUO’$econd/ActuaU~ 50 40 30 6 8 Typical I/O’s/second/Actuator JO 24 20 7 8 -~wdsecond/box 200 40 30 2 8 Typical VO’s/secondmox 120 24 20 2 8 Transfer Rate (MB/set) 3 25 1 3 4 Power/box (w) 6,600 640 10 660 64 Volume (cu ft ) 24 34 03 800 110 Table I Companson of IBM 3380 dtsk model AK4 for marnframe computers, the Fuptsu M2361A “Super Eagle” dtsk for rmnrcomputers, and the Conners Penpherals CP 3100 dtsk for personal computers By “‘MOxtmum Ilo’slsecond” we mean the rMxmtum number of average seeks and average rotates for a stngle sector access Cost and rehabthty rnfonnatzon on the 3380 comes from w&spread expertence [IBM 871 [hvh2k87] O?kd the lnformatlon on the FuJltsu from the manual [Fu& 871, whtle some numbers on the new CP3100 are based on speculatton The pnce per megabyte w gven as a range to allow for dflerent prices for volume &scount and d@rent mark-up practtces of the vendors (The 8 watt maximum power of the CP3100 was rncreased to 10 watts to allow for the tne&xency of an external power supply. stnce rhe other drives contan their awn power supphes) One suqmsmg fact is that the number of I/Ck per second per Bctuator in an inexpensive &Sk is within a factor of two of the large d&s In several of the remammg metrics, mcludmg pnce per megabyte, the mexpenslve disk ts supenor or equal to the large Qsks The small size and low power are even more Impressive since dsks such as the CP31CO contam full track buffers and most funcUons of the traditional mainframe controller Small disk manufacturers can provide such funcUons m high volume dusks because of the efforts of standards comm~ttces m defmmg hrgher level penpheral mterfaces. such as the ANSI x3 131-1986 Small Computer System Interface (SCSI) Such standards have encouraged companies bke Adeptec to offer SCSI mterfaces as single chips, m turn allowing &Sk compames to embed mamfiame controller functrons at low cost Figure 1 compares the uadltlonal mamframe dsk approach and the small computer disk approach 7%~. sine SCSI mterface chip emLxd&d as a controller m every disk can also be uSed aS the dmXt memory access @MA) deuce at the other end of the SCSI bus Such charactensUcs lead to our proposal for buddmg I/O systems as -YS of mexpenslve d&s, either mterleaved for the large tninsfers of supercomputers [I(lm 86]@vny 871[Satem861 or mdependent for the many small mnsfen of transacUon processmg Usmg the mformamn m ‘fable I, 75 ~~xpensrve disks potentmlly have 12 hmcs the I/O bandwIdth of the IBM 3380 and the same capacity, with lower power COnSUmpUOn and Cost 4 Caveats We cannot explore all issues associated with such -ys m the space avaIlable for this paper, so we ConCefltNte on fundamental estimates of price-performance and rehabduy Our reasoning IS that If there are no advantages m pnceperformance or temble d&vantages m rehabdlty, then there IS IIO need to explore further We chamctenze a transacUon-processing workload to evaluate performance of a col&Uon of iexpensive d&s. but remember that such a CollecUon is Just one hardware component of a complete tranacUon-processmg system While deslgnmg a complete TPS based on these ideas 1s enUcmg, we will resst that temptaUon m this paper Cabling and packagmg, certamly an issue m the cost and rehablhty of an array of many mexpenslve d&s, IS also beyond this paper’s scope Mainframe Small Computer CPU LJ 0% Memoly Channel . . . . . . CPU a dm Figure 1 Comparison of organizations for typlca/ mat&me and small compter ahk tnterfaces Stngle chrp SCSI tnte@ces such as the Adaptec MC-6250 allow the small computer to ure a single crUp to be the DMA tnterface as well as pronde an embedded controllerfor each dtsk [Adeptec 871 (The pnce per megabyte an Table I mcludes evetythtng zn the shaded box.?sabovc) 5. And Now The Bad News: Reliabihty The unrehabd~ty of d&s forces computer systems managers to make backup versions of mformaUon quite frequently m case of fmlure What would be the impact on relmbdlty of havmg a hundredfold Increase m disks? Assummg a constant fmlure rate--that is. an exponenhally dlsmbuted Ume to fadure--and that failures are Independent--both assumptmns made by dtsk manufacturers when cakulaUng the Mean Time To Fadure O--the zebablhty of an array of d&s IS MITF ofa slngtc &sk MTI’F of a Drsk Array = Number MDuks m the Array Using the mformatron m Table I. the MTTF of 100 CP 3100 d&s 1s 30,000/100 = 300 hours, or less than 2 weeks Compared to the 30,ooO hour (> 3 years) MTTF of the IBM 3380, this IS &smal If we consider scaling the army to 1000 disks, lhen the MTTF IS 30 hours or about one day, reqmrmg an ad~ecIne. worse rhan dismal Without fault tolerance, large arrays of mexpenstve Qsks are too unrehable to be useful 6. A Better Solution’ RAID To overcome the rebabtbty challenge, we must make use of extra d&s contammg redundant mformaUon to recover the ongmai mformatmn when a &Sk fads Our acronym for these Redundant Arrays of Inexpensn’e Disks IS RAID To sunplify the explanaUon of our final proposal and to avold confusmn wnh previous work, we give a taxonomy of five different orgamzaUons of dtsk arrays, begmnmg with murored disks and progressmg through a variety of ahemaUves with &ffenng performance and rehablhty We refer to each orgamzauon as a RAID level The reader should be forewarned that we describe all levels as If implemented m hardware solely to slmphfy the presentation, for RAID Ideas are apphcable to software implementauons as well as hardware Reltabthty Our baste approach will be to break the arrays into rellabrhty groups, with each group having extra “check” disks contammg redundant mformauon When a disk fads we assume that withm a short time the failed disk ~111 be replaced and the mformauon wdl be 110
`
`PayPal Ex. 1020, p. 2
`PayPal v. IOENGINE
`
`

`

`recon~ acted on to the new dlbk usmg the redundant mformauon Th1.s time IS Ldled the mean time to repair (MlTR) The MTTR can be reduced If the system includes extra d&s to act as “hot” standby spares, when a disk fmls, a replacement disk IS swltched m elecrromcally Penodlcally a human operator replaces all faded d&s Here are other terms that we use D = total number of d&s with data (not mcludmg extra check d&s). G = number of data d&s m a group (not mcludmg extra check d&s), C = number of check d&s m a group, nG =D/G=nUmberOfgoUp& As menhoned above we make the same assumptions that disk manufacturers make--that fadura are exponenual and mdependent (An earthquake or power surge IS a sltuatlon where an array of d&s might not foul Independently ) Since these reliability prticuons wdl be very high, we want to emphasize that the rehabdlty IS only of the the &sk-head assemblies with this fmlure model, and not the whole software and electromc system In ad&non, m our view the pace of technology means extremely lugh WF are “overlull”--for, independent of expected bfeume, users will replace obsolete &sks After all, how many people are stdl using 20 year old d&s? The general MT’TF calculation for single-error repamng RAID 1s given III two steps Fmt, the group MTIF IS mFDtsk I MrrF,,,, = * G+C Probabdrty ofanotherfadure m a group b&re repamng the dead oisk As more formally denved m the appendix, the probabdlty of a second fa&nebeforethefirsthasbeenrepauedIs MlTR hill-R Probabdrty of = E Another Failure bfnF,,,,k /(No DIS~T- 1) MmF/j,k /(w-l) The mtmuon behmd the formal calculation m-the appendix comes from trymg to calculate the average number of second d& fdures durmg the repau time for X single &Sk fadures Since we assume that Qsk fadures occur at a umform rate, tha average number of second fa&ues durmg the rcpau tune for X first fadures 1s X *MlTR MlTF of remamtng d&s u) the group The average number of second fathues for a smgle d&z 1s then MlTR bfnFD,& / No Of W?UlUllIl~ drSkS l?l the group The MTTF of the retnaming disks IS Just the MTI’F of a smgle disk dnwkd by the number of go4 disks m the gmup. gwmg the result above The second step IS the reltablhty of the whole system, which IS approxl~~~teiy (smcc MITFGrow 1s not qmte titnbuted exponentrally) MTrFGrarp MTTFRAID = Pi Pluggmg It all together, we get. mFD,sk mFD,sk 1 MITFRAID = - * *- G+C (G+C-l)*MITR “c (MmFDtsk)2 = (G+C)*tlG * (G+C-l)*MITR Smce the formula 1s tbe same for each level, we make the abstract numbers concrete usmg these parameters as appropriate D=loO total data d&s, G=lO data disks per group, M7VDcsk = 30,000 hours, MmR = 1 hour, with the check d&s per group C detennmed by the RAID level Relubrlrty Overhead Cost This IS stmply the extra check disks. expressed as a percentage of the number of data &sks D As we shall see below, the cost vanes WIUI RAID level fmm 100% down to 4% Useable Storage Capacity Percentage Another way to express this rellabdlty overhead 1s m terms of the percentage of the total capacity of data &sks and check disks that can be used to store data Depending on the orgamauon, this vanes from a low of 50% to a high of 96% Performance Smce supercomputer applications and transaction-processing systems have &fferent access patterns and rates, we need different metncs to evaluate both For supercomputers we count the number of reads and wnte.s per second for large blocks of data, with large defined as gettmg at least one sector from each data d& III a group Durmg large transfers all the disks m a group act as a stngle umt, each readmg or wntmg a pomon of the large data block m parallel A better measure for transacuon-processmg systems s the number of indlvrdual reads or writes per second Smce transacuon-processing systems (e g , deblts/cre&ts) use a read-modify-wnte sequence of disk accesses, we mclude that metnc as well Ideally durmg small transfers each dsk m a group can act mdepe&ndy. e~thez readmg or wntmg mdependent mfonnatmn In summary supercomputer applicauons need a hrgh dura rure whale transacuon-pmcessm g need a hrgh II0 rate For both the large and small transfer calculauons we assume the mlmmum user request IS a sector, that a sector 1s small relauve to a track, and that there 1s enough work to keep every devtce busy Thus sector size affects both dusk storage efficiency and transfer sue Figure 2 shows the uiealoperauonoflargeandsmall~accessesmaRAID (a) Stngle Large or “Graupcd” Read (lreadqwadoverGd&s) 1tt 1 q nl .*. (b) Several Smll or Indmdual Reads and Writes (GndsandlorwntcsqmndawrG&sks) Figure 2. Large tramfer vs small tran$ers WI a group of G d&s The SIX pelformauce memcs are then the number of reads, wntes, and read-mod@-writes per second for both large (grouped) or small (mdlvldual) transfers Rather than @ve absolute numbers for each memc, we calculate efficiency the number of events per second for a RAID relative to the corrcqondmg events per second for a smgle dusk (This ts Boral’s I/O bandwidth per ggabyte moral 831 scaled to glgabytes per disk ) In Uns pap we are after fundamental Mferences so we use ample. demmmlstlc throughput measures for our pezformance memc rather than latency Effective Performance Per Dnk The cost of d&s can be a large portmn of the cost of a database system, so the I/O performance per disk--factonng m the overhead of the check disks--suggests the cost/performance of a system ‘flus IS the bottom line for a RAID 111
`
`PayPal Ex. 1020, p. 3
`PayPal v. IOENGINE
`
`

`

`I 7. First Level RAID: Mwrored Disks Mmored dusks are 11 tradmonal approach for lmprovmg rellabdlty of magneuc disks This IS the most expensive opuon we consider since all tiks are duplicated (G=l and C=l). and eve.ry wnte to a data dusk 1s also a wnte to a check &Sk Tandem doubles the number of controllers for fault tolerance, allowing an opwnized version of mirrored d&s that lets reads occur m parallel Table II shows the memcs for a Level 1 RAID assummg this optnnuatton MTTF Total Number of D&s Ovcrhcad Cost Usecrble Storage Capacity Exceeds Useful Roduct Ltiwne (4500.000 hrs or > 500 years) 2D 100% 50% Eventslscc vs Smgle Disk Full RAID E@caency Per Disk hrge (or Grouped) Readr ws 1 00/s Large (or Grouped) Wrues D/S 50/S Large (or Grouped) R-M-W 4Dl3S 67/S Small (or Indsvuiual) Rends W 100 Small (or hd~vuiual) Writes D 50 Small (or In&dual) R-M-W 4D/3 61 Table II. Charactenstrcs of Level 1 RAID Here we assume that writes are not slowed by waztrng jar the second wrote to complete because the slowdown for writing 2 dtsks 1s mtnor compared to the slowdown S for wntrng a whole group of 10 lo 25 d&s Unltke a “pure” mtrrored scheme wtth extra &As that are mvlsrble to the s&ware, we assume an optmuted scheme with twice as many controllers allowtng parallel reads to all d&s, grvmg full disk bandwidth for large reads and allowtng the reads of rea&noaijj-nntes to occw in paralbzl When mdwldual accesses am dlsmbuted acmss muluple d&s, average queuemg. seek, and rotate delays may &ffer from the smgle Qsk case Although bandwidth may be unchanged, it is Qsmbuted more evenly, reducing vanance m queuemg delay and, If the disk load IS not too high, also reducmg the expected queuemg delay through parallebsm [Llvny 871 When many arms seek to the same track then rotate to the described sector, the average seek and rotate time wdl be larger than the average for a smgle disk, tendmg toward the worst case tunes Tlus affect should not generally more than double the average access tlmc to a smgle sector whde stdl gettmg many sectors m parallel In the special case of rmrrored &sks with sufficient controllers, the choice between arms that can read any data sector will reduce the tune for the average read seek by up to 45% mltton 881 To allow for these factors but to retam our fundamental emphasis we apply a slowdown factor, S, when there are more than two d&s m a group In general, 1 5 S < 2 whenever groups of disk work m parallel With synchronous disks the spindles of all disks m the group are synchronous so that the correspondmg sectors of a group of d&s pass under the heads stmultaneously,[Kmzwd 881 so for synchronous disks there IS no slowdown and S = 1 Smce a Level 1 RAID has only one data disk m its group, we assume that the large transfer reqmres the same number of Qsks actmg in concert 9s found m groups of the higher level RAIDS 10 to 25 d&s Dupllcatmg all msks can mean doubhng the cost of the database system or using only 50% of the disk storage capacity Such largess inspires the next levels of RAID 8 Second Level RAID: Hammmg Code for ECC The history of main memory orgaruzauons suggests a way to reduce the cost of rehablhty With the introduction of 4K and 16K DRAMS, computer designers discovered that these new devices were SubJe.Ct to losing information due to alpha part&s Smce there were many single bit DRAMS m a system and smce they were usually accessed m groups of 16 to 64 chips at a ume, system designers added redundant chips to correct single errors and to detect double errors m each group This increased the number of memory chips by 12% to 38%--depending on the size of the group--but 11 slgmdcantly improved rehabdlty As long as all the dam bits m a group are read or wntten together, there 1s no Impact on performance However, reads of less than the group size requue readmg the whole group to be sure the mformatir? IS correct, and writes to a poruon of the group mean three steps 1) a read step to get all the rest ofthe data, 2) a mad&v step to merge the new and old mformatwn. 3) a write step to write the full group, tncludmg check lnformatwn Smce we have scores of d&s m a RAID and smce some accesses are to groups of d&s, we can mimic the DRAM solution by bit-mterleavmg the data across the disks of a group and then add enough check d&s to detect and correct a smgle error A smgle panty dusk can detect a smgle error, but to correct an erroI we need enough check dusks to ulentiy the disk with the error For a group sue of 10 data do& (G) we need 4 check d&s (C) m total, and d G = 25 then C = 5 [HammmgSO] To keep down the cost of redundancy, we assume the group size will vary from 10 to 25 Since our mdwidual data transfer urn1 is Just a sector, bit- interleaved dsks mean that a large transfer for this RAID must be at least G sectors L&e DRAMS, reads to a smaller amount unpiles readmg a full “cctor from each of the bit-mterleaved disks m a group, and writes of a single unit involve the read-modify-wnte cycle to hll the Qsks Table III shows the metncs of this Level 2 RAID MlTF ExceedsUseful~ehme G=lO G=Z (494500 hrs (103500 llrs or >50 years) OT 12 years) Total Number of D&s 14OD 12OD overhud Cost 40% 20% Useable Storage Capacity 71% 83% EventslSec Full RAID Eficlency Per Ask Eflc~ncy Per Dtsk (vs Single Disk) L2 L2lLI L2 L2ILI hgeRe& D/S 111s 71% 86/S 86% Lurgc wrllcs D/S 71/s 143% 86/s 112% Large R-M-W D/S 71/s 107% 86/S 129% Small Reodr DISC 01/s 6% 03lS 3% Small Wrttes D12sG 04/S 6% o2.B 3% Small R-M-W DISC 071s 9% 03/S 4% Table III Charactenstlcs of a Level 2 RAID The L2lLI column gives the % performance of level 2 m terms of lewl 1 (>lOO% means L.2 IS faster) As long as the transfer taut ts large enough to spread over all the data d& of a group. the large IIOs get the full bandwuith of each &Sk, &w&d by S to allow all dtsks m a group to complete Level 1 large reads are fmler because &a IS duphcated and so the redwdoncy d&s can also do independent accesses Small 110s still reqture accessmg all the Isks tn a group. so only DIG small IIOc can happen at a tone, agam dwrded by S to allow a group of disks to jintsh Small Level 2 writes are hke small R-M-W becalcse full sectors must be read before new &ta can be written onto part of each sector For large wntes, the level 2 system has the same performance as level 1 even though it uses fewer check disks, and so on a per disk basis It outperforms level 1 For small data transfers the performance 1s &smal either for the whole system or per disk, all the disks of a group must be accessed for a small transfer, llmltmg the Ipaxrmum number of simultaneous accesses to DIG We also include the slowdown factor S smce the access must wat for all the disks to complete Thus level 2 RAID IS desuable for supercomputers but mapproprmte for transaction processmg systems, with increasing group size. increasing the disparity m performance per disk for the two applications In recognition of this fact, Thrnkmg Machmes Incorporated announced a Level 2 RAID this year for its Connecuon Machme supercomputer called the “Data Vault,” with G = 32 and C = 8, mcludmg one hot standby spare [H&s 871 Before improving small data transfers, we concentrate once more on lowenng the cost 9 Thwd Level RAID: Single Check Disk Per Group Most check disks m the level 2 RAID are used to determme which disk faded, for only one redundant panty disk is needed to detect an error These extra disks are truly “redundant” smce most drsk controllers can already detect If a dusk faded either through special signals provided m the disk interface or the extra checking mformauon at the end of a sector used to detect and correct soft errors So mformatlon on the failed disk can be reconstructed by calculatmg the parity of the remaining good disks and then companng bit-by-bit to the panty calculated for the ongmal full 112
`
`PayPal Ex. 1020, p. 4
`PayPal v. IOENGINE
`
`

`

`group When these two parmcs agree, the faded bu was a 0, othcrwtse it RAID levels 2,3, and 4 By stormg a whole transfer umt m a sector, reads was a 1 If the check drsk IS the fadure,Just read all the data drsks and store the group panty in the replacement drsk can be mdependent and operate at the maxrmum rate of a disk yet sull detect errors Thus the primary change between level 3 and 4 IS that WC Reducmg the check d&s toone per group (C=l) reduces the overhead cost to between 4% and 10% for the group stzes considered here The performance for the thud level RAID system is the same as the Level 2 RAID, but the effectrve performance per dtsk mcreases smce it needs fewer check d&s This reductron m total d&s also increases relrabdtty, but since It is shll larger than the useful hfehme of disks, this IS a minor pomt One advantage of a level 2 system over level 3 is that the extra check mformatton assocrated with each sector to correct soft errors IS not needed, mcreasmg the capactty per dtsk by perhaps 10% Level 2 also allows all soft errors to be corrected “on the fly” wnhout havmg to reread a sector Table IV summarizes the thud level RAID charactensncs and Figure 3 compares the sector layout and check d&s for levels 2 and 3 mterlcave data 4 Tran$er UIlllS a, b, c & d Level 4 Sector 0 &la Disk 1 MlTF Exceeds Useful Lrfenme Secwr 0 Data Disk 2 A a T 2 A Total Number of D&s owrhcad cost Useable Storage Capacity EventslSec Full RAID (vs Single Disk) LargeRecu& D/S Large Writes D/S Large R-M-W D/S Small Readr DISC Small Vyrites D/2sG Small R-M-W DISC G=lO (820,000 hrs or >90 years) 1 1OD 10% 91% EIficclency Per Disk L3 WIL2 WILl 91/S 127% 91% 91/S 121% 182% 91/S 127% 136% 09/S 127% 8% 05/S 127% 8% 09/S 127% 11% G=25 (346,000 hrs or 40 years) 104D 4% 96% Eflctency Per Disk w Lx2 WILI 96/S 112% 96% 96/S 112% 192% 96/S 112% 142% 041s 112% 3% 02/S 112% 3% 041s 112% 5% Table IV Characterrstrcs of a Level 3 RAID The L3lL2 column gives the % performance of L3 tn terms of L2 and the L3ILl column give; it in terms of LI (>loO% means L3 IS faster) The performance for the full systems IS the same m RAID levels 2 and 3, but since there are fewer check dtsks the performance per dnk tmproves Park and Balasubramaman proposed a thud level RAID system without suggestmg a partrcular applicauon park861 Our calculattons suggest tt 1s a much better match to supercomputer apphcatrons than to transacuon processing systems This year two disk manufacturers have announced level 3 RAIDS for such apphcanons usmg synchronized 5 25 mch disks with G=4 and C=l one from IvIaxtor and one from Mtcropohs [Magmms 871 This thud level has brought the rehabrhty overhead cost to its lowest level, so in the last two levels we Improve performance of small accesses w&out changmg cost or rehabrlny 10. Fourth Level RAID Independent ReadsbVrltes Spreadmg a transfer across all &sks wuhm the group has the followmg advantage . Large or grouped transfer ttme IS reduced because transfer bandwulth of the entue array can be exploned But it has the followmg drsadvantagek as well . ReadmgAvnhng to a disk m a group requues readmg/wnhng to all the d&s m a group, levels 2 and 3 RAIDS can perform only one I/O at a Pme per group . If the disks are not synchromzed, you do not see average seek and rotattonal delays, the observed delays should move towards the worst case, hence the S factor m the equatrons above This fourth level RAID improves performance of small transfers through parallehsm--the abrhty to do more than one I/O per group at a ume We no longer spread the mdtvtdual transfer informanon across several &sks, but keep each mdrvrdual unit ma smgle disk The vutue of bit-mterleavmg 1s the easy calculatron of the Hammmg code needed to detect or correct errors in level 2 But recall that m the thud level RAID we rely on the drsk controller to detect errors wnhm a single drsk sector Hence, rf we store an mdrvrdual transfer umt in a single sector, we can detect errors on an mdtvtdual read without accessing any other drsk Frgure 3 shows the different ways the mformatron is stored in a sector for Sectar 0 L&a Dtsk 3 Sector 0 D& Disk 4 Sector 0 Check Disk 5 Sector 0 Check Ask 6 Sector 0 Check Disk 7 aEcc0 bECC0 CECCO dECC0 aEcc1 bECC1 cECC1 dEcc1 aEcc2 bECC2 cECC2 dECC2 ECCa ECCb ECCc ECCd (Only one check &Sk tn level 3 Check rnfo ts calculated aver each lran.$er 10~1 (Each @tier umt 1s placed tnto a single sector Note that the check ~$0 IS now calculated over a pece of each tran$er urut ) D I s L Frgure 3 Comparrson of locatton of data and check mformatlon In sectors for RAID levels 2, 3, and 4 for G=4 Not shown IS the small amount of check mformatton per sector added by the disk controller to detect and correct soft errors wlthm a sector Remember that we use physical sector numbers and hardware control to explain these ideas but RAID can be unplemented by sofmare ucmg logical sectors and disks At fust thought you mrght expect that an mdrvldual wnte to a smglz sector stdl mvolves all the disks m a group smce (1) the check disk mutt be rewritten wnh the new panty data, and (2) the rest of the data dash> must be read to be able to calculate the new panty data Recall that each panty bit IS Just a smgle exclusive OR of s+l the correspondmg data NIL 11 a group In level 4 RAID, unhke level 3, the panty calculatron is ITXFI simpler since, if we know the old data value and the old parity balue al well as the new data value, we can calculate the new panty mforrmror: sr follows new panty = (old data xor new data ) xor old pantv In level 4 a small wnte then uses 2 dtsks to perform 4 accesses-2 rea& and 2 wrnes--whtle a small mad mvolves only one read on one disk Table V summarmes the fourth level RAID charactensucs Note that all small accesses improve--dramatrcally for the reads--but the small read-modrfy-wnte is strll so slow relatrve to a level 1 RAID that ns applrcabduy to transactron processmg is doubtful Recently Salem and Gama-Molma proposed a Level 4 system [Salem 86) Before proceedmg to the next level we need to explam the performance of small writes in Table V (and hence small read-modify-writes smce they entarl the same operatrons m dus RAID) The formula for the small wntes drvrdes D by 2 Instead of 4 becau*e 2 113
`
`PayPal Ex. 1020, p. 5
`PayPal v. IOENGINE
`
`

`

`accesses can proceed m parallel the old data and old panty can be read at the same ume and the new data and new panty can be wntten at the same nme The performance of small writes IS also d~nded by G because the smgle check disk m a group must be read and wntten with every small wnte m that group, thereby hmmng the number of writes that can be performed at a time to the number of groups The check &sk 1s the bouleneck, and the fmal level RAID removes thus bottleneck MlTF Bxceeds Useful hfetune Total Number of D&s overhead cost Useabk Storage Capacy Events&x Full RAID (vs Smgk Dtsk) Large R& DIS Large Writes DIS Large R-M-W D/S SmallReads D Small Wrttes Small R-M-W G&O 6-25 (820,ooo hrs (346,000 hrs

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket