`Stallmo
`
`I 1111111111111111 11111 111111111111111111111111111111 11111111111111111111111
`US005519844A
`[11] Patent Number:
`[45] Date of Patent:
`
`5,519,844
`May 21, 1996
`
`[54] LOGICAL PARTITIONING OF A
`REDUNDANT ARRAY STORAGE SYSTEM
`
`[75]
`
`Inventor: David C. Stallmo, Boulder, Colo.
`
`[73] Assignee: EMC Corporation, Hopkinton, Mass.
`
`Maximum Strategy, Inc., San Jose, CA: Strategy 2 Disk
`Array Controller Operation Manual (Nov. 2, 1988).
`Maximum Strategy, Inc., San Jose, CA; Stategy 1 Disk
`Array Controller Operation Manual (Date unknown).
`Gibson, G. A., Performance and Reliability in Redundant
`Arrays of Inexpensive Disks (Date Unknown).
`
`[21] Appl. No.: 215,013
`
`[22] Filed:
`
`Mar. 21, 1994
`
`(List continued on next page.)
`
`Primary Examiner-Stephen M. Baker
`Attorney, Agent, or Firm-John Land
`
`Related U.S. Application Data
`
`[57]
`
`ABSTRACT
`
`[63] Continuation of Ser. No. 612,220, Nov. 9, 1990, abandoned.
`Int. Cl.6
`............................. G06F 11/10; G06F 12/06
`[51]
`[52] U.S. CI . ............................................. 395/441; 395/404
`[58] Field of Search ........................... 371/10.2; 395/575,
`395/275, 404, 441
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`3,893,178
`4,092,732
`4,467,421
`4,562,576
`
`7/1975 Sordello ............................... 360/73.02
`5/1978 Ouchi ...................................... 395/575
`8/1984 White ...................................... 395/425
`12/1985 Ratcliffe ................................. 371/11.1
`
`(List continued on next page.)
`
`FOREIGN PATENT DOCUMENTS
`
`9113404
`
`9/1991 WIPO .
`
`OTIIER PUBLICATIONS
`
`Patterson, D. a., Gibson, G., and Katz, H.; A Case For
`Redundant Arrays of Inexpensive Disks (RAID), ACM
`Sigmod Conf., Jun. 1-3, 1988, pp. 109-116.
`Lee, E. K.; Software and Performance Issues in the Imple(cid:173)
`mentation of a RAID Prototype (May 1990), Rept, No.
`UCB/CSD 90/573.
`Chen, P., Gibson, G., Katz, R. H., Patterson, D. A., and
`Schulze, M.; Introduction to Redundant Arrays of Inexpen(cid:173)
`sive Disks (RAID), Compcon Spr. '89.
`Chen, P., Gibson, G., Katz, R. H., Patterson, D. A., and
`Schulze, M., et al. Evolution of the Raid 2 Architecture (Jun.
`12, 1990).
`
`A redundant array storage system that can be configured as
`a RAID 1, 3, 4, or 5 system, or any combination of these
`configurations. The invention includes a configuration data
`structure for addressing a redundant array storage system,
`and a method for configuring a redundant array storage
`system during an initialization process. The redundant array
`storage system includes a set of physical storage units which
`are accessible in terms of block numbers. The physical
`storage units are each configured as one or more logical
`storage units. Each logical storage unit is addressed in terms
`of a channel number, storage unit number, starting block
`number, offset number, and number of blocks to be trans(cid:173)
`ferred. Once logical storage units are defined, logical vol(cid:173)
`umes are defined as one or more logical storage units, each
`logical volume having a depth characteristic. After the
`logical volumes are defined, redundancy groups are defined
`as one or more logical volumes. A redundancy level is
`specified for each redundancy group. The redundancy level
`may be none, one, or two. Logical volumes are addressed by
`a host CPU by volume number, initial block number, and
`number of blocks to be transferred. The host CPU also
`specifies a READ or WRITE operation. The specified vol(cid:173)
`ume number, initial block number, and number of blocks to
`be transferred are then translated into a corresponding
`channel number, storage unit number, starting block number,
`offset number, and number of blocks to be transferred. With
`the present invention, it is possible for a logical volume to
`span across physical storage units ("vertical partitioning"),
`comprise only a portion of each such physical storage unit
`("horizontal partitioning"), and have definable depth and
`redundancy characteristics.
`
`16 Claims, 6 Drawing Sheets
`
`CPU
`
`INTEL-1010
`8,010,740
`
`
`
`5,519,844
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`4,667,362
`4,722,085
`4,754,397
`4,761,785
`4,768,193
`4,775,978
`4,817,035
`4,819,159
`4,849,929
`4,870,643
`4,899,342
`4,914,656
`4,942,579
`4,989,206
`4,993,030
`5,073,854
`5,088,081
`5,124,987
`5,129,088
`5,130,992
`5,134,619
`5,148,432
`5,193,184
`
`5/1987 Young et al ........................... 371/39.1
`1/1988 Flora et al. . ........................... 371/40.1
`6/1988 Varaiya et al ........................... 361/380
`8/1988 □ark et al ............................... 371/2.2
`8/1988 Takemae ................................ 371/10.3
`10/1988 Hartness ................................. 371/40.1
`3/1989 Timsit ..................................... 395/425
`4/1989 Shipley et al ........................ 371/9.1 X
`7/1989 Timsit ..................................... 395/575
`9/1989 Bultman et al ........................ 371/11.1
`2/1990 Potter et al ............................ 371/10.1
`4/1990 Dunphy, Jr. et al. .................. 371/10.2
`7/1990 Goodlander et al ................... 371/51.1
`1/1991 Dunphy, Jr. et al ................... 371/10.1
`2/1991 Krakauer et al ....................... 371/40.1
`12/1991 Martin et al ............................ 364/425
`2/1992 Farr ........................................... 369/54
`6/1992 Milligan et al ........................ 371/10.1
`7/1992 Auslander et al ...................... 395/700
`7/1992 Frey, Jr. et al ........................ 371/40.1
`7/1992 Hensen et al .......................... 371/40.1
`9/1992 Gordon et al .......................... 371/10.1
`3/1993 Belsan et al. ........................... 395/600
`
`5,249,279
`5,283,875
`
`9/1993 Schmenk et al. ....................... 395/425
`2/1994 Gibson et al. .......................... 395/400
`
`OTHER PUBLICATIONS
`
`Chen, P.,An Evaluation ofReduncantArrays of Disks Using
`an Amdahl 5890; (May 1989), Rept. No. UCB/CSD 89/506.
`Katz, R. H., Gibson, G. AS., and Patterson, D.A.,; Disk
`System Architectures for High Performance Computing
`(Mar. 1989), Rept. No. UCB/CSD 89/497.
`Gray, J., Horst, B., and Walker, M.: Parity Striping of Disc
`Arrays: Low-Cost Reliable Storage with Acceptable
`Throughput (Jan. 1990), Tandem Tech. Rept. 90.2.
`Schulze, M. E.; Considerations in the Design of a Raid
`Prototype (Aug. 1988), Rept. No. UCB/CSD 88/448.
`Clark, and Corrigan; IBM Systems Journal, vol. 23, No. 3,
`1989, "Application System/400 Performance Characteris(cid:173)
`tics", pp. 407-423.
`Meador, W., "Disk Array Systems", COMCON Spring '89,
`Feb. 27-Mar. 3, 1989, pp. 143-146.
`Schlecher, D. et al., "System Overview of the Application
`System/400", IBM Systems Journal, vol. 28, No. 3, 1989,
`pp. 360-375.
`
`
`
`U.S. Patent
`
`May 21, 1996
`
`Sheet 1 of 6
`
`5,519,844
`
`1
`
`CPU
`
`FIG. 1
`
`2
`
`3
`
`CONTROLLER
`
`4
`
`4
`
`S1
`
`S2
`
`S3
`
`S4
`
`S5
`
`S6
`
`
`
`SO
`
`S1
`
`S2
`
`PHYSICAL STORAGE UNITS
`S3
`S4
`S5
`S6
`S7
`S8
`
`S9 S1O S11
`
`LO
`X X X X
`X X X
`X X X X X
`X X X X X X X X X X X X
`L1
`X X
`X X X
`X
`X X
`X
`X
`co~ L2
`X
`X
`~ ~ L3
`X
`X
`X
`X X X X X X
`X
`X X
`~ ~ L4
`X X
`X
`X X
`X
`X X
`X
`X
`X
`X
`V> r L5 I X I X I X I X I X I X I X . I X. I X. I X. I X. I X. I FIG. 2A
`X X X
`X
`X
`X X
`X
`X
`X
`X
`X
`X X X X X
`X
`X X X
`X
`X
`L6
`X
`X
`X
`X X
`X
`X
`X
`X
`X
`L7
`X
`X
`X
`
`X = LOGICAL BLOCK LOCATION
`
`SO
`
`S1
`
`S2
`
`PHYSICAL STORAGE UNITS
`S3
`S4
`S5
`S6
`S7
`S8
`
`S9 S1O S11
`
`X X X X
`X
`X
`X
`X
`X
`X
`X
`X
`X X X X
`X X
`X
`X
`X
`X
`X
`X
`X X X X X X X X X X X
`X
`X X
`X
`X X X
`X
`X
`X X
`X
`X
`
`LO
`Ll
`L2
`co r L3
`rO
`0 G)
`~ ~ L4
`X X X
`X
`X
`X
`X
`X
`X
`X
`X
`X
`V> r L5 I X I X I X I X I X I X I X I X I X I X. I X. I ~ I FIG. 2B
`X
`X X
`X
`X
`X
`X
`X
`X
`X
`X
`X
`X X
`X
`X X
`X
`X
`X
`X
`X
`X
`X
`L6
`X X X
`X
`X
`X
`X
`X
`L7
`X
`X
`X
`X
`
`X = LOGICAL BLOCK LOCATION
`
`~ •
`00. •
`~
`Pl
`(D = '""'"
`
`~
`~
`N
`~
`"'
`~
`\C
`\C
`0-1
`
`00 =-
`m.
`N
`s,
`
`0-1
`
`Ul
`-..
`Ul
`i,,,..
`\C
`-..
`QC
`~
`~
`
`
`
`LOGICAL VOLUME #0
`
`LOGICAL VOLUME #1
`LOGICAL STORAGE UNITS
`so· St' S2' S3' S4' S5'
`S3
`S4
`S5
`
`so
`
`St
`
`S2
`
`LO
`Lt
`cob L2
`r
`(i) L3
`0
`....
`n n L4
`;:::,;:>
`V> r L5
`L6
`L7
`
`5
`4
`3
`0
`2
`1
`4
`3
`2
`1
`0
`5
`11
`10
`8
`9
`8
`11
`10
`7
`9
`6
`6
`7
`17
`16
`15
`14
`13
`12
`17
`16
`15
`14
`13
`12
`21 22 23
`19 20
`23 18
`21
`19 20
`18
`22
`29
`25 26 27 28
`29 24
`25 26
`27 28
`24
`32 33 34 35
`31
`33 34 35 30
`32
`31
`30
`36 37 38 39 40
`36 37 38 39 40
`41
`41
`44 45 46
`47
`43 44 45 46 47
`42 43
`42
`
`FIG. 2C
`
`LOGICAL VOLUME #0
`
`LOGICAL VOLUME #1
`
`so
`
`St
`
`S2
`
`LOGICAL STORAGE UNITS
`so' St' S2' S3' S4' S5'
`S3
`S4
`S5
`
`LO
`Lt
`ro b L2
`(i) L3
`r
`0
`....
`n n L4
`;:::,;:>
`w r LS
`L6
`L7
`
`8
`16
`12
`0
`4
`13
`17
`5
`1
`9
`10
`18
`14
`6
`2
`19
`15
`11
`7
`3
`24 28 32 36 40
`25 29 33 37
`41
`26 30 34 38 42
`35 39 43
`31
`27
`
`20
`21
`22
`23
`44
`45
`46
`47
`
`4
`3
`1
`2
`0
`10
`9
`8
`7
`6
`16
`15
`14
`12
`13
`22
`21
`20
`19
`18
`26 27 28
`25
`24
`32 33 34
`31
`30
`36 37 38 39 40
`45 46
`42
`44
`43
`
`5
`11
`17
`23
`29
`35
`41
`47
`
`FIG. 2D
`
`L'.l •
`00.
`•
`
`~ I
`
`~
`~
`N
`i,-.
`" i,-.
`\0
`\0 =-.
`
`r:J:J. =(cid:173)I'll
`~
`~ s,
`=-.
`
`Ul
`tll
`~
`\0
`,...
`00
`~
`~
`
`
`
`U.S. Patent
`
`May 21, 1996
`
`Sheet 4 of 6
`
`5,519,844
`
`LOGICAL VOLUME #0
`LOGICAL STORAGE UNITS
`
`SO
`p
`5
`10
`15
`20
`25
`p
`35
`
`S 1 S2
`
`S3
`
`S4
`
`S5
`
`0
`4
`2
`1
`3
`p
`9
`8
`7
`6
`p
`11
`12
`14
`13
`p
`16
`18
`19
`17
`24
`21 22 23
`p
`p
`26 27 28 29
`30
`31 32 33 34
`p
`36 37
`38 39
`
`LO
`L1
`CD r L2
`rO g ~ L3
`::,:::: n L4
`(/) ~
`L5
`L6
`L7
`
`FIG. 2E
`
`LOGICAL VOLUME #0
`
`LOGICAL STORAGE UNITS
`S 1
`S3
`S2
`S4
`S5
`0
`1
`3
`2
`6
`7
`5
`Q
`10
`11
`Q
`p
`p
`15
`Q
`14
`p
`18
`19
`Q
`21
`22 23
`p
`24 25 26 27
`29 30
`31
`Q
`
`Q
`p
`9
`13
`17
`20
`Q
`p
`
`SO
`
`p
`4
`8
`12
`16
`Q
`p
`28
`
`LO
`Lt
`~ b L2
`g ~ L3
`::,:::: n L4
`(/) ~
`L5
`L6
`L7
`
`FIG. 2F
`
`
`
`CHANNEL#
`CHANNEL#
`CHANNEL II
`CHANNEL#
`CHANNEL#
`CHANNEL#
`
`I REDUNDANCY GROUP f/0 I
`I LOGICAL VOLUME #0
`LOGICAL DISK /10
`LOGICAL DISK #1
`LOGICAL DISK #2
`LOGICAL DISK /13
`LOGICAL DISK #4
`LOGICAL DISK #5
`I LOGICAL VOLUME #1
`LOGICAL DISK #6
`CHANNEL#
`LOGICAL DISK #7
`CHANNEL#
`LOGICAL DISK #8
`CHANNEL#
`LOGICAL DISK #9
`CHANNEL#
`LOGICAL DISK 1110 CHANNEL#
`LOGICAL DISK #11 CHANNEL#
`
`STORAGE UNIT#
`STORAGE UNIT#
`STORAGE UNIT II
`STORAGE UNIT#
`STORAGE UNIT II
`STORAGE UNIT#
`
`STORAGE UNIT#
`STORAGE UNIT#
`STORAGE UNIT#
`STORAGE UNIT#
`STORAGE UNIT#
`STORAGE UNIT#
`
`FIG. 3A
`
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK I
`STARTING BLOCK#
`
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK#
`
`# OF BLOCKS
`II OF BLOCKS
`# OF BLOCKS
`II OF BLOCKS
`II OF BLOCKS
`# OF BLOCKS
`
`II OF BLOCKS
`# OF BLOCKS
`II OF BLOCKS
`II OF BLOCKS
`# OF BLOCKS
`# OF BLOCKS
`
`d •
`00
`•
`~
`
`~ ('0 = """"
`
`~
`~
`N
`!""
`....
`\C
`\C =-.
`
`00
`Q"'
`t'0
`~
`01
`s,
`=-.
`
`01
`-a
`01
`~
`\,C
`-a
`00
`.a;.
`.a;.
`
`
`
`CHANNEL#
`CHANNEL#
`CHANNEL#
`CHANNEL#
`CHANNEL fl
`CHANNEL#
`
`I REDUNDANCY GROUP #0 I
`1 LOGICAL VOLUME #0
`LOGICAL DISK #0
`LOGICAL DISK #1
`LOGICAL DISK #2
`LOGICAL DISK #3
`LOGICAL DISK #4
`LOGICAL DISK #5
`r REDUNDANCY GROUP #1 1
`f LOGICAL VOLUME #1
`LOGICAL DISK #6
`CHANNEL#
`LOGICAL DISK #7
`CHANNEL#
`LOGICAL DISK #8
`CHANNEL#
`LOGICAL DISK #9
`CHANNEL#
`LOGICAL DISK #10 CHANNEL#
`LOGICAL DISK #11 CHANNEL#
`
`STORAGE UNIT II
`STORAGE UNIT#
`STORAGE UNIT#
`STORAGE UNIT#
`STORAGE UNIT#
`STORAGE UNIT#
`
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK#
`
`# OF BLOCKS
`II OF BLOCKS
`# OF BLOCKS
`II OF BLOCKS
`# OF BLOCKS
`# OF BLOCKS
`
`STORAGE UNIT#
`STORAGE UNIT#
`STORAGE UNIT#
`STORAGE UNIT#
`STORAGE UNIT#
`STORAGE UNIT#
`
`FIG. 3B
`
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK#
`STARTING BLOCK#
`
`# OF BLOCKS
`# OF BLOCKS
`# OF BLOCKS
`# OF BLOCKS
`# OF BLOCKS
`# OF BLOCKS
`
`e •
`
`00
`•
`
`~ ea.
`6§
`
`~
`
`s N
`
`,:-'
`~
`\C
`\C
`0\
`
`11.l =(cid:173)!'ti
`
`~
`0\
`s,
`
`0\
`
`Ul
`"' Ul
`~
`\0
`"' QC)
`~
`~
`
`
`
`5,519,844
`
`1
`LOGICAL PARTITIONING OF A
`REDUNDANT ARRAY STORAGE SYSTEM
`
`This is a continuation of application Ser. No. 07/612,220,
`filed on Nov. 9, 1990, now abandoned.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`This invention relates to computer system data storage,
`and more particularly to a redundant array storage system
`that can be configured as a RAID 1, 3, 4, or 5 system, or any
`combination of these configurations.
`2. Description of Related Art
`A typical data processing system generally involves one
`or more storage units which are connected to a Central
`Processor Unit (CPU) either directly or through a control
`unit and a channel. The function of the storage units is to
`store data and programs which the CPU uses in performing
`particular data processing tasks.
`Various type of storage units are used in current data
`processing systems. A typical system may include one or
`more large capacity tape units and/or disk drives (magnetic,
`optical, or semiconductor) connected to the system through
`respective control units for storing data.
`However, a problem exists if one of the large capacity
`storage units fails such that information contained in that
`unit is no longer available to the system. Generally, such a
`failure will shut down the entire computer system.
`The prior art has suggested several ways of solving the
`problem of providing reliable data storage. In systems where
`records are relatively small, it is possible to use error
`correcting codes which generate ECC syndrome bits that are
`appended to each data record within a storage unit. With 35
`such codes, it is possible to correct a small amount of data
`that may be read erroneously. However, such codes are
`generally not suitable for correcting or recreating long
`records which are in error, and provide no remedy at all if
`a complete storage unit fails. Therefore, a need exists for 40
`providing data reliability external to individual storage units.
`Other approaches to such "external" reliability have been
`described in the art. A research group at the University of
`California, Berkeley, in a paper entitled "A Case for Redun- 45
`dantArrays of Inexpensive Disks (RAID)", Patterson, et al.,
`Proc. ACM SIGMOD, June 1988, has catalogued a number
`of different approaches for providing such reliability when
`using disk drives as storage units. Arrays of disk drives are
`characterized in one of five architectures, under the acronym
`"RAID" (for Redundant Arrays of Inexpensive Disks).
`A RAID 1 architecture involves providing a duplicate set
`of "mirror" storage units and keeping a duplicate copy of all
`data on each pair of storage units. While such a solution
`solves the reliability problem, it doubles the cost of storage.
`A number of implementations of RAID 1 architectures have
`been made, in particular by Tandem Corporation.
`A RAID 2 architecture stores each bit of each word of
`data, plus Error Detection and Correction (EDC) bits for
`each word, on separate disk drives (this is also known as "bit
`stripping"). For example, U.S. Pat. No. 4,722,085 to Flora et
`al. discloses a disk drive memory using a plurality of
`relatively small, independently operating disk subsystems to
`function as a large, high capacity disk drive having an
`unusually high fault tolerance and a very high data transfer
`bandwidth. A data organizer adds 7 EDC bits (determined
`using the well-known Hamming code) to each 32-bit data
`
`5
`
`2
`word to provide error detection and error correction capa(cid:173)
`bility. The resultant 39-bit word is written, one bit per disk
`drive, on to 39 disk drives. If one of the 39 disk drives fails,
`the remaining 38 bits of each stored 39-bit word can be used
`to reconstruct each 32-bit data word on a word-by-word
`basis as each data word is read from the disk drives, thereby
`obtaining fault tolerance.
`An obvious drawback of such a system is the large
`number of disk drives required for a minimum system (since
`10 most large computers use a 32-bit word), and the relatively
`high ratio of drives required to store the EDC bits (7 drives
`out of 39). A further limitation of a RAID 2 disk drive
`memory system is that the individual disk actuators are
`operated in unison to write each data block, the bits of which
`
`15 :a: ~s:::t~~t~vi~e~f ~~d~~t:f~~~:c~:Jr~:~:
`
`30
`
`50
`
`disk transfers part of a block of data, the net effect being that
`the entire block is available to the computer system much
`faster than if a single drive were accessing the block. This is
`advantageous for large data blocks. However, this arrange-
`20 ment also effectively provides only a single read/write head
`actuator for the entire storage unit. This adversely affects the
`random access performance of the drive array when data
`files are small, since only one data file at a time can be
`accessed by the "single" actuator. Thus, RAID 2 systems are
`25 generally not considered to be suitable for computer systems
`designed for On-Line Transaction Processing (OLTP), such
`as in banking, financial, and reservation systems, where a
`large number of random accesses to many small data files
`comprises the bulk of data storage and transfer operations.
`A RAID 3 architecture is based on the concept that each
`disk drive storage unit has internal means for detecting a
`fault or data error. Therefore, it is not necessary to store extra
`information to detect the location of an error; a simpler form
`of parity-based error correction can thus be used. In this
`approach, the contents of all storage units subject to failure
`are "Exclusive OR'd" (XOR'd) to generate parity informa-
`tion. The resulting parity information is stored in a single
`redundant storage unit. If a storage unit fails, the data on that
`unit can be reconstructed on to a replacement storage unit by
`XOR'ing the data from the remaining storage units with the
`parity information. Such an arrangement has the advantage
`over the mirrored disk RAID 1 architecture in that only one
`additional storage unit is required for "N" storage units. A
`further aspect of the RAID 3 architecture is that the disk
`drives are operated in a coupled manner, similar to a RAID
`2 system, and a single disk drive is designated as the parity
`unit.
`One implementation of a RAID 3 architecture is the
`Micropolis Corporation Parallel Drive Array, Model 1804
`SCSI, that uses four parallel, synchronized disk drives and
`one redundant parity drive. The failure of one of the four
`data disk drives can be remedied by the use of the parity bits
`stored on the parity disk drive. Another example of a RAID
`55 3 system is described in U.S. Pat. No. 4,092,732 to Ouchi.
`A RAID 3 disk drive memory system has a much lower
`ratio of redundancy units to data units than a RAID 2 system.
`However, a RAID 3 system has the same performance
`limitation as a RAID 2 system, in that the individual disk
`60 actuators are coupled, operating in unison. This adversely
`affects the random access performance of the drive array
`when data files are small, since only one data file at a time
`can be accessed by the "single" actuator. Thus, RAID 3
`systems are generally not considered to be suitable for
`65 computer systems designed for OLTP purposes.
`A RAID 4 architecture uses the same parity error correc(cid:173)
`tion concept of the RAID 3 architecture, but improves on the
`
`
`
`5,519,844
`
`10
`
`4
`be rewritten to the disk drives. While the two Read opera(cid:173)
`tions may be done in parallel, as can the two Write opera(cid:173)
`tions, modification of a block of data in a RAID 4 or a RAID
`5 system still takes substantially longer then the same
`operation on a conventional disk. A conventional disk does
`not require the preliminary Read operation, and thus does
`have to wait for the disk drives to rotate back to the previous
`position in order to perform the Write operation. The rota(cid:173)
`tional latency time alone can amount to about 50% of the
`time required for a typical data modification operation.
`Further, two disk storage units are involved for the duration
`of each data modification operation, limiting the throughput
`of the system as a whole. Despite the Write performance
`penalty, RAID 5 type systems have become increasingly
`15 popular, since they provide high data reliability with a low
`overhead cost for redundancy, good Read performance, and
`fair Write performance.
`Although different RAID systems have been designed, to
`date, such systems are rather inflexible, in that only one type
`20 of redundant configuration is implemented in each design.
`Thus, for example, redundant array storage systems have
`generally been designed to be only a RAID 3 or only a RAID
`5 system. When the principal use of a redundant array
`storage system is known in advance, such rigidity of design
`25 may not pose a problem. However, uses of a storage system
`can vary over time. Indeed, a user may have need for
`different types of RAID systems at the same time, but not
`have the resources to acquire multiple storage systems to
`meet those needs. As importantly, different users have dif-
`30 ferent needs; designing redundant array storage systems
`with different RAID configurations to meet such disparate
`needs is expensive.
`It thus would be highly desirable to have a flexible
`RAID-architecture storage system in which the basic redun-
`35 dancy configuration could be altered for each user, or as a
`user's needs change. It would also be desirable to have a
`flexible RAID-architecture storage system in which different
`types of redundancy configuration can be simultaneously
`implemented.
`The present invention provides such a system.
`
`40
`
`3
`performance of a RAID 3 system with respect to random
`reading of small files by "uncoupling" the operation of the
`individual disk drive actuators, and reading and writing a
`larger minimum amount of data (typically, a disk sector) to
`each disk (this is also known as block stripping). A further 5
`aspect of the RAID 4 architecture is that a single storage unit
`is designated as the parity unit.
`A limitation of a RAID 4 system is that Writing a data
`block on any of the independently operating data storage
`units also requires writing a new parity block on the parity
`unit. The parity information stored on the parity unit must be
`read and XOR'd with the old data (to "remove" the infor(cid:173)
`mation content of the old data), and the resulting sum must
`then be XOR'd with the new data (to provide new parity
`information). Both the data and the parity records then must
`be rewritten to the disk drives. This process is commonly
`referred to as a "Read-Modify-Write" sequence.
`Thus, a Read and a Write on the single parity unit occurs
`each time a record is changed on any of the data storage units
`covered by the parity record on the parity unit. The parity
`unit becomes a bottle-neck to data writing operations since
`the number of changes to records which can be made per
`unit of time is a function of the access rate of the parity unit,
`as opposed to the faster access rate provided by parallel
`operation of the multiple data storage units. Because of this
`limitation, a RAID 4 system is generally not considered to
`be suitable for computer systems designed for OLTP pur(cid:173)
`poses. Indeed, it appears that a RAID 4 system has not been
`implemented for any commercial purpose.
`A RAID 5 architecture uses the same parity error correc(cid:173)
`tion concept of the RAID 4 architecture and independent
`actuators, but improves on the writing performance of a
`RAID 4 system by distributing the data and parity informa(cid:173)
`tion across all of the available disk drives. Typically, "N+ l"
`storage units in a set (also known as a "redundancy group")
`are divided into a plurality of equally sized address areas
`referred to as blocks. Each storage unit generally contains
`the same number of blocks. Blocks from each storage unit
`in a redundancy group having the same unit address ranges
`are referred to as "stripes". Each stripe has N blocks of data,
`plus one parity block on one storage unit containing parity
`for the remainder of the stripe. Further stripes each have a
`parity block, the parity blocks being distributed on different
`storage units. Parity updating activity associated with every
`modification of data in a redundancy group is therefore 45
`distributed over the different storage units. No single unit is
`burdened with all of the parity update activity.
`For example, in a RAID 5 system comprising 5 disk
`drives, the parity information for the first stripe of blocks
`may be written to the fifth drive; the parity information for
`the second stripe of blocks may be written to the fourth
`drive; the parity information for the third stripe of blocks
`may be written to the third drive; etc. The parity block for
`succeeding stripes typically "precesses" around the disk
`drives in a helical pattern (although other patterns may be
`used).
`Thus, no single disk drive is used for storing the parity
`information, and the bottleneck of the RAID 4 architecture
`is eliminated. An example of a RAID 5 system is described
`in U.S. Pat. No. 4,761,785 to Clark et al.
`As in a RAID 4 system, a limitation of a RAID 5 system
`is that a change in a data block requires a Read-Modify(cid:173)
`Write sequence comprising two Read and two Write opera(cid:173)
`tions: the old parity block and old data block must be read 65
`and XOR' d, and the resulting sum must then be XOR' d with
`the new data. Both the data and the parity blocks then must
`
`60
`
`SUMMARY OF THE INVENTION
`
`55
`
`The RAID architecture of the present invention is
`extremely flexible, and permits a redundant array storage
`system to be configured as a RAID 1,3, 4, or 5 system, or any
`combination of these configurations. The invention com(cid:173)
`prises a configuration data structure for addressing a redun(cid:173)
`dant array storage system, and a method for configuring a
`50 redundant array storage system during an initialization pro-
`cess.
`The redundant array storage system comprises a set of
`physical storage units which are accessible in terms of block
`numbers (a block comprises one or more sectors). As part of
`the initialization process, the physical storage units are each
`configured as one or more logical storage units. Each logical
`storage unit is addressed in terms of a channel number,
`storage unit number, starting block number, and offset
`number (the number of blocks to be transferred is also
`specified when doing transfers).
`Once logical storage units are defined, logical volumes
`are defined as one or more logical storage units, each logical
`volume having a depth characteristic.
`After the logical volumes are defined, redundancy groups
`are defined as one or more logical volumes. In the present
`invention, a redundancy level is specified for each redun-
`
`
`
`5,519,844
`
`6
`Physical Storage Units
`
`5
`dancy group. The redundancy level may be none, one (e.g.,
`XOR parity or an error-correction code, such as a Reed(cid:173)
`Solomon code), or two (e.g., XOR parity plus a Reed(cid:173)
`Solomon error-correction code).
`Alternatively, redundancy groups are defined as one or 5
`more logical storage units, and logical volumes are defined
`as a member of a redundancy group.
`Logical volumes are addressed by a host CPU by volume
`number, initial block number, and number of blocks to be
`transferred. The host CPU also specifies a READ or WRITE
`operation. The specified volume number, initial block num(cid:173)
`ber, and number of blocks to be transferred are then trans(cid:173)
`lated into a corresponding channel number, storage unit
`number, starting block number, offset number, and number
`of blocks to be transferred.
`With the present invention, it is possible for a logical
`volume to span across physical storage units ("vertical
`partitioning"), comprise only a portion of each such physical
`storage unit ("horizontal partitioning"), and have definable 20
`depth and redundancy characteristics.
`The details of the preferred embodiment of the present
`invention are set forth in the accompanying drawings and
`the description below. Once the details of the invention are
`known, numerous additional innovations and changes will
`become obvious to one skilled in the art.
`
`10
`
`25
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`A typical physical storage unit, such as a magnetic or
`optical disk drive, comprises a set of one or more rotating
`disks each having at least one read/write transducer head per
`surface. Data storage areas known as tracks are concentri(cid:173)
`cally arranged on the disk surfaces. A disk storage unit may
`have, for example, 500 to 2000 tracks per disk surface. Each
`track is divided into numbered sectors that are commonly
`512 bytes in size. Sectors are the smallest unit of storage area
`that can be accessed by the storage unit (data bits within a
`sector may be individually altered, but only by reading an
`entire sector, modifying selected bits, and writing the entire
`sector back into place). A disk storage unit may have 8 to 50
`sectors per track, and groups of tracks may have differing
`15 numbers of sectors per track on the same disk storage unit
`(e.g., smaller circumference inner tracks may have fewer
`sectors per track, while larger circumference outer tracks
`may have more sectors per track).
`Access to a sector ultimately requires identification of a
`sector by its axial displacement along the set of rotating
`disks, radial displacement on a disk, and circumferential
`displacement around a disk. Two common schemes are used
`for such identification. One scheme identifies a sector by a
`surface or head number (axial displacement), a track number
`(radial displacement), and a sector number (circumferential
`displacement). The second scheme treats all of the tracks
`with the same radius on all disks as a "cylinder", with tracks
`being subsets of a cylinder rather than of a surface. In this
`scheme, a sector is identified by a cylinder number (radial
`30 displacement), a track number (axial displacement), and a
`sector number (circumferential displacement). The present
`invention can be implemented using either form of physical
`identification.
`It is possible for a higher level storage controller ( or even
`the CPU) to keep track of the location of data on a storage
`unit by tracking all involved sectors. This is commonly done
`with magnetic disk drives following the well-known ST-506
`interface standard used in personal computers. Storage units
`addressed in this manner are known as sector-addressable.
`However, it is inconvenient in modem computer systems
`for a high-level storage controller to keep track of sector
`addresses by either of the addressing schemes described
`above. Therefore, in the preferred embodiment of the inven-
`tion, an alternative form of storage unit addressing is used
`that maps the sectors of a storage unit to a more tractable
`form.
`This mapping is accomplished by treating one or more
`sectors as a block, as is known in the art, and addressing each
`50 storage unit by block numbers. A block on the storage units
`used in the preferred embodiment of the inventive system
`can vary from 512 bytes up to 4096 bytes, but may be of any
`size (although commonly block sizes are limited to multiples
`of two bytes, for ease of implementation). The storage units
`55 being used must support the specified block size. In addition,
`such storage units mark defective sectors in such a way that
`they are not used to form blocks. (Some storage units can
`also dynamically "map out" defective blocks during opera(cid:173)
`tion in order to always present to external devices a set of
`60 contiguously numbered blocks). Each storage unit is then
`considered by a higher level controller to be a "perfect"
`physical device comprising a set of contiguously numbered
`logical blocks. Such units are known as block-addressable.
`For example, with storage units having a Small Computer
`System Interface ("SCSI"), each storage unit is considered
`to be a contiguous set of blocks. An access request to such
`a unit simply specifies the numbers of the blocks that are to
`
`35
`
`FIG. 1 is block diagram of a generalized RAID system in
`accordance with the present invention.
`FIG. 2A is a diagram of a model RAID system, showing
`a typical physical organization.
`FIG. 2B is a diagram of a model RAID system, showing
`a logical organization of the physical array of FIG. 2A, in
`which each physical storage unit is configured as two logical
`storage units.
`FIG. 2C is a diagram of a model RAID system