`Wang et al.
`
`I 1111111111111111 11111 lllll lllll lllll lllll 111111111111111 1111111111 11111111
`US006834326Bl
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 6,834,326 Bl
`Dec. 21, 2004
`
`(54) RAID METHOD AND DEVICE WITH
`NETWORK PROTOCOL BETWEEN
`CONTROLLER AND STORAGE DEVICES
`
`(75)
`
`Inventors: Peter S. S. Wang, Cupertino, CA (US);
`David C. Lee, San Jose, CA (US)
`
`(73) Assignee: 3Com Corporation, Marlborough, MA
`(US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by O days.
`
`(21) Appl. No.: 09/527,904
`
`(22) Filed:
`
`Mar.17, 2000
`
`Related U.S. Application Data
`(60) Provisional application No. 60/180,574, filed on Feb. 4,
`2000.
`Int. Cl.7 ................................................ G06F 12/02
`(51)
`(52) U.S. Cl. ........................ 711/114; 711/162; 711/154;
`711/161; 370/355; 370/356
`(58) Field of Search ............................. 711/114, 4, 162,
`711/161, 154; 709/200; 710/1; 370/351-356
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`5,805,920 A * 9/1998 Sprenkle et al. ............... 710/1
`6,000,020 A * 12/1999 Chin et al. .................. 711/162
`6,098,155 A * 8/2000 Chong, Jr.
`. ................. 711/138
`6,138,176 A * 10/2000 McDonald et al.
`............ 710/6
`6,141,788 A * 10/2000 Rosenberg et al.
`......... 714/774
`6,148,414 A * 11/2000 Brown et al. ................. 714/11
`6,272,522 Bl * 8/2001 Lin et al. .................... 709/200
`6,304,942 Bl * 10/2001 DeKoning .................. 711/114
`6,378,039 Bl * 4/2002 Ohara et al. ................ 709/105
`6,529,996 Bl * 3/2003 Nguyen et al.
`............. 711/114
`
`OTHER PUBLICATIONS
`
`D. Meyer, Introduction to IP Multicast (Jun., 1998) from
`www.nanog.org/mtg-9806/ppt/davemeyer/, pp. 1-49.
`T. Maufer and C. Semeria, Introduction to IP Multicast
`Routing, IETF Internet Draft (Apr. or May 1996), pp. 1-54.
`C. Semeria and T. Maufer, Introduction to IP Multicast
`Routing (undated) from www.cs.utexas.edu/users/dragon/
`cs3 78/resources/multl .pdf.
`Juan-Mariano de Goyeneche "Multicast Over TCP/IP
`HOWTO" Mar. 20, 1998 consisting of 34 pages http:/
`www.linuxdoc.org/HOWTO/Multicast-HOWTO.html.
`B. Quinn "IP Multicast Applications: Challenges and Solu(cid:173)
`tions" Nov. 1998 IP Multicast Initiative pp. 1-16.
`B. Quinn "IP Multicast Applications: Challenges and Solu(cid:173)
`tions" Jun. 25, 1999 Stardust Forums pp. 1-25.
`
`* cited by examiner
`
`Primary Examiner-Mano Padmanabhan
`Assistant Examiner-Jasmine Song
`(74) Attorney, Agent, or Firm-Ernest J. Beffel, Jr.; Haynes
`Beffel & Wolfeld LLP
`
`(57)
`
`ABSTRACT
`
`The present invention relates to transparent access to a
`redundant array of inexpensive devices. In particular, it
`provides a method and device for connecting redundant disk
`drives to a controller, preferably an intelligent switch, via a
`network. The disks are controlled by commands transported
`across the network. Commands may be SCSI, IDE/ATA or
`other commands. The network may comprise ethernet, fiber
`channel or other physical layer protocol. Commands can be
`encapsulated in IP packets and transmitted using either a
`reliable or unreliable transport protocol. Multicasting of
`packets from the controller to the multiple disk drives is part
`of the present invention.
`
`74 Claims, 10 Drawing Sheets
`
`Volume 1
`
`Volume 3
`
`Volume 4
`
`Volume 2
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 1
`
`
`
`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 1 of 10
`
`US 6,834,326 Bl
`
`..-
`
`-
`-
`
`A
`
`R
`
`C
`
`T)
`
`A
`
`R
`
`C
`
`T)
`
`~
`
`E
`
`F
`
`G
`
`H
`
`-
`-
`
`E
`
`F
`
`G
`
`H
`
`Disk 1
`
`Disk 2
`
`Disk3
`
`Disk4
`
`Fig. 1
`
`,,..
`
`R
`
`F
`
`.T
`
`N
`
`A
`
`R
`
`T
`
`M
`
`Disk 1
`
`Disk2
`
`-
`
`-
`
`~
`
`-
`
`n
`
`H
`
`L
`
`p
`
`(;
`
`CT
`
`K
`
`0
`
`Disk3
`
`Disk4
`
`~
`
`Fig. 2
`
`-
`
`BlockO
`-
`
`Block 1
`
`Block2
`
`Parity
`Generation
`\
`>----. BlockO, 1,
`J
`2. 3 Paritv
`
`Block3
`
`AO
`
`m
`rn
`m
`
`Al
`
`Rl
`
`(;1
`
`rn
`
`~
`
`-
`
`A2
`
`R2
`
`C2
`
`m
`
`A1
`m
`
`(;1
`
`rn
`
`ask.I
`
`nsk.2
`
`nsk.3
`
`nsk.4
`
`Fig. 3
`
`Atmtv
`
`Btmtv
`
`Ctmtv
`
`Dtmtv
`
`nsk.5
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 2
`
`
`
`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 2 of 10
`
`US 6,834,326 Bl
`
`YD-
`
`Parity
`GenT£nim
`
`-
`
`Aihks
`-
`-
`
`AO
`
`Al
`
`A2
`
`-
`
`-
`
`-
`
`3tmtv
`
`~
`
`Ilskl
`
`Bllrl<s
`
`m
`
`Rl
`
`.._
`
`2JDity
`
`~
`
`-
`
`ffi
`
`Ilsk2
`Fig. 4
`
`Cllrl<s
`-
`-
`
`m
`
`1 JDity
`
`(2
`
`0
`
`....__
`
`D
`
`OJBitv
`
`D1
`
`rn
`
`rn
`
`Ilsk3
`
`Ilsk4
`
`RAID Controller/
`Switch
`
`Volume 1
`
`Volume2
`
`Fig. 5
`
`Volume 1
`
`Volume 3
`
`Volume 4
`
`Volume 2
`
`Fig. 6
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 3
`
`
`
`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 3 of 10
`
`US 6,834,326 Bl
`
`User
`
`Kernel
`
`Applications
`
`Filesystem
`
`Cache System
`
`Block Drivers (SCS ,
`etc.)
`
`TCP
`
`UDP
`
`NetSCSI / NetRAID
`
`IP
`
`802
`
`Fig. 7
`
`Read/
`(cid:141) Multiack
`---------(cid:141)
`Data
`
`\
`
`\
`
`\
`
`\
`
`\
`
`\
`
`\
`
`\ '
`4,8,
`12
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`, ,
`
`,
`
`I
`I
`I
`
`\
`
`'
`
`'
`
`I
`I
`I
`I
`I
`I
`
`' ,
`' \
`
`\
`
`' \
`
`\
`'
`
`\
`
`' \
`
`'
`
`\
`
`\
`
`\
`
`\
`
`\
`
`\
`
`/
`
`,
`, '
`
`,
`, '
`
`,
`
`/
`
`,
`
`/
`
`,
`
`I
`
`/
`,'
`
`, , , ,
`, ,
`, , ,
`,
`
`1,5,
`9
`
`2,6,
`10
`
`3,7,
`11
`
`Fig. Ba
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 4
`
`
`
`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 4 of 10
`
`US 6,834,326 Bl
`
`Read/
`(cid:141) Multiack
`---------(cid:141)
`Data
`
`1-
`100
`
`101-
`200
`
`201-
`300
`
`301-
`400
`
`Fig. 8b
`
`Read/
`(cid:141) Multiack
`---------(cid:141)
`Data
`
`1-
`100
`
`1-
`100
`
`1-
`100
`
`1-
`100
`
`, , ' I
`,,' ,'
`,,
`,,,
`'
`,,,'
`'
`,
`,,'
`,'
`/
`,,'
`,,
`'
`,,,,,,'
`
`Write/
`(cid:141)
`Data
`---------(cid:141) Multiack
`
`,,,,'
`
`1-
`100
`
`1-
`100
`
`1-
`100
`
`1-
`100
`
`Fig. 8c
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 5
`
`
`
`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 5 of 10
`
`US 6,834,326 Bl
`
`Read/
`(cid:141) Multiack
`------(cid:141)
`Data
`
`1-
`100
`
`101-
`200
`
`201~
`300
`
`P(1-
`300)
`
`Write
`Data
`Ack
`
`1-
`100
`
`201-
`300
`
`P(1-
`300)
`
`Fig. 8d
`
`1-
`100
`
`117
`~
`
`Read/
`(cid:141) Multiack
`-----(cid:141)
`Data
`
`~
`- - • - Write
`~
`~ ----(cid:141)
`Data
`·--,~- ->
`Ack
`~- ~
`12ml
`P(1-
`~
`300)
`
`...;~
`
`Fig. Se
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 6
`
`
`
`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 6 of 10
`
`US 6,834,326 Bl
`
`CDMMAND AaESS METHOD
`
`1RANSPORT
`
`PHYSICAL
`
`Fig. 9
`
`SCSI
`
`NIC
`
`SCSI
`
`NIC
`
`•••
`······---~·-~-~-------------------~--,----·········---... ---········
`
`Local Area Network
`
`®
`
`:
`:
`
`• NetSCSIDisk
`:
`
`NetSCSI Disk
`
`:
`
`•
`
`:------(!)------•
`
`I
`
`Switch
`
`Fig. 10
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 7
`
`
`
`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 7 of 10
`
`US 6,834,326 Bl
`
`user
`
`kernel
`
`1101
`
`1102
`
`1103
`
`1104
`
`1105
`
`1106
`
`1107
`
`1108
`
`1109
`
`Linux Platform
`
`Applications
`
`File System
`
`Block
`Interface
`
`SCSI layer
`
`NetSCSI device driver
`
`IP
`
`wire
`
`Fig. 11
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 8
`
`
`
`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 8 of 10
`
`US 6,834,326 Bl
`
`user
`
`kernel
`
`Linux Platfonn
`
`Applications
`
`File System
`
`RAID
`
`Block
`Interface
`
`SCSI layer
`
`1201
`
`1202
`
`1203
`
`1204
`
`1205
`
`NetSCSI device driver
`
`1206
`
`STP - - - - - - - 1207
`
`1208
`
`1209
`
`1210
`
`IP
`
`wire
`
`Fig. 12
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 9
`
`
`
`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 9 of 10
`
`US 6,834,326 Bl
`
`1301
`
`SCSI layer
`
`Send Process
`~-~--~__r-1310
`ahal 740_queuecommand
`
`Receive Process
`
`done
`
`1317
`
`.----~----.__r-1311
`netscsi_send
`
`del_timer ~ 1316
`
`1315
`
`1314
`
`yes
`
`1312
`
`ahal 740_rx_handler
`
`Packet send
`
`Packet received
`
`Networking Code
`
`1313
`
`Fig. 13
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 10
`
`
`
`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 10 of 10
`
`US 6,834,326 Bl
`
`NetSCSI Disk (Client)
`
`Switch PC (Server)
`
`1401
`
`1403
`
`1407
`
`1411
`
`NCDP
`
`RESPONSE
`
`DISK_SETUP
`
`SWITCH_SETUP
`
`1402
`
`1404
`
`GETDISKP _RAID
`
`"-1406
`
`DISKP_RAID
`
`SETDISKP _RAID
`
`" - 1408
`
`RESPONSE
`
`KEEPALIVE
`
`___r- 1410
`
`-.....___r-1412
`
`Fig. 14
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 11
`
`
`
`US 6,834,326 Bl
`
`1
`RAID METHOD AND DEVICE WITH
`NETWORK PROTOCOL BETWEEN
`CONTROLLER AND STORAGE DEVICES
`
`RELATED APPLICATION DATA
`
`Applicant claims the benefit of Provisional Application
`No. 60/180,574 entitled Networked RAID Method and
`Device, filed 4 Feb. 2000.
`
`BACKGROUND OF THE INVENTION
`
`5
`
`2
`of storage devices plus an additional extent for parity data.
`Again, failure of one device can be overcome by recon(cid:173)
`structing the data on the failed device from the remaining
`devices.
`Implementation of RAID has typically required trained
`information technology professionals. Network administra(cid:173)
`tors without formal information technology training may
`find it difficult to choose among the alternative RAID
`configurations. Growth in storage needs may require choice
`10 of a different configuration and migration from an existing
`to a new RAID configuration. The selection of hardware to
`implement an array may involve substantial expense. A
`relatively simple, inexpensive and preferably self-
`configuring scheme for redundant storage is desired.
`Therefore, it is desirable to provide redundant storage
`across a network to assure redundancy while providing
`automatic configuration to reduce the total cost of system
`ownership. It is further desirable to take advantage of
`network-oriented protocols, such as multicasting packets, to
`20 implement redundant storage in an efficient way.
`
`SUMMARY OF INVENTION
`
`The present invention includes using packet multicasting
`to implement redundant configurations of networked storage
`devices. It is preferred to use an intelligent switch, in
`communication with a plurality of storage devices, as a
`RAID controller. The intelligent switch may function as a
`thinserver, supporting a file system such as NFS or CIFS. By
`cascading switches, one or more of the storage devices in
`communication with the switch may actually be another
`switch, appearing as a virtual storage device.
`The present invention includes multicasting command
`packets. In some protocols, such as SCSI, a series of
`commands may be required to establish communications
`between the initiator and the target, followed by data trans-
`fer. Multiple commands can be multicast in single packet.
`Read commands can advantageously be multicast to retrieve
`data from identically located blocks in different partitions or
`extents of data. Write commands can be multicast to estab(cid:173)
`lish communications. Data also may be multicast for certain
`configurations of RAID, such as mirrored disks and byte(cid:173)
`wise striped disks.
`Another aspect of the present invention is use of forward
`error correction to avoid the need for reliable multicast
`protocols. A variety of forward error correction algorithms
`may be used, including parity blocks. Forward error correc(cid:173)
`tion can be implemented on a selective basis. Error correc-
`tion can be implemented when packet loss exceeds a thresh(cid:173)
`old or on a transaction-by-transaction basis. Error correction
`can be supplemented by a dampening mechanism to mini(cid:173)
`mize oscillation.
`Another aspect of the present invention is control of
`multicast implosion. When multicast read commands are
`issued to numerous disks or for a large quantity of data,
`delay and backoff algorithms minimize oscillation.
`The present invention is suitably implemented on a switch
`in communication with multiple storage devices over raw
`ethernet, IP over UDP, fibre channel, or other communica-
`tions protocol.
`
`BRIEF DESCRIPTION OF FIGURES
`FIG. 1 illustrates RAID-1 using two pairs of disks.
`FIG. 2 illustrates RAID O spanning and interleaving of
`data across four disks.
`FIG. 3 illustrates RAID 4 using disk 5 for parity data.
`
`1. Field of the Invention
`The present invention relates to use of multicasting to
`implement a redundant storage protocol such as a redundant
`array of inexpensive devices (RAID), mirroring or striping 15
`of storage devices, where the storage devices are separated
`from a controller by a network. It is preferred for the
`controller to be hosted on an intelligent switch and for the
`controller to use SCSI commands to control the storage
`devices, although it alternatively can use IDE/ATA com(cid:173)
`mands. The network may comprise ethernet, fibre channel or
`another physical layer protocol. One or more commands,
`such SCSI commands, can be encapsulated directly in a
`network packet or encapsulated in IP packet. When com(cid:173)
`mands are encapsulated in IP packets, either a reliable or 25
`unreliable transport protocol can be used. Preferably, the use
`a reliable multicast protocol can be avoided.
`2. Description of Related Art
`Network-attached storage (NAS) is a means to fulfill the
`needs of clients who want fast, scalable, high-bandwidth
`access to their data. It is a means to improve scalability of
`existing distributed file systems by removing the server as a
`bottleneck in using the network as bandwidth to allow
`parallelism by using striping and more efficient data pass.
`Significant to implementing network-attached storage is a
`trend towards increasingly intelligent storage devices. Intel(cid:173)
`ligent storage devices provide functionality previously
`reserved to dedicated servers.
`A second advantage of network-attached storage is 40
`reduced total cost of ownership (TCO). Network-attached
`storage offers convenient placement of storage devices.
`Plug-and-play configuration can be incorporated. Adminis(cid:173)
`tration can be simplified, reducing the load on or need for
`information technology professionals who are needed to 45
`maintain and fine-tune dedicated servers.
`A variation on the traditional client/server or Server
`Integrated Disk (SID) system uses SCSI commands across a
`network. This scheme is referred to as Server Attached Disk
`(SAD). This new model allows storage devices to be arbi- 50
`trarily placed on a network. A SCSI protocol such as
`NetSCSI is used by the server to control the remotely located
`storage devices. Users of a NetSCSI system can have their
`own storage area network (SAN) using their existing net(cid:173)
`works and intelligent SCSI hard disks, without the added 55
`cost and complexity of high-end storage area network sys-
`terns.
`An additional concern that is always present in storage
`systems and may be accentuated by network-attached stor(cid:173)
`age is redundancy. Recovery from failure of an individual
`storage device requires some degree of redundancy. File
`servers have used redundant arrays of inexpensive devices
`(RAID) to provide redundancy. Several RAID levels have
`been defined. RAID level 1, for instance, involves mirroring
`of paired storage extents on distinct physical devices. Fail- 65
`ure of one device leaves the other device available. RAID
`level 4, as an alternative, involves using two or more extents
`
`30
`
`35
`
`60
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 12
`
`
`
`US 6,834,326 Bl
`
`3
`FIG. 4 illustrates RAID 5 placing parity data on alternat(cid:173)
`ing disks.
`FIG. 5 illustrates a preferred architectural framework for
`a Switched RAID system.
`FIG. 6 illustrates cascading of virtual disks in a Switched 5
`RAID system.
`FIG. 7 illustrates a preferred software architecture for a
`RAID controller, such as one running on the switch.
`FIGS. Sa-Se illustrate multicasting from a source to 10
`storage devices and unicast responses.
`FIG. 9 illustrates the SCSI architecture layering.
`FIG.10 depicts an embodiment ofNetSCSI utilizing a PC
`as a switch.
`FIG. 11 depicts the layered design of a Linux platform. 15
`FIG. 12 depicts an implementation of RAID on a Linux
`platform.
`FIG. 13 is a flowchart illustrating sending and receiving
`processes for a NetSCSI device driver.
`FIG. 14 illustrates a sequence of messages by which a
`NetSCSI disk may initiate a session.
`
`4
`performance, as it implies that disks are access sequentially.
`The most obvious performance improvement is to interleave
`or stripe the data so that it can be accessed in parallel. Thus,
`n small and inexpensive disks theoretically can provide n
`times the storage of one disk at n times the performance of
`one disk using striping.
`To improve reliability, two mechanisms are used. One is
`to simply copy, or mirror, the data. To mirror the four disks
`four additional disks are required, which is not an efficient
`use of capacity. Another approach is to add a parity disk to
`provide error detection and correction information. Parity is
`calculated as the exclusive-or (XOR) of the remaining data
`disks. If one data disks fails, the data can be recovered by
`reading the remaining data disks, including the parity disk,
`the bit or whole that failed can be reconstructed. A parity
`disk can be implemented on a stripe-wise basis, to minimize
`impact on read performance.
`These features plus variations on how the parity is imple(cid:173)
`mented have been standardized as RAID levels. There are
`20 six levels of RAID defined:
`RAID linear-spanned
`RAID 0-striped
`RAID 1-mirroring
`RAID 2-not used in practice
`25 RAID 3-byte striped with parity disk
`RAID 4-block striped with parity disk
`RAID 5-RAID 4 with rotating parity
`RAID 6 -similar to RAID 5 except for physical to logical
`mapping difference
`30 The different RAID levels have different performance,
`redundancy, storage capacity, reliability and cost character(cid:173)
`istics. Most, but not all levels of RAID offer redundancy
`against disk failure. Of those that offer redundancy, RAID-1
`and RAID-5 are the most popular. RAID-1 offers better
`35 performance, while RAID-5 provides for more efficient use
`of the available storage space. However, tuning for perfor(cid:173)
`mance is an entirely different matter, as performance
`depends strongly on a large variety of factors, from the type
`of application, to the sizes of stripes, blocks, and files.
`RAID-linear is a simple concatenation of partitions to
`create a larger virtual partition. It is used to create a single,
`large partition from a number small drives. Unfortunately,
`this concatenation offers no redundancy, and even decreases
`the overall reliability. That is, if any one disk fails, the
`45 combined partition will fail.
`RAID-1 is involves mirroring. Mirroring is where two or
`more partitions ( or drives), all of the same size, each store
`an exact copy of all data, disk-block by disk-block. The
`logical addresses of corresponding disk-blocks in extents on
`50 different physical drives are the same. Mirroring gives
`strong protection against disk failure. If one disk should fail,
`another exists with the an exact copy of the same data.
`Mirroring can also help improve performance in 1/O-laden
`systems, as read requests can be divided up between several
`55 disks. Unfortunately, mirroring is also the least efficient in
`terms of storage: two mirrored partitions can store no more
`data than a single partition. FIG. 1 illustrates RAID-1. The
`letters refer to the and order of data. That is, the same letter
`indicates the same data. The data is ordered A, B, C, D, E
`60 and so on. The same data is duplicated on two disks: disk 1
`is mirror of disk 2 and disk 3 is mirror of disk 4.
`Striping is the underlying concept behind all RAID levels.
`Other than RAID-1 a stripe is a contiguous sequence of disk
`blocks. A stripe may be as short as a single disk block, or
`65 may consist of thousands of blocks. The RAID drivers split
`up their component disk partitions into stripes; the different
`RAID levels differ in how they organize the stripes, and
`
`DETAILED DESCRIPTION
`
`A form of networked RAID (NetRAID) is required to
`provide the reliability and performance needed for small and
`medium sized enterprises. While RAID can be implemented
`using SCSI-over-IP technology, the present invention will
`leverage multicast capabilities to further improve perfor(cid:173)
`mance. NetRAID may extend SCSI-over-IP (NetSCSI) auto(cid:173)
`configuration mechanisms to provide RAID autoconfigura(cid:173)
`tion. NetSCSI protocol can be enhanced to support
`NetRAID.
`Most storage devices are internal to a storage server and
`are accessed by the server's processor through effectively
`lossless, high-speed system buses. This simplifies the design
`of the storage hardware and software. The conventional
`wisdom is that LAN technology will not suffice in the
`storage world because it cannot provide the performance or
`reliability required. New gigabit Ethernet (GE) technologies 40
`may disprove the conventional wisdom. While the highest
`end of the market may remain loyal to server centric
`technologies, small and medium sized enterprises, espe(cid:173)
`cially price sensitive organizations, are expected to embrace
`new options.
`In existing network systems, file stacks (filesystems and
`block drivers such as SCSI drivers) are not implemented to
`handle data loss. Data loss is generally viewed as a hardware
`error and the result is a time consuming hardware reset
`process. Also, timeouts are not effectively set and used
`which makes simple implementations of GE-based SANs
`more difficult. Minor changes in the implementation of file
`stacks are required to ensure that GE performance is com(cid:173)
`parable to Fibre Channel (FC.) With standardization of
`NetSCSI and NetRAID, appropriate modifications may be
`realized.
`Background: RAID Primer
`Redundant arrays of inexpensive disks (RAID) are used to
`improve the capacity, performance, and reliability of disk
`subsystems, allowing the potential to create high(cid:173)
`performance terrabyte storage using massive arrays of cheap
`disks. RAID can be implemented in software or hardware,
`with the hardware being transparent to existing SCSI or FC
`drivers.
`The premise of RAID is that n duplicate disks can store
`n times more data than one disk. This spanning property, by
`itself, simply provides capacity and no improvement in
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 13
`
`
`
`US 6,834,326 Bl
`
`5
`
`5
`what data they put in them. The interplay between the size
`of the stripes, the typical size of files in the file system, and
`their location on the disk is what determines the overall
`performance of the RAID subsystem.
`RAID-0 is much like RAID-linear, except that the com-
`ponent partitions are divided into stripes and then inter(cid:173)
`leaved. Like RAID-linear, the result is a single larger virtual
`partition. Also like RAID-linear, it offers no redundancy, and
`therefore decreases overall reliability: a single disk failure
`will knock out the whole thing. RAID-0 is often claimed to 10
`improve performance over the simpler RAID-linear.
`However, this may or may not be true, depending on the
`characteristics to the file system, the typical size of the file
`as compared to the size of the stripe, and the type of
`workload. The ext2fs file system already scatters files 15
`throughout a partition, in an effort to minimize fragmenta(cid:173)
`tion. Thus, at the simplest level, any given access may go to
`one of several disks, and thus, the interleaving of stripes
`across multiple disks offers no apparent additional advan(cid:173)
`tage. However, there are performance differences, and they 20
`are data, workload, and stripe-size dependent. This is shown
`in FIG. 2. The meaning of letters is the same as that
`described in RAID-1. The data is copied to disk 1, then the
`next data is copied to disk 2 and so on.
`RAID-4 interleaves stripes like RAID-0, but it requires an 25
`additional partition to store parity information. The parity is
`used to offer redundancy: if any one of the disks fail, the data
`on the remaining disks can be used to reconstruct the data
`that was on the failed disk. Given N data disks, and one
`parity disk, the parity stripe is computed by taking one stripe 30
`from each of the data disks, and XOR'ing them together.
`Thus, the storage capacity of a an (N+l)-disk RAID-4 array
`is N, which is a lot better than mirroring (N+l) drives, and
`is almost as good as a RAID-0 setup for large N. Note that
`for N=l, where there is one data drive, and one parity drive, 35
`RAID-4 is a lot like mirroring, in that each of the two disks
`is a copy of each other. However, RAID-4 does not offer the
`read-performance of mirroring, and offers considerably
`degraded write performance. This is because updating the
`parity requires a read of the old parity, before the new parity 40
`can be calculated and written out. In an environment with
`lots of writes, the parity disk can become a bottleneck, as
`each write must access the parity disk. FIG. 3 shows RAID-4
`where N=4. RAID-4 requires a minimum of 3 partitions (or
`drives) to implement.
`RAID-2 and RAID-3 are seldom used in practice, having
`been made somewhat obsolete by modern disk technology.
`RAID-2 is similar to RAID-4, but stores Error Correcting
`Code (ECC) information instead of parity. Since all modem
`disk drives incorporate ECC under the covers, this offers 50
`little additional protection. RAID-2 can offer greater data
`consistency if power is lost during a write; however, battery
`backup and a clean shutdown can offer the same benefits.
`RAID-3 is similar to RAID-4, except that it uses the smallest
`possible stripe size. As a result, any given read will involve 55
`all disks, making overlapping 1/0 requests difficult/
`impossible. In order to avoid delay due to rotational latency,
`RAID-3 requires that all disk drive spindles be synchro(cid:173)
`nized. Most modem disk drives lack spindle(cid:173)
`synchronization ability, or, if capable of it, lack the needed 60
`connectors, cables, and manufacturer documentation. Nei(cid:173)
`ther RAID-2 nor RAID-3 are supported by the Linux
`software RAID drivers.
`RAID-5 avoids the write-bottleneck of RAID-4 resulting
`from the need to update parity using a read of the old parity, 65
`before the new parity can be calculated and written out, by
`alternately storing the parity stripe on each of the drives.
`
`6
`However, write performance is still not as good as for
`mirroring, as the parity stripe must still be read and XOR'ed
`before it is written. Read performance is also not as good as
`it is for mirroring, as, after all, there is only one copy of the
`data, not two or more. RAID-S's principle advantage over
`mirroring is that it offers redundancy and protection against
`single-drive failure, while offering far more storage capacity
`when used with three or more drives. This is illustrated in
`FIG. 4. RAID-5 requires a minimum of 3 partitions (or
`drives) to implement.
`Other RAID levels have been defined by various research(cid:173)
`ers and vendors. Many of these represent the layering of one
`type of raid on top of another. Some require special
`hardware, and others are protected by patent. There is no
`commonly accepted naming scheme for these other levels.
`Sometime the advantages of these other systems are minor,
`or at least not apparent until the system is highly stressed.
`Besides RAID-linear, RAID-0, RAID-1, RAID-4 and
`RAID-5, Linux software RAID does not support any of the
`other variations.
`Background: RAID over SCSI/FC
`Using SCSI, an initiator (say a host CPU) will send a
`command to a particular target (disk) and then the target will
`control the remaining transactions. Because a target might
`take some time to perform the requested operation (e.g.,
`rotate the disk so the right data is under the read head), it
`may release the SCSI bus and allow the initiator to send
`other commands. This allows multiple block read/write
`operations to occur in parallel.
`To implement RAID stripes over SCSI, multiple com(cid:173)
`mands are sent to multiple devices to read or write data. The
`RAID controller will assemble the striped data after it is
`read. The use of multiple commands is inherently inefficient
`and an implementation of RAID using LAN technologies
`can leverage multicast to remove this inefficiency. At the
`same time, other LAN features can be leveraged to improve
`the performance of RAID over a LAN.
`Because fiber channel (FC) can broadcast to a particular
`partition and get multiple responses back, NetRAID can
`apply to FC. The standard way to do this appears to overlay
`SCSI on top of FC and not to use broadcast. FC does have
`the benefit that it can multiplex commands over the fabric
`for improved efficiency over SCSI (instead of serial multi(cid:173)
`plexing used in SCSI).
`45 RAID Using Multicast Possibilities
`The opportunity to improve RAID with multicast rests in
`four areas: read from a mirrored storage device; read from
`striped/spanned storage device; write to mirror/backup stor(cid:173)
`age device; and write to a striped/spanned storage device.
`Multicast reads from target storage devices (disks) are
`much easier than writes; with mirrored storage devices being
`easier to multicast than striped. Mirrored data is duplicate
`data, meaning that the host only cares if it gets back one
`copy and it can ignore the other copies and any lost copies.
`Striped data is not a duplicate data; it is data interleaved
`across multiple targets. If parity data exists (RAID 4/5), then
`the host can afford to lose one response from a disk and still
`be able to build the correct data. Further, the host knows
`what data is missing and can query disks directly.
`Writing data means that the host has to verify that packets
`were successfully received and written, regardless of
`whether it was mirrored or striped. For a large number of
`devices and large amount of data, this becomes problemati(cid:173)
`cal. A "reliable multicast" protocol is clearly not desirable,
`due to the data implosion problem associated with a large
`number of simultaneous responses to a multicast request.
`The design issue in term of writing relates to the size of the
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 14
`
`
`
`US 6,834,326 Bl
`
`5
`
`10
`
`8
`the performance cost. The more disks that are mirrored or
`striped, the more processing that is required. The more that
`data access is parallelized by striping across more disks, the
`greater the throughput. Optimal performance depends on the
`type of striping used and application requirements such as
`bulk data transfer versus transaction processing. The more
`numerous the disks striped, the higher the risk of multicast
`implosion. This array of factors militates for autoconfigu(cid:173)
`ration.
`Packet loss may be addressed according to the present
`invention by the use of priority/buffering and FEC in the
`data stream. On well-designed SANs and SANs using 1000-
`Base-FX, the loss rate is likely not to be an issue. The
`mechanisms are in place to reduce the cost to design the
`SAN and to allow a wider geographic area to participate in
`a SAN. The protocol will be designed to ride on top of IP as
`well as Ethernet, the latter being a solution for those who
`desire performance, controllability, and security of a closed
`network. Full duplex, flow controlled Ethernet will mini-
`20 mize packet loss. While the speed of GE mitigates the need
`to make OS changes, the packet loss issue requires changes
`in the operating systems (OS) processing of SCSI timeouts.
`Ultimately, some of the logic implementing the present
`invention should migrate into the NIC chipset and other
`functionality, such as wirespeed security and various per(cid:173)
`formance tricks, should be incorporated. These performance
`features should be transparent to users and application
`software.
`
`7
`data stream being sent, the number of devices being sent to,
`and the packet loss. Packet loss, in turn, relates to the
`number of hops in the network. The more hops or further out
`the data goes, the higher the loss probability. Also affective
`writing strategies is the type of striping/interleaving that is
`performed-bit level, byte/word level, or block level.
`Research results published by others on disk striping using
`SCSI-RAID suggest block level striping. Any implementa(cid:173)
`tion of the present invention should support all levels of
`RAID and allow room for future encoding techniques.
`In striping with parity, updating data is more complicated
`than writing new data. For a stripe across four disks (3 data,
`1 parity), two reads (for the data on the other disks) and two
`writes (for the new data and parity blocks) are done to
`update one block. One way to minimize the disk accesses is 15
`to withhold the write until the other two blocks are written
`to, requiring smart and non-volatile caching.
`In theory, SCSI commands can send and receive up to 256
`blocks of data wi