throbber
(12) United States Patent
`Wang et al.
`
`I 1111111111111111 11111 lllll lllll lllll lllll 111111111111111 1111111111 11111111
`US006834326Bl
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 6,834,326 Bl
`Dec. 21, 2004
`
`(54) RAID METHOD AND DEVICE WITH
`NETWORK PROTOCOL BETWEEN
`CONTROLLER AND STORAGE DEVICES
`
`(75)
`
`Inventors: Peter S. S. Wang, Cupertino, CA (US);
`David C. Lee, San Jose, CA (US)
`
`(73) Assignee: 3Com Corporation, Marlborough, MA
`(US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by O days.
`
`(21) Appl. No.: 09/527,904
`
`(22) Filed:
`
`Mar.17, 2000
`
`Related U.S. Application Data
`(60) Provisional application No. 60/180,574, filed on Feb. 4,
`2000.
`Int. Cl.7 ................................................ G06F 12/02
`(51)
`(52) U.S. Cl. ........................ 711/114; 711/162; 711/154;
`711/161; 370/355; 370/356
`(58) Field of Search ............................. 711/114, 4, 162,
`711/161, 154; 709/200; 710/1; 370/351-356
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`5,805,920 A * 9/1998 Sprenkle et al. ............... 710/1
`6,000,020 A * 12/1999 Chin et al. .................. 711/162
`6,098,155 A * 8/2000 Chong, Jr.
`. ................. 711/138
`6,138,176 A * 10/2000 McDonald et al.
`............ 710/6
`6,141,788 A * 10/2000 Rosenberg et al.
`......... 714/774
`6,148,414 A * 11/2000 Brown et al. ................. 714/11
`6,272,522 Bl * 8/2001 Lin et al. .................... 709/200
`6,304,942 Bl * 10/2001 DeKoning .................. 711/114
`6,378,039 Bl * 4/2002 Ohara et al. ................ 709/105
`6,529,996 Bl * 3/2003 Nguyen et al.
`............. 711/114
`
`OTHER PUBLICATIONS
`
`D. Meyer, Introduction to IP Multicast (Jun., 1998) from
`www.nanog.org/mtg-9806/ppt/davemeyer/, pp. 1-49.
`T. Maufer and C. Semeria, Introduction to IP Multicast
`Routing, IETF Internet Draft (Apr. or May 1996), pp. 1-54.
`C. Semeria and T. Maufer, Introduction to IP Multicast
`Routing (undated) from www.cs.utexas.edu/users/dragon/
`cs3 78/resources/multl .pdf.
`Juan-Mariano de Goyeneche "Multicast Over TCP/IP
`HOWTO" Mar. 20, 1998 consisting of 34 pages http:/
`www.linuxdoc.org/HOWTO/Multicast-HOWTO.html.
`B. Quinn "IP Multicast Applications: Challenges and Solu(cid:173)
`tions" Nov. 1998 IP Multicast Initiative pp. 1-16.
`B. Quinn "IP Multicast Applications: Challenges and Solu(cid:173)
`tions" Jun. 25, 1999 Stardust Forums pp. 1-25.
`
`* cited by examiner
`
`Primary Examiner-Mano Padmanabhan
`Assistant Examiner-Jasmine Song
`(74) Attorney, Agent, or Firm-Ernest J. Beffel, Jr.; Haynes
`Beffel & Wolfeld LLP
`
`(57)
`
`ABSTRACT
`
`The present invention relates to transparent access to a
`redundant array of inexpensive devices. In particular, it
`provides a method and device for connecting redundant disk
`drives to a controller, preferably an intelligent switch, via a
`network. The disks are controlled by commands transported
`across the network. Commands may be SCSI, IDE/ATA or
`other commands. The network may comprise ethernet, fiber
`channel or other physical layer protocol. Commands can be
`encapsulated in IP packets and transmitted using either a
`reliable or unreliable transport protocol. Multicasting of
`packets from the controller to the multiple disk drives is part
`of the present invention.
`
`74 Claims, 10 Drawing Sheets
`
`Volume 1
`
`Volume 3
`
`Volume 4
`
`Volume 2
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 1
`
`

`

`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 1 of 10
`
`US 6,834,326 Bl
`
`..-
`
`-
`-
`
`A
`
`R
`
`C
`
`T)
`
`A
`
`R
`
`C
`
`T)
`
`~
`
`E
`
`F
`
`G
`
`H
`
`-
`-
`
`E
`
`F
`
`G
`
`H
`
`Disk 1
`
`Disk 2
`
`Disk3
`
`Disk4
`
`Fig. 1
`
`,,..
`
`R
`
`F
`
`.T
`
`N
`
`A
`
`R
`
`T
`
`M
`
`Disk 1
`
`Disk2
`
`-
`
`-
`
`~
`
`-
`
`n
`
`H
`
`L
`
`p
`
`(;
`
`CT
`
`K
`
`0
`
`Disk3
`
`Disk4
`
`~
`
`Fig. 2
`
`-
`
`BlockO
`-
`
`Block 1
`
`Block2
`
`Parity
`Generation
`\
`>----. BlockO, 1,
`J
`2. 3 Paritv
`
`Block3
`
`AO
`
`m
`rn
`m
`
`Al
`
`Rl
`
`(;1
`
`rn
`
`~
`
`-
`
`A2
`
`R2
`
`C2
`
`m
`
`A1
`m
`
`(;1
`
`rn
`
`ask.I
`
`nsk.2
`
`nsk.3
`
`nsk.4
`
`Fig. 3
`
`Atmtv
`
`Btmtv
`
`Ctmtv
`
`Dtmtv
`
`nsk.5
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 2
`
`

`

`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 2 of 10
`
`US 6,834,326 Bl
`
`YD-
`
`Parity
`GenT£nim
`
`-
`
`Aihks
`-
`-
`
`AO
`
`Al
`
`A2
`
`-
`
`-
`
`-
`
`3tmtv
`
`~
`
`Ilskl
`
`Bllrl<s
`
`m
`
`Rl
`
`.._
`
`2JDity
`
`~
`
`-
`
`ffi
`
`Ilsk2
`Fig. 4
`
`Cllrl<s
`-
`-
`
`m
`
`1 JDity
`
`(2
`
`0
`
`....__
`
`D
`
`OJBitv
`
`D1
`
`rn
`
`rn
`
`Ilsk3
`
`Ilsk4
`
`RAID Controller/
`Switch
`
`Volume 1
`
`Volume2
`
`Fig. 5
`
`Volume 1
`
`Volume 3
`
`Volume 4
`
`Volume 2
`
`Fig. 6
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 3
`
`

`

`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 3 of 10
`
`US 6,834,326 Bl
`
`User
`
`Kernel
`
`Applications
`
`Filesystem
`
`Cache System
`
`Block Drivers (SCS ,
`etc.)
`
`TCP
`
`UDP
`
`NetSCSI / NetRAID
`
`IP
`
`802
`
`Fig. 7
`
`Read/
`(cid:141) Multiack
`---------(cid:141)
`Data
`
`\
`
`\
`
`\
`
`\
`
`\
`
`\
`
`\
`
`\ '
`4,8,
`12
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`, ,
`
`,
`
`I
`I
`I
`
`\
`
`'
`
`'
`
`I
`I
`I
`I
`I
`I
`
`' ,
`' \
`
`\
`
`' \
`
`\
`'
`
`\
`
`' \
`
`'
`
`\
`
`\
`
`\
`
`\
`
`\
`
`\
`
`/
`
`,
`, '
`
`,
`, '
`
`,
`
`/
`
`,
`
`/
`
`,
`
`I
`
`/
`,'
`
`, , , ,
`, ,
`, , ,
`,
`
`1,5,
`9
`
`2,6,
`10
`
`3,7,
`11
`
`Fig. Ba
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 4
`
`

`

`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 4 of 10
`
`US 6,834,326 Bl
`
`Read/
`(cid:141) Multiack
`---------(cid:141)
`Data
`
`1-
`100
`
`101-
`200
`
`201-
`300
`
`301-
`400
`
`Fig. 8b
`
`Read/
`(cid:141) Multiack
`---------(cid:141)
`Data
`
`1-
`100
`
`1-
`100
`
`1-
`100
`
`1-
`100
`
`, , ' I
`,,' ,'
`,,
`,,,
`'
`,,,'
`'
`,
`,,'
`,'
`/
`,,'
`,,
`'
`,,,,,,'
`
`Write/
`(cid:141)
`Data
`---------(cid:141) Multiack
`
`,,,,'
`
`1-
`100
`
`1-
`100
`
`1-
`100
`
`1-
`100
`
`Fig. 8c
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 5
`
`

`

`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 5 of 10
`
`US 6,834,326 Bl
`
`Read/
`(cid:141) Multiack
`------(cid:141)
`Data
`
`1-
`100
`
`101-
`200
`
`201~
`300
`
`P(1-
`300)
`
`Write
`Data
`Ack
`
`1-
`100
`
`201-
`300
`
`P(1-
`300)
`
`Fig. 8d
`
`1-
`100
`
`117
`~
`
`Read/
`(cid:141) Multiack
`-----(cid:141)
`Data
`
`~
`- - • - Write
`~
`~ ----(cid:141)
`Data
`·--,~- ->
`Ack
`~- ~
`12ml
`P(1-
`~
`300)
`
`...;~
`
`Fig. Se
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 6
`
`

`

`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 6 of 10
`
`US 6,834,326 Bl
`
`CDMMAND AaESS METHOD
`
`1RANSPORT
`
`PHYSICAL
`
`Fig. 9
`
`SCSI
`
`NIC
`
`SCSI
`
`NIC
`
`•••
`······---~·-~-~-------------------~--,----·········---... ---········
`
`Local Area Network
`

`
`:
`:
`
`• NetSCSIDisk
`:
`
`NetSCSI Disk
`
`:
`
`•
`
`:------(!)------•
`
`I
`
`Switch
`
`Fig. 10
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 7
`
`

`

`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 7 of 10
`
`US 6,834,326 Bl
`
`user
`
`kernel
`
`1101
`
`1102
`
`1103
`
`1104
`
`1105
`
`1106
`
`1107
`
`1108
`
`1109
`
`Linux Platform
`
`Applications
`
`File System
`
`Block
`Interface
`
`SCSI layer
`
`NetSCSI device driver
`
`IP
`
`wire
`
`Fig. 11
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 8
`
`

`

`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 8 of 10
`
`US 6,834,326 Bl
`
`user
`
`kernel
`
`Linux Platfonn
`
`Applications
`
`File System
`
`RAID
`
`Block
`Interface
`
`SCSI layer
`
`1201
`
`1202
`
`1203
`
`1204
`
`1205
`
`NetSCSI device driver
`
`1206
`
`STP - - - - - - - 1207
`
`1208
`
`1209
`
`1210
`
`IP
`
`wire
`
`Fig. 12
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 9
`
`

`

`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 9 of 10
`
`US 6,834,326 Bl
`
`1301
`
`SCSI layer
`
`Send Process
`~-~--~__r-1310
`ahal 740_queuecommand
`
`Receive Process
`
`done
`
`1317
`
`.----~----.__r-1311
`netscsi_send
`
`del_timer ~ 1316
`
`1315
`
`1314
`
`yes
`
`1312
`
`ahal 740_rx_handler
`
`Packet send
`
`Packet received
`
`Networking Code
`
`1313
`
`Fig. 13
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 10
`
`

`

`U.S. Patent
`
`Dec. 21, 2004
`
`Sheet 10 of 10
`
`US 6,834,326 Bl
`
`NetSCSI Disk (Client)
`
`Switch PC (Server)
`
`1401
`
`1403
`
`1407
`
`1411
`
`NCDP
`
`RESPONSE
`
`DISK_SETUP
`
`SWITCH_SETUP
`
`1402
`
`1404
`
`GETDISKP _RAID
`
`"-1406
`
`DISKP_RAID
`
`SETDISKP _RAID
`
`" - 1408
`
`RESPONSE
`
`KEEPALIVE
`
`___r- 1410
`
`-.....___r-1412
`
`Fig. 14
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 11
`
`

`

`US 6,834,326 Bl
`
`1
`RAID METHOD AND DEVICE WITH
`NETWORK PROTOCOL BETWEEN
`CONTROLLER AND STORAGE DEVICES
`
`RELATED APPLICATION DATA
`
`Applicant claims the benefit of Provisional Application
`No. 60/180,574 entitled Networked RAID Method and
`Device, filed 4 Feb. 2000.
`
`BACKGROUND OF THE INVENTION
`
`5
`
`2
`of storage devices plus an additional extent for parity data.
`Again, failure of one device can be overcome by recon(cid:173)
`structing the data on the failed device from the remaining
`devices.
`Implementation of RAID has typically required trained
`information technology professionals. Network administra(cid:173)
`tors without formal information technology training may
`find it difficult to choose among the alternative RAID
`configurations. Growth in storage needs may require choice
`10 of a different configuration and migration from an existing
`to a new RAID configuration. The selection of hardware to
`implement an array may involve substantial expense. A
`relatively simple, inexpensive and preferably self-
`configuring scheme for redundant storage is desired.
`Therefore, it is desirable to provide redundant storage
`across a network to assure redundancy while providing
`automatic configuration to reduce the total cost of system
`ownership. It is further desirable to take advantage of
`network-oriented protocols, such as multicasting packets, to
`20 implement redundant storage in an efficient way.
`
`SUMMARY OF INVENTION
`
`The present invention includes using packet multicasting
`to implement redundant configurations of networked storage
`devices. It is preferred to use an intelligent switch, in
`communication with a plurality of storage devices, as a
`RAID controller. The intelligent switch may function as a
`thinserver, supporting a file system such as NFS or CIFS. By
`cascading switches, one or more of the storage devices in
`communication with the switch may actually be another
`switch, appearing as a virtual storage device.
`The present invention includes multicasting command
`packets. In some protocols, such as SCSI, a series of
`commands may be required to establish communications
`between the initiator and the target, followed by data trans-
`fer. Multiple commands can be multicast in single packet.
`Read commands can advantageously be multicast to retrieve
`data from identically located blocks in different partitions or
`extents of data. Write commands can be multicast to estab(cid:173)
`lish communications. Data also may be multicast for certain
`configurations of RAID, such as mirrored disks and byte(cid:173)
`wise striped disks.
`Another aspect of the present invention is use of forward
`error correction to avoid the need for reliable multicast
`protocols. A variety of forward error correction algorithms
`may be used, including parity blocks. Forward error correc(cid:173)
`tion can be implemented on a selective basis. Error correc-
`tion can be implemented when packet loss exceeds a thresh(cid:173)
`old or on a transaction-by-transaction basis. Error correction
`can be supplemented by a dampening mechanism to mini(cid:173)
`mize oscillation.
`Another aspect of the present invention is control of
`multicast implosion. When multicast read commands are
`issued to numerous disks or for a large quantity of data,
`delay and backoff algorithms minimize oscillation.
`The present invention is suitably implemented on a switch
`in communication with multiple storage devices over raw
`ethernet, IP over UDP, fibre channel, or other communica-
`tions protocol.
`
`BRIEF DESCRIPTION OF FIGURES
`FIG. 1 illustrates RAID-1 using two pairs of disks.
`FIG. 2 illustrates RAID O spanning and interleaving of
`data across four disks.
`FIG. 3 illustrates RAID 4 using disk 5 for parity data.
`
`1. Field of the Invention
`The present invention relates to use of multicasting to
`implement a redundant storage protocol such as a redundant
`array of inexpensive devices (RAID), mirroring or striping 15
`of storage devices, where the storage devices are separated
`from a controller by a network. It is preferred for the
`controller to be hosted on an intelligent switch and for the
`controller to use SCSI commands to control the storage
`devices, although it alternatively can use IDE/ATA com(cid:173)
`mands. The network may comprise ethernet, fibre channel or
`another physical layer protocol. One or more commands,
`such SCSI commands, can be encapsulated directly in a
`network packet or encapsulated in IP packet. When com(cid:173)
`mands are encapsulated in IP packets, either a reliable or 25
`unreliable transport protocol can be used. Preferably, the use
`a reliable multicast protocol can be avoided.
`2. Description of Related Art
`Network-attached storage (NAS) is a means to fulfill the
`needs of clients who want fast, scalable, high-bandwidth
`access to their data. It is a means to improve scalability of
`existing distributed file systems by removing the server as a
`bottleneck in using the network as bandwidth to allow
`parallelism by using striping and more efficient data pass.
`Significant to implementing network-attached storage is a
`trend towards increasingly intelligent storage devices. Intel(cid:173)
`ligent storage devices provide functionality previously
`reserved to dedicated servers.
`A second advantage of network-attached storage is 40
`reduced total cost of ownership (TCO). Network-attached
`storage offers convenient placement of storage devices.
`Plug-and-play configuration can be incorporated. Adminis(cid:173)
`tration can be simplified, reducing the load on or need for
`information technology professionals who are needed to 45
`maintain and fine-tune dedicated servers.
`A variation on the traditional client/server or Server
`Integrated Disk (SID) system uses SCSI commands across a
`network. This scheme is referred to as Server Attached Disk
`(SAD). This new model allows storage devices to be arbi- 50
`trarily placed on a network. A SCSI protocol such as
`NetSCSI is used by the server to control the remotely located
`storage devices. Users of a NetSCSI system can have their
`own storage area network (SAN) using their existing net(cid:173)
`works and intelligent SCSI hard disks, without the added 55
`cost and complexity of high-end storage area network sys-
`terns.
`An additional concern that is always present in storage
`systems and may be accentuated by network-attached stor(cid:173)
`age is redundancy. Recovery from failure of an individual
`storage device requires some degree of redundancy. File
`servers have used redundant arrays of inexpensive devices
`(RAID) to provide redundancy. Several RAID levels have
`been defined. RAID level 1, for instance, involves mirroring
`of paired storage extents on distinct physical devices. Fail- 65
`ure of one device leaves the other device available. RAID
`level 4, as an alternative, involves using two or more extents
`
`30
`
`35
`
`60
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 12
`
`

`

`US 6,834,326 Bl
`
`3
`FIG. 4 illustrates RAID 5 placing parity data on alternat(cid:173)
`ing disks.
`FIG. 5 illustrates a preferred architectural framework for
`a Switched RAID system.
`FIG. 6 illustrates cascading of virtual disks in a Switched 5
`RAID system.
`FIG. 7 illustrates a preferred software architecture for a
`RAID controller, such as one running on the switch.
`FIGS. Sa-Se illustrate multicasting from a source to 10
`storage devices and unicast responses.
`FIG. 9 illustrates the SCSI architecture layering.
`FIG.10 depicts an embodiment ofNetSCSI utilizing a PC
`as a switch.
`FIG. 11 depicts the layered design of a Linux platform. 15
`FIG. 12 depicts an implementation of RAID on a Linux
`platform.
`FIG. 13 is a flowchart illustrating sending and receiving
`processes for a NetSCSI device driver.
`FIG. 14 illustrates a sequence of messages by which a
`NetSCSI disk may initiate a session.
`
`4
`performance, as it implies that disks are access sequentially.
`The most obvious performance improvement is to interleave
`or stripe the data so that it can be accessed in parallel. Thus,
`n small and inexpensive disks theoretically can provide n
`times the storage of one disk at n times the performance of
`one disk using striping.
`To improve reliability, two mechanisms are used. One is
`to simply copy, or mirror, the data. To mirror the four disks
`four additional disks are required, which is not an efficient
`use of capacity. Another approach is to add a parity disk to
`provide error detection and correction information. Parity is
`calculated as the exclusive-or (XOR) of the remaining data
`disks. If one data disks fails, the data can be recovered by
`reading the remaining data disks, including the parity disk,
`the bit or whole that failed can be reconstructed. A parity
`disk can be implemented on a stripe-wise basis, to minimize
`impact on read performance.
`These features plus variations on how the parity is imple(cid:173)
`mented have been standardized as RAID levels. There are
`20 six levels of RAID defined:
`RAID linear-spanned
`RAID 0-striped
`RAID 1-mirroring
`RAID 2-not used in practice
`25 RAID 3-byte striped with parity disk
`RAID 4-block striped with parity disk
`RAID 5-RAID 4 with rotating parity
`RAID 6 -similar to RAID 5 except for physical to logical
`mapping difference
`30 The different RAID levels have different performance,
`redundancy, storage capacity, reliability and cost character(cid:173)
`istics. Most, but not all levels of RAID offer redundancy
`against disk failure. Of those that offer redundancy, RAID-1
`and RAID-5 are the most popular. RAID-1 offers better
`35 performance, while RAID-5 provides for more efficient use
`of the available storage space. However, tuning for perfor(cid:173)
`mance is an entirely different matter, as performance
`depends strongly on a large variety of factors, from the type
`of application, to the sizes of stripes, blocks, and files.
`RAID-linear is a simple concatenation of partitions to
`create a larger virtual partition. It is used to create a single,
`large partition from a number small drives. Unfortunately,
`this concatenation offers no redundancy, and even decreases
`the overall reliability. That is, if any one disk fails, the
`45 combined partition will fail.
`RAID-1 is involves mirroring. Mirroring is where two or
`more partitions ( or drives), all of the same size, each store
`an exact copy of all data, disk-block by disk-block. The
`logical addresses of corresponding disk-blocks in extents on
`50 different physical drives are the same. Mirroring gives
`strong protection against disk failure. If one disk should fail,
`another exists with the an exact copy of the same data.
`Mirroring can also help improve performance in 1/O-laden
`systems, as read requests can be divided up between several
`55 disks. Unfortunately, mirroring is also the least efficient in
`terms of storage: two mirrored partitions can store no more
`data than a single partition. FIG. 1 illustrates RAID-1. The
`letters refer to the and order of data. That is, the same letter
`indicates the same data. The data is ordered A, B, C, D, E
`60 and so on. The same data is duplicated on two disks: disk 1
`is mirror of disk 2 and disk 3 is mirror of disk 4.
`Striping is the underlying concept behind all RAID levels.
`Other than RAID-1 a stripe is a contiguous sequence of disk
`blocks. A stripe may be as short as a single disk block, or
`65 may consist of thousands of blocks. The RAID drivers split
`up their component disk partitions into stripes; the different
`RAID levels differ in how they organize the stripes, and
`
`DETAILED DESCRIPTION
`
`A form of networked RAID (NetRAID) is required to
`provide the reliability and performance needed for small and
`medium sized enterprises. While RAID can be implemented
`using SCSI-over-IP technology, the present invention will
`leverage multicast capabilities to further improve perfor(cid:173)
`mance. NetRAID may extend SCSI-over-IP (NetSCSI) auto(cid:173)
`configuration mechanisms to provide RAID autoconfigura(cid:173)
`tion. NetSCSI protocol can be enhanced to support
`NetRAID.
`Most storage devices are internal to a storage server and
`are accessed by the server's processor through effectively
`lossless, high-speed system buses. This simplifies the design
`of the storage hardware and software. The conventional
`wisdom is that LAN technology will not suffice in the
`storage world because it cannot provide the performance or
`reliability required. New gigabit Ethernet (GE) technologies 40
`may disprove the conventional wisdom. While the highest
`end of the market may remain loyal to server centric
`technologies, small and medium sized enterprises, espe(cid:173)
`cially price sensitive organizations, are expected to embrace
`new options.
`In existing network systems, file stacks (filesystems and
`block drivers such as SCSI drivers) are not implemented to
`handle data loss. Data loss is generally viewed as a hardware
`error and the result is a time consuming hardware reset
`process. Also, timeouts are not effectively set and used
`which makes simple implementations of GE-based SANs
`more difficult. Minor changes in the implementation of file
`stacks are required to ensure that GE performance is com(cid:173)
`parable to Fibre Channel (FC.) With standardization of
`NetSCSI and NetRAID, appropriate modifications may be
`realized.
`Background: RAID Primer
`Redundant arrays of inexpensive disks (RAID) are used to
`improve the capacity, performance, and reliability of disk
`subsystems, allowing the potential to create high(cid:173)
`performance terrabyte storage using massive arrays of cheap
`disks. RAID can be implemented in software or hardware,
`with the hardware being transparent to existing SCSI or FC
`drivers.
`The premise of RAID is that n duplicate disks can store
`n times more data than one disk. This spanning property, by
`itself, simply provides capacity and no improvement in
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 13
`
`

`

`US 6,834,326 Bl
`
`5
`
`5
`what data they put in them. The interplay between the size
`of the stripes, the typical size of files in the file system, and
`their location on the disk is what determines the overall
`performance of the RAID subsystem.
`RAID-0 is much like RAID-linear, except that the com-
`ponent partitions are divided into stripes and then inter(cid:173)
`leaved. Like RAID-linear, the result is a single larger virtual
`partition. Also like RAID-linear, it offers no redundancy, and
`therefore decreases overall reliability: a single disk failure
`will knock out the whole thing. RAID-0 is often claimed to 10
`improve performance over the simpler RAID-linear.
`However, this may or may not be true, depending on the
`characteristics to the file system, the typical size of the file
`as compared to the size of the stripe, and the type of
`workload. The ext2fs file system already scatters files 15
`throughout a partition, in an effort to minimize fragmenta(cid:173)
`tion. Thus, at the simplest level, any given access may go to
`one of several disks, and thus, the interleaving of stripes
`across multiple disks offers no apparent additional advan(cid:173)
`tage. However, there are performance differences, and they 20
`are data, workload, and stripe-size dependent. This is shown
`in FIG. 2. The meaning of letters is the same as that
`described in RAID-1. The data is copied to disk 1, then the
`next data is copied to disk 2 and so on.
`RAID-4 interleaves stripes like RAID-0, but it requires an 25
`additional partition to store parity information. The parity is
`used to offer redundancy: if any one of the disks fail, the data
`on the remaining disks can be used to reconstruct the data
`that was on the failed disk. Given N data disks, and one
`parity disk, the parity stripe is computed by taking one stripe 30
`from each of the data disks, and XOR'ing them together.
`Thus, the storage capacity of a an (N+l)-disk RAID-4 array
`is N, which is a lot better than mirroring (N+l) drives, and
`is almost as good as a RAID-0 setup for large N. Note that
`for N=l, where there is one data drive, and one parity drive, 35
`RAID-4 is a lot like mirroring, in that each of the two disks
`is a copy of each other. However, RAID-4 does not offer the
`read-performance of mirroring, and offers considerably
`degraded write performance. This is because updating the
`parity requires a read of the old parity, before the new parity 40
`can be calculated and written out. In an environment with
`lots of writes, the parity disk can become a bottleneck, as
`each write must access the parity disk. FIG. 3 shows RAID-4
`where N=4. RAID-4 requires a minimum of 3 partitions (or
`drives) to implement.
`RAID-2 and RAID-3 are seldom used in practice, having
`been made somewhat obsolete by modern disk technology.
`RAID-2 is similar to RAID-4, but stores Error Correcting
`Code (ECC) information instead of parity. Since all modem
`disk drives incorporate ECC under the covers, this offers 50
`little additional protection. RAID-2 can offer greater data
`consistency if power is lost during a write; however, battery
`backup and a clean shutdown can offer the same benefits.
`RAID-3 is similar to RAID-4, except that it uses the smallest
`possible stripe size. As a result, any given read will involve 55
`all disks, making overlapping 1/0 requests difficult/
`impossible. In order to avoid delay due to rotational latency,
`RAID-3 requires that all disk drive spindles be synchro(cid:173)
`nized. Most modem disk drives lack spindle(cid:173)
`synchronization ability, or, if capable of it, lack the needed 60
`connectors, cables, and manufacturer documentation. Nei(cid:173)
`ther RAID-2 nor RAID-3 are supported by the Linux
`software RAID drivers.
`RAID-5 avoids the write-bottleneck of RAID-4 resulting
`from the need to update parity using a read of the old parity, 65
`before the new parity can be calculated and written out, by
`alternately storing the parity stripe on each of the drives.
`
`6
`However, write performance is still not as good as for
`mirroring, as the parity stripe must still be read and XOR'ed
`before it is written. Read performance is also not as good as
`it is for mirroring, as, after all, there is only one copy of the
`data, not two or more. RAID-S's principle advantage over
`mirroring is that it offers redundancy and protection against
`single-drive failure, while offering far more storage capacity
`when used with three or more drives. This is illustrated in
`FIG. 4. RAID-5 requires a minimum of 3 partitions (or
`drives) to implement.
`Other RAID levels have been defined by various research(cid:173)
`ers and vendors. Many of these represent the layering of one
`type of raid on top of another. Some require special
`hardware, and others are protected by patent. There is no
`commonly accepted naming scheme for these other levels.
`Sometime the advantages of these other systems are minor,
`or at least not apparent until the system is highly stressed.
`Besides RAID-linear, RAID-0, RAID-1, RAID-4 and
`RAID-5, Linux software RAID does not support any of the
`other variations.
`Background: RAID over SCSI/FC
`Using SCSI, an initiator (say a host CPU) will send a
`command to a particular target (disk) and then the target will
`control the remaining transactions. Because a target might
`take some time to perform the requested operation (e.g.,
`rotate the disk so the right data is under the read head), it
`may release the SCSI bus and allow the initiator to send
`other commands. This allows multiple block read/write
`operations to occur in parallel.
`To implement RAID stripes over SCSI, multiple com(cid:173)
`mands are sent to multiple devices to read or write data. The
`RAID controller will assemble the striped data after it is
`read. The use of multiple commands is inherently inefficient
`and an implementation of RAID using LAN technologies
`can leverage multicast to remove this inefficiency. At the
`same time, other LAN features can be leveraged to improve
`the performance of RAID over a LAN.
`Because fiber channel (FC) can broadcast to a particular
`partition and get multiple responses back, NetRAID can
`apply to FC. The standard way to do this appears to overlay
`SCSI on top of FC and not to use broadcast. FC does have
`the benefit that it can multiplex commands over the fabric
`for improved efficiency over SCSI (instead of serial multi(cid:173)
`plexing used in SCSI).
`45 RAID Using Multicast Possibilities
`The opportunity to improve RAID with multicast rests in
`four areas: read from a mirrored storage device; read from
`striped/spanned storage device; write to mirror/backup stor(cid:173)
`age device; and write to a striped/spanned storage device.
`Multicast reads from target storage devices (disks) are
`much easier than writes; with mirrored storage devices being
`easier to multicast than striped. Mirrored data is duplicate
`data, meaning that the host only cares if it gets back one
`copy and it can ignore the other copies and any lost copies.
`Striped data is not a duplicate data; it is data interleaved
`across multiple targets. If parity data exists (RAID 4/5), then
`the host can afford to lose one response from a disk and still
`be able to build the correct data. Further, the host knows
`what data is missing and can query disks directly.
`Writing data means that the host has to verify that packets
`were successfully received and written, regardless of
`whether it was mirrored or striped. For a large number of
`devices and large amount of data, this becomes problemati(cid:173)
`cal. A "reliable multicast" protocol is clearly not desirable,
`due to the data implosion problem associated with a large
`number of simultaneous responses to a multicast request.
`The design issue in term of writing relates to the size of the
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1007, p. 14
`
`

`

`US 6,834,326 Bl
`
`5
`
`10
`
`8
`the performance cost. The more disks that are mirrored or
`striped, the more processing that is required. The more that
`data access is parallelized by striping across more disks, the
`greater the throughput. Optimal performance depends on the
`type of striping used and application requirements such as
`bulk data transfer versus transaction processing. The more
`numerous the disks striped, the higher the risk of multicast
`implosion. This array of factors militates for autoconfigu(cid:173)
`ration.
`Packet loss may be addressed according to the present
`invention by the use of priority/buffering and FEC in the
`data stream. On well-designed SANs and SANs using 1000-
`Base-FX, the loss rate is likely not to be an issue. The
`mechanisms are in place to reduce the cost to design the
`SAN and to allow a wider geographic area to participate in
`a SAN. The protocol will be designed to ride on top of IP as
`well as Ethernet, the latter being a solution for those who
`desire performance, controllability, and security of a closed
`network. Full duplex, flow controlled Ethernet will mini-
`20 mize packet loss. While the speed of GE mitigates the need
`to make OS changes, the packet loss issue requires changes
`in the operating systems (OS) processing of SCSI timeouts.
`Ultimately, some of the logic implementing the present
`invention should migrate into the NIC chipset and other
`functionality, such as wirespeed security and various per(cid:173)
`formance tricks, should be incorporated. These performance
`features should be transparent to users and application
`software.
`
`7
`data stream being sent, the number of devices being sent to,
`and the packet loss. Packet loss, in turn, relates to the
`number of hops in the network. The more hops or further out
`the data goes, the higher the loss probability. Also affective
`writing strategies is the type of striping/interleaving that is
`performed-bit level, byte/word level, or block level.
`Research results published by others on disk striping using
`SCSI-RAID suggest block level striping. Any implementa(cid:173)
`tion of the present invention should support all levels of
`RAID and allow room for future encoding techniques.
`In striping with parity, updating data is more complicated
`than writing new data. For a stripe across four disks (3 data,
`1 parity), two reads (for the data on the other disks) and two
`writes (for the new data and parity blocks) are done to
`update one block. One way to minimize the disk accesses is 15
`to withhold the write until the other two blocks are written
`to, requiring smart and non-volatile caching.
`In theory, SCSI commands can send and receive up to 256
`blocks of data wi

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket