`[19]
`[11] Patent Number:
`6,073,218
`
`DeKoning et al.
`[45] Date of Patent:
`Jun. 6, 2000
`
`U3006073218A
`
`[54] METHODS AND APPARATUS FOR
`COORDINATING SHARED MULTIPLE RAID
`
`CONTROLLER ACCESS TO COMMON
`STORAGE DEVICES
`
`_
`Inventors: ROdHeyA- DeKOHng; Gerald J-
`Fredin, both of Wichita, Kans.
`
`[75]
`
`[73] Assignee: LSI Logic Corp., Milpitas, Calif.
`
`[21] Appl. No.: 08/772,614
`[22]
`Filed:
`Dec. 23, 1996
`
`[51] mt. c1? ...................................................... G06F 13/16
`
`[52] U.S.Cl. ...................
`....711/150;711/114;711/148;
`711/149; 711/151; 711/152; 711/153; 711/162;
`711/168; 710/20; 710/21; 710/38; 710/241;
`714/6; 714/11
`[58] Field of Search ..................................... 711/114, 152,
`711/145, 148, 149, 150, 151, 153, 168,
`162, 714/6, 11; 710/20, 21, 38, 241
`
`[56]
`
`References Cited
`
`Us‘ PATENT DOCUMENTS
`10/1972 page ............................................ 444/1
`3,702,006
`3/1992 Schultz et al.
`395/575
`5,101,492
`
`9/1992 Gordon et 31.
`371/101
`5,148,432
`5/1993 Pfeffer et al.
`395/575
`5,210,860
`.
`9/1993 Schmenk et al.
`.. 395/425
`5,249,279
`
`5/1994 Dias et al.
`395/600
`5,317,731
`7/1994 Fry et a1.
`.. 360/53
`......
`5,331,476
`
`11/1994 Holland et al.
`395/575
`5,367,669
`1/1995 Lai et a1~
`~~~~~~
`706/916
`5,379,417
`1/1995 Fry et a1~
`~~~~~~
`-~ 390/53
`593899324
`2/1995 DCMOSS et al.
`371/511
`5388408
`
`7”]995 SCh‘ffleger
`395/200
`594349970
`.. 395/650
`8/1995 Yokota et a1.
`5,440,743
`.. 395/401
`8/1995 Dang et a1.
`5,446,855
`
`5,455,934 10/1995 Holland et 211.
`”
`395/404
`5,459,864 10/1995 Brent et a1.
`395/650
`
`5,495,601
`2/1996 Narang et al.
`" 395/600
`5,535,365
`7/1996 Barriso et a1.
`" 711/152
`.................... 395/18207
`5,546,535
`8/1996 Stallmo et al.
`
`FOREIGN PATENT DOCUMENTS
`
`0493984
`0551718
`0707269
`0645702
`9513583
`
`7/1992 European Pat. on ........ G06F 11/10
`7/1993
`European Put. on ........ G06F 11/20
`4/1996
`European Pat. Off.
`........ G06F 12/08
`3/1995 Germany.
`5/1995 WIPO ............................. GO6F 12/00
`OTHER PUBLICATIONS
`
`Inexpensive Disks
`A. Case for Redundant Arrays of
`(RAID); David Patterson, Garth Gibson & Randy kata;
`Dec., 1987; pp. 1_24'
`
`Primary Examiner—Hiep T Nguyen
`_
`ABSTRACT
`[37]
`Methods and associated apparatus for performing concur-
`rent I/O operations on a common shared subset of disk
`drives (LUNs) by a plurality of RAID controllers. The
`methods of the present invention are operable in all of a
`plurality of RAID controllers to coordinate concurrent
`access to a shared set of disk drives. In addition to providing
`redundancy features, the plurality of RAID controllers oper-
`able in accordance with the methods of the present invention
`enhance the performance of a RAID subsystem by better
`utilizing available processing power among the plurality of
`RAID controllers. Under
`the methods of the present
`invention, each of a plurality of RAID controllers may
`actively process different I/O requests on a common shared
`subset of disk drives. One of the plurality of controllers is
`designated as primary with respect to a particular shared
`subset of disk drives. The plurality of RAID controllers then
`exchange messages over a communication medium to coor-
`dinate concurrent access to the shared subset of disk drives
`through the primary controller. The [messages exchanged
`include semaphore lock and release requests to coordinate
`exclusive access during critical operations as well as cache
`and meta-cache data to maintain cache coherency between
`the plurality of the RAID controllers with respect to the
`common shared subset of disk drives. These messages are
`exchanged via any of several well known communication
`mediums including, a shared memory common to the plu-
`rality of controllers and the communication bus connecting
`the shared subset of disk drives to each of the plurality of
`commuters-
`
` '20
`
`
`O
`DDIiISVKE
`V
`
`.5110
`
`(List continued on next page.)
`36 Claims, 11 Drawing Sheets
`
`
`
`
`HUSI
`HUN} #1
`
`
`
`COMPUTER
`15? t
`
`
`112:1
`_/
`"2118.1
`CPU
`
`
`‘74 MEMDRV
`PROGRAM
`114,!
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`154
`
`V
`
`1007/“
`
`:
`
`)
`
`(
`
`
`‘15:]
`CACHE
`DlSK
`t, MEMOHV
`DRIVE
`
`1, V150
`RDAIHM
`
`
`
`._
`
`“ 110
`
`4
`
`.110
`
`112.2
`x, ,
`
`0"”
`
`mgk
`DRIVE
`V
`
`111-2
`PROGRAM
`
`
` 1, MEMORV
`
`11622
`,
`
`CACHE
`MEMORY
`
`,1
`5 1
`152.2
`
`*71152
`
`O”*~110
`SEE
`108 / v
`
`VMWARE-1010
`
`Page 1 of 28
`
`VMWARE-1010
`Page 1 of 28
`
`
`
`6,073,218
`
`Page 2
`
`US. PATENT DOCUMENTS
`
`5,694,571
`
`12/1997 Fuller ...................................... 395/440
`
`ran 6 a. ........................
`,
`.
`t
`81996 B
`t
`1
`395/182 03
`5 548711
`2/1997 Sato et a1.
`................................. 710/52
`5,603,062
`
`9/1997 Suganuma et a1.
`711/114
`5,666,511
`............................. 711/152
`10/1997 Vertti et a1.
`5,678,026
`5,682,537 10/1997 Davies et al.
`........................... 395/726
`
`,
`.
`......................... 395/608
`2/1998 Hayashi et a1.
`5,715,447
`6/1998 DeKonmg et 31'
`~ “1/113
`5761705
`6/1998 Peacock e161.
`..
`. 395/275
`5,764,922
`7/1997 HodgesetaL
`~ 395/821
`5,787,304
`5,845,292 12/1998 Bohannon et a1.
`
`. 707/202
`
`VMWARE-1010 / Page 2 of 28
`
`VMWARE-1010 / Page 2 of 28
`
`
`
`US. Patent
`
`Jun. 6, 2000
`
`Sheet 1 0f 11
`
`6,073,218
`
`.50:
`
`ESQEOQ
`
` Eozms.
`/ 00F
`
`N.GE
`
`VMWARE-1010 / Page 3 of 28
`
`Essa:
`
`
`
` Essa:IN565B:
`
`Eosfi:
`
`53¢onwe:
`
`mzo<oWm:
`
`VMWARE-1010 / Page 3 of 28
`
`
`
`
`US. Patent
`
`Jun. 6, 2000
`
`Sheet 2 0f 11
`
`6,073,218
`
`mor
`
`fwor
`
`\omw
`
`_..om_.
`
`NE.T/
`
`In:
`
`wt3.8
`
`mmoznZEmm
`
`150m
`
`E93m
`
`mm:
`
`mmozn—Ezmm
`
`.50;
`
`Wm:
`
`2w:
`
`N.GE
`
`\va
`
`VMWARE-1010 / Page 4 of 28
`
`VMWARE-1010 / Page 4 of 28
`
`
`
`
`
`US. Patent
`
`Jun. 6,2000
`
`Sheet 3 0f 11
`
`6,073,218
`
` 114.1
`
`SEMAPHORE POOL
`
`118.1
`
`154
`
`VMWARE-1010 / Page 5 of 28
`
`VMWARE-1010 / Page 5 of 28
`
`
`
`US. Patent
`
`Jun. 6, 2000
`
`Sheet 4 0f 11
`
`6,073,218
`
`v.GE
`
`2:F/(
`
`
`
`$chch>m<s=E
`
`3onmmozn—Szmm
`
`
`
`.ou_z_.oEzoo23
`
`mmmoans—mmHE;#
`
`
`
`wmmozmgfimm9553#
`
`n5Hm:
`
`mmozmdémm
`
`mepzm
`
`VMWARE-1010 / Page 6 of 28
`
`VMWARE-1010 / Page 6 of 28
`
`
`
`US. Patent
`
`Jun. 6,2000
`
`Sheet 5 0f 11
`
`6,073,218
`
`// 500
`
`MORE
`
`LUNS FOR
`WHICH I AM PRIMARY
`CONTROLLER
`
`YES
`
`
`
`
`
`
`ALLOCATE SEMAPHORE
`POOL FOR LUN
`
`INITIALIZE SEMAPHORE
`POOL FOR LUN
`
`FIG. 5
`
`NO
`
`~ 502
`
`\ 504
`
`AWAIT SHARED
`ACCESS REQUEST
`
`\ 506
`
`
`EXCLUSIVE
`ACCESS
`
`516
`REQUEST
`
`\
`.
`
`ACQUIRE EXCLUSIVE
`NO
`
`
`ACCESS TO
`REQUIRED STRIPE(S)
`
`
`
`REQUEST
`
`
`
`
`508
`
`
`
`
`FOR A LUN FOR
`WHICH I AM PRIMARY
`CONTROLLER
`?
`
`
`
`YES
`
`510
`
`YES
`
`RELEASE EXCLUSIVE
`
`ACCESS TO
`REQUESTED STRIPE(S)
`
`
`
`
`
`No
`
`51 4
`
`
`
`TRANSMIT EXCLUSIVE
`ACCESS GRANT MESSAGE
`
`
`TO REQUESTING
`CONTROLLER
`
`
`518
`
`VMWARE-1010 / Page 7 of 28
`
`VMWARE-1010 / Page 7 of 28
`
`
`
`US. Patent
`
`Jun. 6, 2000
`
`Sheet 6 0f 11
`
`6,073,218
`
`
`
`60
`
`
` 0’\
`ANY
`FREE
`YES
`
`
`SEMAPHORES
`
`?
`
`_
`
`602 /
`
` 506/
`
`AWAIT
`FREE SEMAPHORES
`
`
`
`
`
`REQUESTED
`
`NO
`
`STRIPE(S) ALREADY
`LOCKED
`?
`
`AWAIT RELEASE FOR
`REQUESTED STRIPE(S)
`
` ASSOCIATE A FREE SEMAPHORE
`
`WITH REQUESTED STRIPE(S)
`
`
`AND LOCK SEMAPHORE
`
`
`
`INCREMENT # LOCKED
`
`SEMAPHORES, DECREMENT
`
`# FREE SEMAPHORES
`
`
`
`
`
`608
`
`610
`
`
`
`
`
`
`
`
`514
`
`SEMAPHORE ASSOCIATED WITH
`
`
`
`REQUESTED STRIPE(S)
`
`612
`
` RELEASE EXCLUSIVE ACCESS
`
`
`
`
`
`
`
`DECREMENT # LOCKED
`SEMAPHORES, INCREMENT
`
`# FREE SEMAPHORES
`
`
`
`
`614
`
`VMWARE-1010 / Page 8 of 28
`
`VMWARE-1010 / Page 8 of 28
`
`
`
`US. Patent
`
`Jun. 6, 2000
`
`Sheet 7 0f 11
`
`6,073,218
`
`G)
`
`GEEEw
`
`oz
`
`was:
`
`wE2522.mngm
`
`
`
`m>_w:._QXmmmfijmm
`
`
`
`QmmSOmmo._.mwmoo<
`
`BEES”.ESE
`
`aEEm85375o\_
`
`$255E82
`
`Emsam8$52
`
`@EEm
`
`k.GEwe.\
`
`>x<_>=mn_
`
`mmjomhzoo
`
`amkmmzomEmom
`
`o
`
`23
`
`
`
`
`
`mmmoo<m>_w._:oxm_._.__>_mz<x._.
`
`
`
`
`
`ommSOmmmo;mw<mmm=2meSOmm
`
`23Emsomm75@EE
`
`o\_858%ESE;
`
`
`
`
`
`wmmoo<mam—30$:<>><
`
`
`
`SSEmo<mmm§Hz<mw
`
`
`
`mmjomhzoo>m<s=E
`
`@35585320
`
`
`
`$32$385EMEE
`
`
`
`EmsamIE5&8:59%
`
`23EmsamzoaEEm
`
`muss.
`
`z_8&5
`
`@ESEo\_
`
`mzo<o
`
`23.1mzu<o
`
`
`o\_E1%:
`
`
`
`$3725%cho\_:53
`
`
`
`Enema:3:255
`
`omh
`
`mmi
`
`_.E>>mzo<oESE:
`
`>m<§mm”5293.25;
`
`Nmn‘
`
`>m<s=mn_
`
`54.55200
`
`ompmmzcmmmow
`
`o
`
`23
`
`mm;
`
`wamscmm
`
`@$55ES:»2:
`
`VMWARE-1010 / Page 9 of 28
`
`VMWARE-1010 / Page 9 of 28
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`US. Patent
`
`Jun. 6, 2000
`
`Sheet 8 0f 11
`
`6,073,218
`
`FIG. 8
`
`l/O
`REQUEST
`
`REQUIRES CACHE
`UPDATE
`
`(7
`
`TRANSMIT CACHE UPDATE
`PERMISSION REQUEST TO
`PRIMARY CONTROLLER
`
`AWAIT PERMISSION GRANT
`
`MESSAGE FROM PRIMARY
`
`CONTROLLER
`
`CACHE
`
`
`
`UPDATE LOCAL
`
`VMWARE-1010 / Page 10 of 28
`
`VMWARE-1010 / Page 10 of 28
`
`
`
`US. Patent
`
`Jun. 6, 2000
`
`Sheet 9 0f 11
`
`6,073,218
`
`I:I(3. 5?
`
`@
`
`
`
`764
`
`
`
`
`|/O f
`
`
`
`REQUIRES CACHE
`
`NO
`
`REQUEST
`
`900
`
`UPDATE
`
`
`
`?
`
`TRANSMIT CACHE INVALIDATE
`
`MESSAGES TO ALL
`
`SECONDARY CONTROLLERS
`
`
`
`
`
`ACTIVE WITH RESPECT TO
`
`ASSOCIATED LUNS
`
`UPDATE LOCAL
`CACHE
`
`
`
`
`902
`
`’R‘904
`
`VMWARE-1010 / Page 11 of 28
`
`VMWARE-1010 / Page 11 of 28
`
`
`
`US. Patent
`
`Jun. 6,2000
`
`Sheet 10 0f 11
`
`6,073,218
`
`FIG. 10
`
`
` PRIMARY
`CONTROLLER
`
`
`DAEMON
`
`
` /N 1000
`
`AWAIT CACHE UPDATE
`PERMISSION REQUEST
`
`
`
`
`TRANSMIT CACHE INVALIDATE
`MESSAGES TO EACH
`SECONDARY CONTROLLER
`HAVING NEWLY VALIDATED
`CACHE
`
` 1002
`
`
`
`
`
`
`TRANSMIT PERMISSION GRANT
`MESSAGE TO REQUESTING
`
`
`SECONDARY CONTROLLER
`
`
`
`1004
`
`
`DAEMON
`
`
`SECONDARY
`CONTROLLER
`
`
`
`1010
`
`
`
`AWAIT CACHE INVALIDATE
`MESSAGE FROM PRIMARY
`CONTROLLER
`
`
`
`
` INVALIDATE IDENTIFIED
`
`PORTIONS OF LOCAL
`
`CACHE MEMORY
`
`1012
`
`VMWARE-1010 / Page 12 of 28
`
`VMWARE-1010 / Page 12 of 28
`
`
`
`US. Patent
`
`Jun. 6,2000
`
`Sheet 11 0f 11
`
`6,073,218
`
`
`
`‘IIOO‘T FIG. 11
`
`
`
`
`PRIMARY
`PRIMARY
`D, E
`A, B
`
`
`
`SECONDARY
`SECONDARY
`
`
`C — G
`A—C, F—G
`
`
`
`A—B
`
`
`PRIMARY
`
`
`
`C
`SECONDARY
`
`A-B, D-G
`
`PRIMARY
`F, G
`SECONDARY
`
`,CLS‘L-‘I
`
`
`
`PRIMARY
`
`(NONE)
`
`SECONDARY
`
`A—J
`
`
`LUNS Ha I. J
`
`
`
`
`
`
`
`
`PRIMARY
`H, J
`SECONDARY
`A — G
`
`
`
`
`
`
`
`
`
`VMWARE-1010 / Page 13 of 28
`
`VMWARE-1010 / Page 13 of 28
`
`
`
`6,073,218
`
`1
`METHODS AND APPARATUS FOR
`COORDINATING SHARED MULTIPLE RAID
`CONTROLLER ACCESS TO COMMON
`STORAGE DEVICES
`
`RELATED PATENTS
`
`The present invention is related to commonly assigned
`and co-pending US. patent application entitled “Methods
`And Apparatus For Balancing Loads On A Storage Sub-
`system Among A Plurality Of Controllers”,
`invented by
`Charles Binford, Rodney A. DeKoning, and Gerald Fredin,
`and having an internal docket number of 96-018 and a Ser.
`No. of 08/772,618, filed concurrently herewith on Dec. 23,
`1996, and co-pending U.S. patent application entitled
`“Methods And Apparatus For Locking Files Within A Clus-
`tered Storage Environment”,
`invented by Rodney A.
`DeKoning and Gerald Fredin, and having an internal docket
`number of 96-028 and a Ser. No. of 08/773,470,
`filed
`concurrently herewith on Dec. 23, 1996, both of which are
`hereby incorporated by reference.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`The present invention relates to storage subsystems and in
`particular to methods and associated apparatus which pro-
`vide shared access to common storage devices within the
`storage subsystem by multiple storage controllers.
`2. Discussion of Related Art
`
`Modern mass storage subsystems are continuing to pro-
`vide increasing storage capacities to fulfill user demands
`from host computer system applications. Due to this critical
`reliance on large capacity mass storage, demands for
`enhanced reliability are also high. Various storage device
`configurations and geometries are commonly applied to
`meet the demands for higher storage capacity while main-
`taining or enhancing reliability of the mass storage sub-
`systems.
`One solution to these mass storage demands for increased
`capacity and reliability is the use of multiple smaller storage
`modules configured in geometries that permit redundancy of
`stored data to assure data integrity in case of various failures.
`In many such redundant subsystems, recovery from many
`common failures can be automated within the storage sub—
`system itself due to the use of data redundancy, error
`correction codes, and so-called “hot spares” (extra storage
`modules which may be activated to replace a failed, previ-
`ously active storage module). These subsystems are typi-
`cally referred to as redundant arrays of inexpensive (or
`independent) disks (or more commonly by the acronym
`RAID). The 1987 publication by David A. Patterson, et al.,
`from University of California at Berkeley entitledA Case for
`Redundant A rrays 0fInexpensive Disks (RAID), reviews the
`fundamental concepts of RAID technology.
`There are five “levels” of standard geometries defined in
`the Patterson publication. The simplest array, a RAID level
`1 system, comprises one or more disks for storing data and
`an equal number of additional “mirror” disks for storing
`copies of the information written to the data disks. The
`remaining RAID lcvcls, identified as RAID level 2, 3, 4 and
`5 systems, segment the data into portions for storage across
`several data disks. One of more additional disks are utilized
`to store error check or parity information.
`RAID storage subsystems typically utilize a control mod-
`ule that shields the user or host system from the details of
`managing the redundant array. The controller makes the
`
`10
`
`15
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`subsystem appear to the host computer as a single, highly
`reliable, high capacity disk drive. In fact, the RAID con—
`troller may distribute the host computer system supplied
`data across a plurality of the small independent drives with
`redundancy and error checking information so as to improve
`subsystem reliability. Frequently RAID subsystems provide
`large cache memory structures to further improve the per-
`formance of the RAID subsystem. The cache memory is
`associated with the control module such that the storage
`blocks on the disk array are mapped to blocks in the cache.
`This mapping is also transparent to the host system. The host
`system simply rcqucsts blocks of data to be read or written
`and the RAID controller manipulates the disk array and
`cache memory as required.
`To further improve reliability, it is known in the art to
`provide redundant control modules to reduce the failure rate
`of the subsystem due to control electronics failures. In some
`redundant architectures, pairs of control modules are cori-
`figured such that they control the same physical array of disk
`drives. A cache memory module is associated with each of
`the redundant pair of control modules. The redundant con-
`trol modules communicate with one another to assure that
`
`the cache modules are synchronized. When one of the
`redundant pair of control modules fails, the other stands
`ready to assume control to carry on operations on behalf of
`I/O requests. However, it is common in the art to require host
`intervention to coordinate failover operations among the
`controllers.
`
`is also known that such redundancy methods and
`It
`structures may be extended to more than two control mod-
`ules. Theoretically, any number of control modules may
`participate in the redundant processing to further enhance
`the reliability of the subsystem.
`However, when all redundant control modules are
`operable, a significant portion of the processing power of the
`redundant control modules is wasted. One controller, often
`referred to as a master or the active controller, essentially
`processes all I/O requests for the RAID subsystem. The
`other redundant controllers, often referred to as slaves or
`passive controllers, are simply operable to maintain a con-
`sistent mirrored status by communicating with the active
`controller. As taught in the prior art, for any particular RAID
`logical unit (’LUN—a group of disk drives configured to be
`managed as a RAID array), there is a single active controller
`responsible for processing of all
`I/O requests directed
`thereto. The passive controllers do not concurrently manipu-
`late data on the same I,UN.
`
`to permit each passive
`is known in the prior art
`It
`controller to be deemed the active controller with respect to
`other LUNs within the RAID subsystem. So long as there is
`but a single active controller with respect to any particular
`LUN, the prior art teaches that there may be a plurality of
`active controllers associated with a RAID subsystem. In
`other words, the prior art teaches that each active controller
`of a plurality of controllers is provided with coordinated
`shared access to a subset of the disk drives. The prior art
`therefore does not teach or suggest that multiple controllers
`may be concurrently active processing different I/O requests
`directed to the same LUN.
`In view of the above it is clear that a need exists for an
`
`improved RAID control module architecture that permits
`scaling of RAID subsystem performance through improved
`connectivity of multiple controllers to shared storage mod-
`ules. In addition, it is desirable to remove the host depen-
`dency for failover coordination. More generally, a need
`exists for an improved storage controller architecture for
`
`VMWARE-1010 / Page 14 of 28
`
`VMWARE-1010 / Page 14 of 28
`
`
`
`6,073,218
`
`3
`improved scalability by shared access to storage devices to
`thereby enable parallel processing of multiple [/0 requests.
`
`SUMMARY OF THE INVENTION
`
`invention solves the above and other
`The present
`problems, and thereby advances the useful arts, by providing
`methods and associated apparatus which permit all of a
`plurality of storage controllers to share access to common
`storage devices of a storage subsystem. In particular,
`the
`present invention provides for concurrent processing by a
`plurality of RAID controllers simultaneously processing I/O
`requests. Methods and associated apparatus of the present
`invention serve to coordinate the shared access so as to
`
`prevent deadlock conditions and interference of one con-
`troller with the I/O operations of another controller. Notably,
`the present invention provides inter-controller communica-
`tions to obviate the need for host system intervention to
`coordinate failover operations among the controllers.
`Rather, a plurality of controllers share access to common
`storage modules and communicate among themselves to
`permit continued operations in case of failures.
`As presented herein the invention is discussed primarily
`in terms of RAID controllers sharing access to a logical unit
`(LUN) in the disk array of a RAID subsystem. One of
`ordinary skill will recognize that the methods and associated
`apparatus of the present invention are ecually applicable to
`a cluster of controllers commonly attached to shared storage
`devices. In other words, RAID control management tech-
`niques are not required for application 0 the present inven-
`tion. Rather, RAID subsystems are a common environment
`in which the present
`invention may 3e advantageously
`applied. Therefore, as used herein, a LLN (a RAID logical
`unit) is to be interpreted as equivalent
`to a plurality of
`storage devices or a portion of a one or more storage devices.
`Likewise, RAID controller or RAID con rol module is to be
`interpreted as equivalent to a storage controller or storage
`control module. For simplicity of this presentation, RAID
`terminology will be primarily utilized to describe the inven—
`tion but should not be construed to limi application of the
`present invention only to storage subsystems employing
`RAID techniques.
`More specifically, the methods of the present invention
`utilize communication between a plurality of RAID control-
`ling elements (controllers) all attached to a common region
`on a set of disk drives (a LUN) in the RAID subsystem. The
`methods of the present invention transfer messages among
`the plurality of RAID controllers to coordinate concurrcnt,
`shared access to common subsets of disk drives in the RAID
`
`
`
`subsystem. The messages exchanged between the plurality
`of RAID controllers include access coordination messages
`such as stripe lock semaphore information to coordinate
`shared access to a particular stripe of a particular LUN of the
`RAID subsystem.
`In addition,
`the messages exchanged
`between the plurality of controllers include cache coherency
`messages such as cache data and cache meta-data to assure
`consistency (coherency) between the caches of each of the
`plurality of controllers.
`In particular, one of the plurality of RAID controllers is
`designated as the primary controller with respect to each of
`the LUNs (disk drive subsets) of the RAID subsystem. The
`primary controller is responsible for fairly sharing access to
`the common disk drives of the LUN among all requesting
`RAID controllers. A controller desiring access to the shared
`disk drives of the LUN sends a message to the primary
`controller requesting an exclusive temporary lock of the
`relevant stripes of the LUN. The primary controller returns
`
`10
`
`15
`
`30
`
`35
`
`4O
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`a grant of the requested lock in due course when such
`exclusivity is permissible. The requesting controller then
`performs any required I/O operations on the shared devices
`and transmits a lock release to the primary controller when
`the operations have completed. The primary controller man-
`ages the lock requests and releases using a pool of sema-
`phores for all controllers accessing the shared LUNs in the
`subsystem. One of ordinary skill
`in the art will readily
`recognize that the primary/secondary architecture described
`above may be equivalently implemented in a peer-to-peer or
`broadcast architecture.
`
`As used herein, exclusive, or temporary exclusive access,
`refers to access by one controller which excludes incompat—
`ible access by other controllers. One of ordinary skill will
`recognize that the degree of exclusivity among controllers
`depends upon the type of access required. For example,
`exclusive read/write access by one controller may preclude
`all other controller activity, exclusive write access by one
`controller may permit read access by other controllers, and
`similarly, exclusive append access by one controller may
`permit read and write access to other controllers for unaf-
`fected portions of the shared storage area. It is therefore to
`be understood that the terms “exclusive” and “temporary
`exclusive access” refer to all such configurations. Such
`exclusivity is also referred to herein as “coordinated shared
`access.”
`
`Since most RAID controllers rely heavily on cache
`memory subsystems to improve performance, cache data
`and cache meta-data is also exchanged among the plurality
`of controllers to assure coherency of the caches on the
`plurality of controllers which share access to the common
`LUN. Each controller which updates its cache memory in
`response to processing an [/0 request (or other management
`related I/O operation) exchanges cache coherency messages
`to that effect with a designated primary controller for the
`associated LUN. The primary controller, as noted above,
`carries the primary burden of coordinating activity relating
`to the associated LUN. In addition to the exclusive access
`
`lock structures and methods noted above, the primary cori-
`troller also serve as the distributed cache manager (DCM) to
`coordinate the state of cache memories among all controllers
`which manipulate data on the associated LUN.
`In particular, a secondary controller (non-primary with
`respect to a particular LUN) wishing to update its cache data
`in response to an I/O request must first request permission of
`the primary controller (the DCM for the associated LUN) for
`the intended update. The primary controller then invalidates
`any other copies of the same cache data (now obsolete)
`within any other cache memory of the plurality of control-
`lers. Once all other copies of the cache data are invalidated,
`the primary controller grants permission to the secondary
`controller which requested the update. The secondary con-
`troller may then complete the associated I/O request and
`update the cache as required. The primary controller (the
`DCM) thereby maintains data structures which map the
`contents of all cache memories in the plurality of controllers
`which contain cache data relating to the associated LUN.
`The semaphore lock request and release information and
`the cache data and meta-data are exchanged between the
`plurality of shared controllers through any of several com—
`munication mediums. Adedicated communication bus inter-
`
`connecting all RAID controllers may be preferred for per-
`formance criteria, but may present cost and complexity
`problems. Another preferred approach is where the infor-
`mation is exchanged via the communication bus which
`connects the plurality of controllers to the common subset of
`disk drives in the common LUN. This communication bus
`
`VMWARE-1010 / Page 15 of 28
`
`VMWARE-1010 / Page 15 of 28
`
`
`
`6,073,218
`
`5
`industry standard connections,
`may be any of several
`including, for example, SCSI, Fibre Channel, IPI, SSA, PCI,
`etc. Similarly the host connection bus which connects the
`plurality of RAID controllers to one or more host computer
`systems may be utilized as the shared communication
`medium. In addition, the communication medium may be a
`shared memory architecture in which the a plurality of
`controllers share access to a common, multiported memory
`subsystem (such as the cache memory subsystem of each
`controller).
`As used herein, controller (or RAID controller, or control
`module) includes any device which applies RAID tech-
`niques to an attached array of storage devices (disk drives).
`Examples of such controllers are RAID controllers embed—
`ded within a RAID storage subsystem, RAID controllers
`embedded within an attached host computer system, RAID
`control
`techniques constructed as software components
`within a computer system, etc. The methods of the present
`invention are similarly applicable to all such controller
`architectures.
`
`Another aspect of the present invention is the capability to
`achieve N—way connectivity wherein any number of con-
`trollers may share access to any number of LUNs within a
`RAID storage subsystem. A RAID storage subsystem may
`include any number of control modules. When operated in
`accordance with the present invention to provide temporary
`exclusive access to LUNs within commonly attached storage
`devices such a RAID subsystem provides redundant paths to
`all data stored within the subsystem. These redundant paths
`serve to enhance reliability of the subsystem while,
`in
`accordance with the present invention, enhancing perfor-
`mance of the subsystem by performing multiple operation
`concurrently on common shared LUNs within the storage
`subsystem.
`The configuration flexibility enabled by the present inven—
`tion permits a storage subsystem to be configured for any
`control module to access any data within the subsystem,
`potentially in parallel with other access to the same data by
`another control module. Whereas the prior art generally
`utilized two controllers only for purposes of paired
`redundancy, the present invention permits the addition of
`controllers for added performance as well as added redun-
`dancy. Cache mirroring techniques of the present invention
`are easily extended to permit (but not require) any number
`of mirrored cached controllers. By allowing any number of
`interfaces (i.e., FC-AL loops) on each controller, various
`sharing geometries may be achieved in which certain storage
`devices are shared by one subset of controller but not
`another. Virtually any mixture of connections may be
`achieved in RAID architectures under the methods of the
`present invention which permit any number of controllers to
`share access to any number of common shared LUNs within
`the storage devices.
`Furthermore, each particular connection of a controller or
`group of controllers to a particular LUN or group of LUNs
`may be configured for a different
`level of access (i.e.,
`rcad-only, rcad-writc, appcnd only, etc.) Any controllcr
`within a group of commonly connected controllers may
`configure the geometry of all controllers and LUNs in the
`storage subsystem and communicate the resultant configu—
`ration to all controllers of the subsystem. In a preferred
`embodiment of the present invention, a master controller is
`designated and is responsible for all configuration of the
`subsystem geometry.
`The present invention therefore improves the scalability
`of a RAID storage subsystem such that control modules can
`
`10
`
`15
`
`30
`
`35
`
`4O
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`be easily added and configured for parallel access to com-
`mon shared LUNs. Likewise, additional storage devices can
`be added and utilized by any subset of the controllers
`attached thereto within the RAID storage subsystem. A
`RAID subsystem operable in accordance with the present
`invention therefore enhances the scalability of the subsystem
`to improve performance and/or redundancy through the
`N-way connectivity of controllers and storage devices.
`It is therefore an object of the present invention to provide
`methods and associated apparatus for concurrent processing
`of [/0 requests by RAID controllers on a shared LUN.
`It is a further object of the present invention to provide
`methods and associated apparatus for concurrent access by
`a plurality of RAID controllers to a common LUN.
`It is still a further object of the present invention to
`provide methods and associated apparatus for coordinating
`shared access by a plurality of RAID controllers to a
`common LUN.
`
`It is yet another object of the present invention to provide
`methods and associated apparatus for managing semaphores
`to coordinate shared access by a plurality of RAID control—
`lers to a common LUN.
`
`It is still another object of the present invention to provide
`methods and associated apparatus for managing cache data
`to coordinate shared access by a plurality of RAID control-
`lers to a common LUN.
`
`It is further an object of the present invention to provide
`methods and associated apparatus for managing cache meta-
`data to coordinate shared access by a plurality of RAID
`controllers to a common LUN.
`
`is still further an object of the present invention to
`It
`provide methods and associated apparatus for exchanging
`messages via a communication medium between a plurality
`of RAID controllers to coordinate shared access by a plu-
`rality of RAID controllers to a common LUN.
`It is another object of the present invention to provide
`methods and associated apparatus which enable N—way
`redundant connectivity within the RAID storage subsystem.
`It is still another object of the present invention to provide
`methods and associated apparatus which improve scalability
`of a RAID storage subsystem for performance.
`The above and other objects, aspects, features, and advan—
`tages of the present invention will become apparent from the
`following description and the attached drawing.
`
`BRIEF DESCRIPTION OF THE DRAWING
`
`FIG. 1 is a block diagram of a typical RAID storage
`subsystem in which the structures and methods of the
`present invention may be applied;
`FIG. 2 is a block diagram depicting a first preferred
`embodiment of RAID controllers operable in accordance
`with the methods of the present invention in which the
`controllers communicate via a shared memory bus or via the
`common disk interface channel;
`FIG. 3 is a block diagram depicting a second preferred
`embodiment of RAID controllers operable in accordance
`with the methods of the present invention in which the
`controllers communicate via one or more multipoint loop
`media connecting controllers and disk drives;
`FIG. 4 is a diagram of an exemplary semaphore pool data
`structure associated with a LUN of the RAID subsystem;
`FIG. 5 is a flowchart describing the operation of the
`primary controller in managing exclusive access to one or
`more LUNs;
`
`VMWARE-1010 / Page 16 of 28
`
`VMWARE-1010 / Page 16 of 28
`
`
`
`6,073,218
`
`7
`FIG. 6 contains two flowcharts providing additional
`details of the operation of steps of FIG. 5 which acquire and
`release exclusive access to particular stripes of particular
`LUNs;
`FIG. 7 is a flowchart describing the operation of a
`controller requesting temporary exclusive access to a stripe
`of a LUN for purposes of performing a requested I/O
`operation;
`FIG. 8 is a flowchart providing additional details for
`elements of FIG. 7;
`FIG. 9 is a flowchart providing additional details for
`elements of FIG. 7;
`FIG. 10 is flowcharts describing background daemon
`processing in both primary and secondary controllers for
`maintaining distributed cache coherency; and
`FIG. 11 is a block diagram depicting another preferred
`embodiment of RAID controllers operable in accordance
`with the methods of the present invention in which the
`plurality of controllers communicate via one or more multi-
`point loop media connecting controllers and disk drives.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENT
`
`While the invention is susceptible to various modifica-
`tions and alternative forms, a specific embodiment thereof
`has been shown by way of example in the drawings and will
`herein be described in detail.
`It should be understood,
`however, that it is not intended to limit the invention to the
`particular form disclosed, but on the contrary, the invention
`is to cover all modifications, equivalents, and alternatives
`falling within the spirit and scope of the invention as defined
`by the appended claims.
`
`RAID SUBSYSTEM OVE