`(10) Patent N0.:
`US 6,601,187 Bl
`
`Sicola ct a].
`(45) Date of Patent:
`Jul. 29, 2003
`
`UStJtJ6601187B1
`
`(75)
`
`(54) SYSTEM FOR DATA REPLICATION USING
`REDUNDANT PAIRS OF STORAGE
`CONTROLLERS, FIBRE CHANNEL FABRICS
`1.1
`1
`1
`1
`1
`,
`,
`AND LINKS [HEREBEI WLLN
`Inventors:
`Stephen J. Slcola, Monument, CO
`(US); Susan G. Elkington, Colorado
`Springs, CO (US); Michael D. Walker,
`Colorado Springs, CO (US); Paul
`Guttormsun, Colorado Springs, C0
`(US); Richard F. Lary, Colorado
`Springs, CO (US)
`
`.......... 395,132.07
`5,790,775 A
`$1998 Marks et al.
`
`___________ have
`6.1?8521 31 *
`1mm Filgate
`2:3?ng 3} * 2:333; 261?? mi----------------- iiiiiii
`,
`,
`*
`.
`ri
`t cta.
`6,4510% Bl *
`9t2002 DeKoniug et al. .......... Tllttt‘t
`OTHER PUBLICATIONS
`
`Sicola, Stephen J. et al., U.S. Patent Application, Fault
`Tolerant Storage Controller Utilizing Tightly Coupled Dual
`Controller Modules, Ser. No. (IBIO'I’IJIO, filed Jun. 4, 1993,
`pp. 1—90.
`
`* cited by examiner
`
`(73)
`
`Assignee:
`
`Hewlett-Packard Development
`Company, L. P., Houston, TX {US}
`
`Primary Examiner—Robert Beausoliel
`Assistant Examiner—Christopher McCarthy
`
`(*)
`
`Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(31)
`
`Appl. No.: 095359.745
`
`(22
`
`(51)
`(52)
`(58)
`
`(56)
`
`Filed:
`
`Mar. 31, 2000
`
`Int. Cl.7 ................................................. G06F 11;“00
`U.S. Cl.
`............................................ 714r’6; 711;“162
`Field of Search ......................... 714;“6, 9; 711;“162.
`711;“ 161
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5 274 045 A
`5544,34? A *
`5,768,623 A
`
`12-1993 Idlemanetal.
`8,-‘1996 Yanai et aL
`651998 Judd et a].
`
`371;“101
`
`711f162
`.................. 395857
`
`[5?)
`
`ABSTRACT
`
`Adata replication system having a redundant configuration
`including dual Fibre Channel fabric links interconnecting
`each of the components of two data storage sites, wherein
`each site comprises a host computer and associated data
`storage array, with redundant array controllers and adapters.
`Each array controller in the system is capable of performing
`all of the data replication functions, and each host ‘sees’
`remote data as if it were local. Each array controller has a
`dedicated link via a fabric to a partner on the remote side of
`the long-distance link between fabric elements. Each dedi-
`cated link does not appear to any host as an available link to
`them for data access, however, it is visible to the partner
`array controllers involved in data replication operations.
`These links are managed by each partner array controller as
`if being ‘clustered’ with a reliable data link between them.
`
`16 Claims, 11 Drawing Sheets
`
`100
`
`)
`
`l 107
`
`SWITCHED
`FABRIC
`
`110'
`
`DHPN—1004 I Page 1 of 20
`
`
`
`US. Patent
`
`Jul. 29, 2003
`
`Sheet 1 0f 11
`
`US 6,601,187 B1
`
`110'
`
`100)
`
`l107
`
`80
`IE0
`#2
`gu.
`
`0')
`
`Fig.1
`
`DHPN—1004 I Page 2 of 20
`
`
`
`US. Patent
`
`Jul. 29, 2003
`
`Sheet 2 0f 11
`
`US 6,601,187 B1
`
`Fig.2
`
`100)
`
`
`
`218
`
`DHPN—1004 I Page 3 of 20
`
`
`
`US. Patent
`
`Jul. 29, 2003
`
`Sheet 3 of 11
`
`US 6,601,187 B1
`
`Q
`
`>-
`
`éD
`
`:<
`
`§ )
`
`- EI
`
`I
`<1:
`
`Dow
`
`m.3
`
`
`
`
`
`
`
`mom_‘_.\Emmu
`
`DHPN—1004 I Page 4 of 20
`
`
`
`US. Patent
`
`Jul. 29, 2003
`
`Sheet 4 0f 11
`
`US 6,601,187 B1
`
`401
`
`Fig.4
`
`DHPN—1004 I Page 5 of 20
`
`
`
`US. Patent
`
`Jul. 29, 2003
`
`Sheet 5 0f 11
`
`US 6,601,187 B1
`
`505
`
`510
`
`HOST PORT
`TARGET CODE
`
`HOST PORT
`INITIATOR CODE
`
`
`
`
`
`PPRC MANAGER
`
`
`
`
`VA & CACHE MGR
`
`DEVICES SERVICES
`
`
`
`515
`520
` 525
`
`
`
`
`TO EXTERNAL DEVICES
`
`Fig. 5
`
`DHPN—1004 I Page 6 of 20
`
`
`
`US. Patent
`
`JuL29,2003
`
`Sheet 6 of 11
`
`US 6,601,187 B1
`
`A1SENDS
`ECHO
`“DB1VM
`UNK1
`
`A1STARTS
`UNK
`“MER
`
`A1SENDS
`CMD
`“DB1VM
`UNK1
`
`A1STARTS
`COMMAND
`NMER
`
`
`
`HMESOUT
`
`ATSCOMMAND
`DRUNK
`NMER
`
`A1
`TRANSFERS
`CONTROLTO
`A2
`
`A2
`COMMUNICATES
`VWTHB2
`WALNKZ
`
`82
`ACCESSES
`LUNX'
`
`Fig. 6A
`
`DHPN—1004 I Page 7 of 20
`
`
`
`US. Patent
`
`Jul. 29, 2003
`
`Sheet 7 0f 11
`
`US 6,601,187 B1
`
`635
`
`
`
`C, C! SET
`HB TIMERS
`
`C, C! SEND
`PING
`TO EACH OTHER
`
`645
`
`C, C! RESET
`HB TIMERS
`
`BOTH
`C, C!
`RECEIVE PING
`
`
`
`
`
`
`
`
`
`C FAILS
`
`CI'S
`HB TIMER
`TIMES OUT
`
`C! PERFORMS
`CONTROLLER
`FAILOVER
`
`
`C! SENDS
`DATA OVER
`
`
`C!'S LINK
`
`
`Fig. 63
`
`DHPN—1004 I Page 8 of 20
`
`
`
`US. Patent
`
`Jul. 29, 2003
`
`Sheet 8 0f 11
`
`US 6,601,187 B1
`
`701
`
`HOST ISSUES
`WRITE COMMAND
`
`705
`
`CONTROLLER
`RECEIVES
`
`WRITE COMMAND
`
`710
`
`COMMAND
`SENT TO
`
`VA LAYER
`
`715
`
`WRITE DATA
`TO WRITE-BACK
`CACHE
`
`VA
`RETAINS
`CACHE LOCK
`
`720
`
`SEND WRITE
`DATA TO
`
`REMOTE TARGET
`
`725
`
`TARGET WRITES
`DATA TO
`
`W—B CACHE
`
`730
`
`735
`
`COMPLETION
`
`TARGET SENDS
`COMPLETION
`STATUS TO
`INITIATOR
`
`PPRC MGR
`NOTIFIES
`VA OF
`
`745
`
`STATUS SENT
`To HOST
`
`740
`VA COMPLETES
`WRITE
`OPERATION
`
`
`
`Fig. 7
`
`DHPN—1004 I Page 9 of 20
`
`
`
`US. Patent
`
`Jul. 29, 2003
`
`Sheet 9 0f 11
`
`US 6,601,187 B1
`
`HOST ISSUES
`WRITE COMMAND
`
`CONTROLLER
`RECEIVES
`WRITE COMMAND
`
`COMMAND
`SENT TO
`VA LAYER
`
`301
`
`805
`
`810
`
`WRITE DATA
`TO WRITE-BACK
`CACHE
`
`VA
`RETAINS
`CACHE LOCK
`
`815
`
`M ICRO-LOG
`TH E
`LB N EXTE NT
`
`SEND HOST
`COMPLETION
`STATUS
`
`820
`
`825
`
`830
`
`SEND WRITE
`DATA TO
`REMOTE TARGET
`
`835
`
`TARGET WRITES
`DATA TO
`W-B CACHE
`
`
`
`
`CONTROLLER
`
`845
`340
`TARGET SENDS
`REMOTE COPY Y
`
`
`COMPLETION STATUS
`CC 3
`..
`SEGMELEITEIDIEPY
`TO 'N' I 'ATOR
`
`
` MARK 'WRITE ppRc MGR
`
`SS Y CRASH WSSS
`COMPLETE'
`C
`STETHE IN
`UNLO ES
`MICRO-LOG
`COPY IN T
`.
`
`
`
`PERFORM OTHER m Fig. 8A
`
`
`
`860
`
`DHPN—1004 I Page 10 of 20
`
`
`
`US. Patent
`
`Jul. 29, 2003
`
`Sheet 10 of 11
`
`US 6,601,187 B1
`
`
`
`INHIBIT
`HOST
`WRITES
`
`865
`
`
`
`'RE-PLAY'
`REMOTE COPY
`PRESERVING
`ORDER
`
`
`
`REMOTE
`
`
`CLEAR
`DO OTHER
`COPY
`RECORDED
`RECOVERY,I
`
`
`
`
`SUCCESSFUL?
`ERROR HANDLING
`WRITE ENTRY
`
`
`
`
`
`
`
`
`ALL
`ALLOW
`
`HOST
`ENTRIES
`
`
`
`CLEAR?
`TO RESUME
`
`
`
`
`
`Fig. 8B
`
`DHPN—1004 I Page 11 of 20
`
`
`
`US. Patent
`
`Jul. 29, 2003
`
`Sheet 11 0f 11
`
`US 6,601,187 B1
`
`
`
`213
`
`203
`
`Fig.9
`
`DHPN—1004 I Page 12 of 20
`
`
`
`US 6,601,187 Bl
`
`l
`SYSTEM FOR DATA REPLICATION USING
`REDUNDANT PAIRS OF STORAGE
`CONTROLLERS, FIBRE CHANNEL FABRICS
`AND LINKS THEREBETWEEN
`
`FIELD OF THE INVENTION
`
`The present invention relates generally to error recovery
`in data storage systems, and more specifically, to a system
`for providing controller-based remote data replication using
`a redundantly configured Fibre Channel Storage Area Net-
`work to support data recovery after an error event which
`causes loss of data access at the local site due to a disaster
`at the local site or a catastrophic storage failure.
`
`10
`
`15
`
`BACKGROUND OF THE INVENTION AND
`PROBLEM
`
`It is desirable to provide the ability for rapid recovery of
`user data from a disaster or significant error event at a data
`processing facility. This type of capability is often termed
`‘disaster tolerance’. In a data storage environment, disaster
`tolerance requirements include providing for replicated data
`and redundant storage to support recovery after the event. In
`order to provide a safe physical distance between the origi- ,
`nal data and the data to backed up, the data must be migrated
`from one storage subsystem or physical site to another
`subsystem or site. It is also desirable for user applications to
`continue to run while data replication proceeds in the
`background. Data warehousing,
`'continuous computing’,
`and Enterprise Applications all require remote copy capa-
`bilities.
`
`30
`
`Storage controllers are commonly utilized in computer
`systems to off-load from the host computer certain lower
`level processing functions relating to NO operations, and to
`serve as interface between the host computer and the physi—
`cal storage media. Given the critical role played by the
`storage controller with respect
`to computer system It’O
`performance,
`it
`is desirable to minimize the potential for
`interrupted It'O service due to storage controller malfunc-
`tion. Thus, prior workers in the art have developed various
`system design approaches in an attempt to achieve some
`degree of fault tolerance in the storage control function.
`One prior method of providing storage system fault
`tolerance accomplishes failover through the use of two
`controllers coupled in an activet’passive configuration. Dur-
`ing failover, the passive controller takes over for the active
`(failing) controller. Adrawback to this type of dual configu-
`ration is that it cannot support load balancing, as only one
`controller is active and thus utilized at any given time, to _
`increase overall system performance. Furthermore, the pas-
`sive controller presents an inefficient use of system
`resources.
`
`40
`
`45
`
`Another approach to storage controller fault tolerance is
`based on a process called ‘failover’. Failover is known in the
`art as a process by which a first storage controller, coupled
`to a second controller, assumes the responsibilities of the
`second controller when the second controller fails.
`'Fail-
`back’ is the reverse operation, wherein the second controller,
`having been either repaired or replaced, recovers control
`over its originally-attached storage devices. Since each
`controller is capable of accessing the storage devices
`attached to the other controller as a result of the failover,
`there is no need to store and maintain a duplicate copy of the
`data, i.e., one set stored on the first controller’s attached
`devices and a second (redundant) copy on the second
`controller’s devices.
`
`55
`
`60
`
`65
`
`2
`US. Pat. No. 5,274,645 (Dec. 38, 1993), to Idleman et al.
`discloses a dual-active configuration of storage controllers
`capable of performing failover without the direct involve-
`ment of the host. However, the direction taken by Idleman
`requires a multi—level storage controller implementation.
`Each controller in the dual-redundant pair includes a two-
`level hierarchy of controllers. When the first level or host-
`interface controller of the first controller detects the failure
`of the second level or device interface controller of the
`
`second controller, it re-configures the data path such that the
`data is directed to the functioning second level controller of
`the second controller. In conjunction, a switching circuit
`re-configures the controller-device interconnections, thereby
`permitting the host to access the storage devices originally
`connected to the failed second level controller through the
`operating second level controller of the second controller.
`Thus, the presence of the first level controllers serves to
`isolate the host computer from the failover operation, but
`this isolation is obtained at added controller cost and com—
`plexity.
`Other known failover techniques are based on proprietary
`buses. These techniques utilize existing host interconnect
`“hand-shaking” protocols, whereby the host and controller
`act
`in cooperative eifort
`to effect a failover operation.
`Unfortunately,
`the “hooks” for
`this and other
`types of
`host-assisted failover mechanisms are not compatible with
`more recently developed, industry-standard interconnection
`protocols, such as SCSI, which were not developed with
`failover capability in mind. Consequently, support for dual-
`active failover in these proprietary bus techniques must be
`built
`into the host firmware via the host device drivers.
`Because SCSI, for example, is a popular industry standard
`interconnect, and there is a commercial need to support
`platforms not using proprietary buses, compatibility with
`industry standards such as SCSI is essential. Therefore, a
`vendor-unique device driver in the host is not a desirable
`option.
`US. patent application Ser. No. 082071310, to Sicola et
`al., describes a dual-active, redundant storage controller
`configuration in which each storage controller communi-
`cates directly with the host and its own attached devices, the
`access of which is shared with the other controller. Thus, a
`failover operation may be executed by one of the storage
`controller without the assistance of an intermediary control-
`ler and without the physical reconfiguration of the data path
`at the device interface. However, the technology disclosed in
`Sicola is directed toward a localized configuration, and does
`not provide for data replication across long distances.
`US. Pat. No. 5,790,775 (Aug. 4, 1998) to Marks et al.,
`discloses a system comprising a host CPU, a pair of storage
`controllers in a dual-active, redundant configuration. The
`pair of storage controllers reside on a common host side
`SCSI bus, which serves to couple each controller to the host
`CPU. Each controller is configured by a system user to
`service zero or more, preferred host side SCSI IDs, each host
`side ID associating the controller with one or more units
`located thereon and used by the host CPU to identify the
`controller when accessing one of the associated units. If one
`of the storage controllers in the dual-active,
`redundant
`configuration fails, the surviving one of the storage control-
`lers automatically assumes control of all of the host side
`SCSI IDs and subsequently responds to any host requests
`directed to the preferred, host side SCSI IDS and associated
`units of the failed controller. When the surviving controller
`senses the return of the other controller,
`it releases to the
`returning other controller control of the preferred, SCSI IDS
`of the failed controller.
`
`DHPN—1004 I Page 13 of 20
`
`
`
`US 6,601,187 Bl
`
`10
`
`15
`
`3O
`
`40
`
`45
`
`3
`In another aspect of the Marks invention, the failover is
`made to appear to the host CPU as simply a re-initialization
`of the failed controller. Consequently, all transfers outstand-
`ing are retried by the host CPU after
`time outs have
`occurred. Marks discloses ‘transparent failover’ which is an
`automatic technique that allows for continued operation by
`a partner controller on the same storage bus as the failed
`controller. This technique is useful in situations where the
`host operating system trying to access storage does not have
`the capability to adequately handle multiple paths to the
`same storage volumes. Transparent
`failover makes the
`failover event look like a ‘power—on reset’ of the storage
`device. However, transparent failover has a significant flaw:
`it is not fault tolerant to the storage bus. [f the storage bus
`fails, all access to the storage device is lost.
`us. Pat. No. 5,768,623 (Jun. 16, 1998) to Judd et a]_,
`describes a system for storing data for a plurality of host
`computers on a plurality of storage arrays so that data on
`each storage array can be accessed by any host computer.
`There is an adapter communication interface (interconnect)
`between all of the adapters in the system to provide peer-
`to-peer communications. Each host has an adapter which
`provides controller functions for a separate array designated
`as a primary array (i .e., each adapter functions as an array
`controller}. There are also a plurality of adapters that have '
`secondary control of each storage array. A secondary adapter
`controls a designated storage array when an adapter prima—
`rily controlling the designated storage array is unavailable.
`The adapter communication interface interconnects all
`adapters, including secondary adapters. Interconnectivity of
`the adapters is provided by a Serial Storage Architecture
`(SSA) which includes SCSI as a compatible subset. Judd
`indicates that the SSA network could be implemented with
`various topologies including a switched configuration.
`However, the Judd system elements are interconnected in
`a configuration that comprises three separate loops, one of
`which requires four separate links. Therefore, this configu-
`ration is complex from a connectivity standpoint, and has
`disadvantages in areas including performance, physical
`cabling, and the host involvement required to implement the
`technique. The performance of the Judd invention for data
`replica lion and failover is hindered by the ‘bucket brigade’
`of latency to replicate control information about commands
`in progress and data movement
`in general. The physical
`nature of the invention requires many cables and i nteroon-
`nects to ensure fault tolerance and total interconnectivity,
`resulting in a system which is complex and error prone. The
`tie-in with host adapters is host operating system (DES)
`dependent on an OS platform-by-platfonn basis, such that
`the idiosyncrasies of each platform must be taken into '
`account for each different OES to be used with the .ludd
`system.
`there is a clearly felt need in the art for a
`Therefore,
`disaster tolerant data storage system capable of performing
`data backup and automatic failover and failback without the
`direct involvement of the host computer; and wherein all
`system components are visible to each other via a redundant
`network which allows for extended clustering where local
`and remote sites share data across the network.
`
`55
`
`60
`
`SOLUTION TO THE PROBLEM
`
`the above problems are solved, and an
`Accordingly,
`advance in the field is accomplished by the system of the
`present invention which provides a completely redundant
`configuration including dual Fibre Channel
`fabric links
`interconnecting each of the components of two data storage
`
`65
`
`4
`sites, wherein each site comprises a host computer and
`associated data storage array, with redundant array control-
`lers and adapters. The present system is unique in that each
`array controller is capable of performing all of the data
`replication functions, and each host ‘sees’ remote data as if
`it were local.
`
`The 'mirroring’ of data for backup purposes is the basis
`for RAID (‘Redundant Array of Independent
`[or
`Inexpensive] Disks’) level 1 systems, wherein all data is
`replicated on N separate disks, with N usually having a value
`of 2. Although the concept of storing copies ofdata at a long
`distance from each other [i.e., long distance mirroring) is
`known, the use of a switched, dual-fabric, Fibre Channel
`configuration as described herein is a novel approach to
`disaster tolerant storage systems. Mirroring requires that the
`data be consistent across all volumes. In prior art systems
`which use host -based mirroring (where each host computer
`sees multiple units), the host maintains consistency across
`the units. For those systems which employ controller-based
`mirroring (where the host computer sees only a single unit),
`the host is not signaled completion of a command until the
`controller has updated all pertinent volumes. The present
`invention is, in one aspect, distinguished over the previous
`two types of systems in that the host computer sees multiple
`volumes, but the data replication function is performed by
`the controller. Therefore, a mechanism is required to com-
`municate the association between volumes to the controller.
`To maintain this consistency between volumes, the system
`of the present invention provides a mechanism of associat-
`ing a set of volumes to synchronize the logging to the set of
`volumes so that when the log isoonsistent when it is "played
`back” to the remote site.
`
`Each array controller in the present system has a dedi—
`cated link via a fabric to a partner on the remote side of the
`long-distance link between fabric elements. Each dedicated
`link does not appear to any host as an available link to them
`for data access; however, it is visible to the partner array
`controllers involved in data replication operations. These
`links are managed by each partner array controller as if
`being ‘clustered‘ with a reliable data link between them.
`The fabrics comprise two components, a local element
`and a remote element. An important aspect of the present
`invention is the fact
`that
`the fabrics are 'extended’ by
`standard e-ports (extension ports). The use of e-ports allow
`for standard Fibre Channel cable to be run between the
`fabric elements or the use of a conversion box to covert the
`data to a form such as telco ATM or IP. The extended fabric
`allows the entire system to be viewable by both the hosts and
`storage.
`The dual fabrics, as well as the dual array controllers, dual
`adapters in hosts, and dual links between fabrics, provide
`high-availability and present no single point of failure. A
`distinction here over the prior art is that previous systems
`typically use other kinds of links to provide the data
`replication, resulting in the storage not being readily
`exposed to hosts on both sides of a link. The present
`configuration allows for extended clustering where local a nd
`remote site hosts are actually sharing data across the link
`from one or more storage subystems with dual array con-
`trollers within each subsystem.
`The present system is further distinguished over the prior
`art by other additional features, including independent dis-
`covery of initiator to target system and automatic rediscov-
`ery after link failure. In addition, device failures, such as
`controller and link failures, are detected by ‘heartbeat’
`monitoring by each array controller. Furthermore, no special
`
`DHPN—1004 I Page 14 of 20
`
`
`
`US 6,601,187 Bl
`
`5
`host software is required to implement the above features
`because all replication functionality is totally self contained
`within each array controller and automatically done without
`user intervention.
`
`An additional aspect of the present system is the ability to
`function over two links with data replication traffic. If failure
`of a link occurs, as detected by the 'initiator’ array
`controller, that array controller will automatically ‘failover’,
`or move the base of data replication operations to its partner
`controller. At this time, all transfers in flight are discarded,
`and therefore discarded to the host. The host simply sees a
`controller failover at the host OS (operating system} level,
`causing the OS to retry the operations to the partner con-
`troller.
`
`The array controller partner continues all ‘initiator’ opera-
`tions frorn that point forward. The array controller whose
`link failed will continuously watch that status of its link to
`the same controller on the other 'far’ side of the link. That
`status changes to a ‘good’ link when the array controllers
`have established reliable communications between each
`other. When this occurs,
`the array controller ‘initiator’
`partner will ‘failback’ the link, moving operations back to
`newly reliable link. This procedure re-establishes load bal-
`ance for data replication operations automatically, without
`requiring additional features in the array controller or host
`beyond what
`is minimally required to allow controller
`failover.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`10
`
`15
`
`3O
`
`The above objects, features and advantages of the present
`invention will become more apparent from the following
`detailed description taken in conjunction with the accom-
`panying drawings, in which:
`FIG. 1 is a diagram showing long distance mirroring;
`FIG. 2 illustrates a switched dual fabric, disaster-tolerant
`storage system;
`FIG. 3 is a block diagram of the system shown in FIG. 2;
`FIG. 4 is a high-level diagram of a remote copy set
`operation;
`FIG. 5 is a block diagram showing exemplary controller
`software architecture;
`FIG. 6A is a flow diagram showing inter-site controller
`heartbeat timer opcration;
`FIG. 6B is a flow diagram showing intra-site controller
`heartbeat timer operation;
`FIG. 7 is a flowchart showing synchronous system opera-
`tion;
`FTG. 8A is a flowchart showing asynchronous system '
`operation;
`FTG. 8B is a flowchart showing a ‘micro-merge’ opera-
`tion; and
`FTG. 9 is a diagram showing an example of a link failover
`operation.
`
`40
`
`45
`
`55
`
`DETAILED DESCRIPTION
`
`invention comprises a data
`The system of the present
`backup and remote copy system which provides disaster
`tolerance. In particular, the present system provides a redun-
`dant peer-to-peer remote copy function which is imple-
`mented as a controller-based replication of one or more
`LUNs (Logical Unit Numbers) between two separate pairs
`of array controllers.
`FIG. 1 is a diagram showing long distance mirroring,
`which is an underlying concept of the present invention. The
`
`60
`
`65
`
`6
`present system 100 employs a switched, dual-fabric, Fibre
`Channel configuration to provide, a disaster tolerant storage
`system. Fibre Channel is the general name of an integrated
`set of standards developed by the American National Stan-
`dards Institute (ANSI) which defines protocols for informa-
`tion transfer. Fibre Channel supports multiple physical inter-
`face types, multiple protocols over a common physical
`interface, and a means for interconnecting various interface
`types. A ‘Fibre Channel’ may include transmission media
`such as copper coax or twisted pair copper wires in addition
`to (or in lieu of) optical fiber.
`As shown in FIG. 1, when host computer 101 writes data
`to its local storage array, an initiating node, or ‘initiator’ 111
`sends a backup copy of the data to remote ‘target’ node 112
`via a Fibre Channel switched fabric 103. A ‘fabric’ is a
`topology (explained in more detail below) which supports
`dynamic interconnections between nodes through pons con-
`nected to the fabric. In FIG. 1, nodes 111 and 112 are
`connected to respective links 105 and 106 via ports 109. A
`node is simply a device which has at least one port to provide
`access external lo the device. In the context of the present
`system 100, a node typically includes an array controller pair
`and associated storage array. Each pen in a node is generi-
`cally termed an N (or NL) port. Ports 109 (array controller
`ports) are thus N ports. Each port in a fabric is generically
`termed an F (or W) port. In FIG. 1, links 105 and 106 are
`connected to switched fabric 103 via F ports 107. More
`specifically, these F ports may be standard E ports [extension
`ports) or E portiFC-BBport pairs, as explained below.
`In general, it is possible for any node connected to a fabric
`to communicate with any other node connected to other F
`ports of the fabric, using services provided by the fabric. In
`a fabric topology, all routing of data frames is performed by
`the fabric, rather than by the ports. This any-to-any connec-
`tion service (‘peer—to-peer’ service) provided by a fabric is
`integral to a Fibre Channel system. It should be noted that
`in the context of the present system, although a second host
`computer 102 is shown (at the target site) in FIG. 1, this
`computer is not necessary for operation of the system 100 as
`described herein.
`
`An underlying operational concept employed by the
`present system 100 is the pairing of volumes (or LUNs) on
`a local array with those on a remote array. The combination
`of volumes is called a Remote Copy Set. ARemote Copy Set
`thus consists of two volumes, one on the local array, and one
`on the remote array. For example, as shown in FIG. 1, a
`Remote Copy Set might consist of LUN l (110} on a storage
`array at site 101 and LUN 1‘ (110‘) on a storage array at site
`102. The array designated as the ‘local’ array is called the
`initiator, while the remote array is called the target. Various
`methods for synchronizing the data between the local and
`remote array are possible in the context of the present
`system. These synchronization methods range frotn full
`synchronous to fullyr asynchronous data transmission, as
`explained below. The system user’s ability to choose these
`methods provides the user with the capability to vary system
`reliability with respect to potential disasters and the recovery
`after such a disaster. The present system allows choices to be
`tirade by the user based on factors which include likelihood
`of disasters and the critical nature of the user’s data.
`
`System Architecture
`
`FIG. 2 illustrates an exemplary configuration of the
`present invention, which comprises a switched dual fabric,
`disaster-tolerant storage system 100. The basic topology of
`the present system 100 is that of a switched-based Storage
`
`DHPN—1004 I Page 15 of 20
`
`
`
`US 6,601,187 Bl
`
`10
`
`15
`
`7"
`Area Network (SAN). As shown in FIG. 2, data storage sites
`218 and 219 each respectively comprise two hosts 101E101A
`and 102110224, and two storage array controllers 201E202
`and 211x212 connected to storage arrays 203 and 213,
`respectively. Alternatively, only a single host 101;?102, or
`more than two hosts may be connected to system 100 at each
`site 2181219. Storage arrays 203 and 213 typically comprise
`a plurality of magnetic disk storage devices, but could also
`include or consist of other types of mass storage devices
`such as semiconductor memory.
`In the configuration of FIG. 2, each host at a particular site
`is connected to both fabric elements {i.e., switches) located
`at that particular site. More specifically, at site 218, host 101
`is connected to switches 204- and 214- via respective paths
`231A and 231B; host 101A is connected to the switches via
`paths 241A and 2413. Also located at site 218 are array
`controllers A1 (ref. no. 201) and A2 (ref. no. 202). Array
`controller A1 is connected to switch 204 via paths 221H and
`221D; array controller A2 is connected to switch 214 via
`paths 222H and 222D. The path suffixes ‘H’ and ‘D’ refer to
`'Host’ and 'Disaster-tolerant’ paths,
`respectively, as
`explained below. Site 219 has counterpart array controllers
`B1 (ref. no 211) and B2 (ref. no. 212), each of which is
`connected to switches 205 and 215. Note that array control-
`lers B1 and B2 are connected to switches 205 and 215 via '
`paths 251D and 252D, which are, in effect, continuations of
`paths 221D and 222D, respectively.
`In the present system shown in FIG. 2, all storage sub-
`systems (203t’2042’205 and 2132214f215) and all hosts (101,
`101A, 102, and 102A) are visible to each other over the SAN
`103N103B. This configuration provides for high availabil-
`ity with a dual fabric, dual host, and dual storage topology,
`where a single fabric, host, or storage can fail and the system
`can still continue to access other system components via the
`SAN. As shown in FIG. 2, each fabric 103N103B employed
`by the present system 100 includes two switches intercon-
`nected by a high-speed link. More specifically, fabric 103A
`comprises switches 204 and 205 connected by link 223A,
`while fabric 103B comprises switches 214 and 215 con-
`nected by link 223B.
`Basic Fibre Channel technology allows the length oflinks
`223N223B (i.e., the distance between data storage sites) to
`be as great as 10 KM as per the FC-PH3 specification (see
`Fibre Channel Standard: Fibre Channel Physical and Sig-
`naling Interface, AN511 X3T11). However, distances of 20
`KM and greater are possible given improved technology and
`FC-PH margins with basic Fibre Channel. FC-BB {Fibre
`Channel Backbone) technology provides the opportunity to
`extend Fibre Channel over leased Telco lines (also called
`WAN tunneling). In the case wherein FC-BB is used for
`links 223A and 2233, FC-BB ports are attached to the E
`ports to terminate the ends of links 223A and 223B.
`It is also possible to interconnect each switch pair 204E205
`and 2141215 via an Internet link (223N223B). If the redun—
`dartt links 223A and 223B between the data storage sites
`218i219 are connected to different ISPs (Internet Service
`Providers) at
`the same site, for example, there is a high
`probability of having at least one link operational at any
`given time. This is particularly true because of the many
`redundant paths which are available over the Internet
`between ISPs. For example, switches 204 and 214 could be
`connected to separate ISPs, and switches 205 and 215 could
`also be connected to separate ISPs.
`FIG. 3 is an exemplary block diagram illustrating addi-
`tional details of the system shown in FIG. 2. The configu-
`ration of the present system 100, as shown in FIG. 3, depicts
`
`8
`only one host per site for the sake of simplicity. Each host
`1011102 has two adapters 308 which support the dual fabric
`topology. The hosts typically run multi-pathing software that
`dynamically allows failover between storage paths as well as
`static load balancing of storage volumes (LUNs) between
`the paths to the controller-based storage arrays 201,202 and
`2111212. The configuration of system 100 allows for appli-
`cations using either of the storage arrays 203K213 to continue
`running given any failure of either fabric 103N103B or
`either of the storage arrays.
`The array controllers 2011202 and 211f212 employed by
`the present system 100 have two host ports 109 per array
`controller, for a total of four connections (ports) per pair in
`the dual redundant configuration of FIG. 3. Each host port
`109 preferably has an optical attachment to the switched
`fabric, for example, a Gigabit I_ink Module (‘GI.M’) inter-
`face at the controller, which connects to a Gigabit Converter
`(‘GBIC’) module comprising the switch interface port 107.
`Switch interconnection ports 306 also preferably comprise
`GBIC modules. Each pair of array controllers 201.5202 and
`211r212 (and associated storage array)
`is also called a
`storage node (e.g., 301 and 302), and has a unique Fibre
`Channel Node Identifier. As shown in FIG. 3, array control-
`ler pair AlflA2 comprise storage node 301, and array con-
`troller pair B11132 comprise storage node 302. Furthermore,
`each storage node and each port on the array controller has
`a unique Fibre Channel Port Identifier, such as a World—Wide
`ID (WWID). In addition, each unit connected