`INTERNATIONAL APPLICATION PUBLISHED UNDER IBE PATENT COOPERATION TREATY (PCT)
`
`WORLD INTELLECTUAL PROPERTY ORGANIZATION
`International Bureau
`
`(51) International Patent Classification 6 :
`
`G06F 13/14, 13/00, 13/10, 13/12
`
`(11) International Publication Number:
`
`WO 99/26150
`
`Al
`
`(43) International Publication Date:
`
`27 May 1999 (27.05.99)
`
`(21) International Application Number:
`
`PCT/US98/21203
`
`(22) International Filing Date:
`
`8 October 1998 (08.10.98)
`
`(30) Priority Data:
`60/065,848
`09/034,247
`09/034,248
`09/034,812
`
`14 November 1997 (14.11.97) US
`US
`4 March 1998 (04.03.98)
`US
`4 March 1998 (04.03.98)
`US
`4 March 1998 (04.03.98)
`
`(81) Designated States: AL, AM, AT, AU, AZ, BA, BB, BG, BR,
`BY, CA, CH, CN, CU, CZ, DE, DK, EE, ES, FI, GB, GE,
`GH, GM, HR, HU, ID, IL, IS, JP, KE, KG, KP, KR, KZ,
`LC, LK, LR, LS, LT, LU, LV, MD, MG, MK, MN, MW,
`MX, NO, NZ, PL, PT, RO, RU, SD, SE, SG, SI, SK, SL, TJ,
`TM, TR, TT, UA, UG, UZ, VN, YU, ZW, ARIPO patent
`(GH, GM, KE, LS, MW, SD, SZ, UG, ZW), Eurasian patent
`(AM, AZ, BY, KG, KZ, MD, RU, TJ, TM), European patent
`(AT, BE, CH, CY, DE, DK, ES, Fl, FR, GB, GR, IE, IT,
`LU, MC, NL, PT, SE), OAP! patent (BF, BJ, CF, CG, CI,
`CM, GA, GN, GW, ML, MR, NE, SN, TD, TO).
`
`(71) Applicant: 3WARE, INC. [US/US]; 420 Waverly Street, Palo
`Alto, CA 94301 (US).
`
`(72) Inventors: MCDONALD, James, A.; 940 Colonial Lane,
`Palo Alto, CA 94301 (US). HERZ, John, Peter; 36 Pine
`Lane, Los Altos, CA 94022 (US). ALTMAN, Mitchell,
`A.; 572 Hill Street, #Penthouse, San Francisco, CA 94114
`(US). SMITH, William, Edward, III; 23797 Thurston Court,
`Hayward, CA 94568 (US).
`
`(74) Agent: SIMPSON, Andrew, H.; Knobbe, Martens, Olson and
`Bear, LLP, 16th floor, 620 Newport Center Drive, Neuport
`Beach, CA 92660 (US).
`
`Published
`With international search report.
`Before the expiration of the time limit for amending the
`claims and to be republished in the event of the receipt of
`amendments.
`
`(54) Title: HIGH-PERFORMANCE ARCHITECTURE FOR DISK ARRAY CONTROLLER
`
`(57) Abstract
`
`ATA
`DRI\IE 1
`
`72
`
`ATA
`DRI\IE 2
`
`72
`• • •
`
`ATA
`DRIVE N
`
`7.1
`
`._ __ w
`
`'T.~~-~~l,;nrli'L-&..-----1
`.__,_;;.'ffl..;.;;....;.;.~-~"""'----~
`
`' mmm•••~cm:i~~ml
`
`,--- ------- ----------- ______ L~
`ARRAY CONTROLLER 1
`76
`76
`,,,,
`76
`1
`e,; ,......_,..__.. __ (PCI CARO)
`I ,......_.___,_...,""' ,,____......_...,'u.
`:
`I
`AUTOMA TEO
`I
`CONTROI..LER
`I
`I
`N
`I
`t
`I'-,-:~--
`I
`I
`I
`I
`1
`l
`I
`1
`I
`I
`I
`I
`1
`:
`
`I
`I
`I
`I
`:
`1
`
`~
`
`HOST PC
`
`I
`I
`I
`I - - - - - " ' -~
`I
`I - - - - -
`- - . . - - - -
`I
`I
`I
`I
`
`H
`
`S
`
`_ _.__,~~
`
`.------~
`' - - - - •
`
`1
`I
`I
`I
`I
`I
`I
`I
`I
`t
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1031, p. i
`
`A high-performance RAID system for a PC
`comprises a controller card (70) which controls an
`array of AT A disk drives (72). The controller card
`(70) includes an array of automated disk drive con
`trollers (84), each of which controls one respective
`disk drive (72). The disk drive controllers (84) are
`connected to a microcontroller (82) by a control
`bus (86) and are connected to an automated copro
`cessor (80) by a packet-switched bus (90). The
`coprocessor (80) accesses system memory ( 40) and
`a local buffer (94). In operation, the disk drive con
`trollers (84) respond to controller commands from
`the microcontroller (82) by accessing their respec
`tive disk drives (72), and by sending packets to
`the coprocessor (80) over the packet-switched bus
`(90). The packets carry 1/0 data (in both directions,
`with the coprocessor filling-in packet payloads on
`I/0 writes), and carry transfer commands and tar
`get addresses that are used by the coprocessor (80)
`to access the buffer (94) and system memory (40).
`The packets also carry special completion values
`(generated by the microcontroller) and I/0 request
`identifiers that are processed by a logic circuit (144)
`of the coprocessor (80) to detect the completion
`of processing of each I/0 request. The coproces
`sor (80) grants the packet-switched bus (90) to the
`desk drive controllers (84) using a round robin ar
`bitration protocol which guarantees a minimum 1/0
`bandwidth to each disk drive (72). This minimum
`I/0 bandwidth is preferably greater than the sus
`tained transfer rate of each disk drive (72), so that all drives of the array can operate at the sustained transfer rate without the formation of
`a bottleneck.
`
`H
`,_.....__
`BUFFER
`...___
`
`.>--:1&..-.a..-.
`
`~-IM~I
`
`COPROCESSOR
`
`$/J
`
`RAM
`
`L------------ ---------------------l
`r--------~-------------------------v-.U
`
`
`
`Codes used to identify States party to the PCT on the front pages of pamphlets publishing international applications under the PCT.
`
`FOR THE PURPOSES OF INFORMATION ONLY
`
`Albania
`Annenia
`Austria
`Australia
`Azerbaijan
`Bosnia and Herzegovina
`Barbados
`Belgium
`Burkina Faso
`Bulgaria
`Benin
`Brazil
`Belarus
`Canada
`Central African Republic
`Congo
`Switzerland
`C<lte d'Ivoire
`Cameroon
`China
`Cuba
`Czech Republic
`Gennany
`Denmark
`Estonia
`
`AL
`AM
`
`AT
`AU
`AZ
`BA
`BB
`BE
`BF
`BG
`BJ
`BR
`BY
`CA
`CF
`
`CG
`CH
`CI
`CM
`CN
`
`cu
`CZ
`DE
`
`DK
`EE
`
`ES
`FI
`FR
`
`GA
`GB
`GE
`GH
`GN
`GR
`HU
`IE
`
`IL
`
`IS
`IT
`JP
`KE
`KG
`KP
`
`KR
`
`KZ
`LC
`LI
`
`LK
`LR
`
`Spain
`Finland
`France
`Gabon
`United Kingdom
`Georgia
`Ghana
`Guinea
`Greece
`Hungary
`Ireland
`Israel
`Iceland
`Italy
`Japan
`Kenya
`Kyrgyzstan
`Democratic People's
`Republic of Korea
`Republic of Korea
`Kazakstan
`Saint Lucia
`Liechtenstein
`Sri Lanka
`Liberia
`
`Lesotho
`Lithuania
`Luxembourg
`Latvia
`Monaco
`Republic of Moldova
`Madagascar
`The fonner Yugoslav
`Republic of Macedonia
`Mali
`Mongolia
`Mauritania
`Malawi
`Mexico
`Niger
`Netherlands
`Norway
`New Zealand
`Poland
`Portugal
`Romania
`Russian Federation
`Sudan
`Sweden
`Singapore
`
`LS
`LT
`LU
`
`LV
`MC
`MD
`MG
`MK
`
`ML
`MN
`MR
`
`MW
`MX
`NE
`NL
`NO
`NZ
`PL
`PT
`
`RO
`RU
`SD
`
`SE
`SG
`
`Slovenia
`Slovakia
`Senegal
`Swaziland
`Chad
`Togo
`Tajikistan
`Turkmenistan
`Turkey
`Trinidad and Tobago
`Ukraine
`Uganda
`United States of America
`Uzbekistan
`Viet Nam
`Yugoslavia
`Zimbabwe
`
`SI
`SK
`SN
`
`sz
`TD
`TG
`TJ
`TM
`TR
`TT
`UA
`UG
`
`us
`
`uz
`VN
`YU
`
`zw
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1031, p. ii
`
`
`
`WO 99/26150
`
`-1-
`
`PCT /US98/21203
`
`HIGH-PERFORMANCE ARCHITECTURE
`FOR DISK ARRAY CONTROLLER
`FIELD OF THE INVENTION
`The present invention relates to disk arrays, and more particularly, relates to hardware and software
`architectures for hardware-implemented RAID (Redundant Array of Inexpensive Disks) and other disk array systems.
`BACKGROUND OF THE INVENTION
`A RAID system is a computer data storage system in which data is spread or "striped" across multiple disk
`In many implementations, the data is stored in conjunction with parity information such that any data lost
`drives.
`as the result of a single disk drive failure can be automatically reconstructed.
`One simple type of RAID implementation is known as "software RAID." With software RAID, software
`(typically part of the operating system) which runs on the host computer is used to implement the various RAID
`control functions. These control functions include, for example, generating drive-specific read/write requests according
`to a striping algorithm, reconstructing lost data when drive failures occur, and generating and checking parity.
`Because these tasks occupy CPU bandwidth, and because the transfer of parity information occupies bandwidth on
`the system bus, software RAID frequently produces a degradation in performance over single disk drive systems.
`Where performance is a concern, a "hardware-implemented RAID" system may be used. With hardware(cid:173)
`implemented RAID, the RAID control functions are handled by a dedicated array controller (typically a card) which
`presents the array to the host computer as a single, composite disk drive. Because little or no host CPU bandwidth
`is used to perform the RAID control functions, and because no RAID parity traffic flows across the system bus, little
`or no degradation in performance occurs.
`One potential benefit of RAID systems is that the input/output ("l/0") data can be transferred to and from
`multiple disk drives in parallel. By exploiting this parallelism (particularly within a hardware-implemented RAID
`system), it is possible to achieve a higher degree of performance than is possible with a single disk drive. The two
`basic types of performance that can potentially be increased are the number of 1/0 requests processed per second
`("transactional performance") and the number of megabytes of 1/0 data transferred per second ("streaming
`performance").
`Unfortunately, few hardware-implemented RAID systems provide an appreciable increase in performance.
`In many cases, this failure to provide a performance improvement is the result of limitations in the array controller's
`bus architecture. Performance can also be adversely affected by frequent interrupts of the host computer's
`processor.
`In addition, attempts to increase performance have often relied on the use of expensive hardware
`components. For example, some RAID array controllers rely on the use of a relatively expensive microcontroller that
`can process 1/0 data at a high transfer rate. Other designs rely on complex disk drive interfaces, and thus require
`the use of expensive disk drives.
`The present invention addresses these and other limitations in existing RAID architectures.
`
`5
`
`1 O
`
`15
`
`20
`
`25
`
`30
`
`35
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1031, p. 1
`
`
`
`WO 99/26150
`
`-2-
`
`PCT/US98/21203
`
`SUMMARY OF THE INVENTION
`The present invention provides a high-performance architecture for a hardware-implemented RAID or other
`disk array system. An important benefit of the architecture is that it provides a high degree of performance (both
`transactional and streaming) without the need for disk drives that are based on expensive or complex disk drive
`interfaces.
`In a preferred embodiment, the architecture is embodied within a PC-based disk array system which
`comprises an array controller card which controls an array of ATA disk drives. The controller card includes an array
`of automated ATA disk drive controllers, each of which controls a single, respective ATA drive.
`The controller card also includes an automated coprocessor which is connected to each disk drive controller
`by a packet-switched bus, and which connects as a busmaster to the host PC bus. The coprocessor is also
`connected to a local 1/0 data buffer of the card. As described below, a primary function of the coprocessor is to
`transfer 1/0 data between the disk drive controllers, the system memory, and the buffer in response to commands
`received from the disk drive controllers. Another function of the coprocesor is to control all accesses by the disk
`drive controllers to the packet-switched bus, to thereby control the flow of 1/0 data.
`The controller card further includes a microcontroller which connects to the disk drive controllers and to
`the coprocessor by a local control bus. The microcontroller runs a control program which implements a RAID storage
`configuration. Because the microcontroller does not process or directly monitor the flow of 1/0 data (as described
`below), a low-cost, low-performance microcontroller can advantageously be used.
`In operation, the controller card processes multiple 1/0 requests in at-a-time, and can process multiple 1/0
`requests without interrupting the host computer. As l/0 requests are received from the host computer, the
`microcontroller generates drive-specific sequences of controller commands (based on the particular RAID configuration),
`In addition t?
`and dispatches these controller commands over the local control bus to the disk drive controllers.
`containing disk drive commands, these controller commands include transfer commands and target addresses that
`are (subsequently) used by the coprocessor to transfer 1/0 data to and from system memory and the local buffer.
`Some of the controller commands also include disk completion values and tokens (1/0 request identifiers)
`that are used by the coprocessor to monitor the completion status of pending 1/0 requests. The disk completion
`values are generated by the microcontroller such that the application of a specific logic function to all of the disk
`completion values for a given 1/0 request produces a final completion value that is known a priori to the coprocessor.
`As described below, this enables the coprocessor to detect the completion of processing of an 1/0 request without
`prior knowledge of the details (number of invoked disk drives, etc.) of the 1/0 request.
`In response to the controller commands, the disk drive controllers access their respective disk drives and
`send packets to the coprocessor over the packet-switched bus. These packets carry 1/0 data (in both directions,
`with the coprocessor filling-in packet payloads on 1/0 writes), and carry transfer commands and target addresses that
`are used by the coprocessor to access the buffer and system memory. During this process, the coprocessor grants
`the packet-switched bus to the disk drive controllers (for the transmission of a single packet) using a round robin
`arbitration protocol which guarantees a minimum 1/0 bandwidth to each disk drive. The minimum bandwidth is equal
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1031, p. 2
`
`
`
`WO 99/26150
`
`.3.
`
`PCT/US98/21203
`
`to 1/N of total 1/0 bandwidth of the packet-switched bus, where N is the number of disk drive controllers (and disk
`drives) in the array.
`Because this minimum 1/0 bandwidth is greater than or equal to the sustained transfer rate of each disk
`drive, all N drives can operate concurrently at the sustained transfer rate indefinitely without the formation of a
`bottleneck. When the packet-switched bus is not being used by all of the disk drive controllers (i.e., one or more
`disk drive controllers has no packets to transmit), the arbitration protocol allows other disk drive controllers to use
`more than the guaranteed minimum 1/0 bandwidth. This additional 1/0 bandwidth may be used, for example, to
`transfer 1/0 data at rate higher than the sustained transfer rate when the requested 1/0 data resides in the disk
`drive's cache.
`The disk drive controllers process their respective sequences of controller commands asynchronously to one
`another; thus, the disk drive controllers that are invoked by a given 1/0 request can finish processing the 1/0 request
`in any order. When a given disk drive controller finishes processing an 1/0 request, the controller sends a special
`completion packet to the coprocessor. This completion packet contains the completion value that was assigned to
`the disk drive controller, and contains an identifier (token) of the 1/0 request.
`Upon receiving the completion packet, the coprocessor cumulatively applies the logic function to the
`completion value and all other completion values (if any) that have been received for the same 1/0 request, and
`compares the result to the final completion value. If a match occurs, indicating that all disk drives invoked by the
`1/0 request have finished processing the 1/0 request, the coprocessor uses the token to inform the host computer
`and the microcontroller of the identity of the completed 1/0 request. Thus, the microcontroller monitors the
`completion status of pending 1/0 requests without directly monitoring the flow of 1/0 data.
`BRIEF DESCRIPTION OF THE DRAWINGS
`There and other features of the architecture will now be described in further detail with reference to the
`drawings of the preferred embodiment, in which:
`Figure 1 illustrates a prior art disk array architecture.
`Figure 2 illustrates a disk array system in accordance with a preferred embodiment of the present invention.
`Figure 3 illustrates the general flow of information between the primary components of the Figure 2 system.
`Figure 4 illustrates the types of information included within the controller commands.
`Figure 5 illustrates a format used for the transmission of packets.
`Figure 6 illustrates the architecture of the system in further detail.
`Figure 7 is a flow diagram which illustrates a round robin arbitration protocol which is used to control
`access to the packet-switched bus of Figure 2.
`Figure 8 illustrates the completion logic circuit of Figure 6 in further detail.
`Figure 9 illustrates the transfer/command control circuit of Figure 6 in further detail.
`Figure 10 illustrates the operation of the command engine of Figure 9.
`
`DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1031, p. 3
`
`
`
`WO 99/26150
`
`-4-
`
`PCT/US98/21203
`
`I.
`
`Existing RAIO Architectures
`To illustrate several of the motivations behind the present invention, a prevalent prior art architecture used
`within existing PC-based RAIO systems will initially be described with reference to Figure 1. As depicted in Figure
`1, the architecture includes an array controller card 30 ("array controller") that couples an array of SCSI (Small
`Computer Systems Interface) disk drives 32 to a host computer (PC) 34. The array controller 30 plugs into a PCI
`(Peripheral Component Interconnect) expansion slot of the host computer 34, and communicates with a host processor
`38 and a system memory 40 via a host PCI bus 42. For purposes of this description and the description of the
`preferred embodiment, it may be assumed that the host processor 38 is an Intel Pentium™ or other X86-compatible
`microprocessor, and that the host computer 34 is operating under either the Windows™ 95 or the Windows™ NT
`operating system.
`The array controller 30 includes a PCl-to-PCI bridge 44 which couples the host PCI bus 42 to a local PCI
`bus 46 of the controller 30, and which acts as a bus master with respect to both busses 42, 46. Two or more
`SCSI controllers 50 (three shown in Figure 1 l are connected to the local PCI bus 46. Each SCSI controller 50
`controls the operation of two or more SCSI disk drives 32 via a respective shared cable 52. The array controller
`30 also includes a microcontroller 56 and a buffer 58, both of which are coupled to the local PCI bus by appropriate
`bridge devices (not shown). The buffer 58 will typically include appropriate exclusive-OR (XOR) logic 60 for
`performing the XOR operations associated with RAIO storage protocols.
`In operation, the host processor 38 (running under the control of a device driver) sends input/output (1/0)
`requests to the microcontroller 56 via the host PCI bus 42, the PCl-to-PCI bridge 44, and the local PCI bus 46. Each
`1/0 request typically consists of a command descriptor block (CDB) and a scatter-gather list. The CDB is a SCSI
`drive command that specifies such parameters as the disk operation to be performed (e.g., read or write), a disk drive
`logical block address, and a transfer length. The scatter-gather list is an address list of one of more contiguous
`blocks of system memory for performing the 1/0 operation.
`The microcontroller 56 runs a firmware program which translates these 1/0 requests into component, disk•
`specific SCSI commands based on a particular RAIO configuration {such as RAIO 4 or RAIO 5), and dispatches these
`commands to corresponding SCSI controllers 50. For example, if, based on the particular RAIO configuration
`implemented by the system, a given 1/0 request requires data to be read from every SCSI drive 32 of the array, the
`microcontroller 56 sends SCSI commands to each of the SCSI controllers 50. The SCSI controllers in-turn arbitrate
`for control of the local PCI bus 46 to transfer 1/0 data between the SCSI disks 32 and system memory 40. 1/0
`data that is being transferred from system memory 40 to the disk drives 32 is initially stored in the buffer 58. The
`buffer 58 is also typically used to perform XOR operations, rebuild operations (in response to disk failures), and other
`operations associated with the particular RAID configuration. The microcontroller 56 also monitors the processing
`of the dispatched SCSI commands, and interrupts the host processor 38 to notify the device driver of completed
`transfer operations.
`The Figure 1 architecture suffers from several deficiencies that are addressed by the present invention.
`One such deficiency is that the SCSI drives 32 are expensive in comparison to ATA (AT Attachment) drives. While
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1031, p. 4
`
`
`
`WO 99/26150
`
`PCT/0S98/21203
`
`it is possible to replace the SCSI drives with less expensive A TA drives (see, for example, U.S. Pat. No. 5,506,977),
`the use of ATA drives would generally result in a decrease in performance. One reason for the decreased
`performance is that ATA drives do not buffer multiple disk commands; thus each ATA drive would normally remain
`inactive while a new command is being retrieved from the microcontroller 56. One goal of the present invention is
`thus to provide an architecture in which ATA and other low-cost drives can be used while maintaining a high level
`of performance.
`Another problem with the Figure 1 architecture is that the local PCI bus and the shared cables 52 are
`susceptible to being dominated by a single disk drive 32. Such dominance can result in increased transactional
`latency, and a corresponding degradation in performance. A related problem is that the local PCI bus 46 is used
`both for the transfer of commands and the transfer of 1/0 data; increased command traffic on the bus 46 .can
`therefore adversely affect the throughput and latency of data traffic. As described below, the architecture of the
`preferred embodiment overcomes these and other problems by using separate control and data busses, and by using
`a round-robin arbitration protocol to grant the local data bus to individual drives.
`Another problem with the prior art architecture is that because the microcontroller 56 has to monitor the
`component 1/0 transfers that are performed as part of each 1/0 request, a high-performance microcontroller generally
`must be used. As described below, the architecture of the preferred embodiment avoids this problem by shifting the
`completion monitoring task to a separate, non-program-controlled device that handles the task of routing 1/0 data,
`and by embedding special completion data values within the 1/0 data stream to enable such monitoring. This
`effectively removes the microcontroller from the 1/0 data path, enabling the use of a lower cost, lower performance
`microcontroller.
`Another problem, in at least some RAIO implementations, is that the microcontroller 56 interrupts the host
`processor 38 multiple times during the processing of a single 1/0 request. For example, it is common for the
`microcontroller 56 to interrupt the host processor 38 at least once for each contiguous block of system memory
`referenced by the scatter-gather list. Because there is significant overhead associated with the processing of an
`interrupt, the processing of the interrupts significantly detracts from the processor bandwidth that is available for
`handling other types of tasks. It is therefore an object of the present invention to provide an architecture in which
`the array controller interrupts the host processor no more than once per 1/0 request.
`A related problem, in many RAID architectures, is that when the array controller 30 generates an interrupt
`request to the host processor 38, the array controller suspends operation, or at least postpones generating the
`following interrupt request, until after the pending interrupt request has been serviced. This creates a potential
`bottleneck in the flow of 1/0 data, and increases the number of interrupt requests that need to be serviced by the
`host processor 56. It is therefore an object of the invention to provide an architecture in which the array controller
`continues to process subsequent 1/0 requests while an interrupt request is pending, so that the device driver can
`process multiple completed 1/0 requests when the host processor eventually services an interrupt request.
`The present invention provides a high performance disk array architecture which addresses these and other
`problems with prior art RAID systems. An important aspect of the invention is that the primary performance benefits
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1031, p. 5
`
`
`
`WO 99/26150
`
`-6-
`
`PCT/US98/21203
`
`provided by the architecture are not tied to a particular type of disk drive interface. Thus. the architecture can be
`implemented using ATA drives (as in the preferred embodiment described below) and other types of relatively low-cost
`drives while providing a high level of performance.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`II.
`
`System Overview
`A disk array system which embodies the various features of the present invention will now be described
`with reference to the remaining drawings. Throughout this description, reference will be made to various
`implementation-specific details, including, for example, part numbers, industry standards, timing parameters, message
`formats, and widths of data paths. These details are provided in order to fully set forth a preferred embodiment
`of the invention, and not to limit the scope of the invention. The scope of the invention is set forth in the appended
`claims.
`
`As depicted in Figure 2, the disk array system comprises an array controller card 70 ("array controller")
`that plugs into a PCI slot of the host computer 34. The array controller 70 links the host computer to an array of
`ATA disk drives 72 (numbered 1-N in Figure 2), with each drive connected to the array controller by a respective
`In one implementation, the array controller 70 includes eight A TA ports to permit the connection of
`ATA cable 76.
`up to eight ATA drives. The use of a separate port per drive 72 enables the drives to be tightly controlled by the
`array controller 70, as is desirable for achieving a high level of performance. In the preferred embodiment, the array
`controller 70 supports both the ATA mode 4 standard (also known as Enhanced IDE) and the Ultra ATA standard
`(also known as Ultra OMA), permitting the use of both types of drives.
`As described below, the ability to use less expensive ATA drives, while maintaining a high level of
`performance, is an important feature of the invention. It will be recognized, however, that many of the architectural
`features of the invention can be used to increase the performance of disk array systems that use other types of
`It will also be recognized that the disclosed array controller 70 can be adapted for
`drives, including SCSI drives.
`use with other types of disk drives (including CO-ROM and DVD drives) and mass storage devices (including FLASH
`and other solid state memory drives).
`In the preferred embodiment, the array of ATA drives 72 is operated as a RAID array using, for example,
`a RAIO 4 or a RAID 5 configuration. The array controller 70 can alternatively be configured through firmware to
`operate the drives using a non-RAID implementation, such as a JBOO (Just a Bunch of Disks) configuration.
`With further reference to Figure 2, the array controller 70 includes an automated array coprocessor 80, a
`microcontroller 82, and an array of automated controllers 84 (one per ATA drive 72), all of which are interconnected
`by a local control bus 86 that is used to transfer command and other control information. (As used herein, the term
`"automated" refers to a data processing unit which operates without fetching and executing sequences of macro(cid:173)
`instructions.) The automated controllers 84 are also connected to the array coprocessor 80 by a packet-switched
`bus 90. As further depicted in Figure 2, the array coprocessor 80 is locally connected to a buffer 94, and the
`microcontroller 82 is locally connected to a read-only memory (ROM) 96 and a random-access memory (RAM) 98.
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1031, p. 6
`
`
`
`PCT/US98/21203
`
`WO 99/26150
`
`.7.
`The packet-switched bus 90 handles all 1/0 data transfers between the automated controllers 84 and the
`array coprocessor 80. All transfers on the packet-switched bus 90 flow either to or from the array coprocessor 80,
`and all accesses to the packet-switched bus are controlled by the array coprocessor. These aspects of the bus
`architecture provide for a high degree of data flow performance without the complexity typically associated with PCI
`and other peer-to-peer type bus architectures.
`As described below, the packet-switched bus 90 uses a packet-based round robin protocol that guarantees
`that at least 1 /N of the bus's 1/0 bandwidth will be available to each drive during each round robin cycle (and thus
`throughout the course of each 1/0 transfer). Because this amount (1 /N) of bandwidth is equal to or exceeds the
`sustained data transfer rate of each ATA drive 72 (which is typically in the range of 10 Mbytes/sec.), all N drives
`can operate concurrently at the sustained data rate without the formation of a bottleneck. For example, in an 8·
`drive configuration, all 8 drives can continuously stream 10 Mbytes/second of data to their respective automated
`controllers 84, in which case the packet-switched bus 90 will transfer the 1/0 data to the array coprocessor at a
`rate of 80 Mbytes/second. When less than N drives are using the packet-switched bus 90, each drive is allocated
`more than 1 IN of the bus's bandwidth, allowing each drive to transfer data at a rate which exceeds the sustained
`data transfer rate (such as when the requested 1/0 data resides in the disk drive' s cache).
`In the preferred embodiment, the array coprocessor 80 is implemented using an FPGA, such as a Xilinx
`4000-series FPGA. An application-specific integrated circuit (ASIC) or other type of device may alternatively be used.
`The general functions performed by the array coprocessor 80 include the following: (i) forwarding 1/0 requests from
`the host processor 38 to the microcontroller 82, (ii) controlling arbitration on the packet-switched bus 90, (iii) routing
`1/0 data between the automated controllers 84, the system memory 40, and the buff er 94, (iv) performing exclusive•
`OR, read-modify-write, and other RAID-related logic operations involving 1/0 data using the buffer 94; and (v)
`monitoring and reporting the completion status of 110 requests. With respect to the PCI bus 42 of the host computer
`34, the array coprocessor 80 acts as a PCI initiator (a type of PCI bus master) which initiates memory read and
`write operations based on commands received from the automated controllers 84. The operation of the array
`coprocessor 80 is further described below.
`The buffer 94 is preferably either a 1 megabyte (MB) or 4 MB volatile, random access memory.
`Synchronous ORAM or synchronous SAAM may be used for this purpose. All data that is written from the host
`In addition, the array coprocessor 80 uses this
`computer 34 to the disk array is initially written to this buffer 94.
`buffer 94 for volume rebuilding (such as when a drive or a drive sector goes bad) and parity generation. Although
`the buffer 94 is external to the array coprocessor in the preferred embodiment, it may alternatively be integrated
`into the same chip.
`The microcontroller 82 used in the preferred embodiment is a Siemens 163. The microcontroller 82 is
`controlled by a firmware control program (stored in the ROM 96) that implements a particular RAID or non-RAIO
`storage protocol. The primary function performed by the microcontroller is to translate 110 requests from the host
`computer 34 into sequences of disk-specific controller commands, and to dispatch these commands over the local
`control bus 86 to specific automated controllers 84 for processing. As described below, the architecture is such
`
`5
`
`1 O
`
`15
`
`20
`
`25
`
`30
`
`35
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1031, p. 7
`
`
`
`WO 99/26150
`
`-8-
`
`PCT/US98/21203
`
`that the microcontroller 82 does not have to directly monitor the 1/0 transfers that result from the dispatched
`controller commands, as this task is allocated to the array coprocessor 80 (using an efficient completion token
`scheme which is described below). This aspect of the architecture enables a relatively low cost, low performance
`microcontroller to be used, and reduces the complexity of the control program.
`Although the microcontroller 82 is a separate device in the preferred embodiment, the microcontroller could
`alternatively be integrated into