`
`(12) United States Patent
`Jewett et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,392.291 B2
`Jun. 24, 2008
`
`(54)
`
`(75)
`
`(73)
`
`(*)
`
`(21)
`(22)
`(65)
`
`(60)
`
`(51)
`
`(52)
`(58)
`
`ARCHITECTURE FOR PROVIDING
`BLOCK-LEVEL STORAGE ACCESS OVERA
`COMPUTER NETWORK
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`Inventors: Douglas E. Jewett, Round Rock, TX
`(US); Adam J. Radford, Mission Viejo,
`CA (US); Bradley D. Strand, Los
`Gatos, CA (US); Jeffrey D. Chung,
`Cupertino, CA (US); Joel D. Jacobson,
`Mountain View, CA (US); Robert B.
`Haigler, Newark, CA (US); Rod S.
`Thompson, Sunnyvale, CA (US);
`Thomas L. Couch, Woodside, CA (US)
`Assignee: Applied Micro Circuits Corporation,
`San Diego, CA (US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 1484 days.
`Appl. No.: 09/927,894
`Filed:
`Aug. 10, 2001
`
`Notice:
`
`Prior Publication Data
`US 20O2/OO49825A1
`Apr. 25, 2002
`
`Related U.S. Application Data
`Provisional application No. 60/224,664, filed on Aug.
`11, 2000.
`
`Int. C.
`(2006.01)
`G06F 5/67
`(2006.01)
`GO6F 2/OO
`U.S. Cl. ......................... 709/214; 709/215: 711/114
`Field of Classification Search ................. 709/203,
`709/213, 226, 227, 228, 229, 231, 239, 212,
`709/214, 215, 216, 217: 711/114
`See application file for complete search history.
`
`5,127,097 A
`
`6/1992 Mizuta ....................... T11 168
`
`(Continued)
`OTHER PUBLICATIONS
`Carlson, M., Mali, S., Merhar, M., Monia, C., Rajagopal, M., “A
`Framework for IP Based Storage.” Internet Draft Document printed
`fromietforg web site, Nov. 17, 2000, pp. 1-36.
`(Continued)
`Primary Examiner Ario Etienne
`Assistant Examiner Sargon N Nano
`(74) Attorney, Agent, or Firm Knobbe, Martens, Olson &
`Bear LLP
`
`(57)
`
`ABSTRACT
`
`A network-based storage system comprises one or more
`block-level storage servers that connect to, and provide disk
`storage for, one or more host computers ("hosts) overlogical
`network connections (preferably TCP/IP sockets). In one
`embodiment, each host can maintain one or more socket
`connections to each storage server, over which multiple I/O
`operations may be performed concurrently in a non-blocking
`manner. The physical storage of a storage server may option
`ally be divided into multiple partitions, each of which may be
`independently assigned to a particular host or to a group of
`hosts. Host driver software presents these partitions to user
`level processes as one or more local disk drives. When a host
`initially connects to a storage server in one embodiment, the
`storage server initially authenticates the host, and then noti
`fies the host of the ports that may be used to establish data
`connections and of the partitions assigned to that host.
`
`41 Claims, 9 Drawing Sheets
`
`NWORK
`
`1CO
`
`: BLOCK SERVER
`-- Rw
`Rw
`1
`
`&-
`
`-----N
`
`104.
`
`2O6
`
`-
`
`SD -
`
`/N)
`
`SOCKET 400
`SOCKET AGO
`
`2O4
`
`HOST #1
`
`102
`
`/
`:
`
`---
`
`D S-
`
`2OOd
`
`204
`
`HOST #2
`
`- HD -
`
`- SOCKET 400
`
`
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 1
`
`
`
`US 7,392.291 B2
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`5/1998 Butts et al. .................. T19,311
`5,754,830 A
`5,787,463 A * 7/1998 Gaijar .....
`711,114
`5,867,491 A
`2/1999 Derango et al.
`... 370,329
`5,974,532 A 10, 1999 McLain et al. ....
`... 712/208
`5.996,014 A 11/1999 Uchihori et al. ...
`... 709,226
`6,003,045 A 12/1999 Freitas et al. .....
`... TO7,205
`6,076,142 A
`6/2000 Corrington et al.
`... 711/114
`6,137.796 A 10/2000 Derango et al. ...
`... 370,389
`6,161,165 A * 12/2000 Solomon et al. ..
`... 711 114
`1/2002 Feigenbaum ......
`6,339,785 B1*
`... 709,213
`6,378,036 B2*
`... 711/112
`4/2002 Lerman et al. .
`6,393,026 B1*
`... 370/401
`5/2002 Irwin ...............
`6.421,711 B1*
`709,213
`7/2002 Blumenau et all
`... 714/7
`6,502,205 B1* 12/2002 Yanai et al. .......
`6,654,752 B2 * 1 1/2003 Ofek .............
`... 707/10
`6,662.268 B1* 12/2003 McBrearty et al.
`711 114
`6,671,776 B1* 12/2003 DeKoning .....
`711,114
`6,725,456 B1 * 4/2004 Bruno et al.
`... 718, 102
`6,834.326 B1* 12/2004 Wang et al. ....
`... 711 114
`6,912,668 B1* 6/2005 Brown et al. ...
`... 714/6
`6,955,956 B2 * 10/2005 Tanaka et al. .....
`438,166
`6,970,869 B1 * 1 1/2005 Slaughter et al............... 707/10
`6,983,330 B1* 1/2006 Oliveira et al. .............. 709,239
`OTHER PUBLICATIONS
`Gibson, G., Nagle, D., Amiri, K. Chang, F. Feinberg, E., Gobioff,
`H. Lee, C., Ozceri, B., Riedel, E., and Rochberg, D., “A Case for
`
`
`
`Network-Attached Secure Disks.” Technical Report, CMU-CS-96
`142, Department of Electrical and Computer Engineering, Carnegie
`Mellon University, Sep. 26, 1996, pp. 1-19.
`Hotz, S., Van Meter, R., and Finn, G., “Internet Protocols for Net
`work-Attached Peripherals.” Proc. Sixth NASA Goddard Conference
`on Mass Storage Systems and Technologies in cooperation with
`Fifteenth IEEE Symposium on Mass Storage Systems, Mar. 1998,
`pp. 1-15.
`Lee, E., and Thekkath, C., “Petal: Distributed Virtual Disks.” ACM
`Seventh International Conference on Architectural Support for Pro
`gramming Languages and Operating Systems, ACM Digital Library,
`Oct. 1996, pp. 84-92.
`Perkins, C., and Harjono, H., “Resource Discovery Protocol for
`Mobile Computing.” Mobile Networks and Applications, vol. 1,
`Issue 4, 1996, pp. 447-455.
`Tierney, B., Johnston, W., Herzog, H., Hoo, Guojun, J., Lee, J., Chen,
`L. and Rotem, D., “Distributed Parallel Data Storage Systems: A
`Scalable Approach to High Speed ImageServers.” Proceedings of the
`Second ACM International Conference on Multimeda, Oct. 1994, pp.
`399-405.
`Van Meter, R., Finn, G., and Hotz, S., “VISA: Netstation's Virtual
`Internet SCSI Adapter.” Proceedings of the Eighth International Con
`ference on Architectural Support for Programming Languages and
`Operating Systems, Oct. 1998, pp.71-80.
`
`* cited by examiner
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 2
`
`
`
`U.S. Patent
`
`US 7,392.291 B2
`
`
`
`XÈJONALEN
`
`
`
`/ "427-7
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 3
`
`
`
`U.S. Patent
`
`Jun. 24, 2008
`
`Sheet 2 of 9
`
`US 7,392.291 B2
`
`
`
`27 "427 +/
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 4
`
`
`
`U.S. Patent
`
`Jun. 24, 2008
`
`Sheet 3 of 9
`
`US 7,392.291 B2
`
`s
`O
`
`
`
`
`
`3 g
`
`
`
`s
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 5
`
`
`
`U.S. Patent
`
`Jun. 24, 2008
`
`Sheet 4 of 9
`
`US 7,392,291 B2
`
`902
`
`bOl/
`
`vol
`
`ZO
`
`00}
`
`AYOMLAN
`
`902
`
`
`
` L#:N3ANISNOIOTE«
`
`vod
`
`O0¥LAXI0S
`
`
`
`O00”LAWDOS
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 6
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 6
`
`
`
`
`U.S. Patent
`
`Jun. 24, 2008
`
`Sheet 5 Of 9
`
`US 7,392.291 B2
`
`READ OPERATION
`
`OS
`
`HD
`
`HRW
`
`N
`
`
`
`SRW
`
`SD
`
`ME
`
`1H
`
`CD
`
`2H
`3H
`
`5H
`6
`
`7H
`
`AV6, 2
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 7
`
`
`
`U.S. Patent
`
`Jun. 24, 2008
`
`Sheet 6 of 9
`
`US 7,392.291 B2
`
`WRITE OPERATION
`
`OS
`
`HD
`
`
`
`HRW
`
`N
`
`SRW
`
`SD
`
`TME
`
`OPTIONAL
`CACHE
`LAYER
`
`A76, a
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 8
`
`
`
`U.S. Patent
`
`US 7,392.291 B2
`
`
`
`BOWES (JESÍ)
`
`
`
`02/
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 9
`
`
`
`U.S. Patent
`
`Jun. 24, 2008
`
`Sheet 8 of 9
`
`US 7,392,291 B2
`
`
`(S)LSOHGINSISSY*
`
`
`NOILILYVdWVH9ONd
`
`|a1eavt
`
`SYOLOIS/SMSIC©
`
`LNSNS9VNVW/“9I4NO9
`
`YaAuaS
`
`
`
`SJOFIUAIdSSIOOV»
`
`NOIWLVd
`
`Naavl
`
`&“O17
`
`(G3NDISSVNN)0NOILILYVd
`
`(|LSOH)LNOILIMLYVd
`
`(ZLSOH)ZNOILILYVd
`
`(¢LSOH)¢NOLLILYWd
`
`(1LSOH)7NOILILYVd
`
`AIO1d
`
`aM
`
`YaSMOUG
`
`OL8
`
`Petitioners Microsoft Corporation and HP Inc.
`
`- Ex. 1005, p. 10
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 10
`
`
`
`U.S. Patent
`
`Jun. 24, 2008
`
`Sheet 9 Of 9
`
`US 7,392.291 B2
`
`START
`
`HOST ATTEMPTS O CONNECT
`TO BLOCK SERVER
`
`
`
`BLOCK SERVER ACCEPTS
`THE CONNECTION REOUEST
`
`HOST RECEIVES SOFTWARE VERSIONS
`FROM THE BLOCK SERVER
`
`91 O
`
`915
`
`HOST REPLES TO BLOCK SERVER WITH
`A SELECTED VERSION
`
`925
`
`BLOCK SERVER CHECKS RESPONSE
`FROM HOST AND AUTHENTCATES
`
`BLOCK SERVER SENDS ACKNOWLEDGMENT
`OF THE AUTHENTCATION TO THE HOST
`
`HOST SENDS A REOUEST TO DETERMINE
`AVAILABLE CAPACTY ON THE BLOCK SERVER
`
`BLOCK SERVER SENDS CAPACTY AND THE
`NUMBER OF P CONNECTIONS AUTHORIZED
`
`BLOCK SERVER ESTABLISHES 'LISTEN"
`SOCKETS FOR DATA CHANNELS FROM HOST
`
`935
`
`940
`
`945
`
`950
`
`A V6, 9
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 11
`
`
`
`US 7,392.291 B2
`
`1.
`ARCHITECTURE FOR PROVIDING
`BLOCK-LEVEL STORAGE ACCESS OVERA
`COMPUTER NETWORK
`
`PRIORITY CLAIM
`
`This application claims the benefit of U.S. Provisional
`Appl. No. 60/224,664, filed Aug. 11, 2000, the disclosure of
`which is hereby incorporated by reference.
`
`APPENDICES
`
`10
`
`This specification includes appendices A-D which contain
`details of a commercial implementation of the invention. The
`appendices are provided for illustrative purposes, and not to
`define or limit the scope of the invention.
`
`15
`
`BACKGROUND OF THE INVENTION
`
`2
`server can be connected by two networks that support the
`TCP/IP protocol, one of which may provide a much lower
`transfer rate than the other. As long as one of these networks
`is functioning properly, the host will be able to establish a
`logical connection to the block server and execute I/O
`requests.
`In one embodiment, the architecture includes a host-side
`device driver and a host-side reader/writer component that
`run on the host computers. The architecture also includes a
`server-side device driver and a server-side reader/writer com
`ponent that run on the block-level storage servers. The reader/
`writer components are preferably executed as separate pro
`cesses that are established in pairs (one host-side reader/
`writer process and one server-side reader/writer process),
`with each pair dedicated to a respective socket over a network.
`For example, if two logical connections are established
`between a given host computer and a given storage server,
`each such socket will be managed by a different pair of
`reader/writer processes. The reader/writer processes and
`sockets preferably remain persistent over multiple I/O
`requests. The device drivers and reader/writer processes oper
`ate to export the block-level-access interface of the storage
`servers to the host computers, so that the disk drives of the
`block servers appear to the host computers as local storage
`SOUCS.
`In accordance with one inventive feature of the architec
`ture, when an I/O request from a host process involves the
`transfer of more than a threshold quantity of data, the hosts
`device driver divides the I/O requests into two or more con
`stituent I/O operations. Each such operation is assigned to a
`different Socket connection with the target storage server Such
`that the constituent operations may be performed, and the
`associated I/O data transferred, in parallel over the network.
`This feature of the architecture permits relatively large
`amounts of network bandwidth to be allocated to relatively
`large I/O requests.
`Another feature of the architecture is a mechanism for
`dividing the physical storage space or units of a block-level
`storage server into multiple partitions, and for allocating
`these partitions to hosts independently of one another. In a
`preferred embodiment, a partition can be allocated uniquely
`to a particular host, or can be allocated to a selected group of
`hosts (in which case different hosts may have different access
`privileges to the partition). The partition or partitions
`assigned to a particular host appear, and can be managed as,
`one or more local disk drives.
`Yet another inventive feature of the architecture is an
`authentication and discovery protocol through which a stor
`age server authenticates a host, and then provides access
`information to the host, before permitting the host to access
`storage resources. In a preferred embodiment, when the host
`is booted up, it initially establishes a configuration Socket
`connection to the storage server. Using this configuration
`Socket, the storage server authenticates the host preferably
`using a challenge-response method that is dependent upon a
`version of the driver software. If the authentication is success
`ful, the storage server provides access information to the host,
`such as the identities of dynamic ports which may be used by
`the host for data connections to the storage server, and infor
`mation about any partitions of the storage server that are
`assigned to that host. This feature of the architecture provides
`a high degree of security against unauthorized accesses, and
`allows storage partitions to be securely assigned to individual
`hosts.
`
`1. Field of the Invention
`The present invention relates to storage systems for com
`puter networks, and more specifically, relates to Software
`architectures for providing block level access to storage
`resources on a network.
`2. Description of the Related Art
`25
`Various types of architectures exist for allowing host com
`puters to share hard disk drives and other storage resources on
`a computer network. One common type of architecture
`involves the use of a central file manager. One problem with
`this architecture is that the failure of the central file manager
`can render the entire system inoperable. Another problem is
`that many software applications are not designed to use a
`central file manager.
`Some storage architectures overcome these deficiencies by
`allowing the host computers to access the storage resources
`directly over the network, without the use of a central file
`manager. Typically, these architectures allow the host to
`access the storage resources over a network connection at the
`block level (as opposed to the file level). One problem with
`this type of architecture is that the failure of an input/output
`request can cause other pending requests from the same host
`to be delayed. Another problem is that the architecture is
`highly vulnerable to network failures. The present invention
`addresses these and other problems.
`
`30
`
`35
`
`40
`
`SUMMARY OF THE INVENTION
`
`45
`
`The present invention comprises a system architecture for
`providing block-level access to storage resources. Such as
`disk arrays, over a computer network without the need for a
`central file manager. The architecture embodies various
`inventive features that may be implemented individually or in
`combination.
`One feature of the architecture is that concurrent input/
`output (I/O) requests from the same host computer (“host')
`are handled over separate logical network connections or
`sockets (preferably TCP/IP sockets). For example, in a pre
`ferred embodiment, a given host can establish two socket
`connections with a given block-level storage server, and use
`one socket to perform one I/O request while using the other
`socket to perform another I/O request. As a result, the failure
`or postponement of one I/O request does not block or interfere
`with other I/O requests.
`Another feature of the architecture is that the sockets can be
`established over multiple networks, including networks of
`different types and bandwidths, to provide increased fault
`tolerance. For example, a given host computer and storage
`
`50
`
`55
`
`60
`
`65
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 12
`
`
`
`3
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`US 7,392.291 B2
`
`4
`10/100/1000 Base-T or 1000 Base-SX Gigabit Ethernet
`cards. The host computer 102 may be a standard PC or work
`station configured to operate as a server or as a user computer.
`The block server 104 may be a network-attached IP storage
`box or device which provides block-level data storage ser
`vices for host computers 102 on the network 100.
`In the illustrated embodiment, the block server 104
`includes a disk array controller 110 that controls an array of
`disk drives 112. A disk array controller 110 of the type
`described in U.S. Pat. No. 6,098,114 may be used for this
`purpose, in which case the disk drives 112 may be ATA/IDE
`drives. The disk array controller may supporta variety of disk
`array configurations, such as RAID 0, RAID 5, RAID 10, and
`JBOD, and is preferably capable of processing multiple I/O
`requests in parallel. The block server 104 also includes a CPU
`board and processor 108 for executing device drivers and
`related software. The block server may also include volatile
`RAM (not shown) for caching I/O data, and may include flash
`or other non-volatile solid state memory for storing configu
`ration information (see FIG. 8).
`In one embodiment, the network 100 may be any type or
`combination of networks that support TCP/IP sockets,
`including but not limited to Local Area Networks (LANs).
`wireless LANs (e.g., 802.11 WLANs), Wide Area Networks
`(WANs), the Internet, and direct connections. One common
`configuration is to locally interconnect the hosts 102 and
`block servers 104 by an Ethernet network to create an Ether
`net-based SAN (Storage Area Network). As depicted by
`dashed lines in FIG. 1, the host and the block server 102,104
`may be interconnected by a second network 100', using a
`second set of network cards 106", to provide increased fault
`tolerance (as described below). The two networks 100, 100'
`may be disparate networks that use different mediums and
`provide different transfer speeds. Some of the various net
`work options are described in more detail below with refer
`ence to FIG. 3.
`The software components of the architecture are shown in
`FIG.2. The host side 102 of the software architecture includes
`an operating system (O/S) 202 such as Unix, Windows NT, or
`Linux; a host-side device driver 204 (“host driver”) which
`communicates with the operating system 202; and a reader/
`writer (RW) component 200a (also referred to as an "agent')
`which communicates with the host driver 204. The storage
`side 104 of the software architecture includes a reader/writer
`(RW) component 200b and a storage-side device driver 206
`(“server driver') that are executed by the CPU board’s pro
`cessor 108 (FIG.1). The server driver 206 initiates disk opera
`tions in response to I/O requests received from the server-side
`RW component 200b.
`The RW components 200a, 200b are preferably executed
`as separate processes that are established in pairs (one host
`side RW process and one server-side RW process), with each
`pair dedicated to a respective TCP/IP socket over a network
`100. The host RW 200a operates generally by “reading I/O
`requests from the host driver 204, and “writing these
`requests onto the network 100. Similarly, the storage RW
`200b operates generally by reading I/O requests from the
`network 100 and writing these requests to the server driver
`206. This process can occur simultaneously with transfers by
`other RW pairs, and can occur in any direction across the
`network 100. The RW components 200 also preferably per
`form error checking of transferred I/O data.
`Each RW process (and its corresponding socket) prefer
`ably remains persistent on its respective machine 102, 104,
`and processes I/O requests one at-a-time on a first-in-first-out
`basis until the connection fails or is terminated. A host com
`puter 102 establishes a socket by sending a service request
`
`These and other features will now be described with refer
`ence to the drawings of certain embodiments of the invention,
`which are intended to illustrate, and not limit, the scope of the
`invention.
`FIG. 1 illustrates the primary hardware components of an
`example system in which the invention may be embodied,
`including a host computer and a block server.
`FIG. 2 illustrates the software architecture of the system of
`FIG. 1, including host-side and server-side device drivers and
`reader/writer (RW) components that operate according to the
`invention.
`FIG. 3 illustrates examples of the types of networks and
`network components that can be used to interconnect the
`hosts and block servers.
`FIG. 4 shows, in example form, how the concurrent socket
`connections are established between pairs of reader/writer
`components.
`FIG. 5 illustrates the flow of information between compo
`nents when a host computer performs a read from a block
`SeVe.
`FIG. 6 illustrates the flow of information between compo
`nents when a host computer performs a write to a block server.
`FIG. 7 illustrates how I/O requests are assigned to socket
`connections transparently to user-level applications, and
`illustrates how an I/O request may be subdivided for process
`ing over multiple TCP/IP connections.
`FIG. 8 illustrates how the physical storage of a block server
`may be divided into multiple partitions, each of which may be
`independently allocated to one or more host computer.
`FIG. 9 illustrates an authentication and discovery protocol
`through which a host computer is authenticated by a block
`server, and then obtains information for accessing the block
`SeVe.
`
`5
`
`10
`
`15
`
`25
`
`30
`
`35
`
`DETAILED DESCRIPTION OF PREFERRED
`EMBODIMENTS
`
`The system architecture described in this section, and in the
`attached appendices, embodies various inventive features that
`may be used individually or in combination. Some of these
`features may be implemented without others, and/or may be
`implemented differently than set forth herein, without depart
`ing from the scope of the invention as defined by the appended
`claims.
`
`40
`
`45
`
`I. Overview
`The present invention comprises a system architecture for
`providing block-level storage access over one or more com
`50
`puter networks. The architecture is designed to incorporate
`any number of host computers and block-level storage servers
`communicating across a network or a combination of net
`works. In one embodiment, the architecture exports virtual
`ized storage blocks over TCP/IP connections. Because TCP/
`IP is used for communications between the host computers
`and block-level storage servers in a preferred embodiment, a
`variety of network topologies can be used to interconnect the
`host computers and the block servers of a given system. For
`example, for relatively small systems, the host computers and
`storage servers can be interconnected by a hub, while for
`larger systems, the hub may be replaced with a Switch.
`Depicted in FIG. 1 are the hardware components of a
`typical system that embodies the invention. The system
`includes a host computer 102 (“host”) and a block-level IP
`storage server 104 (“block server') interconnected by a net
`work 100 via respective network interface cards 106, such as
`
`60
`
`65
`
`55
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 13
`
`
`
`5
`over a dedicated configuration socket to the relevant block
`server 104. Once a socket connection is established between
`a RW pair 200a, 200b, the socket handlesbi-directional traffic
`between the host computer 102 and block server 104.
`In the illustrated embodiment, the RW components 200 run
`as processes that are separate from the host and server drivers
`204, 206, respectively. The host-side 200a and storage-side
`200b RW could alternatively be implemented, for example, as
`one or more of the following: (a) part of the host and server
`drivers 204, 206 (respectively), (b) separate device drivers
`204, 206 (respectively), (c) separate kernel threads, (d) mul
`tiple threads within a single process, (e) multiple threads
`within multiple processes, and (f) multiple processes within a
`single thread.
`A host computer 102 may establish multiple logical con
`nections (sockets) to a given block server 104, and/or estab
`lish sockets to multiple different block servers 104 (as dis
`cussed below). An important benefit of this feature is that it
`allows multiple I/O requests from the same host to be pro
`cessed concurrently (each over a separate Socket) in a non
`blocking manner—if one socket fails, the I/O requests being
`performed over other sockets are not affected. Each socket is
`managed by a respective RW pair.
`An important function of the host driver 204 is that of
`virtualizing the storage provided by the block servers 204, so
`25
`that all higher-level software processes on the host, such as
`the operating system and other user-level processes, view the
`block server storage as one or more local, physical disk
`drives. To accomplish this task, the host driver dynamically
`assigns I/O requests to TCP/IP socket connections without
`revealing the existence of Such connections, or any other
`network details, to user-level processes. The block server 104
`preferably appears to the hosts user-level processes as a SCSI
`device, allowing conventional Volume managers to be used.
`As described below in sub-section III, one embodiment of
`the architecture permits the physical storage of a block server
`104 to be divided into multiple, variable-size partitions. Each
`Such partition may be independently allocated to one or more
`hosts, and may configured Such that it is viewed and managed
`as a separate physical disk drive. In other embodiments,
`block-level access may be provided to the hosts without par
`titioning.
`FIG. 3 shows some of the various networks 100 and net
`work components that may be used to interconnect the host
`102 and block servers 104 of a given system. These include a
`hub 302 (commonly used to connect LAN segments), the
`Internet 304, a router 306 (a computer that forwards packets
`according to header information), a switch 308 (a device that
`filters and forwards packets between LAN segments), and a
`gateway 310 (a computer that interconnects two different
`50
`types of networks). The system architecture allows any com
`bination of these network options to be used to interconnect a
`given host computer 102 and block server 104.
`An important feature of the architecture is that when the
`network 100 becomes inundated with traffic, a network 100
`administrator can either add network 100 capabilities on the
`fly or change the network 100 hardware without causing any
`loss of data. The host-side 102 and storage-side 104 software
`components are configured, using conventional methods, to
`detect and use new network 100 connections as they become
`available, and to retry operations until a connection is estab
`lished. For example, a network 100 administrator could ini
`tially connect thirty host computers 102 to a small number of
`block servers 104 using a network hub 302. When the number
`of computers reaches a level at which the network hub 302 is
`no longer suitable, a 1000-port switch could be added to the
`network 100 and the hub 302 removed without taking the
`
`30
`
`40
`
`45
`
`55
`
`60
`
`65
`
`US 7,392.291 B2
`
`5
`
`10
`
`15
`
`35
`
`6
`network 100 off-line. The architecture functions this way
`because the host RW 200a creates a new sockets connection
`to the storage RW 200b automatically as new physical con
`nections become available.
`The architecture and associated storage control protocol
`present the storage resources to the host computers 102 as a
`logically contiguous array of bytes which are accessible in
`blocks (e.g., of 512 bytes). The logical data structures of the
`implementation Support byte level access, but disk drives
`typically export blocks which are of a predetermined size, in
`bytes. Thus, to access a given block, a block address (sector
`number) and a count of the number of blocks (sectors) is
`provided. In one embodiment, the protocol exports a 64-bit
`logical blockaddress (LBA) and 64-bit sector count. On write
`operations, the I/O write data request is packaged into a block
`structure on the host side 102. The block request and data are
`sent to the block server 104 over one or more of the socket
`connections managed by the host RW processes 200a. The
`architecture also allows data to be stored non-sequentially
`and allows for the storage medium to efficiently partition
`space and reclaim unused segments.
`Depicted in FIG. 4 are sample socket connections 400
`made by RW pairs 200 connecting over a network 100 to link
`host computers 102 to block servers 104. As mentioned
`above, the network 100 may actually consist of multiple
`networks 100, including fully redundant networks 100. Each
`host computer 102 can open one or more Socket connections
`400 (using corresponding RW pairs) to any one or more block
`servers 104 as needed to process I/O requests. New socket
`connections 400 can be opened, for example, in response to
`long network 100 response times, failed socket connections
`400, the availability of new physical connections, and
`increases in I/O requests. For example, a host computer 102
`can initially open two sockets 400 to a first block server 104;
`and subsequently open two more sockets 400 to another block
`server 104 as additional storage resources are needed.
`Another host computer 102 may have open socket connec
`tions 400 to the same set of block servers 104 as shown. As
`described above, each socket 400 acts as an independent
`pipeline for handling I/O requests, and remains open until
`either an error occurs or the host 102 terminates the socket
`connection 400.
`II. Processing of Input/Output Requests
`FIGS. 5 and 6 illustrate a network storage protocol that
`may be used for I/O read operations and write operations
`(respectively) between a host computer 102 and a block
`server 104 over a socket connection 400. Located at the tops
`of the vertical lines in FIGS. 5 and 6 are abbreviations that
`denote components as follows.
`OS=Operating System
`HD=Host Driver 204
`HRW=Host Computer's Reader/Writer 200a
`N=Network
`SRW=Server Reader/Writer 200b (of block server)
`SD=Server Driver 206 (of block server)
`Time increases, but is not shown to scale, in these diagrams
`moving from top to bottom. Arrows from one vertical line to
`another generally represent the flow of messages or data
`between components. An arrow that begins and ends at the
`same component (vertical line) represents an action per
`formed by that component. The small circles in the figures
`represent rendezvous events.
`In one embodiment, as shown in FIG. 5, the host reader/
`writer (HRW) initially sends a request 1H to the host driver
`(HD) for an I/O command packet, indicating that the socket is
`available for use. This step can be viewed as the message “if
`
`Petitioners Microsoft Corporation and HP Inc. - Ex. 1005, p. 14
`
`
`
`7
`you have work to do, give it to me.” The host driver eventually
`responds to this request by returning a command packet that
`specifies an I/O request, as shown. As represented by the
`arrow labeled 2H, the host reader/writer (HRW) translates the
`command packet into a network-generalized order. This step
`allows different, cross platform, computer languages to func
`tion on a common network 100. The local computational
`transformation of a host command packet, or host language,
`to a network command packet, or network language, is archi
`tecture specific.
`At this point, the host reader/writer (HRW) generates two
`networks events, 3 Hand 4H. Message 4H represents a post of
`a received network response packet, from 3H, across the
`network 100 and is discussed below. Message 3H represents
`the network generalized command packet being written over
`a pre-existing "pinned-up' TCP/IP connection. In order for
`this transfer to occur in the preferred embodiment, a rendez
`vous must take place with 1S, which represents a network 100
`request to receive the command packet. This request 1S has
`the ability to wait indefinitely if there is no “work' to be done.
`Once the network 100 rendezvous is satisfied, the command
`packet is received by the block server's reader/writer (SRW),
`and is re-translated by the SRW to the server-side language
`via step 2S. Step 2S is similar to the translation of the host
`command packet to a network command packet shown in 2H.
`As further illustrated in FIG. 5, message 3S represents the
`server-side reader/writer posting the command packet to the
`server driver (SD) 206. Included in the command packet are
`the following: an I/O unit number (a small integer that is a
`logical identifier for the underlying disk drive partition on any
`form of storage disks), a command (a small integer indicating
`the type of command, such as a read operation or a write
`operation), a starting logical block address (an integer indi
`cating the starting block or sector for the I/O operation), and
`the block count (an integer indicating the number of blocks or
`sectors for the I/O operation).
`After the command packet is delivered to the server device
`driver (SD), a response is sent back to the server-side reader/
`writer (SRW). As depicted by 4S, the server-side reader/
`writer transforms this response packet from storage-side
`order to network order via step 4S