`Attanasio et a1.
`
`USOO5371852A
`[11] Patent Number:
`[45] Date of Patent:
`
`5,371,852
`Dec. 6, 1994
`
`[54] METHOD AND APPARATUS FOR MAKING
`A CLUSTER OF COMPUTERS APPEAR AS A
`SINGLE HOST ON A NETWORK
`[75] Inventors: Clement R. Attanasio, Peekskill;
`Stephen E. Smith, Mahopac, both of
`NY.
`[73] Assignee: International Business Machines
`Corporation, Armonk, NY.
`[21] Appl. No.: 960,742
`[22] Filed:
`Oct.14,1992
`[51] Int. 01.5 ....................... .. G06F 13/00
`[52] US. Cl. ........................ .. 395/200;370/85.13
`[58] Field of Search ................ .. 395/200, 500; 370/60,
`370/92, 93, 94.1, 54, 85.13, 85.1, 85.6, 85.8,
`60.1, 110.1, 85.11, 95.1; 364/2844, 242.94,
`284.3, 284
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`4,276,643 6/ 1981 Laprie et a1. ......................... .. 371/8
`4,667,287 5/1987 Allen et a1.
`. 364/200
`4,719,621 1/1988 May ......................... .. 370/85
`4,958,273 9/1990 Anderson et a1. .
`364/200
`5,023,780 6/1991 Brearley
`364/200
`5,079,765 1/1992 Nakamura ..
`370/85.13
`
`5,088,032 2/1992 Bosack . . . . . . . . . . .
`
`5,093,920 3/1992 Agrawal etal.
`5,109,515 4/1992 Laggis et a1.
`
`5,125,081 6/1992 Chiba . . . . . . . . .
`
`5,166,931 11/1992 Riddle
`
`. . . .. 395/200
`
`395/800
`395/725
`
`. . . .. 395/325
`
`._ 370/941
`
`5,185,860 2/1993 Wu . . . . . . . . . . . . .
`
`. . . .. 395/200
`
`5,224,205 6/1993 Dinkin etal. ..................... .. 395/200
`
`FOREIGN PATENT DOCUMENTS
`
`1002342 1/ 1991 Belgium .
`59-117842 7/1984 Japan .
`63-193739 8/ 1988 Japan .
`
`OTHER PUBLICATIONS
`D. E. Comer, “Internetworking with TCP/IP, Princi
`ples, Protocols and Architecture”, Prentice Hall, US,
`Chapter 7, pp. 91-97; Chapter 11, pp. 159-169; and
`Chapter 12, pp. 171-203.
`v
`“Network Services”, Network Programming, Sun Mi
`
`crosystems, Inc., 2550 Garcia Ave., Mountain View,
`Ca1if., Chapter 1, pp. 3-30, May 1988.
`OSF DCE Release 1.0 S3, DCE Release Notes, Open
`Software Foundation, 11 Cambridge Center, Cam
`bridge, Mass., Mar. 25, 1991, pp. i-v, 1-1 thru B-2.
`J. K. Ousterhout et al., “The Sprite Network Operating
`System”, IEEE Computer, Feb. 1988, US, pp. 23-36.
`D. R. Cheriton, “The V Distributed System”, Commu
`nication of the ACM, Mar. 1988, vol. 31, No. 3, pp.
`314-333.
`A. Bhide et al., “A Highly Available Network File
`Server”, USENIX Conference, Winter 1991, Dalas,
`Tex., p. 199.
`S. J. Mullender et al., “Amoeba A distributed Operating
`System for the 1990s”, IEEE Computer, May 1990, US,
`pp. 44-53.
`L. Peterson et al., “The X-kernel: A Platform for Ac
`cessing Internet Resources”, IEEE Computer, May
`1990, US, pp. 23-33.
`A. Litman, “The DUNIX Distributed Operating Sys
`tem”, Operating Systems Review, ACM Press, NY,
`vol. 22, No. 1, Jan. 1988, pp. 42-51.
`Primary Examiner—Dale M. Shaw
`Assistant Examiner—Moustafa M. Meky
`Attorney, Agent, or Firm-Louis J. Percello
`[57]
`ABSTRACT
`The present invention provides a method and apparatus
`for enabling a cluster of computers to appear as a single
`computer to host computers outside the cluster. A host
`computer communicates only with a gateway to access
`destination nodes and processes within the cluster. The
`gateway has at least one message switch which pro
`cesses incoming and outgoing port type messages cross
`ing the cluster boundary. This processing comprises
`examining certain information on the message headers
`and then changing some of this header information
`either to route an incoming message to the proper com
`puter node, port and process or to make an outgoing
`message appear as if originated at the gateway node.
`The message switch uses a table to match incoming
`messages to a particular routing function which can be
`run to perform the changes necessary to correctly route
`different kinds of messages.
`
`35 Claims, 13 Drawing Sheets
`
`NETWORK 120
`
`200
`\ ENUPS JED CLUSFER
`
`I05
`
`NODE
`
`110
`INTERCONNECT
`
`\07
`
`NODE
`
`Z10
`
`WEE
`
`125
`
`5g EXTERNAL INI'LRHEI
`
`
`GATEWAY 109
`I85 i
`
`48
`
`E
`
`Google Ex. 1123, pg. 1
`
`
`
`US. Patent
`
`Dec. 6, 1994
`
`Sheet 1 of 13
`
`5,371,852 ’
`
`NODE 2
`106
`
`1100a 3
`107
`
`NODE 2
`106
`
`NODE 3
`107
`
`10p
`
`I
`
`110
`
`NODE 1
`105
`
`NODE N
`108
`
`G
`109
`
`FIG. 1 A
`PRIOR ART
`
`125
`i
`I
`:127 120
`g
`l
`I
`
`NEfWORK
`H
`
`150
`
`H 130
`I
`I
`
`H
`
`130
`
`100
`
`l
`
`110
`
`NODE 1
`105 __,_1_2Z_|
`
`NEI'WORK 1
`H
`
`120
`
`130
`
`‘
`
`G
`109
`
`;127 12ONE|W£LRK 2
`' j
`130
`NODE N ""155
`l
`108
`127
`H 130
`
`120
`
`NEIWORK q
`
`Fl(3.1 B
`PRIOR ART
`
`Google Ex. 1123, pg. 2
`
`
`
`US. Patent
`
`Dec. 6, 1994
`
`Sheet 2 of 13
`
`5,371,852
`
`H
`
`130
`120
`
`120
`?r-
`
`127
`
`G1
`
`109
`
`NODE 2
`106
`
`MODE 3
`107
`
`‘
`
`1
`
`110
`
`120
`1‘ r
`
`N005 1
`105
`
`GP
`
`127
`
`127
`109
`_L_s_.
`127
`12° 4- 120
`120
`130
`
`H
`
`F|G.1 C
`PRIOR ART
`
`‘27
`
`H 130
`
`127
`
`G2
`E09
`127
`
`NODE N
`108
`
`120
`
`120
`q
`
`120
`
`Google Ex. 1123, pg. 3
`
`
`
`US. Patent
`
`Dec. 6, 1994
`
`Sheet 3 0f 13
`
`5,371,852
`
`Li
`Z
`g
`NEIWORK 120
`\ E Q
`g
`|._
`E5
`
`125
`\_
`
`2&0
`\ ENCAPS -TED CLUSTER
`105
`/
`NOD:
`
`GATEWAY 109
`
`REMOTE
`HOSTS
`
`'
`
`106
`
`/
`NODE
`
`/107
`
`MODE
`
`INTERCONNECT
`
`NODE
`
`MESSAGE
`SWITCH
`240
`
`ROUTING
`FUNCTION
`25o
`
`FIG. 2
`
`130
`
`PORTS if
`230 '1
`“210
`\~127
`
`23o
`
`130
`
`\
`22o
`
`230
`
`1
`21°
`
`130
`
`Google Ex. 1123, pg. 4
`
`
`
`US. Patent
`
`Dec. 6, 1994
`
`Sheet 4 of 13
`
`5,371,852
`
`CONCEPTUAL LAYERING
`APPLICATION
`
`308
`/
`
`PORT T0 PORT (PP)
`MACHINE To MACH'NE (MM)
`NEIWORK INTERFACE
`
`/306
`I
`G . 3A
`/so4
`PRIOR ART-PROTOCOL LAYERS
`'
`
`301 /
`
`FIG-35
`PRIOR ART -PROTOCOL HEADERS
`
`s37 /
`330
`\ APPLICATION DATA
`\
`
`336 \. PP HEADER PP DATAGRAM DATA AREA
`324 \
`/—325
`MM HEADER
`MM DATA AREA
`
`’
`
`312
`
`FRAME HEADER
`
`344
`
`341
`w
`24
`1s 19
`4 \ 8
`\ o
`VERS HLEN SERVICEIYPE
`TOTAL LENGTH
`IDENTIFICATION
`FLAGS FRAGMENT OFFSET
`‘ME TO LIVE PROTOCOL
`HEADER CHECKSUM
`/l‘souRcE IP ADDRESS
`341/ DESTINATION IP ADDRESS
`IP OPTIONS (IF ANY)
`
`/
`/ FRAME DATA AREA
`320’
`310/
`
`342
`31-)
`
`34a
`/
`\
`349
`
`pADmNG
`
`346
`
`DATA
`
`F|G.3C PRIOR ART-INTERNET PROTOCOL (IP) HEADER
`
`Google Ex. 1123, pg. 5
`
`
`
`US. Patent
`
`Dec. 6, 1994
`
`Sheet 5 of 13
`
`5,371,852
`
`353
`0
`356 ‘WP SOURCE PORT
`UDP MESSAGE LENGTH
`
`1s
`
`Egg
`
`3E2
`31/
`UDP DESTINATION PORT _\
`UDP CHECKSUM
`354
`
`357
`
`DATA
`
`FIG.3D PRIOR ART—UDP DATAGRAM
`
`3_s_o
`/ > 362
`363
`31/
`24
`10
`p \ 4
`DESTINATION PORT 'J\
`\souRcE PORT
`SEQUENCE NUMBER
`364
`
`1s
`
`3664
`
`ACKNOWLEDGEMENT NUMBER
`HLEN RESERVED CODE Brrs
`wmnow
`CHECKSUM
`URGENT POINTER
`
`_
`’
`367*
`
`OPTIONS(|F ANY)
`DATA
`
`PADDING
`
`FIG. 3E PRIOR ART-TCP DATAGRAM
`
`Google Ex. 1123, pg. 6
`
`
`
`US. Patent
`
`Dec. 6, 1994
`
`Sheet 6 of 13
`
`5,371,852
`
`T
`
`1
`
`5\
`
`2 H
`
`m T. N f_
`
`1
`
`W O 8 5
`
`E E _ Y m ADU O 0 M 0 2
`
`?/ m m/
`W 0 /m f_ 2_ 0 _ f_
`m w , N
`\ / a
`
`4 NE
`/ n~ _
`
`1 N
`
`4. _
`
`H q C s
`
`N\
`
`E 0 0 T8 3 3
`
`WW 2
`OT 1 1 0
`
`RH * * \1
`
`w W _| .
`
`_..l._ S E 3 mm
`M 4 R
`L E R N G C M 8 m H H mm P N
`$ 2M \ 5a W E 1|ll\ 1h‘ 0
`
`w P N ‘I
`
`110
`INTERCONNECT
`
`NODE 1
`
`PORT
`
`105
`
`NODE 2
`
`PORT
`
`106
`
`Google Ex. 1123, pg. 7
`
`
`
`US. Patent
`
`Dec. 6, 1994
`
`Sheet 7 of 13
`
`5,371,852
`
`TOP
`
`WAIT FOR MM MESSAGE
`505
`
`READ DESTINATTON ADDRESS (DADDR)
`IN MM HEADER
`510
`
`SEE FIGURE 6
`
`IS DADDR =
`CLUSTER ADDRESS
`515
`
`READ PROTOCOL TYPE (PROTO)
`520
`
`NO
`
`IS PRO-TO A
`PORT TYPE PROTOCOL
`525
`
`YES
`
`PROCESS MESSAGE
`IN GATEWAY IN
`“NORMAL” MANNER
`
`LOCATE AND READ THE DESTINATION PORT
`(DPORT) IN PP HEADER
`530
`
`(GO TO TOP )
`
`SEARCH MESSAGE SWTTCH TABLE FOR
`ENTRY MATCHING DPORT,PROTO PAIR535
`
`Google Ex. 1123, pg. 8
`
`
`
`Patent
`
`Dec. 6, 1994
`
`FIG.5B
`
`
`
`Sheet 8 of 13 FOUND MATCH ?
`
`540
`
`-
`
`NO
`
`NON-SPECIFIC
`PORT ?
`545
`
`NODLADDR=GATEWAY
`NODE___PORT=DPORT
`555
`
`ROUTING FUNCTION ?
`
`YES
`
`COMPUTE NODE ADDR
`FROM DPORT
`NODE__PORT=DPORT
`
`INVOKE ROUTING
`FUNCTION WHICH
`COMPUTES NODE__ADDR
`AND NODE__PORT 565
`
`NODE_ADDR=" NODE FIELD'
`NODE___PORT=DPORT
`560
`
`MODIFY DATAGRAM DESTINATION
`PORT FIELD TO NODE_PORT
`570
`
`I
`MODIFY DATAGRAM DESTINATION
`ADDRESS FIELD TO NODE ADDR
`575
`I
`SEND DATAGRAM TO NODE
`USING INTERCONNECT
`
`FIG.
`5A
`
`FIG.
`58
`
`FIG.5
`
`Google Ex. 1123, pg. 9
`
`
`
`US. Patent
`
`Dec. 6, 1994
`
`Sheet 9 0f 13
`
`5,371,852
`
`TOP:
`
`WAIT FOR MM MESSAGE
`605
`
`FIG'G
`
`READ DESTINATION ADDRESS (DADDR)
`IN MM HEADER
`
`610
`
`READ SOURCE ADDRESS(SADDR) N0
`IN MM HEADER
`
`620
`
`IS DADDR=
`CLUSTER ADDRESS
`61 5
`
`NO
`
`IS DADDR
`OUTSIDE CLUSTER
`AND SADDR
`INSIDE CLUSTER ?
`625
`
`YES (OUTBOUND)
`
`CHANGE SOURCE ADDRESS
`IN HEADER 0F OUTBOUND
`MESSAGE TO CLUSTER ADDRESS
`630
`
`FOWARD DATAGRAM TO DESTINATION
`640
`
`I GO TO TOP I
`
`Google Ex. 1123, pg. 10
`
`
`
`US. Patent
`
`Dec. 6, 1994
`
`Sheet 10 of 13
`
`5,371,852
`
`700
`\ ——~ 337
`
`IP
`
`UDP
`
`/PROJECTS
`
`- -
`
`s24 / 33s /
`
`\715
`
`CLUSTER EXPORT TABLE
`
`720
`
`FILESYSTEM NAME
`
`NODE
`
`/CONTRACTS
`
`/PROJECTS
`722/ /PROPOSAL
`
`725
`
`2
`
`3
`
`'
`
`73o
`
`MOUNT PORT TABLE
`
`NODE
`
`PORT N0.
`
`1
`
`2
`
`3
`
`4
`
`722
`
`820
`
`640 ‘\
`710
`735
`
`FIG 7 -
`
`Google Ex. 1123, pg. 11
`
`
`
`US. Patent ‘
`
`1111.6, 1994
`
`Sheet 11 01 13
`
`5,371,852
`
`LOCATE AND READ THE FILESYSTEM NAME,
`FSNJN THE MOUNT REQUEST MESSAGE
`805
`
`SEARCH CLUSTER EXPORT TABLE FOR ENTRY
`WHICH MATCHES FSN OR CONTAINS FSN
`810
`
`GT URN RETCODE=NOT OK
`820
`
`FOUND MATCHING
`ENTRY ?
`815
`
`YES
`
`READ NODE.N.FROM EXPORT TABLE
`825
`
`SEARCH MOUNT PORT TABLE FOR ENTRY MATCHING
`NODE AND READ MOUNT PORT NUMBER.P
`
`830
`
`FIG.8
`
`SET FUNCTION RETURN VARIABLES
`NODE_ADDR=N
`NODLPORT=P
`840
`
`850
`(RETURN REICODE=O© \
`
`Google Ex. 1123, pg. 12
`
`
`
`US. Patent
`
`Dec. 6, 1994
`
`Sheet 12 of 13
`
`5,371,852
`
`LOCATE AND READ NFS FILEHANDLE,FH,
`IN THE NFS REQUEST MESSAGE
`920
`
`LOCATE AND READ NODE_|D,N,|N NFS FILEHANDLE
`935
`
`SET FUNCTION RETURN VARIABLES
`NODE_ADDR=N
`NODE_PORT=2049
`
`940
`
`(RETURN RETCODE=OKD
`
`_’337
`
`9D0—-—H IP
`
`3247
`
`UDP
`
`\
`
`336
`
`NFS FILEHANDLE 0 o o
`
`\
`
`915
`
`NFS FILEHANDLE DATA
`
`NODE__|D N
`
`\925
`FIG.9 B
`
`\—92o
`
`Google Ex. 1123, pg. 13
`
`
`
`US. Patent
`
`Dec. 6, 1994
`
`Sheet 13 of 13
`
`5,371,852
`
`00
`2-
`
`CLUSTER GATEWAY
`
`109
`
`REMOTE
`125
`HOSTS
`\
`9.2.43.5
`1010
`: - 1019
`513
`:
`130
`1
`1
`1
`412
`9.24308
`\MEssAsE swrrcH TABLE 414 ‘"6 )3 :
`‘PORT PR0T0"'N0DE£FUNCT!0N-._418
`} "' ‘012%.,
`1004-—\
`1/006 )00
`:
`s13
`TOP
`1_ inc'onn
`:
`14002
`i
`2049
`UDP
`{
`
`MESSAGE SWITCH
`
`400
`
`0
`
`0
`
`LNFS
`
`E____
`REMoTE 1.00m ROUTING FUNC'ITON
`1020
`CLUSTER CONNECTION TABLE / —
`s__uddr
`orl node_£_ 1026
`1024
`
`1022 9.2.43.8 1022
`9.2.43.5 1019
`
`2
`1
`
`MODE 1
`
`r_login 513
`105
`
`NODE 2
`
`110
`C -
`513 1;_ login
`106
`
`\ 1040/’
`1030
`
`‘20
`
`FIG.1O
`
`Google Ex. 1123, pg. 14
`
`
`
`1
`
`METHOD AND APPARATUS FOR MAKING A
`CLUSTER OF COMPUTERS APPEAR AS A
`SINGLE HOST ON A NETWORK
`
`5,371,852
`2
`boundary 127, and which ?nally enters the cluster 100
`destined for one node (called a destination node) within
`the cluster 100, is called an incoming message. Like
`wise, a message which originates from a node (called a
`5 source node) within the cluster 100 and crosses the
`boundary 125 destined for a host 130 on the network
`outside the cluster is called an outgoing message. A
`message from a source node within the cluster 100 to a
`destination also within the cluster 100 is called an inter
`nal message.
`The prior art includes clusters 100 which connect to
`a network 120 through one of the computer nodes in the
`cluster. This computer, which connects the cluster to
`the network at the boundary 125, is called a gatewaY
`109. In loosely-coupled systems, gateways 109 process
`the incoming and outgoing messages. A gateway 109
`directs or routes messages to (or from) the correct node
`in the cluster. Internal messages do not interact with the
`gateway as such.
`FIG. 1B shows a prior art cluster 100, as shown in
`FIG. 1A, with the gateway 109 connected to a plurality
`(of number q) of networks 120. In this con?guration,
`each network 120 has a connection 127 to the gateway
`109. A cluster boundary 125 is therefore created where
`the gateway 109 connects to each network 120.
`FIG. 1C goes on to show another embodiment of the
`prior art. In this embodiment, the cluster 100 has more
`than one computer node (105 through 109) performing
`the function of a gateway 109. The plurality of gate
`ways 109, designated as G1 through Gp each connect to
`one or more networks 120. In FIG. 1C, gateway G1
`connects to a number r of networks 120, gateway G2
`connects to a number q of networks 120, and gateway
`Gp connects to a number s of networks 120. Using this
`con?guration, the prior art nodes within the cluster 100
`are able to communicate with a large number of hosts
`130 on a large number of different networks 120.
`All the prior art known to the inventors uses gate
`ways 109 to enable external hosts to individually com
`municate with each node (105 through 109) in the clus
`ter 100. In other words, the hosts 130 external to the
`cluster 100 on the network 120 have to provide informa
`tion about any node (105 through 109) within the clus
`ter 100 before communication can begin with that node.
`The hosts 120 external to the cluster also have to pro
`vide information about the function running on the
`node which will be accessed or used during the commu
`nication. Since communication with each node (105
`through 109) must be done individually between any
`external host 130 and any node within the cluster 100,
`the cluster 100 appears as multiple, individual computer
`nodes to hosts outside the cluster. These prior art clus
`ters do not have an image of a single computer when
`accessed by outside hosts. Examples of prior art which
`lacks this single computer image follow.
`DUNIX is a restructured UNIX kernel which makes
`the several computer nodes within a cluster appear as a
`single machine to other nodes within the cluster. Sys
`tem calls entered by nodes inside the cluster enter an
`“upper kerne ” which runs on each node. At this level
`there is an explicit call to the “switch” component,
`functionally a conventional Remote Procedure Call
`(RPC), which routes the message (on the basis of the
`referred to object) to the proper node. The RPC calls a
`program which is compiled and run. The RPC is used to
`set up the communication links necessary to communi
`cate with a second node in the cluster. A “lower kernel”
`
`BACKGROUND OF THE INVENTION
`1. Field of the Invention
`This invention relates to the ?eld of clustering com
`puters. More speci?cally, the invention relates to a
`computer cluster which appears to be a single host
`computer when viewed from outside the cluster, e.g.
`from a network of computers.
`2. Description of the Prior Art
`The prior art discloses many ways of increasing com
`puting power. Two ways are improving hardware per
`formance and building tightly coupled multiprocessor
`systems. Hardware technology improvements have
`provided an approximately 100% increase in computing
`power every two years. Tightly coupled systems, i.e.,
`systems with multiple processors that all use a single
`real main storage and input/output con?guration, in
`crease computing power by making several processors
`available for computation.
`However, there are limits to these two approaches.
`Future increases in hardware performance may not be
`as dramatic as in the past. Tightly-coupled multiproces
`sor versions of modern, pipelined and cached proces
`sors are difficult to design and implement, particularly
`as the number of processors in the system increases.
`Sometimes a new operating system has to be provided
`to make the tightly-coupled systems operate. In addi
`tion, overhead costs of multi-processor systems often
`reduce the performance of these systems as compared
`to that of a uniprocessor system.
`An alternative way of increasing computer power
`35
`uses loosely-coupled uniprocessor systems. Loosely
`coupled systems typically are independent and com
`plete systems which communicate with one another in
`some way. Often the loosely-coupled systems are linked
`together on a network, within a cluster, and/ or within a
`cluster which is on a network. In loosely coupled sys
`tems in a cluster, at least one of the systems is connected
`to the network and performs communication functions
`between the cluster and the network.
`In the prior art and also shown in FIG. 1A, clusters
`45
`100 comprise two or more computers (also called nodes
`or computer nodes 105 through 109) connected to
`gether by a communication means 110 in order to ex
`change information. Nodes (105 through 109) may
`share common resources and cooperate in doing work.
`The communication means 110 connecting the comput
`ers in the cluster together can be any type of high speed
`communication link known in the art, including: 1. a
`network link like a token ring, ethernet, or ?ber optic
`connection or 2. a computer bus like a memory or sys
`55
`tem bus. A cluster, for our purposes, also includes two
`or more computers connected together on a network
`120.
`Often, clusters of computers 100 can be connected by
`various known communications links 120, i.e., net
`60
`works, to other computers or clusters. The point at
`which the cluster is connected to the outside network is
`called a boundary or cluster boundary 125. The connec
`tion 127 at the boundary is bi-directional, i.e., there are
`incoming and outgoing messages at the boundary. In
`formation which originates from a computer (also
`called a host or host computer) 130 that is on the net
`work 120 outside the cluster, which then crosses the
`
`30
`
`40
`
`65
`
`Google Ex. 1123, pg. 15
`
`
`
`15
`
`40
`
`5,371,852
`3
`4
`running on the second node then processes the message.
`maintained, or has failed, the communication will fail. If
`DUNIX is essentially a method for making computers
`a new node(s) is added to the cluster, i.e., the cluster is
`within the cluster compatible; there is no facility for
`horizontally expanded, the new node will be unavail
`making the cluster appear as a single computer image
`able to communicate with other host computers outside
`from outside the cluster.
`the cluster without adding the proper access codes,
`Amoeba is another system which provides single
`protocols, and other required information to the outside
`computer imaging of the multiple nodes within the
`hosts.
`cluster only if viewed from within the cluster. To ac
`Accordingly, there has been a long felt need for a
`complish this, Amoeba runs an entirely new base oper
`cluster of computers which presents a single computer
`ating system which has to identify and establish commu
`image, i.e., looks like a single computer, to computers
`nication links with every node within the cluster.
`external to the cluster (gateway) boundary. A single
`Amoeba cannot provide a single computer image of the
`computer image cluster would have the capability of
`cluster to a host computer outside the cluster. Amoeba
`adding or deleting computers within the cluster; chang
`also has to provide an emulator to communicate with
`ing and/or moving processes, operating systems, and
`nodes running UNIX operating systems.
`data among computers within the cluster; changing the
`Sprite is a system which works in an explicitly distrib
`con?guration of cluster resources; redistributing tasks
`uted environment, i.e., the operating system is aware of
`among the computer within the cluster; and redirecting
`every node in the cluster. Sprite provides mechanisms
`for process migration, i.e., moving a partially completed
`communications from a failed cluster node to an operat
`ing node, without having to modify or notify any com
`program from one node to another. To do this, Sprite
`puter outside the cluster. Further, computers outside
`has to execute RPCs each time a new node is accessed.
`There is no single computer image of the cluster pres
`the cluster, would be able to access information or run
`ented to the network hosts outside these systems.
`processes within the cluster without changing the envi
`V is a distributed operating system which is able to
`ronment where they are operating.
`communicate only with nodes (and other clusters)
`Systems like DUNIX, Amoeba, Sprite, and V pro
`which are also running V. UNIX does not run on V.
`vide some degree of a single system image from within
`Other techniques for managing distributed system
`the cluster (i.e., within the gateway boundaries 125) by
`clusters, include LOCUS, TCF, and DCE. These sys
`writing new kernels (in the case of Amoeba, a totally
`terns require that the operating system know of and
`new operating system.) This requires extensive system
`establish communication with each individual node in a
`design effort. In addition, all the nodes of the cluster
`cluster before ?les or processes can be accessed. How
`must run the system’s modi?ed kernel and communicate
`ever, once the nodes in the cluster are communicating,
`with servers inside the system using new software and
`processes or ?les can be accessed from any connected
`protocols.
`node in a transparent way. Thus, the ?le or process is
`LOCUS, TCF and DCE provide single system im
`accessed as if there were only one computer. These
`ages only for computers which are part of their clusters
`systems provide a single system image only for the ?le
`and only with respect to ?le name spaces and process
`name space and process name space in these systems. In
`name spaces. In other aspects, the identities of the indi
`these systems, ?les and processes can not be accessed by
`vidual nodes are visible.
`host computers outside the cluster unless the host has
`established communication with a speci?c node within
`the cluster which contains the ?les and/or processes.
`3. Statement of Problems with the Prior Art
`Prior art computer clusters fail to appear as one entity
`to any system on the network communicating with
`them, i.e., the prior art does not offer the network out
`45
`side its boundary a single computer image. Because of
`this, i.e., because computers outside the boundary of the
`cluster (meaning outside the boundary 125 of any gate
`way 109 of the cluster 100) have to communicate indi
`vidually with each computer within the cluster, com
`munications with the cluster can be complicated. For
`example, computers outside the boundary of the cluster
`(hosts) have to know the location of and processes run
`ning on each computer within the cluster with which
`they are communicating. The host computers need to
`55
`have the proper communication protocols and access
`authorization for each node within the cluster in order
`to establish communication. If a node within the cluster
`changes its location, adds or deletes a program, changes
`communication protocol, or changes access authoriza
`tion, every host computer external to the cluster for
`which the change is relevant has to be informed and
`modi?ed in order reestablish communication with the
`altered node within the cluster.
`The prior art lack of a single computer image to
`outside host computers also limits cluster modi?cation
`and reliability. If hosts try to communicate with a node
`within the cluster which has been removed, is being
`
`OBJECTIVES
`An objective of this invention is an improved method
`and apparatus for routing messages across the boundary
`of a cluster of computers to make the cluster of comput
`ers on a network appear as a single computer image to
`host computers on the network outside the cluster.
`Also an objective of this invention is an improved
`method and apparatus for routing messages across the
`boundary of a cluster of computers to enable outside
`host computers on a network to use the same software
`and network protocols to access functions and informa
`tion within the computer cluster as they would use to
`access those functions and information on a single re
`mote host.
`Also an objective of this invention is an improved
`method and apparatus for routing messages across the
`boundary of a cluster of computers so that computer
`nodes within the cluster can communicate with outside
`hosts on networks such that, from the viewpoint of the
`outside host, the communication is with a single remote
`host, i.e., the cluster, rather than with the individual
`cluster nodes.
`A further objective of this invention is an improved
`method and apparatus for routing messages across the
`boundary of a cluster of computers so that work re
`quests from outside the cluster can be evenly distributed
`among the computer nodes in the cluster.
`
`60
`
`65
`
`Google Ex. 1123, pg. 16
`
`
`
`5,371,852
`6
`FIG. 5 is a flow chart showing the steps performed
`by the present invention to route an incoming message.
`FIG. 6 is a flow chart showing the steps performed
`by the present invention to route an outgoing message.
`FIG. 7 shows data structures used by a function in the
`message switch which processes a MOUNT request.
`FIG. 8 is a ?ow chart of the computer program per
`formed by the function in the message switch which
`processes a MOUNT request.
`FIGS. 9a—9b show data structures and a ?ow chart
`used by a function in the message switch which pro
`cesses NFS requests.
`FIG. 10 shows data structures used by functions in
`the message switch which process TCP connection
`service requests, in particular rlogin.
`
`5
`SUMMARY OF THE INVENTION
`This invention, called an encapsulated cluster, is a
`method and apparatus for routing information that
`crosses the boundary of a computer cluster. The infor
`mation is in the form of port type messages. Both in
`coming and outgoing messages are routed so that the
`cluster appears as a single computer image to the exter
`nal host. The encapsulated cluster appears as a single
`host to hosts on the network which are outside the
`cluster.
`The apparatus comprises two or more computer
`nodes connected together by a communication link,
`called an interconnect, to form a cluster. (Note that in
`one embodiment of the invention, the interconnect can
`be a network.) One of the computers in the cluster,
`serving as a gateway, is connected to one or more exter
`nal computers and/or clusters (hosts) through another
`communication link called a network. A gateway can
`be connected to more than one network and more than
`one node in the cluster can be a gateway. Each gateway
`connection to a network, i.e., boundary, has an address
`on the network. Each gateway has a message switch
`which routes incoming and outgoing messages by
`changing information on the message header based on
`running a speci?c routing function that is selected using
`port and protocol information in port type messages.
`Since all incoming messages are addressed to the
`gateway, the cluster appears as a single computer to
`hosts outside the cluster that are sending incoming mes
`sages to nodes within the cluster. When processing
`incoming messages, the gateway ?rst reads a protocol
`?eld in the message header and analyzes the message to
`determine if it is a port type message originating from a
`location outside the cluster. If the message is of port
`type, the location of the port number on the message is
`found. This port number and protocol type is used to
`search for a match to a port speci?c routing function in
`a table residing in memory within the message switch. If
`a table entry is matched, a routing function associated
`with the entry is selected and run. The routing function
`routes the message to the proper computer node within
`the cluster by altering information on the incoming
`message so that the message is addressed to the proper
`45
`node within the cluster.
`For outgoing messages, originating from a source
`node within the cluster, the message switch ?rst recog
`nizes that the message is a port type message that will
`cross the cluster boundary. The message switch then
`alters the message so that the source address is the gate
`way address rather than the address of the source node.
`In this way, computers external to the cluster perceive
`the message as coming from the gateway computer on
`the network rather than the sending node within the
`cluster.
`
`55
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`FIG. 2 shows one embodiment of an encapsulated
`cluster 200, the present invention. The cluster com
`prises a plurality of computer nodes (105 through 109)
`one of which is a gateway 109. The nodes are connected
`together by a high speed interconnect 110, e.g., a net
`work or any other link that is commonly used in the art.
`The gateway is connected with a bidirectional commu
`nication link 127 to a network 120. A boundary 125 is
`de?ned at the connection point between the network
`120 and the gateway 109. Computers, called hosts 130,
`connect to the network 120 and can communicate with
`nodes within the cluster by passing messages through
`the gateway 109. An incoming message 210 is shown as
`being sent from a host 130, passing through the cluster
`boundary 125, a gateway port 230, a gateway message
`switch 240, a gateway routing function 250, the inter
`connect 110, and ultimately to the destination, the desti
`nation node 107 in the cluster 200. In a similar manner,
`an outgoing message 220, is shown originating at a
`source node 105 within the cluster 200; passing through
`the interconnect 110, gateway message switch 240,
`gateway port 230, cluster boundary 125, and ultimately
`to the destination host 130.
`Although FIG. 2 represents a single cluster 200 with
`a single gateway 109, it is readily appreciated that one
`skilled in the art given this disclosure could produce
`multiple embodiments using this invention. For exam
`ple, the cluster 200 might have multiple gateways 109
`each connected to one or more networks or single host
`computers. A single gateway 109 may also have a plu
`rality of network connections each of which being ca
`pable of communicating with one or more external
`hosts or one or more external networks. All these em
`bodiments are within the contemplation of the inven
`tion.
`The encapsulated cluster 200 connects 127 to a high
`speed communication link 120, here called a network
`120. Host computers 130, also connected to the network
`120, communicate with the encapsulated cluster 200,
`and the nodes (105 through 109) within the cluster, over
`the network 120. The host computers 130 used in the
`invention include any general purpose computer or
`processor that can be connected together by the net
`work 120 in any of the many ways known in the art.
`The preferred network 120 is token-ring or ethernet.
`However, this high speed communication link 120
`might also include other connections known in the art
`like computer system buses. A host computer 130 could
`also be an encapsulated cluster of computers 200, i.e.,
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIGS. 1A-1C show three embodiments of prior art
`computer clusters that are attached to external commu
`nication links like networks.
`FIG. 2 shows an embodiment of the present inven
`tion.
`FIGS. 3a-3e shows the general structure of an incom
`ing and outgoing message and a more speci?c message
`structure using the internet communication protocol.
`FIG. 4 shows a preferred embodiment of a message
`switch.
`
`60
`
`65
`
`Google Ex. 1123, pg. 17
`
`
`
`5,371,852
`7
`8
`the present invention, which gives the image of a single
`to control the communication links, route messages, and
`computer to the network 120.
`access appropriate host computers on the link.
`Nodes (105 through 109) in the encapsulated cluster
`As shown in FIG. 3A, these protocols can be concep
`tually viewed as being layered, with each protocol layer
`200 can also comprise any general purpose computer or
`making use of the services provided by the layer be
`processor. An IBM RISC SYSTEM/ 6000 was the
`neath it. The lowest layer, the Network Interface (302),
`hardware used in the preferred embodiment and is de
`deals at the hardware level, with the transmission of
`scribed in the book SA23-26l9, “IBM RISC SYS
`data between hosts on a single network of a particular
`TEM/6000 Technology.” (RISC SYSTEM/6000 is a
`type. Examples of network types are token-ring and
`trademark of the IBM corporation.) These nodes may
`ethernet. The Network Interface layer presents an inter
`be independent uniprocessors with their own memory
`face to the layers above it which supports the transmis
`and input/output devices. Nodes can also be conven
`sion of data between two hosts on one physical net
`tional multiprocessors whose processors share memory
`work, without having to deal with the requirements of
`and input/output resources.
`the speci?c network hardware being used.
`Nodes (105 through 109) within the cluster are con
`The next higher layer, The Machine-to-Machine
`nected together by a high speed communications link
`(MM) layer 304, provides the capability to communi
`called an interconnect 110. This interconnect includes
`cat