`Rabinovich et al.
`
`[19J
`
`
`
`I IIIII IIIIIIII Ill lllll lllll lllll lllll lllll lllll lllll lllll 111111111111111111
`US006167427A
`[11] Patent Number:
`[45] Date of Patent:
`
`6,167,427
`Dec. 26, 2000
`
`[54] REPLICATION SERVICE SYSTEM AND
`METHOD FOR DIRECTING THE
`REPLICATION OF INFORMATION SERVERS
`BASED ON SELECTED PLURALITY OF
`SERVERS LOAD
`
`[75]
`
`Inventors: Irina Rabinovich; Michael
`Rabinovich, both of Gillette, N.J.
`
`[73] Assignee: Lucent Technologies Inc., Murray Hill,
`N.J.
`
`5,787,442
`5,832,219
`5,923,837
`
`7/1998 Hacheri et al. ......................... 707/201
`11/1998 Pettus et al. ....................... 395/200.33
`7/1999 Matias ....................................... 714/28
`
`OTHER PUBLICATIONS
`
`Microsoft Press Computer Dictionary, 3rd. Ed., Microsoft
`Press, Sep. 1997.
`Baumgariner et al., Global load balancing strategy for
`distributed computer system, IEEE, Jul. 1988.
`Dandamudi et. al., A hierarchical load sharing policy for
`distributed systems, IEEE, Sep. 1997.
`
`[21] Appl. No.: 08/979,611
`
`[22] Filed:
`
`Nov. 28, 1997
`
`[51]
`[52]
`
`[58]
`
`Int. Cl.7 ........................... G06F 15/16; G06F 15/167
`U.S. Cl. .......................... 709/201; 709/203; 709/217;
`709/218; 709/219; 707/204; 707/205
`Field of Search
`..................................... 709/203, 201,
`709/202, 316, 217, 218, 219, 221, 223,
`226, 229, 235, 238, 249; 707/204, 205,
`200
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4,974,256
`5,539,883
`5,588,147
`5,708,812
`5,745,683
`5,774,668
`
`11/1990 Cyr et al. ................................ 379/113
`7/1996 Allon et al. ............................. 395/675
`12/1996 Neeman et al.
`............................ 707/1
`1/1998 Van Dyke et al.
`..................... 395/712
`4/1998 Lee et al. ... .... ... ... ... ... .... ... ... ... 709 /250
`6/1998 Choquier ............................ 395/200.53
`
`Primary Examiner-Le Hien Luu
`Assistant Examiner-Beatriz Prieto
`Attorney, Agent, or Firm-William Ryan
`
`[57]
`
`ABSTRACT
`
`A system an method for efficiently providing access by a
`large number of clients to objects located at a large number
`of information servers. A non-bottleneck solution is pro(cid:173)
`vided to sharing load among servers by migrating or repli(cid:173)
`cating objects over from highly loaded servers to less highly
`loaded servers. Objects that experience low loading are
`deleted to make room for more highly used objects and to
`permit make space for new objects. A naming service is
`provided to provide rapid access to a replica of a requested
`objects, while avoiding directing access requests to servers
`from which replicas of requested objects have been deleted.
`Hierarchical ordering of replication and naming functions
`permits a variety of particular access methods to be realized.
`
`13 Claims, 5 Drawing Sheets
`
`701
`RECEIVE LOAD INFORMATION
`
`702
`DETERMINE SMAX, SMIN
`
`- LOADMIN,SMIN >d ?
`LOADMAX,SMAX
`
`NO
`
`SEND OFFLOAD TO
`PMAX,SMAX
`
`708
`loadmax,R = loadmin,R=avLoadR
`
`709
`
`YES---~~'--------
`OFFLOAD MSG SENT ?
`NO
`
`711
`
`Find S'max S'min such thot
`loadmax,S'max = maxi=1,n{but i not=Smax,Smin){Loadmax,Si)
`loadmin,S'min = mixi=1,n{but i not = Smin,Smox(loodmin,Si)
`pmox,R = pmox,S'max
`pmin,R = = pmin,S'min
`
`712
`losdmax,R = loadmax,Smax;
`loadmin,R = loadmin,Smin'
`pmax,R = pmax,R = pmox,Smax;
`pmin,R = pmin,Smin
`714
`SEND REPORT TO PARENT
`
`713
`SEND REPORT TO PARENT
`
`1
`
`Petitioner Limelight - LN1006
`
`
`
`U.S. Patent
`
`Dec. 26, 2000
`
`Sheet 1 of 5
`
`6,167,427
`
`FIG.
`
`f
`
`110-2
`
`110-(N-1)
`
`110-N
`
`130
`
`140-M
`
`FIG. 2
`
`210
`
`220-2
`~-....__..........._,
`REPLICATOR - - - - - --
`
`-
`
`220-i
`
`230-2
`
`2
`
`
`
`U.S. Patent
`
`Dec. 26, 2000
`
`Sheet 2 of 5
`
`6,167,427
`
`FIG. 3
`
`210
`
`210
`
`220-2
`
`FIG. 4
`
`3
`
`
`
`U.S. Patent
`
`Dec. 26, 2000
`
`Sheet 3 of 5
`
`6,167,427
`
`FIG. 5
`
`501
`READ LOAD FROM
`
`SUBORDINATE SERVERS
`502
`
`MORE SERVERS ?
`
`NO 503
`
`ROOT SERVER? YES
`NO
`
`505
`PASS LOW AND HIGH LOADS
`TO HIGHER SERVER
`
`FIG. 6
`
`600
`START
`
`601
`READ LOAD FROM
`
`SUBORDINATE SERVERS
`602
`
`MORE SERVERS ?
`
`NO 603
`
`504
`MSGS TO HIGH AND SERVERS
`TO PERFORM DISTRIBUTION EVENT
`
`604
`
`MSGS TO SUBORDINATE
`ROOT SERVER ? >-YE_S-i HIGH AND LOW SERVERS
`TO PERFORM DIST EVENT
`
`606
`NO
`MSGS TO SUBORDINATE
`HIGH AND LOW HOSTING SETS
`TO PERFORM DIST EVENT
`
`605
`PASS LOW AND HIGH LOADS
`TO HIGHER SERVER
`
`4
`
`
`
`U.S. Patent
`
`Dec. 26, 2000
`
`Sheet 4 of 5
`
`6,167,427
`
`FIG. 7
`
`701
`RECEIVE LOAD INFORMATION
`
`702
`DETERMINE SMAX, SMIN
`
`
`
`704
`703
`
`~LO-AD-M-AX-,S-M-AX--_._LO-A-DM-IN-,S-M-IN->....Ld-?~ >-Y_ES_-i SEND OFFLOAD TO
`PMAX,SMAX
`705
`SEND ADDITIONAL
`OFFLOAD MSGS
`
`NO
`
`.--------~
`
`END
`
`708
`loadmax,R = loadmin,R=avLoadR
`
`709
`OFFLOAD MSG SENT?
`NO
`
`711
`
`Find S'max S'min such that
`loadmax,S'max = maxi=1,n{but i not=Smax,Smin)(Loadmax,Si)
`loadmin,S'min = mixi=1,n(but i not = Smin,Smax(loadmin,Si)
`pmax,R = pmax,S'max
`pmin,R = = pmin,S'min
`
`712
`losdmax,R = loadmax,Smax;
`loadmin,R = loadmin,Smin'
`pmax,R = pmax,R = pmax,Smax;
`pmin,R = pmin,Smin
`714
`SEND REPORT TO PARENT
`
`713
`SEND REPORT TO PARENT
`
`5
`
`
`
`U.S. Patent
`
`Dec. 26, 2000
`
`Sheet 5 of 5
`
`6,167,427
`
`FIG. 8
`
`802
`HOST SENDS DELETEREPL(X)
`TO REDIRECTOR
`
`803
`REDIRECTOR DELETES THE HOST
`FROM REPLICA SET, RESPONDS
`WITH RESOLVED_CNT(Xp)
`
`804
`805
`
`-RE-S-OL-VE-D_-C_N_T _=....__S-AT-IS-FIE-D-_C_N_T -? NO
`WAIT
`
`
`
`YES
`
`~____,.___......____,
`
`806
`p DELETES X
`
`807
`
`END
`
`6
`
`
`
`6,167,427
`
`1
`REPLICATION SERVICE SYSTEM AND
`METHOD FOR DIRECTING THE
`REPLICATION OF INFORMATION SERVERS
`BASED ON SELECTED PLURALITY OF
`SERVERS LOAD
`
`FIELD OF THE INVENTION
`
`5
`
`The present invention relates generally to the field of
`distributed
`information
`systems. More particularly,
`the
`present invention relates, in one aspect, to selective repli(cid:173)
`cation and distribution of data objects and services between
`and among a plurality of information systems. Still more
`particularly, aspects of the present invention relate to dis(cid:173)
`tributed systems and methods for efficient provisioning of
`objects and services to clients in large (including global) 15
`networks.
`
`BACKGROUND OF THE INVENTION
`
`30
`
`35
`
`40
`
`2
`threshold d, the less-loaded server
`a certain distribution
`would obtain some replicas of objects kept on the higher(cid:173)
`loaded server, thus taking up a portion of its load. Due to the
`randomness of choice, all pairs of servers would be
`involved. An approach similar to this is presented in "Adap(cid:173)
`tive load sharing in homogeneous distributed systems," by T.
`L. Casavant and J. G. Kuhl, IEEE Trans. on Software Eng.,
`vol. 2(14), pp. 141-154, Feb. 1988. The Casavant, et al
`paper also defines a threshold for which nodes with a load
`10 below the threshold are constrained
`to not initiate load
`comparison.
`However, as the number of servers grows, each server
`must initiate load comparison more frequently. Otherwise,
`the average time between load distribution events for any
`given pair of nodes will grow linearly with the number of
`nodes, as will the lag between load changes and detection of
`these changes. Since a server has a limit on how frequently
`it can perform load distribution, this solution is not scalable
`to large systems.
`Another approach is to organize hosting servers in a
`connected graph structure, with neighboring nodes perform(cid:173)
`ing pair-wise load distribution. Since the graph is connected
`the load distribution involves all servers. This technique is
`similar to an algorithm described in "Simulations of three
`adaptive,
`decentralized
`controlled,
`job
`scheduling
`algorithms," by J. A Stankovic, Computer Networks, 8, pp.
`199-217, August 1984. One difference is that in the Stank(cid:173)
`ovic paper, when a node sends its load data to a neighbor, it
`includes its information about all other nodes. These tech(cid:173)
`niques also prove to not be scalable as required for large
`networks.
`Another important consideration in establishing a global
`or other large information system is that of a naming service
`for objects and servers. In general, naming services are used
`to map a logical name of an object into the physical name of
`a replica. The main limiting factors for the name service are
`the number of clients (which determines
`the number of
`requests for name resolution), and the number of objects
`(which determines the size of the name-mapping database).
`Another factor is the number of requests from hosting
`servers for updates of name mappings.
`Name services are well known in the art, including the
`Domain Name Service (DNS) used in today's Internet and
`45 described, e.g., in P. V. Mockapetris, "Domain Names(cid:173)
`Concepts and Facilities," Request for Comments 1034,
`DDN Network
`lnfromation Center, SRI International,
`November, 1987. Also used in Internet naming is CCITT
`(now ITU) Recommendation X.500.
`However, mappings between host name and IP address
`seldom change in the DNS scheme. Rather, DNS is prima(cid:173)
`rily an append-only database that permits the addition of
`new host name/IP address mappings; current DNS imple(cid:173)
`mentations support little or no dynamic mapping of host
`55 name to IP address.
`Using current DNS naming service techniques to map
`logical object names
`to physical replicas, name server
`responses cached by clients become incorrect much more
`quickly. Thus, clients must query the name service much
`more often, greatly increasing
`the burden on the name
`service. Weak-consistency schemes for replicating the map(cid:173)
`ping database such as that described in B. W. Lampson,
`"Designing a global name service," Proc. of ACM Conf on
`Principles of Distributed Systems, ppl-10, 1986 result in
`many
`incorrect
`responses
`to clients. These
`incorrect
`responses result in the use of an incorrect physical name to
`access an object-with
`resulting failure and request renewal.
`
`As networked computers and databases, and users of
`these systems, have proliferated in numbers and geographic 20
`spread, interest has grown in efficiently providing access to
`information objects and services (hereinafter "objects") at
`host computers or information
`servers. Presently,
`for
`example, thousands of Internet servers provide a very large
`number of objects to millions of user clients worldwide. 25
`These servers are located in many countries around the
`world and typically at many locations in each country. In
`other particular cases, network service providers and private
`corporations locate server nodes at widely separated points
`in their networks.
`A particular challenge faced by developers of these net(cid:173)
`works of servers is that of providing access to a widely
`distributed set of clients without overloading particular
`hosts. The overload may occur, e.g., because a server stores
`objects that are in high demand and/or the server is a
`repository for large numbers of objects. Meeting this chal(cid:173)
`lenge proves especially difficult when the demand for par(cid:173)
`ticular objects varies considerably with time. Thus, while a
`straightforward replication of all objects at all servers would
`generally improve availability of a particular object to a
`range of clients, the cost of such replication is prohibitive. In
`fact, the economics of replication, distribution and storage
`do not usually permit design for a worst-case predicted
`demand condition in such large networks.
`The load on each server is an important consideration in
`adequately meeting demand from clients for objects; in
`general, it is desirable to balance the load among servers. In
`many existing replicated object systems, with relatively few
`servers, this question is quite tractable: system administra-
`tors or a dedicated computer system process can monitor the
`load of servers and decide on selective replica placement.
`When
`the number of servers
`increases, however, such
`human or dedicated process cannot be expected to efficiently
`direct the creation and deletion of replicas.
`Current object-location
`techniques used in distributed
`server networks assume that sets of object replicas are well
`known to clients. However, when the number and geo(cid:173)
`graphic distribution of servers increases to dense national
`and international proportions this assumption proves unre-
`alistic; the ability of clients to efficiently locate desired
`objects at servers increases markedly.
`One possible solution to the scalability problem of the
`replication service would be to use a localized "greedy"
`approach. For example, each hosting server might choose 65
`another hosting server at random, and perform load com(cid:173)
`parisons with this other server. If the load difference exceeds
`
`50
`
`60
`
`7
`
`
`
`6,167,427
`
`3
`World-Wide-Web (Web) syntax and semantics for object
`names (URLs) are quite different from those in DNS or
`X.500. Use of DNS or X.500-compliant symbolic names in
`networks like the Web would require extensive changes in
`Web browsers.
`
`5
`
`4
`FIG. 7 is a flowchart for a preferred illustrative load
`balancing arrangement.
`FIG. 8 is a flowchart showing an illustrative method for
`deleting a replica.
`
`SUMMARY OF THE INVENTION
`
`10
`
`Limitations of the prior art are overcome and a technical
`advance is made in accordance with the present invention
`described in illustrative embodiments herein.
`In one illustrative embodiment, a method is described for
`achieving the number and placement of object replicas in a
`network of servers. This result is achieved without the need
`for bottleneck-causing global decisions. Using the present
`inventive methods and resulting network, a server network
`is realized in which creation and deletion of replicas is
`minimized when a steady demand for all objects persists for
`an appropriate period of time. Moreover, when such steady
`demand exists the load can be distributed among servers in
`a substantially equal manner.
`In accordance with an illustrative embodiment, the deci(cid:173)
`sion on the number and placement of replicas is made within
`the network. Moreover, the process is dynamic, with replicas
`being created and deleted as demand and geographic origins
`of requests change. The illustrative replication is advanta(cid:173)
`geously transparent to end-user clients, except in improved
`network response.
`illustrative
`In accordance with an aspect of the
`embodiment, no high demand bottlenecks arise which might 30
`require an increase of processing power at any node. Rather,
`as load increases, the number of nodes can be increased to
`handle the increased demand. This result obtains whether the
`high demand is based on the number of clients requesting
`objects increases, or whether the demand arises from an 35
`increase in the number of objects.
`In accordance with one aspect of the present invention, a
`new naming service is introduced for finding object replicas.
`Advantageously,
`the existing DNS name services can be
`used as a first level of indirection in the new naming service.
`In other particular contexts, such as when the character(cid:173)
`istics of languages for specifying object content or when
`certain network protocols are used, the illustrative inventive
`techniques, or their application, may be modified as required
`to achieve desired results.
`
`DETAILED DESCRIPTION
`Illustrative System Overview
`FIG. 1 shows an illustrative application of the present
`invention. There, a wide area network 100 comprises a large
`number of servers 140-i, i-1, 2, ... M and clients 110-j,
`... N. Servers 140-i host objects publicly accessible
`j=l,2,
`to clients 110-j. A node may play a dual role as a server for
`some objects and a client for other objects.
`Clients access objects by their names. The name of an
`15 object allows a client to infer the identity of hosting server
`(e.g., its domain name or IP address) and access method (i.e.,
`the protocol to be used to access the server). For illustrative
`purposes, it will be assumed that object names embed a
`domain name of the server that hosts the object. One
`20 instance of this environment
`is the Internet, which
`comprises, inter alia, Web servers and Web browsers. Such
`servers typically include computers and related equipment
`for controlling ( often very large) databases, while browsers
`are typically associated with client terminals or (usually
`25 smaller) computers, all as is well known in the art.
`The network of FIG. 1 is a network of information
`systems, where objects provide access to information, and
`thus do not change as a result of an access by a client.
`However, an access may result in extensive processing by
`the server to compute the result of the client's query. For
`example, in a geographical database, a request for the map
`of a specified area in a specified scale may require genera(cid:173)
`tion of the map, which typically is an expensive operation in
`terms of server resources.
`A load measure exists for each server 140-i in FIG. 1 to
`allow comparison of load on each server. For example, when
`the servers operate using the UNIX operating system, the
`length of the input queue (as measured, e.g., by the output
`of the uptime command) proves to be convenient for this
`40 purpose. In other particular environments, other measures of
`load may be preferable.
`In addition, each individual server can estimate the frac(cid:173)
`tion of its total load due to a given object on the server. In
`typical operation
`this
`is accomplished by monitoring
`45 resource consumption (e.g., CPU time, 10 operations, etc.)
`due to requests for individual objects and dividing up the
`total load between objects in proportion to their consump(cid:173)
`tion. While the illustrative servers 140 in FIG. 1 may be
`quite different in total system resources, a server with an
`50 average queue length of 1.5 is more heavily loaded than a
`server with an average queue length of 0.8, regardless of the
`quantity of processing needed at each server to achieve these
`queue lengths.
`In achieving purposes of the present invention, the illus(cid:173)
`trative embodiment of FIG. 1 provides for replicating an
`object located on a particular server, say server p, on another
`server, server q. Alternatively an object may be migrated
`from server p to server q. xP will denote a replication of
`object x on server p. load(p) denotes the load of node p, and
`60 load(px) denotes the load on node p due to object x. In
`general, if x is migrated from node p to node q, the reduction
`of load on p may not be equal to the increase of load on q,
`due to difference in processing power of p and q.
`Typically, there are two ways by which a system can
`65 balance the load: directing client requests to less loaded
`servers (among those with replicas of the requested object),
`and migrating or replicating objects between servers. Server
`
`BRIEF DESCRIPTION OF THE DRAWING
`
`of illustrative
`description
`The above-summarized
`embodiments of the present invention will be more fully
`understood upon a consideration of the following detailed
`description and the attached drawing, wherein:
`FIG. 1 is an overall view of an illustrative network system
`embodiment of the present invention.
`FIG. 2 shows a replication service arrangement compris-
`ing a hierarchy of replicators for use in the network of FIG.
`1.
`
`FIG. 3 shows an application of the replication service
`arrangement of FIG. 2.
`FIG. 4 shows a naming service arrangement for use in the
`system of FIG. 1.
`FIG. 5 is a flowchart illustrating a load balancing system
`and method in accordance with one embodiment of the
`present invention.
`FIG. 6 is a flowchart for a modified version of the system
`and method of FIG. 5.
`
`55
`
`8
`
`
`
`6,167,427
`
`5
`selection for a particular client request is typically based on
`the geographic origin of the request (i.e., the closest replica
`is chosen), but the selection may be based on another
`criterion, or at random. Thus, the load, is illustratively
`balanced by replication or migration of objects. The event of
`migrating an object from server p to server q, or creating a
`new replica of an object on q by copying it from p, is called
`a distribution event. Servers p and q are, respectively, the
`source and the recipient in the distribution event.
`It is typically desired that when in a network system like 10
`that ofFIG. l the demand for all objects does not change, the
`system stabilizes into a state in which the loads are distrib(cid:173)
`uted equally among all hosting servers. Thus, when the
`demand for different objects changes and stabilizes at dif(cid:173)
`ferent levels, the system will eventually re-distribute
`the 15
`load so that it is again distributed equally among hosts.
`However, small changes in the demand for objects should
`not trigger load re-distribution; otherwise, the system will
`hardly ever be stable.
`It proves convenient to consider a mechanism for load 20
`balancing
`to be stabilizing
`if there exist two constants,
`demandDiff and loadDiff such that if the variation in time of
`the request rate for every object x stays within demandDiff,
`the system eventually reaches a state where no replicas are
`created or dropped, and the difference between the load of
`any two hosting servers does not exceed loadDiff.
`Likewise, it proves advantageous to avoid load balancing
`of a type in which isolated regions are created with autono(cid:173)
`mous load distribution in each region. Thus, a load balancing
`mechanism desirably avoids a condition in which nodes
`from different regions have significantly different load, even
`if individual regions are not in the stable state. More
`formally, for some L and l<L, and with all nodes partitioned
`into three sets, A(L,l)={plload(p)~L}, B(L,l)={pll~load(p)
`<L}, and C(L,l)={plload(p)<l}. A mechanism for load bal(cid:173)
`ancing is called contiguous if there exist some constants d
`and t such that for any L and l<L-d,
`if no node moves
`between sets A(L,l), B(L,l) and C(L,l) for time t, then there
`will be a distribution event either with a source in A and
`recipient outside A, or with a source outside C and recipient 40
`inside C.
`The contiguity criterion may be illustrated by a system
`with four servers i, j, k, and 1. A load balancing mechanism
`can be derived that balances the load between nodes in node
`pair i andj, and separately between nodes in node pair k and
`1, with no balancing performed between the node pairs. The
`demand for objects hosted by the first pair (i, j) will be
`assumed to be very high, but unstable (so that there are
`continuous distribution events occurring between i and j).
`Also, it will be assumed that there is unstable but overall low
`demand for objects on k and 1. Thus the load on i and j
`greatly exceeds the load on k and 1, while fluctuating within
`each pair. This example shows the importance of both the
`contiguity criterion as well as the stabilization criterion.
`Returning to the system of FIG. 1, when an object is
`created, it is placed on one of the hosting servers and is
`registered with a name service 120. Registering of a related
`type is known in a so-called "persistent URL" proposal for
`use in the World Wide Web (hereinafter "Web") operation in
`the Internet, but such registering of persistent URLs is not
`part of the present invention. This registration in accordance
`with the illustrative embodiment involves sending a message
`to name service 120 informing that service of the physical
`name of the object and assigning it a symbolic name. The
`physical name of an object is the name by which the object
`can be accessed by the clients (in the Web context, this
`would be the object's URL). The symbolic name uniquely
`
`6
`identifies the object. However, it resolves into the name
`service identity rather than the identity of the server hosting
`the object.
`The symbolic name is advertised to the users; it is the
`5 name known to the clients. When a client wants to access the
`object, it uses its symbolic name. Since this name resolves
`into the name service identity, the request for the object
`actually arrives at the name service 120 as shown by the link
`between typical client 110-1 and name service 120. Name
`server 120 finds a corresponding physical name ( due to
`replication, a symbolic name can map into multiple physical
`names) and sends it as a special "re-direct" response to the
`client. The client, such as 110-1 in FIG. 1 then uses the
`physical name received to access the object at the identified
`server, such as 140-1 in FIG. 1.
`Asynchronously, hosting servers periodically report their
`load to the replication service unit 130. The replication
`service 130 uses these reports to migrate or replicate objects
`so that the load is shared among all hosting servers, and
`replicas are placed close to a majority of requests for the
`corresponding objects.
`When a hosting server 140-j creates or drops a replica of
`an object, it records this fact at the name service 120. Thus,
`the name service keeps a mapping of the symbolic name of
`25 an object to a set of physical names of currently available
`replicas. When resolving a symbolic name, the name service
`chooses one of the current set of physical names, applying
`a fair procedure and taking into account the geographical
`location of the requesting client.
`30 The Replication Service
`One potential bottleneck in the system of FIG. 1 is the
`replication service. Indeed, as the number of hosting servers
`and hosted objects increases, the number of load reports the
`replication service must process grows, and so does the
`35 search space for deciding on replica placement. This section
`describes a scalable distributed architecture for the replica(cid:173)
`tion service 130. Scalability
`is of interest to the present
`discussion because important target networks are assumed to
`be large.
`The following description of a policy for replica place-
`ment is based on the objective of load distribution. A
`challenge in designing such a policy is that the policy must
`ensure globally desirable system behavior ( according to
`criteria such as the stabilizing and contiguity criteria dis-
`45 cussed above) without creating any bottleneck-producing
`global decision points. Those skilled in the art will develop
`other detailed policies and heuristics within the scope of the
`presently described architecture and infrastructure.
`An illustrative structure for the replication service 130 of
`50 FIG. 1 is shown in FIG. 2. There, it will be seen that
`replication service 130 comprises a hierarchy of replicator
`servers. Each replicator server in FIG. 2 comprises standard
`network computer functionality, including a central process(cid:173)
`ing unit, main and auxiliary memory and input/output
`55 facilities, all arranged to receive information from informa(cid:173)
`tion servers and other replicators
`to perform,
`inter alia,
`analyses of load information, and to generate control mes(cid:173)
`sages relating to these analyses and to creation and deletion
`of object replicas. The highest level replicator is at node 210
`60 and replicators 220-j appear at nodes directly below node
`210. Only two levels are shown explicitly in the hierarchy of
`FIG. 2, but other levels of replicators will appear as required
`for particular networks at respective nodes at other levels in
`the hierarchy. For very large networks, the level of such
`65 nodes may extend over many such levels in a hierarchy.
`Direct descendants of a node in the hierarchy of replicators
`in FIG. 2 are referred to as subordinates. For a server S, let
`
`9
`
`
`
`6,167,427
`
`7
`8
`In accordance with this illustrative preferred embodiment,
`the hosting set H(S) be a set of hosting servers in the subtree
`rooted at S. In particular, for a hosting server p, H(p )={p}.
`a protocol is employed which uses three parameters. One of
`The hierarchical structure of FIG. 2, can be used to imple(cid:173)
`these parameters is a system-wide parameter, the distribu(cid:173)
`ment a variety of load re-distribution methods, as will now
`tion threshold, d, which reflects the load difference between
`be described in more detail.
`5 servers required
`to trigger distribution events. Another
`A first solution based on the structure of FIG. 2 is to use
`parameter,
`the deletion
`threshold, u, is host-specific. A
`the replicator hierarchy only to find the hosting servers with
`hosting server p deletes replicas ( except for the last replica)
`the highest and lowest load. In this solution, hosting servers
`of an object such that load(xp)<ur As discussed above, the
`report their load to the lowest level replicators. These lowest
`same rate of requests for the same object may result in
`level replicators then choose the servers with the highest and
`10 different values of the object load at different servers.
`lowest load and pass this information up to their parents. The
`However, up is adjusted for every server such that if object
`parents then choose hosting servers with the highest and
`x qualifies for deletion on one server under certain request
`lowest load, among those reported by their subordinate
`rate, it would qualify for deletion on all servers under about
`replicators. Thus, each replicator reports up the information
`the same rate. The third parameter employed by this illus-
`about the hosting servers with highest and lowest load in its
`15 trative protocol
`is a system-wide stability factor s that
`hosting set. This process continues until it reaches the root,
`reflects the level of variation in server load that should not
`which will identify the hosting servers with the highest and
`cause distribution events. The three parameters are con(cid:173)
`lowest load in the system and re-distribute the load between
`strained by a condition 2*umax+s<d, where umax is the
`them. Since non-root replicators work in parallel, the whole
`maximum deletion threshold across all hosting servers. Each
`process takes place in logarithmic time.
`The flowchart of FIG. 5 illustrates this process. Starting at 20
`replicator is advantageously arrranged to have at least three
`block 500 in the flowchart, high and low load information is
`subordinates.
`In accordance with the illustrative protocol, a hosting
`read from subordinate nodes as indicated by process block
`501. This is continued at block 501 until the decision at
`server p periodically examines all of its objects, attempting
`block 502 indicates that all immediately subordinate nodes
`to delete object replicas whose load is below ur It does not,
`however, delete the sole replica of an object in the system.
`have input their information. A further test is made at 25
`decision block 503 to determine if the information has been
`It then sends a load report to its parent replicator. Every
`input to the root node from its subordinate nodes. If the node
`replicator collects load reports from its subordinates and
`sends the load report to the higher-level replicator.
`processing inputs is not the root node, the high and low load
`information is passed to the next higher node at block 505.
`The load report from node S ( either a hosting server or a
`This process continues until the root server node is the node
`30 replicator) has a form ( av Lo ad 5, H(S), Pmax s, Pmin s, loadmax
`receiving the inputs. Then, messages are sent to the high and
`of hosting
`s, loadmin 5 ), where avLoad 5 is the ave rag~ load
`low load nodes as described above and indicated by process
`servers fr~m H(S). Thus in the case of a hosting server, the
`block 504 in FIG. 5 to cause a distribution event to occur.
`average load is the actual load. H(S) includes information
`This solution can be shown to satisfy the stability and
`about the number of these servers; Pmax s and Pmin s are the
`contiguity criteria described above.
`35 identities of hosting servers chosen am~ng H(S).
`1f S is a
`Certain inefficiencies can be identified in the process
`leaf, both Pmin s and Pmax s are the identity of S itself. When
`shown in FIG. 5 and described above that may be important
`Sis a leaf, i.e'. a hosting
`server, loadmaxs and loadminS are
`in some network applications. First, the work of lower level
`the same and equal to the actual load of S. For a no~-leaf
`replicators whose hosting sets do not contain nodes with the
`server S, Pmax s, Pmin s, loadmax s and loadmin s are calculated
`globally highest and lowest load is wasted, even if nodes
`40 in the protoc~l based on repor'ts from suboidinates of S.
`with the load difference over the threshold were found.
`Upon collecting reports from all subordinates, a replicator
`R executes the DistributeLoad protocol appearing in Listing
`Second, the load between only two (or, more gener