`
`with four sites and full replication. Suppose that transactions T 1 and T 2
`wish to lock data item Q in exclusive mode. Transaction T1 may
`succeed in locking Q at sites 51 and 53, while transaction T2 may
`succeed in locking Qat sites 52 and 54. Each then must wait to acquire
`the third lock, and hence a deadlock has occurred.
`
`18.4.1.4 Biased Protocol
`The biased protocol is based on a model similar to that of the majority
`protocol. The difference is that requests for shared locks are given more
`favorable treatment than are requests for exclusive locks. The system
`maintains a lock manager at each site. Each manager manages the locks for
`all the data items stored at that site. Shared and exclusive locks are handled
`differently.
`·
`
`• Shared locks: When a transaction needs to lock data item Q, it simply
`requests a lock on Q from the lock manager at one site containing a
`replica of Q.
`• Exclusive locks: When a transaction needs to lock data item Q, it
`requests a lock on Q from the lock manager at all sites containing a
`replica of Q.
`
`As before, the response to the request is delayed until the request can be
`granted.
`The scheme has the advantage of imposing less overhead on read
`operations than does the majority protocol. This advantage is especially
`significant in common cases in which the frequency of reads is much
`greater than is the frequency of writes. However, the additional overhead
`on writes is a disadvantage. Furthermore, the biased protocol shares the
`majority protocol's disadvantage of complexity in handling deadlock.
`
`1.8.4.1.5 Primary Copy
`In the case of data replication, we may choose one of the replicas as the
`primary copy. Thus, for each data item Q, the primary copy of Q must
`reside in precisely one site, which we call the primary site of Q,
`When a transaction needs to lock a data item Q, it requests a lock at
`the primary site of Q. As before, the response to the· request is delayed
`until the request can be granted.
`Thus, the primary copy enables concurrency control for replicated data
`to l:>e handled in q. manner· similar to that for unreplicated data. This
`method of handling allows for a simple implementation. However, if the
`primary site of Q fails, Q is inaccessible even tho11gh other sites containing
`a replica may be ·accessible.
`·
`
`Apple 1013 (Part 4 of 4)
`U.S. Pat. 8,504,746
`
`
`
`585
`
`18.4.2 Timestamping
`The principal idea behind the timestamping scheme
`6. 9 is that each transaction is given a unique timestamp
`deciding the serialization order. Our first task, then,
`centralized scheme to a distributed scheme is to
`generating unique timestamps. Once this scheme has
`previous protocols can be applied directly
`to
`environment.
`
`18.4.2.1 Generation of Unique Timestamps
`There are two primary methods for generating unique
`centralized and one distributed. In the centralized scheme, a
`chosen for distributing the timestamps. The site can use a ""'"·"''"'-u
`or its own local clock for this purpose.
`a unique
`In the distributed scheme, each site
`using either a logical counter or the local dock. The
`timestamp is obtained by concatenation of the unique local
`the site identifier, which must be unique (Figure 18.2).
`concatenation is important! We use the
`identifier in
`position to ensure that the global timestamps generated
`always greater than those generated in another
`technique for generating unique timestamps with the one we
`Section 18.1.2 for generating unique names.
`We may still have a problem if one site generates local
`faster rate than do other sites. In such a case, the fast
`will be larger than that of other sites. Therefore, all
`by the fast site will be larger than those generated by
`needed is a mechanism to ensure that local timestamps are
`across the system. To accomplish the
`generation of
`define within each site Si a logical clock (LCi), which U",_,,,,..,,,,
`local timestamp. The logical clock can be implemented as a
`
`local unique timestamp
`
`site identifier
`
`global unique identifier
`
`Figure 18.2 Generation of unique
`
`
`
`586 • Chapter 18: Distributed Coordination
`
`incremented after a n~w local timestamp is generated. To ensure that the
`various logical clocks are synchronized, we require that a site Si advance its
`logical clock whenever a transaction Ti with timestamp <x,y> visits that site
`and x is greater than the current value of LCi. In this case, site Si advances
`its logical clock to the value x + 1.
`If the system clock is used to generate timestamps, then timestamps
`are assigned fairly provided that no site has a system clock that runs fast
`or slow. Since clocks may not be perfectly accurate, a technique similar to
`that used for logical clocks must be used to ensure that no clock gets far
`ahead or far behind another clock.
`
`18.4.2.2 Timestamp-Ordering Scheme
`The basic timestamp scheme introduced in Section 6. 9 can be extended in a
`straightforward manner to a distributed system. As in the centralized case,
`cascading rollbacks may result if no mechanism is used to prevent a
`transaction from reading a data item value that is hot yet committed. To
`eliminate cascading rollbacks, we can combine the basic timestamp scheme
`of Section 6. 9 with the 2PC protocol of Section 18.3 to obtain a protocol that
`ensures serializability with no cascading
`rollbacks. We
`leave
`the
`development of such an algorithm to you.
`the
`from
`The basic
`timestamp scheme
`just described suffers
`undesirable property that conflicts between transactions are resolved
`through rollbacks, rather than through waits. To alleviate this problem, we
`can buffer the various read and write operations (that is, delay them) until
`a time when we are assured that these operations can take place without
`causing aborts. A read(x) operation by Ti must be delayed if there exists a
`transaction Ti that will perform a write(x) operation but has not yet done
`so, and TS('lj) < TS(Ti). Similarly, a write(x) operation by Ti must be
`delayed if there exists a transaction 1J. that will perform either read(x) or
`write(x) operation and TS(:!j) < TS(Ti)· There are various methods for
`the conservative
`ensuring
`this property. One such method, called
`timestamp..;ordering scheme, requires each site to maintain a read and write
`queue consisting of all the read and write requests, respectively, that are to
`be executed at the site and that must be delayed to preserve the above
`property._ We shall not present the scheme here. Rather, we leave the
`·development of the algorithm to you.
`
`18.5 • Deadlock Handling
`
`The deadlock-prevention, deadlock-avoidance, and deadlock-detection
`algorithms presented in Chapter 7 can be extended so that they can also be
`used in a distributed system. In the following, we describe several of
`these distributed algorithms.
`
`
`
`18.5 Deadlock Handling • 587
`
`18.5.1 Deadlock Prevention
`The deadlock-prevention and deadlock-avoidance algorithms presented in
`Chapter 7 can also be used in a distributed system, provided that
`appropriate modifications are made. For example, we can use the
`resource-ordering deadlock-prevention technique by simply defining a
`global ordering among the system resources. That is, all resources in the
`entire system are assigned unique numbers, and a process may request a
`resource (at any processor) with unique number i only if it is not holding a
`resource with a unique number greater than i. Similarly, we can use the
`banker's algorithm in a distributed system by designating one of the
`processes in the system (the banker) as the process that maintains the
`information necessary to carry out the banker's algorithm. Every resource
`request must be channeled through the banker.
`These two schemes can be used in dealing with the deadlock problem
`in a distributed environment. The first scheme is simple to implement and
`requires little overhead. The second scheme can also be implemented
`easily, but it may require too much overhead. The banker may become a
`bottleneck, since the number of messages to and from the banker may be
`large. Thus, the banker's scheme does not seem to be of practical use in a
`distributed system.
`In this section, we present a new deadlock-prevention scheme that is
`based on a timestamp-ordering approach with resource preemption. For
`simplicity, we consider only the case of a single instance of each resource
`type.
`To control the preemption, we assign a unique priority number to each
`process. These numbers are used to decide whether a process Pi should
`wait for a process Pj. For example, we can let Pi wait for Pi if Pi has a
`priority higher than that of Pj; otherwise Pi is rolled back. This scheme
`prevents deadlocks because, for every edge Pi~ Pj in the wait-for graph,
`Pi has a higher priority than Pj. Thus, a cycle cannot exist.
`One difficulty with this scheme is the possibility of starvation. Some
`processes with extremely low priority may always be tolled back. This
`difficulty can be avoided through the use of timestamps. Each process in
`the system is assigned a unique timestamp when it is created. Two
`complementary deadlock-prevention schemes using timestamps have been
`proposed:
`
`• The wait- die scheme: This approach is based on a nonpreemptive
`technique. When process Pi requests a resource currently held by Pj, Pi
`is allowed to wait only if it has a smaller timestamp than does Pj ~that
`is, Pi is older than Pj). Otherwise, Pi is rolled back (dies). For example,
`suppose that processes P 1, P2, and P3 have timestamps 5, 10, and 15,
`respectively. If P 1 requests a resource held by P 2, P 1 will wait. If P 3
`requests a resource held by P 2, P 3 will be rolled back.
`
`
`
`-
`
`588 • Chapter 18: Distributed Coordinatiort
`
`• The wound -wait scheme: This approach is based on a preemptive
`technique and is a counterpart to the wait-die system. When process
`Pi requests a resource currently held by Py Pi is allowed to wait only if
`it has a larger timestamp than does Pj (that is, Pi is younger than Pj).
`Otherwise, Pj is rolled back (Pj is wound by Pi)· Returning to our
`previous example, with processes PI, P2, and P3, if PI requests a
`resource held by P 2, then the resource will be preempted from P 2 and
`P 2 will be rolled back. If P 3 requests a resource held by P 2, then P 3 will
`wait.
`
`Both schemes can avoid starvation, provided that, when a process is
`rolled back, it is not assigned a new timestamp. Since timestamps always
`increase, a process that is rolled back will eventually have th~ smallest
`timestamp. Thus, it will not be rolled back again. There are, however,
`significant differences in the way the two schemes operate.
`
`• In the wait-die scheme, an older process must wait for a younger one
`to release its resource. Thus, the older the process gets, the more it
`tends to wait. By contrast, in the wound -wait scheme, an older
`process never waits for a younger process.
`• In the wait-die scheme, if a process Pi dies and is rolled back because
`it requested a resource held by process Pj, then Pi may reissue the
`same sequence of requests when it is restarted. If the resource is still
`held by Pj, then Pi will die again. Thus, Pi may die several times before
`acquiring the needed resource. Contrast this series of events with what
`happens in the wound-wait scheme. Process Pi is wounded and rolled
`back because Pj requested a resource it holds. When Pi is restarted and
`requests the resource now being held by Pj, Pi waits. Thus, there are
`fewer rollbacks in the wound-wait scheme.
`
`The major problem with these two schemes is that unnecessary rollbacks
`may occur.
`
`18.5.2 Deadlock Detection
`The deadlock-prevention algorithm may preempt resources even if no
`deadlock has occurred. To prevent unnecessary preemptions, we can use a
`deadlock-detection algorithm. We construct a wait-for graph describing the
`resource-allocation state. Since we are assuming only a single resource of
`each type, a cycle in the wait-for graph represents a deadlock.
`The main problem in a distributed system is deciding how to maintain
`the wait-for graph. We elaborate this problem by describing several
`common techniques to deal with this issue. These schemes require that
`each site keep a local wait-for graph. The nodes of the graph correspond
`
`
`
`18.5 Deadlock
`
`to all the processes (local as well as nonlocal) that are
`holding or requesting any of the resources local to that site.
`in Figure 18.3 we have a system consisting of two sites, each
`its local wait-for graph. Note that processes P2 and P3
`graphs, indicating that the processes have requested resources
`sites.
`These local wait-for graphs are constructed in the usual manner
`local processes and resources. When a process Pi in site A
`held by process Pj in site B, a request message is sent by Pi
`edge Pi~ Pj is then inserted in the local wait-for graph of
`Clearly, if any local wait-for graph has a cycle, deadlock
`On the other hand, the fact that there are no cycles in any
`wait-for graphs does not mean that there are no deadlocks.
`this problem, we consider the system depicted in Figure
`for graph is acyclic; nevertheless, a deadlock exists in the
`that a deadlock has not occurred, we must show that the
`graphs is acyclic. The graph (shown in Figure 18.4) that we
`taking the union
`the two wait-for graphs 9f Figure 18.3
`contain a cycle, implying that the system is
`a deadlock state.
`There are a number of different methods for organizing
`graph in a distributed system. We shall describe several common'"''--"''"-""' . .,.
`
`18.5.2.1 Centralized Approach
`In the centralized approach, a global wait-for graph is
`union of all the local wait-for graphs. It maintained in a
`the deadlock-detection coordinator. Since there is communication
`system, we must distinguish between two types of wait-for
`real graph describes the real but unknown state of
`cnc,f-t:n-n
`instance in time, as would be seen by an omniscient
`constructed
`an approximation generated by the
`the execution of its algorithm. Obviously, the constructed
`generated such that, whenever the detection algorithm is
`
`site A
`
`site 8
`
`Figure 18.3 Two local wait-for graphs.
`
`
`
`590
`
`Chapter 18: Distributed Coordination
`
`reported results are correct in a sense that, if a deadlock exists,
`reported properly, and if a deadlock is reported, then the system
`a deadlock state. As we shall show, it is not easy to
`rn1''~""''~t algorithms.
`There are three different options (points in time) when the
`graph may be constructed:
`
`1. Whenever a new edge is inserted or removed in one of the
`for graphs
`Periodically, when a number of changes have occurred in a
`graph
`3. Whenever
`algorithm
`
`the coordinator needs
`
`to
`
`invoke
`
`the
`
`Let us consider option 1. Whenever an edge
`removed in a local graph, the local site must also send a mt:~ss<tge
`coordinator to notify it of this modification. On receiving such a
`the coordinator updates its global graph. Alternatively, a
`number of such changes in a single message periodically.
`previous example, the coordinator process will maintain the global "'701'~"-'tnr
`graph as depicted in Figure 18.4. When site B inserts the edge
`local wait-for graph, it also sends a message to the
`Similarly, when site A deletes the edge P5 ~ P 1, because P 1 has
`resource that was requested by
`an appropriate message is sent
`coordinator.
`When the deadlock-detection algorithm is invoked,
`searches its global graph. If a cycle is found, a victim
`rolled back. The coordinator must notify all the sites that a
`process has been selected as victim. The sites, in turn, roll back
`process.
`Note that, in this scheme (option 1), unnecessary rollbacks
`as a result of two situations:
`
`Figure 18.4 Global wait-for graph for Figure 18:3.
`
`
`
`18.5 Deadlock
`
`• False cycleso may exist in the global wait-for graph.
`point, we consider a snapshot of the system as depicted
`holding in
`Suppose that P2 releases the resource it
`in the deletion of the edge P1 ---? P2 in A. Process
`resource held by P3 at site B, resulting in the addition of
`P3 in B. If the insert P2 ---? P3 message from B arrives before
`---? P2 message from A, the coordinator may discover the
`---? P2 ---? P3 ---? P1 after the insert (but before the
`recovery may be initiated, although no deadlock has
`• Unnecessary rollbacks may also result when a deadlock
`occurred and a victim has been picked, but at the same
`processes was aborted for reasons unrelated to the
`the process exceeding its allocated time). For example,
`site A in Figure 18.3 decides to abort P2. At the same
`coordinator has discovered a cycle and picked
`and P 3 are now rolled back, although only
`
`Note that the same problems are inherited
`other two options (that is, options 2 and 3).
`Let us now present a centralized deadlock-detection aLJ;;,VLL
`option
`which detects all deadlocks that actually occur,
`detect false deadlocks. To avoid the report of false deadlocks,
`that requests from different sites be appended with
`(timestamps). When process Pi, at site A, requests a resource
`site B, a request message with timestamp
`is sent. The
`the label TS is inserted in the local wait-for of A. This
`the local wait-for graph of B only if B has received the
`and cannot immediately grant the requested resource. A
`Pj in the same site is handled in the usual manner;
`associated with the edge Pi ---? Pj' The detection
`follows:
`
`site A
`
`site B
`
`coordinator
`
`Figure 18.5 Local and global wait-for graphs.
`
`
`
`592 • Chapter 18: Distributed Coordination
`
`1. The controller sends an initiating message to each site in the system.
`2. On receiving this message, a site sends its local wait-for graph to the
`coordinator. Note that each of these wait-for graphs contains all the
`local information the site has about the state of the real graph. The
`graph reflects an instantaneous state of the site, but it is not
`synchronized with respect to any other site.
`3. When the controller has received a reply from each site, it constructs a
`graph as follows:
`
`a. The constructed graph contains a vertex for every process in the
`system.
`b. The graph has an edge Pi~ Pj if and only if (1) there is an edge Pi
`~ Pj in one of the wait-for graphs, or (2) an edge Pi ~ Pj with
`some label TS appears in more than one wait-for graph.
`
`We assert that, if there is a cycle in the construCted graph, then the
`system is in a deadlock state. If there isno cycle in the constructed graph,
`then the system was not in a deadlock state when the detection algorithm
`was invoked as result of the initiating messages sent by the coordinator (in
`step 1).
`
`18.5.2.2 Fully Distributed Approach
`In the fully distributed deadlock-detection algorithm, all controllers share
`equally the responsibility for detecting deadlock. In this scheme, every site
`constructs a wait-for graph that represents a part of the total graph,
`depending on the dynamic behavior of the system. The idea is that, if a
`deadlock exists, a cycle will appear in (at least) one of the partial graphs.
`We present one such algorithm, which involves construction of partial
`graphs in every site.
`Each site maintains its own local wait-for graph. A local wait-for graph
`in this scheme differs from the one described earlier in that we add one
`additional node Pex to the graph. An arc Pi~ Pex exists in the graph if Pi
`is waiting for a data item in another site being held by any process.
`Similarly, an arc P ex ~ Pj exists in the graph if there exists a process at
`another site that is waiting to acquire a resource currently being held by Pj
`in this local site.
`To illustrate this situation, we consider the two local wait-for graphs of
`Figure 18.3. The addition of the node Pex in both graphs results in the local
`wait-for graphs shown in Figure 18.6.
`If a local wait-for graph contains a cycle that does not involve node P ex'
`then the system is in a deadlock state. If, however, there exists a cycle
`involving Pex' then this implies that there is a possibility of a deadlock. To
`
`
`
`-18.5 Deadlock
`
`Figure 18.6 Augmented local wait-for graphs of
`
`site
`
`ascertain whether a deadlock does exist, we must invoke a
`deadlock-detection algorithm.
`the local wait-for graph
`Suppose that, at
`Si,
`involving node P ex· This cycle must
`of the form
`
`a
`
`in site si is waiting to
`which indicates that transaction
`say,
`On discovering this
`item located in some other
`sends to site Sj a deadlock-detection
`containing
`that cycle.
`receives this deadlock-detection message,
`When site
`local wait-for graph with the new information. Then, it
`constructed wait-for graph for a cycle not involving
`deadlock is found and an appropriate recovery scheme
`cycle involving
`discovered, then sj transmits a
`message to the appropriate
`say, Sk. Site Sk, in
`procedure. Thus, after a finite number of rounds, either a
`discovered, or the deadlock-detection computation halts.
`To illustrate this procedure, we consider the local
`Figure 18.6. Suppose that
`the cycle
`51
`
`Since P3 is waiting to
`a
`in
`a data
`sl to
`message describing that cycle is transmitted from
`site 52 receives this
`it updates its local wait-for
`the wait-for graph of
`18.7. This graph contains the
`
`Therefore, the
`which does not include node
`state and an appropriate recovery scheme must be invoked.
`
`a '1.4'-'<A'-'--'U'--"'
`
`
`
`II Chapter 18: Distributed Coordination
`
`Note that the outcome would be the same if site
`cycle first in its local wait-for graph and sent the
`message to site 5 1. In the worst case, both sites will discover the
`about the same time, and two deadlock-detection messages will
`one by 51 to
`and another by 52 to 51. This situation
`unnecessary message transfer and overhead in updating the
`wait-for graphs and searching for cycles in both graphs.
`To reduce message traffic, we assign to each transaction Pi a
`identifier, which we denote by ID(Pi). When site 5k discovers that
`wait-for graph contains a cycle involving node P ex of the form
`
`it sends a deadlock-detection message to another site only if
`
`ID(PK ) < ID(PK ).
`n
`1
`
`Otherwise, site 5k continues its normal execution, leaving the
`initiating the deadlock-detection algorithm to some other site.
`To illustrate this scheme, we consider again the wait-for
`maintained at sites 51 and 52 of Figure 18.6. Suppose that
`
`both sites discover these local cycles at about the same time.
`site 51 is of the form
`
`Since ID(P3) > ID(P2)1 site 5 1 does not send a deadlock-detection
`to site 52.
`
`Figure 18.7 Augmented local wait-for graph in site 52 of ·Figure
`
`
`
`18.6 Election Algorithms • 595
`
`The cycle in site 52 is of the form
`
`Since 1D(P2) < ID(P3), site 52 does send a deadlock-detection message to
`site 51, which, on receiving the message, updates its local wait-for graph.
`Site 5 1 then searches for a cycle in the graph and discovers that the system
`is in a deadlock state.
`
`18.6 • Election Algorithms
`
`As we pointed out in Section 18.3, many distributed algorithms employ a
`coordinator process that performs functions needed by the other processes
`in the system. These functions
`include enforcing mutual exclusion,
`·maintaining a global wait-for graph for deadlock detection, replacing a lost
`token, or controlling an input or output device in the system.
`If the
`coordinator process fails due to the failure of the site at which it resides,
`the system can continue execution only by restarting a new copy of the
`coordinator on some other site. The algorithms that determine where a
`new copy of the coordinator should be restarted are called election
`algorithms.
`Election algorithms assume that a unique priority number is associated
`with each active process in the system. For ease of notation, we assume
`that the priority number of process Pi is i. To simplify our discussion, we
`assume a one-to-one correspondence between processes and sites,. and
`thus refer to both as processes. The coordinator is always the process with
`the largest priority number. Hence, when a coordinator fails, the algorithm
`must elect that active process with the largest priority number. This
`number must be sent to each active process in the system. In addition, the
`algorithm must provide a mechanism for a recovered process to identify
`the current coordinator.
`In this section, we present two interesting examples of election
`algorithms for two different configurations of distributed systems. The
`first algorithm is applicable to systems where every process can send a
`message to every other process in the system. The second algorithm is
`applicable to systems organized as a ring (logically or physically). Both
`algorithms require n 2 messages for an election, where n is the number of
`processes in the system. We assume that a process that has failed knows
`on recovery that it indeed has failed and thus takes appropriate actions to ·
`rejoin the set of active processes.
`
`18.6.1 The Bully Algorithm
`Suppose that process Pi sends a request that is not answered by the
`coordinator within a time interval T. In this situation, it is assumed that
`
`
`
`596 • Chapter 18: Distributed Coordination
`
`tries to elect itself as the new
`the coordinator has failed, and Pi
`coordinator. This task is completed through the following algorithm.
`Process Pi sends an election message to every process with a higher
`priority number. Process Pi then waits for a time interval T for an answer
`from any one of these processes.
`· If no response is received within time T, Pi assumes that all processes
`with numbers greater than i have failed, and elects itself the new
`coordinator. Process Pi restarts a new copy of the coordinator and sends a
`message to inform all active processes with priority numbers less than i
`that Pi is the new coordinator.
`However, if an answer is received, Pi begins a time interval T', waiting
`to rec~ive a message informing it that a process with a higher priority
`number has been elected. (Some other process is electing itself coordinator,
`and should report the results within time T'.) If no message is sent within
`T', then the process with a higher number is assumed to have failed, and
`process Pi should restart the algorithm.
`If Pi is not the coordinator, then, at any time during execution, Pi may
`receive one of the following two messages from process Pj:
`
`1. Pj is the new coordinator (j > i). Process Pi, in turn, records this
`information.
`2. Pj started an .election (j < i). Process Pi sends a response to Pj and
`begins its own election algorithm, provided that Pi has not already
`initiated such an election.
`
`The process that completes its algorithm has the highest number and is
`elected as the coordinator. It has sent its number to all active processes
`with smaller numbers. After a failed process recovers, it immediately
`begins execution of the same algorithm. If there are no active processes
`with higher numbers, the recovered process forces all processes with lower
`numbers to let it become the coordinator process, even if there is a
`currently active coordinator with a lower number. For this reason, the
`algorithm is termed the bully algorithm.
`Let us demonstrate the operation of the algorithm with a simple
`example of a system consisting of processes P 1 through P 4. The operations
`are as follows:
`
`1. All processes are active; P 4 is the coordinator process.
`2. P 1 ai)d P 4 fail. P 2 determines P 4 has failed by sending a request that is
`not answered within time T. P2 then begins its election algorithm by
`sending a request to P 3.
`3. P3 receives the request, responds to P2, and begins its own algorithm
`by sending an election request to P 4 •
`·
`
`
`
`18.6 Election Algorithms • 597
`
`4. P2 receives P3's response, and begins waiting for an interval T'.
`5. P 4 does not respond within an interval T, so P 3 elects itself the new
`coordinator, and sends the number 3 to P2 and P 1 (which P 1 does not
`receive, since it has failed).
`6. Later, when P 1 recovers, it sends an election request to P2, P3, and P4•
`7. P2 and P3 respond to P 1 and begin their own election algorithms. P 3
`will again be elected, using 'the same events as before.
`8. Finally, P4 recovers and notifies P 1, P2, and P3 that it is the current
`coordinator. (P4 sends no election requests, since it is the process with
`the highest number in the system.)
`
`18.6.2 Ring Algorithm
`The ring algorithm assumes that the links are unidirectional, and that
`processes send their messages to their right neighbors. The main data
`structure used by the algorithm is the active list, a list that contains the
`priority numbers of all active processes in the system when the algorithm
`ends; each process maintains its own active list. The algorithm works as
`follows:
`
`1. If process Pi detects a coordinator failure, it creates a new active list
`that is initially empty. It then sends a message elect(i) to its right
`neighbor, and adds the number i to its active list.
`2. If Pi receives a message elect(j) from the process on the left, it must
`respond in one of three ways:
`a. If this is the first elect message it has seen or sent, Pi creates a new
`active list with the numbers i and j. It then sends the message
`elect(i), followed by the message elect(j).
`j (that is, the message received does not contain P{ s ·
`b. If i =I=
`number), then Pi adds j to its active list and forwards the message
`to its right neighbor.
`c. If i = j (that is, Pi receives the message elect(i)), then the active list
`for Pi now contains the numbers of all the active processes in the
`system. Process Pi can now determine the largest number in the
`active list to identify the new coordinator process.
`
`This algorithm does not specify how a recovering process determines
`the number of the current coordinator process. One solution would be to
`require a recovering process to send an inquiry message. This message is
`forwarded around the ring to the current coordinator, which in turn sends
`a reply containing its number.
`
`
`
`598 • Chapter 18: Distributed Coordination
`
`18.7 • Reaching Agreement
`
`For a system to be reliable, we need a mechanism that allows a set of
`processes to agree on a common "value." There are several reasons why
`such an agreement may not take place. First, the communication medium
`may be faulty, resulting in lost or garbled messages. Second; the processes
`themselves may be faulty, resulting in unpredictable process behavior. The
`best we can hope for, in this case, is that processes fail in a clean way,
`stopping their execution without deviating from their normal execution
`pattern. In the worst case, processes may send garbled or incorrect
`messages to other processes, or even collaborate with other failed
`processes in an attempt to destroy the integrity of the system.
`This problem has been expressed as the Byzantine generals problem.
`Several divisions of the Byzantine army, each commanded by its own
`general, surround an enemy camp. The Byzantine generals must reach a
`common agreement on whether or not to attack the enemy at dawn. It is
`crucial that all generals agree, since an attack by only some of the divisions
`would result in defeat. The various divisions are geographically dispersed
`and the generals can communicate with one another only via messengers
`who run from camp to camp. There are at least two major reasons why the
`generals may not be able to reach an agreement:
`
`• Messengers may get caught by the enemy and thus may be unable to
`deliver
`their messages. This situation corresponds
`to unreliable
`communication in a computer system, and is discussed further in
`Section 18.7.1.
`• Generals may be traitors, trying to prevent the loyal generals from
`reaching an agreement. This situation corresponds to faulty processes
`in a computer system, and is discussed further in Section 18.7.2.
`
`18.7 .1 Unreliable Communications
`Let us _assume that, if processes fail, they do so in a clean way, and that
`the communication medium is unreliable. Suppose that process Pi at site
`·A, which. has sent a message to process P; at site B, needs to know
`whether Pj has received the message so that 1t can decide how to proceed
`with its computation. For example, Pi may decide to compute a function S
`if Pj has received its message, or to compute a function F if Pj has not
`received the message (because of some hardware failure).
`To detect failures, we ca:n use a time-out scheme similar to the one
`described in Section 16.4.1. When Pi sends out a message, it also specifies
`a time interval during which it is willing to wait for an acknowledgment
`message from Pt When p. receives the message, it immediatel