`
`JOHN M. MELLOR-CRUMMEY
`Rice University
`
`and
`
`MICHAEL L. SCOTT
`University of Rochester
`
`Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in
`shared-memory parallel programs Unfortunately, typical implementations of busy-waiting tend
`to produce large amounts of memory and interconnect contention, introducing performance
`bottlenecks that become markedly more pronounced as applications scale. We argue that this
`problem is not fundamental, and that one can in fact construct busy-wait synchronization
`algorithms that induce no memory or interconnect contention. The key to these algorithms is for
`every processor to spin on separate locally-accessible flag variables, and for some other processor
`to terminate the spin with a single remote write operation at an appropriate time. Flag variables
`may be locally-accessible as a result of coherent caching, or by virtue of allocation in the local
`portion of physically distributed shared memory
`We present a new scalable algorithm for spin locks that generates O ( 1 ) remote references per
`lock acquisition, independent of the number of processors attempting to acquire the lock. Our
`algorithm provides reasonable latency in the absence of contention, requires only a constant
`amount of space per lock, and requires no hardware support other than a swap-with-memory
`instruction. We also present a new scalable barrier algorithm that generates O ( 1 ) remote
`references per processor reaching the barrier, and observe that two previously-known barriers
`can likewise be cast in a form that spins only on locally-accessible flag variables. None of these
`barrier algorithms requires hardware support beyond the usual atomicity of memory reads and
`writes.
`We compare the performance of our scalable algorithms with other software approaches to
`busy-wait synchronization on both a Sequent Symmetry and a BBN Butterfly. Our principal
`conclusion is that contention due to synchronization need not be a problem in large-scale
`shared-memory multiprocessors. The existence of scalable algorithms greatly weakens the case
`for costly special-purpose hardware support for synchronization, and provides a case against
`so-called "dance hall" architectures, in which shared memory locations are equally far from all
`processors
`
`This paper was funded in part by the National Science Foundation under Cooperative Agree-
`ment CCR-8809615 and under Institutional Infrastructure grant CDA-8822724.
`Authors' Addresses: J. M. Mellor-Crummey, Center for Research on Parallel Computation, Rice
`University, P.O. Box 1892, Houston, TX 77251-1892. M. L Scott, Computer Science Depart-
`ment, University of Rochester, Rochester, NY 14627.
`Permission to copy without fee all or part of this material is granted provided that the copies are
`not made or distributed for direct commercial advantage, the ACM copyright notice and the title
`of the publication and its date appear, and notice is given that copying is by permission of the
`Association for Computing Machinery To copy otherwise, or to republish, requires a fee and/or
`specific permission.
`@ 1991 ACM 0734-2071/91/0200-0021 $01.50
`
`ACM Transactions on Computer Systems, Vol. 9, No. 1, February, 1991, Pages 21-65.
`
`Page 1 of 45
`
`Mercedes Exhibit 1029
`
`
`
`22
`
`.
`
`J. M Mellor-Crummey and M. L Scott
`
`Categories and Subject Descriptors: B.3.2 [Memory Structures]: Design Styles-cache memo-
`ries, shared memories; C . l . 2 [Processor Architecture]: Multiple Data Stream Architectures
`(Multiprocessors)-interconnection architectures, parallel processors; C 4 [Computer Systems
`studies; D.4 1 [Operating Systems]: Process
`Organization]: Performance of Systems-design
`Management-mutual exclusion, synchronization; D.4.2 [Operating Systems]: Storage Manage-
`ment-storage hierarchies: D.4.8 [Operating Systemsl: Performance-measurements
`
`General Terms: Algorithms, Performance
`
`Additional Key Words and Phrases: Atomic operations, barrier synchronization, contention,
`fetch-and-phi, locality, spin locks
`
`1. INTRODUCTION
`Techniques for efficiently coordinating parallel computation on MIMD,
`shared-memory multiprocessors are of growing interest and importance as
`the scale of parallel machines increases. On shared-memory machines, pro-
`cessors communicate by sharing data structures. To ensure the consistency of
`shared data structures, processors perform simple operations by using hard-
`ware-supported atomic primitives and coordinate complex operations by us-
`ing synchronization constructs and conventions to protect against overlap of
`conflicting operations.
`Synchronization constructs can be divided into two classes: blocking con-
`structs that deschedule waiting processes and busy-wait constructs in which
`processes repeatedly test shared variables to determine when they may
`proceed. Busy-wait synchronization is fundamental to parallel programming
`on shared-memory multiprocessors and is preferred over scheduler-based
`blocking when scheduling overhead exceeds expected wait time, when proces-
`sor resources are not needed for other tasks (so that the lower wake-up
`latency of busy waiting need not be balanced against an opportunity cost), or
`when scheduler-based blocking is inappropriate or impossible (for example in
`the kernel of an operating system).
`Two of the most widely used busy-wait synchronization constructs are spin
`locks and barriers. Spin locks provide a means for achieving mutual exclu-
`sion (ensuring that only one processor can access a particular shared data
`structure at a time) and are a basic building block for synchronization
`constructs with richer semantics, such as semaphores and monitors. Spin
`locks are ubiquitously used in the implementation of parallel operating
`systems and application programs. Barriers provide a means of ensuring that
`no processes advance beyond a particular point in a computation until all
`have arrived at that point. They are typically used to separate "phases" of an
`application program. A barrier might guarantee, for example, that all pro-
`cesses have finished updating the values in a shared matrix in step t before
`any processes use the values as input in step t + 1.
`The performance of locks and barriers is a topic of great importance. Spin
`locks are generally employed to protect very small critical sections, and may
`
`ACM Transactions o n Computer Systems, V o l 9, N o 1 , February 1991
`
`Page 2 of 45
`
`
`
`Algorithms for Scalable Synchronization on Shared - Memory Multiprocessors
`
`-
`
`2 3
`
`be executed an enormous number of times in the course of a computation.
`Barriers, likewise, are frequently used between brief phases of data-parallel
`algorithms (e.g., successive relaxation), and may be a major contributor to
`run time. Unfortunately, typical implementations of busy-waiting tend to
`produce large amounts of memory and interconnection network contention,
`which causes performance bottlenecks that become markedly more pro-
`nounced in larger machines and applications. As a consequence, the overhead
`of busy-wait synchronization is widely regarded as a serious performance
`problem [2, 6, 11, 14, 38, 49, 511.
`When many processors busy-wait on a single synchronization variable,
`they create a hot spot that is the target of a disproportionate share of the
`network traffic. Pfister and Norton [381 showed that the presence of hot spots
`can severely degrade performance for all traffic in multistage interconnection
`networks, not just traffic due to synchronizing processors. As part of a larger
`study, Agarwal and Cherian [2] investigated the impact of synchronization on
`overall program performance. Their simulations of benchmarks on a cache-
`coherent multiprocessor indicate that memory references due to synchroniza-
`tion cause cache line invalidations much more often than nonsynchronization
`references. In simulations of the benchmarks on a 64-processor "dance hall"
`machine (in which each access to a shared variable traverses the processor-
`memory interconnection network), they observed that synchronization ac-
`counted for as much as 49 percent of total network traffic.
`In response to performance concerns, the history of synchronization tech-
`niques has displayed a trend toward increasing hardware support. Early
`algorithms assumed only the ability to read and write individual memory
`locations atomically. They tended to be subtle, and costly in time and space,
`requiring both a large number of shared variables and a large number of
`operations to coordinate concurrent invocations of synchronization primitives
`[12, 25, 26, 36, 401. Modern multiprocessors generally include more sophisti-
`cated atomic operations, permitting simpler and faster coordination strate-
`gies. Particularly common are various f etch-and_@ operations [221 which
`atomically read, modify, and write a memory location. Fetch-and_$ opera-
`tions include test-and-set, fetch-and-store (swap), fetch-and-add, and
`c ~ m ~ a r e _ a n d _ s w a ~ . ~
`More recently, there have been proposals for multistage interconnection
`networks that combine concurrent accesses to the same memory location [17,
`37, 421, multistage networks that have special synchronization variables
`embedded in each stage of the network [211, and special-purpose cache
`hardware to maintain a queue of processors waiting for the same lock [14, 28,
`351. The principal purpose of these hardware primitives is to reduce the
`impact of busy waiting. Before adopting them it is worth considering the
`extent to which software techniques can achieve a similar result.
`
`--
` etcha and-store exchanges a register with memory. Compareandswap compares the contents of
`a memory location against a given value, and sets a condition code to indicate whether they are
`equal. If so, it replaces the contents of the memory with a second given value.
`ACM Transactions on Computer Systems, Val 9, No. 1, February 1991.
`
`Page 3 of 45
`
`
`
`24
`
`.
`
`J. M Mellor-Crummey and M L Scott
`
`For a wide range of shared-memory multiprocessor architectures, we con-
`tend that appropriate design of spin locks and barriers can eliminate all
`busy-wait contention. Specifically, by distributing data structures appropri-
`ately, we can ensure that each processor spins only on locally-accessible
`locations, locations that are not the target of spinning references by any
`other processor. All that is required in the way of hardware support is a
`simple set of f etch_and_$ operations and a memory hierarchy in which each
`processor is able to read some portion of shared memory without using the
`interconnection network. On a machine with coherent caches, processors spin
`on locations in their caches. On a machine in which shared memory is
`distributed (e.g., the BBN Butterfly [8], the IBM RP3 [371, or a shared-
`memory hypercube [lo]), processors spin only on locations in the local portion
`of shared memory.
`The implication of our work is that efficient synchronization algorithms
`can be constructed in software for shared-memory multiprocessors of arbi-
`trary size. Special-purpose synchronization hardware can offer only a small
`constant factor of additional performance for mutual exclusion, and at best a
`logarithmic factor for barrier synchronization. In addition, the feasibility
`and performance of busy-waiting algorithms with local-only spinning pro-
`vides a case against "dance hall" architectures in which shared memory
`locations are equally far from all processors.
`We discuss the implementation of spin locks in Section 2, presenting both
`existing approaches and a new algorithm of our own design. In Section 3 we
`turn to the issue of barrier synchronization, explaining how existing ap-
`proaches can be adapted to eliminate spinning on remote locations, and
`introducing a new design that achieves both a short critical path and the
`theoretical minimum total number of network transactions. We present
`performance results in Section 4 for a variety of spin lock and barrier
`implementations, and discuss the implications of these results for software
`and hardware designers. Our conclusions are summarized in Section 5.
`
`2 . SPIN LOCKS
`In this section we describe a series of five implementations for a mutual-ex-
`elusion spin lock. The first four appear in the literature in one form or
`another. The fifth is a novel lock of our own design. Each lock can be seen as
`an attempt to eliminate some deficiency in the previous design. Each as-
`sumes a shared-memory environment that includes certain fetch-and_$
`operations. As noted above, a substantial body of work has also addressed
`mutual exclusion using more primitive read and write atomicity; the com-
`plexity of the resulting solutions is the principal motivation for the develop-
`ment of f etch_and_@ primitives. Other researchers have considered mutual
`exclusion in the context of distributed systems [31, 39, 43, 45, 461 but the
`
`' ~ a r d w a r e combining can reduce t h e t i m e t o achieve a barrier from O(1og P ) t o O(1) steps if
`processors happen t o arrive at t h e barrier simultaneously
`
`ACM Transactions on Computer Systems. Vol. 9. No 1, February 1991
`
`Page 4 of 45
`
`
`
`Algorithms for Scalable Synchronization on Shared - Memory Multiprocessors
`
`-
`
`25
`
`characteristics of message passing are different enough from shared memory
`operations that solutions do not transfer from one environment to the other.
`Our pseudo-code notation is meant to be more or less self explanatory. We
`have used line breaks to terminate statements, and indentation to indicate
`nesting in control constructs. The keyword shared indicates that a declared
`variable is to be shared among all processors. The declaration implies no
`particular physical location for the variable, but we often specify locations in
`comments and/or accompanying text. The keywords processor private indi-
`cate that each processor is to have a separate, independent copy of a declared
`variable. All of our atomic operations are written to take as their first
`argument the address of the memory location to be modified. All but com-
`pare-and-swap take the obvious one additional operand and return the old
`contents of the memory location. Compare-and-swap (addr, old, new) is de-
`fined as i f addr " ! = old return false; addr - := new; return true. In one
`case (Figure 3) we have used an atomic-add operation whose behavior is the
`same as calling f etch-and-add and discarding the result.
`
`2.1. The Simple Test-and-Set Lock
`The simplest mutual exclusion lock, found in all operating system textbooks
`and widely used in practice, employs a polling loop to access a Boolean flag
`that indicates whether the lock is held. Each processor repeatedly executes a
`test-and-set instruction in an attempt to change the flag from false to true,
`thereby acquiring the lock. A processor releases the lock by setting it to false.
`The principal shortcoming of the test-and-set lock is contention for the
`flag. Each waiting processor accesses the single shared flag as frequently as
`possible, using relatively expensive read-modify-write (f etch-and_$) instruc-
`tions. The result is degraded performance not only of the memory bank in
`which the lock resides but also of the processor/memory interconnection
`network and, in a distributed shared-memory machine, the processor that
`owns the memory bank (as a result of stolen bus cycles).
`Fetch_and_$ instructions can be particularly expensive on cache-coherent
`multiprocessors since each execution of such an instruction may cause many
`remote invalidations. To reduce this overhead, the test-and-set lock can be
`modified to use a test-and-set instruction only when a previous read indi-
`cates that the test-and-set might succeed. This so-called test-and-
`test-and-set technique [441 ensures that waiting processors poll with read
`requests during the time that a lock is held. Once the lock becomes available,
`some fraction of the waiting processors detect that the lock is free and
`perform a test-and-set operation, exactly one of which succeeds, but each of
`which causes remote invalidations on a cache-coherent machine.
`The total amount of network traffic caused by busy-waiting on a
`test-and-set
`lock can be reduced further by introducing delay on each
`processor between consecutive probes of the lock. The simplest approach
`employs a constant delay; more elaborate schemes use some sort of backoff on
`unsuccessful probes. Anderson [5] reports the best performance with exponen-
`tial backoff; our experiments confirm this result. Pseudo-code for a
`ACM Transactions on Computer Systems, Vol. 9, No. 1, February 1991.
`
`Page 5 of 45
`
`
`
`26
`
`J M. Mellor-Crummey and M L. Scott
`
`t y p e l o c k = (unlocked, l o c k e d )
`
`p r o c e d u r e a c q u i r e - l o c k (L : " l o c k )
`d e l a y : i n t e g e r : = 1
`w h i l e t e s t a n d - s e t (L) = l o c k e d
`pause ( d e l a y )
`d e l a y : = d e l a y * 2
`
`procedure r e l e a s e - l o c k
`l o c k * : = unlocked
`
`( L : 'lock)
`
`// r e t u r n s o l d v a l u e
`// consume t h i s many u n i t s of t i m e
`
`Fig. 1. Simple t e s t - a n d - s e t lock with exponential backoff.
`
`test-and-set lock with exponential backoff appears in Figure 1. Test-and-set
`suffices when using a backoff scheme; test-and-test-and-set is not necessary.
`
`2 2 . The Ticket Lock
`In a test-and-test-ad-set lock, the number of read-modify-write operations
`is substantially less than for a simple test-and-set lock but still potentially
`large. Specifically, it is possible for every waiting processor to perform a
`test-and-set operation every time the lock becomes available, even though
`only one can actually acquire the lock. We can reduce the number of
`f etch-and_$ operations to one per lock acquisition with what we call a ticket
`lock. At the same time, we can ensure FIFO service by granting the lock to
`processors in the same order in which they first requested it. A ticket lock is
`fair in a strong sense; it eliminates the possibility of starvation.
`A ticket lock consists of two counters, one containing the number of
`requests to acquire the lock, and the other the number of times the lock has
`been released. A processor acquires the lock by performing a fetch-and-in-
`crement operation on the request counter and waiting until the result (its
`ticket) is equal to the value of the release counter. It releases the lock by
`incrementing the release counter. In the terminology of Reed and Kanodia
`[411, a ticket lock corresponds to the busy-wait implementation of a semaphore
`using an eventcount and a sequencer. It can also be thought of as an
`optimization of a Lamport's bakery lock [24], which was designed for fault-
`tolerance rather than performance. Instead of spinning on the release counter,
`processors using a bakery lock repeatedly examine the tickets of their peers.
`Though it probes with read operations only (and thus avoids the overhead
`of unnecessary invalidations in coherent cache machines), the ticket lock still
`causes substantial memory and network contention through polling of a
`common location. As with the test-and-set
`lock, this contention can be
`reduced by introducing delay on each processor between consecutive probes of
`the lock. In this case, however, exponential backoff is clearly a bad idea.
`Since processors acquire the lock in FIFO order, overshoot in backoff by the
`first processor in line will delay all others as well, causing them to back off
`even farther. Our experiments suggest that a reasonable delay can be
`determined by using information not available with a test-and-set
`lock:
`namely, the number of processors already waiting for the lock. The delay can
`
`ACM Transactions on Computer Systems. Vol 9, No. 1, February 1991
`
`Page 6 of 45
`
`
`
`Algorithms for Scalable Synchronization on Shared - Memory Multiprocessors
`
`27
`
`type lock = record
`next-ticket : unsigned integer := 0
`now-serving : unsigned integer : = 0
`
`procedure acquire-lock (L : "lock)
`my-ticket : unsigned integer := fetch-and-increment (&L->next_ticket)
`// returns old value; arithmetic overflow is harmless
`loop
`pause (my-ticket - L->now_serving)
`// consume this many units of time
`// on most machines, subtraction works correctly despite overflou
`if L->now_serving = my-ticket
`return
`
`procedure release-lock (L : "lock)
`L->now_serving := L->now_serving t 1
`
`Fig. 2. Ticket lock with proportional backoff.
`
`be computed as the difference between a newly-obtained ticket and the
`current value of the release counter.
`Delaying for an appropriate amount of time requires an estimate of how
`long it will take each processor to execute its critical section and pass the
`lock to its successor. If this time is known exactly, it is in principle possible to
`acquire the lock with only two probes, one to determine the number of
`processors already in line (if any), and another (if necessary) to verify that
`one's predecessor in line has finished with the lock. This sort of accuracy is
`not likely in practice, however, since critical sections do not in general take
`identical, constant amounts of time. Moreover, delaying proportional to the
`expected average time to hold the lock is risky: if the processors already in
`line average less than the expected amount, the waiting processor will delay
`too long and slow the entire system. A more appropriate constant of propor-
`tionality for the delay is the minimum time that a processor can hold the
`lock. Pseudo-code for a ticket lock with proportional backoff appears in
`Figure 2. It assumes that the new-ticket and now-serving counters are large
`enough to accommodate the maximum number of simultaneous requests for
`the lock.
`
`2.3. Array-Based Queuing Locks
`Even using a ticket lock with porportional backoff, it is not possible to obtain
`a lock with an expected constant number of network transactions, due to the
`unpredictability of the length of critical sections. Anderson [51 and Graunke
`and Thakkar [18] have proposed locking algorithms that achieve the constant
`bound on cache-coherent multiprocessors that support atomic fetch-and-in-
`crement or fetch-and-store, respectively. The trick is for each processor to
`use the atomic operation to obtain the address of a location on which to spin.
`Each processor spins on a different location, in a different cache line. Ander-
`son's experiments indicate that his queuing lock outperforms a test-and-set
`lock with exponential backoff on the Sequent Symmetry when more than six
`ACM Transactions on Computer Systems, Vol. 9, No. 1, February 1991.
`
`Page 7 of 45
`
`
`
`28
`
`J. M. Mellor-Crumrney and M L. Scott
`
`t y p e l o c k = r e c o r d
`s l o t s : a r r a y [O..numprocs -11 of (has-lock, must-wait)
`: ( h a s l o c k . must-wait, m u s t w a i t , . . . . must-wait)
`// each element of s l o t s should l i e i n a d i f f e r e n t memory module
`// o r cache l i n e
`n e x t - s l o t : i n t e g e r := 0
`
`// parameter my-place, below, p o i n t s t o a p r i v a t e v a r i a b l e
`// i n an e n c l o s i n g scope
`
`procedure a c q u i r e - l o c k (L : " l o c k , my-place : " i n t e g e r )
`:= fetch-and-increment
`my-place"
`( & L - > n e x t s l o t )
`// r e t u r n s o l d v a l u e
`i f my-place- mod numprocs = 0
`atomic-add (&L->next_slot, -numprocs)
`// a v o i d problems w i t h overflow; r e t u r n v a l u e ignored
`m y p l a c e " " := my-place- mod numprocs
`r e p e a t w h i l e L->slotsCmy-place']
`= must-wait
`L->slots[my_place"]
`: = must-wait
`
`// s p i n
`// i n i t f o r n e x t t i m e
`
`procedure r e l e a s e - l o c k (L : 'lock, my-place : " i n t e g e r )
`L->slots[(my-place" + 1) mod numprocs]
`:= has-lock
`
`Fig 3. Anderson's array-based queuing lock.
`
`processors are competing for access. Graunke and Thakkar's experiments
`indicate that their lock outperforms a test-and-set lock on the same ma-
`chine when more than three processors are competing.
`Pseudo-code for Anderson's lock appears in Figure 3;3 Graunke and
`Thakkar's lock appears in Figure 4. The likely explanation for the better
`performance of Graunke and Thakkar's lock is that fetch-and-store
`is
`supported directly in hardware on the Symmetry; fetch-and-add is not. To
`simulate fetch-and-add, Anderson protected his queue-based lock with an
`outer test-and-set lock. This outer lock (not shown in Figure 3) introduces
`additional overhead and can cause contention when the critical section
`protected by the queue-based lock is shorter than the time required to acquire
`and release the outer lock. In this case competing processors spend their time
`spinning on the outer test-and-set lock rather than the queue-based lock.
`Neither Anderson nor Graunke and Thakkar included the ticket lock in
`their experiments. In qualitative terms both the ticket lock (with propor-
`tional backoff) and the array-based queuing locks guarantee FIFO ordering of
`requests. Both
`the ticket
`lock and Anderson's
`lock use an atomic
`f etch-and-increment instruction. The ticket lock with porportional backoff is
`likely to require more network transactions on a cache-coherent multiproces-
`sor but fewer on a multiprocessor without coherently cached shared vari-
`ables. The array-based queuing locks require space per lock linear in the
`number of processors whereas the ticket lock requires only a small constant
`
`--
`'~nderson's original pseudo-code did not address the issue of overflow, which causes his algo-
`rithm to fail unless numprocs = 2'. Our variant of his algorithm addresses this problem.
`
`ACM Transactions on Computer Systems, Vol 9, No. 1, February 1991
`
`Page 8 of 45
`
`
`
`Algorithms for Scalable Synchronization on Shared - Memory Multiprocessors
`
`29
`
`type l o c k = record
`s l o t s : a r r a y [O..numprocs -11 of Boolean := t r u e
`// each element of s l o t s should l i e i n a d i f f e r e n t memory module
`/ / o r cache l i n e
`t a i l : record
`w h o w a s l a s t : 'Boolean
`:= 0
`this-means-locked : Boolean := f a l s e
`// this-means-locked
`i s a one-bit q u a n t i t y .
`// who_was_last p o i n t s t o an element of s l o t s .
`// i f a l l elements l i e a t even addresses, t h i s t a i l "record"
`// can be made t o f i t i n one word
`processor p r i v a t e vpid : i n t e g e r // a unique v i r t u a l processor index
`
`procedure acquire-lock (L : "lock)
`(who_is_ahead_of_me : "Boolean, w h a t i s l o c k e d : Boolean)
`( & L - > t a i l , ( & s l o t s [vpidl , s l o t s [ v p i d l ) )
`:= f etch-and-store
`r e p e a t while who_is_ahead_of_me" = what_is_locked
`
`procedure release-lock (L : 'lock)
`L->slots [vpid] : = not L->slots [vpid]
`
`Fig. 4. Graunke and Thakkar's array-based queuing lock
`
`amount of space.4 We provide quantitative comparisons of the locks' perfor-
`mance in Section 4.3.
`
`2.4. A New List-Based Queuing Lock
`We have devised a new mechanism called the MCS lock (after out initials)
`that
`
`-guarantees FIFO ordering of lock acquisitions;
`-spins on locally-accessible flag variables only;
`-requires a small constant amount of space per lock; and
`-works equally well (requiring only O(1) network transactions per lock
`acquisition) on machines with and without coherent caches.
`
`The first of these advantages is shared with the ticket lock and the
`array-based queuing locks, but not with the test-and-set lock. The third is
`shared with the test-and-set and ticket locks, but not with the array-based
`queuing locks. The fourth advantage is in large part a consequence of the
`second and is unique to the MCS lock.
`Our lock was inspired by the Queue On Lock Bit (QOLB) primitive
`proposed for the cache controllers of the Wisconsin Multicube [141, but is
`implemented entirely in software. It requires an atomic fetch-and-store
`(swap) instruction and benefits from the availability of compare-and-swap.
`Without compare-and-swap, we lose the guarantee of FIFO ordering and
`
`^ ~ t
`first glance, one might suspect t h a t t h e flag bits of t h e Graunke and Thakkar lock could be
`allocated on a per-processor basis, rather t h a n a per-lock basis. Once a processor releases a lock,
`however, i t cannot use its bit for anything else until some other processor h a s acquired t h e lock
`i t released.
`
`ACM Transactions on Computer Systems, Vol. 9, No. 1, February 1991.
`
`Page 9 of 45
`
`
`
`30
`
`J. M. Mellor-Crurnrney and M. L. Scott
`
`type qnode = record
`next : "qnode
`locked : Boolean
`type lock = 'qnode
`
`// parameter I, below, points to a qnode record allocated
`// (in an enclosing scope) in shared memory locally-accessible
`// to the invoking processor
`
`procedure acquire-lock (L : "lock, I : qnode)
`I->next := nil
`predecessor : "qnode := fetch-and-store (L, I)
`// queue was non-empty
`if predecessor ! = nil
`I->locked := true
`predecessor->next := I
`repeat while I->locked
`
`// spin
`
`procedure release-lock (L : 'lock, I: "qnode)
`// no known successor
`if I->next = nil
`if compare_and_swap (L, I, nil)
`return
`// compare-and-swap returns true iff it swapped
`/ / spin
`repeat while I->next = nil
`I->next->locked := false
`
`Fig 5 The MCS list-based queuing lock.
`
`introduce the theoretical possibility of starvation, though lock acquisitions
`are likely to remain very nearly FIFO in practice.
`Pseudo-code for our lock appears in Figure 5. Every processor using the
`lock allocates a qnode record containing a queue link and a Boolean flag.
`Each processor employs one additional temporary variable during the ac-
`quire-lock operation. Processors holding or waiting for the lock are chained
`together by the links. Each processor spins on its own locally-accessible flag.
`The lock itself contains a pointer to the qnode record for the processor at the
`tail of the queue, or a n i l if the lock is not held. Each processor in the queue
`holds the address of the record for the processor behind it-the processor it
`should resume after acquiring and releasing the lock. Compare-and-swap
`enables a processor to determine whether it is the only processor in the
`queue, and if so remove itself correctly, as a single atomic action. The spin in
`acquire-lock waits for the lock to become free. The spin in release-lock
`compensates for the timing window between the fetch-and-store and the
`assignment to predecessor - >next i n acquire-lock. Both spins are local.
`Figure 6, parts (a) through (e), illustrates a series of acquire-lock and
`release-lock operations. The lock itself is represented by a box containing
`an 'L.' The other rectangles are qnode records. A box with a slash through it
`represents a n i l pointer. Non-nil pointers are directed arcs. In (a), the lock is
`free. In (b), processor 1 has acquired the lock. It is running (indicated by the
`'R'), thus its locked flag is irrelevant (indicated by putting the 'R' in
`parentheses). In (c), two more processors have entered the queue while the
`lock is still held by processor 1. They are blocked spinning on their locked
`
`A C M Transactions on Computer Systems, Vol. 9, No. 1, February 1991
`
`Page 10 of 45
`
`
`
`Algorithms for Scalable Synchronization on Shared - Memory Multiprocessors
`
`31
`
`Fig. 6. Pictorial example of MCS locking protocol in the presence of competition
`
`flags (indicated by the 'B's.). In (d), processor 1 has completed and has
`changed the locked flag of processor 2 so that it is now running. In (e),
`processor 2 has completed, and has similarly unblocked processor 3. If no
`more processors enter the queue in the immediate future, the lock will return
`to the situation in (a) when processor 3 completes its critical section.
`Alternative code for the release-lock operation, without compare-and-swap,
`appears in Figure 7. Like the code in Figure 5, it spins on processor-specific,
`locally-accessible memory locations only, requires constant space per lock,
`and requires only O(1) network transactions regardless of whether the ma-
`chine provides coherent caches. Its disadvantages are extra complexity and
`the loss of strict FIFO ordering.
`Parts (e) through (h) of Figure 6 illustrate the subtleties of the alternative
`code for release-lock. In the original version of the lock, compare-and-swap
`ensures that updates to the tail of the queue happen atomically. There are no
`ACM Transactions on Computer Systems, Vol. 9, No. 1, February 1991.
`
`Page 11 of 45
`
`
`
`32
`
`-
`
`J M. Mellor-Crummey and M L. Scott
`
`procedure release-lock (L : 'lock, I : 'qnode)
`// no known successor
`1f I->next = m l
`o l d - t