throbber
calable Synchronization on
`
`JOHN M. MELLOR-CRUMMEY
`Rice University
`
`and
`
`MICHAEL L. SCOTT
`University of Rochester
`
`Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in
`shared-memory parallel programs Unfortunately, typical implementations of busy-waiting tend
`to produce large amounts of memory and interconnect contention, introducing performance
`bottlenecks that become markedly more pronounced as applications scale. We argue that this
`problem is not fundamental, and that one can in fact construct busy-wait synchronization
`algorithms that induce no memory or interconnect contention. The key to these algorithms is for
`every processor to spin on separate locally-accessible flag variables, and for some other processor
`to terminate the spin with a single remote write operation at an appropriate time. Flag variables
`may be locally-accessible as a result of coherent caching, or by virtue of allocation in the local
`portion of physically distributed shared memory
`We present a new scalable algorithm for spin locks that generates O ( 1 ) remote references per
`lock acquisition, independent of the number of processors attempting to acquire the lock. Our
`algorithm provides reasonable latency in the absence of contention, requires only a constant
`amount of space per lock, and requires no hardware support other than a swap-with-memory
`instruction. We also present a new scalable barrier algorithm that generates O ( 1 ) remote
`references per processor reaching the barrier, and observe that two previously-known barriers
`can likewise be cast in a form that spins only on locally-accessible flag variables. None of these
`barrier algorithms requires hardware support beyond the usual atomicity of memory reads and
`writes.
`We compare the performance of our scalable algorithms with other software approaches to
`busy-wait synchronization on both a Sequent Symmetry and a BBN Butterfly. Our principal
`conclusion is that contention due to synchronization need not be a problem in large-scale
`shared-memory multiprocessors. The existence of scalable algorithms greatly weakens the case
`for costly special-purpose hardware support for synchronization, and provides a case against
`so-called "dance hall" architectures, in which shared memory locations are equally far from all
`processors
`
`This paper was funded in part by the National Science Foundation under Cooperative Agree-
`ment CCR-8809615 and under Institutional Infrastructure grant CDA-8822724.
`Authors' Addresses: J. M. Mellor-Crummey, Center for Research on Parallel Computation, Rice
`University, P.O. Box 1892, Houston, TX 77251-1892. M. L Scott, Computer Science Depart-
`ment, University of Rochester, Rochester, NY 14627.
`Permission to copy without fee all or part of this material is granted provided that the copies are
`not made or distributed for direct commercial advantage, the ACM copyright notice and the title
`of the publication and its date appear, and notice is given that copying is by permission of the
`Association for Computing Machinery To copy otherwise, or to republish, requires a fee and/or
`specific permission.
`@ 1991 ACM 0734-2071/91/0200-0021 $01.50
`
`ACM Transactions on Computer Systems, Vol. 9, No. 1, February, 1991, Pages 21-65.
`
`Page 1 of 45
`
`Mercedes Exhibit 1029
`
`

`
`22
`
`.
`
`J. M Mellor-Crummey and M. L Scott
`
`Categories and Subject Descriptors: B.3.2 [Memory Structures]: Design Styles-cache memo-
`ries, shared memories; C . l . 2 [Processor Architecture]: Multiple Data Stream Architectures
`(Multiprocessors)-interconnection architectures, parallel processors; C 4 [Computer Systems
`studies; D.4 1 [Operating Systems]: Process
`Organization]: Performance of Systems-design
`Management-mutual exclusion, synchronization; D.4.2 [Operating Systems]: Storage Manage-
`ment-storage hierarchies: D.4.8 [Operating Systemsl: Performance-measurements
`
`General Terms: Algorithms, Performance
`
`Additional Key Words and Phrases: Atomic operations, barrier synchronization, contention,
`fetch-and-phi, locality, spin locks
`
`1. INTRODUCTION
`Techniques for efficiently coordinating parallel computation on MIMD,
`shared-memory multiprocessors are of growing interest and importance as
`the scale of parallel machines increases. On shared-memory machines, pro-
`cessors communicate by sharing data structures. To ensure the consistency of
`shared data structures, processors perform simple operations by using hard-
`ware-supported atomic primitives and coordinate complex operations by us-
`ing synchronization constructs and conventions to protect against overlap of
`conflicting operations.
`Synchronization constructs can be divided into two classes: blocking con-
`structs that deschedule waiting processes and busy-wait constructs in which
`processes repeatedly test shared variables to determine when they may
`proceed. Busy-wait synchronization is fundamental to parallel programming
`on shared-memory multiprocessors and is preferred over scheduler-based
`blocking when scheduling overhead exceeds expected wait time, when proces-
`sor resources are not needed for other tasks (so that the lower wake-up
`latency of busy waiting need not be balanced against an opportunity cost), or
`when scheduler-based blocking is inappropriate or impossible (for example in
`the kernel of an operating system).
`Two of the most widely used busy-wait synchronization constructs are spin
`locks and barriers. Spin locks provide a means for achieving mutual exclu-
`sion (ensuring that only one processor can access a particular shared data
`structure at a time) and are a basic building block for synchronization
`constructs with richer semantics, such as semaphores and monitors. Spin
`locks are ubiquitously used in the implementation of parallel operating
`systems and application programs. Barriers provide a means of ensuring that
`no processes advance beyond a particular point in a computation until all
`have arrived at that point. They are typically used to separate "phases" of an
`application program. A barrier might guarantee, for example, that all pro-
`cesses have finished updating the values in a shared matrix in step t before
`any processes use the values as input in step t + 1.
`The performance of locks and barriers is a topic of great importance. Spin
`locks are generally employed to protect very small critical sections, and may
`
`ACM Transactions o n Computer Systems, V o l 9, N o 1 , February 1991
`
`Page 2 of 45
`
`

`
`Algorithms for Scalable Synchronization on Shared - Memory Multiprocessors
`
`-
`
`2 3
`
`be executed an enormous number of times in the course of a computation.
`Barriers, likewise, are frequently used between brief phases of data-parallel
`algorithms (e.g., successive relaxation), and may be a major contributor to
`run time. Unfortunately, typical implementations of busy-waiting tend to
`produce large amounts of memory and interconnection network contention,
`which causes performance bottlenecks that become markedly more pro-
`nounced in larger machines and applications. As a consequence, the overhead
`of busy-wait synchronization is widely regarded as a serious performance
`problem [2, 6, 11, 14, 38, 49, 511.
`When many processors busy-wait on a single synchronization variable,
`they create a hot spot that is the target of a disproportionate share of the
`network traffic. Pfister and Norton [381 showed that the presence of hot spots
`can severely degrade performance for all traffic in multistage interconnection
`networks, not just traffic due to synchronizing processors. As part of a larger
`study, Agarwal and Cherian [2] investigated the impact of synchronization on
`overall program performance. Their simulations of benchmarks on a cache-
`coherent multiprocessor indicate that memory references due to synchroniza-
`tion cause cache line invalidations much more often than nonsynchronization
`references. In simulations of the benchmarks on a 64-processor "dance hall"
`machine (in which each access to a shared variable traverses the processor-
`memory interconnection network), they observed that synchronization ac-
`counted for as much as 49 percent of total network traffic.
`In response to performance concerns, the history of synchronization tech-
`niques has displayed a trend toward increasing hardware support. Early
`algorithms assumed only the ability to read and write individual memory
`locations atomically. They tended to be subtle, and costly in time and space,
`requiring both a large number of shared variables and a large number of
`operations to coordinate concurrent invocations of synchronization primitives
`[12, 25, 26, 36, 401. Modern multiprocessors generally include more sophisti-
`cated atomic operations, permitting simpler and faster coordination strate-
`gies. Particularly common are various f etch-and_@ operations [221 which
`atomically read, modify, and write a memory location. Fetch-and_$ opera-
`tions include test-and-set, fetch-and-store (swap), fetch-and-add, and
`c ~ m ~ a r e _ a n d _ s w a ~ . ~
`More recently, there have been proposals for multistage interconnection
`networks that combine concurrent accesses to the same memory location [17,
`37, 421, multistage networks that have special synchronization variables
`embedded in each stage of the network [211, and special-purpose cache
`hardware to maintain a queue of processors waiting for the same lock [14, 28,
`351. The principal purpose of these hardware primitives is to reduce the
`impact of busy waiting. Before adopting them it is worth considering the
`extent to which software techniques can achieve a similar result.
`
`--
` etcha and-store exchanges a register with memory. Compareandswap compares the contents of
`a memory location against a given value, and sets a condition code to indicate whether they are
`equal. If so, it replaces the contents of the memory with a second given value.
`ACM Transactions on Computer Systems, Val 9, No. 1, February 1991.
`
`Page 3 of 45
`
`

`
`24
`
`.
`
`J. M Mellor-Crummey and M L Scott
`
`For a wide range of shared-memory multiprocessor architectures, we con-
`tend that appropriate design of spin locks and barriers can eliminate all
`busy-wait contention. Specifically, by distributing data structures appropri-
`ately, we can ensure that each processor spins only on locally-accessible
`locations, locations that are not the target of spinning references by any
`other processor. All that is required in the way of hardware support is a
`simple set of f etch_and_$ operations and a memory hierarchy in which each
`processor is able to read some portion of shared memory without using the
`interconnection network. On a machine with coherent caches, processors spin
`on locations in their caches. On a machine in which shared memory is
`distributed (e.g., the BBN Butterfly [8], the IBM RP3 [371, or a shared-
`memory hypercube [lo]), processors spin only on locations in the local portion
`of shared memory.
`The implication of our work is that efficient synchronization algorithms
`can be constructed in software for shared-memory multiprocessors of arbi-
`trary size. Special-purpose synchronization hardware can offer only a small
`constant factor of additional performance for mutual exclusion, and at best a
`logarithmic factor for barrier synchronization. In addition, the feasibility
`and performance of busy-waiting algorithms with local-only spinning pro-
`vides a case against "dance hall" architectures in which shared memory
`locations are equally far from all processors.
`We discuss the implementation of spin locks in Section 2, presenting both
`existing approaches and a new algorithm of our own design. In Section 3 we
`turn to the issue of barrier synchronization, explaining how existing ap-
`proaches can be adapted to eliminate spinning on remote locations, and
`introducing a new design that achieves both a short critical path and the
`theoretical minimum total number of network transactions. We present
`performance results in Section 4 for a variety of spin lock and barrier
`implementations, and discuss the implications of these results for software
`and hardware designers. Our conclusions are summarized in Section 5.
`
`2 . SPIN LOCKS
`In this section we describe a series of five implementations for a mutual-ex-
`elusion spin lock. The first four appear in the literature in one form or
`another. The fifth is a novel lock of our own design. Each lock can be seen as
`an attempt to eliminate some deficiency in the previous design. Each as-
`sumes a shared-memory environment that includes certain fetch-and_$
`operations. As noted above, a substantial body of work has also addressed
`mutual exclusion using more primitive read and write atomicity; the com-
`plexity of the resulting solutions is the principal motivation for the develop-
`ment of f etch_and_@ primitives. Other researchers have considered mutual
`exclusion in the context of distributed systems [31, 39, 43, 45, 461 but the
`
`' ~ a r d w a r e combining can reduce t h e t i m e t o achieve a barrier from O(1og P ) t o O(1) steps if
`processors happen t o arrive at t h e barrier simultaneously
`
`ACM Transactions on Computer Systems. Vol. 9. No 1, February 1991
`
`Page 4 of 45
`
`

`
`Algorithms for Scalable Synchronization on Shared - Memory Multiprocessors
`
`-
`
`25
`
`characteristics of message passing are different enough from shared memory
`operations that solutions do not transfer from one environment to the other.
`Our pseudo-code notation is meant to be more or less self explanatory. We
`have used line breaks to terminate statements, and indentation to indicate
`nesting in control constructs. The keyword shared indicates that a declared
`variable is to be shared among all processors. The declaration implies no
`particular physical location for the variable, but we often specify locations in
`comments and/or accompanying text. The keywords processor private indi-
`cate that each processor is to have a separate, independent copy of a declared
`variable. All of our atomic operations are written to take as their first
`argument the address of the memory location to be modified. All but com-
`pare-and-swap take the obvious one additional operand and return the old
`contents of the memory location. Compare-and-swap (addr, old, new) is de-
`fined as i f addr " ! = old return false; addr - := new; return true. In one
`case (Figure 3) we have used an atomic-add operation whose behavior is the
`same as calling f etch-and-add and discarding the result.
`
`2.1. The Simple Test-and-Set Lock
`The simplest mutual exclusion lock, found in all operating system textbooks
`and widely used in practice, employs a polling loop to access a Boolean flag
`that indicates whether the lock is held. Each processor repeatedly executes a
`test-and-set instruction in an attempt to change the flag from false to true,
`thereby acquiring the lock. A processor releases the lock by setting it to false.
`The principal shortcoming of the test-and-set lock is contention for the
`flag. Each waiting processor accesses the single shared flag as frequently as
`possible, using relatively expensive read-modify-write (f etch-and_$) instruc-
`tions. The result is degraded performance not only of the memory bank in
`which the lock resides but also of the processor/memory interconnection
`network and, in a distributed shared-memory machine, the processor that
`owns the memory bank (as a result of stolen bus cycles).
`Fetch_and_$ instructions can be particularly expensive on cache-coherent
`multiprocessors since each execution of such an instruction may cause many
`remote invalidations. To reduce this overhead, the test-and-set lock can be
`modified to use a test-and-set instruction only when a previous read indi-
`cates that the test-and-set might succeed. This so-called test-and-
`test-and-set technique [441 ensures that waiting processors poll with read
`requests during the time that a lock is held. Once the lock becomes available,
`some fraction of the waiting processors detect that the lock is free and
`perform a test-and-set operation, exactly one of which succeeds, but each of
`which causes remote invalidations on a cache-coherent machine.
`The total amount of network traffic caused by busy-waiting on a
`test-and-set
`lock can be reduced further by introducing delay on each
`processor between consecutive probes of the lock. The simplest approach
`employs a constant delay; more elaborate schemes use some sort of backoff on
`unsuccessful probes. Anderson [5] reports the best performance with exponen-
`tial backoff; our experiments confirm this result. Pseudo-code for a
`ACM Transactions on Computer Systems, Vol. 9, No. 1, February 1991.
`
`Page 5 of 45
`
`

`
`26
`
`J M. Mellor-Crummey and M L. Scott
`
`t y p e l o c k = (unlocked, l o c k e d )
`
`p r o c e d u r e a c q u i r e - l o c k (L : " l o c k )
`d e l a y : i n t e g e r : = 1
`w h i l e t e s t a n d - s e t (L) = l o c k e d
`pause ( d e l a y )
`d e l a y : = d e l a y * 2
`
`procedure r e l e a s e - l o c k
`l o c k * : = unlocked
`
`( L : 'lock)
`
`// r e t u r n s o l d v a l u e
`// consume t h i s many u n i t s of t i m e
`
`Fig. 1. Simple t e s t - a n d - s e t lock with exponential backoff.
`
`test-and-set lock with exponential backoff appears in Figure 1. Test-and-set
`suffices when using a backoff scheme; test-and-test-and-set is not necessary.
`
`2 2 . The Ticket Lock
`In a test-and-test-ad-set lock, the number of read-modify-write operations
`is substantially less than for a simple test-and-set lock but still potentially
`large. Specifically, it is possible for every waiting processor to perform a
`test-and-set operation every time the lock becomes available, even though
`only one can actually acquire the lock. We can reduce the number of
`f etch-and_$ operations to one per lock acquisition with what we call a ticket
`lock. At the same time, we can ensure FIFO service by granting the lock to
`processors in the same order in which they first requested it. A ticket lock is
`fair in a strong sense; it eliminates the possibility of starvation.
`A ticket lock consists of two counters, one containing the number of
`requests to acquire the lock, and the other the number of times the lock has
`been released. A processor acquires the lock by performing a fetch-and-in-
`crement operation on the request counter and waiting until the result (its
`ticket) is equal to the value of the release counter. It releases the lock by
`incrementing the release counter. In the terminology of Reed and Kanodia
`[411, a ticket lock corresponds to the busy-wait implementation of a semaphore
`using an eventcount and a sequencer. It can also be thought of as an
`optimization of a Lamport's bakery lock [24], which was designed for fault-
`tolerance rather than performance. Instead of spinning on the release counter,
`processors using a bakery lock repeatedly examine the tickets of their peers.
`Though it probes with read operations only (and thus avoids the overhead
`of unnecessary invalidations in coherent cache machines), the ticket lock still
`causes substantial memory and network contention through polling of a
`common location. As with the test-and-set
`lock, this contention can be
`reduced by introducing delay on each processor between consecutive probes of
`the lock. In this case, however, exponential backoff is clearly a bad idea.
`Since processors acquire the lock in FIFO order, overshoot in backoff by the
`first processor in line will delay all others as well, causing them to back off
`even farther. Our experiments suggest that a reasonable delay can be
`determined by using information not available with a test-and-set
`lock:
`namely, the number of processors already waiting for the lock. The delay can
`
`ACM Transactions on Computer Systems. Vol 9, No. 1, February 1991
`
`Page 6 of 45
`
`

`
`Algorithms for Scalable Synchronization on Shared - Memory Multiprocessors
`
`27
`
`type lock = record
`next-ticket : unsigned integer := 0
`now-serving : unsigned integer : = 0
`
`procedure acquire-lock (L : "lock)
`my-ticket : unsigned integer := fetch-and-increment (&L->next_ticket)
`// returns old value; arithmetic overflow is harmless
`loop
`pause (my-ticket - L->now_serving)
`// consume this many units of time
`// on most machines, subtraction works correctly despite overflou
`if L->now_serving = my-ticket
`return
`
`procedure release-lock (L : "lock)
`L->now_serving := L->now_serving t 1
`
`Fig. 2. Ticket lock with proportional backoff.
`
`be computed as the difference between a newly-obtained ticket and the
`current value of the release counter.
`Delaying for an appropriate amount of time requires an estimate of how
`long it will take each processor to execute its critical section and pass the
`lock to its successor. If this time is known exactly, it is in principle possible to
`acquire the lock with only two probes, one to determine the number of
`processors already in line (if any), and another (if necessary) to verify that
`one's predecessor in line has finished with the lock. This sort of accuracy is
`not likely in practice, however, since critical sections do not in general take
`identical, constant amounts of time. Moreover, delaying proportional to the
`expected average time to hold the lock is risky: if the processors already in
`line average less than the expected amount, the waiting processor will delay
`too long and slow the entire system. A more appropriate constant of propor-
`tionality for the delay is the minimum time that a processor can hold the
`lock. Pseudo-code for a ticket lock with proportional backoff appears in
`Figure 2. It assumes that the new-ticket and now-serving counters are large
`enough to accommodate the maximum number of simultaneous requests for
`the lock.
`
`2.3. Array-Based Queuing Locks
`Even using a ticket lock with porportional backoff, it is not possible to obtain
`a lock with an expected constant number of network transactions, due to the
`unpredictability of the length of critical sections. Anderson [51 and Graunke
`and Thakkar [18] have proposed locking algorithms that achieve the constant
`bound on cache-coherent multiprocessors that support atomic fetch-and-in-
`crement or fetch-and-store, respectively. The trick is for each processor to
`use the atomic operation to obtain the address of a location on which to spin.
`Each processor spins on a different location, in a different cache line. Ander-
`son's experiments indicate that his queuing lock outperforms a test-and-set
`lock with exponential backoff on the Sequent Symmetry when more than six
`ACM Transactions on Computer Systems, Vol. 9, No. 1, February 1991.
`
`Page 7 of 45
`
`

`
`28
`
`J. M. Mellor-Crumrney and M L. Scott
`
`t y p e l o c k = r e c o r d
`s l o t s : a r r a y [O..numprocs -11 of (has-lock, must-wait)
`: ( h a s l o c k . must-wait, m u s t w a i t , . . . . must-wait)
`// each element of s l o t s should l i e i n a d i f f e r e n t memory module
`// o r cache l i n e
`n e x t - s l o t : i n t e g e r := 0
`
`// parameter my-place, below, p o i n t s t o a p r i v a t e v a r i a b l e
`// i n an e n c l o s i n g scope
`
`procedure a c q u i r e - l o c k (L : " l o c k , my-place : " i n t e g e r )
`:= fetch-and-increment
`my-place"
`( & L - > n e x t s l o t )
`// r e t u r n s o l d v a l u e
`i f my-place- mod numprocs = 0
`atomic-add (&L->next_slot, -numprocs)
`// a v o i d problems w i t h overflow; r e t u r n v a l u e ignored
`m y p l a c e " " := my-place- mod numprocs
`r e p e a t w h i l e L->slotsCmy-place']
`= must-wait
`L->slots[my_place"]
`: = must-wait
`
`// s p i n
`// i n i t f o r n e x t t i m e
`
`procedure r e l e a s e - l o c k (L : 'lock, my-place : " i n t e g e r )
`L->slots[(my-place" + 1) mod numprocs]
`:= has-lock
`
`Fig 3. Anderson's array-based queuing lock.
`
`processors are competing for access. Graunke and Thakkar's experiments
`indicate that their lock outperforms a test-and-set lock on the same ma-
`chine when more than three processors are competing.
`Pseudo-code for Anderson's lock appears in Figure 3;3 Graunke and
`Thakkar's lock appears in Figure 4. The likely explanation for the better
`performance of Graunke and Thakkar's lock is that fetch-and-store
`is
`supported directly in hardware on the Symmetry; fetch-and-add is not. To
`simulate fetch-and-add, Anderson protected his queue-based lock with an
`outer test-and-set lock. This outer lock (not shown in Figure 3) introduces
`additional overhead and can cause contention when the critical section
`protected by the queue-based lock is shorter than the time required to acquire
`and release the outer lock. In this case competing processors spend their time
`spinning on the outer test-and-set lock rather than the queue-based lock.
`Neither Anderson nor Graunke and Thakkar included the ticket lock in
`their experiments. In qualitative terms both the ticket lock (with propor-
`tional backoff) and the array-based queuing locks guarantee FIFO ordering of
`requests. Both
`the ticket
`lock and Anderson's
`lock use an atomic
`f etch-and-increment instruction. The ticket lock with porportional backoff is
`likely to require more network transactions on a cache-coherent multiproces-
`sor but fewer on a multiprocessor without coherently cached shared vari-
`ables. The array-based queuing locks require space per lock linear in the
`number of processors whereas the ticket lock requires only a small constant
`
`--
`'~nderson's original pseudo-code did not address the issue of overflow, which causes his algo-
`rithm to fail unless numprocs = 2'. Our variant of his algorithm addresses this problem.
`
`ACM Transactions on Computer Systems, Vol 9, No. 1, February 1991
`
`Page 8 of 45
`
`

`
`Algorithms for Scalable Synchronization on Shared - Memory Multiprocessors
`
`29
`
`type l o c k = record
`s l o t s : a r r a y [O..numprocs -11 of Boolean := t r u e
`// each element of s l o t s should l i e i n a d i f f e r e n t memory module
`/ / o r cache l i n e
`t a i l : record
`w h o w a s l a s t : 'Boolean
`:= 0
`this-means-locked : Boolean := f a l s e
`// this-means-locked
`i s a one-bit q u a n t i t y .
`// who_was_last p o i n t s t o an element of s l o t s .
`// i f a l l elements l i e a t even addresses, t h i s t a i l "record"
`// can be made t o f i t i n one word
`processor p r i v a t e vpid : i n t e g e r // a unique v i r t u a l processor index
`
`procedure acquire-lock (L : "lock)
`(who_is_ahead_of_me : "Boolean, w h a t i s l o c k e d : Boolean)
`( & L - > t a i l , ( & s l o t s [vpidl , s l o t s [ v p i d l ) )
`:= f etch-and-store
`r e p e a t while who_is_ahead_of_me" = what_is_locked
`
`procedure release-lock (L : 'lock)
`L->slots [vpid] : = not L->slots [vpid]
`
`Fig. 4. Graunke and Thakkar's array-based queuing lock
`
`amount of space.4 We provide quantitative comparisons of the locks' perfor-
`mance in Section 4.3.
`
`2.4. A New List-Based Queuing Lock
`We have devised a new mechanism called the MCS lock (after out initials)
`that
`
`-guarantees FIFO ordering of lock acquisitions;
`-spins on locally-accessible flag variables only;
`-requires a small constant amount of space per lock; and
`-works equally well (requiring only O(1) network transactions per lock
`acquisition) on machines with and without coherent caches.
`
`The first of these advantages is shared with the ticket lock and the
`array-based queuing locks, but not with the test-and-set lock. The third is
`shared with the test-and-set and ticket locks, but not with the array-based
`queuing locks. The fourth advantage is in large part a consequence of the
`second and is unique to the MCS lock.
`Our lock was inspired by the Queue On Lock Bit (QOLB) primitive
`proposed for the cache controllers of the Wisconsin Multicube [141, but is
`implemented entirely in software. It requires an atomic fetch-and-store
`(swap) instruction and benefits from the availability of compare-and-swap.
`Without compare-and-swap, we lose the guarantee of FIFO ordering and
`
`^ ~ t
`first glance, one might suspect t h a t t h e flag bits of t h e Graunke and Thakkar lock could be
`allocated on a per-processor basis, rather t h a n a per-lock basis. Once a processor releases a lock,
`however, i t cannot use its bit for anything else until some other processor h a s acquired t h e lock
`i t released.
`
`ACM Transactions on Computer Systems, Vol. 9, No. 1, February 1991.
`
`Page 9 of 45
`
`

`
`30
`
`J. M. Mellor-Crurnrney and M. L. Scott
`
`type qnode = record
`next : "qnode
`locked : Boolean
`type lock = 'qnode
`
`// parameter I, below, points to a qnode record allocated
`// (in an enclosing scope) in shared memory locally-accessible
`// to the invoking processor
`
`procedure acquire-lock (L : "lock, I : qnode)
`I->next := nil
`predecessor : "qnode := fetch-and-store (L, I)
`// queue was non-empty
`if predecessor ! = nil
`I->locked := true
`predecessor->next := I
`repeat while I->locked
`
`// spin
`
`procedure release-lock (L : 'lock, I: "qnode)
`// no known successor
`if I->next = nil
`if compare_and_swap (L, I, nil)
`return
`// compare-and-swap returns true iff it swapped
`/ / spin
`repeat while I->next = nil
`I->next->locked := false
`
`Fig 5 The MCS list-based queuing lock.
`
`introduce the theoretical possibility of starvation, though lock acquisitions
`are likely to remain very nearly FIFO in practice.
`Pseudo-code for our lock appears in Figure 5. Every processor using the
`lock allocates a qnode record containing a queue link and a Boolean flag.
`Each processor employs one additional temporary variable during the ac-
`quire-lock operation. Processors holding or waiting for the lock are chained
`together by the links. Each processor spins on its own locally-accessible flag.
`The lock itself contains a pointer to the qnode record for the processor at the
`tail of the queue, or a n i l if the lock is not held. Each processor in the queue
`holds the address of the record for the processor behind it-the processor it
`should resume after acquiring and releasing the lock. Compare-and-swap
`enables a processor to determine whether it is the only processor in the
`queue, and if so remove itself correctly, as a single atomic action. The spin in
`acquire-lock waits for the lock to become free. The spin in release-lock
`compensates for the timing window between the fetch-and-store and the
`assignment to predecessor - >next i n acquire-lock. Both spins are local.
`Figure 6, parts (a) through (e), illustrates a series of acquire-lock and
`release-lock operations. The lock itself is represented by a box containing
`an 'L.' The other rectangles are qnode records. A box with a slash through it
`represents a n i l pointer. Non-nil pointers are directed arcs. In (a), the lock is
`free. In (b), processor 1 has acquired the lock. It is running (indicated by the
`'R'), thus its locked flag is irrelevant (indicated by putting the 'R' in
`parentheses). In (c), two more processors have entered the queue while the
`lock is still held by processor 1. They are blocked spinning on their locked
`
`A C M Transactions on Computer Systems, Vol. 9, No. 1, February 1991
`
`Page 10 of 45
`
`

`
`Algorithms for Scalable Synchronization on Shared - Memory Multiprocessors
`
`31
`
`Fig. 6. Pictorial example of MCS locking protocol in the presence of competition
`
`flags (indicated by the 'B's.). In (d), processor 1 has completed and has
`changed the locked flag of processor 2 so that it is now running. In (e),
`processor 2 has completed, and has similarly unblocked processor 3. If no
`more processors enter the queue in the immediate future, the lock will return
`to the situation in (a) when processor 3 completes its critical section.
`Alternative code for the release-lock operation, without compare-and-swap,
`appears in Figure 7. Like the code in Figure 5, it spins on processor-specific,
`locally-accessible memory locations only, requires constant space per lock,
`and requires only O(1) network transactions regardless of whether the ma-
`chine provides coherent caches. Its disadvantages are extra complexity and
`the loss of strict FIFO ordering.
`Parts (e) through (h) of Figure 6 illustrate the subtleties of the alternative
`code for release-lock. In the original version of the lock, compare-and-swap
`ensures that updates to the tail of the queue happen atomically. There are no
`ACM Transactions on Computer Systems, Vol. 9, No. 1, February 1991.
`
`Page 11 of 45
`
`

`
`32
`
`-
`
`J M. Mellor-Crummey and M L. Scott
`
`procedure release-lock (L : 'lock, I : 'qnode)
`// no known successor
`1f I->next = m l
`o l d - t

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket