`
`
`
`
`
`
`
`
`
`
`
`
`
`DECLARATION OF GORDON MACPHERSON
`
`I, Gordon MacPherson, am over twenty-one (21) years of age. I have never been
`convicted of a felony, and I am fully competent to make this declaration. I declare the following
`to be true to the best of my knowledge, information and belief:
`
`1. I am Director, Board Governance & Policy Development of The Institute of Electrical
`and Electronics Engineers, Incorporated (“IEEE”).
`
`2. IEEE is a neutral third party in this dispute.
`
`3. I am not being compensated for this declaration and IEEE is only being reimbursed
`for the cost of the article I am certifying.
`
`4. Among my responsibilities as Director, Board Governance & Policy Development, I
`act as a custodian of certain records for IEEE.
`
`5. I make this declaration based on my personal knowledge and information contained
`in the business records of IEEE.
`
`6. As part of its ordinary course of business, IEEE publishes and makes available
`technical articles and standards. These publications are made available for public
`download through the IEEE digital library, IEEE Xplore.
`
`7. It is the regular practice of IEEE to publish articles and other writings including
`article abstracts and make them available to the public through IEEE Xplore. IEEE
`maintains copies of publications in the ordinary course of its regularly conducted
`activities.
`
`8. The article below has been attached as Exhibit A to this declaration:
`
`
`A. S. Fiske, et al.; “Thread prioritization: a thread scheduling mechanism for
`multiple-context parallel processors”, published in Proceedings of 1995 1st
`IEEE Symposium on High Performance Computer Architecture, date of
`conference January 22-25, 1995.
`
`
`
`
`
`9. I obtained a copy of Exhibit A through IEEE Xplore, where it is maintained in the
`ordinary course of IEEE’s business. Exhibit A is a true and correct copy of the
`Exhibit, as it existed on or about May 11, 2023.
`
`
`445 Hoes Lane Piscataway, NJ 08854
`
`DocuSign Envelope ID: F0A92180-66A9-4C5C-A81C-637CE721AAE6
`
`Realtek Ex. 1012
`Case No. IPR2023-00922
`Page 1 of 15
`
`
`
`
`
`10. The article and abstract from IEEE Xplore shows the date of publication. IEEE
`Xplore populates this information using the metadata associated with the publication.
`
`11. S. Fiske, et al.; “Thread prioritization: a thread scheduling mechanism for multiple-
`context parallel processors”, published in Proceedings of 1995 1st IEEE Symposium
`on High Performance Computer Architecture, date of conference January 22-25,
`1995. Copies of the conference proceedings were made available no later than the
`last day of the conference. The article is currently available for public download from
`the IEEE digital library, IEEE Xplore.
`
`12. I hereby declare that all statements made herein of my own knowledge are true and
`that all statements made on information and belief are believed to be true, and further
`that these statements were made with the knowledge that willful false statements and
`the like are punishable by fine or imprisonment, or both, under 18 U.S.C. § 1001.
`
`I declare under penalty of perjury that the foregoing statements are true and correct.
`
`
`
`
`Executed on:
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`DocuSign Envelope ID: F0A92180-66A9-4C5C-A81C-637CE721AAE6
`
`5/11/2023
`
`Realtek Ex. 1012
`Case No. IPR2023-00922
`Page 2 of 15
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`EXHIBIT A
`
`
`
`DocuSign Envelope ID: F0A92180-66A9-4C5C-A81C-637CE721AAE6
`
`Realtek Ex. 1012
`Case No. IPR2023-00922
`Page 3 of 15
`
`
`
`Thread Prioritization: A Thread Scheduling Mechanism
`for Multiple-Context Parallel Processors*
`
`Stuart Fiske and William J. Dally
`stuart@ai.mit.edu, billd@ai.mit.edu
`Artificial Intelligence Laboratory and Laboratory for Computer Science
`Massachusetts Institute of Technology
`Cambridge, Massachusetts 02139
`
`Abstract
`Multiple-context processors provide register resources
`that allow rapid context switching between several threads
`as a means of tolerating long communication and synchro-
`nization latencies. When scheduling threads on such a pro-
`cessor; we must first decide which threads should have their
`state loaded into the multiple contexts, and second, which
`loaded thread is to execute instructions at any given time.
`In this paper we show that both decisions are important,
`and that incorrect choices can lead to serious performance
`degradation. We propose thread priorithation as a means
`of guiding both levels of scheduling. Each thread has a pri-
`ority that can change dynamically, and that the scheduler
`uses to allocate as many computation resources as possible
`to critical threads. We briejy describe its implementation,
`and we show simulation performance results fora number of
`simple benchmarks in which synchronization performance
`is critical.
`1 Introduction
`Parallel processor performance is critically tied to the
`mechanisms provided for tolerating long latencies that oc-
`cur during remote memory accesses, and processor syn-
`chronization operations. Multiple-context processors [20,
`3, 13, 151 provide multiple register sets to multiplex sev-
`eral threads over a processor pipeline in order to tolerate
`these communication and synchronization latencies. Mul-
`tiple register sets, including multiple instruction pointers,
`allow the state of multiple threads to be loaded and ready
`to run at the same time. Each time the currently executing
`thread misses in the cache or fails a synchronization test,
`the processor can begin executing one of the other threads
`loaded in one of the other hardware contexts.
`For a multiple-context processor as shown in Figure 1,
`there are both loaded and unloaded threads. A thread is
`loaded if its register state is in one of the hardware con-
`texts, and unloaded otherwise. Unloaded threads wait to
`'The research described in this paper was supported by the Advanced
`Research Rojects Agency under ARPA order number 8272. and moni-
`tored by the Air Force Electronic Systems Division under contract number
`F19628-92-C-0045.
`
`Pipeline
`
`-
`
`~
`
`1
`
`
`
`Multiple-context processor
`
`Thread queue in main memory
`
`Figure 1: Multiple-context processor with N contexts.
`
`be loaded in a software scheduling queue. To allow a tra-
`ditional RISC pipeline design, we assume a block multi-
`threading model [23,3], in which blocks of instructions are
`executed from each context in turn, rather than a cycle-by-
`cycle interleaving of instructions from the different con-
`texts [20, 15, 131. At any given time, the processor is
`executing one of the loaded threads. A context switch oc-
`curs when the processor switches from executing one loaded
`thread, to executing another loaded thread, an operation that
`can be done in 1 to 20 cycles, depending on the processor
`design. A thread swap involves swapping a loaded thread
`with an unloaded thread from the software queue. The cost
`of a thread swap is one to two orders of magnitude greater
`than a context switch, because it involves saving and restor-
`ing register state, and manipulating the thread scheduling
`queue.
`The scheduling problem on a multiple-context proces-
`sor involves two components. First, we must decide which
`threads should be loaded in the contexts. Second, in the
`case that there are multiple loaded threads, we must decide
`which one is executing at any given time. In this paper we
`
`0.8186-6445-2195 $04 no c 1995 IEEE
`
`210
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:52:42 UTC from IEEE Xplore. Restrictions apply.
`
`Realtek Ex. 1012
`Case No. IPR2023-00922
`Page 4 of 15
`
`
`
`show that it is important to correctly make both types of
`scheduling decisions. If a critical thread is not loaded, then
`no progress can be made along the critical path, and runtime
`performance suffers. If the critical threads are loaded along
`with other non-critical threads and the scheduler treats all
`the loaded threads as equal, then runtime performance suf-
`fers. Time devoted to the non-critical loaded threads could
`potentially be devoted to the critical threads.
`Thread prioritization is a simple means of guiding the
`scheduling of threads on a node. Each thread has a priority
`that indicates the importance of the thread in the overall
`problem. The software scheduler on each node chooses the
`highest priority threads as the loaded threads. On a context
`switch, the hardware scheduler chooses the loaded thread
`with the highest priority as the next thread using simple
`hardware, in order to minimize context switch overhead.
`The goal is to devote as many of the processor resources as
`possible to the tasks that are known to be critical to overall
`performance.
`This paper examines a number of benchmarks that show
`the effects of prioritizingat both levels of scheduling. These
`benchmarks evaluate the performance of barrier synchro-
`nization, queue locks, and fine-grain synchronization. Our
`experiments vary the number of threads and contexts per
`processor. When threads are prioritized, the performance
`of the barrier benchmark improves by up to a factor of 2,
`and the performance of the queue lock benchmark improves
`by up to a factor of 7. For the fine-grain synchronization
`benchmark, performance was improved by up to 26%.
`This paper is organized as follows. Section 2 describes
`thread prioritization and outlines some of the implementa-
`tion details and costs. Section 3 briefly outlines the simula-
`tion environment and assumptions, and Section 4 presents
`the results from a number of simple scheduling experiments.
`Section 5 describes related work, while Section 6 concludes
`the paper and discusses future work.
`
`2 Thread Prioritization
`Thread prioritization involves assigning a priority to all
`the different threads in an application, and then using this
`priority to make thread scheduling decisions. The priority
`reflects the importance of a single thread to the completion
`of a singleapplication. The thread scheduler uses the thread
`priority in a very different way than process scheduling in
`UNIX for instance, where the goal is to achieve good in-
`teractive performance and time sharing between competing
`processes [16]. In our case, the goal is to identify as ex-
`actly as possible a relative order in which threads should
`be run, and devote as many resources as possible to the
`most important threads. Also, the granularity of scheduling
`is much different: in our case the priority is used to make
`scheduling decisions on every hardware context switch in a
`multiple-context processor.
`
`2.1 Priority Thread Scheduling
`Consider an application that consists of a set T of threads
`on each processor, where each processor has C contexts.
`Each thread t; E T has a priority P;, with a higher value
`of Pi indicating a higher thread priority. The hardware and
`software schedulers use the priority to do the scheduling.
`First, the software scheduler uses the priority to decide
`which threads are loaded. Specifically, it chooses a set
`TL of threads to load into the C contexts, and a set TU of
`unloaded threads to remain in a software scheduling queue.
`The scheduler chooses the loaded threads such that PI 2 P,
`for all ti E TL and tu E Tu. Threads of equal priority are
`scheduled in round-robin fashion.
`Second, at each context switch the hardware scheduler
`uses the priority to determine which loaded thread to ex-
`ecute. The scheduler chooses a thread t, E TL such that
`P, = m a + { A } for all tr E TL. If a thread is waiting for
`a memory reference to be satisfied then it is stalled and is
`not considered for scheduling until the memory reference is
`satisfied. If several loaded threads have the same priority,
`then these threads are chosen in round-robin fashion. A
`context switch can occur on a cache miss, on a failed syn-
`chronization test, or on a change of priority of one of the
`threads on the processor. Each change in priority results
`in a re-evaluation of Tu, TL, and t,.
`In this sense, the
`scheduling is preemptive.
`Note as well that thread scheduling as defined here is
`purely a local operation. Each processor has its own set
`of threads, and schedules only these. We do not consider
`dynamic load balancing issues in this paper.
`2.2 Assigning Thread Priorities
`In our benchmarks the user explicitly assigns a priority
`to each thread, and changes this priority as the algorithm
`requires. Although initially the use of thread prioritiza-
`tion is likely to be limited to special runtime libraries (e.g.
`synchronization primitives) and user-available program di-
`rectives, we expect that it will eventually be possible have
`a compiler assign priorities to threads automatically. Au-
`tomatic thread prioritization is particularly straightforward
`when the program can be described as a well defined DAG
`(Directed Acyclic Graph) that can be analyzed and used to
`assign the priorities.
`Prioritizing threads incorrectly can lead to a number of
`deadlock situations. Specifically, if thread A is waiting for
`another thread B to complete some operation, and thread B
`has a low priority that does not allow it to be loaded, then
`deadlock results. Thus the priorities assigned to threads
`must respect the dependencies of the computation.
`One way of avoiding deadlock is to guarantee some sort
`of fairness in the scheduling. If all threads are guaranteed to
`run some amount of time despite their priorities, then we can
`guarantee that deadlock will not result. However, as will be
`
`211
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:52:42 UTC from IEEE Xplore. Restrictions apply.
`
`Realtek Ex. 1012
`Case No. IPR2023-00922
`Page 5 of 15
`
`
`
`shown in the examples, doing fair scheduling without regard
`to priority, or not specifying the priority of threads as exactly
`as they could be, can lead to a serious performance penalty.
`Thus, both the hardware and the software schedulers assume
`that the prioritizing of the threads is correct and deadlock
`free. Between threads of the same priority scheduling is
`fair.
`2.3 Hardware Support for Context Switching
`The context switch time is the time spent in switching
`between two active contexts, and is an important parame-
`ter in determining the efficiency of context switching for
`tolerating latency. In order to effectively tolerate latency,
`the total context switch time in a multiple context processor
`must be small [l, 231.
`The context switch time consists of a number of compo-
`nents. For instance, in the APRIL processor [3] the time
`required is 11 cycles, which is used to drain the processor
`pipeline (5 cycles), and execute a trap handler which saves
`state, and chooses the next context to execute using a round
`robin scheme (6 cycles). Duplicating instruction pointers
`and the processor status word for each context would reduce
`this time to just the cost of draining the pipeline, and more
`complicated processor designs could reduce this time fur-
`ther,possiblytoaslittleasasinglecycle[13, 151. Inamore
`software oriented approach, Waldspurger's flexible register
`relocation scheme [22] minimizes software context switch-
`ing cost by allocating registers in each context to maintain
`an active thread data structure. Choosing the next context
`takes 4 to 6 instructions for round-robin scheduling.
`When priority scheduling the active threads,choosing the
`next context on a context switch becomes more complicated,
`and if done in software can potentially increase the context
`switch time well beyond the minimum cost of draining the
`pipeline. For our simulations, we assume hardware support
`for choosing the next context, as shown in Figure 2. Each
`context has a special register, the priority register, into which
`the thread priority is loaded. The priorities of all the contexts
`feed into the context selection circuit that selects the next
`context on a context switch. When several threads have the
`same priority, the hardware scheduler chooses them in round
`robin fashion. Stalling can be incorporated into the active
`thread scheduling scheme by having a stall bit for each
`context indicating if it is waiting for a memory reference.
`3 Simulation Parameters
`3.1 Simulation Environment
`Simulation experiments were run using Proteus [5], a
`simulator for MIMD computer architectures. It allows ar-
`chitectural features and parameters associated with the net-
`work, the memory system, and the processor to be varied.
`Programs are written in C with language extensions for con-
`currency. Simulator function calls support non-local inter-
`
`Figure 2: Priority context selection logic.
`
`actions between processors such as shared-memory opera-
`tions (including a complete set of atomic read-modify-write
`operations), inter-processor interrupts, and message pass-
`ing. Proteus also provides a basic runtime system written in
`C.
`
`One important assumption made by Proteus in order to
`make simulation tractable is that only the references to ad-
`dresses that have been explicitly declaredas shared are simu-
`lated in detail. That is, it is assumed that all local instruction
`and data references (to a thread's stack for instance) hit in
`the cache. This assumption has been found to be a reason-
`able approximation of the case where every single memory
`reference is simulated through the cache [5,9] since the hit
`rates for instructions and local data are typically very high.
`3.2 System Parameters
`The basic system consists of a collection of multiple-
`context processing nodes connected by a high speed inter-
`connection network. Each processing node has a cache and
`some portion of the global memory. We use a 2-dimensional
`torus type network that uses wormhole routing. Each pro-
`cessor has both a shared memory interface, as well as an effi-
`cient message passing interface with high priority interrupts
`that is used in a number of the benchmarks. A variation of
`the shared-memory, directory-based cache coherence proto-
`col described by Chaiken [6] is used to maintain a consistent
`view of memory.
`The system parameters that are kept constant across the
`different simulations are shown in Table 1. They were
`chosen to represent a processor similar to the MIT Alewife
`machine [2,3], with modifications that reflect the increasing
`ratio of processor speed to memory speed, and that allow
`faster hardware context switching.
`The local memory latency is assumed to be 20 cycles.
`This corresponds to the time to fill a cache line from the
`local memory when there is a miss in the cache, the value
`is in the local node memory, and no coherency protocol
`messages have to be sent before returning the data. If the
`data is not in local memory, then the time for the response is
`
`212
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:52:42 UTC from IEEE Xplore. Restrictions apply.
`
`Realtek Ex. 1012
`Case No. IPR2023-00922
`Page 6 of 15
`
`
`
`Parameter
`Cache Latency
`Local Memory Latency
`Memory Bandwidth
`Hardware Context Switch Time
`Time to Unload Registers
`Time to Reload Registers
`Network Data Transfer Size
`Network Wire Delay
`Network Switch Delay
`Network Input Bandwidth
`Network Output Bandwidth
`
`Value
`I cycle
`20 cycles
`I wordcycle
`5 cycles
`32 cycles
`32 cycles
`0.5 word
`1 cycle
`1 cycle
`0.5 wordcycle
`0.5 wordlcycle
`
`Table 1: Important system parameters.
`
`variable and depends on factors such as the network traffic,
`and the number of messages that have to be sent to satisfy the
`protocol. The memory bandwidth available is 1 wordcycle.
`The 5 cycle hardware context switch time is the cost
`of draining the pipeline on a context switch. We assume
`that this is the only cost of context switching, and that no
`additional instructions have to be executed.
`The cost of a thread swap is the time to save the regis-
`ters of the first thread, perform various operations on the
`scheduling data structures, and restore the registers of the
`next thread [17]. We assume that there are 32 registers in
`each context, and that there is a single cycle cache read and
`write hit time. The cost of the scheduling and descheduling
`is modeled directly by the cost of the runtime scheduler.
`The total swapping cost is on the order of 1 I5 to 150 cycles,
`which assumes all references hit in the cache. This swap
`time is significant in comparison to the hardware context
`switch time.
`Finally, the network transfers data one half word at a time,
`with a switch delay of 1 cycle, and a point-to-point wire
`delay of 1 cycle. The network input and output bandwidths
`are each 0.5 wordcycle.
`
`4 Experimental Results
`In this section we present the results from three simple
`benchmarks. These experiments concentrate on improv-
`ing synchronization performance, which is crucial to the
`performance of many parallel applications. The first two
`benchmarks are synthetic benchmarks which look at the
`performance of a combining tree barrier, and of a mutual-
`exclusion queue lock. The third benchmark is an imple-
`mentation of Lower-Upper Decomposition (LUD) that uses
`fine-grain synchronization in the form of a FulIEmpty tag
`bit associated with each memory location.
`For each benchmark we consider three different scenar-
`ios:
`
`1. SINGLE: There are several threads, but there is only
`one context so that only a single thread is loaded at a
`time.
`
`2. ALL: There are sufficient contexts so that all threads
`created can be loaded. We use 16 contexts in our
`simulations.
`
`3. LIMITED: There are several contexts, but there are
`potentially more threads than contexts so that only a
`limited number of the available threads are loaded. We
`use 4 contexts in our simulations.
`
`The SINGLE and LIMITED cases represent situations
`in which not all threads can be loaded at the same time.
`These cases can arise in the context of data-dependent thread
`spawning, runtime dynamic partitioning, or in a multipro-
`gramming environment. The ALL case is balanced in the
`Sense that all threads can be loaded at once. The first case
`emphasizes the type of scheduling which is required for
`threads that are not currently loaded. The second case em-
`phasizes the scheduling required for loaded threads. The
`last case combines the two problems to study what must be
`done to schedule both loaded and unloaded threads.
`4.1 Barrier Synchronization
`The first benchmark is a barrier benchmark using a shared
`memory combiningtree [25,18]. In this benchmark, a num-
`ber of threads are spawned on each processor, and these
`threads repeatedly perform a barrier synchronization. The
`first level of the combining tree has a fan-in equal to the
`number of threads on each processor'. The threads on each
`processor first perform a local combine, and then the last
`thread to combine on each local processor participates in a
`global barrier using a radix4 combining tree. The simula-
`tion uses 64 processors, with a large fully associative cache,
`so that only cache invalidation traffic affects performance.
`Prioritization is simple, and is done as follows. When a
`thread arrives at the barrier and it is not the first thread in the
`leafgroup, it decreases itspriority in preparation for thenext
`phase ofthe computation, and begins to spin. The last thread
`to arrive at a leaf n d e maintains its priority, and proceeds
`up the combining tree. Thus on each node, only the thread
`that is participating in the non-local barrier tree is using
`any cycles - it can either be spinning at an intermediate
`node of the combining tree, or it can be proceeding up
`or down the combining tree. Once a thread going back
`down the combining tree reaches the leaves of the tree, it
`decreases its priority to the priority of the other spinning leaf
`threads, and they can all proceed to the next phase of the
`computation. Note that the prioritization required for other
`tree-like barriers including tournament barriers and MCS
`
`'If there is only a single thread per processor. then this first level is
`eliminated.
`
`213
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:52:42 UTC from IEEE Xplore. Restrictions apply.
`
`Realtek Ex. 1012
`Case No. IPR2023-00922
`Page 7 of 15
`
`
`
`tree barriers [ 181, is qualitatively similar to the prioritization
`of the combining tree barrier.
`
`4.1.1 Results
`Figure 3 shows the average barrier wait times for the three.
`different scenarios, where the barrier wait time is the time
`spent by each thread waiting at the barrier. Each case is
`discussed below.
`SINGLE: With unprioritized threads, performance of
`the barrier decreases as the number of threads increases due
`to two factors. First, each thread that participates in the
`barrier must be swapped into the context in order to reach
`the barrier. Second, when a thread that is spinning at an in-
`termediate node of the combining tree does an unsuccessful
`poll, the scheduler swaps out this thread, and successively
`load in all the other spinning threads on the local node. It
`does this because it does not differentiate between the lo-
`cally spinning threads and thus treats them all fairly. In the
`prioritized case, the time to perform the barrier increases due
`to the larger number of threads, but once the local barrier has
`been completed and one thread has been chosen to represent
`the node in the global barrier, this thread is never swapped
`out regardless of how often the polling is unsuccessful. As
`a result, the second component which contributed to poor
`performance in the unprioritized case is eliminated. For 4
`threads per processor performance improves by 11%, and
`for 16 threads per processor performance improves 41%.
`ALL: The unprioritized ALL scenario suffers from a
`similar problem to the unprioritized SINGLE scenario, ex-
`cept that no thread swapping is necessary since all threads
`are loaded, only context switching. Although a context
`switch is much cheaper than a full thread swap, the context
`switchs happen more often in the ALL case than thread
`swaps in the SINGLE case because they occur not only on
`failed synchronization tests, but also on cache misses. Each
`time the thread participating in the global barrier misses in
`the cache or does an unsuccessful polling operation, the pro-
`cessor runs through all the other contexts before returning
`to the critical context. It is important to note that the time to
`switch between the contexts is more than simply the number
`of cycles to switch between hardware contexts, in this case 5
`cycles. This is because once the actual context switch takes
`place, the new thread issues some number of instructions,
`until it either misses in the cache, or tests its flag unsuc-
`cessfully and context switches. The prioritized scheduling
`eliminates unnecessary context switching during the global
`barrier with performance improving by 23% for 4 threads,
`and by 47% for 16 threads.
`LIMITED: The case of having more threads than con-
`texts with multiple contexts can potentially suffer from
`the worst of both the SINGLE, and the ALL scenarios.
`With unprioritized threads, each time a non-critical spin-
`
`214
`
`O l " " " ~ " " " " ~
`0 1 2 3 4 5 6 7 8 9 10111213141516
`
`0 1 ' " " " " " " " '
`0 1 2 3 4 5 6 7 8 9 10111213141516
`mmspwp-w
`
`b. ALL
`
`14wO
`
`-
`i 12000-
`
`1ww-
`
`8 0 0 0 -
`-
`6000
`-
`-
`2000
`
`4000
`
`
`"
`"
`"
`"
`' ~ "
`"
`"
`O I
`0 1 2 3 4 5 6 7 8 9 10111213141516
`
`Figure 3: Average barrier wait time for 64 Processors. SIN-
`GLE, ALL, and LIMITED scenarios.
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:52:42 UTC from IEEE Xplore. Restrictions apply.
`
`Realtek Ex. 1012
`Case No. IPR2023-00922
`Page 8 of 15
`
`
`
`ning thread runs, it not only checks its flag but also does
`a thread swap if there are other threads on the scheduling
`queue. Thus the time between when the critical thread con-
`text switches to the time it is again the executing thread is
`increased by the time to run through all the other loaded
`threads, where each is checking its flag and then swap-
`ping itself with some other spinning thread on the software
`scheduling queue. This in effect represents a worst case
`scenario in terms of the amount of thread swapping that
`is done. Figure 3c also shows the case when the thread
`priorities are used only by the software scheduler, and not
`the hardware scheduler. In this case the hardware scheduler
`does round-robin scheduling of the loaded threads rather
`than priority scheduling and thus does some amount of un-
`necessary context switching. Wlth 16 threads, software
`thread prioritization without hardware thread prioritization
`reduces the barrier wait time by 47%. whereas doing both
`software and hardware thread prioritization reduces the wait
`time by 57%.
`These three scenarios show that the prioritizing of both
`loaded and unloaded threads is important for the perfor-
`mance of the barrier synchronization. Prioritizing unloaded
`threads in the thread queue is important because it guaran-
`tees that threads that still have to participate in the barrier
`are loaded, and it eliminates unnecessq thread swapping.
`Prioritizing the loaded threads themselves guarantees that
`lower priority spinning threads do not steal cycles from the
`higher priority threads, so that a critical thread can proceed
`as soon as it is able.
`4.2 Queue Locks
`A queue lock is a mutual exclusion mechanism that is
`appropriate for high contention locks [18,4]. Each thread
`inserts itself onto a queue, and then spins on its own flag so
`as to not generate the hot spots and excess network traffic
`that can be generated by simpler Test-and-Set style locks.
`Our implementation of queue locks is inspired by the MCS
`lock of Mellor-Crummey and Scott [ 181. The MCS lock has
`the advantage that the flag on which each thread spins while
`waiting in the queue is locally allocated and generates no
`global traffic while spinning.
`Scheduling threads waiting for a lock raises a number
`of issues. It is clear that once a lock is acquired the thread
`owning the lock should not be swapped out [26, 171. This
`is because all other threads waiting for the lock will be
`unable to make progress until the lock is released, and so
`performance can be seriously degraded. In the case of the
`queue lock there is an additional factor to be considered:
`the order in which threads are going to acquire the lock
`is determined by the order in which threads are inserted
`into the queue. The priority of a spinning thread should be
`determined by the position of the thread in the queue, so that
`when there are multiple spinning threads, the one earlier in
`the queue will be given priority.
`
`The synthetic queue lock benchmark consists of an equal
`number of threads on each processor, each trying to obtain a
`lock. Each thread repeatedly obtains the lock, runs a critical
`section, releases the lock, and then runs a non-critical sec-
`tion. Both a high contention and a low contention case are
`run. In the high contention case, the critical section is about
`100 cycles, and the non-critical section varies between 50
`and 150 cycles, with a uniform distribution. In the low
`contention case, the critical section is the same, but the non-
`criticalsection varies between 5000and 15000cycles witha
`uniform distribution. When executing the non-critical sec-
`tion, a context switch is forced about every 40 cycles to
`simulate a cache miss. The total number of locks acquired
`over a sample period of lo6 cycles was taken as the fig-
`ure of merit. It should be noted that the latency tolerance
`properties of having multiple threads, as well as possible
`cache interference between threads are not measured in this
`benchmark, but rather only the effects of the thread schedul-
`ing. The simulation uses 16 processors, with a large fully
`associative cache so that only invalidation traffic occurs.
`The following variations on the queue lock benchmark
`were run:
`
`Priorityl: the threads repeatedly context switch when
`they poll their local variable and determine that they
`are not next in line in the queue. If there are more
`threads than contexts, then there is also a thread swap
`with an unloaded thread. A thread that acquires the
`lock has high priority so that it is not swapped out, and
`when it releases the lock it decreases its priority.
`
`Priority2: If a thread owns the lock or is trying to
`determine its position in the queue, it has the highest
`priority. If the thread is trying to insert itself into the
`queue, it has the next highest priorityz. Threads that
`are spinning in the queue have a priority based on their
`position in the queue3. Thus whenever a processor has
`several threads waiting