throbber
DECLARATION OF GORDON MACPHERSON
`
`I, Gordon MacPherson, am over twenty-one (21) years of age. I have never been
`convicted of a felony, and I am fully competent to make this declaration. I declare the following
`to be true to the best of my knowledge, information and belief:
`
`1.
`
`I am Director, Board Governance & Policy Development of The Institute of Electrical
`and Electronics Engineers, Incorporated (“IEEE”).
`
`2.
`
`IEEE is a neutral third party in this dispute.
`
`3.
`
`I am not being compensated for this declaration and IEEE is only being reimbursed
`for the cost of the article I am certifying.
`
`4. Among my responsibilities as Director, Board Governance & Policy Development, I
`act as a custodian of certain records for IEEE.
`
`5.
`
`I make this declaration based on my personal knowledge and information contained
`in the business records of IEEE.
`
`6. As part of its ordinary course of business, IEEE publishes and makes available
`technical articles and standards. These publications are made available for public
`download through the IEEE digital library, IEEE Xplore.
`
`7.
`
`It is the regular practice of IEEE to publish articles and other writings including
`article abstracts and make them available to the public through IEEE Xplore. IEEE
`maintains copies of publications in the ordinary course of its regularly conducted
`activities.
`
`8. The article below has been attached as Exhibit A to this declaration:
`
`A. S. Fiske, et al.; “Thread prioritization: a thread scheduling mechanism for
`multiple-context parallel processors”, published in Proceedings of 1995 1st
`IEEE Symposium on High Performance Computer Architecture, date of
`conference January 22-25, 1995.
`
`9.
`
`I obtained a copy of Exhibit A through IEEE Xplore, where it is maintained in the
`ordinary course of IEEE’s business. Exhibit A is a true and correct copy of the
`Exhibit, as it existed on or about May 11, 2023.
`
`445 Hoes Lane Piscataway, NJ 08854
`
`DocuSign Envelope ID: F0A92180-66A9-4C5C-A81C-637CE721AAE6
`
`TCL 1012
`
`

`

`10. The article and abstract from IEEE Xplore shows the date of publication. IEEE
`Xplore populates this information using the metadata associated with the publication.
`
`11. S. Fiske, et al.; “Thread prioritization: a thread scheduling mechanism for multiple-
`context parallel processors”, published in Proceedings of 1995 1st IEEE Symposium
`on High Performance Computer Architecture, date of conference January 22-25,
`1995. Copies of the conference proceedings were made available no later than the
`last day of the conference. The article is currently available for public download from
`the IEEE digital library, IEEE Xplore.
`
`12. I hereby declare that all statements made herein of my own knowledge are true and
`that all statements made on information and belief are believed to be true, and further
`that these statements were made with the knowledge that willful false statements and
`the like are punishable by fine or imprisonment, or both, under 18 U.S.C. § 1001.
`
`I declare under penalty of perjury that the foregoing statements are true and correct.
`
`Executed on:
`
`DocuSign Envelope ID: F0A92180-66A9-4C5C-A81C-637CE721AAE6
`
`5/11/2023
`
`TCL 1012
`
`

`

`EXHIBIT A
`
`DocuSign Envelope ID: F0A92180-66A9-4C5C-A81C-637CE721AAE6
`
`TCL 1012
`
`

`

`Thread Prioritization: A Thread Scheduling Mechanism
`for Multiple-Context Parallel Processors*
`
`Stuart Fiske and William J. Dally
`stuart@ai.mit.edu, billd@ai.mit.edu
`Artificial Intelligence Laboratory and Laboratory for Computer Science
`Massachusetts Institute of Technology
`Cambridge, Massachusetts 02139
`
`Abstract
`Multiple-context processors provide register resources
`that allow rapid context switching between several threads
`as a means of tolerating long communication and synchro-
`nization latencies. When scheduling threads on such a pro-
`cessor; we must first decide which threads should have their
`state loaded into the multiple contexts, and second, which
`loaded thread is to execute instructions at any given time.
`In this paper we show that both decisions are important,
`and that incorrect choices can lead to serious performance
`degradation. We propose thread priorithation as a means
`of guiding both levels of scheduling. Each thread has a pri-
`ority that can change dynamically, and that the scheduler
`uses to allocate as many computation resources as possible
`to critical threads. We briejy describe its implementation,
`and we show simulation performance results fora number of
`simple benchmarks in which synchronization performance
`is critical.
`1 Introduction
`Parallel processor performance is critically tied to the
`mechanisms provided for tolerating long latencies that oc-
`cur during remote memory accesses, and processor syn-
`chronization operations. Multiple-context processors [20,
`3, 13, 151 provide multiple register sets to multiplex sev-
`eral threads over a processor pipeline in order to tolerate
`these communication and synchronization latencies. Mul-
`tiple register sets, including multiple instruction pointers,
`allow the state of multiple threads to be loaded and ready
`to run at the same time. Each time the currently executing
`thread misses in the cache or fails a synchronization test,
`the processor can begin executing one of the other threads
`loaded in one of the other hardware contexts.
`For a multiple-context processor as shown in Figure 1,
`there are both loaded and unloaded threads. A thread is
`loaded if its register state is in one of the hardware con-
`texts, and unloaded otherwise. Unloaded threads wait to
`'The research described in this paper was supported by the Advanced
`Research Rojects Agency under ARPA order number 8272. and moni-
`tored by the Air Force Electronic Systems Division under contract number
`F19628-92-C-0045.
`
`Pipeline
`
`-
`
`~
`
`1
`
`
`
`Multiple-context processor
`
`Thread queue in main memory
`
`Figure 1: Multiple-context processor with N contexts.
`
`be loaded in a software scheduling queue. To allow a tra-
`ditional RISC pipeline design, we assume a block multi-
`threading model [23,3], in which blocks of instructions are
`executed from each context in turn, rather than a cycle-by-
`cycle interleaving of instructions from the different con-
`texts [20, 15, 131. At any given time, the processor is
`executing one of the loaded threads. A context switch oc-
`curs when the processor switches from executing one loaded
`thread, to executing another loaded thread, an operation that
`can be done in 1 to 20 cycles, depending on the processor
`design. A thread swap involves swapping a loaded thread
`with an unloaded thread from the software queue. The cost
`of a thread swap is one to two orders of magnitude greater
`than a context switch, because it involves saving and restor-
`ing register state, and manipulating the thread scheduling
`queue.
`The scheduling problem on a multiple-context proces-
`sor involves two components. First, we must decide which
`threads should be loaded in the contexts. Second, in the
`case that there are multiple loaded threads, we must decide
`which one is executing at any given time. In this paper we
`
`0.8186-6445-2195 $04 no c 1995 IEEE
`
`210
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:52:42 UTC from IEEE Xplore. Restrictions apply.
`
`TCL 1012
`
`

`

`show that it is important to correctly make both types of
`scheduling decisions. If a critical thread is not loaded, then
`no progress can be made along the critical path, and runtime
`performance suffers. If the critical threads are loaded along
`with other non-critical threads and the scheduler treats all
`the loaded threads as equal, then runtime performance suf-
`fers. Time devoted to the non-critical loaded threads could
`potentially be devoted to the critical threads.
`Thread prioritization is a simple means of guiding the
`scheduling of threads on a node. Each thread has a priority
`that indicates the importance of the thread in the overall
`problem. The software scheduler on each node chooses the
`highest priority threads as the loaded threads. On a context
`switch, the hardware scheduler chooses the loaded thread
`with the highest priority as the next thread using simple
`hardware, in order to minimize context switch overhead.
`The goal is to devote as many of the processor resources as
`possible to the tasks that are known to be critical to overall
`performance.
`This paper examines a number of benchmarks that show
`the effects of prioritizingat both levels of scheduling. These
`benchmarks evaluate the performance of barrier synchro-
`nization, queue locks, and fine-grain synchronization. Our
`experiments vary the number of threads and contexts per
`processor. When threads are prioritized, the performance
`of the barrier benchmark improves by up to a factor of 2,
`and the performance of the queue lock benchmark improves
`by up to a factor of 7. For the fine-grain synchronization
`benchmark, performance was improved by up to 26%.
`This paper is organized as follows. Section 2 describes
`thread prioritization and outlines some of the implementa-
`tion details and costs. Section 3 briefly outlines the simula-
`tion environment and assumptions, and Section 4 presents
`the results from a number of simple scheduling experiments.
`Section 5 describes related work, while Section 6 concludes
`the paper and discusses future work.
`
`2 Thread Prioritization
`Thread prioritization involves assigning a priority to all
`the different threads in an application, and then using this
`priority to make thread scheduling decisions. The priority
`reflects the importance of a single thread to the completion
`of a singleapplication. The thread scheduler uses the thread
`priority in a very different way than process scheduling in
`UNIX for instance, where the goal is to achieve good in-
`teractive performance and time sharing between competing
`processes [16]. In our case, the goal is to identify as ex-
`actly as possible a relative order in which threads should
`be run, and devote as many resources as possible to the
`most important threads. Also, the granularity of scheduling
`is much different: in our case the priority is used to make
`scheduling decisions on every hardware context switch in a
`multiple-context processor.
`
`2.1 Priority Thread Scheduling
`Consider an application that consists of a set T of threads
`on each processor, where each processor has C contexts.
`Each thread t; E T has a priority P;, with a higher value
`of Pi indicating a higher thread priority. The hardware and
`software schedulers use the priority to do the scheduling.
`First, the software scheduler uses the priority to decide
`which threads are loaded. Specifically, it chooses a set
`TL of threads to load into the C contexts, and a set TU of
`unloaded threads to remain in a software scheduling queue.
`The scheduler chooses the loaded threads such that PI 2 P,
`for all ti E TL and tu E Tu. Threads of equal priority are
`scheduled in round-robin fashion.
`Second, at each context switch the hardware scheduler
`uses the priority to determine which loaded thread to ex-
`ecute. The scheduler chooses a thread t, E TL such that
`P, = m a + { A } for all tr E TL. If a thread is waiting for
`a memory reference to be satisfied then it is stalled and is
`not considered for scheduling until the memory reference is
`satisfied. If several loaded threads have the same priority,
`then these threads are chosen in round-robin fashion. A
`context switch can occur on a cache miss, on a failed syn-
`chronization test, or on a change of priority of one of the
`threads on the processor. Each change in priority results
`in a re-evaluation of Tu, TL, and t,.
`In this sense, the
`scheduling is preemptive.
`Note as well that thread scheduling as defined here is
`purely a local operation. Each processor has its own set
`of threads, and schedules only these. We do not consider
`dynamic load balancing issues in this paper.
`2.2 Assigning Thread Priorities
`In our benchmarks the user explicitly assigns a priority
`to each thread, and changes this priority as the algorithm
`requires. Although initially the use of thread prioritiza-
`tion is likely to be limited to special runtime libraries (e.g.
`synchronization primitives) and user-available program di-
`rectives, we expect that it will eventually be possible have
`a compiler assign priorities to threads automatically. Au-
`tomatic thread prioritization is particularly straightforward
`when the program can be described as a well defined DAG
`(Directed Acyclic Graph) that can be analyzed and used to
`assign the priorities.
`Prioritizing threads incorrectly can lead to a number of
`deadlock situations. Specifically, if thread A is waiting for
`another thread B to complete some operation, and thread B
`has a low priority that does not allow it to be loaded, then
`deadlock results. Thus the priorities assigned to threads
`must respect the dependencies of the computation.
`One way of avoiding deadlock is to guarantee some sort
`of fairness in the scheduling. If all threads are guaranteed to
`run some amount of time despite their priorities, then we can
`guarantee that deadlock will not result. However, as will be
`
`211
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:52:42 UTC from IEEE Xplore. Restrictions apply.
`
`TCL 1012
`
`

`

`shown in the examples, doing fair scheduling without regard
`to priority, or not specifying the priority of threads as exactly
`as they could be, can lead to a serious performance penalty.
`Thus, both the hardware and the software schedulers assume
`that the prioritizing of the threads is correct and deadlock
`free. Between threads of the same priority scheduling is
`fair.
`2.3 Hardware Support for Context Switching
`The context switch time is the time spent in switching
`between two active contexts, and is an important parame-
`ter in determining the efficiency of context switching for
`tolerating latency. In order to effectively tolerate latency,
`the total context switch time in a multiple context processor
`must be small [l, 231.
`The context switch time consists of a number of compo-
`nents. For instance, in the APRIL processor [3] the time
`required is 11 cycles, which is used to drain the processor
`pipeline (5 cycles), and execute a trap handler which saves
`state, and chooses the next context to execute using a round
`robin scheme (6 cycles). Duplicating instruction pointers
`and the processor status word for each context would reduce
`this time to just the cost of draining the pipeline, and more
`complicated processor designs could reduce this time fur-
`ther,possiblytoaslittleasasinglecycle[13, 151. Inamore
`software oriented approach, Waldspurger's flexible register
`relocation scheme [22] minimizes software context switch-
`ing cost by allocating registers in each context to maintain
`an active thread data structure. Choosing the next context
`takes 4 to 6 instructions for round-robin scheduling.
`When priority scheduling the active threads,choosing the
`next context on a context switch becomes more complicated,
`and if done in software can potentially increase the context
`switch time well beyond the minimum cost of draining the
`pipeline. For our simulations, we assume hardware support
`for choosing the next context, as shown in Figure 2. Each
`context has a special register, the priority register, into which
`the thread priority is loaded. The priorities of all the contexts
`feed into the context selection circuit that selects the next
`context on a context switch. When several threads have the
`same priority, the hardware scheduler chooses them in round
`robin fashion. Stalling can be incorporated into the active
`thread scheduling scheme by having a stall bit for each
`context indicating if it is waiting for a memory reference.
`3 Simulation Parameters
`3.1 Simulation Environment
`Simulation experiments were run using Proteus [5], a
`simulator for MIMD computer architectures. It allows ar-
`chitectural features and parameters associated with the net-
`work, the memory system, and the processor to be varied.
`Programs are written in C with language extensions for con-
`currency. Simulator function calls support non-local inter-
`
`Figure 2: Priority context selection logic.
`
`actions between processors such as shared-memory opera-
`tions (including a complete set of atomic read-modify-write
`operations), inter-processor interrupts, and message pass-
`ing. Proteus also provides a basic runtime system written in
`C.
`
`One important assumption made by Proteus in order to
`make simulation tractable is that only the references to ad-
`dresses that have been explicitly declaredas shared are simu-
`lated in detail. That is, it is assumed that all local instruction
`and data references (to a thread's stack for instance) hit in
`the cache. This assumption has been found to be a reason-
`able approximation of the case where every single memory
`reference is simulated through the cache [5,9] since the hit
`rates for instructions and local data are typically very high.
`3.2 System Parameters
`The basic system consists of a collection of multiple-
`context processing nodes connected by a high speed inter-
`connection network. Each processing node has a cache and
`some portion of the global memory. We use a 2-dimensional
`torus type network that uses wormhole routing. Each pro-
`cessor has both a shared memory interface, as well as an effi-
`cient message passing interface with high priority interrupts
`that is used in a number of the benchmarks. A variation of
`the shared-memory, directory-based cache coherence proto-
`col described by Chaiken [6] is used to maintain a consistent
`view of memory.
`The system parameters that are kept constant across the
`different simulations are shown in Table 1. They were
`chosen to represent a processor similar to the MIT Alewife
`machine [2,3], with modifications that reflect the increasing
`ratio of processor speed to memory speed, and that allow
`faster hardware context switching.
`The local memory latency is assumed to be 20 cycles.
`This corresponds to the time to fill a cache line from the
`local memory when there is a miss in the cache, the value
`is in the local node memory, and no coherency protocol
`messages have to be sent before returning the data. If the
`data is not in local memory, then the time for the response is
`
`212
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:52:42 UTC from IEEE Xplore. Restrictions apply.
`
`TCL 1012
`
`

`

`Parameter
`Cache Latency
`Local Memory Latency
`Memory Bandwidth
`Hardware Context Switch Time
`Time to Unload Registers
`Time to Reload Registers
`Network Data Transfer Size
`Network Wire Delay
`Network Switch Delay
`Network Input Bandwidth
`Network Output Bandwidth
`
`Value
`I cycle
`20 cycles
`I wordcycle
`5 cycles
`32 cycles
`32 cycles
`0.5 word
`1 cycle
`1 cycle
`0.5 wordcycle
`0.5 wordlcycle
`
`Table 1: Important system parameters.
`
`variable and depends on factors such as the network traffic,
`and the number of messages that have to be sent to satisfy the
`protocol. The memory bandwidth available is 1 wordcycle.
`The 5 cycle hardware context switch time is the cost
`of draining the pipeline on a context switch. We assume
`that this is the only cost of context switching, and that no
`additional instructions have to be executed.
`The cost of a thread swap is the time to save the regis-
`ters of the first thread, perform various operations on the
`scheduling data structures, and restore the registers of the
`next thread [17]. We assume that there are 32 registers in
`each context, and that there is a single cycle cache read and
`write hit time. The cost of the scheduling and descheduling
`is modeled directly by the cost of the runtime scheduler.
`The total swapping cost is on the order of 1 I5 to 150 cycles,
`which assumes all references hit in the cache. This swap
`time is significant in comparison to the hardware context
`switch time.
`Finally, the network transfers data one half word at a time,
`with a switch delay of 1 cycle, and a point-to-point wire
`delay of 1 cycle. The network input and output bandwidths
`are each 0.5 wordcycle.
`
`4 Experimental Results
`In this section we present the results from three simple
`benchmarks. These experiments concentrate on improv-
`ing synchronization performance, which is crucial to the
`performance of many parallel applications. The first two
`benchmarks are synthetic benchmarks which look at the
`performance of a combining tree barrier, and of a mutual-
`exclusion queue lock. The third benchmark is an imple-
`mentation of Lower-Upper Decomposition (LUD) that uses
`fine-grain synchronization in the form of a FulIEmpty tag
`bit associated with each memory location.
`For each benchmark we consider three different scenar-
`ios:
`
`1. SINGLE: There are several threads, but there is only
`one context so that only a single thread is loaded at a
`time.
`
`2. ALL: There are sufficient contexts so that all threads
`created can be loaded. We use 16 contexts in our
`simulations.
`
`3. LIMITED: There are several contexts, but there are
`potentially more threads than contexts so that only a
`limited number of the available threads are loaded. We
`use 4 contexts in our simulations.
`
`The SINGLE and LIMITED cases represent situations
`in which not all threads can be loaded at the same time.
`These cases can arise in the context of data-dependent thread
`spawning, runtime dynamic partitioning, or in a multipro-
`gramming environment. The ALL case is balanced in the
`Sense that all threads can be loaded at once. The first case
`emphasizes the type of scheduling which is required for
`threads that are not currently loaded. The second case em-
`phasizes the scheduling required for loaded threads. The
`last case combines the two problems to study what must be
`done to schedule both loaded and unloaded threads.
`4.1 Barrier Synchronization
`The first benchmark is a barrier benchmark using a shared
`memory combiningtree [25,18]. In this benchmark, a num-
`ber of threads are spawned on each processor, and these
`threads repeatedly perform a barrier synchronization. The
`first level of the combining tree has a fan-in equal to the
`number of threads on each processor'. The threads on each
`processor first perform a local combine, and then the last
`thread to combine on each local processor participates in a
`global barrier using a radix4 combining tree. The simula-
`tion uses 64 processors, with a large fully associative cache,
`so that only cache invalidation traffic affects performance.
`Prioritization is simple, and is done as follows. When a
`thread arrives at the barrier and it is not the first thread in the
`leafgroup, it decreases itspriority in preparation for thenext
`phase ofthe computation, and begins to spin. The last thread
`to arrive at a leaf n d e maintains its priority, and proceeds
`up the combining tree. Thus on each node, only the thread
`that is participating in the non-local barrier tree is using
`any cycles - it can either be spinning at an intermediate
`node of the combining tree, or it can be proceeding up
`or down the combining tree. Once a thread going back
`down the combining tree reaches the leaves of the tree, it
`decreases its priority to the priority of the other spinning leaf
`threads, and they can all proceed to the next phase of the
`computation. Note that the prioritization required for other
`tree-like barriers including tournament barriers and MCS
`
`'If there is only a single thread per processor. then this first level is
`eliminated.
`
`213
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:52:42 UTC from IEEE Xplore. Restrictions apply.
`
`TCL 1012
`
`

`

`tree barriers [ 181, is qualitatively similar to the prioritization
`of the combining tree barrier.
`
`4.1.1 Results
`Figure 3 shows the average barrier wait times for the three.
`different scenarios, where the barrier wait time is the time
`spent by each thread waiting at the barrier. Each case is
`discussed below.
`SINGLE: With unprioritized threads, performance of
`the barrier decreases as the number of threads increases due
`to two factors. First, each thread that participates in the
`barrier must be swapped into the context in order to reach
`the barrier. Second, when a thread that is spinning at an in-
`termediate node of the combining tree does an unsuccessful
`poll, the scheduler swaps out this thread, and successively
`load in all the other spinning threads on the local node. It
`does this because it does not differentiate between the lo-
`cally spinning threads and thus treats them all fairly. In the
`prioritized case, the time to perform the barrier increases due
`to the larger number of threads, but once the local barrier has
`been completed and one thread has been chosen to represent
`the node in the global barrier, this thread is never swapped
`out regardless of how often the polling is unsuccessful. As
`a result, the second component which contributed to poor
`performance in the unprioritized case is eliminated. For 4
`threads per processor performance improves by 11%, and
`for 16 threads per processor performance improves 41%.
`ALL: The unprioritized ALL scenario suffers from a
`similar problem to the unprioritized SINGLE scenario, ex-
`cept that no thread swapping is necessary since all threads
`are loaded, only context switching. Although a context
`switch is much cheaper than a full thread swap, the context
`switchs happen more often in the ALL case than thread
`swaps in the SINGLE case because they occur not only on
`failed synchronization tests, but also on cache misses. Each
`time the thread participating in the global barrier misses in
`the cache or does an unsuccessful polling operation, the pro-
`cessor runs through all the other contexts before returning
`to the critical context. It is important to note that the time to
`switch between the contexts is more than simply the number
`of cycles to switch between hardware contexts, in this case 5
`cycles. This is because once the actual context switch takes
`place, the new thread issues some number of instructions,
`until it either misses in the cache, or tests its flag unsuc-
`cessfully and context switches. The prioritized scheduling
`eliminates unnecessary context switching during the global
`barrier with performance improving by 23% for 4 threads,
`and by 47% for 16 threads.
`LIMITED: The case of having more threads than con-
`texts with multiple contexts can potentially suffer from
`the worst of both the SINGLE, and the ALL scenarios.
`With unprioritized threads, each time a non-critical spin-
`
`214
`
`O l " " " ~ " " " " ~
`0 1 2 3 4 5 6 7 8 9 10111213141516
`
`0 1 ' " " " " " " " '
`0 1 2 3 4 5 6 7 8 9 10111213141516
`mmspwp-w
`
`b. ALL
`
`14wO
`
`-
`i 12000-
`
`1ww-
`
`8 0 0 0 -
`-
`6000
`-
`-
`2000
`
`4000
`
`
`"
`"
`"
`"
`' ~ "
`"
`"
`O I
`0 1 2 3 4 5 6 7 8 9 10111213141516
`
`Figure 3: Average barrier wait time for 64 Processors. SIN-
`GLE, ALL, and LIMITED scenarios.
`
`Authorized licensed use limited to: IEEE Staff. Downloaded on May 11,2023 at 12:52:42 UTC from IEEE Xplore. Restrictions apply.
`
`TCL 1012
`
`

`

`ning thread runs, it not only checks its flag but also does
`a thread swap if there are other threads on the scheduling
`queue. Thus the time between when the critical thread con-
`text switches to the time it is again the executing thread is
`increased by the time to run through all the other loaded
`threads, where each is checking its flag and then swap-
`ping itself with some other spinning thread on the software
`scheduling queue. This in effect represents a worst case
`scenario in terms of the amount of thread swapping that
`is done. Figure 3c also shows the case when the thread
`priorities are used only by the software scheduler, and not
`the hardware scheduler. In this case the hardware scheduler
`does round-robin scheduling of the loaded threads rather
`than priority scheduling and thus does some amount of un-
`necessary context switching. Wlth 16 threads, software
`thread prioritization without hardware thread prioritization
`reduces the barrier wait time by 47%. whereas doing both
`software and hardware thread prioritization reduces the wait
`time by 57%.
`These three scenarios show that the prioritizing of both
`loaded and unloaded threads is important for the perfor-
`mance of the barrier synchronization. Prioritizing unloaded
`threads in the thread queue is important because it guaran-
`tees that threads that still have to participate in the barrier
`are loaded, and it eliminates unnecessq thread swapping.
`Prioritizing the loaded threads themselves guarantees that
`lower priority spinning threads do not steal cycles from the
`higher priority threads, so that a critical thread can proceed
`as soon as it is able.
`4.2 Queue Locks
`A queue lock is a mutual exclusion mechanism that is
`appropriate for high contention locks [18,4]. Each thread
`inserts itself onto a queue, and then spins on its own flag so
`as to not generate the hot spots and excess network traffic
`that can be generated by simpler Test-and-Set style locks.
`Our implementation of queue locks is inspired by the MCS
`lock of Mellor-Crummey and Scott [ 181. The MCS lock has
`the advantage that the flag on which each thread spins while
`waiting in the queue is locally allocated and generates no
`global traffic while spinning.
`Scheduling threads waiting for a lock raises a number
`of issues. It is clear that once a lock is acquired the thread
`owning the lock should not be swapped out [26, 171. This
`is because all other threads waiting for the lock will be
`unable to make progress until the lock is released, and so
`performance can be seriously degraded. In the case of the
`queue lock there is an additional factor to be considered:
`the order in which threads are going to acquire the lock
`is determined by the order in which threads are inserted
`into the queue. The priority of a spinning thread should be
`determined by the position of the thread in the queue, so that
`when there are multiple spinning threads, the one earlier in
`the queue will be given priority.
`
`The synthetic queue lock benchmark consists of an equal
`number of threads on each processor, each trying to obtain a
`lock. Each thread repeatedly obtains the lock, runs a critical
`section, releases the lock, and then runs a non-critical sec-
`tion. Both a high contention and a low contention case are
`run. In the high contention case, the critical section is about
`100 cycles, and the non-critical section varies between 50
`and 150 cycles, with a uniform distribution. In the low
`contention case, the critical section is the same, but the non-
`criticalsection varies between 5000and 15000cycles witha
`uniform distribution. When executing the non-critical sec-
`tion, a context switch is forced about every 40 cycles to
`simulate a cache miss. The total number of locks acquired
`over a sample period of lo6 cycles was taken as the fig-
`ure of merit. It should be noted that the latency tolerance
`properties of having multiple threads, as well as possible
`cache interference between threads are not measured in this
`benchmark, but rather only the effects of the thread schedul-
`ing. The simulation uses 16 processors, with a large fully
`associative cache so that only invalidation traffic occurs.
`The following variations on the queue lock benchmark
`were run:
`
`Priorityl: the threads repeatedly context switch when
`they poll their local variable and determine that they
`are not next in line in the queue. If there are more
`threads than contexts, then there is also a thread swap
`with an unloaded thread. A thread that acquires the
`lock has high priority so that it is not swapped out, and
`when it releases the lock it decreases its priority.
`
`Priority2: If a thread owns the lock or is trying to
`determine its position in the queue, it has the highest
`priority. If the thread is trying to insert itself into the
`queue, it has the next highest priorityz. Threads that
`are spinning in the queue have a priority based on their
`position in the queue3. Thus whenever a processor has
`several threads waiting in the queue,it will give priority
`to the thread that will next acquire the queue. Note that
`this requires the addition of a count field to the data
`structure in order to keep track of the position in the
`queue, and slightly more complicated lock acquisition
`code.
`
`Signallingl: In this version, the benchmark has been
`modified so that after inserting itself into the queue a
`thread suspends itself. A suspended thread is put into a
`
`*Their is a subtle issue here having to do with a th

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket