throbber
Operating System Support for Mitigating Software
`Scalability Bottlenecks on Asymmetric Multicore
`Processors
`
`Juan Carlos Saez
`Complutense University,
`Madrid, Spain
`jcsaezal@fdi.ucm.es
`Manuel Prieto
`Complutense University,
`Madrid, Spain
`mpmatias@dacya.ucm.es
`
`Alexandra Fedorova
`Simon Fraser University,
`Vancouver BC, Canada
`fedorova@cs.sfu.ca
`Hugo Vegas
`Complutense University,
`Madrid, Spain
`hugovegas@fdi.ucm.es
`
`ABSTRACT
`Asymmetric multicore processors (AMP) promise higher per-
`formance per watt than their symmetric counterparts, and
`it is likely that future processors will integrate a few fast
`out-of-order cores, coupled with a large number of simpler,
`slow cores, all exposing the same instruction-set architec-
`ture (ISA). It is well known that one of the most effective
`ways to leverage the effectiveness of these systems is to use
`fast cores to accelerate sequential phases of parallel appli-
`cations, and to use slow cores for running parallel phases.
`At the same time, we are not aware of any implementation
`of this parallelism-aware (PA) scheduling policy in an op-
`erating system. So the questions as to whether this policy
`can be delivered efficiently by the operating system to un-
`modified applications, and what the associated overheads
`are remain open. To answer these questions we created two
`different implementations of the PA policy in OpenSolaris
`and evaluated it on real hardware, where asymmetry was
`emulated via CPU frequency scaling. This paper reports
`our findings with regard to benefits and drawbacks of this
`scheduling policy.
`
`Categories and Subject Descriptors
`D.4.1 [Process Management]: Scheduling
`
`General Terms
`Performance, Measurement, Algorithms
`
`Keywords
`Asymmetric multicore, Scheduling, Operating Systems
`
`Permission to make digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that copies are
`not made or distributed for profit or commercial advantage and that copies
`bear this notice and the full citation on the first page. To copy otherwise, to
`republish, to post on servers or to redistribute to lists, requires prior specific
`permission and/or a fee.
`CF’10, May 17–19, 2010, Bertinoro, Italy.
`Copyright 2010 ACM 978-1-4503-0044-5/10/05 ...$10.00.
`
`INTRODUCTION
`1.
`An asymmetric multicore processor (AMP) includes cores
`exposing the same instruction-set architecture, but differing
`in features, size, speed and power consumption [2, 8]. A
`typical AMP would contain a number of simple, small and
`low-power slow cores and a few complex, large and high-
`power fast cores. It is well known that AMP systems can
`mitigate scalability bottlenecks in parallel applications by
`accelerating sequential phases on fast cores [2, 7, 12].
`To leverage this potential of AMP systems, threads must
`be mapped to cores in consideration of the amount of paral-
`lelism in the application: if an application is highly parallel
`its threads should be mapped to slow cores, but if the ap-
`plication is sequential or is executing a sequential phase its
`thread should be mapped to a fast core. A natural place for
`this Parallelism-Aware (PA) policy is in the operating sys-
`tem. This way, many applications can reap its benefits, po-
`tentially without requiring any modifications, and the shar-
`ing of scarce fast cores among multiple applications can be
`fairly arbitrated by the operating system. To the best of our
`knowledge, there are no OS-level implementations of the PA
`scheduling policy. As a result, many questions regarding the
`effectiveness and practicality of this policy remain open.
`One open question is how can the operating system effec-
`tively detect sequential phases in applications? In some ap-
`plications unused threads block during the sequential phase,
`and by monitoring the application’s runnable thread count,
`which is exposed to the OS by most threading libraries, the
`scheduler can trivially detect a sequential phase. In other
`applications, however, unused threads busy-wait (or spin)
`during short periods of time, and so the OS cannot de-
`tect these phases simply by monitoring the runnable thread
`count. To address these scenarios we designed PA Runtime
`Extensions (PA-RTX) – an interface and library enhance-
`ments enabling the threading library to notify the scheduler
`when a thread spins rather than doing useful work. We im-
`plemented PA-RTX in a popular OpenMP runtime, which
`required only minimal modifications to support them, but
`the extensions are general enough to be used with other
`threading libraries.
`Another open question is the overhead associated with the
`PA policy. Any policy that prioritizes fast cores to specific
`
`Petitioner Samsung Ex-1025, 0001
`
`

`

`threads is bound to generate migration overheads – a per-
`formance degradation that occurs when a thread is moved
`from one core to another. Performance degradation results
`from the loss of cache state accumulated on the thread’s old
`core. Upon evaluating these overheads we found that they
`can be significant (up to 18%) if the fast core is placed in
`a different memory hierarchy domain from slow cores, but
`a hardware configuration where a fast core shares a mem-
`ory hierarchy domain with several slow cores coupled with a
`topology-aware scheduler practically eliminates these over-
`heads.
`We evaluate the PA policy on a real multicore system,
`where “slow” cores were emulated by reducing the clock fre-
`quency on the processors, while “fast” cores were configured
`to run at regular speed. We find that the main benefits from
`the PA policy are derived for multi-application workloads
`and when the number of fast cores relative to slow cores is
`small. In this case, it delivers speedups of up to 40% relative
`to the OpenSolaris default asymmetry-agnostic scheduler.
`Previously proposed asymmetry-aware algorithms, which we
`used for comparison, also do well in some cases, but unlike
`our parallelism-aware algorithms they do not perform well
`across the board, because they fail to consider the paral-
`lelism of the application.
`The key contribution of our work is the evaluation of the
`operating system technology enabling next-generation asym-
`metric systems. We are not aware of previous studies investi-
`gating the benefits and drawbacks of the PA scheduling pol-
`icy implemented in a real OS. Our findings provide insights
`for design of future asymmetry-aware operating systems and
`asymmetric hardware alike.
`The rest of the paper is structured as follows. Section 2
`presents the design and implementation of the PA schedul-
`ing algorithm. Section 3 presents experimental results. Sec-
`tion 4 discusses related work. Section 5 summarizes our
`findings and discusses future work.
`
`2. DESIGN AND IMPLEMENTATION
`In Section 2.1 we describe two parallelism-aware algo-
`rithms proposed in this work: PA and MinTLP. In Sec-
`tion 2.2 we describe the runtime extensions to PA (PA-
`RTX). A brief description of other asymmetry-aware algo-
`rithms that we use for comparison is provided in Section 2.3.
`2.1 PA and MinTLP algorithms
`Our algorithms assume an AMP system with two core
`types: fast and slow. Previous studies concluded that sup-
`porting only two core types is optimal for achieving most of
`the potential gains on AMP [8]; so we expect this configu-
`ration to be typical of future systems. More core types may
`be present in future systems due to variations in the fabrica-
`tion process. In that case, scheduling must be complemented
`with other algorithms, designed specifically to address this
`problem [18].
`The goal of the algorithm is to decide which threads should
`run on fast cores and which on slow cores. In MinTLP, this
`decision is straightforward: the algorithm selects applica-
`tions with the smallest thread-level parallelism (hence the
`name MinTLP) and maps threads of these applications to
`fast cores. Thread-level parallelism is determined by exam-
`ining the number of runnable (i.e., not blocked) threads. If
`not enough fast cores are available to accommodate all these
`threads, some will be left running on slow cores. MinTLP
`
`makes no effort to fairly share fast cores among all “eligible”
`threads. This algorithm is very simple, but not always fair.
`The other proposed algorithm, PA, is more sophisticated.
`It classifies threads dynamically into several categories: MP,
`HP, and SP. The MP (mildly parallel) category includes
`threads belonging to applications with a low degree of thread-
`level parallelism, including the single-threaded applications.
`The HP category includes threads belonging to highly par-
`allel applications. The MP threads will run primarily on
`fast cores, and the HP threads will run primarily on slow
`cores. Threads of applications whose runnable thread count
`exceeds hp_threshold fall into the HP category, the remain-
`ing threads fall into the MP category.
`A special class SP is reserved for threads of parallel ap-
`plications that have just entered a sequential phase. These
`threads will get the highest priority for running on fast cores:
`this provides more opportunities to accelerate sequential
`phases. To avoid monopolizing fast cores, SP threads are
`downgraded by the scheduler into the MP class after spend-
`ing amp_boost_ticks scheduling clock ticks in the SP class.
`If there are not enough cores to run all SP and MP threads
`on fast cores, the scheduler will run some of the threads on
`slow cores, to preserve load balance. SP threads have a
`higher priority in using fast cores. The remaining fast cores
`will be shared among MP threads in a round-robin fashion.
`The scheduler keeps track of the count of runnable threads
`in each applicationto detect transitions between the afore-
`mentioned classes and perform thread-to-core mapping ad-
`justments accordingly. To avoid premature migrations and
`preserve load balance, PA integrates a thread swapping mech-
`anism to perform those adjustments periodically, instead of
`reacting to those transitions immediately (MinTLP also in-
`tegrates a similar swapping mechanism).
`When the change in the thread-level parallelism cannot be
`determined via the monitoring of the runnable thread count,
`PA relies on the Runtime Extensions, described in the next
`Section. We must also highlight that despite the fact that
`our evaluation has been focused on multi-threaded single-
`process applications, the PA and MinTLP algorithms can be
`easily extended to support multi-process applications using
`high-level abstractions provided by the operating system,
`such as process sets.
`Although sensitivity of the PA algorithm to its config-
`urable parameters was studied, we are unable to provide
`the results due to space constraints. We found, however,
`that it is generally easy to choose good values for these pa-
`rameters. After performing such a sensitivity study, we set
`amp_boost_ticks to one hundred timeslices (1 second) and
`hp_threshold to one greater than the number of fast cores.
`These values ensure acceleration of sequential phases with-
`out monopolizing fast cores.
`
`2.2 PA Runtime Extensions
`The base PA algorithm introduced so far relies on moni-
`toring runnable thread count to detect transitions between
`serial and parallel phases in the application. However, con-
`ventional synchronization primitives found in most thread-
`ing libraries use an adaptive two-phase approach where un-
`used threads busy wait for a while before blocking to reduce
`context-switching overheads. While blocking is coordinated
`with the OS, making it possible to detect phase transitions,
`spinning is not. Reducing the spinning phase enables the
`OS to detect more serial phases. However, in our context
`
`Petitioner Samsung Ex-1025, 0002
`
`

`

`it may also lead to excessive migrations and cause substan-
`tial overheads (as soon as a fast core becomes idle PA and
`MinTLP will immediately migrate a thread to this core). In
`the event these busy-waiting phases are frequent, it is help-
`ful to give the scheduler some hints that would help it to
`avoid mapping spinning threads to fast cores. To that end,
`we propose two optimizations, which can be implemented in
`the threading library (applications themselves need not be
`changed).
`2.2.1 Spin-then-notify mode
`Our first proposal is a new spin-then-notify waiting mode
`for synchronization primitives. Its primary goal is to avoid
`running spinning threads on fast cores and save these “power-
`hungry” cores for other threads. In this mode the synchro-
`nization primitive notifies the operating system via a system
`call after a certain spin threshold that the thread is busy-
`waiting rather than doing useful work. Upon notification,
`the PA scheduler marks this thread as a candidate for migra-
`tion to slow cores. We have opted to mark threads as migra-
`tion candidates instead of forcing an immediate migration
`since this approach avoids premature migrations and allows
`a seamless integration with the PA and MinTLP swapping
`mechanisms. The synchronization primitive also notifies the
`scheduler when a spinning thread finishes the busy wait. In
`Section 3.2 we explore the advantages of using the new spin-
`then-notify mode. For this purpose we have modified the
`OpenMP runtime system to include this new mode in the
`basic waiting function used by high-level primitives such as
`mutexes or barriers.
`Another potentially useful feature of this primitive may
`arise in the context of scheduling algorithms that map threads
`on AMP systems based on their relative speedup on fast vs.
`slow cores (see Section 4). These algorithms typically mea-
`sure performance of each thread on fast and slow cores and
`compute its performance ratio, which determines the rela-
`tive speedup [5, 9]. If a thread performs busy-waiting it can
`achieve a very high performance ratio, since a spin loop uses
`the CPU pipeline very efficiently1. As a result, the proposed
`algorithms would map spinning threads to fast cores despite
`they are not doing useful work. Even though these imple-
`mentation issues could be solved via additional hardware
`support [11], a spin-then-notify primitive could help avoid
`the problem without needing extra hardware.
`2.2.2 Exposing the master thread
`We have also investigated a simple but effective optimiza-
`tion allowing the application to communicate to the ker-
`nel that a particular thread must have a higher priority
`in running on a fast core. This optimization was inspired
`by the typical structure of OpenMP do-all applications. In
`these applications, there is usually a master thread that is in
`charge of the explicit serial phases at the beginning, in be-
`tween parallel loops, and at the end of the application (apart
`from being in charge of its share of the parallel loops). Iden-
`tifying this master thread to the kernel enables the scheduler
`to give it a higher priority on the fast core simply because
`this thread will likely act as the “serial” thread. This hint can
`speed up do-all applications even without properly detect-
`ing serial phases. Our PA Runtime Extensions enable the
`
`1Best practices in implementing spinlocks dictate using al-
`gorithms where a thread spins on a local variable [1], which
`leads to a high instruction throughput.
`
`runtime system to identify the master thread to the sched-
`uler via a new system call. If the pattern of the application
`changes and another thread gets this responsibility, the same
`system call can be used to update this information.
`To evaluate this feature, we have modified the OpenMP
`runtime system to automatically identify the thread execut-
`ing the main function as the master thread to the kernel,
`right after initializing the runtime environment. In the same
`way as the implementation of spin-notify mode, only the
`OpenMP library needs to be modified, not requiring any
`change in the applications themselves. Upon receiving this
`notification, the PA scheduler tries to ensure that the master
`thread runs on a fast core whenever it is active, but without
`permanently binding the thread to that core as would be
`done with other explicit mechanisms based on thread affini-
`ties. This way, PA still allows different threads to compete
`for fast cores according to its policies.
`2.3 The other schedulers
`We compare PA and MinTLP to three other schedulers
`proposed in previous work. Round-Robin (RR) equally shares
`fast and slow cores among all threads [5]. BusyFCs is a
`simple asymmetry-aware scheduler that guarantees that fast
`cores never go idle before slow cores [4]. Static-IPC-Driven,
`which we describe in detail below, assigns fast cores to those
`threads that experience the greatest relative speedup (in
`terms of instructions per second) relative to running on slow
`cores [5]. We implemented all these algorithms in Open-
`Solaris. Our baseline for comparison is the asymmetry-
`agnostic default scheduler in OpenSolaris, referred to here-
`after as Default.
`The Static-IPC-Driven scheduler is based on the design
`proposed by Becchi and Crowley [5]. Thread-to-core as-
`signments in that algorithm are done based on per-thread
`IPC ratios (quotients of IPCs on fast and slow cores), which
`determine the relative benefit of running a thread on a par-
`ticular core type. Threads with the highest IPC ratios are
`scheduled on fast cores while remaining threads are sched-
`uled on slow cores. In the original work [5], the IPC-driven
`scheduler was simulated. This scheduler samples threads’
`IPC on cores of all types whenever a new program phase is
`detected. Researchers who attempted an implementation of
`this algorithm found that such sampling caused large over-
`heads, because frequent cross-core thread migrations were
`required [16]. To avoid these overheads, we have imple-
`mented a static version of the IPC-driven algorithm, where
`IPC ratios of all threads are measured a priori. This makes
`IPC ratios more accurate in some cases [16] and eliminates
`much of the runtime performance overhead. Therefore, the
`results of the Static-IPC-Driven scheduler are somewhat op-
`timistic and the speedups of PA and MinTLP relative to
`Static-IPC-Driven are somewhat pessimistic.
`
`3. EXPERIMENTS
`The evaluation of the PA algorithm was performed on
`an AMD Opteron system with four quad-core (Barcelona)
`CPUs. The total number of cores was 16. Each core has
`private 64KB instruction and data caches, and a private L2
`cache of 512KB. A 2MB L3 cache is shared by the four cores
`on a chip. The system has a NUMA architecture. Access
`to a local memory bank incurs a shorter latency than access
`to a remote memory bank. Each core is capable of running
`at a range of frequencies from 1.15 GHz to 2.3 GHz. Since
`
`Petitioner Samsung Ex-1025, 0003
`
`

`

`Table 1: Classification of selected applications.
`
`Categories
`HP-CI
`HP-MI
`PS-CI
`PS-MI
`ST-CI
`ST-MI
`
`Benchmarks
`EP(N), vips(P), fma3d(O), ammp(O), RNA(I), scalparc(M), wupwise (O)
`art(O), equake(O), applu(O), swim(O)
`BLAST(NS), swaptions(P), bodytrack(P), semphy(M), FT(N)
`MG(N), TPC-C(NS), FFTW(NS)
`gromacs(C), sjeng(C), gamess(C), gobmk(C), h264ref(C), hmmer(C), namd(C)
`astar(C), omnetpp(C), soplex(C), milc(C), mcf(C), libquantum(C)
`
`Table 2: Multi-application workloads, Set #1.
`
`Workload name Benchmarks
`STCI-PSMI
`gamess, FFTW (12,15)
`STCI-PSCI
`gamess, BLAST (12,15)
`STCI-PSCI(2)
`hmmer, BLAST (12,15)
`STCI-HP
`gamess, wupwise (12,15)
`STCI-HP(2)
`gobmk, EP (12,15)
`STMI-PSMI
`mcf, FFTW (12,15)
`STMI-PSCI
`mcf, BLAST (12,15)
`STMI-HP
`astar, EP (12,15)
`PSMB-PSCI
`FFTW (6,8), BLAST (7,8)
`PSMB-HP
`FFTW (6,8), wupwise_m (7,8)
`PSCI-HP
`BLAST (6,8), wupwise_m (7,8)
`PSCI-HP(2)
`semphy (6,8), EP (7,8)
`
`each core is within its own voltage/frequency domain, we are
`able to vary the frequency for each core independently. We
`experimented with asymmetric configurations that use two
`core types: “fast” (a core set to run at 2.3 GHz) and “slow” (a
`core set to run at 1.15 GHz). We also varied the number of
`cores in the experimental configurations by disabling some
`of the cores.
`We used three AMP configurations in our experiments:
`(1) 1FC-12SC – one fast core and 12 slow cores, the fast
`core is on its own chip and the other cores on that chip are
`disabled; (2) 4FC-12SC – four fast cores and 12 slow cores,
`each fast core is on a chip with three slow cores; (3) 1FC-
`3SC – one fast core, three slow cores, all on one chip. Not
`all configurations are used in all experiments.
`Although thread migrations can be effectively exploited by
`asymmetry-aware schedulers (e.g. to map sequential parts
`of parallel applications on fast cores), the overhead that they
`may introduce can lead to performance degradation. Since
`we also aim to assess the impact of migrations on perfor-
`mance we opted to select the default asymmetry-unaware
`scheduler used in OpenSolaris (we refer to it as Default
`henceforth) as our baseline scheduler. Despite Default keeps
`threads on the same core for most of the execution time and
`thus minimizes thread migrations, its asymmetry-unawareness
`leads it to offer much more unstable results from run to run
`than the ones observed for the other schedulers. For that
`reason, a high number of samples were collected for this
`scheduler, in an attempt to capture the average behavior
`more accurately. Overall, we found that Default usually
`fails to schedule single-threaded applications and sequential
`phases of parallel application on fast cores, especially when
`the number of fast cores is much smaller than the number
`of slow cores, such as on the 1FC-12SC and 4FC-12SC con-
`figurations.
`
`We evaluate the base implementation of the PA algorithm
`as well PA with Runtime Extensions. We compare PA to
`RR, BusyFCs, Static-IPC-Driven, Min-TLP and to Default.
`In all experiments, each application was run a minimum of
`three times, and we measure the average completion time.
`The observed variance was small in most cases (so it is not
`reported) and where it was large we repeated the experi-
`ments for a larger number of trials until the variance reached
`a low threshold. In multi-application workloads the appli-
`cations are started simultaneously and when an application
`terminates it is restarted repeatedly until the longest appli-
`cation in the set completes at least three times. We report
`performance as the speedup over Default. The geometric
`mean of completion times of all executions for a benchmark
`under a particular asymmetry-aware scheduler is compared
`to that under Default, and percentage speedup is reported.
`In all experiments, the total number of threads (sum of the
`number of threads of all applications) was set to match the
`number of cores in the experimental system, since this is how
`runtime systems typically configure the number of threads
`for the CPU-bound workloads that we considered [19].
`Our evaluation section is divided into four parts. In Sec-
`tion 3.1 we introduce the applications and workloads used
`for evaluation. In Section 3.2 we evaluate PA runtime ex-
`tensions. In Section 3.3 we evaluate multi-application work-
`loads. Finally, in Section 3.4 we study the overhead.
`
`3.1 Workload selection
`We used applications from PARSEC [6], SPEC OMP2001,
`NAS [3] Parallel Benchmarks and MineBench [13] bench-
`mark suites, as well as the TPC-C benchmark implemented
`over Oracle Berkeley DB [14], BLAST – a bioinformatics
`benchmark, FFT-W – a scientific benchmark performing
`the fast Fourier transform, and RNA – an RNA sequencing
`
`Petitioner Samsung Ex-1025, 0004
`
`

`

`Table 3: M u lti-a pplica tion w o rkload s, Set #2.
`
`Worklo a d n a m e
`Benchmarks
`gamess, h264ref, astar , s oplex, wupwise (12)
`2STCI- 2STMI-1HP
`gromacs , gamess, namd, gobmk, EP (12)
`4STCI-1HP
`gamess, hmmer, gobmk, s opl ex, semphy (12)
`3STCI-1STMI-1P SCI
`2STCI-1STMI-1P SMI-1HP gamess, h264ref, s oplex, FFTW (6), equake (7)
`3STCl-3STMI-1HP
`gromacs , s j eng, h264ref, l i bquantum, milc, omnetpp, EP (10)
`3STCl-3STMI-1P SCI
`gromacs , s j eng, h264ref, l ibquantum, milc, omnetpp, BLAST (10)
`
`application. For multi-applica tion workloads we also used
`sequent ial applicat ions from SPEC CPU2006.
`We classified applications according t o their architect ural
`properties: memory-intensive (MI) or compute-intensive (CI),
`as well as according to their parallelism: highly parallel
`(HP), partially sequential (P S) and single-threaded (ST ).
`Memory-int ensity was import ant for fair comparison with
`Stat ic-IPC-Driven. CI applications have a higher relat ive
`speedup on fast cores [16) and so it was important to include
`applications of bot h types in the experiments. P arallelism
`class was determined by tracing execution via OpenSolaris'
`DTrace framework and measuring the fraction of t ime the
`application spent running with a single runnable thread.
`Parallel applications where this fraction was great er t han 7%
`were classified as PS, whereas the rest were classified as HP.
`T he ST class includes sequential applications. T able 1 shows
`the classifica tion of our select ed applica tions according to
`these classes. T he text in parentheses next to t he bench(cid:173)
`mark name indicates the corresponding benchmark suite: 0
`-SP EC OMP2001, P- PARSEC, M - Minebench, N- NAS,
`C - SP EC CP U2006, and NS - other benchmarks not be(cid:173)
`longing to any specific suite.
`By default, all OpenMP applicat ions were compiled with
`the nat ive Sun St udio compiler. In order t o evalua te PA
`Runt ime Extensions (Section 3.2) we had to modify the
`OpenMP runtime system but the source code for the Sun
`Studio OpenMP runt ime system was not available to us. For
`that reason, we resorted to using t he Linux version of the
`GCC 4.4 OpenMP runtime system in OpenSolaris2
`• Never(cid:173)
`theless, we observed that the performance of OpenMP ap(cid:173)
`plications wit h Sun Studio and GCC is similar.
`Both OpenMP and POSIX threaded applica tions used
`in section 3.3 and 3.4 run with adapt ive synchronization
`modes; as such sequential phases are exposed to the oper(cid:173)
`ating system in both cases. In t hese sect ions we do not use
`runtime extensions wit h parallelism-aware algorit hms. All
`OpenMP applications run with t he default adapt ive syn(cid:173)
`chronization mode used by GCC 4.4 unless otherwise noted
`(Sun St udio can be easily configured to use a similar adap(cid:173)
`tive mode). P OSIX threaded applications (such as BLAST
`or bodytrack) use full blocking modes on all synchroniza(cid:173)
`tion primitives but on t hose related to P OSIX standard mu(cid:173)
`texes and synchroniza tion barriers, where an adapt ive im(cid:173)
`plementation is provided by OpenSolaris. Unlike OpenMP
`applications, threads of P OSIX applications spin for shorter
`periods of time before blocking on those adaptive synchro(cid:173)
`niza tion primitives (these are t he default parameters used
`in OpenSolaris) .
`
`2Using such a version of the runt ime system required aug(cid:173)
`ment ing OpenSolaris with a Linux compatible sys_futex
`syscall)
`
`7Cl"Ai
`
`□ Sleep
`
`~
`
`6Cl"Ai
`
`-s > SO"Ai
`
`4Cl"Ai
`
`.0
`C:
`(IJ
`~
`C:
`0
`tl
`~ 2Cl"Ai
`
`■ Adaptive(lm)
`
`□ Adaptive(lOm)
`
`13Adaptive(100m)
`
`a spin
`
`3Cl"Ai
`
`.. ~ l O"Ai
`
`(IJ
`::,
`,;r
`(IJ
`V,
`
`Cl"Ai
`
`<(
`z
`a:
`
`cu
`.><
`"'
`:::,
`C'
`cu
`
`"O
`
`t:
`:::,
`"' a. m
`"'
`a.
`.E
`"'
`
`a.
`E
`E
`"'
`
`I-
`u.
`
`I.!)
`::iE
`
`e
`>,
`"' ~
`a.
`a. E
`~
`Bl
`
`Figu r e 2: Varia tions in the seq u entia l fractio n seen
`by the OS w h e n va rying the syn ch r oniza t ion mode
`a nd blocking t hreshold.
`
`For Section 3.2 we selected ten OpenMP applications:
`art, applu, fma3d, ammp, FT, MG, s cal parc , semphy and
`RNA . These applica tions were chosen to cover a wide variety
`of sequential port ions. In t he overhead section we analyze
`ten parallel applica tions accross the aforementioned classes:
`three HPCI (RNA , wupwise and vips), two HPMI (swim and
`applu), three PSCI (swaptions, bodytrack and BLAST) and
`two PSMI applicat ions (TPC- C and FFTW) .
`For Sect ion 3.3, we constructed two sets of multi-application
`workloads. T he first set, shown in T able 2, comprises twelve
`representa tive pairs of benchmarks across the previous cat(cid:173)
`egories ment ioned above. For the sake of completeness, we
`experimented wit h additional multi-applicat ion workloads
`wit h more than two applica tions. Table 3 shows this second
`set, consisting of six workloads.
`
`3.2 PA Runtime Extensions
`We begin by invest igating the effect on performance when
`using different synchroniza tion waiting modes under t he PA
`scheduler. In these experiments we demonstrat e t hat us(cid:173)
`ing a low blocking t hreshold effect ively exposes sequential
`phases t o the scheduler, but performance can also suffer if
`the t hreshold is set too low. T hen we evaluate PA-RT X
`and show that it offers comparable performance t o purely
`adaptive approaches and in some cases even improves it .
`In the following experiment we used the 1F C-12SC config(cid:173)
`ura tion and tested three different waiting modes: spin, sleep
`and adaptive. In spin mode unused t hreads busy-wait for
`the entire t ime, in sleep mode, they block immedia tely. We
`studied the effects of various synchronizat ion modes on all
`asymmetry-aware schedulers, but since our results showed
`
`Petitioner Samsung Ex-1025, 0005
`
`

`

`70%
`
`60%
`
`<ti -
`
`~
`50%
`~
`::, 40%
`QJ 30%
`... 20%
`0
`QJ
`> 10%
`0
`a. 0%
`::, 1:: -10%
`~ -20%
`VI
`-30%
`
`-40%
`
`-50%
`
`□ PA (sleep)
`
`■ PA {adapt ive l m)
`D PA {adapt ive 10m)
`1:1 PA {adapt ive 100m )
`a PA (spin)
`
`z
`a:
`
`QJ
`~
`"'
`::,
`a-
`QJ
`
`t::
`"'
`
`"O
`m
`"'
`.E
`
`a.
`E
`E
`"'
`
`I-
`u.
`
`~
`~
`
`~
`"'
`a.
`;;
`:ii
`
`>
`.r:::.
`a.
`E
`
`QJ
`VI
`
`F igur e 1 : Speedup fr o m PA u s ing sleep, spin a n d adap t ive m o d es with d iffer e n t blocking thresholds.
`
`~ 70%
`~
`a, 60%
`
`40%
`
`.=!: o. 50%
`<ti
`]
`....
`:i 30%
`~ a, 20%
`0
`ai 10%
`Ei
`0%
`a.
`~-10%
`QJ
`~-20%
`VI
`
`-30%
`
`E.'I PA-base (spin)
`
`□ PA-RTX (master t hread)
`
`■ PA-RTX (spin-notif l k)
`
`□ PA-RTX (spin-notif 10k)
`
`1:1 PA-RTX (spin-notif 100k)
`
`E:I PA-RTX (spin-notif l m)
`
`D PA-base (adaptive 10m)
`
`B PA-base {adaptive l m)
`
`RNA
`
`equake
`
`a rt
`
`applu
`
`fma3d
`
`amm p
`
`FT
`
`MG
`
`seal pare
`
`semphy
`
`Figu re 3: Speed u p from P A with R untime Exten sion s.
`
`that across t he schedulers the effects are largely t he same,
`we present t he dat a for the PA scheduler only.
`Figure 1 shows t he results. P A runt ime extensions are
`not used in t his case. W hen t he spin mode is used, t he base
`PA algorit hm delivers hardly any speedup , because it is not
`aware of t he sequential phases. With the adapt ive and sleep
`modes, applications on the right side of t he chart experi(cid:173)
`ence noticeable speedup. They have large sequent ial phases
`and switching to the adapt ive or sleep mode exposes t hese
`phases to t he scheduler and enables their acceleration on the
`fast core. Applications on the left side of t he chart, however,
`experience performance degradation. T hose with t he high(cid:173)
`est overhead (RNA, equake and appl u) run frequent short
`parallel loops. Despite being well-balanced applications, the
`asymmetry of t he platform causes t he thread running on a
`fast core to complete its share of these loops earlier. In t his
`case, t he sleep mode makes the fast core become idle very
`often, t riggering frequent migrations t hat int roduce subst an(cid:173)
`tial overheads. An adapt ive mode alleviates this issue, but
`the blocking t hreshold must be sufficiently large to remove
`the overheads complet ely.
`Figure 2 shows how t he fraction of time spent in sequen(cid:173)
`tial phases as seen by the OS changes for different block(cid:173)
`ing thresholds. T his further underscores t hat t he blocking
`threshold for t he adapt ive mode must be chosen carefully:
`
`choosing a very large t hreshold reduces t he visibility of se(cid:173)
`quential phases for t he OS, but a small one causes overhead,
`as shown in Figure 1.
`We now evaluate the PA algorithm wit h Runtime Ex(cid:173)
`tensions (PA-RTX). We t est t he spin-then-notify synchro(cid:173)
`nization mode using several spin t hresholds. The blocking
`threshold is set at 100m iterations in all experiments. We
`also test t he feature permitting the application to expose
`the "master thread". T hese scenarios are compared with
`the Default scheduler where applications use t he adapt ive
`mode, and with t he base PA algorithm (no RTX) using the
`spin mode (PA-base (spin)) and the adapt ive mode (PA-base
`(adaptive)). For PA-base (adapt ive) we use the best block(cid:173)
`ing t hresholds: 10m and lm respectively. F igure 3 shows
`the results.
`Overall, we conclude that PA-RTX is less sensit ive to
`the choice of t hresholds t han PA-base (adapt ive). PA-base
`(adaptive) hurts performance for several applications, up to
`a maximum of as much as 26%(!) when a low value of the
`blocking threshold is used. PA-RTX hurts per formance by
`4% at most and only in one case, and that happens when the
`spin threshold is set to an extremely small value of lK it er(cid:173)
`ations. W ith adapt ive synchronization, a trade-off must b e
`mad

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket