`Scalability Bottlenecks on Asymmetric Multicore
`Processors
`
`Juan Carlos Saez
`Complutense University,
`Madrid, Spain
`jcsaezal@fdi.ucm.es
`Manuel Prieto
`Complutense University,
`Madrid, Spain
`mpmatias@dacya.ucm.es
`
`Alexandra Fedorova
`Simon Fraser University,
`Vancouver BC, Canada
`fedorova@cs.sfu.ca
`Hugo Vegas
`Complutense University,
`Madrid, Spain
`hugovegas@fdi.ucm.es
`
`ABSTRACT
`Asymmetric multicore processors (AMP) promise higher per-
`formance per watt than their symmetric counterparts, and
`it is likely that future processors will integrate a few fast
`out-of-order cores, coupled with a large number of simpler,
`slow cores, all exposing the same instruction-set architec-
`ture (ISA). It is well known that one of the most effective
`ways to leverage the effectiveness of these systems is to use
`fast cores to accelerate sequential phases of parallel appli-
`cations, and to use slow cores for running parallel phases.
`At the same time, we are not aware of any implementation
`of this parallelism-aware (PA) scheduling policy in an op-
`erating system. So the questions as to whether this policy
`can be delivered efficiently by the operating system to un-
`modified applications, and what the associated overheads
`are remain open. To answer these questions we created two
`different implementations of the PA policy in OpenSolaris
`and evaluated it on real hardware, where asymmetry was
`emulated via CPU frequency scaling. This paper reports
`our findings with regard to benefits and drawbacks of this
`scheduling policy.
`
`Categories and Subject Descriptors
`D.4.1 [Process Management]: Scheduling
`
`General Terms
`Performance, Measurement, Algorithms
`
`Keywords
`Asymmetric multicore, Scheduling, Operating Systems
`
`Permission to make digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that copies are
`not made or distributed for profit or commercial advantage and that copies
`bear this notice and the full citation on the first page. To copy otherwise, to
`republish, to post on servers or to redistribute to lists, requires prior specific
`permission and/or a fee.
`CF’10, May 17–19, 2010, Bertinoro, Italy.
`Copyright 2010 ACM 978-1-4503-0044-5/10/05 ...$10.00.
`
`INTRODUCTION
`1.
`An asymmetric multicore processor (AMP) includes cores
`exposing the same instruction-set architecture, but differing
`in features, size, speed and power consumption [2, 8]. A
`typical AMP would contain a number of simple, small and
`low-power slow cores and a few complex, large and high-
`power fast cores. It is well known that AMP systems can
`mitigate scalability bottlenecks in parallel applications by
`accelerating sequential phases on fast cores [2, 7, 12].
`To leverage this potential of AMP systems, threads must
`be mapped to cores in consideration of the amount of paral-
`lelism in the application: if an application is highly parallel
`its threads should be mapped to slow cores, but if the ap-
`plication is sequential or is executing a sequential phase its
`thread should be mapped to a fast core. A natural place for
`this Parallelism-Aware (PA) policy is in the operating sys-
`tem. This way, many applications can reap its benefits, po-
`tentially without requiring any modifications, and the shar-
`ing of scarce fast cores among multiple applications can be
`fairly arbitrated by the operating system. To the best of our
`knowledge, there are no OS-level implementations of the PA
`scheduling policy. As a result, many questions regarding the
`effectiveness and practicality of this policy remain open.
`One open question is how can the operating system effec-
`tively detect sequential phases in applications? In some ap-
`plications unused threads block during the sequential phase,
`and by monitoring the application’s runnable thread count,
`which is exposed to the OS by most threading libraries, the
`scheduler can trivially detect a sequential phase. In other
`applications, however, unused threads busy-wait (or spin)
`during short periods of time, and so the OS cannot de-
`tect these phases simply by monitoring the runnable thread
`count. To address these scenarios we designed PA Runtime
`Extensions (PA-RTX) – an interface and library enhance-
`ments enabling the threading library to notify the scheduler
`when a thread spins rather than doing useful work. We im-
`plemented PA-RTX in a popular OpenMP runtime, which
`required only minimal modifications to support them, but
`the extensions are general enough to be used with other
`threading libraries.
`Another open question is the overhead associated with the
`PA policy. Any policy that prioritizes fast cores to specific
`
`Petitioner Samsung Ex-1025, 0001
`
`
`
`threads is bound to generate migration overheads – a per-
`formance degradation that occurs when a thread is moved
`from one core to another. Performance degradation results
`from the loss of cache state accumulated on the thread’s old
`core. Upon evaluating these overheads we found that they
`can be significant (up to 18%) if the fast core is placed in
`a different memory hierarchy domain from slow cores, but
`a hardware configuration where a fast core shares a mem-
`ory hierarchy domain with several slow cores coupled with a
`topology-aware scheduler practically eliminates these over-
`heads.
`We evaluate the PA policy on a real multicore system,
`where “slow” cores were emulated by reducing the clock fre-
`quency on the processors, while “fast” cores were configured
`to run at regular speed. We find that the main benefits from
`the PA policy are derived for multi-application workloads
`and when the number of fast cores relative to slow cores is
`small. In this case, it delivers speedups of up to 40% relative
`to the OpenSolaris default asymmetry-agnostic scheduler.
`Previously proposed asymmetry-aware algorithms, which we
`used for comparison, also do well in some cases, but unlike
`our parallelism-aware algorithms they do not perform well
`across the board, because they fail to consider the paral-
`lelism of the application.
`The key contribution of our work is the evaluation of the
`operating system technology enabling next-generation asym-
`metric systems. We are not aware of previous studies investi-
`gating the benefits and drawbacks of the PA scheduling pol-
`icy implemented in a real OS. Our findings provide insights
`for design of future asymmetry-aware operating systems and
`asymmetric hardware alike.
`The rest of the paper is structured as follows. Section 2
`presents the design and implementation of the PA schedul-
`ing algorithm. Section 3 presents experimental results. Sec-
`tion 4 discusses related work. Section 5 summarizes our
`findings and discusses future work.
`
`2. DESIGN AND IMPLEMENTATION
`In Section 2.1 we describe two parallelism-aware algo-
`rithms proposed in this work: PA and MinTLP. In Sec-
`tion 2.2 we describe the runtime extensions to PA (PA-
`RTX). A brief description of other asymmetry-aware algo-
`rithms that we use for comparison is provided in Section 2.3.
`2.1 PA and MinTLP algorithms
`Our algorithms assume an AMP system with two core
`types: fast and slow. Previous studies concluded that sup-
`porting only two core types is optimal for achieving most of
`the potential gains on AMP [8]; so we expect this configu-
`ration to be typical of future systems. More core types may
`be present in future systems due to variations in the fabrica-
`tion process. In that case, scheduling must be complemented
`with other algorithms, designed specifically to address this
`problem [18].
`The goal of the algorithm is to decide which threads should
`run on fast cores and which on slow cores. In MinTLP, this
`decision is straightforward: the algorithm selects applica-
`tions with the smallest thread-level parallelism (hence the
`name MinTLP) and maps threads of these applications to
`fast cores. Thread-level parallelism is determined by exam-
`ining the number of runnable (i.e., not blocked) threads. If
`not enough fast cores are available to accommodate all these
`threads, some will be left running on slow cores. MinTLP
`
`makes no effort to fairly share fast cores among all “eligible”
`threads. This algorithm is very simple, but not always fair.
`The other proposed algorithm, PA, is more sophisticated.
`It classifies threads dynamically into several categories: MP,
`HP, and SP. The MP (mildly parallel) category includes
`threads belonging to applications with a low degree of thread-
`level parallelism, including the single-threaded applications.
`The HP category includes threads belonging to highly par-
`allel applications. The MP threads will run primarily on
`fast cores, and the HP threads will run primarily on slow
`cores. Threads of applications whose runnable thread count
`exceeds hp_threshold fall into the HP category, the remain-
`ing threads fall into the MP category.
`A special class SP is reserved for threads of parallel ap-
`plications that have just entered a sequential phase. These
`threads will get the highest priority for running on fast cores:
`this provides more opportunities to accelerate sequential
`phases. To avoid monopolizing fast cores, SP threads are
`downgraded by the scheduler into the MP class after spend-
`ing amp_boost_ticks scheduling clock ticks in the SP class.
`If there are not enough cores to run all SP and MP threads
`on fast cores, the scheduler will run some of the threads on
`slow cores, to preserve load balance. SP threads have a
`higher priority in using fast cores. The remaining fast cores
`will be shared among MP threads in a round-robin fashion.
`The scheduler keeps track of the count of runnable threads
`in each applicationto detect transitions between the afore-
`mentioned classes and perform thread-to-core mapping ad-
`justments accordingly. To avoid premature migrations and
`preserve load balance, PA integrates a thread swapping mech-
`anism to perform those adjustments periodically, instead of
`reacting to those transitions immediately (MinTLP also in-
`tegrates a similar swapping mechanism).
`When the change in the thread-level parallelism cannot be
`determined via the monitoring of the runnable thread count,
`PA relies on the Runtime Extensions, described in the next
`Section. We must also highlight that despite the fact that
`our evaluation has been focused on multi-threaded single-
`process applications, the PA and MinTLP algorithms can be
`easily extended to support multi-process applications using
`high-level abstractions provided by the operating system,
`such as process sets.
`Although sensitivity of the PA algorithm to its config-
`urable parameters was studied, we are unable to provide
`the results due to space constraints. We found, however,
`that it is generally easy to choose good values for these pa-
`rameters. After performing such a sensitivity study, we set
`amp_boost_ticks to one hundred timeslices (1 second) and
`hp_threshold to one greater than the number of fast cores.
`These values ensure acceleration of sequential phases with-
`out monopolizing fast cores.
`
`2.2 PA Runtime Extensions
`The base PA algorithm introduced so far relies on moni-
`toring runnable thread count to detect transitions between
`serial and parallel phases in the application. However, con-
`ventional synchronization primitives found in most thread-
`ing libraries use an adaptive two-phase approach where un-
`used threads busy wait for a while before blocking to reduce
`context-switching overheads. While blocking is coordinated
`with the OS, making it possible to detect phase transitions,
`spinning is not. Reducing the spinning phase enables the
`OS to detect more serial phases. However, in our context
`
`Petitioner Samsung Ex-1025, 0002
`
`
`
`it may also lead to excessive migrations and cause substan-
`tial overheads (as soon as a fast core becomes idle PA and
`MinTLP will immediately migrate a thread to this core). In
`the event these busy-waiting phases are frequent, it is help-
`ful to give the scheduler some hints that would help it to
`avoid mapping spinning threads to fast cores. To that end,
`we propose two optimizations, which can be implemented in
`the threading library (applications themselves need not be
`changed).
`2.2.1 Spin-then-notify mode
`Our first proposal is a new spin-then-notify waiting mode
`for synchronization primitives. Its primary goal is to avoid
`running spinning threads on fast cores and save these “power-
`hungry” cores for other threads. In this mode the synchro-
`nization primitive notifies the operating system via a system
`call after a certain spin threshold that the thread is busy-
`waiting rather than doing useful work. Upon notification,
`the PA scheduler marks this thread as a candidate for migra-
`tion to slow cores. We have opted to mark threads as migra-
`tion candidates instead of forcing an immediate migration
`since this approach avoids premature migrations and allows
`a seamless integration with the PA and MinTLP swapping
`mechanisms. The synchronization primitive also notifies the
`scheduler when a spinning thread finishes the busy wait. In
`Section 3.2 we explore the advantages of using the new spin-
`then-notify mode. For this purpose we have modified the
`OpenMP runtime system to include this new mode in the
`basic waiting function used by high-level primitives such as
`mutexes or barriers.
`Another potentially useful feature of this primitive may
`arise in the context of scheduling algorithms that map threads
`on AMP systems based on their relative speedup on fast vs.
`slow cores (see Section 4). These algorithms typically mea-
`sure performance of each thread on fast and slow cores and
`compute its performance ratio, which determines the rela-
`tive speedup [5, 9]. If a thread performs busy-waiting it can
`achieve a very high performance ratio, since a spin loop uses
`the CPU pipeline very efficiently1. As a result, the proposed
`algorithms would map spinning threads to fast cores despite
`they are not doing useful work. Even though these imple-
`mentation issues could be solved via additional hardware
`support [11], a spin-then-notify primitive could help avoid
`the problem without needing extra hardware.
`2.2.2 Exposing the master thread
`We have also investigated a simple but effective optimiza-
`tion allowing the application to communicate to the ker-
`nel that a particular thread must have a higher priority
`in running on a fast core. This optimization was inspired
`by the typical structure of OpenMP do-all applications. In
`these applications, there is usually a master thread that is in
`charge of the explicit serial phases at the beginning, in be-
`tween parallel loops, and at the end of the application (apart
`from being in charge of its share of the parallel loops). Iden-
`tifying this master thread to the kernel enables the scheduler
`to give it a higher priority on the fast core simply because
`this thread will likely act as the “serial” thread. This hint can
`speed up do-all applications even without properly detect-
`ing serial phases. Our PA Runtime Extensions enable the
`
`1Best practices in implementing spinlocks dictate using al-
`gorithms where a thread spins on a local variable [1], which
`leads to a high instruction throughput.
`
`runtime system to identify the master thread to the sched-
`uler via a new system call. If the pattern of the application
`changes and another thread gets this responsibility, the same
`system call can be used to update this information.
`To evaluate this feature, we have modified the OpenMP
`runtime system to automatically identify the thread execut-
`ing the main function as the master thread to the kernel,
`right after initializing the runtime environment. In the same
`way as the implementation of spin-notify mode, only the
`OpenMP library needs to be modified, not requiring any
`change in the applications themselves. Upon receiving this
`notification, the PA scheduler tries to ensure that the master
`thread runs on a fast core whenever it is active, but without
`permanently binding the thread to that core as would be
`done with other explicit mechanisms based on thread affini-
`ties. This way, PA still allows different threads to compete
`for fast cores according to its policies.
`2.3 The other schedulers
`We compare PA and MinTLP to three other schedulers
`proposed in previous work. Round-Robin (RR) equally shares
`fast and slow cores among all threads [5]. BusyFCs is a
`simple asymmetry-aware scheduler that guarantees that fast
`cores never go idle before slow cores [4]. Static-IPC-Driven,
`which we describe in detail below, assigns fast cores to those
`threads that experience the greatest relative speedup (in
`terms of instructions per second) relative to running on slow
`cores [5]. We implemented all these algorithms in Open-
`Solaris. Our baseline for comparison is the asymmetry-
`agnostic default scheduler in OpenSolaris, referred to here-
`after as Default.
`The Static-IPC-Driven scheduler is based on the design
`proposed by Becchi and Crowley [5]. Thread-to-core as-
`signments in that algorithm are done based on per-thread
`IPC ratios (quotients of IPCs on fast and slow cores), which
`determine the relative benefit of running a thread on a par-
`ticular core type. Threads with the highest IPC ratios are
`scheduled on fast cores while remaining threads are sched-
`uled on slow cores. In the original work [5], the IPC-driven
`scheduler was simulated. This scheduler samples threads’
`IPC on cores of all types whenever a new program phase is
`detected. Researchers who attempted an implementation of
`this algorithm found that such sampling caused large over-
`heads, because frequent cross-core thread migrations were
`required [16]. To avoid these overheads, we have imple-
`mented a static version of the IPC-driven algorithm, where
`IPC ratios of all threads are measured a priori. This makes
`IPC ratios more accurate in some cases [16] and eliminates
`much of the runtime performance overhead. Therefore, the
`results of the Static-IPC-Driven scheduler are somewhat op-
`timistic and the speedups of PA and MinTLP relative to
`Static-IPC-Driven are somewhat pessimistic.
`
`3. EXPERIMENTS
`The evaluation of the PA algorithm was performed on
`an AMD Opteron system with four quad-core (Barcelona)
`CPUs. The total number of cores was 16. Each core has
`private 64KB instruction and data caches, and a private L2
`cache of 512KB. A 2MB L3 cache is shared by the four cores
`on a chip. The system has a NUMA architecture. Access
`to a local memory bank incurs a shorter latency than access
`to a remote memory bank. Each core is capable of running
`at a range of frequencies from 1.15 GHz to 2.3 GHz. Since
`
`Petitioner Samsung Ex-1025, 0003
`
`
`
`Table 1: Classification of selected applications.
`
`Categories
`HP-CI
`HP-MI
`PS-CI
`PS-MI
`ST-CI
`ST-MI
`
`Benchmarks
`EP(N), vips(P), fma3d(O), ammp(O), RNA(I), scalparc(M), wupwise (O)
`art(O), equake(O), applu(O), swim(O)
`BLAST(NS), swaptions(P), bodytrack(P), semphy(M), FT(N)
`MG(N), TPC-C(NS), FFTW(NS)
`gromacs(C), sjeng(C), gamess(C), gobmk(C), h264ref(C), hmmer(C), namd(C)
`astar(C), omnetpp(C), soplex(C), milc(C), mcf(C), libquantum(C)
`
`Table 2: Multi-application workloads, Set #1.
`
`Workload name Benchmarks
`STCI-PSMI
`gamess, FFTW (12,15)
`STCI-PSCI
`gamess, BLAST (12,15)
`STCI-PSCI(2)
`hmmer, BLAST (12,15)
`STCI-HP
`gamess, wupwise (12,15)
`STCI-HP(2)
`gobmk, EP (12,15)
`STMI-PSMI
`mcf, FFTW (12,15)
`STMI-PSCI
`mcf, BLAST (12,15)
`STMI-HP
`astar, EP (12,15)
`PSMB-PSCI
`FFTW (6,8), BLAST (7,8)
`PSMB-HP
`FFTW (6,8), wupwise_m (7,8)
`PSCI-HP
`BLAST (6,8), wupwise_m (7,8)
`PSCI-HP(2)
`semphy (6,8), EP (7,8)
`
`each core is within its own voltage/frequency domain, we are
`able to vary the frequency for each core independently. We
`experimented with asymmetric configurations that use two
`core types: “fast” (a core set to run at 2.3 GHz) and “slow” (a
`core set to run at 1.15 GHz). We also varied the number of
`cores in the experimental configurations by disabling some
`of the cores.
`We used three AMP configurations in our experiments:
`(1) 1FC-12SC – one fast core and 12 slow cores, the fast
`core is on its own chip and the other cores on that chip are
`disabled; (2) 4FC-12SC – four fast cores and 12 slow cores,
`each fast core is on a chip with three slow cores; (3) 1FC-
`3SC – one fast core, three slow cores, all on one chip. Not
`all configurations are used in all experiments.
`Although thread migrations can be effectively exploited by
`asymmetry-aware schedulers (e.g. to map sequential parts
`of parallel applications on fast cores), the overhead that they
`may introduce can lead to performance degradation. Since
`we also aim to assess the impact of migrations on perfor-
`mance we opted to select the default asymmetry-unaware
`scheduler used in OpenSolaris (we refer to it as Default
`henceforth) as our baseline scheduler. Despite Default keeps
`threads on the same core for most of the execution time and
`thus minimizes thread migrations, its asymmetry-unawareness
`leads it to offer much more unstable results from run to run
`than the ones observed for the other schedulers. For that
`reason, a high number of samples were collected for this
`scheduler, in an attempt to capture the average behavior
`more accurately. Overall, we found that Default usually
`fails to schedule single-threaded applications and sequential
`phases of parallel application on fast cores, especially when
`the number of fast cores is much smaller than the number
`of slow cores, such as on the 1FC-12SC and 4FC-12SC con-
`figurations.
`
`We evaluate the base implementation of the PA algorithm
`as well PA with Runtime Extensions. We compare PA to
`RR, BusyFCs, Static-IPC-Driven, Min-TLP and to Default.
`In all experiments, each application was run a minimum of
`three times, and we measure the average completion time.
`The observed variance was small in most cases (so it is not
`reported) and where it was large we repeated the experi-
`ments for a larger number of trials until the variance reached
`a low threshold. In multi-application workloads the appli-
`cations are started simultaneously and when an application
`terminates it is restarted repeatedly until the longest appli-
`cation in the set completes at least three times. We report
`performance as the speedup over Default. The geometric
`mean of completion times of all executions for a benchmark
`under a particular asymmetry-aware scheduler is compared
`to that under Default, and percentage speedup is reported.
`In all experiments, the total number of threads (sum of the
`number of threads of all applications) was set to match the
`number of cores in the experimental system, since this is how
`runtime systems typically configure the number of threads
`for the CPU-bound workloads that we considered [19].
`Our evaluation section is divided into four parts. In Sec-
`tion 3.1 we introduce the applications and workloads used
`for evaluation. In Section 3.2 we evaluate PA runtime ex-
`tensions. In Section 3.3 we evaluate multi-application work-
`loads. Finally, in Section 3.4 we study the overhead.
`
`3.1 Workload selection
`We used applications from PARSEC [6], SPEC OMP2001,
`NAS [3] Parallel Benchmarks and MineBench [13] bench-
`mark suites, as well as the TPC-C benchmark implemented
`over Oracle Berkeley DB [14], BLAST – a bioinformatics
`benchmark, FFT-W – a scientific benchmark performing
`the fast Fourier transform, and RNA – an RNA sequencing
`
`Petitioner Samsung Ex-1025, 0004
`
`
`
`Table 3: M u lti-a pplica tion w o rkload s, Set #2.
`
`Worklo a d n a m e
`Benchmarks
`gamess, h264ref, astar , s oplex, wupwise (12)
`2STCI- 2STMI-1HP
`gromacs , gamess, namd, gobmk, EP (12)
`4STCI-1HP
`gamess, hmmer, gobmk, s opl ex, semphy (12)
`3STCI-1STMI-1P SCI
`2STCI-1STMI-1P SMI-1HP gamess, h264ref, s oplex, FFTW (6), equake (7)
`3STCl-3STMI-1HP
`gromacs , s j eng, h264ref, l i bquantum, milc, omnetpp, EP (10)
`3STCl-3STMI-1P SCI
`gromacs , s j eng, h264ref, l ibquantum, milc, omnetpp, BLAST (10)
`
`application. For multi-applica tion workloads we also used
`sequent ial applicat ions from SPEC CPU2006.
`We classified applications according t o their architect ural
`properties: memory-intensive (MI) or compute-intensive (CI),
`as well as according to their parallelism: highly parallel
`(HP), partially sequential (P S) and single-threaded (ST ).
`Memory-int ensity was import ant for fair comparison with
`Stat ic-IPC-Driven. CI applications have a higher relat ive
`speedup on fast cores [16) and so it was important to include
`applications of bot h types in the experiments. P arallelism
`class was determined by tracing execution via OpenSolaris'
`DTrace framework and measuring the fraction of t ime the
`application spent running with a single runnable thread.
`Parallel applications where this fraction was great er t han 7%
`were classified as PS, whereas the rest were classified as HP.
`T he ST class includes sequential applications. T able 1 shows
`the classifica tion of our select ed applica tions according to
`these classes. T he text in parentheses next to t he bench(cid:173)
`mark name indicates the corresponding benchmark suite: 0
`-SP EC OMP2001, P- PARSEC, M - Minebench, N- NAS,
`C - SP EC CP U2006, and NS - other benchmarks not be(cid:173)
`longing to any specific suite.
`By default, all OpenMP applicat ions were compiled with
`the nat ive Sun St udio compiler. In order t o evalua te PA
`Runt ime Extensions (Section 3.2) we had to modify the
`OpenMP runtime system but the source code for the Sun
`Studio OpenMP runt ime system was not available to us. For
`that reason, we resorted to using t he Linux version of the
`GCC 4.4 OpenMP runtime system in OpenSolaris2
`• Never(cid:173)
`theless, we observed that the performance of OpenMP ap(cid:173)
`plications wit h Sun Studio and GCC is similar.
`Both OpenMP and POSIX threaded applica tions used
`in section 3.3 and 3.4 run with adapt ive synchronization
`modes; as such sequential phases are exposed to the oper(cid:173)
`ating system in both cases. In t hese sect ions we do not use
`runtime extensions wit h parallelism-aware algorit hms. All
`OpenMP applications run with t he default adapt ive syn(cid:173)
`chronization mode used by GCC 4.4 unless otherwise noted
`(Sun St udio can be easily configured to use a similar adap(cid:173)
`tive mode). P OSIX threaded applications (such as BLAST
`or bodytrack) use full blocking modes on all synchroniza(cid:173)
`tion primitives but on t hose related to P OSIX standard mu(cid:173)
`texes and synchroniza tion barriers, where an adapt ive im(cid:173)
`plementation is provided by OpenSolaris. Unlike OpenMP
`applications, threads of P OSIX applications spin for shorter
`periods of time before blocking on those adaptive synchro(cid:173)
`niza tion primitives (these are t he default parameters used
`in OpenSolaris) .
`
`2Using such a version of the runt ime system required aug(cid:173)
`ment ing OpenSolaris with a Linux compatible sys_futex
`syscall)
`
`7Cl"Ai
`
`□ Sleep
`
`~
`
`6Cl"Ai
`
`-s > SO"Ai
`
`4Cl"Ai
`
`.0
`C:
`(IJ
`~
`C:
`0
`tl
`~ 2Cl"Ai
`
`■ Adaptive(lm)
`
`□ Adaptive(lOm)
`
`13Adaptive(100m)
`
`a spin
`
`3Cl"Ai
`
`.. ~ l O"Ai
`
`(IJ
`::,
`,;r
`(IJ
`V,
`
`Cl"Ai
`
`<(
`z
`a:
`
`cu
`.><
`"'
`:::,
`C'
`cu
`
`"O
`
`t:
`:::,
`"' a. m
`"'
`a.
`.E
`"'
`
`a.
`E
`E
`"'
`
`I-
`u.
`
`I.!)
`::iE
`
`e
`>,
`"' ~
`a.
`a. E
`~
`Bl
`
`Figu r e 2: Varia tions in the seq u entia l fractio n seen
`by the OS w h e n va rying the syn ch r oniza t ion mode
`a nd blocking t hreshold.
`
`For Section 3.2 we selected ten OpenMP applications:
`art, applu, fma3d, ammp, FT, MG, s cal parc , semphy and
`RNA . These applica tions were chosen to cover a wide variety
`of sequential port ions. In t he overhead section we analyze
`ten parallel applica tions accross the aforementioned classes:
`three HPCI (RNA , wupwise and vips), two HPMI (swim and
`applu), three PSCI (swaptions, bodytrack and BLAST) and
`two PSMI applicat ions (TPC- C and FFTW) .
`For Sect ion 3.3, we constructed two sets of multi-application
`workloads. T he first set, shown in T able 2, comprises twelve
`representa tive pairs of benchmarks across the previous cat(cid:173)
`egories ment ioned above. For the sake of completeness, we
`experimented wit h additional multi-applicat ion workloads
`wit h more than two applica tions. Table 3 shows this second
`set, consisting of six workloads.
`
`3.2 PA Runtime Extensions
`We begin by invest igating the effect on performance when
`using different synchroniza tion waiting modes under t he PA
`scheduler. In these experiments we demonstrat e t hat us(cid:173)
`ing a low blocking t hreshold effect ively exposes sequential
`phases t o the scheduler, but performance can also suffer if
`the t hreshold is set too low. T hen we evaluate PA-RT X
`and show that it offers comparable performance t o purely
`adaptive approaches and in some cases even improves it .
`In the following experiment we used the 1F C-12SC config(cid:173)
`ura tion and tested three different waiting modes: spin, sleep
`and adaptive. In spin mode unused t hreads busy-wait for
`the entire t ime, in sleep mode, they block immedia tely. We
`studied the effects of various synchronizat ion modes on all
`asymmetry-aware schedulers, but since our results showed
`
`Petitioner Samsung Ex-1025, 0005
`
`
`
`70%
`
`60%
`
`<ti -
`
`~
`50%
`~
`::, 40%
`QJ 30%
`... 20%
`0
`QJ
`> 10%
`0
`a. 0%
`::, 1:: -10%
`~ -20%
`VI
`-30%
`
`-40%
`
`-50%
`
`□ PA (sleep)
`
`■ PA {adapt ive l m)
`D PA {adapt ive 10m)
`1:1 PA {adapt ive 100m )
`a PA (spin)
`
`z
`a:
`
`QJ
`~
`"'
`::,
`a-
`QJ
`
`t::
`"'
`
`"O
`m
`"'
`.E
`
`a.
`E
`E
`"'
`
`I-
`u.
`
`~
`~
`
`~
`"'
`a.
`;;
`:ii
`
`>
`.r:::.
`a.
`E
`
`QJ
`VI
`
`F igur e 1 : Speedup fr o m PA u s ing sleep, spin a n d adap t ive m o d es with d iffer e n t blocking thresholds.
`
`~ 70%
`~
`a, 60%
`
`40%
`
`.=!: o. 50%
`<ti
`]
`....
`:i 30%
`~ a, 20%
`0
`ai 10%
`Ei
`0%
`a.
`~-10%
`QJ
`~-20%
`VI
`
`-30%
`
`E.'I PA-base (spin)
`
`□ PA-RTX (master t hread)
`
`■ PA-RTX (spin-notif l k)
`
`□ PA-RTX (spin-notif 10k)
`
`1:1 PA-RTX (spin-notif 100k)
`
`E:I PA-RTX (spin-notif l m)
`
`D PA-base (adaptive 10m)
`
`B PA-base {adaptive l m)
`
`RNA
`
`equake
`
`a rt
`
`applu
`
`fma3d
`
`amm p
`
`FT
`
`MG
`
`seal pare
`
`semphy
`
`Figu re 3: Speed u p from P A with R untime Exten sion s.
`
`that across t he schedulers the effects are largely t he same,
`we present t he dat a for the PA scheduler only.
`Figure 1 shows t he results. P A runt ime extensions are
`not used in t his case. W hen t he spin mode is used, t he base
`PA algorit hm delivers hardly any speedup , because it is not
`aware of t he sequential phases. With the adapt ive and sleep
`modes, applications on the right side of t he chart experi(cid:173)
`ence noticeable speedup. They have large sequent ial phases
`and switching to the adapt ive or sleep mode exposes t hese
`phases to t he scheduler and enables their acceleration on the
`fast core. Applications on the left side of t he chart, however,
`experience performance degradation. T hose with t he high(cid:173)
`est overhead (RNA, equake and appl u) run frequent short
`parallel loops. Despite being well-balanced applications, the
`asymmetry of t he platform causes t he thread running on a
`fast core to complete its share of these loops earlier. In t his
`case, t he sleep mode makes the fast core become idle very
`often, t riggering frequent migrations t hat int roduce subst an(cid:173)
`tial overheads. An adapt ive mode alleviates this issue, but
`the blocking t hreshold must be sufficiently large to remove
`the overheads complet ely.
`Figure 2 shows how t he fraction of time spent in sequen(cid:173)
`tial phases as seen by the OS changes for different block(cid:173)
`ing thresholds. T his further underscores t hat t he blocking
`threshold for t he adapt ive mode must be chosen carefully:
`
`choosing a very large t hreshold reduces t he visibility of se(cid:173)
`quential phases for t he OS, but a small one causes overhead,
`as shown in Figure 1.
`We now evaluate the PA algorithm wit h Runtime Ex(cid:173)
`tensions (PA-RTX). We t est t he spin-then-notify synchro(cid:173)
`nization mode using several spin t hresholds. The blocking
`threshold is set at 100m iterations in all experiments. We
`also test t he feature permitting the application to expose
`the "master thread". T hese scenarios are compared with
`the Default scheduler where applications use t he adapt ive
`mode, and with t he base PA algorithm (no RTX) using the
`spin mode (PA-base (spin)) and the adapt ive mode (PA-base
`(adaptive)). For PA-base (adapt ive) we use the best block(cid:173)
`ing t hresholds: 10m and lm respectively. F igure 3 shows
`the results.
`Overall, we conclude that PA-RTX is less sensit ive to
`the choice of t hresholds t han PA-base (adapt ive). PA-base
`(adaptive) hurts performance for several applications, up to
`a maximum of as much as 26%(!) when a low value of the
`blocking threshold is used. PA-RTX hurts per formance by
`4% at most and only in one case, and that happens when the
`spin threshold is set to an extremely small value of lK it er(cid:173)
`ations. W ith adapt ive synchronization, a trade-off must b e
`mad