throbber
Operating System Support for Mitigating Software
`Scalability Bottlenecks on Asymmetric Multicore
`Processors
`
`Juan Carlos Saez
`Complutense University,
`Madrid, Spain
`jcsaezal@fdi.ucm.es
`Manuel Prieto
`Complutense University,
`Madrid, Spain
`mpmatias@dacya.ucm.es
`
`Alexandra Fedorova
`Simon Fraser University,
`Vancouver BC, Canada
`fedorova@cs.sfu.ca
`Hugo Vegas
`Complutense University,
`Madrid, Spain
`hugovegas@fdi.ucm.es
`
`ABSTRACT
`Asymmetric multicore processors (AMP) promise higher per-
`formance per watt than their symmetric counterparts, and
`it is likely that future processors will integrate a few fast
`out-of-order cores, coupled with a large number of simpler,
`slow cores, all exposing the same instruction-set architec-
`ture (ISA). It is well known that one of the most effective
`ways to leverage the effectiveness of these systems is to use
`fast cores to accelerate sequential phases of parallel appli-
`cations, and to use slow cores for running parallel phases.
`At the same time, we are not aware of any implementation
`of this parallelism-aware (PA) scheduling policy in an op-
`erating system. So the questions as to whether this policy
`can be delivered efficiently by the operating system to un-
`modified applications, and what the associated overheads
`are remain open. To answer these questions we created two
`different implementations of the PA policy in OpenSolaris
`and evaluated it on real hardware, where asymmetry was
`emulated via CPU frequency scaling. This paper reports
`our findings with regard to benefits and drawbacks of this
`scheduling policy.
`
`Categories and Subject Descriptors
`D.4.1 [Process Management]: Scheduling
`
`General Terms
`Performance, Measurement, Algorithms
`
`Keywords
`Asymmetric multicore, Scheduling, Operating Systems
`
`Permission to make digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that copies are
`not made or distributed for profit or commercial advantage and that copies
`bear this notice and the full citation on the first page. To copy otherwise, to
`republish, to post on servers or to redistribute to lists, requires prior specific
`permission and/or a fee.
`CF’10, May 17–19, 2010, Bertinoro, Italy.
`Copyright 2010 ACM 978-1-4503-0044-5/10/05 ...$10.00.
`
`INTRODUCTION
`1.
`An asymmetric multicore processor (AMP) includes cores
`exposing the same instruction-set architecture, but differing
`in features, size, speed and power consumption [2, 8]. A
`typical AMP would contain a number of simple, small and
`low-power slow cores and a few complex, large and high-
`power fast cores. It is well known that AMP systems can
`mitigate scalability bottlenecks in parallel applications by
`accelerating sequential phases on fast cores [2, 7, 12].
`To leverage this potential of AMP systems, threads must
`be mapped to cores in consideration of the amount of paral-
`lelism in the application: if an application is highly parallel
`its threads should be mapped to slow cores, but if the ap-
`plication is sequential or is executing a sequential phase its
`thread should be mapped to a fast core. A natural place for
`this Parallelism-Aware (PA) policy is in the operating sys-
`tem. This way, many applications can reap its benefits, po-
`tentially without requiring any modifications, and the shar-
`ing of scarce fast cores among multiple applications can be
`fairly arbitrated by the operating system. To the best of our
`knowledge, there are no OS-level implementations of the PA
`scheduling policy. As a result, many questions regarding the
`effectiveness and practicality of this policy remain open.
`One open question is how can the operating system effec-
`tively detect sequential phases in applications? In some ap-
`plications unused threads block during the sequential phase,
`and by monitoring the application’s runnable thread count,
`which is exposed to the OS by most threading libraries, the
`scheduler can trivially detect a sequential phase. In other
`applications, however, unused threads busy-wait (or spin)
`during short periods of time, and so the OS cannot de-
`tect these phases simply by monitoring the runnable thread
`count. To address these scenarios we designed PA Runtime
`Extensions (PA-RTX) – an interface and library enhance-
`ments enabling the threading library to notify the scheduler
`when a thread spins rather than doing useful work. We im-
`plemented PA-RTX in a popular OpenMP runtime, which
`required only minimal modifications to support them, but
`the extensions are general enough to be used with other
`threading libraries.
`Another open question is the overhead associated with the
`PA policy. Any policy that prioritizes fast cores to specific
`
`Petitioner Mercedes Ex-1025, 0001
`
`

`

`threads is bound to generate migration overheads – a per-
`formance degradation that occurs when a thread is moved
`from one core to another. Performance degradation results
`from the loss of cache state accumulated on the thread’s old
`core. Upon evaluating these overheads we found that they
`can be significant (up to 18%) if the fast core is placed in
`a different memory hierarchy domain from slow cores, but
`a hardware configuration where a fast core shares a mem-
`ory hierarchy domain with several slow cores coupled with a
`topology-aware scheduler practically eliminates these over-
`heads.
`We evaluate the PA policy on a real multicore system,
`where “slow” cores were emulated by reducing the clock fre-
`quency on the processors, while “fast” cores were configured
`to run at regular speed. We find that the main benefits from
`the PA policy are derived for multi-application workloads
`and when the number of fast cores relative to slow cores is
`small. In this case, it delivers speedups of up to 40% relative
`to the OpenSolaris default asymmetry-agnostic scheduler.
`Previously proposed asymmetry-aware algorithms, which we
`used for comparison, also do well in some cases, but unlike
`our parallelism-aware algorithms they do not perform well
`across the board, because they fail to consider the paral-
`lelism of the application.
`The key contribution of our work is the evaluation of the
`operating system technology enabling next-generation asym-
`metric systems. We are not aware of previous studies investi-
`gating the benefits and drawbacks of the PA scheduling pol-
`icy implemented in a real OS. Our findings provide insights
`for design of future asymmetry-aware operating systems and
`asymmetric hardware alike.
`The rest of the paper is structured as follows. Section 2
`presents the design and implementation of the PA schedul-
`ing algorithm. Section 3 presents experimental results. Sec-
`tion 4 discusses related work. Section 5 summarizes our
`findings and discusses future work.
`
`2. DESIGN AND IMPLEMENTATION
`In Section 2.1 we describe two parallelism-aware algo-
`rithms proposed in this work: PA and MinTLP. In Sec-
`tion 2.2 we describe the runtime extensions to PA (PA-
`RTX). A brief description of other asymmetry-aware algo-
`rithms that we use for comparison is provided in Section 2.3.
`2.1 PA and MinTLP algorithms
`Our algorithms assume an AMP system with two core
`types: fast and slow. Previous studies concluded that sup-
`porting only two core types is optimal for achieving most of
`the potential gains on AMP [8]; so we expect this configu-
`ration to be typical of future systems. More core types may
`be present in future systems due to variations in the fabrica-
`tion process. In that case, scheduling must be complemented
`with other algorithms, designed specifically to address this
`problem [18].
`The goal of the algorithm is to decide which threads should
`run on fast cores and which on slow cores. In MinTLP, this
`decision is straightforward: the algorithm selects applica-
`tions with the smallest thread-level parallelism (hence the
`name MinTLP) and maps threads of these applications to
`fast cores. Thread-level parallelism is determined by exam-
`ining the number of runnable (i.e., not blocked) threads. If
`not enough fast cores are available to accommodate all these
`threads, some will be left running on slow cores. MinTLP
`
`makes no effort to fairly share fast cores among all “eligible”
`threads. This algorithm is very simple, but not always fair.
`The other proposed algorithm, PA, is more sophisticated.
`It classifies threads dynamically into several categories: MP,
`HP, and SP. The MP (mildly parallel) category includes
`threads belonging to applications with a low degree of thread-
`level parallelism, including the single-threaded applications.
`The HP category includes threads belonging to highly par-
`allel applications. The MP threads will run primarily on
`fast cores, and the HP threads will run primarily on slow
`cores. Threads of applications whose runnable thread count
`exceeds hp_threshold fall into the HP category, the remain-
`ing threads fall into the MP category.
`A special class SP is reserved for threads of parallel ap-
`plications that have just entered a sequential phase. These
`threads will get the highest priority for running on fast cores:
`this provides more opportunities to accelerate sequential
`phases. To avoid monopolizing fast cores, SP threads are
`downgraded by the scheduler into the MP class after spend-
`ing amp_boost_ticks scheduling clock ticks in the SP class.
`If there are not enough cores to run all SP and MP threads
`on fast cores, the scheduler will run some of the threads on
`slow cores, to preserve load balance. SP threads have a
`higher priority in using fast cores. The remaining fast cores
`will be shared among MP threads in a round-robin fashion.
`The scheduler keeps track of the count of runnable threads
`in each applicationto detect transitions between the afore-
`mentioned classes and perform thread-to-core mapping ad-
`justments accordingly. To avoid premature migrations and
`preserve load balance, PA integrates a thread swapping mech-
`anism to perform those adjustments periodically, instead of
`reacting to those transitions immediately (MinTLP also in-
`tegrates a similar swapping mechanism).
`When the change in the thread-level parallelism cannot be
`determined via the monitoring of the runnable thread count,
`PA relies on the Runtime Extensions, described in the next
`Section. We must also highlight that despite the fact that
`our evaluation has been focused on multi-threaded single-
`process applications, the PA and MinTLP algorithms can be
`easily extended to support multi-process applications using
`high-level abstractions provided by the operating system,
`such as process sets.
`Although sensitivity of the PA algorithm to its config-
`urable parameters was studied, we are unable to provide
`the results due to space constraints. We found, however,
`that it is generally easy to choose good values for these pa-
`rameters. After performing such a sensitivity study, we set
`amp_boost_ticks to one hundred timeslices (1 second) and
`hp_threshold to one greater than the number of fast cores.
`These values ensure acceleration of sequential phases with-
`out monopolizing fast cores.
`
`2.2 PA Runtime Extensions
`The base PA algorithm introduced so far relies on moni-
`toring runnable thread count to detect transitions between
`serial and parallel phases in the application. However, con-
`ventional synchronization primitives found in most thread-
`ing libraries use an adaptive two-phase approach where un-
`used threads busy wait for a while before blocking to reduce
`context-switching overheads. While blocking is coordinated
`with the OS, making it possible to detect phase transitions,
`spinning is not. Reducing the spinning phase enables the
`OS to detect more serial phases. However, in our context
`
`Petitioner Mercedes Ex-1025, 0002
`
`

`

`it may also lead to excessive migrations and cause substan-
`tial overheads (as soon as a fast core becomes idle PA and
`MinTLP will immediately migrate a thread to this core). In
`the event these busy-waiting phases are frequent, it is help-
`ful to give the scheduler some hints that would help it to
`avoid mapping spinning threads to fast cores. To that end,
`we propose two optimizations, which can be implemented in
`the threading library (applications themselves need not be
`changed).
`2.2.1 Spin-then-notify mode
`Our first proposal is a new spin-then-notify waiting mode
`for synchronization primitives. Its primary goal is to avoid
`running spinning threads on fast cores and save these “power-
`hungry” cores for other threads. In this mode the synchro-
`nization primitive notifies the operating system via a system
`call after a certain spin threshold that the thread is busy-
`waiting rather than doing useful work. Upon notification,
`the PA scheduler marks this thread as a candidate for migra-
`tion to slow cores. We have opted to mark threads as migra-
`tion candidates instead of forcing an immediate migration
`since this approach avoids premature migrations and allows
`a seamless integration with the PA and MinTLP swapping
`mechanisms. The synchronization primitive also notifies the
`scheduler when a spinning thread finishes the busy wait. In
`Section 3.2 we explore the advantages of using the new spin-
`then-notify mode. For this purpose we have modified the
`OpenMP runtime system to include this new mode in the
`basic waiting function used by high-level primitives such as
`mutexes or barriers.
`Another potentially useful feature of this primitive may
`arise in the context of scheduling algorithms that map threads
`on AMP systems based on their relative speedup on fast vs.
`slow cores (see Section 4). These algorithms typically mea-
`sure performance of each thread on fast and slow cores and
`compute its performance ratio, which determines the rela-
`tive speedup [5, 9]. If a thread performs busy-waiting it can
`achieve a very high performance ratio, since a spin loop uses
`the CPU pipeline very efficiently1. As a result, the proposed
`algorithms would map spinning threads to fast cores despite
`they are not doing useful work. Even though these imple-
`mentation issues could be solved via additional hardware
`support [11], a spin-then-notify primitive could help avoid
`the problem without needing extra hardware.
`2.2.2 Exposing the master thread
`We have also investigated a simple but effective optimiza-
`tion allowing the application to communicate to the ker-
`nel that a particular thread must have a higher priority
`in running on a fast core. This optimization was inspired
`by the typical structure of OpenMP do-all applications. In
`these applications, there is usually a master thread that is in
`charge of the explicit serial phases at the beginning, in be-
`tween parallel loops, and at the end of the application (apart
`from being in charge of its share of the parallel loops). Iden-
`tifying this master thread to the kernel enables the scheduler
`to give it a higher priority on the fast core simply because
`this thread will likely act as the “serial” thread. This hint can
`speed up do-all applications even without properly detect-
`ing serial phases. Our PA Runtime Extensions enable the
`
`1Best practices in implementing spinlocks dictate using al-
`gorithms where a thread spins on a local variable [1], which
`leads to a high instruction throughput.
`
`runtime system to identify the master thread to the sched-
`uler via a new system call. If the pattern of the application
`changes and another thread gets this responsibility, the same
`system call can be used to update this information.
`To evaluate this feature, we have modified the OpenMP
`runtime system to automatically identify the thread execut-
`ing the main function as the master thread to the kernel,
`right after initializing the runtime environment. In the same
`way as the implementation of spin-notify mode, only the
`OpenMP library needs to be modified, not requiring any
`change in the applications themselves. Upon receiving this
`notification, the PA scheduler tries to ensure that the master
`thread runs on a fast core whenever it is active, but without
`permanently binding the thread to that core as would be
`done with other explicit mechanisms based on thread affini-
`ties. This way, PA still allows different threads to compete
`for fast cores according to its policies.
`2.3 The other schedulers
`We compare PA and MinTLP to three other schedulers
`proposed in previous work. Round-Robin (RR) equally shares
`fast and slow cores among all threads [5]. BusyFCs is a
`simple asymmetry-aware scheduler that guarantees that fast
`cores never go idle before slow cores [4]. Static-IPC-Driven,
`which we describe in detail below, assigns fast cores to those
`threads that experience the greatest relative speedup (in
`terms of instructions per second) relative to running on slow
`cores [5]. We implemented all these algorithms in Open-
`Solaris. Our baseline for comparison is the asymmetry-
`agnostic default scheduler in OpenSolaris, referred to here-
`after as Default.
`The Static-IPC-Driven scheduler is based on the design
`proposed by Becchi and Crowley [5]. Thread-to-core as-
`signments in that algorithm are done based on per-thread
`IPC ratios (quotients of IPCs on fast and slow cores), which
`determine the relative benefit of running a thread on a par-
`ticular core type. Threads with the highest IPC ratios are
`scheduled on fast cores while remaining threads are sched-
`uled on slow cores. In the original work [5], the IPC-driven
`scheduler was simulated. This scheduler samples threads’
`IPC on cores of all types whenever a new program phase is
`detected. Researchers who attempted an implementation of
`this algorithm found that such sampling caused large over-
`heads, because frequent cross-core thread migrations were
`required [16]. To avoid these overheads, we have imple-
`mented a static version of the IPC-driven algorithm, where
`IPC ratios of all threads are measured a priori. This makes
`IPC ratios more accurate in some cases [16] and eliminates
`much of the runtime performance overhead. Therefore, the
`results of the Static-IPC-Driven scheduler are somewhat op-
`timistic and the speedups of PA and MinTLP relative to
`Static-IPC-Driven are somewhat pessimistic.
`
`3. EXPERIMENTS
`The evaluation of the PA algorithm was performed on
`an AMD Opteron system with four quad-core (Barcelona)
`CPUs. The total number of cores was 16. Each core has
`private 64KB instruction and data caches, and a private L2
`cache of 512KB. A 2MB L3 cache is shared by the four cores
`on a chip. The system has a NUMA architecture. Access
`to a local memory bank incurs a shorter latency than access
`to a remote memory bank. Each core is capable of running
`at a range of frequencies from 1.15 GHz to 2.3 GHz. Since
`
`Petitioner Mercedes Ex-1025, 0003
`
`

`

`Table 1: Classification of selected applications.
`
`Categories
`HP-CI
`HP-MI
`PS-CI
`PS-MI
`ST-CI
`ST-MI
`
`Benchmarks
`EP(N), vips(P), fma3d(O), ammp(O), RNA(I), scalparc(M), wupwise (O)
`art(O), equake(O), applu(O), swim(O)
`BLAST(NS), swaptions(P), bodytrack(P), semphy(M), FT(N)
`MG(N), TPC-C(NS), FFTW(NS)
`gromacs(C), sjeng(C), gamess(C), gobmk(C), h264ref(C), hmmer(C), namd(C)
`astar(C), omnetpp(C), soplex(C), milc(C), mcf(C), libquantum(C)
`
`Table 2: Multi-application workloads, Set #1.
`
`Workload name Benchmarks
`STCI-PSMI
`gamess, FFTW (12,15)
`STCI-PSCI
`gamess, BLAST (12,15)
`STCI-PSCI(2)
`hmmer, BLAST (12,15)
`STCI-HP
`gamess, wupwise (12,15)
`STCI-HP(2)
`gobmk, EP (12,15)
`STMI-PSMI
`mcf, FFTW (12,15)
`STMI-PSCI
`mcf, BLAST (12,15)
`STMI-HP
`astar, EP (12,15)
`PSMB-PSCI
`FFTW (6,8), BLAST (7,8)
`PSMB-HP
`FFTW (6,8), wupwise_m (7,8)
`PSCI-HP
`BLAST (6,8), wupwise_m (7,8)
`PSCI-HP(2)
`semphy (6,8), EP (7,8)
`
`each core is within its own voltage/frequency domain, we are
`able to vary the frequency for each core independently. We
`experimented with asymmetric configurations that use two
`core types: “fast” (a core set to run at 2.3 GHz) and “slow” (a
`core set to run at 1.15 GHz). We also varied the number of
`cores in the experimental configurations by disabling some
`of the cores.
`We used three AMP configurations in our experiments:
`(1) 1FC-12SC – one fast core and 12 slow cores, the fast
`core is on its own chip and the other cores on that chip are
`disabled; (2) 4FC-12SC – four fast cores and 12 slow cores,
`each fast core is on a chip with three slow cores; (3) 1FC-
`3SC – one fast core, three slow cores, all on one chip. Not
`all configurations are used in all experiments.
`Although thread migrations can be effectively exploited by
`asymmetry-aware schedulers (e.g. to map sequential parts
`of parallel applications on fast cores), the overhead that they
`may introduce can lead to performance degradation. Since
`we also aim to assess the impact of migrations on perfor-
`mance we opted to select the default asymmetry-unaware
`scheduler used in OpenSolaris (we refer to it as Default
`henceforth) as our baseline scheduler. Despite Default keeps
`threads on the same core for most of the execution time and
`thus minimizes thread migrations, its asymmetry-unawareness
`leads it to offer much more unstable results from run to run
`than the ones observed for the other schedulers. For that
`reason, a high number of samples were collected for this
`scheduler, in an attempt to capture the average behavior
`more accurately. Overall, we found that Default usually
`fails to schedule single-threaded applications and sequential
`phases of parallel application on fast cores, especially when
`the number of fast cores is much smaller than the number
`of slow cores, such as on the 1FC-12SC and 4FC-12SC con-
`figurations.
`
`We evaluate the base implementation of the PA algorithm
`as well PA with Runtime Extensions. We compare PA to
`RR, BusyFCs, Static-IPC-Driven, Min-TLP and to Default.
`In all experiments, each application was run a minimum of
`three times, and we measure the average completion time.
`The observed variance was small in most cases (so it is not
`reported) and where it was large we repeated the experi-
`ments for a larger number of trials until the variance reached
`a low threshold. In multi-application workloads the appli-
`cations are started simultaneously and when an application
`terminates it is restarted repeatedly until the longest appli-
`cation in the set completes at least three times. We report
`performance as the speedup over Default. The geometric
`mean of completion times of all executions for a benchmark
`under a particular asymmetry-aware scheduler is compared
`to that under Default, and percentage speedup is reported.
`In all experiments, the total number of threads (sum of the
`number of threads of all applications) was set to match the
`number of cores in the experimental system, since this is how
`runtime systems typically configure the number of threads
`for the CPU-bound workloads that we considered [19].
`Our evaluation section is divided into four parts. In Sec-
`tion 3.1 we introduce the applications and workloads used
`for evaluation. In Section 3.2 we evaluate PA runtime ex-
`tensions. In Section 3.3 we evaluate multi-application work-
`loads. Finally, in Section 3.4 we study the overhead.
`
`3.1 Workload selection
`We used applications from PARSEC [6], SPEC OMP2001,
`NAS [3] Parallel Benchmarks and MineBench [13] bench-
`mark suites, as well as the TPC-C benchmark implemented
`over Oracle Berkeley DB [14], BLAST – a bioinformatics
`benchmark, FFT-W – a scientific benchmark performing
`the fast Fourier transform, and RNA – an RNA sequencing
`
`Petitioner Mercedes Ex-1025, 0004
`
`

`

`

`

`

`

`

`

`

`

`

`

`Multicore Architectures. SIGARCH CAN,
`33(2):506–517, 2005.
`[5] M. Becchi and P. Crowley. Dynamic Thread
`Assignment on Heterogeneous Multiprocessor
`Architectures. In Proc. of Computing Frontiers ’06,
`pages 29–40, 2006.
`[6] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The
`PARSEC Benchmark Suite: Characterization and
`Architectural Implications. In Proc. of PACT’08,
`October 2008.
`[7] M. D. Hill and M. R. Marty. Amdahl’s Law in the
`Multicore Era. IEEE Computer, 41(7):33–38, 2008.
`[8] R. Kumar, K. I. Farkas, and N. Jouppi et al.
`Single-ISA Heterogeneous Multi-Core Architectures:
`the Potential for Processor Power Reduction. In Proc.
`of MICRO 36, 2003.
`[9] R. Kumar, D. M. Tullsen, and P. Ranganathan et al.
`Single-ISA Heterogeneous Multi-Core Architectures
`for Multithreaded Workload Performance. In Proc. of
`ISCA ’04.
`[10] T. Li, D. Baumberger, and D. A. Koufaty et al.
`Efficient Operating System Scheduling for
`Performance-Asymmetric Multi-Core Architectures. In
`Proc. of SC ’07, pages 1–11.
`[11] T. Li, A. R. Lebeck, and D. J. Sorin. Spin Detection
`Hardware for Improved Management of Multithreaded
`Systems. IEEE TPDS, 17(6):508–521, 2006.
`[12] T. Morad, U. Weiser, and A. Kolody. ACCMP –
`Asymmetric Cluster Chip Multi-Processing. TR
`CCIT, 2004.
`[13] R. Narayanan, B. Ozisikyilmaz, and J. Z. et al.
`MineBench: A Benchmark Suite for Data Mining
`Workloads. 2006.
`[14] M. A. Olson, K. Bostic, and M. Seltzer. Berkeley DB.
`In Proc. of USENIX, pages 43–43, 1999.
`[15] J. C. Saez, M. Prieto, A. Fedorova, and
`S. Blagodourov. A Comprehensive Scheduler for
`Asymmetric Multicore Systems. In Proc. of ACM
`Eurosys ’10, 2010.
`[16] D. Shelepov, J. C. Saez, and S. Jeffery et al. HASS: a
`Scheduler for Heterogeneous Multicore Systems. ACM
`Operating System Review, 43(2), 2009.
`[17] M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N.
`Patt. Accelerating Critical Section Execution with
`Asymmetric Multi-Core Architectures. In Proc. of
`ASPLOS ’09, pages 253–264, 2009.
`[18] R. Teodorescu and J. Torrellas. Variation-Aware
`Application Scheduling and Power Management for
`Chip Multiprocessors. In Proc. of ISCA ’08, 2008.
`[19] R. van der Pas. The OMPlab on Sun Systems. In
`Proc. of IWOMP’05, 2005.
`
`not optimal for workloads that include multi-threaded ap-
`plications, as we demonstrated in the experimental section.
`Nevertheless, these algorithms are very effective for work-
`loads consisting of single-threaded applications only, and so
`we are investigating on algorithms that augment PA with
`information about relative speedups [15].
`In addition to the algorithms that attempted to opti-
`mize the use of fast cores by discovering the most “effi-
`cient” threads for running on these cores, several simpler
`asymmetry-aware algorithms were proposed. An algorithm
`proposed by Balakrishnan et al. ensured that the fast cores
`do not go idle before slow cores [4] – this is the same ap-
`proach used by the BusyFC algorithm, which we used in the
`evaluation. An algorithm proposed by Li et al. also used
`a BusyFC-like policy, but in addition ensured that the fast
`cores were given a higher load than slow cores [10]. Neither
`of these schedulers considered thread-level parallelism when
`making scheduling decisions.
`
`5. CONCLUSIONS AND FUTURE WORK
`In this work we have evaluated the benefits and drawbacks
`of parallelism-aware scheduling policies for AMP systems.
`While some algorithms that do not consider TLP perform
`comparably to the proposed algorithms PA and MinTLP
`in some scenarios, none of them perform well across the
`board. PA and MinTLP outperform the other schedulers
`by as much as 40% in some cases. This indicates the impor-
`tance of considering TLP in asymmetry-aware scheduling.
`We have also proposed a small set of runtime extensions to
`complement the OS scheduler and reduce the possibility of
`wastefully scheduling busy-waiting threads on fast cores.
`Overall, our results have shown that while PA and MinTLP
`effectively schedule parallel workloads, an asymmetry-aware
`algorithm that relies on relative speedup brings significant
`performance improvements for single-threaded applications.
`Designing a comprehensive scheduler that combines both the
`TLP and relative speedup to cater to both single-threaded
`and parallel workloads is an interesting avenue for future
`work.
`
`Acknowledgements
`This research was funded by the Spanish government’s re-
`search contracts TIN2008-005089 and the Ingenio 2010 Con-
`solider ESP00C-07-20811, by the HIPEAC2 European Net-
`work of Excellence, by the National Science and Engineering
`Research Council of Canada (NSERC) under the Strategic
`Project Grant program and by Sun Microsystems. Juan
`Carlos Saez is supported by a MEC FPU fellowship grant.
`
`6. REFERENCES
`[1] T. Anderson. The Performance of Spin Lock
`Alternatives for Shared-Money Multiprocessors. IEEE
`TPDS, 1(1):6–16, 1990.
`[2] M. Annavaram, E. Grochowski, and J. Shen.
`Mitigating Amdahl’s Law through EPI Throttling. In
`Proc. of ISCA’05, pages 298–309, 2005.
`[3] D. H. Bailey, E. Barszcz, and J. T. Barton et al. The
`NAS parallel benchmarks—summary and preliminary
`results. In Supercomputing ’91, pages 158–165, 1991.
`[4] S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai.
`The Impact of Performance Asymmetry in Emerging
`
`Petitioner Mercedes Ex-1025, 0010
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket