`
`
`
`I n v t n t
`
`
`
`
`Operating Systems and Asymmetric Single-ISA CMPs: The Potential
`for Saving Energy
`
`Jeffrey C. Mogul, Jayaram Mudigonda, Nathan Binkert, Partha Ranganathan, Vanish Talwar
`HP Laboratories Palo Alto
`HPL-2007-140
`August 22, 2007*
`
`
`
`
`CMP, multi-core,
`energy savings,
`operating systems
`
`CPUs consume too much power. Modern complex cores sometimes waste
`power on functions that are not useful for the code they run. In particular,
`operating system kernels do not benefit from many power-consuming features
`that were intended to improve application performance. We propose using
`asymmetric single-ISA CMPs (ASISA-CMPs), multicore CPUs where all cores
`execute the same instruction set architecture but have different performance and
`power characteristics, to avoid wasting power on operating systems code. We
`describe various design choices for both hardware and software, describe Linux
`kernel modifications to support ASISA-CMP, and offer some quantified
`estimates that support our proposal.
`
`
`
`
`* Internal Accession Date Only
` Approved for External Publication
`© Copyright 2007 Hewlett-Packard Development Company, L.P.
`
`Petitioner Mercedes Ex-1024, 0001
`
`
`
`Operating Systems and Asymmetric Single-ISA CMPs:
`The Potential for Saving Energy
`
`Vanish Talwar
`Partha Ranganathan
`Nathan Binkert
`Jayaram Mudigonda
`Jeffrey C. Mogul
`Jeff.Mogul@hp.com, Jayaram.Mudigonda@hp.com, binkert@hp.com, Partha.Ranganathan@hp.com, Vanish.Talwar@hp.com
`HP Labs, Palo Alto, CA 94304
`
`Abstract
`
`CPUs consume too much power. Modern complex cores
`sometimes waste power on functions that are not useful for
`the code they run. In particular, operating system kernels do
`not benefit from many power-consuming features that were
`intended to improve application performance. We propose
`using asymmetric single-ISA CMPs (ASISA-CMPs), mul-
`ticore CPUs where all cores execute the same instruction
`set architecture but have different performance and power
`characteristics, to avoid wasting power on operating systems
`code. We describe various design choices for both hardware
`and software, describe Linux kernel modifications to support
`ASISA-CMP, and offer some quantified estimates that sup-
`port our proposal.
`
`1 Introduction
`
`While Moore's Law has delivered exponential increases in
`computation over the past few decades, two well-known
`trends create problems for computer systems: CPUs con-
`sume more and more power, and operating systems do not
`speed up as rapidly as most application code does. Many
`people have addressed these problems separately; we pro-
`pose to address them together.
`Until recently, designers of high-end CPU chips tended to
`improve single-stream performance as much as possible, by
`exploiting instruction-level parallelism and decreasing cycle
`times. Both of these techniques are now hard to sustain, so
`recent CPU designs exploit shrinking VLSI feature sizes by
`using multiple cores, rather than faster clocks. Examples of
`these Chip Multi-Processors (CMPs) include the Sun Niagra
`
`processor with eight cores, the quad-core Intel Xeon, and
`dual-core systems from several vendors.
`All commercially-available general-purpose CMPs, as of
`mid-2007, are symmetrical: each CPU is identical, and typ-
`ically, all run at the same clock rate. However, in 2003 Ku-
`mar et al. [13] proposed heterogeneous (or asymmetrical)
`multi-core processors, as a way of reducing power require-
`ments. Their proposal retains the single-Instruction-Set-
`Architecture (single-ISA) model of symmetrical CMPs: all
`cores can execute the same machine code. They observed
`that different implementations of the same ISA had order-
`of-magnitude differences in power consumption (assuming a
`single VLSI process). They further observed that in a multi-
`application workload, or even in phases of a single applica-
`tion, one does not always need the full power and functional-
`ity of the most complex CPU core; if a CMP could switch a
`process between cores with different capabilities, one could
`maintain throughput while decreasing power consumption.
`Since the original study by Kumar et al., several other
`studies [3, 8, 14] have highlighted the benefits from het-
`erogeneity. (Keynote speakers from some major processor
`vendors have also suggested that heterogeneity might be
`commercially interesting [2, 22].) However, all these stud-
`ies looked only at user-mode execution. But we know that
`many workloads spend much or most of their cycles in the
`operating system [23]. We also know that operating system
`(OS) code differs from application code: it avoids the float-
`ing point processor, it branches more often, and it has lower
`cache locality (all reasons why OS speedups lag applica-
`tion speedups on modern CPUs). An asymmetric single-ISA
`CMP (ASISA-CMP) might therefore save power, without re-
`ducing throughput, by executing OS code on a simple, low-
`power core, while using more complex, high-power cores for
`
`1
`
`Petitioner Mercedes Ex-1024, 0002
`
`
`
`application code.
`The main contribution of this paper is to propose and
`evaluate the ASISA-CMP model, in which (1) a multi-core,
`single-ISA CPU includes some “OS-friendly” cores, optim-
`ized to execute OS code without wasting energy, and (2) the
`OS statically and/or dynamically decides which code to ex-
`ecute on which cores, so as to optimize throughput per joule.
`To optimally exploit ASISA-CMP, we expect that the OS
`and the hardware both must change. This paper explores
`the various design considerations for co-evolving the OS and
`hardware, and presents experimental results.
`
`2 Related work
`
`Ousterhout [20] may have been the first to point out that “op-
`erating system performance does not seem to be improving
`at the same rate as the base speed of the underlying hard-
`ware.” He speculated that causes include memory latencies
`and context-switching overheads.
`Nellans et al. [19] measured the fraction of cycles spent in
`the OS for a variety of applications, and re-examined how OS
`performance scales with CPU performance, suggesting that
`interrupt-handling code interferes with caches and branch-
`prediction history tables. They found that many applications
`execute a large fraction of cycles in the OS, and observed
`that “a classic 5 stage pipeline [such as] a 486 is surprisingly
`close in performance to a modern Pentium 4 when execut-
`ing [OS code].” However, instead of proposing an ASISA-
`CMP, they suggest adding a dedicated OS-specific core. (It is
`not entirely clear how far their proposal is from a single-ISA
`CMP.) They did not evaluate this proposal in detail.
`Chakraborty et al. [11] proposed refactoring software so
`that similar “fragments” of code are executed on the same
`core of a CMP. Their initial study treated the OS and the
`user-mode application as two coarse-grained fragments, and
`found speedup in some cases. However, they did not exam-
`ine asymmetric CMPs or the question of power reduction.
`Sanjay Kumar et al. [15] propose a “sidecore” architec-
`ture to support hypervisor operations. In their approach, a
`hypervisor is restructured into multiple components, with
`some running on specialized cores. Their goal was to avoid
`the expensive internal state changes triggered via traps (e.g.,
`VMexit in Intel's VT architecture) to perform privileged hy-
`pervisor functions. Instead, the sidecore approach transfers
`the operation to a remote core “that is already in the appro-
`
`priate state.” This also avoids polluting the guest-core caches
`with code and data from hypervisor operations. (The side-
`core approach is not specifically targeted at saving energy.)
`
`3 Design overview
`
`Our goal is to address two major challenges for multi-core
`systems: how to minimize power consumption while main-
`taining good performance, and how to exploit the parallelism
`offered by a multi-core CPU. These two issues are closely
`linked, but we will try to untangle them somewhat.
`
`3.1 Proportionality in power consumption
`
`We want to maximize the energy efficiency of our computer
`systems, which could be expressed as the useful computa-
`tional work (throughput) per joule expended. Fan et al. [6]
`have observed that the ideal system would consume power
`directly proportional to the rate of useful computational work
`it completes. We refer to this as the “proportionality rule.”
`Such a system would need no additional power management
`algorithms, except as might be needed to avoided exceeding
`peak power or cooling limits.
`Fan et al. argue that “system designers should consider
`power efficiency not simply at peak performance levels but
`across the activity range.” We believe that the ASISA-CMP
`approach, with a careful integration of OS and hardware
`design, can help address this goal. Of course, it is prob-
`ably impossible to design a system that truly complies with
`the proportionality rule, especially since many components
`consume considerable power even when the CPU is idle.
`There are at least two ways that one might design a sys-
`tem to address the proportionality rule. First, one could
`design individual components whose power consumption
`varies with throughput, such as a CPU that supports voltage
`and frequency scaling. Second, one could design a sys-
`tem with a mix of both high-power/high-performance and
`low-power/low-performance components, with a mechanism
`for dynamically varying which components are used (and
`powered up) based on system load.
`The original ASISA-CMP model, as proposed by Kumar
`et al. [13], follows the second approach, without precluding
`the first one. In times of light load, activity shifts to low-
`power cores; in times of heavy load, low-power cores can
`offer additional parallelism without significant increases in
`
`2
`
`Petitioner Mercedes Ex-1024, 0003
`
`
`
`Table 1: Power and relative performance of Alpha cores
`
`area or power consumption.
`In this paper, we extend the ASISA-CMP model by as-
`serting that the ideal low-power core is one that is special-
`ized to execute operating system code. (More broadly, we
`consider “OS-like code,” which we will define in Sec. 4.3.1.)
`This stems from several observations:
` OS code does not proportionately benefit from the
`potential speedup of complex, high-frequency cores.
`Thus, running OS code on a simpler core is a better use
`of power and chip area.
` Most computer systems (with certain exceptions, such
`as scientific computing) are often idle.
`If we could
`power down complex CPU cores during periods when
`they would otherwise be idle, we could improve propor-
`tionality.
`The designs explored in this paper include:
` Multi-core CPUs with a mix of high-power, high-
`complexity application cores, and low-power,
`low-
`complexity OS-friendly cores.
` Operating system modifications to dynamically shift
`load to the most appropriate core, and (potentially) to
`power down idle cores.
` Modest hardware changes to improve the efficiency of
`core-switching.
`Of course, the CPU is not the only power-consuming
`component in a system, and ASISA-CMP does not address
`the power consumed by memory, busses, and I/O devices,
`or the power wasted in power supplies and cooling systems.
`Therefore, even if the CPU were perfectly proportional, the
`entire system would still fail to meet the proportionality rule.
`However, as long as CPUs represent the largest single power
`draws in a system (see Sec. 3.3.1), improving their propor-
`tionality is worthwhile.
`
`3.2 Core heterogeneity
`
`The promise of ASISA-CMP depends critically on two facts
`of CPU core design: (1) for a given process technology, a
`complex core consumes much more power and die area than
`a simple core, and (2) a complex core does not improve OS
`performance nearly as much as it improves application per-
`formance.
`Table 1 shows the relative power consumption, perform-
`ance (in terms of instructions per cycle, or IPC), and sizes
`of various generations of Alpha cores, scaled as if all were
`
`
`
`Table 2: Example power budgets for two typical systems
`
`
`
`Kernel entry overhead Kernel exit overhead
`
`Interrupt exit
`overhead
`
`Device
`Interrupt
`
`Interrupt entry
`overhead
`
`Cache
`interference
`overhead
`
`(a) Single−core CPU
`
`Thread A
`User mode
`Kernel mode
`
`Kernel entry overhead
`
`Kernel exit overhead
`
`Thread A
`User mode, core 1
`Kernel mode, core 1
`
`Kernel mode, core 0
`
`core 1 in low
`power mode
`
`Device
`Interrupt
`
`Core−switch
`overhead
`
`Cache
`affinity
`overhead
`
`(b) ASISA−CMP dual−core CPU
`
`Figure 1: Example without thread-level parallelism.
`
`4.1 Dynamic core-switching for OS kernel
`code
`
`In any multiprocessor system, performance depends on
`whether there is enough available parallelism to keep all
`processing elements busy. Specifically in an ASISA-CMP-
`based system,
`there are two ways to optimize through-
`put/joule:
`shift OS load to
`1. If the system is underutilized:
`low-power cores, and power down high-power cores,
`whenever possible.
`2. If the system has available parallelism: shift OS load to
`low-power cores, while keeping the high-power cores
`as busy as possible with application code.
`First, consider the case where there is no available
`application-level parallelism. Figure 1 illustrates a simple
`example for an application with just one thread. Figure 1(a)
`shows a brief part of the application's execution on a single-
`core CPU. When the application thread makes a system call,
`execution moves from user mode to kernel mode and back.
`Figure 1(b) shows the same execution sequence on a two-
`core ASISA-CMP system. In this case, when the application
`thread makes the system call, the kernel (1) transfers control
`from the “application core” (core 1) to the “OS core” (core
`0); (2) puts core 1 in low-power mode; (3) executes the sys-
`tem call on core 0; (4) wakes up core 1; and (5) transfers
`control back to core 1.
`
`scaling: This could be especially effective at saving
`power in systems with lots of idle time. However, we
`suspect that the lowest-power mode for V/F scaling on
`a complex core is still much higher than the power
`drawn by an ASISA-CMP CPU with the application
`core turned off.
` Complex-core with power-down on idle: In such a
`system, a idle core would wake up on any external in-
`terrupt. This approach could also be especially effective
`at saving power in systems with lots of idle time, espe-
`cially if the system is not forced to wake up on every
`clock tick even if there is no work to do. However, this
`approach only works if the system is truly idle for mod-
`erately long periods; a server system handling a min-
`imal rate of request packets, for example, might never
`stay idle for long enough.
` Lots of simple cores: A CPU with lots of simple cores,
`each of which could be independently power off, might
`allow a close approximation to the proportionality rule.
`However, this configuration can only get reasonable
`throughput (and thus reasonable throughput/joule for
`the full system) if we can solve the general parallel pro-
`gramming problem – a problem that has been elusive
`for many years. Amdahl's Law may be an inherent
`limit to this approach.
` Some cores with specialized ISAs: Others have ex-
`plored this alternative [19]. We have ruled it out for this
`paper, because we believe that a single-ISA approach
`makes it much easier to develop and maintain an oper-
`ating system, and because we have no way to simulate
`multiple ISAs in one system.
`
`4 Software issues
`
`We believe that the ASISA-CMP approach can be applied to
`OS code in several ways, including:
`1. Dynamically switching between cores in OS kernel
`code; see Sec. 4.1.
`2. Running virtualization “helper processing” on OS-
`friendly cores; see Sec. 4.2.
`3. Running applications, or parts thereof, with “OS-like
`code” on OS-friendly cores; see Sec. 4.3
`To date, we have focussed all of our implementation and ex-
`perimental work on the first approach.
`
`5
`
`Petitioner Mercedes Ex-1024, 0006
`
`
`
`for an application with two runnable threads, thread A and
`thread B, where thread A is running at the start of the ex-
`ample, and again the thread makes a system call. Figure 2(a)
`shows the single-core case: while thread A is executing in
`the kernel, thread B remains blocked until the scheduler de-
`cides that A has run long enough. Figure 2(a) shows the
`ASISA-CMP case: when thread A makes its system call, the
`kernel switches that thread to the OS core (core 0), thread B
`can now run on the application core (core 1).
`Again assuming that the switching overheads are not too
`large, the use of ASISA-CMP increases the utilization of the
`application core, because OS execution is moved to a more
`appropriate core.
`
`4.2 Running virtualization helpers on
`OS-friendly cores
`
`So far, we have discussed the operating system as if it were a
`single monolithic kernel. Of course, many operating systems
`designs, both in research and as products, have been decom-
`posed; for example, into a microkernel and a set of server
`processes. Sec. 4.3 discusses the possibility of running dae-
`mon processes on a OS-friendly core.
`However, a different kind of decomposition has become
`popular:
`the creation of one or more virtualization layers
`between the traditional operating system and the hardware.
`These layers basically execute OS kernel code, but in a
`different protection domain from the “guest” OS. We be-
`lieve these are clear candidates for execution on OS-friendly
`cores. They also tend to be bound to specific cores, which
`eliminates the performance overhead of core-switching.
`Note that the use of current virtual machine monitors
`(VMM) might undermine the use of dynamic core switching
`for kernel code, because a guest operating system running
`in a virtual machine might not be able to bind a thread to a
`specific core. It is possible that a VMM's interface to the
`guest OS could be augmented to expose the actual cores, or
`at least to expose the distinction between application cores
`and OS-friendly cores, but we have not explored this issue in
`detail.
` Xen's Domain 0: In Xen [4], and probably in sim-
`ilar VM systems, one privileged virtual machine is used
`to multiplex and manage the I/O operations issued by
`the other virtual machines. This “Domain 0” probably
`would be most power-efficient if run on an OS-friendly
`
`Kernel entry overhead Kernel exit overhead
`
`Thread A
`Thread B
`User mode
`Kernel mode
`
`(B blocked)
`
`Interrupt exit
`overhead
`
`(A blocked) Thread A
`Thread B
`
`Scheduler
`Interrupt
`
`Device
`Interrupt
`
`Interrupt entry
`overhead
`
`(a) Single−core CPU
`
`Cache
`interference
`overhead
`
`Kernel entry overhead
`
`Thread A
`Thread B
`(B blocked)
`User mode, core 1
`Kernel mode, core 1
`
`Kernel mode, core 0
`
`Cache interference overhead
`(A blocked)
`
`Thread A
`(B blocked) Thread B
`
`(begin core switch for A)
`
`(finish core switch for A)
`
`Scheduler
`Interrupt
`
`Device
`Interrupt
`
`Core−switch
`overhead
`
`Cache
`affinity
`overhead
`
`(b) ASISA−CMP dual−core CPU
`
`Figure 2: Example with thread-level parallelism.
`
`If the OS core draws significantly less power than the ap-
`plication core while not significantly reducing performance
`on OS code, and if there were no overheads for switching
`cores and for changing the power state of core 0, then the
`ASISA-CMP system would have higher throughput per joule
`than the single-core system. These two “ifs” are two of the
`key questions for this paper: are there real benefits to run-
`ning OS code on specialized cores, and are the overheads
`small enough to avoid overwhelming the benefits?
`Figure 1(a) and (b) also show what happens on the arrival
`of an interrupt. In the single-core system, application exe-
`cution is delayed both for the actual execution of the inter-
`rupt handling code in the OS, and also for interrupt exit and
`entry. In the ASISA-CMP system, the application continues
`to run without delay (except perhaps for memory-access in-
`terference). This would probably be true for any multicore
`system, but in the ASISA-CMP approach, interrupt handling
`happens on a power-optimized core, rather than on a high-
`power core.
`Note that we compare the dual-core ASISA-CMP system
`against a single-core system, rather than against a symmet-
`ric dual-core system, because (in the underutilized case) it
`seems likely that a dual-core CMP would consume more
`power than a single-core system, without any increase in
`throughput.
`Next, consider the case where there is application-level
`parallelism. Figure 2 illustrates another simple example,
`
`6
`
`Petitioner Mercedes Ex-1024, 0007
`
`
`
`core.
`I/O and network helper threads: Several researchers
`have proposed running I/O-related portions of the virtu-
`alization environment on separate processors. McAuley
`and Neugebauer [18] proposed “encapsulating oper-
`ating system I/O subsystems in Virtual Channel Pro-
`cessors,” which could run on separate processors. Reg-
`nier et al. [24] and Brecht et al. [5] have proposed run-
`ning network packet processing code on distinct cores.
`For both of these approaches, OS-friendly cores would
`be a good match for optimizing power and performance.
`
`4.3 Running OS-like code on OS-friendly
`cores
`
`Our hardware design is based on the observation that “OS
`code” behaves differently from user code. However, the
`definition of an operating system is quite fluid; the same code
`that executes inside the kernel in a monolithic system, such
`as Linux, may execute in user mode on a microkernel. Thus,
`we should not limit our choice of which code to execute on
`OS-friendly cores based simply on whether it runs in kernel
`mode.
`Typical systems include much code that shares these char-
`acteristics with actual kernel code:
` Libraries: much of user-mode code actually executes
`in libraries that are tightly integrated with the kernel. In
`microkernels and similar approaches, it might be hard
`to distinguish much of this code from traditional kernel
`code. So, one might expect some library code to run
`most efficiently on an OS-friendly core. On the other
`hand, we suspect that this would require core-switching
`far too frequently, and it might be hard to detect when
`to switch.
` Daemons: most operating systems (but especially mi-
`crokernels) use daemon processes to execute code that
`does not need to be in the kernel, or that needs to block
`frequently. This code might also be run most efficiently
`on an OS-friendly core, and could easily be bound to the
`appropriate core (most operating systems allow binding
`a process to a CPU).
` Servers: Some applications, such as Web servers,
`might fall into the same category as daemons. However,
`cryptographic code might run better on normal cores,
`and so a secure Web server might have to be refactored
`
`to run optimally on an ASISA-CMP.
`Note that if we use ASISA-CMP to execute entire OS-like
`application processes (or, at least, threads) on OS-friendly
`cores, this eliminates the performance overhead of core-
`switching.
`
`4.3.1 Defining and recognizing “OS-like” code
`
`If it makes sense to run OS-like application code on OS-
`friendly cores, how does the OS decide which applications
`to treat this way? Programmers could simply declare their
`applications to be OS-like, but this is likely to lead to mis-
`takes. Instead, we believe that through monitoring the ap-
`propriate hardware performance counters, the OS can detect
`applications, or perhaps even threads, with OS-like execu-
`tion behavior.
`Automated recognition of OS-like code demands a clear
`definition of the term. While we do not yet have a spe-
`cific definition, the main characteristics of OS-like code
`are likely to include the absence of floating-point opera-
`tions; frequent pointer-chasing (which can defeat complex
`data cache designs); idiosyncratic conditional-branch beha-
`vior [7] (which can defeat branch predictors designed for ap-
`plication code), and frequent block-data copies (which can
`limit the effectiveness of large D-caches).
`Many of these idiosyncratic characteristics could be vis-
`ible via CPU performance counters. (Zhang et al. [26] have
`suggested that performance counters should be directly man-
`aged by the OS as a first-class resource, for similar reasons.)
`We are not sure, however, whether support for ASISA-CMP
`will require novel performance counters.
`We also suspect that an adaptive approach, based on tent-
`atively running a suspected OS-like thread on an OS-friendly
`core, and measuring the resulting change in progress, could
`be effective in determining whether a thread is sufficiently
`OS-like to benefit. This approach requires a means for the
`kernel to determine the progress of a thread, which is an open
`issue; in some cases, the rate at which it issues system calls
`could be a useful indicator.
`
`4.3.2 Should asymmetry be exposed to user-mode code?
`
`Kernels already provide processor-affinity controls (e.g.,
`sched
`
`
`
`
`5 Implementation of dynamic
`core-switching
`
`For the preliminary experiments in this paper, we made the
`simplest possible changes to Linux (version 2.6.18) to allow
`us to shift certain OS operations to an OS-friendly core. We
`assumed a two-core system, and did not add code to power-
`down (or slow down) idle cores.
`Kernels
`typically execute two categories of code:
`“Bottom-half”
`interrupt-handler
`code,
`and “top-half”
`system-call code. We used different strategies for each.
`
`5.1 Interrupt-handler code
`
`We believe that interrupt code (often known as “bottom half”
`code) should always prefer an OS-friendly core. The OS can
`configure interrupt delivery so as to ensure this.
`Linux already provides, via the /proc/irq directory in
`the /proc process information pseudo-filesystem, a way to
`set interrupt affinity to a specific core. We do this only for
`the NIC; most other devices have much lower interrupt rates,
`and the timer interrupt must go to all CPUs.
`It is possible that the interrupt load could exceed the ca-
`pacity of the OS core(s) in a CPU. If so, should this load be
`spread out over the other cores? This might improve system
`throughput under some conditions, but it might be better to
`confine interrupt processing to a fixed subset of cores, as a
`way to limit the disruption caused by excessive interrupts. In
`our experiments, we have statically directed interrupts to the
`(single) OS core.
`
`5.2 System calls
`
`We faced two main challenges in our implementation of core
`switching for system calls: how to decide when to switch,
`and how to reduce the switching costs. Switching cores takes
`time, because
`1. switching involves executing extra instructions;
`2. switching requires transferring some state between
`CPUs;
`3. if the target core is powered down, it could take about
`a thousand cycles to switch [13]. Note that in our cur-
`rent implementation, we do not attempt to power-down
`idle cores; we defer further discussion of this issue to
`Sec. 5.6.
`
`Given a significant cost for core-switching, the tradeoff
`is only beneficial for expensive, frequent system calls (e.g.,
`select or perhaps open), but not fast or rare calls (e.g., getpid
`or exit). (Sec. 7.2 describes simple measurements of relevant
`costs.) Further, for some system calls the decision to switch
`should depend on how much work is to be done. A read
`system call with a buffer size of 10 bytes should not switch,
`whereas one with a buffer size of 10K bytes probably should.
`Our basic approach to core-switching is to modify certain
`system calls so that the basic sequence is:
` do initial validation of arguments
` Decide whether there is enough work to do to merit
`switching
`If so, invoke a core-switching function to switch to
`an OS core
` do the bulk of the system call
`If we switched codes, core switch back to an applic-
`ation core. We return, if possible, to the original core,
`to preserve cache affinity.
` finish up and return
`The steps in bold are the modifications we made. The details
`differ slightly for each call that we modified, but generally
`involve only a few lines of new code for each call.
`In our current implementation, we core-switch on these
`system calls: open, stat, read, write, readv, writev, select,
`poll, and sendfile. For read, write, and similar calls, we
`arbitrarily defined “enough work” as 4096 bytes or more; we
`have not yet tried to optimize this choice.
`We have two different
`implementations of the core-
`switching function, asisaSwitchCores: a very slow design
`that works, and a faster design that, unfortunately, we were
`unable to debug before the submission deadline. We describe
`each version below.
`In either case, we dedicate one or more cores as OS cores,
`to execute only OS code, and we modified the kernel to main-
`tain bitmaps designating the OS cores and the application
`cores.
`
`5.3 Core-switching via the Linux migration
`mechanism
`
`Linux already includes a load-balancing facility that can mi-
`grate a thread from one CPU to another [1]. For our initial
`implementation of asisaSwitchCores, we used this mechan-
`ism. This gave us a simple implementation: we block the
`
`8
`
`Petitioner Mercedes Ex-1024, 0009
`
`
`
`
`
`Table 3: Core-switching overheads
`
`calling thread, make it eligible to run only on the target des-
`tination CPU, place it on a queue of migratable processes,
`and then invoke Linux's per-CPU migration thread.
`This is an expensive procedure, since it involves a thread-
`switch on the source CPU, and invokes the scheduler on both
`the source and destination CPUs ... and the entire procedure
`must be done twice per system call.
`
`5.4 Core-switching via a modified scheduler
`
`In an attempt to speed up core-switching, we wrote a ver-
`sion of asisaSwitchCores that directly invokes a modified,
`somewhat simplified version of the Linux scheduler. Linux
`allows a thread (running in the kernel) to call the schedule
`procedure to yield the CPU. We wrote a modified version,
`asisaYield, which deactivates the current thread, places it on
`a special per-CPU queue for the target CPU, does the usual
`scheduler processing to choose the next thread for the source
`CPU, and finally sends an inter-processor interrupt to the
`destination CPU. (Linux already uses this kind of interrupt
`to awaken the scheduler when doing thread migration.)
`When the interrupt arrives at the destination CPU, it in-
`vokes another modified version of the scheduler, asisaGrab,
`which dequeues the thread from its special per-CPU queue,
`bypassing the normal scheduler logic to choose which thread
`to run.
`
`5.5 Core switching costs
`
`To quantify the cost of core switching, we modified a kernel
`to switch on the getpid system call. This call does almost
`nothing, so it is a good way to measure the overhead. We
`wrote a benchmark that bypasses the Linux C library' s at-
`tempt to cache the results of getpid; the benchmark pins it-
`self to the application core, then invokes getpid many times
`in a tight loop. We wrote similar benchmarks to measure the
`latency of select with both an empty file-descriptor set (fd-
`set) and a one-element fdset which is always “ready”, and
`of readv (which reads into a vector of
`buffers) reading
`
`
`cache.
`The results in Table 3 were measured on our modified
`Linux 2.6.18 kernel running on a dual-core Xeon model 5160
`(3.0GHz, 64KB L1 caches, 4MB L2 cache) system, with
`and without core-switching enabled. In general, our known-
`to-be-slow core-switching code adds slightly over 4 micro-
`
`
`
`is to drastically reduce the core's voltage and frequency.
`Sec. 6.2 discusses idle-CPU power states in more detail.
`
`6 Architectural support
`
`The ASISA-CMP approach exposes a number of hardware
`design choices, which we discuss in this section. Most of
`these are open issues, since we have not yet done the simu-
`lation studies to explore all of the many options.
`
`6.1 Design of ASISA-CMP processors
`
`How should an “OS processor” in an ASISA-CMP differ
`from the other processors? We strongly favor the single-ISA
`model, since it greatly simplifies the design of the software,
`and because it still allows flexible allocation of computation
`to cores. But given a single ISA, many choices remain:
`- Non-architectural state: Different implementations of
`an ISA often expose different non-architectural state
`to the operating system. (“Non-architectural state” in-
`cludes CPU state that might have to be saved when a
`partially-completed instruction traps on a page fault.)
`The OS needs to be aware of these differences to be
`able to support the appropriate emulation.
`- Floating-point support: Most kernels do not use float-
`ing point at all, and so one might design the OS core
`without any floating-point pipeline or registers.
`If a
`thread with FP code somehow did end up executing on
`the OS core, it could trap into FP-emulation code, pre-
`serving the single-ISA view at a considerable perform-
`ance cost.
`- Caches: CMPs typically have per-core first-level (L1)
`caches, and a shared L2 cache. Cache designers have
`many choices (line size, associativity, total cache size,
`etc.). OS code tends to have different cache behavior
`than application code [23]; for example, the kernel ha