`
`
`
`I n v t n t
`
`
`
`
`Operating Systems and Asymmetric Single-ISA CMPs: The Potential
`for Saving Energy
`
`Jeffrey C. Mogul, Jayaram Mudigonda, Nathan Binkert, Partha Ranganathan, Vanish Talwar
`HP Laboratories Palo Alto
`HPL-2007-140
`August 22, 2007*
`
`
`
`
`CMP, multi-core,
`energy savings,
`operating systems
`
`CPUs consume too much power. Modern complex cores sometimes waste
`power on functions that are not useful for the code they run. In particular,
`operating system kernels do not benefit from many power-consuming features
`that were intended to improve application performance. We propose using
`asymmetric single-ISA CMPs (ASISA-CMPs), multicore CPUs where all cores
`execute the same instruction set architecture but have different performance and
`power characteristics, to avoid wasting power on operating systems code. We
`describe various design choices for both hardware and software, describe Linux
`kernel modifications to support ASISA-CMP, and offer some quantified
`estimates that support our proposal.
`
`
`
`
`* Internal Accession Date Only
` Approved for External Publication
`© Copyright 2007 Hewlett-Packard Development Company, L.P.
`
`Petitioner Samsung Ex-1024, 0001
`
`
`
`Operating Systems and Asymmetric Single-ISA CMPs:
`The Potential for Saving Energy
`
`Vanish Talwar
`Partha Ranganathan
`Nathan Binkert
`Jayaram Mudigonda
`Jeffrey C. Mogul
`Jeff.Mogul@hp.com, Jayaram.Mudigonda@hp.com, binkert@hp.com, Partha.Ranganathan@hp.com, Vanish.Talwar@hp.com
`HP Labs, Palo Alto, CA 94304
`
`Abstract
`
`CPUs consume too much power. Modern complex cores
`sometimes waste power on functions that are not useful for
`the code they run. In particular, operating system kernels do
`not benefit from many power-consuming features that were
`intended to improve application performance. We propose
`using asymmetric single-ISA CMPs (ASISA-CMPs), mul-
`ticore CPUs where all cores execute the same instruction
`set architecture but have different performance and power
`characteristics, to avoid wasting power on operating systems
`code. We describe various design choices for both hardware
`and software, describe Linux kernel modifications to support
`ASISA-CMP, and offer some quantified estimates that sup-
`port our proposal.
`
`1 Introduction
`
`While Moore's Law has delivered exponential increases in
`computation over the past few decades, two well-known
`trends create problems for computer systems: CPUs con-
`sume more and more power, and operating systems do not
`speed up as rapidly as most application code does. Many
`people have addressed these problems separately; we pro-
`pose to address them together.
`Until recently, designers of high-end CPU chips tended to
`improve single-stream performance as much as possible, by
`exploiting instruction-level parallelism and decreasing cycle
`times. Both of these techniques are now hard to sustain, so
`recent CPU designs exploit shrinking VLSI feature sizes by
`using multiple cores, rather than faster clocks. Examples of
`these Chip Multi-Processors (CMPs) include the Sun Niagra
`
`processor with eight cores, the quad-core Intel Xeon, and
`dual-core systems from several vendors.
`All commercially-available general-purpose CMPs, as of
`mid-2007, are symmetrical: each CPU is identical, and typ-
`ically, all run at the same clock rate. However, in 2003 Ku-
`mar et al. [13] proposed heterogeneous (or asymmetrical)
`multi-core processors, as a way of reducing power require-
`ments. Their proposal retains the single-Instruction-Set-
`Architecture (single-ISA) model of symmetrical CMPs: all
`cores can execute the same machine code. They observed
`that different implementations of the same ISA had order-
`of-magnitude differences in power consumption (assuming a
`single VLSI process). They further observed that in a multi-
`application workload, or even in phases of a single applica-
`tion, one does not always need the full power and functional-
`ity of the most complex CPU core; if a CMP could switch a
`process between cores with different capabilities, one could
`maintain throughput while decreasing power consumption.
`Since the original study by Kumar et al., several other
`studies [3, 8, 14] have highlighted the benefits from het-
`erogeneity. (Keynote speakers from some major processor
`vendors have also suggested that heterogeneity might be
`commercially interesting [2, 22].) However, all these stud-
`ies looked only at user-mode execution. But we know that
`many workloads spend much or most of their cycles in the
`operating system [23]. We also know that operating system
`(OS) code differs from application code: it avoids the float-
`ing point processor, it branches more often, and it has lower
`cache locality (all reasons why OS speedups lag applica-
`tion speedups on modern CPUs). An asymmetric single-ISA
`CMP (ASISA-CMP) might therefore save power, without re-
`ducing throughput, by executing OS code on a simple, low-
`power core, while using more complex, high-power cores for
`
`1
`
`Petitioner Samsung Ex-1024, 0002
`
`
`
`application code.
`The main contribution of this paper is to propose and
`evaluate the ASISA-CMP model, in which (1) a multi-core,
`single-ISA CPU includes some “OS-friendly” cores, optim-
`ized to execute OS code without wasting energy, and (2) the
`OS statically and/or dynamically decides which code to ex-
`ecute on which cores, so as to optimize throughput per joule.
`To optimally exploit ASISA-CMP, we expect that the OS
`and the hardware both must change. This paper explores
`the various design considerations for co-evolving the OS and
`hardware, and presents experimental results.
`
`2 Related work
`
`Ousterhout [20] may have been the first to point out that “op-
`erating system performance does not seem to be improving
`at the same rate as the base speed of the underlying hard-
`ware.” He speculated that causes include memory latencies
`and context-switching overheads.
`Nellans et al. [19] measured the fraction of cycles spent in
`the OS for a variety of applications, and re-examined how OS
`performance scales with CPU performance, suggesting that
`interrupt-handling code interferes with caches and branch-
`prediction history tables. They found that many applications
`execute a large fraction of cycles in the OS, and observed
`that “a classic 5 stage pipeline [such as] a 486 is surprisingly
`close in performance to a modern Pentium 4 when execut-
`ing [OS code].” However, instead of proposing an ASISA-
`CMP, they suggest adding a dedicated OS-specific core. (It is
`not entirely clear how far their proposal is from a single-ISA
`CMP.) They did not evaluate this proposal in detail.
`Chakraborty et al. [11] proposed refactoring software so
`that similar “fragments” of code are executed on the same
`core of a CMP. Their initial study treated the OS and the
`user-mode application as two coarse-grained fragments, and
`found speedup in some cases. However, they did not exam-
`ine asymmetric CMPs or the question of power reduction.
`Sanjay Kumar et al. [15] propose a “sidecore” architec-
`ture to support hypervisor operations. In their approach, a
`hypervisor is restructured into multiple components, with
`some running on specialized cores. Their goal was to avoid
`the expensive internal state changes triggered via traps (e.g.,
`VMexit in Intel's VT architecture) to perform privileged hy-
`pervisor functions. Instead, the sidecore approach transfers
`the operation to a remote core “that is already in the appro-
`
`priate state.” This also avoids polluting the guest-core caches
`with code and data from hypervisor operations. (The side-
`core approach is not specifically targeted at saving energy.)
`
`3 Design overview
`
`Our goal is to address two major challenges for multi-core
`systems: how to minimize power consumption while main-
`taining good performance, and how to exploit the parallelism
`offered by a multi-core CPU. These two issues are closely
`linked, but we will try to untangle them somewhat.
`
`3.1 Proportionality in power consumption
`
`We want to maximize the energy efficiency of our computer
`systems, which could be expressed as the useful computa-
`tional work (throughput) per joule expended. Fan et al. [6]
`have observed that the ideal system would consume power
`directly proportional to the rate of useful computational work
`it completes. We refer to this as the “proportionality rule.”
`Such a system would need no additional power management
`algorithms, except as might be needed to avoided exceeding
`peak power or cooling limits.
`Fan et al. argue that “system designers should consider
`power efficiency not simply at peak performance levels but
`across the activity range.” We believe that the ASISA-CMP
`approach, with a careful integration of OS and hardware
`design, can help address this goal. Of course, it is prob-
`ably impossible to design a system that truly complies with
`the proportionality rule, especially since many components
`consume considerable power even when the CPU is idle.
`There are at least two ways that one might design a sys-
`tem to address the proportionality rule. First, one could
`design individual components whose power consumption
`varies with throughput, such as a CPU that supports voltage
`and frequency scaling. Second, one could design a sys-
`tem with a mix of both high-power/high-performance and
`low-power/low-performance components, with a mechanism
`for dynamically varying which components are used (and
`powered up) based on system load.
`The original ASISA-CMP model, as proposed by Kumar
`et al. [13], follows the second approach, without precluding
`the first one. In times of light load, activity shifts to low-
`power cores; in times of heavy load, low-power cores can
`offer additional parallelism without significant increases in
`
`2
`
`Petitioner Samsung Ex-1024, 0003
`
`
`
`area or power consumption.
`In this paper, we extend the ASISA-CMP model by as-
`serting that the ideal low-power core is one that is special-
`ized to execute operating system code. (More broadly, we
`consider “OS-like code,” which we will define in Sec. 4.3.1.)
`This stems from several observations:
` OS code does not proportionately benefit from the
`potential speedup of complex, high-frequency cores.
`Thus, running OS code on a simpler core is a better use
`of power and chip area.
` Most computer systems (with certain exceptions, such
`as scientific computing) are often idle.
`If we could
`power down complex CPU cores during periods when
`they would otherwise be idle, we could improve propor-
`tionality.
`The designs explored in this paper include:
` Multi-core CPUs with a mix of high-power, high-
`complexity application cores, and low-power,
`low-
`complexity OS-friendly cores.
` Operating system modifications to dynamically shift
`load to the most appropriate core, and (potentially) to
`power down idle cores.
` Modest hardware changes to improve the efficiency of
`core-switching.
`Of course, the CPU is not the only power-consuming
`component in a system, and ASISA-CMP does not address
`the power consumed by memory, busses, and I/O devices,
`or the power wasted in power supplies and cooling systems.
`Therefore, even if the CPU were perfectly proportional, the
`entire system would still fail to meet the proportionality rule.
`However, as long as CPUs represent the largest single power
`draws in a system (see Sec. 3.3.1), improving their propor-
`tionality is worthwhile.
`
`Table 1: Power and relative performance of Alpha cores
`
`Alpha
`core
`EV4
`EV5
`EV6
`EV8
`
`Average
`Peak
`power
`power
`3.73W
`4.97W
`6.88W
`9.83W
`17.8W 10.68W
`92.88W 46.44W
`
`Normalized vs. EV4
`IPC
`area
`power
`1.00
`1.00
`1.00
`1.30
`1.76
`1.84
`1.87
`8.54
`2.86
`2.14
`82.2
`12.45
`
`I
`I
`I
`II
`I
`I
`All cores scaled to 0.1 m; IPC based on SPEC CPU benchmarks
`
`Based on data from Kumar et al. [13]
`
`implemented in the same process technology. Clearly, the
`smallest core delivers significantly more performance per
`watt and per mm . In fact, these performance results were
`based on the SPEC CPU benchmark suite; since operating
`system performance generally scales worse than application
`performance [20], we believe the IPC ratios would be even
`smaller for OS code.
`
`3.3 Complicating issues
`
`Various issues complicate the question of whether we can
`improve throughput/joule by running OS (or OS-like) code
`on special a OS-friendly core. We cover many details in sub-
`sequent sections of this paper; here, we expose some general
`questions. Many of these can only be resolved by exper-
`imentation (possibly through simulation); we describe our
`experiments later, in Sec. 7.
`The two key issues, as mentioned above, are the relative
`power consumption levels for various system components,
`and the relative performance costs and benefits of switching
`cores. In order for ASISA-CMP to pay off, we require that
`it does not reduce performance faster than it reduces power
`consumption.
`
`3.2 Core heterogeneity
`
`3.3.1 How important is CPU power?
`
`The promise of ASISA-CMP depends critically on two facts
`of CPU core design: (1) for a given process technology, a
`complex core consumes much more power and die area than
`a simple core, and (2) a complex core does not improve OS
`performance nearly as much as it improves application per-
`formance.
`Table 1 shows the relative power consumption, perform-
`ance (in terms of instructions per cycle, or IPC), and sizes
`of various generations of Alpha cores, scaled as if all were
`
`Any real system includes multiple components that draw
`power, and ASISA-CMP will not significantly change the
`energy consumption of components other than the CPU. In
`fact, if ASISA-CMP increases the time required to complete
`a job (or set of jobs), the resulting increase in energy con-
`sumed by other system components may outweight the sav-
`ings from the CPU cores.
`Component power consumption varies tremendously
`across the variety of computer systems in use. In particular,
`
`3
`
`Petitioner Samsung Ex-1024, 0004
`
`
`
`Table 2: Example power budgets for two typical systems
`
`Watts
`PCI Other
`(%) Mem Disk
`CPU
`Tot.
`Blade servers (all with multiple CPUs; after [16])
`248
`70
`28%
`48
`10
`50
`442
`170
`39%
`112
`10
`50
`I
`I
`1025
`520
`51%
`320
`10
`75
`Laptop (after [17])
`2.0
`15%
`0.4
`13.4
`52%
`1
`
`System
`
`Small
`Med.
`Large
`
`Idle
`Busy
`
`I
`
`I
`
`13.1
`25.8
`
`I
`
`I
`
`I
`
`0.6
`1
`
`N.A.
`I
`N.A.
`
`I
`
`70
`100
`100
`
`10.1
`10.3
`
`I
`
`I
`
`laptops and servers have vastly different balances between
`components. Table 2 shows a power breakdown for several
`typical systems. The blade server results, taken from [16],
`show “nameplate” (maximum) power budgets, and all have
`either two or (for the “Large” configuration) four CPUs. The
`laptop results, taken from [17], show measured results for
`an idle system and for one running the PCMark CPU bench-
`mark; in both cases, Dynamic Voltage Scaling was disabled
`and the screen was at full brightness.
`In all cases, except for the idle laptop, CPU power con-
`sumption was the largest single component of system power
`consumption. For the Large server and the busy laptop, the
`CPU (or CPUs) consumed slightly more than half of the total
`power. This suggests that techniques, such as ASISA-CMP,
`that address CPU power consumption can have meaningful
`effects on whole-system power consumption.
`
`3.3.2 How does ASISA-CMP affect performance?
`
`ASISA-CMP can affect performance in several ways:
` Running OS code on a slower CPU: By design,
`ASISA-CMP concedes some performance by running
`OS code on a slower core. As argued in Sec. 3.2, this
`slowdown might be minimal. However, an application
`that spends much of its time in the OS could see a sig-
`nificant performance decline.
` Core switching costs: ASISA-CMP inherently moves
`a thread of execution from one core to another for cer-
`tain system calls. Core-switching creates longer code
`paths for these system calls, and adds state-saving over-
`head.
` Cache affinity vs. cache interference: We assume
`that the cores in a CMP CPU share a single L2 cache
`but have private L1 caches. Core-switching could af-
`
`4
`
`fect cache performance in at least two ways: it could
`harm cache affinity, by requiring cache lines (e.g., for
`the data buffer of a write system call) to move between
`L1 caches, or it could reduce cache interference, by
`keeping some OS code and data out of the application
`core's cache.
` Available parallelism: Given that the incremental cost
`(in power and area) for adding an OS core to a CMP is
`much lower than for an application core (see Sec. 3.2),
`if there is available parallelism in the workload that ex-
`tend to OS processing, an ASISA-CMP CPU could sup-
`port more parallelism than a symmetric CMP CPU with
`similar power and area. For example, a parallel “make”
`command might benefit from having an OS core run I/O
`processing while the application core is dedicated to an
`optimizing compiler. Not all workloads will have this
`kind of parallelism, of course.
`
`3.3.3 Idle time
`
`We have described the example in Figure 1 as if the applic-
`ation's system call does useful work. However, it could also
`be blocked waiting for some external event, such as disk I/O
`or the arrival of a Web request. Numerous studies (e.g.,
`[6, 21]) have shown that most computers are idle most of
`the time. Therefore, as pointed out by Fan et al. [6], a use-
`ful design for meeting the portionality rule must significantly
`reduce power consumption during idle periods.
`ASISA-CMP offers the option of powering down the
`high-power application core(s), while maintaining OS func-
`tions on a low-power core. For example, the arrival of a new
`Web (HTTP/TCP) connection normally precedes the arrival
`of the actual HTTP request by at least one network round-trip
`time (typically on the order of milliseconds or more). This
`would allow an OS core to handle the initial TCP connection
`request and then awaken the application core soon enough to
`process the HTTP-level request without any delay.
`
`3.4 Competing approaches
`
`Given that the goal of ASISA-CMP is to improve perform-
`ance (in terms of throughput/joule), and it would require
`changes to the design of CPU chips, we have to compare
`it against possible competing approaches.
`Other potential alternatives include:
` Complex-core with dynamic Voltage/Frequency
`
`Petitioner Samsung Ex-1024, 0005
`
`
`
`Kernel entry overhead Kernel exit overhead
`
`Interrupt exit
`overhead
`
`Device
`Interrupt
`
`Interrupt entry
`overhead
`
`(a) Single−core CPU
`
`Cache
`interference
`overhead
`
`Thread A
`User mode
`Kernel mode
`
`Kernel entry overhead
`
`Kernel exit overhead
`
`Thread A
`User mode, core 1
`Kernel mode, core 1
`
`Kernel mode, core 0
`
`core 1 in low
`power mode
`
`Device
`Interrupt
`
`Core−switch
`overhead
`
`Cache
`affinity
`overhead
`
`(b) ASISA−CMP dual−core CPU
`
`Figure 1: Example without thread-level parallelism.
`
`4.1 Dynamic core-switching for OS kernel
`code
`
`In any multiprocessor system, performance depends on
`whether there is enough available parallelism to keep all
`processing elements busy. Specifically in an ASISA-CMP-
`based system,
`there are two ways to optimize through-
`put/joule:
`shift OS load to
`1. If the system is underutilized:
`low-power cores, and power down high-power cores,
`whenever possible.
`2. If the system has available parallelism: shift OS load to
`low-power cores, while keeping the high-power cores
`as busy as possible with application code.
`First, consider the case where there is no available
`application-level parallelism. Figure 1 illustrates a simple
`example for an application with just one thread. Figure 1(a)
`shows a brief part of the application's execution on a single-
`core CPU. When the application thread makes a system call,
`execution moves from user mode to kernel mode and back.
`Figure 1(b) shows the same execution sequence on a two-
`core ASISA-CMP system. In this case, when the application
`thread makes the system call, the kernel (1) transfers control
`from the “application core” (core 1) to the “OS core” (core
`0); (2) puts core 1 in low-power mode; (3) executes the sys-
`tem call on core 0; (4) wakes up core 1; and (5) transfers
`control back to core 1.
`
`scaling: This could be especially effective at saving
`power in systems with lots of idle time. However, we
`suspect that the lowest-power mode for V/F scaling on
`a complex core is still much higher than the power
`drawn by an ASISA-CMP CPU with the application
`core turned off.
` Complex-core with power-down on idle: In such a
`system, a idle core would wake up on any external in-
`terrupt. This approach could also be especially effective
`at saving power in systems with lots of idle time, espe-
`cially if the system is not forced to wake up on every
`clock tick even if there is no work to do. However, this
`approach only works if the system is truly idle for mod-
`erately long periods; a server system handling a min-
`imal rate of request packets, for example, might never
`stay idle for long enough.
` Lots of simple cores: A CPU with lots of simple cores,
`each of which could be independently power off, might
`allow a close approximation to the proportionality rule.
`However, this configuration can only get reasonable
`throughput (and thus reasonable throughput/joule for
`the full system) if we can solve the general parallel pro-
`gramming problem – a problem that has been elusive
`for many years. Amdahl's Law may be an inherent
`limit to this approach.
` Some cores with specialized ISAs: Others have ex-
`plored this alternative [19]. We have ruled it out for this
`paper, because we believe that a single-ISA approach
`makes it much easier to develop and maintain an oper-
`ating system, and because we have no way to simulate
`multiple ISAs in one system.
`
`4 Software issues
`
`We believe that the ASISA-CMP approach can be applied to
`OS code in several ways, including:
`1. Dynamically switching between cores in OS kernel
`code; see Sec. 4.1.
`2. Running virtualization “helper processing” on OS-
`friendly cores; see Sec. 4.2.
`3. Running applications, or parts thereof, with “OS-like
`code” on OS-friendly cores; see Sec. 4.3
`To date, we have focussed all of our implementation and ex-
`perimental work on the first approach.
`
`5
`
`Petitioner Samsung Ex-1024, 0006
`
`
`
`for an application with two runnable threads, thread A and
`thread B, where thread A is running at the start of the ex-
`ample, and again the thread makes a system call. Figure 2(a)
`shows the single-core case: while thread A is executing in
`the kernel, thread B remains blocked until the scheduler de-
`cides that A has run long enough. Figure 2(a) shows the
`ASISA-CMP case: when thread A makes its system call, the
`kernel switches that thread to the OS core (core 0), thread B
`can now run on the application core (core 1).
`Again assuming that the switching overheads are not too
`large, the use of ASISA-CMP increases the utilization of the
`application core, because OS execution is moved to a more
`appropriate core.
`
`4.2 Running virtualization helpers on
`OS-friendly cores
`
`So far, we have discussed the operating system as if it were a
`single monolithic kernel. Of course, many operating systems
`designs, both in research and as products, have been decom-
`posed; for example, into a microkernel and a set of server
`processes. Sec. 4.3 discusses the possibility of running dae-
`mon processes on a OS-friendly core.
`However, a different kind of decomposition has become
`popular:
`the creation of one or more virtualization layers
`between the traditional operating system and the hardware.
`These layers basically execute OS kernel code, but in a
`different protection domain from the “guest” OS. We be-
`lieve these are clear candidates for execution on OS-friendly
`cores. They also tend to be bound to specific cores, which
`eliminates the performance overhead of core-switching.
`Note that the use of current virtual machine monitors
`(VMM) might undermine the use of dynamic core switching
`for kernel code, because a guest operating system running
`in a virtual machine might not be able to bind a thread to a
`specific core. It is possible that a VMM's interface to the
`guest OS could be augmented to expose the actual cores, or
`at least to expose the distinction between application cores
`and OS-friendly cores, but we have not explored this issue in
`detail.
` Xen's Domain 0: In Xen [4], and probably in sim-
`ilar VM systems, one privileged virtual machine is used
`to multiplex and manage the I/O operations issued by
`the other virtual machines. This “Domain 0” probably
`would be most power-efficient if run on an OS-friendly
`
`Kernel entry overhead Kernel exit overhead
`
`Thread A
`Thread B
`User mode
`Kernel mode
`
`(B blocked)
`
`Interrupt exit
`overhead
`
`(A blocked) Thread A
`Thread B
`
`Scheduler
`Interrupt
`
`Device
`Interrupt
`
`Interrupt entry
`overhead
`
`(a) Single−core CPU
`
`Cache
`interference
`overhead
`
`Kernel entry overhead
`
`Thread A
`Thread B
`(B blocked)
`User mode, core 1
`Kernel mode, core 1
`
`Kernel mode, core 0
`
`Cache interference overhead
`(A blocked)
`
`Thread A
`(B blocked) Thread B
`
`(begin core switch for A)
`
`(finish core switch for A)
`
`Scheduler
`Interrupt
`
`Device
`Interrupt
`
`Core−switch
`overhead
`
`Cache
`affinity
`overhead
`
`(b) ASISA−CMP dual−core CPU
`
`Figure 2: Example with thread-level parallelism.
`
`If the OS core draws significantly less power than the ap-
`plication core while not significantly reducing performance
`on OS code, and if there were no overheads for switching
`cores and for changing the power state of core 0, then the
`ASISA-CMP system would have higher throughput per joule
`than the single-core system. These two “ifs” are two of the
`key questions for this paper: are there real benefits to run-
`ning OS code on specialized cores, and are the overheads
`small enough to avoid overwhelming the benefits?
`Figure 1(a) and (b) also show what happens on the arrival
`of an interrupt. In the single-core system, application exe-
`cution is delayed both for the actual execution of the inter-
`rupt handling code in the OS, and also for interrupt exit and
`entry. In the ASISA-CMP system, the application continues
`to run without delay (except perhaps for memory-access in-
`terference). This would probably be true for any multicore
`system, but in the ASISA-CMP approach, interrupt handling
`happens on a power-optimized core, rather than on a high-
`power core.
`Note that we compare the dual-core ASISA-CMP system
`against a single-core system, rather than against a symmet-
`ric dual-core system, because (in the underutilized case) it
`seems likely that a dual-core CMP would consume more
`power than a single-core system, without any increase in
`throughput.
`Next, consider the case where there is application-level
`parallelism. Figure 2 illustrates another simple example,
`
`6
`
`Petitioner Samsung Ex-1024, 0007
`
`
`
`core.
`I/O and network helper threads: Several researchers
`have proposed running I/O-related portions of the virtu-
`alization environment on separate processors. McAuley
`and Neugebauer [18] proposed “encapsulating oper-
`ating system I/O subsystems in Virtual Channel Pro-
`cessors,” which could run on separate processors. Reg-
`nier et al. [24] and Brecht et al. [5] have proposed run-
`ning network packet processing code on distinct cores.
`For both of these approaches, OS-friendly cores would
`be a good match for optimizing power and performance.
`
`4.3 Running OS-like code on OS-friendly
`cores
`
`Our hardware design is based on the observation that “OS
`code” behaves differently from user code. However, the
`definition of an operating system is quite fluid; the same code
`that executes inside the kernel in a monolithic system, such
`as Linux, may execute in user mode on a microkernel. Thus,
`we should not limit our choice of which code to execute on
`OS-friendly cores based simply on whether it runs in kernel
`mode.
`Typical systems include much code that shares these char-
`acteristics with actual kernel code:
` Libraries: much of user-mode code actually executes
`in libraries that are tightly integrated with the kernel. In
`microkernels and similar approaches, it might be hard
`to distinguish much of this code from traditional kernel
`code. So, one might expect some library code to run
`most efficiently on an OS-friendly core. On the other
`hand, we suspect that this would require core-switching
`far too frequently, and it might be hard to detect when
`to switch.
` Daemons: most operating systems (but especially mi-
`crokernels) use daemon processes to execute code that
`does not need to be in the kernel, or that needs to block
`frequently. This code might also be run most efficiently
`on an OS-friendly core, and could easily be bound to the
`appropriate core (most operating systems allow binding
`a process to a CPU).
` Servers: Some applications, such as Web servers,
`might fall into the same category as daemons. However,
`cryptographic code might run better on normal cores,
`and so a secure Web server might have to be refactored
`
`to run optimally on an ASISA-CMP.
`Note that if we use ASISA-CMP to execute entire OS-like
`application processes (or, at least, threads) on OS-friendly
`cores, this eliminates the performance overhead of core-
`switching.
`
`4.3.1 Defining and recognizing “OS-like” code
`
`If it makes sense to run OS-like application code on OS-
`friendly cores, how does the OS decide which applications
`to treat this way? Programmers could simply declare their
`applications to be OS-like, but this is likely to lead to mis-
`takes. Instead, we believe that through monitoring the ap-
`propriate hardware performance counters, the OS can detect
`applications, or perhaps even threads, with OS-like execu-
`tion behavior.
`Automated recognition of OS-like code demands a clear
`definition of the term. While we do not yet have a spe-
`cific definition, the main characteristics of OS-like code
`are likely to include the absence of floating-point opera-
`tions; frequent pointer-chasing (which can defeat complex
`data cache designs); idiosyncratic conditional-branch beha-
`vior [7] (which can defeat branch predictors designed for ap-
`plication code), and frequent block-data copies (which can
`limit the effectiveness of large D-caches).
`Many of these idiosyncratic characteristics could be vis-
`ible via CPU performance counters. (Zhang et al. [26] have
`suggested that performance counters should be directly man-
`aged by the OS as a first-class resource, for similar reasons.)
`We are not sure, however, whether support for ASISA-CMP
`will require novel performance counters.
`We also suspect that an adaptive approach, based on tent-
`atively running a suspected OS-like thread on an OS-friendly
`core, and measuring the resulting change in progress, could
`be effective in determining whether a thread is sufficiently
`OS-like to benefit. This approach requires a means for the
`kernel to determine the progress of a thread, which is an open
`issue; in some cases, the rate at which it issues system calls
`could be a useful indicator.
`
`4.3.2 Should asymmetry be exposed to user-mode code?
`
`Kernels already provide processor-affinity controls (e.g.,
`sched setaffinity on Linux). With asymmetrical cores, the
`-
`kernel might also expose core configuration data, to allow
`OS-like user code to select the right core.
`
`7
`
`Petitioner Samsung Ex-1024, 0008
`
`
`
`
`5 Implementation of dynamic
`core-switching
`
`For the preliminary experiments in this paper, we made the
`simplest possible changes to Linux (version 2.6.18) to allow
`us to shift certain OS operations to an OS-friendly core. We
`assumed a two-core system, and did not add code to power-
`down (or slow down) idle cores.
`Kernels
`typically execute two categories of code:
`“Bottom-half”
`interrupt-handler
`code,
`and “top-half”
`system-call code. We used different strategies for each.
`
`5.1 Interrupt-handler code
`
`We believe that interrupt code (often known as “bottom half”
`code) should always prefer an OS-friendly core. The OS can
`configure interrupt delivery so as to ensure this.
`Linux already provides, via the /proc/irq directory in
`the /proc process information pseudo-filesystem, a way to
`set interrupt affinity to a specific core. We do this only for
`the NIC; most other devices have much lower interrupt rates,
`and the timer interrupt must go to all CPUs.
`It is possible that the interrupt load could exceed the ca-
`pacity of the OS core(s) in a CPU. If so, should this load be
`spread out over the other cores? This might improve system
`throughput under some conditions, but it might be better to
`confine interrupt processing to a fixed subset of cores, as a
`way to limit the disruption caused by excessive interrupts. In
`our experiments, we have statically directed interrupts to the
`(single) OS core.
`
`5.2 System calls
`
`We faced two main challenges in our implementation of core
`switching for system calls: how to decide when to switch,
`and how to reduce the switching costs. Switching cores takes
`time, because
`1. switching involves executing extra instructions;
`2. switching requires transferring some state between
`CPUs;
`3. if the target core is powered down, it could take about
`a thousan