throbber

`
`
`
`I n v t n t
`
`
`
`
`Operating Systems and Asymmetric Single-ISA CMPs: The Potential
`for Saving Energy
`
`Jeffrey C. Mogul, Jayaram Mudigonda, Nathan Binkert, Partha Ranganathan, Vanish Talwar
`HP Laboratories Palo Alto
`HPL-2007-140
`August 22, 2007*
`
`
`
`
`CMP, multi-core,
`energy savings,
`operating systems
`
`CPUs consume too much power. Modern complex cores sometimes waste
`power on functions that are not useful for the code they run. In particular,
`operating system kernels do not benefit from many power-consuming features
`that were intended to improve application performance. We propose using
`asymmetric single-ISA CMPs (ASISA-CMPs), multicore CPUs where all cores
`execute the same instruction set architecture but have different performance and
`power characteristics, to avoid wasting power on operating systems code. We
`describe various design choices for both hardware and software, describe Linux
`kernel modifications to support ASISA-CMP, and offer some quantified
`estimates that support our proposal.
`
`
`
`
`* Internal Accession Date Only
` Approved for External Publication
`© Copyright 2007 Hewlett-Packard Development Company, L.P.
`
`Petitioner Mercedes Ex-1024, 0001
`
`

`

`Operating Systems and Asymmetric Single-ISA CMPs:
`The Potential for Saving Energy
`
`Vanish Talwar
`Partha Ranganathan
`Nathan Binkert
`Jayaram Mudigonda
`Jeffrey C. Mogul
`Jeff.Mogul@hp.com, Jayaram.Mudigonda@hp.com, binkert@hp.com, Partha.Ranganathan@hp.com, Vanish.Talwar@hp.com
`HP Labs, Palo Alto, CA 94304
`
`Abstract
`
`CPUs consume too much power. Modern complex cores
`sometimes waste power on functions that are not useful for
`the code they run. In particular, operating system kernels do
`not benefit from many power-consuming features that were
`intended to improve application performance. We propose
`using asymmetric single-ISA CMPs (ASISA-CMPs), mul-
`ticore CPUs where all cores execute the same instruction
`set architecture but have different performance and power
`characteristics, to avoid wasting power on operating systems
`code. We describe various design choices for both hardware
`and software, describe Linux kernel modifications to support
`ASISA-CMP, and offer some quantified estimates that sup-
`port our proposal.
`
`1 Introduction
`
`While Moore's Law has delivered exponential increases in
`computation over the past few decades, two well-known
`trends create problems for computer systems: CPUs con-
`sume more and more power, and operating systems do not
`speed up as rapidly as most application code does. Many
`people have addressed these problems separately; we pro-
`pose to address them together.
`Until recently, designers of high-end CPU chips tended to
`improve single-stream performance as much as possible, by
`exploiting instruction-level parallelism and decreasing cycle
`times. Both of these techniques are now hard to sustain, so
`recent CPU designs exploit shrinking VLSI feature sizes by
`using multiple cores, rather than faster clocks. Examples of
`these Chip Multi-Processors (CMPs) include the Sun Niagra
`
`processor with eight cores, the quad-core Intel Xeon, and
`dual-core systems from several vendors.
`All commercially-available general-purpose CMPs, as of
`mid-2007, are symmetrical: each CPU is identical, and typ-
`ically, all run at the same clock rate. However, in 2003 Ku-
`mar et al. [13] proposed heterogeneous (or asymmetrical)
`multi-core processors, as a way of reducing power require-
`ments. Their proposal retains the single-Instruction-Set-
`Architecture (single-ISA) model of symmetrical CMPs: all
`cores can execute the same machine code. They observed
`that different implementations of the same ISA had order-
`of-magnitude differences in power consumption (assuming a
`single VLSI process). They further observed that in a multi-
`application workload, or even in phases of a single applica-
`tion, one does not always need the full power and functional-
`ity of the most complex CPU core; if a CMP could switch a
`process between cores with different capabilities, one could
`maintain throughput while decreasing power consumption.
`Since the original study by Kumar et al., several other
`studies [3, 8, 14] have highlighted the benefits from het-
`erogeneity. (Keynote speakers from some major processor
`vendors have also suggested that heterogeneity might be
`commercially interesting [2, 22].) However, all these stud-
`ies looked only at user-mode execution. But we know that
`many workloads spend much or most of their cycles in the
`operating system [23]. We also know that operating system
`(OS) code differs from application code: it avoids the float-
`ing point processor, it branches more often, and it has lower
`cache locality (all reasons why OS speedups lag applica-
`tion speedups on modern CPUs). An asymmetric single-ISA
`CMP (ASISA-CMP) might therefore save power, without re-
`ducing throughput, by executing OS code on a simple, low-
`power core, while using more complex, high-power cores for
`
`1
`
`Petitioner Mercedes Ex-1024, 0002
`
`

`

`application code.
`The main contribution of this paper is to propose and
`evaluate the ASISA-CMP model, in which (1) a multi-core,
`single-ISA CPU includes some “OS-friendly” cores, optim-
`ized to execute OS code without wasting energy, and (2) the
`OS statically and/or dynamically decides which code to ex-
`ecute on which cores, so as to optimize throughput per joule.
`To optimally exploit ASISA-CMP, we expect that the OS
`and the hardware both must change. This paper explores
`the various design considerations for co-evolving the OS and
`hardware, and presents experimental results.
`
`2 Related work
`
`Ousterhout [20] may have been the first to point out that “op-
`erating system performance does not seem to be improving
`at the same rate as the base speed of the underlying hard-
`ware.” He speculated that causes include memory latencies
`and context-switching overheads.
`Nellans et al. [19] measured the fraction of cycles spent in
`the OS for a variety of applications, and re-examined how OS
`performance scales with CPU performance, suggesting that
`interrupt-handling code interferes with caches and branch-
`prediction history tables. They found that many applications
`execute a large fraction of cycles in the OS, and observed
`that “a classic 5 stage pipeline [such as] a 486 is surprisingly
`close in performance to a modern Pentium 4 when execut-
`ing [OS code].” However, instead of proposing an ASISA-
`CMP, they suggest adding a dedicated OS-specific core. (It is
`not entirely clear how far their proposal is from a single-ISA
`CMP.) They did not evaluate this proposal in detail.
`Chakraborty et al. [11] proposed refactoring software so
`that similar “fragments” of code are executed on the same
`core of a CMP. Their initial study treated the OS and the
`user-mode application as two coarse-grained fragments, and
`found speedup in some cases. However, they did not exam-
`ine asymmetric CMPs or the question of power reduction.
`Sanjay Kumar et al. [15] propose a “sidecore” architec-
`ture to support hypervisor operations. In their approach, a
`hypervisor is restructured into multiple components, with
`some running on specialized cores. Their goal was to avoid
`the expensive internal state changes triggered via traps (e.g.,
`VMexit in Intel's VT architecture) to perform privileged hy-
`pervisor functions. Instead, the sidecore approach transfers
`the operation to a remote core “that is already in the appro-
`
`priate state.” This also avoids polluting the guest-core caches
`with code and data from hypervisor operations. (The side-
`core approach is not specifically targeted at saving energy.)
`
`3 Design overview
`
`Our goal is to address two major challenges for multi-core
`systems: how to minimize power consumption while main-
`taining good performance, and how to exploit the parallelism
`offered by a multi-core CPU. These two issues are closely
`linked, but we will try to untangle them somewhat.
`
`3.1 Proportionality in power consumption
`
`We want to maximize the energy efficiency of our computer
`systems, which could be expressed as the useful computa-
`tional work (throughput) per joule expended. Fan et al. [6]
`have observed that the ideal system would consume power
`directly proportional to the rate of useful computational work
`it completes. We refer to this as the “proportionality rule.”
`Such a system would need no additional power management
`algorithms, except as might be needed to avoided exceeding
`peak power or cooling limits.
`Fan et al. argue that “system designers should consider
`power efficiency not simply at peak performance levels but
`across the activity range.” We believe that the ASISA-CMP
`approach, with a careful integration of OS and hardware
`design, can help address this goal. Of course, it is prob-
`ably impossible to design a system that truly complies with
`the proportionality rule, especially since many components
`consume considerable power even when the CPU is idle.
`There are at least two ways that one might design a sys-
`tem to address the proportionality rule. First, one could
`design individual components whose power consumption
`varies with throughput, such as a CPU that supports voltage
`and frequency scaling. Second, one could design a sys-
`tem with a mix of both high-power/high-performance and
`low-power/low-performance components, with a mechanism
`for dynamically varying which components are used (and
`powered up) based on system load.
`The original ASISA-CMP model, as proposed by Kumar
`et al. [13], follows the second approach, without precluding
`the first one. In times of light load, activity shifts to low-
`power cores; in times of heavy load, low-power cores can
`offer additional parallelism without significant increases in
`
`2
`
`Petitioner Mercedes Ex-1024, 0003
`
`

`

`Table 1: Power and relative performance of Alpha cores
`
`area or power consumption.
`In this paper, we extend the ASISA-CMP model by as-
`serting that the ideal low-power core is one that is special-
`ized to execute operating system code. (More broadly, we
`consider “OS-like code,” which we will define in Sec. 4.3.1.)
`This stems from several observations:
` OS code does not proportionately benefit from the
`potential speedup of complex, high-frequency cores.
`Thus, running OS code on a simpler core is a better use
`of power and chip area.
` Most computer systems (with certain exceptions, such
`as scientific computing) are often idle.
`If we could
`power down complex CPU cores during periods when
`they would otherwise be idle, we could improve propor-
`tionality.
`The designs explored in this paper include:
` Multi-core CPUs with a mix of high-power, high-
`complexity application cores, and low-power,
`low-
`complexity OS-friendly cores.
` Operating system modifications to dynamically shift
`load to the most appropriate core, and (potentially) to
`power down idle cores.
` Modest hardware changes to improve the efficiency of
`core-switching.
`Of course, the CPU is not the only power-consuming
`component in a system, and ASISA-CMP does not address
`the power consumed by memory, busses, and I/O devices,
`or the power wasted in power supplies and cooling systems.
`Therefore, even if the CPU were perfectly proportional, the
`entire system would still fail to meet the proportionality rule.
`However, as long as CPUs represent the largest single power
`draws in a system (see Sec. 3.3.1), improving their propor-
`tionality is worthwhile.
`
`3.2 Core heterogeneity
`
`The promise of ASISA-CMP depends critically on two facts
`of CPU core design: (1) for a given process technology, a
`complex core consumes much more power and die area than
`a simple core, and (2) a complex core does not improve OS
`performance nearly as much as it improves application per-
`formance.
`Table 1 shows the relative power consumption, perform-
`ance (in terms of instructions per cycle, or IPC), and sizes
`of various generations of Alpha cores, scaled as if all were
`
`

`

`Table 2: Example power budgets for two typical systems
`
`

`

`Kernel entry overhead Kernel exit overhead
`
`Interrupt exit
`overhead
`
`Device
`Interrupt
`
`Interrupt entry
`overhead
`
`Cache
`interference
`overhead
`
`(a) Single−core CPU
`
`Thread A
`User mode
`Kernel mode
`
`Kernel entry overhead
`
`Kernel exit overhead
`
`Thread A
`User mode, core 1
`Kernel mode, core 1
`
`Kernel mode, core 0
`
`core 1 in low
`power mode
`
`Device
`Interrupt
`
`Core−switch
`overhead
`
`Cache
`affinity
`overhead
`
`(b) ASISA−CMP dual−core CPU
`
`Figure 1: Example without thread-level parallelism.
`
`4.1 Dynamic core-switching for OS kernel
`code
`
`In any multiprocessor system, performance depends on
`whether there is enough available parallelism to keep all
`processing elements busy. Specifically in an ASISA-CMP-
`based system,
`there are two ways to optimize through-
`put/joule:
`shift OS load to
`1. If the system is underutilized:
`low-power cores, and power down high-power cores,
`whenever possible.
`2. If the system has available parallelism: shift OS load to
`low-power cores, while keeping the high-power cores
`as busy as possible with application code.
`First, consider the case where there is no available
`application-level parallelism. Figure 1 illustrates a simple
`example for an application with just one thread. Figure 1(a)
`shows a brief part of the application's execution on a single-
`core CPU. When the application thread makes a system call,
`execution moves from user mode to kernel mode and back.
`Figure 1(b) shows the same execution sequence on a two-
`core ASISA-CMP system. In this case, when the application
`thread makes the system call, the kernel (1) transfers control
`from the “application core” (core 1) to the “OS core” (core
`0); (2) puts core 1 in low-power mode; (3) executes the sys-
`tem call on core 0; (4) wakes up core 1; and (5) transfers
`control back to core 1.
`
`scaling: This could be especially effective at saving
`power in systems with lots of idle time. However, we
`suspect that the lowest-power mode for V/F scaling on
`a complex core is still much higher than the power
`drawn by an ASISA-CMP CPU with the application
`core turned off.
` Complex-core with power-down on idle: In such a
`system, a idle core would wake up on any external in-
`terrupt. This approach could also be especially effective
`at saving power in systems with lots of idle time, espe-
`cially if the system is not forced to wake up on every
`clock tick even if there is no work to do. However, this
`approach only works if the system is truly idle for mod-
`erately long periods; a server system handling a min-
`imal rate of request packets, for example, might never
`stay idle for long enough.
` Lots of simple cores: A CPU with lots of simple cores,
`each of which could be independently power off, might
`allow a close approximation to the proportionality rule.
`However, this configuration can only get reasonable
`throughput (and thus reasonable throughput/joule for
`the full system) if we can solve the general parallel pro-
`gramming problem – a problem that has been elusive
`for many years. Amdahl's Law may be an inherent
`limit to this approach.
` Some cores with specialized ISAs: Others have ex-
`plored this alternative [19]. We have ruled it out for this
`paper, because we believe that a single-ISA approach
`makes it much easier to develop and maintain an oper-
`ating system, and because we have no way to simulate
`multiple ISAs in one system.
`
`4 Software issues
`
`We believe that the ASISA-CMP approach can be applied to
`OS code in several ways, including:
`1. Dynamically switching between cores in OS kernel
`code; see Sec. 4.1.
`2. Running virtualization “helper processing” on OS-
`friendly cores; see Sec. 4.2.
`3. Running applications, or parts thereof, with “OS-like
`code” on OS-friendly cores; see Sec. 4.3
`To date, we have focussed all of our implementation and ex-
`perimental work on the first approach.
`
`5
`
`Petitioner Mercedes Ex-1024, 0006
`
`

`

`for an application with two runnable threads, thread A and
`thread B, where thread A is running at the start of the ex-
`ample, and again the thread makes a system call. Figure 2(a)
`shows the single-core case: while thread A is executing in
`the kernel, thread B remains blocked until the scheduler de-
`cides that A has run long enough. Figure 2(a) shows the
`ASISA-CMP case: when thread A makes its system call, the
`kernel switches that thread to the OS core (core 0), thread B
`can now run on the application core (core 1).
`Again assuming that the switching overheads are not too
`large, the use of ASISA-CMP increases the utilization of the
`application core, because OS execution is moved to a more
`appropriate core.
`
`4.2 Running virtualization helpers on
`OS-friendly cores
`
`So far, we have discussed the operating system as if it were a
`single monolithic kernel. Of course, many operating systems
`designs, both in research and as products, have been decom-
`posed; for example, into a microkernel and a set of server
`processes. Sec. 4.3 discusses the possibility of running dae-
`mon processes on a OS-friendly core.
`However, a different kind of decomposition has become
`popular:
`the creation of one or more virtualization layers
`between the traditional operating system and the hardware.
`These layers basically execute OS kernel code, but in a
`different protection domain from the “guest” OS. We be-
`lieve these are clear candidates for execution on OS-friendly
`cores. They also tend to be bound to specific cores, which
`eliminates the performance overhead of core-switching.
`Note that the use of current virtual machine monitors
`(VMM) might undermine the use of dynamic core switching
`for kernel code, because a guest operating system running
`in a virtual machine might not be able to bind a thread to a
`specific core. It is possible that a VMM's interface to the
`guest OS could be augmented to expose the actual cores, or
`at least to expose the distinction between application cores
`and OS-friendly cores, but we have not explored this issue in
`detail.
` Xen's Domain 0: In Xen [4], and probably in sim-
`ilar VM systems, one privileged virtual machine is used
`to multiplex and manage the I/O operations issued by
`the other virtual machines. This “Domain 0” probably
`would be most power-efficient if run on an OS-friendly
`
`Kernel entry overhead Kernel exit overhead
`
`Thread A
`Thread B
`User mode
`Kernel mode
`
`(B blocked)
`
`Interrupt exit
`overhead
`
`(A blocked) Thread A
`Thread B
`
`Scheduler
`Interrupt
`
`Device
`Interrupt
`
`Interrupt entry
`overhead
`
`(a) Single−core CPU
`
`Cache
`interference
`overhead
`
`Kernel entry overhead
`
`Thread A
`Thread B
`(B blocked)
`User mode, core 1
`Kernel mode, core 1
`
`Kernel mode, core 0
`
`Cache interference overhead
`(A blocked)
`
`Thread A
`(B blocked) Thread B
`
`(begin core switch for A)
`
`(finish core switch for A)
`
`Scheduler
`Interrupt
`
`Device
`Interrupt
`
`Core−switch
`overhead
`
`Cache
`affinity
`overhead
`
`(b) ASISA−CMP dual−core CPU
`
`Figure 2: Example with thread-level parallelism.
`
`If the OS core draws significantly less power than the ap-
`plication core while not significantly reducing performance
`on OS code, and if there were no overheads for switching
`cores and for changing the power state of core 0, then the
`ASISA-CMP system would have higher throughput per joule
`than the single-core system. These two “ifs” are two of the
`key questions for this paper: are there real benefits to run-
`ning OS code on specialized cores, and are the overheads
`small enough to avoid overwhelming the benefits?
`Figure 1(a) and (b) also show what happens on the arrival
`of an interrupt. In the single-core system, application exe-
`cution is delayed both for the actual execution of the inter-
`rupt handling code in the OS, and also for interrupt exit and
`entry. In the ASISA-CMP system, the application continues
`to run without delay (except perhaps for memory-access in-
`terference). This would probably be true for any multicore
`system, but in the ASISA-CMP approach, interrupt handling
`happens on a power-optimized core, rather than on a high-
`power core.
`Note that we compare the dual-core ASISA-CMP system
`against a single-core system, rather than against a symmet-
`ric dual-core system, because (in the underutilized case) it
`seems likely that a dual-core CMP would consume more
`power than a single-core system, without any increase in
`throughput.
`Next, consider the case where there is application-level
`parallelism. Figure 2 illustrates another simple example,
`
`6
`
`Petitioner Mercedes Ex-1024, 0007
`
`

`

`core.
`I/O and network helper threads: Several researchers
`have proposed running I/O-related portions of the virtu-
`alization environment on separate processors. McAuley
`and Neugebauer [18] proposed “encapsulating oper-
`ating system I/O subsystems in Virtual Channel Pro-
`cessors,” which could run on separate processors. Reg-
`nier et al. [24] and Brecht et al. [5] have proposed run-
`ning network packet processing code on distinct cores.
`For both of these approaches, OS-friendly cores would
`be a good match for optimizing power and performance.
`
`4.3 Running OS-like code on OS-friendly
`cores
`
`Our hardware design is based on the observation that “OS
`code” behaves differently from user code. However, the
`definition of an operating system is quite fluid; the same code
`that executes inside the kernel in a monolithic system, such
`as Linux, may execute in user mode on a microkernel. Thus,
`we should not limit our choice of which code to execute on
`OS-friendly cores based simply on whether it runs in kernel
`mode.
`Typical systems include much code that shares these char-
`acteristics with actual kernel code:
` Libraries: much of user-mode code actually executes
`in libraries that are tightly integrated with the kernel. In
`microkernels and similar approaches, it might be hard
`to distinguish much of this code from traditional kernel
`code. So, one might expect some library code to run
`most efficiently on an OS-friendly core. On the other
`hand, we suspect that this would require core-switching
`far too frequently, and it might be hard to detect when
`to switch.
` Daemons: most operating systems (but especially mi-
`crokernels) use daemon processes to execute code that
`does not need to be in the kernel, or that needs to block
`frequently. This code might also be run most efficiently
`on an OS-friendly core, and could easily be bound to the
`appropriate core (most operating systems allow binding
`a process to a CPU).
` Servers: Some applications, such as Web servers,
`might fall into the same category as daemons. However,
`cryptographic code might run better on normal cores,
`and so a secure Web server might have to be refactored
`
`to run optimally on an ASISA-CMP.
`Note that if we use ASISA-CMP to execute entire OS-like
`application processes (or, at least, threads) on OS-friendly
`cores, this eliminates the performance overhead of core-
`switching.
`
`4.3.1 Defining and recognizing “OS-like” code
`
`If it makes sense to run OS-like application code on OS-
`friendly cores, how does the OS decide which applications
`to treat this way? Programmers could simply declare their
`applications to be OS-like, but this is likely to lead to mis-
`takes. Instead, we believe that through monitoring the ap-
`propriate hardware performance counters, the OS can detect
`applications, or perhaps even threads, with OS-like execu-
`tion behavior.
`Automated recognition of OS-like code demands a clear
`definition of the term. While we do not yet have a spe-
`cific definition, the main characteristics of OS-like code
`are likely to include the absence of floating-point opera-
`tions; frequent pointer-chasing (which can defeat complex
`data cache designs); idiosyncratic conditional-branch beha-
`vior [7] (which can defeat branch predictors designed for ap-
`plication code), and frequent block-data copies (which can
`limit the effectiveness of large D-caches).
`Many of these idiosyncratic characteristics could be vis-
`ible via CPU performance counters. (Zhang et al. [26] have
`suggested that performance counters should be directly man-
`aged by the OS as a first-class resource, for similar reasons.)
`We are not sure, however, whether support for ASISA-CMP
`will require novel performance counters.
`We also suspect that an adaptive approach, based on tent-
`atively running a suspected OS-like thread on an OS-friendly
`core, and measuring the resulting change in progress, could
`be effective in determining whether a thread is sufficiently
`OS-like to benefit. This approach requires a means for the
`kernel to determine the progress of a thread, which is an open
`issue; in some cases, the rate at which it issues system calls
`could be a useful indicator.
`
`4.3.2 Should asymmetry be exposed to user-mode code?
`
`Kernels already provide processor-affinity controls (e.g.,
`sched
`
`
`

`

`5 Implementation of dynamic
`core-switching
`
`For the preliminary experiments in this paper, we made the
`simplest possible changes to Linux (version 2.6.18) to allow
`us to shift certain OS operations to an OS-friendly core. We
`assumed a two-core system, and did not add code to power-
`down (or slow down) idle cores.
`Kernels
`typically execute two categories of code:
`“Bottom-half”
`interrupt-handler
`code,
`and “top-half”
`system-call code. We used different strategies for each.
`
`5.1 Interrupt-handler code
`
`We believe that interrupt code (often known as “bottom half”
`code) should always prefer an OS-friendly core. The OS can
`configure interrupt delivery so as to ensure this.
`Linux already provides, via the /proc/irq directory in
`the /proc process information pseudo-filesystem, a way to
`set interrupt affinity to a specific core. We do this only for
`the NIC; most other devices have much lower interrupt rates,
`and the timer interrupt must go to all CPUs.
`It is possible that the interrupt load could exceed the ca-
`pacity of the OS core(s) in a CPU. If so, should this load be
`spread out over the other cores? This might improve system
`throughput under some conditions, but it might be better to
`confine interrupt processing to a fixed subset of cores, as a
`way to limit the disruption caused by excessive interrupts. In
`our experiments, we have statically directed interrupts to the
`(single) OS core.
`
`5.2 System calls
`
`We faced two main challenges in our implementation of core
`switching for system calls: how to decide when to switch,
`and how to reduce the switching costs. Switching cores takes
`time, because
`1. switching involves executing extra instructions;
`2. switching requires transferring some state between
`CPUs;
`3. if the target core is powered down, it could take about
`a thousand cycles to switch [13]. Note that in our cur-
`rent implementation, we do not attempt to power-down
`idle cores; we defer further discussion of this issue to
`Sec. 5.6.
`
`Given a significant cost for core-switching, the tradeoff
`is only beneficial for expensive, frequent system calls (e.g.,
`select or perhaps open), but not fast or rare calls (e.g., getpid
`or exit). (Sec. 7.2 describes simple measurements of relevant
`costs.) Further, for some system calls the decision to switch
`should depend on how much work is to be done. A read
`system call with a buffer size of 10 bytes should not switch,
`whereas one with a buffer size of 10K bytes probably should.
`Our basic approach to core-switching is to modify certain
`system calls so that the basic sequence is:
` do initial validation of arguments
` Decide whether there is enough work to do to merit
`switching
`If so, invoke a core-switching function to switch to
`an OS core
` do the bulk of the system call
`If we switched codes, core switch back to an applic-
`ation core. We return, if possible, to the original core,
`to preserve cache affinity.
` finish up and return
`The steps in bold are the modifications we made. The details
`differ slightly for each call that we modified, but generally
`involve only a few lines of new code for each call.
`In our current implementation, we core-switch on these
`system calls: open, stat, read, write, readv, writev, select,
`poll, and sendfile. For read, write, and similar calls, we
`arbitrarily defined “enough work” as 4096 bytes or more; we
`have not yet tried to optimize this choice.
`We have two different
`implementations of the core-
`switching function, asisaSwitchCores: a very slow design
`that works, and a faster design that, unfortunately, we were
`unable to debug before the submission deadline. We describe
`each version below.
`In either case, we dedicate one or more cores as OS cores,
`to execute only OS code, and we modified the kernel to main-
`tain bitmaps designating the OS cores and the application
`cores.
`
`5.3 Core-switching via the Linux migration
`mechanism
`
`Linux already includes a load-balancing facility that can mi-
`grate a thread from one CPU to another [1]. For our initial
`implementation of asisaSwitchCores, we used this mechan-
`ism. This gave us a simple implementation: we block the
`
`8
`
`Petitioner Mercedes Ex-1024, 0009
`
`
`
`

`

`Table 3: Core-switching overheads
`
`calling thread, make it eligible to run only on the target des-
`tination CPU, place it on a queue of migratable processes,
`and then invoke Linux's per-CPU migration thread.
`This is an expensive procedure, since it involves a thread-
`switch on the source CPU, and invokes the scheduler on both
`the source and destination CPUs ... and the entire procedure
`must be done twice per system call.
`
`5.4 Core-switching via a modified scheduler
`
`In an attempt to speed up core-switching, we wrote a ver-
`sion of asisaSwitchCores that directly invokes a modified,
`somewhat simplified version of the Linux scheduler. Linux
`allows a thread (running in the kernel) to call the schedule
`procedure to yield the CPU. We wrote a modified version,
`asisaYield, which deactivates the current thread, places it on
`a special per-CPU queue for the target CPU, does the usual
`scheduler processing to choose the next thread for the source
`CPU, and finally sends an inter-processor interrupt to the
`destination CPU. (Linux already uses this kind of interrupt
`to awaken the scheduler when doing thread migration.)
`When the interrupt arrives at the destination CPU, it in-
`vokes another modified version of the scheduler, asisaGrab,
`which dequeues the thread from its special per-CPU queue,
`bypassing the normal scheduler logic to choose which thread
`to run.
`
`5.5 Core switching costs
`
`To quantify the cost of core switching, we modified a kernel
`to switch on the getpid system call. This call does almost
`nothing, so it is a good way to measure the overhead. We
`wrote a benchmark that bypasses the Linux C library' s at-
`tempt to cache the results of getpid; the benchmark pins it-
`self to the application core, then invokes getpid many times
`in a tight loop. We wrote similar benchmarks to measure the
`latency of select with both an empty file-descriptor set (fd-
`set) and a one-element fdset which is always “ready”, and
`of readv (which reads into a vector of 
`buffers) reading
`
`
    blocks of 64 bytes from a file in the buffer
`cache.
`The results in Table 3 were measured on our modified
`Linux 2.6.18 kernel running on a dual-core Xeon model 5160
`(3.0GHz, 64KB L1 caches, 4MB L2 cache) system, with
`and without core-switching enabled. In general, our known-
`to-be-slow core-switching code adds slightly over 4 micro-
`
`

`

`is to drastically reduce the core's voltage and frequency.
`Sec. 6.2 discusses idle-CPU power states in more detail.
`
`6 Architectural support
`
`The ASISA-CMP approach exposes a number of hardware
`design choices, which we discuss in this section. Most of
`these are open issues, since we have not yet done the simu-
`lation studies to explore all of the many options.
`
`6.1 Design of ASISA-CMP processors
`
`How should an “OS processor” in an ASISA-CMP differ
`from the other processors? We strongly favor the single-ISA
`model, since it greatly simplifies the design of the software,
`and because it still allows flexible allocation of computation
`to cores. But given a single ISA, many choices remain:
`- Non-architectural state: Different implementations of
`an ISA often expose different non-architectural state
`to the operating system. (“Non-architectural state” in-
`cludes CPU state that might have to be saved when a
`partially-completed instruction traps on a page fault.)
`The OS needs to be aware of these differences to be
`able to support the appropriate emulation.
`- Floating-point support: Most kernels do not use float-
`ing point at all, and so one might design the OS core
`without any floating-point pipeline or registers.
`If a
`thread with FP code somehow did end up executing on
`the OS core, it could trap into FP-emulation code, pre-
`serving the single-ISA view at a considerable perform-
`ance cost.
`- Caches: CMPs typically have per-core first-level (L1)
`caches, and a shared L2 cache. Cache designers have
`many choices (line size, associativity, total cache size,
`etc.). OS code tends to have different cache behavior
`than application code [23]; for example, the kernel ha

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket