IPR2023-00567, No. 1024 Exhibit - The Potential for Saving Energy (P.T.A.B. Feb. 14, 2023)

`
`
`
`I n v t n t
`
`
`
`
`Operating Systems and Asymmetric Single-ISA CMPs: The Potential
`for Saving Energy
`
`Jeffrey C. Mogul, Jayaram Mudigonda, Nathan Binkert, Partha Ranganathan, Vanish Talwar
`HP Laboratories Palo Alto
`HPL-2007-140
`August 22, 2007*
`
`
`
`
`CMP, multi-core,
`energy savings,
`operating systems
`
`CPUs consume too much power. Modern complex cores sometimes waste
`power on functions that are not useful for the code they run. In particular,
`operating system kernels do not benefit from many power-consuming features
`that were intended to improve application performance. We propose using
`asymmetric single-ISA CMPs (ASISA-CMPs), multicore CPUs where all cores
`execute the same instruction set architecture but have different performance and
`power characteristics, to avoid wasting power on operating systems code. We
`describe various design choices for both hardware and software, describe Linux
`kernel modifications to support ASISA-CMP, and offer some quantified
`estimates that support our proposal.
`
`
`
`
`* Internal Accession Date Only
` Approved for External Publication
`© Copyright 2007 Hewlett-Packard Development Company, L.P.
`
`Petitioner Samsung Ex-1024, 0001
`
`

`Operating Systems and Asymmetric Single-ISA CMPs:
`The Potential for Saving Energy
`
`Vanish Talwar
`Partha Ranganathan
`Nathan Binkert
`Jayaram Mudigonda
`Jeffrey C. Mogul
`Jeff.Mogul@hp.com, Jayaram.Mudigonda@hp.com, binkert@hp.com, Partha.Ranganathan@hp.com, Vanish.Talwar@hp.com
`HP Labs, Palo Alto, CA 94304
`
`Abstract
`
`CPUs consume too much power. Modern complex cores
`sometimes waste power on functions that are not useful for
`the code they run. In particular, operating system kernels do
`not beneﬁt from many power-consuming features that were
`intended to improve application performance. We propose
`using asymmetric single-ISA CMPs (ASISA-CMPs), mul-
`ticore CPUs where all cores execute the same instruction
`set architecture but have different performance and power
`characteristics, to avoid wasting power on operating systems
`code. We describe various design choices for both hardware
`and software, describe Linux kernel modiﬁcations to support
`ASISA-CMP, and offer some quantiﬁed estimates that sup-
`port our proposal.
`
`1 Introduction
`
`While Moore's Law has delivered exponential increases in
`computation over the past few decades, two well-known
`trends create problems for computer systems: CPUs con-
`sume more and more power, and operating systems do not
`speed up as rapidly as most application code does. Many
`people have addressed these problems separately; we pro-
`pose to address them together.
`Until recently, designers of high-end CPU chips tended to
`improve single-stream performance as much as possible, by
`exploiting instruction-level parallelism and decreasing cycle
`times. Both of these techniques are now hard to sustain, so
`recent CPU designs exploit shrinking VLSI feature sizes by
`using multiple cores, rather than faster clocks. Examples of
`these Chip Multi-Processors (CMPs) include the Sun Niagra
`
`processor with eight cores, the quad-core Intel Xeon, and
`dual-core systems from several vendors.
`All commercially-available general-purpose CMPs, as of
`mid-2007, are symmetrical: each CPU is identical, and typ-
`ically, all run at the same clock rate. However, in 2003 Ku-
`mar et al. [13] proposed heterogeneous (or asymmetrical)
`multi-core processors, as a way of reducing power require-
`ments. Their proposal retains the single-Instruction-Set-
`Architecture (single-ISA) model of symmetrical CMPs: all
`cores can execute the same machine code. They observed
`that different implementations of the same ISA had order-
`of-magnitude differences in power consumption (assuming a
`single VLSI process). They further observed that in a multi-
`application workload, or even in phases of a single applica-
`tion, one does not always need the full power and functional-
`ity of the most complex CPU core; if a CMP could switch a
`process between cores with different capabilities, one could
`maintain throughput while decreasing power consumption.
`Since the original study by Kumar et al., several other
`studies [3, 8, 14] have highlighted the beneﬁts from het-
`erogeneity. (Keynote speakers from some major processor
`vendors have also suggested that heterogeneity might be
`commercially interesting [2, 22].) However, all these stud-
`ies looked only at user-mode execution. But we know that
`many workloads spend much or most of their cycles in the
`operating system [23]. We also know that operating system
`(OS) code differs from application code: it avoids the ﬂoat-
`ing point processor, it branches more often, and it has lower
`cache locality (all reasons why OS speedups lag applica-
`tion speedups on modern CPUs). An asymmetric single-ISA
`CMP (ASISA-CMP) might therefore save power, without re-
`ducing throughput, by executing OS code on a simple, low-
`power core, while using more complex, high-power cores for
`
`1
`
`Petitioner Samsung Ex-1024, 0002
`
`

`application code.
`The main contribution of this paper is to propose and
`evaluate the ASISA-CMP model, in which (1) a multi-core,
`single-ISA CPU includes some “OS-friendly” cores, optim-
`ized to execute OS code without wasting energy, and (2) the
`OS statically and/or dynamically decides which code to ex-
`ecute on which cores, so as to optimize throughput per joule.
`To optimally exploit ASISA-CMP, we expect that the OS
`and the hardware both must change. This paper explores
`the various design considerations for co-evolving the OS and
`hardware, and presents experimental results.
`
`2 Related work
`
`Ousterhout [20] may have been the ﬁrst to point out that “op-
`erating system performance does not seem to be improving
`at the same rate as the base speed of the underlying hard-
`ware.” He speculated that causes include memory latencies
`and context-switching overheads.
`Nellans et al. [19] measured the fraction of cycles spent in
`the OS for a variety of applications, and re-examined how OS
`performance scales with CPU performance, suggesting that
`interrupt-handling code interferes with caches and branch-
`prediction history tables. They found that many applications
`execute a large fraction of cycles in the OS, and observed
`that “a classic 5 stage pipeline [such as] a 486 is surprisingly
`close in performance to a modern Pentium 4 when execut-
`ing [OS code].” However, instead of proposing an ASISA-
`CMP, they suggest adding a dedicated OS-speciﬁc core. (It is
`not entirely clear how far their proposal is from a single-ISA
`CMP.) They did not evaluate this proposal in detail.
`Chakraborty et al. [11] proposed refactoring software so
`that similar “fragments” of code are executed on the same
`core of a CMP. Their initial study treated the OS and the
`user-mode application as two coarse-grained fragments, and
`found speedup in some cases. However, they did not exam-
`ine asymmetric CMPs or the question of power reduction.
`Sanjay Kumar et al. [15] propose a “sidecore” architec-
`ture to support hypervisor operations. In their approach, a
`hypervisor is restructured into multiple components, with
`some running on specialized cores. Their goal was to avoid
`the expensive internal state changes triggered via traps (e.g.,
`VMexit in Intel's VT architecture) to perform privileged hy-
`pervisor functions. Instead, the sidecore approach transfers
`the operation to a remote core “that is already in the appro-
`
`priate state.” This also avoids polluting the guest-core caches
`with code and data from hypervisor operations. (The side-
`core approach is not speciﬁcally targeted at saving energy.)
`
`3 Design overview
`
`Our goal is to address two major challenges for multi-core
`systems: how to minimize power consumption while main-
`taining good performance, and how to exploit the parallelism
`offered by a multi-core CPU. These two issues are closely
`linked, but we will try to untangle them somewhat.
`
`3.1 Proportionality in power consumption
`
`We want to maximize the energy efﬁciency of our computer
`systems, which could be expressed as the useful computa-
`tional work (throughput) per joule expended. Fan et al. [6]
`have observed that the ideal system would consume power
`directly proportional to the rate of useful computational work
`it completes. We refer to this as the “proportionality rule.”
`Such a system would need no additional power management
`algorithms, except as might be needed to avoided exceeding
`peak power or cooling limits.
`Fan et al. argue that “system designers should consider
`power efﬁciency not simply at peak performance levels but
`across the activity range.” We believe that the ASISA-CMP
`approach, with a careful integration of OS and hardware
`design, can help address this goal. Of course, it is prob-
`ably impossible to design a system that truly complies with
`the proportionality rule, especially since many components
`consume considerable power even when the CPU is idle.
`There are at least two ways that one might design a sys-
`tem to address the proportionality rule. First, one could
`design individual components whose power consumption
`varies with throughput, such as a CPU that supports voltage
`and frequency scaling. Second, one could design a sys-
`tem with a mix of both high-power/high-performance and
`low-power/low-performance components, with a mechanism
`for dynamically varying which components are used (and
`powered up) based on system load.
`The original ASISA-CMP model, as proposed by Kumar
`et al. [13], follows the second approach, without precluding
`the ﬁrst one. In times of light load, activity shifts to low-
`power cores; in times of heavy load, low-power cores can
`offer additional parallelism without signiﬁcant increases in
`
`2
`
`Petitioner Samsung Ex-1024, 0003
`
`

`area or power consumption.
`In this paper, we extend the ASISA-CMP model by as-
`serting that the ideal low-power core is one that is special-
`ized to execute operating system code. (More broadly, we
`consider “OS-like code,” which we will deﬁne in Sec. 4.3.1.)
`This stems from several observations:
` OS code does not proportionately beneﬁt from the
`potential speedup of complex, high-frequency cores.
`Thus, running OS code on a simpler core is a better use
`of power and chip area.
` Most computer systems (with certain exceptions, such
`as scientiﬁc computing) are often idle.
`If we could
`power down complex CPU cores during periods when
`they would otherwise be idle, we could improve propor-
`tionality.
`The designs explored in this paper include:
` Multi-core CPUs with a mix of high-power, high-
`complexity application cores, and low-power,
`low-
`complexity OS-friendly cores.
` Operating system modiﬁcations to dynamically shift
`load to the most appropriate core, and (potentially) to
`power down idle cores.
` Modest hardware changes to improve the efﬁciency of
`core-switching.
`Of course, the CPU is not the only power-consuming
`component in a system, and ASISA-CMP does not address
`the power consumed by memory, busses, and I/O devices,
`or the power wasted in power supplies and cooling systems.
`Therefore, even if the CPU were perfectly proportional, the
`entire system would still fail to meet the proportionality rule.
`However, as long as CPUs represent the largest single power
`draws in a system (see Sec. 3.3.1), improving their propor-
`tionality is worthwhile.
`
`Table 1: Power and relative performance of Alpha cores
`
`Alpha
`core
`EV4
`EV5
`EV6
`EV8
`
`Average
`Peak
`power
`power
`3.73W
`4.97W
`6.88W
`9.83W
`17.8W 10.68W
`92.88W 46.44W
`
`Normalized vs. EV4
`IPC
`area
`power
`1.00
`1.00
`1.00
`1.30
`1.76
`1.84
`1.87
`8.54
`2.86
`2.14
`82.2
`12.45
`
`I
`I
`I
`II
`I
`I
`All cores scaled to 0.1 m; IPC based on SPEC CPU benchmarks
`
`Based on data from Kumar et al. [13]
`
`implemented in the same process technology. Clearly, the
`smallest core delivers signiﬁcantly more performance per
`watt and per mm . In fact, these performance results were
`based on the SPEC CPU benchmark suite; since operating
`system performance generally scales worse than application
`performance [20], we believe the IPC ratios would be even
`smaller for OS code.
`
`3.3 Complicating issues
`
`Various issues complicate the question of whether we can
`improve throughput/joule by running OS (or OS-like) code
`on special a OS-friendly core. We cover many details in sub-
`sequent sections of this paper; here, we expose some general
`questions. Many of these can only be resolved by exper-
`imentation (possibly through simulation); we describe our
`experiments later, in Sec. 7.
`The two key issues, as mentioned above, are the relative
`power consumption levels for various system components,
`and the relative performance costs and beneﬁts of switching
`cores. In order for ASISA-CMP to pay off, we require that
`it does not reduce performance faster than it reduces power
`consumption.
`
`3.2 Core heterogeneity
`
`3.3.1 How important is CPU power?
`
`The promise of ASISA-CMP depends critically on two facts
`of CPU core design: (1) for a given process technology, a
`complex core consumes much more power and die area than
`a simple core, and (2) a complex core does not improve OS
`performance nearly as much as it improves application per-
`formance.
`Table 1 shows the relative power consumption, perform-
`ance (in terms of instructions per cycle, or IPC), and sizes
`of various generations of Alpha cores, scaled as if all were
`
`Any real system includes multiple components that draw
`power, and ASISA-CMP will not signiﬁcantly change the
`energy consumption of components other than the CPU. In
`fact, if ASISA-CMP increases the time required to complete
`a job (or set of jobs), the resulting increase in energy con-
`sumed by other system components may outweight the sav-
`ings from the CPU cores.
`Component power consumption varies tremendously
`across the variety of computer systems in use. In particular,
`
`3
`
`Petitioner Samsung Ex-1024, 0004
`
`

`Table 2: Example power budgets for two typical systems
`
`Watts
`PCI Other
`(%) Mem Disk
`CPU
`Tot.
`Blade servers (all with multiple CPUs; after [16])
`248
`70
`28%
`48
`10
`50
`442
`170
`39%
`112
`10
`50
`I
`I
`1025
`520
`51%
`320
`10
`75
`Laptop (after [17])
`2.0
`15%
`0.4
`13.4
`52%
`1
`
`System
`
`Small
`Med.
`Large
`
`Idle
`Busy
`
`I
`
`I
`
`13.1
`25.8
`
`I
`
`I
`
`I
`
`0.6
`1
`
`N.A.
`I
`N.A.
`
`I
`
`70
`100
`100
`
`10.1
`10.3
`
`I
`
`I
`
`laptops and servers have vastly different balances between
`components. Table 2 shows a power breakdown for several
`typical systems. The blade server results, taken from [16],
`show “nameplate” (maximum) power budgets, and all have
`either two or (for the “Large” conﬁguration) four CPUs. The
`laptop results, taken from [17], show measured results for
`an idle system and for one running the PCMark CPU bench-
`mark; in both cases, Dynamic Voltage Scaling was disabled
`and the screen was at full brightness.
`In all cases, except for the idle laptop, CPU power con-
`sumption was the largest single component of system power
`consumption. For the Large server and the busy laptop, the
`CPU (or CPUs) consumed slightly more than half of the total
`power. This suggests that techniques, such as ASISA-CMP,
`that address CPU power consumption can have meaningful
`effects on whole-system power consumption.
`
`3.3.2 How does ASISA-CMP affect performance?
`
`ASISA-CMP can affect performance in several ways:
` Running OS code on a slower CPU: By design,
`ASISA-CMP concedes some performance by running
`OS code on a slower core. As argued in Sec. 3.2, this
`slowdown might be minimal. However, an application
`that spends much of its time in the OS could see a sig-
`niﬁcant performance decline.
` Core switching costs: ASISA-CMP inherently moves
`a thread of execution from one core to another for cer-
`tain system calls. Core-switching creates longer code
`paths for these system calls, and adds state-saving over-
`head.
` Cache afﬁnity vs. cache interference: We assume
`that the cores in a CMP CPU share a single L2 cache
`but have private L1 caches. Core-switching could af-
`
`4
`
`fect cache performance in at least two ways: it could
`harm cache afﬁnity, by requiring cache lines (e.g., for
`the data buffer of a write system call) to move between
`L1 caches, or it could reduce cache interference, by
`keeping some OS code and data out of the application
`core's cache.
` Available parallelism: Given that the incremental cost
`(in power and area) for adding an OS core to a CMP is
`much lower than for an application core (see Sec. 3.2),
`if there is available parallelism in the workload that ex-
`tend to OS processing, an ASISA-CMP CPU could sup-
`port more parallelism than a symmetric CMP CPU with
`similar power and area. For example, a parallel “make”
`command might beneﬁt from having an OS core run I/O
`processing while the application core is dedicated to an
`optimizing compiler. Not all workloads will have this
`kind of parallelism, of course.
`
`3.3.3 Idle time
`
`We have described the example in Figure 1 as if the applic-
`ation's system call does useful work. However, it could also
`be blocked waiting for some external event, such as disk I/O
`or the arrival of a Web request. Numerous studies (e.g.,
`[6, 21]) have shown that most computers are idle most of
`the time. Therefore, as pointed out by Fan et al. [6], a use-
`ful design for meeting the portionality rule must signiﬁcantly
`reduce power consumption during idle periods.
`ASISA-CMP offers the option of powering down the
`high-power application core(s), while maintaining OS func-
`tions on a low-power core. For example, the arrival of a new
`Web (HTTP/TCP) connection normally precedes the arrival
`of the actual HTTP request by at least one network round-trip
`time (typically on the order of milliseconds or more). This
`would allow an OS core to handle the initial TCP connection
`request and then awaken the application core soon enough to
`process the HTTP-level request without any delay.
`
`3.4 Competing approaches
`
`Given that the goal of ASISA-CMP is to improve perform-
`ance (in terms of throughput/joule), and it would require
`changes to the design of CPU chips, we have to compare
`it against possible competing approaches.
`Other potential alternatives include:
` Complex-core with dynamic Voltage/Frequency
`
`Petitioner Samsung Ex-1024, 0005
`
`

`Kernel entry overhead Kernel exit overhead
`
`Interrupt exit
`overhead
`
`Device
`Interrupt
`
`Interrupt entry
`overhead
`
`(a) Single−core CPU
`
`Cache
`interference
`overhead
`
`Thread A
`User mode
`Kernel mode
`
`Kernel entry overhead
`
`Kernel exit overhead
`
`Thread A
`User mode, core 1
`Kernel mode, core 1
`
`Kernel mode, core 0
`
`core 1 in low
`power mode
`
`Device
`Interrupt
`
`Core−switch
`overhead
`
`Cache
`affinity
`overhead
`
`(b) ASISA−CMP dual−core CPU
`
`Figure 1: Example without thread-level parallelism.
`
`4.1 Dynamic core-switching for OS kernel
`code
`
`In any multiprocessor system, performance depends on
`whether there is enough available parallelism to keep all
`processing elements busy. Speciﬁcally in an ASISA-CMP-
`based system,
`there are two ways to optimize through-
`put/joule:
`shift OS load to
`1. If the system is underutilized:
`low-power cores, and power down high-power cores,
`whenever possible.
`2. If the system has available parallelism: shift OS load to
`low-power cores, while keeping the high-power cores
`as busy as possible with application code.
`First, consider the case where there is no available
`application-level parallelism. Figure 1 illustrates a simple
`example for an application with just one thread. Figure 1(a)
`shows a brief part of the application's execution on a single-
`core CPU. When the application thread makes a system call,
`execution moves from user mode to kernel mode and back.
`Figure 1(b) shows the same execution sequence on a two-
`core ASISA-CMP system. In this case, when the application
`thread makes the system call, the kernel (1) transfers control
`from the “application core” (core 1) to the “OS core” (core
`0); (2) puts core 1 in low-power mode; (3) executes the sys-
`tem call on core 0; (4) wakes up core 1; and (5) transfers
`control back to core 1.
`
`scaling: This could be especially effective at saving
`power in systems with lots of idle time. However, we
`suspect that the lowest-power mode for V/F scaling on
`a complex core is still much higher than the power
`drawn by an ASISA-CMP CPU with the application
`core turned off.
` Complex-core with power-down on idle: In such a
`system, a idle core would wake up on any external in-
`terrupt. This approach could also be especially effective
`at saving power in systems with lots of idle time, espe-
`cially if the system is not forced to wake up on every
`clock tick even if there is no work to do. However, this
`approach only works if the system is truly idle for mod-
`erately long periods; a server system handling a min-
`imal rate of request packets, for example, might never
`stay idle for long enough.
` Lots of simple cores: A CPU with lots of simple cores,
`each of which could be independently power off, might
`allow a close approximation to the proportionality rule.
`However, this conﬁguration can only get reasonable
`throughput (and thus reasonable throughput/joule for
`the full system) if we can solve the general parallel pro-
`gramming problem – a problem that has been elusive
`for many years. Amdahl's Law may be an inherent
`limit to this approach.
` Some cores with specialized ISAs: Others have ex-
`plored this alternative [19]. We have ruled it out for this
`paper, because we believe that a single-ISA approach
`makes it much easier to develop and maintain an oper-
`ating system, and because we have no way to simulate
`multiple ISAs in one system.
`
`4 Software issues
`
`We believe that the ASISA-CMP approach can be applied to
`OS code in several ways, including:
`1. Dynamically switching between cores in OS kernel
`code; see Sec. 4.1.
`2. Running virtualization “helper processing” on OS-
`friendly cores; see Sec. 4.2.
`3. Running applications, or parts thereof, with “OS-like
`code” on OS-friendly cores; see Sec. 4.3
`To date, we have focussed all of our implementation and ex-
`perimental work on the ﬁrst approach.
`
`5
`
`Petitioner Samsung Ex-1024, 0006
`
`

`for an application with two runnable threads, thread A and
`thread B, where thread A is running at the start of the ex-
`ample, and again the thread makes a system call. Figure 2(a)
`shows the single-core case: while thread A is executing in
`the kernel, thread B remains blocked until the scheduler de-
`cides that A has run long enough. Figure 2(a) shows the
`ASISA-CMP case: when thread A makes its system call, the
`kernel switches that thread to the OS core (core 0), thread B
`can now run on the application core (core 1).
`Again assuming that the switching overheads are not too
`large, the use of ASISA-CMP increases the utilization of the
`application core, because OS execution is moved to a more
`appropriate core.
`
`4.2 Running virtualization helpers on
`OS-friendly cores
`
`So far, we have discussed the operating system as if it were a
`single monolithic kernel. Of course, many operating systems
`designs, both in research and as products, have been decom-
`posed; for example, into a microkernel and a set of server
`processes. Sec. 4.3 discusses the possibility of running dae-
`mon processes on a OS-friendly core.
`However, a different kind of decomposition has become
`popular:
`the creation of one or more virtualization layers
`between the traditional operating system and the hardware.
`These layers basically execute OS kernel code, but in a
`different protection domain from the “guest” OS. We be-
`lieve these are clear candidates for execution on OS-friendly
`cores. They also tend to be bound to speciﬁc cores, which
`eliminates the performance overhead of core-switching.
`Note that the use of current virtual machine monitors
`(VMM) might undermine the use of dynamic core switching
`for kernel code, because a guest operating system running
`in a virtual machine might not be able to bind a thread to a
`speciﬁc core. It is possible that a VMM's interface to the
`guest OS could be augmented to expose the actual cores, or
`at least to expose the distinction between application cores
`and OS-friendly cores, but we have not explored this issue in
`detail.
` Xen's Domain 0: In Xen [4], and probably in sim-
`ilar VM systems, one privileged virtual machine is used
`to multiplex and manage the I/O operations issued by
`the other virtual machines. This “Domain 0” probably
`would be most power-efﬁcient if run on an OS-friendly
`
`Kernel entry overhead Kernel exit overhead
`
`Thread A
`Thread B
`User mode
`Kernel mode
`
`(B blocked)
`
`Interrupt exit
`overhead
`
`(A blocked) Thread A
`Thread B
`
`Scheduler
`Interrupt
`
`Device
`Interrupt
`
`Interrupt entry
`overhead
`
`(a) Single−core CPU
`
`Cache
`interference
`overhead
`
`Kernel entry overhead
`
`Thread A
`Thread B
`(B blocked)
`User mode, core 1
`Kernel mode, core 1
`
`Kernel mode, core 0
`
`Cache interference overhead
`(A blocked)
`
`Thread A
`(B blocked) Thread B
`
`(begin core switch for A)
`
`(finish core switch for A)
`
`Scheduler
`Interrupt
`
`Device
`Interrupt
`
`Core−switch
`overhead
`
`Cache
`affinity
`overhead
`
`(b) ASISA−CMP dual−core CPU
`
`Figure 2: Example with thread-level parallelism.
`
`If the OS core draws signiﬁcantly less power than the ap-
`plication core while not signiﬁcantly reducing performance
`on OS code, and if there were no overheads for switching
`cores and for changing the power state of core 0, then the
`ASISA-CMP system would have higher throughput per joule
`than the single-core system. These two “ifs” are two of the
`key questions for this paper: are there real beneﬁts to run-
`ning OS code on specialized cores, and are the overheads
`small enough to avoid overwhelming the beneﬁts?
`Figure 1(a) and (b) also show what happens on the arrival
`of an interrupt. In the single-core system, application exe-
`cution is delayed both for the actual execution of the inter-
`rupt handling code in the OS, and also for interrupt exit and
`entry. In the ASISA-CMP system, the application continues
`to run without delay (except perhaps for memory-access in-
`terference). This would probably be true for any multicore
`system, but in the ASISA-CMP approach, interrupt handling
`happens on a power-optimized core, rather than on a high-
`power core.
`Note that we compare the dual-core ASISA-CMP system
`against a single-core system, rather than against a symmet-
`ric dual-core system, because (in the underutilized case) it
`seems likely that a dual-core CMP would consume more
`power than a single-core system, without any increase in
`throughput.
`Next, consider the case where there is application-level
`parallelism. Figure 2 illustrates another simple example,
`
`6
`
`Petitioner Samsung Ex-1024, 0007
`
`

`core.
`I/O and network helper threads: Several researchers
`have proposed running I/O-related portions of the virtu-
`alization environment on separate processors. McAuley
`and Neugebauer [18] proposed “encapsulating oper-
`ating system I/O subsystems in Virtual Channel Pro-
`cessors,” which could run on separate processors. Reg-
`nier et al. [24] and Brecht et al. [5] have proposed run-
`ning network packet processing code on distinct cores.
`For both of these approaches, OS-friendly cores would
`be a good match for optimizing power and performance.
`
`4.3 Running OS-like code on OS-friendly
`cores
`
`Our hardware design is based on the observation that “OS
`code” behaves differently from user code. However, the
`deﬁnition of an operating system is quite ﬂuid; the same code
`that executes inside the kernel in a monolithic system, such
`as Linux, may execute in user mode on a microkernel. Thus,
`we should not limit our choice of which code to execute on
`OS-friendly cores based simply on whether it runs in kernel
`mode.
`Typical systems include much code that shares these char-
`acteristics with actual kernel code:
` Libraries: much of user-mode code actually executes
`in libraries that are tightly integrated with the kernel. In
`microkernels and similar approaches, it might be hard
`to distinguish much of this code from traditional kernel
`code. So, one might expect some library code to run
`most efﬁciently on an OS-friendly core. On the other
`hand, we suspect that this would require core-switching
`far too frequently, and it might be hard to detect when
`to switch.
` Daemons: most operating systems (but especially mi-
`crokernels) use daemon processes to execute code that
`does not need to be in the kernel, or that needs to block
`frequently. This code might also be run most efﬁciently
`on an OS-friendly core, and could easily be bound to the
`appropriate core (most operating systems allow binding
`a process to a CPU).
` Servers: Some applications, such as Web servers,
`might fall into the same category as daemons. However,
`cryptographic code might run better on normal cores,
`and so a secure Web server might have to be refactored
`
`to run optimally on an ASISA-CMP.
`Note that if we use ASISA-CMP to execute entire OS-like
`application processes (or, at least, threads) on OS-friendly
`cores, this eliminates the performance overhead of core-
`switching.
`
`4.3.1 Deﬁning and recognizing “OS-like” code
`
`If it makes sense to run OS-like application code on OS-
`friendly cores, how does the OS decide which applications
`to treat this way? Programmers could simply declare their
`applications to be OS-like, but this is likely to lead to mis-
`takes. Instead, we believe that through monitoring the ap-
`propriate hardware performance counters, the OS can detect
`applications, or perhaps even threads, with OS-like execu-
`tion behavior.
`Automated recognition of OS-like code demands a clear
`deﬁnition of the term. While we do not yet have a spe-
`ciﬁc deﬁnition, the main characteristics of OS-like code
`are likely to include the absence of ﬂoating-point opera-
`tions; frequent pointer-chasing (which can defeat complex
`data cache designs); idiosyncratic conditional-branch beha-
`vior [7] (which can defeat branch predictors designed for ap-
`plication code), and frequent block-data copies (which can
`limit the effectiveness of large D-caches).
`Many of these idiosyncratic characteristics could be vis-
`ible via CPU performance counters. (Zhang et al. [26] have
`suggested that performance counters should be directly man-
`aged by the OS as a ﬁrst-class resource, for similar reasons.)
`We are not sure, however, whether support for ASISA-CMP
`will require novel performance counters.
`We also suspect that an adaptive approach, based on tent-
`atively running a suspected OS-like thread on an OS-friendly
`core, and measuring the resulting change in progress, could
`be effective in determining whether a thread is sufﬁciently
`OS-like to beneﬁt. This approach requires a means for the
`kernel to determine the progress of a thread, which is an open
`issue; in some cases, the rate at which it issues system calls
`could be a useful indicator.
`
`4.3.2 Should asymmetry be exposed to user-mode code?
`
`Kernels already provide processor-afﬁnity controls (e.g.,
`sched setafﬁnity on Linux). With asymmetrical cores, the
`-
`kernel might also expose core conﬁguration data, to allow
`OS-like user code to select the right core.
`
`7
`
`Petitioner Samsung Ex-1024, 0008
`
`
`

`5 Implementation of dynamic
`core-switching
`
`For the preliminary experiments in this paper, we made the
`simplest possible changes to Linux (version 2.6.18) to allow
`us to shift certain OS operations to an OS-friendly core. We
`assumed a two-core system, and did not add code to power-
`down (or slow down) idle cores.
`Kernels
`typically execute two categories of code:
`“Bottom-half”
`interrupt-handler
`code,
`and “top-half”
`system-call code. We used different strategies for each.
`
`5.1 Interrupt-handler code
`
`We believe that interrupt code (often known as “bottom half”
`code) should always prefer an OS-friendly core. The OS can
`conﬁgure interrupt delivery so as to ensure this.
`Linux already provides, via the /proc/irq directory in
`the /proc process information pseudo-ﬁlesystem, a way to
`set interrupt afﬁnity to a speciﬁc core. We do this only for
`the NIC; most other devices have much lower interrupt rates,
`and the timer interrupt must go to all CPUs.
`It is possible that the interrupt load could exceed the ca-
`pacity of the OS core(s) in a CPU. If so, should this load be
`spread out over the other cores? This might improve system
`throughput under some conditions, but it might be better to
`conﬁne interrupt processing to a ﬁxed subset of cores, as a
`way to limit the disruption caused by excessive interrupts. In
`our experiments, we have statically directed interrupts to the
`(single) OS core.
`
`5.2 System calls
`
`We faced two main challenges in our implementation of core
`switching for system calls: how to decide when to switch,
`and how to reduce the switching costs. Switching cores takes
`time, because
`1. switching involves executing extra instructions;
`2. switching requires transferring some state between
`CPUs;
`3. if the target core is powered down, it could take about
`a thousan

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases