`The Potential for Processor Power Reduction
`
`Rakesh Kumar(cid:0),Keith I. Farkas(cid:1),Norman P. Jouppi(cid:1),Parthasarathy Ranganathan(cid:1),Dean M. Tullsen(cid:0)
`
`(cid:0)Department of Computer Science and Engineering
`University of California, San Diego
`La Jolla, CA 92093-0114
`
`(cid:1)HP Labs
`1501 Page Mill Road
`Palo Alto, CA 94304
`
`Abstract
`
`This paper proposes and evaluates single-ISA hetero-
`geneous multi-core architectures as a mechanism to re-
`duce processor power dissipation. Our design incorpo-
`rates heterogeneous cores representing different points in
`the power/performance design space; during an applica-
`tion’s execution, system software dynamically chooses the
`most appropriate core to meet specific performance and
`power requirements.
`Our evaluation of this architecture shows significant en-
`ergy benefits. For an objective function that optimizes for
`energy efficiency with a tight performance threshold, for 14
`SPEC benchmarks, our results indicate a 39% average en-
`ergy reduction while only sacrificing 3% in performance.
`An objective function that optimizes for energy-delay with
`looser performance bounds achieves, on average, nearly a
`factor of three improvement in energy-delay product while
`sacrificing only 22% in performance. Energy savings are
`substantially more than chip-wide voltage/frequency scal-
`ing.
`
`1 Introduction
`
`As processors continue to increase in performance and
`speed, processor power consumption and heat dissipation
`have become key challenges in the design of future high-
`performance systems. For example, Pentium-4 class pro-
`cessors currently consume well over 50W and processors in
`the year 2015 are expected to consume close to 300W [1].
`Increased power consumption and heat dissipation typically
`leads to higher costs for thermal packaging, fans, electricity,
`and even air conditioning. Higher-power systems can also
`have a greater incidence of failures.
`In this paper, we propose and evaluate a single-ISA het-
`erogeneous multi-core architecture [26, 27] to reduce pro-
`
`cessor power dissipation. Prior chip-level multiproces-
`sors (CMP) have been proposed using multiple copies of
`the same core (i.e., homogeneous), or processors with co-
`processors that execute a different instruction set. We pro-
`pose that for many applications, core diversity is of higher
`value than uniformity, offering much greater ability to adapt
`to the demands of the application(s). We present a multi-
`core architecture where all cores execute the same instruc-
`tion set, but have different capabilities and performance lev-
`els. At run time, system software evaluates the resource re-
`quirements of an application and chooses the core that can
`best meet these requirements while minimizing energy con-
`sumption. The goal of this research is to identify and quan-
`tify some of the key advantages of this novel architecture in
`a particular execution environment.
`One of the motivations for this proposal is that differ-
`ent applications have different resource requirements dur-
`ing their execution. Some applications may have a large
`amount of instruction-level parallelism (ILP), which can be
`exploited by a core that can issue many instructions per
`cycle (i.e., a wide-issue superscalar CPU). The same core,
`however, might be wasted on an application with little ILP,
`consuming significantly more power than a simpler core
`that is better matched to the characteristics of the applica-
`tion.
`A heterogeneous multi-core architecture could be im-
`plemented by designing a series of cores from scratch, by
`reusing a series of previously-implemented processor cores
`after modifying their interfaces, or a combination of these
`two approaches. In this paper, we consider the reuse of ex-
`isting cores, which allows previous design effort to be amor-
`tized. Given the growth between generations of processors
`from the same architectural family, the entire family can
`typically be incorporated on a die only slightly larger than
`that required by the most advanced core.
`In addition, clock frequencies of the older cores would
`scale with technology, and would be much closer to that
`
`Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003)
`0-7695-2043-X/03 $17.00 © 2003 IEEE
`
`I EEF~
`
`COMPUTER
`SOCIETY
`
`Petitioner Samsung Ex-1032, 0001
`
`
`
`of the latest processor technology than their original imple-
`mentation clock frequency. Then, the primary criterion for
`selecting between different cores would be the performance
`of each architecture and the resulting energy dissipation.
`In this paper, we model one example of a single-ISA
`heterogeneous architecture – it includes four representative
`cores (two in-order cores and two out-of-order cores) from
`an ordered complexity/performance continuum in the Al-
`pha processor roadmap. We show that typical applications
`not only place highly varied demands on an execution archi-
`tecture, but also that that demand can vary between phases
`of the same program. We assume the ability to dynami-
`cally switch between cores. This allows the architecture to
`adapt to differences between applications, differences be-
`tween phases in the same application, or changing priori-
`ties of the processor or workload over time. We show re-
`ductions in processor energy-delay product as high as 84%
`(a six-fold improvement) for individual applications, and
`63% overall. Energy-delay(cid:0) (the product of energy and
`the square of the delay) reductions are as high as 75% (a
`four-fold improvement), and 50% overall. Chip-wide volt-
`age/frequency scaling can do no better than break even on
`this metric. We examine oracle-driven core switching, to
`understand the limits of this approach, as well as realistic
`runtime heuristics for core switching.
`The rest of the paper is organized as follows. Section 2
`discusses the single-ISA heterogeneous multi-core architec-
`ture that we study. Section 3 describes the methodology
`used to study performance and power. Section 4 discusses
`the results of our evaluation while Section 5 discusses re-
`lated work. Section 6 summarizes the work and discusses
`ongoing and future research.
`
`2 Architecture
`
`This section gives an overview of a potential heteroge-
`neous multi-core architecture and core-switching approach.
`The architecture consists of a chip-level multiprocessor
`with multiple, diverse processor cores. These cores all ex-
`ecute the same instruction set, but include significantly dif-
`ferent resources and achieve different performance and en-
`ergy efficiency on the same application. During an appli-
`cation’s execution, the operating system software tries to
`match the application to the different cores, attempting to
`meet a defined objective function. For example, it may be
`trying to meet a particular performance requirement or goal,
`but doing so with maximum energy efficiency.
`
`2.1 Discussion of Core Switching
`
`There are many reasons why the best core for execution
`may vary over time. The demands of executing code vary
`
`widely between applications; thus, the best core for one ap-
`plication will often not be the best for the next, given a par-
`ticular objective function (assumed to be some combination
`of energy and performance). In addition, the demands of
`a single application can also vary across phases of the pro-
`gram.
`Even the objective function can change over time, as the
`processor changes power conditions (e.g., plugged vs. un-
`plugged, full battery vs. low battery, thermal emergencies),
`as applications switch (e.g., low priority vs. high priority
`job), or even within an application (e.g., a real-time appli-
`cation is behind or ahead of schedule).
`The experiments in this paper explore only a subset of
`these possible changing conditions. Specifically, it exam-
`ines adaptation to phase changes in single applications.
`However, by simulating multiple applications and several
`objective functions, it also indirectly examines the potential
`to adapt to changing applications and objective functions.
`We believe a real system would see far greater opportuni-
`ties to switch cores to adapt to changing execution and en-
`vironmental conditions than the narrow set of experiments
`exhibited here.
`This work examines a diverse set of execution cores. In a
`processor where the objective function is static (and perhaps
`the workload is well known), some of our results indicate
`that a smaller set of cores (often two) will suffice to achieve
`very significant gains. However, if the objective function
`varies over time or workload, a larger set of cores has even
`greater benefit.
`
`2.2 Choice of cores.
`
`To provide an effective platform for a wide variety of
`application execution characteristics and/or system priority
`functions, the cores on the heterogeneous multi-core pro-
`cessor should cover both a wide and evenly spaced range of
`the complexity/performance design space.
`In this study, we consider a design that takes a se-
`ries of previously implemented processor cores with slight
`changes to their interface – this choice reflects one of the
`key advantages of the CMP architecture, namely the effec-
`tive amortization of design and verification effort. We in-
`clude four Alpha cores – EV4 (Alpha 21064), EV5 (Alpha
`21164), EV6 (Alpha 21264) and a single-threaded version
`of the EV8 (Alpha 21464), referred to as EV8-. These cores
`demonstrate strict gradation in terms of complexity and are
`capable of sharing a single executable. We assume the four
`cores have private L1 data and instruction caches and share
`a common L2 cache, phase-lock loop circuitry, and pins.
`We chose the cores of these off-the-shelf processors due
`to the availability of real power and area data for these pro-
`cessors, except for the EV8 where we use projected num-
`bers [10, 12, 23, 30]. All these processors have 64-bit archi-
`
`Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003)
`0-7695-2043-X/03 $17.00 © 2003 IEEE
`
`I EEF~
`
`COMPUTER
`SOCIETY
`
`Petitioner Samsung Ex-1032, 0002
`
`
`
`EV4
`
`F
`
`EV5
`
`EV8-
`
`EV6
`
`Figure 1. Relative sizes of the cores used in
`the study
`
`tectures. Note that technology mapping across a few gener-
`ations has been shown to be feasible [24].
`
`Figure 1 shows the relative sizes of the cores used in
`the study, assuming they are all implemented in a 0.10 mi-
`cron technology (the methodology to obtain this figure is
`described in the next section). It can be seen that the result-
`ing core is only modestly (within 15%) larger than the EV8-
`core by itself.
`
`Minor differences in the ISA between processor gener-
`ations are handled easily. Either programs are compiled to
`the least common denominator (the EV4), or we use soft-
`ware traps for the older cores. If extensive use is made of
`the software traps, our mechanisms will naturally shy away
`from those cores, due to the low performance.
`
`For this research, to simplify the initial analysis of this
`new execution paradigm, we assume only one application
`runs at a time on only one core. This design point could
`either represent an environment targeted at a single applica-
`tion at a time, or modeling policies that might be employed
`when a multithreaded multi-core configuration lacks thread
`parallelism. Because we assume a maximum of one thread
`running, the multithreaded features of EV8 are not needed.
`Hence, these are subtracted from the model, as discussed in
`Section 3. In addition, this assumption means that we do
`not need more than one of any core type. Finally, since only
`one core is active at a time, we implement cache coherence
`by ensuring that dirty data is flushed from the current core’s
`L1 data cache before execution is migrated to another core.
`
`This particular choice of architectures also gives a clear
`ordering in both power dissipation and expected perfor-
`mance. This allows the best coverage of the design space
`for a given number of cores and simplifies the design of
`core-switching algorithms.
`
`2.3 Switching applications between cores.
`
`Typical programs go through phases with different exe-
`cution characteristics [35, 39]. Therefore, the best core dur-
`ing one phase may not be best for the next phase. This ob-
`servation motivates the ability to dynamically switch cores
`in mid execution to take full advantage of our heterogeneous
`architecture.
`There is a cost to switching cores, so we must restrict the
`granularity of switching. One method for doing this would
`switch only at operating system timeslice intervals, when
`execution is in the operating system, with user state already
`saved to memory. If the OS decides a switch is in order, it
`powers up the new core, triggers a cache flush to save all
`dirty cache data to the shared L2, and signals the new core
`to start at a predefined OS entry point. The new core would
`then power down the old core and return from the timer in-
`terrupt handler. The user state saved by the old core would
`be loaded from memory into the new core at that time, as
`a normal consequence of returning from the operating sys-
`tem. Alternatively, we could switch to different cores at the
`granularity of the entire application, possibly chosen stati-
`cally. In this study, we consider both these options.
`In this work, we assume that unused cores are com-
`pletely powered down, rather than left idle. Thus, unused
`cores suffer no static leakage or dynamic switching power.
`This does, however, introduce a latency for powering a new
`core up. We estimate that a given processor core can be
`powered up in approximately one thousand cycles of the
`2.1GHz clock. This assumption is based on the observa-
`tion that when we power down a processor core we do not
`power down the phase-lock loop that generates the clock for
`the core. Rather, in our multi-core architecture, the same
`phase-lock loop generates the clocks for all cores. Conse-
`quently, the power-up time of a core is determined by the
`time required for the power buses to charge and stabilize.
`In addition, to avoid injecting excessive noise on the power
`bus bars of the multi-core processor, we assume a staged
`power up would be used.
`In addition, our experiments confirm that switching
`cores at operating-system timer intervals ensures that the
`switching overhead has almost no impact on performance,
`even with the most pessimistic assumptions about power-up
`time, software overhead, and cache cold start effects. How-
`ever, these overheads are still modeled in our experiments
`in Section 4.4.
`
`3 Methodology
`
`This section discusses the various methodological chal-
`lenges of this research, including modeling the power, the
`real estate, and the performance of the heterogeneous multi-
`core architecture.
`
`Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003)
`0-7695-2043-X/03 $17.00 © 2003 IEEE
`
`lFFF ~
`
`COMPUTER
`OCJETY
`
`Petitioner Samsung Ex-1032, 0003
`
`
`
`Processor
`Issue-width
`I-Cache
`D-Cache
`Branch Pred.
`Number of MSHRs
`
`EV6
`EV5
`EV4
`6 (OOO)
`4
`2
`8KB, DM 8KB, DM 64KB, 2-way
`8KB, DM 8KB, DM 64KB, 2-way
`2KB,1-bit
`2K-gshare
`hybrid 2-level
`2
`4
`8
`
`I
`I
`I
`Table 1. Configuration of the cores
`
`EV8-
`8 (OOO)
`64KB, 4-way
`64KB, 4-way
`hybrid 2-level (2X EV6 size)
`16
`
`3.1 Modeling of CPU Cores
`
`The cores we simulate are roughly modeled after cores
`of EV4 (Alpha 21064), EV5 (Alpha 21164), EV6 (Alpha
`21264) and EV8-. EV8- is a hypothetical single-threaded
`version of EV8 (Alpha 21464). The data on the resources
`for EV8 was based on predictions made by Joel Emer [12]
`and Artur Klauser [23], conversations with people from the
`Alpha design team, and other reported data [10, 30]. The
`data on the resources of the other cores are based on pub-
`lished literature on these processors [2, 3, 4].
`The multi-core processor is assumed to be implemented
`in a 0.10 micron technology. The cores have private first-
`level caches, and share an on-chip 3.5 MB 7-way set-
`associative L2 cache. At 0.10 micron, this cache will oc-
`cupy an area just under half the die size of the Pentium 4.
`All the cores are assumed to run at 2.1GHz. This is the
`frequency at which an EV6 core would run if its 600MHz,
`0.35 micron implementation was scaled to a 0.10 micron
`technology.
`In the Alpha design, the amount of work per
`pipe stage was relatively constant across processor genera-
`tions [7, 11, 12, 15]; therefore, it is reasonable to assume
`they can all be clocked at the same rate when implemented
`in the same technology (if not as designed, processors with
`similar characteristics certainly could). The input voltage
`for all the cores is assumed to be 1.2V.
`Note that while we took care to model real architectures
`that have been available in the past, we could consider these
`as just sample design points in the continuum of proces-
`sor designs that could be integrated into a heterogeneous
`multiple-core architecture. These existing designs already
`display the diversity of performance and power consump-
`tion desired. However, a custom or partially custom design
`would have much greater flexibility in ensuring that the per-
`formance and power space is covered in the most appropri-
`ate manner, but sacrificing the design time and verification
`advantages of the approach we follow in this work.
`Table 1 summarizes the configurations that were mod-
`eled for various cores. All architectures are modeled as ac-
`curately as possible, given the parameters in Table 1, on
`a highly detailed instruction-level simulator. However, we
`did not faithfully model every detail of each architecture;
`we were most concerned with modeling the approximate
`spaces each core covers in our complexity/performance
`continuum.
`
`Specific instances of deviations from exact design pa-
`rameters include the following. Associativity of the EV8-
`caches is double the associativity of equally-sized EV6
`caches. EV8- uses a tournament predictor double the size
`of the EV6 branch predictor. All the caches are assumed
`to be non-blocking, but the number of MSHRs is assumed
`to double with successive cores to adjust to increasing issue
`width. All the out-of-order cores are assumed to have big
`enough re-order buffers and large enough load/store queues
`to ensure no conflicts for these structures.
`The various miss penalties and L2 cache access laten-
`cies for the simulated cores were determined using CACTI.
`CACTI [37] provides an integrated model of cache access
`time, cycle time, area, aspect ratio, and power. To calculate
`the penalties, we used CACTI to get access times and then
`added one cycle each for L1-miss detection, going to L2,
`and coming from L2. For calculating the L2 access time,
`we assume that the L2 data and tag access are serialized so
`that the data memories don’t have to be cycled on a miss and
`only the required set is cycled on a hit. Memory latency was
`set to be 150ns.
`
`3.2 Modeling Power
`
`Modeling power for this type of study is a challenge. We
`need to consider cores designed over the time span of more
`than a decade. Power depends not only on the configuration
`of a processor, but also on the circuit design style and pro-
`cess parameters. Also, actual power dissipation varies with
`activity, though the degree of variability again depends on
`the technology parameters as well as the gating style used.
`No existing architecture-level power modeling frame-
`work accounts for all of these factors. Current power mod-
`els like Wattch [8] are primarily meant for activity-based
`architectural level power analysis and optimizations within
`a single processor generation, not as a tool to compare the
`absolute power consumption of widely varied architectures.
`We integrated Wattch into our architectural simulator and
`simulated the configuration of various cores implemented
`in their original technologies to get an estimate of the max-
`imum power consumption of these cores as well as the typ-
`ical power consumption running various applications. We
`found that Wattch did not, in general, reproduce published
`peak and typical power for the variety of processor config-
`urations we are using.
`
`Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003)
`0-7695-2043-X/03 $17.00 © 2003 IEEE
`
`I EEF~
`
`COMPUTER
`SOCIETY
`
`Petitioner Samsung Ex-1032, 0004
`
`
`
`Therefore we use a hybrid power model that uses esti-
`mates from Wattch, along with additional scaling and off-
`set factors to calibrate for technology factors. This model
`not only accounts for activity-based dissipation, but also
`accounts for the design style and process parameter differ-
`ences by relying on measured datapoints from the manufac-
`turers.
`To solve for the calibration factors, this methodology re-
`quires peak and typical power values for the actual proces-
`sors and the corresponding values reported by Wattch. This
`allows us to establish scaling factors that use the output of
`Wattch to estimate the actual power dissipation within the
`expected range for each core. To obtain the values for the
`processor cores, we derive the values from the literature;
`Section 3.2.1 discusses our derivation of peak power, and
`Section 3.2.2 discusses our derivation of typical power. For
`the corresponding Wattch values, we estimate peak power
`for each core given peak activity assumptions for all the
`hardware structures, and use the simulator to derive typical
`power consumed for SPEC2000 benchmarks.
`This methodology then both reproduces published re-
`sults and scales reasonably accurately with activity. While
`this is not a perfect power model, it will be far more accu-
`rate than using Wattch alone, or relying simply on reported
`average power.
`
`3.2.1 Estimating Peak Power
`
`This section details the methodology for estimating peak
`power dissipation of the cores. Table 2 shows our power
`and area estimates for the cores. We start with the peak
`power data of the processors obtained from data sheets and
`conference publications [2, 3, 4, 10, 23]. To derive the peak
`power dissipation in the core of a processor from the pub-
`lished numbers, the power consumed in the L2 caches and at
`the output pins of the processor must be subtracted from the
`published value. Power consumption in the L2 caches under
`peak load was determined using CACTI, starting by finding
`the energy consumed per access and dividing by the effec-
`tive access time. Details on bitouts, the extent of pipelining
`during accesses, etc. were obtained from data sheets (ex-
`cept for EV8-). For the EV8 L2, we assume 32 byte (288
`bits including ECC) transfers on reads and writes to the L1
`cache. We also assume the L2 cache is doubly pumped.
`The power dissipation at the output pins is calculated us-
`ing the formula: (cid:0) (cid:2)(cid:1)(cid:3)(cid:2)(cid:3) (cid:0)(cid:4) .
`The values of V (bus voltage), f (effective bus frequency)
`and C (load capacitance) were obtained from data sheets.
`Effective bus frequency was calculated by dividing the peak
`bandwidth of the data bus by the maximum number of data
`output pins which are active per cycle. The address bus was
`assumed to operate at the same effective frequency. For pro-
`cessors like the EV4, the effective frequency of the bus con-
`
`necting to the off-chip cache is different from the effective
`frequency of the system bus, so power must be calculated
`separately for those buses. We assume the probability that
`a bus line changes state is 0.5. For calculating the power
`at the output pins of EV8, we used the projected values for
`V and f. We assumed that half of the pins are input pins.
`Also, we assume that pin capacitance scales as the square
`root of the technology scaling factor. Due to reduced re-
`sources, we assumed that the EV8- core consumes 80% of
`the calculated EV8 core-power. This reduction is primarily
`due to smaller issue queues and register files. The power
`data was then scaled to the 0.10 micron process. For scal-
`ing, we assumed that power dissipation varies directly with
`frequency, quadratically with input voltage, and is propor-
`tional to feature-size.
`The second column in Table 2 summarizes the power
`consumed by the cores at 0.10 micron technology. As can
`be seen from the table, the EV8- core consumes almost 20
`times the peak power and more than 80 times the real estate
`of the EV4 core.
`CACTI was also used to derive the energy per access of
`the shared L2 cache, for use in our simulations. We also es-
`timated power dissipation at the output pins of the L2 cache
`due to L2 misses. For this, we assume 400 output pins.
`We assume a load capacitance of 50pF and a bus voltage of
`2.5V. Again, an activity factor of 0.5 for bit-line transitions
`is assumed. We also ran some experiments with a detailed
`model of off-chip memory access power, but found that the
`level of off-chip activity is highly constant across cores, and
`did not impact our results.
`
`3.2.2 Estimating Typical Power
`
`Values for typical power are more difficult to obtain, so we
`rely on a variety of techniques and sources to arrive at these
`values.
`Typical power for the EV6 and EV8- assume similar
`peak to typical ratios as published data for Intel processors
`of the same generation (the 0.13 micron Pentium 4 [5] for
`EV8-, and the 0.35 micron late-release Pentium Pro [18, 22]
`for the EV6).
`EV4 and EV5 typical power is extrapolated from these
`results and available thermal data [2, 3] assuming a approx-
`imately linear increase in power variation over time, due to
`wider issue processors and increased application of clock
`gating.
`These typical values are then scaled in similar ways to
`the peak values (but using measured typical activity) to de-
`rive the power for the cores alone. Table 2 gives the derived
`typical power for each of our cores. Also shown, for each
`core, is the range in power demand for the actual applica-
`tions we run, expressed as a percentage of typical power.
`
`Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003)
`0-7695-2043-X/03 $17.00 © 2003 IEEE
`
`I EEF~
`
`COMPUTER
`SOCIETY
`
`Petitioner Samsung Ex-1032, 0005
`
`
`
`I
`
`Core
`I
`
`Peak-power
`Core-area
`Typical-power
`Range
`I
`(Watts)
`( (cid:0))
`(Watts)
`(%)
`EV4
`4.97
`2.87
`3.73
`92-107
`EV5
`9.83
`5.06
`6.88
`89-109
`EV6
`17.80
`24.5
`10.68
`86-113
`I
`I
`I
`EV8-
`92.88
`236
`46.44
`82-128
`Table 2. Power and area statistics of the cores
`Program
`Description
`ammp
`Computational Chemistry
`applu
`Parabolic/Elliptic Partial Differential Equations
`apsi
`Meteorology:Pollutant Distribution
`art
`Image Recognition/Neural Networks
`bzip2
`Compression
`crafty
`Game Playing:Chess
`eon
`Computer Visualization
`equake
`Seismic Wave Propagation Simulation
`fma3d
`Finite-element Crash Simulation
`gzip
`Compression
`mcf
`Combinatorial Optimization
`twolf
`Place and Route Simulator
`vortex
`Object-oriented Database
`wupwise
`Physics/Quantum Chromodynamics
`
`Table 3. Benchmarks simulated.
`3.2.3 Power Model Sensitivity
`
`While our methodology includes several assumptions based
`on common rules-of-thumb used in typical processor de-
`sign, we performed several sensitivity experiments with
`widely different assumptions about the range of power dis-
`sipation in the core. Our results show very little difference
`in the qualitative results in this research. For any reasonable
`assumptions about the range, the power differences between
`cores still dominates the power difference between applica-
`tions on the same core. Furthermore, as noted previously,
`the cores can be considered as just sample design points in
`the continuum of processor designs that could be integrated
`into a heterogeneous multiple-core architecture.
`
`I
`
`I
`
`duced sizes. The sizes of the original and reduced instruc-
`tion queue sizes were estimated from examination of MIPS
`R10000 and HP PA-8000 data [9, 25], assuming that the
`area grows more than linear with respect to the number of
`entries ( (cid:8)(cid:11)(cid:8)(cid:1)(cid:0)(cid:2)). The area data is then scaled for
`the 0.10 micron process.
`
`3.4 Modeling Performance
`
`In this paper, we simulate the execution of 14 bench-
`marks from the SPEC2000 benchmark suite, including 7
`from SPECint and 7 from SPECfp. These are listed in Ta-
`ble 3.
`Benchmarks are simulated using SMTSIM, a cycle-
`accurate, execution-driven simulator that simulates an out-
`of-order, simultaneous multithreading processor [38], used
`in non-multithreading mode for this research. SMTSIM ex-
`ecutes unmodified, statically linked Alpha binaries. The
`simulator was modified to simulate a multi-core processor
`comprising four heterogeneous cores sharing an on-chip L2
`cache and the memory subsystem.
`In all simulations in this research we assume a single
`thread of execution running on one core at a time. Switch-
`ing execution between cores involves flushing the pipeline
`of the “active” core and writing back all its dirty L1 cache
`lines to the L2 cache. The next instruction is then fetched
`into the pipeline of the new core. The execution time and
`energy of this overhead, as well as the startup effects on the
`new core, are accounted for in our simulations of the dy-
`namic switching heuristics in Section 4.4.
`The simpoint tool [36] is used to determine the number
`of committed instructions which need to be fast-forwarded
`so as to capture the representative program behavior during
`simulation. After fast-forwarding, we simulate 1 billion in-
`structions. All benchmarks are simulated using ref inputs.
`
`3.3 Estimating Chip Area
`
`4 Discussion and Results
`
`Table 2 also summarizes the area occupied by the cores
`at 0.10 micron (also shown in Figure 1). The area of the
`cores (except EV8-) is derived from published photos of the
`dies after subtracting the area occupied by I/O pads, inter-
`connection wires, the bus-interface unit, L2 cache, and con-
`trol logic. Area of the L2 cache of the multi-core processor
`is estimated using CACTI.
`The die size of EV8 was predicted to be 400 (cid:0) [33].
`To determine the core size of EV8-, we subtract out the es-
`timated area of the L2 cache (using CACTI). We also ac-
`count for reduction in the size of register files, instruction
`queues, reorder buffer, and renaming tables to account for
`the single-threaded EV8-. For this, we use detailed mod-
`els of the register bit equivalents (rbe) [31] for register files,
`reorder buffer and renaming tables at the original and re-
`
`This section examines the effectiveness of single-ISA
`heterogeneous multi-core designs in reducing the power
`dissipation of processors. Section 4.1 examines the relative
`energy efficiency across cores, and how it varies by appli-
`cation and phase. Later sections use this variance, demon-
`strating both oracle and realistic core switching heuristics
`to maximize particular objective functions.
`
`4.1 Variation in Core Performance and Power
`
`As discussed in Section 2, this work assumes that the
`performance ratios between our processor cores is not con-
`stant, but varies across benchmarks, as well as over time on
`a single benchmark. This section verifies that premise.
`
`Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003)
`0-7695-2043-X/03 $17.00 © 2003 IEEE
`
`I EEF~
`
`COMPUTER
`SOCIETY
`
`Petitioner Samsung Ex-1032, 0006
`
`
`
`,,,
`e: o.s
`
`'::
`
`II
`
`Ill
`
`■
`
`•
`
`..
`..
`•
`- •
`"
`OA ..
`• ..
`
`EVB•
`
`EV6
`
`EV5
`
`~VA
`
`-
`
`~
`
`Figure 2(a) shows the performance measured in million
`instructions committed per second (IPS) of one represen(cid:173)
`tative benchmark, applu. In the figure, a separate curve is
`shown for each of the five cores, with each data point rep(cid:173)
`resenting the IPS over the preceding 1 million committed
`instructions.
`With applu, there are very clear and distinct phases of
`performance on each core, and the relative performance of
`the cores varies significantly between these phases. Nearly
`all programs show clear phased behavior, although the fre(cid:173)
`quency and variety of phases varies significantly.
`If relative performance of the cores varies over time,
`it follows that energy efficiency will also vary. Figure 3
`shows one metric of energy efficiency (defined in this case
`as I PS 2 /Watt) of the various cores for the same bench(cid:173)
`mark. IP S 2 /W att is merely the inverse of Energy-Delay
`product. As can be seen, the relative value of the energy(cid:173)
`delay product among cores, and even the ordering of the
`cores, varies from phase to phase.
`
`4.2 Oracle Heuristics for Dynamic Core Selection
`
`This section examines the limits of power and efficiency
`improvements possible with a heterogeneous multi-core
`architecture. The ideal core-selection algorithm depends
`heavily on the particular goals of the architecture or ap(cid:173)
`plication. This section demonstrates oracle algorithms that
`maximize two sample objective functions. The first opti(cid:173)
`mizes for energy efficiency with a tight performance thresh(cid:173)
`old. The second optimizes for energy-delay product with a
`looser performance constraint.
`These algorithms assume perfect knowledge of the per(cid:173)
`formance and power characteristics at the granularity of in(cid:17