throbber
Single-ISA Heterogeneous Multi-Core Architectures:
`The Potential for Processor Power Reduction
`
`Rakesh Kumar(cid:0),Keith I. Farkas(cid:1),Norman P. Jouppi(cid:1),Parthasarathy Ranganathan(cid:1),Dean M. Tullsen(cid:0)
`
`(cid:0)Department of Computer Science and Engineering
`University of California, San Diego
`La Jolla, CA 92093-0114
`
`(cid:1)HP Labs
`1501 Page Mill Road
`Palo Alto, CA 94304
`
`Abstract
`
`This paper proposes and evaluates single-ISA hetero-
`geneous multi-core architectures as a mechanism to re-
`duce processor power dissipation. Our design incorpo-
`rates heterogeneous cores representing different points in
`the power/performance design space; during an applica-
`tion’s execution, system software dynamically chooses the
`most appropriate core to meet specific performance and
`power requirements.
`Our evaluation of this architecture shows significant en-
`ergy benefits. For an objective function that optimizes for
`energy efficiency with a tight performance threshold, for 14
`SPEC benchmarks, our results indicate a 39% average en-
`ergy reduction while only sacrificing 3% in performance.
`An objective function that optimizes for energy-delay with
`looser performance bounds achieves, on average, nearly a
`factor of three improvement in energy-delay product while
`sacrificing only 22% in performance. Energy savings are
`substantially more than chip-wide voltage/frequency scal-
`ing.
`
`1 Introduction
`
`As processors continue to increase in performance and
`speed, processor power consumption and heat dissipation
`have become key challenges in the design of future high-
`performance systems. For example, Pentium-4 class pro-
`cessors currently consume well over 50W and processors in
`the year 2015 are expected to consume close to 300W [1].
`Increased power consumption and heat dissipation typically
`leads to higher costs for thermal packaging, fans, electricity,
`and even air conditioning. Higher-power systems can also
`have a greater incidence of failures.
`In this paper, we propose and evaluate a single-ISA het-
`erogeneous multi-core architecture [26, 27] to reduce pro-
`
`cessor power dissipation. Prior chip-level multiproces-
`sors (CMP) have been proposed using multiple copies of
`the same core (i.e., homogeneous), or processors with co-
`processors that execute a different instruction set. We pro-
`pose that for many applications, core diversity is of higher
`value than uniformity, offering much greater ability to adapt
`to the demands of the application(s). We present a multi-
`core architecture where all cores execute the same instruc-
`tion set, but have different capabilities and performance lev-
`els. At run time, system software evaluates the resource re-
`quirements of an application and chooses the core that can
`best meet these requirements while minimizing energy con-
`sumption. The goal of this research is to identify and quan-
`tify some of the key advantages of this novel architecture in
`a particular execution environment.
`One of the motivations for this proposal is that differ-
`ent applications have different resource requirements dur-
`ing their execution. Some applications may have a large
`amount of instruction-level parallelism (ILP), which can be
`exploited by a core that can issue many instructions per
`cycle (i.e., a wide-issue superscalar CPU). The same core,
`however, might be wasted on an application with little ILP,
`consuming significantly more power than a simpler core
`that is better matched to the characteristics of the applica-
`tion.
`A heterogeneous multi-core architecture could be im-
`plemented by designing a series of cores from scratch, by
`reusing a series of previously-implemented processor cores
`after modifying their interfaces, or a combination of these
`two approaches. In this paper, we consider the reuse of ex-
`isting cores, which allows previous design effort to be amor-
`tized. Given the growth between generations of processors
`from the same architectural family, the entire family can
`typically be incorporated on a die only slightly larger than
`that required by the most advanced core.
`In addition, clock frequencies of the older cores would
`scale with technology, and would be much closer to that
`
`Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003)
`0-7695-2043-X/03 $17.00 © 2003 IEEE
`
`I EEF~
`
`COMPUTER
`SOCIETY
`
`Petitioner Samsung Ex-1032, 0001
`
`

`

`of the latest processor technology than their original imple-
`mentation clock frequency. Then, the primary criterion for
`selecting between different cores would be the performance
`of each architecture and the resulting energy dissipation.
`In this paper, we model one example of a single-ISA
`heterogeneous architecture – it includes four representative
`cores (two in-order cores and two out-of-order cores) from
`an ordered complexity/performance continuum in the Al-
`pha processor roadmap. We show that typical applications
`not only place highly varied demands on an execution archi-
`tecture, but also that that demand can vary between phases
`of the same program. We assume the ability to dynami-
`cally switch between cores. This allows the architecture to
`adapt to differences between applications, differences be-
`tween phases in the same application, or changing priori-
`ties of the processor or workload over time. We show re-
`ductions in processor energy-delay product as high as 84%
`(a six-fold improvement) for individual applications, and
`63% overall. Energy-delay(cid:0) (the product of energy and
`the square of the delay) reductions are as high as 75% (a
`four-fold improvement), and 50% overall. Chip-wide volt-
`age/frequency scaling can do no better than break even on
`this metric. We examine oracle-driven core switching, to
`understand the limits of this approach, as well as realistic
`runtime heuristics for core switching.
`The rest of the paper is organized as follows. Section 2
`discusses the single-ISA heterogeneous multi-core architec-
`ture that we study. Section 3 describes the methodology
`used to study performance and power. Section 4 discusses
`the results of our evaluation while Section 5 discusses re-
`lated work. Section 6 summarizes the work and discusses
`ongoing and future research.
`
`2 Architecture
`
`This section gives an overview of a potential heteroge-
`neous multi-core architecture and core-switching approach.
`The architecture consists of a chip-level multiprocessor
`with multiple, diverse processor cores. These cores all ex-
`ecute the same instruction set, but include significantly dif-
`ferent resources and achieve different performance and en-
`ergy efficiency on the same application. During an appli-
`cation’s execution, the operating system software tries to
`match the application to the different cores, attempting to
`meet a defined objective function. For example, it may be
`trying to meet a particular performance requirement or goal,
`but doing so with maximum energy efficiency.
`
`2.1 Discussion of Core Switching
`
`There are many reasons why the best core for execution
`may vary over time. The demands of executing code vary
`
`widely between applications; thus, the best core for one ap-
`plication will often not be the best for the next, given a par-
`ticular objective function (assumed to be some combination
`of energy and performance). In addition, the demands of
`a single application can also vary across phases of the pro-
`gram.
`Even the objective function can change over time, as the
`processor changes power conditions (e.g., plugged vs. un-
`plugged, full battery vs. low battery, thermal emergencies),
`as applications switch (e.g., low priority vs. high priority
`job), or even within an application (e.g., a real-time appli-
`cation is behind or ahead of schedule).
`The experiments in this paper explore only a subset of
`these possible changing conditions. Specifically, it exam-
`ines adaptation to phase changes in single applications.
`However, by simulating multiple applications and several
`objective functions, it also indirectly examines the potential
`to adapt to changing applications and objective functions.
`We believe a real system would see far greater opportuni-
`ties to switch cores to adapt to changing execution and en-
`vironmental conditions than the narrow set of experiments
`exhibited here.
`This work examines a diverse set of execution cores. In a
`processor where the objective function is static (and perhaps
`the workload is well known), some of our results indicate
`that a smaller set of cores (often two) will suffice to achieve
`very significant gains. However, if the objective function
`varies over time or workload, a larger set of cores has even
`greater benefit.
`
`2.2 Choice of cores.
`
`To provide an effective platform for a wide variety of
`application execution characteristics and/or system priority
`functions, the cores on the heterogeneous multi-core pro-
`cessor should cover both a wide and evenly spaced range of
`the complexity/performance design space.
`In this study, we consider a design that takes a se-
`ries of previously implemented processor cores with slight
`changes to their interface – this choice reflects one of the
`key advantages of the CMP architecture, namely the effec-
`tive amortization of design and verification effort. We in-
`clude four Alpha cores – EV4 (Alpha 21064), EV5 (Alpha
`21164), EV6 (Alpha 21264) and a single-threaded version
`of the EV8 (Alpha 21464), referred to as EV8-. These cores
`demonstrate strict gradation in terms of complexity and are
`capable of sharing a single executable. We assume the four
`cores have private L1 data and instruction caches and share
`a common L2 cache, phase-lock loop circuitry, and pins.
`We chose the cores of these off-the-shelf processors due
`to the availability of real power and area data for these pro-
`cessors, except for the EV8 where we use projected num-
`bers [10, 12, 23, 30]. All these processors have 64-bit archi-
`
`Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003)
`0-7695-2043-X/03 $17.00 © 2003 IEEE
`
`I EEF~
`
`COMPUTER
`SOCIETY
`
`Petitioner Samsung Ex-1032, 0002
`
`

`

`EV4
`
`F
`
`EV5
`
`EV8-
`
`EV6
`
`Figure 1. Relative sizes of the cores used in
`the study
`
`tectures. Note that technology mapping across a few gener-
`ations has been shown to be feasible [24].
`
`Figure 1 shows the relative sizes of the cores used in
`the study, assuming they are all implemented in a 0.10 mi-
`cron technology (the methodology to obtain this figure is
`described in the next section). It can be seen that the result-
`ing core is only modestly (within 15%) larger than the EV8-
`core by itself.
`
`Minor differences in the ISA between processor gener-
`ations are handled easily. Either programs are compiled to
`the least common denominator (the EV4), or we use soft-
`ware traps for the older cores. If extensive use is made of
`the software traps, our mechanisms will naturally shy away
`from those cores, due to the low performance.
`
`For this research, to simplify the initial analysis of this
`new execution paradigm, we assume only one application
`runs at a time on only one core. This design point could
`either represent an environment targeted at a single applica-
`tion at a time, or modeling policies that might be employed
`when a multithreaded multi-core configuration lacks thread
`parallelism. Because we assume a maximum of one thread
`running, the multithreaded features of EV8 are not needed.
`Hence, these are subtracted from the model, as discussed in
`Section 3. In addition, this assumption means that we do
`not need more than one of any core type. Finally, since only
`one core is active at a time, we implement cache coherence
`by ensuring that dirty data is flushed from the current core’s
`L1 data cache before execution is migrated to another core.
`
`This particular choice of architectures also gives a clear
`ordering in both power dissipation and expected perfor-
`mance. This allows the best coverage of the design space
`for a given number of cores and simplifies the design of
`core-switching algorithms.
`
`2.3 Switching applications between cores.
`
`Typical programs go through phases with different exe-
`cution characteristics [35, 39]. Therefore, the best core dur-
`ing one phase may not be best for the next phase. This ob-
`servation motivates the ability to dynamically switch cores
`in mid execution to take full advantage of our heterogeneous
`architecture.
`There is a cost to switching cores, so we must restrict the
`granularity of switching. One method for doing this would
`switch only at operating system timeslice intervals, when
`execution is in the operating system, with user state already
`saved to memory. If the OS decides a switch is in order, it
`powers up the new core, triggers a cache flush to save all
`dirty cache data to the shared L2, and signals the new core
`to start at a predefined OS entry point. The new core would
`then power down the old core and return from the timer in-
`terrupt handler. The user state saved by the old core would
`be loaded from memory into the new core at that time, as
`a normal consequence of returning from the operating sys-
`tem. Alternatively, we could switch to different cores at the
`granularity of the entire application, possibly chosen stati-
`cally. In this study, we consider both these options.
`In this work, we assume that unused cores are com-
`pletely powered down, rather than left idle. Thus, unused
`cores suffer no static leakage or dynamic switching power.
`This does, however, introduce a latency for powering a new
`core up. We estimate that a given processor core can be
`powered up in approximately one thousand cycles of the
`2.1GHz clock. This assumption is based on the observa-
`tion that when we power down a processor core we do not
`power down the phase-lock loop that generates the clock for
`the core. Rather, in our multi-core architecture, the same
`phase-lock loop generates the clocks for all cores. Conse-
`quently, the power-up time of a core is determined by the
`time required for the power buses to charge and stabilize.
`In addition, to avoid injecting excessive noise on the power
`bus bars of the multi-core processor, we assume a staged
`power up would be used.
`In addition, our experiments confirm that switching
`cores at operating-system timer intervals ensures that the
`switching overhead has almost no impact on performance,
`even with the most pessimistic assumptions about power-up
`time, software overhead, and cache cold start effects. How-
`ever, these overheads are still modeled in our experiments
`in Section 4.4.
`
`3 Methodology
`
`This section discusses the various methodological chal-
`lenges of this research, including modeling the power, the
`real estate, and the performance of the heterogeneous multi-
`core architecture.
`
`Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003)
`0-7695-2043-X/03 $17.00 © 2003 IEEE
`
`lFFF ~
`
`COMPUTER
`OCJETY
`
`Petitioner Samsung Ex-1032, 0003
`
`

`

`Processor
`Issue-width
`I-Cache
`D-Cache
`Branch Pred.
`Number of MSHRs
`
`EV6
`EV5
`EV4
`6 (OOO)
`4
`2
`8KB, DM 8KB, DM 64KB, 2-way
`8KB, DM 8KB, DM 64KB, 2-way
`2KB,1-bit
`2K-gshare
`hybrid 2-level
`2
`4
`8
`
`I
`I
`I
`Table 1. Configuration of the cores
`
`EV8-
`8 (OOO)
`64KB, 4-way
`64KB, 4-way
`hybrid 2-level (2X EV6 size)
`16
`
`3.1 Modeling of CPU Cores
`
`The cores we simulate are roughly modeled after cores
`of EV4 (Alpha 21064), EV5 (Alpha 21164), EV6 (Alpha
`21264) and EV8-. EV8- is a hypothetical single-threaded
`version of EV8 (Alpha 21464). The data on the resources
`for EV8 was based on predictions made by Joel Emer [12]
`and Artur Klauser [23], conversations with people from the
`Alpha design team, and other reported data [10, 30]. The
`data on the resources of the other cores are based on pub-
`lished literature on these processors [2, 3, 4].
`The multi-core processor is assumed to be implemented
`in a 0.10 micron technology. The cores have private first-
`level caches, and share an on-chip 3.5 MB 7-way set-
`associative L2 cache. At 0.10 micron, this cache will oc-
`cupy an area just under half the die size of the Pentium 4.
`All the cores are assumed to run at 2.1GHz. This is the
`frequency at which an EV6 core would run if its 600MHz,
`0.35 micron implementation was scaled to a 0.10 micron
`technology.
`In the Alpha design, the amount of work per
`pipe stage was relatively constant across processor genera-
`tions [7, 11, 12, 15]; therefore, it is reasonable to assume
`they can all be clocked at the same rate when implemented
`in the same technology (if not as designed, processors with
`similar characteristics certainly could). The input voltage
`for all the cores is assumed to be 1.2V.
`Note that while we took care to model real architectures
`that have been available in the past, we could consider these
`as just sample design points in the continuum of proces-
`sor designs that could be integrated into a heterogeneous
`multiple-core architecture. These existing designs already
`display the diversity of performance and power consump-
`tion desired. However, a custom or partially custom design
`would have much greater flexibility in ensuring that the per-
`formance and power space is covered in the most appropri-
`ate manner, but sacrificing the design time and verification
`advantages of the approach we follow in this work.
`Table 1 summarizes the configurations that were mod-
`eled for various cores. All architectures are modeled as ac-
`curately as possible, given the parameters in Table 1, on
`a highly detailed instruction-level simulator. However, we
`did not faithfully model every detail of each architecture;
`we were most concerned with modeling the approximate
`spaces each core covers in our complexity/performance
`continuum.
`
`Specific instances of deviations from exact design pa-
`rameters include the following. Associativity of the EV8-
`caches is double the associativity of equally-sized EV6
`caches. EV8- uses a tournament predictor double the size
`of the EV6 branch predictor. All the caches are assumed
`to be non-blocking, but the number of MSHRs is assumed
`to double with successive cores to adjust to increasing issue
`width. All the out-of-order cores are assumed to have big
`enough re-order buffers and large enough load/store queues
`to ensure no conflicts for these structures.
`The various miss penalties and L2 cache access laten-
`cies for the simulated cores were determined using CACTI.
`CACTI [37] provides an integrated model of cache access
`time, cycle time, area, aspect ratio, and power. To calculate
`the penalties, we used CACTI to get access times and then
`added one cycle each for L1-miss detection, going to L2,
`and coming from L2. For calculating the L2 access time,
`we assume that the L2 data and tag access are serialized so
`that the data memories don’t have to be cycled on a miss and
`only the required set is cycled on a hit. Memory latency was
`set to be 150ns.
`
`3.2 Modeling Power
`
`Modeling power for this type of study is a challenge. We
`need to consider cores designed over the time span of more
`than a decade. Power depends not only on the configuration
`of a processor, but also on the circuit design style and pro-
`cess parameters. Also, actual power dissipation varies with
`activity, though the degree of variability again depends on
`the technology parameters as well as the gating style used.
`No existing architecture-level power modeling frame-
`work accounts for all of these factors. Current power mod-
`els like Wattch [8] are primarily meant for activity-based
`architectural level power analysis and optimizations within
`a single processor generation, not as a tool to compare the
`absolute power consumption of widely varied architectures.
`We integrated Wattch into our architectural simulator and
`simulated the configuration of various cores implemented
`in their original technologies to get an estimate of the max-
`imum power consumption of these cores as well as the typ-
`ical power consumption running various applications. We
`found that Wattch did not, in general, reproduce published
`peak and typical power for the variety of processor config-
`urations we are using.
`
`Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003)
`0-7695-2043-X/03 $17.00 © 2003 IEEE
`
`I EEF~
`
`COMPUTER
`SOCIETY
`
`Petitioner Samsung Ex-1032, 0004
`
`

`

`Therefore we use a hybrid power model that uses esti-
`mates from Wattch, along with additional scaling and off-
`set factors to calibrate for technology factors. This model
`not only accounts for activity-based dissipation, but also
`accounts for the design style and process parameter differ-
`ences by relying on measured datapoints from the manufac-
`turers.
`To solve for the calibration factors, this methodology re-
`quires peak and typical power values for the actual proces-
`sors and the corresponding values reported by Wattch. This
`allows us to establish scaling factors that use the output of
`Wattch to estimate the actual power dissipation within the
`expected range for each core. To obtain the values for the
`processor cores, we derive the values from the literature;
`Section 3.2.1 discusses our derivation of peak power, and
`Section 3.2.2 discusses our derivation of typical power. For
`the corresponding Wattch values, we estimate peak power
`for each core given peak activity assumptions for all the
`hardware structures, and use the simulator to derive typical
`power consumed for SPEC2000 benchmarks.
`This methodology then both reproduces published re-
`sults and scales reasonably accurately with activity. While
`this is not a perfect power model, it will be far more accu-
`rate than using Wattch alone, or relying simply on reported
`average power.
`
`3.2.1 Estimating Peak Power
`
`This section details the methodology for estimating peak
`power dissipation of the cores. Table 2 shows our power
`and area estimates for the cores. We start with the peak
`power data of the processors obtained from data sheets and
`conference publications [2, 3, 4, 10, 23]. To derive the peak
`power dissipation in the core of a processor from the pub-
`lished numbers, the power consumed in the L2 caches and at
`the output pins of the processor must be subtracted from the
`published value. Power consumption in the L2 caches under
`peak load was determined using CACTI, starting by finding
`the energy consumed per access and dividing by the effec-
`tive access time. Details on bitouts, the extent of pipelining
`during accesses, etc. were obtained from data sheets (ex-
`cept for EV8-). For the EV8 L2, we assume 32 byte (288
`bits including ECC) transfers on reads and writes to the L1
`cache. We also assume the L2 cache is doubly pumped.
`The power dissipation at the output pins is calculated us-
`ing the formula:  (cid:0) (cid:2)(cid:1)(cid:3)(cid:2)(cid:3) (cid:0)(cid:4) .
`The values of V (bus voltage), f (effective bus frequency)
`and C (load capacitance) were obtained from data sheets.
`Effective bus frequency was calculated by dividing the peak
`bandwidth of the data bus by the maximum number of data
`output pins which are active per cycle. The address bus was
`assumed to operate at the same effective frequency. For pro-
`cessors like the EV4, the effective frequency of the bus con-
`
`necting to the off-chip cache is different from the effective
`frequency of the system bus, so power must be calculated
`separately for those buses. We assume the probability that
`a bus line changes state is 0.5. For calculating the power
`at the output pins of EV8, we used the projected values for
`V and f. We assumed that half of the pins are input pins.
`Also, we assume that pin capacitance scales as the square
`root of the technology scaling factor. Due to reduced re-
`sources, we assumed that the EV8- core consumes 80% of
`the calculated EV8 core-power. This reduction is primarily
`due to smaller issue queues and register files. The power
`data was then scaled to the 0.10 micron process. For scal-
`ing, we assumed that power dissipation varies directly with
`frequency, quadratically with input voltage, and is propor-
`tional to feature-size.
`The second column in Table 2 summarizes the power
`consumed by the cores at 0.10 micron technology. As can
`be seen from the table, the EV8- core consumes almost 20
`times the peak power and more than 80 times the real estate
`of the EV4 core.
`CACTI was also used to derive the energy per access of
`the shared L2 cache, for use in our simulations. We also es-
`timated power dissipation at the output pins of the L2 cache
`due to L2 misses. For this, we assume 400 output pins.
`We assume a load capacitance of 50pF and a bus voltage of
`2.5V. Again, an activity factor of 0.5 for bit-line transitions
`is assumed. We also ran some experiments with a detailed
`model of off-chip memory access power, but found that the
`level of off-chip activity is highly constant across cores, and
`did not impact our results.
`
`3.2.2 Estimating Typical Power
`
`Values for typical power are more difficult to obtain, so we
`rely on a variety of techniques and sources to arrive at these
`values.
`Typical power for the EV6 and EV8- assume similar
`peak to typical ratios as published data for Intel processors
`of the same generation (the 0.13 micron Pentium 4 [5] for
`EV8-, and the 0.35 micron late-release Pentium Pro [18, 22]
`for the EV6).
`EV4 and EV5 typical power is extrapolated from these
`results and available thermal data [2, 3] assuming a approx-
`imately linear increase in power variation over time, due to
`wider issue processors and increased application of clock
`gating.
`These typical values are then scaled in similar ways to
`the peak values (but using measured typical activity) to de-
`rive the power for the cores alone. Table 2 gives the derived
`typical power for each of our cores. Also shown, for each
`core, is the range in power demand for the actual applica-
`tions we run, expressed as a percentage of typical power.
`
`Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003)
`0-7695-2043-X/03 $17.00 © 2003 IEEE
`
`I EEF~
`
`COMPUTER
`SOCIETY
`
`Petitioner Samsung Ex-1032, 0005
`
`

`

`I
`
`Core
`I
`
`Peak-power
`Core-area
`Typical-power
`Range
`I
`(Watts)
`( (cid:0))
`(Watts)
`(%)
`EV4
`4.97
`2.87
`3.73
`92-107
`EV5
`9.83
`5.06
`6.88
`89-109
`EV6
`17.80
`24.5
`10.68
`86-113
`I
`I
`I
`EV8-
`92.88
`236
`46.44
`82-128
`Table 2. Power and area statistics of the cores
`Program
`Description
`ammp
`Computational Chemistry
`applu
`Parabolic/Elliptic Partial Differential Equations
`apsi
`Meteorology:Pollutant Distribution
`art
`Image Recognition/Neural Networks
`bzip2
`Compression
`crafty
`Game Playing:Chess
`eon
`Computer Visualization
`equake
`Seismic Wave Propagation Simulation
`fma3d
`Finite-element Crash Simulation
`gzip
`Compression
`mcf
`Combinatorial Optimization
`twolf
`Place and Route Simulator
`vortex
`Object-oriented Database
`wupwise
`Physics/Quantum Chromodynamics
`
`Table 3. Benchmarks simulated.
`3.2.3 Power Model Sensitivity
`
`While our methodology includes several assumptions based
`on common rules-of-thumb used in typical processor de-
`sign, we performed several sensitivity experiments with
`widely different assumptions about the range of power dis-
`sipation in the core. Our results show very little difference
`in the qualitative results in this research. For any reasonable
`assumptions about the range, the power differences between
`cores still dominates the power difference between applica-
`tions on the same core. Furthermore, as noted previously,
`the cores can be considered as just sample design points in
`the continuum of processor designs that could be integrated
`into a heterogeneous multiple-core architecture.
`
`I
`
`I
`
`duced sizes. The sizes of the original and reduced instruc-
`tion queue sizes were estimated from examination of MIPS
`R10000 and HP PA-8000 data [9, 25], assuming that the
`area grows more than linear with respect to the number of
`entries ( (cid:8)(cid:11)(cid:8)(cid:1)(cid:0)(cid:2)). The area data is then scaled for
`the 0.10 micron process.
`
`3.4 Modeling Performance
`
`In this paper, we simulate the execution of 14 bench-
`marks from the SPEC2000 benchmark suite, including 7
`from SPECint and 7 from SPECfp. These are listed in Ta-
`ble 3.
`Benchmarks are simulated using SMTSIM, a cycle-
`accurate, execution-driven simulator that simulates an out-
`of-order, simultaneous multithreading processor [38], used
`in non-multithreading mode for this research. SMTSIM ex-
`ecutes unmodified, statically linked Alpha binaries. The
`simulator was modified to simulate a multi-core processor
`comprising four heterogeneous cores sharing an on-chip L2
`cache and the memory subsystem.
`In all simulations in this research we assume a single
`thread of execution running on one core at a time. Switch-
`ing execution between cores involves flushing the pipeline
`of the “active” core and writing back all its dirty L1 cache
`lines to the L2 cache. The next instruction is then fetched
`into the pipeline of the new core. The execution time and
`energy of this overhead, as well as the startup effects on the
`new core, are accounted for in our simulations of the dy-
`namic switching heuristics in Section 4.4.
`The simpoint tool [36] is used to determine the number
`of committed instructions which need to be fast-forwarded
`so as to capture the representative program behavior during
`simulation. After fast-forwarding, we simulate 1 billion in-
`structions. All benchmarks are simulated using ref inputs.
`
`3.3 Estimating Chip Area
`
`4 Discussion and Results
`
`Table 2 also summarizes the area occupied by the cores
`at 0.10 micron (also shown in Figure 1). The area of the
`cores (except EV8-) is derived from published photos of the
`dies after subtracting the area occupied by I/O pads, inter-
`connection wires, the bus-interface unit, L2 cache, and con-
`trol logic. Area of the L2 cache of the multi-core processor
`is estimated using CACTI.
`The die size of EV8 was predicted to be 400 (cid:0) [33].
`To determine the core size of EV8-, we subtract out the es-
`timated area of the L2 cache (using CACTI). We also ac-
`count for reduction in the size of register files, instruction
`queues, reorder buffer, and renaming tables to account for
`the single-threaded EV8-. For this, we use detailed mod-
`els of the register bit equivalents (rbe) [31] for register files,
`reorder buffer and renaming tables at the original and re-
`
`This section examines the effectiveness of single-ISA
`heterogeneous multi-core designs in reducing the power
`dissipation of processors. Section 4.1 examines the relative
`energy efficiency across cores, and how it varies by appli-
`cation and phase. Later sections use this variance, demon-
`strating both oracle and realistic core switching heuristics
`to maximize particular objective functions.
`
`4.1 Variation in Core Performance and Power
`
`As discussed in Section 2, this work assumes that the
`performance ratios between our processor cores is not con-
`stant, but varies across benchmarks, as well as over time on
`a single benchmark. This section verifies that premise.
`
`Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003)
`0-7695-2043-X/03 $17.00 © 2003 IEEE
`
`I EEF~
`
`COMPUTER
`SOCIETY
`
`Petitioner Samsung Ex-1032, 0006
`
`

`

`,,,
`e: o.s
`
`'::
`
`II
`
`Ill
`
`■
`
`•
`
`..
`..
`•
`- •
`"
`OA ..
`• ..
`
`EVB•
`
`EV6
`
`EV5
`
`~VA
`
`-
`
`~
`
`Figure 2(a) shows the performance measured in million
`instructions committed per second (IPS) of one represen(cid:173)
`tative benchmark, applu. In the figure, a separate curve is
`shown for each of the five cores, with each data point rep(cid:173)
`resenting the IPS over the preceding 1 million committed
`instructions.
`With applu, there are very clear and distinct phases of
`performance on each core, and the relative performance of
`the cores varies significantly between these phases. Nearly
`all programs show clear phased behavior, although the fre(cid:173)
`quency and variety of phases varies significantly.
`If relative performance of the cores varies over time,
`it follows that energy efficiency will also vary. Figure 3
`shows one metric of energy efficiency (defined in this case
`as I PS 2 /Watt) of the various cores for the same bench(cid:173)
`mark. IP S 2 /W att is merely the inverse of Energy-Delay
`product. As can be seen, the relative value of the energy(cid:173)
`delay product among cores, and even the ordering of the
`cores, varies from phase to phase.
`
`4.2 Oracle Heuristics for Dynamic Core Selection
`
`This section examines the limits of power and efficiency
`improvements possible with a heterogeneous multi-core
`architecture. The ideal core-selection algorithm depends
`heavily on the particular goals of the architecture or ap(cid:173)
`plication. This section demonstrates oracle algorithms that
`maximize two sample objective functions. The first opti(cid:173)
`mizes for energy efficiency with a tight performance thresh(cid:173)
`old. The second optimizes for energy-delay product with a
`looser performance constraint.
`These algorithms assume perfect knowledge of the per(cid:173)
`formance and power characteristics at the granularity of in(cid:17

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket