`
`Accelerating the worlds research.
`
`Energy Managen1ent for Con1n1ercial
`Servers
`
`W sF It r
`
`IEEE Computer
`
`Cite this paper
`
`Downloaded from Academia edu C?
`
`Get the citation in M LA, APA, or Chicago styles
`
`Related papers
`
`Download a PDF Pack of the best related papers C?
`
`Introducing the Adaptive Energy Management Features of the Power7 Chip
`Alan Drake, Malcolm Ware
`
`-
`
`Conserving disk energy in network servers
`Eduardo Pinheiro
`
`Data Center Energy Consumption Modeling: A Survey
`JORGE ANDREE VELASQUEZ RAMOS
`
`Petitioner Samsung Ex-1027, 0001
`
`
`
`Research Gate
`
`See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/2956025
`
`Energy Management for Commercial Servers
`
`Article in Com puter· January 2004
`
`DO : 0. 09/MC.2003. 250880 Sou ce: EEE Xp o e
`
`C A ONS
`290
`
`6 authors, including:
`
`Charles Lefurgy
`
`IBM
`
`READS
`56
`
`Michael Kistler
`
`IBM
`
`69 PUBL CA ONS 2,370 C A ONS
`
`26 PUBL CA ONS 1,701 C A ONS
`
`SEE PROF LE
`
`SEE PROF LE
`
`Tom W. Keller
`
`IBM
`
`•
`
`29 PUBL CA ONS 1,357 C A ONS
`
`SEE PROF LE
`
`All content following this page was uploaded by Charles Lefurgy on 20 March 2013.
`
`he user has requested enhancement of the down oaded f e A n text references under ned n b ue are added to the or g na document
`
`and are nked to pub cat ons on ResearchGate, ett ng you access and read them mmed ate y
`
`Petitioner Samsung Ex-1027, 0002
`
`
`
`COVER FEATURE
`
`Energy Management
`tor Commercial
`Servers
`
`As power increasingly shapes commercial systems design, commercial
`servers can conserve energy by leveraging their unique architecture and
`workload characteristics.
`
`I n the past, energy-aware computing was pri(cid:173)
`
`marily associated with mobile and embedded
`computing platforms. Servers- high-end, mul(cid:173)
`tiprocessor systems running commercial work(cid:173)
`loads-
`typically included extensive cooling
`systems and resided in custom-built rooms for
`high-power delivery. In recent years, however, as
`transistor density and demand for computing
`resources have rapidly increased, even high-end
`systems face energy-use constraints. Moreover,
`conventional computers are currently air cooled,
`and systems are approaching the limits of what
`manufacturers can build without introducing addi(cid:173)
`tional techniques such as liquid cooling. Clearly,
`good energy management is becoming important
`for all servers.
`Power management challenges for commercial
`servers differ from those for mobile systems.
`Techniques for saving power and energy at the cir(cid:173)
`cuit and microarchitecture levels are well known, 1
`and other low-power options are specialized to a
`server's particular structure and the nature of its
`workload. Although there has been some progress,
`a gap still exists between the known solutions and
`the energy-management needs of servers.
`In light of the trend toward isolating disk
`resources in separate cabinets and accessing them
`through some form of storage networking, the main
`focus of energy management for commercial servers
`is conserving power in the memory and micro(cid:173)
`processor subsystems. Because their workloads are
`typically structured as multiple-application pro(cid:173)
`grams, system-wide approaches are more applica-
`
`ble to multiprocessor environments in commercial
`servers than techniques that are primarily applica(cid:173)
`ble to single-application environments, such as
`those based on compiler optimizations.
`
`COMMERCIAL SERVERS
`Commercial servers comprise one or more high(cid:173)
`performance processors and their associated caches;
`large amounts of dynamic random-access memory
`(DRAM) with multiple memory controllers; and
`high-speed interface chips for high-memory band(cid:173)
`width, 1/0 controllers, and high-speed network
`interfaces.
`Servers with multiple processors typically are
`designed as symmetric multiprocessors (SMPs),
`which means that the processors share the main
`memory and any processor can access any memory
`location. T his organization has several advantages:
`
`• M ultiprocessor systems can scale to much
`larger workloads than single-processor sys(cid:173)
`tems.
`• Shared memory simplifies workload balancing
`across servers.
`• T he machine naturally supports the shared(cid:173)
`memory programming paradigm that most
`developers prefer.
`• Because it has a large capacity and high-band(cid:173)
`width memory, a multiprocessor system can effi(cid:173)
`ciently execute memory-intensive workloads.
`
`In commercial servers, memory is hierarchical.
`T hese servers usually have two or three levels of
`
`Charles
`Lefurgy
`Karthick
`Rajamani
`Freeman
`Rawson
`Wes Felter
`Michael
`Kistler
`Tom W.
`Keller
`IBM Austin
`Research Lab
`
`0018·0162Al3/$17.00 © 2003 1EEE
`
`Pub l i s hed by th e IEEE Com pu t e r Soc i e t y
`
`December 2003
`
`Petitioner Samsung Ex-1027, 0003
`
`
`
`C=7._ Memory .-Q
`l..=J~ controller ~o
`
`Power4 processor
`
`C=7._ Memory .-Q
`l..=J ~ controller -+□
`
`Multichip module
`
`[l+- Memo~ +-I Memo~ I
`o ~ controller ~
`[l+- Memo~ +-I Memo~ I
`o -+ controller -+
`
`Figure 1. A single multichip module in a Power4 system. Each processor has two processor cores, each executing a single program context.
`Each core contains L1 caches, shares an L2 cache, and connects to an off-chip L3 cache, a memory controller, and main memory.
`
`Table 1. Power consumption breakdown for an IBM p670.
`
`Processor
`and
`1/0
`and memory
`Processors Memory other fans
`384
`90
`676
`318
`
`1/0
`component Total
`fans
`watts
`1,614
`144
`
`840
`
`1,223
`
`90
`
`676
`
`144
`
`2,972
`
`IBM p670
`server
`Small
`cont ig u ration
`(watts)
`Large
`cont ig u ration
`(watts)
`
`cache between the processor and the main mem(cid:173)
`ory. Typical high-end commercial servers include
`IBM's p690, HP's 9000 Superdome, and Sun
`Microsystems' Sun Fire 15K.
`Figure 1 shows a high-level organization of
`processors and memory in a single multichip mod(cid:173)
`ule (MCM) in an IBM Power4 system. Each Power4
`processor contains two processor cores; each core
`executes a single program context.
`The processor's two cores contain Ll caches (not
`shown) and share an U cache, which is the coher(cid:173)
`ence point for the memory hierarchy. Each proces(cid:173)
`sor connects to an L3 cache (off-chip), a memory
`controller, and main memory. In some configura(cid:173)
`tions, processors share the L3 caches. T he four
`processors reside on an MCM and communicate
`through dedicated point-to-point links. Larger sys(cid:173)
`tems such as the IBM p690 consist of multiple con(cid:173)
`nected MCMs.
`Table 1 shows the power consumption of two con(cid:173)
`figurations of an IBM p670 server, which is a
`midrange version of the p690. T he top row gives the
`power breakdown for a small four-way server (sin(cid:173)
`gle MCM with four single-core chips) with a 128-
`Mbyte L3 cache and a 16-Gbyte memory. T he
`bottom row gives the breakdown for a larger 16-way
`server (dual MCM with four dual-core chips) with a
`256-Mbyte L3 cache and a 128-Gbyte memory.
`The power consumption breakdowns include
`
`Computer
`
`• the processors, including the M CMs with
`processor cores and Ll and U caches, cache
`controllers, and directories;
`• the memory, consisting of the off-chip L3
`caches, DRAM, memory controllers, and high(cid:173)
`bandwidth interface chips between the con(cid:173)
`trollers and DRAM;
`• J/0 and other nonfan components;
`• fans for cooling processors and memory; and
`• fans for cooling the J/0 components.
`
`We measured the power consumption at idle. A
`high-end commercial server typically focuses pri(cid:173)
`marily on performance, and the designs incorpo(cid:173)
`rate few system-level power-management tech(cid:173)
`niques. Consequently, idle and active power con(cid:173)
`sumption are similar.
`We estimated fan power consumption from prod(cid:173)
`uct specifications. For the other components of the
`small configuration, we measured DC power. We
`estimated the power in the larger configuration by
`scaling the measurements of the smaller configura(cid:173)
`tion based on relative increases in the component
`quantities. Separate measurements were made to
`obtain dual-core processor power consumption.
`In the small configuration, processor power is
`greater than memory power: Processor power
`accounts for 24 percent of system power, memory
`power for 19 percent. In the larger configuration,
`the processors use 28 percent of the powei; and
`memory uses 41 percent. T his suggests the need to
`supplement the conventional, processor-centric
`approach to energy management with techniques
`for managing memory energy.
`T he high power consumption of the computing
`components generates large amounts of heat,
`requiring significant cooling capabilities. T he fans
`driving the cooling system consume additional
`power. Fan power consumption, which is relatively
`fixed for the system cabinet, dominates the small
`configuration at 51 percent, and it is a big compo(cid:173)
`nent of the large configuration at 28 percent.
`Reducing the power of computing components
`
`Petitioner Samsung Ex-1027, 0004
`
`
`
`would allow a commensurate reduction in cool(cid:173)
`ing capacity, therefore reducing fan power con(cid:173)
`sumption.
`We did not separate disk power because the mea(cid:173)
`sured system chiefly used remote storage and
`because the number of disks in any configuration
`varies dramatically. Current high-performance
`SCSI disks typically consume 11 to 18 watts each
`when active.
`Fortunately, this machine organization suggests
`several natural options for power management. For
`example, using multiple, discrete processors allows
`for mechanisms to turn a subset of the processors
`off and on as needed. Similarly, multiple cache
`banks, memory controllers, and DRAM modules
`provide natural demarcations of power-manage(cid:173)
`able entities in the memory subsystem. In addition,
`the processing capabilities of memory controllers,
`although limited, can accommodate new power(cid:173)
`management mechanisms.
`
`Energy-management goals
`Energy management primarily aims to limit max(cid:173)
`imum power consumption and improve energy effi(cid:173)
`ciency. Although generally consistent with each
`othei; the two goals are not identical. Some energy(cid:173)
`management techniques address both goals, but
`most implementations focus on only one or the
`other.
`Addressing the power consumption problem is
`critical to maintaining reliability and reducing cool(cid:173)
`ing requirements. Traditionally, server designs coun(cid:173)
`tered increased power consumption by improving
`the cooling and packaging technology. More
`recently, designers have used circuit and microar(cid:173)
`chitectural approaches to reduce thermal stress.
`Improving energy efficiency requires either
`increasing the number of operations per unit of
`energy consumed or decreasing the amount of
`energy consumed per operation. Increased energy
`efficiency reduces the operational costs for the sys(cid:173)
`tem's power and cooling needs. Energy efficiency is
`particularly important in large installations such as
`data centers, where power and cooling costs can be
`sizable.
`A recent energy management challenge is leak(cid:173)
`age current in semiconductor circuits, which causes
`transistors designed for high frequencies to con(cid:173)
`sume power even when they don't switch. Although
`we discuss some technologies that turn off idle com(cid:173)
`ponents, and reduce leakage, the primary ap(cid:173)
`proaches to tackling this problem center on
`improvements to circuit technology and microar(cid:173)
`chitecture design.
`
`~
`:::,
`
`<1) C:
`C. ~
`
`140
`120
`100
`0 .r::: -;;;-~ =
`80
`- :::, " ' 0
`"'"' 60
`<1) .r::;
`40
`:::,-
`a -
`20
`a:
`0
`1
`
`<1)
`
`3 5 7 9 11 13 15 17 19 21 23 25
`Time (hour)
`
`~ "O
`
`600
`500
`~
`_g <ii"' 400
`Q) C: 300
`C. ~
`¥l ~
`</) 0
`200
`Q) .s::
`:::,-
`a -
`100
`Q)
`a:
`0
`1
`
`3 5 7 9 11 13 15 17 19 21
`Time (hour)
`
`23 25
`
`(a)
`
`(b)
`
`Figure 2. Load varia(cid:173)
`tion by hour at fa) a
`financial Web site
`and (b) the 1998
`Olympics Web site in
`a one•day period.
`The number of
`requests received
`varies widely
`depending on time
`of day and other fac(cid:173)
`tors.
`
`Server workloads
`Commercial server workloads include transac(cid:173)
`tion processing- for Web servers or databases, for
`example--and batch processing-
`for noninterac(cid:173)
`tive, long-running programs. Transaction and batch
`processing offer somewhat different power man(cid:173)
`agement opportunities.
`As Figure 2 illustrates, transaction-oriented
`servers do not always run at peak capacity because
`their workloads often vary significantly depending
`on the time of day, day of the week, or other exter(cid:173)
`nal factors. Such servers have significant buffer
`capacity to maintain performance goals in the event
`of unexpected workload increases. T hus, much of
`the server capacity remains unutilized during nor(cid:173)
`mal operation.
`Transaction servers that run at peak throughput
`can impact the latency of individual requests and,
`consequently, fail to meet response time goals. Batch
`servers, on the other hand, often have less stringent
`latency requirements, and they might run at peak
`throughput in bursts. However, their more relaxed
`latency requirements mean that sometimes running
`a large job overnight is sufficient. As a result, both
`transaction and batch servers have idleness- or
`slack- that designers can exploit to reduce the
`energy used.
`Server workloads can comprise multiple applica(cid:173)
`tions with varying computational and performance
`requirements. T he server systems' organization
`often matches this variety. For example, a typical e(cid:173)
`commerce Web site consists of a first tier of simple
`page servers organized in a clustei; a second tier of
`higher performance servers running Web applica(cid:173)
`tions, and a third tier of high-performance database
`servers.
`Although such heterogeneous configurations pri-
`
`December 2003
`
`Petitioner Samsung Ex-1027, 0005
`
`
`
`application code, or the operating system can per(cid:173)
`form the operations during process scheduling.
`
`Simultaneous multithreading
`Unlike DVS, which reduces the number of idle
`processor cycles by lowering frequency, simultane(cid:173)
`ous multithreading (SMT ) increases the processor
`use at a fixed frequency. Single programs exhibit
`idle cycles because of stalls on memory accesses or
`an application's inherent lack of instruction-level
`parallelism. T herefore, SMT maps multiple pro(cid:173)
`gram contexts (threads) onto the processor con(cid:173)
`currently to provide instructions that use the
`otherwise idle resources. T his increases performance
`by increasing processor use. Because the threads
`share resources, single threads are less likely to issue
`highly speculative instructions.
`Both effects improve energy efficiency because
`functional units stay busy executing useful, non(cid:173)
`speculative instructions.5 Support for SMT chiefly
`involves duplicating the registers describing the
`thread state, which requires much less power over(cid:173)
`head than adding a processor.
`
`Processor packing
`Whereas DVS and SMT effectively reduce idle
`cycles on single processors, processor packing oper(cid:173)
`ates across multiple processors. Research at IBM
`shows that average processor use of real Web
`servers is 11 to 50 percent of their peak capacity.4
`In SMP servers, this presents an opportunity to
`reduce power consumption.
`Rather than balancing the load across all proces(cid:173)
`sors, leaving them all lightly loaded, processor
`packing concentrates the load onto the smallest
`number of processors possible and turns the
`remaining processors off. T he major challenge to
`effectively implementing processor packing is the
`current lack of hardware support for turning
`processors in SMP servers on and off.
`
`Throughput computing
`Certain workload types might offer additional
`opportunities to reduce processor power. Servers
`typically process many requests, transactions, or jobs
`concurrently. If these tasks have internal parallelism,
`developers can replace high-performance processors
`with a larger number of slower but more energy-effi(cid:173)
`cient processors providing the same throughput.
`Piranha6 and our Super-Dense Server7 prototype
`are two such systems. The SDS prototype, operat(cid:173)
`ing as a cluster, used half of the energy that a con(cid:173)
`ventional server used on a transaction-processing
`workload. However, throughput computing is not
`
`Realizing the
`potential of
`processor power
`and energy
`management
`often requires
`software support.
`
`marily offer cost and performance benefits,
`they also present a power advantage:
`Machines optimized for particular tasks and
`amenable to workload-specific tuning for
`higher energy efficiencies serve each distinct
`part of the workload.
`
`PROCESSORS
`Processor power and energy management
`relies primarily on microarchitectural en(cid:173)
`hancements, but realizing their potential
`often requires some form of systems software
`support.
`
`Frequency and voltage scaling
`CM OS circuits in modern microprocessors con(cid:173)
`sume power in proportion to their frequency and to
`the square of their operating voltage. However,
`voltage and frequency are not independent of each
`other: A CPU can safely operate at a lower voltage
`only when it runs at a low enough frequency. Thus,
`reducing frequency and voltage together reduces
`energy per operation quadratically, but only
`decreases performance linearly.
`Early work on dynamic frequency and voltage
`scaling (DVS)-
`the ability to dynamically adjust
`processor frequency and voltage-proposed that
`instead of running at full speed and sitting idle part
`of the time, the CPU should dynamically change its
`frequency to accommodate the current load and
`eliminate slack.2 To select the proper CPU fre(cid:173)
`quency and voltage, the system must predict CPU
`use over a future time interval.
`Much of the recent work in DVS has sought to
` but further research
`develop prediction heuristics,3
`on predicting processor use for server workloads
`is needed. Other barriers to implementing DVS on
`SMP servers remain. For example, many cache(cid:173)
`coherence protocols assume that all processors run
`at the same frequency. Modifying these protocols to
`support DVS is nontrivial.
`Standard DVS techniques attempt to minimize
`the time the processor spends running the operat(cid:173)
`ing system idle loop. Other researchers have
`explored the use of DVS to reduce or eliminate the
`stall cycles that poor memory access latency ca uses.
`Because memory access time often limits the per(cid:173)
`formance of data-intensive applications, running
`the applications at reduced CPU frequency has a
`limited impact on performance.
`Offline profiling or compiler analysis can deter(cid:173)
`mine the optimal CPU frequency for an application
`or application phase. A programmer can insert
`operations to change frequency directly into the
`
`•4
`
`Computer
`
`Petitioner Samsung Ex-1027, 0006
`
`
`
`suitable for all applications, so server makers must
`develop separate systems to provide improved sin(cid:173)
`gle-thread performance.
`
`MEMORY POWER
`Server memory is typically double-data rate syn(cid:173)
`chronous dynamic random access memory (DDR
`SDRAM), which has two low-power modes:
`power-down and self-refresh. In a large server, the
`number of memory modules typically surpasses the
`number of outstanding memory requests, so mem(cid:173)
`ory modules are often idle. The memory controllers
`can put this idle memory into low-power mode
`until a processor accesses it.
`Switching to power-down mode or back to active
`mode takes only one memory cycle and can reduce
`idle DRAM power consumption by more than 80
`percent. This savings may increase with newer tech(cid:173)
`nology.
`Self-refresh mode might achieve even greater
`power savings, but it currently requires several hun(cid:173)
`dred cycles to return to active mode. Thus, realiz(cid:173)
`ing this mode's potential benefits requires more
`sophisticated techniques.
`
`Data placement
`Data distribution in physical memory determines
`which memory devices a particular memory access
`uses and consequently the devices' active and idle
`periods. Memory devices with no active data can
`operate in a low-power mode.
`When a processor accesses memory, memory
`controllers can bring powered-down memory into
`an active state before proceeding with the access.
`To optimize performance, the operating system can
`activate the memory that a newly scheduled process
`uses during the context switch period, thus largely
`9
`hiding the latency of exiting the low-power mode.8
`•
`Better page allocation policies can also save
`energy. Allocating new pages to memory devices
`already in use helps reduce the number of active
`10 Intelligent page migration(cid:173)
`memory devices.9
`•
`moving data from one memory device to another to
`reduce the number of active memory devices---can
`11
`further reduce energy consumption.9
`•
`However, minimizing the number of active
`memory devices also reduces the memory band(cid:173)
`width available to the application. Some accesses
`previously made in parallel to several memory
`devices must be made serially to the same mem(cid:173)
`ory device.
`Data placement strategies originally targeted
`low-end systems; hence, developers must carefully
`evaluate them in the context of commercial server
`
`Minimizing the
`number of active
`memory devices also
`reduces the memory
`bandwidth available
`to the application.
`
`environments, where reduced memory band(cid:173)
`width can substantially impact performance.
`
`Memory address mapping
`Server memory systems provide config(cid:173)
`urable address mappings that can be used for
`energy efficiency. These systems partition
`memory among memory controllers, with
`each controller responsible for multiple banks
`of physical memory.
`Configurable parameters in the system mem-
`ory organization determine the interleaving-the
`mapping from a memory address to its location in a
`physical memory device. This mapping often occurs
`at the granularity of the higher-level cache line size.
`One possible interleaving allocates consecutive
`cache lines to DRAMs that different memory con(cid:173)
`trollers manage, striping the physical memory across
`all controllers. Another approach involves mapping
`consecutive cache lines to the same controller, using
`all the physical memory under one controller before
`moving to the next. Under each controller, memory
`address interleaving can similarly occur across the
`multiple physical memory banks.
`Spreading consecutive accesses- or even a single
`access-across multiple devices and controllers can
`improve memory bandwidth, but it may consume
`more power because the server must activate more
`memory devices and controllers.
`The interactions of interleaving schemes with con(cid:173)
`figurable DRAM parameters such as page-mode
`policies and burst length have an additional impact
`on memory latencies and power consumption.
`Contrary to current practice in which systems come
`with a set interleaving scheme, a developer can tune
`an interleaving scheme to obtain the desired power
`and performance tradeoff for each workload.
`
`Memory compression
`An orthogonal approach to reducing the active
`physical memory is to use a system with less mem(cid:173)
`ory. IBM researchers have demonstrated that they
`can compress the data in a server's memory to half
`its size.12 A modified memory controller compresses
`and decompresses data when it accesses the mem(cid:173)
`ory. However, compression adds an extra step to
`the memory access, which can be time-<:onsuming.
`Adding the 32-Mbyte L3 cache used in the IBM
`compression implementation significantly reduces
`the traffic to memory, thereby reducing the perfor(cid:173)
`mance loss that compression causes. In fact, a com(cid:173)
`pressed memory server can perform at nearly the
`same level as a server with no compression and
`double the memory.
`
`December 2003
`
`Petitioner Samsung Ex-1027, 0007
`
`
`
`Energy Conservation in Clustered Servers
`Ricardo Bianchini, Rutgers University
`Ram Rajamony, IBM Austin Research Lab
`Power and energy conservation have recently become key con(cid:173)
`cerns for high-performance servers, especially when deployed
`in large cluster configurations as in data centers and Web-host(cid:173)
`ing facilities. 1
`Research teams from Rutgers University2 and Duke Uni(cid:173)
`versity3 have proposed similar strategies for managing energy
`in Web-server clusters. T he idea is to dynamically distribute the
`load offered to a server cluster so that, under light load, the sys(cid:173)
`tem can idle some hardware resources and put them in low(cid:173)
`power modes. Under heavy load, the system should reactivate
`the resources and redistribute the load to eliminate performance
`degradation. Because Web-server clusters replicate file data at
`
`.,
`.,
`
`1,000
`
`900
`
`800
`
`700
`
`"C
`0
`C:
`
`- - CPU load
`- - Disk load
`- - Net load
`- - Number of nodes
`
`600
`
`C:
`0
`'15
`.,
`<: 500
`.e, 400
`f::?
`"C "' 300
`
`0
`...J
`
`200
`
`100
`
`0
`
`1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000
`Time (seconds)
`
`10
`
`9
`
`8
`
`7
`
`<I)
`
`6 ~
`0
`C:
`
`.Q
`
`.,
`5~
`4 E ::,
`:z
`
`3
`
`2
`
`0
`
`all nodes, and traditional server hardware has very high base
`power-
`the power consumed when the system is on but idle(cid:173)
`these systems dynamically turn entire nodes on and off, effec(cid:173)
`tively reconfiguring the cluster.
`Figure A illustrates the behavior of the Rutgers system for a
`seven-node cluster running a real, but accelerated, Web trace.
`T he figure shows the evolution of the cluster configuration and
`offered loads on each resource as a function of time. It plots the
`load on each resource as a percentage of the nominal through(cid:173)
`put of the same resource in one node. T he figure shows that the
`network interface is the bottleneck resource throughout the
`entire experiment.
`Figure B shows the cluster power consumption for two ver(cid:173)
`sions of the same experiment, again as a function of time. T he
`lower curve (dynamic configuration) represents the reconfig-
`
`800
`
`700
`
`600
`
`"C
`0
`
`- - Static configuration
`- - Dynamic configuration
`
`500
`
`C:
`0
`'15
`<: 400
`
`., ! 300
`
`"C "' 200
`
`0
`...J
`
`100
`
`0
`
`1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000
`Time (seconds)
`
`.,
`C: .,
`
`Figure A. Cluster evolution and per-resource offered loads. The
`network interface is the bottleneck resource throughout the entire
`experiment.
`
`Figure 8. Power under static and dynamic configurations.
`Reconfiguration reduces power consumption significantly for most
`of the experiment, saving 38 percent in energy.
`
`Although the modified memory controller and
`larger caches need additional powei; high memory
`capacity servers should expect a net savings.
`Howevei; for workloads with the highest memory
`bandwidth requirements and working sets exceed(cid:173)
`ing the compression buffer size, this solution may
`not be energy efficient. Quantifying compression's
`power and performance benefit requires further
`analysis.
`
`Cache coherence
`Reducing overhead due to cache coherence traf(cid:173)
`fic can also improve energy efficiency. Shared-mem(cid:173)
`ory servers often use invalidation protocols to
`maintain cache coherency. One way to implement
`these protocols is to have one cache level (typically
`L2 cache) track- or snoop- memory requests
`from all processors.
`T he cache controllers of all processors in the
`server share a bus to memory and the other caches.
`
`Each cache controller arrives at the appropriate
`coherence action to maintain consistency between
`the caches based on its ability to observe memory
`traffic to and from all caches, and to snoop remote
`caches. Snoop accesses from other processors
`greatly increase U cache power consumption.
`Andreas Moshovos and colleagues introduced a
`small cache-like structure they term a Jetty at each
`L2 cache to filter incoming snoop requests. 13 T he
`Jetty predicts whether the local cache contains a
`requested line with no false negatives. Querying the
`Jetty first and forwarding the snoop request to the
`L2 cache only when it predicts the line is in the local
`cache reduces L2 cache energy for all accesses by
`about 30 percent.
`Craig Saldanha and Mikko Lipasti propose
`replacing parallel snoop accesses with serial
`accesses. 14 Many snoop cache implementations
`broadcast the snoop request in parallel to all caches
`on the SMP bus. Snooping other processors' caches
`
`Computer
`
`Petitioner Samsung Ex-1027, 0008
`
`
`
`urable version of the system, whereas the higher curve (static
`configuration) represents a configuration fixed at seven nodes.
`The figure shows that reconfiguration reduces power consump(cid:173)
`tion significantly for most of the experiment, saving 38 percent
`in energy.
`Karthick Rajamani and Charles Lefurgy4 studied how to
`improve the cluster reconfiguration technique's energy saving
`potential by using spare servers and history information about
`peak server loads. They also modeled the key system and work(cid:173)
`load parameters that influence this technique.
`Mootaz Elnozahy and colleagues5 evaluated different combi(cid:173)
`nations of cluster reconfiguration and two types of dynamic volt(cid:173)
`independent and coordinated-for clusters in
`age scaling-
`which the base power is relatively low. In independent voltage
`scaling, each server node makes its own independent decision
`about what voltage and frequency to use, depending on the load
`it is receiving. In coordinated voltage scaling, nodes coordinate
`their voltage and frequency settings. Their simulation results
`showed that, while either coordinated voltage scaling or recon(cid:173)
`figuration is the best technique depending on the workload, com(cid:173)
`bining them is always the best approach.
`In other research, Mootaz Elnozahy and colleagues6 proposed
`using request batching to conserve processor energy under a light
`load. In this technique, the network interface processor accu(cid:173)
`mulates incoming requests in memory, while the server's host
`processor is in a low-power state. The system awakens the host
`processor when an accumulated request has been pending for
`longer than a threshold.
`Taliver Heath and colleagues7 considered reconfiguration in
`the context of heterogeneous server clusters. Their experimen(cid:173)
`tal results for a cluster of traditional and blade nodes show that
`their heterogeneity-conscious server can conserve more than
`twice as much energy as a heterogeneity-oblivious reconfigurable
`server.
`
`References
`1. R. Bianchini and R. Rajamony, Power and Energy Management
`for Server Systems, tech. report DCS-TR-528 1 Dept. Computer
`Science, Rutgers Univ., 2003.
`2. E. Pinheiro et al., " Dynamic Cluster Reconfiguration for Power
`and Performance," L. Bellini, M. Kandemir, and]. Ramanujam,
`eds., Compilers and Operating Systems for Low Power, Kluwer
`Academic, 2003. Earlier version published in Proc. Workshop
`Compilers and Operating Systems for Low Power, 2001.
`3. J.S. Chase et al., "Managing Energy and Server Resources in
`Hosting Centers," Proc. 18th ACM Symp. Operating Systems
`Principles, ACM Press, 2001, pp. 103-116.
`4. K. Ra jamani and C. Lefurgy, "On Evaluating Request(cid:173)
`Distribution Schemes for Saving Energy in Server Clustersi"
`Proc. 2003 IEEE Int'/ Sym p. Performance Analysis of Systems
`and So ftware, IEEE Press, 20031 pp. 111-122.
`5. E.N. Elnozahy, M. Kistler, and R. Rajamony, " Energy-Efficient
`Server Clusters," Proc. 2nd Workshop Power-Aware Computing
`Systems, LNCS 2325, Springer, 2003, pp. 179-196.
`6. E.N. Elnozahy, M. Kistler, and R. Rajamony, " Energy
`Conservation Policies for Web Servers," Proc. 4th Usenix Symp.
`Internet Technologies and Systems, Usenix Assoc., 2003 1 pp.
`99-112.
`7. T. Heath et al., "Self-Configuring Heterogeneous Server
`Clusters," Proc. Workshop Compilers and Operating Systems
`for Low Power, 2003, 75-84.
`
`Ricardo Bianchini is an assistant professor in the Department
`of Computer Science, Rutgers University. Contact him at
`ricardob@cs.rutgers.edu.
`
`Ram Rajamony is a research staff member at IBM Austin
`Research Lab. Contact him at rajamony@us.ibm.com.
`
`one at a time reduces the total number of snoop
`accesses. Although this method can reduce the
`energy and maximum power used, it increases the
`average cache miss access latency. Therefore, devel(cid:173)
`opers must balance the energy savings in the mem(cid:173)
`ory system against the energy used to keep other
`system components active longer.
`
`istrators. Software policies can also use higher-level
`info