throbber
ACADEMIA
`
`Accelerating the worlds research.
`
`Energy Managen1ent for Con1n1ercial
`Servers
`
`W sF It r
`
`IEEE Computer
`
`Cite this paper
`
`Downloaded from Academia edu C?
`
`Get the citation in M LA, APA, or Chicago styles
`
`Related papers
`
`Download a PDF Pack of the best related papers C?
`
`Introducing the Adaptive Energy Management Features of the Power7 Chip
`Alan Drake, Malcolm Ware
`
`-
`
`Conserving disk energy in network servers
`Eduardo Pinheiro
`
`Data Center Energy Consumption Modeling: A Survey
`JORGE ANDREE VELASQUEZ RAMOS
`
`Petitioner Samsung Ex-1027, 0001
`
`

`

`Research Gate
`
`See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/2956025
`
`Energy Management for Commercial Servers
`
`Article in Com puter· January 2004
`
`DO : 0. 09/MC.2003. 250880 Sou ce: EEE Xp o e
`
`C A ONS
`290
`
`6 authors, including:
`
`Charles Lefurgy
`
`IBM
`
`READS
`56
`
`Michael Kistler
`
`IBM
`
`69 PUBL CA ONS 2,370 C A ONS
`
`26 PUBL CA ONS 1,701 C A ONS
`
`SEE PROF LE
`
`SEE PROF LE
`
`Tom W. Keller
`
`IBM
`
`•
`
`29 PUBL CA ONS 1,357 C A ONS
`
`SEE PROF LE
`
`All content following this page was uploaded by Charles Lefurgy on 20 March 2013.
`
`he user has requested enhancement of the down oaded f e A n text references under ned n b ue are added to the or g na document
`
`and are nked to pub cat ons on ResearchGate, ett ng you access and read them mmed ate y
`
`Petitioner Samsung Ex-1027, 0002
`
`

`

`COVER FEATURE
`
`Energy Management
`tor Commercial
`Servers
`
`As power increasingly shapes commercial systems design, commercial
`servers can conserve energy by leveraging their unique architecture and
`workload characteristics.
`
`I n the past, energy-aware computing was pri(cid:173)
`
`marily associated with mobile and embedded
`computing platforms. Servers- high-end, mul(cid:173)
`tiprocessor systems running commercial work(cid:173)
`loads-
`typically included extensive cooling
`systems and resided in custom-built rooms for
`high-power delivery. In recent years, however, as
`transistor density and demand for computing
`resources have rapidly increased, even high-end
`systems face energy-use constraints. Moreover,
`conventional computers are currently air cooled,
`and systems are approaching the limits of what
`manufacturers can build without introducing addi(cid:173)
`tional techniques such as liquid cooling. Clearly,
`good energy management is becoming important
`for all servers.
`Power management challenges for commercial
`servers differ from those for mobile systems.
`Techniques for saving power and energy at the cir(cid:173)
`cuit and microarchitecture levels are well known, 1
`and other low-power options are specialized to a
`server's particular structure and the nature of its
`workload. Although there has been some progress,
`a gap still exists between the known solutions and
`the energy-management needs of servers.
`In light of the trend toward isolating disk
`resources in separate cabinets and accessing them
`through some form of storage networking, the main
`focus of energy management for commercial servers
`is conserving power in the memory and micro(cid:173)
`processor subsystems. Because their workloads are
`typically structured as multiple-application pro(cid:173)
`grams, system-wide approaches are more applica-
`
`ble to multiprocessor environments in commercial
`servers than techniques that are primarily applica(cid:173)
`ble to single-application environments, such as
`those based on compiler optimizations.
`
`COMMERCIAL SERVERS
`Commercial servers comprise one or more high(cid:173)
`performance processors and their associated caches;
`large amounts of dynamic random-access memory
`(DRAM) with multiple memory controllers; and
`high-speed interface chips for high-memory band(cid:173)
`width, 1/0 controllers, and high-speed network
`interfaces.
`Servers with multiple processors typically are
`designed as symmetric multiprocessors (SMPs),
`which means that the processors share the main
`memory and any processor can access any memory
`location. T his organization has several advantages:
`
`• M ultiprocessor systems can scale to much
`larger workloads than single-processor sys(cid:173)
`tems.
`• Shared memory simplifies workload balancing
`across servers.
`• T he machine naturally supports the shared(cid:173)
`memory programming paradigm that most
`developers prefer.
`• Because it has a large capacity and high-band(cid:173)
`width memory, a multiprocessor system can effi(cid:173)
`ciently execute memory-intensive workloads.
`
`In commercial servers, memory is hierarchical.
`T hese servers usually have two or three levels of
`
`Charles
`Lefurgy
`Karthick
`Rajamani
`Freeman
`Rawson
`Wes Felter
`Michael
`Kistler
`Tom W.
`Keller
`IBM Austin
`Research Lab
`
`0018·0162Al3/$17.00 © 2003 1EEE
`
`Pub l i s hed by th e IEEE Com pu t e r Soc i e t y
`
`December 2003
`
`Petitioner Samsung Ex-1027, 0003
`
`

`

`C=7._ Memory .-Q
`l..=J~ controller ~o
`
`Power4 processor
`
`C=7._ Memory .-Q
`l..=J ~ controller -+□
`
`Multichip module
`
`[l+- Memo~ +-I Memo~ I
`o ~ controller ~
`[l+- Memo~ +-I Memo~ I
`o -+ controller -+
`
`Figure 1. A single multichip module in a Power4 system. Each processor has two processor cores, each executing a single program context.
`Each core contains L1 caches, shares an L2 cache, and connects to an off-chip L3 cache, a memory controller, and main memory.
`
`Table 1. Power consumption breakdown for an IBM p670.
`
`Processor
`and
`1/0
`and memory
`Processors Memory other fans
`384
`90
`676
`318
`
`1/0
`component Total
`fans
`watts
`1,614
`144
`
`840
`
`1,223
`
`90
`
`676
`
`144
`
`2,972
`
`IBM p670
`server
`Small
`cont ig u ration
`(watts)
`Large
`cont ig u ration
`(watts)
`
`cache between the processor and the main mem(cid:173)
`ory. Typical high-end commercial servers include
`IBM's p690, HP's 9000 Superdome, and Sun
`Microsystems' Sun Fire 15K.
`Figure 1 shows a high-level organization of
`processors and memory in a single multichip mod(cid:173)
`ule (MCM) in an IBM Power4 system. Each Power4
`processor contains two processor cores; each core
`executes a single program context.
`The processor's two cores contain Ll caches (not
`shown) and share an U cache, which is the coher(cid:173)
`ence point for the memory hierarchy. Each proces(cid:173)
`sor connects to an L3 cache (off-chip), a memory
`controller, and main memory. In some configura(cid:173)
`tions, processors share the L3 caches. T he four
`processors reside on an MCM and communicate
`through dedicated point-to-point links. Larger sys(cid:173)
`tems such as the IBM p690 consist of multiple con(cid:173)
`nected MCMs.
`Table 1 shows the power consumption of two con(cid:173)
`figurations of an IBM p670 server, which is a
`midrange version of the p690. T he top row gives the
`power breakdown for a small four-way server (sin(cid:173)
`gle MCM with four single-core chips) with a 128-
`Mbyte L3 cache and a 16-Gbyte memory. T he
`bottom row gives the breakdown for a larger 16-way
`server (dual MCM with four dual-core chips) with a
`256-Mbyte L3 cache and a 128-Gbyte memory.
`The power consumption breakdowns include
`
`Computer
`
`• the processors, including the M CMs with
`processor cores and Ll and U caches, cache
`controllers, and directories;
`• the memory, consisting of the off-chip L3
`caches, DRAM, memory controllers, and high(cid:173)
`bandwidth interface chips between the con(cid:173)
`trollers and DRAM;
`• J/0 and other nonfan components;
`• fans for cooling processors and memory; and
`• fans for cooling the J/0 components.
`
`We measured the power consumption at idle. A
`high-end commercial server typically focuses pri(cid:173)
`marily on performance, and the designs incorpo(cid:173)
`rate few system-level power-management tech(cid:173)
`niques. Consequently, idle and active power con(cid:173)
`sumption are similar.
`We estimated fan power consumption from prod(cid:173)
`uct specifications. For the other components of the
`small configuration, we measured DC power. We
`estimated the power in the larger configuration by
`scaling the measurements of the smaller configura(cid:173)
`tion based on relative increases in the component
`quantities. Separate measurements were made to
`obtain dual-core processor power consumption.
`In the small configuration, processor power is
`greater than memory power: Processor power
`accounts for 24 percent of system power, memory
`power for 19 percent. In the larger configuration,
`the processors use 28 percent of the powei; and
`memory uses 41 percent. T his suggests the need to
`supplement the conventional, processor-centric
`approach to energy management with techniques
`for managing memory energy.
`T he high power consumption of the computing
`components generates large amounts of heat,
`requiring significant cooling capabilities. T he fans
`driving the cooling system consume additional
`power. Fan power consumption, which is relatively
`fixed for the system cabinet, dominates the small
`configuration at 51 percent, and it is a big compo(cid:173)
`nent of the large configuration at 28 percent.
`Reducing the power of computing components
`
`Petitioner Samsung Ex-1027, 0004
`
`

`

`would allow a commensurate reduction in cool(cid:173)
`ing capacity, therefore reducing fan power con(cid:173)
`sumption.
`We did not separate disk power because the mea(cid:173)
`sured system chiefly used remote storage and
`because the number of disks in any configuration
`varies dramatically. Current high-performance
`SCSI disks typically consume 11 to 18 watts each
`when active.
`Fortunately, this machine organization suggests
`several natural options for power management. For
`example, using multiple, discrete processors allows
`for mechanisms to turn a subset of the processors
`off and on as needed. Similarly, multiple cache
`banks, memory controllers, and DRAM modules
`provide natural demarcations of power-manage(cid:173)
`able entities in the memory subsystem. In addition,
`the processing capabilities of memory controllers,
`although limited, can accommodate new power(cid:173)
`management mechanisms.
`
`Energy-management goals
`Energy management primarily aims to limit max(cid:173)
`imum power consumption and improve energy effi(cid:173)
`ciency. Although generally consistent with each
`othei; the two goals are not identical. Some energy(cid:173)
`management techniques address both goals, but
`most implementations focus on only one or the
`other.
`Addressing the power consumption problem is
`critical to maintaining reliability and reducing cool(cid:173)
`ing requirements. Traditionally, server designs coun(cid:173)
`tered increased power consumption by improving
`the cooling and packaging technology. More
`recently, designers have used circuit and microar(cid:173)
`chitectural approaches to reduce thermal stress.
`Improving energy efficiency requires either
`increasing the number of operations per unit of
`energy consumed or decreasing the amount of
`energy consumed per operation. Increased energy
`efficiency reduces the operational costs for the sys(cid:173)
`tem's power and cooling needs. Energy efficiency is
`particularly important in large installations such as
`data centers, where power and cooling costs can be
`sizable.
`A recent energy management challenge is leak(cid:173)
`age current in semiconductor circuits, which causes
`transistors designed for high frequencies to con(cid:173)
`sume power even when they don't switch. Although
`we discuss some technologies that turn off idle com(cid:173)
`ponents, and reduce leakage, the primary ap(cid:173)
`proaches to tackling this problem center on
`improvements to circuit technology and microar(cid:173)
`chitecture design.
`
`~
`:::,
`
`<1) C:
`C. ~
`
`140
`120
`100
`0 .r::: -;;;-~ =
`80
`- :::, " ' 0
`"'"' 60
`<1) .r::;
`40
`:::,-
`a -
`20
`a:
`0
`1
`
`<1)
`
`3 5 7 9 11 13 15 17 19 21 23 25
`Time (hour)
`
`~ "O
`
`600
`500
`~
`_g <ii"' 400
`Q) C: 300
`C. ~
`¥l ~
`</) 0
`200
`Q) .s::
`:::,-
`a -
`100
`Q)
`a:
`0
`1
`
`3 5 7 9 11 13 15 17 19 21
`Time (hour)
`
`23 25
`
`(a)
`
`(b)
`
`Figure 2. Load varia(cid:173)
`tion by hour at fa) a
`financial Web site
`and (b) the 1998
`Olympics Web site in
`a one•day period.
`The number of
`requests received
`varies widely
`depending on time
`of day and other fac(cid:173)
`tors.
`
`Server workloads
`Commercial server workloads include transac(cid:173)
`tion processing- for Web servers or databases, for
`example--and batch processing-
`for noninterac(cid:173)
`tive, long-running programs. Transaction and batch
`processing offer somewhat different power man(cid:173)
`agement opportunities.
`As Figure 2 illustrates, transaction-oriented
`servers do not always run at peak capacity because
`their workloads often vary significantly depending
`on the time of day, day of the week, or other exter(cid:173)
`nal factors. Such servers have significant buffer
`capacity to maintain performance goals in the event
`of unexpected workload increases. T hus, much of
`the server capacity remains unutilized during nor(cid:173)
`mal operation.
`Transaction servers that run at peak throughput
`can impact the latency of individual requests and,
`consequently, fail to meet response time goals. Batch
`servers, on the other hand, often have less stringent
`latency requirements, and they might run at peak
`throughput in bursts. However, their more relaxed
`latency requirements mean that sometimes running
`a large job overnight is sufficient. As a result, both
`transaction and batch servers have idleness- or
`slack- that designers can exploit to reduce the
`energy used.
`Server workloads can comprise multiple applica(cid:173)
`tions with varying computational and performance
`requirements. T he server systems' organization
`often matches this variety. For example, a typical e(cid:173)
`commerce Web site consists of a first tier of simple
`page servers organized in a clustei; a second tier of
`higher performance servers running Web applica(cid:173)
`tions, and a third tier of high-performance database
`servers.
`Although such heterogeneous configurations pri-
`
`December 2003
`
`Petitioner Samsung Ex-1027, 0005
`
`

`

`application code, or the operating system can per(cid:173)
`form the operations during process scheduling.
`
`Simultaneous multithreading
`Unlike DVS, which reduces the number of idle
`processor cycles by lowering frequency, simultane(cid:173)
`ous multithreading (SMT ) increases the processor
`use at a fixed frequency. Single programs exhibit
`idle cycles because of stalls on memory accesses or
`an application's inherent lack of instruction-level
`parallelism. T herefore, SMT maps multiple pro(cid:173)
`gram contexts (threads) onto the processor con(cid:173)
`currently to provide instructions that use the
`otherwise idle resources. T his increases performance
`by increasing processor use. Because the threads
`share resources, single threads are less likely to issue
`highly speculative instructions.
`Both effects improve energy efficiency because
`functional units stay busy executing useful, non(cid:173)
`speculative instructions.5 Support for SMT chiefly
`involves duplicating the registers describing the
`thread state, which requires much less power over(cid:173)
`head than adding a processor.
`
`Processor packing
`Whereas DVS and SMT effectively reduce idle
`cycles on single processors, processor packing oper(cid:173)
`ates across multiple processors. Research at IBM
`shows that average processor use of real Web
`servers is 11 to 50 percent of their peak capacity.4
`In SMP servers, this presents an opportunity to
`reduce power consumption.
`Rather than balancing the load across all proces(cid:173)
`sors, leaving them all lightly loaded, processor
`packing concentrates the load onto the smallest
`number of processors possible and turns the
`remaining processors off. T he major challenge to
`effectively implementing processor packing is the
`current lack of hardware support for turning
`processors in SMP servers on and off.
`
`Throughput computing
`Certain workload types might offer additional
`opportunities to reduce processor power. Servers
`typically process many requests, transactions, or jobs
`concurrently. If these tasks have internal parallelism,
`developers can replace high-performance processors
`with a larger number of slower but more energy-effi(cid:173)
`cient processors providing the same throughput.
`Piranha6 and our Super-Dense Server7 prototype
`are two such systems. The SDS prototype, operat(cid:173)
`ing as a cluster, used half of the energy that a con(cid:173)
`ventional server used on a transaction-processing
`workload. However, throughput computing is not
`
`Realizing the
`potential of
`processor power
`and energy
`management
`often requires
`software support.
`
`marily offer cost and performance benefits,
`they also present a power advantage:
`Machines optimized for particular tasks and
`amenable to workload-specific tuning for
`higher energy efficiencies serve each distinct
`part of the workload.
`
`PROCESSORS
`Processor power and energy management
`relies primarily on microarchitectural en(cid:173)
`hancements, but realizing their potential
`often requires some form of systems software
`support.
`
`Frequency and voltage scaling
`CM OS circuits in modern microprocessors con(cid:173)
`sume power in proportion to their frequency and to
`the square of their operating voltage. However,
`voltage and frequency are not independent of each
`other: A CPU can safely operate at a lower voltage
`only when it runs at a low enough frequency. Thus,
`reducing frequency and voltage together reduces
`energy per operation quadratically, but only
`decreases performance linearly.
`Early work on dynamic frequency and voltage
`scaling (DVS)-
`the ability to dynamically adjust
`processor frequency and voltage-proposed that
`instead of running at full speed and sitting idle part
`of the time, the CPU should dynamically change its
`frequency to accommodate the current load and
`eliminate slack.2 To select the proper CPU fre(cid:173)
`quency and voltage, the system must predict CPU
`use over a future time interval.
`Much of the recent work in DVS has sought to
` but further research
`develop prediction heuristics,3
`on predicting processor use for server workloads
`is needed. Other barriers to implementing DVS on
`SMP servers remain. For example, many cache(cid:173)
`coherence protocols assume that all processors run
`at the same frequency. Modifying these protocols to
`support DVS is nontrivial.
`Standard DVS techniques attempt to minimize
`the time the processor spends running the operat(cid:173)
`ing system idle loop. Other researchers have
`explored the use of DVS to reduce or eliminate the
`stall cycles that poor memory access latency ca uses.
`Because memory access time often limits the per(cid:173)
`formance of data-intensive applications, running
`the applications at reduced CPU frequency has a
`limited impact on performance.
`Offline profiling or compiler analysis can deter(cid:173)
`mine the optimal CPU frequency for an application
`or application phase. A programmer can insert
`operations to change frequency directly into the
`
`•4
`
`Computer
`
`Petitioner Samsung Ex-1027, 0006
`
`

`

`suitable for all applications, so server makers must
`develop separate systems to provide improved sin(cid:173)
`gle-thread performance.
`
`MEMORY POWER
`Server memory is typically double-data rate syn(cid:173)
`chronous dynamic random access memory (DDR
`SDRAM), which has two low-power modes:
`power-down and self-refresh. In a large server, the
`number of memory modules typically surpasses the
`number of outstanding memory requests, so mem(cid:173)
`ory modules are often idle. The memory controllers
`can put this idle memory into low-power mode
`until a processor accesses it.
`Switching to power-down mode or back to active
`mode takes only one memory cycle and can reduce
`idle DRAM power consumption by more than 80
`percent. This savings may increase with newer tech(cid:173)
`nology.
`Self-refresh mode might achieve even greater
`power savings, but it currently requires several hun(cid:173)
`dred cycles to return to active mode. Thus, realiz(cid:173)
`ing this mode's potential benefits requires more
`sophisticated techniques.
`
`Data placement
`Data distribution in physical memory determines
`which memory devices a particular memory access
`uses and consequently the devices' active and idle
`periods. Memory devices with no active data can
`operate in a low-power mode.
`When a processor accesses memory, memory
`controllers can bring powered-down memory into
`an active state before proceeding with the access.
`To optimize performance, the operating system can
`activate the memory that a newly scheduled process
`uses during the context switch period, thus largely
`9
`hiding the latency of exiting the low-power mode.8
`•
`Better page allocation policies can also save
`energy. Allocating new pages to memory devices
`already in use helps reduce the number of active
`10 Intelligent page migration(cid:173)
`memory devices.9
`•
`moving data from one memory device to another to
`reduce the number of active memory devices---can
`11
`further reduce energy consumption.9
`•
`However, minimizing the number of active
`memory devices also reduces the memory band(cid:173)
`width available to the application. Some accesses
`previously made in parallel to several memory
`devices must be made serially to the same mem(cid:173)
`ory device.
`Data placement strategies originally targeted
`low-end systems; hence, developers must carefully
`evaluate them in the context of commercial server
`
`Minimizing the
`number of active
`memory devices also
`reduces the memory
`bandwidth available
`to the application.
`
`environments, where reduced memory band(cid:173)
`width can substantially impact performance.
`
`Memory address mapping
`Server memory systems provide config(cid:173)
`urable address mappings that can be used for
`energy efficiency. These systems partition
`memory among memory controllers, with
`each controller responsible for multiple banks
`of physical memory.
`Configurable parameters in the system mem-
`ory organization determine the interleaving-the
`mapping from a memory address to its location in a
`physical memory device. This mapping often occurs
`at the granularity of the higher-level cache line size.
`One possible interleaving allocates consecutive
`cache lines to DRAMs that different memory con(cid:173)
`trollers manage, striping the physical memory across
`all controllers. Another approach involves mapping
`consecutive cache lines to the same controller, using
`all the physical memory under one controller before
`moving to the next. Under each controller, memory
`address interleaving can similarly occur across the
`multiple physical memory banks.
`Spreading consecutive accesses- or even a single
`access-across multiple devices and controllers can
`improve memory bandwidth, but it may consume
`more power because the server must activate more
`memory devices and controllers.
`The interactions of interleaving schemes with con(cid:173)
`figurable DRAM parameters such as page-mode
`policies and burst length have an additional impact
`on memory latencies and power consumption.
`Contrary to current practice in which systems come
`with a set interleaving scheme, a developer can tune
`an interleaving scheme to obtain the desired power
`and performance tradeoff for each workload.
`
`Memory compression
`An orthogonal approach to reducing the active
`physical memory is to use a system with less mem(cid:173)
`ory. IBM researchers have demonstrated that they
`can compress the data in a server's memory to half
`its size.12 A modified memory controller compresses
`and decompresses data when it accesses the mem(cid:173)
`ory. However, compression adds an extra step to
`the memory access, which can be time-<:onsuming.
`Adding the 32-Mbyte L3 cache used in the IBM
`compression implementation significantly reduces
`the traffic to memory, thereby reducing the perfor(cid:173)
`mance loss that compression causes. In fact, a com(cid:173)
`pressed memory server can perform at nearly the
`same level as a server with no compression and
`double the memory.
`
`December 2003
`
`Petitioner Samsung Ex-1027, 0007
`
`

`

`Energy Conservation in Clustered Servers
`Ricardo Bianchini, Rutgers University
`Ram Rajamony, IBM Austin Research Lab
`Power and energy conservation have recently become key con(cid:173)
`cerns for high-performance servers, especially when deployed
`in large cluster configurations as in data centers and Web-host(cid:173)
`ing facilities. 1
`Research teams from Rutgers University2 and Duke Uni(cid:173)
`versity3 have proposed similar strategies for managing energy
`in Web-server clusters. T he idea is to dynamically distribute the
`load offered to a server cluster so that, under light load, the sys(cid:173)
`tem can idle some hardware resources and put them in low(cid:173)
`power modes. Under heavy load, the system should reactivate
`the resources and redistribute the load to eliminate performance
`degradation. Because Web-server clusters replicate file data at
`
`.,
`.,
`
`1,000
`
`900
`
`800
`
`700
`
`"C
`0
`C:
`
`- - CPU load
`- - Disk load
`- - Net load
`- - Number of nodes
`
`600
`
`C:
`0
`'15
`.,
`<: 500
`.e, 400
`f::?
`"C "' 300
`
`0
`...J
`
`200
`
`100
`
`0
`
`1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000
`Time (seconds)
`
`10
`
`9
`
`8
`
`7
`
`<I)
`
`6 ~
`0
`C:
`
`.Q
`
`.,
`5~
`4 E ::,
`:z
`
`3
`
`2
`
`0
`
`all nodes, and traditional server hardware has very high base
`power-
`the power consumed when the system is on but idle(cid:173)
`these systems dynamically turn entire nodes on and off, effec(cid:173)
`tively reconfiguring the cluster.
`Figure A illustrates the behavior of the Rutgers system for a
`seven-node cluster running a real, but accelerated, Web trace.
`T he figure shows the evolution of the cluster configuration and
`offered loads on each resource as a function of time. It plots the
`load on each resource as a percentage of the nominal through(cid:173)
`put of the same resource in one node. T he figure shows that the
`network interface is the bottleneck resource throughout the
`entire experiment.
`Figure B shows the cluster power consumption for two ver(cid:173)
`sions of the same experiment, again as a function of time. T he
`lower curve (dynamic configuration) represents the reconfig-
`
`800
`
`700
`
`600
`
`"C
`0
`
`- - Static configuration
`- - Dynamic configuration
`
`500
`
`C:
`0
`'15
`<: 400
`
`., ! 300
`
`"C "' 200
`
`0
`...J
`
`100
`
`0
`
`1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000
`Time (seconds)
`
`.,
`C: .,
`
`Figure A. Cluster evolution and per-resource offered loads. The
`network interface is the bottleneck resource throughout the entire
`experiment.
`
`Figure 8. Power under static and dynamic configurations.
`Reconfiguration reduces power consumption significantly for most
`of the experiment, saving 38 percent in energy.
`
`Although the modified memory controller and
`larger caches need additional powei; high memory
`capacity servers should expect a net savings.
`Howevei; for workloads with the highest memory
`bandwidth requirements and working sets exceed(cid:173)
`ing the compression buffer size, this solution may
`not be energy efficient. Quantifying compression's
`power and performance benefit requires further
`analysis.
`
`Cache coherence
`Reducing overhead due to cache coherence traf(cid:173)
`fic can also improve energy efficiency. Shared-mem(cid:173)
`ory servers often use invalidation protocols to
`maintain cache coherency. One way to implement
`these protocols is to have one cache level (typically
`L2 cache) track- or snoop- memory requests
`from all processors.
`T he cache controllers of all processors in the
`server share a bus to memory and the other caches.
`
`Each cache controller arrives at the appropriate
`coherence action to maintain consistency between
`the caches based on its ability to observe memory
`traffic to and from all caches, and to snoop remote
`caches. Snoop accesses from other processors
`greatly increase U cache power consumption.
`Andreas Moshovos and colleagues introduced a
`small cache-like structure they term a Jetty at each
`L2 cache to filter incoming snoop requests. 13 T he
`Jetty predicts whether the local cache contains a
`requested line with no false negatives. Querying the
`Jetty first and forwarding the snoop request to the
`L2 cache only when it predicts the line is in the local
`cache reduces L2 cache energy for all accesses by
`about 30 percent.
`Craig Saldanha and Mikko Lipasti propose
`replacing parallel snoop accesses with serial
`accesses. 14 Many snoop cache implementations
`broadcast the snoop request in parallel to all caches
`on the SMP bus. Snooping other processors' caches
`
`Computer
`
`Petitioner Samsung Ex-1027, 0008
`
`

`

`urable version of the system, whereas the higher curve (static
`configuration) represents a configuration fixed at seven nodes.
`The figure shows that reconfiguration reduces power consump(cid:173)
`tion significantly for most of the experiment, saving 38 percent
`in energy.
`Karthick Rajamani and Charles Lefurgy4 studied how to
`improve the cluster reconfiguration technique's energy saving
`potential by using spare servers and history information about
`peak server loads. They also modeled the key system and work(cid:173)
`load parameters that influence this technique.
`Mootaz Elnozahy and colleagues5 evaluated different combi(cid:173)
`nations of cluster reconfiguration and two types of dynamic volt(cid:173)
`independent and coordinated-for clusters in
`age scaling-
`which the base power is relatively low. In independent voltage
`scaling, each server node makes its own independent decision
`about what voltage and frequency to use, depending on the load
`it is receiving. In coordinated voltage scaling, nodes coordinate
`their voltage and frequency settings. Their simulation results
`showed that, while either coordinated voltage scaling or recon(cid:173)
`figuration is the best technique depending on the workload, com(cid:173)
`bining them is always the best approach.
`In other research, Mootaz Elnozahy and colleagues6 proposed
`using request batching to conserve processor energy under a light
`load. In this technique, the network interface processor accu(cid:173)
`mulates incoming requests in memory, while the server's host
`processor is in a low-power state. The system awakens the host
`processor when an accumulated request has been pending for
`longer than a threshold.
`Taliver Heath and colleagues7 considered reconfiguration in
`the context of heterogeneous server clusters. Their experimen(cid:173)
`tal results for a cluster of traditional and blade nodes show that
`their heterogeneity-conscious server can conserve more than
`twice as much energy as a heterogeneity-oblivious reconfigurable
`server.
`
`References
`1. R. Bianchini and R. Rajamony, Power and Energy Management
`for Server Systems, tech. report DCS-TR-528 1 Dept. Computer
`Science, Rutgers Univ., 2003.
`2. E. Pinheiro et al., " Dynamic Cluster Reconfiguration for Power
`and Performance," L. Bellini, M. Kandemir, and]. Ramanujam,
`eds., Compilers and Operating Systems for Low Power, Kluwer
`Academic, 2003. Earlier version published in Proc. Workshop
`Compilers and Operating Systems for Low Power, 2001.
`3. J.S. Chase et al., "Managing Energy and Server Resources in
`Hosting Centers," Proc. 18th ACM Symp. Operating Systems
`Principles, ACM Press, 2001, pp. 103-116.
`4. K. Ra jamani and C. Lefurgy, "On Evaluating Request(cid:173)
`Distribution Schemes for Saving Energy in Server Clustersi"
`Proc. 2003 IEEE Int'/ Sym p. Performance Analysis of Systems
`and So ftware, IEEE Press, 20031 pp. 111-122.
`5. E.N. Elnozahy, M. Kistler, and R. Rajamony, " Energy-Efficient
`Server Clusters," Proc. 2nd Workshop Power-Aware Computing
`Systems, LNCS 2325, Springer, 2003, pp. 179-196.
`6. E.N. Elnozahy, M. Kistler, and R. Rajamony, " Energy
`Conservation Policies for Web Servers," Proc. 4th Usenix Symp.
`Internet Technologies and Systems, Usenix Assoc., 2003 1 pp.
`99-112.
`7. T. Heath et al., "Self-Configuring Heterogeneous Server
`Clusters," Proc. Workshop Compilers and Operating Systems
`for Low Power, 2003, 75-84.
`
`Ricardo Bianchini is an assistant professor in the Department
`of Computer Science, Rutgers University. Contact him at
`ricardob@cs.rutgers.edu.
`
`Ram Rajamony is a research staff member at IBM Austin
`Research Lab. Contact him at rajamony@us.ibm.com.
`
`one at a time reduces the total number of snoop
`accesses. Although this method can reduce the
`energy and maximum power used, it increases the
`average cache miss access latency. Therefore, devel(cid:173)
`opers must balance the energy savings in the mem(cid:173)
`ory system against the energy used to keep other
`system components active longer.
`
`istrators. Software policies can also use higher-level
`info

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket