`(12) Patent Application Publication (10) Pub. No.: US 2006/0095807 A1
`(43) Pub. Date:
`May 4, 2006
`Grochowski et al.
`
`US 20060095807A1
`
`(54) METHOD AND APPARATUS FOR VARYING
`ENERGY PER INSTRUCTION ACCORDING
`TO THE AMOUNT OF AVAILABLE
`PARALLELISM
`
`(75) Inventors: Edward Grochowski, San Jose, CA
`(US); John Shen, San Jose, CA (US);
`Hong Wang, Fremont, CA (US); Doron
`Orenstein, Haifa (IL); Gad S. Sheaffer,
`Haifa (IL); Ronny Ronen, Haifa (IL);
`Murali M. Annavaram, Santa Clara,
`CA (US)
`Correspondence Address:
`BLAKELY SOKOLOFFTAYLOR & ZAFMAN
`124OO WILSHIRE BOULEVARD
`SEVENTH FLOOR
`LOS ANGELES, CA 90025-1030 (US)
`(73) Assignee: Intel Corporation
`
`(21) Appl. No.:
`
`10/952,627
`
`(22) Filed:
`
`Sep. 28, 2004
`
`Publication Classification
`
`(51) Int. Cl.
`(2006.01)
`G06F L/30
`(52) U.S. Cl. ............................................ 713/324; 713/322
`
`(57)
`
`ABSTRACT
`
`A method and apparatus for changing the configuration of a
`multi-core processor is disclosed. In one embodiment, a
`throttle module (or throttle logic) may determine the amount
`of parallelism present in the currently-executing program,
`and change the execution of the threads of that program on
`the various cores. If the amount of parallelism is high, then
`the processor may be configured to run a larger amount of
`threads on cores configured to consume less power. If the
`amount of parallelism is low, then the processor may be
`configured to run a smaller amount of threads on cores
`configured for greater Scalar performance.
`
`
`
`WOLTAGE
`CONTROL
`122
`
`FREQUENCY
`CONTROL
`124
`
`THROTTLE
`MODULE
`110
`
`Petitioner Mercedes Ex-1022, 0001
`
`
`
`Patent Application Publication May 4, 2006 Sheet 1 of 11
`
`US 2006/0095807 A1
`
`
`
`VOLTAGE
`CONTROL
`122
`
`FREQUENCY
`CONTROL
`124
`
`112
`
`CORE 1
`120
`
`
`
`THROTTLE
`MODULE
`110
`
`114
`
`116
`
`118
`
`CORE 2
`130
`
`CORE 3
`140
`
`CORE 4
`150
`
`FIG. 1
`
`Petitioner Mercedes Ex-1022, 0002
`
`
`
`Patent Application Publication May 4, 2006 Sheet 2 of 11
`
`US 2006/0095807 A1
`
`
`
`THROTTLE
`MODULE
`
`THREAD
`MGRATION
`LOGIC
`
`A CORE 1
`220
`
`ACORE 2
`222
`
`224
`
`226
`
`236
`
`234
`
`B CORE 1
`230
`
`B CORE 31
`232
`
`246
`-
`
`244
`
`B CORE 2
`240
`
`O
`
`O
`
`O
`
`B CORE 32
`242
`
`O
`
`d
`
`O
`
`264
`
`B CORE 30
`260
`
`B CORE 60
`262
`
`266
`
`FIG. 2
`
`Petitioner Mercedes Ex-1022, 0003
`
`
`
`Patent Application Publication May 4, 2006 Sheet 3 of 11
`
`US 2006/0095807 A1
`
`334
`
`SCHA
`
`SCHB
`
`332
`
`330
`
`PREFETCH
`
`OTHER PIPELINE
`
`LO CACHEA
`
`O CACHEB
`
`
`
`THROTTLE
`MODULE
`310
`
`312
`
`314
`
`316
`
`318
`
`FIG. 3
`
`CORE 2
`370
`
`CORE3
`380
`
`CORE 4
`390
`
`Petitioner Mercedes Ex-1022, 0004
`
`
`
`Patent Application Publication May 4, 2006 Sheet 4 of 11
`
`US 2006/0095807 A1
`
`PREFETCH
`430
`
`OTHER PIPELINE
`432
`
`BRANCH
`PREDICTOR
`
`PREDICTOR
`
`
`
`THROTTLE
`MODULE
`
`410
`
`412
`
`414
`
`416
`
`418
`
`FG. 4
`
`CO
`E. 2
`
`-
`
`CORE 3
`480
`
`CORE 4
`490
`
`Petitioner Mercedes Ex-1022, 0005
`
`
`
`Patent Application Publication May 4, 2006 Sheet 5 of 11
`
`US 2006/0095807 A1
`
`099
`
`
`
`CHOOT XO\/8C]EE - MOTS
`
`|9
`
`}}O_LINOWERHOO
`ZUG | ||-|[|[]
`
`
`
`
`
`
`
`
`
`Petitioner Mercedes Ex-1022, 0006
`
`
`
`Patent Application Publication May 4, 2006 Sheet 6 of 11
`
`US 2006/0095807 A1
`
`
`
`ALLOCATE
`THREADS TO
`CORES
`
`MONITOR PWR
`IN CORES
`COMPUTE
`ERROR VALUE
`
`INTEGRATE
`SAMPLE
`ERROR VALUE
`
`CONTROL
`CHANGES2
`
`ADJUST VOLT+
`FREQACCORDING
`TO CONTROL
`VALUE
`
`FIG. 6
`
`Petitioner Mercedes Ex-1022, 0007
`
`
`
`Patent Application Publication May 4, 2006 Sheet 7 of 11
`
`US 2006/0095807 A1
`
`
`
`ALLOCATE
`THREADS TO
`CORES
`
`MONITOR PWR
`IN CORES
`COMPUTE
`ERRORVALUE
`
`INTEGRATE
`SAMPLE
`ERROR VALUE
`
`CONTROL
`CHANGES2
`
`REALLOCATE
`CORES ACCORDING
`TO CONTROL
`VALUE
`
`FIG. 7
`
`Petitioner Mercedes Ex-1022, 0008
`
`
`
`Patent Application Publication May 4, 2006 Sheet 8 of 11
`
`US 2006/0095807 A1
`
`
`
`ALLOCATE
`THREADS TO
`CORES
`
`MONITOR PWR
`IN CORES
`COMPUTE
`ERROR VALUE
`
`INTEGRATE
`SAMPLE
`ERRORVALUE
`
`CONTROL
`CHANGES
`
`ADJUST AMOUNT
`OF OPTIONAL
`CIRCUITRY ON/OFF
`NACCORDANCE
`WITH CONTROL
`VALUE
`
`FIG. 8
`
`Petitioner Mercedes Ex-1022, 0009
`
`
`
`Patent Application Publication May 4, 2006 Sheet 9 of 11
`
`US 2006/0095807 A1
`
`
`
`ALLOCATE
`THREADS TO
`CORES
`
`MONTOR PWR
`N CORES
`COMPUTE
`ERROR VALUE
`
`INTEGRATE
`SAMPLE
`ERROR VALUE
`
`CONTROL
`CHANGES
`
`ADJUST AMOUNT
`OF PRED. CIRCUITRY
`ON/OFFIN
`ACCORDANCE
`WITH CONTROL
`VALUE
`
`F.G. 9
`
`Petitioner Mercedes Ex-1022, 0010
`
`
`
`Patent Application Publication May 4, 2006 Sheet 10 of 11
`
`US 2006/0095807 A1
`
`
`
`
`
`
`
`
`
`OZ
`
`Petitioner Mercedes Ex-1022, 0011
`
`
`
`Patent Application Publication May 4, 2006 Sheet 11 of 11
`
`US 2006/0095807 A1
`
`vAYOWSN
`
`9}
`
`YyOSsd00ud
`
`‘00d
`
`JYOO
`
`YOSSa00ud
`
`‘90d
`
`4HOO
`
`CAYOWAW
`
`4ud3ad-HOIH
`
`SOIHdVYD
`
`SE
`
`0¢
`
`dO)Old
`
`
`
`JOVYOLSVLVd
`
`
`
`O/lO1GNY
`
`ve
`
`4
`
`9¢
`
`WAWOO22
`[QUvOgAsy
`
`SS0IAI0
`
`ASNOW
`
`¥SAOIARO/|
`
`dogsSNE
`
`8b
`
`Petitioner Mercedes Ex-1022, 0012
`
`Petitioner Mercedes Ex-1022, 0012
`
`
`
`
`
`
`US 2006/0095807 A1
`
`May 4, 2006
`
`METHOD AND APPARATUS FOR VARYING
`ENERGY PER INSTRUCTION ACCORDING TO
`THE AMOUNT OF AVAILABLE PARALLELISM
`
`FIELD
`0001. The present disclosure relates generally to micro
`processors that may execute programs with varying amounts
`of Scalar and parallel resource requirements, and more
`specifically to microprocessors employing multiple cores.
`
`BACKGROUND
`0002 Computer workloads, in some embodiments, run in
`a continuum from those having little inherent parallelism
`(being predominantly scalar) to those having significant
`amounts of parallelism (being predominantly parallel), and
`this nature may vary from segment to segment in the
`software. Typical scalar workloads include software devel
`opment tools, office productivity Suites, and operating sys
`tem kernel routines. Typical parallel workloads include 3D
`graphics, media processing, and Scientific applications. Sca
`lar workloads may retire instructions per clock (IPCs) in the
`range of 0.2 to 2.0, whereas parallel workloads may achieve
`throughput in the range of 4 to several thousand IPC. The
`latter high IPCs may be obtainable through the use of
`instruction-level parallelism and thread-level parallelism.
`0003 Prior art microprocessors have often been designed
`with either Scalar or parallel performance as the primary
`objective. To achieve high Scalar performance, it is often
`desirable to reduce execution latency as much as possible.
`Micro-architectural techniques to reduce effective latency
`include speculative execution, branch prediction, and cach
`ing. The pursuit of high Scalar performance has resulted in
`large out-of-order, highly speculative, deep pipeline micro
`processors. To achieve high parallel performance, it may be
`desirable to provide as much execution throughput (band
`width) as possible. Micro-architectural techniques to
`increase throughput include wide SuperScalar processing,
`single-instruction-multiple-data
`instructions,
`chip-level
`multiprocessing, and multithreading.
`0004 Problems may arise when trying to build a micro
`processor that performs well on both scalar and parallel
`tasks. One problem may arise from a perception that design
`techniques needed to achieve short latency are in some cases
`very different from the design techniques needed to achieve
`high throughput.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`0005 The present invention is illustrated by way of
`example, and not by way of limitation, in the figures of the
`accompanying drawings and in which like reference numer
`als refer to similar elements and in which:
`0006 FIG. 1 is a schematic diagram of a processor
`including cores configurable by Voltage and frequency,
`according to one embodiment.
`0007 FIG. 2 is a schematic diagram of a processor
`including cores selectable by processing power and power
`consumption, according to one embodiment.
`0008 FIG. 3 is a schematic diagram of a processor
`including cores configurable by optional performance cir
`cuits, according to one embodiment.
`
`0009 FIG. 4 is a schematic diagram of a processor
`including cores configurable by optional speculative cir
`cuits, according to one embodiment of the present disclo
`SUC.
`0010 FIG. 5 is schematic diagram of a processor includ
`ing cores and details of a throttle, according to one embodi
`ment of the present disclosure.
`0011 FIG. 6 is a flowchart showing transitioning to
`differing core configurations, according to one embodiment
`of the present disclosure.
`0012 FIG. 7 is a flowchart showing transitioning to
`differing core configurations, according to another embodi
`ment of the present disclosure.
`0013 FIG. 8 is a flowchart showing transitioning to
`differing core configurations, according to another embodi
`ment of the present disclosure.
`0014 FIG. 9 is a flowchart showing transitioning to
`differing core configurations, according to another embodi
`ment of the present disclosure.
`0015 FIG. 10A is a schematic diagram of a system
`including processors with throttles and multiple cores,
`according to an embodiment of the present disclosure.
`0016 FIG. 10B is a schematic diagram of a system
`including processors with throttles and multiple cores,
`according to another embodiment of the present disclosure.
`
`DETAILED DESCRIPTION
`0017. The following description describes techniques for
`varying the amount of energy expended to process each
`instruction according to the amount of parallelism available
`in a software program. In the following description, numer
`ous specific details such as logic implementations, software
`module allocation, bus and other interface signaling tech
`niques, and details of operation are set forth in order to
`provide a more thorough understanding of the present inven
`tion. It will be appreciated, however, by one skilled in the art
`that the invention may be practiced without such specific
`details. In other instances, control structures, gate level
`circuits and full software instruction sequences have not
`been shown in detail in order not to obscure the invention.
`Those of ordinary skill in the art, with the included descrip
`tions, will be able to implement appropriate functionality
`without undue experimentation. In certain embodiments, the
`invention is disclosed in the form of multi-core implemen
`tations of Pentium(R) compatible processor such as those
`produced by Intel(R) Corporation. However, the invention
`may be practiced in other kinds of processors, such as an
`Itanium Processor Family compatible processor, an
`X-ScaleR) family compatible processor, or any of a wide
`variety of different general-purpose processors from any of
`the processor architectures of other vendors or designers.
`Additionally, Some embodiments may include or may be
`special purpose processors, such as graphics, network,
`image, communications, or any other known or otherwise
`available type of processor.
`0018 Power efficiency may be measured in terms of
`instructions-per-second (IPS) per watt. The IPS/watt metric
`is equivalent to energy per instruction, or, more precisely,
`
`Petitioner Mercedes Ex-1022, 0013
`
`
`
`US 2006/0095807 A1
`
`May 4, 2006
`
`IPS/watt is proportional to the reciprocal of energy per
`instruction as follows,
`EQUATION 1
`(IPS)/(Watt)=(Instructions)/(Joule)
`An important property of the energy per instruction metric
`is that it is independent of the amount of time required to
`process an instruction. This makes energy per instruction a
`useful metric for throughput performance.
`0019. An approximate analysis of a microprocessors
`power consumption may be performed by modeling the
`microprocessor as a capacitor that is charged or discharged
`with every instruction processed (for simplicity, the leakage
`current and short-circuit Switching current may be ignored).
`With this assumption, energy per instruction may depend on
`only two things: the amount of capacitance toggled to
`process each instruction (from fetch to retirement), and
`power Supply Voltage. The well-known formula:
`E=C V/2
`EQUATION 2
`which is normally applied to capacitors, may be applied to
`microprocessors as well. E is the energy required to process
`an instruction; C is the amount of capacitance toggled in
`processing the instruction; and V is the power Supply
`Voltage.
`0020. A microprocessor may operate within a fixed
`power budget Such as, for example, 100 watts. Averaged
`over some time period, the microprocessor's power con
`Sumption should not exceed the power budget regardless of
`what the microprocessor or software do. To achieve this
`objective, a microprocessor may incorporate Some form of
`dynamic thermal management. Similarly, a chip-level mul
`tiprocessor may regulate (or throttle) its activities to stay
`within a fixed power budget regardless of whether it is
`retiring, for example, 0.2 instructions per clock (IPC) or 20
`IPC. To deliver good performance, the chip-level multipro
`cessor should be able to vary its MIPS/watt, or equivalently
`its energy/instruction, over a 100:1 range in this example.
`0021 One approach to designing a microprocessor that
`may achieve both high scalar performance and high through
`put performance is to dynamically vary the amount of
`energy expended to process each instruction according to the
`amount of parallelism available or estimated to be available
`in the software. In other words, if there is a small amount of
`parallelism, a microprocessor may expend all available
`energy processing a few instructions; and, if there is a
`greater amount of parallelism, the microprocessor may
`expend very little energy in processing each instruction. This
`may be expressed as:
`EQUATION 3
`P=(EPI)x(IPS)
`where P is the fixed power budget, EPI is the average energy
`per retired instruction, and IPS is the aggregate number of
`instructions retired per second across all processor cores.
`This embodiment attempts to maintain the total multipro
`cessor chip power at a nearly constant level.
`0022 Complementary
`Metal-Oxide-Semiconductor
`(CMOS) Voltage and frequency Scaling may be used to
`achieve different energy per instruction ratios. In one
`embodiment, logic varies the microprocessor's power Sup
`ply Voltage and clock frequency in unison according to the
`performance and power levels desired. To maintain a chip
`level multiprocessors total power consumption within a
`fixed power budget, Voltage and frequency Scaling may be
`
`applied dynamically as follows. In phases of low thread
`parallelism, a few cores may be run using high Supply
`Voltage and high frequency for best scalar performance. In
`phases of high thread parallelism, many cores may be run
`using low Supply Voltage and low frequency for best
`throughput performance. Since low power consumption for
`inactive cores may be desirable, leakage control techniques
`Such as dynamic sleep transistors and body bias may be
`used.
`0023 Referring now to FIG. 1, a schematic diagram of a
`processor including cores configurable by Voltage and fre
`quency is shown, according to one embodiment. Core 1120,
`core 2130, core 3140, and core 4150 are shown, but in other
`embodiments there may be more or fewer than four cores in
`a processor. One or more of the cores may have a Voltage
`control circuit and a clock frequency control circuit. FIG. 1
`expressly shows core 1120 possessing Voltage control circuit
`122 and frequency control circuit 124, but the other cores
`may have equivalent circuits as well, or the Voltage control
`and frequency control logic may be separate logic not
`directly associated with a particular core.
`0024. A throttle module 110 may be used to gather
`information and make a determination about, or an estimate
`of the amount of parallelism present in the executing
`Software program. In one embodiment, the amount of par
`allelism may be the number of simultaneous threads Sup
`ported. In other embodiments, other metrics may be used to
`express the amount of parallelism, such as the aggregate
`number of instructions retired per second, or the number of
`branch instructions that may support speculative multi
`threaded execution. Throttle module 110 may utilize infor
`mation provided by the operating system to aid in the
`determination of the amount of parallelism. In other embodi
`ments, throttle module 110 may make this determination
`using hardware logic within the processor and its cores. The
`determination may be made on a continuous basis or peri
`odically.
`0025. Each time the throttle module 110 makes the deter
`mination of the amount of parallelism in the program, it may
`direct cores 120, 130, 140, 150 via signal lines 112, 114, 116,
`and 118 to change their voltage and clock frequency. In one
`embodiment, signal lines 112, 114, 116, and 118 may also be
`used to turn the cores on or off, or to remove power from a
`power well containing a core. In other embodiments, the
`cores may be turned off by clock gating or instruction
`starvation techniques. In one embodiment, if the current
`amount of thread level parallelism exceeds a previous
`amount by more than a threshold value, then the throttle
`module may initiate a transition to running a greater number
`of threads by decreasing the Voltage and clock frequency in
`each core but running the threads on a greater number of
`cores. Cores that had previously been turned off may be
`turned on to support the larger number of threads. Similarly,
`if the current amount of thread level parallelism is less than
`a previous amount by more than a threshold value, then the
`throttle module may initiate a transition to running a fewer
`number of threads by increasing the Voltage and clock
`frequency in Some cores but running the threads on a fewer
`number of these cores. Some cores that had previously been
`turned on may be turned off as they may no longer be needed
`to support the smaller number of threads.
`In one embodiment it may be possible to design a
`0026
`single-instruction-set-architecture
`(ISA) heterogeneous
`
`Petitioner Mercedes Ex-1022, 0014
`
`
`
`US 2006/0095807 A1
`
`May 4, 2006
`
`multi-core microprocessor in which different micro-archi
`tectures may be used to span a range of performance and
`power. In one embodiment, a chip-level multiprocessor may
`be built from two types of processor cores, which may be
`referred to as a large core and a small core. The two types
`of cores may implement the same instruction set architec
`ture, use cache coherency to implement shared memory, and
`differ only in their micro-architecture. In other embodi
`ments, the two types of core may implement similar instruc
`tion set architectures, or the Small cores may implement a
`Subset of the instruction set of the large cores. The large core
`may be an out-of-order, SuperScalar, deep pipeline machine,
`whereas the Small core may be an in-order, Scalar, short
`pipeline machine. The Intel Pentium 4 processor and Intel
`i486 processor are representative of the two classes of cores.
`In other embodiments, more than two classes or perfor
`mance levels of cores running a Substantially similar or
`identical instruction set architecture may be used.
`0027. In one embodiment, a chip-level multiprocessor
`includes one large core and 25 Small cores, with the two
`types of cores having a 25:1 ratio in power consumption, a
`5:1 ratio in Scalar performance, and a 5:1 range of energy per
`instruction. The chip-level multiprocessor or this embodi
`ment may operate as follows. In phases of low thread-level
`parallelism, the large core may be run for best scalar
`performance. In phases of high thread-level parallelism,
`multiple small cores may be run for best throughput perfor
`aCC.
`0028. At any instant in time, the microprocessor may run
`either one large core or 25 small cores. Because the number
`of available software threads will vary over time, the asym
`metric multiprocessor should be capable of migrating a
`thread between large and Small cores. A thread-migration
`logic may be implemented to Support this function.
`0029. In practice, it may be desirable to allow a few small
`cores to run simultaneously with the large core in order to
`reduce the throughput performance discontinuity at the point
`of Switching off the large core. In the previous example, a
`discontinuity of 3 units of throughput may result from
`Switching off the large core and Switching on two small
`cores. To reduce the percentage of the total throughput lost,
`the discontinuity may be moved to occur with a higher
`number of running threads by permitting, for example, up to
`5 small cores to run simultaneously with the large core if the
`power supply will support this for a short period of time.
`0030. Using two types of cores representative of today's
`microprocessors, a 4:1 range of energy per instruction is
`achievable. As future microprocessors continue to deliver
`even higher levels of Scalar performance, the range of
`possible energy per instruction may be expected to increase
`to perhaps 6:1, or well beyond this ratio.
`0031
`Referring now to FIG. 2, a schematic diagram of a
`processor including cores selectable by processing power
`and power consumption is shown, according to one embodi
`ment. The processor may include a few larger cores, the A
`cores, and may also include a larger number of Small cores,
`the B cores. The A core 1220. A core 2222, and B cores 1
`through 60230-262 are shown, but in other embodiments
`there may be more or fewer than two A cores and sixty B
`cores in a processor.
`0032. A throttle module 210 may again be used to gather
`information and make a determination about the amount of
`
`parallelism present in the executing software program. In
`one embodiment, the amount of parallelism may be the
`number of simultaneous threads supported. In other embodi
`ments, other metrics may be used to express the amount of
`parallelism as discussed previously. Throttle module 210
`may utilize information provided by the operating system to
`aid in the determination of the amount of parallelism. In
`other embodiments, throttle module 210 may make this
`determination using hardware logic within the processor and
`its cores. The determination may be made on a continuous
`basis or periodically.
`0033 Because the number of available software threads
`may vary over time, the processor of FIG. 1 may include a
`thread-migration logic 212 capable of migrating a thread
`between large A cores and small B cores. It may be desirable
`to allow a few small B cores to run simultaneously with the
`large A core in order to reduce the throughput performance
`discontinuity at the point of Switching off the large A core.
`To reduce the percentage of the total throughput lost, the
`discontinuity may be moved to occur with a higher number
`of running threads by permitting, for example, up to 5 Small
`cores to run simultaneously with the large core.
`0034). Each time the throttle module 210 makes the
`determination of the amount of parallelism in the program,
`it may initiate powering the A cores and B cores up or down
`using signal lines 224 through 266. In one embodiment, if
`the current amount of parallelism exceeds a previous amount
`by more than a threshold value, then the throttle module 210
`may initiate, using thread-migration logic 212, a transition to
`running a greater number of threads which may be run on a
`greater number of B cores. B cores that had previously been
`turned off may be turned on to support the larger number of
`threads, and any A cores which are turned on may be turned
`off. Similarly, if the current amount of parallelism is less
`than a previous amount by more than a threshold value, then
`the throttle module may initiate a transition to running a
`fewer number of threads by running the threads on a fewer
`number of the A cores. B cores that had previously been
`turned on may be turned off as they may no longer be needed
`to Support the Smaller number of threads, and A cores may
`be turned on to support the smaller number of threads. As
`mentioned above, it may be desirable to allow a few B cores
`to run simultaneously with the A cores in order to reduce the
`throughput performance discontinuity at the point of Switch
`ing off the large core.
`0035) In one embodiment, the throttle module may be
`implemented in a manner that does not require a feedback
`loop. Here the throttle's control action (e.g. determining
`which type and how many cores on which to run the threads)
`does not return to affect the input value (e.g. the allocation
`and configuration of cores for the threads). In this embodi
`ment, it may be presumed that each A core 220, 222 may
`consume the same amount of power as 25 of the B cores 230
`through 262. In other embodiments, differing ratios of power
`consumption may be used. The processor may divide its
`total power budget into two portions. For each portion, the
`power budget may permit either one A core and up to five B
`cores to operate at the same time, or no A cores and up to
`thirty B cores to operate at the same time. In other embodi
`ments, the power budget may be divided into portions in
`other ways.
`
`Petitioner Mercedes Ex-1022, 0015
`
`
`
`US 2006/0095807 A1
`
`May 4, 2006
`
`0036). In one embodiment, the number of running threads
`(RT) may be allocated to a quantity of A cores (QAC) and
`a quantity of B cores (QBC) according to Table I.
`
`TABLE I
`
`QAC
`
`QBC
`
`O
`1
`2
`2
`2
`
`2
`2
`2
`1
`1
`1
`
`1
`1
`1
`1
`1
`1
`1
`1
`1
`O
`O
`O
`O
`
`O
`
`O
`O
`O
`1
`2
`
`8
`9
`10
`12
`13
`14
`
`27
`28
`29
`30
`31
`32
`33
`34
`35
`37
`38
`39
`40
`
`60
`
`RT
`
`O
`1
`2
`3
`4
`
`10
`11
`12
`13
`14
`15
`
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`
`60
`
`0037. When the number of running threads increases, and
`the new thread is started (in one embodiment via an inter
`processor interrupt), the throttle module may determine the
`number of currently-running threads. Depending upon the
`number of currently-running threads, the new thread may be
`assigned to either an A core or a B core in accordance with
`Table I above. In this embodiment, for certain cases, such as
`when increasing from 12 threads to 13 threads, or from 36
`threads to 37 threads, an existing thread running on an Acore
`would be migrated to running on a B core. When this
`migration is complete, both the existing migrated thread and
`the new thread may be started. For this reason, in this
`embodiment the new thread may exhibit a delay in starting.
`0038 A similar process may occur when the number of
`running threads decreases. When a particular thread termi
`nates, and its core is halted, various methods may be used to
`potentially migrate one of the remaining threads from run
`ning on a B core to running on an A core. This could occur,
`for example, when reducing the number of running thread
`from 13 threads to 12 threads, or from 37 threads to 36
`threads. In one embodiment, a periodic timer may be used to
`permit the migration only once in a particular time interval.
`This may advantageously prevent too frequent thread migra
`tions in cases where threads are rapidly created and termi
`nated. The affected thread could remain running on the B
`core for up to the particular time interval.
`0039. In one embodiment, the throttle module may per
`form the migrations from A cores to B cores transparently to
`the software. The thread migration mechanism of the throttle
`module may include a table for mapping logical cores to
`physical cores, an interrupt to signal a core migration may
`be needed, microcode or hardwired logic to copy the core's
`
`processor state, and an interconnect network between the
`processor's cores. The number of logical cores may be equal
`to the number of B cores.
`0040. In another embodiment, the throttle module may
`perform the migrations from A cores to B cores in a manner
`not transparent to the software. The thread migration may be
`performed by the operating system scheduler. The operating
`system may track the number of cores with currently run
`ning threads, assign new threads to cores, and migrate
`threads from A cores to B cores (or B cores to Acores). The
`Software thread migration may use equivalent functions to
`those described above in regards the hardware implemen
`tation. In one embodiment, the throttle module operation
`may be transparent to application programs although not to
`the operating system.
`0041 One alternate way to modulate power consumption
`may be to adjust the size or functionality of logic blocks. For
`example, variable-sized schedulers, caches, translation look
`aside buffers (TLBs), branch predictors, and other optional
`performance circuits may be used to reduce Switching
`capacitance (and hence energy) when large array sizes are
`not needed. In addition to dynamically resizing arrays, it is
`also possible to design a large core that degrades its perfor
`mance into that of a smaller core by dynamically disabling
`execution units, pipeline stages, and other optional perfor
`mance circuits. These techniques may be collectively known
`as adaptive processing.
`0042. One embodiment of a chip-level multiprocessor
`may operate as follows. In phases of low thread parallelism,
`a few cores could be run using a first set (for example, all or
`many) of the available optional performance circuits on each
`core for good Scalar performance. In phases of high thread
`parallelism, many cores could be operated using fewer
`optional performance circuits on each core for good
`throughput performance.
`0043. The net result of reducing array sizes and disabling
`execution units may be to reduce the capacitance toggled per
`instruction. However, Switching capacitance might not be
`reduced by as much as designing a smaller core to begin
`with. While unused execution hardware may be gated off,
`the physical size of the core does not change, and thus the
`wire lengths associated with the still active hardware blocks
`may remain longer than in a small core.
`0044 An estimate of the possible reduction in energy per
`instruction may be made by examining the floorplan of a
`large out-of-order microprocessor and determining how
`many optional performance circuits may be turned off to
`convert the processor into a small in-order machine (keeping
`in mind that the blocks cannot be physically moved). The
`percentage of processor core area turned off may then be
`quantified, which may approximate the reduction in Switch
`ing capacitance. From equation (2), energy per instruction is
`roughly proportional to the amount of Switching capaci
`tance.
`0045. A rough estimate is that in some cases up to 50%
`of the Switching capacitance may be turned off, resulting in
`a 1x to 2x reduction in energy per instruction. In some
`embodiments, the use of leakage control techniques such as
`dynamic sleep transistors and body bias in addition to clock
`gating may facilitate reducing the energy consumed per
`instruction.
`
`Petitioner Mercedes Ex-1022, 0016
`
`
`
`US 2006/0095807 A1
`
`May 4, 2006
`
`0046 Referring now to FIG. 3, a schematic diagram of a
`processor including cores configurable by optional perfor
`mance circuits is shown, according to one embodiment. The
`FIG. 3 processor may include four cores, core 1320, core
`2370, core 3380, and core 4390. In other embodiments,
`more or fewer than four cores may be used. Core 1320
`shows various optional performance circuits. Scheduler A
`334 may be coupled to an optional scheduler B 336 which
`may enhance performance when turned on. Execution unit
`1340 may be coupled to optional execution units 2 through
`4342, 344, 346 which may enhance performance when
`turned on. Level Zero (L0) cache A322 may be coupled to
`L0 cache B 324 which may enhance performance when
`turned on. TLB A326 may be coupled to TLB B 328 which
`may enhance performance when turned on. A reorder buffer
`(ROB) 338 may have a variable number of lines, or may be
`turned off altogether to inhibit out-of-order execution.
`Finally a prefetch stage 332, separate from other pipeline
`stages 330, may perform speculative fetches when powered
`on. In other embodiments, other optional performance cir
`cuits may be used.
`0047 A throttle module 310 again may be used to gather
`information and make a determination about the amount of
`parallelism present in the executing Software program. The
`throttle module 310 may be similar to those discussed above
`in connection with FIGS. 1 and 2. Each time the throttle
`module 310 makes the determination of the amount of
`parallelism in the program, it may direct cores 320,370,380,
`and 390 via signal lines 312,314,316, and 318 to change the
`number of optional performance circuits that are powered up
`or down. In one embodiment, signal lines 312,314,316, and
`318 may also be used to turn the cores 320, 370, 380, and
`390 on or off. In one embodiment, if the current amount o