`US 20060095807Al
`
`c19) United States
`c12) Patent Application Publication
`Grochowski et al.
`
`c10) Pub. No.: US 2006/0095807 Al
`May 4, 2006
`(43) Pub. Date:
`
`(54) METHOD AND APPARATUS FOR VARYING
`ENERGY PER INSTRUCTION ACCORDING
`TO THE AMOUNT OF AVAILABLE
`PARALLELISM
`
`(75)
`
`Inventors: Edward Grochowski, San Jose, CA
`(US); John Shen, San Jose, CA (US);
`Hong Wang, Fremont, CA (US); Doron
`Orenstein, Haifa (IL); Gad S. Sheaffer,
`Haifa (IL); Ronny Ronen, Haifa (IL);
`Murali M. Annavaram, Santa Clara,
`CA (US)
`
`Correspondence Address:
`BLAKELY SOKOLOFF TAYLOR & ZAFMAN
`12400 WILSHIRE BOULEVARD
`SEVENTH FLOOR
`LOS ANGELES, CA 90025-1030 (US)
`
`(73) Assignee: Intel Corporation
`
`(21) Appl. No.:
`
`10/952,627
`
`(22) Filed:
`
`Sep. 28, 2004
`
`Publication Classification
`
`(51)
`
`Int. Cl.
`G06F 1130
`(2006.01)
`(52) U.S. Cl. ............................................ 713/324; 713/322
`
`(57)
`
`ABSTRACT
`
`A method and apparatus for changing the configuration of a
`multi-core processor is disclosed. In one embodiment, a
`throttle module ( or throttle logic) may determine the amount
`of parallelism present in the currently-executing program,
`and change the execution of the threads of that program on
`the various cores. If the amount of parallelism is high, then
`the processor may be configured to run a larger amount of
`threads on cores configured to consume less power. If the
`amount of parallelism is low, then the processor may be
`configured to run a smaller amount of threads on cores
`configured for greater scalar performance.
`
`1
`
`VOLTAGE
`CONTROL
`m
`
`I
`
`FREQUENCY
`CONTROL
`124
`
`CORE1
`120
`
`THROTTLE
`MODULE
`11.Q_
`
`f - - -
`
`114
`,_;
`
`116
`~
`, -
`
`l
`
`I
`
`CORE2
`130
`
`l
`
`I
`
`CORE3
`140
`
`I
`
`-T
`
`
`
`CORE4
`~
`
`118 _/
`
`Petitioner Samsung Ex-1022, 0001
`
`
`
`Patent Application Publication May 4, 2006 Sheet 1 of 11
`
`US 2006/0095807 Al
`
`I
`FREQUENCY
`CONTROL
`124
`
`VOLTAGE
`CONTROL
`122
`
`CORE 1
`120
`
`· FIG. 1
`
`I
`
`-T
`
`
`
`CORE2
`130
`
`T
`
`I
`
`CORE3
`140
`
`I
`
`-T
`
`
`
`CORE4
`150
`
`THROTTLE
`MODULE
`110
`
`114
`~
`
`116
`~
`
`-
`
`118 J
`
`Petitioner Samsung Ex-1022, 0002
`
`
`
`Patent Application Publication May 4, 2006 Sheet 2 of 11
`
`US 2006/0095807 Al
`
`THROTTLE
`MODULE
`
`THREAD
`MIGRATION
`LOGIC
`212
`
`210
`
`A CORE 1
`220
`
`A CORE 2
`222
`
`226
`
`B CORE 1
`230
`
`B CORE 31
`232
`
`246
`,J
`
`B CORE2
`240
`
`B CORE 32
`242
`
`•
`•
`•
`
`•
`•
`•
`
`266
`
`B CORE 30
`260
`
`B CORE 60
`262
`
`FIG. 2
`
`Petitioner Samsung Ex-1022, 0003
`
`
`
`Patent Application Publication May 4, 2006 Sheet 3 of 11
`
`US 2006/0095807 Al
`
`334
`
`SCHA
`
`SCH B
`
`336
`
`340
`
`332
`
`330
`
`OTHER PIPELINE
`
`338
`
`ROB
`
`EXE 1
`
`EXE2
`
`EXE3
`
`EXE4
`
`LO CACHE A
`
`LO CACHE B
`
`3 2 2~ 3 2 6
`
`TLB B
`
`328
`
`324
`
`346
`
`CORE1
`320
`
`THROTTLE
`MODULE
`310
`
`CORE2
`370
`
`CORE3
`380
`
`CORE4
`390
`
`FIG. 3
`
`Petitioner Samsung Ex-1022, 0004
`
`
`
`Patent Application Publication May 4, 2006 Sheet 4 of 11
`
`US 2006/0095807 Al
`
`PREFETCH
`430
`
`-----
`
`OTHER PIPELINE
`432
`
`I
`
`BRANCH
`PREDICTOR
`~
`434
`
`I
`
`OTHER
`PREDICTOR
`~
`436
`
`CORE 1
`420
`
`THROTTLE
`MODULE
`410
`
`CORE2
`470
`
`CORE3
`480
`
`CORE4
`490
`
`FIG. 4
`
`Petitioner Samsung Ex-1022, 0005
`
`
`
`~ ...
`.... 0 =
`.... 0 = ""O = O" -....
`"e -....
`> "e
`('D = .....
`~ .....
`""O
`
`~ .....
`
`(')
`
`~ .....
`
`(')
`
`O'I
`0
`0
`N
`
`~
`~
`~
`
`> ....
`
`-....J
`0
`QO
`Ul
`1,0
`0
`
`0
`0
`N
`rJJ
`c
`
`O'I --- 0
`
`....
`....
`0 ....
`Ul
`.....
`rJJ =- ('D
`
`('D
`
`FIG. 5
`
`FAST FEEDBACK LOOP
`
`560
`
`THROTTLE
`
`540
`
`CLOCK
`
`CONTROL
`
`548
`
`SAMPLE
`
`544
`
`538
`INTEG.
`
`534
`DIFF.
`COMP.
`
`550"--1
`
`SLOW FEEDBACK LOOP
`
`POWER
`
`530
`
`TO
`
`CONVERT
`
`MONITOR
`
`fila
`M
`
`• • •
`
`516
`3
`
`MONITOR
`
`MONITOR
`
`514
`2
`
`512
`1
`
`MONITOR
`
`508
`M
`CORE
`
`• • •
`
`506
`3
`
`CORE
`
`504
`2
`
`CORE
`
`502
`1
`
`CORE
`
`Petitioner Samsung Ex-1022, 0006
`
`
`
`Patent Application Publication May 4, 2006 Sheet 6 of 11
`
`US 2006/0095807 Al
`
`610
`
`614
`
`618
`
`NO
`
`626
`
`ALLOCATE
`THREADS TO
`CORES
`
`MONITOR PWR
`IN CORES
`COMPUTE
`ERROR VALUE
`
`INTEGRATE/
`SAMPLE
`ERROR VALUE
`
`YES
`
`ADJUST VOLT+
`FREQ ACCORDING
`TO CONTROL
`VALUE
`
`FIG. 6
`
`Petitioner Samsung Ex-1022, 0007
`
`
`
`Patent Application Publication May 4, 2006 Sheet 7 of 11
`
`US 2006/0095807 Al
`
`710
`
`714
`
`718
`
`NO
`
`726
`
`ALLOCATE
`THREADS TO
`CORES
`
`MONITORPWR
`IN CORES
`COMPUTE
`ERROR VALUE
`
`INTEGRATE/
`SAMPLE
`ERROR VALUE
`
`YES
`
`REP ALLOCATE
`CORES ACCORDING
`TO CONTROL
`VALUE
`
`FIG. 7
`
`Petitioner Samsung Ex-1022, 0008
`
`
`
`Patent Application Publication May 4, 2006 Sheet 8 of 11
`
`US 2006/0095807 Al
`
`810
`
`814
`
`818
`
`NO
`
`826
`
`ALLOCATE
`THREADS TO
`CORES
`
`MONITOR PWR
`IN CORES
`COMPUTE
`ERROR VALUE
`
`INTEGRATE/
`SAMPLE
`ERROR VALUE
`
`YES
`
`ADJUST AMOUNT
`OF OPTIONAL
`CIRCUITRY ON/OFF
`IN ACCORDANCE
`WITH CONTROL
`VALUE
`
`FIG. 8
`
`Petitioner Samsung Ex-1022, 0009
`
`
`
`Patent Application Publication May 4, 2006 Sheet 9 of 11
`
`US 2006/0095807 Al
`
`910
`
`914
`
`918
`
`NO
`
`926
`
`ALLOCATE
`THREADS TO
`CORES
`
`MONITOR PWR
`IN CORES
`COMPUTE
`ERROR VALUE
`
`INTEGRATE/
`SAMPLE
`ERROR VALUE
`
`YES
`
`ADJUST AMOUNT
`OF PRED. CIRCUITRY
`ON/OFF IN
`ACCORDANCE
`WITH CONTROL
`VALUE
`
`FIG. 9
`
`Petitioner Samsung Ex-1022, 0010
`
`
`
`~ ...
`~
`~
`.... 0 =
`""O = O" -....
`.... 0 =
`t "e -....
`
`~ .....
`
`(')
`
`~ .....
`
`(')
`
`('D = .....
`~ .....
`""O
`
`> ....
`
`0
`0
`N
`rJJ
`c
`
`O'I --- 0
`
`--..J
`0
`QO
`Ul
`1,0
`0
`
`....
`....
`0 ....
`0
`....
`.....
`rJJ =(cid:173)
`
`('D
`('D
`
`O'I
`0
`0
`N
`
`.. ►
`
`~
`
`r
`
`I CODE I 28
`
`t
`
`30
`
`I
`
`-
`
`BRIDGE
`
`BUS
`
`t
`
`,,
`
`Al
`
`DEVICES
`COMM
`
`c20
`
`AUDIO
`
`,,
`.~
`1/0
`
`,,
`
`<l
`
`KEYBOARD
`
`MOUSE
`
`.. ◄
`
`DATA STORAGE
`
`,-18
`
`,.--26
`
`,.--24
`
`\..32
`
`,-22
`,.
`
`Al
`
`BUS BRIDGE
`
`BUS 1/F
`
`,-12
`,.
`48\....
`Al
`
`BUS I/F
`
`CACHE
`
`DEVICES
`
`1/0
`
`,,
`.~
`
`~
`
`,-14
`
`~
`
`.,
`
`EPROM . \__.,36
`BIOS
`
`\__., 10
`
`SYS MEM
`
`~ ..
`
`(
`6
`
`64
`
`62
`
`BU , 1/F u-
`68'--,,.
`BUS I/F .~
`
`,-8
`
`'
`'
`
`CACHE
`
`39
`\..34
`( MEM CNTR
`
`,.--16
`
`GRAPHICS
`HI-PERF
`
`,-38
`
`SYSTEM BUS
`
`44
`
`42
`
`PROCESSOR
`
`60'--
`
`PROCESSOR
`
`40'--
`
`. 10A
`
`FIG
`
`Petitioner Samsung Ex-1022, 0011
`
`
`
`~ ...
`~
`~
`.... 0 =
`""O = O" -....
`.... 0 =
`t "e -....
`
`~ .....
`
`(')
`
`~ .....
`
`(')
`
`('D = .....
`~ .....
`""O
`
`> ....
`
`-....J
`0
`QO
`Ul
`1,0
`0
`
`0
`0
`N
`rJJ
`c
`
`O'I --- 0
`
`28
`
`CODE
`
`30
`
`DATA STORAGE
`
`....
`....
`0 ....
`....
`....
`.....
`rJJ =(cid:173)
`
`('D
`('D
`
`16
`
`O'I
`0
`0
`N
`
`98
`
`CHIPSET
`
`94
`
`80
`
`82
`
`MEMORY
`
`!
`
`I
`
`MCH I
`
`84 86
`~
`
`CORE
`PROC.
`
`88
`
`50
`
`78
`
`76 74
`~
`
`CORE
`PROC.
`
`PROCESSOR
`
`PROCESSOR
`
`FIG. 108
`
`20
`
`AUDIO 1/0
`
`24
`
`90
`
`26
`
`DEVICES
`COMM
`
`22
`
`KEYBOARD/
`
`MOUSE
`
`1/0 DEVICES
`
`14.
`
`18
`
`BUS BRIDGE
`
`39
`
`70
`
`72
`
`38
`
`GRAPHICS
`HIGH-PERF
`
`1-------1 M CH
`
`2
`
`MEMORY
`
`Petitioner Samsung Ex-1022, 0012
`
`
`
`US 2006/0095807 Al
`
`May 4, 2006
`
`1
`
`METHOD AND APPARATUS FOR VARYING
`ENERGY PER INSTRUCTION ACCORDING TO
`THE AMOUNT OF AVAILABLE PARALLELISM
`
`FIELD
`
`[0001] The present disclosure relates generally to micro(cid:173)
`processors that may execute programs with varying amounts
`of scalar and parallel resource requirements, and more
`specifically to microprocessors employing multiple cores.
`
`BACKGROUND
`
`[0002] Computer workloads, in some embodiments, run in
`a continuum from those having little inherent parallelism
`(being predominantly scalar) to those having significant
`amounts of parallelism (being predominantly parallel), and
`this nature may vary from segment to segment in the
`software. Typical scalar workloads include software devel(cid:173)
`opment tools, office productivity suites, and operating sys(cid:173)
`tem kernel routines. Typical parallel workloads include 3D
`graphics, media processing, and scientific applications. Sca(cid:173)
`lar workloads may retire instructions per clock (IPCs) in the
`range of0.2 to 2.0, whereas parallel workloads may achieve
`throughput in the range of 4 to several thousand IPC. The
`latter high IPCs may be obtainable through the use of
`instruction-level parallelism and thread-level parallelism.
`
`[0003] Prior art microprocessors have often been designed
`with either scalar or parallel performance as the primary
`objective. To achieve high scalar performance, it is often
`desirable to reduce execution latency as much as possible.
`Micro-architectural techniques to reduce effective latency
`include speculative execution, branch prediction, and cach(cid:173)
`ing. The pursuit of high scalar performance has resulted in
`large out-of-order, highly speculative, deep pipeline micro(cid:173)
`processors. To achieve high parallel performance, it may be
`desirable to provide as much execution throughput (band(cid:173)
`width) as possible. Micro-architectural
`techniques
`to
`increase throughput include wide superscalar processing,
`single-instruction-multiple-data
`instructions,
`chip-level
`multiprocessing, and multithreading.
`
`[0004] Problems may arise when trying to build a micro(cid:173)
`processor that performs well on both scalar and parallel
`tasks. One problem may arise from a perception that design
`techniques needed to achieve short latency are in some cases
`very different from the design techniques needed to achieve
`high throughput.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0005] The present invention is illustrated by way of
`example, and not by way of limitation, in the figures of the
`accompanying drawings and in which like reference numer(cid:173)
`als refer to similar elements and in which:
`
`[0006] FIG. 1 is a schematic diagram of a processor
`including cores configurable by voltage and frequency,
`according to one embodiment.
`
`[0007] FIG. 2 is a schematic diagram of a processor
`including cores selectable by processing power and power
`consumption, according to one embodiment.
`
`[0009] FIG. 4 is a schematic diagram of a processor
`including cores configurable by optional speculative cir(cid:173)
`cuits, according to one embodiment of the present disclo(cid:173)
`sure.
`
`[0010] FIG. 5 is schematic diagram of a processor includ(cid:173)
`ing cores and details of a throttle, according to one embodi(cid:173)
`ment of the present disclosure.
`
`[0011] FIG. 6 is a flowchart showing trans1t10ning to
`differing core configurations, according to one embodiment
`of the present disclosure.
`
`[0012] FIG. 7 is a flowchart showing trans1t10ning to
`differing core configurations, according to another embodi(cid:173)
`ment of the present disclosure.
`
`[0013] FIG. 8 is a flowchart showing trans1t10ning to
`differing core configurations, according to another embodi(cid:173)
`ment of the present disclosure.
`
`[0014] FIG. 9 is a flowchart showing trans1t10ning to
`differing core configurations, according to another embodi(cid:173)
`ment of the present disclosure.
`
`[0015] FIG. l0A is a schematic diagram of a system
`including processors with throttles and multiple cores,
`according to an embodiment of the present disclosure.
`
`[0016] FIG. 10B is a schematic diagram of a system
`including processors with throttles and multiple cores,
`according to another embodiment of the present disclosure.
`
`DETAILED DESCRIPTION
`
`[0017] The following description describes techniques for
`varying the amount of energy expended to process each
`instruction according to the amount of parallelism available
`in a software program. In the following description, numer(cid:173)
`ous specific details such as logic implementations, software
`module allocation, bus and other interface signaling tech(cid:173)
`niques, and details of operation are set forth in order to
`provide a more thorough understanding of the present inven(cid:173)
`tion. It will be appreciated, however, by one skilled in the art
`that the invention may be practiced without such specific
`details. In other instances, control structures, gate level
`circuits and full software instruction sequences have not
`been shown in detail in order not to obscure the invention.
`Those of ordinary skill in the art, with the included descrip(cid:173)
`tions, will be able to implement appropriate functionality
`without undue experimentation. In certain embodiments, the
`invention is disclosed in the form of multi-core implemen(cid:173)
`tations of Pentium® compatible processor such as those
`produced by Intel® Corporation. However, the invention
`may be practiced in other kinds of processors, such as an
`Itanium Processor Family compatible processor, an
`X-Scale® family compatible processor, or any of a wide
`variety of different general-purpose processors from any of
`the processor architectures of other vendors or designers.
`Additionally, some embodiments may include or may be
`special purpose processors, such as graphics, network,
`image, communications, or any other known or otherwise
`available type of processor.
`
`[0008] FIG. 3 is a schematic diagram of a processor
`including cores configurable by optional performance cir(cid:173)
`cuits, according to one embodiment.
`
`[0018] Power efficiency may be measured in terms of
`instructions-per-second (IPS) per watt. The IPS/watt metric
`is equivalent to energy per instruction, or, more precisely,
`
`Petitioner Samsung Ex-1022, 0013
`
`
`
`US 2006/0095807 Al
`
`May 4, 2006
`
`2
`
`IPS/watt is proportional to the reciprocal of energy per
`instruction as follows,
`
`(IPS)/(Watt)-(Instructions)/(Joule)
`
`EQUATION 1
`
`An important property of the energy per instruction metric
`is that it is independent of the amount of time required to
`process an instruction. This makes energy per instruction a
`useful metric for throughput performance.
`
`[0019] An approximate analysis of a microprocessor's
`power consumption may be performed by modeling the
`microprocessor as a capacitor that is charged or discharged
`with every instruction processed (for simplicity, the leakage
`current and short-circuit switching current may be ignored).
`With this assumption, energy per instruction may depend on
`only two things: the amount of capacitance toggled to
`process each instruction (from fetch to retirement), and
`power supply voltage. The well-known formula:
`
`E-C V2/2
`EQUATION 2
`which is normally applied to capacitors, may be applied to
`microprocessors as well. E is the energy required to process
`an instruction; C is the amount of capacitance toggled in
`processing the instruction; and V is the power supply
`voltage.
`
`[0020] A microprocessor may operate within a fixed
`power budget such as, for example, 100 watts. Averaged
`over some time period, the microprocessor's power con(cid:173)
`sumption should not exceed the power budget regardless of
`what the microprocessor or software do. To achieve this
`objective, a microprocessor may incorporate some form of
`dynamic thermal management. Similarly, a chip-level mul(cid:173)
`tiprocessor may regulate ( or throttle) its activities to stay
`within a fixed power budget regardless of whether it is
`retiring, for example, 0.2 instructions per clock (IPC) or 20
`IPC. To deliver good performance, the chip-level multipro(cid:173)
`cessor should be able to vary its MIPS/watt, or equivalently
`its energy/instruction, over a 100: 1 range in this example.
`
`[0021] One approach to designing a microprocessor that
`may achieve both high scalar performance and high through(cid:173)
`put performance is to dynamically vary the amount of
`energy expended to process each instruction according to the
`amount of parallelism available or estimated to be available
`in the software. In other words, if there is a small amount of
`parallelism, a microprocessor may expend all available
`energy processing a few instructions; and, if there is a
`greater amount of parallelism, the microprocessor may
`expend very little energy in processing each instruction. This
`may be expressed as:
`
`EQUATION 3
`P-(EPI)x(IPS)
`where Pis the fixed power budget, EPI is the average energy
`per retired instruction, and IPS is the aggregate number of
`instructions retired per second across all processor cores.
`This embodiment attempts to maintain the total multipro(cid:173)
`cessor chip power at a nearly constant level.
`
`[0022] Complementary
`Metal-Oxide-Semiconductor
`(CMOS) voltage and frequency scaling may be used to
`achieve different energy per instruction ratios. In one
`embodiment, logic varies the microprocessor's power sup(cid:173)
`ply voltage and clock frequency in unison according to the
`performance and power levels desired. To maintain a chip(cid:173)
`level multiprocessor's total power consumption within a
`fixed power budget, voltage and frequency scaling may be
`
`applied dynamically as follows. In phases of low thread
`parallelism, a few cores may be run using high supply
`voltage and high frequency for best scalar performance. In
`phases of high thread parallelism, many cores may be run
`using low supply voltage and low frequency for best
`throughput performance. Since low power consumption for
`inactive cores may be desirable, leakage control techniques
`such as dynamic sleep transistors and body bias may be
`used.
`[0023] Referring now to FIG. 1, a schematic diagram of a
`processor including cores configurable by voltage and fre(cid:173)
`quency is shown, according to one embodiment. Core 1120,
`core 2130, core 3140, and core 4150 are shown, but in other
`embodiments there may be more or fewer than four cores in
`a processor. One or more of the cores may have a voltage
`control circuit and a clock frequency control circuit. FIG. 1
`expressly shows core 1120 possessing voltage control circuit
`122 and frequency control circuit 124, but the other cores
`may have equivalent circuits as well, or the voltage control
`and frequency control logic may be separate logic not
`directly associated with a particular core.
`[0024] A throttle module 110 may be used to gather
`information and make a determination about, or an estimate
`of, the amount of parallelism present in the executing
`software program. In one embodiment, the amount of par(cid:173)
`allelism may be the number of simultaneous threads sup(cid:173)
`ported. In other embodiments, other metrics may be used to
`express the amount of parallelism, such as the aggregate
`number of instructions retired per second, or the number of
`branch instructions that may support speculative multi(cid:173)
`threaded execution. Throttle module 110 may utilize infor(cid:173)
`mation provided by the operating system to aid in the
`determination of the amount of parallelism. In other embodi(cid:173)
`ments, throttle module 110 may make this determination
`using hardware logic within the processor and its cores. The
`determination may be made on a continuous basis or peri(cid:173)
`odically.
`[0025] Each time the throttle module 110 makes the deter(cid:173)
`mination of the amount of parallelism in the program, it may
`direct cores 120, 130, 140, 150 via signal lines 112, 114, 116,
`and 118 to change their voltage and clock frequency. In one
`embodiment, signal lines 112, 114, 116, and 118 may also be
`used to turn the cores on or off, or to remove power from a
`power well containing a core. In other embodiments, the
`cores may be turned off by clock gating or instruction
`starvation techniques. In one embodiment, if the current
`amount of thread level parallelism exceeds a previous
`amount by more than a threshold value, then the throttle
`module may initiate a transition to running a greater number
`of threads by decreasing the voltage and clock frequency in
`each core but running the threads on a greater number of
`cores. Cores that had previously been turned off may be
`turned on to support the larger number of threads. Similarly,
`if the current amount of thread level parallelism is less than
`a previous amount by more than a threshold value, then the
`throttle module may initiate a transition to running a fewer
`number of threads by increasing the voltage and clock
`frequency in some cores but running the threads on a fewer
`number of these cores. Some cores that had previously been
`turned on may be turned off as they may no longer be needed
`to support the smaller number of threads.
`[0026]
`In one embodiment it may be possible to design a
`single-instruction-set-architecture
`(ISA)
`heterogeneous
`
`Petitioner Samsung Ex-1022, 0014
`
`
`
`US 2006/0095807 Al
`
`May 4, 2006
`
`3
`
`multi-core microprocessor in which different micro-archi(cid:173)
`tectures may be used to span a range of performance and
`power. In one embodiment, a chip-level multiprocessor may
`be built from two types of processor cores, which may be
`referred to as a large core and a small core. The two types
`of cores may implement the same instruction set architec(cid:173)
`ture, use cache coherency to implement shared memory, and
`differ only in their micro-architecture. In other embodi(cid:173)
`ments, the two types of core may implement similar instruc(cid:173)
`tion set architectures, or the small cores may implement a
`subset of the instruction set of the large cores. The large core
`may be an out-of-order, superscalar, deep pipeline machine,
`whereas the small core may be an in-order, scalar, short
`pipeline machine. The Intel Pentium 4 processor and Intel
`i486 processor are representative of the two classes of cores.
`In other embodiments, more than two classes or perfor(cid:173)
`mance levels of cores running a substantially similar or
`identical instruction set architecture may be used.
`[0027]
`In one embodiment, a chip-level multiprocessor
`includes one large core and 25 small cores, with the two
`types of cores having a 25:1 ratio in power consumption, a
`5: 1 ratio in scalar performance, and a 5: 1 range of energy per
`instruction. The chip-level multiprocessor or this embodi(cid:173)
`ment may operate as follows. In phases of low thread-level
`parallelism, the large core may be run for best scalar
`performance. In phases of high thread-level parallelism,
`multiple small cores may be run for best throughput perfor(cid:173)
`mance.
`[0028] At any instant in time, the microprocessor may run
`either one large core or 25 small cores. Because the number
`of available software threads will vary over time, the asym(cid:173)
`metric multiprocessor should be capable of migrating a
`thread between large and small cores. A thread-migration
`logic may be implemented to support this function.
`[0029]
`In practice, it may be desirable to allow a few small
`cores to run simultaneously with the large core in order to
`reduce the throughput performance discontinuity at the point
`of switching off the large core. In the previous example, a
`discontinuity of 3 units of throughput may result from
`switching off the large core and switching on two small
`cores. To reduce the percentage of the total throughput lost,
`the discontinuity may be moved to occur with a higher
`number of running threads by permitting, for example, up to
`5 small cores to run simultaneously with the large core if the
`power supply will support this for a short period of time.
`[0030] Using two types of cores representative of today's
`microprocessors, a 4:1 range of energy per instruction is
`achievable. As future microprocessors continue to deliver
`even higher levels of scalar performance, the range of
`possible energy per instruction may be expected to increase
`to perhaps 6:1, or well beyond this ratio.
`[0031] Referring now to FIG. 2, a schematic diagram ofa
`processor including cores selectable by processing power
`and power consumption is shown, according to one embodi(cid:173)
`ment. The processor may include a few larger cores, the A
`cores, and may also include a larger number of small cores,
`the B cores. The A core 1220, A core 2222, and B cores 1
`through 60230-262 are shown, but in other embodiments
`there may be more or fewer than two A cores and sixty B
`cores in a processor.
`[0032] A throttle module 210 may again be used to gather
`information and make a determination about the amount of
`
`parallelism present in the executing software program. In
`one embodiment, the amount of parallelism may be the
`number of simultaneous threads supported. In other embodi(cid:173)
`ments, other metrics may be used to express the amount of
`parallelism as discussed previously. Throttle module 210
`may utilize information provided by the operating system to
`aid in the determination of the amount of parallelism. In
`other embodiments, throttle module 210 may make this
`determination using hardware logic within the processor and
`its cores. The determination may be made on a continuous
`basis or periodically.
`
`[0033] Because the number of available software threads
`may vary over time, the processor of FIG. 1 may include a
`thread-migration logic 212 capable of migrating a thread
`between large A cores and small B cores. It may be desirable
`to allow a few small B cores to run simultaneously with the
`large A core in order to reduce the throughput performance
`discontinuity at the point of switching off the large A core.
`To reduce the percentage of the total throughput lost, the
`discontinuity may be moved to occur with a higher number
`of running threads by permitting, for example, up to 5 small
`cores to run simultaneously with the large core.
`
`[0034] Each time the throttle module 210 makes the
`determination of the amount of parallelism in the program,
`it may initiate powering the A cores and B cores up or down
`using signal lines 224 through 266. In one embodiment, if
`the current amount of parallelism exceeds a previous amount
`by more than a threshold value, then the throttle module 210
`may initiate, using thread-migration logic 212, a transition to
`running a greater number of threads which may be run on a
`greater number of B cores. B cores that had previously been
`turned off may be turned on to support the larger number of
`threads, and any A cores which are turned on may be turned
`off. Similarly, if the current amount of parallelism is less
`than a previous amount by more than a threshold value, then
`the throttle module may initiate a transition to running a
`fewer number of threads by running the threads on a fewer
`number of the A cores. B cores that had previously been
`turned on may be turned off as they may no longer be needed
`to support the smaller number of threads, and A cores may
`be turned on to support the smaller number of threads. As
`mentioned above, it may be desirable to allow a few B cores
`to run simultaneously with the A cores in order to reduce the
`throughput performance discontinuity at the point of switch(cid:173)
`ing off the large core.
`
`[0035]
`In one embodiment, the throttle module may be
`implemented in a manner that does not require a feedback
`loop. Here the throttle's control action (e.g. determining
`which type and how many cores on which to run the threads)
`does not return to affect the input value ( e.g. the allocation
`and configuration of cores for the threads). In this embodi(cid:173)
`ment, it may be presumed that each A core 220, 222 may
`consume the same amount of power as 25 of the B cores 230
`through 262. In other embodiments, differing ratios of power
`consumption may be used. The processor may divide its
`total power budget into two portions. For each portion, the
`power budget may permit either one A core and up to five B
`cores to operate at the same time, or no A cores and up to
`thirty B cores to operate at the same time. In other embodi(cid:173)
`ments, the power budget may be divided into portions in
`other ways.
`
`Petitioner Samsung Ex-1022, 0015
`
`
`
`US 2006/0095807 Al
`
`May 4, 2006
`
`4
`
`[0036]
`In one embodiment, the number of running threads
`(RT) may be allocated to a quantity of A cores (QAC) and
`a quantity of B cores (QBC) according to Table I.
`
`processor state, and an interconnect network between the
`processor's cores. The number oflogical cores may be equal
`to the number of B cores.
`
`TABLE I
`
`QAC
`
`QBC
`
`0
`
`2
`2
`2
`
`2
`2
`2
`
`0
`0
`0
`0
`
`0
`
`0
`0
`0
`
`2
`
`8
`9
`10
`12
`13
`14
`
`27
`28
`29
`30
`31
`32
`33
`34
`35
`37
`38
`39
`40
`
`60
`
`RT
`
`0
`
`2
`3
`4
`
`10
`11
`12
`13
`14
`15
`
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`
`60
`
`[0037] When the number of running threads increases, and
`the new thread is started (in one embodiment via an inter(cid:173)
`processor interrupt), the throttle module may determine the
`number of currently-running threads. Depending upon the
`number of currently-running threads, the new thread may be
`assigned to either an A core or a B core in accordance with
`Table I above. In this embodiment, for certain cases, such as
`when increasing from 12 threads to 13 threads, or from 36
`threads to 37 threads, an existing thread running on an A core
`would be migrated to running on a B core. When this
`migration is complete, both the existing migrated thread and
`the new thread may be started. For this reason, in this
`embodiment the new thread may exhibit a delay in starting.
`[0038] A similar process may occur when the number of
`running threads decreases. When a particular thread termi(cid:173)
`nates, and its core is halted, various methods may be used to
`potentially migrate one of the remaining threads from run(cid:173)
`ning on a B core to running on an A core. This could occur,
`for example, when reducing the number of running thread
`from 13 threads to 12 threads, or from 37 threads to 36
`threads. In one embodiment, a periodic timer may be used to
`permit the migration only once in a particular time interval.
`This may advantageously prevent too frequent thread migra(cid:173)
`tions in cases where threads are rapidly created and termi(cid:173)
`nated. The affected thread could remain running on the B
`core for up to the particular time interval.
`[0039]
`In one embodiment, the throttle module may per(cid:173)
`form the migrations from A cores to B cores transparently to
`the software. The thread migration mechanism of the throttle
`module may include a table for mapping logical cores to
`physical cores, an interrupt to signal a core migration may
`be needed, microcode or hardwired logic to copy the core's
`
`[0040]
`In another embodiment, the throttle module may
`perform the migrations from A cores to B cores in a manner
`not transparent to the software. The thread migration may be
`performed by the operating system scheduler. The operating
`system may track the number of cores with currently run(cid:173)
`ning threads, assign new threads to cores, and migrate
`threads from A cores to B cores ( or B cores to A cores). The
`software thread migration may use equivalent functions to
`those described above in regards the hardware implemen(cid:173)
`tation. In one embodiment, the throttle module operation
`may be transparent to application programs although not to
`the operating system.
`
`[0041] One alternate way to modulate power consumption
`may be to adjust the size or functionality oflogic blocks. For
`example, variable-sized schedulers, caches, translation look(cid:173)
`aside buffers (TLBs ), branch predictors, and other optional
`performance circuits may be used to reduce switching
`capacitance (and hence energy) when large array sizes are
`not needed. In addition to dynamically resizing arrays, it is
`also possible to design a large core that degrades its perfor(cid:173)
`mance into that of a smaller core by dynamically disabling
`execution units, pipeline stages, and other optional perfor(cid:173)
`mance circuits. These techniques may be collectively known
`as adaptive processing.
`
`[0042] One embodiment of a chip-level multiprocessor
`may operate as follows. In phases oflow thread parallelism,
`a few cores could be run using a first set (for example, all or
`many) of the available optional performance circuits on each
`core for good scalar performance. In phases of high thread
`parallelism, many cores could be operated using fewer
`optional performance circuits on each core for good
`throughput performance.
`
`[0043] The net result of reducing array sizes and disabling
`execution units may be to reduce the capacitance toggled per
`instruction. However, switching capacitance might not be
`reduced by as much as designing a smaller core to begin
`with. While unused execution hardware may be gated off,
`the physical size of the core does not change, and thus the
`wire lengths associated with the still active hardware blocks
`may remain longer than in a small core.
`
`[0044] An estimate of the possible reduction in energy per
`instruction may be made by examining the floorplan of a
`large out-of-order microprocessor and determining how
`many optional performance circuits may be turned off to
`convert the processor into a small in-order machine (keeping
`in mind that the blocks cannot be physically moved). The
`percentage of processor core area turned off may then be
`