throbber
1111111111111111 IIIIII IIIII 11111 1111111111 11111 lllll 111111111111111 111111111111111 11111111
`US 20060095807Al
`
`c19) United States
`c12) Patent Application Publication
`Grochowski et al.
`
`c10) Pub. No.: US 2006/0095807 Al
`May 4, 2006
`(43) Pub. Date:
`
`(54) METHOD AND APPARATUS FOR VARYING
`ENERGY PER INSTRUCTION ACCORDING
`TO THE AMOUNT OF AVAILABLE
`PARALLELISM
`
`(75)
`
`Inventors: Edward Grochowski, San Jose, CA
`(US); John Shen, San Jose, CA (US);
`Hong Wang, Fremont, CA (US); Doron
`Orenstein, Haifa (IL); Gad S. Sheaffer,
`Haifa (IL); Ronny Ronen, Haifa (IL);
`Murali M. Annavaram, Santa Clara,
`CA (US)
`
`Correspondence Address:
`BLAKELY SOKOLOFF TAYLOR & ZAFMAN
`12400 WILSHIRE BOULEVARD
`SEVENTH FLOOR
`LOS ANGELES, CA 90025-1030 (US)
`
`(73) Assignee: Intel Corporation
`
`(21) Appl. No.:
`
`10/952,627
`
`(22) Filed:
`
`Sep. 28, 2004
`
`Publication Classification
`
`(51)
`
`Int. Cl.
`G06F 1130
`(2006.01)
`(52) U.S. Cl. ............................................ 713/324; 713/322
`
`(57)
`
`ABSTRACT
`
`A method and apparatus for changing the configuration of a
`multi-core processor is disclosed. In one embodiment, a
`throttle module ( or throttle logic) may determine the amount
`of parallelism present in the currently-executing program,
`and change the execution of the threads of that program on
`the various cores. If the amount of parallelism is high, then
`the processor may be configured to run a larger amount of
`threads on cores configured to consume less power. If the
`amount of parallelism is low, then the processor may be
`configured to run a smaller amount of threads on cores
`configured for greater scalar performance.
`
`1
`
`VOLTAGE
`CONTROL
`m
`
`I
`
`FREQUENCY
`CONTROL
`124
`
`CORE1
`120
`
`THROTTLE
`MODULE
`11.Q_
`
`f - - -
`
`114
`,_;
`
`116
`~
`, -
`
`l
`
`I
`
`CORE2
`130
`
`l
`
`I
`
`CORE3
`140
`
`I
`
`-T
`
`
`
`CORE4
`~
`
`118 _/
`
`Petitioner Samsung Ex-1022, 0001
`
`

`

`Patent Application Publication May 4, 2006 Sheet 1 of 11
`
`US 2006/0095807 Al
`
`I
`FREQUENCY
`CONTROL
`124
`
`VOLTAGE
`CONTROL
`122
`
`CORE 1
`120
`
`· FIG. 1
`
`I
`
`-T
`
`
`
`CORE2
`130
`
`T
`
`I
`
`CORE3
`140
`
`I
`
`-T
`
`
`
`CORE4
`150
`
`THROTTLE
`MODULE
`110
`
`114
`~
`
`116
`~
`
`-
`
`118 J
`
`Petitioner Samsung Ex-1022, 0002
`
`

`

`Patent Application Publication May 4, 2006 Sheet 2 of 11
`
`US 2006/0095807 Al
`
`THROTTLE
`MODULE
`
`THREAD
`MIGRATION
`LOGIC
`212
`
`210
`
`A CORE 1
`220
`
`A CORE 2
`222
`
`226
`
`B CORE 1
`230
`
`B CORE 31
`232
`
`246
`,J
`
`B CORE2
`240
`
`B CORE 32
`242
`
`•
`•
`•
`
`•
`•
`•
`
`266
`
`B CORE 30
`260
`
`B CORE 60
`262
`
`FIG. 2
`
`Petitioner Samsung Ex-1022, 0003
`
`

`

`Patent Application Publication May 4, 2006 Sheet 3 of 11
`
`US 2006/0095807 Al
`
`334
`
`SCHA
`
`SCH B
`
`336
`
`340
`
`332
`
`330
`
`OTHER PIPELINE
`
`338
`
`ROB
`
`EXE 1
`
`EXE2
`
`EXE3
`
`EXE4
`
`LO CACHE A
`
`LO CACHE B
`
`3 2 2~ 3 2 6
`
`TLB B
`
`328
`
`324
`
`346
`
`CORE1
`320
`
`THROTTLE
`MODULE
`310
`
`CORE2
`370
`
`CORE3
`380
`
`CORE4
`390
`
`FIG. 3
`
`Petitioner Samsung Ex-1022, 0004
`
`

`

`Patent Application Publication May 4, 2006 Sheet 4 of 11
`
`US 2006/0095807 Al
`
`PREFETCH
`430
`
`-----
`
`OTHER PIPELINE
`432
`
`I
`
`BRANCH
`PREDICTOR
`~
`434
`
`I
`
`OTHER
`PREDICTOR
`~
`436
`
`CORE 1
`420
`
`THROTTLE
`MODULE
`410
`
`CORE2
`470
`
`CORE3
`480
`
`CORE4
`490
`
`FIG. 4
`
`Petitioner Samsung Ex-1022, 0005
`
`

`

`~ ...
`.... 0 =
`.... 0 = ""O = O" -....
`"e -....
`> "e
`('D = .....
`~ .....
`""O
`
`~ .....
`
`(')
`
`~ .....
`
`(')
`
`O'I
`0
`0
`N
`
`~
`~
`~
`
`> ....
`
`-....J
`0
`QO
`Ul
`1,0
`0
`
`0
`0
`N
`rJJ
`c
`
`O'I --- 0
`
`....
`....
`0 ....
`Ul
`.....
`rJJ =- ('D
`
`('D
`
`FIG. 5
`
`FAST FEEDBACK LOOP
`
`560
`
`THROTTLE
`
`540
`
`CLOCK
`
`CONTROL
`
`548
`
`SAMPLE
`
`544
`
`538
`INTEG.
`
`534
`DIFF.
`COMP.
`
`550"--1
`
`SLOW FEEDBACK LOOP
`
`POWER
`
`530
`
`TO
`
`CONVERT
`
`MONITOR
`
`fila
`M
`
`• • •
`
`516
`3
`
`MONITOR
`
`MONITOR
`
`514
`2
`
`512
`1
`
`MONITOR
`
`508
`M
`CORE
`
`• • •
`
`506
`3
`
`CORE
`
`504
`2
`
`CORE
`
`502
`1
`
`CORE
`
`Petitioner Samsung Ex-1022, 0006
`
`

`

`Patent Application Publication May 4, 2006 Sheet 6 of 11
`
`US 2006/0095807 Al
`
`610
`
`614
`
`618
`
`NO
`
`626
`
`ALLOCATE
`THREADS TO
`CORES
`
`MONITOR PWR
`IN CORES
`COMPUTE
`ERROR VALUE
`
`INTEGRATE/
`SAMPLE
`ERROR VALUE
`
`YES
`
`ADJUST VOLT+
`FREQ ACCORDING
`TO CONTROL
`VALUE
`
`FIG. 6
`
`Petitioner Samsung Ex-1022, 0007
`
`

`

`Patent Application Publication May 4, 2006 Sheet 7 of 11
`
`US 2006/0095807 Al
`
`710
`
`714
`
`718
`
`NO
`
`726
`
`ALLOCATE
`THREADS TO
`CORES
`
`MONITORPWR
`IN CORES
`COMPUTE
`ERROR VALUE
`
`INTEGRATE/
`SAMPLE
`ERROR VALUE
`
`YES
`
`REP ALLOCATE
`CORES ACCORDING
`TO CONTROL
`VALUE
`
`FIG. 7
`
`Petitioner Samsung Ex-1022, 0008
`
`

`

`Patent Application Publication May 4, 2006 Sheet 8 of 11
`
`US 2006/0095807 Al
`
`810
`
`814
`
`818
`
`NO
`
`826
`
`ALLOCATE
`THREADS TO
`CORES
`
`MONITOR PWR
`IN CORES
`COMPUTE
`ERROR VALUE
`
`INTEGRATE/
`SAMPLE
`ERROR VALUE
`
`YES
`
`ADJUST AMOUNT
`OF OPTIONAL
`CIRCUITRY ON/OFF
`IN ACCORDANCE
`WITH CONTROL
`VALUE
`
`FIG. 8
`
`Petitioner Samsung Ex-1022, 0009
`
`

`

`Patent Application Publication May 4, 2006 Sheet 9 of 11
`
`US 2006/0095807 Al
`
`910
`
`914
`
`918
`
`NO
`
`926
`
`ALLOCATE
`THREADS TO
`CORES
`
`MONITOR PWR
`IN CORES
`COMPUTE
`ERROR VALUE
`
`INTEGRATE/
`SAMPLE
`ERROR VALUE
`
`YES
`
`ADJUST AMOUNT
`OF PRED. CIRCUITRY
`ON/OFF IN
`ACCORDANCE
`WITH CONTROL
`VALUE
`
`FIG. 9
`
`Petitioner Samsung Ex-1022, 0010
`
`

`

`~ ...
`~
`~
`.... 0 =
`""O = O" -....
`.... 0 =
`t "e -....
`
`~ .....
`
`(')
`
`~ .....
`
`(')
`
`('D = .....
`~ .....
`""O
`
`> ....
`
`0
`0
`N
`rJJ
`c
`
`O'I --- 0
`
`--..J
`0
`QO
`Ul
`1,0
`0
`
`....
`....
`0 ....
`0
`....
`.....
`rJJ =(cid:173)
`
`('D
`('D
`
`O'I
`0
`0
`N
`
`.. ►
`
`~
`
`r
`
`I CODE I 28
`
`t
`
`30
`
`I
`
`-
`
`BRIDGE
`
`BUS
`
`t
`
`,,
`
`Al
`
`DEVICES
`COMM
`
`c20
`
`AUDIO
`
`,,
`.~
`1/0
`
`,,
`
`<l
`
`KEYBOARD
`
`MOUSE
`
`.. ◄
`
`DATA STORAGE
`
`,-18
`
`,.--26
`
`,.--24
`
`\..32
`
`,-22
`,.
`
`Al
`
`BUS BRIDGE
`
`BUS 1/F
`
`,-12
`,.
`48\....
`Al
`
`BUS I/F
`
`CACHE
`
`DEVICES
`
`1/0
`
`,,
`.~
`
`~
`
`,-14
`
`~
`
`.,
`
`EPROM . \__.,36
`BIOS
`
`\__., 10
`
`SYS MEM
`
`~ ..
`
`(
`6
`
`64
`
`62
`
`BU , 1/F u-
`68'--,,.
`BUS I/F .~
`
`,-8
`
`'
`'
`
`CACHE
`
`39
`\..34
`( MEM CNTR
`
`,.--16
`
`GRAPHICS
`HI-PERF
`
`,-38
`
`SYSTEM BUS
`
`44
`
`42
`
`PROCESSOR
`
`60'--
`
`PROCESSOR
`
`40'--
`
`. 10A
`
`FIG
`
`Petitioner Samsung Ex-1022, 0011
`
`

`

`~ ...
`~
`~
`.... 0 =
`""O = O" -....
`.... 0 =
`t "e -....
`
`~ .....
`
`(')
`
`~ .....
`
`(')
`
`('D = .....
`~ .....
`""O
`
`> ....
`
`-....J
`0
`QO
`Ul
`1,0
`0
`
`0
`0
`N
`rJJ
`c
`
`O'I --- 0
`
`28
`
`CODE
`
`30
`
`DATA STORAGE
`
`....
`....
`0 ....
`....
`....
`.....
`rJJ =(cid:173)
`
`('D
`('D
`
`16
`
`O'I
`0
`0
`N
`
`98
`
`CHIPSET
`
`94
`
`80
`
`82
`
`MEMORY
`
`!
`
`I
`
`MCH I
`
`84 86
`~
`
`CORE
`PROC.
`
`88
`
`50
`
`78
`
`76 74
`~
`
`CORE
`PROC.
`
`PROCESSOR
`
`PROCESSOR
`
`FIG. 108
`
`20
`
`AUDIO 1/0
`
`24
`
`90
`
`26
`
`DEVICES
`COMM
`
`22
`
`KEYBOARD/
`
`MOUSE
`
`1/0 DEVICES
`
`14.
`
`18
`
`BUS BRIDGE
`
`39
`
`70
`
`72
`
`38
`
`GRAPHICS
`HIGH-PERF
`
`1-------1 M CH
`
`2
`
`MEMORY
`
`Petitioner Samsung Ex-1022, 0012
`
`

`

`US 2006/0095807 Al
`
`May 4, 2006
`
`1
`
`METHOD AND APPARATUS FOR VARYING
`ENERGY PER INSTRUCTION ACCORDING TO
`THE AMOUNT OF AVAILABLE PARALLELISM
`
`FIELD
`
`[0001] The present disclosure relates generally to micro(cid:173)
`processors that may execute programs with varying amounts
`of scalar and parallel resource requirements, and more
`specifically to microprocessors employing multiple cores.
`
`BACKGROUND
`
`[0002] Computer workloads, in some embodiments, run in
`a continuum from those having little inherent parallelism
`(being predominantly scalar) to those having significant
`amounts of parallelism (being predominantly parallel), and
`this nature may vary from segment to segment in the
`software. Typical scalar workloads include software devel(cid:173)
`opment tools, office productivity suites, and operating sys(cid:173)
`tem kernel routines. Typical parallel workloads include 3D
`graphics, media processing, and scientific applications. Sca(cid:173)
`lar workloads may retire instructions per clock (IPCs) in the
`range of0.2 to 2.0, whereas parallel workloads may achieve
`throughput in the range of 4 to several thousand IPC. The
`latter high IPCs may be obtainable through the use of
`instruction-level parallelism and thread-level parallelism.
`
`[0003] Prior art microprocessors have often been designed
`with either scalar or parallel performance as the primary
`objective. To achieve high scalar performance, it is often
`desirable to reduce execution latency as much as possible.
`Micro-architectural techniques to reduce effective latency
`include speculative execution, branch prediction, and cach(cid:173)
`ing. The pursuit of high scalar performance has resulted in
`large out-of-order, highly speculative, deep pipeline micro(cid:173)
`processors. To achieve high parallel performance, it may be
`desirable to provide as much execution throughput (band(cid:173)
`width) as possible. Micro-architectural
`techniques
`to
`increase throughput include wide superscalar processing,
`single-instruction-multiple-data
`instructions,
`chip-level
`multiprocessing, and multithreading.
`
`[0004] Problems may arise when trying to build a micro(cid:173)
`processor that performs well on both scalar and parallel
`tasks. One problem may arise from a perception that design
`techniques needed to achieve short latency are in some cases
`very different from the design techniques needed to achieve
`high throughput.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0005] The present invention is illustrated by way of
`example, and not by way of limitation, in the figures of the
`accompanying drawings and in which like reference numer(cid:173)
`als refer to similar elements and in which:
`
`[0006] FIG. 1 is a schematic diagram of a processor
`including cores configurable by voltage and frequency,
`according to one embodiment.
`
`[0007] FIG. 2 is a schematic diagram of a processor
`including cores selectable by processing power and power
`consumption, according to one embodiment.
`
`[0009] FIG. 4 is a schematic diagram of a processor
`including cores configurable by optional speculative cir(cid:173)
`cuits, according to one embodiment of the present disclo(cid:173)
`sure.
`
`[0010] FIG. 5 is schematic diagram of a processor includ(cid:173)
`ing cores and details of a throttle, according to one embodi(cid:173)
`ment of the present disclosure.
`
`[0011] FIG. 6 is a flowchart showing trans1t10ning to
`differing core configurations, according to one embodiment
`of the present disclosure.
`
`[0012] FIG. 7 is a flowchart showing trans1t10ning to
`differing core configurations, according to another embodi(cid:173)
`ment of the present disclosure.
`
`[0013] FIG. 8 is a flowchart showing trans1t10ning to
`differing core configurations, according to another embodi(cid:173)
`ment of the present disclosure.
`
`[0014] FIG. 9 is a flowchart showing trans1t10ning to
`differing core configurations, according to another embodi(cid:173)
`ment of the present disclosure.
`
`[0015] FIG. l0A is a schematic diagram of a system
`including processors with throttles and multiple cores,
`according to an embodiment of the present disclosure.
`
`[0016] FIG. 10B is a schematic diagram of a system
`including processors with throttles and multiple cores,
`according to another embodiment of the present disclosure.
`
`DETAILED DESCRIPTION
`
`[0017] The following description describes techniques for
`varying the amount of energy expended to process each
`instruction according to the amount of parallelism available
`in a software program. In the following description, numer(cid:173)
`ous specific details such as logic implementations, software
`module allocation, bus and other interface signaling tech(cid:173)
`niques, and details of operation are set forth in order to
`provide a more thorough understanding of the present inven(cid:173)
`tion. It will be appreciated, however, by one skilled in the art
`that the invention may be practiced without such specific
`details. In other instances, control structures, gate level
`circuits and full software instruction sequences have not
`been shown in detail in order not to obscure the invention.
`Those of ordinary skill in the art, with the included descrip(cid:173)
`tions, will be able to implement appropriate functionality
`without undue experimentation. In certain embodiments, the
`invention is disclosed in the form of multi-core implemen(cid:173)
`tations of Pentium® compatible processor such as those
`produced by Intel® Corporation. However, the invention
`may be practiced in other kinds of processors, such as an
`Itanium Processor Family compatible processor, an
`X-Scale® family compatible processor, or any of a wide
`variety of different general-purpose processors from any of
`the processor architectures of other vendors or designers.
`Additionally, some embodiments may include or may be
`special purpose processors, such as graphics, network,
`image, communications, or any other known or otherwise
`available type of processor.
`
`[0008] FIG. 3 is a schematic diagram of a processor
`including cores configurable by optional performance cir(cid:173)
`cuits, according to one embodiment.
`
`[0018] Power efficiency may be measured in terms of
`instructions-per-second (IPS) per watt. The IPS/watt metric
`is equivalent to energy per instruction, or, more precisely,
`
`Petitioner Samsung Ex-1022, 0013
`
`

`

`US 2006/0095807 Al
`
`May 4, 2006
`
`2
`
`IPS/watt is proportional to the reciprocal of energy per
`instruction as follows,
`
`(IPS)/(Watt)-(Instructions)/(Joule)
`
`EQUATION 1
`
`An important property of the energy per instruction metric
`is that it is independent of the amount of time required to
`process an instruction. This makes energy per instruction a
`useful metric for throughput performance.
`
`[0019] An approximate analysis of a microprocessor's
`power consumption may be performed by modeling the
`microprocessor as a capacitor that is charged or discharged
`with every instruction processed (for simplicity, the leakage
`current and short-circuit switching current may be ignored).
`With this assumption, energy per instruction may depend on
`only two things: the amount of capacitance toggled to
`process each instruction (from fetch to retirement), and
`power supply voltage. The well-known formula:
`
`E-C V2/2
`EQUATION 2
`which is normally applied to capacitors, may be applied to
`microprocessors as well. E is the energy required to process
`an instruction; C is the amount of capacitance toggled in
`processing the instruction; and V is the power supply
`voltage.
`
`[0020] A microprocessor may operate within a fixed
`power budget such as, for example, 100 watts. Averaged
`over some time period, the microprocessor's power con(cid:173)
`sumption should not exceed the power budget regardless of
`what the microprocessor or software do. To achieve this
`objective, a microprocessor may incorporate some form of
`dynamic thermal management. Similarly, a chip-level mul(cid:173)
`tiprocessor may regulate ( or throttle) its activities to stay
`within a fixed power budget regardless of whether it is
`retiring, for example, 0.2 instructions per clock (IPC) or 20
`IPC. To deliver good performance, the chip-level multipro(cid:173)
`cessor should be able to vary its MIPS/watt, or equivalently
`its energy/instruction, over a 100: 1 range in this example.
`
`[0021] One approach to designing a microprocessor that
`may achieve both high scalar performance and high through(cid:173)
`put performance is to dynamically vary the amount of
`energy expended to process each instruction according to the
`amount of parallelism available or estimated to be available
`in the software. In other words, if there is a small amount of
`parallelism, a microprocessor may expend all available
`energy processing a few instructions; and, if there is a
`greater amount of parallelism, the microprocessor may
`expend very little energy in processing each instruction. This
`may be expressed as:
`
`EQUATION 3
`P-(EPI)x(IPS)
`where Pis the fixed power budget, EPI is the average energy
`per retired instruction, and IPS is the aggregate number of
`instructions retired per second across all processor cores.
`This embodiment attempts to maintain the total multipro(cid:173)
`cessor chip power at a nearly constant level.
`
`[0022] Complementary
`Metal-Oxide-Semiconductor
`(CMOS) voltage and frequency scaling may be used to
`achieve different energy per instruction ratios. In one
`embodiment, logic varies the microprocessor's power sup(cid:173)
`ply voltage and clock frequency in unison according to the
`performance and power levels desired. To maintain a chip(cid:173)
`level multiprocessor's total power consumption within a
`fixed power budget, voltage and frequency scaling may be
`
`applied dynamically as follows. In phases of low thread
`parallelism, a few cores may be run using high supply
`voltage and high frequency for best scalar performance. In
`phases of high thread parallelism, many cores may be run
`using low supply voltage and low frequency for best
`throughput performance. Since low power consumption for
`inactive cores may be desirable, leakage control techniques
`such as dynamic sleep transistors and body bias may be
`used.
`[0023] Referring now to FIG. 1, a schematic diagram of a
`processor including cores configurable by voltage and fre(cid:173)
`quency is shown, according to one embodiment. Core 1120,
`core 2130, core 3140, and core 4150 are shown, but in other
`embodiments there may be more or fewer than four cores in
`a processor. One or more of the cores may have a voltage
`control circuit and a clock frequency control circuit. FIG. 1
`expressly shows core 1120 possessing voltage control circuit
`122 and frequency control circuit 124, but the other cores
`may have equivalent circuits as well, or the voltage control
`and frequency control logic may be separate logic not
`directly associated with a particular core.
`[0024] A throttle module 110 may be used to gather
`information and make a determination about, or an estimate
`of, the amount of parallelism present in the executing
`software program. In one embodiment, the amount of par(cid:173)
`allelism may be the number of simultaneous threads sup(cid:173)
`ported. In other embodiments, other metrics may be used to
`express the amount of parallelism, such as the aggregate
`number of instructions retired per second, or the number of
`branch instructions that may support speculative multi(cid:173)
`threaded execution. Throttle module 110 may utilize infor(cid:173)
`mation provided by the operating system to aid in the
`determination of the amount of parallelism. In other embodi(cid:173)
`ments, throttle module 110 may make this determination
`using hardware logic within the processor and its cores. The
`determination may be made on a continuous basis or peri(cid:173)
`odically.
`[0025] Each time the throttle module 110 makes the deter(cid:173)
`mination of the amount of parallelism in the program, it may
`direct cores 120, 130, 140, 150 via signal lines 112, 114, 116,
`and 118 to change their voltage and clock frequency. In one
`embodiment, signal lines 112, 114, 116, and 118 may also be
`used to turn the cores on or off, or to remove power from a
`power well containing a core. In other embodiments, the
`cores may be turned off by clock gating or instruction
`starvation techniques. In one embodiment, if the current
`amount of thread level parallelism exceeds a previous
`amount by more than a threshold value, then the throttle
`module may initiate a transition to running a greater number
`of threads by decreasing the voltage and clock frequency in
`each core but running the threads on a greater number of
`cores. Cores that had previously been turned off may be
`turned on to support the larger number of threads. Similarly,
`if the current amount of thread level parallelism is less than
`a previous amount by more than a threshold value, then the
`throttle module may initiate a transition to running a fewer
`number of threads by increasing the voltage and clock
`frequency in some cores but running the threads on a fewer
`number of these cores. Some cores that had previously been
`turned on may be turned off as they may no longer be needed
`to support the smaller number of threads.
`[0026]
`In one embodiment it may be possible to design a
`single-instruction-set-architecture
`(ISA)
`heterogeneous
`
`Petitioner Samsung Ex-1022, 0014
`
`

`

`US 2006/0095807 Al
`
`May 4, 2006
`
`3
`
`multi-core microprocessor in which different micro-archi(cid:173)
`tectures may be used to span a range of performance and
`power. In one embodiment, a chip-level multiprocessor may
`be built from two types of processor cores, which may be
`referred to as a large core and a small core. The two types
`of cores may implement the same instruction set architec(cid:173)
`ture, use cache coherency to implement shared memory, and
`differ only in their micro-architecture. In other embodi(cid:173)
`ments, the two types of core may implement similar instruc(cid:173)
`tion set architectures, or the small cores may implement a
`subset of the instruction set of the large cores. The large core
`may be an out-of-order, superscalar, deep pipeline machine,
`whereas the small core may be an in-order, scalar, short
`pipeline machine. The Intel Pentium 4 processor and Intel
`i486 processor are representative of the two classes of cores.
`In other embodiments, more than two classes or perfor(cid:173)
`mance levels of cores running a substantially similar or
`identical instruction set architecture may be used.
`[0027]
`In one embodiment, a chip-level multiprocessor
`includes one large core and 25 small cores, with the two
`types of cores having a 25:1 ratio in power consumption, a
`5: 1 ratio in scalar performance, and a 5: 1 range of energy per
`instruction. The chip-level multiprocessor or this embodi(cid:173)
`ment may operate as follows. In phases of low thread-level
`parallelism, the large core may be run for best scalar
`performance. In phases of high thread-level parallelism,
`multiple small cores may be run for best throughput perfor(cid:173)
`mance.
`[0028] At any instant in time, the microprocessor may run
`either one large core or 25 small cores. Because the number
`of available software threads will vary over time, the asym(cid:173)
`metric multiprocessor should be capable of migrating a
`thread between large and small cores. A thread-migration
`logic may be implemented to support this function.
`[0029]
`In practice, it may be desirable to allow a few small
`cores to run simultaneously with the large core in order to
`reduce the throughput performance discontinuity at the point
`of switching off the large core. In the previous example, a
`discontinuity of 3 units of throughput may result from
`switching off the large core and switching on two small
`cores. To reduce the percentage of the total throughput lost,
`the discontinuity may be moved to occur with a higher
`number of running threads by permitting, for example, up to
`5 small cores to run simultaneously with the large core if the
`power supply will support this for a short period of time.
`[0030] Using two types of cores representative of today's
`microprocessors, a 4:1 range of energy per instruction is
`achievable. As future microprocessors continue to deliver
`even higher levels of scalar performance, the range of
`possible energy per instruction may be expected to increase
`to perhaps 6:1, or well beyond this ratio.
`[0031] Referring now to FIG. 2, a schematic diagram ofa
`processor including cores selectable by processing power
`and power consumption is shown, according to one embodi(cid:173)
`ment. The processor may include a few larger cores, the A
`cores, and may also include a larger number of small cores,
`the B cores. The A core 1220, A core 2222, and B cores 1
`through 60230-262 are shown, but in other embodiments
`there may be more or fewer than two A cores and sixty B
`cores in a processor.
`[0032] A throttle module 210 may again be used to gather
`information and make a determination about the amount of
`
`parallelism present in the executing software program. In
`one embodiment, the amount of parallelism may be the
`number of simultaneous threads supported. In other embodi(cid:173)
`ments, other metrics may be used to express the amount of
`parallelism as discussed previously. Throttle module 210
`may utilize information provided by the operating system to
`aid in the determination of the amount of parallelism. In
`other embodiments, throttle module 210 may make this
`determination using hardware logic within the processor and
`its cores. The determination may be made on a continuous
`basis or periodically.
`
`[0033] Because the number of available software threads
`may vary over time, the processor of FIG. 1 may include a
`thread-migration logic 212 capable of migrating a thread
`between large A cores and small B cores. It may be desirable
`to allow a few small B cores to run simultaneously with the
`large A core in order to reduce the throughput performance
`discontinuity at the point of switching off the large A core.
`To reduce the percentage of the total throughput lost, the
`discontinuity may be moved to occur with a higher number
`of running threads by permitting, for example, up to 5 small
`cores to run simultaneously with the large core.
`
`[0034] Each time the throttle module 210 makes the
`determination of the amount of parallelism in the program,
`it may initiate powering the A cores and B cores up or down
`using signal lines 224 through 266. In one embodiment, if
`the current amount of parallelism exceeds a previous amount
`by more than a threshold value, then the throttle module 210
`may initiate, using thread-migration logic 212, a transition to
`running a greater number of threads which may be run on a
`greater number of B cores. B cores that had previously been
`turned off may be turned on to support the larger number of
`threads, and any A cores which are turned on may be turned
`off. Similarly, if the current amount of parallelism is less
`than a previous amount by more than a threshold value, then
`the throttle module may initiate a transition to running a
`fewer number of threads by running the threads on a fewer
`number of the A cores. B cores that had previously been
`turned on may be turned off as they may no longer be needed
`to support the smaller number of threads, and A cores may
`be turned on to support the smaller number of threads. As
`mentioned above, it may be desirable to allow a few B cores
`to run simultaneously with the A cores in order to reduce the
`throughput performance discontinuity at the point of switch(cid:173)
`ing off the large core.
`
`[0035]
`In one embodiment, the throttle module may be
`implemented in a manner that does not require a feedback
`loop. Here the throttle's control action (e.g. determining
`which type and how many cores on which to run the threads)
`does not return to affect the input value ( e.g. the allocation
`and configuration of cores for the threads). In this embodi(cid:173)
`ment, it may be presumed that each A core 220, 222 may
`consume the same amount of power as 25 of the B cores 230
`through 262. In other embodiments, differing ratios of power
`consumption may be used. The processor may divide its
`total power budget into two portions. For each portion, the
`power budget may permit either one A core and up to five B
`cores to operate at the same time, or no A cores and up to
`thirty B cores to operate at the same time. In other embodi(cid:173)
`ments, the power budget may be divided into portions in
`other ways.
`
`Petitioner Samsung Ex-1022, 0015
`
`

`

`US 2006/0095807 Al
`
`May 4, 2006
`
`4
`
`[0036]
`In one embodiment, the number of running threads
`(RT) may be allocated to a quantity of A cores (QAC) and
`a quantity of B cores (QBC) according to Table I.
`
`processor state, and an interconnect network between the
`processor's cores. The number oflogical cores may be equal
`to the number of B cores.
`
`TABLE I
`
`QAC
`
`QBC
`
`0
`
`2
`2
`2
`
`2
`2
`2
`
`0
`0
`0
`0
`
`0
`
`0
`0
`0
`
`2
`
`8
`9
`10
`12
`13
`14
`
`27
`28
`29
`30
`31
`32
`33
`34
`35
`37
`38
`39
`40
`
`60
`
`RT
`
`0
`
`2
`3
`4
`
`10
`11
`12
`13
`14
`15
`
`28
`29
`30
`31
`32
`33
`34
`35
`36
`37
`38
`39
`40
`
`60
`
`[0037] When the number of running threads increases, and
`the new thread is started (in one embodiment via an inter(cid:173)
`processor interrupt), the throttle module may determine the
`number of currently-running threads. Depending upon the
`number of currently-running threads, the new thread may be
`assigned to either an A core or a B core in accordance with
`Table I above. In this embodiment, for certain cases, such as
`when increasing from 12 threads to 13 threads, or from 36
`threads to 37 threads, an existing thread running on an A core
`would be migrated to running on a B core. When this
`migration is complete, both the existing migrated thread and
`the new thread may be started. For this reason, in this
`embodiment the new thread may exhibit a delay in starting.
`[0038] A similar process may occur when the number of
`running threads decreases. When a particular thread termi(cid:173)
`nates, and its core is halted, various methods may be used to
`potentially migrate one of the remaining threads from run(cid:173)
`ning on a B core to running on an A core. This could occur,
`for example, when reducing the number of running thread
`from 13 threads to 12 threads, or from 37 threads to 36
`threads. In one embodiment, a periodic timer may be used to
`permit the migration only once in a particular time interval.
`This may advantageously prevent too frequent thread migra(cid:173)
`tions in cases where threads are rapidly created and termi(cid:173)
`nated. The affected thread could remain running on the B
`core for up to the particular time interval.
`[0039]
`In one embodiment, the throttle module may per(cid:173)
`form the migrations from A cores to B cores transparently to
`the software. The thread migration mechanism of the throttle
`module may include a table for mapping logical cores to
`physical cores, an interrupt to signal a core migration may
`be needed, microcode or hardwired logic to copy the core's
`
`[0040]
`In another embodiment, the throttle module may
`perform the migrations from A cores to B cores in a manner
`not transparent to the software. The thread migration may be
`performed by the operating system scheduler. The operating
`system may track the number of cores with currently run(cid:173)
`ning threads, assign new threads to cores, and migrate
`threads from A cores to B cores ( or B cores to A cores). The
`software thread migration may use equivalent functions to
`those described above in regards the hardware implemen(cid:173)
`tation. In one embodiment, the throttle module operation
`may be transparent to application programs although not to
`the operating system.
`
`[0041] One alternate way to modulate power consumption
`may be to adjust the size or functionality oflogic blocks. For
`example, variable-sized schedulers, caches, translation look(cid:173)
`aside buffers (TLBs ), branch predictors, and other optional
`performance circuits may be used to reduce switching
`capacitance (and hence energy) when large array sizes are
`not needed. In addition to dynamically resizing arrays, it is
`also possible to design a large core that degrades its perfor(cid:173)
`mance into that of a smaller core by dynamically disabling
`execution units, pipeline stages, and other optional perfor(cid:173)
`mance circuits. These techniques may be collectively known
`as adaptive processing.
`
`[0042] One embodiment of a chip-level multiprocessor
`may operate as follows. In phases oflow thread parallelism,
`a few cores could be run using a first set (for example, all or
`many) of the available optional performance circuits on each
`core for good scalar performance. In phases of high thread
`parallelism, many cores could be operated using fewer
`optional performance circuits on each core for good
`throughput performance.
`
`[0043] The net result of reducing array sizes and disabling
`execution units may be to reduce the capacitance toggled per
`instruction. However, switching capacitance might not be
`reduced by as much as designing a smaller core to begin
`with. While unused execution hardware may be gated off,
`the physical size of the core does not change, and thus the
`wire lengths associated with the still active hardware blocks
`may remain longer than in a small core.
`
`[0044] An estimate of the possible reduction in energy per
`instruction may be made by examining the floorplan of a
`large out-of-order microprocessor and determining how
`many optional performance circuits may be turned off to
`convert the processor into a small in-order machine (keeping
`in mind that the blocks cannot be physically moved). The
`percentage of processor core area turned off may then be
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket