`US008615647B2
`
`c12) United States Patent
`Hum et al.
`
`(IO) Patent No.:
`(45) Date of Patent:
`
`US 8,615,647 B2
`Dec. 24, 2013
`
`(54)
`
`MIGRATING EXECUTION OF THREAD
`BETWEEN CORES OF DIFFERENT
`INSTRUCTION SET ARCHITECTURE IN
`MULTI-CORE PROCESSOR AND
`TRANSITIONING EACH CORE TO
`RESPECTIVE ON/ OFF POWER STATE
`
`(75)
`
`Inventors: Herbert Hum, Portland, OR (US); Eric
`Sprangle, Austin, TX (US); Doug
`Carmean, Beaverton, OR (US); Rajesh
`Kumar, Portland, OR (US)
`
`(73) Assignee: Intel Corporation, Santa Clara, CA
`(US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 538 days.
`
`(21) Appl. No.: 12/220,092
`
`(22) Filed:
`
`Jul. 22, 2008
`
`5,890,799 A *
`5,991,883 A
`6,006,320 A
`6,021,484 A *
`6,035,408 A
`6,058,434 A
`6,219,742 Bl
`6,240,521 Bl
`6,405,320 Bl
`6,501,999 Bl
`6,513,057 Bl
`6,513,124 Bl
`6,567,839 Bl
`6,631,474 Bl
`
`. ..................... 713/321
`
`4/1999 Yiu et al.
`1111999 Atkinson
`12/1999 Paraday
`2/2000 Park ................................ 712/41
`3/2000 Huang
`5/2000 Wilt et al.
`4/2001 Stanley
`5/2001 Barber et al.
`6/2002 Lee et al.
`12/2002 Cai
`1/2003 McCrory
`1/2003 Furuichi et al.
`5/2003 Borkenhagen et al.
`10/2003 Cai et al.
`(Continued)
`
`FOREIGN PATENT DOCUMENTS
`
`GB
`WO
`WO
`WO
`
`2382180
`WO03100546
`WO 03100546
`2004/064119
`
`5/2003
`10/2003
`10/2003
`7/2004
`
`OTHER PUBLICATIONS
`
`(65)
`
`Prior Publication Data
`
`US 2009/0222654 Al
`
`Sep.3,2009
`
`Seng et al. "Reducing Power with Dynamic Critical Path Informa(cid:173)
`tion", Proc. of the 34th annual ACM/IEEE international symposium
`on Microarchitecture, ACM, Dec. 2001, pp. 114-123.
`
`Related U.S. Application Data
`
`(Continued)
`
`(60) Provisional application No. 61/067,737, filed on Feb.
`29, 2008.
`
`(51)
`
`(2006.01)
`
`Int. Cl.
`G06F 1132
`(52) U.S. Cl.
`USPC .............................. 713/100; 712/32; 713/323
`( 58) Field of Classification Search
`None
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4,763,242 A
`5,293,500 A
`
`8/1988 Lee et al.
`3/ 1994 Ishida et al.
`
`PROCESSOR
`
`[;D
`
`E
`
`MEMORY
`32.
`
`MCH
`
`372 376374
`
`378
`
`HIGH.PERF
`GRAPHICS
`3.3.a
`
`BUS BRIDGE
`ll8.
`
`KEYBOARD/
`MOUSE
`
`322
`
`COMM
`DEVICES
`
`326
`
`Primary Examiner - Kenneth Kim
`(74) Attorney, Agent, or Firm - Mnemoglyphics, LLC;
`Lawrence M. Mennemeier
`
`(57)
`
`ABSTRACT
`
`Techniques to control power and processing among a plural(cid:173)
`ity of asymmetric cores. In one embodiment, one or more
`asymmetric cores are power managed to migrate processes or
`threads among a plurality of cores according to the perfor(cid:173)
`mance and power needs of the system.
`
`20 Claims, 6 Drawing Sheets
`
`PROCESSOR
`
`E ~ 388
`
`384
`
`386
`
`382
`
`MEMORY
`3!
`
`316
`
`AUDID 1/0
`lli.
`
`320
`
`DATA STORAGE
`
`330
`
`Petitioner Samsung Ex-1033, 0001
`
`
`
`US 8,615,647 B2
`Page 2
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`6,718,475 B2
`6,732,280 Bl
`6,901,522 B2
`6,968,469 Bl
`7,017,060 B2
`7,069,463 B2
`7,093,147 B2
`7,171,546 B2
`7,269,752 B2
`7,334,142 B2
`7,461,275 B2 *
`7,492,368 Bl
`7,500,126 B2 *
`7,500,127 B2
`7,624,215 B2 *
`7,743,232 B2 *
`8,028,290 B2 *
`8,060,727 B2 *
`8,214,808 B2 *
`2002/0095609 Al
`2002/0129288 Al
`2003/0065734 Al
`2003/0088800 Al
`2003/0100340 Al
`2003/0110012 Al
`2003/0224768 Al
`2005/0066209 Al
`2005/0132239 Al
`2005/0182980 Al
`2006/0036878 Al
`2006/0095807 Al
`2006/0294401 Al *
`2007/0038875 Al
`
`. ............ 713/323
`
`4/2004 Cai
`5/2004 Cheok et al.
`5/2005 Buch
`11/2005 Fleischmann et al.
`3/2006 Therien et al.
`6/2006 Oh
`8/2006 Farkas et al.
`1/2007 Adams
`9/2007 John
`2/2008 Hack
`.............. 713/300
`12/2008 Belmont et al.
`2/2009 Nordquist et al.
`3/2009 Terechko et al.
`3/2009 Fleck et al.
`11/2009 Axford et al. ................. 710/260
`6/2010 Shen et al.
`.................... 712/211
`9/2011 Rymarczyketal. .......... 718/104
`11/2011 Blixt ............................... 712/32
`7/2012 Day et al. ...................... 717/140
`7/2002 Tokunaga
`9/2002 Loh et al.
`4/2003 Rarnakesavan
`5/2003 Cai
`5/2003 Cupps et al.
`6/2003 Orenstein et al.
`12/2003 Adjamah
`3/2005 Kee et al.
`6/2005 Athas et al.
`8/2005 Sutardja
`2/2006 Rothman et al.
`5/2006 Grochowski et al.
`12/2006 Munger ........................ 713/300
`2/2007 Cupps et al.
`
`2007/0083785 Al
`2007/0094444 Al*
`2007/0220246 Al
`2007 /0234077 Al *
`2008/0028244 Al
`2008/0028245 Al
`2008/0307244 Al
`2009/0193243 Al*
`2009/0222654 Al
`2010/0185833 Al*
`2010/0268916 Al*
`
`4/2007 Sutardja
`4/2007 Sutardja ........................ 711/112
`9/2007 Powell et al.
`10/2007 Rothman et al ............... 713/300
`1/2008 Capps et al.
`1/2008 Ober et al.
`12/2008 Bertelsen et al.
`7 /2009 Ely ................................... 713/2
`9/2009 Hum et al.
`7/2010 Saito et al. .................... 712/203
`10/2010 Huetal. ......................... 712/41
`
`OTHER PUBLICATIONS
`
`Irani et al., "Online Strategies for Dynamic Power Management in
`Systems with Multiple Power-Saving States," ACM Transactions on
`Embedded Computing Systems, vol. 2, No. 3, Aug. 2003, pp. 325-
`346.
`Lefurgy et al., "Energy Management for Commercial Servers," Com(cid:173)
`puter, IEEE, Dec. 2003, vol. 36, Issue 12, pp. 39-48.
`Kumar et al., "Single-ISA Heterogeneous Multi-Core Architectures:
`The Potential for Processor Power Reduction," Proceedings of the
`36th Annual
`IEEE/ ACM
`International
`Symposium
`on
`Microarchitecture, 2003, pp. 81-92.
`Benini et al., "A Survey of Design Techniques for System-Level
`Dynamic Power Management," IEEE Transactions on Very Large
`Scale Integration (VLSI) Systems, Jun. 2000, vol. 8, Issue 3, pp.
`299-316.
`Abramson et al., "Intel Virtualization Technology for Directed I/O,"
`Intel Technology Journal, Aug. 10, 2006, vol. 10, Issue 3, pp. 179-192
`(16 pages included).
`International Search Report of PCT/2007/000010, Swedish Patent
`Office, Stockholm, Sweden, dated May 16, 2007, 4 pages.
`International Search Report, PCT/2007/000010, dated May 16, 2007.
`
`* cited by examiner
`
`Petitioner Samsung Ex-1033, 0002
`
`
`
`U.S. Patent
`
`Dec. 24, 2013
`
`Sheet 1 of 6
`
`US 8,615,647 B2
`
`100
`
`123
`
`105
`
`129
`
`FIG. 1
`
`201
`I 220 I
`223 1219 I
`I 225 I
`227
`-12191
`
`203
`
`I
`
`I
`
`205
`I 230 I
`-~~1219 I
`207
`◄-- I 2~511
`-►72191
`ii
`
`.1
`I
`
`I
`I
`
`210
`215
`I 240 I
`I 250 I
`243 I 219 I
`253 I 219 I
`217
`213
`◄-- I 255 I
`◄-- I 245 I
`247 ~ 257 ~
`219
`219
`I I
`
`"
`
`I
`I
`
`~
`260
`
`~ - ~ .___
`
`265
`
`FIG. 2
`
`Petitioner Samsung Ex-1033, 0003
`
`
`
`-....l = N
`
`~
`O'I
`'"'"' UI
`O'I
`00
`d r.,;_
`
`0 ....
`N
`.....
`rJJ =(cid:173)
`
`('D
`('D
`
`O'I
`
`328_
`
`CODE
`
`330
`
`DATA STORAGE
`
`316
`
`FIG. 3
`
`320
`
`AUDIO 1/0
`
`324
`
`390
`
`326
`
`DEVICES
`COMM
`
`322
`
`KEYBOARD/
`
`MOUSE
`
`1/0 DEVICES
`
`314
`
`BUS BRIDGE
`
`.318..
`
`394 CHIPSET 398
`
`339
`
`GRAPHICS
`HIGH-PERF
`
`.3.3_8_
`
`~
`
`N
`
`N
`~
`
`0 ....
`~ ...
`c ('D
`
`~ = ~
`
`~
`~
`~
`•
`00
`~
`
`380
`
`382
`
`MEMORY
`
`M
`
`I
`
`MCH I
`
`388 384386
`
`1=;
`
`CORE
`PROC.
`
`350
`
`378
`
`-s
`
`CORE
`PROC.
`
`372 376374
`
`I MCH
`
`I
`
`MEMORY
`
`32
`
`PROCESSOR
`
`PROCESSOR
`
`Petitioner Samsung Ex-1033, 0004
`
`
`
`U.S. Patent
`
`Dec. 24, 2013
`
`Sheet 3 of 6
`
`US 8,615,647 B2
`
`Power vs. Performance
`
`35
`
`30
`
`25
`
`,,, 20 = cu
`
`3: 15
`
`10
`
`5
`
`0
`
`0
`
`0.2
`
`·····
`
`0.6
`0.4
`Relative Performance
`
`1
`
`401
`
`FIG. 4
`
`Petitioner Samsung Ex-1033, 0005
`
`
`
`U.S. Patent
`
`Dec. 24, 2013
`
`Sheet 4 of 6
`
`US 8,615,647 B2
`
`IT IS DETERMINED THAT A
`PROCESS/THREAD/TASK RUNNING
`ON A MAIN PROCESSOR CORE OF A
`MUL Tl-CORE PROCESSOR DOES
`MAY BE RUN ON A LOWER
`POWER/PERFORMANCE CORE
`WHILE MAINTAINING AN ACCEPTABLE
`PERFORMANCE LEVEL
`501
`
`AN EVENT OCCURS IN THE MAIN
`CORE TO CAUSE STATE FROM THE
`CORE TO BE SAVED AND COPIED TO
`A LOWER POWER/PERFORMANCE
`CORE
`505
`
`THE TRANSFERRED THREAD/
`PROCESS/TASK IS RESTARTED ON
`THE LOWER POWER/PERFORMANCE
`CORE
`510
`
`THE MAIN CORE MAY BE PLACED
`IN A LOWER POWER STATE
`515
`
`TRANSFORMED TASK
`ACTIVITY LEVEL EXCEEDS A
`THRESHOLD OR NEW PROCESS
`STARTED ON MAIN
`CORE?
`
`TRANSFER TASK BACK
`TO MAIN CORE
`525
`
`FIG. 5
`
`Petitioner Samsung Ex-1033, 0006
`
`
`
`U.S. Patent
`U.S. Patent
`
`Dec. 24, 2013
`Dec. 24, 2013
`
`Sheet 5 of 6
`Sheet 5 of 6
`
`US 8,615,647 B2
`US 8,615,647 B2
`
`FIG. 6
`
`E .~
`
`~
`
`SC-1
`
`PC-1
`
`..
`
`...
`
`. ..
`
`'
`
`SC-2
`
`PC-2
`
`SC-N
`
`PC-N
`
`PC-N
`
`r
`
`·r
`
`610
`
`L
`
`~
`
`~
`
`~
`
`~
`
`-
`
`Petitioner Samsung Ex-1033, 0007
`
`Petitioner Samsung Ex-1033, 0007
`
`
`
`U.S. Patent
`
`Dec. 24, 2013
`
`Sheet 6 of 6
`
`US 8,615,647 B2
`
`701
`
`710
`
`1.15_
`
`12.0..
`
`FIG. 7
`
`Main Core
`
`ULPC
`
`FIG. 8
`
`Petitioner Samsung Ex-1033, 0008
`
`
`
`US 8,615,647 B2
`
`1
`MIGRATING EXECUTION OF THREAD
`BETWEEN CORES OF DIFFERENT
`INSTRUCTION SET ARCHITECTURE IN
`MULTI-CORE PROCESSOR AND
`TRANSITIONING EACH CORE TO
`RESPECTIVE ON/ OFF POWER STATE
`
`FIELD OF THE INVENTION
`
`Embodiments of the invention relate generally to the field
`of information processing and more specifically, to the field
`of distributing program tasks among various processing ele(cid:173)
`ments.
`
`BACKGROUND
`
`As more processing throughput is required from modern
`microprocessors, it is often at the expense of power consump(cid:173)
`tion. Some applications, such as mobile internet devices
`(MIDs ), ultra-mobile personal computers (UMPCs ), cellular
`phones, personal digital assistants (PDAs), and even laptop/
`notebook computers, may benefit from processors that con(cid:173)
`sume relatively little power. However, achieving relatively
`high processing throughput at relatively low power is a chal(cid:173)
`lenge, involving various design trade-offs, depending on the
`usage models of the computing platform.
`One approach to reducing power in a computing platform
`when there is relatively little activity, is to place the processor
`in a low-power state. However, placing a processor in a low(cid:173)
`power state or returning a processor from a low-power state
`may require a non-trivial amount of time. Therefore, it may or
`may not be worth the time required to place a processor in a
`low-power state or to return the processor from a low-power
`state. Furthermore, not all processes and tasks that are run on
`a processor require the full processing throughput of the
`processor.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`Embodiments of the invention are illustrated by way of
`example, and not by way of limitation, in the figures of the
`accompanying drawings and in which like reference numer(cid:173)
`als refer to similar elements and in which:
`FIG. 1 illustrates a block diagram of a microprocessor, in
`which at least one embodiment of the invention may be used;
`FIG. 2 illustrates a block diagram of a shared bus computer
`system, in which at least one embodiment of the invention
`maybe used;
`FIG. 3 illustrates a block diagram a point-to-point inter(cid:173)
`connect computer system, in which at least one embodiment
`of the invention may be used;
`FIG. 4 is a curve showing the relationship between power
`and performance using at least one embodiment of the inven(cid:173)
`tion;
`FIG. 5 is a flow diagram of operations that may be used for
`performing at least one embodiment of the invention;
`FIG. 6 illustrates a number of processing units and an
`activity level, thermal, or power detection/monitoring unit
`that may be used in at least one embodiment.
`FIG. 7 illustrates a power management logic according to
`one embodiment.
`FIG. 8 illustrates a technique to transition between at least
`two asymmetric processing cores, according to one embodi(cid:173)
`ment.
`
`DETAILED DESCRIPTION
`
`Embodiments of the invention include a microprocessor or
`processing system having a number of asymmetric process-
`
`2
`ing elements. In some embodiments, each processing element
`is a processor core, having one or more execution resources,
`such as arithmetic logic units (ALUs), instruction decoder,
`and instruction retirement unit, among other things. In some
`5 embodiments, the number of asymmetric processing ele(cid:173)
`ments has at least two different processing throughput or
`performance capabilities, power consumption characteristics
`or limits, voltage supply requirements, clock frequency char(cid:173)
`acteristics, number of transistors, and/or instruction set archi-
`10 tectures (ISAs). In one embodiment, an asymmetric micro(cid:173)
`processor includes at least one main processor core having
`larger power consumption characteristics and/or processing
`throughput/performance characteristics than at least one
`other processing core within or otherwise associated with the
`15 microprocessor.
`In one embodiment, a process or task running or intended
`to run on a main higher power/performance processing core
`may be transferred to one of the other lower power/perfor(cid:173)
`mance processing cores for various reasons, including that
`20 the process or task does not require the processing throughput
`of one of the main cores, the processor or the system in which
`it's used is placed into or otherwise requires a lower-power
`consumption condition (such as when running on battery
`power), and for increasing the processing throughput of the
`25 asymmetric microprocessor or system in which the higher
`power/performance cores and lower power/performance
`cores are used. For example, in one embodiment, the asym(cid:173)
`metric processing elements may be used concurrently or oth(cid:173)
`erwise in parallel to perform multiple tasks or processes,
`30 thereby improving the overall throughput of the processor
`and processing system.
`In one embodiment, the at least one main processing core
`has a different ISA than at least one of the at least one pro(cid:173)
`cessor cores having a lower power consumption characteris-
`35 tic and/or processing performance capability. In one embodi(cid:173)
`ment, instruction translation logic in the form of hardware,
`software, or some combination thereof, may be used to trans(cid:173)
`late instructions for the at least one main processor core into
`instructions for the at least one other lower-power/perfor-
`40 mance processing core. For example, in one embodiment, one
`or more of the main higher power/performance cores may
`have a complex instruction set computing (CISC) architec(cid:173)
`ture, such as the "x86" computing architecture, and therefore
`performs instructions that are intended for x86 processor
`45 cores. One or more of the lower power/performance cores
`may have a different ISA than the main core, including a
`reduced instruction set computing (RISC) architecture, such
`as anAdvanced RISC Machine (ARM) core. In other embodi(cid:173)
`ments, the main processing element(s) and the lower power/
`50 performance processing element(s) may include other archi(cid:173)
`tectures, such as the MIPS ISA. In other embodiments the
`main processing element(s) may have the same ISA as the
`lower power/performance element( s) ( e.g., x86).
`In one embodiment, a number of different threads, pro-
`55 cesses, or tasks associated with one or more software pro(cid:173)
`grams may be intelligently moved among and ran on a num(cid:173)
`ber of different processing elements, having a number of
`different processing capabilities ( e.g., operating voltage, per(cid:173)
`formance, power consumption, clock frequency, pipeline
`60 depth, transistor leakage, ISA), according to the dynamic
`performance and power consumption needs of the processor
`or computer system. For example, if one process, such as that
`associated with a spreadsheet application, does not require
`the full processing capabilities of a main, higher performance
`65 processor core, but may be instead be ran with acceptable
`performance on a lower-power core, the process may be
`transferred to or otherwise ran on the lower power core and
`
`Petitioner Samsung Ex-1033, 0009
`
`
`
`US 8,615,647 B2
`
`5
`
`3
`the main, higher power processor core may be placed in a low
`power state or may just remain idle. By running threads/
`processes/tasks on a processor core that better matches the
`performance needs of the thread/process/task, power con(cid:173)
`sumption may be optimized, according to some embodi(cid:173)
`ments.
`FIG. 1 illustrates a microprocessor in which at least one
`embodiment of the invention may be used. In particular, FIG.
`1 illustrates microprocessor 100 having one or more main
`processor cores 105 and 110, each being able to operate at a
`higher performance level (e.g., instruction throughput) or
`otherwise consume more power than one or more low-power
`cores 115, 120. In one embodiment, the low-power cores may
`be operated at the same or different operating voltage as the
`main cores. Furthermore, in some embodiments, the low(cid:173)
`power cores may operate a different clock speed or have fewer
`execution resources, such that they operate at a lower perfor(cid:173)
`mance level than the main cores.
`In other embodiments, the low-power cores may be of a
`different ISA than the main cores. For example, the low(cid:173)
`power cores may have an ARM ISA and the main cores may
`have an x86 ISA, such that a program using x86 instructions
`may need to have these instructions translated into ARM
`instructions if a process/task/thread is transferred to one of
`the ARM cores. Because the process/thread/task being trans(cid:173)
`ferred may be one that does not require the performance of
`one of the main cores, a certain amount oflatency associated
`with the instruction translation may be tolerated without
`noticeable or significant loss of performance.
`Also illustrated in FIG. 1 is at least one other non-CPU
`functional unit 117, 118, and 119 which may perform other
`non-CPU related operations. In one embodiment, the func(cid:173)
`tional units 117, 118, and 119 may include functions such as
`graphics processing, memory control and I/O or peripheral
`control, such as audio, video, disk control, digital signal pro(cid:173)
`cessing, etc. The multi-core processor of FIG. 1 also illus(cid:173)
`trates a cache 123 that each core can access for data or instruc(cid:173)
`tions corresponding to any of the cores.
`In one embodiment, logic 129 may be used to monitor
`performance or power ofany of the cores illustrated in FIG. 1
`in order to determine whether a process/task/thread should be
`migrated from one core to another to optimize power and
`performance. In one embodiment, logic 129 is associated
`with the main cores 105 and 110 to monitor an activity level
`of the cores to determine whether the processes/threads/tasks 45
`running on those cores could be run on a lower-power core
`115, 120 at an acceptable performance level, thereby reduc(cid:173)
`ing the overall power consumption of the processor. In other
`embodiments, logic 129 may respond to a power state of the
`system, such as when the system goes from being plugged 50
`into an A/C outlet to battery power. In this case, the OS or
`some other power state monitoring logic may inform logic
`129 of the new power conditions and the logic 129 may cause
`a current-running process ( or processes yet to be scheduled to
`run) to either be transferred ( or scheduled) to a lower-power 55
`core (in the case of going fromA/C to battery, for example) or
`from a lower-power core to a main core (in the case of going
`from battery to A/C, for example). In some embodiments, an
`operating system (OS) may be responsible for monitoring or
`otherwise controlling the power states of the processor and/or 60
`system, such that the logic 129 simply reacts to the OS's
`commands to reduce power by migrating tasks/threads/pro(cid:173)
`cesses to a core that better matches the performance needs of
`the tasks/threads/processes while accomplishing the power
`requirements dictated or indicated by the OS.
`In some embodiments, the logic 129 may be hardware
`logic or software, which may or may not determine a core(s)
`
`4
`on which a process/task/thread should be run independently
`of the OS. In one embodiment, for example, logic 129 is
`implemented in software to monitor the activity level of the
`cores, such as the main cores, to see if it drops below a
`threshold level, and in response thereto, causes one or more
`processes running on the monitored core(s) to be transferred
`to a lower-power core, such as cores 115 and 120. Conversely,
`logic 129 may monitor the activity level of a process running
`on a lower-power core 115 and 120 in order to determine
`10 whether it is rising above a threshold level, thereby indicating
`the process should be transferred to one of the main cores 105,
`110. In other embodiments, logic 129 may independently
`monitor other performance or power indicators within the
`processor or system and cause processes/threads/tasks to be
`15 migrated to cores that more closely fit the performance needs
`of the tasks/processes/threads while meeting the power
`requirements of the processor of the system at a given time. In
`this way, the power and performance of processor 100 can be
`controlled without the programmer or OS being concerned or
`20 even aware of the underlying power state of the processor.
`In other embodiments, each core in FIG. 1 may be concur(cid:173)
`rently running different tasks/threads/processes to get the
`most performance benefit possible from the processor. For
`example, in one embodiment, a process/thread/task that
`25 requires high performance may be run on a main core 105,
`110 concurrently with a process/thread/task that doesn't
`require as high performance as what the main cores are able to
`deliver on lower-power cores 115, 120. In one embodiment,
`the progranmier determines where to schedule these tasks/
`30 threads/processes, whereas in other embodiments, these
`threads/tasks/processes may be scheduled by an intelligent
`thread scheduler ( not shown) that is aware of the performance
`capabilities of each core and can schedule the threads to the
`appropriate core accordingly. In other embodiments, the
`35 threads are simply scheduled without regard to the perfor(cid:173)
`mance capabilities of the underlying cores and the threads/
`processes/tasks are migrated to a more appropriate core after
`the activity levels of the cores in response to the threads/
`processes/tasks are determined. In this manner, neither an OS
`40 nor a programmer need be concerned about where the
`threads/processes/tasks are scheduled, because the threads/
`processes/tasks are scheduled on the appropriate core(s) that
`best suits the performance requirement of each thread while
`maintaining the power requirements of the system or proces(cid:173)
`sor.
`In one embodiment, logic 129 may be hardware, software,
`or some combination thereof. Furthermore, logic 129 may be
`distributed within one or more cores or exist outside the cores
`while maintaining electronic connection to the one or more
`cores to monitor activity/power and cause threads/tasks/pro(cid:173)
`cesses to be transferred to appropriate cores.
`FIG. 2, for example, illustrates a front-side-bus (FSB)
`computer system in which one embodiment of the invention
`may be used. Any processor 201, 205, 210, or 215 may
`include asymmetric cores ( differing in performance, power,
`operating voltage, clock speed, or ISA), which may access
`information from any local level one (Ll) cache memory 220,
`225, 230, 235, 240, 245, 250, 255 within or otherwise asso(cid:173)
`ciated with one of the processor cores 223, 227, 233, 237,
`243, 247, 253, 257. Furthermore, any processor 201, 205,
`210, or 215 may access information from any one of the
`shared level two (L2) caches 203, 207, 213, 217 or from
`system memory 260 via chipset 265. One or more of the
`processors in FIG. 2 may include or otherwise be associated
`65 with logic 219 to monitor and/or control the scheduling or
`migration of processes/threads/tasks between each of the
`asymmetric cores of each processor. In one embodiment,
`
`Petitioner Samsung Ex-1033, 0010
`
`
`
`US 8,615,647 B2
`
`5
`logic 219 may be used to schedule or migrate threads/tasks/
`processes to or from one asymmetric core in one processor to
`another asymmetric core in another processor.
`In addition to the FSB computer system illustrated in FIG.
`2, other system configurations may be used in conjunction
`with various embodiments of the invention, including point(cid:173)
`to-point (P2P) interconnect systems and ring interconnect
`systems. The P2P system of FIG. 3, for example, may include
`several processors, of which only two, processors 370, 380
`are shown by example. Processors 370, 380 may each include
`a local memory controller hub (MCH) 372, 382 to connect
`with memory 32, 34. Processors 370, 380 may exchange data
`via a P2P interface 350 using P2P interface circuits 378, 388.
`Processors 370, 380 may each exchange data with a chipset
`390 via individual P2P interfaces 352, 354 using point to 15
`point interface circuits 376, 394, 386, 398. Chipset 390 may
`also exchange data with a high-performance graphics circuit
`338 via a high-performance graphics interface 339. Embodi(cid:173)
`ments of the invention may be located within any processor
`having any number of processing cores, or within each of the 20
`P2P bus agents of FIG. 3. In one embodiment, any processor
`core may include or otherwise be associated with a local
`cache memory (not shown). Furthermore, a shared cache (not
`shown) may be included in either processor outside of both
`processors, yet connected with the processors via P2P inter(cid:173)
`connect, such that either or both processors' local cache infor(cid:173)
`mation may be stored in the shared cache if a processor is
`placed into a low power mode. One or more of the processors
`or cores in FIG. 3 may include orotherwise be associated with
`logic to monitor and/or control the scheduling or migration of 30
`processes/threads/tasks between each of the asymmetric
`cores of each processor.
`FIG. 4 is a graph illustrating the performance and power
`characteristics associated with a processor when scaling volt(cid:173)
`age and frequency including techniques according to at least 35
`one embodiment of the invention. Reducing voltage is an
`efficient way of reducing power since the frequency scales
`linearly with the voltage, while the power scales as the volt(cid:173)
`age'3 (power=CV'2F). Unfortunately, this efficient voltage
`scaling approach only works within a range of voltages; at 40
`some point, "Vmin", the transistor switching frequency does
`not scale linearly with voltage. At this point ( 401 ), to further
`reduce power, the frequency is reduced without dropping the
`voltage. In this range, the power scales linearly with the
`frequency which is not nearly as attractive as when in the 45
`range where voltage scaling is possible. In one embodiment,
`power consumption of the system may be reduced below the
`minimum point 401 of a typical multi-core processor having
`symmetric processing elements by scheduling or migrating
`processes/threads/tasks
`from higher-performance/power 50
`cores to lower-performance/power cores if appropriate. In
`FIG. 4, the power/performance curve segment 405 indicates
`where the overall non-linear power/performance curve could
`be extended to enable more power savings, in one embodi(cid:173)
`ment.
`FIG. 5 illustrates a flow diagram of operations that may be
`used in conjunction with at least one embodiment of the
`invention. At operation 501, it is determined that a process/
`thread/task running on a main processor core of a multi-core
`processor does may be run on a lower power/performance 60
`core while maintaining an acceptable performance level. In
`one embodiment, the determination could be made by moni(cid:173)
`toring the activity level of the main core in response to run(cid:173)
`ning the thread/process/task and comparing it to a threshold
`value, corresponding to an acceptable performance metric of 65
`the lower power/performance core. In other embodiments,
`the determination could be made based on system power
`
`6
`requirements, such as when the system is running on A/C
`power versus battery power. In yet other embodiments, a
`thread/process/task may be designated to require only a cer(cid:173)
`tain amount of processor performance, for example, by a
`5 programmer, the OS, etc. In other embodiments, other tech(cid:173)
`niques for determining whether a task/thread/process could
`be transferred to a lower power/performance core, thereby
`reducing power consumption.
`At operation 505, an event (e.g., yield, exception, etc.)
`1 o occurs in the main core to cause state from the core to be saved
`and copied to a lower power/performance core. In one
`embodiment, a handler program is invoked in response to the
`event to cause the main core state to be transferred from the
`main core to a lower power/performance core. At operation
`510, the transferred thread/process/task is restarted or
`resumed on the lower power/performance core. At operation
`515, the main core may be placed in a lower power state ( e.g.,
`paused, halted, etc.) until 520 either the transferred process/
`task/thread requires above a threshold level of performance,
`in which case the thread/process/task may be transferred back
`to the main core 525 in a similar manner as it was transferred
`to the lower power/performance core, or another task/pro(cid:173)
`cess/thread is scheduled for execution on the main core.
`In one embodiment, the thread/process/task transferred
`25 from the main core to the lower power/performance core is
`first translated from the ISA of the main core to the ISA of the
`lower power/performance core, if the two have different
`architectures. For example, in one embodiment, the main core
`is an x86 architecture core and the lower power/performance
`core is an ARM architecture core, in which case instructions
`of the transferred thread/process/task may be translated (for
`example, by a software binary translation shell) from x86
`instructions to ARM instructions. Because the thread/pro(cid:173)
`cess/task being transferred is by definition one that does not
`require as much performance as to require it to be ran on the
`main core, a certain amount of latency may be tolerated in
`translating the process/task/thread from the x86 architecture
`to ARM architecture.
`FIG. 6 illustrates a processing apparatus having a number
`of individual processing units between which processes/
`threads/tasks may be swapped under control of an activity
`level monitor, or thermal or power monitor, according to one
`embodiment. In the embodiment of FIG. 6, N processing
`units, processing unit 600-1, 600-2 through 600-N are
`coupled to a monitor or detection (generically referred to as
`"monitor") logic 610. In one embodiment, the monitor 610
`includes an activity, thermal and/or power monitoring unit
`that monitors the activity/performance, power consumption,
`and/or temperature of the processing units 600-1 through
`600-N. In one embodiment, performance counters may be
`used to monitor the activity level of processing units 600-1
`through 600-N. In one embodiment, the monitor 610 orches(cid:173)
`trates process shifting between processing units in order to
`manage power consumption and/or particularly thermal con-
`55 cerns, while maintaining an acceptable level of performance.
`In one embodiment, each processing unit provides a moni(cid:173)
`tor value that typically reflects activity level, power consump(cid:173)
`tion and/or temperature information to the monitor 610 via
`signals such as processor communication (PC) lines PC-1
`through PC-N. The monitor value may take a variety of forms
`and may be a variety of different types of information. For
`example, the monitor value may simply be an analog or
`digital reading of the temperature of each processing unit.
`Alternatively, the monitor value may be a simple or complex
`activity factor that reflects the operational activity level of a
`particular processing unit. In some embodiments, power con-
`sumption information reflected by the monitor value may
`
`Petitioner Samsung Ex-1033, 0011
`
`
`
`US 8,615,647 B2
`
`7
`include a measured current level or other indication of how
`much power is being consumed by the processing unit. Addi(cid:173)
`tionally, some embodiments may convey power consumption
`information to the monitor 110 that is a composite of several
`of these or other types of known or otherwise available means
`of measuring or estimating power consumption. Accordingly,
`some power consumption metric which reflects one or more
`of these or other power consumption indicators may be
`derived. The transmitted monitor value may reflect a tempera(cid:173)
`ture or a power consumption metric, which itself may factor
`in a temperature. Serial, parallel, and/or various known or
`otherwise available protocols