`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 11, NOVEMBER 2008
`
`Energy Efficient Scheduling of Real-Time
`Tasks on Multicore Processors
`
`Euiseong Seo, Jinkyu Jeong, Seonyeong Park, and Joonwon Lee
`
`Abstract Multicore processors deliver a higher throughput at lower power consumption than unicore processors. In the near future,
`they will thus be widely used in mobile real-time systems. There have been many research on energy-efficient scheduling of
`real-time tasks using DVS. These approaches must be modified for multicore processors, however, since normally all the cores in a
`chip must run at the same performance level. Thus, blindly adopting existing DVS algorithms that do not consider the restriction will
`result in a waste of energy. This article suggests Dynamic Repartitioning algorithm based on existing partitioning approaches of
`multiprocessor systems. The algorithm dynamically balances the task loads of multiple cores to optimize power consumption during
`execution. We also suggest Dynamic Core Scaling algorithm, which adjusts the number of active cores to reduce leakage power
`consumption under low load conditions. Simulation results show that Dynamic Repartitioning can produce energy savings of about
`8 percent even with the best energy-efficient partitioning algorithm. The results also show that Dynamic Core Scaling can reduce
`energy consumption by about 26 percent under low load conditions.
`
`Index Terms Real-time systems, real-time scheduling, low-power design, power-aware systems, multicore processors,
`multiprocessor systems.
`
`Ç
`
`1 INTRODUCTION
`
`MOBILE real-time systems have seen rapidly increasing
`
`use in sensor networks, satellites, and unmanned
`vehicles, as well as personal mobile equipment. Thus, the
`energy efficiency of them is becoming an important issue.
`The processor is one of the most
`important power
`consumers in any computing system. Considering that
`state-of-the-art real-time systems are evolving in complexity
`and scale, the demand for high-performance processors will
`continue to increase. A processor’s performance, however,
`is directly related to its power consumption. As a result, the
`processor power consumption is becoming more important
`issue as their required performance standards increase.
`Over the last decade, manufacturers competed to
`advance the performance of processors by raising the clock
`frequency. However,
`the dynamic power consumption
`Pdynamic of a CMOS-based processor, the power required
`during execution of instructions,
`is related to its clock
`
`frequency f and operating voltage Vdd as Pdynamic / V 2dd f.
`And, the relation Vdd / f also holds in these processors. As
`a result, the dramatically increased power consumption
`caused by high clock frequency has stopped the race, and
`they are now concentrating on other ways to improve
`performance at relatively low clock frequencies.
`
`. E. Seo is with the Department of Computer Science and Engineering,
`Pennsylvania State University, University Park, PA 16803.
`E mail: euiseong@gmail.com.
`. J. Jeong, S. Park, and J. Lee are with the Computer Science Division, Korea
`Advanced Institute of Science and Technology, 373 1 Guseongdong,
`Yuseonggu, Daejeon 305 701, Korea.
`E mail: {jinkyu, parksy}@calab.kaist.ac.kr, joon@kaist.ac.kr.
`
`Manuscript received 29 Oct. 2007; accepted 13 June 2008; published online
`17 June 2008.
`Recommended for acceptance by I. Ahmad, K.W. Cameron, and R. Melhem.
`For information on obtaining reprints of this article, please send e mail to:
`tpds@computer.org, and reference IEEECS Log Number TPDS 2007 10 0397.
`Digital Object Identifier no. 10.1109/TPDS.2008.104.
`
`One of the representative results from this effort is
`multicore architecture [1], which integrates several proces-
`sing units (known as cores) into a single chip. Multicore
`processors, which are quickly becoming mainstream, can
`achieve higher throughput with the same clock frequency.
`Thus, power consumption in them is a linear function of the
`throughput. As the demand for concurrent processing and
`increased energy efficiency grows,
`it
`is expected that
`multicore processors will become widely used in real-time
`systems.
`The problem of scheduling real-time tasks on a multicore
`processor is the same as that of scheduling on a multi-
`processor system. This is an NP-hard problem [2], and
`existing heuristic solutions can be divided into two
`categories. Partitioned scheduling algorithms [3], [4], [5]
`require every execution of a particular task to take place in
`the same processor, while global scheduling algorithms [6],
`[7], [8] permit a given task to be executed upon different
`processors [6]. Partitioned algorithms are based on a divide-
`and-conquer strategy. After all tasks have been assigned to
`their respective cores,
`the tasks in each core can be
`scheduled using well-known algorithms such as Earliest
`Deadline First (EDF) [9] or Rate Monotonic (RM) [10]. Due to
`their simplicity and efficiency, partitioned scheduling
`algorithms are generally preferred over global scheduling
`algorithms.
`In addition to the innovation of multicore architecture,
`many up-to-date processors also use dynamic voltage
`scaling (DVS). DVS adjusts the clock frequency and
`operating voltage on the fly to meet changes in the
`performance demand.
`Multicore processors can also benefit greatly from DVS
`technology. Because all the cores in a chip are in the same
`clock domain, however, they must all operate at the same
`clock frequency and operating voltage [11], [12]. It seems
`
`Authorized licensed use limited to: UNIST. Downloaded on May 26, 2009 at 21:48 from IEEE Xplore. Restrictions apply.
`
`1045-9219/08/$25.00 ß 2008 IEEE
`
`Published by the IEEE Computer Society
`
`Petitioner Samsung Ex-1031, 0001
`
`
`
`SEO ET AL.: ENERGY EFFICIENT SCHEDULING OF REAL TIME TASKS ON MULTICORE PROCESSORS
`
`1541
`
`that this limitation will remain in force for some years at
`least because the design and production of multicore
`processors with independent clock domains is still prohibi-
`tively expensive.
`There has been much research [13], [14], [15], [16], [17]
`on how best to use DVS in a unicore processor for real-
`time tasks.
`In systems consisting of multiple DVS
`processors, DVS scheduling is easily accomplished using
`those existing algorithms on each processor after partition-
`ing [13], [14], [15]. In multicore environments, however,
`the benefit of this approach is greatly reduced by the
`limitation that all cores must share the same clock. Even
`though the performance demands of each core may differ
`at a given scheduling point, this limitation forces all cores
`to work at the highest frequency scheduled. Compared to
`a multiprocessor system, a multicore system will
`thus
`consume more power needlessly if
`the existing DVS
`method is adopted blindly.
`This paper suggests a dynamic, power-conscious, real-
`time scheduling algorithm to resolve this problem.
`In
`general, multicore processors have some caches that are
`shared among their cores. Task migration between cores
`thus requires less overhead than migration between fully
`independent processors. With an exploitation of
`this
`property, Dynamic Repartitioning, which is the suggested
`scheme tries to keep the performance demands of each core
`balanced by migrating tasks from the core with the highest
`demand to the one with the lowest demand. Similar to
`multiprocessor systems, the dynamic performance demand
`of each core is given by existing DVS algorithms, and the
`migration decisions are made at every scheduling time. This
`repartitioning of tasks is expected to reduce the dynamic
`power consumption by lowering the maximum perfor-
`mance demand of the cores at any given moment.
`In addition to dynamic power, there is another source of
`power consumption that must be considered. Different
`from dynamic power, which is consumed during instruc-
`tion execution, leakage power is consumed as long as there
`is electric current in the circuits. In general, this energy loss
`is proportional to the circuit density and the total number of
`circuits in the processor. Leakage power has thus been
`taking up an increasing proportion of the total power, up to
`44 percent in 50 nm technology for an active cycle of a
`uniprocessor [18]. And, it will become even more in a
`multicore processor for the vastly increased circuits.
`In this paper, we also suggest a method of reducing the
`leakage power by adjusting the number of active cores.
`Dynamic Core Scaling decides on the optimal number of
`cores for the current performance demand and tries to meet
`this criterion as far as all deadlines are guaranteed.
`Dynamic Core Scaling is expected to save a considerable
`amount of leakage power in low load periods, where the
`leakage power makes up a large fraction of the total power
`consumption.
`The suggested Dynamic Repartitioning and Dynamic
`Core Scaling methods were evaluated through simulations
`by applying them to a well-known processor power model.
`The target task sets in the simulations were designed to
`demonstrate the behavior of the algorithm under diverse
`environments.
`
`TABLE 1
`Example Task Set [13]
`
`Task
`r 1
`r2
`r-J
`
`Period
`8 ms
`10 ms
`14 ms
`
`WCET
`3 ms
`3 ms
`1 ms
`
`Utilization
`0.375
`0.300
`0.071
`
`cc2
`1
`1
`1
`
`2
`1
`1
`
`The rest of this paper is organized as follows: Section 2
`reviews existing research on the use of DVS in real-time
`unicore processor systems and on the development of
`energy-efficient scheduling algorithms in multiprocessor
`and multicore systems. Section 3 defines the problem and
`describes the power consumption model used in this paper.
`In Section 4, we describe Dynamic Repartitioning algorithm
`as a way of efficiently reducing clock frequencies.
`In
`Section 5, we introduce Dynamic Core Scaling algorithm,
`which reduces the leakage power by adjusting the number
`of activated cores. Section 6 presents simulation results for
`the two algorithms, and Section 7 summarizes our
`conclusions.
`
`2 RELATED WORK
`2.1 DVS on a Unicore Processor
`In this paper, the WCET of a task will be taken as the time
`required to finish the worst-case execution path at
`maximum performance. The actual WCET of a task is the
`scaled value of its WCET to the current performance, and it
`increases linearly as performance degrades. In this paper,
`we will use the term utilization of a task to refer to its WCET
`divided by its period. It means the fraction of processor
`time dedicated to the task at maximum performance. A
`matter of course, the relative utilization of a task which is
`based on its actual WCET is also grows as the performance
`degrades.
`EDF is the optimal algorithm for preemptible periodic
`real-time task scheduling. Defining the utilization of a task
`as the value of its WCET divided by its period, EDF can
`guarantee meeting the deadlines of all task sets for which
`the sum of all task utilization is less than one. Based on this
`property, Pillai and Shin [13] suggested three DVS schedul-
`ing heuristics: Static, Cycle conserving, and Look ahead.
`The Static algorithm adjusts the clock frequency so that
`the total relative utilization of all tasks is 1.0. For example,
`the total relative utilization of the task set in Table 1 is 0.746.
`If the execution time of all tasks is inversely proportional to
`the clock frequency,
`then we can achieve the highest
`possible energy efficiency while still meeting all deadlines
`by scaling the performance down to 0.746 times the
`maximum clock frequency.
`A given task may be finished earlier than its WCET, and
`the actual execution time changes with every period. If the
`tasks in Table 1 have actual execution times as given in
`columns cc1 and cc2 during their first and second periods,
`respectively, then idle periods will result even executing at
`the frequency given by the Static algorithm. This means that
`the frequency could be lowered further.
`To exploit
`this phenomenon,
`the Cycle-conserving
`algorithm adopts the notion of dynamic utilization, which
`
`Authorized licensed use limited to: UNIST. Downloaded on May 26, 2009 at 21:48 from IEEE Xplore. Restrictions apply.
`
`Petitioner Samsung Ex-1031, 0002
`
`
`
`1542
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 11, NOVEMBER 2008
`
`~ 1.00
`~
`[ 0.75
`
`t.:..
`
`0.621
`lr
`
`0.50
`
`t,
`
`l2
`
`t ,
`
`12
`
`0
`U(t 1) : 0.375- 0.25
`
`10
`
`15 Time
`
`f 1.00
`.,.
`J: 0.75
`0.50
`
`0
`
`1.00
`
`>.
`u
`ii
`::,
`O"
`J: 0.75
`0.50
`
`0.746
`
`t I
`
`(a)
`
`0.546
`
`10
`U(t 1) : 0.375
`
`(b)
`
`0.546
`
`t ,
`
`l2
`
`15
`
`Time
`
`0.496
`
`0.296
`
`10
`5
`0
`U(l 1) : 0.375
`U(t 1) : 0.375---> 0.25
`U(t 2) : 0.3 --+ 0.1
`U(l 2) : 0.3
`
`15
`
`T ime
`
`(c)
`
`Fig. 1. Cycle-conserving algorithm on the example task set [13]. (a) After
`finish of executing 1. (b) After finish of executing 2 and 3. (c) Actual
`execution flow for two rounds.
`
`is updated whenever the tasks are scheduled and finished.
`On the completion of a task, it updates the utilization based
`on the task’s actual execution time. The next time the task
`begins executing, however, its utilization is restored to the
`original value based on the task’s WCET. In this manner,
`the Cycle-conserving algorithm may choose a lower
`frequency than the Static algorithm during the period
`between a task’s completion and the start of its next period.
`It thus saves more energy than the Static algorithm.
`Fig. 1 shows an example of
`the Cycle-conserving
`algorithm at work. The actual execution time of 1 is 2 ms
`(Fig. 1a). The utilization of 1 is thus updated from 3/8 to
`2/8 after its first execution, and the total utilization of the
`task set decreases to 0.621. At this point (Fig. 1b), the
`processor will be operated at 0.621 times the highest
`frequency. The utilization of 2 drops from 3/10 to 1/10
`after completion, and as a result, 3 can be executed at
`0.421 times the highest frequency. The actual execution
`flow under the Cycle-conserving algorithm for both rounds
`is shown in Fig. 1c.
`Cycle conserving is expected to lead to a higher energy
`efficiency than the Static algorithm, because it reduces the
`frequency during idle periods. As shown in Fig. 1c, at the start
`of each new period, it assumes the worst case for the current
`task. As a result, the frequency tends to start high and
`decrease gradually as tasks are completed. If the actual
`execution times of most tasks fall short of their WCET,
`however, it is better to start with a low frequency and defer
`the use of high-frequency processing as long as all deadlines
`
`can be met. This is the basic concept of the Look-ahead
`algorithm. When actual execution times usually fall short of
`their corresponding WCETs, the Look-ahead algorithm gives
`better results than Cycle-conserving. In cases where the
`actual execution times are usually close to their WCETs,
`however, Cycle conserving is the better choice.
`A variety of DVS algorithms have been proposed in
`addition to these. Aydin et al. [14], for example, have
`suggested the Generic Dynamic Reclaiming Algorithm
`(GDRA) and Aggressive Speed Adjustment (AGR) algo-
`rithms. GDRA is in many respects similar to the Cycle-
`conserving algorithm; AGR, however, sets the frequency
`based on the execution history of the tasks. Gruian [15]
`suggested an algorithm that starts at a low frequency and
`increases the processing speed gradually based on the
`statistics of the actual execution times. Kim et al. [17] also
`suggested the method to utilize slack time, which is based
`on the expectation of the slack time occurrences. These
`alternative approaches are helpful in cases, where trends are
`visible in the actual execution times, for example, when the
`most recent execution time is related to the previous one.
`
`2.2 Power-Aware Scheduling on Multiprocessors
`Besides the problem of deciding which task to execute at a
`certain time, multiprocessor real-time systems must also
`decide which processor the task will run on. Partitioned
`scheduling is the most widely used solution to this NP-hard
`problem. In partitioned scheduling, every processor has its
`own task queue, and in an initial stage, each task is
`partitioned into one of these queues. Each processor’s task
`set is scheduled with a single-processor scheduling algo-
`rithm such as EDF or RM [3], [5]. The partitioning itself is
`one variant of the well-known Knapsack problem, for
`which a number of heuristics such as Best Fit, Worst Fit, and
`Next Fit are known to work well.
`The partitioning approach has the advantage of utilizing
`DVS. Any of the many possible DVS algorithms described
`in Section 2.1 can be used to adjust the frequency of each
`processor and its associated task set. To maximize the
`energy efficiency, however, the utilizations of each parti-
`tioned set should be well balanced [4]; this is because the
`dynamic power consumption increases as the cube of the
`utilization.
`Aydin and Yang [4] proved that it is also an NP-hard
`problem to partition a list of tasks into a given number of
`sets that are optimally load balanced, with the guarantee
`that each task set can be scheduled on the system. They also
`showed that among well-known heuristics, worst
`fit
`decreasing (WFD) generates the most balanced sets. WFD
`applies the worst-fit algorithm to the tasks after sorting
`them in order of decreasing task utilization.
`There are also many scheduling heuristics for the variety
`configurations of target environments. Gruian [19] proposed
`a simulated annealing approach in multiprocessor energy
`efficient scheduling with the considerations of precedence
`constraints and predictable execution time for each task.
`Chen et al. [20] suggested an approximation algorithm
`with different approximation bounds for processors with/
`without constraints on processor speeds for the task set with
`common periods. Anderson and Baruah [21] suggested the
`
`Authorized licensed use limited to: UNIST. Downloaded on May 26, 2009 at 21:48 from IEEE Xplore. Restrictions apply.
`
`Petitioner Samsung Ex-1031, 0003
`
`
`
`SEO ET AL.: ENERGY EFFICIENT SCHEDULING OF REAL TIME TASKS ON MULTICORE PROCESSORS
`
`1543
`
`Number of Cores
`Active Cores
`
`Clock Frequency
`
`100%
`
`75%
`
`actual execution times may be shorter than the WCETs,
`which are close to the real world. If this is so, then additional
`energy can be saved with a dynamic approach.
`
`Clock Frequen cy
`'"---•.. -.- -.-.. -.. -.. -~-~.-'.i.o-__ d-__ -.. -.. -.. -.. -.. -.. -. -__ -__ .. _ .. -._-__ -----------~► Time
`
`25%
`
`Fig. 2. Example schedule generated by heuristic algorithm [12] for
`DVS-CMP.
`
`trade-off between increasing the number of processor and
`increasing the performance of each processor is explored,
`and they also suggested algorithms to solve the problem
`with static analysis. However, even though the many works
`have been done, most of them are based on the static
`analysis of the WCETs of tasks and have little consideration
`for utilizing slack time.
`
`2.3 Power-Aware Scheduling on Multicores
`While much research has examined the problem of energy-
`efficient scheduling for single-processor or multiprocessor
`systems, little work has been done on multicore processors.
`Nikitovic and Brorsson [22] assumed an adaptive chip-
`multicore processor (ACMP) architecture,
`in which the
`cores have several operating modes (RUN, STANDBY, and
`DORMANT) and can change their state dynamically and
`independently according to the performance demand. They
`suggested some scheduling heuristics for ACMP systems
`and demonstrated that these heuristics save a significant
`amount of energy for non-real-time task sets compared to a
`high-frequency unicore processor. Although this work
`introduced the benefits of processors with multiple cores
`that can change their operating mode independently, it
`does not take into consideration the demands of real-time
`task sets.
`The first energy-efficient approach to real-time scheduling
`on a multicore processor was suggested by Yang et al. [12],
`who assumed a DVS-enabled chip multiprocessor (DVS-
`CMP). In DVS-CMP systems, all cores share the same clock
`frequency but a core can “sleep” independently if it has no
`work to do. Yang et al. proved that the energy efficient
`scheduling of periodic real-time tasks on DVS-CMP system is
`an NP-hard problem. They thus suggested a heuristic
`algorithm for scheduling a framed, periodic, real-time task
`model. In this model all tasks have the same period, share a
`deadline which is equal to the end of the period, and start at
`the same time. As shown in Fig. 2, the suggested algorithm
`starts executing tasks at a low performance. As time goes on,
`cores with no tasks to run will be set to the sleep state. When
`the number of sleeping cores increases, the frequency must
`also increase to meet the deadlines of tasks that have not been
`finished yet. In this manner the number of cores running in a
`high frequency mode is reduced, and a significant amount of
`energy will be saved. The applications of this algorithm are
`limited, however, because it can be only used for the framed
`real-time systems in which all tasks have same dead-lines
`and starting points. Moreover it is also a static approach. In
`other words, it does not take into account cases where the
`
`3 SYSTEM MODEL
`3.1 Task Set Model
`The assumed target tasks are executed periodically, and
`each should be completed before its given deadline. A
`completed task rests in sleep state until its next period
`begins, at
`the start of which the task will again be
`activated and prepared for execution. The tasks have no
`interdependency.
`A task set T is defined by (1), where i
`is the
`ith individual task in T . Each task has its own predefined
`period pi and WCET wi;
`the latter is defined as the
`maximum execution time required to complete i at the
`highest possible processor frequency. The real worst-case
`execution time of i thus increases from wi as the clock
`frequency decreases. The nearest deadline at the current
`time is defined as di:
`T ¼ 1ðp1; w1Þ; . . .; nðpn; wnÞ
`f
`g:
`
`ð1Þ
`
`task i
`The utilization ui of
`is defined by (2). A
`proportion ui of the total number of cycles of a core will
`be dedicated to executing i:
`ui ¼ wi=pi:
`U, the total utilization of T , is defined as (3):
`X
`
`ð2Þ
`
`U ¼
`
`ui:
`
`8i2T
`
`ð3Þ
`
`The processor S consists of multiple cores and is defined
`in (4). The nth core in S is denoted as Cn. The number of
`cores in S is denoted as m. Each core is assumed to have
`identical structure and performance. We also assume that
`resource sharing between the cores does not introduce any
`interference overhead. We have
`S ¼ fC0; . . .; C mg:
`
`ð4Þ
`
`F , the relative performance of S and the scaling factor for
`the operating clock frequency, is a number between 0 and 1.
`If the performance demand on S is F , then the actual
`frequency is the highest possible frequency of S multiplied
`by the factor F .
`The system follows the partitioned scheduling approach;
`any well-known heuristic such as BFD, NFD, etc., may be
`adopted. The partitioned state of T on S is denoted P, and
`the partitioned task set allocated to core Cn is denoted as Pn.
`The utilization of Pn is defined by (5):
`X
`
`Un ¼
`
`ui:
`
`8i2Pn
`
`ð5Þ
`
`For ease of description and explanation, we further
`define the two functions given by (6) and (7). ðiÞ gives the
`core that i was initially partitioned into, and ðiÞ gives the
`
`Authorized licensed use limited to: UNIST. Downloaded on May 26, 2009 at 21:48 from IEEE Xplore. Restrictions apply.
`
`Petitioner Samsung Ex-1031, 0004
`
`
`
`1544
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 11, NOVEMBER 2008
`
`core that i
`is currently located in. This distinction is
`necessary because we will dynamically migrate tasks
`between the cores:
`ðiÞ ¼ Cj in which i was initially partitioned;
`
`ð6Þ
`
`ð7Þ
`
`ðiÞ ¼ Cj in which i is currently partitioned:
`Each partitioned task set is scheduled using EDF on its
`corresponding core. The performance demand of each core
`is decided by running the Cycle-conserving algorithm on
`each core individually.
`To apply the Cycle-conserving algorithm, we define
`some dynamically updated variables. The Cycle-conserving
`utilization li of task i, which is initially equal to ui, is
`defined by (8). After the execution of a task, li is updated
`using the most recent actual execution time cci as the
`numerator of (2) instead of wi. After the period pi elapses,
`the task will be executed again and may now meet the worst
`case execution conditions; the utilization of the task will
`thus be reset to ui. As a result, li is updated after every
`deadline of i:
`
`
`li ¼ wi=pi
`cci=pi
`
`if i is unfinished
`if i is finished:
`
`ð8Þ
`
`Ln, the dynamic utilization of core Cn, is defined by (9).
`Ln is the current performance demand on Cn. Thus, as long
`as F is greater than Ln, all the deadlines of tasks in Cn will
`be met by the EDF scheduling algorithm. We will also use L
`to refer to the Cycle-conserving utilization of a core when
`the context is unambiguous. Thus, L of Ci also means Li:
`Ln ¼
`þ
`ð9Þ
`
`X
`
`8 unfinished i2Pn
`
`wi
`pi
`
`:
`
`X
`
`cci
`pi
`
`8 finished i2Pn
`
`3.2 Power Model
`The total power consumption of a CMOS-based processor
`consists of its dynamic power Pdynamic and its leakage power
`Pleakage. We construct a processor model to evaluate the
`energy efficiency of the proposed algorithms.
`Most of Pdynamic
`is the capacitive switching power
`consumed during circuit charging and discharging. Gen-
`erally, it is the largest part in the processor power during
`executing instruction. Pdynamic can be expressed in terms of
`the operating voltage Vdd, the clock frequency f, and the
`switching capacity cl as follows [23]:
`ð10Þ
`
`Pdynamic ¼ cl V 2dd f:
`The clock frequency f is itself related to several other
`factors, as given by (11). The threshold voltage Vth is a
`function of the body bias voltage Vbs, as seen in (12). Here,
`Vth1 , , K1, and K2 are constants depending on the processor
`fabrication technology. Generally, is between 1 and 2, so
`raising Vdd above the threshold voltage enables the
`processor to increase the clock frequency. In the assumed
`processor model, the change of f is assumed to accompany
`with switching Vdd to the lowest allowable point:
`
`f ¼ ðVdd VthÞ
`
`;
`
`LdK6
`Vth ¼ Vth1 K1 Vdd K2 Vbs:
`
`ð11Þ
`
`ð12Þ
`
`TABLE 2
`Constants Based on the 70 nm Technology [24]
`
`Variable
`K1
`K 2
`K 3
`/(4
`Ks
`/(6
`Vb s
`Va,1
`
`Value
`0.063
`0.153
`5.3 X 10- 7
`1.83
`4.19
`5.26 X 10- 12
`-0.7
`0.244
`
`Variable
`I j
`Ct
`Ld
`Ly
`~
`f.m in
`f,nax
`
`Value
`4. 0 X 10-
`4.3 X 10-tO
`37
`4 X 106
`1.5
`1 X 109
`3 X 109
`
`Pleakage is caused by leakage current, which flows even
`while no instructions are being executed. To calculate the
`leakage power consumption, we adopt a processor power
`used in existing research [24], [25]. Pleakage mainly consists of
`the subthreshold leakage current Isubn and the reverse bias
`junction current Ij. Pleakage can be expressed as a function of
`these two variables, as in (13).
`Isubn is defined by (14), where Lg is the number of
`components in the circuit. K3, K4, and K5 are constants
`determined by the processor fabrication technology:
`Pleakage ¼ Lg ðVdd Isubn þ j Vbs j IjÞ;
`
`ð13Þ
`
`ð14Þ
`Isubn ¼ K3 eK4Vdd eK5Vbs :
`The processor cores are assumed to consume both Pdynamic
`and Pleakage while executing instructions, and only Pleakage
`during idle periods.
`A multicore processor actually has some shared compo-
`nents as well, such as processor caches, buses, and so on. In
`this paper, we do not count the power consumption from
`these shared components because our goal is to reduce the
`power consumption of the cores themselves. The power
`consumption of a multicore processor is thus simply
`obtained by summing the power consumption of
`the
`individual cores.
`The core is assumed to have two operating modes: an
`active state, in which it is able to execute instructions and a
`sleep state in which it ceases working and rests with
`minimized power consuming. In the sleep state, the only
`possible operation is a transition into the active state. In this
`paper, we assume that the state transition introduces no
`additional overhead because this factor can be treated easily
`in practical implementations.
`In the sleep state, it is assumed that there is no Pdynamic
`and that Pleakage is 3 percent of Pleakage at the active state with
`current frequency f [26].
`just de-
`To simulate the power consumption model
`scribed, we adopt the realistic constants [24] given in Table 2.
`These constants have been scaled to 70 nm technology and
`are based on the original values [25] of Transmeta’s Crusoe
`Processor which uses 180 nm technology.
`By adopting the constants in Table 2, we obtain the
`power consumption function depicted in Fig. 3. As f
`to Ptotal
`decreases, the ratio of Pleakage
`increases. Below
`1.5 GHz, Pleakage is greater than Pdynamic. Pdynamic increases
`rapidly as f increases, however, and Ptotal thus rapidly
`increases in the high f domain.
`
`Authorized licensed use limited to: UNIST. Downloaded on May 26, 2009 at 21:48 from IEEE Xplore. Restrictions apply.
`
`Petitioner Samsung Ex-1031, 0005
`
`
`
`SEO ET AL.: ENERGY EFFICIENT SCHEDULING OF REAL TIME TASKS ON MULTICORE PROCESSORS
`
`1545
`
`1.0
`
`0.8
`
`0.2
`
`0
`
`- - Total Power
`·""" Dynamic Power
`-
`- - Leakage Power
`
`1.5
`2
`Clock Frequency (GHz)
`
`2.5
`
`3
`
`Fig. 3. Power consumption of a 70 nm core as a function of clock
`frequency.
`
`from the current time to a certain time point t according to
`the current schedule as Mn;t. Importing i into Cdst can be
`i þ Mdst;di < 1:0.
`done only when u0
`If i has the nearest deadline among the unfinished tasks
`scheduled in Csrc, exporting i can be treated as if it was
`finished at that time. Thus, Lsrc can be updated using (9). In
`this paper, we will allow the migration of a task only if it
`has the nearest deadline among the unfinished tasks in Psrc.
`This migration is only effective for the current period.
`After the period, the migrated task should be returned to its
`original core. However, returning to the original core is only
`conceptual. If the condition can be met, the task that was
`executed in a foreign core can be exported into the same
`core again right after the next release and this will be seen
`as the task remains for its next period.
`A migrated task may be finished earlier than its worst
`case execution time. It allows reducing L0
`dst as (17), which is
`
`a combination of (16) and (9):
`dst ¼ Ldst þ cci cmi
`L00
`
`until di
`after di:
`
`ð17Þ
`
`di T
`
`Ldst
`
`The migration operations can be overlapped and taken
`in recursive manner. In other words, a task executed in a
`foreign core that was migrated from the original core can
`be exported to the other core within the period as long as
`the conditions are met. Moreover, if a core has imported a
`task that is not completed yet, it can export multiple tasks
`to the other cores as long as the tasks have the nearest
`deadline among the unfinished tasks in the core at the
`exporting time.
`We now suggest a dynamic approach to balancing the
`dynamic utilizations of each core as follows:
`Notion of dynamic repartitioning. Let us assume that
`the partitioned state P has been generated as defined in
`Section 3.1. We further define the variables Lmax and Lmin as
`the maximum and minimum L values among all cores at
`calling time:
`
`8Cn 2 S; Lmax Ln;
`
`ð18Þ
`
`ð19Þ
`8Cn 2 S; Lmin Ln:
`For each core, Cmax and Cmin, which are the cores that
`if 9k such that
`have Lmax and Lmin, respectively,
`ðk 2 PmaxÞ ^ ðu0k þ Lmin < LmaxÞ, temporally migrating k
`
`from Pmax to Pmin will lower the performance demand of
`processor S.
`
`that ðP0max ¼
`Our approach will replace P with P0
`
`Pmax fkgÞ^ ðP0min¼PminþfkgÞ ^ ð8Cn2 S fCmax; Cming;
`n ¼ PnÞ until dk. Under the initial partitioned state P, all
`P0
`deadlines are guaranteed to be met because 8Cn 2 S, Un < 1
`by the assumption. If the above conditions are met, then all
`deadlines will also be met under P0
`. As this partition is the
`result of migrating k 2 Cmax to Cmin, it follows that 8Cn 2 S,
`Un < 1. Similarly, the repartitioned state P00
`based on P0
`is
`obtained by temporarily migrating a certain task l from C0
`max
`) until dl and P00
`min (as determined on P0
`to C0
`also guarantees
`
`
`all deadlines until a certain task l 2 fP00max [ P00ming.
`
`If we compare Lmax (of P) to L0max (of P0
`), it will always
`be true that Lmax > L0
`max, at least until a new period starts
`for one of the tasks in C0
`max. All the cores except for Cmax
`
`4 DYNAMIC REPARTITIONING
`Because Pdynamic / f 3, minimizing Pdynamic in a multicore
`processor for a given task set is essentially the problem of
`generating partitioned task sets with the most balanced
`utilization [4]. However, even though the initial partitioned
`state is well-balanced the performance demand on each
`core changes frequently during runtime. Thus, to achieve a
`consistently low power consumption,
`the performance
`demand of each core must stay balanced during operation.
`The intuitive way to solve the temporal unbalance is
`migrating some